Skip to content

Table of Contents

cs.CV [Back]

[1] Synthetic History: Evaluating Visual Representations of the Past in Diffusion Models

Maria-Teresa De Rosa Palmini,Eva Cetinic

Main category: cs.CV

TL;DR: 本文提出了一种评估文本到图像(TTI)扩散模型在历史背景描绘中的方法,揭示了模型在风格、一致性和人口统计方面的系统性不准确性。

Details Motivation: 随着TTI模型在内容创作中的影响力增加,其对社会和文化的影响受到关注,但历史背景的准确性尚未充分研究。 Method: 引入HistVis数据集,包含30,000张由三种先进扩散模型生成的图像,评估其在风格关联、历史一致性和人口统计表示方面的表现。 Result: 发现TTI模型在历史主题图像中存在系统性不准确,包括刻板风格、时代错误和不合理的人口统计分布。 Conclusion: 提供了一种可扩展的方法和基准,为构建更准确和文化对齐的TTI模型迈出第一步。 Abstract: As Text-to-Image (TTI) diffusion models become increasingly influential in content creation, growing attention is being directed toward their societal and cultural implications. While prior research has primarily examined demographic and cultural biases, the ability of these models to accurately represent historical contexts remains largely underexplored. In this work, we present a systematic and reproducible methodology for evaluating how TTI systems depict different historical periods. For this purpose, we introduce the HistVis dataset, a curated collection of 30,000 synthetic images generated by three state-of-the-art diffusion models using carefully designed prompts depicting universal human activities across different historical periods. We evaluate generated imagery across three key aspects: (1) Implicit Stylistic Associations: examining default visual styles associated with specific eras; (2) Historical Consistency: identifying anachronisms such as modern artifacts in pre-modern contexts; and (3) Demographic Representation: comparing generated racial and gender distributions against historically plausible baselines. Our findings reveal systematic inaccuracies in historically themed generated imagery, as TTI models frequently stereotype past eras by incorporating unstated stylistic cues, introduce anachronisms, and fail to reflect plausible demographic patterns. By offering a scalable methodology and benchmark for assessing historical representation in generated imagery, this work provides an initial step toward building more historically accurate and culturally aligned TTI models.

[2] EmoSign: A Multimodal Dataset for Understanding Emotions in American Sign Language

Phoebe Chua,Cathy Mengying Fang,Takehiko Ohkawa,Raja Kushalnagar,Suranga Nanayakkara,Pattie Maes

Main category: cs.CV

TL;DR: EmoSign是首个包含200个美国手语视频情感标签的数据集,填补了手语情感表达研究的空白,并提供了基线模型。

Details Motivation: 手语中情感表达的研究不足,导致关键场景中的沟通障碍,EmoSign旨在解决这一问题。 Method: 收集200个ASL视频,由3名聋人ASL使用者标注情感和情绪,并提供基线分类模型。 Result: 建立了首个手语情感数据集,为多模态情感识别研究设定了新基准。 Conclusion: EmoSign填补了手语情感研究的空白,并推动了相关模型的发展。 Abstract: Unlike spoken languages where the use of prosodic features to convey emotion is well studied, indicators of emotion in sign language remain poorly understood, creating communication barriers in critical settings. Sign languages present unique challenges as facial expressions and hand movements simultaneously serve both grammatical and emotional functions. To address this gap, we introduce EmoSign, the first sign video dataset containing sentiment and emotion labels for 200 American Sign Language (ASL) videos. We also collect open-ended descriptions of emotion cues. Annotations were done by 3 Deaf ASL signers with professional interpretation experience. Alongside the annotations, we include baseline models for sentiment and emotion classification. This dataset not only addresses a critical gap in existing sign language research but also establishes a new benchmark for understanding model capabilities in multimodal emotion recognition for sign languages. The dataset is made available at https://huggingface.co/datasets/catfang/emosign.

[3] CAMA: Enhancing Multimodal In-Context Learning with Context-Aware Modulated Attention

Yanshu Li,JianJiang Yang,Bozheng Li,Ruixiang Tang

Main category: cs.CV

TL;DR: 论文提出了一种名为CAMA的方法,通过校准LVLM的注意力机制,解决了多模态ICL中的不稳定问题。

Details Motivation: 多模态ICL在LVLMs中表现不稳定,现有研究主要关注序列配置优化,而忽略了LVLMs的内部机制。 Method: 提出了Context-Aware Modulated Attention (CAMA),一种无需训练的即插即用方法,直接校准LVLM的注意力对数。 Result: 在四个LVLMs和六个基准测试中验证了CAMA的有效性和通用性。 Conclusion: CAMA为深入探索和针对性利用LVLM注意力动态提供了新机会,推动了多模态推理的发展。 Abstract: Multimodal in-context learning (ICL) enables large vision-language models (LVLMs) to efficiently adapt to novel tasks, supporting a wide array of real-world applications. However, multimodal ICL remains unstable, and current research largely focuses on optimizing sequence configuration while overlooking the internal mechanisms of LVLMs. In this work, we first provide a theoretical analysis of attentional dynamics in multimodal ICL and identify three core limitations of standard attention that ICL impair performance. To address these challenges, we propose Context-Aware Modulated Attention (CAMA), a simple yet effective plug-and-play method for directly calibrating LVLM attention logits. CAMA is training-free and can be seamlessly applied to various open-source LVLMs. We evaluate CAMA on four LVLMs across six benchmarks, demonstrating its effectiveness and generality. CAMA opens new opportunities for deeper exploration and targeted utilization of LVLM attention dynamics to advance multimodal reasoning.

[4] Pixels Versus Priors: Controlling Knowledge Priors in Vision-Language Models through Visual Counterfacts

Michal Golovanevsky,William Rudman,Michael Lepori,Amir Bar,Ritambhara Singh,Carsten Eickhoff

Main category: cs.CV

TL;DR: 论文研究了多模态大语言模型(MLLMs)在视觉问答任务中依赖记忆知识还是视觉输入,通过Visual CounterFact数据集发现模型预测从记忆知识转向视觉证据,并提出PvP机制控制模型输出。

Details Motivation: 探究MLLMs在视觉任务中依赖记忆知识还是视觉输入,以理解其推理机制。 Method: 引入Visual CounterFact数据集,设计视觉与记忆知识冲突的场景,并提出PvP机制通过激活干预控制模型输出。 Result: 模型预测从记忆知识转向视觉证据,PvP机制成功将92.5%颜色和74.6%大小预测从记忆知识转向反事实。 Conclusion: 研究为理解和控制多模态模型的事实行为提供了新工具。 Abstract: Multimodal Large Language Models (MLLMs) perform well on tasks such as visual question answering, but it remains unclear whether their reasoning relies more on memorized world knowledge or on the visual information present in the input image. To investigate this, we introduce Visual CounterFact, a new dataset of visually-realistic counterfactuals that put world knowledge priors (e.g, red strawberry) into direct conflict with visual input (e.g, blue strawberry). Using Visual CounterFact, we show that model predictions initially reflect memorized priors, but shift toward visual evidence in mid-to-late layers. This dynamic reveals a competition between the two modalities, with visual input ultimately overriding priors during evaluation. To control this behavior, we propose Pixels Versus Priors (PvP) steering vectors, a mechanism for controlling model outputs toward either world knowledge or visual input through activation-level interventions. On average, PvP successfully shifts 92.5% of color and 74.6% of size predictions from priors to counterfactuals. Together, these findings offer new tools for interpreting and controlling factual behavior in multimodal models.

[5] Robustifying Vision-Language Models via Dynamic Token Reweighting

Tanqiu Jiang,Jiacheng Liang,Rongyi Zhu,Jiawei Zhou,Fenglong Ma,Ting Wang

Main category: cs.CV

TL;DR: DTR是一种新型推理时防御方法,通过优化模型的键值缓存(KV caches)来减轻多模态越狱攻击,无需依赖特定安全数据或昂贵转换。

Details Motivation: 大型视觉语言模型(VLMs)易受越狱攻击,现有防御方法依赖特定数据或转换,效率低且效果有限。 Method: DTR动态调整视觉令牌权重,减少对抗性视觉输入的影响,同时保持模型性能和推理效率。 Result: DTR在多种VLMs和攻击基准测试中表现优异,攻击鲁棒性和良性任务性能均优于现有防御方法。 Conclusion: DTR首次成功将KV缓存优化应用于多模态基础模型的安全增强,为未来研究提供了新方向。 Abstract: Large vision-language models (VLMs) are highly vulnerable to jailbreak attacks that exploit visual-textual interactions to bypass safety guardrails. In this paper, we present DTR, a novel inference-time defense that mitigates multimodal jailbreak attacks through optimizing the model's key-value (KV) caches. Rather than relying on curated safety-specific data or costly image-to-text conversion, we introduce a new formulation of the safety-relevant distributional shift induced by the visual modality. This formulation enables DTR to dynamically adjust visual token weights, minimizing the impact of adversarial visual inputs while preserving the model's general capabilities and inference efficiency. Extensive evaluation across diverse VLMs and attack benchmarks demonstrates that \sys outperforms existing defenses in both attack robustness and benign task performance, marking the first successful application of KV cache optimization for safety enhancement in multimodal foundation models. The code for replicating DTR is available: https://anonymous.4open.science/r/DTR-2755 (warning: this paper contains potentially harmful content generated by VLMs.)

[6] A Framework for Multi-View Multiple Object Tracking using Single-View Multi-Object Trackers on Fish Data

Chaim Chai Elchik,Fatemeh Karimi Nejadasl,Seyed Sahand Mohammadi Ziabari,Ali Mohammed Mansoor Alsahag

Main category: cs.CV

TL;DR: 该论文提出了一种多视角框架,用于水下鱼类检测与追踪,结合FairMOT和YOLOv8模型,通过立体视频输入提升精度和鱼类行为识别能力。

Details Motivation: 传统单视角多目标追踪模型在水下复杂3D运动和噪声环境下表现不佳,需改进以适应生态研究需求。 Method: 采用FairMOT和YOLOv8模型,开发多视角框架,利用立体视频输入和立体匹配技术生成3D输出。 Result: 框架检测鱼类的相对准确率为47%,并提供3D运动数据,显著优于单视角方法。 Conclusion: 多视角框架显著提升了水下鱼类追踪的精度和可靠性,为生态研究提供了更全面的鱼类行为分析工具。 Abstract: Multi-object tracking (MOT) in computer vision has made significant advancements, yet tracking small fish in underwater environments presents unique challenges due to complex 3D motions and data noise. Traditional single-view MOT models often fall short in these settings. This thesis addresses these challenges by adapting state-of-the-art single-view MOT models, FairMOT and YOLOv8, for underwater fish detecting and tracking in ecological studies. The core contribution of this research is the development of a multi-view framework that utilizes stereo video inputs to enhance tracking accuracy and fish behavior pattern recognition. By integrating and evaluating these models on underwater fish video datasets, the study aims to demonstrate significant improvements in precision and reliability compared to single-view approaches. The proposed framework detects fish entities with a relative accuracy of 47% and employs stereo-matching techniques to produce a novel 3D output, providing a more comprehensive understanding of fish movements and interactions

[7] REACT 2025: the Third Multiple Appropriate Facial Reaction Generation Challenge

Siyang Song,Micol Spitale,Xiangyu Kong,Hengde Zhu,Cheng Luo,Cristina Palmero,German Barquero,Sergio Escalera,Michel Valstar,Mohamed Daoudi,Tobias Baur,Fabien Ringeval,Andrew Howes,Elisabeth Andre,Hatice Gunes

Main category: cs.CV

TL;DR: REACT 2025挑战赛旨在推动机器学习模型生成多样、真实且同步的人类面部反应,并提供大规模多模态数据集MARS。

Details Motivation: 研究人类面部反应的多样性及其在机器学习中的应用,以提升人机交互的自然性。 Method: 提供MARS数据集,包含137对人类交互数据,并设立离线与在线两个子挑战。 Result: 挑战赛基线代码已公开,为参与者提供了基准性能参考。 Conclusion: REACT 2025挑战赛为面部反应生成领域的研究提供了重要平台和资源。 Abstract: In dyadic interactions, a broad spectrum of human facial reactions might be appropriate for responding to each human speaker behaviour. Following the successful organisation of the REACT 2023 and REACT 2024 challenges, we are proposing the REACT 2025 challenge encouraging the development and benchmarking of Machine Learning (ML) models that can be used to generate multiple appropriate, diverse, realistic and synchronised human-style facial reactions expressed by human listeners in response to an input stimulus (i.e., audio-visual behaviours expressed by their corresponding speakers). As a key of the challenge, we provide challenge participants with the first natural and large-scale multi-modal MAFRG dataset (called MARS) recording 137 human-human dyadic interactions containing a total of 2856 interaction sessions covering five different topics. In addition, this paper also presents the challenge guidelines and the performance of our baselines on the two proposed sub-challenges: Offline MAFRG and Online MAFRG, respectively. The challenge baseline code is publicly available at https://github.com/reactmultimodalchallenge/baseline_react2025

[8] CHAOS: Chart Analysis with Outlier Samples

Omar Moured,Yufan Chen,Ruiping Liu,Simon Reiß,Philip Torr,Jiaming Zhang,Rainer Stiefelhagen

Main category: cs.CV

TL;DR: CHAOS是一个用于评估多模态大语言模型(MLLMs)对图表扰动的鲁棒性基准,包含文本和视觉扰动,分为三个难度级别。

Details Motivation: 现实应用中的图表常包含噪声或异常特征,而现有MLLMs在处理这些扰动时表现不佳,因此需要系统评估其鲁棒性。 Method: CHAOS包含五类文本和十类视觉扰动,分为三个难度级别,并测试了13种MLLMs在两项下游任务(ChartQA和Chart-to-Text)中的表现。 Result: 实验和案例分析揭示了模型在不同图表扰动下的鲁棒性,为未来图表理解研究提供了指导。 Conclusion: CHAOS基准为评估和提升MLLMs在图表理解中的鲁棒性提供了重要工具,数据与代码已公开。 Abstract: Charts play a critical role in data analysis and visualization, yet real-world applications often present charts with challenging or noisy features. However, "outlier charts" pose a substantial challenge even for Multimodal Large Language Models (MLLMs), which can struggle to interpret perturbed charts. In this work, we introduce CHAOS (CHart Analysis with Outlier Samples), a robustness benchmark to systematically evaluate MLLMs against chart perturbations. CHAOS encompasses five types of textual and ten types of visual perturbations, each presented at three levels of severity (easy, mid, hard) inspired by the study result of human evaluation. The benchmark includes 13 state-of-the-art MLLMs divided into three groups (i.e., general-, document-, and chart-specific models) according to the training scope and data. Comprehensive analysis involves two downstream tasks (ChartQA and Chart-to-Text). Extensive experiments and case studies highlight critical insights into robustness of models across chart perturbations, aiming to guide future research in chart understanding domain. Data and code are publicly available at: http://huggingface.co/datasets/omoured/CHAOS.

[9] Extending Dataset Pruning to Object Detection: A Variance-based Approach

Ryota Yagi

Main category: cs.CV

TL;DR: 本文首次将分类数据剪枝技术扩展到目标检测领域,解决了三个关键挑战,并提出了一种新的评分方法VPS,实验证明其在PASCAL VOC和MS COCO上优于现有方法。

Details Motivation: 数据剪枝在图像分类中表现良好,但在复杂任务如目标检测中研究不足,本文旨在填补这一空白。 Method: 提出Variance-based Prediction Score (VPS)方法,结合IoU和置信度评分,解决目标检测中的三个关键问题。 Result: 在PASCAL VOC和MS COCO上,VPS方法在mAP上优于现有剪枝方法,且样本信息量比数据集大小或平衡性更重要。 Conclusion: 本文为数据剪枝在复杂视觉任务中的应用奠定了基础,展示了其在目标检测中的潜力。 Abstract: Dataset pruning -- selecting a small yet informative subset of training data -- has emerged as a promising strategy for efficient machine learning, offering significant reductions in computational cost and storage compared to alternatives like dataset distillation. While pruning methods have shown strong performance in image classification, their extension to more complex computer vision tasks, particularly object detection, remains relatively underexplored. In this paper, we present the first principled extension of classification pruning techniques to the object detection domain, to the best of our knowledge. We identify and address three key challenges that hinder this transition: the Object-Level Attribution Problem, the Scoring Strategy Problem, and the Image-Level Aggregation Problem. To overcome these, we propose tailored solutions, including a novel scoring method called Variance-based Prediction Score (VPS). VPS leverages both Intersection over Union (IoU) and confidence scores to effectively identify informative training samples specific to detection tasks. Extensive experiments on PASCAL VOC and MS COCO demonstrate that our approach consistently outperforms prior dataset pruning methods in terms of mean Average Precision (mAP). We also show that annotation count and class distribution shift can influence detection performance, but selecting informative examples is a more critical factor than dataset size or balance. Our work bridges dataset pruning and object detection, paving the way for dataset pruning in complex vision tasks.

[10] ExpertGen: Training-Free Expert Guidance for Controllable Text-to-Face Generation

Liang Shi,Yun Fu

Main category: cs.CV

TL;DR: ExpertGen是一个无需训练的框架,利用预训练的专家模型(如人脸识别、属性识别和年龄估计)实现细粒度控制的文本到人脸生成。

Details Motivation: 现有方法需要额外训练模块以实现特定控制(如身份、属性或年龄),缺乏灵活性且资源密集。 Method: 使用潜在一致性模型确保每一步扩散的预测真实且符合分布,结合专家模型提供精确引导信号。 Result: 定性和定量实验表明,专家模型能高精度引导生成,多专家协作可实现多面部特征的同步控制。 Conclusion: ExpertGen通过直接集成现成专家模型,将其转化为即插即用组件,实现可控人脸生成。 Abstract: Recent advances in diffusion models have significantly improved text-to-face generation, but achieving fine-grained control over facial features remains a challenge. Existing methods often require training additional modules to handle specific controls such as identity, attributes, or age, making them inflexible and resource-intensive. We propose ExpertGen, a training-free framework that leverages pre-trained expert models such as face recognition, facial attribute recognition, and age estimation networks to guide generation with fine control. Our approach uses a latent consistency model to ensure realistic and in-distribution predictions at each diffusion step, enabling accurate guidance signals to effectively steer the diffusion process. We show qualitatively and quantitatively that expert models can guide the generation process with high precision, and multiple experts can collaborate to enable simultaneous control over diverse facial aspects. By allowing direct integration of off-the-shelf expert models, our method transforms any such model into a plug-and-play component for controllable face generation.

[11] Mitigate One, Skew Another? Tackling Intersectional Biases in Text-to-Image Models

Pushkar Shukla,Aditya Chinchure,Emily Diana,Alexander Tolbert,Kartik Hosanagar,Vineeth N Balasubramanian,Leonid Sigal,Matthew Turk

Main category: cs.CV

TL;DR: BiasConnect和InterMit工具用于分析和缓解文本到图像模型中的交互偏见,通过量化偏见间的相互影响并提供高效解决方案。

Details Motivation: 现有方法通常独立处理文本到图像模型的偏见,而实际上这些偏见可能相互关联,需要系统性解决方案。 Method: 提出BiasConnect工具分析偏见交互,并开发InterMit算法基于用户目标分布和优先级权重进行偏见缓解。 Result: BiasConnect的估计与后缓解结果强相关(+0.65),InterMit在减少偏见(0.33 vs. 0.52)和步骤(2.38 vs. 3.15)上优于传统方法。 Conclusion: InterMit是一种灵活、可扩展的解决方案,能够高效缓解交互偏见,并提升图像质量。 Abstract: The biases exhibited by text-to-image (TTI) models are often treated as independent, though in reality, they may be deeply interrelated. Addressing bias along one dimension - such as ethnicity or age - can inadvertently affect another, like gender, either mitigating or exacerbating existing disparities. Understanding these interdependencies is crucial for designing fairer generative models, yet measuring such effects quantitatively remains a challenge. To address this, we introduce BiasConnect, a novel tool for analyzing and quantifying bias interactions in TTI models. BiasConnect uses counterfactual interventions along different bias axes to reveal the underlying structure of these interactions and estimates the effect of mitigating one bias axis on another. These estimates show strong correlation (+0.65) with observed post-mitigation outcomes. Building on BiasConnect, we propose InterMit, an intersectional bias mitigation algorithm guided by user-defined target distributions and priority weights. InterMit achieves lower bias (0.33 vs. 0.52) with fewer mitigation steps (2.38 vs. 3.15 average steps), and yields superior image quality compared to traditional techniques. Although our implementation is training-free, InterMit is modular and can be integrated with many existing debiasing approaches for TTI models, making it a flexible and extensible solution.

[12] Harnessing EHRs for Diffusion-based Anomaly Detection on Chest X-rays

Harim Kim,Yuhan Wang,Minkyu Ahn,Heeyoul Choi,Yuyin Zhou,Charmgil Hong

Main category: cs.CV

TL;DR: Diff3M是一种多模态扩散框架,结合胸部X光和结构化电子健康记录(EHRs),通过图像-EHR交叉注意力模块提升异常检测能力。

Details Motivation: 现有基于扩散的无监督异常检测(UAD)模型仅依赖图像特征,难以区分正常解剖变异和病理异常。 Method: 提出Diff3M框架,整合X光和EHRs,引入图像-EHR交叉注意力模块和静态掩码策略。 Result: 在CheXpert和MIMIC-CXR/IV数据集上表现优于现有UAD方法。 Conclusion: Diff3M通过多模态融合显著提升了医学影像异常检测性能。 Abstract: Unsupervised anomaly detection (UAD) in medical imaging is crucial for identifying pathological abnormalities without requiring extensive labeled data. However, existing diffusion-based UAD models rely solely on imaging features, limiting their ability to distinguish between normal anatomical variations and pathological anomalies. To address this, we propose Diff3M, a multi-modal diffusion-based framework that integrates chest X-rays and structured Electronic Health Records (EHRs) for enhanced anomaly detection. Specifically, we introduce a novel image-EHR cross-attention module to incorporate structured clinical context into the image generation process, improving the model's ability to differentiate normal from abnormal features. Additionally, we develop a static masking strategy to enhance the reconstruction of normal-like images from anomalies. Extensive evaluations on CheXpert and MIMIC-CXR/IV demonstrate that Diff3M achieves state-of-the-art performance, outperforming existing UAD methods in medical imaging. Our code is available at this http URL https://github.com/nth221/Diff3M

[13] Analyzing Fine-Grained Alignment and Enhancing Vision Understanding in Multimodal Language Models

Jiachen Jiang,Jinxin Zhou,Bo Peng,Xia Ning,Zhihui Zhu

Main category: cs.CV

TL;DR: 论文研究了视觉嵌入与大型语言模型(LLM)对齐的问题,提出了一种新的训练方法(patch-aligned training)以增强对齐效果,显著提升了多模态LLM的性能。

Details Motivation: 提升多模态LLM的能力需要更好地对齐视觉嵌入与LLM,但目前通过投影器连接的方法对视觉信息的压缩和对齐机制尚不明确。 Method: 研究了投影器在压缩视觉嵌入和对齐词嵌入中的作用,并提出多语义对齐假设和patch-aligned training方法。 Result: 实验表明,新方法在压缩能力和对齐效果上更强,显著提升了生成描述的质量,并在多项任务中提高了性能(如指代表达任务提升16%)。 Conclusion: 提出的方法能有效增强视觉与语言的对齐,并可轻松扩展到其他多模态模型。 Abstract: Achieving better alignment between vision embeddings and Large Language Models (LLMs) is crucial for enhancing the abilities of Multimodal LLMs (MLLMs), particularly for recent models that rely on powerful pretrained vision encoders and LLMs. A common approach to connect the pretrained vision encoder and LLM is through a projector applied after the vision encoder. However, the projector is often trained to enable the LLM to generate captions, and hence the mechanism by which LLMs understand each vision token remains unclear. In this work, we first investigate the role of the projector in compressing vision embeddings and aligning them with word embeddings. We show that the projector significantly compresses visual information, removing redundant details while preserving essential elements necessary for the LLM to understand visual content. We then examine patch-level alignment -- the alignment between each vision patch and its corresponding semantic words -- and propose a *multi-semantic alignment hypothesis*. Our analysis indicates that the projector trained by caption loss improves patch-level alignment but only to a limited extent, resulting in weak and coarse alignment. To address this issue, we propose *patch-aligned training* to efficiently enhance patch-level alignment. Our experiments show that patch-aligned training (1) achieves stronger compression capability and improved patch-level alignment, enabling the MLLM to generate higher-quality captions, (2) improves the MLLM's performance by 16% on referring expression grounding tasks, 4% on question-answering tasks, and 3% on modern instruction-following benchmarks when using the same supervised fine-tuning (SFT) setting. The proposed method can be easily extended to other multimodal models.

[14] Optimizing Image Capture for Computer Vision-Powered Taxonomic Identification and Trait Recognition of Biodiversity Specimens

Alyson East,Elizabeth G. Campolongo,Luke Meyers,S M Rayeed,Samuel Stevens,Iuliia Zarubiieva,Isadora E. Fluck,Jennifer C. Girón,Maximiliane Jousse,Scott Lowe,Kayla I Perry,Isabelle Betancourt,Noah Charney,Evan Donoso,Nathan Fox,Kim J. Landsbergen,Ekaterina Nepovinnykh,Michelle Ramirez,Parkash Singh,Khum Thapa-Magar,Matthew Thompson,Evan Waite,Tanya Berger-Wolf,Hilmar Lapp,Paula Mabee,Graham Taylor,Sydne Record

Main category: cs.CV

TL;DR: 本文提出了一套优化生物标本图像以支持计算机视觉应用的框架,包括10项关键考虑因素,旨在弥补当前成像实践与自动化分析需求之间的差距。

Details Motivation: 生物标本的数字图像通常为人类视觉设计,未考虑计算机分析需求,限制了自动化分析的潜力。本文旨在通过优化图像采集和存储实践,支持大规模自动化的生物多样性研究。 Method: 通过跨学科合作,提出10项关键考虑因素,包括标准化成像、元数据记录、数据存储等,以优化生物标本图像用于计算机视觉分析。 Result: 提出了一套框架,涵盖从图像采集到数据共享的全流程建议,支持自动化特征提取、物种识别及生态进化分析。 Conclusion: 通过实施这些建议,生物标本图像可以更好地服务于计算机视觉应用,推动生物多样性研究的大规模自动化分析。 Abstract: Biological collections house millions of specimens documenting Earth's biodiversity, with digital images increasingly available through open-access platforms. Most imaging protocols were developed for human visual interpretation without considering computational analysis requirements. This paper aims to bridge the gap between current imaging practices and the potential for automated analysis by presenting key considerations for creating biological specimen images optimized for computer vision applications. We provide conceptual computer vision topics for context, addressing fundamental concerns including model generalization, data leakage, and comprehensive metadata documentation, and outline practical guidance on specimen imagine, and data storage. These recommendations were synthesized through interdisciplinary collaboration between taxonomists, collection managers, ecologists, and computer scientists. Through this synthesis, we have identified ten interconnected considerations that form a framework for successfully integrating biological specimen images into computer vision pipelines. The key elements include: (1) comprehensive metadata documentation, (2) standardized specimen positioning, (3) consistent size and color calibration, (4) protocols for handling multiple specimens in one image, (5) uniform background selection, (6) controlled lighting, (7) appropriate resolution and magnification, (8) optimal file formats, (9) robust data archiving strategies, and (10) accessible data sharing practices. By implementing these recommendations, collection managers, taxonomists, and biodiversity informaticians can generate images that support automated trait extraction, species identification, and novel ecological and evolutionary analyses at unprecedented scales. Successful implementation lies in thorough documentation of methodological choices.

[15] Game-invariant Features Through Contrastive and Domain-adversarial Learning

Dylan Kline

Main category: cs.CV

TL;DR: 提出了一种结合对比学习和领域对抗训练的方法,学习游戏无关的视觉特征,提升跨游戏任务的泛化能力。

Details Motivation: 基础游戏图像编码器容易过拟合特定游戏的视觉风格,影响在新游戏下游任务中的表现。 Method: 通过对比学习鼓励相似内容聚类,同时通过对抗性领域分类器抑制游戏特定线索,学习游戏无关特征。 Result: 在Bingsu数据集(10款游戏的10,000张截图)上,模型特征很快不再按游戏聚类,表明成功实现游戏无关性。 Conclusion: 该方法为通用游戏视觉模型铺平道路,新游戏上仅需少量微调甚至无需训练。 Abstract: Foundational game-image encoders often overfit to game-specific visual styles, undermining performance on downstream tasks when applied to new games. We present a method that combines contrastive learning and domain-adversarial training to learn game-invariant visual features. By simultaneously encouraging similar content to cluster and discouraging game-specific cues via an adversarial domain classifier, our approach produces embeddings that generalize across diverse games. Experiments on the Bingsu game-image dataset (10,000 screenshots from 10 games) demonstrate that after only a few training epochs, our model's features no longer cluster by game, indicating successful invariance and potential for improved cross-game transfer (e.g., glitch detection) with minimal fine-tuning. This capability paves the way for more generalizable game vision models that require little to no retraining on new games.

[16] FS-DAG: Few Shot Domain Adapting Graph Networks for Visually Rich Document Understanding

Amit Agarwal,Srikant Panda,Kulbhushan Pachauri

Main category: cs.CV

TL;DR: FS-DAG是一种高效、可扩展的模型架构,用于少样本场景下的视觉丰富文档理解(VRDU),通过模块化框架适应多样文档类型,性能优越且参数少于90M。

Details Motivation: 解决少样本场景下视觉丰富文档理解的挑战,如OCR错误、拼写错误和领域偏移,同时适应计算资源有限的实际需求。 Method: 结合领域特定和语言/视觉特定骨干网络,构建模块化框架,以最小数据适应多样文档类型。 Result: 实验表明,FS-DAG在信息提取任务中收敛速度和性能显著优于现有方法。 Conclusion: FS-DAG展示了开发高效小模型而不牺牲性能的潜力,适用于实际部署。 Abstract: In this work, we propose Few Shot Domain Adapting Graph (FS-DAG), a scalable and efficient model architecture for visually rich document understanding (VRDU) in few-shot settings. FS-DAG leverages domain-specific and language/vision specific backbones within a modular framework to adapt to diverse document types with minimal data. The model is robust to practical challenges such as handling OCR errors, misspellings, and domain shifts, which are critical in real-world deployments. FS-DAG is highly performant with less than 90M parameters, making it well-suited for complex real-world applications for Information Extraction (IE) tasks where computational resources are limited. We demonstrate FS-DAG's capability through extensive experiments for information extraction task, showing significant improvements in convergence speed and performance compared to state-of-the-art methods. Additionally, this work highlights the ongoing progress in developing smaller, more efficient models that do not compromise on performance. Code : https://github.com/oracle-samples/fs-dag

[17] Temporal Differential Fields for 4D Motion Modeling via Image-to-Video Synthesis

Xin You,Minghui Zhang,Hanxiao Zhang,Jie Yang,Nassir Navab

Main category: cs.CV

TL;DR: 提出了一种基于图像到视频(I2V)合成框架的方法,用于模拟呼吸引起的规则运动,解决了现有方法在动态背景下的局限性。

Details Motivation: 现有方法需要高剂量扫描的起始和结束帧,而患者在术前数据采集阶段的轻微移动会导致动态背景,影响时间建模。 Method: 采用I2V框架,通过首帧预测未来帧,并设计时间差分扩散模型生成时间差分场,提升视频的时间一致性。 Result: 在ACDC心脏和4D Lung数据集上,该方法生成的4D视频在感知相似性和时间一致性上优于其他方法。 Conclusion: 该方法能有效模拟规则运动过程,为图像引导的临床应用提供了新思路。 Abstract: Temporal modeling on regular respiration-induced motions is crucial to image-guided clinical applications. Existing methods cannot simulate temporal motions unless high-dose imaging scans including starting and ending frames exist simultaneously. However, in the preoperative data acquisition stage, the slight movement of patients may result in dynamic backgrounds between the first and last frames in a respiratory period. This additional deviation can hardly be removed by image registration, thus affecting the temporal modeling. To address that limitation, we pioneeringly simulate the regular motion process via the image-to-video (I2V) synthesis framework, which animates with the first frame to forecast future frames of a given length. Besides, to promote the temporal consistency of animated videos, we devise the Temporal Differential Diffusion Model to generate temporal differential fields, which measure the relative differential representations between adjacent frames. The prompt attention layer is devised for fine-grained differential fields, and the field augmented layer is adopted to better interact these fields with the I2V framework, promoting more accurate temporal variation of synthesized videos. Extensive results on ACDC cardiac and 4D Lung datasets reveal that our approach simulates 4D videos along the intrinsic motion trajectory, rivaling other competitive methods on perceptual similarity and temporal consistency. Codes will be available soon.

[18] Render-FM: A Foundation Model for Real-time Photorealistic Volumetric Rendering

Zhongpai Gao,Meng Zheng,Benjamin Planche,Anwesa Choudhuri,Terrence Chen,Ziyan Wu

Main category: cs.CV

TL;DR: Render-FM是一种新型基础模型,用于直接实时渲染CT扫描,通过大规模预训练消除逐场景优化,显著减少准备时间。

Details Motivation: 当前高保真神经渲染技术需要逐场景优化,计算量大且泛化性差,限制了临床应用。 Method: 采用编码器-解码器架构,直接从CT体积回归6D高斯泼溅参数,通过大规模预训练实现。 Result: Render-FM在视觉保真度上媲美或优于专用逐扫描方法,同时将准备时间从近一小时缩短至秒级。 Conclusion: Render-FM实现了高质量实时3D可视化,可无缝集成到实时手术规划和诊断工作流中。 Abstract: Volumetric rendering of Computed Tomography (CT) scans is crucial for visualizing complex 3D anatomical structures in medical imaging. Current high-fidelity approaches, especially neural rendering techniques, require time-consuming per-scene optimization, limiting clinical applicability due to computational demands and poor generalizability. We propose Render-FM, a novel foundation model for direct, real-time volumetric rendering of CT scans. Render-FM employs an encoder-decoder architecture that directly regresses 6D Gaussian Splatting (6DGS) parameters from CT volumes, eliminating per-scan optimization through large-scale pre-training on diverse medical data. By integrating robust feature extraction with the expressive power of 6DGS, our approach efficiently generates high-quality, real-time interactive 3D visualizations across diverse clinical CT data. Experiments demonstrate that Render-FM achieves visual fidelity comparable or superior to specialized per-scan methods while drastically reducing preparation time from nearly an hour to seconds for a single inference step. This advancement enables seamless integration into real-time surgical planning and diagnostic workflows. The project page is: https://gaozhongpai.github.io/renderfm/.

[19] Ocular Authentication: Fusion of Gaze and Periocular Modalities

Dillon Lohr,Michael J. Proulx,Mehedi Hasan Raju,Oleg V. Komogortsev

Main category: cs.CV

TL;DR: 本文研究了将眼动和眼周图像两种模态融合在免校准认证系统中的可行性,结果表明多模态方法优于单模态系统。

Details Motivation: 探索眼动和眼周图像在统一认证系统中的结合潜力,填补大规模研究的空白。 Method: 提出多模态认证系统,使用包含9202名受试者的大规模数据集,结合先进机器学习架构。 Result: 多模态方法在所有场景中均优于单模态系统,并超越FIDO基准。 Conclusion: 多模态融合及先进机器学习架构显著提升了大规模认证性能。 Abstract: This paper investigates the feasibility of fusing two eye-centric authentication modalities-eye movements and periocular images-within a calibration-free authentication system. While each modality has independently shown promise for user authentication, their combination within a unified gaze-estimation pipeline has not been thoroughly explored at scale. In this report, we propose a multimodal authentication system and evaluate it using a large-scale in-house dataset comprising 9202 subjects with an eye tracking (ET) signal quality equivalent to a consumer-facing virtual reality (VR) device. Our results show that the multimodal approach consistently outperforms both unimodal systems across all scenarios, surpassing the FIDO benchmark. The integration of a state-of-the-art machine learning architecture contributed significantly to the overall authentication performance at scale, driven by the model's ability to capture authentication representations and the complementary discriminative characteristics of the fused modalities.

[20] Alignment and Safety of Diffusion Models via Reinforcement Learning and Reward Modeling: A Survey

Preeti Lamba,Kiran Ravish,Ankita Kushwaha,Pawan Kumar

Main category: cs.CV

TL;DR: 该论文探讨了如何通过强化学习和奖励建模对齐扩散模型的输出与人类偏好和安全约束,总结了现有方法并提出了五个未来研究方向。

Details Motivation: 扩散模型在生成图像等领域表现出色,但其输出与人类偏好和安全约束的对齐仍是一个关键挑战。 Method: 论文通过调查现有方法(如基于人类反馈的强化学习、直接偏好优化等),分类并比较了不同技术的效率和安全性。 Result: 总结了现有方法的优缺点,并提出了五个未来研究方向,包括多目标对齐、高效人类反馈使用等。 Conclusion: 论文旨在为更安全、更符合人类价值观的扩散模型生成AI提供新的见解和技术。 Abstract: Diffusion models have emerged as leading generative models for images and other modalities, but aligning their outputs with human preferences and safety constraints remains a critical challenge. This thesis proposal investigates methods to align diffusion models using reinforcement learning (RL) and reward modeling. We survey recent advances in fine-tuning text-to-image diffusion models with human feedback, including reinforcement learning from human and AI feedback, direct preference optimization, and differentiable reward approaches. We classify these methods based on the type of feedback (human, automated, binary or ranked preferences), the fine-tuning technique (policy gradient, reward-weighted likelihood, direct backpropagation, etc.), and their efficiency and safety outcomes. We compare key algorithms and frameworks, highlighting how they improve alignment with user intent or safety standards, and discuss inter-relationships such as how newer methods build on or diverge from earlier ones. Based on the survey, we identify five promising research directions for the next two years: (1) multi-objective alignment with combined rewards, (2) efficient human feedback usage and active learning, (3) robust safety alignment against adversarial inputs, (4) continual and online alignment of diffusion models, and (5) interpretable and trustworthy reward modeling for generative images. Each direction is elaborated with its problem statement, challenges, related work, and a proposed research plan. The proposal is organized as a comprehensive document with literature review, comparative tables of methods, and detailed research plans, aiming to contribute new insights and techniques for safer and value-aligned diffusion-based generative AI.

[21] Dual Ascent Diffusion for Inverse Problems

Minseo Kim,Axel Levy,Gordon Wetzstein

Main category: cs.CV

TL;DR: 提出了一种基于扩散模型先验的双上升优化框架,用于解决MAP问题,提升了图像质量、鲁棒性和计算效率。

Details Motivation: 现有基于扩散模型的MAP或后验采样方法存在计算近似问题,导致结果不准确或次优。 Method: 采用双上升优化框架结合扩散模型先验。 Result: 在图像恢复任务中表现更优,对高噪声更鲁棒,计算更快,结果更忠实于观测数据。 Conclusion: 新方法在解决MAP问题时优于现有技术。 Abstract: Ill-posed inverse problems are fundamental in many domains, ranging from astrophysics to medical imaging. Emerging diffusion models provide a powerful prior for solving these problems. Existing maximum-a-posteriori (MAP) or posterior sampling approaches, however, rely on different computational approximations, leading to inaccurate or suboptimal samples. To address this issue, we introduce a new approach to solving MAP problems with diffusion model priors using a dual ascent optimization framework. Our framework achieves better image quality as measured by various metrics for image restoration problems, it is more robust to high levels of measurement noise, it is faster, and it estimates solutions that represent the observations more faithfully than the state of the art.

[22] Repurposing Marigold for Zero-Shot Metric Depth Estimation via Defocus Blur Cues

Chinmay Talegaonkar,Nikhil Gandudi Suresh,Zachary Novack,Yash Belhe,Priyanka Nagasamudra,Nicholas Antipa

Main category: cs.CV

TL;DR: 论文提出了一种在推理时通过引入散焦模糊线索的方法,将预训练的Marigold扩散模型转化为无需训练的度量深度预测器,显著提升了零样本单目度量深度估计的性能。

Details Motivation: 现有的零样本单目度量深度估计方法在分布外数据集上性能显著下降,论文旨在通过引入散焦模糊线索解决这一问题。 Method: 通过捕获同一视角下不同光圈大小的两张图像,利用散焦模糊图像形成模型的损失函数优化Marigold的度量深度缩放参数和噪声潜变量。 Result: 在自收集的真实数据集上,论文方法在定量和定性上均优于现有零样本单目度量深度估计方法。 Conclusion: 论文提出的方法成功将Marigold转化为度量深度预测器,显著提升了零样本泛化能力。 Abstract: Recent monocular metric depth estimation (MMDE) methods have made notable progress towards zero-shot generalization. However, they still exhibit a significant performance drop on out-of-distribution datasets. We address this limitation by injecting defocus blur cues at inference time into Marigold, a \textit{pre-trained} diffusion model for zero-shot, scale-invariant monocular depth estimation (MDE). Our method effectively turns Marigold into a metric depth predictor in a training-free manner. To incorporate defocus cues, we capture two images with a small and a large aperture from the same viewpoint. To recover metric depth, we then optimize the metric depth scaling parameters and the noise latents of Marigold at inference time using gradients from a loss function based on the defocus-blur image formation model. We compare our method against existing state-of-the-art zero-shot MMDE methods on a self-collected real dataset, showing quantitative and qualitative improvements.

[23] Are GNNs Worth the Effort for IoT Botnet Detection? A Comparative Study of VAE-GNN vs. ViT-MLP and VAE-MLP Approaches

Hassan Wasswa,Hussein Abbass,Timothy Lynar

Main category: cs.CV

TL;DR: 论文评估了四种深度学习架构在IoT僵尸网络检测中的表现,发现所有模型在二分类任务中表现优异(>99.93%),但在多分类任务中,GNN模型表现较差。

Details Motivation: 由于IoT僵尸网络攻击的指数级增长,研究探索了多种高级技术(如VAE、ViT和GNN)以提升IoT安全性。 Method: 评估了四种架构:VAE-MLP、VAE-GCN、VAE-GAT和ViT-MLP,在N-BaIoT数据集上进行二分类和多分类任务。 Result: 二分类任务中所有模型表现优异(>99.93%),多分类任务中GNN模型(VAE-GCN和VAE-GAT)表现较差(86.42%和89.46%),而VAE-MLP和ViT-MLP表现较好(99.72%和98.38%)。 Conclusion: GNN模型在多分类任务中表现不佳,而VAE-MLP和ViT-MLP更适合复杂分类任务。 Abstract: Due to the exponential rise in IoT-based botnet attacks, researchers have explored various advanced techniques for both dimensionality reduction and attack detection to enhance IoT security. Among these, Variational Autoencoders (VAE), Vision Transformers (ViT), and Graph Neural Networks (GNN), including Graph Convolutional Networks (GCN) and Graph Attention Networks (GAT), have garnered significant research attention in the domain of attack detection. This study evaluates the effectiveness of four state-of-the-art deep learning architectures for IoT botnet detection: a VAE encoder with a Multi-Layer Perceptron (MLP), a VAE encoder with a GCN, a VAE encoder with a GAT, and a ViT encoder with an MLP. The evaluation is conducted on a widely studied IoT benchmark dataset--the N-BaIoT dataset for both binary and multiclass tasks. For the binary classification task, all models achieved over 99.93% in accuracy, recall, precision, and F1-score, with no notable differences in performance. In contrast, for the multiclass classification task, GNN-based models showed significantly lower performance compared to VAE-MLP and ViT-MLP, with accuracies of 86.42%, 89.46%, 99.72%, and 98.38% for VAE-GCN, VAE-GAT, VAE-MLP, and ViT-MLP, respectively.

[24] Optimizing YOLOv8 for Parking Space Detection: Comparative Analysis of Custom YOLOv8 Architecture

Apar Pokhrel,Gia Dao

Main category: cs.CV

TL;DR: 本文比较了多种定制化骨干网络(如ResNet-18、VGG16等)与YOLOv8结合在停车位占用检测中的性能,分析了其准确性与计算效率。

Details Motivation: 传统目标检测方法(如YOLOv8)在部分可见车辆、小型车辆或光线不佳情况下表现不佳,需改进。 Method: 通过集成不同骨干网络(ResNet-18、VGG16等)到YOLOv8,并在PKLot数据集上进行实验比较。 Result: 实验展示了各架构的优势与权衡,为停车位占用检测模型选择提供了参考。 Conclusion: 定制化骨干网络能有效提升YOLOv8在复杂场景下的检测性能。 Abstract: Parking space occupancy detection is a critical component in the development of intelligent parking management systems. Traditional object detection approaches, such as YOLOv8, provide fast and accurate vehicle detection across parking lots but can struggle with borderline cases, such as partially visible vehicles, small vehicles (e.g., motorcycles), and poor lighting conditions. In this work, we perform a comprehensive comparative analysis of customized backbone architectures integrated with YOLOv8. Specifically, we evaluate various backbones -- ResNet-18, VGG16, EfficientNetV2, Ghost -- on the PKLot dataset in terms of detection accuracy and computational efficiency. Experimental results highlight each architecture's strengths and trade-offs, providing insight into selecting suitable models for parking occupancy.

[25] EVM-Fusion: An Explainable Vision Mamba Architecture with Neural Algorithmic Fusion

Zichuan Yang

Main category: cs.CV

TL;DR: 本文提出了一种名为EVM-Fusion的可解释视觉Mamba架构,通过神经算法融合机制提升多器官医学图像分类的准确性、可解释性和泛化性。

Details Motivation: 医学图像分类对临床决策至关重要,但准确性、可解释性和泛化性仍具挑战性。 Method: EVM-Fusion采用多路径设计,结合DenseNet、U-Net和Vision Mamba模块,通过两阶段融合(跨模态注意力和神经算法融合块)动态整合特征。 Result: 在9类多器官医学图像数据集上,EVM-Fusion达到99.75%的测试准确率,并提供多方面的决策解释。 Conclusion: EVM-Fusion展示了在医学诊断中实现可信赖AI的潜力。 Abstract: Medical image classification is critical for clinical decision-making, yet demands for accuracy, interpretability, and generalizability remain challenging. This paper introduces EVM-Fusion, an Explainable Vision Mamba architecture featuring a novel Neural Algorithmic Fusion (NAF) mechanism for multi-organ medical image classification. EVM-Fusion leverages a multipath design, where DenseNet and U-Net based pathways, enhanced by Vision Mamba (Vim) modules, operate in parallel with a traditional feature pathway. These diverse features are dynamically integrated via a two-stage fusion process: cross-modal attention followed by the iterative NAF block, which learns an adaptive fusion algorithm. Intrinsic explainability is embedded through path-specific spatial attention, Vim {\Delta}-value maps, traditional feature SE-attention, and cross-modal attention weights. Experiments on a diverse 9-class multi-organ medical image dataset demonstrate EVM-Fusion's strong classification performance, achieving 99.75% test accuracy and provide multi-faceted insights into its decision-making process, highlighting its potential for trustworthy AI in medical diagnostics.

[26] Dual-sensing driving detection model

Leon C. C. K,Zeng Hui

Main category: cs.CV

TL;DR: 提出了一种结合计算机视觉和生理信号分析的双重感知驾驶员疲劳检测方法,突破了单模态方法的限制,通过融合策略实现高效可靠的检测。

Details Motivation: 现有单模态疲劳检测方法存在局限性,需要更可靠、高效的解决方案以减少疲劳相关事故。 Method: 结合实时面部特征分析和生理信号处理,采用先进的融合策略,设计高效运行的系统。 Result: 在控制和真实环境中均优于传统方法,保持高准确性和可靠性,验证了实际应用潜力。 Conclusion: 该方法为驾驶员疲劳检测提供了更可靠、经济且人性化的解决方案,具有广泛应用前景。 Abstract: In this paper, a novel dual-sensing driver fatigue detection method combining computer vision and physiological signal analysis is proposed. The system exploits the complementary advantages of the two sensing modalities and breaks through the limitations of existing single-modality methods. We introduce an innovative architecture that combines real-time facial feature analysis with physiological signal processing, combined with advanced fusion strategies, for robust fatigue detection. The system is designed to run efficiently on existing hardware while maintaining high accuracy and reliability. Through comprehensive experiments, we demonstrate that our method outperforms traditional methods in both controlled environments and real-world conditions, while maintaining high accuracy. The practical applicability of the system has been verified through extensive tests in various driving scenarios and shows great potential in reducing fatigue-related accidents. This study contributes to the field by providing a more reliable, cost-effective, and humane solution for driver fatigue detection.

[27] Wildfire Detection Using Vision Transformer with the Wildfire Dataset

Gowtham Raj Vuppari,Navarun Gupta,Ahmed El-Sayed,Xingguo Xiong

Main category: cs.CV

TL;DR: 论文探讨了利用Vision Transformers(ViTs)提升野火早期检测的准确性,但面临数据质量、计算成本和实时集成等挑战。

Details Motivation: 美国尤其是加州野火频发,造成严重损失,亟需高效检测技术以减少灾害影响。 Method: 使用10.74GB高分辨率图像数据集训练ViT模型,图像预处理包括调整大小、转换为张量并归一化。 Result: ViT模型在复杂图像数据处理中表现出高准确性,但存在数据覆盖不足和计算成本高的问题。 Conclusion: ViT技术有望提升野火检测效率,但需解决实时数据获取和系统集成等挑战。 Abstract: The critical need for sophisticated detection techniques has been highlighted by the rising frequency and intensity of wildfires in the US, especially in California. In 2023, wildfires caused 130 deaths nationwide, the highest since 1990. In January 2025, Los Angeles wildfires which included the Palisades and Eaton fires burnt approximately 40,000 acres and 12,000 buildings, and caused loss of human lives. The devastation underscores the urgent need for effective detection and prevention strategies. Deep learning models, such as Vision Transformers (ViTs), can enhance early detection by processing complex image data with high accuracy. However, wildfire detection faces challenges, including the availability of high-quality, real-time data. Wildfires often occur in remote areas with limited sensor coverage, and environmental factors like smoke and cloud cover can hinder detection. Additionally, training deep learning models is computationally expensive, and issues like false positives/negatives and scaling remain concerns. Integrating detection systems with real-time alert mechanisms also poses difficulties. In this work, we used the wildfire dataset consisting of 10.74 GB high-resolution images categorized into 'fire' and 'nofire' classes is used for training the ViT model. To prepare the data, images are resized to 224 x 224 pixels, converted into tensor format, and normalized using ImageNet statistics.

[28] Direct3D-S2: Gigascale 3D Generation Made Easy with Spatial Sparse Attention

Shuang Wu,Youtian Lin,Feihu Zhang,Yifei Zeng,Yikang Yang,Yajie Bao,Jiachen Qian,Siyu Zhu,Philip Torr,Xun Cao,Yao Yao

Main category: cs.CV

TL;DR: Direct3D S2是一个基于稀疏体积的可扩展3D生成框架,通过空间稀疏注意力机制显著提升效率,并实现了高质量输出。

Details Motivation: 解决使用体积表示(如SDF)生成高分辨率3D形状时的计算和内存挑战。 Method: 引入空间稀疏注意力机制(SSA)优化稀疏体积数据的扩散变换器计算,并结合统一的稀疏体积变分自编码器设计。 Result: 实现了3.9倍前向传播和9.6倍反向传播加速,支持1024分辨率训练仅需8个GPU。 Conclusion: Direct3D S2在生成质量和效率上超越现有方法,使大规模3D生成更实用和可访问。 Abstract: Generating high resolution 3D shapes using volumetric representations such as Signed Distance Functions presents substantial computational and memory challenges. We introduce Direct3D S2, a scalable 3D generation framework based on sparse volumes that achieves superior output quality with dramatically reduced training costs. Our key innovation is the Spatial Sparse Attention mechanism, which greatly enhances the efficiency of Diffusion Transformer computations on sparse volumetric data. SSA allows the model to effectively process large token sets within sparse volumes, significantly reducing computational overhead and achieving a 3.9x speedup in the forward pass and a 9.6x speedup in the backward pass. Our framework also includes a variational autoencoder that maintains a consistent sparse volumetric format across input, latent, and output stages. Compared to previous methods with heterogeneous representations in 3D VAE, this unified design significantly improves training efficiency and stability. Our model is trained on public available datasets, and experiments demonstrate that Direct3D S2 not only surpasses state-of-the-art methods in generation quality and efficiency, but also enables training at 1024 resolution using only 8 GPUs, a task typically requiring at least 32 GPUs for volumetric representations at 256 resolution, thus making gigascale 3D generation both practical and accessible. Project page: https://nju3dv.github.io/projects/Direct3D-S2/.

[29] VIBE: Video-to-Text Information Bottleneck Evaluation for TL;DR

Shenghui Chen,Po-han Li,Sandeep Chichali,Ufuk Topcu

Main category: cs.CV

TL;DR: VIBE是一种无需标注的方法,通过评估视觉语言模型(VLM)输出的摘要与视觉内容的对齐程度(grounding)及其任务实用性(utility),显著提升决策任务的准确性和效率。

Details Motivation: 现有视觉语言模型生成的摘要冗长冗余,且缺乏对下游任务实用性的评估,导致决策任务效率低下。 Method: 提出VIBE方法,通过grounding和utility两个指标对VLM输出进行评分,并从中选择最优摘要。 Result: 实验表明,VIBE选择的摘要显著提升任务准确性(最高61.23%)并减少响应时间(75.77%)。 Conclusion: VIBE为决策任务提供了一种高效、无需人工标注的摘要评估方法,显著提升任务性能。 Abstract: Many decision-making tasks, where both accuracy and efficiency matter, still require human supervision. For example, tasks like traffic officers reviewing hour-long dashcam footage or researchers screening conference videos can benefit from concise summaries that reduce cognitive load and save time. Yet current vision-language models (VLMs) often produce verbose, redundant outputs that hinder task performance. Existing video caption evaluation depends on costly human annotations and overlooks the summaries' utility in downstream tasks. We address these gaps with Video-to-text Information Bottleneck Evaluation (VIBE), an annotation-free method that scores VLM outputs using two metrics: grounding (how well the summary aligns with visual content) and utility (how informative it is for the task). VIBE selects from randomly sampled VLM outputs by ranking them according to the two scores to support effective human decision-making. Human studies on LearningPaper24, SUTD-TrafficQA, and LongVideoBench show that summaries selected by VIBE consistently improve performance-boosting task accuracy by up to 61.23% and reducing response time by 75.77% compared to naive VLM summaries or raw video.

[30] Debiasing CLIP: Interpreting and Correcting Bias in Attention Heads

Wei Jie Yeo,Rui Mao,Moloud Abdar,Erik Cambria,Ranjan Satapathy

Main category: cs.CV

TL;DR: 论文提出了一种名为LTC的框架,通过定位和修正CLIP模型中的虚假注意力头,提升其零样本性能。

Details Motivation: CLIP模型在零样本任务中表现优异,但可能学习到目标变量与混淆因素之间的虚假关联,影响性能。 Method: LTC框架通过对比机制识别虚假注意力头,并通过正交投影整合任务相关特征。 Result: 在存在背景和性别偏见的基准测试中,LTC的最差组准确率提升了50%以上。 Conclusion: LTC有效识别并修正了虚假注意力头,同时增强了任务相关特征,提升了模型性能。 Abstract: Multimodal models like CLIP have gained significant attention due to their remarkable zero-shot performance across various tasks. However, studies have revealed that CLIP can inadvertently learn spurious associations between target variables and confounding factors. To address this, we introduce \textsc{Locate-Then-Correct} (LTC), a contrastive framework that identifies spurious attention heads in Vision Transformers via mechanistic insights and mitigates them through targeted ablation. Furthermore, LTC identifies salient, task-relevant attention heads, enabling the integration of discriminative features through orthogonal projection to improve classification performance. We evaluate LTC on benchmarks with inherent background and gender biases, achieving over a $>50\%$ gain in worst-group accuracy compared to non-training post-hoc baselines. Additionally, we visualize the representation of selected heads and find that the presented interpretation corroborates our contrastive mechanism for identifying both spurious and salient attention heads. Code available at https://github.com/wj210/CLIP_LTC.

[31] Learning Generalized and Flexible Trajectory Models from Omni-Semantic Supervision

Yuanshao Zhu,James Jianqiao Yu,Xiangyu Zhao,Xiao Han,Qidong Liu,Xuetao Wei,Yuxuan Liang

Main category: cs.CV

TL;DR: OmniTraj是一个多模态轨迹检索框架,解决了传统方法在大规模数据、条件查询和轨迹相似性度量上的局限性。

Details Motivation: 移动设备和数据收集技术的普及导致轨迹数据激增,现有检索方法在大规模数据处理、条件查询支持和多模态融合方面存在不足。 Method: OmniTraj通过四种模态(原始轨迹、拓扑、路段和区域)的专用编码器,将多模态数据嵌入共享表示空间,支持灵活查询。 Result: 在两个真实数据集上的实验表明,OmniTraj能高效处理大规模数据,支持多模态查询,并适用于下游任务。 Conclusion: OmniTraj通过多模态融合和灵活查询设计,显著提升了轨迹检索的效率和准确性。 Abstract: The widespread adoption of mobile devices and data collection technologies has led to an exponential increase in trajectory data, presenting significant challenges in spatio-temporal data mining, particularly for efficient and accurate trajectory retrieval. However, existing methods for trajectory retrieval face notable limitations, including inefficiencies in large-scale data, lack of support for condition-based queries, and reliance on trajectory similarity measures. To address the above challenges, we propose OmniTraj, a generalized and flexible omni-semantic trajectory retrieval framework that integrates four complementary modalities or semantics -- raw trajectories, topology, road segments, and regions -- into a unified system. Unlike traditional approaches that are limited to computing and processing trajectories as a single modality, OmniTraj designs dedicated encoders for each modality, which are embedded and fused into a shared representation space. This design enables OmniTraj to support accurate and flexible queries based on any individual modality or combination thereof, overcoming the rigidity of traditional similarity-based methods. Extensive experiments on two real-world datasets demonstrate the effectiveness of OmniTraj in handling large-scale data, providing flexible, multi-modality queries, and supporting downstream tasks and applications.

[32] VEAttack: Downstream-agnostic Vision Encoder Attack against Large Vision Language Models

Hefei Mei,Zirui Wang,Shen You,Minjing Dong,Chang Xu

Main category: cs.CV

TL;DR: VEAttack是一种针对大型视觉语言模型(LVLM)的视觉编码器攻击方法,通过最小化干净和扰动视觉特征的余弦相似度生成对抗样本,无需访问后续大型语言模型或任务信息,显著降低计算开销。

Details Motivation: 现有攻击方法多针对特定任务的白盒设置,不适用于LVLM的多样任务需求和高计算成本,因此提出仅针对视觉编码器的攻击方法。 Method: 通过优化图像标记而非分类标记生成对抗样本,无需任务信息或标签,减少计算开销。 Result: 在图像描述任务中性能下降94.5%,视觉问答任务中下降75.7%,并能泛化到多种任务。 Conclusion: VEAttack高效且通用,揭示了LVLM攻击/防御的关键观察,如LLM隐藏层变化、标记注意力差异等。 Abstract: Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in multimodal understanding and generation, yet their vulnerability to adversarial attacks raises significant robustness concerns. While existing effective attacks always focus on task-specific white-box settings, these approaches are limited in the context of LVLMs, which are designed for diverse downstream tasks and require expensive full-model gradient computations. Motivated by the pivotal role and wide adoption of the vision encoder in LVLMs, we propose a simple yet effective Vision Encoder Attack (VEAttack), which targets the vision encoder of LVLMs only. Specifically, we propose to generate adversarial examples by minimizing the cosine similarity between the clean and perturbed visual features, without accessing the following large language models, task information, and labels. It significantly reduces the computational overhead while eliminating the task and label dependence of traditional white-box attacks in LVLMs. To make this simple attack effective, we propose to perturb images by optimizing image tokens instead of the classification token. We provide both empirical and theoretical evidence that VEAttack can easily generalize to various tasks. VEAttack has achieved a performance degradation of 94.5% on image caption task and 75.7% on visual question answering task. We also reveal some key observations to provide insights into LVLM attack/defense: 1) hidden layer variations of LLM, 2) token attention differential, 3) M\"obius band in transfer attack, 4) low sensitivity to attack steps. The code is available at https://github.com/hfmei/VEAttack-LVLM

[33] Reflectance Prediction-based Knowledge Distillation for Robust 3D Object Detection in Compressed Point Clouds

Hao Jing,Anhong Wang,Yifan Zhang,Donghan Bu,Junhui Hou

Main category: cs.CV

TL;DR: 本文提出了一种基于反射率预测知识蒸馏(RPKD)的3D目标检测框架,通过压缩点坐标并丢弃反射率,再通过几何反射率预测模块重建反射率,提升低比特率传输下的检测精度。

Details Motivation: 现有压缩传输系统在低比特率传输中面临反射率编码的传输负担和信息丢失导致的检测鲁棒性不足问题。 Method: 提出RPKD框架,包括学生检测器(压缩点云输入)、反射率预测模块和教师检测器(原始点云输入),通过知识蒸馏联合训练。 Result: 在KITTI和Waymo数据集上,RPKD在低码率下显著提升检测精度,如KITTI数据集2.146 Bpp时mAP达73.6。 Conclusion: RPKD框架有效解决了低比特率传输中的检测鲁棒性问题,显著优于现有方法。 Abstract: Regarding intelligent transportation systems for vehicle networking, low-bitrate transmission via lossy point cloud compression is vital for facilitating real-time collaborative perception among vehicles with restricted bandwidth. In existing compression transmission systems, the sender lossily compresses point coordinates and reflectance to generate a transmission code stream, which faces transmission burdens from reflectance encoding and limited detection robustness due to information loss. To address these issues, this paper proposes a 3D object detection framework with reflectance prediction-based knowledge distillation (RPKD). We compress point coordinates while discarding reflectance during low-bitrate transmission, and feed the decoded non-reflectance compressed point clouds into a student detector. The discarded reflectance is then reconstructed by a geometry-based reflectance prediction (RP) module within the student detector for precise detection. A teacher detector with the same structure as student detector is designed for performing reflectance knowledge distillation (RKD) and detection knowledge distillation (DKD) from raw to compressed point clouds. Our RPKD framework jointly trains detectors on both raw and compressed point clouds to improve the student detector's robustness. Experimental results on the KITTI dataset and Waymo Open Dataset demonstrate that our method can boost detection accuracy for compressed point clouds across multiple code rates. Notably, at a low code rate of 2.146 Bpp on the KITTI dataset, our RPKD-PV achieves the highest mAP of 73.6, outperforming existing detection methods with the PV-RCNN baseline.

[34] PawPrint: Whose Footprints Are These? Identifying Animal Individuals by Their Footprints

Inpyo Song,Hyemin Hwang,Jangwon Lee

Main category: cs.CV

TL;DR: 论文介绍了PawPrint和PawPrint+数据集,用于猫狗脚印识别,分析了不同方法的优缺点,并提出了结合全局和局部特征以提高可靠性的方向。

Details Motivation: 美国宠物数量持续增长,但传统识别方法存在局限性(如GPS标签可移除、信号问题等),需要更有效的非侵入式识别方法。 Method: 通过现代深度神经网络(如CNN、Transformers)和经典局部特征方法对PawPrint和PawPrint+数据集进行基准测试。 Result: 不同方法在底物复杂性和数据可用性方面表现出各自的优缺点。 Conclusion: 结合全局和局部特征的方法有望提高可靠性,未来可应用于宠物管理和野生动物保护。 Abstract: In the United States, as of 2023, pet ownership has reached 66% of households and continues to rise annually. This trend underscores the critical need for effective pet identification and monitoring methods, particularly as nearly 10 million cats and dogs are reported stolen or lost each year. However, traditional methods for finding lost animals like GPS tags or ID photos have limitations-they can be removed, face signal issues, and depend on someone finding and reporting the pet. To address these limitations, we introduce PawPrint and PawPrint+, the first publicly available datasets focused on individual-level footprint identification for dogs and cats. Through comprehensive benchmarking of both modern deep neural networks (e.g., CNN, Transformers) and classical local features, we observe varying advantages and drawbacks depending on substrate complexity and data availability. These insights suggest future directions for combining learned global representations with local descriptors to enhance reliability across diverse, real-world conditions. As this approach provides a non-invasive alternative to traditional ID tags, we anticipate promising applications in ethical pet management and wildlife conservation efforts.

[35] Real-time Traffic Accident Anticipation with Feature Reuse

Inpyo Song,Jangwon Lee

Main category: cs.CV

TL;DR: RARE是一种轻量级框架,通过重用预训练目标检测器的中间特征,显著降低延迟,并引入注意力分数排序损失提升准确性和可解释性。

Details Motivation: 实时预测交通事故对自动驾驶安全至关重要,但现有方法依赖计算密集型模块,难以实际部署。 Method: RARE利用预训练目标检测器的中间特征,避免额外特征提取,并引入注意力分数排序损失。 Result: 在DAD和CCD基准测试中,RARE速度提升4-8倍,延迟13.6ms/帧,同时保持最高平均精度。 Conclusion: RARE在实时性和准确性上表现优异,适用于安全关键应用。 Abstract: This paper addresses the problem of anticipating traffic accidents, which aims to forecast potential accidents before they happen. Real-time anticipation is crucial for safe autonomous driving, yet most methods rely on computationally heavy modules like optical flow and intermediate feature extractors, making real-world deployment challenging. In this paper, we thus introduce RARE (Real-time Accident anticipation with Reused Embeddings), a lightweight framework that capitalizes on intermediate features from a single pre-trained object detector. By eliminating additional feature-extraction pipelines, RARE significantly reduces latency. Furthermore, we introduce a novel Attention Score Ranking Loss, which prioritizes higher attention on accident-related objects over non-relevant ones. This loss enhances both accuracy and interpretability. RARE demonstrates a 4-8 times speedup over existing approaches on the DAD and CCD benchmarks, achieving a latency of 13.6ms per frame (73.3 FPS) on an RTX 6000. Moreover, despite its reduced complexity, it attains state-of-the-art Average Precision and reliably anticipates imminent collisions in real time. These results highlight RARE's potential for safety-critical applications where timely and explainable anticipation is essential.

[36] Graph Mamba for Efficient Whole Slide Image Understanding

Jiaxuan Lu,Junyan Shi,Yuhui Lin,Fang Yan,Yue Gao,Shaoting Zhang,Xiaosong Wang

Main category: cs.CV

TL;DR: WSI-GMamba框架结合GNN的关系建模能力和Mamba的高效性,解决了WSI分析中的可扩展性和计算成本问题。

Details Motivation: WSI的高分辨率和复杂性对现有MIL方法(如GNN和Transformer)提出了可扩展性和计算成本的挑战。 Method: 提出WSI-GMamba框架,结合GNN和Mamba,通过GMamba块(包含消息传递、图扫描与展平及双向状态空间模型)实现高效特征聚合。 Result: WSI-GMamba在Transformer级性能下减少7倍FLOPs,兼顾高精度和计算效率。 Conclusion: WSI-GMamba为大规模WSI分析提供了可扩展的高效解决方案。 Abstract: Whole Slide Images (WSIs) in histopathology present a significant challenge for large-scale medical image analysis due to their high resolution, large size, and complex tile relationships. Existing Multiple Instance Learning (MIL) methods, such as Graph Neural Networks (GNNs) and Transformer-based models, face limitations in scalability and computational cost. To bridge this gap, we propose the WSI-GMamba framework, which synergistically combines the relational modeling strengths of GNNs with the efficiency of Mamba, the State Space Model designed for sequence learning. The proposed GMamba block integrates Message Passing, Graph Scanning & Flattening, and feature aggregation via a Bidirectional State Space Model (Bi-SSM), achieving Transformer-level performance with 7* fewer FLOPs. By leveraging the complementary strengths of lightweight GNNs and Mamba, the WSI-GMamba framework delivers a scalable solution for large-scale WSI analysis, offering both high accuracy and computational efficiency for slide-level classification.

[37] Diagnosing Vision Language Models' Perception by Leveraging Human Methods for Color Vision Deficiencies

Kazuki Hayashi,Shintaro Ozaki,Yusuke Sakai,Hidetaka Kamigaito,Taro Watanabe

Main category: cs.CV

TL;DR: LVLMs能解释色觉缺陷(CVDs),但无法模拟色觉缺陷者在图像任务中的感知,需更多关注感知多样性。

Details Motivation: 研究LVLMs在颜色感知多样性方面的表现,尤其是色觉缺陷(CVDs)的影响,以推动多模态AI的感知包容性和公平性。 Method: 使用Ishihara测试评估LVLMs对个体感知差异的处理能力。 Result: LVLMs能用自然语言解释CVDs,但无法模拟色觉缺陷者在图像任务中的颜色感知。 Conclusion: 多模态系统需改进以支持颜色感知多样性,促进感知包容性和公平性。 Abstract: Large-scale Vision Language Models (LVLMs) are increasingly being applied to a wide range of real-world multimodal applications, involving complex visual and linguistic reasoning. As these models become more integrated into practical use, they are expected to handle complex aspects of human interaction. Among these, color perception is a fundamental yet highly variable aspect of visual understanding. It differs across individuals due to biological factors such as Color Vision Deficiencies (CVDs), as well as differences in culture and language. Despite its importance, perceptual diversity has received limited attention. In our study, we evaluate LVLMs' ability to account for individual level perceptual variation using the Ishihara Test, a widely used method for detecting CVDs. Our results show that LVLMs can explain CVDs in natural language, but they cannot simulate how people with CVDs perceive color in image based tasks. These findings highlight the need for multimodal systems that can account for color perceptual diversity and support broader discussions on perceptual inclusiveness and fairness in multimodal AI.

[38] OrionBench: A Benchmark for Chart and Human-Recognizable Object Detection in Infographics

Jiangning Zhu,Yuxing Zhou,Zheng Wang,Juntao Yao,Yima Gu,Yuhui Yuan,Shixia Liu

Main category: cs.CV

TL;DR: OrionBench是一个新基准,旨在提升视觉语言模型(VLMs)对图表和人类可识别对象(HROs)的检测能力,包含大量真实和合成信息图表及其标注。

Details Motivation: 现有VLMs在图表和HROs的视觉定位上存在不足,而图表理解需要准确识别和推理这些元素。 Method: 通过结合模型在环和程序化方法,创建了包含26,250张真实和78,750张合成信息图表的OrionBench,标注了超过690万个边界框。 Result: OrionBench在三个应用中展示了其价值:提升VLMs的图表理解性能、比较现有目标检测模型、应用于文档布局和UI元素检测。 Conclusion: OrionBench为开发更准确的图表和HROs检测模型提供了重要支持。 Abstract: Given the central role of charts in scientific, business, and communication contexts, enhancing the chart understanding capabilities of vision-language models (VLMs) has become increasingly critical. A key limitation of existing VLMs lies in their inaccurate visual grounding of infographic elements, including charts and human-recognizable objects (HROs) such as icons and images. However, chart understanding often requires identifying relevant elements and reasoning over them. To address this limitation, we introduce OrionBench, a benchmark designed to support the development of accurate object detection models for charts and HROs in infographics. It contains 26,250 real and 78,750 synthetic infographics, with over 6.9 million bounding box annotations. These annotations are created by combining the model-in-the-loop and programmatic methods. We demonstrate the usefulness of OrionBench through three applications: 1) constructing a Thinking-with-Boxes scheme to boost the chart understanding performance of VLMs, 2) comparing existing object detection models, and 3) applying the developed detection model to document layout and UI element detection.

[39] PoseBH: Prototypical Multi-Dataset Training Beyond Human Pose Estimation

Uyoung Jeong,Jonathan Freer,Seungryul Baek,Hyung Jin Chang,Kwang In Kim

Main category: cs.CV

TL;DR: PoseBH是一个多数据集训练框架,通过非参数关键点原型和跨类型自监督机制解决姿态估计中的骨架异构问题。

Details Motivation: 现有方法未解决骨架异构问题,传统多数据集训练方法在姿态估计中难以适用。 Method: 提出非参数关键点原型和跨类型自监督机制,实现骨架类型的无缝集成。 Result: 在COCO-WholeBody、AP-10K等数据集上显著提升泛化性能,同时保持标准基准性能。 Conclusion: PoseBH有效解决了骨架异构问题,并展示了良好的迁移能力。 Abstract: We study multi-dataset training (MDT) for pose estimation, where skeletal heterogeneity presents a unique challenge that existing methods have yet to address. In traditional domains, \eg regression and classification, MDT typically relies on dataset merging or multi-head supervision. However, the diversity of skeleton types and limited cross-dataset supervision complicate integration in pose estimation. To address these challenges, we introduce PoseBH, a new MDT framework that tackles keypoint heterogeneity and limited supervision through two key techniques. First, we propose nonparametric keypoint prototypes that learn within a unified embedding space, enabling seamless integration across skeleton types. Second, we develop a cross-type self-supervision mechanism that aligns keypoint predictions with keypoint embedding prototypes, providing supervision without relying on teacher-student models or additional augmentations. PoseBH substantially improves generalization across whole-body and animal pose datasets, including COCO-WholeBody, AP-10K, and APT-36K, while preserving performance on standard human pose benchmarks (COCO, MPII, and AIC). Furthermore, our learned keypoint embeddings transfer effectively to hand shape estimation (InterHand2.6M) and human body shape estimation (3DPW). The code for PoseBH is available at: https://github.com/uyoung-jeong/PoseBH.

[40] The Coherence Trap: When MLLM-Crafted Narratives Exploit Manipulated Visual Contexts

Yuchen Zhang,Yaxiong Wang,Yujiao Wu,Lianwei Wu,Li Zhu

Main category: cs.CV

TL;DR: 本文提出了一种新方法,通过多模态大语言模型(MLLM)生成高风险虚假信息,并构建了MDSM数据集和AMD框架,以检测和应对MLLM驱动的多模态欺骗。

Details Motivation: 现有方法低估了MLLM驱动的欺骗风险,且依赖不现实的语义不一致内容,无法有效应对动态生成的虚假信息。 Method: 提出MDSM数据集和AMD框架,利用MLLM生成语义一致的虚假信息,并通过创新策略进行检测。 Result: 实验证明AMD框架在检测MLLM驱动的多模态欺骗方面具有优越的泛化能力。 Conclusion: 该方法为应对MLLM驱动的虚假信息提供了统一且有效的解决方案。 Abstract: The detection and grounding of multimedia manipulation has emerged as a critical challenge in combating AI-generated disinformation. While existing methods have made progress in recent years, we identify two fundamental limitations in current approaches: (1) Underestimation of MLLM-driven deception risk: prevailing techniques primarily address rule-based text manipulations, yet fail to account for sophisticated misinformation synthesized by multimodal large language models (MLLMs) that can dynamically generate semantically coherent, contextually plausible yet deceptive narratives conditioned on manipulated images; (2) Unrealistic misalignment artifacts: currently focused scenarios rely on artificially misaligned content that lacks semantic coherence, rendering them easily detectable. To address these gaps holistically, we propose a new adversarial pipeline that leverages MLLMs to generate high-risk disinformation. Our approach begins with constructing the MLLM-Driven Synthetic Multimodal (MDSM) dataset, where images are first altered using state-of-the-art editing techniques and then paired with MLLM-generated deceptive texts that maintain semantic consistency with the visual manipulations. Building upon this foundation, we present the Artifact-aware Manipulation Diagnosis via MLLM (AMD) framework featuring two key innovations: Artifact Pre-perception Encoding strategy and Manipulation-Oriented Reasoning, to tame MLLMs for the MDSM problem. Comprehensive experiments validate our framework's superior generalization capabilities as a unified architecture for detecting MLLM-powered multimodal deceptions.

[41] Research on Defect Detection Method of Motor Control Board Based on Image Processing

Jingde Huang,Zhangyu Huang,Chenyu Li,Jiantong Liu

Main category: cs.CV

TL;DR: 论文提出了一种基于图像处理的电机控制板缺陷检测方法,通过噪声抑制、特征提取和优化搜索算法,实现了高效且高精度的缺陷检测。

Details Motivation: 电机控制板存在多种缺陷,如色差、插件位置错误和焊锡短路等,直接影响产品质量。研究缺陷检测技术是提升质量控制水平的重要手段。 Method: 研究电机控制板的数字图像处理方法,分析噪声抑制技术;建立缺陷特征提取和色差识别模型;优化缺陷图像搜索算法。 Result: 实验结果表明,所提方法的检测准确率超过99%,适用于生产线上的大批量实时图像处理。 Conclusion: 该方法不仅可用于电机控制板的在线缺陷检测,还为集成电路板的缺陷处理提供了解决方案。 Abstract: The motor control board has various defects such as inconsistent color differences, incorrect plug-in positions, solder short circuits, and more. These defects directly affect the performance and stability of the motor control board, thereby having a negative impact on product quality. Therefore, studying the defect detection technology of the motor control board is an important means to improve the quality control level of the motor control board. Firstly, the processing methods of digital images about the motor control board were studied, and the noise suppression methods that affect image feature extraction were analyzed. Secondly, a specific model for defect feature extraction and color difference recognition of the tested motor control board was established, and qualified or defective products were determined based on feature thresholds. Thirdly, the search algorithm for defective images was optimized. Finally, comparative experiments were conducted on the typical motor control board, and the experimental results demonstrate that the accuracy of the motor control board defect detection model-based on image processing established in this paper reached over 99%. It is suitable for timely image processing of large quantities of motor control boards on the production line, and achieved efficient defect detection. The defect detection method can not only be used for online detection of the motor control board defects, but also provide solutions for the integrated circuit board defect processing for the industry.

[42] RoHyDR: Robust Hybrid Diffusion Recovery for Incomplete Multimodal Emotion Recognition

Yuehan Jin,Xiaoqing Liu,Yiyuan Yang,Zhiwen Yu,Tong Zhang,Kaixiang Yang

Main category: cs.CV

TL;DR: 提出了一种名为RoHyDR的新框架,通过混合扩散和对抗学习在多模态情感识别中恢复缺失模态的数据,显著提升了性能。

Details Motivation: 解决多模态情感识别中因数据缺失或损坏导致的性能下降问题。 Method: 结合扩散生成器和对抗学习,在单模态和多模态层面恢复缺失数据,并采用多阶段优化策略。 Result: 在两个广泛使用的基准测试中优于现有方法,表现出鲁棒性。 Conclusion: RoHyDR在多模态情感识别中有效解决了数据缺失问题,提升了识别性能。 Abstract: Multimodal emotion recognition analyzes emotions by combining data from multiple sources. However, real-world noise or sensor failures often cause missing or corrupted data, creating the Incomplete Multimodal Emotion Recognition (IMER) challenge. In this paper, we propose Robust Hybrid Diffusion Recovery (RoHyDR), a novel framework that performs missing-modality recovery at unimodal, multimodal, feature, and semantic levels. For unimodal representation recovery of missing modalities, RoHyDR exploits a diffusion-based generator to generate distribution-consistent and semantically aligned representations from Gaussian noise, using available modalities as conditioning. For multimodal fusion recovery, we introduce adversarial learning to produce a realistic fused multimodal representation and recover missing semantic content. We further propose a multi-stage optimization strategy that enhances training stability and efficiency. In contrast to previous work, the hybrid diffusion and adversarial learning-based recovery mechanism in RoHyDR allows recovery of missing information in both unimodal representation and multimodal fusion, at both feature and semantic levels, effectively mitigating performance degradation caused by suboptimal optimization. Comprehensive experiments conducted on two widely used multimodal emotion recognition benchmarks demonstrate that our proposed method outperforms state-of-the-art IMER methods, achieving robust recognition performance under various missing-modality scenarios. Our code will be made publicly available upon acceptance.

[43] Enhancing Adversarial Robustness of Vision Language Models via Adversarial Mixture Prompt Tuning

Shiji Zhao,Qihui Zhu,Shukun Xiong,Shouwei Ruan,Yize Fan,Ranjie Duan,Qing Guo,Xingxing Wei

Main category: cs.CV

TL;DR: 论文提出了一种名为Adversarial Mixture Prompt Tuning (AMPT)的方法,通过增加学习提示的数量而非长度,提升视觉语言模型(VLMs)对抗对抗样本的鲁棒性。

Details Motivation: 大型预训练视觉语言模型(VLMs)虽然泛化能力强,但对对抗样本高度敏感,存在安全风险。现有的对抗提示调优方法在面对多种对抗攻击时,单一提示泛化不足,容易过拟合。 Method: 提出AMPT方法,学习混合文本提示以获取更鲁棒的文本特征,并基于输入对抗图像的条件权重路由器预测混合权重,生成样本特定的聚合文本特征。 Result: 实验表明,AMPT在11个数据集和不同实验设置下,比现有方法具有更好的对抗鲁棒性。 Conclusion: AMPT通过混合提示和条件权重路由,显著提升了VLMs对抗多种对抗攻击的鲁棒性。 Abstract: Large pre-trained Vision Language Models (VLMs) have excellent generalization capabilities but are highly susceptible to adversarial examples, presenting potential security risks. To improve the robustness of VLMs against adversarial examples, adversarial prompt tuning methods are proposed to align the text feature with the adversarial image feature without changing model parameters. However, when facing various adversarial attacks, a single learnable text prompt has insufficient generalization to align well with all adversarial image features, which finally leads to the overfitting phenomenon. To address the above challenge, in this paper, we empirically find that increasing the number of learned prompts can bring more robustness improvement than a longer prompt. Then we propose an adversarial tuning method named Adversarial Mixture Prompt Tuning (AMPT) to enhance the generalization towards various adversarial attacks for VLMs. AMPT aims to learn mixture text prompts to obtain more robust text features. To further enhance the adaptability, we propose a conditional weight router based on the input adversarial image to predict the mixture weights of multiple learned prompts, which helps obtain sample-specific aggregated text features aligning with different adversarial image features. A series of experiments show that our method can achieve better adversarial robustness than state-of-the-art methods on 11 datasets under different experimental settings.

[44] Do You Keep an Eye on What I Ask? Mitigating Multimodal Hallucination via Attention-Guided Ensemble Decoding

Yeongjae Cho,Keonwoo Kim,Taebaek Hwang,Sungzoon Cho

Main category: cs.CV

TL;DR: 提出了一种名为Ensemble Decoding (ED)的新方法,通过分割输入图像并结合注意力图加权分配logit分布,解决大型视觉语言模型中的物体幻觉问题。

Details Motivation: 大型视觉语言模型在图像描述和视觉问答等任务中表现优异,但仍存在物体幻觉问题,即生成不准确的描述。现有方法在可扩展性和依赖外部模块方面存在不足。 Method: 提出ED方法,将输入图像分割为子图像,通过注意力图加权分配logit分布,并引入ED自适应合理性约束和FastED变体。 Result: 在多个幻觉基准测试中,ED方法取得了最先进的性能。 Conclusion: ED方法有效解决了物体幻觉问题,并在性能和速度上表现优异。 Abstract: Recent advancements in Large Vision-Language Models (LVLMs) have significantly expanded their utility in tasks like image captioning and visual question answering. However, they still struggle with object hallucination, where models generate descriptions that inaccurately reflect the visual content by including nonexistent objects or misrepresenting existing ones. While previous methods, such as data augmentation and training-free approaches, strive to tackle this issue, they still encounter scalability challenges and often depend on additional external modules. In this work, we propose Ensemble Decoding (ED), a novel strategy that splits the input image into sub-images and combines logit distributions by assigning weights through the attention map. Furthermore, we introduce ED adaptive plausibility constraint to calibrate logit distribution and FastED, a variant designed for speed-critical applications. Extensive experiments across hallucination benchmarks demonstrate that our proposed method achieves state-of-the-art performance, validating the effectiveness of our approach.

[45] Co-Reinforcement Learning for Unified Multimodal Understanding and Generation

Jingjing Jiang,Chongjie Si,Jun Luo,Hanwang Zhang,Chao Ma

Main category: cs.CV

TL;DR: 本文提出了一种通过群体相对策略优化(CoRL)强化多模态大语言模型(ULMs)的方法,显著提升了生成和理解能力。

Details Motivation: 探索如何通过强化学习(RL)同时增强多模态大语言模型的生成和理解能力,实现双能力的协同进化。 Method: 提出CoRL框架,包括统一的RL阶段进行联合优化和精细化的RL阶段进行任务特定增强。 Result: ULM-R1模型在三个文本到图像生成数据集上平均提升7%,在九个多模态理解基准上平均提升23%。 Conclusion: CoRL框架有效促进了多模态大语言模型的跨任务协同和优化,展示了强化学习在此领域的巨大潜力。 Abstract: This paper presents a pioneering exploration of reinforcement learning (RL) via group relative policy optimization for unified multimodal large language models (ULMs), aimed at simultaneously reinforcing generation and understanding capabilities. Through systematic pilot studies, we uncover the significant potential of ULMs to enable the synergistic co-evolution of dual capabilities within a shared policy optimization framework. Building on this insight, we introduce \textbf{CoRL}, a co-reinforcement learning framework comprising a unified RL stage for joint optimization and a refined RL stage for task-specific enhancement. With the proposed CoRL, our resulting model, \textbf{ULM-R1}, achieves average improvements of \textbf{7%} on three text-to-image generation datasets and \textbf{23%} on nine multimodal understanding benchmarks. These results demonstrate the effectiveness of CoRL and highlight the substantial benefit of reinforcement learning in facilitating cross-task synergy and optimization for ULMs.

[46] RePrompt: Reasoning-Augmented Reprompting for Text-to-Image Generation via Reinforcement Learning

Mingrui Wu,Lu Wang,Pu Zhao,Fangkai Yang,Jianjin Zhang,Jianfeng Liu,Yuefeng Zhan,Weihao Han,Hao Sun,Jiayi Ji,Xiaoshuai Sun,Qingwei Lin,Weiwei Deng,Dongmei Zhang,Feng Sun,Qi Zhang,Rongrong Ji

Main category: cs.CV

TL;DR: RePrompt是一种通过强化学习引入显式推理的提示增强框架,显著提升了文本到图像生成的空间布局保真度和组合泛化能力。

Details Motivation: 现有文本到图像生成模型难以从简短且不明确的提示中准确捕捉用户意图,且现有增强方法常因缺乏视觉语义和现实组合的接地性而生成不切实际的内容。 Method: 提出RePrompt框架,通过强化学习训练语言模型生成结构化、自反思的提示,优化图像级结果,利用奖励模型评估生成图像的人类偏好、语义对齐和视觉组合。 Result: 在GenEval和T2I-Compbench上的实验表明,RePrompt显著提升了空间布局保真度和组合泛化能力,达到新的SOTA。 Conclusion: RePrompt通过强化学习和自反思提示生成,有效解决了现有文本到图像生成模型在意图捕捉和内容真实性上的不足。 Abstract: Despite recent progress in text-to-image (T2I) generation, existing models often struggle to faithfully capture user intentions from short and under-specified prompts. While prior work has attempted to enhance prompts using large language models (LLMs), these methods frequently generate stylistic or unrealistic content due to insufficient grounding in visual semantics and real-world composition. Inspired by recent advances in reasoning for language model, we propose RePrompt, a novel reprompting framework that introduces explicit reasoning into the prompt enhancement process via reinforcement learning. Instead of relying on handcrafted rules or stylistic rewrites, our method trains a language model to generate structured, self-reflective prompts by optimizing for image-level outcomes. The tailored reward models assesse the generated images in terms of human preference, semantic alignment, and visual composition, providing indirect supervision to refine prompt generation. Our approach enables end-to-end training without human-annotated data. Experiments on GenEval and T2I-Compbench show that RePrompt significantly boosts spatial layout fidelity and compositional generalization across diverse T2I backbones, establishing new state-of-the-art results.

[47] T2VUnlearning: A Concept Erasing Method for Text-to-Video Diffusion Models

Xiaoyu Ye,Songjie Cheng,Yongtao Wang,Yajiao Xiong,Yishen Li

Main category: cs.CV

TL;DR: 论文提出了一种针对文本到视频(T2V)扩散模型的精确去学习方法,通过负引导速度预测微调和提示增强,有效消除有害内容,同时保留其他生成能力。

Details Motivation: 尽管T2V扩散模型在生成视频质量上有显著提升,但其可能生成有害内容的问题引发了滥用和权利侵犯的担忧。 Method: 采用负引导速度预测微调,并结合提示增强以提高鲁棒性;引入定位和保护正则化以实现精确去学习。 Result: 实验表明,该方法能有效消除特定概念,同时保留对其他概念的生成能力,优于现有方法。 Conclusion: 该方法为T2V模型提供了一种高效且精确的去学习解决方案,解决了有害内容生成的问题。 Abstract: Recent advances in text-to-video (T2V) diffusion models have significantly enhanced the quality of generated videos. However, their ability to produce explicit or harmful content raises concerns about misuse and potential rights violations. Inspired by the success of unlearning techniques in erasing undesirable concepts from text-to-image (T2I) models, we extend unlearning to T2V models and propose a robust and precise unlearning method. Specifically, we adopt negatively-guided velocity prediction fine-tuning and enhance it with prompt augmentation to ensure robustness against LLM-refined prompts. To achieve precise unlearning, we incorporate a localization and a preservation regularization to preserve the model's ability to generate non-target concepts. Extensive experiments demonstrate that our method effectively erases a specific concept while preserving the model's generation capability for all other concepts, outperforming existing methods. We provide the unlearned models in \href{https://github.com/VDIGPKU/T2VUnlearning.git}{https://github.com/VDIGPKU/T2VUnlearning.git}.

[48] Center-aware Residual Anomaly Synthesis for Multi-class Industrial Anomaly Detection

Qiyu Chen,Huiyuan Luo,Haiming Yao,Wei Luo,Zhen Qu,Chengkan Lv,Zhengtao Zhang

Main category: cs.CV

TL;DR: 提出了一种名为CRAS的新方法,用于多类别异常检测,通过中心感知残差学习和距离引导异常合成,解决了类别间干扰和类别内重叠问题。

Details Motivation: 现有方法需要为每个类别单独部署模型,成本高,且多类别统一模型易受类别间干扰和类别内重叠影响。 Method: CRAS结合中心感知残差学习和距离引导异常合成,减少类别间干扰并自适应调整噪声方差。 Result: 实验表明CRAS在检测精度和推理速度上表现优异。 Conclusion: CRAS为多类别异常检测提供了一种高效统一的解决方案。 Abstract: Anomaly detection plays a vital role in the inspection of industrial images. Most existing methods require separate models for each category, resulting in multiplied deployment costs. This highlights the challenge of developing a unified model for multi-class anomaly detection. However, the significant increase in inter-class interference leads to severe missed detections. Furthermore, the intra-class overlap between normal and abnormal samples, particularly in synthesis-based methods, cannot be ignored and may lead to over-detection. To tackle these issues, we propose a novel Center-aware Residual Anomaly Synthesis (CRAS) method for multi-class anomaly detection. CRAS leverages center-aware residual learning to couple samples from different categories into a unified center, mitigating the effects of inter-class interference. To further reduce intra-class overlap, CRAS introduces distance-guided anomaly synthesis that adaptively adjusts noise variance based on normal data distribution. Experimental results on diverse datasets and real-world industrial applications demonstrate the superior detection accuracy and competitive inference speed of CRAS. The source code and the newly constructed dataset are publicly available at https://github.com/cqylunlun/CRAS.

[49] Deeper Diffusion Models Amplify Bias

Shahin Hakemi,Naveed Akhtar,Ghulam Mubashar Hassan,Ajmal Mian

Main category: cs.CV

TL;DR: 本文探讨了扩散模型中的偏差-方差权衡问题,揭示了模型可能放大训练数据偏差或泄露隐私的风险,并提出了一种无需训练的方法来提升生成图像质量。

Details Motivation: 扩散模型内部机制尚不明确,可能带来问题。本文旨在研究其偏差-方差权衡,揭示潜在风险。 Method: 提出一种无需训练的方法,通过在去噪过程中部分绕过中间块的贡献,临时增加生成过程的高方差。 Result: 方法在理论和实验上均验证了其有效性,显著提升了文本到图像和图像到图像生成的质量。 Conclusion: 研究扩展了对生成模型的理解,揭示了偏差放大的风险,并提出了一种高效改进生成质量的方法。 Abstract: Despite the impressive performance of generative Diffusion Models (DMs), their internal working is still not well understood, which is potentially problematic. This paper focuses on exploring the important notion of bias-variance tradeoff in diffusion models. Providing a systematic foundation for this exploration, it establishes that at one extreme the diffusion models may amplify the inherent bias in the training data and, on the other, they may compromise the presumed privacy of the training samples. Our exploration aligns with the memorization-generalization understanding of the generative models, but it also expands further along this spectrum beyond ``generalization'', revealing the risk of bias amplification in deeper models. Building on the insights, we also introduce a training-free method to improve output quality in text-to-image and image-to-image generation. By progressively encouraging temporary high variance in the generation process with partial bypassing of the mid-block's contribution in the denoising process of DMs, our method consistently improves generative image quality with zero training cost. Our claims are validated both theoretically and empirically.

[50] Model Already Knows the Best Noise: Bayesian Active Noise Selection via Attention in Video Diffusion Model

Kwanyoung Kim,Sanghyun Kim

Main category: cs.CV

TL;DR: ANSE提出了一种基于注意力不确定性的噪声选择框架,显著提升了视频扩散模型的质量和时序一致性。

Details Motivation: 初始噪声的选择对视频扩散模型生成质量影响显著,但现有方法忽视了模型内部信号。 Method: 提出ANSE框架,通过BANSA量化注意力不确定性,并引入Bernoulli掩码近似以实现高效推理。 Result: 在CogVideoX-2B和5B上实验表明,ANSE仅增加8%和13%推理时间即可提升视频质量和时序一致性。 Conclusion: ANSE为视频扩散中的噪声选择提供了原则性且通用的解决方案。 Abstract: The choice of initial noise significantly affects the quality and prompt alignment of video diffusion models, where different noise seeds for the same prompt can lead to drastically different generations. While recent methods rely on externally designed priors such as frequency filters or inter-frame smoothing, they often overlook internal model signals that indicate which noise seeds are inherently preferable. To address this, we propose ANSE (Active Noise Selection for Generation), a model-aware framework that selects high-quality noise seeds by quantifying attention-based uncertainty. At its core is BANSA (Bayesian Active Noise Selection via Attention), an acquisition function that measures entropy disagreement across multiple stochastic attention samples to estimate model confidence and consistency. For efficient inference-time deployment, we introduce a Bernoulli-masked approximation of BANSA that enables score estimation using a single diffusion step and a subset of attention layers. Experiments on CogVideoX-2B and 5B demonstrate that ANSE improves video quality and temporal coherence with only an 8% and 13% increase in inference time, respectively, providing a principled and generalizable approach to noise selection in video diffusion. See our project page: https://anse-project.github.io/anse-project/

[51] Enhancing Fourier-based Doppler Resolution with Diffusion Models

Denisa Qosja,Kilian Barth,Simon Wagner

Main category: cs.CV

TL;DR: 利用人工智能提升雷达多普勒分辨率,通过生成扩散模型优化零填充FFT数据,有效分离密集目标。

Details Motivation: 高多普勒分辨率对检测慢速目标至关重要,但硬件和物理因素限制了分辨率,需开发后处理技术。 Method: 基于零填充FFT,通过生成扩散模型进行数据细化。 Result: 方法克服传统FFT限制,有效分离密集目标。 Conclusion: AI技术可显著提升雷达多普勒分辨率,为慢速目标检测提供新方案。 Abstract: In radar systems, high resolution in the Doppler dimension is important for detecting slow-moving targets as it allows for more distinct separation between these targets and clutter, or stationary objects. However, achieving sufficient resolution is constrained by hardware capabilities and physical factors, leading to the development of processing techniques to enhance the resolution after acquisition. In this work, we leverage artificial intelligence to increase the Doppler resolution in range-Doppler maps. Based on a zero-padded FFT, a refinement via the generative neural networks of diffusion models is achieved. We demonstrate that our method overcomes the limitations of traditional FFT, generating data where closely spaced targets are effectively separated.

[52] InfLVG: Reinforce Inference-Time Consistent Long Video Generation with GRPO

Xueji Fang,Liyuan Ma,Zhiyang Chen,Mingyuan Zhou,Guo-jun Qi

Main category: cs.CV

TL;DR: InfLVG是一个推理时框架,通过动态选择语义相关上下文,解决了长视频生成中的计算成本和一致性挑战。

Details Motivation: 现有文本到视频生成模型在生成长跨场景视频时面临计算成本高和一致性下降的问题。 Method: InfLVG采用可学习的上下文选择策略(GRPO优化),动态保留最相关上下文,固定计算预算。 Result: InfLVG能将视频长度扩展至9倍,保持跨场景一致性和语义对齐。 Conclusion: InfLVG为长视频生成提供了一种高效解决方案,无需额外长视频数据。 Abstract: Recent advances in text-to-video generation, particularly with autoregressive models, have enabled the synthesis of high-quality videos depicting individual scenes. However, extending these models to generate long, cross-scene videos remains a significant challenge. As the context length grows during autoregressive decoding, computational costs rise sharply, and the model's ability to maintain consistency and adhere to evolving textual prompts deteriorates. We introduce InfLVG, an inference-time framework that enables coherent long video generation without requiring additional long-form video data. InfLVG leverages a learnable context selection policy, optimized via Group Relative Policy Optimization (GRPO), to dynamically identify and retain the most semantically relevant context throughout the generation process. Instead of accumulating the entire generation history, the policy ranks and selects the top-$K$ most contextually relevant tokens, allowing the model to maintain a fixed computational budget while preserving content consistency and prompt alignment. To optimize the policy, we design a hybrid reward function that jointly captures semantic alignment, cross-scene consistency, and artifact reduction. To benchmark performance, we introduce the Cross-scene Video Benchmark (CsVBench) along with an Event Prompt Set (EPS) that simulates complex multi-scene transitions involving shared subjects and varied actions/backgrounds. Experimental results show that InfLVG can extend video length by up to 9$\times$, achieving strong consistency and semantic fidelity across scenes. Our code is available at https://github.com/MAPLE-AIGC/InfLVG.

[53] MODEM: A Morton-Order Degradation Estimation Mechanism for Adverse Weather Image Recovery

Hainuo Wang,Qiming Hu,Xiaojie Guo

Main category: cs.CV

TL;DR: MODEM提出了一种基于Morton-Order的退化估计机制,用于恶劣天气图像恢复,通过MOS2D和DDEM模块实现自适应处理,取得了最先进的性能。

Details Motivation: 恶劣天气导致的图像退化具有高度非均匀和空间异质性,传统方法难以准确估计退化,因此需要更有效的自适应恢复策略。 Method: MODEM结合Morton-Order编码和选择性状态空间模型(MOS2D)捕获长程依赖,并通过DDEM模块解耦全局和局部退化先验,动态指导恢复。 Result: MODEM在多个基准测试和天气类型中取得了最先进的恢复效果。 Conclusion: MODEM通过自适应退化估计和条件恢复,有效建模复杂退化动态,为恶劣天气图像恢复提供了新思路。 Abstract: Restoring images degraded by adverse weather remains a significant challenge due to the highly non-uniform and spatially heterogeneous nature of weather-induced artifacts, e.g., fine-grained rain streaks versus widespread haze. Accurately estimating the underlying degradation can intuitively provide restoration models with more targeted and effective guidance, enabling adaptive processing strategies. To this end, we propose a Morton-Order Degradation Estimation Mechanism (MODEM) for adverse weather image restoration. Central to MODEM is the Morton-Order 2D-Selective-Scan Module (MOS2D), which integrates Morton-coded spatial ordering with selective state-space models to capture long-range dependencies while preserving local structural coherence. Complementing MOS2D, we introduce a Dual Degradation Estimation Module (DDEM) that disentangles and estimates both global and local degradation priors. These priors dynamically condition the MOS2D modules, facilitating adaptive and context-aware restoration. Extensive experiments and ablation studies demonstrate that MODEM achieves state-of-the-art results across multiple benchmarks and weather types, highlighting its effectiveness in modeling complex degradation dynamics. Our code will be released at https://github.com/hainuo-wang/MODEM.git.

[54] CGS-GAN: 3D Consistent Gaussian Splatting GANs for High Resolution Human Head Synthesis

Florian Barthel,Wieland Morgenstern,Paul Hinzer,Anna Hilsmann,Peter Eisert

Main category: cs.CV

TL;DR: CGS-GAN提出了一种新的3D高斯泼溅GAN框架,解决了现有方法在3D一致性和训练稳定性上的问题,无需依赖视角条件。

Details Motivation: 现有3D GAN方法在视角变化时会导致身份不一致,而固定视角则无法适应新视角,且去除视角条件会导致训练不稳定。 Method: 引入多视角正则化技术,改进条件损失函数,并设计新的生成器架构,支持高效渲染和高分辨率输出。 Result: CGS-GAN在FFHQ数据集上实现了高渲染质量和3D一致性,FID得分表现优异。 Conclusion: CGS-GAN通过多视角正则化和架构优化,实现了稳定训练和高质量的3D一致合成。 Abstract: Recently, 3D GANs based on 3D Gaussian splatting have been proposed for high quality synthesis of human heads. However, existing methods stabilize training and enhance rendering quality from steep viewpoints by conditioning the random latent vector on the current camera position. This compromises 3D consistency, as we observe significant identity changes when re-synthesizing the 3D head with each camera shift. Conversely, fixing the camera to a single viewpoint yields high-quality renderings for that perspective but results in poor performance for novel views. Removing view-conditioning typically destabilizes GAN training, often causing the training to collapse. In response to these challenges, we introduce CGS-GAN, a novel 3D Gaussian Splatting GAN framework that enables stable training and high-quality 3D-consistent synthesis of human heads without relying on view-conditioning. To ensure training stability, we introduce a multi-view regularization technique that enhances generator convergence with minimal computational overhead. Additionally, we adapt the conditional loss used in existing 3D Gaussian splatting GANs and propose a generator architecture designed to not only stabilize training but also facilitate efficient rendering and straightforward scaling, enabling output resolutions up to $2048^2$. To evaluate the capabilities of CGS-GAN, we curate a new dataset derived from FFHQ. This dataset enables very high resolutions, focuses on larger portions of the human head, reduces view-dependent artifacts for improved 3D consistency, and excludes images where subjects are obscured by hands or other objects. As a result, our approach achieves very high rendering quality, supported by competitive FID scores, while ensuring consistent 3D scene generation. Check our our project page here: https://fraunhoferhhi.github.io/cgs-gan/

[55] PathoSCOPE: Few-Shot Pathology Detection via Self-Supervised Contrastive Learning and Pathology-Informed Synthetic Embeddings

Sinchee Chin,Yinuo Ma,Xiaochen Yang,Jing-Hao Xue,Wenming Yang

Main category: cs.CV

TL;DR: PathoSCOPE是一种少样本无监督病理检测框架,仅需少量非病理样本即可高效检测病理,通过全局-局部对比损失和病理感知嵌入生成模块提升性能。

Details Motivation: 医院数据偏向症状人群,隐私法规限制健康数据收集,现有方法需大量健康数据,难以构建可靠模型。 Method: 提出PathoSCOPE框架,结合全局-局部对比损失(GLCL)和病理感知嵌入生成模块(PiEG),利用少量非病理样本提升检测效率。 Result: 在BraTS2020和ChestXray8数据集上表现优于其他无监督方法,计算效率高(2.48 GFLOPs,166 FPS)。 Conclusion: PathoSCOPE显著减少对健康数据的依赖,为无监督病理检测提供高效解决方案。 Abstract: Unsupervised pathology detection trains models on non-pathological data to flag deviations as pathologies, offering strong generalizability for identifying novel diseases and avoiding costly annotations. However, building reliable normality models requires vast healthy datasets, as hospitals' data is inherently biased toward symptomatic populations, while privacy regulations hinder the assembly of representative healthy cohorts. To address this limitation, we propose PathoSCOPE, a few-shot unsupervised pathology detection framework that requires only a small set of non-pathological samples (minimum 2 shots), significantly improving data efficiency. We introduce Global-Local Contrastive Loss (GLCL), comprised of a Local Contrastive Loss to reduce the variability of non-pathological embeddings and a Global Contrastive Loss to enhance the discrimination of pathological regions. We also propose a Pathology-informed Embedding Generation (PiEG) module that synthesizes pathological embeddings guided by the global loss, better exploiting the limited non-pathological samples. Evaluated on the BraTS2020 and ChestXray8 datasets, PathoSCOPE achieves state-of-the-art performance among unsupervised methods while maintaining computational efficiency (2.48 GFLOPs, 166 FPS).

Haoran He,Jiajun Liang,Xintao Wang,Pengfei Wan,Di Zhang,Kun Gai,Ling Pan

Main category: cs.CV

TL;DR: 本文提出了一种名为EvoSearch的新型测试时扩展方法,用于提升图像和视频生成模型的性能,无需额外训练或模型扩展。

Details Motivation: 随着模型预训练的计算成本不断增加,测试时扩展(TTS)成为提升生成模型性能的有前景方向,但现有方法在视觉任务中存在局限性。 Method: EvoSearch将测试时扩展问题转化为进化搜索问题,利用生物进化原理优化去噪轨迹,设计选择和突变机制。 Result: 实验表明,EvoSearch在扩散和流模型中均表现优异,生成质量更高、多样性更强,且具有广泛适用性。 Conclusion: EvoSearch是一种通用且高效的TTS方法,显著提升了图像和视频生成的性能。 Abstract: As the marginal cost of scaling computation (data and parameters) during model pre-training continues to increase substantially, test-time scaling (TTS) has emerged as a promising direction for improving generative model performance by allocating additional computation at inference time. While TTS has demonstrated significant success across multiple language tasks, there remains a notable gap in understanding the test-time scaling behaviors of image and video generative models (diffusion-based or flow-based models). Although recent works have initiated exploration into inference-time strategies for vision tasks, these approaches face critical limitations: being constrained to task-specific domains, exhibiting poor scalability, or falling into reward over-optimization that sacrifices sample diversity. In this paper, we propose \textbf{Evo}lutionary \textbf{Search} (EvoSearch), a novel, generalist, and efficient TTS method that effectively enhances the scalability of both image and video generation across diffusion and flow models, without requiring additional training or model expansion. EvoSearch reformulates test-time scaling for diffusion and flow models as an evolutionary search problem, leveraging principles from biological evolution to efficiently explore and refine the denoising trajectory. By incorporating carefully designed selection and mutation mechanisms tailored to the stochastic differential equation denoising process, EvoSearch iteratively generates higher-quality offspring while preserving population diversity. Through extensive evaluation across both diffusion and flow architectures for image and video generation tasks, we demonstrate that our method consistently outperforms existing approaches, achieves higher diversity, and shows strong generalizability to unseen evaluation metrics. Our project is available at the website https://tinnerhrhe.github.io/evosearch.

[57] CAS-IQA: Teaching Vision-Language Models for Synthetic Angiography Quality Assessment

Bo Wang,De-Xing Huang,Xiao-Hu Zhou,Mei-Jiang Gui,Nu-Fang Xiao,Jian-Long Hao,Ming-Yuan Liu,Zeng-Guang Hou

Main category: cs.CV

TL;DR: 本文提出了一种基于视觉语言模型(VLM)的框架CAS-IQA,用于评估合成X射线血管造影图像的质量,通过结合辅助图像信息和定义任务特定指标,显著优于现有方法。

Details Motivation: 现有图像质量评估(IQA)方法未能利用辅助图像作为参考,且缺乏临床相关的细粒度指标,导致低质量合成血管造影可能增加手术风险。 Method: 提出CAS-IQA框架,构建CAS-3K数据集(3,565张合成血管造影图像),定义三个任务特定指标,并设计MUST模块自适应融合和路由视觉特征。 Result: 在CAS-3K数据集上的实验表明,CAS-IQA显著优于现有IQA方法。 Conclusion: CAS-IQA通过结合辅助信息和任务特定指标,为合成血管造影提供了更可靠的临床质量评估。 Abstract: Synthetic X-ray angiographies generated by modern generative models hold great potential to reduce the use of contrast agents in vascular interventional procedures. However, low-quality synthetic angiographies can significantly increase procedural risk, underscoring the need for reliable image quality assessment (IQA) methods. Existing IQA models, however, fail to leverage auxiliary images as references during evaluation and lack fine-grained, task-specific metrics necessary for clinical relevance. To address these limitations, this paper proposes CAS-IQA, a vision-language model (VLM)-based framework that predicts fine-grained quality scores by effectively incorporating auxiliary information from related images. In the absence of angiography datasets, CAS-3K is constructed, comprising 3,565 synthetic angiographies along with score annotations. To ensure clinically meaningful assessment, three task-specific evaluation metrics are defined. Furthermore, a Multi-path featUre fuSion and rouTing (MUST) module is designed to enhance image representations by adaptively fusing and routing visual tokens to metric-specific branches. Extensive experiments on the CAS-3K dataset demonstrate that CAS-IQA significantly outperforms state-of-the-art IQA methods by a considerable margin.

[58] HoloLLM: Multisensory Foundation Model for Language-Grounded Human Sensing and Reasoning

Chuhao Zhou,Jianfei Yang

Main category: cs.CV

TL;DR: HoloLLM是一种多模态大语言模型,整合了LiDAR、红外等罕见但强大的传感模态,通过通用模态注入投影器(UMIP)解决数据对齐和信号表示异质性问题,显著提升了语言基础的人类感知准确性。

Details Motivation: 现有视觉语言模型(VLMs)依赖视觉数据,在遮挡、光线不足或隐私限制等现实场景中鲁棒性不足。HoloLLM旨在通过整合多种传感模态提升语言基础的人类感知能力。 Method: 设计通用模态注入投影器(UMIP),通过粗到细的跨注意力机制增强预对齐模态嵌入;引入人机协作数据标注流程生成传感数据的文本注释。 Result: 在两个新构建的基准测试中,HoloLLM显著优于现有MLLM,语言基础的人类感知准确性提升高达30%。 Conclusion: HoloLLM为现实世界中语言驱动的多模态智能体感知奠定了新基础。 Abstract: Embodied agents operating in smart homes must understand human behavior through diverse sensory inputs and communicate via natural language. While Vision-Language Models (VLMs) have enabled impressive language-grounded perception, their reliance on visual data limits robustness in real-world scenarios with occlusions, poor lighting, or privacy constraints. In this paper, we introduce HoloLLM, a Multimodal Large Language Model (MLLM) that integrates uncommon but powerful sensing modalities, such as LiDAR, infrared, mmWave radar, and WiFi, to enable seamless human perception and reasoning across heterogeneous environments. We address two key challenges: (1) the scarcity of aligned modality-text data for rare sensors, and (2) the heterogeneity of their physical signal representations. To overcome these, we design a Universal Modality-Injection Projector (UMIP) that enhances pre-aligned modality embeddings with fine-grained, text-aligned features from tailored encoders via coarse-to-fine cross-attention without introducing significant alignment overhead. We further introduce a human-VLM collaborative data curation pipeline to generate paired textual annotations for sensing datasets. Extensive experiments on two newly constructed benchmarks show that HoloLLM significantly outperforms existing MLLMs, improving language-grounded human sensing accuracy by up to 30%. This work establishes a new foundation for real-world, language-informed multisensory embodied intelligence.

[59] Instruct2See: Learning to Remove Any Obstructions Across Distributions

Junhang Li,Yu Guo,Chuhua Xian,Shengfeng He

Main category: cs.CV

TL;DR: Instruct2See是一个零样本框架,通过多模态提示处理遮挡问题,实现软硬掩码修复,适用于已知和未知遮挡场景。

Details Motivation: 现有方法局限于特定遮挡物,难以应对现实世界中多样化的遮挡问题。 Method: 将遮挡修复视为软硬掩码问题,利用多模态提示(视觉语义和文本指令)和交叉注意力单元增强上下文理解,动态调整掩码。 Result: 在分布内外遮挡物上均表现优异,具有强泛化能力。 Conclusion: Instruct2See在遮挡修复任务中展现了卓越的性能和泛化能力。 Abstract: Images are often obstructed by various obstacles due to capture limitations, hindering the observation of objects of interest. Most existing methods address occlusions from specific elements like fences or raindrops, but are constrained by the wide range of real-world obstructions, making comprehensive data collection impractical. To overcome these challenges, we propose Instruct2See, a novel zero-shot framework capable of handling both seen and unseen obstacles. The core idea of our approach is to unify obstruction removal by treating it as a soft-hard mask restoration problem, where any obstruction can be represented using multi-modal prompts, such as visual semantics and textual instructions, processed through a cross-attention unit to enhance contextual understanding and improve mode control. Additionally, a tunable mask adapter allows for dynamic soft masking, enabling real-time adjustment of inaccurate masks. Extensive experiments on both in-distribution and out-of-distribution obstacles show that Instruct2See consistently achieves strong performance and generalization in obstruction removal, regardless of whether the obstacles were present during the training phase. Code and dataset are available at https://jhscut.github.io/Instruct2See.

[60] EMRA-proxy: Enhancing Multi-Class Region Semantic Segmentation in Remote Sensing Images with Attention Proxy

Yichun Yu,Yuqing Lan,Zhihuan Xing,Xiaoyi Yang,Tingyue Tang,Dan Yu

Main category: cs.CV

TL;DR: RAPNet提出了一种结合Transformer和区域感知的新方法,用于高分辨率遥感图像分割,显著提升了多类分割精度。

Details Motivation: 高分辨率遥感图像分割因复杂的空间布局和多样的物体外观而具有挑战性。传统CNN和Transformer各有局限,RAPNet旨在结合两者的优势。 Method: RAPNet包含两个模块:CRA(基于Transformer的区域级上下文依赖捕获)和GCR(全局类别注意力图优化)。 Result: 在三个公开数据集上,RAPNet表现优于现有方法,实现了更高的多类分割精度。 Conclusion: RAPNet通过区域感知和全局优化,有效解决了高分辨率遥感图像分割的难题。 Abstract: High-resolution remote sensing (HRRS) image segmentation is challenging due to complex spatial layouts and diverse object appearances. While CNNs excel at capturing local features, they struggle with long-range dependencies, whereas Transformers can model global context but often neglect local details and are computationally expensive.We propose a novel approach, Region-Aware Proxy Network (RAPNet), which consists of two components: Contextual Region Attention (CRA) and Global Class Refinement (GCR). Unlike traditional methods that rely on grid-based layouts, RAPNet operates at the region level for more flexible segmentation. The CRA module uses a Transformer to capture region-level contextual dependencies, generating a Semantic Region Mask (SRM). The GCR module learns a global class attention map to refine multi-class information, combining the SRM and attention map for accurate segmentation.Experiments on three public datasets show that RAPNet outperforms state-of-the-art methods, achieving superior multi-class segmentation accuracy.

[61] Proto-FG3D: Prototype-based Interpretable Fine-Grained 3D Shape Classification

Shuxian Ma,Zihao Dong,Runmin Cong,Sam Kwong,Xiuli Shao

Main category: cs.CV

TL;DR: Proto-FG3D是一种基于原型的框架,用于细粒度3D形状分类,通过非参数原型学习解决了多视图特征聚合中的问题,提升了分类精度和可解释性。

Details Motivation: 细粒度3D分类因多视图特征聚合中捕获的判别信息有限而研究不足,Proto-FG3D旨在解决这一问题。 Method: 通过原型关联进行多视图和多类别联合表示学习,在线聚类优化原型,并利用原型引导的监督学习增强细粒度判别能力。 Result: 在FG3D和ModelNet40数据集上,Proto-FG3D在准确性、透明预测和可解释性方面优于现有方法。 Conclusion: Proto-FG3D为细粒度3D分类提供了新的范式,展示了原型学习的潜力。 Abstract: Deep learning-based multi-view coarse-grained 3D shape classification has achieved remarkable success over the past decade, leveraging the powerful feature learning capabilities of CNN-based and ViT-based backbones. However, as a challenging research area critical for detailed shape understanding, fine-grained 3D classification remains understudied due to the limited discriminative information captured during multi-view feature aggregation, particularly for subtle inter-class variations, class imbalance, and inherent interpretability limitations of parametric model. To address these problems, we propose the first prototype-based framework named Proto-FG3D for fine-grained 3D shape classification, achieving a paradigm shift from parametric softmax to non-parametric prototype learning. Firstly, Proto-FG3D establishes joint multi-view and multi-category representation learning via Prototype Association. Secondly, prototypes are refined via Online Clustering, improving both the robustness of multi-view feature allocation and inter-subclass balance. Finally, prototype-guided supervised learning is established to enhance fine-grained discrimination via prototype-view correlation analysis and enables ad-hoc interpretability through transparent case-based reasoning. Experiments on FG3D and ModelNet40 show Proto-FG3D surpasses state-of-the-art methods in accuracy, transparent predictions, and ad-hoc interpretability with visualizations, challenging conventional fine-grained 3D recognition approaches.

[62] SVL: Spike-based Vision-language Pretraining for Efficient 3D Open-world Understanding

Xuerui Qiu,Peixi Wu,Yaozhi Wen,Shaowei Gu,Yuqi Pan,Xinhao Luo,Bo XU,Guoqi Li

Main category: cs.CV

TL;DR: SVL框架通过多尺度三重对齐和可重参数化视觉语言集成,显著提升了SNNs在3D开放世界理解任务中的性能,超越了ANNs。

Details Motivation: 现有SNNs在预训练策略上的不足导致其性能与ANNs存在差距,尤其在多模态理解和零样本任务中表现不佳。 Method: 提出SVL框架,包含多尺度三重对齐(MTA)和可重参数化视觉语言集成(Rep-VLI)两个关键组件。 Result: SVL在零样本3D分类中达到85.4%的准确率,超越ANNs,并在多项下游任务中表现优异。 Conclusion: SVL是首个可扩展、通用且硬件友好的3D开放世界理解范式,有效缩小了SNNs与ANNs的差距。 Abstract: Spiking Neural Networks (SNNs) provide an energy-efficient way to extract 3D spatio-temporal features. However, existing SNNs still exhibit a significant performance gap compared to Artificial Neural Networks (ANNs) due to inadequate pre-training strategies. These limitations manifest as restricted generalization ability, task specificity, and a lack of multimodal understanding, particularly in challenging tasks such as multimodal question answering and zero-shot 3D classification. To overcome these challenges, we propose a Spike-based Vision-Language (SVL) pretraining framework that empowers SNNs with open-world 3D understanding while maintaining spike-driven efficiency. SVL introduces two key components: (i) Multi-scale Triple Alignment (MTA) for label-free triplet-based contrastive learning across 3D, image, and text modalities, and (ii) Re-parameterizable Vision-Language Integration (Rep-VLI) to enable lightweight inference without relying on large text encoders. Extensive experiments show that SVL achieves a top-1 accuracy of 85.4% in zero-shot 3D classification, surpassing advanced ANN models, and consistently outperforms prior SNNs on downstream tasks, including 3D classification (+6.1%), DVS action recognition (+2.1%), 3D detection (+1.1%), and 3D segmentation (+2.1%) with remarkable efficiency. Moreover, SVL enables SNNs to perform open-world 3D question answering, sometimes outperforming ANNs. To the best of our knowledge, SVL represents the first scalable, generalizable, and hardware-friendly paradigm for 3D open-world understanding, effectively bridging the gap between SNNs and ANNs in complex open-world understanding tasks. Code is available https://github.com/bollossom/SVL.

[63] Towards Dynamic 3D Reconstruction of Hand-Instrument Interaction in Ophthalmic Surgery

Ming Hu,Zhendi Yu,Feilong Tang,Kaiwen Chen,Yulong Li,Imran Razzak,Junjun He,Tolga Birdal,Kaijing Zhou,Zongyuan Ge

Main category: cs.CV

TL;DR: OphNet-3D是首个用于眼科手术的RGB-D动态3D重建数据集,包含7.1百万帧数据,并提出了自动标注流水线和两个新基准任务。

Details Motivation: 现有数据集和标注工具不足,限制了眼科手术中手和器械3D重建的进展。 Method: 提出OphNet-3D数据集,设计多阶段自动标注流水线,并建立两个基准任务及专用模型H-Net和OH-Net。 Result: H-Net和OH-Net在MPJPE和ADD-S指标上显著优于现有方法。 Conclusion: OphNet-3D和提出的方法为眼科手术的3D重建提供了新基准和解决方案。 Abstract: Accurate 3D reconstruction of hands and instruments is critical for vision-based analysis of ophthalmic microsurgery, yet progress has been hampered by the lack of realistic, large-scale datasets and reliable annotation tools. In this work, we introduce OphNet-3D, the first extensive RGB-D dynamic 3D reconstruction dataset for ophthalmic surgery, comprising 41 sequences from 40 surgeons and totaling 7.1 million frames, with fine-grained annotations of 12 surgical phases, 10 instrument categories, dense MANO hand meshes, and full 6-DoF instrument poses. To scalably produce high-fidelity labels, we design a multi-stage automatic annotation pipeline that integrates multi-view data observation, data-driven motion prior with cross-view geometric consistency and biomechanical constraints, along with a combination of collision-aware interaction constraints for instrument interactions. Building upon OphNet-3D, we establish two challenging benchmarks-bimanual hand pose estimation and hand-instrument interaction reconstruction-and propose two dedicated architectures: H-Net for dual-hand mesh recovery and OH-Net for joint reconstruction of two-hand-two-instrument interactions. These models leverage a novel spatial reasoning module with weak-perspective camera modeling and collision-aware center-based representation. Both architectures outperform existing methods by substantial margins, achieving improvements of over 2mm in Mean Per Joint Position Error (MPJPE) and up to 23% in ADD-S metrics for hand and instrument reconstruction, respectively.

[64] 5G-DIL: Domain Incremental Learning with Similarity-Aware Sampling for Dynamic 5G Indoor Localization

Nisha Lakshmana Raichur,Lucas Heublein,Christopher Mutschler,Felix Ott

Main category: cs.CV

TL;DR: 本文提出了一种基于5G数据的室内定位方法5G-DIL,通过域增量学习(DIL)和相似性感知采样技术,快速适应环境变化,减少训练时间和资源消耗。

Details Motivation: 传统基于机器学习的5G室内定位方法在环境变化时性能下降,且重新训练模型耗时耗资源。 Method: 采用基于切比雪夫距离的相似性感知采样技术,选择旧环境中的关键样本,仅在新环境的变化区域进行训练。 Result: 在动态环境条件下,定位误差MAE为0.261米,且仅需50个样本即可快速适应新环境。 Conclusion: 5G-DIL方法高效且准确,适用于动态环境下的室内定位。 Abstract: Indoor positioning based on 5G data has achieved high accuracy through the adoption of recent machine learning (ML) techniques. However, the performance of learning-based methods degrades significantly when environmental conditions change, thereby hindering their applicability to new scenarios. Acquiring new training data for each environmental change and fine-tuning ML models is both time-consuming and resource-intensive. This paper introduces a domain incremental learning (DIL) approach for dynamic 5G indoor localization, called 5G-DIL, enabling rapid adaptation to environmental changes. We present a novel similarity-aware sampling technique based on the Chebyshev distance, designed to efficiently select specific exemplars from the previous environment while training only on the modified regions of the new environment. This avoids the need to train on the entire region, significantly reducing the time and resources required for adaptation without compromising localization accuracy. This approach requires as few as 50 exemplars from adaptation domains, significantly reducing training time while maintaining high positioning accuracy in previous environments. Comparative evaluations against state-of-the-art DIL techniques on a challenging real-world indoor dataset demonstrate the effectiveness of the proposed sample selection method. Our approach is adaptable to real-world non-line-of-sight propagation scenarios and achieves an MAE positioning error of 0.261 meters, even under dynamic environmental conditions. Code: https://gitlab.cc-asp.fraunhofer.de/5g-pos/5g-dil

[65] FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving

Shuang Zeng,Xinyuan Chang,Mengwei Xie,Xinran Liu,Yifan Bai,Zheng Pan,Mu Xu,Xing Wei

Main category: cs.CV

TL;DR: 提出了一种时空链式思维推理方法,使视觉语言模型能够通过视觉生成和推理提升自动驾驶的性能。

Details Motivation: 现有视觉语言模型通常使用离散的文本链式思维,可能导致时空关系模糊和细粒度信息丢失,因此需要一种更接近真实世界模拟和想象的推理方法。 Method: 提出时空链式思维推理方法,利用视觉语言模型作为世界模型生成统一的图像帧,预测未来世界状态,并结合感知结果和未来帧表示时空关系。 Result: 实验证明该方法有效,推动了自动驾驶向视觉推理方向发展。 Conclusion: 该方法通过视觉生成和推理,显著提升了自动驾驶模型的性能。 Abstract: Visual language models (VLMs) have attracted increasing interest in autonomous driving due to their powerful reasoning capabilities. However, existing VLMs typically utilize discrete text Chain-of-Thought (CoT) tailored to the current scenario, which essentially represents highly abstract and symbolic compression of visual information, potentially leading to spatio-temporal relationship ambiguity and fine-grained information loss. Is autonomous driving better modeled on real-world simulation and imagination than on pure symbolic logic? In this paper, we propose a spatio-temporal CoT reasoning method that enables models to think visually. First, VLM serves as a world model to generate unified image frame for predicting future world states: where perception results (e.g., lane divider and 3D detection) represent the future spatial relationships, and ordinary future frame represent the temporal evolution relationships. This spatio-temporal CoT then serves as intermediate reasoning steps, enabling the VLM to function as an inverse dynamics model for trajectory planning based on current observations and future predictions. To implement visual generation in VLMs, we propose a unified pretraining paradigm integrating visual generation and understanding, along with a progressive visual CoT enhancing autoregressive image generation. Extensive experimental results demonstrate the effectiveness of the proposed method, advancing autonomous driving towards visual reasoning.

[66] Semi-Supervised Medical Image Segmentation via Dual Networks

Yunyao Lu,Yihang Wu,Reem Kateb,Ahmad Chaddad

Main category: cs.CV

TL;DR: 提出了一种创新的半监督3D医学图像分割方法,减少对大规模专家标注数据的依赖。

Details Motivation: 传统监督模型需要大量标注数据,而半监督模型存在伪标签噪声和特征空间监督不足的问题。 Method: 采用双网络架构解决上下文信息利用和伪标签可靠性问题,并结合自监督对比学习增强网络表示。 Result: 在临床磁共振成像实验中,该方法优于现有技术。 Conclusion: 该方法有效解决了半监督医学图像分割中的关键问题,性能优越。 Abstract: Traditional supervised medical image segmentation models require large amounts of labeled data for training; however, obtaining such large-scale labeled datasets in the real world is extremely challenging. Recent semi-supervised segmentation models also suffer from noisy pseudo-label issue and limited supervision in feature space. To solve these challenges, we propose an innovative semi-supervised 3D medical image segmentation method to reduce the dependency on large, expert-labeled datasets. Furthermore, we introduce a dual-network architecture to address the limitations of existing methods in using contextual information and generating reliable pseudo-labels. In addition, a self-supervised contrastive learning strategy is used to enhance the representation of the network and reduce prediction uncertainty by distinguishing between reliable and unreliable predictions. Experiments on clinical magnetic resonance imaging demonstrate that our approach outperforms state-of-the-art techniques. Our code is available at https://github.com/AIPMLab/Semi-supervised-Segmentation.

[67] ViP$^2$-CLIP: Visual-Perception Prompting with Unified Alignment for Zero-Shot Anomaly Detection

Ziteng Yang,Jingzehua Xu,Yanshu Li,Zepeng Li,Yeqiang Wang,Xinghui Li

Main category: cs.CV

TL;DR: ViP$^{2}$-CLIP通过视觉感知提示机制(ViP-Prompt)自适应生成细粒度文本提示,解决了零样本异常检测中手工模板和静态学习提示的局限性,实现了跨领域的优异性能。

Details Motivation: 现有基于CLIP的零样本异常检测方法依赖手工模板或静态学习提示,存在语义覆盖有限和适应性差的问题,且CLIP对类别名称的敏感性限制了提示策略。 Method: 提出ViP$^{2}$-CLIP,采用视觉感知提示机制(ViP-Prompt),融合全局和多尺度局部视觉上下文,自适应生成细粒度文本提示,无需手工模板和类别名称先验。 Result: 在15个工业和医学基准测试中,ViP$^{2}$-CLIP实现了最先进的性能和强大的跨领域泛化能力。 Conclusion: ViP$^{2}$-CLIP通过自适应提示机制显著提升了零样本异常检测的效果,适用于类别标签模糊或隐私受限的场景。 Abstract: Zero-shot anomaly detection (ZSAD) aims to detect anomalies without any target domain training samples, relying solely on external auxiliary data. Existing CLIP-based methods attempt to activate the model's ZSAD potential via handcrafted or static learnable prompts. The former incur high engineering costs and limited semantic coverage, whereas the latter apply identical descriptions across diverse anomaly types, thus fail to adapt to complex variations. Furthermore, since CLIP is originally pretrained on large-scale classification tasks, its anomaly segmentation quality is highly sensitive to the exact wording of class names, severely constraining prompting strategies that depend on class labels. To address these challenges, we introduce ViP$^{2}$-CLIP. The key insight of ViP$^{2}$-CLIP is a Visual-Perception Prompting (ViP-Prompt) mechanism, which fuses global and multi-scale local visual context to adaptively generate fine-grained textual prompts, eliminating manual templates and class-name priors. This design enables our model to focus on precise abnormal regions, making it particularly valuable when category labels are ambiguous or privacy-constrained. Extensive experiments on 15 industrial and medical benchmarks demonstrate that ViP$^{2}$-CLIP achieves state-of-the-art performance and robust cross-domain generalization.

[68] Seek-CAD: A Self-refined Generative Modeling for 3D Parametric CAD Using Local Inference via DeepSeek

Xueyang Li,Jiahao Li,Yu Song,Yunzhong Lou,Xiangdong Zhou

Main category: cs.CV

TL;DR: Seek-CAD首次探索了本地部署的开源LLM DeepSeek-R1用于CAD参数化模型生成,结合视觉和CoT反馈的自优化机制,并提出了基于SSR设计范式的3D CAD数据集。

Details Motivation: 解决封闭源LLM的高成本和本地部署限制问题,提升CAD生成模型的灵活性和效率。 Method: 利用DeepSeek-R1进行训练无关的CAD模型生成,结合视觉和CoT反馈的自优化机制,并通过VLM评估模型。 Result: 实验验证了Seek-CAD在多种指标下的有效性,适用于工业应用。 Conclusion: Seek-CAD为开源LLM在CAD生成领域的应用提供了可行方案,具有实际应用潜力。 Abstract: The advent of Computer-Aided Design (CAD) generative modeling will significantly transform the design of industrial products. The recent research endeavor has extended into the realm of Large Language Models (LLMs). In contrast to fine-tuning methods, training-free approaches typically utilize the advanced closed-source LLMs, thereby offering enhanced flexibility and efficiency in the development of AI agents for generating CAD parametric models. However, the substantial cost and limitations of local deployment of the top-tier closed-source LLMs pose challenges in practical applications. The Seek-CAD is the pioneer exploration of locally deployed open-source inference LLM DeepSeek-R1 for CAD parametric model generation with a training-free methodology. This study is the first investigation to incorporate both visual and Chain-of-Thought (CoT) feedback within the self-refinement mechanism for generating CAD models. Specifically, the initial generated parametric CAD model is rendered into a sequence of step-wise perspective images, which are subsequently processed by a Vision Language Model (VLM) alongside the corresponding CoTs derived from DeepSeek-R1 to assess the CAD model generation. Then, the feedback is utilized by DeepSeek-R1 to refine the initial generated model for the next round of generation. Moreover, we present an innovative 3D CAD model dataset structured around the SSR (Sketch, Sketch-based feature, and Refinements) triple design paradigm. This dataset encompasses a wide range of CAD commands, thereby aligning effectively with industrial application requirements and proving suitable for the generation of LLMs. Extensive experiments validate the effectiveness of Seek-CAD under various metrics.

[69] SeaLion: Semantic Part-Aware Latent Point Diffusion Models for 3D Generation

Dekai Zhu,Yan Di,Stefan Gavranovic,Slobodan Ilic

Main category: cs.CV

TL;DR: SeaLion是一种新型扩散模型,用于生成带有细粒度分割标签的高质量点云,并引入了新的评估指标p-CD。

Details Motivation: 现有方法在生成带有分割标签的点云及评估指标方面研究不足。 Method: 提出语义部分感知潜在点扩散技术,联合预测噪声和分割标签,并引入p-CD评估方法。 Result: 在ShapeNet和IntrA数据集上,SeaLion在生成质量和多样性上显著优于DiffFacto。 Conclusion: SeaLion在生成数据增强和3D形状编辑中具有应用潜力,且支持半监督训练。 Abstract: Denoising diffusion probabilistic models have achieved significant success in point cloud generation, enabling numerous downstream applications, such as generative data augmentation and 3D model editing. However, little attention has been given to generating point clouds with point-wise segmentation labels, as well as to developing evaluation metrics for this task. Therefore, in this paper, we present SeaLion, a novel diffusion model designed to generate high-quality and diverse point clouds with fine-grained segmentation labels. Specifically, we introduce the semantic part-aware latent point diffusion technique, which leverages the intermediate features of the generative models to jointly predict the noise for perturbed latent points and associated part segmentation labels during the denoising process, and subsequently decodes the latent points to point clouds conditioned on part segmentation labels. To effectively evaluate the quality of generated point clouds, we introduce a novel point cloud pairwise distance calculation method named part-aware Chamfer distance (p-CD). This method enables existing metrics, such as 1-NNA, to measure both the local structural quality and inter-part coherence of generated point clouds. Experiments on the large-scale synthetic dataset ShapeNet and real-world medical dataset IntrA demonstrate that SeaLion achieves remarkable performance in generation quality and diversity, outperforming the existing state-of-the-art model, DiffFacto, by 13.33% and 6.52% on 1-NNA (p-CD) across the two datasets. Experimental analysis shows that SeaLion can be trained semi-supervised, thereby reducing the demand for labeling efforts. Lastly, we validate the applicability of SeaLion in generative data augmentation for training segmentation models and the capability of SeaLion to serve as a tool for part-aware 3D shape editing.

[70] Slot-MLLM: Object-Centric Visual Tokenization for Multimodal LLM

Donghwan Chi,Hyomin Kim,Yoonjin Oh,Yongjin Kim,Donghoon Lee,Daejin Jo,Jongmin Kim,Junyeob Baek,Sungjin Ahn,Sungwoong Kim

Main category: cs.CV

TL;DR: 提出了一种基于Slot Attention的视觉分词器,用于多模态大语言模型(MLLMs),以提升对局部视觉细节的理解和生成能力。

Details Motivation: 现有图像分词方法仅捕捉全局抽象概念或均匀分割的图像块,限制了MLLMs在对象级别理解和生成详细视觉内容的能力。 Method: 基于Q-Former编码器、扩散解码器和残差向量量化,提出了一种离散化的slot token,既能编码局部视觉细节,又能保持高级语义,并与文本数据对齐。 Result: Slot-MLLM在多种视觉语言任务中显著优于基线模型,特别是在需要局部细节理解和生成的任务中。 Conclusion: 首次证明了在MLLMs和自然图像中使用对象中心slot attention的可行性,为视觉语言模型的发展提供了新方向。 Abstract: Recently, multimodal large language models (MLLMs) have emerged as a key approach in achieving artificial general intelligence. In particular, vision-language MLLMs have been developed to generate not only text but also visual outputs from multimodal inputs. This advancement requires efficient image tokens that LLMs can process effectively both in input and output. However, existing image tokenization methods for MLLMs typically capture only global abstract concepts or uniformly segmented image patches, restricting MLLMs' capability to effectively understand or generate detailed visual content, particularly at the object level. To address this limitation, we propose an object-centric visual tokenizer based on Slot Attention specifically for MLLMs. In particular, based on the Q-Former encoder, diffusion decoder, and residual vector quantization, our proposed discretized slot tokens can encode local visual details while maintaining high-level semantics, and also align with textual data to be integrated seamlessly within a unified next-token prediction framework of LLMs. The resulting Slot-MLLM demonstrates significant performance improvements over baselines with previous visual tokenizers across various vision-language tasks that entail local detailed comprehension and generation. Notably, this work is the first demonstration of the feasibility of object-centric slot attention performed with MLLMs and in-the-wild natural images.

[71] SafeMVDrive: Multi-view Safety-Critical Driving Video Synthesis in the Real World Domain

Jiawei Zhou,Linye Lyu,Zhuotao Tian,Cheng Zhuo,Yu Li

Main category: cs.CV

TL;DR: SafeMVDrive是一个生成高质量、多视角安全关键驾驶视频的框架,填补了现有方法在真实世界多视角数据上的不足。

Details Motivation: 现有方法无法满足端到端自动驾驶系统对真实世界多视角视频数据的需求,需要一种新方法来生成安全关键场景的视频。 Method: 结合安全关键轨迹生成器和多视角视频生成器,通过增强轨迹生成器的场景理解能力和引入两阶段可控轨迹生成机制,生成高质量视频。 Result: 实验表明,生成的视频显著提高了端到端自动驾驶规划器的碰撞率,验证了其有效性。 Conclusion: SafeMVDrive为测试自动驾驶规划模块提供了有效的工具,填补了多视角安全关键视频生成的空白。 Abstract: Safety-critical scenarios are rare yet pivotal for evaluating and enhancing the robustness of autonomous driving systems. While existing methods generate safety-critical driving trajectories, simulations, or single-view videos, they fall short of meeting the demands of advanced end-to-end autonomous systems (E2E AD), which require real-world, multi-view video data. To bridge this gap, we introduce SafeMVDrive, the first framework designed to generate high-quality, safety-critical, multi-view driving videos grounded in real-world domains. SafeMVDrive strategically integrates a safety-critical trajectory generator with an advanced multi-view video generator. To tackle the challenges inherent in this integration, we first enhance scene understanding ability of the trajectory generator by incorporating visual context -- which is previously unavailable to such generator -- and leveraging a GRPO-finetuned vision-language model to achieve more realistic and context-aware trajectory generation. Second, recognizing that existing multi-view video generators struggle to render realistic collision events, we introduce a two-stage, controllable trajectory generation mechanism that produces collision-evasion trajectories, ensuring both video quality and safety-critical fidelity. Finally, we employ a diffusion-based multi-view video generator to synthesize high-quality safety-critical driving videos from the generated trajectories. Experiments conducted on an E2E AD planner demonstrate a significant increase in collision rate when tested with our generated data, validating the effectiveness of SafeMVDrive in stress-testing planning modules. Our code, examples, and datasets are publicly available at: https://zhoujiawei3.github.io/SafeMVDrive/.

[72] RQR3D: Reparametrizing the regression targets for BEV-based 3D object detection

Ozsel Kilinc,Cem Tarhan

Main category: cs.CV

TL;DR: 论文提出了一种名为RQR3D的新方法,用于改进BEV(鸟瞰图)3D目标检测,通过限制四边形表示法将问题转化为关键点回归任务,显著提升了性能。

Details Motivation: BEV-based 3D目标检测在自动驾驶中至关重要,但现有方法存在角度表示导致的损失函数不连续问题,影响了检测精度。 Method: 提出RQR3D方法,回归水平边界框及其偏移量,将问题转化为关键点回归任务;结合无锚单阶段检测方法,并引入目标性头部解决类别不平衡问题;简化雷达融合主干网络。 Result: 在nuScenes数据集上,RQR3D在相机-雷达3D目标检测中表现最佳,NDS提升4%,mAP提升2.4%,显著减少平移和方向误差。 Conclusion: RQR3D方法在性能、鲁棒性和实际应用性上均表现出色,为自动驾驶提供了更可靠的3D感知解决方案。 Abstract: Accurate, fast, and reliable 3D perception is essential for autonomous driving. Recently, bird's-eye view (BEV)-based perception approaches have emerged as superior alternatives to perspective-based solutions, offering enhanced spatial understanding and more natural outputs for planning. Existing BEV-based 3D object detection methods, typically adhering to angle-based representation, directly estimate the size and orientation of rotated bounding boxes. We observe that BEV-based 3D object detection is analogous to aerial oriented object detection, where angle-based methods are recognized for being affected by discontinuities in their loss functions. Drawing inspiration from this domain, we propose Restricted Quadrilateral Representation to define 3D regression targets. RQR3D regresses the smallest horizontal bounding box encapsulating the oriented box, along with the offsets between the corners of these two boxes, thereby transforming the oriented object detection problem into a keypoint regression task. RQR3D is compatible with any 3D object detection approach. We employ RQR3D within an anchor-free single-stage object detection method and introduce an objectness head to address class imbalance problem. Furthermore, we introduce a simplified radar fusion backbone that eliminates the need for voxel grouping and processes the BEV-mapped point cloud with standard 2D convolutions, rather than sparse convolutions. Extensive evaluations on the nuScenes dataset demonstrate that RQR3D achieves state-of-the-art performance in camera-radar 3D object detection, outperforming the previous best method by +4% in NDS and +2.4% in mAP, and significantly reducing the translation and orientation errors, which are crucial for safe autonomous driving. These consistent gains highlight the robustness, precision, and real-world readiness of our approach.

[73] R-Genie: Reasoning-Guided Generative Image Editing

Dong Zhang,Lingfeng He,Rui Yan,Fei Shen,Jinhui Tang

Main category: cs.CV

TL;DR: 该论文提出了一种新的图像编辑范式——推理引导生成编辑(R-Genie),结合扩散模型和多模态大语言模型,通过推理注意力机制实现复杂文本查询的图像编辑。

Details Motivation: 当前图像编辑方法受限于显式文本指令和有限操作,缺乏对用户隐式意图和上下文推理的深度理解。 Method: 构建包含丰富推理上下文的图像-指令-编辑三元组数据集,提出R-Genie模型,结合扩散模型和多模态大语言模型,引入推理注意力机制。 Result: 实验证明R-Genie能够为扩散模型提供基于推理的编辑能力,拓展智能图像合成的潜力。 Conclusion: R-Genie通过推理引导生成编辑,实现了对复杂意图和上下文关系的图像编辑,为智能图像合成开辟了新方向。 Abstract: While recent advances in image editing have enabled impressive visual synthesis capabilities, current methods remain constrained by explicit textual instructions and limited editing operations, lacking deep comprehension of implicit user intentions and contextual reasoning. In this work, we introduce a new image editing paradigm: reasoning-guided generative editing, which synthesizes images based on complex, multi-faceted textual queries accepting world knowledge and intention inference. To facilitate this task, we first construct a comprehensive dataset featuring over 1,000 image-instruction-edit triples that incorporate rich reasoning contexts and real-world knowledge. We then propose R-Genie: a reasoning-guided generative image editor, which synergizes the generation power of diffusion models with advanced reasoning capabilities of multimodal large language models. R-Genie incorporates a reasoning-attention mechanism to bridge linguistic understanding with visual synthesis, enabling it to handle intricate editing requests involving abstract user intentions and contextual reasoning relations. Extensive experimental results validate that R-Genie can equip diffusion models with advanced reasoning-based editing capabilities, unlocking new potentials for intelligent image synthesis.

[74] TopoPoint: Enhance Topology Reasoning via Endpoint Detection in Autonomous Driving

Yanping Fu,Xinyuan Liu,Tianyu Li,Yike Ma,Yucheng Zhang,Feng Dai

Main category: cs.CV

TL;DR: TopoPoint提出了一种新框架,通过显式检测车道端点和联合推理端点与车道,解决了现有方法在拓扑推理中车道端点偏差的问题。

Details Motivation: 现有方法在车道端点检测上的偏差导致拓扑推理不准确,影响了自动驾驶中对交叉口的理解。 Method: TopoPoint通过点-车道合并自注意力机制和点-车道图卷积网络,增强了全局上下文共享和特征聚合。推理时采用点-车道几何匹配算法优化端点。 Result: 在OpenLane-V2基准测试中,TopoPoint在拓扑推理(OLS 48.8)和端点检测(DET$_p$ 52.6)上均达到最优性能。 Conclusion: TopoPoint通过改进端点检测和联合推理,显著提升了拓扑推理的鲁棒性和准确性。 Abstract: Topology reasoning, which unifies perception and structured reasoning, plays a vital role in understanding intersections for autonomous driving. However, its performance heavily relies on the accuracy of lane detection, particularly at connected lane endpoints. Existing methods often suffer from lane endpoints deviation, leading to incorrect topology construction. To address this issue, we propose TopoPoint, a novel framework that explicitly detects lane endpoints and jointly reasons over endpoints and lanes for robust topology reasoning. During training, we independently initialize point and lane query, and proposed Point-Lane Merge Self-Attention to enhance global context sharing through incorporating geometric distances between points and lanes as an attention mask . We further design Point-Lane Graph Convolutional Network to enable mutual feature aggregation between point and lane query. During inference, we introduce Point-Lane Geometry Matching algorithm that computes distances between detected points and lanes to refine lane endpoints, effectively mitigating endpoint deviation. Extensive experiments on the OpenLane-V2 benchmark demonstrate that TopoPoint achieves state-of-the-art performance in topology reasoning (48.8 on OLS). Additionally, we propose DET$_p$ to evaluate endpoint detection, under which our method significantly outperforms existing approaches (52.6 v.s. 45.2 on DET$_p$). The code is released at https://github.com/Franpin/TopoPoint.

[75] TextFlux: An OCR-Free DiT Model for High-Fidelity Multilingual Scene Text Synthesis

Yu Xie,Jielei Zhang,Pengyu Chen,Ziyue Wang,Weihang Wang,Longwen Gao,Peiyi Li,Huyang Sun,Qiang Zhang,Qian Qiao,Jiaqing Fan,Zhouhui Lian

Main category: cs.CV

TL;DR: TextFlux是一种基于DiT的多语言场景文本合成框架,无需OCR编码器,支持低资源多语言生成,训练数据需求仅为竞争方法的1%,并能灵活控制多行文本生成。

Details Motivation: 现有方法依赖额外的视觉条件模块和大规模标注数据,限制了多语言场景文本合成的效率和可扩展性。 Method: 利用扩散模型的内置上下文推理能力,提出TextFlux框架,简化模型架构并减少数据依赖。 Result: TextFlux在低资源多语言设置下表现优异,生成质量高,且支持灵活的多行文本控制。 Conclusion: TextFlux通过简化架构和数据需求,显著提升了多语言场景文本合成的效率和性能。 Abstract: Diffusion-based scene text synthesis has progressed rapidly, yet existing methods commonly rely on additional visual conditioning modules and require large-scale annotated data to support multilingual generation. In this work, we revisit the necessity of complex auxiliary modules and further explore an approach that simultaneously ensures glyph accuracy and achieves high-fidelity scene integration, by leveraging diffusion models' inherent capabilities for contextual reasoning. To this end, we introduce TextFlux, a DiT-based framework that enables multilingual scene text synthesis. The advantages of TextFlux can be summarized as follows: (1) OCR-free model architecture. TextFlux eliminates the need for OCR encoders (additional visual conditioning modules) that are specifically used to extract visual text-related features. (2) Strong multilingual scalability. TextFlux is effective in low-resource multilingual settings, and achieves strong performance in newly added languages with fewer than 1,000 samples. (3) Streamlined training setup. TextFlux is trained with only 1% of the training data required by competing methods. (4) Controllable multi-line text generation. TextFlux offers flexible multi-line synthesis with precise line-level control, outperforming methods restricted to single-line or rigid layouts. Extensive experiments and visualizations demonstrate that TextFlux outperforms previous methods in both qualitative and quantitative evaluations.

[76] U2-BENCH: Benchmarking Large Vision-Language Models on Ultrasound Understanding

Anjie Le,Henan Liu,Yue Wang,Zhenyu Liu,Rongkun Zhu,Taohan Weng,Jinze Yu,Boyang Wang,Yalun Wu,Kaiwen Yan,Quanlin Sun,Meirui Jiang,Jialun Pei,Siya Liu,Haoyun Zheng,Zhoujun Li,Alison Noble,Jacques Souquet,Xiaoqing Guo,Manxi Lin,Hongcheng Guo

Main category: cs.CV

TL;DR: U2-BENCH是首个评估大型视觉语言模型(LVLMs)在超声理解任务上的综合基准,涵盖分类、检测、回归和文本生成任务,揭示了其在图像分类上的优势,但在空间推理和临床语言生成上的挑战。

Details Motivation: 超声图像的解读因操作者、噪声和解剖结构差异而具有挑战性,而LVLMs在超声领域的性能尚未充分探索。 Method: U2-BENCH整合了7,241个案例,覆盖15个解剖区域和50个应用场景,定义了8个临床任务,评估了20种LVLMs。 Result: LVLMs在图像分类任务上表现优异,但在空间推理和临床语言生成方面仍存在困难。 Conclusion: U2-BENCH为超声领域的LVLM研究提供了严格的测试平台,推动了这一多模态领域的发展。 Abstract: Ultrasound is a widely-used imaging modality critical to global healthcare, yet its interpretation remains challenging due to its varying image quality on operators, noises, and anatomical structures. Although large vision-language models (LVLMs) have demonstrated impressive multimodal capabilities across natural and medical domains, their performance on ultrasound remains largely unexplored. We introduce U2-BENCH, the first comprehensive benchmark to evaluate LVLMs on ultrasound understanding across classification, detection, regression, and text generation tasks. U2-BENCH aggregates 7,241 cases spanning 15 anatomical regions and defines 8 clinically inspired tasks, such as diagnosis, view recognition, lesion localization, clinical value estimation, and report generation, across 50 ultrasound application scenarios. We evaluate 20 state-of-the-art LVLMs, both open- and closed-source, general-purpose and medical-specific. Our results reveal strong performance on image-level classification, but persistent challenges in spatial reasoning and clinical language generation. U2-BENCH establishes a rigorous and unified testbed to assess and accelerate LVLM research in the uniquely multimodal domain of medical ultrasound imaging.

[77] Hephaestus Minicubes: A Global, Multi-Modal Dataset for Volcanic Unrest Monitoring

Nikolas Papadopoulos,Nikolaos Ioannis Bountos,Maria Sdraka,Andreas Karavias,Ioannis Papoutsis

Main category: cs.CV

TL;DR: 论文提出了Hephaestus Minicubes数据集,用于支持火山活动监测的深度学习研究,并展示了其在多模态、多时态分类和语义分割任务中的表现。

Details Motivation: 火山活动的地面变形是喷发前的重要信号,但深度学习在该领域的应用因缺乏标注数据集而受限。 Method: 基于Hephaestus数据集,构建了包含38个时空数据立方体的Hephaestus Minicubes,覆盖44座活跃火山,整合了InSAR、地形和大气数据,并提供专家标注。 Result: 数据集支持多模态、多时态任务,并通过基准测试展示了其性能。 Conclusion: 该工作推动了机器学习在火山监测中的应用,促进了数据驱动方法在地球科学中的整合。 Abstract: Ground deformation is regarded in volcanology as a key precursor signal preceding volcanic eruptions. Satellite-based Interferometric Synthetic Aperture Radar (InSAR) enables consistent, global-scale deformation tracking; however, deep learning methods remain largely unexplored in this domain, mainly due to the lack of a curated machine learning dataset. In this work, we build on the existing Hephaestus dataset, and introduce Hephaestus Minicubes, a global collection of 38 spatiotemporal datacubes offering high resolution, multi-source and multi-temporal information, covering 44 of the world's most active volcanoes over a 7-year period. Each spatiotemporal datacube integrates InSAR products, topographic data, as well as atmospheric variables which are known to introduce signal delays that can mimic ground deformation in InSAR imagery. Furthermore, we provide expert annotations detailing the type, intensity and spatial extent of deformation events, along with rich text descriptions of the observed scenes. Finally, we present a comprehensive benchmark, demonstrating Hephaestus Minicubes' ability to support volcanic unrest monitoring as a multi-modal, multi-temporal classification and semantic segmentation task, establishing strong baselines with state-of-the-art architectures. This work aims to advance machine learning research in volcanic monitoring, contributing to the growing integration of data-driven methods within Earth science applications.

[78] Generative Data Augmentation for Object Point Cloud Segmentation

Dekai Zhu,Stefan Gavranovic,Flavien Boussuge,Benjamin Busam,Slobodan Ilic

Main category: cs.CV

TL;DR: 论文提出了一种基于扩散模型的生成式数据增强方法(GDA),用于点云分割任务,显著优于传统数据增强和半监督方法。

Details Motivation: 传统数据增强(TDA)方法在点云分割任务中数据多样性提升有限,而现有生成模型缺乏语义标签。本文旨在结合扩散模型生成高质量带标签的点云数据。 Method: 扩展了3D扩散模型Lion,使其能基于分割掩码生成点云;提出三步GDA流程,包括生成变体和伪标签样本,并通过扩散模型过滤伪标签。 Result: 在两个合成数据集和一个真实医学数据集上,GDA方法优于TDA及相关半监督和自监督方法。 Conclusion: GDA方法通过生成高质量带标签数据,有效提升了点云分割任务的性能。 Abstract: Data augmentation is widely used to train deep learning models to address data scarcity. However, traditional data augmentation (TDA) typically relies on simple geometric transformation, such as random rotation and rescaling, resulting in minimal data diversity enrichment and limited model performance improvement. State-of-the-art generative models for 3D shape generation rely on the denoising diffusion probabilistic models and manage to generate realistic novel point clouds for 3D content creation and manipulation. Nevertheless, the generated 3D shapes lack associated point-wise semantic labels, restricting their usage in enlarging the training data for point cloud segmentation tasks. To bridge the gap between data augmentation techniques and the advanced diffusion models, we extend the state-of-the-art 3D diffusion model, Lion, to a part-aware generative model that can generate high-quality point clouds conditioned on given segmentation masks. Leveraging the novel generative model, we introduce a 3-step generative data augmentation (GDA) pipeline for point cloud segmentation training. Our GDA approach requires only a small amount of labeled samples but enriches the training data with generated variants and pseudo-labeled samples, which are validated by a novel diffusion-based pseudo-label filtering method. Extensive experiments on two large-scale synthetic datasets and a real-world medical dataset demonstrate that our GDA method outperforms TDA approach and related semi-supervised and self-supervised methods.

[79] DetailFusion: A Dual-branch Framework with Detail Enhancement for Composed Image Retrieval

Yuxin Yang,Yinan Zhou,Yuxin Chen,Ziqi Zhang,Zongyang Ma,Chunfeng Yuan,Bing Li,Lin Song,Jun Gao,Peng Li,Weiming Hu

Main category: cs.CV

TL;DR: DetailFusion提出了一种双分支框架,通过全局和细节信息的协调,提升了组合图像检索(CIR)的性能。

Details Motivation: 现有方法在全局信息融合上表现不足,难以处理细微视觉变化或复杂文本指令。 Method: 采用双分支框架,结合原子细节变化先验和细节导向优化策略,设计自适应特征组合器动态融合特征。 Result: 在CIRR和FashionIQ数据集上达到最优性能,验证了细节增强的有效性和跨域适应性。 Conclusion: DetailFusion通过细节增强显著提升了CIR任务的表现。 Abstract: Composed Image Retrieval (CIR) aims to retrieve target images from a gallery based on a reference image and modification text as a combined query. Recent approaches focus on balancing global information from two modalities and encode the query into a unified feature for retrieval. However, due to insufficient attention to fine-grained details, these coarse fusion methods often struggle with handling subtle visual alterations or intricate textual instructions. In this work, we propose DetailFusion, a novel dual-branch framework that effectively coordinates information across global and detailed granularities, thereby enabling detail-enhanced CIR. Our approach leverages atomic detail variation priors derived from an image editing dataset, supplemented by a detail-oriented optimization strategy to develop a Detail-oriented Inference Branch. Furthermore, we design an Adaptive Feature Compositor that dynamically fuses global and detailed features based on fine-grained information of each unique multimodal query. Extensive experiments and ablation analyses not only demonstrate that our method achieves state-of-the-art performance on both CIRR and FashionIQ datasets but also validate the effectiveness and cross-domain adaptability of detail enhancement for CIR.

[80] Temporal Consistency Constrained Transferable Adversarial Attacks with Background Mixup for Action Recognition

Ping Li,Jianan Ni,Bo Pang

Main category: cs.CV

TL;DR: 提出了一种基于背景混合和时间一致性的对抗攻击方法(BMTC),用于提升动作识别模型的对抗样本迁移性。

Details Motivation: 现有对抗攻击方法依赖源模型与目标模型决策边界相似的假设,且攻击方向不确定,导致迁移性受限。 Method: 设计了模型无关的背景混合模块,通过强化学习选择攻击能力强的背景帧进行混合,并利用背景类别指导梯度更新,增强攻击方向稳定性。 Result: 在UCF101、Kinetics-400和ImageNet数据集上验证了方法的有效性,显著提升了对抗样本的迁移性。 Conclusion: BMTC方法通过减少模型依赖性和增强攻击方向稳定性,显著提升了对抗攻击的迁移性。 Abstract: Action recognition models using deep learning are vulnerable to adversarial examples, which are transferable across other models trained on the same data modality. Existing transferable attack methods face two major challenges: 1) they heavily rely on the assumption that the decision boundaries of the surrogate (a.k.a., source) model and the target model are similar, which limits the adversarial transferability; and 2) their decision boundary difference makes the attack direction uncertain, which may result in the gradient oscillation, weakening the adversarial attack. This motivates us to propose a Background Mixup-induced Temporal Consistency (BMTC) attack method for action recognition. From the input transformation perspective, we design a model-agnostic background adversarial mixup module to reduce the surrogate-target model dependency. In particular, we randomly sample one video from each category and make its background frame, while selecting the background frame with the top attack ability for mixup with the clean frame by reinforcement learning. Moreover, to ensure an explicit attack direction, we leverage the background category as guidance for updating the gradient of adversarial example, and design a temporal gradient consistency loss, which strengthens the stability of the attack direction on subsequent frames. Empirical studies on two video datasets, i.e., UCF101 and Kinetics-400, and one image dataset, i.e., ImageNet, demonstrate that our method significantly boosts the transferability of adversarial examples across several action/image recognition models. Our code is available at https://github.com/mlvccn/BMTC_TransferAttackVid.

[81] An Attention Infused Deep Learning System with Grad-CAM Visualization for Early Screening of Glaucoma

Ramanathan Swaminathan

Main category: cs.CV

TL;DR: 论文提出了一种结合卷积神经网络和Vision Transformer的混合深度学习模型,通过交叉注意力模块提升性能,并在青光眼检测数据集ACRIMA和Drishti上验证。

Details Motivation: 探索卷积神经网络与Vision Transformer的结合潜力,以提升青光眼检测的准确性和效率。 Method: 采用混合模型,结合卷积神经网络和Vision Transformer,并引入交叉注意力模块,利用ACRIMA和Drishti数据集进行训练和测试。 Result: 模型在青光眼检测任务中表现出高性能。 Conclusion: 混合模型结合了两种架构的优势,为青光眼检测提供了新的解决方案。 Abstract: This research work reveals the eye opening wisdom of the hybrid labyrinthine deep learning models synergy born out of combining a trailblazing convolutional neural network with a disruptive Vision Transformer, both intertwined together with a radical Cross Attention module. Here, two high yielding datasets for artificial intelligence models in detecting glaucoma, namely ACRIMA and Drishti, are utilized.

[82] Seeing It or Not? Interpretable Vision-aware Latent Steering to Mitigate Object Hallucinations

Boxu Chen,Ziwei Zheng,Le Yang,Zeyu Geng,Zhengyu Zhao,Chenhao Lin,Chao Shen

Main category: cs.CV

TL;DR: VaLSe框架通过视觉贡献图和潜在空间调整,减少大型视觉语言模型中的对象幻觉问题。

Details Motivation: 解决大型视觉语言模型中对象幻觉问题,并理解其视觉决策机制。 Method: 采用解释后缓解策略,生成视觉贡献图并调整潜在空间表示。 Result: VaLSe显著减少幻觉输出,并揭示现有评估指标的局限性。 Conclusion: VaLSe是有效的解释工具和鲁棒性增强方法,未来需改进评估指标。 Abstract: Large Vision-Language Models (LVLMs) have achieved remarkable success but continue to struggle with object hallucination (OH), generating outputs inconsistent with visual inputs. While previous work has proposed methods to reduce OH, the visual decision-making mechanisms that lead to hallucinations remain poorly understood. In this paper, we propose VaLSe, a Vision-aware Latent Steering framework that adopts an interpretation-then-mitigation strategy to address OH in LVLMs. By tackling dual challenges of modeling complex vision-language interactions and eliminating spurious activation artifacts, VaLSe can generate visual contribution maps that trace how specific visual inputs influence individual output tokens. These maps reveal the model's vision-aware focus regions, which are then used to perform latent space steering, realigning internal representations toward semantically relevant content and reducing hallucinated outputs. Extensive experiments demonstrate that VaLSe is a powerful interpretability tool and an effective method for enhancing model robustness against OH across multiple benchmarks. Furthermore, our analysis uncovers limitations in existing OH evaluation metrics, underscoring the need for more nuanced, interpretable, and visually grounded OH benchmarks in future work. Code is available at: https://github.com/Ziwei-Zheng/VaLSe.

[83] ICPL-ReID: Identity-Conditional Prompt Learning for Multi-Spectral Object Re-Identification

Shihao Li,Chenglong Li,Aihua Zheng,Jin Tang,Bin Luo

Main category: cs.CV

TL;DR: 提出了一种基于CLIP的跨模态对齐框架(ICPL),通过在线提示学习和身份条件模块解决多光谱目标重识别中的模态差异问题,并在多个基准测试中表现优异。

Details Motivation: 多光谱目标重识别在复杂光照和恶劣天气下具有优势,但现有方法缺乏对光谱信息的细粒度语义理解,难以有效利用光谱的互补性和差异性。 Method: 提出ICPL框架,包括在线提示学习、多光谱身份条件模块和对齐循环优化,同时使用低秩适配器学习光谱特定特征。 Result: 在5个基准测试(如RGBNT201、Market-MM等)中表现优于现有方法。 Conclusion: ICPL通过跨模态对齐和细粒度语义理解,显著提升了多光谱目标重识别的性能。 Abstract: Multi-spectral object re-identification (ReID) brings a new perception perspective for smart city and intelligent transportation applications, effectively addressing challenges from complex illumination and adverse weather. However, complex modal differences between heterogeneous spectra pose challenges to efficiently utilizing complementary and discrepancy of spectra information. Most existing methods fuse spectral data through intricate modal interaction modules, lacking fine-grained semantic understanding of spectral information (\textit{e.g.}, text descriptions, part masks, and object keypoints). To solve this challenge, we propose a novel Identity-Conditional text Prompt Learning framework (ICPL), which exploits the powerful cross-modal alignment capability of CLIP, to unify different spectral visual features from text semantics. Specifically, we first propose the online prompt learning using learnable text prompt as the identity-level semantic center to bridge the identity semantics of different spectra in online manner. Then, in lack of concrete text descriptions, we propose the multi-spectral identity-condition module to use identity prototype as spectral identity condition to constraint prompt learning. Meanwhile, we construct the alignment loop mutually optimizing the learnable text prompt and spectral visual encoder to avoid online prompt learning disrupting the pre-trained text-image alignment distribution. In addition, to adapt to small-scale multi-spectral data and mitigate style differences between spectra, we propose multi-spectral adapter that employs a low-rank adaption method to learn spectra-specific features. Comprehensive experiments on 5 benchmarks, including RGBNT201, Market-MM, MSVR310, RGBN300, and RGBNT100, demonstrate that the proposed method outperforms the state-of-the-art methods.

[84] VLM Models and Automated Grading of Atopic Dermatitis

Marc Lalonde,Hamed Ghodrati

Main category: cs.CV

TL;DR: 评估七种视觉语言模型(VLM)在特应性皮炎(AD)严重程度分级任务中的表现。

Details Motivation: 特应性皮炎(AD)的分级对皮肤科医生具有挑战性,而多模态模型和视觉语言模型(VLM)的发展为可解释的医学图像评估提供了新可能。 Method: 通过实验评估七种VLM在AD测试图像上的表现。 Result: 未明确提及具体结果。 Conclusion: VLM在AD严重程度分级任务中具有潜力,但需进一步验证。 Abstract: The task of grading atopic dermatitis (or AD, a form of eczema) from patient images is difficult even for trained dermatologists. Research on automating this task has progressed in recent years with the development of deep learning solutions; however, the rapid evolution of multimodal models and more specifically vision-language models (VLMs) opens the door to new possibilities in terms of explainable assessment of medical images, including dermatology. This report describes experiments carried out to evaluate the ability of seven VLMs to assess the severity of AD on a set of test images.

[85] Locality-Sensitive Hashing for Efficient Hard Negative Sampling in Contrastive Learning

Fabian Deuser,Philipp Hausenblas,Hannah Schieber,Daniel Roth,Martin Werner,Norbert Oswald

Main category: cs.CV

TL;DR: 提出了一种基于GPU的局部敏感哈希(LSH)方案,用于高效寻找高质量硬负样本,提升了对比学习的性能。

Details Motivation: 在大规模高维数据集中高效寻找高质量的硬负样本是一个计算挑战。 Method: 使用GPU友好的LSH方案,将实值特征向量量化为二进制表示,进行近似最近邻搜索。 Result: 在多个文本和视觉数据集上,性能与现有方法相当或更好,且计算量显著减少。 Conclusion: 该方法为对比学习中的硬负样本挖掘提供了一种高效且性能优越的解决方案。 Abstract: Contrastive learning is a representational learning paradigm in which a neural network maps data elements to feature vectors. It improves the feature space by forming lots with an anchor and examples that are either positive or negative based on class similarity. Hard negative examples, which are close to the anchor in the feature space but from a different class, improve learning performance. Finding such examples of high quality efficiently in large, high-dimensional datasets is computationally challenging. In this paper, we propose a GPU-friendly Locality-Sensitive Hashing (LSH) scheme that quantizes real-valued feature vectors into binary representations for approximate nearest neighbor search. We investigate its theoretical properties and evaluate it on several datasets from textual and visual domain. Our approach achieves comparable or better performance while requiring significantly less computation than existing hard negative mining strategies.

[86] Multi-task Learning For Joint Action and Gesture Recognition

Konstantinos Spathis,Nikolaos Kardaris,Petros Maragos

Main category: cs.CV

TL;DR: 多任务学习通过共享表示联合训练动作和手势识别任务,比单任务学习更高效、鲁棒且泛化能力更强。

Details Motivation: 动作和手势识别任务密切相关,但现有方法通常分开处理,未能充分利用其协同效应。 Method: 采用多任务学习范式,通过单一深度神经网络学习共享表示。 Result: 在多个数据集上的实验表明,多任务学习方法在两个任务上的性能均优于单任务学习。 Conclusion: 多任务学习能够更高效地处理动作和手势识别任务,并提升性能。 Abstract: In practical applications, computer vision tasks often need to be addressed simultaneously. Multitask learning typically achieves this by jointly training a single deep neural network to learn shared representations, providing efficiency and improving generalization. Although action and gesture recognition are closely related tasks, since they focus on body and hand movements, current state-of-the-art methods handle them separately. In this paper, we show that employing a multi-task learning paradigm for action and gesture recognition results in more efficient, robust and generalizable visual representations, by leveraging the synergies between these tasks. Extensive experiments on multiple action and gesture datasets demonstrate that handling actions and gestures in a single architecture can achieve better performance for both tasks in comparison to their single-task learning variants.

[87] Hyperspectral Anomaly Detection Fused Unified Nonconvex Tensor Ring Factors Regularization

Wenjin Qin,Hailin Wang,Hao Shu,Feng Zhang,Jianjun Wang,Xiangyong Cao,Xi-Le Zhao,Gemine Vivone

Main category: cs.CV

TL;DR: 提出了一种新的高光谱异常检测方法HAD-EUNTRFR,通过增强的非凸张量环分解和正则化,显著提升了检测性能。

Details Motivation: 现有方法未能充分利用高光谱图像的全局相关性和局部平滑性,导致检测效果不佳。 Method: 采用张量环分解捕获背景的空间-光谱相关性,并引入非凸正则化器编码低秩性和稀疏性。 Result: 在多个基准数据集上,该方法优于现有最优方法。 Conclusion: HAD-EUNTRFR通过高效优化算法和创新的正则化设计,显著提升了高光谱异常检测的准确性。 Abstract: In recent years, tensor decomposition-based approaches for hyperspectral anomaly detection (HAD) have gained significant attention in the field of remote sensing. However, existing methods often fail to fully leverage both the global correlations and local smoothness of the background components in hyperspectral images (HSIs), which exist in both the spectral and spatial domains. This limitation results in suboptimal detection performance. To mitigate this critical issue, we put forward a novel HAD method named HAD-EUNTRFR, which incorporates an enhanced unified nonconvex tensor ring (TR) factors regularization. In the HAD-EUNTRFR framework, the raw HSIs are first decomposed into background and anomaly components. The TR decomposition is then employed to capture the spatial-spectral correlations within the background component. Additionally, we introduce a unified and efficient nonconvex regularizer, induced by tensor singular value decomposition (TSVD), to simultaneously encode the low-rankness and sparsity of the 3-D gradient TR factors into a unique concise form. The above characterization scheme enables the interpretable gradient TR factors to inherit the low-rankness and smoothness of the original background. To further enhance anomaly detection, we design a generalized nonconvex regularization term to exploit the group sparsity of the anomaly component. To solve the resulting doubly nonconvex model, we develop a highly efficient optimization algorithm based on the alternating direction method of multipliers (ADMM) framework. Experimental results on several benchmark datasets demonstrate that our proposed method outperforms existing state-of-the-art (SOTA) approaches in terms of detection accuracy.

[88] Track Anything Annotate: Video annotation and dataset generation of computer vision models

Nikita Ivanov,Mark Klimov,Dmitry Glukhikh,Tatiana Chernysheva,Igor Glukhikh

Main category: cs.CV

TL;DR: 提出了一种基于视频跟踪和分割的标注工具原型,显著加速了训练数据集的生成。

Details Motivation: 现代机器学习方法需要大量标注数据,准备过程耗时且资源密集。 Method: 研究了从技术选择到最终实现的不同方法,开发了一个标注工具原型。 Result: 开发的工具原型比手动标注显著加快了数据集生成速度。 Conclusion: 所有资源已开源,工具原型有效解决了标注效率问题。 Abstract: Modern machine learning methods require significant amounts of labelled data, making the preparation process time-consuming and resource-intensive. In this paper, we propose to consider the process of prototyping a tool for annotating and generating training datasets based on video tracking and segmentation. We examine different approaches to solving this problem, from technology selection through to final implementation. The developed prototype significantly accelerates dataset generation compared to manual annotation. All resources are available at https://github.com/lnikioffic/track-anything-annotate

[89] Pixels to Prognosis: Harmonized Multi-Region CT-Radiomics and Foundation-Model Signatures Across Multicentre NSCLC Data

Shruti Atul Mali,Zohaib Salahuddin,Danial Khan,Yumeng Zhang,Henry C. Woodruff,Eduardo Ibor-Crespo,Ana Jimenez-Pastor,Luis Marti-Bonmati,Philippe Lambin

Main category: cs.CV

TL;DR: 该研究评估了多中心非小细胞肺癌(NSCLC)患者中,影像特征整合和协调对生存预测的影响,结合了手工放射组学、预训练基础模型特征和临床数据。

Details Motivation: 研究旨在探索如何通过多区域CT图像特征整合和协调方法,提高NSCLC患者的生存预测准确性。 Method: 分析了876名NSCLC患者的CT扫描和临床数据,提取了多个区域的特征,并使用ComBat和RKN等方法进行协调。通过正则化Cox模型预测生存,评估指标包括C-index、t-AUC和HR。 Result: 结果显示,结合临床数据和肿瘤放射组学特征(ComBat协调后)的模型表现最佳(C-index=0.7552)。预训练基础模型特征进一步提升了性能(C-index=0.7616)。共识模型在测试集上表现优异(t-AUC=0.92)。 Conclusion: 特征协调和多区域整合显著提升了NSCLC生存预测的准确性,结合放射组学、基础模型特征和共识建模可实现跨中心的稳健风险分层。 Abstract: Purpose: To evaluate the impact of harmonization and multi-region CT image feature integration on survival prediction in non-small cell lung cancer (NSCLC) patients, using handcrafted radiomics, pretrained foundation model (FM) features, and clinical data from a multicenter dataset. Methods: We analyzed CT scans and clinical data from 876 NSCLC patients (604 training, 272 test) across five centers. Features were extracted from the whole lung, tumor, mediastinal nodes, coronary arteries, and coronary artery calcium (CAC). Handcrafted radiomics and FM deep features were harmonized using ComBat, reconstruction kernel normalization (RKN), and RKN+ComBat. Regularized Cox models predicted overall survival; performance was assessed using the concordance index (C-index), 5-year time-dependent area under the curve (t-AUC), and hazard ratio (HR). SHapley Additive exPlanations (SHAP) values explained feature contributions. A consensus model used agreement across top region of interest (ROI) models to stratify patient risk. Results: TNM staging showed prognostic utility (C-index = 0.67; HR = 2.70; t-AUC = 0.85). The clinical + tumor radiomics model with ComBat achieved a C-index of 0.7552 and t-AUC of 0.8820. FM features (50-voxel cubes) combined with clinical data yielded the highest performance (C-index = 0.7616; t-AUC = 0.8866). An ensemble of all ROIs and FM features reached a C-index of 0.7142 and t-AUC of 0.7885. The consensus model, covering 78% of valid test cases, achieved a t-AUC of 0.92, sensitivity of 97.6%, and specificity of 66.7%. Conclusion: Harmonization and multi-region feature integration improve survival prediction in multicenter NSCLC data. Combining interpretable radiomics, FM features, and consensus modeling enables robust risk stratification across imaging centers.

[90] Semantic segmentation with reward

Xie Ting,Ye Huang,Zhilin Liu,Lixin Duan

Main category: cs.CV

TL;DR: 论文提出RSS(Reward in Semantic Segmentation),首次将基于奖励的强化学习应用于纯语义分割任务,支持像素级和图像级两种粒度。

Details Motivation: 解决真实场景中像素级标签不可用的问题,探索利用非传统标签(如解析结果质量反馈)训练语义分割网络。 Method: 提出RSS框架,结合渐进尺度奖励(PSR)和成对空间差异(PSD)技术,确保奖励机制促进网络收敛。 Result: 实验证明RSS在两种奖励粒度下均能成功收敛,且图像级奖励表现优于现有弱监督方法。 Conclusion: RSS为语义分割提供了一种灵活且高效的训练方法,尤其在图像级反馈场景中表现突出。 Abstract: In real-world scenarios, pixel-level labeling is not always available. Sometimes, we need a semantic segmentation network, and even a visual encoder can have a high compatibility, and can be trained using various types of feedback beyond traditional labels, such as feedback that indicates the quality of the parsing results. To tackle this issue, we proposed RSS (Reward in Semantic Segmentation), the first practical application of reward-based reinforcement learning on pure semantic segmentation offered in two granular levels (pixel-level and image-level). RSS incorporates various novel technologies, such as progressive scale rewards (PSR) and pair-wise spatial difference (PSD), to ensure that the reward facilitates the convergence of the semantic segmentation network, especially under image-level rewards. Experiments and visualizations on benchmark datasets demonstrate that the proposed RSS can successfully ensure the convergence of the semantic segmentation network on two levels of rewards. Additionally, the RSS, which utilizes an image-level reward, outperforms existing weakly supervised methods that also rely solely on image-level signals during training.

[91] DiffusionReward: Enhancing Blind Face Restoration through Reward Feedback Learning

Bin Wu,Wei Wang,Yahui Liu,Zixiang Li,Yao Zhao

Main category: cs.CV

TL;DR: DiffusionReward是一种基于Reward Feedback Learning(ReFL)的盲人脸修复框架,通过Face Reward Model(FRM)提供反馈信号,优化修复网络的生成质量。

Details Motivation: 解决扩散模型在人脸修复任务中生成细节不真实和身份一致性差的问题。 Method: 引入FRM,结合梯度流、正则化项和结构一致性约束,动态优化修复过程。 Result: 在合成和真实数据集上表现优于现有方法,显著提升了身份一致性和面部细节。 Conclusion: DiffusionReward通过ReFL框架有效提升了盲人脸修复的质量和一致性。 Abstract: Reward Feedback Learning (ReFL) has recently shown great potential in aligning model outputs with human preferences across various generative tasks. In this work, we introduce a ReFL framework, named DiffusionReward, to the Blind Face Restoration task for the first time. DiffusionReward effectively overcomes the limitations of diffusion-based methods, which often fail to generate realistic facial details and exhibit poor identity consistency. The core of our framework is the Face Reward Model (FRM), which is trained using carefully annotated data. It provides feedback signals that play a pivotal role in steering the optimization process of the restoration network. In particular, our ReFL framework incorporates a gradient flow into the denoising process of off-the-shelf face restoration methods to guide the update of model parameters. The guiding gradient is collaboratively determined by three aspects: (i) the FRM to ensure the perceptual quality of the restored faces; (ii) a regularization term that functions as a safeguard to preserve generative diversity; and (iii) a structural consistency constraint to maintain facial fidelity. Furthermore, the FRM undergoes dynamic optimization throughout the process. It not only ensures that the restoration network stays precisely aligned with the real face manifold, but also effectively prevents reward hacking. Experiments on synthetic and wild datasets demonstrate that our method outperforms state-of-the-art methods, significantly improving identity consistency and facial details. The source codes, data, and models are available at: https://github.com/01NeuralNinja/DiffusionReward.

[92] Object-level Cross-view Geo-localization with Location Enhancement and Multi-Head Cross Attention

Zheyang Huang,Jagannath Aryal,Saeid Nahavandi,Xuequan Lu,Chee Peng Lim,Lei Wei,Hailing Zhou

Main category: cs.CV

TL;DR: OCGNet提出了一种对象级跨视角地理定位网络,通过高斯核传递和双嵌入机制提升定位精度,并在CVOGL数据集上实现最优性能。

Details Motivation: 传统方法仅关注图像级定位,而实际应用(如搜救、基础设施检查)需要对象级精度。OCGNet旨在解决视角、时间和成像条件变化带来的挑战。 Method: OCGNet结合高斯核传递(GKT)保留位置信息,并嵌入特征编码和匹配模块。还包含位置增强(LE)和多头交叉注意力(MHCA)模块以优化特征提取。 Result: OCGNet在CVOGL数据集上达到最优性能,并展示出少样本学习能力,适用于多样化应用。 Conclusion: OCGNet通过对象级定位和自适应特征增强,显著提升了跨视角地理定位的精度和实用性。 Abstract: Cross-view geo-localization determines the location of a query image, captured by a drone or ground-based camera, by matching it to a geo-referenced satellite image. While traditional approaches focus on image-level localization, many applications, such as search-and-rescue, infrastructure inspection, and precision delivery, demand object-level accuracy. This enables users to prompt a specific object with a single click on a drone image to retrieve precise geo-tagged information of the object. However, variations in viewpoints, timing, and imaging conditions pose significant challenges, especially when identifying visually similar objects in extensive satellite imagery. To address these challenges, we propose an Object-level Cross-view Geo-localization Network (OCGNet). It integrates user-specified click locations using Gaussian Kernel Transfer (GKT) to preserve location information throughout the network. This cue is dually embedded into the feature encoder and feature matching blocks, ensuring robust object-specific localization. Additionally, OCGNet incorporates a Location Enhancement (LE) module and a Multi-Head Cross Attention (MHCA) module to adaptively emphasize object-specific features or expand focus to relevant contextual regions when necessary. OCGNet achieves state-of-the-art performance on a public dataset, CVOGL. It also demonstrates few-shot learning capabilities, effectively generalizing from limited examples, making it suitable for diverse applications (https://github.com/ZheyangH/OCGNet).

[93] Evaluation of Few-Shot Learning Methods for Kidney Stone Type Recognition in Ureteroscopy

Carlos Salazar-Ruiz,Francisco Lopez-Tiro,Ivan Reyes-Amezcua,Clement Larose,Gilberto Ochoa-Ruiz,Christian Daul

Main category: cs.CV

TL;DR: 提出了一种基于少样本学习的深度学习方法,用于在训练数据有限的情况下对肾结石类型进行内窥镜图像分类。

Details Motivation: 现有肾结石类型识别方法耗时或依赖专家,且深度学习模型常因训练数据不足而受限。 Method: 采用少样本学习方法(原型网络),在数据稀缺或罕见类别情况下生成判别性特征。 Result: 原型网络仅需25%的训练数据即可达到或超过传统深度学习模型的性能。 Conclusion: 该方法为肾结石分类提供了一种高效且数据需求低的解决方案。 Abstract: Determining the type of kidney stones is crucial for prescribing appropriate treatments to prevent recurrence. Currently, various approaches exist to identify the type of kidney stones. However, obtaining results through the reference ex vivo identification procedure can take several weeks, while in vivo visual recognition requires highly trained specialists. For this reason, deep learning models have been developed to provide urologists with an automated classification of kidney stones during ureteroscopies. Nevertheless, a common issue with these models is the lack of training data. This contribution presents a deep learning method based on few-shot learning, aimed at producing sufficiently discriminative features for identifying kidney stone types in endoscopic images, even with a very limited number of samples. This approach was specifically designed for scenarios where endoscopic images are scarce or where uncommon classes are present, enabling classification even with a limited training dataset. The results demonstrate that Prototypical Networks, using up to 25% of the training data, can achieve performance equal to or better than traditional deep learning models trained with the complete dataset.

[94] AutoMiSeg: Automatic Medical Image Segmentation via Test-Time Adaptation of Foundation Models

Xingjian Li,Qifeng Wu,Colleen Que,Yiran Ding,Adithya S. Ubaradka,Jianhua Xing,Tianyang Wang,Min Xu

Main category: cs.CV

TL;DR: 提出一种零样本自动医学图像分割方法,结合视觉语言和分割基础模型,无需大量标注或手动提示。

Details Motivation: 解决当前深度学习方法需要大量专家标注或手动提示的问题,提供高效、可扩展的解决方案。 Method: 结合视觉语言模型生成初始边界框,通过视觉提示增强模块优化提示,再使用可提示分割模型生成最终掩码。引入测试时适应框架以解决领域差距问题。 Result: 在七个医学影像数据集上表现优异,性能接近弱提示交互式基础模型。 Conclusion: 该方法为医学图像分割提供了一种高效、自动化的零样本解决方案。 Abstract: Medical image segmentation is vital for clinical diagnosis, yet current deep learning methods often demand extensive expert effort, i.e., either through annotating large training datasets or providing prompts at inference time for each new case. This paper introduces a zero-shot and automatic segmentation pipeline that combines off-the-shelf vision-language and segmentation foundation models. Given a medical image and a task definition (e.g., "segment the optic disc in an eye fundus image"), our method uses a grounding model to generate an initial bounding box, followed by a visual prompt boosting module that enhance the prompts, which are then processed by a promptable segmentation model to produce the final mask. To address the challenges of domain gap and result verification, we introduce a test-time adaptation framework featuring a set of learnable adaptors that align the medical inputs with foundation model representations. Its hyperparameters are optimized via Bayesian Optimization, guided by a proxy validation model without requiring ground-truth labels. Our pipeline offers an annotation-efficient and scalable solution for zero-shot medical image segmentation across diverse tasks. Our pipeline is evaluated on seven diverse medical imaging datasets and shows promising results. By proper decomposition and test-time adaptation, our fully automatic pipeline performs competitively with weakly-prompted interactive foundation models.

[95] SplatCo: Structure-View Collaborative Gaussian Splatting for Detail-Preserving Rendering of Large-Scale Unbounded Scenes

Haihong Xiao,Jianan Zou,Yuxin Zhou,Ying He,Wenxiong Kang

Main category: cs.CV

TL;DR: SplatCo提出了一种结构-视图协作的高斯泼溅框架,用于复杂户外环境的高保真渲染,通过全局与局部特征融合及多视图一致性训练策略,显著提升了重建质量。

Details Motivation: 解决复杂户外场景的高保真渲染问题,尤其是在全局布局与局部细节的平衡以及多视图一致性方面。 Method: 结合全局三平面表示与局部上下文网格特征,采用分层补偿策略;通过跨视图辅助训练策略增强多视图一致性。 Result: 在13个大型场景上测试,PSNR提升1-2 dB,SSIM提升0.1-0.2,重建质量优于现有方法。 Conclusion: SplatCo为大规模无边界场景的高保真渲染设定了新基准。 Abstract: We present SplatCo, a structure-view collaborative Gaussian splatting framework for high-fidelity rendering of complex outdoor environments. SplatCo builds upon two novel components: (1) a cross-structure collaboration module that combines global tri-plane representations, which capture coarse scene layouts, with local context grid features that represent fine surface details. This fusion is achieved through a novel hierarchical compensation strategy, ensuring both global consistency and local detail preservation; and (2) a cross-view assisted training strategy that enhances multi-view consistency by synchronizing gradient updates across viewpoints, applying visibility-aware densification, and pruning overfitted or inaccurate Gaussians based on structural consistency. Through joint optimization of structural representation and multi-view coherence, SplatCo effectively reconstructs fine-grained geometric structures and complex textures in large-scale scenes. Comprehensive evaluations on 13 diverse large-scale scenes, including Mill19, MatrixCity, Tanks & Temples, WHU, and custom aerial captures, demonstrate that SplatCo consistently achieves higher reconstruction quality than state-of-the-art methods, with PSNR improvements of 1-2 dB and SSIM gains of 0.1 to 0.2. These results establish a new benchmark for high-fidelity rendering of large-scale unbounded scenes. Code and additional information are available at https://github.com/SCUT-BIP-Lab/SplatCo.

[96] Diffusion Classifiers Understand Compositionality, but Conditions Apply

Yujin Jeong,Arnas Uselis,Seong Joon Oh,Anna Rohrbach

Main category: cs.CV

TL;DR: 该论文研究了扩散分类器在多种组合任务中的判别能力,通过广泛的实验和分析揭示了其性能与数据集领域、时间步权重等因素的关系。

Details Motivation: 尽管生成式文本到图像扩散模型在合成复杂场景方面表现出色,但其判别式组合能力的潜力尚未充分探索。论文旨在填补这一空白。 Method: 研究使用了三种扩散模型(SD 1.5、2.0和3-m),覆盖10个数据集和30多个任务,并引入新的诊断基准Self-Bench。 Result: 扩散分类器在组合任务中表现出色,但其性能受数据集领域和时间步权重的影响,尤其是SD3-m对领域差距和时间步敏感。 Conclusion: 扩散分类器具备组合理解能力,但其效果依赖于特定条件。研究为未来工作提供了重要参考。 Abstract: Understanding visual scenes is fundamental to human intelligence. While discriminative models have significantly advanced computer vision, they often struggle with compositional understanding. In contrast, recent generative text-to-image diffusion models excel at synthesizing complex scenes, suggesting inherent compositional capabilities. Building on this, zero-shot diffusion classifiers have been proposed to repurpose diffusion models for discriminative tasks. While prior work offered promising results in discriminative compositional scenarios, these results remain preliminary due to a small number of benchmarks and a relatively shallow analysis of conditions under which the models succeed. To address this, we present a comprehensive study of the discriminative capabilities of diffusion classifiers on a wide range of compositional tasks. Specifically, our study covers three diffusion models (SD 1.5, 2.0, and, for the first time, 3-m) spanning 10 datasets and over 30 tasks. Further, we shed light on the role that target dataset domains play in respective performance; to isolate the domain effects, we introduce a new diagnostic benchmark Self-Bench comprised of images created by diffusion models themselves. Finally, we explore the importance of timestep weighting and uncover a relationship between domain gap and timestep sensitivity, particularly for SD3-m. To sum up, diffusion classifiers understand compositionality, but conditions apply! Code and dataset are available at https://github.com/eugene6923/Diffusion-Classifiers-Compositionality.

[97] Mind the Domain Gap: Measuring the Domain Gap Between Real-World and Synthetic Point Clouds for Automated Driving Development

Nguyen Duc,Yan-Ling Lai,Patrick Madlindl,Xinyuan Zhu,Benedikt Schwab,Olaf Wysocki,Ludwig Hoegner,Thomas H. Kolbe

Main category: cs.CV

TL;DR: 提出了一种新方法DoGSS-PCL,用于测量真实世界传感器观测与模拟数据之间的领域差距,支持全面的领域差距分析。

Details Motivation: 解决长尾数据分布问题,确保安全关键应用(如自动驾驶)中模拟数据的可信度。 Method: 引入DoGSS-PCL指标,评估模拟点云的几何和语义质量。 Result: 实验证明该方法能有效测量领域差距,且合成语义点云可用于训练深度神经网络。 Conclusion: 该方法将推动可信数据模拟研究,并支持自动驾驶测试和数字孪生的大规模应用。 Abstract: Owing to the typical long-tail data distribution issues, simulating domain-gap-free synthetic data is crucial in robotics, photogrammetry, and computer vision research. The fundamental challenge pertains to credibly measuring the difference between real and simulated data. Such a measure is vital for safety-critical applications, such as automated driving, where out-of-domain samples may impact a car's perception and cause fatal accidents. Previous work has commonly focused on simulating data on one scene and analyzing performance on a different, real-world scene, hampering the disjoint analysis of domain gap coming from networks' deficiencies, class definitions, and object representation. In this paper, we propose a novel approach to measuring the domain gap between the real world sensor observations and simulated data representing the same location, enabling comprehensive domain gap analysis. To measure such a domain gap, we introduce a novel metric DoGSS-PCL and evaluation assessing the geometric and semantic quality of the simulated point cloud. Our experiments corroborate that the introduced approach can be used to measure the domain gap. The tests also reveal that synthetic semantic point clouds may be used for training deep neural networks, maintaining the performance at the 50/50 real-to-synthetic ratio. We strongly believe that this work will facilitate research on credible data simulation and allow for at-scale deployment in automated driving testing and digital twinning.

[98] MR-EEGWaveNet: Multiresolutional EEGWaveNet for Seizure Detection from Long EEG Recordings

Kazi Mahmudul Hassan,Xuyang Zhao,Hidenori Sugano,Toshihisa Tanaka

Main category: cs.CV

TL;DR: 提出了一种新型端到端模型MR-EEGWaveNet,通过多分辨率特征提取和异常评分后处理技术,显著提高了癫痫检测的准确性和F1分数。

Details Motivation: 特征工程在广义癫痫检测模型中仍具挑战性,现有模型性能不稳定且难以区分伪影与癫痫数据。 Method: 模型包含卷积、特征提取和预测器三个模块,采用深度和时空卷积提取特征,并通过异常评分后处理降低假阳性率。 Result: 在Siena和Juntendo数据集上,F1分数分别从0.177提升至0.336和从0.327提升至0.488,精确度分别提高15.9%和20.62%。 Conclusion: MR-EEGWaveNet在多分辨率特征提取和后处理技术的支持下,显著优于传统方法,为癫痫检测提供了更有效的解决方案。 Abstract: Feature engineering for generalized seizure detection models remains a significant challenge. Recently proposed models show variable performance depending on the training data and remain ineffective at accurately distinguishing artifacts from seizure data. In this study, we propose a novel end-to-end model, ''Multiresolutional EEGWaveNet (MR-EEGWaveNet),'' which efficiently distinguishes seizure events from background electroencephalogram (EEG) and artifacts/noise by capturing both temporal dependencies across different time frames and spatial relationships between channels. The model has three modules: convolution, feature extraction, and predictor. The convolution module extracts features through depth-wise and spatio-temporal convolution. The feature extraction module individually reduces the feature dimension extracted from EEG segments and their sub-segments. Subsequently, the extracted features are concatenated into a single vector for classification using a fully connected classifier called the predictor module. In addition, an anomaly score-based post-classification processing technique was introduced to reduce the false-positive rates of the model. Experimental results were reported and analyzed using different parameter settings and datasets (Siena (public) and Juntendo (private)). The proposed MR-EEGWaveNet significantly outperformed the conventional non-multiresolution approach, improving the F1 scores from 0.177 to 0.336 on Siena and 0.327 to 0.488 on Juntendo, with precision gains of 15.9% and 20.62%, respectively.

[99] To Glue or Not to Glue? Classical vs Learned Image Matching for Mobile Mapping Cameras to Textured Semantic 3D Building Models

Simone Gaisbauer,Prabin Gyawali,Qilin Zhang,Olaf Wysocki,Boris Jutzi

Main category: cs.CV

TL;DR: 论文比较了传统手工方法和可学习特征匹配方法在语义3D建筑相机到模型匹配任务中的表现,发现可学习方法在精度和鲁棒性上显著优于传统方法。

Details Motivation: 尽管深度学习方法在特征匹配中表现出色,但缺乏针对语义3D建筑相机到模型匹配任务的系统比较。 Method: 使用标准数据集(HPatches、MegaDepth-1500)和自定义数据集(立面纹理和相机图像),通过PnP算法评估绝对姿态估计的精度。 Result: 可学习方法在自定义数据集上表现更优,精度和鲁棒性显著高于传统方法。 Conclusion: 该研究将促进基于模型的视觉定位方法的发展。 Abstract: Feature matching is a necessary step for many computer vision and photogrammetry applications such as image registration, structure-from-motion, and visual localization. Classical handcrafted methods such as SIFT feature detection and description combined with nearest neighbour matching and RANSAC outlier removal have been state-of-the-art for mobile mapping cameras. With recent advances in deep learning, learnable methods have been introduced and proven to have better robustness and performance under complex conditions. Despite their growing adoption, a comprehensive comparison between classical and learnable feature matching methods for the specific task of semantic 3D building camera-to-model matching is still missing. This submission systematically evaluates the effectiveness of different feature-matching techniques in visual localization using textured CityGML LoD2 models. We use standard benchmark datasets (HPatches, MegaDepth-1500) and custom datasets consisting of facade textures and corresponding camera images (terrestrial and drone). For the latter, we evaluate the achievable accuracy of the absolute pose estimated using a Perspective-n-Point (PnP) algorithm, with geometric ground truth derived from geo-referenced trajectory data. The results indicate that the learnable feature matching methods vastly outperform traditional approaches regarding accuracy and robustness on our challenging custom datasets with zero to 12 RANSAC-inliers and zero to 0.16 area under the curve. We believe that this work will foster the development of model-based visual localization methods. Link to the code: https://github.com/simBauer/To\_Glue\_or\_not\_to\_Glue

[100] Few-Shot Learning from Gigapixel Images via Hierarchical Vision-Language Alignment and Modeling

Bryan Wong,Jong Woo Kim,Huazhu Fu,Mun Yong Yi

Main category: cs.CV

TL;DR: HiVE-MIL提出了一种层次化视觉语言框架,通过统一图结构和文本引导的动态过滤机制,解决了多尺度视觉-语言对齐不足的问题,显著提升了少样本WSI分类性能。

Details Motivation: 现有方法在多尺度模态内交互和视觉-语言模态对齐方面存在不足,限制了WSI分类的性能。 Method: HiVE-MIL构建了统一图结构,包含跨尺度的父子链接和同尺度的异质边,并引入两阶段文本引导动态过滤机制和层次对比损失。 Result: 在TCGA乳腺癌、肺癌和肾癌数据集上,HiVE-MIL在16-shot设置下比传统MIL和VLM-MIL方法性能提升高达4.1%。 Conclusion: HiVE-MIL证明了联合建模层次结构和多模态对齐对有限病理数据高效学习的重要性。 Abstract: Vision-language models (VLMs) have recently been integrated into multiple instance learning (MIL) frameworks to address the challenge of few-shot, weakly supervised classification of whole slide images (WSIs). A key trend involves leveraging multi-scale information to better represent hierarchical tissue structures. However, existing methods often face two key limitations: (1) insufficient modeling of interactions within the same modalities across scales (e.g., 5x and 20x) and (2) inadequate alignment between visual and textual modalities on the same scale. To address these gaps, we propose HiVE-MIL, a hierarchical vision-language framework that constructs a unified graph consisting of (1) parent-child links between coarse (5x) and fine (20x) visual/textual nodes to capture hierarchical relationships, and (2) heterogeneous intra-scale edges linking visual and textual nodes on the same scale. To further enhance semantic consistency, HiVE-MIL incorporates a two-stage, text-guided dynamic filtering mechanism that removes weakly correlated patch-text pairs, and introduces a hierarchical contrastive loss to align textual semantics across scales. Extensive experiments on TCGA breast, lung, and kidney cancer datasets demonstrate that HiVE-MIL consistently outperforms both traditional MIL and recent VLM-based MIL approaches, achieving gains of up to 4.1% in macro F1 under 16-shot settings. Our results demonstrate the value of jointly modeling hierarchical structure and multimodal alignment for efficient and scalable learning from limited pathology data. The code is available at https://github.com/bryanwong17/HiVE-MIL

[101] Canonical Pose Reconstruction from Single Depth Image for 3D Non-rigid Pose Recovery on Limited Datasets

Fahd Alhamazani,Yu-Kun Lai,Paul L. Rosin

Main category: cs.CV

TL;DR: 提出一种基于规范姿态的3D重建模型,通过将单视角深度图像转换为规范形式,解决非刚性物体重建的挑战,仅需少量数据即可实现高效结果。

Details Motivation: 传统方法在非刚性物体(如人体)的3D重建中因变形范围大而表现不佳,且需要大量训练数据。本研究旨在克服这些限制。 Method: 提出一种规范姿态重建模型,将单视角深度图像对齐到规范形式,利用刚体重建技术,并支持体素表示输入姿态的恢复。 Result: 在动物和人体数据集上,模型仅需约300个样本即可超越其他先进方法。 Conclusion: 该模型通过规范姿态对齐简化了非刚性物体的3D重建,显著减少数据需求并提升性能。 Abstract: 3D reconstruction from 2D inputs, especially for non-rigid objects like humans, presents unique challenges due to the significant range of possible deformations. Traditional methods often struggle with non-rigid shapes, which require extensive training data to cover the entire deformation space. This study addresses these limitations by proposing a canonical pose reconstruction model that transforms single-view depth images of deformable shapes into a canonical form. This alignment facilitates shape reconstruction by enabling the application of rigid object reconstruction techniques, and supports recovering the input pose in voxel representation as part of the reconstruction task, utilizing both the original and deformed depth images. Notably, our model achieves effective results with only a small dataset of approximately 300 samples. Experimental results on animal and human datasets demonstrate that our model outperforms other state-of-the-art methods.

[102] Segment Anyword: Mask Prompt Inversion for Open-Set Grounded Segmentation

Zhihua Liu,Amrutha Saseendran,Lei Tong,Xilin He,Fariba Yousefi,Nikolay Burlutskiy,Dino Oglic,Tom Diethe,Philip Teare,Huiyu Zhou,Chen Jin

Main category: cs.CV

TL;DR: 提出了一种无需训练的开放集图像分割方法Segment Anyword,通过冻结扩散模型的跨注意力图生成分割掩码,并结合语言引导的视觉提示正则化提升分割效果。

Details Motivation: 现有方法需要大量训练或微调,且在多样文本参考表达下难以一致分割对象。 Method: 利用冻结扩散模型的跨注意力图生成初始分割掩码,通过语言引导的视觉提示正则化优化掩码。 Result: 在多个开放集分割任务中取得最优结果,如Pascal Context 59上52.5 mIoU,gRefCOCO上67.73 cIoU。 Conclusion: Segment Anyword是一种高效、通用的开放集分割方法,显著提升了分割准确性和鲁棒性。 Abstract: Open-set image segmentation poses a significant challenge because existing methods often demand extensive training or fine-tuning and generally struggle to segment unified objects consistently across diverse text reference expressions. Motivated by this, we propose Segment Anyword, a novel training-free visual concept prompt learning approach for open-set language grounded segmentation that relies on token-level cross-attention maps from a frozen diffusion model to produce segmentation surrogates or mask prompts, which are then refined into targeted object masks. Initial prompts typically lack coherence and consistency as the complexity of the image-text increases, resulting in suboptimal mask fragments. To tackle this issue, we further introduce a novel linguistic-guided visual prompt regularization that binds and clusters visual prompts based on sentence dependency and syntactic structural information, enabling the extraction of robust, noise-tolerant mask prompts, and significant improvements in segmentation accuracy. The proposed approach is effective, generalizes across different open-set segmentation tasks, and achieves state-of-the-art results of 52.5 (+6.8 relative) mIoU on Pascal Context 59, 67.73 (+25.73 relative) cIoU on gRefCOCO, and 67.4 (+1.1 relative to fine-tuned methods) mIoU on GranDf, which is the most complex open-set grounded segmentation task in the field.

[103] Clinical Validation of Deep Learning for Real-Time Tissue Oxygenation Estimation Using Spectral Imaging

Jens De Winne,Siri Willems,Siri Luthman,Danilo Babin,Hiep Luong,Wim Ceelen

Main category: cs.CV

TL;DR: 论文提出基于深度学习的实时组织氧合估计方法,通过蒙特卡洛模拟光谱训练神经网络,并采用域对抗训练缩小模拟与临床数据的差距,效果优于传统线性解混方法。

Details Motivation: 实时监测组织缺血对理解组织健康和指导手术至关重要,传统线性解混方法因假设限制难以准确估计氧合。 Method: 使用蒙特卡洛模拟光谱训练全连接神经网络(FCN)和卷积神经网络(CNN),并提出域对抗训练方法以缩小模拟与临床数据的差距。 Result: 深度学习模型与手术中毛细血管乳酸测量的相关性高于传统线性解混方法,域对抗训练显著缩小了域差距。 Conclusion: 深度学习方法在实时组织氧合估计中表现优越,域对抗训练有效提升了临床适用性。 Abstract: Accurate, real-time monitoring of tissue ischemia is crucial to understand tissue health and guide surgery. Spectral imaging shows great potential for contactless and intraoperative monitoring of tissue oxygenation. Due to the difficulty of obtaining direct reference oxygenation values, conventional methods are based on linear unmixing techniques. These are prone to assumptions and these linear relations may not always hold in practice. In this work, we present deep learning approaches for real-time tissue oxygenation estimation using Monte-Carlo simulated spectra. We train a fully connected neural network (FCN) and a convolutional neural network (CNN) for this task and propose a domain-adversarial training approach to bridge the gap between simulated and real clinical spectral data. Results demonstrate that these deep learning models achieve a higher correlation with capillary lactate measurements, a well-known marker of hypoxia, obtained during spectral imaging in surgery, compared to traditional linear unmixing. Notably, domain-adversarial training effectively reduces the domain gap, optimizing performance in real clinical settings.

[104] SemSegBench & DetecBench: Benchmarking Reliability and Generalization Beyond Classification

Shashank Agnihotri,David Schader,Jonas Jakubassa,Nico Sharei,Simon Kral,Mehmet Ege Kaçar,Ruben Weber,Margret Keuper

Main category: cs.CV

TL;DR: 该论文提出了SEMSEGBENCH和DETECBENCH两个基准工具,用于评估语义分割和目标检测模型的鲁棒性和泛化能力,并进行了大规模实验。

Details Motivation: 研究深度学习在安全关键领域(如语义分割和目标检测)中的可靠性和泛化能力,填补了现有研究主要集中在图像分类上的空白。 Method: 开发了SEMSEGBENCH和DETECBENCH工具,对76个分割模型和61个目标检测模型进行了广泛评估,测试其在对抗攻击和常见干扰下的表现。 Result: 揭示了当前先进模型的系统性弱点,并发现了基于架构、主干网络和模型容量的关键趋势。 Conclusion: 开源工具和数据集将促进未来研究,提升模型在分类任务之外的可靠性。 Abstract: Reliability and generalization in deep learning are predominantly studied in the context of image classification. Yet, real-world applications in safety-critical domains involve a broader set of semantic tasks, such as semantic segmentation and object detection, which come with a diverse set of dedicated model architectures. To facilitate research towards robust model design in segmentation and detection, our primary objective is to provide benchmarking tools regarding robustness to distribution shifts and adversarial manipulations. We propose the benchmarking tools SEMSEGBENCH and DETECBENCH, along with the most extensive evaluation to date on the reliability and generalization of semantic segmentation and object detection models. In particular, we benchmark 76 segmentation models across four datasets and 61 object detectors across two datasets, evaluating their performance under diverse adversarial attacks and common corruptions. Our findings reveal systematic weaknesses in state-of-the-art models and uncover key trends based on architecture, backbone, and model capacity. SEMSEGBENCH and DETECBENCH are open-sourced in our GitHub repository (https://github.com/shashankskagnihotri/benchmarking_reliability_generalization) along with our complete set of total 6139 evaluations. We anticipate the collected data to foster and encourage future research towards improved model reliability beyond classification.

[105] Building Floor Number Estimation from Crowdsourced Street-Level Images: Munich Dataset and Baseline Method

Yao Sun,Sining Chen,Yifan Tian,Xiao Xiang Zhu

Main category: cs.CV

TL;DR: 提出一种端到端深度学习框架,直接从街景图像推断建筑楼层数,无需手工特征,适用于多种建筑风格。

Details Motivation: 建筑楼层数对家庭估算、风险评估、能源建模等至关重要,但现有数据稀缺。 Method: 使用深度学习分类-回归网络,直接从街景图像预测楼层数。 Result: 在公开数据集上,准确率达81.2%,97.9%的预测误差在±1层内。 Conclusion: 该方法及数据集为3D城市模型提供了可扩展的垂直信息补充,推动城市信息学发展。 Abstract: Accurate information on the number of building floors, or above-ground storeys, is essential for household estimation, utility provision, risk assessment, evacuation planning, and energy modeling. Yet large-scale floor-count data are rarely available in cadastral and 3D city databases. This study proposes an end-to-end deep learning framework that infers floor numbers directly from unrestricted, crowdsourced street-level imagery, avoiding hand-crafted features and generalizing across diverse facade styles. To enable benchmarking, we release the Munich Building Floor Dataset, a public set of over 6800 geo-tagged images collected from Mapillary and targeted field photography, each paired with a verified storey label. On this dataset, the proposed classification-regression network attains 81.2% exact accuracy and predicts 97.9% of buildings within +/-1 floor. The method and dataset together offer a scalable route to enrich 3D city models with vertical information and lay a foundation for future work in urban informatics, remote sensing, and geographic information science. Source code and data will be released under an open license at https://github.com/ya0-sun/Munich-SVI-Floor-Benchmark.

[106] RemoteSAM: Towards Segment Anything for Earth Observation

Liang Yao,Fan Liu,Delong Chen,Chuanyi Zhang,Yijun Wang,Ziyun Chen,Wei Xu,Shimin Di,Yuhui Zheng

Main category: cs.CV

TL;DR: 论文提出了一种名为RemoteSAM的视觉基础模型,旨在解决地球观测任务中的多样性和灵活性需求。通过自动数据引擎和大规模数据集,结合任务统一范式,模型在多个基准测试中表现优异。

Details Motivation: 当前系统无法满足地球观测任务中对多样性和灵活性的需求,通常局限于特定任务架构和狭窄数据领域。 Method: 引入自动数据引擎创建大规模数据集(270K图像-文本-掩码三元组),并提出以参考表达式分割为中心的任务统一范式。 Result: RemoteSAM在多个地球观测基准测试中表现优于其他基础模型(如Falcon、GeoChat等),且效率更高。 Conclusion: 通过数据与建模的创新,RemoteSAM为地球观测任务提供了一种高效、灵活的基础模型解决方案。 Abstract: We aim to develop a robust yet flexible visual foundation model for Earth observation. It should possess strong capabilities in recognizing and localizing diverse visual targets while providing compatibility with various input-output interfaces required across different task scenarios. Current systems cannot meet these requirements, as they typically utilize task-specific architecture trained on narrow data domains with limited semantic coverage. Our study addresses these limitations from two aspects: data and modeling. We first introduce an automatic data engine that enjoys significantly better scalability compared to previous human annotation or rule-based approaches. It has enabled us to create the largest dataset of its kind to date, comprising 270K image-text-mask triplets covering an unprecedented range of diverse semantic categories and attribute specifications. Based on this data foundation, we further propose a task unification paradigm that centers around referring expression segmentation. It effectively handles a wide range of vision-centric perception tasks, including classification, detection, segmentation, grounding, etc, using a single model without any task-specific heads. Combining these innovations on data and modeling, we present RemoteSAM, a foundation model that establishes new SoTA on several earth observation perception benchmarks, outperforming other foundation models such as Falcon, GeoChat, and LHRS-Bot with significantly higher efficiency. Models and data are publicly available at https://github.com/1e12Leon/RemoteSAM.

[107] A Wavelet-based Stereo Matching Framework for Solving Frequency Convergence Inconsistency

Xiaobao Wei,Jiawei Liu,Dongbo Yang,Junda Cheng,Changyong Shu,Wei Wang

Main category: cs.CV

TL;DR: 论文提出了一种基于小波的立体匹配框架(Wavelet-Stereo),通过分离处理高低频分量,解决了现有迭代方法在高频区域性能受限的问题,并在KITTI数据集上取得了领先效果。

Details Motivation: 现有迭代方法(如RAFT-stereo)在高低频区域的EPE评估指标收敛不一致,导致高频区域(如边缘和细薄物体)性能下降。 Method: 使用离散小波变换将图像分解为高低频分量,分别输入多尺度特征提取器,并设计了一种基于LSTM的高频保留更新算子,通过迭代频率适配器动态调整高频特征。 Result: Wavelet-Stereo在KITTI 2015和2012数据集上几乎在所有指标上排名第一,显著优于现有方法。 Conclusion: 通过分离处理高低频分量,Wavelet-Stereo能够同时优化边缘细节和平滑区域,适用于复杂场景,代码和预训练模型已开源。 Abstract: We find that the EPE evaluation metrics of RAFT-stereo converge inconsistently in the low and high frequency regions, resulting high frequency degradation (e.g., edges and thin objects) during the iterative process. The underlying reason for the limited performance of current iterative methods is that it optimizes all frequency components together without distinguishing between high and low frequencies. We propose a wavelet-based stereo matching framework (Wavelet-Stereo) for solving frequency convergence inconsistency. Specifically, we first explicitly decompose an image into high and low frequency components using discrete wavelet transform. Then, the high-frequency and low-frequency components are fed into two different multi-scale frequency feature extractors. Finally, we propose a novel LSTM-based high-frequency preservation update operator containing an iterative frequency adapter to provide adaptive refined high-frequency features at different iteration steps by fine-tuning the initial high-frequency features. By processing high and low frequency components separately, our framework can simultaneously refine high-frequency information in edges and low-frequency information in smooth regions, which is especially suitable for challenging scenes with fine details and textures in the distance. Extensive experiments demonstrate that our Wavelet-Stereo outperforms the state-of-the-art methods and ranks 1st on both the KITTI 2015 and KITTI 2012 leaderboards for almost all metrics. We will provide code and pre-trained models to encourage further exploration, application, and development of our innovative framework (https://github.com/SIA-IDE/Wavelet-Stereo).

[108] 3D Face Reconstruction Error Decomposed: A Modular Benchmark for Fair and Fast Method Evaluation

Evangelos Sariyanidi,Claudio Ferrari,Federico Nocentini,Stefano Berretti,Andrea Cavallaro,Birkan Tunc

Main category: cs.CV

TL;DR: 提出了一个模块化的3D人脸重建基准工具包(M3DFB),将误差计算的基本组件分离并可互换,同时引入新的校正组件,显著提升了误差估计的效率和准确性。

Details Motivation: 当前3D人脸重建的基准工具是单一的,缺乏对误差测量最佳方式的共识,因此需要一个模块化的工具包来量化每个组件的影响。 Method: 开发了M3DFB工具包,分离并允许互换误差计算的基本组件,并提出了一个计算高效的校正方案。测试了16种误差估计器和10种重建方法。 Result: 广泛使用的ICP估计器性能最差,而非刚性对齐显著提升了性能。提出的校正方案与非刚性变形结合,精度与最佳非刚性ICP估计器相当,但速度更快。 Conclusion: M3DFB工具包为研究人员提供了灵活的误差比较方式,有助于推动3D人脸重建基准测试的进步,并支持学习重建方法的改进。 Abstract: Computing the standard benchmark metric for 3D face reconstruction, namely geometric error, requires a number of steps, such as mesh cropping, rigid alignment, or point correspondence. Current benchmark tools are monolithic (they implement a specific combination of these steps), even though there is no consensus on the best way to measure error. We present a toolkit for a Modularized 3D Face reconstruction Benchmark (M3DFB), where the fundamental components of error computation are segregated and interchangeable, allowing one to quantify the effect of each. Furthermore, we propose a new component, namely correction, and present a computationally efficient approach that penalizes for mesh topology inconsistency. Using this toolkit, we test 16 error estimators with 10 reconstruction methods on two real and two synthetic datasets. Critically, the widely used ICP-based estimator provides the worst benchmarking performance, as it significantly alters the true ranking of the top-5 reconstruction methods. Notably, the correlation of ICP with the true error can be as low as 0.41. Moreover, non-rigid alignment leads to significant improvement (correlation larger than 0.90), highlighting the importance of annotating 3D landmarks on datasets. Finally, the proposed correction scheme, together with non-rigid warping, leads to an accuracy on a par with the best non-rigid ICP-based estimators, but runs an order of magnitude faster. Our open-source codebase is designed for researchers to easily compare alternatives for each component, thus helping accelerating progress in benchmarking for 3D face reconstruction and, furthermore, supporting the improvement of learned reconstruction methods, which depend on accurate error estimation for effective training.

[109] CAMME: Adaptive Deepfake Image Detection with Multi-Modal Cross-Attention

Naseem Khan,Tuan Nguyen,Amine Bermak,Issa Khalil

Main category: cs.CV

TL;DR: CAMME框架通过多模态跨注意力机制整合视觉、文本和频域特征,显著提升了跨域深度伪造检测的性能和鲁棒性。

Details Motivation: AI生成的深度伪造技术快速发展,现有检测方法在未见过的生成架构上表现不佳,亟需一种能跨域泛化的解决方案。 Method: 提出CAMME框架,利用多模态跨注意力机制动态整合视觉、文本和频域特征,实现跨域泛化。 Result: CAMME在自然场景和人脸深度伪造检测上分别提升12.56%和13.25%,对自然图像扰动和对抗攻击表现出高鲁棒性。 Conclusion: 多模态跨注意力机制能有效整合互补特征,为跨异构生成架构的深度伪造检测提供可靠解决方案。 Abstract: The proliferation of sophisticated AI-generated deepfakes poses critical challenges for digital media authentication and societal security. While existing detection methods perform well within specific generative domains, they exhibit significant performance degradation when applied to manipulations produced by unseen architectures--a fundamental limitation as generative technologies rapidly evolve. We propose CAMME (Cross-Attention Multi-Modal Embeddings), a framework that dynamically integrates visual, textual, and frequency-domain features through a multi-head cross-attention mechanism to establish robust cross-domain generalization. Extensive experiments demonstrate CAMME's superiority over state-of-the-art methods, yielding improvements of 12.56% on natural scenes and 13.25% on facial deepfakes. The framework demonstrates exceptional resilience, maintaining (over 91%) accuracy under natural image perturbations and achieving 89.01% and 96.14% accuracy against PGD and FGSM adversarial attacks, respectively. Our findings validate that integrating complementary modalities through cross-attention enables more effective decision boundary realignment for reliable deepfake detection across heterogeneous generative architectures.

[110] Clip4Retrofit: Enabling Real-Time Image Labeling on Edge Devices via Cross-Architecture CLIP Distillation

Li Zhong,Ahmed Ghazal,Jun-Jun Wan,Frederik Zilly,Patrick Mackens,Joachim E. Vollrath,Bogdan Sorin Coseriu

Main category: cs.CV

TL;DR: Clip4Retrofit是一个高效的模型蒸馏框架,旨在将CLIP模型的知识迁移到轻量级学生模型中,以在资源受限的边缘设备上实现实时图像标注。

Details Motivation: CLIP等基础模型在视觉-语言任务中表现优异,但其高计算复杂性和内存占用使其难以部署在资源受限的边缘设备上。 Method: 通过将CLIP模型的知识蒸馏到结合EfficientNet-B3和多层感知机(MLP)投影头的轻量级学生模型中,保持跨模态对齐的同时显著降低计算需求。 Result: 实验表明,Clip4Retrofit在资源受限的边缘设备上实现了实时图像标注和物体识别,平衡了效率与性能。 Conclusion: 该工作填补了先进视觉-语言模型与资源受限环境部署之间的鸿沟,为边缘计算中基础模型的广泛应用铺平了道路。 Abstract: Foundation models like CLIP (Contrastive Language-Image Pretraining) have revolutionized vision-language tasks by enabling zero-shot and few-shot learning through cross-modal alignment. However, their computational complexity and large memory footprint make them unsuitable for deployment on resource-constrained edge devices, such as in-car cameras used for image collection and real-time processing. To address this challenge, we propose Clip4Retrofit, an efficient model distillation framework that enables real-time image labeling on edge devices. The framework is deployed on the Retrofit camera, a cost-effective edge device retrofitted into thousands of vehicles, despite strict limitations on compute performance and memory. Our approach distills the knowledge of the CLIP model into a lightweight student model, combining EfficientNet-B3 with multi-layer perceptron (MLP) projection heads to preserve cross-modal alignment while significantly reducing computational requirements. We demonstrate that our distilled model achieves a balance between efficiency and performance, making it ideal for deployment in real-world scenarios. Experimental results show that Clip4Retrofit can perform real-time image labeling and object identification on edge devices with limited resources, offering a practical solution for applications such as autonomous driving and retrofitting existing systems. This work bridges the gap between state-of-the-art vision-language models and their deployment in resource-constrained environments, paving the way for broader adoption of foundation models in edge computing.

[111] RestoreVAR: Visual Autoregressive Generation for All-in-One Image Restoration

Sudarshan Rajagopalan,Kartik Narayan,Vishal M. Patel

Main category: cs.CV

TL;DR: RestoreVAR是一种基于视觉自回归模型(VAR)的生成方法,显著提升了图像修复的性能,同时推理速度比潜在扩散模型(LDM)快10倍以上。

Details Motivation: LDM在图像修复中表现优异但推理速度慢,无法满足实时需求。 Method: 采用视觉自回归模型(VAR),结合改进的交叉注意力机制和潜在空间细化模块。 Result: RestoreVAR在生成式图像修复方法中达到最优性能,且泛化能力强。 Conclusion: RestoreVAR在性能和速度上均优于LDM,适合实时应用。 Abstract: The use of latent diffusion models (LDMs) such as Stable Diffusion has significantly improved the perceptual quality of All-in-One image Restoration (AiOR) methods, while also enhancing their generalization capabilities. However, these LDM-based frameworks suffer from slow inference due to their iterative denoising process, rendering them impractical for time-sensitive applications. To address this, we propose RestoreVAR, a novel generative approach for AiOR that significantly outperforms LDM-based models in restoration performance while achieving over $\mathbf{10\times}$ faster inference. RestoreVAR leverages visual autoregressive modeling (VAR), a recently introduced approach which performs scale-space autoregression for image generation. VAR achieves comparable performance to that of state-of-the-art diffusion transformers with drastically reduced computational costs. To optimally exploit these advantages of VAR for AiOR, we propose architectural modifications and improvements, including intricately designed cross-attention mechanisms and a latent-space refinement module, tailored for the AiOR task. Extensive experiments show that RestoreVAR achieves state-of-the-art performance among generative AiOR methods, while also exhibiting strong generalization capabilities.

[112] SHARDeg: A Benchmark for Skeletal Human Action Recognition in Degraded Scenarios

Simon Malzard,Nitish Mital,Richard Walters,Victoria Nockles,Raghuveer Rao,Celso M. De Melo

Main category: cs.CV

TL;DR: 论文提出了一个针对骨骼动作识别(SHAR)的数据退化基准测试,评估了五种领先模型在三种退化情况下的表现,并发现退化类型对模型精度影响显著。通过插值方法提升性能,并发现一种基于粗糙路径理论的模型在低帧率下表现优异。

Details Motivation: 现有计算机视觉模型在真实世界中的退化数据上表现不佳,尤其是骨骼动作识别(SHAR)领域缺乏系统评估。 Method: 在NTU-RGB+D-120数据集上建立数据退化基准,评估五种SHAR模型对三种退化形式的鲁棒性,并提出插值方法提升性能。 Result: 退化类型对模型精度影响显著(差异>40%),插值方法提升性能>40%,LogSigRNN模型在低帧率下优于SoTA模型。 Conclusion: 论文为SHAR提供了首个数据退化基准,揭示了退化类型的重要性,并展示了LogSigRNN模型在退化数据中的优势。 Abstract: Computer vision (CV) models for detection, prediction or classification tasks operate on video data-streams that are often degraded in the real world, due to deployment in real-time or on resource-constrained hardware. It is therefore critical that these models are robust to degraded data, but state of the art (SoTA) models are often insufficiently assessed with these real-world constraints in mind. This is exemplified by Skeletal Human Action Recognition (SHAR), which is critical in many CV pipelines operating in real-time and at the edge, but robustness to degraded data has previously only been shallowly and inconsistently assessed. Here we address this issue for SHAR by providing an important first data degradation benchmark on the most detailed and largest 3D open dataset, NTU-RGB+D-120, and assess the robustness of five leading SHAR models to three forms of degradation that represent real-world issues. We demonstrate the need for this benchmark by showing that the form of degradation, which has not previously been considered, has a large impact on model accuracy; at the same effective frame rate, model accuracy can vary by >40% depending on degradation type. We also identify that temporal regularity of frames in degraded SHAR data is likely a major driver of differences in model performance, and harness this to improve performance of existing models by up to >40%, through employing a simple mitigation approach based on interpolation. Finally, we highlight how our benchmark has helped identify an important degradation-resistant SHAR model based in Rough Path Theory; the LogSigRNN SHAR model outperforms the SoTA DeGCN model in five out of six cases at low frame rates by an average accuracy of 6%, despite trailing the SoTA model by 11-12% on un-degraded data at high frame rates (30 FPS).

[113] SpikeGen: Generative Framework for Visual Spike Stream Processing

Gaole Dai,Menghang Dong,Rongyu Zhang,Ruichuan An,Shanghang Zhang,Tiejun Huang

Main category: cs.CV

TL;DR: 论文提出了一种名为SpikeGen的生成处理框架,用于解决尖峰相机捕获的视觉尖峰流数据稀疏性问题,并通过生成模型融合RGB模态信息,实现多种视觉任务。

Details Motivation: 尖峰相机虽然能捕捉动态条件下的清晰纹理,但其生成的数据空间稀疏,而RGB模态提供密集空间信息。生成模型被提出以解决稀疏数据的局限性,并融合两种模态的优势。 Method: 引入SpikeGen框架,利用生成模型的潜在空间操作能力,结合尖峰流和RGB模态信息,实现条件生成和融合。 Result: 实验表明,SpikeGen能有效解决空间信息稀疏问题,同时充分利用尖峰流的时间丰富性,提升多种视觉任务(如图像/视频去模糊、帧重建和新视角合成)的性能。 Conclusion: 生成模型在融合尖峰和RGB模态方面具有潜力,SpikeGen框架通过潜在空间操作实现了两种模态的协同增强。 Abstract: Neuromorphic Visual Systems, such as spike cameras, have attracted considerable attention due to their ability to capture clear textures under dynamic conditions. This capability effectively mitigates issues related to motion and aperture blur. However, in contrast to conventional RGB modalities that provide dense spatial information, these systems generate binary, spatially sparse frames as a trade-off for temporally rich visual streams. In this context, generative models emerge as a promising solution to address the inherent limitations of sparse data. These models not only facilitate the conditional fusion of existing information from both spike and RGB modalities but also enable the conditional generation based on latent priors. In this study, we introduce a robust generative processing framework named SpikeGen, designed for visual spike streams captured by spike cameras. We evaluate this framework across multiple tasks involving mixed spike-RGB modalities, including conditional image/video deblurring, dense frame reconstruction from spike streams, and high-speed scene novel-view synthesis. Supported by comprehensive experimental results, we demonstrate that leveraging the latent space operation abilities of generative models allows us to effectively address the sparsity of spatial information while fully exploiting the temporal richness of spike streams, thereby promoting a synergistic enhancement of different visual modalities.

[114] LookWhere? Efficient Visual Recognition by Learning Where to Look and What to See from Self-Supervision

Anthony Fuller,Yousef Yassin,Junfeng Wen,Daniel G. Kyrollos,Tarek Ibrahim,James R. Green,Evan Shelhamer

Main category: cs.CV

TL;DR: LookWhere方法通过自适应计算减少视觉Transformer在高分辨率下的计算成本,联合训练选择器和提取器,显著降低FLOPs和时间消耗。

Details Motivation: 解决视觉Transformer在高分辨率图像中计算成本过高的问题。 Method: 使用低分辨率选择器和高分辨率提取器联合训练,无需全分辨率输入处理。 Result: 在稀疏识别任务中减少FLOPs达34倍,时间减少6倍;在标准任务中时间减少1.36倍且精度提升。 Conclusion: LookWhere方法高效且准确,适用于多种识别任务。 Abstract: Vision transformers are ever larger, more accurate, and more expensive to compute. The expense is even more extreme at high resolution as the number of tokens grows quadratically with the image size. We turn to adaptive computation to cope with this cost by learning to predict where to compute. Our LookWhere method divides the computation between a low-resolution selector and a high-resolution extractor without ever processing the full high-resolution input. We jointly pretrain the selector and extractor without task supervision by distillation from a self-supervised teacher, in effect, learning where and what to compute simultaneously. Unlike prior token reduction methods, which pay to save by pruning already-computed tokens, and prior token selection methods, which require complex and expensive per-task optimization, LookWhere economically and accurately selects and extracts transferrable representations of images. We show that LookWhere excels at sparse recognition on high-resolution inputs (Traffic Signs), maintaining accuracy while reducing FLOPs by up to 34x and time by 6x. It also excels at standard recognition tasks that are global (ImageNet classification) or local (ADE20K segmentation), improving accuracy while reducing time by 1.36x.

[115] BOTM: Echocardiography Segmentation via Bi-directional Optimal Token Matching

Zhihua Liu,Lei Tong,Xilin He,Che Liu,Rossella Arcucci,Chen Jin,Huiyu Zhou

Main category: cs.CV

TL;DR: BOTM框架通过双向最优令牌匹配解决超声心动图分割中的解剖不一致问题,提供稳定且准确的分割结果。

Details Motivation: 现有超声心动图分割方法因形状变化、部分观察和区域模糊导致解剖不一致,BOTM旨在提供解剖一致性保证。 Method: BOTM通过双向跨传输注意力代理匹配图像令牌,实现分割与解剖传输同步。 Result: 实验显示BOTM在CAMUS2H LV上HD降低1.917,TED上Dice提升1.9%。 Conclusion: BOTM能生成稳定且解剖一致的分割结果,优于现有方法。 Abstract: Existed echocardiography segmentation methods often suffer from anatomical inconsistency challenge caused by shape variation, partial observation and region ambiguity with similar intensity across 2D echocardiographic sequences, resulting in false positive segmentation with anatomical defeated structures in challenging low signal-to-noise ratio conditions. To provide a strong anatomical guarantee across different echocardiographic frames, we propose a novel segmentation framework named BOTM (Bi-directional Optimal Token Matching) that performs echocardiography segmentation and optimal anatomy transportation simultaneously. Given paired echocardiographic images, BOTM learns to match two sets of discrete image tokens by finding optimal correspondences from a novel anatomical transportation perspective. We further extend the token matching into a bi-directional cross-transport attention proxy to regulate the preserved anatomical consistency within the cardiac cyclic deformation in temporal domain. Extensive experimental results show that BOTM can generate stable and accurate segmentation outcomes (e.g. -1.917 HD on CAMUS2H LV, +1.9% Dice on TED), and provide a better matching interpretation with anatomical consistency guarantee.

[116] FDBPL: Faster Distillation-Based Prompt Learning for Region-Aware Vision-Language Models Adaptation

Zherui Zhang,Jiaxin Wu,Changwei Wang,Rongtao Xu,Longzhao Huang,Wenhao Xu,Wenbo Xu,Li Guo,Shibiao Xu

Main category: cs.CV

TL;DR: FDBPL是一种高效的提示学习方法,通过共享软监督上下文和加速I/O,解决了现有方法在泛化和效率上的不足,同时引入区域感知提示学习和正负空间互学机制,显著提升了零样本性能。

Details Motivation: 现有提示学习方法在泛化和效率上存在不足,尤其是基于蒸馏的方法牺牲了训练效率。FDBPL旨在同时保持参数高效性和强泛化能力。 Method: FDBPL通过共享软监督上下文和加速I/O提升效率,并引入区域感知提示学习和正负空间互学机制,优化语义识别和拒绝弱相关概念的能力。 Result: 在11个数据集上的评估显示,FDBPL在基础到新类别的泛化、跨数据集迁移和鲁棒性测试中表现优异,训练速度提升2.2倍。 Conclusion: FDBPL在保持参数高效性的同时,显著提升了泛化能力和训练效率,为提示学习提供了新的解决方案。 Abstract: Prompt learning as a parameter-efficient method that has been widely adopted to adapt Vision-Language Models (VLMs) to downstream tasks. While hard-prompt design requires domain expertise and iterative optimization, soft-prompt methods rely heavily on task-specific hard labels, limiting their generalization to unseen categories. Recent popular distillation-based prompt learning methods improve generalization by exploiting larger teacher VLMs and unsupervised knowledge transfer, yet their repetitive teacher model online inference sacrifices the inherent training efficiency advantage of prompt learning. In this paper, we propose {{\large {\textbf{F}}}}aster {{\large {\textbf{D}}}}istillation-{{\large {\textbf{B}}}}ased {{\large {\textbf{P}}}}rompt {{\large {\textbf{L}}}}earning (\textbf{FDBPL}), which addresses these issues by sharing soft supervision contexts across multiple training stages and implementing accelerated I/O. Furthermore, FDBPL introduces a region-aware prompt learning paradigm with dual positive-negative prompt spaces to fully exploit randomly cropped regions that containing multi-level information. We propose a positive-negative space mutual learning mechanism based on similarity-difference learning, enabling student CLIP models to recognize correct semantics while learning to reject weakly related concepts, thereby improving zero-shot performance. Unlike existing distillation-based prompt learning methods that sacrifice parameter efficiency for generalization, FDBPL maintains dual advantages of parameter efficiency and strong downstream generalization. Comprehensive evaluations across 11 datasets demonstrate superior performance in base-to-new generalization, cross-dataset transfer, and robustness tests, achieving $2.2\times$ faster training speed.

[117] Semantic Correspondence: Unified Benchmarking and a Strong Baseline

Kaiyan Zhang,Xinghui Li,Jingyi Lu,Kai Han

Main category: cs.CV

TL;DR: 本文首次全面综述了语义匹配方法,提出分类法对现有方法进行分类,并详细分析每种方法。通过统一比较表和实验验证,提出了一种简单有效的基线方法,性能达到SOTA。

Details Motivation: 语义匹配是计算机视觉中的挑战性任务,但缺乏全面的综述和分析。本文旨在填补这一空白,为未来研究提供参考和基线。 Method: 提出分类法对现有方法分类,详细分析每种方法;汇总文献结果并统一比较;通过实验验证不同方法的组件有效性;提出一种简单有效的基线方法。 Result: 基线方法在多个基准测试中达到SOTA性能。 Conclusion: 本文为语义匹配领域提供了全面的综述和基线方法,希望为未来研究奠定基础。 Abstract: Establishing semantic correspondence is a challenging task in computer vision, aiming to match keypoints with the same semantic information across different images. Benefiting from the rapid development of deep learning, remarkable progress has been made over the past decade. However, a comprehensive review and analysis of this task remains absent. In this paper, we present the first extensive survey of semantic correspondence methods. We first propose a taxonomy to classify existing methods based on the type of their method designs. These methods are then categorized accordingly, and we provide a detailed analysis of each approach. Furthermore, we aggregate and summarize the results of methods in literature across various benchmarks into a unified comparative table, with detailed configurations to highlight performance variations. Additionally, to provide a detailed understanding on existing methods for semantic matching, we thoroughly conduct controlled experiments to analyse the effectiveness of the components of different methods. Finally, we propose a simple yet effective baseline that achieves state-of-the-art performance on multiple benchmarks, providing a solid foundation for future research in this field. We hope this survey serves as a comprehensive reference and consolidated baseline for future development. Code is publicly available at: https://github.com/Visual-AI/Semantic-Correspondence.

[118] DanceTogether! Identity-Preserving Multi-Person Interactive Video Generation

Junhao Chen,Mingjin Chen,Jianjin Xu,Xiang Li,Junting Dong,Mingze Sun,Puhua Jiang,Hongxiang Li,Yuhang Yang,Hao Zhao,Xiaoxiao Long,Ruqi Huang

Main category: cs.CV

TL;DR: DanceTogether是一个端到端的扩散框架,通过结合参考图像和独立姿势掩码流生成多角色交互的长视频,解决了现有可控视频生成系统在多角色运动和交互中的问题。

Details Motivation: 当前可控视频生成系统在多角色运动和交互时表现不佳,尤其是在噪声控制信号下。DanceTogether旨在解决这一问题,实现多角色交互视频的生成。 Method: 提出MaskPoseAdapter,通过融合跟踪掩码和姿势热图,在每一步去噪过程中绑定身份和动作,避免身份漂移和外观混合。 Result: 在TogetherVideoBench上,DanceTogether显著优于现有方法,并能通过少量微调生成人类-机器人交互视频。 Conclusion: DanceTogether及其配套数据集和基准测试将可控视频生成从单角色扩展到多角色交互,为数字制作、仿真和具身智能开辟了新途径。 Abstract: Controllable video generation (CVG) has advanced rapidly, yet current systems falter when more than one actor must move, interact, and exchange positions under noisy control signals. We address this gap with DanceTogether, the first end-to-end diffusion framework that turns a single reference image plus independent pose-mask streams into long, photorealistic videos while strictly preserving every identity. A novel MaskPoseAdapter binds "who" and "how" at every denoising step by fusing robust tracking masks with semantically rich-but noisy-pose heat-maps, eliminating the identity drift and appearance bleeding that plague frame-wise pipelines. To train and evaluate at scale, we introduce (i) PairFS-4K, 26 hours of dual-skater footage with 7,000+ distinct IDs, (ii) HumanRob-300, a one-hour humanoid-robot interaction set for rapid cross-domain transfer, and (iii) TogetherVideoBench, a three-track benchmark centered on the DanceTogEval-100 test suite covering dance, boxing, wrestling, yoga, and figure skating. On TogetherVideoBench, DanceTogether outperforms the prior arts by a significant margin. Moreover, we show that a one-hour fine-tune yields convincing human-robot videos, underscoring broad generalization to embodied-AI and HRI tasks. Extensive ablations confirm that persistent identity-action binding is critical to these gains. Together, our model, datasets, and benchmark lift CVG from single-subject choreography to compositionally controllable, multi-actor interaction, opening new avenues for digital production, simulation, and embodied intelligence. Our video demos and code are available at https://DanceTog.github.io/.

[119] Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding

Xiaoyi Zhang,Zhaoyang Jia,Zongyu Guo,Jiahao Li,Bin Li,Houqiang Li,Yan Lu

Main category: cs.CV

TL;DR: 论文提出了一种名为Deep Video Discovery(DVD)的智能代理,用于解决长视频理解中的挑战,通过自主搜索策略在多粒度视频数据库上实现高效分析,显著提升了性能。

Details Motivation: 长视频理解因时空复杂性和长上下文问题而具有挑战性,现有大型语言模型(LLMs)在处理信息密集的长视频时仍存在局限。 Method: 提出DVD代理,利用自主搜索策略在多粒度视频数据库上操作,结合LLM的高级推理能力进行规划、工具选择和参数优化。 Result: DVD在多个长视频理解基准测试中表现优异,尤其在LVBench数据集上显著超越现有方法。 Conclusion: DVD代理通过自主搜索和工具优化,为长视频理解任务提供了高效解决方案,未来可进一步优化智能代理设计。 Abstract: Long-form video understanding presents significant challenges due to extensive temporal-spatial complexity and the difficulty of question answering under such extended contexts. While Large Language Models (LLMs) have demonstrated considerable advancements in video analysis capabilities and long context handling, they continue to exhibit limitations when processing information-dense hour-long videos. To overcome such limitations, we propose the Deep Video Discovery agent to leverage an agentic search strategy over segmented video clips. Different from previous video agents manually designing a rigid workflow, our approach emphasizes the autonomous nature of agents. By providing a set of search-centric tools on multi-granular video database, our DVD agent leverages the advanced reasoning capability of LLM to plan on its current observation state, strategically selects tools, formulates appropriate parameters for actions, and iteratively refines its internal reasoning in light of the gathered information. We perform comprehensive evaluation on multiple long video understanding benchmarks that demonstrates the advantage of the entire system design. Our DVD agent achieves SOTA performance, significantly surpassing prior works by a large margin on the challenging LVBench dataset. Comprehensive ablation studies and in-depth tool analyses are also provided, yielding insights to further advance intelligent agents tailored for long-form video understanding tasks. The code will be released later.

[120] CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays

Hyungyung Lee,Geon Choi,Jung-Oh Lee,Hangyul Yoon,Hyuk Gi Hong,Edward Choi

Main category: cs.CV

TL;DR: CheXStruct和CXReasonBench是一个基于MIMIC-CXR-JPG数据集的管道和基准,用于评估大型视觉语言模型在医学任务中的临床推理能力。

Details Motivation: 现有基准主要关注最终诊断答案,缺乏对模型是否进行临床有意义推理的深入分析。 Method: 通过CheXStruct自动从胸部X光片中提取中间推理步骤,如解剖区域分割、诊断测量等,并利用CXReasonBench评估模型的推理能力。 Result: 评估的10个LVLM在结构化推理和泛化能力上表现不佳,难以将抽象知识与视觉解释结合。 Conclusion: 该研究为医学任务中的模型推理能力提供了细粒度评估工具,揭示了现有模型的局限性。 Abstract: Recent progress in Large Vision-Language Models (LVLMs) has enabled promising applications in medical tasks, such as report generation and visual question answering. However, existing benchmarks focus mainly on the final diagnostic answer, offering limited insight into whether models engage in clinically meaningful reasoning. To address this, we present CheXStruct and CXReasonBench, a structured pipeline and benchmark built on the publicly available MIMIC-CXR-JPG dataset. CheXStruct automatically derives a sequence of intermediate reasoning steps directly from chest X-rays, such as segmenting anatomical regions, deriving anatomical landmarks and diagnostic measurements, computing diagnostic indices, and applying clinical thresholds. CXReasonBench leverages this pipeline to evaluate whether models can perform clinically valid reasoning steps and to what extent they can learn from structured guidance, enabling fine-grained and transparent assessment of diagnostic reasoning. The benchmark comprises 18,988 QA pairs across 12 diagnostic tasks and 1,200 cases, each paired with up to 4 visual inputs, and supports multi-path, multi-stage evaluation including visual grounding via anatomical region selection and diagnostic measurements. Even the strongest of 10 evaluated LVLMs struggle with structured reasoning and generalization, often failing to link abstract knowledge with anatomically grounded visual interpretation. The code is available at https://github.com/ttumyche/CXReasonBench

[121] DualTalk: Dual-Speaker Interaction for 3D Talking Head Conversations

Ziqiao Peng,Yanbo Fan,Haoyu Wu,Xuan Wang,Hongyan Liu,Jun He,Zhaoxin Fan

Main category: cs.CV

TL;DR: 提出了一种名为DualTalk的统一框架,用于生成3D对话头像,同时处理说话和倾听行为,以模拟真实对话。

Details Motivation: 现有3D对话头像生成模型仅关注说话或倾听,忽略了对话中的自然动态切换,导致交互不自然。 Method: 引入DualTalk框架,整合说话和倾听的动态行为,并创建包含50小时多轮对话的新数据集。 Result: 实验表明,该方法显著提升了双人对话中3D头像的自然性和表现力。 Conclusion: DualTalk为3D对话头像生成提供了更真实的交互模拟,解决了现有模型的局限性。 Abstract: In face-to-face conversations, individuals need to switch between speaking and listening roles seamlessly. Existing 3D talking head generation models focus solely on speaking or listening, neglecting the natural dynamics of interactive conversation, which leads to unnatural interactions and awkward transitions. To address this issue, we propose a new task -- multi-round dual-speaker interaction for 3D talking head generation -- which requires models to handle and generate both speaking and listening behaviors in continuous conversation. To solve this task, we introduce DualTalk, a novel unified framework that integrates the dynamic behaviors of speakers and listeners to simulate realistic and coherent dialogue interactions. This framework not only synthesizes lifelike talking heads when speaking but also generates continuous and vivid non-verbal feedback when listening, effectively capturing the interplay between the roles. We also create a new dataset featuring 50 hours of multi-round conversations with over 1,000 characters, where participants continuously switch between speaking and listening roles. Extensive experiments demonstrate that our method significantly enhances the naturalness and expressiveness of 3D talking heads in dual-speaker conversations. We recommend watching the supplementary video: https://ziqiaopeng.github.io/dualtalk.

[122] F-ANcGAN: An Attention-Enhanced Cycle Consistent Generative Adversarial Architecture for Synthetic Image Generation of Nanoparticles

Varun Ajith,Anindya Pal,Saumik Bhattacharya,Sayantari Ghosh

Main category: cs.CV

TL;DR: F-ANcGAN是一种基于注意力机制的生成对抗系统,用于从少量数据生成高质量的纳米粒子SEM图像,解决了数据不足问题。

Details Motivation: 纳米材料研究中,高质量标注数据稀缺,限制了纳米尺度成像分割模型的开发。 Method: 采用注意力增强的循环一致性生成对抗系统(F-ANcGAN),结合Style U-Net生成器和自注意力U-Net分割网络,并通过数据增强提升多样性。 Result: 模型在TiO$_2$数据集上生成图像的FID得分为17.65,经后处理降至10.39。 Conclusion: F-ANcGAN能高效生成高保真合成数据,提升下游分割任务性能,适用于资源有限领域。 Abstract: Nanomaterial research is becoming a vital area for energy, medicine, and materials science, and accurate analysis of the nanoparticle topology is essential to determine their properties. Unfortunately, the lack of high-quality annotated datasets drastically hinders the creation of strong segmentation models for nanoscale imaging. To alleviate this problem, we introduce F-ANcGAN, an attention-enhanced cycle consistent generative adversarial system that can be trained using a limited number of data samples and generates realistic scanning electron microscopy (SEM) images directly from segmentation maps. Our model uses a Style U-Net generator and a U-Net segmentation network equipped with self-attention to capture structural relationships and applies augmentation methods to increase the variety of the dataset. The architecture reached a raw FID score of 17.65 for TiO$_2$ dataset generation, with a further reduction in FID score to nearly 10.39 by using efficient post-processing techniques. By facilitating scalable high-fidelity synthetic dataset generation, our approach can improve the effectiveness of downstream segmentation task training, overcoming severe data shortage issues in nanoparticle analysis, thus extending its applications to resource-limited fields.

[123] Adapting SAM 2 for Visual Object Tracking: 1st Place Solution for MMVPR Challenge Multi-Modal Tracking

Cheng-Yen Yang,Hsiang-Wei Huang,Pyong-Kun Kim,Chien-Kai Kuo,Jui-Wei Chang,Kwang-Ju Kim,Chung-I Huang,Jenq-Neng Hwang

Main category: cs.CV

TL;DR: 提出了一种将Segment Anything Model 2(SAM2)适配到视觉目标跟踪(VOT)任务的有效方法,通过优化SAM2并在多模态数据集上取得领先性能。

Details Motivation: 利用SAM2的预训练能力,提升其在VOT任务中的表现。 Method: 结合SAM2并提出多项优化技术,增强其在VOT中的应用性能。 Result: 在2024年ICPR多模态目标跟踪挑战赛中以89.4的AUC得分排名第一。 Conclusion: 该方法通过优化SAM2,在多模态VOT任务中表现出色,验证了其有效性。 Abstract: We present an effective approach for adapting the Segment Anything Model 2 (SAM2) to the Visual Object Tracking (VOT) task. Our method leverages the powerful pre-trained capabilities of SAM2 and incorporates several key techniques to enhance its performance in VOT applications. By combining SAM2 with our proposed optimizations, we achieved a first place AUC score of 89.4 on the 2024 ICPR Multi-modal Object Tracking challenge, demonstrating the effectiveness of our approach. This paper details our methodology, the specific enhancements made to SAM2, and a comprehensive analysis of our results in the context of VOT solutions along with the multi-modality aspect of the dataset.

[124] Instructify: Demystifying Metadata to Visual Instruction Tuning Data Conversion

Jacob Hansen,Wei Lin,Junmo Kang,Muhammad Jehanzeb Mirza,Hongyin Luo,Rogerio Feris,Alan Ritter,James Glass,Leonid Karlinsky

Main category: cs.CV

TL;DR: 论文提出了一种开放统一的方法(\method),用于将图像元数据转换为视觉指令调优(VisIT)数据,解决了现有方法依赖闭源模型、成本高且难以扩展的问题。

Details Motivation: 现有VisIT数据集构建方法依赖闭源模型API,成本高且难以扩展,缺乏统一标准和可复现性。 Method: 提出多阶段方法,包括元数据分组、质量控制、数据组织及对话采样,使用开源LLMs(如Gemma 2 27B和LLaMa 3.1 70B)生成VisIT指令。 Result: 方法在相同数据源下可提升VisIT数据质量,平均提升3%,个别基准提升12%,并能扩展数据量和质量。 Conclusion: \method为VisIT数据生成提供了高效、可扩展的解决方案,支持开源模型,提升了LMM性能。 Abstract: Visual Instruction Tuning (VisIT) data, commonly available as human-assistant conversations with images interleaved in the human turns, are currently the most widespread vehicle for aligning strong LLMs to understand visual inputs, converting them to strong LMMs. While many VisIT datasets are available, most are constructed using ad-hoc techniques developed independently by different groups. They are often poorly documented, lack reproducible code, and rely on paid, closed-source model APIs such as GPT-4, Gemini, or Claude to convert image metadata (labels) into VisIT instructions. This leads to high costs and makes it challenging to scale, enhance quality, or generate VisIT data for new datasets. In this work, we address these challenges and propose an open and unified recipe and approach,~\textbf{\method}, for converting available metadata to VisIT instructions using open LLMs. Our multi-stage \method features an efficient framework for metadata grouping, quality control, data and prompt organization, and conversation sampling. We show that our approach can reproduce or enhance the data quality of available VisIT datasets when applied to the same image data and metadata sources, improving GPT-4 generated VisIT instructions by ~3\% on average and up to 12\% on individual benchmarks using open models, such as Gemma 2 27B and LLaMa 3.1 70B. Additionally, our approach enables effective performance scaling - both in quantity and quality - by enhancing the resulting LMM performance across a wide range of benchmarks. We also analyze the impact of various factors, including conversation format, base model selection, and resampling strategies. Our code, which supports the reproduction of equal or higher-quality VisIT datasets and facilities future metadata-to-VisIT data conversion for niche domains, is released at https://github.com/jacob-hansen/Instructify.

[125] One RL to See Them All: Visual Triple Unified Reinforcement Learning

Yan Ma,Linge Du,Xuyang Shen,Shaoxiang Chen,Pengfei Li,Qibing Ren,Lizhuang Ma,Yuchao Dai,Pengfei Liu,Junjie Yan

Main category: cs.CV

TL;DR: V-Triune是一个视觉三重统一强化学习系统,用于联合训练视觉语言模型(VLMs)的推理和感知任务,通过动态IoU奖励和多样化数据集显著提升性能。

Details Motivation: 探索强化学习(RL)在视觉语言模型中除推理任务外的应用,尤其是在感知密集型任务(如目标检测和定位)中的潜力。 Method: 提出V-Triune系统,包含样本级数据格式化、验证级奖励计算和源级指标监控三个互补组件,并引入动态IoU奖励。 Result: Orsta模型在推理和感知任务上均表现优异,在MEGA-Bench Core上提升2.1至14.1分。 Conclusion: V-Triune系统展示了统一强化学习方法在VLMs中的有效性和可扩展性。 Abstract: Reinforcement learning (RL) has significantly advanced the reasoning capabilities of vision-language models (VLMs). However, the use of RL beyond reasoning tasks remains largely unexplored, especially for perceptionintensive tasks like object detection and grounding. We propose V-Triune, a Visual Triple Unified Reinforcement Learning system that enables VLMs to jointly learn visual reasoning and perception tasks within a single training pipeline. V-Triune comprises triple complementary components: Sample-Level Data Formatting (to unify diverse task inputs), Verifier-Level Reward Computation (to deliver custom rewards via specialized verifiers) , and Source-Level Metric Monitoring (to diagnose problems at the data-source level). We further introduce a novel Dynamic IoU reward, which provides adaptive, progressive, and definite feedback for perception tasks handled by V-Triune. Our approach is instantiated within off-the-shelf RL training framework using open-source 7B and 32B backbone models. The resulting model, dubbed Orsta (One RL to See Them All), demonstrates consistent improvements across both reasoning and perception tasks. This broad capability is significantly shaped by its training on a diverse dataset, constructed around four representative visual reasoning tasks (Math, Puzzle, Chart, and Science) and four visual perception tasks (Grounding, Detection, Counting, and OCR). Subsequently, Orsta achieves substantial gains on MEGA-Bench Core, with improvements ranging from +2.1 to an impressive +14.1 across its various 7B and 32B model variants, with performance benefits extending to a wide range of downstream tasks. These results highlight the effectiveness and scalability of our unified RL approach for VLMs. The V-Triune system, along with the Orsta models, is publicly available at https://github.com/MiniMax-AI.

[126] BiggerGait: Unlocking Gait Recognition with Layer-wise Representations from Large Vision Models

Dingqing Ye,Chao Fan,Zhanbo Huang,Chengwen Luo,Jianqiang Li,Shiqi Yu,Xiaoming Liu

Main category: cs.CV

TL;DR: BiggerGait是一种基于大型视觉模型(LVM)的步态识别方法,通过整合LVM中间层的互补特性,显著提升了性能,无需依赖复杂的步态先验知识。

Details Motivation: 现有LVM方法过度依赖步态先验知识,忽视了LVM多层表示的丰富多样性,本研究旨在挖掘LVM的潜力。 Method: 分析LVM各层表示对下游任务的影响,提出BiggerGait方法,整合中间层的互补特性。 Result: 在多个数据集(CCPG、CAISA-B*等)上验证了BiggerGait的优越性,尤其在跨域任务中表现突出。 Conclusion: BiggerGait为步态表示学习提供了一个简单而有效的基线方法,模型和代码将公开。 Abstract: Large vision models (LVM) based gait recognition has achieved impressive performance. However, existing LVM-based approaches may overemphasize gait priors while neglecting the intrinsic value of LVM itself, particularly the rich, distinct representations across its multi-layers. To adequately unlock LVM's potential, this work investigates the impact of layer-wise representations on downstream recognition tasks. Our analysis reveals that LVM's intermediate layers offer complementary properties across tasks, integrating them yields an impressive improvement even without rich well-designed gait priors. Building on this insight, we propose a simple and universal baseline for LVM-based gait recognition, termed BiggerGait. Comprehensive evaluations on CCPG, CAISA-B*, SUSTech1K, and CCGR\_MINI validate the superiority of BiggerGait across both within- and cross-domain tasks, establishing it as a simple yet practical baseline for gait representation learning. All the models and code will be publicly available.

[127] Boosting Open Set Recognition Performance through Modulated Representation Learning

Amit Kumar Kundu,Vaishnavi Patil,Joseph Jaja

Main category: cs.CV

TL;DR: 本文提出了一种基于负余弦调度方案的温度调制表示学习方法,用于开放集识别(OSR),通过动态调整温度参数,提升模型的表示能力和泛化性能。

Details Motivation: 现有OSR方法使用固定温度参数限制了表示学习的多样性,无法同时探索实例级和语义级特征。本文旨在解决这一问题。 Method: 采用负余弦调度方案动态调整温度参数,使模型在训练初期形成粗糙决策边界,随后逐步平滑边界,提升表示空间的丰富性。 Result: 实验表明,该方法在多种基线模型和损失函数上均能提升OSR和闭集性能,尤其在语义偏移基准测试中表现突出。 Conclusion: 提出的温度调制方案无需额外计算开销,可灵活集成到现有OSR方法中,显著提升性能。 Abstract: The open set recognition (OSR) problem aims to identify test samples from novel semantic classes that are not part of the training classes, a task that is crucial in many practical scenarios. However, existing OSR methods use a constant scaling factor (the temperature) to the logits before applying a loss function, which hinders the model from exploring both ends of the spectrum in representation learning -- from instance-level to semantic-level features. In this paper, we address this problem by enabling temperature-modulated representation learning using our novel negative cosine scheduling scheme. Our scheduling lets the model form a coarse decision boundary at the beginning of training by focusing on fewer neighbors, and gradually prioritizes more neighbors to smooth out rough edges. This gradual task switching leads to a richer and more generalizable representation space. While other OSR methods benefit by including regularization or auxiliary negative samples, such as with mix-up, thereby adding a significant computational overhead, our scheme can be folded into any existing OSR method with no overhead. We implement the proposed scheme on top of a number of baselines, using both cross-entropy and contrastive loss functions as well as a few other OSR methods, and find that our scheme boosts both the OSR performance and the closed set performance in most cases, especially on the tougher semantic shift benchmarks.

[128] TokBench: Evaluating Your Visual Tokenizer before Visual Generation

Junfeng Wu,Dongliang Luo,Weizhi Zhao,Zhihao Xie,Yuanhao Wang,Junyi Li,Xudong Xie,Yuliang Liu,Xiang Bai

Main category: cs.CV

TL;DR: 本文揭示了视觉分词器和VAE在保留细粒度特征上的局限性,并提出了一种评估文本和面部重建性能的基准。

Details Motivation: 图像分词在视觉生成和多模态建模中取得进展,但压缩过程中不可避免的信息损失限制了生成质量的上限。本文旨在评估这些损失对文本和面部重建的影响。 Method: 收集文本和面部图像,使用OCR模型评估文本重建准确性,通过特征相似性量化面部重建保真度。方法轻量级,仅需2GB内存和4分钟完成评估。 Result: 现代视觉分词器在保留细粒度特征上仍有困难,尤其在较小尺度下。传统指标无法准确反映面部和文本的重建性能,而提出的指标可作为有效补充。 Conclusion: 本文提出的基准和分析框架为视觉分词器和VAE的性能评估提供了新视角,尤其在文本和面部重建方面。 Abstract: In this work, we reveal the limitations of visual tokenizers and VAEs in preserving fine-grained features, and propose a benchmark to evaluate reconstruction performance for two challenging visual contents: text and face. Image tokenization has significantly advanced visual generation and multimodal modeling, particularly with autoregressive models due to the modeling simplicity of discrete tokens. Autoregressive models typically rely on image tokenizers to compress images into discrete tokens for sequential prediction, whereas diffusion models often operate on continuous latent space to reduce computational costs. However, both visual compression approaches inevitably lose visual information, thereby limiting the upper bound of visual generation quality. To evaluate how these compression losses affect text and faces, the most human-sensitive visual elements, we first collect and curate a collection of text and faces images from existing datasets, ensuring clarity and diversity. For text reconstruction, we employ OCR models to assess the recognition accuracy of the reconstructed text, and then we measure feature similarity between original and reconstructed faces thereby quantifying faces reconstruction fidelity. Our method is highly lightweight, requiring just 2GB memory and 4 minutes to complete evaluations. With our benchmark, we analyze the reconstruction quality of text and faces at various scales across different image tokenizers and VAEs. Our results demonstrate that modern visual tokenizers still struggle to preserve fine-grained features, particularly at smaller scales. Furthermore, we extend this evaluation framework to the video, conducting a comprehensive analysis of video tokenizers. Additionally, we find that traditional metrics fail to accurately reflect the reconstruction performance for faces and text, while our proposed metrics serve as an effective complement.

[129] REN: Fast and Efficient Region Encodings from Patch-Based Image Encoders

Savya Khosla,Sethuraman TV,Barnett Lee,Alexander Schwing,Derek Hoiem

Main category: cs.CV

TL;DR: REN是一种快速有效的模型,通过点提示生成基于区域的图像表示,避免了传统方法的高计算成本,实现了60倍的速度提升和35倍的内存节省,同时在性能上优于现有方法。

Details Motivation: 现有方法结合类无关分割器和基于块的图像编码器生成区域表示,但计算成本高。REN旨在绕过这一瓶颈,直接生成区域标记。 Method: REN使用轻量级模块,通过跨注意力块将点提示作为查询,结合图像编码器特征生成区域标记。支持多种编码器(如DINO、DINOv2、OpenCLIP)。 Result: REN在语义分割和检索任务中表现优异,性能优于原始编码器,且与SAM方法相当或更好,速度显著提升。在Ego4D VQ2D和Visual Haystacks任务中达到SOTA。 Conclusion: REN是一种高效、通用的区域表示生成方法,显著提升了速度和性能,适用于多种任务和编码器。 Abstract: We introduce the Region Encoder Network (REN), a fast and effective model for generating region-based image representations using point prompts. Recent methods combine class-agnostic segmenters (e.g., SAM) with patch-based image encoders (e.g., DINO) to produce compact and effective region representations, but they suffer from high computational cost due to the segmentation step. REN bypasses this bottleneck using a lightweight module that directly generates region tokens, enabling 60x faster token generation with 35x less memory, while also improving token quality. It uses a few cross-attention blocks that take point prompts as queries and features from a patch-based image encoder as keys and values to produce region tokens that correspond to the prompted objects. We train REN with three popular encoders-DINO, DINOv2, and OpenCLIP-and show that it can be extended to other encoders without dedicated training. We evaluate REN on semantic segmentation and retrieval tasks, where it consistently outperforms the original encoders in both performance and compactness, and matches or exceeds SAM-based region methods while being significantly faster. Notably, REN achieves state-of-the-art results on the challenging Ego4D VQ2D benchmark and outperforms proprietary LMMs on Visual Haystacks' single-needle challenge. Code and models are available at: https://github.com/savya08/REN.

cs.GR [Back]

[130] From Flight to Insight: Semantic 3D Reconstruction for Aerial Inspection via Gaussian Splatting and Language-Guided Segmentation

Mahmoud Chick Zaouali,Todd Charter,Homayoun Najjaran

Main category: cs.GR

TL;DR: 该论文提出了一种基于无人机的3D重建与语言引导分割的混合方法,结合了3D高斯泼溅(3DGS)和语义特征场,实现了高保真且语义可解释的3D重建。

Details Motivation: 传统摄影测量技术缺乏语义解释能力,而现有的神经渲染和3DGS技术虽然高效且逼真,但缺乏场景级理解。因此,需要一种能够结合语言交互和语义分割的方法来提升航空检测任务的自动化水平。 Method: 论文提出了一种无人机流程,扩展了Feature-3DGS,利用LSeg特征场和CLIP嵌入生成语言提示的热图,并通过SAM或SAM2进行精细的2D分割。 Result: 实验结果表明,该方法能够在大规模户外环境中实现灵活的语言驱动交互,并展示了不同特征场主干(CLIP-LSeg、SAM、SAM2)在捕获有意义结构时的优缺点。 Conclusion: 这种混合方法为语义航空检测和场景理解提供了新的可能性,实现了高保真且语义可解释的3D重建。 Abstract: High-fidelity 3D reconstruction is critical for aerial inspection tasks such as infrastructure monitoring, structural assessment, and environmental surveying. While traditional photogrammetry techniques enable geometric modeling, they lack semantic interpretability, limiting their effectiveness for automated inspection workflows. Recent advances in neural rendering and 3D Gaussian Splatting (3DGS) offer efficient, photorealistic reconstructions but similarly lack scene-level understanding. In this work, we present a UAV-based pipeline that extends Feature-3DGS for language-guided 3D segmentation. We leverage LSeg-based feature fields with CLIP embeddings to generate heatmaps in response to language prompts. These are thresholded to produce rough segmentations, and the highest-scoring point is then used as a prompt to SAM or SAM2 for refined 2D segmentation on novel view renderings. Our results highlight the strengths and limitations of various feature field backbones (CLIP-LSeg, SAM, SAM2) in capturing meaningful structure in large-scale outdoor environments. We demonstrate that this hybrid approach enables flexible, language-driven interaction with photorealistic 3D reconstructions, opening new possibilities for semantic aerial inspection and scene understanding.

[131] Multi-Person Interaction Generation from Two-Person Motion Priors

Wenning Xu,Shiyu Fan,Paul Henderson,Edmond S. L. Ho

Main category: cs.GR

TL;DR: 提出了一种基于图结构的多人交互生成方法,利用双人运动扩散模型作为先验,通过分解和引导生成高质量且多样化的多人交互。

Details Motivation: 多人交互建模是一个未被充分探索的领域,现有方法难以生成高质量且多样化的交互动作。 Method: 将多人交互分解为图结构的双人交互(Pairwise Interaction Graph),并引入图相关的引导项以减少生成中的伪影。 Result: 实验表明,该方法在生成多样且高质量的多人交互时,显著减少了伪影,优于现有方法。 Conclusion: 通过图结构和引导项的引入,该方法为多人交互生成提供了一种高效且高质量的解决方案。 Abstract: Generating realistic human motion with high-level controls is a crucial task for social understanding, robotics, and animation. With high-quality MOCAP data becoming more available recently, a wide range of data-driven approaches have been presented. However, modelling multi-person interactions still remains a less explored area. In this paper, we present Graph-driven Interaction Sampling, a method that can generate realistic and diverse multi-person interactions by leveraging existing two-person motion diffusion models as motion priors. Instead of training a new model specific to multi-person interaction synthesis, our key insight is to spatially and temporally separate complex multi-person interactions into a graph structure of two-person interactions, which we name the Pairwise Interaction Graph. We thus decompose the generation task into simultaneous single-person motion generation conditioned on one other's motion. In addition, to reduce artifacts such as interpenetrations of body parts in generated multi-person interactions, we introduce two graph-dependent guidance terms into the diffusion sampling scheme. Unlike previous work, our method can produce various high-quality multi-person interactions without having repetitive individual motions. Extensive experiments demonstrate that our approach consistently outperforms existing methods in reducing artifacts when generating a wide range of two-person and multi-person interactions.

[132] WonderPlay: Dynamic 3D Scene Generation from a Single Image and Actions

Zizhang Li,Hong-Xing Yu,Wei Liu,Yin Yang,Charles Herrmann,Gordon Wetzstein,Jiajun Wu

Main category: cs.GR

TL;DR: WonderPlay是一个结合物理模拟与视频生成的新框架,可从单张图像生成动作驱动的动态3D场景。

Details Motivation: 现有方法局限于刚体或简单弹性动力学,而WonderPlay旨在通过混合生成模拟器实现更广泛的3D动态场景生成。 Method: 采用混合生成模拟器,先用物理求解器模拟粗略3D动态,再通过视频生成器生成更精细、真实的视频,最后用视频更新模拟场景。 Result: 实验表明,WonderPlay支持用户通过单张图像与多种场景(如布料、沙子、液体等)交互。 Conclusion: WonderPlay结合了物理模拟的精确性和扩散视频生成的表现力,为动态场景生成提供了新思路。 Abstract: WonderPlay is a novel framework integrating physics simulation with video generation for generating action-conditioned dynamic 3D scenes from a single image. While prior works are restricted to rigid body or simple elastic dynamics, WonderPlay features a hybrid generative simulator to synthesize a wide range of 3D dynamics. The hybrid generative simulator first uses a physics solver to simulate coarse 3D dynamics, which subsequently conditions a video generator to produce a video with finer, more realistic motion. The generated video is then used to update the simulated dynamic 3D scene, closing the loop between the physics solver and the video generator. This approach enables intuitive user control to be combined with the accurate dynamics of physics-based simulators and the expressivity of diffusion-based video generators. Experimental results demonstrate that WonderPlay enables users to interact with various scenes of diverse content, including cloth, sand, snow, liquid, smoke, elastic, and rigid bodies -- all using a single image input. Code will be made public. Project website: https://kyleleey.github.io/WonderPlay/

cs.CL [Back]

[133] Prompt Engineering: How Prompt Vocabulary affects Domain Knowledge

Dimitri Schreiter

Main category: cs.CL

TL;DR: 研究探讨了提示词特异性对LLM在STEM、医学和法律领域任务表现的影响,发现存在一个最优特异性范围。

Details Motivation: 探索提示词特异性在专业领域(如STEM、医学和法律)中对LLM性能的影响,填补研究空白。 Method: 开发同义词替换框架,测试四种LLM在不同特异性提示下的表现。 Result: 增加提示词特异性通常无显著影响,但存在一个最优特异性范围可提升LLM表现。 Conclusion: 优化提示词特异性范围可提高LLM在专业领域的性能,为提示设计提供关键指导。 Abstract: Prompt engineering has emerged as a critical component in optimizing large language models (LLMs) for domain-specific tasks. However, the role of prompt specificity, especially in domains like STEM (physics, chemistry, biology, computer science and mathematics), medicine, and law, remains underexplored. This thesis addresses the problem of whether increasing the specificity of vocabulary in prompts improves LLM performance in domain-specific question-answering and reasoning tasks. We developed a synonymization framework to systematically substitute nouns, verbs, and adjectives with varying specificity levels, measuring the impact on four LLMs: Llama-3.1-70B-Instruct, Granite-13B-Instruct-V2, Flan-T5-XL, and Mistral-Large 2, across datasets in STEM, law, and medicine. Our results reveal that while generally increasing the specificity of prompts does not have a significant impact, there appears to be a specificity range, across all considered models, where the LLM performs the best. Identifying this optimal specificity range offers a key insight for prompt design, suggesting that manipulating prompts within this range could maximize LLM performance and lead to more efficient applications in specialized domains.

[134] Signals from the Floods: AI-Driven Disaster Analysis through Multi-Source Data Fusion

Xian Gong,Paul X. McCarthy,Lin Tian,Marian-Andrei Rizoiu

Main category: cs.CL

TL;DR: 该研究利用X(原Twitter)和公共调查提交的数据,结合LDA和LLM方法,分析公众在极端天气事件中的行为模式,提升灾害响应效率。

Details Motivation: 探索如何利用社交媒体和公共调查数据改进政府灾害响应,以2022年新南威尔士州洪水为例。 Method: 整合LDA主题建模和LLM增强语义理解,通过相关性指数方法过滤噪音,优先处理可操作内容。 Result: LDA揭示了不同的意见和地理模式,LLM提高了洪水相关推文的识别准确性,提升了应急响应能力。 Conclusion: 结合互补数据流,提出了一种新型AI驱动方法,优化灾害相关社交媒体内容,支持实时响应和长期韧性规划。 Abstract: Massive and diverse web data are increasingly vital for government disaster response, as demonstrated by the 2022 floods in New South Wales (NSW), Australia. This study examines how X (formerly Twitter) and public inquiry submissions provide insights into public behaviour during crises. We analyse more than 55,000 flood-related tweets and 1,450 submissions to identify behavioural patterns during extreme weather events. While social media posts are short and fragmented, inquiry submissions are detailed, multi-page documents offering structured insights. Our methodology integrates Latent Dirichlet Allocation (LDA) for topic modelling with Large Language Models (LLMs) to enhance semantic understanding. LDA reveals distinct opinions and geographic patterns, while LLMs improve filtering by identifying flood-relevant tweets using public submissions as a reference. This Relevance Index method reduces noise and prioritizes actionable content, improving situational awareness for emergency responders. By combining these complementary data streams, our approach introduces a novel AI-driven method to refine crisis-related social media content, improve real-time disaster response, and inform long-term resilience planning.

[135] A new classification system of beer categories and styles based on large-scale data mining and self-organizing maps of beer recipes

Diego Bonatto

Main category: cs.CL

TL;DR: 论文提出了一种基于数据驱动的啤酒分类系统,通过分析六万多个啤酒配方,利用统计方法和自组织映射(SOMs)识别出四大超级类别,揭示了不同发酵方式下的原料使用模式。

Details Motivation: 传统啤酒分类主要依赖感官评价,缺乏客观性和可重复性。本研究旨在通过数据驱动的方法,建立一个更科学、可扩展的分类框架。 Method: 分析了62,121个啤酒配方,结合原料、发酵参数和统计数据,使用统计分析和自组织映射(SOMs)进行分类。 Result: 识别出四大超级类别,冷发酵啤酒原料保守,热发酵啤酒多样性高,反映了地域偏好和创新。新分类系统为配方分析和啤酒开发提供了工具。 Conclusion: 新分类系统为啤酒多样性提供了客观理解,并探索了原料使用与发酵特性及风味之间的联系。 Abstract: A data-driven quantitative approach was used to develop a novel classification system for beer categories and styles. Sixty-two thousand one hundred twenty-one beer recipes were mined and analyzed, considering ingredient profiles, fermentation parameters, and recipe vital statistics. Statistical analyses combined with self-organizing maps (SOMs) identified four major superclusters that showed distinctive malt and hop usage patterns, style characteristics, and historical brewing traditions. Cold fermented styles showed a conservative grain and hop composition, whereas hot fermented beers exhibited high heterogeneity, reflecting regional preferences and innovation. This new taxonomy offers a reproducible and objective framework beyond traditional sensory-based classifications, providing brewers, researchers, and educators with a scalable tool for recipe analysis and beer development. The findings in this work provide an understanding of beer diversity and open avenues for linking ingredient usage with fermentation profiles and flavor outcomes.

[136] VLM-KG: Multimodal Radiology Knowledge Graph Generation

Abdullah Abdullah,Seong Tae Kim

Main category: cs.CL

TL;DR: 提出了一种基于多模态视觉语言模型(VLM)的放射学知识图谱生成框架,解决了现有单模态方法的局限性。

Details Motivation: 放射学知识图谱生成面临专业语言和领域数据稀缺的挑战,现有方法仅基于报告且难以处理长数据。 Method: 采用多模态VLM框架,结合放射学报告和影像数据生成知识图谱。 Result: 新方法优于现有方法,首次实现多模态放射学知识图谱生成。 Conclusion: 多模态VLM框架为放射学知识图谱生成提供了更优解决方案。 Abstract: Vision-Language Models (VLMs) have demonstrated remarkable success in natural language generation, excelling at instruction following and structured output generation. Knowledge graphs play a crucial role in radiology, serving as valuable sources of factual information and enhancing various downstream tasks. However, generating radiology-specific knowledge graphs presents significant challenges due to the specialized language of radiology reports and the limited availability of domain-specific data. Existing solutions are predominantly unimodal, meaning they generate knowledge graphs only from radiology reports while excluding radiographic images. Additionally, they struggle with long-form radiology data due to limited context length. To address these limitations, we propose a novel multimodal VLM-based framework for knowledge graph generation in radiology. Our approach outperforms previous methods and introduces the first multimodal solution for radiology knowledge graph generation.

[137] QRA++: Quantified Reproducibility Assessment for Common Types of Results in Natural Language Processing

Anya Belz

Main category: cs.CL

TL;DR: QRA++是一种量化方法,用于评估NLP领域的可复现性,提供多粒度连续评估,并揭示可复现性与实验相似性、系统类型和评估方法的关系。

Details Motivation: NLP领域的可复现性研究结论难以解释和比较,需要一种统一的量化评估方法。 Method: 提出QRA++方法,通过三个粒度级别的连续值评估可复现性,并使用可比性强的度量标准。 Result: 应用QRA++发现可复现性受实验相似性、系统类型和评估方法影响。 Conclusion: QRA++为可复现性评估提供了更有效的方法,并揭示了影响可复现性的关键因素。 Abstract: Reproduction studies reported in NLP provide individual data points which in combination indicate worryingly low levels of reproducibility in the field. Because each reproduction study reports quantitative conclusions based on its own, often not explicitly stated, criteria for reproduction success/failure, the conclusions drawn are hard to interpret, compare, and learn from. In this paper, we present QRA++, a quantitative approach to reproducibility assessment that (i) produces continuous-valued degree of reproducibility assessments at three levels of granularity; (ii) utilises reproducibility measures that are directly comparable across different studies; and (iii) grounds expectations about degree of reproducibility in degree of similarity between experiments. QRA++ enables more informative reproducibility assessments to be conducted, and conclusions to be drawn about what causes reproducibility to be better/poorer. We illustrate this by applying QRA++ to three example sets of comparable experiments, revealing clear evidence that degree of reproducibility depends on similarity of experiment properties, but also system type and evaluation method.

[138] Assessing GPT's Bias Towards Stigmatized Social Groups: An Intersectional Case Study on Nationality Prejudice and Psychophobia

Afifah Kashif,Heer Patel

Main category: cs.CL

TL;DR: 研究发现GPT-3.5/4/4o等大型语言模型对美国人和朝鲜人存在显著偏见,尤其在涉及精神障碍时,对朝鲜人的共情水平更低。

Details Motivation: 探讨大型语言模型对不同国籍和 stigmatized 群体的偏见及其伦理影响。 Method: 通过结构化提示系列评估模型对涉及美国人和朝鲜人及精神障碍场景的响应。 Result: 发现模型对朝鲜人表现出更大的负面偏见,尤其在精神障碍因素叠加时。 Conclusion: 需改进模型设计,以更细致地理解交叉身份问题。 Abstract: Recent studies have separately highlighted significant biases within foundational large language models (LLMs) against certain nationalities and stigmatized social groups. This research investigates the ethical implications of these biases intersecting with outputs of widely-used GPT-3.5/4/4o LLMS. Through structured prompt series, we evaluate model responses to several scenarios involving American and North Korean nationalities with various mental disabilities. Findings reveal significant discrepancies in empathy levels with North Koreans facing greater negative bias, particularly when mental disability is also a factor. This underscores the need for improvements in LLMs designed with a nuanced understanding of intersectional identity.

[139] Assessing the Quality of AI-Generated Clinical Notes: A Validated Evaluation of a Large Language Model Scribe

Erin Palm,Astrit Manikantan,Mark E. Pepin,Herprit Mahal,Srikanth Subramanya Belwadi

Main category: cs.CL

TL;DR: 研究比较了AI生成的临床笔记与专家笔记的质量,发现AI笔记质量接近人类笔记,支持使用PDQI9工具评估AI笔记质量。

Details Motivation: 医疗实践中广泛使用AI生成临床笔记,但缺乏评估其质量的方法。 Method: 采用盲法研究,使用PDQI9工具评估AI和专家笔记质量,涉及97次患者就诊和多个医学专家。 Result: AI笔记质量(4.20/5)接近人类笔记(4.25/5),且评估者间一致性高。 Conclusion: PDQI9工具适用于评估AI生成笔记质量,AI表现接近人类专家。 Abstract: In medical practices across the United States, physicians have begun implementing generative artificial intelligence (AI) tools to perform the function of scribes in order to reduce the burden of documenting clinical encounters. Despite their widespread use, no established methods exist to gauge the quality of AI scribes. To address this gap, we developed a blinded study comparing the relative performance of large language model (LLM) generated clinical notes with those from field experts based on audio-recorded clinical encounters. Quantitative metrics from the Physician Documentation Quality Instrument (PDQI9) provided a framework to measure note quality, which we adapted to assess relative performance of AI generated notes. Clinical experts spanning 5 medical specialties used the PDQI9 tool to evaluate specialist-drafted Gold notes and LLM authored Ambient notes. Two evaluators from each specialty scored notes drafted from a total of 97 patient visits. We found uniformly high inter rater agreement (RWG greater than 0.7) between evaluators in general medicine, orthopedics, and obstetrics and gynecology, and moderate (RWG 0.5 to 0.7) to high inter rater agreement in pediatrics and cardiology. We found a modest yet significant difference in the overall note quality, wherein Gold notes achieved a score of 4.25 out of 5 and Ambient notes scored 4.20 out of 5 (p = 0.04). Our findings support the use of the PDQI9 instrument as a practical method to gauge the quality of LLM authored notes, as compared to human-authored notes.

[140] Words That Unite The World: A Unified Framework for Deciphering Central Bank Communications Globally

Agam Shah,Siddhant Sukhani,Huzaifa Pardawala,Saketh Budideti,Riya Bhadani,Rudra Gopal,Siddhartha Somani,Michael Galarnyk,Soungmin Lee,Arnav Hiray,Akshar Ravichandran,Eric Kim,Pranav Aluru,Joshua Zhang,Sebastian Jaskowski,Veer Guda,Meghaj Tarte,Liqin Ye,Spencer Gosden,Rutwik Routu,Rachel Yuh,Sloka Chava,Sahasra Chava,Dylan Patrick Kelly,Aiden Chiang,Harsit Mittal,Sudheer Chava

Main category: cs.CL

TL;DR: 论文介绍了WCB数据集,用于分析全球25家央行28年的政策文本,通过标注和模型测试验证了跨银行数据训练的优越性。

Details Motivation: 央行政策解读对经济稳定至关重要,但易被误解,影响弱势群体。因此,需系统分析政策文本。 Method: 构建WCB数据集,标注25k句子,定义三项任务,测试16种模型(7种PLMs和9种LLMs)。 Result: 跨银行数据训练的模型表现优于单一银行数据,验证了整体优于部分的原理。 Conclusion: WCB数据集和框架具有经济实用性,数据与模型已开源。 Abstract: Central banks around the world play a crucial role in maintaining economic stability. Deciphering policy implications in their communications is essential, especially as misinterpretations can disproportionately impact vulnerable populations. To address this, we introduce the World Central Banks (WCB) dataset, the most comprehensive monetary policy corpus to date, comprising over 380k sentences from 25 central banks across diverse geographic regions, spanning 28 years of historical data. After uniformly sampling 1k sentences per bank (25k total) across all available years, we annotate and review each sentence using dual annotators, disagreement resolutions, and secondary expert reviews. We define three tasks: Stance Detection, Temporal Classification, and Uncertainty Estimation, with each sentence annotated for all three. We benchmark seven Pretrained Language Models (PLMs) and nine Large Language Models (LLMs) (Zero-Shot, Few-Shot, and with annotation guide) on these tasks, running 15,075 benchmarking experiments. We find that a model trained on aggregated data across banks significantly surpasses a model trained on an individual bank's data, confirming the principle "the whole is greater than the sum of its parts." Additionally, rigorous human evaluations, error analyses, and predictive tasks validate our framework's economic utility. Our artifacts are accessible through the HuggingFace and GitHub under the CC-BY-NC-SA 4.0 license.

[141] Gender and Positional Biases in LLM-Based Hiring Decisions: Evidence from Comparative CV/Résumé Evaluations

David Rozado

Main category: cs.CL

TL;DR: 研究发现,大型语言模型(LLMs)在评估简历时存在性别偏好,倾向于选择女性候选人,且易受名字、性别字段和位置偏见影响。

Details Motivation: 探讨LLMs在职业候选人评估中的行为,揭示其潜在的偏见和不一致性。 Method: 通过实验,22个LLMs在相同职业资格下评估带有性别化名字的简历对,并分析其选择偏好。 Result: LLMs普遍偏好女性候选人,性别字段和名字进一步强化了这种偏好;性别中性标识下偏好消失。 Conclusion: LLMs在高风险决策中需谨慎使用,其推理可能存在偏见和不一致。 Abstract: This study examines the behavior of Large Language Models (LLMs) when evaluating professional candidates based on their resumes or curricula vitae (CVs). In an experiment involving 22 leading LLMs, each model was systematically given one job description along with a pair of profession-matched CVs, one bearing a male first name, the other a female first name, and asked to select the more suitable candidate for the job. Each CV pair was presented twice, with names swapped to ensure that any observed preferences in candidate selection stemmed from gendered names cues. Despite identical professional qualifications across genders, all LLMs consistently favored female-named candidates across 70 different professions. Adding an explicit gender field (male/female) to the CVs further increased the preference for female applicants. When gendered names were replaced with gender-neutral identifiers "Candidate A" and "Candidate B", several models displayed a preference to select "Candidate A". Counterbalancing gender assignment between these gender-neutral identifiers resulted in gender parity in candidate selection. When asked to rate CVs in isolation rather than compare pairs, LLMs assigned slightly higher average scores to female CVs overall, but the effect size was negligible. Including preferred pronouns (he/him or she/her) next to a candidate's name slightly increased the odds of the candidate being selected regardless of gender. Finally, most models exhibited a substantial positional bias to select the candidate listed first in the prompt. These findings underscore the need for caution when deploying LLMs in high-stakes autonomous decision-making contexts and raise doubts about whether LLMs consistently apply principled reasoning.

[142] Towards Robust Evaluation of STEM Education: Leveraging MLLMs in Project-Based Learning

Yanhao Jia,Xinyi Wu,Qinglin Zhang,Yiran Qin,Luwei Xiao,Shuai Zhao

Main category: cs.CL

TL;DR: PBLBench是一个新的基准测试,用于评估基于领域知识和长上下文理解的复杂推理能力,填补了现有基准在自由输出结构和专家验证上的不足。

Details Motivation: 现有基准缺乏自由输出结构和严格的专家验证,限制了其在教育任务中的有效性,且模型幻觉和不稳定性阻碍了自动化管道的开发。 Method: 采用层次分析法(AHP)通过专家驱动的成对比较建立结构化加权评估标准,并评估了15种领先的MLLMs/LLMs。 Result: 即使最先进的模型在PBLBench上仅达到59%的排名准确率,表明该基准的挑战性。 Conclusion: PBLBench有望推动更强大AI代理的开发,减轻教师负担并提升教育效率。 Abstract: Project-Based Learning (PBL) involves a variety of highly correlated multimodal data, making it a vital educational approach within STEM disciplines. With the rapid development of multimodal large language models (MLLMs), researchers have begun exploring their potential to enhance tasks such as information retrieval, knowledge comprehension, and data generation in educational settings. However, existing benchmarks fall short in providing both a free-form output structure and a rigorous human expert validation process, limiting their effectiveness in evaluating real-world educational tasks. Additionally, few methods have developed automated pipelines to assist with the complex responsibilities of teachers leveraging MLLMs, largely due to model hallucination and instability, which lead to unreliable implementation. To address this gap, we introduce PBLBench, a novel benchmark designed to evaluate complex reasoning grounded in domain-specific knowledge and long-context understanding, thereby challenging models with tasks that closely resemble those handled by human experts. To establish reliable ground truth, we adopt the Analytic Hierarchy Process (AHP), utilizing expert-driven pairwise comparisons to derive structured and weighted evaluation criteria. We assess the performance of 15 leading MLLMs/LLMs using PBLBench and demonstrate that even the most advanced models achieve only 59% rank accuracy, underscoring the significant challenges presented by this benchmark. We believe PBLBench will serve as a catalyst for the development of more capable AI agents, ultimately aiming to alleviate teacher workload and enhance educational productivity.

[143] Embedding-to-Prefix: Parameter-Efficient Personalization for Pre-Trained Large Language Models

Bernd Huber,Ghazal Fazelnia,Andreas Damianou,Sebastian Peleato,Max Lefarov,Praveen Ravichandran,Marco De Nadai,Mounia Lalmas-Roellke,Paul N. Bennett

Main category: cs.CL

TL;DR: E2P是一种参数高效的方法,通过将预计算的上下文嵌入投影到LLM的隐藏表示空间,实现个性化生成,同时避免昂贵的微调或提示。

Details Motivation: 当前利用用户特定信息进行LLM个性化通常需要高成本的微调或大量提示,E2P旨在解决这一问题。 Method: E2P通过学习的投影将预计算的上下文嵌入注入LLM的隐藏表示空间,生成单个软令牌前缀,保持模型冻结。 Result: 在多个数据集和生产环境中,E2P有效保留上下文信号,性能强劲且计算开销低。 Conclusion: E2P为生成式AI系统提供了一种可扩展、高效的个性化解决方案。 Abstract: Large language models (LLMs) excel at generating contextually relevant content. However, tailoring these outputs to individual users for effective personalization is a significant challenge. While rich user-specific information often exists as pre-existing user representations, such as embeddings learned from preferences or behaviors, current methods to leverage these for LLM personalization typically require costly fine-tuning or token-heavy prompting. We propose Embedding-to-Prefix (E2P), a parameter-efficient method that injects pre-computed context embeddings into an LLM's hidden representation space through a learned projection to a single soft token prefix. This enables effective personalization while keeping the backbone model frozen and avoiding expensive adaptation techniques. We evaluate E2P across two public datasets and in a production setting: dialogue personalization on Persona-Chat, contextual headline generation on PENS, and large-scale personalization for music and podcast consumption. Results show that E2P preserves contextual signals and achieves strong performance with minimal computational overhead, offering a scalable, efficient solution for contextualizing generative AI systems.

[144] SpecEdge: Scalable Edge-Assisted Serving Framework for Interactive LLMs

Jinwoo Park,Seunggeun Cho,Dongsu Han

Main category: cs.CL

TL;DR: SpecEdge是一个边缘辅助推理框架,通过将LLM工作负载分配到边缘和服务器GPU上,利用推测解码方案提高成本效率和吞吐量。

Details Motivation: 当前以服务器为中心的系统忽视了边缘的消费级GPU资源,导致LLM服务成本高且资源密集。 Method: SpecEdge采用推测解码方案,将LLM工作负载分配到边缘和服务器GPU上,并通过主动边缘草拟和管道感知调度优化性能。 Result: 实验显示,SpecEdge将整体成本效率提升1.91倍,服务器吞吐量提高2.22倍,同时降低令牌间延迟11.24%。 Conclusion: SpecEdge为LLM服务提供了一种可扩展且经济高效的范式。 Abstract: Large language models (LLMs) power many modern applications, but serving them at scale remains costly and resource-intensive. Current server-centric systems overlook consumer-grade GPUs at the edge. We introduce SpecEdge, an edge-assisted inference framework that splits LLM workloads between edge and server GPUs using a speculative decoding scheme, exchanging only token outputs over the network. SpecEdge employs proactive edge drafting to overlap edge token creation with server verification and pipeline-aware scheduling that interleaves multiple user requests to increase server-side throughput. Experiments show SpecEdge enhances overall cost efficiency by 1.91x through achieving 2.22x server throughput, and reduces inter token latency by 11.24% compared to a server-only baseline, introducing a scalable, cost-effective paradigm for LLM serving.

[145] Social preferences with unstable interactive reasoning: Large language models in economic trust games

Ou Jiamin,Eikmans Emile,Buskens Vincent,Pankowska Paulina,Shan Yuli

Main category: cs.CL

TL;DR: 研究探讨了大型语言模型(LLMs)在社交交换情境中的表现,发现它们能体现信任与互惠行为,但行为受角色设定影响显著。

Details Motivation: 探索LLMs如何将语言理解能力转化为社交互动行为,揭示其社会偏好与交互推理能力。 Method: 将ChatGPT-4、Claude和Bard置于经济信任游戏中,观察其在不同角色设定下的决策行为。 Result: LLMs在无角色设定时表现出信任与互惠,但行为受角色设定影响显著;ChatGPT-4在无私角色中表现最佳。 Conclusion: LLMs在社交互动中展现出潜力,但行为稳定性和可预测性仍需改进。 Abstract: While large language models (LLMs) have demonstrated remarkable capabilities in understanding human languages, this study explores how they translate this understanding into social exchange contexts that capture certain essences of real world human interactions. Three LLMs - ChatGPT-4, Claude, and Bard - were placed in economic trust games where players balance self-interest with trust and reciprocity, making decisions that reveal their social preferences and interactive reasoning abilities. Our study shows that LLMs deviate from pure self-interest and exhibit trust and reciprocity even without being prompted to adopt a specific persona. In the simplest one-shot interaction, LLMs emulated how human players place trust at the beginning of such a game. Larger human-machine divergences emerged in scenarios involving trust repayment or multi-round interactions, where decisions were influenced by both social preferences and interactive reasoning. LLMs responses varied significantly when prompted to adopt personas like selfish or unselfish players, with the impact outweighing differences between models or game types. Response of ChatGPT-4, in an unselfish or neutral persona, resembled the highest trust and reciprocity, surpassing humans, Claude, and Bard. Claude and Bard displayed trust and reciprocity levels that sometimes exceeded and sometimes fell below human choices. When given selfish personas, all LLMs showed lower trust and reciprocity than humans. Interactive reasoning to the actions of counterparts or changing game mechanics appeared to be random rather than stable, reproducible characteristics in the response of LLMs, though some improvements were observed when ChatGPT-4 responded in selfish or unselfish personas.

[146] METHOD: Modular Efficient Transformer for Health Outcome Discovery

Linglong Qian,Zina Ibrahim

Main category: cs.CL

TL;DR: 论文提出了一种名为METHOD的新型Transformer架构,专为电子健康记录中的临床序列建模设计,解决了传统Transformer在医疗领域的挑战。

Details Motivation: 医疗领域的数据(如患者时间线)具有不规则采样、复杂时间依赖性和上下文关系,传统Transformer难以直接应用。 Method: METHOD结合了三种创新:患者感知注意力机制、自适应滑动窗口注意力方案和动态跳跃连接的U-Net架构。 Result: 在MIMIC-IV数据库上的评估显示,METHOD在预测高严重性病例和长序列处理方面优于现有模型ETHOS。 Conclusion: METHOD为医疗领域的Transformer应用提供了更准确、高效的解决方案,具有临床部署潜力。 Abstract: Recent advances in transformer architectures have revolutionised natural language processing, but their application to healthcare domains presents unique challenges. Patient timelines are characterised by irregular sampling, variable temporal dependencies, and complex contextual relationships that differ substantially from traditional language tasks. This paper introduces \METHOD~(Modular Efficient Transformer for Health Outcome Discovery), a novel transformer architecture specifically designed to address the challenges of clinical sequence modelling in electronic health records. \METHOD~integrates three key innovations: (1) a patient-aware attention mechanism that prevents information leakage whilst enabling efficient batch processing; (2) an adaptive sliding window attention scheme that captures multi-scale temporal dependencies; and (3) a U-Net inspired architecture with dynamic skip connections for effective long sequence processing. Evaluations on the MIMIC-IV database demonstrate that \METHOD~consistently outperforms the state-of-the-art \ETHOS~model, particularly in predicting high-severity cases that require urgent clinical intervention. \METHOD~exhibits stable performance across varying inference lengths, a crucial feature for clinical deployment where patient histories vary significantly in length. Analysis of learned embeddings reveals that \METHOD~better preserves clinical hierarchies and relationships between medical concepts. These results suggest that \METHOD~represents a significant advancement in transformer architectures optimised for healthcare applications, providing more accurate and clinically relevant predictions whilst maintaining computational efficiency.

[147] Enhancing Mathematics Learning for Hard-of-Hearing Students Through Real-Time Palestinian Sign Language Recognition: A New Dataset

Fidaa khandaqji,Huthaifa I. Ashqar,Abdelrahem Atawnih

Main category: cs.CL

TL;DR: 研究通过AI技术开发巴勒斯坦手语(PSL)识别系统,提升听障学生的数学教育可及性,模型准确率达97.59%。

Details Motivation: 解决PSL数字资源匮乏问题,为听障学生提供智能教育工具,缩小学习差距。 Method: 创建包含41个数学手势类别的数据集,使用Vision Transformer(ViT)模型进行微调。 Result: 模型准确率为97.59%,高效识别数学手势。 Conclusion: 深度学习在开发智能教育工具中发挥重要作用,推动包容性数字教育发展。 Abstract: The study aims to enhance mathematics education accessibility for hard-of-hearing students by developing an accurate Palestinian sign language PSL recognition system using advanced artificial intelligence techniques. Due to the scarcity of digital resources for PSL, a custom dataset comprising 41 mathematical gesture classes was created, and recorded by PSL experts to ensure linguistic accuracy and domain specificity. To leverage state-of-the-art-computer vision techniques, a Vision Transformer ViTModel was fine-tuned for gesture classification. The model achieved an accuracy of 97.59%, demonstrating its effectiveness in recognizing mathematical signs with high precision and reliability. This study highlights the role of deep learning in developing intelligent educational tools that bridge the learning gap for hard-of-hearing students by providing AI-driven interactive solutions to enhance mathematical comprehension. This work represents a significant step toward innovative and inclusive frosting digital integration in specialized learning environments. The dataset is hosted on Hugging Face at https://huggingface.co/datasets/fidaakh/STEM_data.

[148] Are LLMs Ready for English Standardized Tests? A Benchmarking and Elicitation Perspective

Luoxi Tang,Tharunya Sundar,Shuai Yang,Ankita Patra,Manohar Chippada,Giqi Zhao,Yi Li,Riteng Zhang,Tunan Zhao,Ting Yang,Yuqiao Meng,Weicheng Ma,Zhaohan Xi

Main category: cs.CL

TL;DR: 论文探讨了大型语言模型(LLMs)在英语标准化测试(ESTs)准备中的潜力,提出ESTBOOK基准评估其能力,并开发了分解分析框架以提升LLMs作为智能辅导系统的可靠性。

Details Motivation: 研究LLMs在教育领域的应用潜力,特别是如何支持标准化测试准备,以提升学习体验。 Method: 引入ESTBOOK基准,涵盖多种题型和模态,并开发分解分析框架评估LLMs在解题各阶段的表现。 Result: 通过ESTBOOK评估LLMs的准确性和推理效率,发现其在教育场景中的潜力及改进方向。 Conclusion: LLMs在标准化测试准备中具有潜力,但需针对性优化以提升其作为智能辅导系统的可靠性。 Abstract: AI is transforming education by enabling powerful tools that enhance learning experiences. Among recent advancements, large language models (LLMs) hold particular promise for revolutionizing how learners interact with educational content. In this work, we investigate the potential of LLMs to support standardized test preparation by focusing on English Standardized Tests (ESTs). Specifically, we assess their ability to generate accurate and contextually appropriate solutions across a diverse set of EST question types. We introduce ESTBOOK, a comprehensive benchmark designed to evaluate the capabilities of LLMs in solving EST questions. ESTBOOK aggregates five widely recognized tests, encompassing 29 question types and over 10,576 questions across multiple modalities, including text, images, audio, tables, and mathematical symbols. Using ESTBOOK, we systematically evaluate both the accuracy and inference efficiency of LLMs. Additionally, we propose a breakdown analysis framework that decomposes complex EST questions into task-specific solution steps. This framework allows us to isolate and assess LLM performance at each stage of the reasoning process. Evaluation findings offer insights into the capability of LLMs in educational contexts and point toward targeted strategies for improving their reliability as intelligent tutoring systems.

[149] DO-RAG: A Domain-Specific QA Framework Using Knowledge Graph-Enhanced Retrieval-Augmented Generation

David Osei Opoku,Ming Sheng,Yong Zhang

Main category: cs.CL

TL;DR: DO-RAG是一个结合知识图谱和语义向量检索的混合QA框架,通过动态知识图谱和多级检索提升事实准确性和推理一致性,实验显示其性能优于基线框架。

Details Motivation: 领域特定QA系统需要高事实准确性和结构化专家知识,现有RAG框架在异构数据整合和推理一致性上表现不足。 Method: DO-RAG采用多级知识图谱构建与语义向量检索结合的方法,利用代理链式思维架构从多模态文档中提取结构化关系,并通过图与向量检索融合生成上下文感知响应。 Result: 实验在数据库和电气领域显示,DO-RAG实现了接近完美的召回率和94%以上的答案相关性,性能优于基线框架33.38%。 Conclusion: DO-RAG通过可追溯性、适应性和高效性能,为多领域高精度QA提供了可靠基础。 Abstract: Domain-specific QA systems require not just generative fluency but high factual accuracy grounded in structured expert knowledge. While recent Retrieval-Augmented Generation (RAG) frameworks improve context recall, they struggle with integrating heterogeneous data and maintaining reasoning consistency. To address these challenges, we propose DO-RAG, a scalable and customizable hybrid QA framework that integrates multi-level knowledge graph construction with semantic vector retrieval. Our system employs a novel agentic chain-of-thought architecture to extract structured relationships from unstructured, multimodal documents, constructing dynamic knowledge graphs that enhance retrieval precision. At query time, DO-RAG fuses graph and vector retrieval results to generate context-aware responses, followed by hallucination mitigation via grounded refinement. Experimental evaluations in the database and electrical domains show near-perfect recall and over 94% answer relevancy, with DO-RAG outperforming baseline frameworks by up to 33.38%. By combining traceability, adaptability, and performance efficiency, DO-RAG offers a reliable foundation for multi-domain, high-precision QA at scale.

[150] Medalyze: Lightweight Medical Report Summarization Application Using FLAN-T5-Large

Van-Tinh Nguyen,Hoang-Duong Pham,Thanh-Hai To,Cong-Tuan Hung Do,Thi-Thu-Trang Dong,Vu-Trung Duong Le,Van-Phuc Hoang

Main category: cs.CL

TL;DR: Medalyze是一个基于AI的应用,通过三个专用FLAN-T5-Large模型提升医学文本理解能力,并在实时推理平台上部署,性能优于GPT-4。

Details Motivation: 医学文本因术语复杂和语境特定而难以理解,需要高效工具提升信息可访问性。 Method: 使用三个专用FLAN-T5-Large模型,分别用于总结报告、提取健康问题和识别关键问题,并通过Web和移动平台实时推理。 Result: 实验显示Medalyze在领域特定任务中的总结性能优于GPT-4,使用BLEU、ROUGE-L等指标评估。 Conclusion: Medalyze提供了一种实用、隐私保护且轻量级的解决方案,提升了医疗信息可访问性。 Abstract: Understanding medical texts presents significant challenges due to complex terminology and context-specific language. This paper introduces Medalyze, an AI-powered application designed to enhance the comprehension of medical texts using three specialized FLAN-T5-Large models. These models are fine-tuned for (1) summarizing medical reports, (2) extracting health issues from patient-doctor conversations, and (3) identifying the key question in a passage. Medalyze is deployed across a web and mobile platform with real-time inference, leveraging scalable API and YugabyteDB. Experimental evaluations demonstrate the system's superior summarization performance over GPT-4 in domain-specific tasks, based on metrics like BLEU, ROUGE-L, BERTScore, and SpaCy Similarity. Medalyze provides a practical, privacy-preserving, and lightweight solution for improving information accessibility in healthcare.

[151] SALMONN-omni: A Standalone Speech LLM without Codec Injection for Full-duplex Conversation

Wenyi Yu,Siyin Wang,Xiaoyu Yang,Xianzhao Chen,Xiaohai Tian,Jun Zhang,Guangzhi Sun,Lu Lu,Yuxuan Wang,Chao Zhang

Main category: cs.CL

TL;DR: SALMONN-omni是一种新型的全双工语音LLM,通过动态思考机制实现自然的人机语音交互,性能优于现有开源模型。

Details Motivation: 解决现有全双工对话系统中模块化架构导致的错误累积和关键挑战(如上下文相关打断和回声消除)。 Method: 引入动态思考机制的单LLM架构,无需音频编解码器,支持语音和文本模态。 Result: 在广泛使用的基准测试中,性能提升30%,并在复杂对话场景中表现优异。 Conclusion: SALMONN-omni在减少训练数据的同时,实现了高性能的全双工语音交互。 Abstract: In order to enable fluid and natural human-machine speech interaction, existing full-duplex conversational systems often adopt modular architectures with auxiliary components such as voice activity detectors, interrupters, conversation state predictors, or multiple LLMs. These systems, however, suffer from error accumulation across modules and struggle with key challenges such as context-dependent barge-in and echo cancellation. Recent approaches, most notably Moshi, simplify the pipeline by injecting audio codecs into the token space of a single LLM. However, such methods still incur significant performance degradation when operating on the speech rather than text modality. In this paper, we introduce SALMONN-omni, the first single, standalone full-duplex speech LLM that operates without audio codecs in its token space. It features a novel dynamic thinking mechanism within the LLM backbone, enabling the model to learn when to transition between speaking and listening states. Experiments on widely used benchmarks for spoken question answering and open-domain dialogue show that SALMONN-omni achieves at least 30\% relative performance improvement over existing open-source full-duplex models and performs highly competitively to half-duplex and turn-based systems, despite using substantially less training data. Moreover, SALMONN-omni demonstrates strong performance in complex conversational scenarios, including turn-taking, backchanneling, echo cancellation and context-dependent barge-in, with further improvements achieved through reinforcement learning. Some demo conversations between user and SALMONN-omni are provided in the following repository https://github.com/bytedance/SALMONN.

[152] Mixture of Decoding: An Attention-Inspired Adaptive Decoding Strategy to Mitigate Hallucinations in Large Vision-Language Models

Xinlong Chen,Yuanxing Zhang,Qiang Liu,Junfei Wu,Fuzheng Zhang,Tieniu Tan

Main category: cs.CL

TL;DR: 提出了一种名为MoD的新方法,通过动态调整解码策略来减少大型视觉语言模型中的幻觉问题。

Details Motivation: 大型视觉语言模型在视觉任务中表现出色,但仍面临幻觉问题的挑战,需要有效解决方法。 Method: MoD通过评估模型对图像标记的关注正确性,动态调整解码策略,包括一致性检测和互补或对比策略。 Result: 实验表明,MoD在多个主流基准测试中显著优于现有解码方法,有效减少了幻觉现象。 Conclusion: MoD是一种有效的幻觉缓解方法,为大型视觉语言模型的改进提供了新思路。 Abstract: Large Vision-Language Models (LVLMs) have exhibited impressive capabilities across various visual tasks, yet they remain hindered by the persistent challenge of hallucinations. To address this critical issue, we propose Mixture of Decoding (MoD), a novel approach for hallucination mitigation that dynamically adapts decoding strategies by evaluating the correctness of the model's attention on image tokens. Specifically, MoD measures the consistency between outputs generated from the original image tokens and those derived from the model's attended image tokens, to distinguish the correctness aforementioned. If the outputs are consistent, indicating correct attention, MoD employs a complementary strategy to amplify critical information. Conversely, if the outputs are inconsistent, suggesting erroneous attention, MoD utilizes a contrastive strategy to suppress misleading information. Extensive experiments demonstrate that MoD significantly outperforms existing decoding methods across multiple mainstream benchmarks, effectively mitigating hallucinations in LVLMs. The code is available at https://github.com/xlchen0205/MoD.

[153] Synthetic Data RL: Task Definition Is All You Need

Yiduo Guo,Zhen Guo,Chuanwei Huang,Zi-Ang Wang,Zekai Zhang,Haofei Yu,Huishuai Zhang,Yikang Shen

Main category: cs.CL

TL;DR: Synthetic Data RL框架通过合成数据强化学习微调模型,显著减少对人工标注数据的依赖,并在多个任务上超越传统方法。

Details Motivation: 传统强化学习依赖大规模人工标注数据,限制了广泛应用。本文旨在通过合成数据减少这种依赖。 Method: 从任务定义生成问答对,根据模型解决能力调整问题难度,并基于平均通过率选择问题用于RL训练。 Result: 在多个任务上表现优异,如GSM8K提升29.2%,且接近全人工数据的RL效果。 Conclusion: Synthetic Data RL实现了高效、可扩展的模型微调,减少了对人工数据的依赖。 Abstract: Reinforcement learning (RL) is a powerful way to adapt foundation models to specialized tasks, but its reliance on large-scale human-labeled data limits broad adoption. We introduce Synthetic Data RL, a simple and general framework that reinforcement fine-tunes models using only synthetic data generated from a task definition. Our method first generates question and answer pairs from the task definition and retrieved documents, then adapts the difficulty of the question based on model solvability, and selects questions using the average pass rate of the model across samples for RL training. On Qwen-2.5-7B, our method achieves a 29.2% absolute improvement over the base model on GSM8K (+2.9 pp vs. instruction-tuned, +6.6 pp vs. Self-Instruct), 8.7% on MATH, 13.1% on GPQA (+7.0 pp vs. SynthLLM), 8.9% on MedQA, 17.7% on CQA (law) and 13.7% on CFA (finance). It surpasses supervised fine-tuning under the same data budget and nearly matches RL with full human data across datasets (e.g., +17.2 pp on GSM8K). Adding 100 human demonstrations improves the performance of GSM8K only by 0.4 pp, showing a limited added value. By reducing human data annotation, Synthetic Data RL enables scalable and efficient RL-based model adaptation. Code and demos are available at https://github.com/gydpku/Data_Synthesis_RL/.

[154] Decoding Rarity: Large Language Models in the Diagnosis of Rare Diseases

Valentina Carbonari,Pierangelo Veltri,Pietro Hiram Guzzi

Main category: cs.CL

TL;DR: 本文综述了大型语言模型(LLMs)在罕见病研究中的应用,探讨了其在诊断、治疗和患者护理中的潜力,同时指出了多模态数据整合的未来方向及面临的挑战。

Details Motivation: 罕见病研究面临数据稀缺和复杂性高的挑战,LLMs通过分析文本数据展现出解决这些问题的潜力。 Method: 综述了LLMs在医学信息提取、智能对话代理和诊断支持中的应用,并探讨了多模态数据整合的可能性。 Result: LLMs在罕见病研究中显示出显著潜力,但仍需解决数据隐私、模型透明性和数据多样性等问题。 Conclusion: 未来LLMs应向多模态平台发展,整合多种数据类型以提升罕见病研究的全面性和临床效果。 Abstract: Recent advances in artificial intelligence, particularly large language models LLMs, have shown promising capabilities in transforming rare disease research. This survey paper explores the integration of LLMs in the analysis of rare diseases, highlighting significant strides and pivotal studies that leverage textual data to uncover insights and patterns critical for diagnosis, treatment, and patient care. While current research predominantly employs textual data, the potential for multimodal data integration combining genetic, imaging, and electronic health records stands as a promising frontier. We review foundational papers that demonstrate the application of LLMs in identifying and extracting relevant medical information, simulating intelligent conversational agents for patient interaction, and enabling the formulation of accurate and timely diagnoses. Furthermore, this paper discusses the challenges and ethical considerations inherent in deploying LLMs, including data privacy, model transparency, and the need for robust, inclusive data sets. As part of this exploration, we present a section on experimentation that utilizes multiple LLMs alongside structured questionnaires, specifically designed for diagnostic purposes in the context of different diseases. We conclude with future perspectives on the evolution of LLMs towards truly multimodal platforms, which would integrate diverse data types to provide a more comprehensive understanding of rare diseases, ultimately fostering better outcomes in clinical settings.

[155] Unveil Multi-Picture Descriptions for Multilingual Mild Cognitive Impairment Detection via Contrastive Learning

Kristin Qi,Jiali Cheng,Youxiang Zhu,Hadi Amiri,Xiaohui Liang

Main category: cs.CL

TL;DR: 论文提出了一种用于多语言和多图片环境下轻度认知障碍(MCI)检测的框架,通过对比学习、图像模态融合和专家乘积策略,显著提升了检测性能。

Details Motivation: 现有研究主要集中于英语单图片描述,而多语言和多图片场景带来了新的分析挑战。 Method: 框架包含三个部分:监督对比学习增强表征、融合图像模态、使用专家乘积策略减少虚假相关性和过拟合。 Result: 与文本单模态基线相比,UAR提升7.1%(68.1%至75.2%),F1分数提升2.9%(80.6%至83.5%)。 Conclusion: 该框架在多语言和多图片MCI检测中表现出显著有效性。 Abstract: Detecting Mild Cognitive Impairment from picture descriptions is critical yet challenging, especially in multilingual and multiple picture settings. Prior work has primarily focused on English speakers describing a single picture (e.g., the 'Cookie Theft'). The TAUKDIAL-2024 challenge expands this scope by introducing multilingual speakers and multiple pictures, which presents new challenges in analyzing picture-dependent content. To address these challenges, we propose a framework with three components: (1) enhancing discriminative representation learning via supervised contrastive learning, (2) involving image modality rather than relying solely on speech and text modalities, and (3) applying a Product of Experts (PoE) strategy to mitigate spurious correlations and overfitting. Our framework improves MCI detection performance, achieving a +7.1% increase in Unweighted Average Recall (UAR) (from 68.1% to 75.2%) and a +2.9% increase in F1 score (from 80.6% to 83.5%) compared to the text unimodal baseline. Notably, the contrastive learning component yields greater gains for the text modality compared to speech. These results highlight our framework's effectiveness in multilingual and multi-picture MCI detection.

Jorge Paz-Ruza,Amparo Alonso-Betanzos,Bertha Guijarro-Berdiñas,Carlos Eiras-Franco

Main category: cs.CL

TL;DR: 提出一种预测用户在健康相关在线讨论中可能产生毒性行为的方法,通过协同过滤机器学习模型预测COVID相关讨论中的毒性,避免冲突。

Details Motivation: 在线健康讨论中的用户毒性常引发社会冲突或传播危险行为,传统检测和删除方法效果有限且可能适得其反。 Method: 采用基于协同过滤的机器学习方法,预测Reddit上COVID相关讨论中用户与子社区的潜在毒性互动。 Result: 模型在相关指标上预测性能超过80%,能有效避免冲突用户与子社区的配对。 Conclusion: 预测性方法比传统反应性方法更有效,可减少健康讨论中的毒性行为。 Abstract: In health-related topics, user toxicity in online discussions frequently becomes a source of social conflict or promotion of dangerous, unscientific behaviour; common approaches for battling it include different forms of detection, flagging and/or removal of existing toxic comments, which is often counterproductive for platforms and users alike. In this work, we propose the alternative of combatting user toxicity predictively, anticipating where a user could interact toxically in health-related online discussions. Applying a Collaborative Filtering-based Machine Learning methodology, we predict the toxicity in COVID-related conversations between any user and subcommunity of Reddit, surpassing 80% predictive performance in relevant metrics, and allowing us to prevent the pairing of conflicting users and subcommunities.

[157] Improving endpoint detection in end-to-end streaming ASR for conversational speech

Anandh C,Karthik Pandia Durai,Jeena Prakash,Manickavela Arumugam,Kadri Hacioglu,S. Pavankumar Dubagunta,Andreas Stolcke,Shankar Venkatesan,Aravind Ganapathiraju

Main category: cs.CL

TL;DR: 论文提出方法改进基于Transducer的ASR端点检测,解决延迟发射和端点错误问题。

Details Motivation: T-ASR在流式处理中表现优越,但其延迟发射问题会导致端点检测错误或延迟,影响用户体验。 Method: 引入词尾标记和延迟惩罚,结合辅助网络实现可靠的帧级语音活动检测。 Result: 在Switchboard语料库上验证,对比延迟惩罚方法表现更优。 Conclusion: 所提方法有效改善了端点检测的延迟和准确性,提升了用户体验。 Abstract: ASR endpointing (EP) plays a major role in delivering a good user experience in products supporting human or artificial agents in human-human/machine conversations. Transducer-based ASR (T-ASR) is an end-to-end (E2E) ASR modelling technique preferred for streaming. A major limitation of T-ASR is delayed emission of ASR outputs, which could lead to errors or delays in EP. Inaccurate EP will cut the user off while speaking, returning incomplete transcript while delays in EP will increase the perceived latency, degrading the user experience. We propose methods to improve EP by addressing delayed emission along with EP mistakes. To address the delayed emission problem, we introduce an end-of-word token at the end of each word, along with a delay penalty. The EP delay is addressed by obtaining a reliable frame-level speech activity detection using an auxiliary network. We apply the proposed methods on Switchboard conversational speech corpus and evaluate it against a delay penalty method.

[158] What's in a prompt? Language models encode literary style in prompt embeddings

Raphaël Sarfati,Haley Moller,Toni J. B. Liu,Nicolas Boullé,Christopher Earls

Main category: cs.CL

TL;DR: 论文研究了大型语言模型如何将文本信息压缩到高维潜在空间中,特别关注了提示信息的累积如何通过Transformer层转化为单个嵌入表示。

Details Motivation: 探索语言模型如何将无形(如风格)而非事实内容的信息编码到深层表示中。 Method: 使用文学作品作为数据,分析短摘录在潜在空间中的分离情况及其与作者风格的关系。 Result: 发现不同小说的短摘录在潜在空间中独立分离,而同一作者的作品嵌入更纠缠,表明嵌入编码了风格特征。 Conclusion: 这种风格几何可能用于作者归属和文学分析,同时揭示了语言模型信息处理和压缩的复杂性。 Abstract: Large language models use high-dimensional latent spaces to encode and process textual information. Much work has investigated how the conceptual content of words translates into geometrical relationships between their vector representations. Fewer studies analyze how the cumulative information of an entire prompt becomes condensed into individual embeddings under the action of transformer layers. We use literary pieces to show that information about intangible, rather than factual, aspects of the prompt are contained in deep representations. We observe that short excerpts (10 - 100 tokens) from different novels separate in the latent space independently from what next-token prediction they converge towards. Ensembles from books from the same authors are much more entangled than across authors, suggesting that embeddings encode stylistic features. This geometry of style may have applications for authorship attribution and literary analysis, but most importantly reveals the sophistication of information processing and compression accomplished by language models.

[159] Mechanistic Interpretability of GPT-like Models on Summarization Tasks

Anurag Mishra

Main category: cs.CL

TL;DR: 本文提出了一种解释性框架,分析GPT类模型如何适应摘要任务,通过差异分析定位模型中的“摘要电路”,并发现中层(2、3、5层)变化最显著。

Details Motivation: 揭示大型语言模型在摘要任务中的内部机制,填补分类或生成任务之外的空白。 Method: 对预训练和微调模型进行差异分析,量化注意力模式和内部激活的变化,定位关键层和注意力头。 Result: 中层(2、3、5层)变化最显著,62%的注意力头熵值降低,表明信息选择更集中。针对性LoRA微调性能优于标准方法。 Conclusion: 该研究为神经网络在摘要任务中的信息选择和压缩机制提供了新见解,连接了黑盒评估与机制理解。 Abstract: Mechanistic interpretability research seeks to reveal the inner workings of large language models, yet most work focuses on classification or generative tasks rather than summarization. This paper presents an interpretability framework for analyzing how GPT-like models adapt to summarization tasks. We conduct differential analysis between pre-trained and fine-tuned models, quantifying changes in attention patterns and internal activations. By identifying specific layers and attention heads that undergo significant transformation, we locate the "summarization circuit" within the model architecture. Our findings reveal that middle layers (particularly 2, 3, and 5) exhibit the most dramatic changes, with 62% of attention heads showing decreased entropy, indicating a shift toward focused information selection. We demonstrate that targeted LoRA adaptation of these identified circuits achieves significant performance improvement over standard LoRA fine-tuning while requiring fewer training epochs. This work bridges the gap between black-box evaluation and mechanistic understanding, providing insights into how neural networks perform information selection and compression during summarization.

[160] Semi-Clairvoyant Scheduling of Speculative Decoding Requests to Minimize LLM Inference Latency

Ruixiao Li,Fahao Chen,Peng Li

Main category: cs.CL

TL;DR: 提出了一种名为LAPS-SD的半预见性请求调度算法,通过动态调度请求以减少LLM推理延迟。

Details Motivation: 现有调度方法仅基于预测输出长度估计执行时间,忽略了令牌接受率的影响,导致效率低下。 Method: LAPS-SD采用多优先级队列和请求抢占机制,动态适应令牌接受率变化,稳定后精确估计执行时间。 Result: 实验表明,LAPS-SD比现有方法减少约39%的推理延迟。 Conclusion: LAPS-SD有效解决了动态令牌接受率下的调度问题,显著提升了LLM推理效率。 Abstract: Speculative decoding accelerates Large Language Model (LLM) inference by employing a small speculative model (SSM) to generate multiple candidate tokens and verify them using the LLM in parallel. This technique has been widely integrated into LLM inference serving systems. However, inference requests typically exhibit uncertain execution time, which poses a significant challenge of efficiently scheduling requests in these systems. Existing work estimates execution time based solely on predicted output length, which could be inaccurate because execution time depends on both output length and token acceptance rate of verification by the LLM. In this paper, we propose a semi-clairvoyant request scheduling algorithm called Least-Attained/Perceived-Service for Speculative Decoding (LAPS-SD). Given a number of inference requests, LAPS-SD can effectively minimize average inference latency by adaptively scheduling requests according to their features during decoding. When the token acceptance rate is dynamic and execution time is difficult to estimate, LAPS-SD maintains multiple priority queues and allows request execution preemption across different queues. Once the token acceptance rate becomes stable, LAPS-SD can accurately estimate the execution time and schedule requests accordingly. Extensive experiments show that LAPS-SD reduces inference latency by approximately 39\% compared to state-of-the-art scheduling methods.

[161] Development and Validation of Engagement and Rapport Scales for Evaluating User Experience in Multimodal Dialogue Systems

Fuma Kurata,Mao Saeki,Masaki Eguchi,Shungo Suzuki,Hiroaki Takatsu,Yoichi Matsuyama

Main category: cs.CL

TL;DR: 研究开发并验证了两个量表(参与度和亲和力)以评估多模态对话系统在外语学习中的用户体验质量。

Details Motivation: 通过理论和实证研究,评估多模态对话系统在外语学习中的用户体验,并与人类导师的对话效果进行比较。 Method: 基于教育心理学、社会心理学和二语习得理论设计量表,通过74名日本英语学习者与人类导师及对话代理的互动实验,验证量表的有效性和可靠性。 Result: 量表成功捕捉到人类导师与对话代理在对话体验质量上的差异。 Conclusion: 开发的量表能有效评估多模态对话系统的用户体验质量,为未来研究提供了工具。 Abstract: This study aimed to develop and validate two scales of engagement and rapport to evaluate the user experience quality with multimodal dialogue systems in the context of foreign language learning. The scales were designed based on theories of engagement in educational psychology, social psychology, and second language acquisition.Seventy-four Japanese learners of English completed roleplay and discussion tasks with trained human tutors and a dialog agent. After each dialogic task was completed, they responded to the scales of engagement and rapport. The validity and reliability of the scales were investigated through two analyses. We first conducted analysis of Cronbach's alpha coefficient and a series of confirmatory factor analyses to test the structural validity of the scales and the reliability of our designed items. We then compared the scores of engagement and rapport between the dialogue with human tutors and the one with a dialogue agent. The results revealed that our scales succeeded in capturing the difference in the dialogue experience quality between the human interlocutors and the dialogue agent from multiple perspectives.

[162] Impact of Frame Rates on Speech Tokenizer: A Case Study on Mandarin and English

Haoyang Zhang,Hexin Liu,Xiangyu Zhang,Qiquan Zhang,Yuchen Hu,Junqi Zhao,Fei Tian,Xuerui Yang,Eng Siong Chng

Main category: cs.CL

TL;DR: 研究了不同帧率对汉语和英语语音标记化的影响,发现帧率变化对两种语言的影响不同,为语音标记器的帧率选择提供了优化依据。

Details Motivation: 探讨帧率对语音标记化的影响,填补现有研究的空白,尤其是针对汉语和英语这两种类型不同的语言。 Method: 通过在不同帧率下编码语音,并在语音识别任务中评估生成的语义标记。 Result: 帧率变化对汉语和英语的语音标记化影响不同,揭示了帧率、语音密度和语言特定声学特征之间的相互作用。 Conclusion: 研究结果为语音标记器的帧率选择提供了优化依据,对自动语音识别、文本转语音等应用具有指导意义。 Abstract: The speech tokenizer plays a crucial role in recent speech tasks, generally serving as a bridge between speech signals and language models. While low-frame-rate codecs are widely employed as speech tokenizers, the impact of frame rates on speech tokens remains underexplored. In this study, we investigate how varying frame rates affect speech tokenization by examining Mandarin and English, two typologically distinct languages. We encode speech at different frame rates and evaluate the resulting semantic tokens in the speech recognition task. Our findings reveal that frame rate variations influence speech tokenization differently for each language, highlighting the interplay between frame rates, phonetic density, and language-specific acoustic features. The results provide insights into optimizing frame rate selection for speech tokenizers, with implications for automatic speech recognition, text-to-speech, and other speech-related applications.

[163] GloSS over Toxicity: Understanding and Mitigating Toxicity in LLMs via Global Toxic Subspace

Zenghao Duan,Zhiyi Yin,Zhichao Shi,Liang Pang,Shaoling Jing,Jiayi Wu,Yu Yan,Huawei Shen,Xueqi Cheng

Main category: cs.CL

TL;DR: 本文研究了大型语言模型(LLMs)中毒性生成的机制,并提出了一种轻量级的去毒方法GloSS,通过识别并移除全局毒性子空间来降低毒性。

Details Motivation: 现有研究通常将前馈网络(FFN)视为毒性的主要来源,但本文发现全局毒性子空间能更全面地表征毒性区域。 Method: 提出了GloSS方法,分为四个阶段,通过识别并移除FFN参数中的全局毒性子空间来实现去毒。 Result: 实验表明,GloSS在多种LLMs上实现了最先进的去毒性能,同时保留了模型的通用能力,且无需大规模数据或重新训练。 Conclusion: 全局毒性子空间是更有效的毒性表征方式,GloSS方法在去毒任务中表现出色且高效。 Abstract: This paper investigates the underlying mechanisms of toxicity generation in Large Language Models (LLMs) and proposes an effective detoxification approach. Prior work typically considers the Feed-Forward Network (FFN) as the main source of toxicity, representing toxic regions as a set of toxic vectors or layer-wise subspaces. However, our in-depth analysis reveals that the global toxic subspace offers a more effective and comprehensive representation of toxic region within the model. Building on this insight, we propose GloSS (Global Toxic Subspace Suppression), a lightweight, four-stage method that mitigates toxicity by identifying and removing the global toxic subspace from the parameters of FFN. Experiments across a range of LLMs show that GloSS achieves state-of-the-art detoxification performance while preserving the models general capabilities, without requiring large-scale data or model retraining.

[164] Not Minds, but Signs: Reframing LLMs through Semiotics

Davide Picca

Main category: cs.CL

TL;DR: 论文主张从符号学而非认知系统视角理解大语言模型(LLMs),强调其作为符号重组与意义生成工具的作用。

Details Motivation: 挑战将LLMs视为认知系统的流行观点,避免拟人化,更精确地理解其在文化过程中的角色。 Method: 通过理论分析和实际案例,展示LLMs作为符号代理的功能,其输出可视为开放的诠释行为。 Result: LLMs在文学、哲学、教育等领域作为创造力工具,符号学框架提供了更严谨和伦理的研究方法。 Conclusion: LLMs是符号生态中的技术参与者,改变人类读写与意义生成方式,需重新思考语言与知识生产的基础。 Abstract: This paper challenges the prevailing tendency to frame Large Language Models (LLMs) as cognitive systems, arguing instead for a semiotic perspective that situates these models within the broader dynamics of sign manipulation and meaning-making. Rather than assuming that LLMs understand language or simulate human thought, we propose that their primary function is to recombine, recontextualize, and circulate linguistic forms based on probabilistic associations. By shifting from a cognitivist to a semiotic framework, we avoid anthropomorphism and gain a more precise understanding of how LLMs participate in cultural processes, not by thinking, but by generating texts that invite interpretation. Through theoretical analysis and practical examples, the paper demonstrates how LLMs function as semiotic agents whose outputs can be treated as interpretive acts, open to contextual negotiation and critical reflection. We explore applications in literature, philosophy, education, and cultural production, emphasizing how LLMs can serve as tools for creativity, dialogue, and critical inquiry. The semiotic paradigm foregrounds the situated, contingent, and socially embedded nature of meaning, offering a more rigorous and ethically aware framework for studying and using LLMs. Ultimately, this approach reframes LLMs as technological participants in an ongoing ecology of signs. They do not possess minds, but they alter how we read, write, and make meaning, compelling us to reconsider the foundations of language, interpretation, and the role of artificial systems in the production of knowledge.

[165] GemMaroc: Unlocking Darija Proficiency in LLMs with Minimal Data

Abderrahman Skiredj,Ferdaous Azhari,Houdaifa Atou,Nouamane Tazi,Ismail Berrada

Main category: cs.CL

TL;DR: 通过质量优先的对齐策略,提升摩洛哥阿拉伯语(Darija)在开源大语言模型中的表现,同时保持模型的跨语言推理能力,计算成本极低。

Details Motivation: 开源大语言模型对摩洛哥阿拉伯语支持不足,现有方法要么牺牲推理能力,要么计算成本过高。 Method: 将三个小型指令集翻译为Darija,保留部分英文原指令,并添加数学、编程和科学提示。使用LoRA微调的Gemma 3-4B和Gemma 3-27B模型进行训练。 Result: DarijaMMLU得分显著提升(32.8到47.5),GemMaroc-27B在Darija任务上表现优异,且未影响英语和数学能力。训练仅需48 GPU小时。 Conclusion: 该方法为绿色AI提供了可持续的语言技术路径,支持Darija在教育、公共服务等领域的应用。 Abstract: Open-source large language models (LLMs) still marginalise Moroccan Arabic (Darija), forcing practitioners either to bolt on heavyweight Arabic adapters or to sacrifice the very reasoning skills that make LLMs useful. We show that a rigorously quality-over-quantity alignment strategy can surface fluent Darija while safeguarding the backbone s cross-lingual reasoning at a sliver of the usual compute. We translate three compact instruction suites LIMA 1 K, DEITA 6 K and TULU 50 K into Darija, preserve 20 of the English originals, and add mathematics, coding and scientific prompts. A LoRA-tuned Gemma 3-4B trained on 5 K mixed instructions lifts DarijaMMLU from 32.8 to 42.7 ; adding the reasoning-dense TULU portion pushes it to 47.5 with no English regression. Scaling the identical recipe to Gemma 3-27B produces GemMaroc-27B, which matches Atlas-Chat on DarijaMMLU (61.6 ) and leaps ahead on Darija commonsense, scoring 60.5 on HellaSwag versus Atlas-Chat s 48.4 . Crucially, GemMaroc retains Gemma-27B s strong maths and general-reasoning ability, showing only minimal movement on GSM8K and English benchmarks. The entire model is trained in just 48 GPU.h, underscoring a Green AI pathway to inclusive, sustainable language technology. We release code, data and checkpoints to spur Darija-centric applications in education, public services and everyday digital interaction.

[166] Scale-invariant Attention

Ben Anson,Xi Wang,Laurence Aitchison

Main category: cs.CL

TL;DR: 论文提出了一种满足尺度不变性的注意力机制,通过简单的位置依赖变换实现,实验表明其在长上下文推理中表现优异。

Details Motivation: 解决LLM研究中从短上下文训练到长上下文推理的泛化问题。 Method: 提出两种条件(尺度不变总注意力和稀疏性),并在高斯假设下通过位置依赖变换实现。 Result: 实验显示该机制在零样本泛化和长上下文检索中表现良好。 Conclusion: 尺度不变注意力机制能有效提升长上下文推理性能。 Abstract: One persistent challenge in LLM research is the development of attention mechanisms that are able to generalise from training on shorter contexts to inference on longer contexts. We propose two conditions that we expect all effective long context attention mechanisms to have: scale-invariant total attention, and scale-invariant attention sparsity. Under a Gaussian assumption, we show that a simple position-dependent transformation of the attention logits is sufficient for these conditions to hold. Experimentally we find that the resulting scale-invariant attention scheme gives considerable benefits in terms of validation loss when zero-shot generalising from training on short contexts to validation on longer contexts, and is effective at long-context retrieval.

[167] Reinforcing Question Answering Agents with Minimalist Policy Gradient Optimization

Yihong Wu,Liheng Ma,Muzhi Li,Jiaming Zhou,Jianye Hao,Ho-fung Leung,Irwin King,Yingxue Zhang,Jian-Yun Nie

Main category: cs.CL

TL;DR: Mujica-MyGO通过多跳联合智能和强化学习方法提升LLMs在多跳问答任务中的性能,解决了幻觉问题。

Details Motivation: LLMs在问答任务中因缺乏事实知识而产生幻觉,现有检索增强生成方法受限于上下文学习能力。 Method: 提出Mujica框架(分解问题为子问题图)和MyGO强化学习方法(用MLE替代策略梯度更新)。 Result: 实验证明Mujica-MyGO在多数据集上有效提升多跳问答性能。 Conclusion: Mujica-MyGO为复杂问答任务提供了可扩展且资源高效的解决方案。 Abstract: Large Language Models (LLMs) have demonstrated remarkable versatility, due to the lack of factual knowledge, their application to Question Answering (QA) tasks remains hindered by hallucination. While Retrieval-Augmented Generation mitigates these issues by integrating external knowledge, existing approaches rely heavily on in-context learning, whose performance is constrained by the fundamental reasoning capabilities of LLMs. In this paper, we propose Mujica, a Multi-hop Joint Intelligence for Complex Question Answering, comprising a planner that decomposes questions into a directed acyclic graph of subquestions and a worker that resolves questions via retrieval and reasoning. Additionally, we introduce MyGO (Minimalist policy Gradient Optimization), a novel reinforcement learning method that replaces traditional policy gradient updates with Maximum Likelihood Estimation (MLE) by sampling trajectories from an asymptotically optimal policy. MyGO eliminates the need for gradient rescaling and reference models, ensuring stable and efficient training. Empirical results across multiple datasets demonstrate the effectiveness of Mujica-MyGO in enhancing multi-hop QA performance for various LLMs, offering a scalable and resource-efficient solution for complex QA tasks.

[168] Informatics for Food Processing

Gordana Ispirova,Michael Sebek,Giulia Menichetti

Main category: cs.CL

TL;DR: 本章探讨了食品加工的演变、分类及其健康影响,重点介绍了机器学习和人工智能在食品信息学中的变革作用。

Details Motivation: 传统食品分类框架(如NOVA、Nutri-Score和SIGA)存在主观性和可重复性问题,限制了流行病学研究和公共政策的有效性。 Method: 提出了计算新方法,如基于营养成分数据的随机森林模型FoodProX,以及利用BERT和BioBERT嵌入食品描述的语义模型。 Result: 通过Open Food Facts数据库的案例研究,展示了多模态AI模型如何整合结构化与非结构化数据,实现大规模食品分类。 Conclusion: 本章为食品加工评估提供了新范式,对公共卫生和研究具有重要意义。 Abstract: This chapter explores the evolution, classification, and health implications of food processing, while emphasizing the transformative role of machine learning, artificial intelligence (AI), and data science in advancing food informatics. It begins with a historical overview and a critical review of traditional classification frameworks such as NOVA, Nutri-Score, and SIGA, highlighting their strengths and limitations, particularly the subjectivity and reproducibility challenges that hinder epidemiological research and public policy. To address these issues, the chapter presents novel computational approaches, including FoodProX, a random forest model trained on nutrient composition data to infer processing levels and generate a continuous FPro score. It also explores how large language models like BERT and BioBERT can semantically embed food descriptions and ingredient lists for predictive tasks, even in the presence of missing data. A key contribution of the chapter is a novel case study using the Open Food Facts database, showcasing how multimodal AI models can integrate structured and unstructured data to classify foods at scale, offering a new paradigm for food processing assessment in public health and research.

[169] Trust Me, I Can Handle It: Self-Generated Adversarial Scenario Extrapolation for Robust Language Models

Md Rafi Ur Rashid,Vishnu Asutosh Dasu,Ye Wang,Gang Tan,Shagufta Mehnaz

Main category: cs.CL

TL;DR: ASE是一种新型推理框架,通过Chain-of-Thought推理增强LLM的鲁棒性和无缝性,显著降低攻击成功率和毒性,同时减少拒绝率。

Details Motivation: 现有防御方法通常仅针对单一威胁或牺牲用户体验,无法应对多样化和新型攻击。 Method: ASE利用Chain-of-Thought推理,引导LLM在生成响应前自我生成对抗场景并制定防御策略。 Result: 在四个对抗基准测试中,ASE将越狱攻击成功率降至接近零,毒性最小化,拒绝率低于4%,并在鲁棒性和无缝性上优于六种现有防御方法。 Conclusion: ASE通过将对抗感知转化为内在认知过程,为安全自然的人机交互设定了新范式。 Abstract: Large Language Models (LLMs) exhibit impressive capabilities, but remain susceptible to a growing spectrum of safety risks, including jailbreaks, toxic content, hallucinations, and bias. Existing defenses often address only a single threat type or resort to rigid outright rejection, sacrificing user experience and failing to generalize across diverse and novel attacks. This paper introduces Adversarial Scenario Extrapolation (ASE), a novel inference-time computation framework that leverages Chain-of-Thought (CoT) reasoning to simultaneously enhance LLM robustness and seamlessness. ASE guides the LLM through a self-generative process of contemplating potential adversarial scenarios and formulating defensive strategies before generating a response to the user query. Comprehensive evaluation on four adversarial benchmarks with four latest LLMs shows that ASE achieves near-zero jailbreak attack success rates and minimal toxicity, while slashing outright rejections to <4%. ASE outperforms six state-of-the-art defenses in robustness-seamlessness trade-offs, with 92-99% accuracy on adversarial Q&A and 4-10x lower bias scores. By transforming adversarial perception into an intrinsic cognitive process, ASE sets a new paradigm for secure and natural human-AI interaction.

[170] Large Language Models Implicitly Learn to See and Hear Just By Reading

Prateek Verma,Mert Pilanci

Main category: cs.CL

TL;DR: 通过训练自回归LLM模型,文本模型能够内在地理解图像和音频,无需额外训练即可实现多模态能力。

Details Motivation: 探索文本LLM模型是否能够通过文本训练内在地学习多模态能力,从而避免为每种模态单独训练模型。 Method: 将图像、音频波形或标记作为输入,利用文本LLM的权重生成嵌入或分类标签,应用于音频和图像分类任务。 Result: 在FSD-50K和GTZAN音频数据集以及CIFAR-10和Fashion-MNIST图像数据集上展示了文本LLM权重的通用性。 Conclusion: 文本LLM能够学习强大的内部电路,通过激活必要连接实现多模态应用,减少从头训练的需求。 Abstract: This paper presents a fascinating find: By training an auto-regressive LLM model on text tokens, the text model inherently develops internally an ability to understand images and audio, thereby developing the ability to see and hear just by reading. Popular audio and visual LLM models fine-tune text LLM models to give text output conditioned on images and audio embeddings. On the other hand, our architecture takes in patches of images, audio waveforms or tokens as input. It gives us the embeddings or category labels typical of a classification pipeline. We show the generality of text weights in aiding audio classification for datasets FSD-50K and GTZAN. Further, we show this working for image classification on CIFAR-10 and Fashion-MNIST, as well on image patches. This pushes the notion of text-LLMs learning powerful internal circuits that can be utilized by activating necessary connections for various applications rather than training models from scratch every single time.

[171] Are LLMs reliable? An exploration of the reliability of large language models in clinical note generation

Kristine Ann M. Carandang,Jasper Meynard P. Araña,Ethan Robert A. Casin,Christopher P. Monterola,Daniel Stanley Y. Tan,Jesus Felix B. Valenzuela,Christian M. Alis

Main category: cs.CL

TL;DR: 研究评估了12种大型语言模型(LLMs)在临床笔记生成(CNG)中的可靠性,发现Meta的Llama 70B和Mistral的Small模型表现最佳,建议本地部署以保护数据隐私并提高效率。

Details Motivation: 由于医疗提供者(HCPs)对文档准确性和患者数据隐私的法律与伦理责任,LLMs在CNG中的变异性带来了挑战,需增强HCPs对LLM工具的信任。 Method: 评估12种开源和专有LLMs在CNG中的表现,包括字符串等价性、语义一致性和正确性。 Result: 所有LLMs均表现稳定,语义一致;Meta的Llama 70B和Mistral的Small模型最接近专家笔记。 Conclusion: 建议本地部署较小开源模型以确保数据隐私合规性并提升HCPs的临床文档效率。 Abstract: Due to the legal and ethical responsibilities of healthcare providers (HCPs) for accurate documentation and protection of patient data privacy, the natural variability in the responses of large language models (LLMs) presents challenges for incorporating clinical note generation (CNG) systems, driven by LLMs, into real-world clinical processes. The complexity is further amplified by the detailed nature of texts in CNG. To enhance the confidence of HCPs in tools powered by LLMs, this study evaluates the reliability of 12 open-weight and proprietary LLMs from Anthropic, Meta, Mistral, and OpenAI in CNG in terms of their ability to generate notes that are string equivalent (consistency rate), have the same meaning (semantic consistency) and are correct (semantic similarity), across several iterations using the same prompt. The results show that (1) LLMs from all model families are stable, such that their responses are semantically consistent despite being written in various ways, and (2) most of the LLMs generated notes close to the corresponding notes made by experts. Overall, Meta's Llama 70B was the most reliable, followed by Mistral's Small model. With these findings, we recommend the local deployment of these relatively smaller open-weight models for CNG to ensure compliance with data privacy regulations, as well as to improve the efficiency of HCPs in clinical documentation.

[172] TACO: Enhancing Multimodal In-context Learning via Task Mapping-Guided Sequence Configuration

Yanshu Li,Tian Yun,Jianjiang Yang,Pinyuan Feng,Jinfa Huang,Ruixiang Tang

Main category: cs.CL

TL;DR: 论文提出了一种名为TACO的轻量级模型,通过任务映射动态配置多模态上下文序列,显著提升了大型视觉语言模型(LVLM)的上下文学习效果。

Details Motivation: 多模态上下文学习(ICL)的效果高度依赖于输入序列的质量,但目前对LVLM如何利用这些序列的理解有限。 Method: 通过任务映射解析多模态ICL,提出TACO模型,利用任务感知注意力动态配置上下文序列。 Result: 在五个LVLM和九个数据集上的实验表明,TACO在多种ICL任务中均优于基线方法。 Conclusion: 任务映射为理解和改进多模态ICL提供了有价值的视角。 Abstract: Multimodal in-context learning (ICL) has emerged as a key mechanism for harnessing the capabilities of large vision-language models (LVLMs). However, its effectiveness remains highly sensitive to the quality of input in-context sequences, particularly for tasks involving complex reasoning or open-ended generation. A major limitation is our limited understanding of how LVLMs actually exploit these sequences during inference. To bridge this gap, we systematically interpret multimodal ICL through the lens of task mapping, which reveals how local and global relationships within and among demonstrations guide model reasoning. Building on this insight, we present TACO, a lightweight transformer-based model equipped with task-aware attention that dynamically configures in-context sequences. By injecting task-mapping signals into the autoregressive decoding process, TACO creates a bidirectional synergy between sequence construction and task reasoning. Experiments on five LVLMs and nine datasets demonstrate that TACO consistently surpasses baselines across diverse ICL tasks. These results position task mapping as a valuable perspective for interpreting and improving multimodal ICL.

[173] Learning Interpretable Representations Leads to Semantically Faithful EEG-to-Text Generation

Xiaozhao Liu,Dinggang Shen,Xihui Liu

Main category: cs.CL

TL;DR: 论文提出GLIM模型,通过语义总结而非逐字重建解决EEG到文本解码中的幻觉问题,实验证明其生成流畅且基于EEG的句子。

Details Motivation: 预训练生成模型在脑解码中生成文本和图像,但输出可靠性存疑,可能仅是模型幻觉而非真实脑语义激活。 Method: 提出GLIM模型,强调学习信息丰富且可解释的EEG表示,以改进语义基础,适应小规模异构数据。 Result: 在ZuCo数据集上,GLIM生成流畅且基于EEG的句子,支持EEG-文本检索和零样本语义分类。 Conclusion: GLIM为生成脑解码提供了可靠且可扩展的基准架构和评估协议。 Abstract: Pretrained generative models have opened new frontiers in brain decoding by enabling the synthesis of realistic texts and images from non-invasive brain recordings. However, the reliability of such outputs remains questionable--whether they truly reflect semantic activation in the brain, or are merely hallucinated by the powerful generative models. In this paper, we focus on EEG-to-text decoding and address its hallucination issue through the lens of posterior collapse. Acknowledging the underlying mismatch in information capacity between EEG and text, we reframe the decoding task as semantic summarization of core meanings rather than previously verbatim reconstruction of stimulus texts. To this end, we propose the Generative Language Inspection Model (GLIM), which emphasizes learning informative and interpretable EEG representations to improve semantic grounding under heterogeneous and small-scale data conditions. Experiments on the public ZuCo dataset demonstrate that GLIM consistently generates fluent, EEG-grounded sentences without teacher forcing. Moreover, it supports more robust evaluation beyond text similarity, through EEG-text retrieval and zero-shot semantic classification across sentiment categories, relation types, and corpus topics. Together, our architecture and evaluation protocols lay the foundation for reliable and scalable benchmarking in generative brain decoding.

[174] Any Large Language Model Can Be a Reliable Judge: Debiasing with a Reasoning-based Bias Detector

Haoyan Yang,Runxue Bao,Cao Xiao,Jun Ma,Parminder Bhatia,Shangqian Gao,Taha Kass-Hout

Main category: cs.CL

TL;DR: 论文提出了一种名为RBD的外部插件模块,用于检测和纠正LLM评估中的偏见,通过迭代的偏见检测和反馈驱动修正,显著提高了评估的准确性和一致性。

Details Motivation: LLM作为评估工具存在偏见问题,现有方法(如上下文学习或微调)无法完全解决,尤其是对闭源模型。因此,需要一种不依赖修改评估器本身的外部解决方案。 Method: 提出RBD模块,通过构建偏见数据集、监督收集、基于推理的微调,并与LLM评估器集成,实现偏见检测和反馈驱动的自我修正。 Result: 实验表明,RBD在不同规模和偏见类型下均有效,例如RBD-8B模型将评估准确性和一致性分别提高了18.5%和10.9%,优于基线方法。 Conclusion: RBD是一种高效、可扩展且泛化能力强的解决方案,能够显著提升LLM评估的可靠性。 Abstract: LLM-as-a-Judge has emerged as a promising tool for automatically evaluating generated outputs, but its reliability is often undermined by potential biases in judgment. Existing efforts to mitigate these biases face key limitations: in-context learning-based methods fail to address rooted biases due to the evaluator's limited capacity for self-reflection, whereas fine-tuning is not applicable to all evaluator types, especially closed-source models. To address this challenge, we introduce the Reasoning-based Bias Detector (RBD), which is a plug-in module that identifies biased evaluations and generates structured reasoning to guide evaluator self-correction. Rather than modifying the evaluator itself, RBD operates externally and engages in an iterative process of bias detection and feedback-driven revision. To support its development, we design a complete pipeline consisting of biased dataset construction, supervision collection, distilled reasoning-based fine-tuning of RBD, and integration with LLM evaluators. We fine-tune four sizes of RBD models, ranging from 1.5B to 14B, and observe consistent performance improvements across all scales. Experimental results on 4 bias types--verbosity, position, bandwagon, and sentiment--evaluated using 8 LLM evaluators demonstrate RBD's strong effectiveness. For example, the RBD-8B model improves evaluation accuracy by an average of 18.5% and consistency by 10.9%, and surpasses prompting-based baselines and fine-tuned judges by 12.8% and 17.2%, respectively. These results highlight RBD's effectiveness and scalability. Additional experiments further demonstrate its strong generalization across biases and domains, as well as its efficiency.

[175] An approach to identify the most semantically informative deep representations of text and images

Santiago Acevedo,Andrea Mascaretti,Riccardo Rende,Matéo Mahaut,Marco Baroni,Alessandro Laio

Main category: cs.CL

TL;DR: 论文提出了一种定量分析方法,用于研究深度神经网络中语义相关数据的表示相似性,并探讨了大型语言模型(LLMs)和视觉变换器中多令牌编码的方式。

Details Motivation: 研究深度神经网络如何在不同领域(如图像与文本)中为语义相关数据生成相似表示,并量化分析其信息内容。 Method: 通过测量语义相关数据表示的信息内容,分析LLMs和视觉变换器中多令牌的编码方式,重点关注语言模型处理翻译句子对时的内部语义层。 Result: 发现LLMs中存在包含最多可转移信息的内部语义层,且较大模型(DeepSeek-V3)提取的通用信息显著多于较小模型(Llama3.1-8B)。语义信息分布在多个令牌中,具有长距离相关性和因果不对称性。视觉变换器中也有类似语义层,且LLMs的语义层表示能预测对应图像的视觉表示。 Conclusion: 研究揭示了深度神经网络中语义信息的编码方式及其跨领域传递特性,为模型设计和跨模态研究提供了新视角。 Abstract: Deep neural networks are known to develop similar representations for semantically related data, even when they belong to different domains, such as an image and its description, or the same text in different languages. We present a method for quantitatively investigating this phenomenon by measuring the relative information content of the representations of semantically related data and probing how it is encoded into multiple tokens of large language models (LLMs) and vision transformers. Looking first at how LLMs process pairs of translated sentences, we identify inner ``semantic'' layers containing the most language-transferable information. We find moreover that, on these layers, a larger LLM (DeepSeek-V3) extracts significantly more general information than a smaller one (Llama3.1-8B). Semantic information is spread across many tokens and it is characterized by long-distance correlations between tokens and by a causal left-to-right (i.e., past-future) asymmetry. We also identify layers encoding semantic information within visual transformers. We show that caption representations in the semantic layers of LLMs predict visual representations of the corresponding images. We observe significant and model-dependent information asymmetries between image and text representations.

[176] BanglaByT5: Byte-Level Modelling for Bangla

Pramit Bhattacharyya,Arnab Bhattacharya

Main category: cs.CL

TL;DR: BanglaByT5是一个针对孟加拉语的字节级编码器-解码器模型,基于ByT5架构,预训练于14GB精选语料,在生成和分类任务中表现优异。

Details Motivation: 传统分词器(如BPE和SentencePiece)无法充分捕捉形态丰富的孟加拉语细微差别,因此需要专门优化的模型。 Method: 基于ByT5小变体构建,预训练于14GB精选文学和新闻语料,并通过零样本和监督评估验证性能。 Result: 在生成和分类任务中表现优于多语言和更大模型,展示了字节级建模对形态丰富语言的有效性。 Conclusion: BanglaByT5是轻量但强大的孟加拉语NLP工具,适用于资源受限和可扩展环境。 Abstract: Large language models (LLMs) have achieved remarkable success across various natural language processing tasks. However, most LLM models use traditional tokenizers like BPE and SentencePiece, which fail to capture the finer nuances of a morphologically rich language like Bangla (Bengali). In this work, we introduce BanglaByT5, the first byte-level encoder-decoder model explicitly tailored for Bangla. Built upon a small variant of Googles ByT5 architecture, BanglaByT5 is pre-trained on a 14GB curated corpus combining high-quality literary and newspaper articles. Through zeroshot and supervised evaluations across generative and classification tasks, BanglaByT5 demonstrates competitive performance, surpassing several multilingual and larger models. Our findings highlight the efficacy of byte-level modelling for morphologically rich languages and highlight BanglaByT5 potential as a lightweight yet powerful tool for Bangla NLP, particularly in both resource-constrained and scalable environments.

[177] Forging Time Series with Language: A Large Language Model Approach to Synthetic Data Generation

Cécile Rousseau,Tobia Boschi,Giandomenico Cornacchia,Dhaval Salwala,Alessandra Pascale,Juan Bernabe Moreno

Main category: cs.CL

TL;DR: SDForger是一个高效灵活的框架,利用LLMs生成高质量多元时间序列,通过少量样本和低计算量微调实现。

Details Motivation: 解决现有生成模型在多元时间序列生成中的局限性,提供更高效、高质量的解决方案。 Method: 将单变量和多变量信号转换为表格嵌入,编码为文本后微调LLM,生成保留原始数据统计特性和时间动态的合成时间序列。 Result: 在多样数据集上,SDForger在相似性评估和下游预测任务中优于现有生成模型。 Conclusion: SDForger为多模态建模和时间序列与文本信息的无缝整合提供了新途径,其源代码将开源。 Abstract: SDForger is a flexible and efficient framework for generating high-quality multivariate time series using LLMs. Leveraging a compact data representation, SDForger provides synthetic time series generation from a few samples and low-computation fine-tuning of any autoregressive LLM. Specifically, the framework transforms univariate and multivariate signals into tabular embeddings, which are then encoded into text and used to fine-tune the LLM. At inference, new textual embeddings are sampled and decoded into synthetic time series that retain the original data's statistical properties and temporal dynamics. Across a diverse range of datasets, SDForger outperforms existing generative models in many scenarios, both in similarity-based evaluations and downstream forecasting tasks. By enabling textual conditioning in the generation process, SDForger paves the way for multimodal modeling and the streamlined integration of time series with textual information. SDForger source code will be open-sourced soon.

[178] P2P: Automated Paper-to-Poster Generation and Fine-Grained Benchmark

Tao Sun,Enhao Pan,Zhengkai Yang,Kaixin Sui,Jiajun Shi,Xianfu Cheng,Tongliang Li,Wenhao Huang,Ge Zhang,Jian Yang,Zhoujun Li

Main category: cs.CL

TL;DR: P2P是一个基于LLM的多智能体框架,用于从研究论文直接生成高质量的HTML学术海报,解决了现有方法在语义丰富性和结构细节上的不足。

Details Motivation: 学术海报的手动制作耗时,而现有自动化方法在保留科学细节和视觉-文本整合方面存在挑战。 Method: P2P采用三个专门智能体(视觉元素处理、内容生成、海报组装)和检查模块,实现迭代优化。同时发布P2PInstruct数据集和P2PEval基准。 Result: P2P展示了实际应用的潜力,并提供了大规模数据集和双评估方法(通用和细粒度)。 Conclusion: P2P框架及其配套工具旨在简化研究传播,并为下一代海报生成系统提供开发和评估基础。 Abstract: Academic posters are vital for scholarly communication, yet their manual creation is time-consuming. However, automated academic poster generation faces significant challenges in preserving intricate scientific details and achieving effective visual-textual integration. Existing approaches often struggle with semantic richness and structural nuances, and lack standardized benchmarks for evaluating generated academic posters comprehensively. To address these limitations, we introduce P2P, the first flexible, LLM-based multi-agent framework that generates high-quality, HTML-rendered academic posters directly from research papers, demonstrating strong potential for practical applications. P2P employs three specialized agents-for visual element processing, content generation, and final poster assembly-each integrated with dedicated checker modules to enable iterative refinement and ensure output quality. To foster advancements and rigorous evaluation in this domain, we construct and release P2PInstruct, the first large-scale instruction dataset comprising over 30,000 high-quality examples tailored for the academic paper-to-poster generation task. Furthermore, we establish P2PEval, a comprehensive benchmark featuring 121 paper-poster pairs and a dual evaluation methodology (Universal and Fine-Grained) that leverages LLM-as-a-Judge and detailed, human-annotated checklists. Our contributions aim to streamline research dissemination and provide the community with robust tools for developing and evaluating next-generation poster generation systems.

[179] RRTL: Red Teaming Reasoning Large Language Models in Tool Learning

Yifei Liu,Yu Cui,Haibin Zhang

Main category: cs.CL

TL;DR: 论文提出RRTL方法,评估推理大语言模型(RLLMs)在工具学习中的安全性,发现RLLMs虽比传统LLMs更安全,但仍存在欺骗性风险和多语言漏洞。

Details Motivation: 研究新兴推理大语言模型(RLLMs)在工具学习中的安全性,填补现有研究的空白。 Method: 提出RRTL方法,结合欺骗性威胁识别和Chain-of-Thought(CoT)提示,评估RLLMs的安全性。 Result: 发现RLLMs整体安全性优于传统LLMs,但仍存在欺骗性风险和多语言漏洞。 Conclusion: 为提升RLLMs在工具学习中的安全性提供了重要见解。 Abstract: While tool learning significantly enhances the capabilities of large language models (LLMs), it also introduces substantial security risks. Prior research has revealed various vulnerabilities in traditional LLMs during tool learning. However, the safety of newly emerging reasoning LLMs (RLLMs), such as DeepSeek-R1, in the context of tool learning remains underexplored. To bridge this gap, we propose RRTL, a red teaming approach specifically designed to evaluate RLLMs in tool learning. It integrates two novel strategies: (1) the identification of deceptive threats, which evaluates the model's behavior in concealing the usage of unsafe tools and their potential risks; and (2) the use of Chain-of-Thought (CoT) prompting to force tool invocation. Our approach also includes a benchmark for traditional LLMs. We conduct a comprehensive evaluation on seven mainstream RLLMs and uncover three key findings: (1) RLLMs generally achieve stronger safety performance than traditional LLMs, yet substantial safety disparities persist across models; (2) RLLMs can pose serious deceptive risks by frequently failing to disclose tool usage and to warn users of potential tool output risks; (3) CoT prompting reveals multi-lingual safety vulnerabilities in RLLMs. Our work provides important insights into enhancing the security of RLLMs in tool learning.

[180] Multi-Modality Expansion and Retention for LLMs through Parameter Merging and Decoupling

Junlin Li,Guodong DU,Jing Li,Sim Kuan Goh,Wenya Wang,Yequan Wang,Fangming Liu,Ho-Kin Tang,Saleh Alharbi,Daojing He,Min Zhang

Main category: cs.CL

TL;DR: MMER是一种无需训练的方法,通过复用多模态编码器和合并LLM参数,实现多模态扩展并保留原始性能。

Details Motivation: 解决传统方法依赖资源密集型微调的问题,提出一种灵活的多模态扩展方案。 Method: 复用MLLMs的多模态编码器,合并LLM参数并生成二进制掩码,分离模态专用参数。 Result: 实验表明MMER显著提升多模态能力,保留99%原始性能,并有效缓解灾难性遗忘。 Conclusion: MMER为多模态扩展提供了一种高效且灵活的方法。 Abstract: Fine-tuning Large Language Models (LLMs) with multimodal encoders on modality-specific data expands the modalities that LLMs can handle, leading to the formation of Multimodal LLMs (MLLMs). However, this paradigm heavily relies on resource-intensive and inflexible fine-tuning from scratch with new multimodal data. In this paper, we propose MMER (Multi-modality Expansion and Retention), a training-free approach that integrates existing MLLMs for effective multimodal expansion while retaining their original performance. Specifically, MMER reuses MLLMs' multimodal encoders while merging their LLM parameters. By comparing original and merged LLM parameters, MMER generates binary masks to approximately separate LLM parameters for each modality. These decoupled parameters can independently process modality-specific inputs, reducing parameter conflicts and preserving original MLLMs' fidelity. MMER can also mitigate catastrophic forgetting by applying a similar process to MLLMs fine-tuned on new tasks. Extensive experiments show significant improvements over baselines, proving that MMER effectively expands LLMs' multimodal capabilities while retaining 99% of the original performance, and also markedly mitigates catastrophic forgetting.

[181] Cultural Value Alignment in Large Language Models: A Prompt-based Analysis of Schwartz Values in Gemini, ChatGPT, and DeepSeek

Robin Segerer

Main category: cs.CL

TL;DR: 研究分析了Gemini、ChatGPT和DeepSeek在Schwartz价值观框架中的文化价值对齐,发现DeepSeek因中文数据训练更强调集体主义价值观,而所有模型均倾向于亲社会价值观。

Details Motivation: 探讨大型语言模型(LLMs)是否反映文化偏见,而非普遍伦理框架。 Method: 使用40项肖像价值观问卷,通过贝叶斯序数回归模型分析模型的价值偏好。 Result: 所有模型均高度优先自我超越价值观(如仁慈、普遍主义),但DeepSeek相对弱化自我增强价值观(如权力、成就),与集体主义文化一致。 Conclusion: LLMs反映文化偏见,需通过多视角推理、自反反馈和动态情境化解决价值不对称问题,推动多元AI对齐框架。 Abstract: This study examines cultural value alignment in large language models (LLMs) by analyzing how Gemini, ChatGPT, and DeepSeek prioritize values from Schwartz's value framework. Using the 40-item Portrait Values Questionnaire, we assessed whether DeepSeek, trained on Chinese-language data, exhibits distinct value preferences compared to Western models. Results of a Bayesian ordinal regression model show that self-transcendence values (e.g., benevolence, universalism) were highly prioritized across all models, reflecting a general LLM tendency to emphasize prosocial values. However, DeepSeek uniquely downplayed self-enhancement values (e.g., power, achievement) compared to ChatGPT and Gemini, aligning with collectivist cultural tendencies. These findings suggest that LLMs reflect culturally situated biases rather than a universal ethical framework. To address value asymmetries in LLMs, we propose multi-perspective reasoning, self-reflective feedback, and dynamic contextualization. This study contributes to discussions on AI fairness, cultural neutrality, and the need for pluralistic AI alignment frameworks that integrate diverse moral perspectives.

[182] RAVEN: Query-Guided Representation Alignment for Question Answering over Audio, Video, Embedded Sensors, and Natural Language

Subrata Biswas,Mohammad Nur Hossain Khan,Bashima Islam

Main category: cs.CL

TL;DR: RAVEN是一种多模态问答架构,通过QuART模块动态分配跨模态令牌的相关性分数,显著提升性能。

Details Motivation: 解决多模态问答中因模态不一致(如背景噪音、视野外运动)导致的模型误导问题。 Method: 采用三阶段训练流程(单模态预训练、查询对齐融合、不一致微调)和QuART模块动态加权。 Result: 在七个多模态QA基准上,RAVEN性能提升最高达14.5%,传感器数据进一步带来16.4%增益。 Conclusion: RAVEN通过动态模态加权和鲁棒训练策略,显著提升多模态问答的准确性和鲁棒性。 Abstract: Multimodal question answering (QA) often requires identifying which video, audio, or sensor tokens are relevant to the question. Yet modality disagreements are common: off-camera speech, background noise, or motion outside the field of view often mislead fusion models that weight all streams equally. We present RAVEN, a unified QA architecture whose core is QuART, a query-conditioned cross-modal gating module that assigns scalar relevance scores to each token across modalities, enabling the model to amplify informative signals and suppress distractors before fusion. RAVEN is trained through a three-stage pipeline comprising unimodal pretraining, query-aligned fusion, and disagreement-oriented fine-tuning -- each stage targeting a distinct challenge in multi-modal reasoning: representation quality, cross-modal relevance, and robustness to modality mismatch. To support training and evaluation, we release AVS-QA, a dataset of 300K synchronized Audio--Video-Sensor streams paired with automatically generated question-answer pairs. Experimental results on seven multi-modal QA benchmarks -- including egocentric and exocentric tasks -- show that RAVEN achieves up to 14.5\% and 8.0\% gains in accuracy compared to state-of-the-art multi-modal large language models, respectively. Incorporating sensor data provides an additional 16.4\% boost, and the model remains robust under modality corruption, outperforming SOTA baselines by 50.23\%. Our code and dataset are available at https://github.com/BASHLab/RAVEN.

[183] Comparative Evaluation of Prompting and Fine-Tuning for Applying Large Language Models to Grid-Structured Geospatial Data

Akash Dhruv,Yangxinyu Xie,Jordan Branham,Tanwi Mallick

Main category: cs.CL

TL;DR: 比较研究大型语言模型(LLMs)在解析网格结构地理空间数据中的表现,评估基础模型与微调变体的性能差异。

Details Motivation: 探讨LLMs在零样本提示和微调后对结构化地理空间和时序推理的能力差异。 Method: 通过结构化提示评估基础模型,并与基于用户-助手交互数据集微调的变体进行对比。 Result: 结果揭示了零样本提示的优势与局限,并展示了微调对结构化地理空间和时序推理的益处。 Conclusion: 微调显著提升LLMs在结构化地理空间和时序推理任务中的表现。 Abstract: This paper presents a comparative study of large language models (LLMs) in interpreting grid-structured geospatial data. We evaluate the performance of a base model through structured prompting and contrast it with a fine-tuned variant trained on a dataset of user-assistant interactions. Our results highlight the strengths and limitations of zero-shot prompting and demonstrate the benefits of fine-tuning for structured geospatial and temporal reasoning.

[184] From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning

Chen Shani,Dan Jurafsky,Yann LeCun,Ravid Shwartz-Ziv

Main category: cs.CL

TL;DR: 论文探讨了人类与大型语言模型(LLMs)在语义压缩上的差异,发现LLMs倾向于统计压缩,而人类更注重语义细节。

Details Motivation: 研究LLMs是否在压缩与语义保真度之间实现类似人类的权衡。 Method: 提出信息论框架,结合率失真理论和信息瓶颈原理,分析LLMs的token嵌入与人类分类基准。 Result: LLMs形成广泛概念类别,但缺乏人类精细语义区分能力,偏向统计压缩而非语义丰富性。 Conclusion: 揭示了LLMs与人类认知架构的关键差异,为开发更人类化的LLMs提供方向。 Abstract: Humans organize knowledge into compact categories through semantic compression by mapping diverse instances to abstract representations while preserving meaning (e.g., robin and blue jay are both birds; most birds can fly). These concepts reflect a trade-off between expressive fidelity and representational simplicity. Large Language Models (LLMs) demonstrate remarkable linguistic abilities, yet whether their internal representations strike a human-like trade-off between compression and semantic fidelity is unclear. We introduce a novel information-theoretic framework, drawing from Rate-Distortion Theory and the Information Bottleneck principle, to quantitatively compare these strategies. Analyzing token embeddings from a diverse suite of LLMs against seminal human categorization benchmarks, we uncover key divergences. While LLMs form broad conceptual categories that align with human judgment, they struggle to capture the fine-grained semantic distinctions crucial for human understanding. More fundamentally, LLMs demonstrate a strong bias towards aggressive statistical compression, whereas human conceptual systems appear to prioritize adaptive nuance and contextual richness, even if this results in lower compressional efficiency by our measures. These findings illuminate critical differences between current AI and human cognitive architectures, guiding pathways toward LLMs with more human-aligned conceptual representations.

[185] After Retrieval, Before Generation: Enhancing the Trustworthiness of Large Language Models in RAG

Xinbang Dai,Huikang Hu,Yuncheng Hua,Jiaqi Li,Yongrui Chen,Rihui Jin,Nan Hu,Guilin Qi

Main category: cs.CL

TL;DR: 论文提出了BRIDGE框架,通过动态权衡内部和外部知识,提升RAG系统的可信度。

Details Motivation: 解决RAG系统中内部和外部知识冲突或不可靠时的平衡问题。 Method: 构建TRD数据集,提出BRIDGE框架,采用软偏置机制和最大软偏置决策树。 Result: BRIDGE在准确性上优于基线5-15%,且在所有场景中表现均衡。 Conclusion: BRIDGE为RAG应用中的可信响应提供了有效解决方案。 Abstract: Retrieval-augmented generation (RAG) systems face critical challenges in balancing internal (parametric) and external (retrieved) knowledge, especially when these sources conflict or are unreliable. To analyze these scenarios comprehensively, we construct the Trustworthiness Response Dataset (TRD) with 36,266 questions spanning four RAG settings. We reveal that existing approaches address isolated scenarios-prioritizing one knowledge source, naively merging both, or refusing answers-but lack a unified framework to handle different real-world conditions simultaneously. Therefore, we propose the BRIDGE framework, which dynamically determines a comprehensive response strategy of large language models (LLMs). BRIDGE leverages an adaptive weighting mechanism named soft bias to guide knowledge collection, followed by a Maximum Soft-bias Decision Tree to evaluate knowledge and select optimal response strategies (trust internal/external knowledge, or refuse). Experiments show BRIDGE outperforms baselines by 5-15% in accuracy while maintaining balanced performance across all scenarios. Our work provides an effective solution for LLMs' trustworthy responses in real-world RAG applications.

[186] Systematic Evaluation of Machine-Generated Reasoning and PHQ-9 Labeling for Depression Detection Using Large Language Models

Zongru Shao,Xin Wang,Zhanyang Liu,Chenhan Wang,K. P. Subbalakshmi

Main category: cs.CL

TL;DR: 该论文系统评估了大型语言模型(LLM)在早期心理健康检测中的推理能力,揭示了潜在弱点,并提出了优化策略。通过设计指令策略、对比提示和人工标注,研究发现LLM在检测显性抑郁语言时表现更优,并采用SFT和DPO方法优化性能。

Details Motivation: 当前研究利用LLM进行心理健康检测(如抑郁),但检测可能存在未知弱点,且生成数据的质量控制不足。论文旨在系统评估LLM的推理能力并揭示其弱点。 Method: 1. 设计LLM指令策略,将任务分解为子任务;2. 设计对比提示(few-shot和chain-of-thought);3. 人工标注子任务并评估性能;4. 探索优化策略(SFT和DPO)。 Result: LLM在分析显性抑郁语言时表现更优。DPO方法显著提升了性能。 Conclusion: 通过系统评估和优化策略,论文揭示了LLM在心理健康检测中的潜力与局限,并提出了有效的性能提升方法。 Abstract: Recent research leverages large language models (LLMs) for early mental health detection, such as depression, often optimized with machine-generated data. However, their detection may be subject to unknown weaknesses. Meanwhile, quality control has not been applied to these generated corpora besides limited human verifications. Our goal is to systematically evaluate LLM reasoning and reveal potential weaknesses. To this end, we first provide a systematic evaluation of the reasoning over machine-generated detection and interpretation. Then we use the models' reasoning abilities to explore mitigation strategies for enhanced performance. Specifically, we do the following: A. Design an LLM instruction strategy that allows for systematic analysis of the detection by breaking down the task into several subtasks. B. Design contrastive few-shot and chain-of-thought prompts by selecting typical positive and negative examples of detection reasoning. C. Perform human annotation for the subtasks identified in the first step and evaluate the performance. D. Identify human-preferred detection with desired logical reasoning from the few-shot generation and use them to explore different optimization strategies. We conducted extensive comparisons on the DepTweet dataset across the following subtasks: 1. identifying whether the speaker is describing their own depression; 2. accurately detecting the presence of PHQ-9 symptoms, and 3. finally, detecting depression. Human verification of statistical outliers shows that LLMs demonstrate greater accuracy in analyzing and detecting explicit language of depression as opposed to implicit expressions of depression. Two optimization methods are used for performance enhancement and reduction of the statistic bias: supervised fine-tuning (SFT) and direct preference optimization (DPO). Notably, the DPO approach achieves significant performance improvement.

[187] Self-Interpretability: LLMs Can Describe Complex Internal Processes that Drive Their Decisions, and Improve with Training

Dillon Plunkett,Adam Morris,Keerthi Reddy,Jorge Morales

Main category: cs.CL

TL;DR: 研究表明,大型语言模型(LLMs)能够通过训练提高其自我解释能力,并能推广到其他复杂决策中。

Details Motivation: 理解LLMs的内部工作机制及其自我解释能力,以提高模型的可解释性、控制性和安全性。 Method: 通过微调GPT-4o和GPT-4o-mini,使其在复杂决策任务中学习并报告其内部权重偏好。 Result: LLMs能够准确报告其决策权重,且通过训练可进一步提高其自我解释能力,并推广到其他任务。 Conclusion: 训练LLMs自我解释其内部过程是可行的,这对模型的可解释性和安全性具有重要意义。 Abstract: We have only limited understanding of how and why large language models (LLMs) respond in the ways that they do. Their neural networks have proven challenging to interpret, and we are only beginning to tease out the function of individual neurons and circuits within them. However, another path to understanding these systems is to investigate and develop their capacity to introspect and explain their own functioning. Here, we show that i) contemporary LLMs are capable of providing accurate, quantitative descriptions of their own internal processes during certain kinds of decision-making, ii) that it is possible to improve these capabilities through training, and iii) that this training generalizes to at least some degree. To do so, we fine-tuned GPT-4o and GPT-4o-mini to make decisions in a wide variety of complex contexts (e.g., choosing between condos, loans, vacations, etc.) according to randomly-generated, quantitative preferences about how to weigh different attributes during decision-making (e.g., the relative importance of natural light versus quiet surroundings for condos). We demonstrate that the LLMs can accurately report these preferences (i.e., the weights that they learned to give to different attributes during decision-making). Next, we demonstrate that these LLMs can be fine-tuned to explain their decision-making even more accurately. Finally, we demonstrate that this training generalizes: It improves the ability of the models to accurately explain what they are doing as they make other complex decisions, not just decisions they have learned to make via fine-tuning. This work is a step towards training LLMs to accurately and broadly report on their own internal processes -- a possibility that would yield substantial benefits for interpretability, control, and safety.

[188] NeSyGeo: A Neuro-Symbolic Framework for Multimodal Geometric Reasoning Data Generation

Weiming Wu,Zi-kang Wang,Jin Ye,Zhi Zhou,Yu-Feng Li,Lan-Zhe Guo

Main category: cs.CL

TL;DR: NeSyGeo是一个神经符号框架,用于生成几何推理数据,解决了现有方法在多样性和数值泛化上的局限性。通过符号-视觉-文本流程生成多样化问答对,并构建了数据集和基准测试,显著提升了多模态大语言模型的几何推理能力。

Details Motivation: 现有几何推理数据生成方法在多样性和数值泛化上存在局限性,需要一种更全面的解决方案。 Method: 提出NeSyGeo框架,结合领域特定语言和符号-视觉-文本流程生成多样化问答对,并构建数据集和基准测试。 Result: 实验表明,NeSyGeo显著提升了多个MLLM的性能,仅需4k样本和两轮强化微调即可实现显著改进。 Conclusion: NeSyGeo为几何推理数据生成提供了高效解决方案,显著提升了模型性能。 Abstract: Obtaining large-scale, high-quality data with reasoning paths is crucial for improving the geometric reasoning capabilities of multi-modal large language models (MLLMs). However, existing data generation methods, whether based on predefined templates or constrained symbolic provers, inevitably face diversity and numerical generalization limitations. To address these limitations, we propose NeSyGeo, a novel neuro-symbolic framework for generating geometric reasoning data. First, we propose a domain-specific language grounded in the entity-relation-constraint paradigm to comprehensively represent all components of plane geometry, along with generative actions defined within this symbolic space. We then design a symbolic-visual-text pipeline that synthesizes symbolic sequences, maps them to corresponding visual and textual representations, and generates diverse question-answer (Q&A) pairs using large language models (LLMs). To the best of our knowledge, we are the first to propose a neuro-symbolic approach in generating multimodal reasoning data. Based on this framework, we construct NeSyGeo-CoT and NeSyGeo-Caption datasets, containing 100k samples, and release a new benchmark NeSyGeo-Test for evaluating geometric reasoning abilities in MLLMs. Experiments demonstrate that the proposal significantly and consistently improves the performance of multiple MLLMs under both reinforcement and supervised fine-tuning. With only 4k samples and two epochs of reinforcement fine-tuning, base models achieve improvements of up to +15.8% on MathVision, +8.4% on MathVerse, and +7.3% on GeoQA. Notably, a 4B model can be improved to outperform an 8B model from the same series on geometric reasoning tasks.

[189] Shallow Preference Signals: Large Language Model Aligns Even Better with Truncated Data?

Xuan Qi,Jiahao Qiu,Xinzhe Juan,Yue Wu,Mengdi Wang

Main category: cs.CL

TL;DR: 研究发现,大型语言模型(LLM)与人类偏好的对齐信号主要集中在早期标记中(浅层偏好信号)。通过截断数据集训练模型,性能甚至优于完整数据集。解码策略进一步验证了这一现象,但也揭示了现有对齐方法的潜在问题。

Details Motivation: 探索LLM与人类偏好对齐中的信号分布特性,以提高对齐效率和性能。 Method: 截断偏好数据集,训练奖励模型和DPO模型;设计解码策略(长度控制解码和KL阈值控制解码)以利用浅层偏好信号。 Result: 截断数据集训练的模型性能优于完整数据集;解码策略进一步提升了性能。 Conclusion: 浅层偏好信号现象揭示了现有对齐方法的局限性,需更全面的响应对齐方法。 Abstract: Aligning large language models (LLMs) with human preferences remains a key challenge in AI. Preference-based optimization methods, such as Reinforcement Learning with Human Feedback (RLHF) and Direct Preference Optimization (DPO), rely on human-annotated datasets to improve alignment. In this work, we identify a crucial property of the existing learning method: the distinguishing signal obtained in preferred responses is often concentrated in the early tokens. We refer to this as shallow preference signals. To explore this property, we systematically truncate preference datasets at various points and train both reward models and DPO models on the truncated data. Surprisingly, models trained on truncated datasets, retaining only the first half or fewer tokens, achieve comparable or even superior performance to those trained on full datasets. For example, a reward model trained on the Skywork-Reward-Preference-80K-v0.2 dataset outperforms the full dataset when trained on a 40\% truncated dataset. This pattern is consistent across multiple datasets, suggesting the widespread presence of shallow preference signals. We further investigate the distribution of the reward signal through decoding strategies. We consider two simple decoding strategies motivated by the shallow reward signal observation, namely Length Control Decoding and KL Threshold Control Decoding, which leverage shallow preference signals to optimize the trade-off between alignment and computational efficiency. The performance is even better, which again validates our hypothesis. The phenomenon of shallow preference signals highlights potential issues in LLM alignment: existing alignment methods often focus on aligning only the initial tokens of responses, rather than considering the full response. This could lead to discrepancies with real-world human preferences, resulting in suboptimal alignment performance.

[190] MTR-Bench: A Comprehensive Benchmark for Multi-Turn Reasoning Evaluation

Xiaoyuan Li,Keqin Bao,Yubo Ma,Moxin Li,Wenjie Wang,Rui Men,Yichang Zhang,Fuli Feng,Dayiheng Liu,Junyang Lin

Main category: cs.CL

TL;DR: MTR-Bench是一个用于评估大语言模型多轮推理能力的数据集和框架,填补了当前评估中交互任务的空白。

Details Motivation: 当前大语言模型的评估主要关注单轮推理任务,缺乏对交互任务的全面评估,原因是缺少数据集和自动化评估协议。 Method: 提出了MTR-Bench,包含4类、40个任务和3600个实例,覆盖多样推理能力,并设计了全自动化评估框架。 Result: 实验表明,即使是前沿推理模型在多轮交互任务中表现不佳。 Conclusion: MTR-Bench为未来交互式AI系统的研究提供了有价值的见解。 Abstract: Recent advances in Large Language Models (LLMs) have shown promising results in complex reasoning tasks. However, current evaluations predominantly focus on single-turn reasoning scenarios, leaving interactive tasks largely unexplored. We attribute it to the absence of comprehensive datasets and scalable automatic evaluation protocols. To fill these gaps, we present MTR-Bench for LLMs' Multi-Turn Reasoning evaluation. Comprising 4 classes, 40 tasks, and 3600 instances, MTR-Bench covers diverse reasoning capabilities, fine-grained difficulty granularity, and necessitates multi-turn interactions with the environments. Moreover, MTR-Bench features fully-automated framework spanning both dataset constructions and model evaluations, which enables scalable assessment without human interventions. Extensive experiments reveal that even the cutting-edge reasoning models fall short of multi-turn, interactive reasoning tasks. And the further analysis upon these results brings valuable insights for future research in interactive AI systems.

[191] Conformal Language Model Reasoning with Coherent Factuality

Maxon Rubin-Toles,Maya Gambhir,Keshav Ramji,Aaron Roth,Surbhi Goel

Main category: cs.CL

TL;DR: 该论文提出了一种基于‘连贯事实性’的方法,通过可演绎性图和分步一致性预测,确保语言模型在推理任务中输出的正确性。

Details Motivation: 语言模型在重要决策中的应用日益增多,确保其输出的正确性至关重要。现有方法在信息检索任务中有效,但不适用于需要上下文逻辑的推理任务。 Method: 定义‘连贯事实性’,并开发基于一致性预测的方法,通过可演绎性子图保证语言模型输出的连贯事实性。 Result: 在MATH和FELM数据集的数学推理问题上,该方法能生成正确且有依据的声明序列,并在严格定义下实现90%的事实性和80%以上的声明保留率。 Conclusion: 该方法在推理任务中有效保证了语言模型输出的连贯事实性,展示了可演绎性图指导方法的实用性。 Abstract: Language models are increasingly being used in important decision pipelines, so ensuring the correctness of their outputs is crucial. Recent work has proposed evaluating the "factuality" of claims decomposed from a language model generation and applying conformal prediction techniques to filter out those claims that are not factual. This can be effective for tasks such as information retrieval, where constituent claims may be evaluated in isolation for factuality, but is not appropriate for reasoning tasks, as steps of a logical argument can be evaluated for correctness only within the context of the claims that precede them. To capture this, we define "coherent factuality" and develop a conformal-prediction-based method to guarantee coherent factuality for language model outputs. Our approach applies split conformal prediction to subgraphs within a "deducibility" graph" that represents the steps of a reasoning problem. We evaluate our method on mathematical reasoning problems from the MATH and FELM datasets and find that our algorithm consistently produces correct and substantiated orderings of claims, achieving coherent factuality across target coverage levels. Moreover, we achieve 90% factuality on our stricter definition while retaining 80% or more of the original claims, highlighting the utility of our deducibility-graph-guided approach.

[192] Relative Bias: A Comparative Framework for Quantifying Bias in LLMs

Alireza Arbabi,Florian Kerschbaum

Main category: cs.CL

TL;DR: 论文提出了Relative Bias框架,通过两种方法(Embedding Transformation分析和LLM-as-a-Judge)评估LLM的偏差,并在案例研究中验证了方法的有效性。

Details Motivation: 大型语言模型(LLM)的广泛应用引发了对其偏差的担忧,但量化偏差仍是一个挑战。 Method: 提出了Relative Bias框架,包括Embedding Transformation分析和LLM-as-a-Judge两种方法。 Result: 两种评分方法在案例研究中表现出强一致性,验证了框架的系统性和可扩展性。 Conclusion: 该框架为LLM的偏差分析提供了一种统计基础和可扩展的方法。 Abstract: The growing deployment of large language models (LLMs) has amplified concerns regarding their inherent biases, raising critical questions about their fairness, safety, and societal impact. However, quantifying LLM bias remains a fundamental challenge, complicated by the ambiguity of what "bias" entails. This challenge grows as new models emerge rapidly and gain widespread use, while introducing potential biases that have not been systematically assessed. In this paper, we propose the Relative Bias framework, a method designed to assess how an LLM's behavior deviates from other LLMs within a specified target domain. We introduce two complementary methodologies: (1) Embedding Transformation analysis, which captures relative bias patterns through sentence representations over the embedding space, and (2) LLM-as-a-Judge, which employs a language model to evaluate outputs comparatively. Applying our framework to several case studies on bias and alignment scenarios following by statistical tests for validation, we find strong alignment between the two scoring methods, offering a systematic, scalable, and statistically grounded approach for comparative bias analysis in LLMs.

[193] LongMagpie: A Self-synthesis Method for Generating Large-scale Long-context Instructions

Chaochen Gao,Xing Wu,Zijia Lin,Debing Zhang,Songlin Hu

Main category: cs.CL

TL;DR: LongMagpie是一个自合成框架,自动生成大规模长上下文指令数据,无需人工标注,性能优于现有方法。

Details Motivation: 高质量的长上下文指令数据对对齐长上下文大语言模型至关重要,但现有方法成本高且质量受限。 Method: 利用对齐的长上下文LLM自动生成文档-查询对及其响应,合成高质量指令数据。 Result: 在HELMET、RULER和Longbench v2等任务中表现领先,同时保持短上下文任务的竞争力。 Conclusion: LongMagpie是一种简单、开放、多样且可扩展的长上下文指令数据合成方法。 Abstract: High-quality long-context instruction data is essential for aligning long-context large language models (LLMs). Despite the public release of models like Qwen and Llama, their long-context instruction data remains proprietary. Human annotation is costly and challenging, while template-based synthesis methods limit scale, diversity, and quality. We introduce LongMagpie, a self-synthesis framework that automatically generates large-scale long-context instruction data. Our key insight is that aligned long-context LLMs, when presented with a document followed by special tokens preceding a user turn, auto-regressively generate contextually relevant queries. By harvesting these document-query pairs and the model's responses, LongMagpie produces high-quality instructions without human effort. Experiments on HELMET, RULER, and Longbench v2 demonstrate that LongMagpie achieves leading performance on long-context tasks while maintaining competitive performance on short-context tasks, establishing it as a simple and effective approach for open, diverse, and scalable long-context instruction data synthesis.

[194] When can isotropy help adapt LLMs' next word prediction to numerical domains?

Rashed Shelim,Shengzhe Xu,Walid Saad,Naren Ramakrishnan

Main category: cs.CL

TL;DR: 本文探讨了预训练大语言模型(LLMs)在数值领域任务中的适用性,提出了一种基于上下文嵌入空间各向同性概念的分析方法,以解决LLMs在数值预测中的幻觉问题。

Details Motivation: 尽管LLMs在数值领域任务中表现出色,但其幻觉问题可能对能源、金融、医疗等关键领域产生严重后果。本文旨在通过理论分析确保LLMs在数值任务中的可靠性和准确性。 Method: 提出了一种基于各向同性的分析方法,通过log-linear模型和softmax输出层,研究了LLMs在数值预测中的性能保证。 Result: 研究表明,LLM嵌入的隐藏表示需具备特定结构以解决softmax函数的平移不变性问题,各向同性性质能保留表示结构,从而提供性能保证。 Conclusion: 通过理论分析和实验验证,本文为LLMs在数值领域的应用提供了性能保证,并揭示了不同数值数据特性对模型性能的影响。 Abstract: Recent studies have shown that vector representations of contextual embeddings learned by pre-trained large language models (LLMs) are effective in various downstream tasks in numerical domains. Despite their significant benefits, the tendency of LLMs to hallucinate in such domains can have severe consequences in applications such as energy, nature, finance, healthcare, retail and transportation, among others. To guarantee prediction reliability and accuracy in numerical domains, it is necessary to open the black-box and provide performance guarantees through explanation. However, there is little theoretical understanding of when pre-trained language models help solve numeric downstream tasks. This paper seeks to bridge this gap by understanding when the next-word prediction capability of LLMs can be adapted to numerical domains through a novel analysis based on the concept of isotropy in the contextual embedding space. Specifically, we consider a log-linear model for LLMs in which numeric data can be predicted from its context through a network with softmax in the output layer of LLMs (i.e., language model head in self-attention). We demonstrate that, in order to achieve state-of-the-art performance in numerical domains, the hidden representations of the LLM embeddings must possess a structure that accounts for the shift-invariance of the softmax function. By formulating a gradient structure of self-attention in pre-trained models, we show how the isotropic property of LLM embeddings in contextual embedding space preserves the underlying structure of representations, thereby resolving the shift-invariance problem and providing a performance guarantee. Experiments show that different characteristics of numeric data and model architecture could have different impacts on isotropy.

[195] Foundation Models for Geospatial Reasoning: Assessing Capabilities of Large Language Models in Understanding Geometries and Topological Spatial Relations

Yuhan Ji,Song Gao,Ying Nie,Ivan Majić,Krzysztof Janowicz

Main category: cs.CL

TL;DR: 论文探讨了将AI基础模型应用于地理空间数据时的挑战,研究了WKT表示法和空间关系在LLMs中的保留情况,并比较了三种方法。GPT-4在拓扑空间关系推理中表现最佳。

Details Motivation: 直接应用AI基础模型处理地理空间数据存在挑战,尤其是对地理实体和空间关系的表示与推理能力不足。 Method: 采用三种方法(几何嵌入、提示工程和日常语言)进行空间推理任务,评估了GPT-3.5-turbo、GPT-4和DeepSeek-R1-14B模型。 Result: GPT-4在拓扑空间关系推理中表现最佳,准确率超过0.66。LLM生成的几何形状能提升地理实体检索效果。 Conclusion: 研究为改进LLMs的地理知识提供了见解,有助于开发具备地理空间推理能力的地理基础模型。 Abstract: Applying AI foundation models directly to geospatial datasets remains challenging due to their limited ability to represent and reason with geographical entities, specifically vector-based geometries and natural language descriptions of complex spatial relations. To address these issues, we investigate the extent to which a well-known-text (WKT) representation of geometries and their spatial relations (e.g., topological predicates) are preserved during spatial reasoning when the geospatial vector data are passed to large language models (LLMs) including GPT-3.5-turbo, GPT-4, and DeepSeek-R1-14B. Our workflow employs three distinct approaches to complete the spatial reasoning tasks for comparison, i.e., geometry embedding-based, prompt engineering-based, and everyday language-based evaluation. Our experiment results demonstrate that both the embedding-based and prompt engineering-based approaches to geospatial question-answering tasks with GPT models can achieve an accuracy of over 0.6 on average for the identification of topological spatial relations between two geometries. Among the evaluated models, GPT-4 with few-shot prompting achieved the highest performance with over 0.66 accuracy on topological spatial relation inference. Additionally, GPT-based reasoner is capable of properly comprehending inverse topological spatial relations and including an LLM-generated geometry can enhance the effectiveness for geographic entity retrieval. GPT-4 also exhibits the ability to translate certain vernacular descriptions about places into formal topological relations, and adding the geometry-type or place-type context in prompts may improve inference accuracy, but it varies by instance. The performance of these spatial reasoning tasks offers valuable insights for the refinement of LLMs with geographical knowledge towards the development of geo-foundation models capable of geospatial reasoning.

[196] Cog-TiPRO: Iterative Prompt Refinement with LLMs to Detect Cognitive Decline via Longitudinal Voice Assistant Commands

Kristin Qi,Youxiang Zhu,Caroline Summerour,John A. Batsis,Xiaohui Liang

Main category: cs.CL

TL;DR: 研究提出Cog-TiPRO框架,结合LLM和HuBERT技术,通过语音助手系统检测认知衰退,准确率达73.80%。

Details Motivation: 传统认知衰退检测方法耗时且不适用于频繁监测,需非侵入性工具。 Method: 结合LLM驱动的提示优化、HuBERT声学特征提取和基于Transformer的时序建模。 Result: 检测轻度认知障碍的准确率为73.80%,F1分数72.67%,优于基线27.13%。 Conclusion: 语音助手系统结合Cog-TiPRO框架可有效检测认知衰退,并识别独特的语言特征。 Abstract: Early detection of cognitive decline is crucial for enabling interventions that can slow neurodegenerative disease progression. Traditional diagnostic approaches rely on labor-intensive clinical assessments, which are impractical for frequent monitoring. Our pilot study investigates voice assistant systems (VAS) as non-invasive tools for detecting cognitive decline through longitudinal analysis of speech patterns in voice commands. Over an 18-month period, we collected voice commands from 35 older adults, with 15 participants providing daily at-home VAS interactions. To address the challenges of analyzing these short, unstructured and noisy commands, we propose Cog-TiPRO, a framework that combines (1) LLM-driven iterative prompt refinement for linguistic feature extraction, (2) HuBERT-based acoustic feature extraction, and (3) transformer-based temporal modeling. Using iTransformer, our approach achieves 73.80% accuracy and 72.67% F1-score in detecting MCI, outperforming its baseline by 27.13%. Through our LLM approach, we identify linguistic features that uniquely characterize everyday command usage patterns in individuals experiencing cognitive decline.

[197] EarthSE: A Benchmark Evaluating Earth Scientific Exploration Capability for Large Language Models

Wanghan Xu,Xiangyu Zhao,Yuhao Zhou,Xiaoyu Yue,Ben Fei,Fenghua Ling,Wenlong Zhang,Lei Bai

Main category: cs.CL

TL;DR: 该论文提出了一个专门针对地球科学的综合基准,用于评估大型语言模型(LLMs)在该领域的科学探索能力,包括基础知识和高级能力。

Details Motivation: 现有基准缺乏对地球科学的专门评估,且忽视了LLMs在开放科学探索中的能力。 Method: 基于10万篇研究论文构建了两个QA数据集(Earth-Iron和Earth-Silver)和一个开放对话数据集(Earth-Gold),涵盖多个地球科学领域和任务类别。 Result: 实验揭示了11种领先LLMs在不同领域和任务中的局限性,表明其科学探索能力有待提升。 Conclusion: 该基准为地球科学领域的LLM评估提供了全面工具,并指出了未来改进的方向。 Abstract: Advancements in Large Language Models (LLMs) drive interest in scientific applications, necessitating specialized benchmarks such as Earth science. Existing benchmarks either present a general science focus devoid of Earth science specificity or cover isolated subdomains, lacking holistic evaluation. Furthermore, current benchmarks typically neglect the assessment of LLMs' capabilities in open-ended scientific exploration. In this paper, we present a comprehensive and professional benchmark for the Earth sciences, designed to evaluate the capabilities of LLMs in scientific exploration within this domain, spanning from fundamental to advanced levels. Leveraging a corpus of 100,000 research papers, we first construct two Question Answering (QA) datasets: Earth-Iron, which offers extensive question coverage for broad assessment, and Earth-Silver, which features a higher level of difficulty to evaluate professional depth. These datasets encompass five Earth spheres, 114 disciplines, and 11 task categories, assessing foundational knowledge crucial for scientific exploration. Most notably, we introduce Earth-Gold with new metrics, a dataset comprising open-ended multi-turn dialogues specifically designed to evaluate the advanced capabilities of LLMs in scientific exploration, including methodology induction, limitation analysis, and concept proposal. Extensive experiments reveal limitations in 11 leading LLMs across different domains and tasks, highlighting considerable room for improvement in their scientific exploration capabilities. The benchmark is available on https://huggingface.co/ai-earth .

[198] Data Doping or True Intelligence? Evaluating the Transferability of Injected Knowledge in LLMs

Essa Jan,Moiz Ali,Muhammad Saram Hassan,Fareed Zaffar,Yasir Zaki

Main category: cs.CL

TL;DR: 研究发现,理解密集型微调任务(如问答和填空)在知识保留率上显著优于映射型任务(如翻译或文本转JSON),且模型规模越大保留效果越好,但知识在更广泛上下文中的整合仍有限。

Details Motivation: 随着大语言模型(LLM)知识的过时,需要高效方法更新模型,尤其是注入专有信息时。 Method: 比较不同微调任务(理解密集型与映射型)对知识保留率的影响,并分析模型规模和语义整合能力。 Result: 理解密集型任务知识保留率(48%)显著高于映射型任务(17%-20%),且模型规模越大效果越好,但语义整合能力有限。 Conclusion: 任务选择对更新LLM知识至关重要,知识注入的有效性不仅依赖数据暴露,还需深度认知参与。 Abstract: As the knowledge of large language models (LLMs) becomes outdated over time, there is a growing need for efficient methods to update them, especially when injecting proprietary information. Our study reveals that comprehension-intensive fine-tuning tasks (e.g., question answering and blanks) achieve substantially higher knowledge retention rates (48%) compared to mapping-oriented tasks like translation (17%) or text-to-JSON conversion (20%), despite exposure to identical factual content. We demonstrate that this pattern persists across model architectures and follows scaling laws, with larger models showing improved retention across all task types. However, all models exhibit significant performance drops when applying injected knowledge in broader contexts, suggesting limited semantic integration. These findings show the importance of task selection in updating LLM knowledge, showing that effective knowledge injection relies not just on data exposure but on the depth of cognitive engagement during fine-tuning.

[199] MDIT-Bench: Evaluating the Dual-Implicit Toxicity in Large Multimodal Models

Bohan Jin,Shuhan Qi,Kehai Chen,Xinyi Guo,Xuan Wang

Main category: cs.CL

TL;DR: 论文提出了一种名为双重隐性毒性的新毒性类型,并构建了多模态双重隐性毒性基准(MDIT-Bench),用于评估模型对隐性毒性的敏感性。实验表明,现有大型多模态模型(LMMs)在处理隐性毒性时表现不佳。

Details Motivation: 当前研究主要关注显性毒性,而忽略了隐性毒性(如偏见和歧视)。为了填补这一空白,论文提出了双重隐性毒性的概念及其评估方法。 Method: 通过多阶段人工参与上下文生成方法(Multi-stage Human-in-loop In-context Generation)构建了MDIT-Dataset,并在此基础上开发了MDIT-Bench,包含317,638个问题,覆盖12个类别、23个子类别和780个主题。 Result: 在13个主流LMMs上的实验表明,这些模型在处理双重隐性毒性时表现较差,尤其是在高难度级别下性能显著下降。 Conclusion: LMMs仍存在大量可激活的隐性毒性,需要进一步改进。数据已公开。 Abstract: The widespread use of Large Multimodal Models (LMMs) has raised concerns about model toxicity. However, current research mainly focuses on explicit toxicity, with less attention to some more implicit toxicity regarding prejudice and discrimination. To address this limitation, we introduce a subtler type of toxicity named dual-implicit toxicity and a novel toxicity benchmark termed MDIT-Bench: Multimodal Dual-Implicit Toxicity Benchmark. Specifically, we first create the MDIT-Dataset with dual-implicit toxicity using the proposed Multi-stage Human-in-loop In-context Generation method. Based on this dataset, we construct the MDIT-Bench, a benchmark for evaluating the sensitivity of models to dual-implicit toxicity, with 317,638 questions covering 12 categories, 23 subcategories, and 780 topics. MDIT-Bench includes three difficulty levels, and we propose a metric to measure the toxicity gap exhibited by the model across them. In the experiment, we conducted MDIT-Bench on 13 prominent LMMs, and the results show that these LMMs cannot handle dual-implicit toxicity effectively. The model's performance drops significantly in hard level, revealing that these LMMs still contain a significant amount of hidden but activatable toxicity. Data are available at https://github.com/nuo1nuo/MDIT-Bench.

[200] Large Language Models for Predictive Analysis: How Far Are They?

Qin Chen,Yuanyi Ren,Xiaojun Ma,Yuyang Shi

Main category: cs.CL

TL;DR: 论文介绍了PredictiQ基准,用于评估大型语言模型(LLMs)在预测分析中的能力,发现现有模型仍面临挑战。

Details Motivation: 预测分析在现代决策中至关重要,但缺乏对LLMs在此领域能力的系统评估。 Method: 设计了PredictiQ基准,包含1130个复杂查询,涵盖8个领域的44个数据集,评估了12个知名LLMs。 Result: 现有LLMs在预测分析中仍存在显著挑战。 Conclusion: LLMs在预测分析中的应用仍需进一步改进。 Abstract: Predictive analysis is a cornerstone of modern decision-making, with applications in various domains. Large Language Models (LLMs) have emerged as powerful tools in enabling nuanced, knowledge-intensive conversations, thus aiding in complex decision-making tasks. With the burgeoning expectation to harness LLMs for predictive analysis, there is an urgent need to systematically assess their capability in this domain. However, there is a lack of relevant evaluations in existing studies. To bridge this gap, we introduce the \textbf{PredictiQ} benchmark, which integrates 1130 sophisticated predictive analysis queries originating from 44 real-world datasets of 8 diverse fields. We design an evaluation protocol considering text analysis, code generation, and their alignment. Twelve renowned LLMs are evaluated, offering insights into their practical use in predictive analysis. Generally, we believe that existing LLMs still face considerable challenges in conducting predictive analysis. See \href{https://github.com/Cqkkkkkk/PredictiQ}{Github}.

[201] Bayesian Optimization for Enhanced Language Models: Optimizing Acquisition Functions

Zishuo Bao,Yibo Liu,Changyutao Qiu

Main category: cs.CL

TL;DR: 论文提出了一种结合双层贝叶斯优化(BO)和模型融合的方法(Bilevel-BO-SWA),用于改进大型语言模型的微调,通过混合不同的获取函数(如EI和UCB)在嵌套优化循环中提升泛化性能。

Details Motivation: 现有贝叶斯优化方法忽略了获取函数对训练损失和验证性能的敏感性,导致微调效果不佳。 Method: 采用双层BO策略,内层循环最小化训练损失,外层循环优化验证指标,并结合模型融合技术。 Result: 在GLUE任务上使用RoBERTa-base模型,泛化性能提升,微调效果最高提升2.7%。 Conclusion: 混合获取函数的双层BO策略能有效提升语言模型微调的性能。 Abstract: With the rise of different language model architecture, fine-tuning is becoming even more important for down stream tasks Model gets messy, finding proper hyperparameters for fine-tuning. Although BO has been tried for hyperparameter tuning, most of the existing methods are oblivious to the fact that BO relies on careful choices of acquisition functions, which are essential components of BO that guide how much to explore versus exploit during the optimization process; Different acquisition functions have different levels of sensitivity towards training loss and validation performance; existing methods often just apply an acquisition function no matter if the training and validation performance are sensitive to the acquisition function or not. This work introduces{Bilevel - BO - SWA}, a model fusion approach coupled with a bilevel BO strategy to improve the fine - tunning of large language models. Our work on mixture of acquisition functions like EI and UCB into nested opt loops, where inner loop perform minimization of training loss while outer loops optimized w.r.t. val metric. Experiments on GLUE tasks using RoBERTA - base show that when using EI and UCB, there is an improvement in generalization, and fine - tuning can be improved by up to 2.7%.

[202] Amplify Adjacent Token Differences: Enhancing Long Chain-of-Thought Reasoning with Shift-FFN

Yao Xu,Mingyu Xu,Fangyu Lei,Wangtao Sun,Xiangrong Zeng,Bingning Wang,Guang Liu,Shizhu He,Jun Zhao,Kang Liu

Main category: cs.CL

TL;DR: 论文提出Shift-FFN方法,通过动态调整相邻token的表示差异,减少长链推理中的循环推理问题,提升模型性能。

Details Motivation: 现有方法在长链推理任务中容易出现循环推理问题,影响模型性能。 Method: 提出Shift-FFN,通过编辑当前token的表示以放大相邻token的差异,结合LoRA进行微调。 Result: 实验表明,Shift-FFN结合LoRA在数学推理任务中表现优于全参数微调和标准LoRA。 Conclusion: Shift-FFN有效减少循环推理,提升模型在长链推理任务中的准确性和稳定性。 Abstract: Recently, models such as OpenAI-o1 and DeepSeek-R1 have demonstrated remarkable performance on complex reasoning tasks through Long Chain-of-Thought (Long-CoT) reasoning. Although distilling this capability into student models significantly enhances their performance, this paper finds that fine-tuning LLMs with full parameters or LoRA with a low rank on long CoT data often leads to Cyclical Reasoning, where models repeatedly reiterate previous inference steps until the maximum length limit. Further analysis reveals that smaller differences in representations between adjacent tokens correlates with a higher tendency toward Cyclical Reasoning. To mitigate this issue, this paper proposes Shift Feedforward Networks (Shift-FFN), a novel approach that edits the current token's representation with the previous one before inputting it to FFN. This architecture dynamically amplifies the representation differences between adjacent tokens. Extensive experiments on multiple mathematical reasoning tasks demonstrate that LoRA combined with Shift-FFN achieves higher accuracy and a lower rate of Cyclical Reasoning across various data sizes compared to full fine-tuning and standard LoRA. Our data and code are available at https://anonymous.4open.science/r/Shift-FFN

[203] PersonaBOT: Bringing Customer Personas to Life with LLMs and RAG

Muhammed Rizwan,Lars Carlsson,Mohammad Loni

Main category: cs.CL

TL;DR: 论文探讨了利用大语言模型(LLMs)生成合成客户角色,并将其集成到RAG聊天机器人中以提升商业决策支持。Few-Shot提示在生成完整角色上表现更优,而CoT提示在效率和资源使用上更佳。知识库增强后,聊天机器人的准确性和实用性显著提升。

Details Motivation: 传统定性方法开发客户角色耗时且难以扩展,LLMs的引入为高效生成合成角色提供了可能。 Method: 1. 开发基于角色的RAG聊天机器人;2. 使用Few-Shot和CoT提示生成合成角色;3. 通过McNemar测试评估角色质量;4. 增强知识库并评估改进效果。 Result: Few-Shot提示生成的角色更完整,CoT提示更高效。知识库增强后,聊天机器人准确性从5.88提升至6.42(10分制),81.82%参与者认为系统实用。 Conclusion: 合成角色生成和知识库增强显著提升了聊天机器人的商业决策支持能力,Few-Shot和CoT提示各有优势。 Abstract: The introduction of Large Language Models (LLMs) has significantly transformed Natural Language Processing (NLP) applications by enabling more advanced analysis of customer personas. At Volvo Construction Equipment (VCE), customer personas have traditionally been developed through qualitative methods, which are time-consuming and lack scalability. The main objective of this paper is to generate synthetic customer personas and integrate them into a Retrieval-Augmented Generation (RAG) chatbot to support decision-making in business processes. To this end, we first focus on developing a persona-based RAG chatbot integrated with verified personas. Next, synthetic personas are generated using Few-Shot and Chain-of-Thought (CoT) prompting techniques and evaluated based on completeness, relevance, and consistency using McNemar's test. In the final step, the chatbot's knowledge base is augmented with synthetic personas and additional segment information to assess improvements in response accuracy and practical utility. Key findings indicate that Few-Shot prompting outperformed CoT in generating more complete personas, while CoT demonstrated greater efficiency in terms of response time and token usage. After augmenting the knowledge base, the average accuracy rating of the chatbot increased from 5.88 to 6.42 on a 10-point scale, and 81.82% of participants found the updated system useful in business contexts.

[204] Harry Potter is Still Here! Probing Knowledge Leakage in Targeted Unlearned Large Language Models via Automated Adversarial Prompting

Bang Trinh Tran To,Thai Le

Main category: cs.CL

TL;DR: LURK框架通过对抗性后缀提示探测未学习LLM中隐藏的保留知识,揭示当前未学习评估标准的局限性。

Details Motivation: 研究未学习LLM中可能残留的隐藏知识,评估当前未学习方法的鲁棒性。 Method: 使用对抗性后缀提示自动生成针对哈利波特领域的探测提示,间接揭示潜在知识。 Result: 实验表明,即使被认为成功未学习的模型在对抗条件下仍可能泄露信息。 Conclusion: LURK为评估未学习算法提供了更严格的诊断工具,揭示了当前标准的不足。 Abstract: This work presents LURK (Latent UnleaRned Knowledge), a novel framework that probes for hidden retained knowledge in unlearned LLMs through adversarial suffix prompting. LURK automatically generates adversarial prompt suffixes designed to elicit residual knowledge about the Harry Potter domain, a commonly used benchmark for unlearning. Our experiments reveal that even models deemed successfully unlearned can leak idiosyncratic information under targeted adversarial conditions, highlighting critical limitations of current unlearning evaluation standards. By uncovering latent knowledge through indirect probing, LURK offers a more rigorous and diagnostic tool for assessing the robustness of unlearning algorithms. All code will be publicly available.

[205] CRG Score: A Distribution-Aware Clinical Metric for Radiology Report Generation

Ibrahim Ethem Hamamci,Sezgin Er,Suprosanna Shit,Hadrien Reynaud,Bernhard Kainz,Bjoern Menze

Main category: cs.CL

TL;DR: 提出了CRG Score,一种分布感知且可适应的度量标准,用于评估放射学报告生成中的临床相关性。

Details Motivation: 现有NLG指标无法捕捉临床正确性,LLM指标缺乏通用性,临床准确性指标易受类别不平衡影响。 Method: CRG Score通过评估参考报告中明确描述的临床相关异常,支持二分类和结构化标签,并可结合LLM进行特征提取。 Result: CRG Score通过基于标签分布的惩罚平衡,实现了更公平、更稳健的评估。 Conclusion: CRG Score是一种临床对齐的奖励函数,适用于放射学报告生成任务。 Abstract: Evaluating long-context radiology report generation is challenging. NLG metrics fail to capture clinical correctness, while LLM-based metrics often lack generalizability. Clinical accuracy metrics are more relevant but are sensitive to class imbalance, frequently favoring trivial predictions. We propose the CRG Score, a distribution-aware and adaptable metric that evaluates only clinically relevant abnormalities explicitly described in reference reports. CRG supports both binary and structured labels (e.g., type, location) and can be paired with any LLM for feature extraction. By balancing penalties based on label distribution, it enables fairer, more robust evaluation and serves as a clinically aligned reward function.

[206] Next Token Perception Score: Analytical Assessment of your LLM Perception Skills

Yu-Ang Cheng,Leyang Hu,Hai Huang,Randall Balestriero

Main category: cs.CL

TL;DR: 论文提出了一种新指标NTPS,用于量化自回归预训练与下游感知任务之间的对齐程度,并通过实验验证其有效性。

Details Motivation: 自回归预训练在LLMs中广泛使用,但其特征在下游感知任务中的表现不一致,需要一种量化对齐程度的方法。 Method: 引入NTPS指标,测量自回归与感知特征子空间的重叠,并通过实验验证其与线性探测准确度的相关性。 Result: NTPS与12个NLP数据集和8个预训练模型的线性探测准确度强相关,且LoRA微调后NTPS增加。 Conclusion: NTPS为评估LLM感知能力提供了理论和实践工具,并能预测LoRA微调的效果。 Abstract: Autoregressive pretraining has become the de facto paradigm for learning general-purpose representations in large language models (LLMs). However, linear probe performance across downstream perception tasks shows substantial variability, suggesting that features optimized for next-token prediction do not consistently transfer well to downstream perception tasks. We demonstrate that representations learned via autoregression capture features that may lie outside the subspaces most informative for perception. To quantify the (mis)alignment between autoregressive pretraining and downstream perception, we introduce the Next Token Perception Score (NTPS)-a score derived under a linear setting that measures the overlap between autoregressive and perception feature subspaces. This metric can be easily computed in closed form from pretrained representations and labeled data, and is proven to both upper- and lower-bound the excess loss. Empirically, we show that NTPS correlates strongly with linear probe accuracy across 12 diverse NLP datasets and eight pretrained models ranging from 270M to 8B parameters, confirming its utility as a measure of alignment. Furthermore, we show that NTPS increases following low-rank adaptation (LoRA) fine-tuning, especially in large models, suggesting that LoRA aligning representations to perception tasks enhances subspace overlap and thus improves downstream performance. More importantly, we find that NTPS reliably predicts the additional accuracy gains attained by LoRA finetuning thereby providing a lightweight prescreening tool for LoRA adaptation. Our results offer both theoretical insights and practical tools for analytically assessing LLM perception skills.

[207] FB-RAG: Improving RAG with Forward and Backward Lookup

Kushal Chawla,Alfy Samuel,Anoop Kumar,Daben Liu

Main category: cs.CL

TL;DR: FB-RAG通过结合反向查找和正向查找,优化了RAG系统的检索性能,减少了无关内容的影响,同时提升了回答准确性。

Details Motivation: RAG系统的性能受检索质量和上下文大小影响,大上下文可能包含无关信息,小上下文可能丢失关键信息。FB-RAG旨在解决这一矛盾。 Method: FB-RAG结合反向查找(与查询重叠)和正向查找(与候选答案重叠)来检索最相关的上下文块。 Result: 在9个数据集上的评估显示,FB-RAG优于RAG和长上下文基线,同时降低了延迟。 Conclusion: FB-RAG有效提升了RAG系统的性能,并为未来工作提供了具体指导。 Abstract: The performance of Retrieval Augmented Generation (RAG) systems relies heavily on the retriever quality and the size of the retrieved context. A large enough context ensures that the relevant information is present in the input context for the LLM, but also incorporates irrelevant content that has been shown to confuse the models. On the other hand, a smaller context reduces the irrelevant information, but it often comes at the risk of losing important information necessary to answer the input question. This duality is especially challenging to manage for complex queries that contain little information to retrieve the relevant chunks from the full context. To address this, we present a novel framework, called FB-RAG, which enhances the RAG pipeline by relying on a combination of backward lookup (overlap with the query) and forward lookup (overlap with candidate reasons and answers) to retrieve specific context chunks that are the most relevant for answering the input query. Our evaluations on 9 datasets from two leading benchmarks show that FB-RAG consistently outperforms RAG and Long Context baselines developed recently for these benchmarks. We further show that FB-RAG can improve performance while reducing latency. We perform qualitative analysis of the strengths and shortcomings of our approach, providing specific insights to guide future work.

[208] Mitigating Gender Bias via Fostering Exploratory Thinking in LLMs

Kangda Wei,Hasnat Md Abdullah,Ruihong Huang

Main category: cs.CL

TL;DR: 提出一种新框架,通过生成性别中立的道德判断故事对,利用DPO优化LLMs,显著减少性别偏见。

Details Motivation: 解决LLMs中普遍存在的性别偏见问题,确保对不同性别主体的公平对待。 Method: 生成结构相同但性别不同的道德模糊故事对,引导模型生成性别中立的判断,并通过DPO优化模型。 Result: 实验表明该方法显著减少性别偏见,同时保持或提升模型整体能力。 Conclusion: 提出的框架有效减少LLMs的性别偏见,代码和生成数据将公开。 Abstract: Large Language Models (LLMs) often exhibit gender bias, resulting in unequal treatment of male and female subjects across different contexts. To address this issue, we propose a novel data generation framework that fosters exploratory thinking in LLMs. Our approach prompts models to generate story pairs featuring male and female protagonists in structurally identical, morally ambiguous scenarios, then elicits and compares their moral judgments. When inconsistencies arise, the model is guided to produce balanced, gender-neutral judgments. These story-judgment pairs are used to fine-tune or optimize the models via Direct Preference Optimization (DPO). Experimental results show that our method significantly reduces gender bias while preserving or even enhancing general model capabilities. We will release the code and generated data.

[209] Humans Hallucinate Too: Language Models Identify and Correct Subjective Annotation Errors With Label-in-a-Haystack Prompts

Georgios Chochlakis,Peter Wu,Arjun Bedi,Marcus Ma,Kristina Lerman,Shrikanth Narayanan

Main category: cs.CL

TL;DR: 论文提出了一种基于大语言模型(LLMs)的标签验证方法,用于处理主观任务中标签的合理性与错误区分,并提出了Label-in-a-Haystack Rectification(LiaHR)框架用于标签校正。

Details Motivation: 由于人类标注在主观任务中存在显著差异,这种差异反映了语义解释的合理多样性而非噪声,因此需要区分合理主观性与错误。 Method: 通过In-Context Learning二元过滤基线评估文档-标签对的合理性,并引入Label-in-a-Haystack设置,利用LLMs预测标签,提出LiaHR框架进行标签校正。 Result: 实验表明,LLMs未能复制标签的输出具有任务相关性,LiaHR框架能有效提升标签质量。 Conclusion: LiaHR框架可用于标注流程中,提高信噪比,并通过分析和评估验证了其有效性。 Abstract: Modeling complex subjective tasks in Natural Language Processing, such as recognizing emotion and morality, is considerably challenging due to significant variation in human annotations. This variation often reflects reasonable differences in semantic interpretations rather than mere noise, necessitating methods to distinguish between legitimate subjectivity and error. We address this challenge by exploring label verification in these contexts using Large Language Models (LLMs). First, we propose a simple In-Context Learning binary filtering baseline that estimates the reasonableness of a document-label pair. We then introduce the Label-in-a-Haystack setting: the query and its label(s) are included in the demonstrations shown to LLMs, which are prompted to predict the label(s) again, while receiving task-specific instructions (e.g., emotion recognition) rather than label copying. We show how the failure to copy the label(s) to the output of the LLM are task-relevant and informative. Building on this, we propose the Label-in-a-Haystack Rectification (LiaHR) framework for subjective label correction: when the model outputs diverge from the reference gold labels, we assign the generated labels to the example instead of discarding it. This approach can be integrated into annotation pipelines to enhance signal-to-noise ratios. Comprehensive analyses, human evaluations, and ecological validity studies verify the utility of LiaHR for label correction. Code is available at https://github.com/gchochla/LiaHR.

[210] ExeSQL: Self-Taught Text-to-SQL Models with Execution-Driven Bootstrapping for SQL Dialects

Jipeng Zhang,Haolin Yang,Kehao Miao,Ruiyuan Zhang,Renjie Pi,Jiahui Gao,Xiaofang Zhou

Main category: cs.CL

TL;DR: ExeSQL框架通过执行驱动的反馈学习,解决了文本到SQL模型在多方言环境中的泛化问题,显著提升了性能。

Details Motivation: 现有文本到SQL模型受限于数据集,难以适应多SQL方言的语法和特性,亟需一种能通过执行验证提升模型泛化能力的方法。 Method: ExeSQL采用迭代查询生成、执行过滤和偏好训练,通过反馈学习适应新方言。 Result: 实验显示ExeSQL在PostgreSQL、MySQL和Oracle上分别比GPT-4o平均提升15.2%、10.38%和4.49%。 Conclusion: ExeSQL通过执行驱动的学习有效解决了多方言SQL生成的挑战,为实际应用提供了可靠解决方案。 Abstract: Recent text-to-SQL models have achieved strong performance, but their effectiveness remains largely confined to SQLite due to dataset limitations. However, real-world applications require SQL generation across multiple dialects with varying syntax and specialized features, which remains a challenge for current models. The main obstacle in building a dialect-aware model lies in acquiring high-quality dialect-specific data. Data generated purely through static prompting - without validating SQLs via execution - tends to be noisy and unreliable. Moreover, the lack of real execution environments in the training loop prevents models from grounding their predictions in executable semantics, limiting generalization despite surface-level improvements from data filtering. This work introduces ExeSQL, a text-to-SQL framework with execution-driven, agentic bootstrapping. The method consists of iterative query generation, execution-based filtering (e.g., rejection sampling), and preference-based training, enabling the model to adapt to new SQL dialects through verifiable, feedback-guided learning. Experiments show that ExeSQL bridges the dialect gap in text-to-SQL, achieving average improvements of 15.2%, 10.38%, and 4.49% over GPT-4o on PostgreSQL, MySQL, and Oracle, respectively, across multiple datasets of varying difficulty.

[211] Personalizing Student-Agent Interactions Using Log-Contextualized Retrieval Augmented Generation (RAG)

Clayton Cohn,Surya Rayala,Caitlin Snyder,Joyce Fonteles,Shruti Jain,Naveeduddin Mohammed,Umesh Timalsina,Sarah K. Burriss,Ashwin T S,Namrata Srivastava,Menton Deweese,Angela Eeds,Gautam Biswas

Main category: cs.CL

TL;DR: 论文提出了一种名为LC-RAG的方法,通过结合环境日志增强RAG检索,以改善STEM+C教育中协作对话的个性化指导。

Details Motivation: 协作对话能揭示学生的学习与批判性思维,但现有方法(如RAG)在语义链接较弱时效果不佳。需要更精准的检索方法以支持个性化教学。 Method: 提出LC-RAG,利用环境日志增强RAG检索,结合协作对话上下文,优化知识库匹配。 Result: LC-RAG在检索效果上优于仅基于对话的基线方法,并能提供更相关的个性化指导。 Conclusion: LC-RAG通过上下文增强检索,有效支持学生的批判性思维和决策能力,适用于协作学习环境。 Abstract: Collaborative dialogue offers rich insights into students' learning and critical thinking. This is essential for adapting pedagogical agents to students' learning and problem-solving skills in STEM+C settings. While large language models (LLMs) facilitate dynamic pedagogical interactions, potential hallucinations can undermine confidence, trust, and instructional value. Retrieval-augmented generation (RAG) grounds LLM outputs in curated knowledge, but its effectiveness depends on clear semantic links between user input and a knowledge base, which are often weak in student dialogue. We propose log-contextualized RAG (LC-RAG), which enhances RAG retrieval by incorporating environment logs to contextualize collaborative discourse. Our findings show that LC-RAG improves retrieval over a discourse-only baseline and allows our collaborative peer agent, Copa, to deliver relevant, personalized guidance that supports students' critical thinking and epistemic decision-making in a collaborative computational modeling environment, XYZ.

[212] ReasoningShield: Content Safety Detection over Reasoning Traces of Large Reasoning Models

Changyi Li,Jiayi Wang,Xudong Pan,Geng Hong,Min Yang

Main category: cs.CL

TL;DR: 论文提出了ReasoningShield,首个专注于检测推理轨迹中潜在风险的安全检测模型,并通过高质量数据集和高效标注流程验证其性能。

Details Motivation: 现有审核工具主要针对问答对,无法有效检测推理轨迹中的隐藏风险,因此需要专门的安全检测模型。 Method: 提出QT审核任务,构建包含8000多个问题-思考对的数据集,采用人机协作标注流程,开发基于1B/3B基础模型的ReasoningShield。 Result: ReasoningShield在推理轨迹风险检测中表现优异(F1>0.92),并在传统问答对审核中具有竞争力。 Conclusion: ReasoningShield为推理轨迹安全检测提供了高效解决方案,数据集和模型资源已公开以促进未来研究。 Abstract: Large Reasoning Models (LRMs) are transforming the AI landscape with advanced reasoning capabilities. While the generated reasoning traces enhance model transparency, they can still contain unsafe content, even when the final answer appears safe. Existing moderation tools, primarily designed for question-answer (QA) pairs, are empirically ineffective at detecting hidden risks embedded in reasoning traces. After identifying the key challenges, we formally define the question-thought (QT) moderation task and propose ReasoningShield, the first safety detection model tailored to identify potential risks in the reasoning trace before reaching the final answer. To construct the model, we synthesize a high-quality reasoning safety detection dataset comprising over 8,000 question-thought pairs spanning ten risk categories and three safety levels. Our dataset construction process incorporates a comprehensive human-AI collaborative annotation pipeline, which achieves over 93% annotation accuracy while significantly reducing human costs. On a diverse set of in-distribution and out-of-distribution benchmarks, ReasoningShield outperforms mainstream content safety moderation models in identifying risks within reasoning traces, with an average F1 score exceeding 0.92. Notably, despite being trained on our QT dataset only, ReasoningShield also demonstrates competitive performance in detecting unsafe question-answer pairs on traditional benchmarks, rivaling baselines trained on 10 times larger datasets and base models, which strongly validates the quality of our dataset. Furthermore, ReasoningShield is built upon compact 1B/3B base models to facilitate lightweight deployment and provides human-friendly risk analysis by default. To foster future research, we publicly release all the resources.

[213] ConciseRL: Conciseness-Guided Reinforcement Learning for Efficient Reasoning Models

Razvan-Gabriel Dumitru,Darius Peteleaza,Vikas Yadav,Liangming Pan

Main category: cs.CL

TL;DR: 提出一种基于强化学习的方法,通过简洁性评分优化大语言模型的推理步骤,减少计算浪费并提高准确性。

Details Motivation: 大语言模型的推理步骤常超出必要范围,导致计算浪费、可读性降低和幻觉问题。 Method: 引入一种无需超参数的简洁性评分作为强化学习的奖励信号,由大语言模型作为评委动态评估。 Result: 在MATH数据集上实现最佳效率-准确性平衡,简单问题减少31倍令牌使用,准确率提高7%;最难题准确率提高7.5%,令牌减少3.6倍。TheoremQA上准确率提高2.2%,令牌减少12.5倍。 Conclusion: 该方法能根据问题难度动态调整推理长度,且评委模型强度对效果影响显著。代码和数据集已开源。 Abstract: Large language models excel at complex tasks by breaking down problems into structured reasoning steps. However, reasoning traces often extend beyond reaching a correct answer, causing wasted computation, reduced readability, and hallucinations. To address this, we introduce a novel hyperparameter-free conciseness score used as a reward signal within a reinforcement learning framework to guide models toward generating correct and concise reasoning traces. This score is evaluated by a large language model acting as a judge, enabling dynamic, context-aware feedback beyond simple token length. Our method achieves state-of-the-art efficiency-accuracy trade-offs on the MATH dataset, reducing token usage by up to 31x on simple problems while improving accuracy by 7%, and on the hardest problems, it outperforms full reasoning by +7.5% accuracy with up to 3.6x fewer tokens. On TheoremQA, our method improves accuracy by +2.2% using 12.5x fewer tokens. We also conduct ablation studies on the judge model, reward composition, and problem difficulty, showing that our method dynamically adapts reasoning length based on problem difficulty and benefits significantly from stronger judges. The code, model weights, and datasets are open-sourced at https://github.com/RazvanDu/ConciseRL.

[214] The Rise of Parameter Specialization for Knowledge Storage in Large Language Models

Yihuai Hong,Yiran Zhao,Wei Tang,Yang Deng,Yu Rong,Wenxuan Zhang

Main category: cs.CL

TL;DR: 研究分析了20个开源大语言模型,发现随着模型性能提升,MLP参数中的知识存储呈现专业化趋势,这种分布提升了知识利用效率。

Details Motivation: 探索如何更好地在模型参数(尤其是MLP中)存储知识,以提升模型对知识的有效利用。 Method: 分析20个开源大语言模型,研究其性能与MLP参数中知识存储方式的关系,并通过因果训练实验验证。 Result: 发现模型参数专业化程度越高,知识利用效率越高。 Conclusion: 知识在MLP参数中的专业化分布对提升模型效率至关重要。 Abstract: Over time, a growing wave of large language models from various series has been introduced to the community. Researchers are striving to maximize the performance of language models with constrained parameter sizes. However, from a microscopic perspective, there has been limited research on how to better store knowledge in model parameters, particularly within MLPs, to enable more effective utilization of this knowledge by the model. In this work, we analyze twenty publicly available open-source large language models to investigate the relationship between their strong performance and the way knowledge is stored in their corresponding MLP parameters. Our findings reveal that as language models become more advanced and demonstrate stronger knowledge capabilities, their parameters exhibit increased specialization. Specifically, parameters in the MLPs tend to be more focused on encoding similar types of knowledge. We experimentally validate that this specialized distribution of knowledge contributes to improving the efficiency of knowledge utilization in these models. Furthermore, by conducting causal training experiments, we confirm that this specialized knowledge distribution plays a critical role in improving the model's efficiency in leveraging stored knowledge.

[215] CaseReportBench: An LLM Benchmark Dataset for Dense Information Extraction in Clinical Case Reports

Xiao Yu Cindy Zhang,Carlos R. Ferreira,Francis Rossignol,Raymond T. Ng,Wyeth Wasserman,Jian Zhu

Main category: cs.CL

TL;DR: 论文提出CaseReportBench数据集,评估LLMs在罕见病病例报告中的信息提取能力,发现Qwen2.5-7B优于GPT-4o,并指出LLMs在识别阴性结果方面的不足。

Details Motivation: 罕见病(如IEM)诊断困难,病例报告是重要但未充分利用的资源,LLMs可能实现高效信息提取,但缺乏评估。 Method: 引入CaseReportBench数据集,评估多种模型和提示策略,包括类别特定提示和子标题过滤数据整合。 Result: Qwen2.5-7B表现优于GPT-4o,LLMs能提取临床相关细节,但在识别阴性结果方面有限。 Conclusion: LLMs在临床NLP中有潜力,但仍需改进,为医疗AI应用铺路。 Abstract: Rare diseases, including Inborn Errors of Metabolism (IEM), pose significant diagnostic challenges. Case reports serve as key but computationally underutilized resources to inform diagnosis. Clinical dense information extraction refers to organizing medical information into structured predefined categories. Large Language Models (LLMs) may enable scalable information extraction from case reports but are rarely evaluated for this task. We introduce CaseReportBench, an expert-annotated dataset for dense information extraction of case reports, focusing on IEMs. Using this dataset, we assess various models and prompting strategies, introducing novel approaches such as category-specific prompting and subheading-filtered data integration. Zero-shot chain-of-thought prompting offers little advantage over standard zero-shot prompting. Category-specific prompting improves alignment with the benchmark. The open-source model Qwen2.5-7B outperforms GPT-4o for this task. Our clinician evaluations show that LLMs can extract clinically relevant details from case reports, supporting rare disease diagnosis and management. We also highlight areas for improvement, such as LLMs' limitations in recognizing negative findings important for differential diagnosis. This work advances LLM-driven clinical natural language processing and paves the way for scalable medical AI applications.

[216] Select2Reason: Efficient Instruction-Tuning Data Selection for Long-CoT Reasoning

Cehao Yang,Xueyuan Lin,Chengjin Xu,Xuhui Jiang,Xiaojun Wu,Honghao Liu,Hui Xiong,Jian Guo

Main category: cs.CL

TL;DR: 提出Select2Reason框架,通过高效选择长链推理指令数据,仅需10%数据即可达到或超越全数据微调效果。

Details Motivation: 大规模指令数据训练成本高,且缺乏自动选择高质量长链推理指令的策略。 Method: 基于问题难度和推理轨迹长度,设计加权排名策略选择高效用指令。 Result: 在多个数学基准测试中表现优于全数据微调和开源基线模型。 Conclusion: Select2Reason高效、可扩展且适应性强,为长链推理指令选择提供实用方案。 Abstract: A practical approach to activate long chain-of-thoughts reasoning ability in pre-trained large language models is to perform supervised fine-tuning on instruction datasets synthesized by strong Large Reasoning Models such as DeepSeek-R1, offering a cost-effective alternative to reinforcement learning. However, large-scale instruction sets with more than 100k samples incur significant training overhead, while effective strategies for automatic long-CoT instruction selection still remain unexplored. In this work, we propose Select2Reason, a novel and efficient instruction-tuning data selection framework for long-CoT reasoning. From the perspective of emergence of rethinking behaviors like self-correction and backtracking, we investigate common metrics that may determine the quality of long-CoT reasoning instructions. Select2Reason leverages a quantifier to estimate difficulty of question and jointly incorporates a reasoning trace length-based heuristic through a weighted scheme for ranking to prioritize high-utility examples. Empirical results on OpenR1-Math-220k demonstrate that fine-tuning LLM on only 10% of the data selected by Select2Reason achieves performance competitive with or superior to full-data tuning and open-source baseline OpenR1-Qwen-7B across three competition-level and six comprehensive mathematical benchmarks. Further experiments highlight the scalability in varying data size, efficiency during inference, and its adaptability to other instruction pools with minimal cost.

Odysseas S. Chlapanis,Dimitrios Galanis,Nikolaos Aletras,Ion Androutsopoulos

Main category: cs.CL

TL;DR: GreekBarBench是一个评估LLMs在希腊律师考试中五个法律领域问题的基准,结合了三维评分系统和LLM作为裁判的方法。

Details Motivation: 解决自由文本评估的挑战,并提升LLMs在法律问题中的表现。 Method: 提出三维评分系统和LLM-as-a-judge方法,开发元评估基准以衡量LLM裁判与人类专家的相关性。 Result: 最佳模型表现优于专家平均分,但未达到专家前5%的水平。 Conclusion: 简单、基于跨度的评分标准能提升LLM裁判与人类专家的对齐效果。 Abstract: We introduce GreekBarBench, a benchmark that evaluates LLMs on legal questions across five different legal areas from the Greek Bar exams, requiring citations to statutory articles and case facts. To tackle the challenges of free-text evaluation, we propose a three-dimensional scoring system combined with an LLM-as-a-judge approach. We also develop a meta-evaluation benchmark to assess the correlation between LLM-judges and human expert evaluations, revealing that simple, span-based rubrics improve their alignment. Our systematic evaluation of 13 proprietary and open-weight LLMs shows that even though the best models outperform average expert scores, they fall short of the 95th percentile of experts.

[218] Search Wisely: Mitigating Sub-optimal Agentic Searches By Reducing Uncertainty

Peilin Wu,Mian Zhang,Xinlu Zhang,Xinya Du,Zhiyu Zoey Chen

Main category: cs.CL

TL;DR: 论文提出了一种强化学习方法(β-GRPO),通过引入置信度阈值优化代理式检索增强生成(RAG)系统的搜索行为,解决了过度搜索和搜索不足的问题,提升了模型性能。

Details Motivation: 代理式RAG系统在动态多步推理和信息检索中存在效率低下的问题(如过度搜索和搜索不足),影响其可靠性和性能。 Method: 论文定义了并量化了这些低效行为,揭示了其与模型知识边界不确定性的关联,并提出了基于强化学习的β-GRPO方法,通过置信度阈值优化搜索决策。 Result: 实验表明,β-GRPO方法在七个QA基准测试中显著提升了3B模型的性能,平均精确匹配分数提高了4%。 Conclusion: β-GRPO通过优化搜索决策的置信度,有效提升了代理式RAG系统的效率和准确性。 Abstract: Agentic Retrieval-Augmented Generation (RAG) systems enhance Large Language Models (LLMs) by enabling dynamic, multi-step reasoning and information retrieval. However, these systems often exhibit sub-optimal search behaviors like over-search (retrieving redundant information) and under-search (failing to retrieve necessary information), which hinder efficiency and reliability. This work formally defines and quantifies these behaviors, revealing their prevalence across multiple QA datasets and agentic RAG systems (e.g., one model could have avoided searching in 27.7% of its search steps). Furthermore, we demonstrate a crucial link between these inefficiencies and the models' uncertainty regarding their own knowledge boundaries, where response accuracy correlates with model's uncertainty in its search decisions. To address this, we propose $\beta$-GRPO, a reinforcement learning-based training method that incorporates confidence threshold to reward high-certainty search decisions. Experiments on seven QA benchmarks show that $\beta$-GRPO enable a 3B model with better agentic RAG ability, outperforming other strong baselines with a 4% higher average exact match score.

[219] SELF: Self-Extend the Context Length With Logistic Growth Function

Phat Thanh Dang,Saahil Thoppay,Wang Yang,Qifan Wang,Vipin Chaudhary,Xiaotian Han

Main category: cs.CL

TL;DR: 论文提出SELF方法,通过逻辑增长函数和分组策略扩展大语言模型的长上下文处理能力,性能提升显著。

Details Motivation: 大语言模型在处理超出训练上下文长度的长文本时,由于位置编码问题,表现不佳,需要改进。 Method: 采用逻辑增长函数和变长分组策略,结合小距离恒定分组,优化长上下文处理。 Result: 在LEval和LongBench任务中,性能分别提升12%和6.4%,阅读理解任务提升5.4%。 Conclusion: SELF方法有效提升模型在长上下文任务中的表现,优于现有扩展方法。 Abstract: Large language models suffer issues when operated on long contexts that are larger than their training context length due to the standard position encoding for tokens in the attention layer. Tokens a long distance apart will rarely have an effect on each other and long prompts yield unexpected results. To solve this problem, we propose SELF (Self-Extend the Context Length With Logistic Growth Function): a solution of grouping consecutive tokens at varying group sizes using a logistic capacity equation combined with a constant group size at smaller relative distances. Our model had an increase in performance of up to 12% compared to the LongLM extension method in LEval (specifically on the Qwen model). On summarization related tasks in LongBench, our model performed up to 6.4% better than LongLM (specifically on the Llama-2-7b model). On reading comprehension tasks from LEval, our model performed up to 5.4% better than the LongLM. Our code is available at https://github.com/alexeipc/SELF-LLM.

[220] Refusal Direction is Universal Across Safety-Aligned Languages

Xinpeng Wang,Mingyang Wang,Yihong Liu,Hinrich Schütze,Barbara Plank

Main category: cs.CL

TL;DR: 研究发现,大型语言模型的拒绝行为可以通过激活空间中的单一方向调节,且这种方向在不同语言间具有通用性。

Details Motivation: 探究多语言环境下LLMs的拒绝行为,以提升其安全性。 Method: 使用PolyRefuse数据集,分析14种语言中的拒绝行为,验证拒绝方向的跨语言通用性。 Result: 英语提取的拒绝方向可高效绕过其他语言的拒绝机制,且方向在不同语言间可无缝转移。 Conclusion: 拒绝方向的跨语言通用性为多语言安全防御提供了新思路,揭示了LLMs的跨语言漏洞机制。 Abstract: Refusal mechanisms in large language models (LLMs) are essential for ensuring safety. Recent research has revealed that refusal behavior can be mediated by a single direction in activation space, enabling targeted interventions to bypass refusals. While this is primarily demonstrated in an English-centric context, appropriate refusal behavior is important for any language, but poorly understood. In this paper, we investigate the refusal behavior in LLMs across 14 languages using PolyRefuse, a multilingual safety dataset created by translating malicious and benign English prompts into these languages. We uncover the surprising cross-lingual universality of the refusal direction: a vector extracted from English can bypass refusals in other languages with near-perfect effectiveness, without any additional fine-tuning. Even more remarkably, refusal directions derived from any safety-aligned language transfer seamlessly to others. We attribute this transferability to the parallelism of refusal vectors across languages in the embedding space and identify the underlying mechanism behind cross-lingual jailbreaks. These findings provide actionable insights for building more robust multilingual safety defenses and pave the way for a deeper mechanistic understanding of cross-lingual vulnerabilities in LLMs.

[221] Benchmarking Expressive Japanese Character Text-to-Speech with VITS and Style-BERT-VITS2

Zackary Rackauckas,Julia Hirschberg

Main category: cs.CL

TL;DR: 论文比较了VITS和SBV2JE两种开源TTS模型在日语角色语音合成中的表现,SBV2JE在自然度、清晰度和角色一致性上表现更优。

Details Motivation: 解决日语角色语音合成中音高敏感性和风格多样性的挑战。 Method: 使用三个角色特定数据集,评估自然度(MOS和CMOS)、清晰度(WER)和角色一致性。 Result: SBV2JE在自然度上接近人类水平(MOS 4.37 vs. 4.38),WER更低,CMOS略优。 Conclusion: SBV2JE适用于语言学习和角色对话生成,但计算需求较高。 Abstract: Synthesizing expressive Japanese character speech poses unique challenges due to pitch-accent sensitivity and stylistic variability. This paper benchmarks two open-source text-to-speech models--VITS and Style-BERT-VITS2 JP Extra (SBV2JE)--on in-domain, character-driven Japanese speech. Using three character-specific datasets, we evaluate models across naturalness (mean opinion and comparative mean opinion score), intelligibility (word error rate), and speaker consistency. SBV2JE matches human ground truth in naturalness (MOS 4.37 vs. 4.38), achieves lower WER, and shows slight preference in CMOS. Enhanced by pitch-accent controls and a WavLM-based discriminator, SBV2JE proves effective for applications like language learning and character dialogue generation, despite higher computational demands.

[222] From Compression to Expansion: A Layerwise Analysis of In-Context Learning

Jiachen Jiang,Yuxin Dong,Jinxin Zhou,Zhihui Zhu

Main category: cs.CL

TL;DR: 论文通过统计几何分析研究了上下文学习(ICL)的内部表征机制,发现了一种称为“层间压缩-扩展”的现象,揭示了任务信息在模型各层中的动态处理方式。

Details Motivation: 尽管ICL在实证中表现优异,但其内部表征机制尚未被充分理解,因此需要深入研究以揭示其工作原理。 Method: 采用统计几何分析方法,研究了ICL在不同层中的表征动态,并通过偏置-方差分解和理论分析探讨了注意力机制的作用。 Result: 发现ICL中存在“层间压缩-扩展”现象,早期层压缩任务信息,后期层扩展以生成预测;该现象对模型性能和鲁棒性有重要影响。 Conclusion: 研究揭示了ICL的表征动态,为理解LLM内部行为提供了新视角,并表明分析内部表征有助于提升模型性能。 Abstract: In-context learning (ICL) enables large language models (LLMs) to adapt to new tasks without weight updates by learning from demonstration sequences. While ICL shows strong empirical performance, its internal representational mechanisms are not yet well understood. In this work, we conduct a statistical geometric analysis of ICL representations to investigate how task-specific information is captured across layers. Our analysis reveals an intriguing phenomenon, which we term *Layerwise Compression-Expansion*: early layers progressively produce compact and discriminative representations that encode task information from the input demonstrations, while later layers expand these representations to incorporate the query and generate the prediction. This phenomenon is observed consistently across diverse tasks and a range of contemporary LLM architectures. We demonstrate that it has important implications for ICL performance -- improving with model size and the number of demonstrations -- and for robustness in the presence of noisy examples. To further understand the effect of the compact task representation, we propose a bias-variance decomposition and provide a theoretical analysis showing how attention mechanisms contribute to reducing both variance and bias, thereby enhancing performance as the number of demonstrations increases. Our findings reveal an intriguing layerwise dynamic in ICL, highlight how structured representations emerge within LLMs, and showcase that analyzing internal representations can facilitate a deeper understanding of model behavior.

[223] GPT Editors, Not Authors: The Stylistic Footprint of LLMs in Academic Preprints

Soren DeHaan,Yuanze Liu,Johan Bollen,Sa'ul A. Blanco

Main category: cs.CL

TL;DR: 研究探讨了大型语言模型(LLMs)在学术写作中的使用情况,发现其使用方式均匀,降低了幻觉风险。

Details Motivation: LLMs的普及对学术写作的诚信和机构信任度造成威胁,研究旨在区分其用于生成关键文本还是编辑用途。 Method: 通过分析arXiv论文,使用PELT阈值和贝叶斯分类器对GPT生成文本进行风格分割测量。 Result: LLM生成的语言与风格分割无关,表明作者使用LLMs时方式一致,减少了幻觉引入的风险。 Conclusion: LLMs在学术写作中的均匀使用降低了潜在风险,但仍需进一步研究其影响。 Abstract: The proliferation of Large Language Models (LLMs) in late 2022 has impacted academic writing, threatening credibility, and causing institutional uncertainty. We seek to determine the degree to which LLMs are used to generate critical text as opposed to being used for editing, such as checking for grammar errors or inappropriate phrasing. In our study, we analyze arXiv papers for stylistic segmentation, which we measure by varying a PELT threshold against a Bayesian classifier trained on GPT-regenerated text. We find that LLM-attributed language is not predictive of stylistic segmentation, suggesting that when authors use LLMs, they do so uniformly, reducing the risk of hallucinations being introduced into academic preprints.

[224] SweEval: Do LLMs Really Swear? A Safety Benchmark for Testing Limits for Enterprise Use

Hitesh Laxmichand Patel,Amit Agarwal,Arion Das,Bhargava Kumar,Srikant Panda,Priyaranjan Pattnayak,Taki Hasan Rafi,Tejaswini Kumar,Dong-Kyu Chae

Main category: cs.CL

TL;DR: SweEval是一个评估大型语言模型(LLM)在多元文化背景下是否合规处理不当指令的基准,旨在推动企业级AI伦理研究。

Details Motivation: 企业客户广泛使用LLM进行关键沟通任务,需确保其理解多元文化背景并生成安全、尊重的回应,以降低声誉风险。 Method: 引入SweEval基准,模拟真实场景,通过明确指令模型包含特定脏词的任务,评估其合规性与伦理对齐。 Result: 评估LLM是否遵守或抵制不当指令,并测试其对伦理框架、文化差异和语言理解能力的对齐程度。 Conclusion: SweEval为构建伦理对齐的企业级AI系统提供研究基础,并公开数据集和代码以促进进一步研究。 Abstract: Enterprise customers are increasingly adopting Large Language Models (LLMs) for critical communication tasks, such as drafting emails, crafting sales pitches, and composing casual messages. Deploying such models across different regions requires them to understand diverse cultural and linguistic contexts and generate safe and respectful responses. For enterprise applications, it is crucial to mitigate reputational risks, maintain trust, and ensure compliance by effectively identifying and handling unsafe or offensive language. To address this, we introduce SweEval, a benchmark simulating real-world scenarios with variations in tone (positive or negative) and context (formal or informal). The prompts explicitly instruct the model to include specific swear words while completing the task. This benchmark evaluates whether LLMs comply with or resist such inappropriate instructions and assesses their alignment with ethical frameworks, cultural nuances, and language comprehension capabilities. In order to advance research in building ethically aligned AI systems for enterprise use and beyond, we release the dataset and code: https://github.com/amitbcp/multilingual_profanity.

[225] Language models should be subject to repeatable, open, domain-contextualized hallucination benchmarking

Justin D. Norman,Michael U. Rivera,D. Alex Hughes

Main category: cs.CL

TL;DR: 论文探讨了语言模型生成文本中的幻觉问题,提出了一种可重复、开放且基于领域上下文的幻觉评估方法。

Details Motivation: 语言模型生成的文本中普遍存在不准确但看似合理的幻觉问题,但缺乏科学的测量方法。 Method: 提出了一种幻觉分类法,并通过案例研究说明专家参与数据创建的重要性。 Result: 研究表明,若专家未参与数据创建的早期阶段,幻觉指标将缺乏有效性和实用性。 Conclusion: 语言模型的幻觉评估需采用可重复、开放且基于领域上下文的方法,并确保专家参与数据创建。 Abstract: Plausible, but inaccurate, tokens in model-generated text are widely believed to be pervasive and problematic for the responsible adoption of language models. Despite this concern, there is little scientific work that attempts to measure the prevalence of language model hallucination in a comprehensive way. In this paper, we argue that language models should be evaluated using repeatable, open, and domain-contextualized hallucination benchmarking. We present a taxonomy of hallucinations alongside a case study that demonstrates that when experts are absent from the early stages of data creation, the resulting hallucination metrics lack validity and practical utility.

[226] A Fully Generative Motivational Interviewing Counsellor Chatbot for Moving Smokers Towards the Decision to Quit

Zafarullah Mahmood,Soliman Ali,Jiading Zhu,Mohamed Abdelwahab,Michelle Yu Collins,Sihan Chen,Yi Cheng Zhao,Jodi Wolff,Osnat Melamed,Nadia Minian,Marta Maslej,Carolynne Cooper,Matt Ratto,Peter Selby,Jonathan Rose

Main category: cs.CL

TL;DR: 研究探讨了基于大型语言模型(LLM)的聊天机器人作为戒烟心理咨询师的效果,结果显示其符合动机访谈(MI)标准,并能提升用户戒烟信心。

Details Motivation: 验证LLM是否能作为有效的自动化心理咨询师,并评估其在戒烟领域的表现。 Method: 开发了一个基于LLM的聊天机器人,采用动机访谈(MI)技术,并通过106名参与者测试其效果。 Result: 参与者戒烟信心平均提升1.7分(0-10分制),聊天机器人在98%的对话中符合MI标准,且用户语言显示其动机增强。 Conclusion: LLM驱动的心理咨询机器人具有潜力,尤其在戒烟领域表现良好。 Abstract: The conversational capabilities of Large Language Models (LLMs) suggest that they may be able to perform as automated talk therapists. It is crucial to know if these systems would be effective and adhere to known standards. We present a counsellor chatbot that focuses on motivating tobacco smokers to quit smoking. It uses a state-of-the-art LLM and a widely applied therapeutic approach called Motivational Interviewing (MI), and was evolved in collaboration with clinician-scientists with expertise in MI. We also describe and validate an automated assessment of both the chatbot's adherence to MI and client responses. The chatbot was tested on 106 participants, and their confidence that they could succeed in quitting smoking was measured before the conversation and one week later. Participants' confidence increased by an average of 1.7 on a 0-10 scale. The automated assessment of the chatbot showed adherence to MI standards in 98% of utterances, higher than human counsellors. The chatbot scored well on a participant-reported metric of perceived empathy but lower than typical human counsellors. Furthermore, participants' language indicated a good level of motivation to change, a key goal in MI. These results suggest that the automation of talk therapy with a modern LLM has promise.

[227] AI-Augmented LLMs Achieve Therapist-Level Responses in Motivational Interviewing

Yinghui Huang,Yuxuan Jiang,Hui Liu,Yixin Cai,Weiqing Li,Xiangen Hu

Main category: cs.CL

TL;DR: GPT-4在成瘾护理中的动机访谈(MI)应用潜力大,但需系统评估其治疗能力。研究通过人机协作框架分析MI行为,开发预测模型,并通过提示工程优化GPT-4表现。

Details Motivation: 评估GPT-4在临床沟通中的潜力,尤其是成瘾护理中的动机访谈(MI)应用。 Method: 通过人机协作分析MI会话,结合深度学习和可解释AI,开发预测模型,并优化GPT-4的提示工程。 Result: GPT-4在建议管理上优于人类治疗师,但整体表现略逊,且处理复杂情感有限。提示工程显著提升了其表现。 Conclusion: 研究为优化基于LLM的治疗工具提供了框架,同时揭示了GPT-4在临床沟通中的潜力与局限。 Abstract: Large language models (LLMs) like GPT-4 show potential for scaling motivational interviewing (MI) in addiction care, but require systematic evaluation of therapeutic capabilities. We present a computational framework assessing user-perceived quality (UPQ) through expected and unexpected MI behaviors. Analyzing human therapist and GPT-4 MI sessions via human-AI collaboration, we developed predictive models integrating deep learning and explainable AI to identify 17 MI-consistent (MICO) and MI-inconsistent (MIIN) behavioral metrics. A customized chain-of-thought prompt improved GPT-4's MI performance, reducing inappropriate advice while enhancing reflections and empathy. Although GPT-4 remained marginally inferior to therapists overall, it demonstrated superior advice management capabilities. The model achieved measurable quality improvements through prompt engineering, yet showed limitations in addressing complex emotional nuances. This framework establishes a pathway for optimizing LLM-based therapeutic tools through targeted behavioral metric analysis and human-AI co-evaluation. Findings highlight both the scalability potential and current constraints of LLMs in clinical communication applications.

[228] WiNGPT-3.0 Technical Report

Boqin Zhuang,Chenxiao Song,Huitong Lu,Jiacheng Qiao,Mingqian Liu,Mingxing Yu,Ping Hong,Rui Li,Xiaoxia Song,Xiangjun Xu,Xu Chen,Yaoyao Ma,Yujie Gao

Main category: cs.CL

TL;DR: WiNGPT-3.0是一个320亿参数的LLM,旨在提升医学推理能力,并通过多阶段训练和强化学习在有限数据下取得显著性能提升。

Details Motivation: 解决现有LLM在医学推理中的结构化、可解释性和可验证性不足,以及部署时的计算资源和数据隐私问题。 Method: 采用多阶段训练管道,包括监督微调(SFT)和强化学习(RL),结合长链思维数据集和基于证据的诊断链模拟。 Result: WiNGPT-3.0在MedCalc和MedQA-USMLE上分别得分66.6和87.1,临床推理任务从58.1提升至62.5。 Conclusion: 强化学习在有限数据和计算资源下仍能提升医学推理准确性,为临床工作流中更可信和实用的LLM部署铺平道路。 Abstract: Current Large Language Models (LLMs) exhibit significant limitations, notably in structured, interpretable, and verifiable medical reasoning, alongside practical deployment challenges related to computational resources and data privacy. This report focused on the development of WiNGPT-3.0, the 32-billion parameter LLMs, engineered with the objective of enhancing its capacity for medical reasoning and exploring its potential for effective integration within healthcare IT infrastructures. The broader aim is to advance towards clinically applicable models. The approach involved a multi-stage training pipeline tailored for general, medical, and clinical reasoning. This pipeline incorporated supervised fine-tuning (SFT) and reinforcement learning (RL), leveraging curated Long Chain-of-Thought (CoT) datasets, auxiliary reward models, and an evidence-based diagnostic chain simulation. WiNGPT-3.0 demonstrated strong performance: specific model variants achieved scores of 66.6 on MedCalc and 87.1 on MedQA-USMLE. Furthermore, targeted training improved performance on a clinical reasoning task from a baseline score of 58.1 to 62.5. These findings suggest that reinforcement learning, even when applied with a limited dataset of only a few thousand examples, can enhance medical reasoning accuracy. Crucially, this demonstration of RL's efficacy with limited data and computation paves the way for more trustworthy and practically deployable LLMs within clinical workflows and health information infrastructures.

[229] Measuring diversity of synthetic prompts and data generated with fine-grained persona prompting

Gauri Kambhatla,Chantal Shaib,Venkata Govindarajan

Main category: cs.CL

TL;DR: 研究发现,基于细粒度角色的合成提示多样性低于人工编写的提示,且细粒度角色描述对生成文本多样性的提升不明显。

Details Motivation: 探讨细粒度角色描述在生成多样化合成数据中的作用及其对大型语言模型的影响。 Method: 使用词汇多样性和冗余度量评估角色驱动的合成提示和响应的多样性,并比较不同规模模型的生成效果。 Result: 合成提示多样性显著低于人工提示;细粒度角色描述对多样性提升有限,但角色提示能提高词汇多样性(尤其是大模型)。 Conclusion: 细粒度角色描述对生成文本多样性的贡献有限,角色提示对词汇多样性有积极影响,尤其是在大模型中。 Abstract: Fine-grained personas have recently been used for generating 'diverse' synthetic data for pre-training and supervised fine-tuning of Large Language Models (LLMs). In this work, we measure the diversity of persona-driven synthetically generated prompts and responses with a suite of lexical diversity and redundancy metrics. Firstly, we find that synthetic prompts/instructions are significantly less diverse than human-written ones. Next, we sample responses from LLMs of different sizes with fine-grained and coarse persona descriptions to investigate how much fine-grained detail in persona descriptions contribute to generated text diversity. We find that while persona-prompting does improve lexical diversity (especially with larger models), fine-grained detail in personas doesn't increase diversity noticeably.

[230] Curriculum Guided Reinforcement Learning for Efficient Multi Hop Retrieval Augmented Generation

Yuelyu Ji,Rui Meng,Zhuochun Li,Daqing He

Main category: cs.CL

TL;DR: EVO-RAG是一种基于课程引导强化学习的框架,通过动态奖励调度和多头奖励模型优化多跳RAG系统,显著提升答案准确性和检索效率。

Details Motivation: 现有RAG方法在多跳检索中存在冗余查询、探索不足或搜索链过长的问题,EVO-RAG旨在解决这些问题。 Method: 采用课程引导强化学习框架,结合七因素奖励向量和动态调度器,训练查询重写代理。 Result: 在四个多跳QA基准测试中,EVO-RAG将Exact Match提升4.6分,同时减少15%的平均检索深度。 Conclusion: EVO-RAG为构建高效可靠的多跳RAG系统提供了通用方案。 Abstract: Retrieval-augmented generation (RAG) grounds large language models (LLMs) in up-to-date external evidence, yet existing multi-hop RAG pipelines still issue redundant subqueries, explore too shallowly, or wander through overly long search chains. We introduce EVO-RAG, a curriculum-guided reinforcement learning framework that evolves a query-rewriting agent from broad early-stage exploration to concise late-stage refinement. EVO-RAG couples a seven-factor, step-level reward vector (covering relevance, redundancy, efficiency, and answer correctness) with a time-varying scheduler that reweights these signals as the episode unfolds. The agent is trained with Direct Preference Optimization over a multi-head reward model, enabling it to learn when to search, backtrack, answer, or refuse. Across four multi-hop QA benchmarks (HotpotQA, 2WikiMultiHopQA, MuSiQue, and Bamboogle), EVO-RAG boosts Exact Match by up to 4.6 points over strong RAG baselines while trimming average retrieval depth by 15 %. Ablation studies confirm the complementary roles of curriculum staging and dynamic reward scheduling. EVO-RAG thus offers a general recipe for building reliable, cost-effective multi-hop RAG systems.

[231] FullFront: Benchmarking MLLMs Across the Full Front-End Engineering Workflow

Haoyu Sun,Huichen Will Wang,Jiawei Gu,Linjie Li,Yu Cheng

Main category: cs.CL

TL;DR: FullFront是一个评估多模态大语言模型(MLLMs)在前端开发全流程表现的基准测试,涵盖设计、视觉理解和代码生成任务。

Details Motivation: 现有基准测试仅关注视觉设计到代码的转换,而FullFront旨在全面评估MLLMs在前端工程全流程中的能力。 Method: FullFront采用两阶段方法将真实网页转换为干净、标准化的HTML,并设计了三个任务:网页设计、网页感知QA和网页代码生成。 Result: 测试显示MLLMs在页面感知、代码生成(尤其是图像处理和布局)及交互实现方面存在显著局限性,与人类专家表现差距较大。 Conclusion: FullFront揭示了当前MLLMs在前端工程中的不足,为未来研究提供了标准化评估工具。 Abstract: Front-end engineering involves a complex workflow where engineers conceptualize designs, translate them into code, and iteratively refine the implementation. While recent benchmarks primarily focus on converting visual designs to code, we present FullFront, a benchmark designed to evaluate Multimodal Large Language Models (MLLMs) \textbf{across the full front-end development pipeline}. FullFront assesses three fundamental tasks that map directly to the front-end engineering pipeline: Webpage Design (conceptualization phase), Webpage Perception QA (comprehension of visual organization and elements), and Webpage Code Generation (implementation phase). Unlike existing benchmarks that use either scraped websites with bloated code or oversimplified LLM-generated HTML, FullFront employs a novel, two-stage process to transform real-world webpages into clean, standardized HTML while maintaining diverse visual designs and avoiding copyright issues. Extensive testing of state-of-the-art MLLMs reveals significant limitations in page perception, code generation (particularly for image handling and layout), and interaction implementation. Our results quantitatively demonstrate performance disparities across models and tasks, and highlight a substantial gap between current MLLM capabilities and human expert performance in front-end engineering. The FullFront benchmark and code are available in https://github.com/Mikivishy/FullFront.

[232] Language Matters: How Do Multilingual Input and Reasoning Paths Affect Large Reasoning Models?

Zhi Rui Tam,Cheng-Kuang Wu,Yu Ying Chiu,Chieh-Yen Lin,Yun-Nung Chen,Hung-yi Lee

Main category: cs.CL

TL;DR: 研究发现,大型推理模型(LRMs)在多语言环境中倾向于默认使用高资源语言(如英语)进行推理,即使输入语言不同。强制使用输入语言推理会降低性能,尤其是低资源语言。语言选择的影响因任务类型而异。

Details Motivation: 探究LRMs在多语言环境中的内部推理过程,尤其是语言选择对推理性能的影响。 Method: 通过多语言训练和测试,评估模型在不同语言下的推理表现,涵盖推理密集型任务和非推理基准。 Result: LRMs倾向于使用高资源语言推理,强制使用输入语言会降低性能,尤其是低资源语言。任务类型影响语言选择的效果。 Conclusion: 研究揭示了LRMs的语言偏见,为开发更公平的多语言模型提供了关键方向。 Abstract: Large reasoning models (LRMs) have demonstrated impressive performance across a range of reasoning tasks, yet little is known about their internal reasoning processes in multilingual settings. We begin with a critical question: {\it In which language do these models reason when solving problems presented in different languages?} Our findings reveal that, despite multilingual training, LRMs tend to default to reasoning in high-resource languages (e.g., English) at test time, regardless of the input language. When constrained to reason in the same language as the input, model performance declines, especially for low-resource languages. In contrast, reasoning in high-resource languages generally preserves performance. We conduct extensive evaluations across reasoning-intensive tasks (MMMLU, MATH-500) and non-reasoning benchmarks (CulturalBench, LMSYS-toxic), showing that the effect of language choice varies by task type: input-language reasoning degrades performance on reasoning tasks but benefits cultural tasks, while safety evaluations exhibit language-specific behavior. By exposing these linguistic biases in LRMs, our work highlights a critical step toward developing more equitable models that serve users across diverse linguistic backgrounds.

[233] Conversations: Love Them, Hate Them, Steer Them

Niranjan Chebrolu,Gerard Christopher Yeo,Kokil Jaidka

Main category: cs.CL

TL;DR: 通过激活工程定向引导LLaMA 3.1-8B模型,使其表现出更人性化的情感表达。

Details Motivation: 当前LLMs在情感表达上仍缺乏细腻度,现有方法多为表面调整或需要大量微调。 Method: 采用归因修补技术定位关键干预点,并通过对比文本对生成情感表达向量,应用于新对话提示。 Result: 定向引导后的响应表现出更积极的情感和更多第一人称代词使用,情感表达更丰富。 Conclusion: 该方法为LLMs提供了精确且可解释的情感控制手段,有助于开发更具同理心的对话AI。 Abstract: Large Language Models (LLMs) demonstrate increasing conversational fluency, yet instilling them with nuanced, human-like emotional expression remains a significant challenge. Current alignment techniques often address surface-level output or require extensive fine-tuning. This paper demonstrates that targeted activation engineering can steer LLaMA 3.1-8B to exhibit more human-like emotional nuances. We first employ attribution patching to identify causally influential components, to find a key intervention locus by observing activation patterns during diagnostic conversational tasks. We then derive emotional expression vectors from the difference in the activations generated by contrastive text pairs (positive vs. negative examples of target emotions). Applying these vectors to new conversational prompts significantly enhances emotional characteristics: steered responses show increased positive sentiment (e.g., joy, trust) and more frequent first-person pronoun usage, indicative of greater personal engagement. Our findings offer a precise and interpretable method for controlling specific emotional attributes in LLMs, contributing to developing more aligned and empathetic conversational AI.

[234] DASH: Input-Aware Dynamic Layer Skipping for Efficient LLM Inference with Markov Decision Policies

Ning Yang,Fangxin Liu,Junjie Wang,Tao Yang,Kan Liu,Haibing Guan,Li Jiang

Main category: cs.CL

TL;DR: DASH是一种自适应层跳跃框架,通过动态选择计算路径减少大语言模型的推理成本,同时保持性能。

Details Motivation: 大语言模型推理成本高,限制了其在实时场景中的应用。 Method: 将跳跃过程建模为马尔可夫决策过程,引入轻量级补偿机制和异步执行策略。 Result: 在多个LLM架构和NLP基准测试中显著加速推理,性能保持竞争力。 Conclusion: DASH在加速推理的同时保持了任务性能,优于现有方法。 Abstract: Large language models (LLMs) have achieved remarkable performance across a wide range of NLP tasks. However, their substantial inference cost poses a major barrier to real-world deployment, especially in latency-sensitive scenarios. To address this challenge, we propose \textbf{DASH}, an adaptive layer-skipping framework that dynamically selects computation paths conditioned on input characteristics. We model the skipping process as a Markov Decision Process (MDP), enabling fine-grained token-level decisions based on intermediate representations. To mitigate potential performance degradation caused by skipping, we introduce a lightweight compensation mechanism that injects differential rewards into the decision process. Furthermore, we design an asynchronous execution strategy that overlaps layer computation with policy evaluation to minimize runtime overhead. Experiments on multiple LLM architectures and NLP benchmarks show that our method achieves significant inference acceleration while maintaining competitive task performance, outperforming existing methods.

[235] T$^2$: An Adaptive Test-Time Scaling Strategy for Contextual Question Answering

Zhengyi Zhao,Shubo Zhang,Zezhong Wang,Huimin Wang,Yutian Zhao,Bin Liang,Yefeng Zheng,Binyang Li,Kam-Fai Wong,Xian Wu

Main category: cs.CL

TL;DR: T²框架通过动态调整推理深度,根据问题复杂度优化LLM的推理策略,提高准确性和计算效率。

Details Motivation: 现有方法在CQA中缺乏适应性,要么过度推理简单问题,要么引入人为偏见,未能充分利用模型的推理能力。 Method: T²通过分解问题、生成相似示例、评估策略并应用最优策略,动态调整推理深度。 Result: 在七个CQA基准测试中,T²比基线方法准确率更高,计算开销减少25.2%。 Conclusion: T²通过动态推理策略,显著提升了LLM在CQA中的性能和效率。 Abstract: Recent advances in Large Language Models (LLMs) have demonstrated remarkable performance in Contextual Question Answering (CQA). However, prior approaches typically employ elaborate reasoning strategies regardless of question complexity, leading to low adaptability. Recent efficient test-time scaling methods introduce budget constraints or early stop mechanisms to avoid overthinking for straightforward questions. But they add human bias to the reasoning process and fail to leverage models' inherent reasoning capabilities. To address these limitations, we present T$^2$: Think-to-Think, a novel framework that dynamically adapts reasoning depth based on question complexity. T$^2$ leverages the insight that if an LLM can effectively solve similar questions using specific reasoning strategies, it can apply the same strategy to the original question. This insight enables to adoption of concise reasoning for straightforward questions while maintaining detailed analysis for complex problems. T$^2$ works through four key steps: decomposing questions into structural elements, generating similar examples with candidate reasoning strategies, evaluating these strategies against multiple criteria, and applying the most appropriate strategy to the original question. Experimental evaluation across seven diverse CQA benchmarks demonstrates that T$^2$ not only achieves higher accuracy than baseline methods but also reduces computational overhead by up to 25.2\%.

[236] Discovering Forbidden Topics in Language Models

Can Rager,Chris Wendler,Rohit Gandikota,David Bau

Main category: cs.CL

TL;DR: 论文提出了一种新任务——拒绝发现(refusal discovery),旨在识别语言模型拒绝讨论的全部主题。作者开发了LLM-crawler方法,利用token预填充技术发现禁忌主题,并在多个模型上进行了测试,揭示了模型中的审查倾向和对齐失败问题。

Details Motivation: 研究动机在于揭示语言模型在安全调优后可能存在的审查倾向和对齐失败问题,以帮助检测模型的偏见和边界。 Method: 方法包括开发LLM-crawler,利用token预填充技术生成提示,测试模型对禁忌主题的拒绝行为,并在多个开源和前沿模型上进行实验。 Result: 实验结果显示,LLM-crawler在Tulu-3-8B上成功识别了31/36个禁忌主题;在DeepSeek-R1-70B中发现了与审查调优一致的模式;Perplexity-R1-1776-70B虽对审查鲁棒,但其量化版本仍表现出对齐失败。 Conclusion: 研究强调了拒绝发现方法的重要性,以检测AI系统的偏见、边界和对齐失败,为模型安全性和透明度提供了新视角。 Abstract: Refusal discovery is the task of identifying the full set of topics that a language model refuses to discuss. We introduce this new problem setting and develop a refusal discovery method, LLM-crawler, that uses token prefilling to find forbidden topics. We benchmark the LLM-crawler on Tulu-3-8B, an open-source model with public safety tuning data. Our crawler manages to retrieve 31 out of 36 topics within a budget of 1000 prompts. Next, we scale the crawl to a frontier model using the prefilling option of Claude-Haiku. Finally, we crawl three widely used open-weight models: Llama-3.3-70B and two of its variants finetuned for reasoning: DeepSeek-R1-70B and Perplexity-R1-1776-70B. DeepSeek-R1-70B reveals patterns consistent with censorship tuning: The model exhibits "thought suppression" behavior that indicates memorization of CCP-aligned responses. Although Perplexity-R1-1776-70B is robust to censorship, LLM-crawler elicits CCP-aligned refusals answers in the quantized model. Our findings highlight the critical need for refusal discovery methods to detect biases, boundaries, and alignment failures of AI systems.

[237] Exploring the Effect of Segmentation and Vocabulary Size on Speech Tokenization for Speech Language Models

Shunsuke Kando,Yusuke Miyao,Shinnosuke Takamichi

Main category: cs.CL

TL;DR: 研究了语音标记化中分段宽度和聚类大小对语音语言模型性能的影响,发现适度的粗分段和大聚类能提升性能,高效模型减少50%训练数据和70%训练时间。

Details Motivation: 探索语音标记化中分段宽度和聚类大小对语音语言模型性能的影响,以优化模型效率。 Method: 将语音信号分段为固定/可变宽度和池化表示,训练不同聚类大小的K-means模型,评估零样本口语理解性能。 Result: 适度粗分段和大聚类提升性能,高效模型减少50%训练数据和70%训练时间。 Conclusion: 结合多标记可增强细粒度口语理解,分段和聚类大小是关键因素。 Abstract: The purpose of speech tokenization is to transform a speech signal into a sequence of discrete representations, serving as the foundation for speech language models (SLMs). While speech tokenization has many options, their effect on the performance of SLMs remains unclear. This paper investigates two key aspects of speech tokenization: the segmentation width and the cluster size of discrete units. First, we segment speech signals into fixed/variable widths and pooled representations. We then train K-means models in multiple cluster sizes. Through the evaluation on zero-shot spoken language understanding benchmarks, we find the positive effect of moderately coarse segmentation and bigger cluster size. Notably, among the best-performing models, the most efficient one achieves a 50% reduction in training data and a 70% decrease in training runtime. Our analysis highlights the importance of combining multiple tokens to enhance fine-grained spoken language understanding.

[238] LeTS: Learning to Think-and-Search via Process-and-Outcome Reward Hybridization

Qi Zhang,Shouqing Yang,Lirong Gao,Hao Chen,Xiaomeng Hu,Jinglei Chen,Jiexiang Wang,Sheng Guo,Bo Zheng,Haobo Wang,Junbo Zhao

Main category: cs.CL

TL;DR: 论文提出LeTS框架,通过结合过程级和结果级奖励,提升检索增强生成(RAG)中大型语言模型(LLMs)的推理能力。

Details Motivation: 当前检索增强生成(RAG)研究忽视中间推理步骤的正确性,仅依赖结果监督强化学习(RL)。 Method: 设计过程级奖励模块,结合结果级奖励,提出LeTS框架。 Result: 实验证明LeTS在多个RAG基准测试中具有泛化性和推理效率。 Conclusion: LeTS展示了过程级和结果级奖励结合在提升LLMs推理能力中的潜力。 Abstract: Large language models (LLMs) have demonstrated impressive capabilities in reasoning with the emergence of reasoning models like OpenAI-o1 and DeepSeek-R1. Recent research focuses on integrating reasoning capabilities into the realm of retrieval-augmented generation (RAG) via outcome-supervised reinforcement learning (RL) approaches, while the correctness of intermediate think-and-search steps is usually neglected. To address this issue, we design a process-level reward module to mitigate the unawareness of intermediate reasoning steps in outcome-level supervision without additional annotation. Grounded on this, we propose Learning to Think-and-Search (LeTS), a novel framework that hybridizes stepwise process reward and outcome-based reward to current RL methods for RAG. Extensive experiments demonstrate the generalization and inference efficiency of LeTS across various RAG benchmarks. In addition, these results reveal the potential of process- and outcome-level reward hybridization in boosting LLMs' reasoning ability via RL under other scenarios. The code will be released soon.

[239] Towards Evaluating Proactive Risk Awareness of Multimodal Language Models

Youliang Yuan,Wenxiang Jiao,Yuejin Xie,Chihao Shen,Menghan Tian,Wenxuan Wang,Jen-tse Huang,Pinjia He

Main category: cs.CL

TL;DR: 论文提出了一种主动安全AI系统(PaSBench),通过多模态场景评估模型在提前识别潜在风险方面的能力,发现现有模型存在不稳定推理问题。

Details Motivation: 解决人类安全意识不足导致的日常风险识别延迟问题,推动开发主动预防而非被动响应的安全AI。 Method: 使用416个多模态场景(128图像序列、288文本日志)评估36个先进模型,分析其主动推理能力。 Result: 最佳模型在图像和文本上的准确率分别为71%和64%,但重复试验中仍遗漏45-55%的风险,主要问题是推理不稳定。 Conclusion: 研究建立了主动安全基准,揭示了模型局限性,并为开发可靠保护性AI提供了方向。 Abstract: Human safety awareness gaps often prevent the timely recognition of everyday risks. In solving this problem, a proactive safety artificial intelligence (AI) system would work better than a reactive one. Instead of just reacting to users' questions, it would actively watch people's behavior and their environment to detect potential dangers in advance. Our Proactive Safety Bench (PaSBench) evaluates this capability through 416 multimodal scenarios (128 image sequences, 288 text logs) spanning 5 safety-critical domains. Evaluation of 36 advanced models reveals fundamental limitations: Top performers like Gemini-2.5-pro achieve 71% image and 64% text accuracy, but miss 45-55% risks in repeated trials. Through failure analysis, we identify unstable proactive reasoning rather than knowledge deficits as the primary limitation. This work establishes (1) a proactive safety benchmark, (2) systematic evidence of model limitations, and (3) critical directions for developing reliable protective AI. We believe our dataset and findings can promote the development of safer AI assistants that actively prevent harm rather than merely respond to requests. Our dataset can be found at https://huggingface.co/datasets/Youliang/PaSBench.

[240] Hydra: Structured Cross-Source Enhanced Large Language Model Reasoning

Xingyu Tan,Xiaoyang Wang,Qing Liu,Xiwei Xu,Xin Yuan,Liming Zhu,Wenjie Zhang

Main category: cs.CL

TL;DR: Hydra是一个无需训练的框架,通过统一图拓扑、文档语义和来源可靠性,提升LLM的深度推理能力,解决多跳、多实体和多源验证问题。

Details Motivation: 当前混合RAG系统在处理多跳推理、多实体问题、多源验证和有效利用图结构时存在挑战。 Method: Hydra通过代理驱动的探索结合结构和非结构检索,采用三因素跨源验证(来源可信度评估、跨源佐证和实体路径对齐)。 Result: 在七个基准数据集上,Hydra平均优于ToG-2基线20.3%,最高达30.1%,并使小模型性能接近GPT-4-Turbo。 Conclusion: Hydra显著提升了LLM的推理能力,尤其在多跳和多实体问题中表现优异。 Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) by incorporating external knowledge. Current hybrid RAG system retrieves evidence from both knowledge graphs (KGs) and text documents to support LLM reasoning. However, it faces challenges like handling multi-hop reasoning, multi-entity questions, multi-source verification, and effective graph utilization. To address these limitations, we present Hydra, a training-free framework that unifies graph topology, document semantics, and source reliability to support deep, faithful reasoning in LLMs. Hydra handles multi-hop and multi-entity problems through agent-driven exploration that combines structured and unstructured retrieval, increasing both diversity and precision of evidence. To tackle multi-source verification, Hydra uses a tri-factor cross-source verification (source trustworthiness assessment, cross-source corroboration, and entity-path alignment), to balance topic relevance with cross-modal agreement. By leveraging graph structure, Hydra fuses heterogeneous sources, guides efficient exploration, and prunes noise early. Comprehensive experiments on seven benchmark datasets show that Hydra achieves overall state-of-the-art results on all benchmarks with GPT-3.5, outperforming the strong hybrid baseline ToG-2 by an average of 20.3% and up to 30.1%. Furthermore, Hydra enables smaller models (e.g., Llama-3.1-8B) to achieve reasoning performance comparable to that of GPT-4-Turbo.

[241] A Position Paper on the Automatic Generation of Machine Learning Leaderboards

Roelien C Timmer,Yufang Hou,Stephen Wan

Main category: cs.CL

TL;DR: 本文概述了自动排行榜生成(ALG)研究,提出了统一框架和基准指南,并探讨了未来方向。

Details Motivation: 机器学习文献增长导致手动维护排行榜困难,需自动化方法解决。 Method: 提出ALG统一概念框架,定义任务标准,并提供数据集和评估指标建议。 Result: 明确了ALG研究的差异,提出了标准化框架和可复现的评估方法。 Conclusion: ALG需扩展覆盖范围和丰富元数据,未来研究应关注这些方向。 Abstract: An important task in machine learning (ML) research is comparing prior work, which is often performed via ML leaderboards: a tabular overview of experiments with comparable conditions (e.g., same task, dataset, and metric). However, the growing volume of literature creates challenges in creating and maintaining these leaderboards. To ease this burden, researchers have developed methods to extract leaderboard entries from research papers for automated leaderboard curation. Yet, prior work varies in problem framing, complicating comparisons and limiting real-world applicability. In this position paper, we present the first overview of Automatic Leaderboard Generation (ALG) research, identifying fundamental differences in assumptions, scope, and output formats. We propose an ALG unified conceptual framework to standardise how the ALG task is defined. We offer ALG benchmarking guidelines, including recommendations for datasets and metrics that promote fair, reproducible evaluation. Lastly, we outline challenges and new directions for ALG, such as, advocating for broader coverage by including all reported results and richer metadata.

[242] SLearnLLM: A Self-Learning Framework for Efficient Domain-Specific Adaptation of Large Language Models

Xiang Liu,Zhaoxiang Liu,Peng Wang,Kohou Wang,Huan Hu,Kai Wang,Shiguo Lian

Main category: cs.CL

TL;DR: 提出了一种基于自学习框架的方法,通过筛选SFT数据集中模型未知的知识进行微调,显著提升了训练效率。

Details Motivation: 传统方法直接在整个SFT数据集上微调,若数据与模型已有知识重叠,会导致计算资源浪费。识别并利用未知知识可提高效率。 Method: 采用自学习框架,先让LLM回答SFT数据集中的问题,筛选出错误答案的QA对,仅用这些数据进行微调。 Result: 在农业和医学领域的实验中,该方法显著减少训练时间,同时达到与全数据集微调相当的改进。 Conclusion: 通过聚焦SFT数据集中的未知知识,该方法高效提升了LLM的微调效果。 Abstract: When using supervised fine-tuning (SFT) to adapt large language models (LLMs) to specific domains, a significant challenge arises: should we use the entire SFT dataset for fine-tuning? Common practice often involves fine-tuning directly on the entire dataset due to limited information on the LLM's past training data. However, if the SFT dataset largely overlaps with the model's existing knowledge, the performance gains are minimal, leading to wasted computational resources. Identifying the unknown knowledge within the SFT dataset and using it to fine-tune the model could substantially improve the training efficiency. To address this challenge, we propose a self-learning framework for LLMs inspired by human learning pattern. This framework takes a fine-tuning (SFT) dataset in a specific domain as input. First, the LLMs answer the questions in the SFT dataset. The LLMs then objectively grade the responses and filter out the incorrectly answered QA pairs. Finally, we fine-tune the LLMs based on this filtered QA set. Experimental results in the fields of agriculture and medicine demonstrate that our method substantially reduces training time while achieving comparable improvements to those attained with full dataset fine-tuning. By concentrating on the unknown knowledge within the SFT dataset, our approach enhances the efficiency of fine-tuning LLMs.

[243] FinRAGBench-V: A Benchmark for Multimodal RAG with Visual Citation in the Financial Domain

Suifeng Zhao,Zhuoran Jin,Sujian Li,Jun Gao

Main category: cs.CL

TL;DR: FinRAGBench-V是一个针对金融领域的视觉RAG基准测试,整合了多模态数据并提供视觉引用,填补了现有研究忽视视觉内容的空白。

Details Motivation: 现有金融领域的RAG研究主要关注文本数据,忽略了视觉内容,导致关键分析洞察的缺失。 Method: 提出了FinRAGBench-V基准测试和RGenCite基线模型,整合了双语检索语料库和高质量QA数据集,并开发了自动引用评估方法。 Result: 实验表明FinRAGBench-V具有挑战性,为金融领域多模态RAG系统的发展提供了重要见解。 Conclusion: FinRAGBench-V和RGenCite为金融领域的多模态RAG研究提供了新方向,强调了视觉内容的重要性。 Abstract: Retrieval-Augmented Generation (RAG) plays a vital role in the financial domain, powering applications such as real-time market analysis, trend forecasting, and interest rate computation. However, most existing RAG research in finance focuses predominantly on textual data, overlooking the rich visual content in financial documents, resulting in the loss of key analytical insights. To bridge this gap, we present FinRAGBench-V, a comprehensive visual RAG benchmark tailored for finance which effectively integrates multimodal data and provides visual citation to ensure traceability. It includes a bilingual retrieval corpus with 60,780 Chinese and 51,219 English pages, along with a high-quality, human-annotated question-answering (QA) dataset spanning heterogeneous data types and seven question categories. Moreover, we introduce RGenCite, an RAG baseline that seamlessly integrates visual citation with generation. Furthermore, we propose an automatic citation evaluation method to systematically assess the visual citation capabilities of Multimodal Large Language Models (MLLMs). Extensive experiments on RGenCite underscore the challenging nature of FinRAGBench-V, providing valuable insights for the development of multimodal RAG systems in finance.

[244] MARCO: Meta-Reflection with Cross-Referencing for Code Reasoning

Yusheng Zhao,Xiao Luo,Weizhi Zhang,Wei Ju,Zhiping Xiao,Philip S. Yu,Ming Zhang

Main category: cs.CL

TL;DR: 论文提出了一种名为MARCO的框架,通过自我改进动态提升LLM在代码推理中的能力。

Details Motivation: 现有研究多采用静态视角,而MARCO从认知进化视角出发,结合知识积累和经验共享,旨在动态提升LLM的代码推理能力。 Method: 提出Meta-Reflection和Cross-Referencing机制,前者通过反思当前问题的推理路径积累知识,后者通过借鉴其他代理的解决方案和反馈。 Result: 在多个代码推理数据集上的实验证明了MARCO的有效性。 Conclusion: MARCO框架通过动态自我改进显著提升了LLM的代码推理能力。 Abstract: The ability to reason is one of the most fundamental capabilities of large language models (LLMs), enabling a wide range of downstream tasks through sophisticated problem-solving. A critical aspect of this is code reasoning, which involves logical reasoning with formal languages (i.e., programming code). In this paper, we enhance this capability of LLMs by exploring the following question: how can an LLM agent become progressively smarter in code reasoning with each solution it proposes, thereby achieving substantial cumulative improvement? Most existing research takes a static perspective, focusing on isolated problem-solving using frozen LLMs. In contrast, we adopt a cognitive-evolving perspective and propose a novel framework named Meta-Reflection with Cross-Referencing (MARCO) that enables the LLM to evolve dynamically during inference through self-improvement. From the perspective of human cognitive development, we leverage both knowledge accumulation and lesson sharing. In particular, to accumulate knowledge during problem-solving, we propose meta-reflection that reflects on the reasoning paths of the current problem to obtain knowledge and experience for future consideration. Moreover, to effectively utilize the lessons from other agents, we propose cross-referencing that incorporates the solution and feedback from other agents into the current problem-solving process. We conduct experiments across various datasets in code reasoning, and the results demonstrate the effectiveness of MARCO.

[245] keepitsimple at SemEval-2025 Task 3: LLM-Uncertainty based Approach for Multilingual Hallucination Span Detection

Saketh Reddy Vemula,Parameswari Krishnamurthy

Main category: cs.CL

TL;DR: 提出了一种基于熵分析的方法,通过随机采样响应的变异性来识别语言模型生成的幻觉文本片段。

Details Motivation: 识别语言模型生成的幻觉文本对实际应用至关重要,尤其是在多语言环境中。 Method: 利用随机采样响应的变异性,通过熵分析测量分歧,无需额外训练。 Result: 能够准确识别幻觉片段,方法成本低且适应性强。 Conclusion: 该方法为语言模型幻觉识别提供了一种高效且经济的解决方案。 Abstract: Identification of hallucination spans in black-box language model generated text is essential for applications in the real world. A recent attempt at this direction is SemEval-2025 Task 3, Mu-SHROOM-a Multilingual Shared Task on Hallucinations and Related Observable Over-generation Errors. In this work, we present our solution to this problem, which capitalizes on the variability of stochastically-sampled responses in order to identify hallucinated spans. Our hypothesis is that if a language model is certain of a fact, its sampled responses will be uniform, while hallucinated facts will yield different and conflicting results. We measure this divergence through entropy-based analysis, allowing for accurate identification of hallucinated segments. Our method is not dependent on additional training and hence is cost-effective and adaptable. In addition, we conduct extensive hyperparameter tuning and perform error analysis, giving us crucial insights into model behavior.

[246] Analyzing Mitigation Strategies for Catastrophic Forgetting in End-to-End Training of Spoken Language Models

Chi-Yuan Hsiao,Ke-Han Lu,Kai-Wei Chang,Chih-Kai Yang,Wei-Chih Chen,Hung-yi Lee

Main category: cs.CL

TL;DR: 本文研究了在多阶段训练口语语言模型(SLM)中出现的灾难性遗忘问题,并评估了三种缓解策略,发现经验回放最有效。

Details Motivation: 多阶段训练SLM可能导致灾难性遗忘,本文旨在探索缓解策略以平衡知识保留与新学习。 Method: 评估了模型合并、LoRA缩放因子折扣和经验回放三种策略。 Result: 经验回放最有效,结合其他方法可进一步提升效果。 Conclusion: 研究结果为开发更稳健高效的SLM训练流程提供了参考。 Abstract: End-to-end training of Spoken Language Models (SLMs) commonly involves adapting pre-trained text-based Large Language Models (LLMs) to the speech modality through multi-stage training on diverse tasks such as ASR, TTS and spoken question answering (SQA). Although this multi-stage continual learning equips LLMs with both speech understanding and generation capabilities, the substantial differences in task and data distributions across stages can lead to catastrophic forgetting, where previously acquired knowledge is lost. This paper investigates catastrophic forgetting and evaluates three mitigation strategies-model merging, discounting the LoRA scaling factor, and experience replay to balance knowledge retention with new learning. Results show that experience replay is the most effective, with further gains achieved by combining it with other methods. These findings provide insights for developing more robust and efficient SLM training pipelines.

[247] CReSt: A Comprehensive Benchmark for Retrieval-Augmented Generation with Complex Reasoning over Structured Documents

Minsoo Khang,Sangjun Park,Teakgyu Hong,Dawoon Jung

Main category: cs.CL

TL;DR: CReSt是一个用于评估检索增强生成(RAG)场景下大型语言模型(LLMs)能力的综合基准,重点关注复杂推理、拒绝回答、精确引用和文档布局理解。

Details Motivation: 现有评估方法未能统一衡量LLMs在RAG场景中的关键能力,如复杂推理和文档理解,因此需要一种更全面的评估框架。 Method: 提出CReSt基准,包含2,245个人工标注的英文和韩语示例,并设计专门评估方法。 Result: 实验表明,即使是先进的LLMs在这些关键维度上表现也不一致,凸显了改进空间。 Conclusion: CReSt为未来研究和开发更鲁棒的RAG系统提供了支持,数据集和代码已开源。 Abstract: Large Language Models (LLMs) have made substantial progress in recent years, yet evaluating their capabilities in practical Retrieval-Augmented Generation (RAG) scenarios remains challenging. In practical applications, LLMs must demonstrate complex reasoning, refuse to answer appropriately, provide precise citations, and effectively understand document layout. These capabilities are crucial for advanced task handling, uncertainty awareness, maintaining reliability, and structural understanding. While some of the prior works address these aspects individually, there is a need for a unified framework that evaluates them collectively in practical RAG scenarios. To address this, we present CReSt (A Comprehensive Benchmark for Retrieval-Augmented Generation with Complex Reasoning over Structured Documents), a benchmark designed to assess these key dimensions holistically. CReSt comprises 2,245 human-annotated examples in English and Korean, designed to capture practical RAG scenarios that require complex reasoning over structured documents. It also introduces a tailored evaluation methodology to comprehensively assess model performance in these critical areas. Our evaluation shows that even advanced LLMs struggle to perform consistently across these dimensions, underscoring key areas for improvement. We release CReSt to support further research and the development of more robust RAG systems. The dataset and code are available at: https://github.com/UpstageAI/CReSt.

[248] L-MTP: Leap Multi-Token Prediction Beyond Adjacent Context for Large Language Models

Xiaohao Liu,Xiaobo Xia,Weixiang Zhao,Manyi Zhang,Xianzhi Yu,Xiu Su,Shuo Yang,See-Kiong Ng,Tat-Seng Chua

Main category: cs.CL

TL;DR: 论文提出了一种名为L-MTP的创新方法,通过跳跃式多令牌预测提升大型语言模型的推理效率和上下文覆盖能力。

Details Motivation: 传统的下一个令牌预测(NTP)方法在上下文覆盖和推理效率上存在局限性,限制了大型语言模型(LLM)的潜力。 Method: L-MTP通过跳跃式机制预测非连续令牌,增强了长距离依赖捕捉能力,并优化了非连续令牌生成的解码策略。 Result: 实验证明,L-MTP在多个基准测试中显著提升了模型性能和推理速度。 Conclusion: L-MTP是一种有效的改进方法,能够同时提升LLM的性能和推理效率,代码将公开。 Abstract: Large language models (LLMs) have achieved notable progress. Despite their success, next-token prediction (NTP), the dominant method for LLM training and inference, is constrained in both contextual coverage and inference efficiency due to its inherently sequential process. To overcome these challenges, we propose leap multi-token prediction~(L-MTP), an innovative token prediction method that extends the capabilities of multi-token prediction (MTP) by introducing a leap-based mechanism. Unlike conventional MTP, which generates multiple tokens at adjacent positions, L-MTP strategically skips over intermediate tokens, predicting non-sequential ones in a single forward pass. This structured leap not only enhances the model's ability to capture long-range dependencies but also enables a decoding strategy specially optimized for non-sequential leap token generation, effectively accelerating inference. We theoretically demonstrate the benefit of L-MTP in improving inference efficiency. Experiments across diverse benchmarks validate its merit in boosting both LLM performance and inference speed. The source code will be publicly available.

[249] Large Language Models Do Multi-Label Classification Differently

Marcus Ma,Georgios Chochlakis,Niyantha Maruthu Pandiyan,Jesse Thomason,Shrikanth Narayanan

Main category: cs.CL

TL;DR: 研究探讨了自回归大语言模型(LLMs)在多标签分类任务中的表现,发现其预测行为反映了生成所有相关标签所需的多个步骤,且模型规模增大时熵降低但标签内部排序改善。提出了分布对齐任务以改进对齐和预测性能。

Details Motivation: 多标签分类在现实场景中普遍存在,但LLMs在此类任务中的行为尚未充分研究,尤其是在主观任务中。 Method: 通过分析模型在每一步生成的输出分布,研究LLMs的多标签分类行为,并引入分布对齐任务,提出零样本和有监督方法。 Result: 发现LLMs在生成标签时倾向于抑制除一个标签外的其他标签,模型规模增大时熵降低但标签排序改善。提出的方法在分布对齐和预测性能上优于现有方法。 Conclusion: LLMs在多标签分类中表现出独特的预测行为,提出的分布对齐方法有效提升了性能,为未来研究提供了方向。 Abstract: Multi-label classification is prevalent in real-world settings, but the behavior of Large Language Models (LLMs) in this setting is understudied. We investigate how autoregressive LLMs perform multi-label classification, with a focus on subjective tasks, by analyzing the output distributions of the models in each generation step. We find that their predictive behavior reflects the multiple steps in the underlying language modeling required to generate all relevant labels as they tend to suppress all but one label at each step. We further observe that as model scale increases, their token distributions exhibit lower entropy, yet the internal ranking of the labels improves. Finetuning methods such as supervised finetuning and reinforcement learning amplify this phenomenon. To further study this issue, we introduce the task of distribution alignment for multi-label settings: aligning LLM-derived label distributions with empirical distributions estimated from annotator responses in subjective tasks. We propose both zero-shot and supervised methods which improve both alignment and predictive performance over existing approaches.

[250] Multimodal Conversation Structure Understanding

Kent K. Chang,Mackenzie Hanh Cramer,Anna Ho,Ti Ti Nguyen,Yilin Yuan,David Bamman

Main category: cs.CL

TL;DR: 论文提出了一套任务和数据集,用于评估多模态大语言模型在理解对话结构和角色分配方面的能力,发现当前模型在多模态对话结构理解上仍有挑战。

Details Motivation: 探索大语言模型在细粒度对话结构理解(如角色分配和对话线程)方面的能力,尤其是在多模态、多参与者场景下的表现。 Method: 引入基于对话分析和社会语言学的任务,构建人工标注数据集,评估流行的多模态大语言模型和视觉语言模型。 Result: 音频-视觉大语言模型表现最佳,但角色分配性能受参与者数量和匿名化影响;声学清晰度和面部覆盖率对性能有正向影响。 Conclusion: 该研究为未来多模态大语言模型在对话结构理解方面的评估和发展奠定了基础。 Abstract: Conversations are usually structured by roles -- who is speaking, who's being addressed, and who's listening -- and unfold in threads that break with changes in speaker floor or topical focus. While large language models (LLMs) have shown incredible capabilities in dialogue and reasoning, their ability to understand fine-grained conversational structure, especially in multi-modal, multi-party settings, remains underexplored. To address this gap, we introduce a suite of tasks focused on conversational role attribution (speaker, addressees, side-participants) and conversation threading (utterance linking and clustering), drawing on conversation analysis and sociolinguistics. To support those tasks, we present a human annotated dataset of 4,398 annotations for speakers and reply-to relationship, 5,755 addressees, and 3,142 side-participants. We evaluate popular audio-visual LLMs and vision-language models on our dataset, and our experimental results suggest that multimodal conversational structure understanding remains challenging. The most performant audio-visual LLM outperforms all vision-language models across all metrics, especially in speaker and addressee recognition. However, its performance drops significantly when conversation participants are anonymized. The number of conversation participants in a clip is the strongest negative predictor of role-attribution performance, while acoustic clarity (measured by pitch and spectral centroid) and detected face coverage yield positive associations. We hope this work lays the groundwork for future evaluation and development of multimodal LLMs that can reason more effectively about conversation structure.

[251] How Knowledge Popularity Influences and Enhances LLM Knowledge Boundary Perception

Shiyu Ni,Keping Bi,Jiafeng Guo,Xueqi Cheng

Main category: cs.CL

TL;DR: 研究探讨知识流行度对大型语言模型(LLMs)识别知识边界能力的影响,发现流行度与模型表现、信心及边界感知正相关,并提出基于流行度的置信度校准方法。

Details Motivation: LLMs常因无法识别知识边界而给出错误但自信的答案,研究旨在探索知识流行度如何影响这一现象。 Method: 通过实体和关系流行度量化知识流行度,并在不同流行度的数据集上测试LLMs的表现。 Result: 流行度与LLMs的表现、信心及边界感知正相关,关系流行度影响最大;基于流行度的校准方法提升预测准确率5.24%。 Conclusion: 知识流行度是影响LLMs表现的关键因素,可有效用于置信度校准,且无需外部语料库的提示方法可行。 Abstract: Large language models (LLMs) often fail to recognize their knowledge boundaries, producing confident yet incorrect answers. In this paper, we investigate how knowledge popularity affects LLMs' ability to perceive their knowledge boundaries. Focusing on entity-centric factual question answering (QA), we quantify knowledge popularity from three perspectives: the popularity of entities in the question, the popularity of entities in the answer, and relation popularity, defined as their co-occurrence frequency. Experiments on three representative datasets containing knowledge with varying popularity show that LLMs exhibit better QA performance, higher confidence, and more accurate perception on more popular knowledge, with relation popularity having the strongest correlation. Cause knowledge popularity shows strong correlation with LLMs' QA performance, we propose to leverage these signals for confidence calibration. This improves the accuracy of answer correctness prediction by an average of 5.24% across all models and datasets. Furthermore, we explore prompting LLMs to estimate popularity without external corpora, which yields a viable alternative.

[252] Swedish Whispers; Leveraging a Massive Speech Corpus for Swedish Speech Recognition

Leonora Vesterbacka,Faton Rekathati,Robin Kurtz,Justyna Sikora,Agnes Toftgård

Main category: cs.CL

TL;DR: 本文提出了一套针对瑞典语的微调Whisper模型,通过大规模多样化的数据集训练,显著提升了瑞典语的语音识别性能。

Details Motivation: 中等资源语言(如瑞典语)在多语言训练数据集中常被忽视,微调现有模型可显著提升性能。 Method: 使用大规模多样化的瑞典语数据集对Whisper模型进行微调。 Result: 相比OpenAI的Whisper模型,微调后的模型在瑞典语上平均WER降低了47%。 Conclusion: 微调多语言模型对中等资源语言的性能提升具有显著效果。 Abstract: This work presents a suite of fine-tuned Whisper models for Swedish, trained on a dataset of unprecedented size and variability for this mid-resourced language. As languages of smaller sizes are often underrepresented in multilingual training datasets, substantial improvements in performance can be achieved by fine-tuning existing multilingual models, as shown in this work. This work reports an overall improvement across model sizes compared to OpenAI's Whisper evaluated on Swedish. Most notably, we report an average 47% reduction in WER comparing our best performing model to OpenAI's whisper-large-v3, in evaluations across FLEURS, Common Voice, and NST.

[253] Teaching with Lies: Curriculum DPO on Synthetic Negatives for Hallucination Detection

Shrey Pandit,Ashwin Vinod,Liu Leqi,Ying Ding

Main category: cs.CL

TL;DR: 论文提出了一种基于课程学习和高质量负样本的DPO对齐方法,显著提升了LLM在检测幻觉文本上的性能。

Details Motivation: 由于幻觉文本的复杂性,现有方法难以准确检测,因此需要一种更有效的对齐方法。 Method: 使用精心设计的幻觉文本作为负样本,结合课程学习策略,从易到难逐步训练模型。 Result: HaluCheck模型在多个基准测试中性能显著提升,最高达24%,并在零样本设置中表现出色。 Conclusion: 该方法通过高质量负样本和课程学习策略,有效提升了LLM检测幻觉文本的能力。 Abstract: Aligning large language models (LLMs) to accurately detect hallucinations remains a significant challenge due to the sophisticated nature of hallucinated text. Recognizing that hallucinated samples typically exhibit higher deceptive quality than traditional negative samples, we use these carefully engineered hallucinations as negative examples in the DPO alignment procedure. Our method incorporates a curriculum learning strategy, gradually transitioning the training from easier samples, identified based on the greatest reduction in probability scores from independent fact checking models, to progressively harder ones. This structured difficulty scaling ensures stable and incremental learning. Experimental evaluation demonstrates that our HaluCheck models, trained with curriculum DPO approach and high quality negative samples, significantly improves model performance across various metrics, achieving improvements of upto 24% on difficult benchmarks like MedHallu and HaluEval. Additionally, HaluCheck models demonstrate robustness in zero-shot settings, significantly outperforming larger state-of-the-art models across various benchmarks.

[254] PPT: A Process-based Preference Learning Framework for Self Improving Table Question Answering Models

Wei Zhou,Mohsen Mesgar,Heike Adel,Annemarie Friedrich

Main category: cs.CL

TL;DR: PPT框架通过分解推理链并采样对比步骤,显著提升表格问答任务性能,仅需少量偏好对即可实现高效改进。

Details Motivation: 填补表格问答任务中自生成数据改进的空白,避免高成本人工标注。 Method: 提出PPT框架,分解推理链为离散状态,评分并采样对比步骤进行偏好学习。 Result: 在领域内和领域外数据集上分别提升5%和2.4%,推理效率提升5倍。 Conclusion: PPT框架高效且性能优越,适用于表格问答任务。 Abstract: Improving large language models (LLMs) with self-generated data has demonstrated success in tasks such as mathematical reasoning and code generation. Yet, no exploration has been made on table question answering (TQA), where a system answers questions based on tabular data. Addressing this gap is crucial for TQA, as effective self-improvement can boost performance without requiring costly or manually annotated data. In this work, we propose PPT, a Process-based Preference learning framework for TQA. It decomposes reasoning chains into discrete states, assigns scores to each state, and samples contrastive steps for preference learning. Experimental results show that PPT effectively improves TQA models by up to 5% on in-domain datasets and 2.4% on out-of-domain datasets, with only 8,000 preference pairs. Furthermore, the resulting models achieve competitive results compared to more complex and larger state-of-the-art TQA systems, while being five times more efficient during inference.

[255] Reasoning Meets Personalization: Unleashing the Potential of Large Reasoning Model for Personalized Generation

Sichun Luo,Guanzhi Deng,Jian Xu,Xiaojie Zhang,Hanxu Hou,Linqi Song

Main category: cs.CL

TL;DR: 本文首次系统评估了大推理模型(LRMs)在个性化任务中的表现,发现其并未优于通用大语言模型(LLMs)。作者提出了Reinforced Reasoning for Personalization(RRP)框架,通过分层推理模板和干预方法显著提升了性能。

Details Motivation: 尽管大推理模型(LRMs)在数学和编程等任务中表现优异,但其在个性化任务中的潜力尚未充分探索。本文旨在填补这一空白。 Method: 提出RRP框架,包括分层推理模板、推理过程干预方法和跨引用机制,以解决LRMs在个性化任务中的局限性。 Result: 实验表明,RRP框架显著优于现有技术,尤其是在检索密集型场景中。 Conclusion: 通过结构化推理和干预方法,RRP框架有效提升了LRMs在个性化任务中的表现,为未来研究提供了新方向。 Abstract: Personalization is a critical task in modern intelligent systems, with applications spanning diverse domains, including interactions with large language models (LLMs). Recent advances in reasoning capabilities have significantly enhanced LLMs, enabling unprecedented performance in tasks such as mathematics and coding. However, their potential for personalization tasks remains underexplored. In this paper, we present the first systematic evaluation of large reasoning models (LRMs) for personalization tasks. Surprisingly, despite generating more tokens, LRMs do not consistently outperform general-purpose LLMs, especially in retrieval-intensive scenarios where their advantages diminish. Our analysis identifies three key limitations: divergent thinking, misalignment of response formats, and ineffective use of retrieved information. To address these challenges, we propose Reinforced Reasoning for Personalization (\model), a novel framework that incorporates a hierarchical reasoning thought template to guide LRMs in generating structured outputs. Additionally, we introduce a reasoning process intervention method to enforce adherence to designed reasoning patterns, enhancing alignment. We also propose a cross-referencing mechanism to ensure consistency. Extensive experiments demonstrate that our approach significantly outperforms existing techniques.

[256] Wolf Hidden in Sheep's Conversations: Toward Harmless Data-Based Backdoor Attacks for Jailbreaking Large Language Models

Jiawei Kong,Hao Fang,Xiaochen Yang,Kuofeng Gao,Bin Chen,Shu-Tao Xia,Yaowei Wang,Min Zhang

Main category: cs.CL

TL;DR: 提出了一种新型的“干净数据后门攻击”方法,通过无害问答对将触发器与良性前缀关联,以绕过安全检测并高效劫持大语言模型。

Details Motivation: 现有后门攻击易被安全防护检测且会破坏模型的安全对齐性,需要一种更隐蔽且高效的攻击方法。 Method: 利用无害问答对将触发器与良性前缀关联,并通过梯度优化增强通用触发器,分两阶段生成有害响应。 Result: 在LLaMA-3-8B和Qwen-2.5-7B上攻击成功率达86.67%和85%,且能绕过防护模型检测。 Conclusion: 该方法显著提升了后门攻击的隐蔽性和成功率,为大语言模型的安全防护提出了新挑战。 Abstract: Supervised fine-tuning (SFT) aligns large language models (LLMs) with human intent by training them on labeled task-specific data. Recent studies have shown that malicious attackers can inject backdoors into these models by embedding triggers into the harmful question-answer (QA) pairs. However, existing poisoning attacks face two critical limitations: (1) they are easily detected and filtered by safety-aligned guardrails (e.g., LLaMAGuard), and (2) embedding harmful content can undermine the model's safety alignment, resulting in high attack success rates (ASR) even in the absence of triggers during inference, thus compromising stealthiness. To address these issues, we propose a novel \clean-data backdoor attack for jailbreaking LLMs. Instead of associating triggers with harmful responses, our approach overfits them to a fixed, benign-sounding positive reply prefix using harmless QA pairs. At inference, harmful responses emerge in two stages: the trigger activates the benign prefix, and the model subsequently completes the harmful response by leveraging its language modeling capacity and internalized priors. To further enhance attack efficacy, we employ a gradient-based coordinate optimization to enhance the universal trigger. Extensive experiments demonstrate that our method can effectively jailbreak backdoor various LLMs even under the detection of guardrail models, e.g., an ASR of 86.67% and 85% on LLaMA-3-8B and Qwen-2.5-7B judged by GPT-4o.

[257] Distilling LLM Agent into Small Models with Retrieval and Code Tools

Minki Kang,Jongwon Jeong,Seanie Lee,Jaewoong Cho,Sung Ju Hwang

Main category: cs.CL

TL;DR: 论文提出了一种名为Agent Distillation的框架,通过从大型语言模型(LLMs)中提取推理能力和任务解决行为到小型语言模型(sLMs),并结合检索和代码工具,解决了传统方法在罕见事实知识或精确计算上的不足。

Details Motivation: 大型语言模型在复杂推理任务上表现出色,但计算成本高,限制了实际部署。传统方法通过思维链(CoT)蒸馏推理能力到小型模型,但在罕见知识或精确计算场景下表现不佳。 Method: 提出了Agent Distillation框架,包括两个改进方向:1)引入first-thought prefix提示方法提升教师模型生成轨迹的质量;2)提出self-consistent action generation提升小型代理的测试时鲁棒性。 Result: 在八个推理任务上的实验表明,参数规模为0.5B、1.5B、3B的小型模型性能可与传统方法训练的1.5B、3B、7B模型媲美。 Conclusion: Agent Distillation框架展示了构建实用、工具化小型代理的潜力,代码已开源。 Abstract: Large language models (LLMs) excel at complex reasoning tasks but remain computationally expensive, limiting their practical deployment. To address this, recent works have focused on distilling reasoning capabilities into smaller language models (sLMs) using chain-of-thought (CoT) traces from teacher LLMs. However, this approach struggles in scenarios requiring rare factual knowledge or precise computation, where sLMs often hallucinate due to limited capability. In this work, we propose Agent Distillation, a framework for transferring not only reasoning capability but full task-solving behavior from LLM-based agents into sLMs with retrieval and code tools. We improve agent distillation along two complementary axes: (1) we introduce a prompting method called first-thought prefix to enhance the quality of teacher-generated trajectories; and (2) we propose a self-consistent action generation for improving test-time robustness of small agents. We evaluate our method on eight reasoning tasks across factual and mathematical domains, covering both in-domain and out-of-domain generalization. Our results show that sLMs as small as 0.5B, 1.5B, 3B parameters can achieve performance competitive with next-tier larger 1.5B, 3B, 7B models fine-tuned using CoT distillation, demonstrating the potential of agent distillation for building practical, tool-using small agents. Our code is available at https://github.com/Nardien/agent-distillation.

[258] Runaway is Ashamed, But Helpful: On the Early-Exit Behavior of Large Language Model-based Agents in Embodied Environments

Qingyu Lu,Liang Ding,Siyi Cao,Xuebo Liu,Kanjian Zhang,Jinxia Zhang,Dacheng Tao

Main category: cs.CL

TL;DR: 论文提出两种方法(内在和外在)优化基于LLM的代理在多轮交互中的效率,减少冗余步骤并保持性能。

Details Motivation: LLM代理在复杂环境中表现出色,但多轮交互效率低下,常陷入重复循环或无效命令,导致计算冗余。 Method: 1. 内在方法:生成时注入退出指令;2. 外在方法:验证任务完成以决定终止。引入两个指标评估冗余步骤减少和进度退化。 Result: 实验表明,四种LLM在五个环境中效率显著提升,性能仅轻微下降。验证了强代理辅助策略的有效性。 Conclusion: 早期退出机制能有效提升LLM代理效率,且性能损失小,为后续研究提供支持。 Abstract: Agents powered by large language models (LLMs) have demonstrated strong planning and decision-making capabilities in complex embodied environments. However, such agents often suffer from inefficiencies in multi-turn interactions, frequently trapped in repetitive loops or issuing ineffective commands, leading to redundant computational overhead. Instead of relying solely on learning from trajectories, we take a first step toward exploring the early-exit behavior for LLM-based agents. We propose two complementary approaches: 1. an $\textbf{intrinsic}$ method that injects exit instructions during generation, and 2. an $\textbf{extrinsic}$ method that verifies task completion to determine when to halt an agent's trial. To evaluate early-exit mechanisms, we introduce two metrics: one measures the reduction of $\textbf{redundant steps}$ as a positive effect, and the other evaluates $\textbf{progress degradation}$ as a negative effect. Experiments with 4 different LLMs across 5 embodied environments show significant efficiency improvements, with only minor drops in agent performance. We also validate a practical strategy where a stronger agent assists after an early-exit agent, achieving better performance with the same total steps. We will release our code to support further research.

[259] Enhancing Large Vision-Language Models with Layout Modality for Table Question Answering on Japanese Annual Securities Reports

Hayato Aida,Kosuke Takahashi,Takahiro Omi

Main category: cs.CL

TL;DR: 提出了一种通过结合表格内文本内容和布局特征来增强大型视觉语言模型(LVLM)表格理解能力的方法,显著提升了复杂文档布局的解析性能。

Details Motivation: 随着大语言模型(LLM)和检索增强生成(RAG)的发展,表格结构理解在金融等领域变得至关重要,但现有大型视觉语言模型(LVLM)在字符和空间关系理解上仍有不足。 Method: 通过引入表格内文本内容和布局特征作为辅助模态,增强LVLM的表格理解能力。 Result: 实验结果表明,这些辅助模态显著提升了模型性能,使其能够在不依赖显式结构化输入的情况下解析复杂文档布局。 Conclusion: 该方法为提升LVLM在表格理解任务中的表现提供了有效途径,尤其在金融等需要高精度问答的领域具有应用潜力。 Abstract: With recent advancements in Large Language Models (LLMs) and growing interest in retrieval-augmented generation (RAG), the ability to understand table structures has become increasingly important. This is especially critical in financial domains such as securities reports, where highly accurate question answering (QA) over tables is required. However, tables exist in various formats-including HTML, images, and plain text-making it difficult to preserve and extract structural information. Therefore, multimodal LLMs are essential for robust and general-purpose table understanding. Despite their promise, current Large Vision-Language Models (LVLMs), which are major representatives of multimodal LLMs, still face challenges in accurately understanding characters and their spatial relationships within documents. In this study, we propose a method to enhance LVLM-based table understanding by incorporating in-table textual content and layout features. Experimental results demonstrate that these auxiliary modalities significantly improve performance, enabling robust interpretation of complex document layouts without relying on explicitly structured input formats.

[260] GIM: Improved Interpretability for Large Language Models

Joakim Edin,Róbert Csordás,Tuukka Ruotsalo,Zhengxuan Wu,Maria Maistro,Jing Huang,Lars Maaløe

Main category: cs.CL

TL;DR: 论文提出了一种新方法GIM,用于解决大语言模型中自修复现象导致的解释性失真问题,显著提高了解释的忠实度。

Details Motivation: 确保大语言模型的忠实解释性对AI的可信性和可靠性至关重要,但自修复现象掩盖了组件真实重要性。 Method: 提出Gradient Interaction Modifications (GIM)技术,在反向传播中考虑自修复效应。 Result: 在多个大语言模型和任务中,GIM显著优于现有电路识别和特征归因方法。 Conclusion: GIM是理解大语言模型内部机制的重要进展,有助于改进模型并确保其安全性。 Abstract: Ensuring faithful interpretability in large language models is imperative for trustworthy and reliable AI. A key obstacle is self-repair, a phenomenon where networks compensate for reduced signal in one component by amplifying others, masking the true importance of the ablated component. While prior work attributes self-repair to layer normalization and back-up components that compensate for ablated components, we identify a novel form occurring within the attention mechanism, where softmax redistribution conceals the influence of important attention scores. This leads traditional ablation and gradient-based methods to underestimate the significance of all components contributing to these attention scores. We introduce Gradient Interaction Modifications (GIM), a technique that accounts for self-repair during backpropagation. Extensive experiments across multiple large language models (Gemma 2B/9B, LLAMA 1B/3B/8B, Qwen 1.5B/3B) and diverse tasks demonstrate that GIM significantly improves faithfulness over existing circuit identification and feature attribution methods. Our work is a significant step toward better understanding the inner mechanisms of LLMs, which is crucial for improving them and ensuring their safety. Our code is available at https://github.com/JoakimEdin/gim.

[261] Stereotype Detection in Natural Language Processing

Alessandra Teresa Cignarella,Anastasia Giachanou,Els Lefever

Main category: cs.CL

TL;DR: 该论文综述了刻板印象检测的研究现状,分析了心理学、社会学和哲学中的定义,并提出了NLP领域未来发展的方向。

Details Motivation: 刻板印象可能演变为歧视和暴力,而NLP领域对刻板印象检测的研究尚处于起步阶段,具有重要的社会意义。 Method: 通过Semantic Scholar进行了半自动文献综述,检索并筛选了2000-2025年间的6000多篇论文,分析了关键趋势、方法、挑战和未来方向。 Result: 研究发现刻板印象检测可作为早期监测工具,防止偏见升级和仇恨言论的蔓延。 Conclusion: 结论强调NLP研究需要更广泛、多语言和交叉性的方法。 Abstract: Stereotypes influence social perceptions and can escalate into discrimination and violence. While NLP research has extensively addressed gender bias and hate speech, stereotype detection remains an emerging field with significant societal implications. In this work is presented a survey of existing research, analyzing definitions from psychology, sociology, and philosophy. A semi-automatic literature review was performed by using Semantic Scholar. We retrieved and filtered over 6,000 papers (in the year range 2000-2025), identifying key trends, methodologies, challenges and future directions. The findings emphasize stereotype detection as a potential early-monitoring tool to prevent bias escalation and the rise of hate speech. Conclusions highlight the need for a broader, multilingual, and intersectional approach in NLP studies.

[262] Bridging Electronic Health Records and Clinical Texts: Contrastive Learning for Enhanced Clinical Tasks

Sara Ketabi,Dhanesh Ramachandram

Main category: cs.CL

TL;DR: 提出了一种深度多模态对比学习框架,结合结构化EHR数据和非结构化出院摘要,提升临床预测任务的性能。

Details Motivation: 传统机器学习模型在临床预测任务中表现良好,但对需要更深层次上下文理解的任务(如30天再入院预测)表现不佳,主要原因是结构化EHR数据语义信息有限。 Method: 提出深度多模态对比学习框架,对齐结构化EHR数据和非结构化出院摘要的潜在表示,通过拉近配对数据嵌入并推远非配对数据嵌入。 Result: 预训练的EHR编码器显著提升下游任务性能,例如30天再入院预测的AUROC比XGBoost提高了4.1%。 Conclusion: 整合临床笔记的领域知识到EHR管道中,可实现更准确和上下文感知的临床决策支持系统。 Abstract: Conventional machine learning models, particularly tree-based approaches, have demonstrated promising performance across various clinical prediction tasks using electronic health record (EHR) data. Despite their strengths, these models struggle with tasks that require deeper contextual understanding, such as predicting 30-day hospital readmission. This can be primarily due to the limited semantic information available in structured EHR data. To address this limitation, we propose a deep multimodal contrastive learning (CL) framework that aligns the latent representations of structured EHR data with unstructured discharge summary notes. It works by pulling together paired EHR and text embeddings while pushing apart unpaired ones. Fine-tuning the pretrained EHR encoder extracted from this framework significantly boosts downstream task performance, e.g., a 4.1% AUROC enhancement over XGBoost for 30-day readmission prediction. Such results demonstrate the effect of integrating domain knowledge from clinical notes into EHR-based pipelines, enabling more accurate and context-aware clinical decision support systems.

[263] EVADE: Multimodal Benchmark for Evasive Content Detection in E-Commerce Applications

Ancheng Xu,Zhihao Yang,Jingpeng Li,Guanghu Yuan,Longze Chen,Liang Yan,Jiehui Zhou,Zhen Qin,Hengyun Chang,Hamid Alinejad-Rokny,Bo Zheng,Min Yang

Main category: cs.CL

TL;DR: EVADE是一个针对电子商务中规避性内容检测的多模态基准测试,包含文本和图像样本,评估主流LLMs和VLMs的性能,发现现有模型在此任务上表现不足。

Details Motivation: 电子商务平台依赖LLMs和VLMs检测违规内容,但这些模型对规避性内容(表面合规但隐含违规)的检测能力有限,缺乏相关基准。 Method: 提出EVADE基准,包含2,833文本和13,961图像样本,设计两个任务(Single-Violation和All-in-One)评估模型能力。 Result: 主流模型在EVADE上表现不佳,All-in-One任务显示更清晰的规则定义能提升模型与人类判断的一致性。 Conclusion: EVADE为规避性内容检测提供了首个严格标准,揭示了多模态推理的局限性,为更安全的电子商务内容审核奠定了基础。 Abstract: E-commerce platforms increasingly rely on Large Language Models (LLMs) and Vision-Language Models (VLMs) to detect illicit or misleading product content. However, these models remain vulnerable to evasive content: inputs (text or images) that superficially comply with platform policies while covertly conveying prohibited claims. Unlike traditional adversarial attacks that induce overt failures, evasive content exploits ambiguity and context, making it far harder to detect. Existing robustness benchmarks provide little guidance for this demanding, real-world challenge. We introduce EVADE, the first expert-curated, Chinese, multimodal benchmark specifically designed to evaluate foundation models on evasive content detection in e-commerce. The dataset contains 2,833 annotated text samples and 13,961 images spanning six demanding product categories, including body shaping, height growth, and health supplements. Two complementary tasks assess distinct capabilities: Single-Violation, which probes fine-grained reasoning under short prompts, and All-in-One, which tests long-context reasoning by merging overlapping policy rules into unified instructions. Notably, the All-in-One setting significantly narrows the performance gap between partial and full-match accuracy, suggesting that clearer rule definitions improve alignment between human and model judgment. We benchmark 26 mainstream LLMs and VLMs and observe substantial performance gaps: even state-of-the-art models frequently misclassify evasive samples. By releasing EVADE and strong baselines, we provide the first rigorous standard for evaluating evasive-content detection, expose fundamental limitations in current multimodal reasoning, and lay the groundwork for safer and more transparent content moderation systems in e-commerce. The dataset is publicly available at https://huggingface.co/datasets/koenshen/EVADE-Bench.

[264] Too Consistent to Detect: A Study of Self-Consistent Errors in LLMs

Hexiang Tan,Fei Sun,Sha Liu,Du Su,Qi Cao,Xin Chen,Jingang Wang,Xunliang Cai,Yuanzhuo Wang,Huawei Shen,Xueqi Cheng

Main category: cs.CL

TL;DR: 论文研究了大型语言模型(LLMs)中自洽错误的问题,发现现有检测方法难以应对此类错误,并提出了一种跨模型探测方法以提升检测效果。

Details Motivation: 由于LLMs常生成看似合理但实际错误的内容,错误检测变得至关重要。然而,现有方法忽略了自洽错误(即模型在多轮采样中重复生成相同错误)的问题。 Method: 论文正式定义了自洽错误,并评估了主流检测方法的表现。基于发现,提出了一种跨模型探测方法,融合外部验证LLM的隐藏状态证据。 Result: 研究发现:(1)自洽错误的频率随模型规模增加而稳定或上升;(2)现有检测方法在自洽错误上表现显著不足。提出的方法在三个LLM家族中显著提升了检测性能。 Conclusion: 研究揭示了当前检测方法的局限性,并展示了跨模型探测方法的有效性,为未来改进提供了方向。 Abstract: As large language models (LLMs) often generate plausible but incorrect content, error detection has become increasingly critical to ensure truthfulness. However, existing detection methods often overlook a critical problem we term as self-consistent error, where LLMs repeatly generate the same incorrect response across multiple stochastic samples. This work formally defines self-consistent errors and evaluates mainstream detection methods on them. Our investigation reveals two key findings: (1) Unlike inconsistent errors, whose frequency diminishes significantly as LLM scale increases, the frequency of self-consistent errors remains stable or even increases. (2) All four types of detection methshods significantly struggle to detect self-consistent errors. These findings reveal critical limitations in current detection methods and underscore the need for improved methods. Motivated by the observation that self-consistent errors often differ across LLMs, we propose a simple but effective cross-model probe method that fuses hidden state evidence from an external verifier LLM. Our method significantly enhances performance on self-consistent errors across three LLM families.

[265] Towards Dynamic Theory of Mind: Evaluating LLM Adaptation to Temporal Evolution of Human States

Yang Xiao,Jiashuo Wang,Qiancheng Xu,Changhe Song,Chunpu Xu,Yi Cheng,Wenjie Li,Pengfei Liu

Main category: cs.CL

TL;DR: 论文提出了DynToM基准,用于评估大语言模型(LLMs)在动态心理状态追踪能力上的表现,发现其平均性能比人类低44.7%。

Details Motivation: 现有基准主要关注静态心理状态,忽视了现实社交互动中的动态演变,因此需要新的评估工具。 Method: 通过四步框架生成1,100个社交情境,包含5,500个场景和78,100个问题,并对10个先进LLMs进行全面评估。 Result: LLMs在动态心理状态追踪上的表现显著低于人类,尤其在推理心理状态变化时性能下降明显。 Conclusion: 当前LLMs在模拟人类心理状态动态性方面存在根本性局限。 Abstract: As Large Language Models (LLMs) increasingly participate in human-AI interactions, evaluating their Theory of Mind (ToM) capabilities - particularly their ability to track dynamic mental states - becomes crucial. While existing benchmarks assess basic ToM abilities, they predominantly focus on static snapshots of mental states, overlooking the temporal evolution that characterizes real-world social interactions. We present \textsc{DynToM}, a novel benchmark specifically designed to evaluate LLMs' ability to understand and track the temporal progression of mental states across interconnected scenarios. Through a systematic four-step framework, we generate 1,100 social contexts encompassing 5,500 scenarios and 78,100 questions, each validated for realism and quality. Our comprehensive evaluation of ten state-of-the-art LLMs reveals that their average performance underperforms humans by 44.7\%, with performance degrading significantly when tracking and reasoning about the shift of mental states. This performance gap highlights fundamental limitations in current LLMs' ability to model the dynamic nature of human mental states.

[266] QwenLong-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning

Fanqi Wan,Weizhou Shen,Shengyi Liao,Yingcheng Shi,Chenliang Li,Ziyi Yang,Ji Zhang,Fei Huang,Jingren Zhou,Ming Yan

Main category: cs.CL

TL;DR: QwenLong-L1框架通过渐进式上下文扩展和课程引导的强化学习,解决了长上下文推理任务中训练效率低和优化不稳定的问题。

Details Motivation: 现有大型推理模型(LRMs)在短上下文任务中表现优异,但在长上下文推理中仍面临挑战。 Method: 采用渐进式上下文扩展、课程引导的分阶段强化学习和难度感知的回顾采样策略。 Result: 在七个长上下文文档问答基准测试中,QwenLong-L1-32B表现优于其他主流模型,性能接近Claude-3.7-Sonnet-Thinking。 Conclusion: QwenLong-L1框架为长上下文推理任务提供了高效稳定的解决方案,推动了实用型LRMs的发展。 Abstract: Recent large reasoning models (LRMs) have demonstrated strong reasoning capabilities through reinforcement learning (RL). These improvements have primarily been observed within the short-context reasoning tasks. In contrast, extending LRMs to effectively process and reason on long-context inputs via RL remains a critical unsolved challenge. To bridge this gap, we first formalize the paradigm of long-context reasoning RL, and identify key challenges in suboptimal training efficiency and unstable optimization process. To address these issues, we propose QwenLong-L1, a framework that adapts short-context LRMs to long-context scenarios via progressive context scaling. Specifically, we utilize a warm-up supervised fine-tuning (SFT) stage to establish a robust initial policy, followed by a curriculum-guided phased RL technique to stabilize the policy evolution, and enhanced with a difficulty-aware retrospective sampling strategy to incentivize the policy exploration. Experiments on seven long-context document question-answering benchmarks demonstrate that QwenLong-L1-32B outperforms flagship LRMs like OpenAI-o3-mini and Qwen3-235B-A22B, achieving performance on par with Claude-3.7-Sonnet-Thinking, demonstrating leading performance among state-of-the-art LRMs. This work advances the development of practical long-context LRMs capable of robust reasoning across information-intensive environments.

[267] MIDB: Multilingual Instruction Data Booster for Enhancing Multilingual Instruction Synthesis

Yilun Liu,Chunguang Zhao,Xinhua Yang,Hongyong Zeng,Shimin Tao,Weibin Meng,Minggui He,Chang Su,Yan Yu,Hongxia Ma,Li Zhang,Daimeng Wei,Hao Yang

Main category: cs.CL

TL;DR: MIDB是一种多语言指令数据增强工具,通过自动修正内容错误和机器翻译缺陷,提升多语言合成指令数据的质量,显著增强多语言LLM的指令遵循和文化理解能力。

Details Motivation: 多语言合成指令数据因翻译和本地化不足存在严重质量问题,影响LLM性能。 Method: MIDB基于3.68万条16种语言的人工修订样本训练,自动修正内容错误、机器翻译缺陷并改进本地化。 Result: 自动和人工评估显示,MIDB显著提升16种语言的数据质量,并增强LLM的指令遵循与文化理解能力。 Conclusion: MIDB有效解决了多语言合成指令数据的质量问题,为多语言LLM的优化提供了实用工具。 Abstract: Despite doubts on data quality, instruction synthesis has been widely applied into instruction tuning (IT) of LLMs as an economic and rapid alternative. Recent endeavors focus on improving data quality for synthesized instruction pairs in English and have facilitated IT of English-centric LLMs. However, data quality issues in multilingual synthesized instruction pairs are even more severe, since the common synthesizing practice is to translate English synthesized data into other languages using machine translation (MT). Besides the known content errors in these English synthesized data, multilingual synthesized instruction data are further exposed to defects introduced by MT and face insufficient localization of the target languages. In this paper, we propose MIDB, a Multilingual Instruction Data Booster to automatically address the quality issues in multilingual synthesized data. MIDB is trained on around 36.8k revision examples across 16 languages by human linguistic experts, thereby can boost the low-quality data by addressing content errors and MT defects, and improving localization in these synthesized data. Both automatic and human evaluation indicate that not only MIDB steadily improved instruction data quality in 16 languages, but also the instruction-following and cultural-understanding abilities of multilingual LLMs fine-tuned on MIDB-boosted data were significantly enhanced.

[268] Tuning Language Models for Robust Prediction of Diverse User Behaviors

Fanjin Meng,Jingtao Ding,Jiahui Gong,Chen Yang,Hong Chen,Zuojian Wang,Haisheng Lu,Yong Li

Main category: cs.CL

TL;DR: BehaviorLM通过渐进式微调方法,解决了LLMs在预测长尾行为时的过拟合问题,提升了罕见行为的预测能力。

Details Motivation: 深度学习模型在预测用户行为时难以捕捉长尾行为,而现有微调方法容易过拟合常见行为,导致罕见行为预测能力下降。 Method: BehaviorLM采用两阶段渐进式微调:第一阶段微调常见行为并保留通用行为知识;第二阶段基于样本难度平衡所有行为,提升罕见行为预测。 Result: 在两个真实数据集上的实验表明,BehaviorLM能稳健预测常见和罕见行为,并利用LLM知识通过少量样本掌握罕见行为预测。 Conclusion: BehaviorLM有效解决了LLMs在长尾行为预测中的过拟合问题,提升了罕见行为的预测能力。 Abstract: Predicting user behavior is essential for intelligent assistant services, yet deep learning models often struggle to capture long-tailed behaviors. Large language models (LLMs), with their pretraining on vast corpora containing rich behavioral knowledge, offer promise. However, existing fine-tuning approaches tend to overfit to frequent ``anchor'' behaviors, reducing their ability to predict less common ``tail'' behaviors. In this paper, we introduce BehaviorLM, a progressive fine-tuning approach that addresses this issue. In the first stage, LLMs are fine-tuned on anchor behaviors while preserving general behavioral knowledge. In the second stage, fine-tuning uses a balanced subset of all behaviors based on sample difficulty to improve tail behavior predictions without sacrificing anchor performance. Experimental results on two real-world datasets demonstrate that BehaviorLM robustly predicts both anchor and tail behaviors and effectively leverages LLM behavioral knowledge to master tail behavior prediction with few-shot examples.

[269] ELSPR: Evaluator LLM Training Data Self-Purification on Non-Transitive Preferences via Tournament Graph Reconstruction

Yan Yu,Yilun Liu,Minggui He,Shimin Tao,Weibin Meng,Xinhua Yang,Li Zhang,Hongxia Ma,Chang Su,Hao Yang,Fuliang Li

Main category: cs.CL

TL;DR: 论文提出了一种图论框架,通过建模成对偏好为锦标赛图,分析和解决大语言模型(LLM)评估中的非传递性问题,并设计了过滤策略ELSPR以减少非传递性偏好数据。

Details Motivation: 大语言模型(LLM)在开放任务评估中存在非传递性偏好问题,即评估者对A优于B、B优于C但C优于A的偏好不一致,这可能是由于低质量训练数据导致的。 Method: 提出图论框架,量化非传递性并引入有向图结构熵衡量偏好清晰度;设计过滤策略ELSPR,保留一致且传递的偏好数据用于模型微调。 Result: 实验表明,过滤后的数据使非传递性降低13.78%,结构熵减少0.0879,且模型与人类评估者更一致(人类一致率提高0.6%,Spearman相关性增加0.01)。 Conclusion: 通过图论分析和数据过滤策略,可以有效减少LLM评估中的非传递性问题,提升评估的可靠性和一致性。 Abstract: Large language models (LLMs) are widely used as evaluators for open-ended tasks, while previous research has emphasized biases in LLM evaluations, the issue of non-transitivity in pairwise comparisons remains unresolved: non-transitive preferences for pairwise comparisons, where evaluators prefer A over B, B over C, but C over A. Our results suggest that low-quality training data may reduce the transitivity of preferences generated by the Evaluator LLM. To address this, We propose a graph-theoretic framework to analyze and mitigate this problem by modeling pairwise preferences as tournament graphs. We quantify non-transitivity and introduce directed graph structural entropy to measure the overall clarity of preferences. Our analysis reveals significant non-transitivity in advanced Evaluator LLMs (with Qwen2.5-Max exhibiting 67.96%), as well as high entropy values (0.8095 for Qwen2.5-Max), reflecting low overall clarity of preferences. To address this issue, we designed a filtering strategy, ELSPR, to eliminate preference data that induces non-transitivity, retaining only consistent and transitive preference data for model fine-tuning. Experiments demonstrate that models fine-tuned with filtered data reduce non-transitivity by 13.78% (from 64.28% to 50.50%), decrease structural entropy by 0.0879 (from 0.8113 to 0.7234), and align more closely with human evaluators (human agreement rate improves by 0.6% and Spearman correlation increases by 0.01).

[270] Activation Control for Efficiently Eliciting Long Chain-of-thought Ability of Language Models

Zekai Zhao,Qi Liu,Kun Zhou,Zihan Liu,Yifei Shao,Zhiting Hu,Biwei Huang

Main category: cs.CL

TL;DR: 通过分析LLMs内部机制,发现少量高影响力激活可控制长链推理能力,提出无需训练的激活控制技术,显著提升推理性能。

Details Motivation: 探索无需昂贵训练即可激发LLMs长链推理能力的内部机制。 Method: 放大关键激活并插入“等待”标记,结合对比示例和简单分析函数调制激活值。 Result: 显著提高自反率和准确率,验证了方法的有效性。 Conclusion: 提出了一种高效、无需训练的长链推理激发技术,并展示了参数高效的微调方法。 Abstract: Despite the remarkable reasoning performance, eliciting the long chain-of-thought (CoT) ability in large language models (LLMs) typically requires costly reinforcement learning or supervised fine-tuning on high-quality distilled data. We investigate the internal mechanisms behind this capability and show that a small set of high-impact activations in the last few layers largely governs long-form reasoning attributes, such as output length and self-reflection. By simply amplifying these activations and inserting "wait" tokens, we can invoke the long CoT ability without any training, resulting in significantly increased self-reflection rates and accuracy. Moreover, we find that the activation dynamics follow predictable trajectories, with a sharp rise after special tokens and a subsequent exponential decay. Building on these insights, we introduce a general training-free activation control technique. It leverages a few contrastive examples to identify key activations, and employs simple analytic functions to modulate their values at inference time to elicit long CoTs. Extensive experiments confirm the effectiveness of our method in efficiently eliciting long CoT reasoning in LLMs and improving their performance. Additionally, we propose a parameter-efficient fine-tuning method that trains only a last-layer activation amplification module and a few LoRA layers, outperforming full LoRA fine-tuning on reasoning benchmarks with significantly fewer parameters. Our code and data are publicly released.

[271] SemSketches-2021: experimenting with the machine processing of the pilot semantic sketches corpus

Maria Ponomareva,Maria Petrova,Julia Detkova,Oleg Serikov,Maria Yarova

Main category: cs.CL

TL;DR: 论文探讨了语义草图的机器处理方法,介绍了首个开放的语义草图语料库,并讨论了草图的创建及其应用任务,重点研究了语料库的机器处理工具。

Details Motivation: 研究语义草图的机器处理方法和应用,以促进相关技术的发展。 Method: 通过组织SemSketches-2021共享任务,参与者需将匿名草图与包含必要谓词的上下文匹配。 Result: 共享任务展示了语义草图与上下文匹配的可行性。 Conclusion: 语义草图及其机器处理工具在解决特定任务中具有潜力。 Abstract: The paper deals with elaborating different approaches to the machine processing of semantic sketches. It presents the pilot open corpus of semantic sketches. Different aspects of creating the sketches are discussed, as well as the tasks that the sketches can help to solve. Special attention is paid to the creation of the machine processing tools for the corpus. For this purpose, the SemSketches-2021 Shared Task was organized. The participants were given the anonymous sketches and a set of contexts containing the necessary predicates. During the Task, one had to assign the proper contexts to the corresponding sketches.

[272] Understanding How Value Neurons Shape the Generation of Specified Values in LLMs

Yi Su,Jiayi Zhang,Shu Yang,Xinhai Wang,Lijie Hu,Di Wang

Main category: cs.CL

TL;DR: ValueLocate框架通过心理学价值框架与LLM神经元分析结合,定位并验证了价值关键神经元,为价值对齐提供了新方法。

Details Motivation: LLMs的价值表征不透明,现有方法难以系统解释其价值编码,亟需结合心理学框架的机制解释工具。 Method: 构建ValueInsight数据集,基于Schwartz Values Survey设计神经元识别方法,通过激活差异定位价值关键神经元。 Result: 验证了目标神经元操控能有效改变模型价值取向,建立了神经元与价值表征的因果关系。 Conclusion: ValueLocate为价值对齐提供了机制解释基础,结合了心理学与神经元分析。 Abstract: Rapid integration of large language models (LLMs) into societal applications has intensified concerns about their alignment with universal ethical principles, as their internal value representations remain opaque despite behavioral alignment advancements. Current approaches struggle to systematically interpret how values are encoded in neural architectures, limited by datasets that prioritize superficial judgments over mechanistic analysis. We introduce ValueLocate, a mechanistic interpretability framework grounded in the Schwartz Values Survey, to address this gap. Our method first constructs ValueInsight, a dataset that operationalizes four dimensions of universal value through behavioral contexts in the real world. Leveraging this dataset, we develop a neuron identification method that calculates activation differences between opposing value aspects, enabling precise localization of value-critical neurons without relying on computationally intensive attribution methods. Our proposed validation method demonstrates that targeted manipulation of these neurons effectively alters model value orientations, establishing causal relationships between neurons and value representations. This work advances the foundation for value alignment by bridging psychological value frameworks with neuron analysis in LLMs.

[273] The Pilot Corpus of the English Semantic Sketches

Maria Petrova,Maria Ponomareva,Alexandra Ivoylova

Main category: cs.CL

TL;DR: 论文探讨了为英语动词创建语义草图的方法,并分析了跨语言差异及构建过程中的错误。

Details Motivation: 研究旨在通过英语-俄语语义草图对展示对比研究的潜力,并揭示语义相似草图的跨语言差异。 Method: 构建了一个英语-俄语草图对试点语料库,分析构建过程及可能的错误。 Result: 揭示了语义草图的跨语言差异,并提供了构建过程中的错误分析。 Conclusion: 语义草图有助于跨语言对比研究,构建过程中的错误可深化对草图语言学本质的理解。 Abstract: The paper is devoted to the creation of the semantic sketches for English verbs. The pilot corpus consists of the English-Russian sketch pairs and is aimed to show what kind of contrastive studies the sketches help to conduct. Special attention is paid to the cross-language differences between the sketches with similar semantics. Moreover, we discuss the process of building a semantic sketch, and analyse the mistakes that could give insight to the linguistic nature of sketches.

[274] Fast Quiet-STaR: Thinking Without Thought Tokens

Wei Huang,Yizhe Xiong,Xin Ye,Zhijie Deng,Hui Chen,Zijia Lin,Guiguang Ding

Main category: cs.CL

TL;DR: Fast Quiet STaR是一种更高效的推理框架,通过减少计算成本保留令牌级推理的优势,并在多个基准数据集上表现优于Quiet STaR。

Details Motivation: 尽管大型语言模型(LLMs)在自然语言处理任务中表现优异,但在复杂推理任务中仍需改进。Quiet STaR通过生成令牌级思维痕迹提升了推理能力,但计算开销较大。 Method: 提出Fast Quiet STaR,采用课程学习策略逐步减少思维令牌数量,并通过强化学习微调将其扩展到标准的下一个令牌预测(NTP)设置。 Result: 在Mistral 7B和Qwen2.5 7B上,Fast Quiet STaR NTP的平均准确率分别提高了9%和5.7%,同时保持相同的推理延迟。 Conclusion: Fast Quiet STaR在提升推理效率的同时保持了性能优势,为复杂推理任务提供了更高效的解决方案。 Abstract: Large Language Models (LLMs) have achieved impressive performance across a range of natural language processing tasks. However, recent advances demonstrate that further gains particularly in complex reasoning tasks require more than merely scaling up model sizes or training data. One promising direction is to enable models to think during the reasoning process. Recently, Quiet STaR significantly improves reasoning by generating token-level thought traces, but incurs substantial inference overhead. In this work, we propose Fast Quiet STaR, a more efficient reasoning framework that preserves the benefits of token-level reasoning while reducing computational cost. Our method introduces a curriculum learning based training strategy that gradually reduces the number of thought tokens, enabling the model to internalize more abstract and concise reasoning processes. We further extend this approach to the standard Next Token Prediction (NTP) setting through reinforcement learning-based fine-tuning, resulting in Fast Quiet-STaR NTP, which eliminates the need for explicit thought token generation during inference. Experiments on four benchmark datasets with Mistral 7B and Qwen2.5 7B demonstrate that Fast Quiet-STaR consistently outperforms Quiet-STaR in terms of average accuracy under the same inference time budget. Notably, Fast Quiet-STaR NTP achieves an average accuracy improvement of 9\% on Mistral 7B and 5.7\% on Qwen2.5 7B, while maintaining the same inference latency. Our code will be available at https://github.com/huangwei200012/Fast-Quiet-STaR.

[275] Discriminating Form and Meaning in Multilingual Models with Minimal-Pair ABX Tasks

Maureen de Seyssel,Jie Chi,Skyler Seto,Maartje ter Hoeve,Masha Fedzechkina,Natalie Schluter

Main category: cs.CL

TL;DR: 本文提出了一种无需训练的ABX式判别任务,用于评估多语言模型如何表示语言身份(形式)和语义内容(意义)。通过实验发现,语言判别能力随训练下降并集中在低层,而语义判别能力随时间增强并稳定在深层。

Details Motivation: 研究多语言模型中语言身份和语义内容的表示方式,提供一种轻量级且可解释的评估方法。 Method: 采用零-shot ABX任务,分析XLM-R模型在不同训练阶段和层级的表示能力。 Result: 语言判别能力随训练下降并集中在低层,语义判别能力随时间增强并稳定在深层。 Conclusion: ABX任务为分析多语言表示结构提供了一种轻量级框架。 Abstract: We introduce a set of training-free ABX-style discrimination tasks to evaluate how multilingual language models represent language identity (form) and semantic content (meaning). Inspired from speech processing, these zero-shot tasks measure whether minimal differences in representation can be reliably detected. This offers a flexible and interpretable alternative to probing. Applied to XLM-R (Conneau et al, 2020) across pretraining checkpoints and layers, we find that language discrimination declines over training and becomes concentrated in lower layers, while meaning discrimination strengthens over time and stabilizes in deeper layers. We then explore probing tasks, showing some alignment between our metrics and linguistic learning performance. Our results position ABX tasks as a lightweight framework for analyzing the structure of multilingual representations.

[276] Resolving Conflicting Evidence in Automated Fact-Checking: A Study on Retrieval-Augmented LLMs

Ziyu Ge,Yuhao Wu,Daniel Wai Kit Chin,Roy Ka-Wei Lee,Rui Cao

Main category: cs.CL

TL;DR: 论文评估了检索增强生成(RAG)模型在事实核查任务中处理冲突证据的能力,并提出了新数据集CONFACT。研究发现现有方法在解决媒体来源可信度差异时存在漏洞,提出整合来源可信度信息的策略,显著提升了模型性能。

Details Motivation: 大型语言模型(LLMs)在事实核查任务中表现潜力,但在面对不同可信度来源的冲突证据时可靠性下降,需要系统性评估和改进。 Method: 引入CONFACT数据集,评估RAG模型在冲突证据下的表现,并提出在检索和生成阶段整合媒体背景信息的策略。 Result: 实验表明现有RAG方法在解决媒体可信度差异时存在漏洞,整合来源可信度信息显著提升了模型性能。 Conclusion: 通过整合来源可信度信息,RAG模型在冲突证据下的表现得到显著改善,为事实核查任务提供了更可靠的解决方案。 Abstract: Large Language Models (LLMs) augmented with retrieval mechanisms have demonstrated significant potential in fact-checking tasks by integrating external knowledge. However, their reliability decreases when confronted with conflicting evidence from sources of varying credibility. This paper presents the first systematic evaluation of Retrieval-Augmented Generation (RAG) models for fact-checking in the presence of conflicting evidence. To support this study, we introduce \textbf{CONFACT} (\textbf{Con}flicting Evidence for \textbf{Fact}-Checking) (Dataset available at https://github.com/zoeyyes/CONFACT), a novel dataset comprising questions paired with conflicting information from various sources. Extensive experiments reveal critical vulnerabilities in state-of-the-art RAG methods, particularly in resolving conflicts stemming from differences in media source credibility. To address these challenges, we investigate strategies to integrate media background information into both the retrieval and generation stages. Our results show that effectively incorporating source credibility significantly enhances the ability of RAG models to resolve conflicting evidence and improve fact-checking performance.

[277] The Real Barrier to LLM Agent Usability is Agentic ROI

Weiwen Liu,Jiarui Qin,Xu Huang,Xingshan Zeng,Yunjia Xi,Jianghao Lin,Chuhan Wu,Yasheng Wang,Lifeng Shang,Ruiming Tang,Defu Lian,Yong Yu,Weinan Zhang

Main category: cs.CL

TL;DR: 论文提出LLM代理在实际应用中的可用性差距,强调需从单纯优化模型性能转向以效用驱动的视角,提出Agent ROI概念,并规划了优化路径。

Details Motivation: LLM代理在专业领域表现优异,但在大众市场应用有限,主要因价值与成本间的权衡问题。 Method: 提出Agent ROI框架,关注信息质量、代理时间和成本,规划先扩大规模提升信息质量,再缩小规模降低成本和时间。 Result: 通过优化Agent ROI,可提升LLM代理的实用性和可扩展性。 Conclusion: 呼吁以效用为中心的发展方向,推动LLM代理在大众市场的广泛应用。 Abstract: Large Language Model (LLM) agents represent a promising shift in human-AI interaction, moving beyond passive prompt-response systems to autonomous agents capable of reasoning, planning, and goal-directed action. Despite the widespread application in specialized, high-effort tasks like coding and scientific research, we highlight a critical usability gap in high-demand, mass-market applications. This position paper argues that the limited real-world adoption of LLM agents stems not only from gaps in model capabilities, but also from a fundamental tradeoff between the value an agent can provide and the costs incurred during real-world use. Hence, we call for a shift from solely optimizing model performance to a broader, utility-driven perspective: evaluating agents through the lens of the overall agentic return on investment (Agent ROI). By identifying key factors that determine Agentic ROI--information quality, agent time, and cost--we posit a zigzag development trajectory in optimizing agentic ROI: first scaling up to improve the information quality, then scaling down to minimize the time and cost. We outline the roadmap across different development stages to bridge the current usability gaps, aiming to make LLM agents truly scalable, accessible, and effective in real-world contexts.

[278] EXECUTE: A Multilingual Benchmark for LLM Token Understanding

Lukas Edman,Helmut Schmid,Alexander Fraser

Main category: cs.CL

TL;DR: EXECUTE扩展了CUTE基准,测试多语言LLMs的字符理解能力,发现不同语言的挑战层次各异,并研究了中日韩的子字符任务。

Details Motivation: 扩展CUTE基准以评估LLMs在多语言环境下的字符理解能力,揭示不同语言中的处理差异。 Method: 开发EXECUTE框架,简化多语言扩展,测试多种LLMs在不同语言中的表现,并研究中日韩的子字符任务。 Result: 发现不同语言的挑战层次不同(如字符级或词级),部分语言无问题;中日韩的子字符任务揭示了LLMs对字符组件的理解。 Conclusion: EXECUTE为多语言LLMs评估提供了灵活框架,揭示了语言特性对模型性能的影响。 Abstract: The CUTE benchmark showed that LLMs struggle with character understanding in English. We extend it to more languages with diverse scripts and writing systems, introducing EXECUTE. Our simplified framework allows easy expansion to any language. Tests across multiple LLMs reveal that challenges in other languages are not always on the character level as in English. Some languages show word-level processing issues, some show no issues at all. We also examine sub-character tasks in Chinese, Japanese, and Korean to assess LLMs' understanding of character components.

[279] Compression Hacking: A Supplementary Perspective on Informatics Metric of Language Models from Geometric Distortion

Jianxiang Zang,Meiling Ning,Yongda Wei,Shihan Dou,Jiazheng Zhang,Nijia Mo,Binhong Li,Tao Gui,Qi Zhang,Xuanjing Huang

Main category: cs.CL

TL;DR: 论文提出了一种新的压缩指标,通过结合几何失真分析来改进语言模型的压缩度量,显著提升了模型能力的解释性。

Details Motivation: 研究发现高度压缩的语言模型表示空间会退化,影响模型性能,因此需要改进压缩度量方法。 Method: 提出了三种改进的压缩指标,结合几何失真分析,并集成到自评估流程中。 Result: 改进的指标与模型综合能力高度相关(Spearman系数>0.9),优于原始压缩指标。 Conclusion: 几何失真分析的引入显著提升了语言模型的信息学解释性。 Abstract: Recently, the concept of ``compression as intelligence'' has provided a novel informatics metric perspective for language models (LMs), emphasizing that highly structured representations signify the intelligence level of LMs. However, from a geometric standpoint, the word representation space of highly compressed LMs tends to degenerate into a highly anisotropic state, which hinders the LM's ability to comprehend instructions and directly impacts its performance. We found this compression-anisotropy synchronicity is essentially the ``Compression Hacking'' in LM representations, where noise-dominated directions tend to create the illusion of high compression rates by sacrificing spatial uniformity. Based on this, we propose three refined compression metrics by incorporating geometric distortion analysis and integrate them into a self-evaluation pipeline. The refined metrics exhibit strong alignment with the LM's comprehensive capabilities, achieving Spearman correlation coefficients above 0.9, significantly outperforming both the original compression and other internal structure-based metrics. This confirms that compression hacking substantially enhances the informatics interpretation of LMs by incorporating geometric distortion of representations.

[280] DialogXpert: Driving Intelligent and Emotion-Aware Conversations through Online Value-Based Reinforcement Learning with LLM Priors

Tazeek Bin Abdur Rakib,Ambuj Mehrish,Lay-Ki Soon,Wern Han Lim,Soujanya Poria

Main category: cs.CL

TL;DR: DialogXpert 利用冻结的 LLM 生成高质量候选动作,并通过紧凑的 Q 网络选择最优动作,实现高效、情感智能的对话规划。

Details Motivation: 解决 LLM 代理在目标驱动对话中的短视解码和高成本规划问题。 Method: 结合冻结 LLM 生成候选动作,使用基于 BERT 嵌入的 Q 网络进行动作选择,并跟踪用户情感。 Result: 在谈判、情感支持和辅导任务中,对话成功率超过 94%,优化后可达 97%。 Conclusion: DialogXpert 实现了实时、战略性和情感智能的对话规划。 Abstract: Large-language-model (LLM) agents excel at reactive dialogue but struggle with proactive, goal-driven interactions due to myopic decoding and costly planning. We introduce DialogXpert, which leverages a frozen LLM to propose a small, high-quality set of candidate actions per turn and employs a compact Q-network over fixed BERT embeddings trained via temporal-difference learning to select optimal moves within this reduced space. By tracking the user's emotions, DialogXpert tailors each decision to advance the task while nurturing a genuine, empathetic connection. Across negotiation, emotional support, and tutoring benchmarks, DialogXpert drives conversations to under $3$ turns with success rates exceeding 94\% and, with a larger LLM prior, pushes success above 97\% while markedly improving negotiation outcomes. This framework delivers real-time, strategic, and emotionally intelligent dialogue planning at scale. Code available at https://github.com/declare-lab/dialogxpert/

[281] Don't Overthink it. Preferring Shorter Thinking Chains for Improved LLM Reasoning

Michael Hassid,Gabriel Synnaeve,Yossi Adi,Roy Schwartz

Main category: cs.CL

TL;DR: 研究发现,较短的推理链在LLMs中比长链更准确,提出short-m@k方法以减少计算成本并提升性能。

Details Motivation: 挑战长推理链提升性能的假设,减少计算成本和推理时间。 Method: 提出short-m@k方法,并行生成k个独立推理链,选择前m个进行多数投票。 Result: 短链方法在低计算设置下表现更优,节省40%计算资源,短-3@k在所有预算下超越多数投票。 Conclusion: 短推理链训练和推理更高效,长链可能适得其反,需重新思考LLMs的计算方法。 Abstract: Reasoning large language models (LLMs) heavily rely on scaling test-time compute to perform complex reasoning tasks by generating extensive "thinking" chains. While demonstrating impressive results, this approach incurs significant computational costs and inference time. In this work, we challenge the assumption that long thinking chains results in better reasoning capabilities. We first demonstrate that shorter reasoning chains within individual questions are significantly more likely to yield correct answers - up to 34.5% more accurate than the longest chain sampled for the same question. Based on these results, we suggest short-m@k, a novel reasoning LLM inference method. Our method executes k independent generations in parallel and halts computation once the first m thinking processes are done. The final answer is chosen using majority voting among these m chains. Basic short-1@k demonstrates similar or even superior performance over standard majority voting in low-compute settings - using up to 40% fewer thinking tokens. short-3@k, while slightly less efficient than short-1@k, consistently surpasses majority voting across all compute budgets, while still being substantially faster (up to 33% wall time reduction). Inspired by our results, we finetune an LLM using short, long, and randomly selected reasoning chains. We then observe that training on the shorter ones leads to better performance. Our findings suggest rethinking current methods of test-time compute in reasoning LLMs, emphasizing that longer "thinking" does not necessarily translate to improved performance and can, counter-intuitively, lead to degraded results.

[282] Low-Resource NMT: A Case Study on the Written and Spoken Languages in Hong Kong

Hei Yi Mak,Tan Lee

Main category: cs.CL

TL;DR: 本文提出了一种基于Transformer的神经机器翻译系统,用于标准中文与粤语书面语的翻译,并通过数据挖掘和收集解决了平行语料稀缺的问题。

Details Motivation: 随着粤语书面语在网络中的普及,普通话与粤语使用者之间的互动增加,对自动翻译的需求日益明显。 Method: 研究采用Transformer架构,通过收集28K平行句子和从维基百科中自动提取72K语义相似句子,构建训练数据。 Result: 系统在BLEU评分上优于百度翻译的6/8测试集,并能捕捉中文与粤语间的关键语言转换。 Conclusion: 通过数据挖掘和有效训练,该系统在中文到粤语翻译任务中表现出色,解决了语料稀缺的挑战。 Abstract: The majority of inhabitants in Hong Kong are able to read and write in standard Chinese but use Cantonese as the primary spoken language in daily life. Spoken Cantonese can be transcribed into Chinese characters, which constitute the so-called written Cantonese. Written Cantonese exhibits significant lexical and grammatical differences from standard written Chinese. The rise of written Cantonese is increasingly evident in the cyber world. The growing interaction between Mandarin speakers and Cantonese speakers is leading to a clear demand for automatic translation between Chinese and Cantonese. This paper describes a transformer-based neural machine translation (NMT) system for written-Chinese-to-written-Cantonese translation. Given that parallel text data of Chinese and Cantonese are extremely scarce, a major focus of this study is on the effort of preparing good amount of training data for NMT. In addition to collecting 28K parallel sentences from previous linguistic studies and scattered internet resources, we devise an effective approach to obtaining 72K parallel sentences by automatically extracting pairs of semantically similar sentences from parallel articles on Chinese Wikipedia and Cantonese Wikipedia. We show that leveraging highly similar sentence pairs mined from Wikipedia improves translation performance in all test sets. Our system outperforms Baidu Fanyi's Chinese-to-Cantonese translation on 6 out of 8 test sets in BLEU scores. Translation examples reveal that our system is able to capture important linguistic transformations between standard Chinese and spoken Cantonese.

[283] Not All Tokens Are What You Need In Thinking

Hang Yuan,Bin Yu,Haotian Li,Shijun Yang,Christina Dan Wang,Zhou Yu,Xueyin Xu,Weizhen Qi,Kai Chen

Main category: cs.CL

TL;DR: 论文提出Conditional Token Selection (CTS)框架,通过压缩推理过程中的冗余令牌,显著提升模型效率,同时保持推理性能。

Details Motivation: 现代推理模型存在推理延迟高、计算资源消耗大及过度思考(生成冗余推理链)的问题,CTS旨在解决这些问题。 Method: CTS通过条件重要性评分识别并保留关键令牌,训练模型使用压缩后的推理链。 Result: 在GPQA基准测试中,CTS显著减少推理令牌数量(最多75.8%),同时提升或仅轻微降低准确性。 Conclusion: CTS证明了现有推理链中存在大量冗余,通过选择性压缩可显著提升模型效率。 Abstract: Modern reasoning models, such as OpenAI's o1 and DeepSeek-R1, exhibit impressive problem-solving capabilities but suffer from critical inefficiencies: high inference latency, excessive computational resource consumption, and a tendency toward overthinking -- generating verbose chains of thought (CoT) laden with redundant tokens that contribute minimally to the final answer. To address these issues, we propose Conditional Token Selection (CTS), a token-level compression framework with a flexible and variable compression ratio that identifies and preserves only the most essential tokens in CoT. CTS evaluates each token's contribution to deriving correct answers using conditional importance scoring, then trains models on compressed CoT. Extensive experiments demonstrate that CTS effectively compresses long CoT while maintaining strong reasoning performance. Notably, on the GPQA benchmark, Qwen2.5-14B-Instruct trained with CTS achieves a 9.1% accuracy improvement with 13.2% fewer reasoning tokens (13% training token reduction). Further reducing training tokens by 42% incurs only a marginal 5% accuracy drop while yielding a 75.8% reduction in reasoning tokens, highlighting the prevalence of redundancy in existing CoT.

[284] Stepwise Reasoning Checkpoint Analysis: A Test Time Scaling Method to Enhance LLMs' Reasoning

Zezhong Wang,Xingshan Zeng,Weiwen Liu,Yufei Wang,Liangyou Li,Yasheng Wang,Lifeng Shang,Xin Jiang,Qun Liu,Kam-Fai Wong

Main category: cs.CL

TL;DR: 论文提出了一种名为Stepwise Reasoning Checkpoint Analysis (SRCA)的框架,通过引入检查点和两种策略,解决了现有Test-Time Scaling方法在数学推理中的路径同质化和中间结果利用不足的问题。

Details Motivation: 现有的Chain-of-Thought (CoT)方法虽然通过Test-Time Scaling (TTS)提升了推理准确性,但存在路径同质化和中间结果利用效率低的问题。 Method: SRCA框架在推理步骤间引入检查点,采用Answer-Clustered Search和Checkpoint Candidate Augmentation两种策略,以保持路径多样性并高效利用中间结果。 Result: 实验结果表明,SRCA在多个数学数据集上比现有TTS方法显著提高了推理准确性。 Conclusion: SRCA通过优化中间结果利用和减少路径同质化,为数学推理任务提供了一种更高效的解决方案。 Abstract: Mathematical reasoning through Chain-of-Thought (CoT) has emerged as a powerful capability of Large Language Models (LLMs), which can be further enhanced through Test-Time Scaling (TTS) methods like Beam Search and DVTS. However, these methods, despite improving accuracy by allocating more computational resources during inference, often suffer from path homogenization and inefficient use of intermediate results. To address these limitations, we propose Stepwise Reasoning Checkpoint Analysis (SRCA), a framework that introduces checkpoints between reasoning steps. It incorporates two key strategies: (1) Answer-Clustered Search, which groups reasoning paths by their intermediate checkpoint answers to maintain diversity while ensuring quality, and (2) Checkpoint Candidate Augmentation, which leverages all intermediate answers for final decision-making. Our approach effectively reduces path homogenization and creates a fault-tolerant mechanism by utilizing high-quality intermediate results. Experimental results show that SRCA improves reasoning accuracy compared to existing TTS methods across various mathematical datasets.

[285] Emerging categories in scientific explanations

Giacomo Magnifico,Eduard Barbu

Main category: cs.CL

TL;DR: 论文提出了一种从生物技术和生物物理学文献中提取解释性句子并构建多类别标注数据集的方法,旨在填补机器学习和人工智能领域缺乏人类生成解释的大规模数据集的空白。

Details Motivation: 当前机器学习和人工智能领域缺乏人类生成的大规模解释性数据集,而清晰有效的解释对人类理解和知识传播至关重要。 Method: 从PubMed的PMC开放获取子集等来源提取解释性句子,进行多类别标注,并评估标注者一致性。 Result: 构建了公开可用的数据集,包含6类和3类标注,3类标注的Krippendorf Alpha值为0.667。 Conclusion: 该数据集为机器学习和人工智能领域提供了人类生成解释的资源,并展示了标注方法的可行性。 Abstract: Clear and effective explanations are essential for human understanding and knowledge dissemination. The scope of scientific research aiming to understand the essence of explanations has recently expanded from the social sciences to machine learning and artificial intelligence. Explanations for machine learning decisions must be impactful and human-like, and there is a lack of large-scale datasets focusing on human-like and human-generated explanations. This work aims to provide such a dataset by: extracting sentences that indicate explanations from scientific literature among various sources in the biotechnology and biophysics topic domains (e.g. PubMed's PMC Open Access subset); providing a multi-class notation derived inductively from the data; evaluating annotator consensus on the emerging categories. The sentences are organized in an openly-available dataset, with two different classifications (6-class and 3-class category annotation), and the 3-class notation achieves a 0.667 Krippendorf Alpha value.

[286] Investigating Affect Mining Techniques for Annotation Sample Selection in the Creation of Finnish Affective Speech Corpus

Kalle Lahtinen,Einari Vaaras,Liisa Mustanoja,Okko Räsänen

Main category: cs.CL

TL;DR: 本文介绍了首个芬兰语自然情感语音语料库,通过结合声学、跨语言情感和文本情感特征的方法构建,并比较了随机采样与情感挖掘方法的多样性。

Details Motivation: 研究芬兰语中自然情感表达的缺失,填补现有语料库多为表演或特定场景的空白。 Method: 从三个大型芬兰语语音语料库中采样12,000条语句,结合声学、跨语言情感和文本情感特征进行情感挖掘,并与随机采样比较多样性。 Result: 成功构建首个芬兰语自然情感语音语料库,并发现情感挖掘方法在多样性上优于随机采样。 Conclusion: 该研究不仅提供了芬兰语情感语音资源,还为其他语言或领域的情感语料库构建提供了采样策略参考。 Abstract: Study of affect in speech requires suitable data, as emotional expression and perception vary across languages. Until now, no corpus has existed for natural expression of affect in spontaneous Finnish, existing data being acted or from a very specific communicative setting. This paper presents the first such corpus, created by annotating 12,000 utterances for emotional arousal and valence, sampled from three large-scale Finnish speech corpora. To ensure diverse affective expression, sample selection was conducted with an affect mining approach combining acoustic, cross-linguistic speech emotion, and text sentiment features. We compare this method to random sampling in terms of annotation diversity, and conduct post-hoc analyses to identify sampling choices that would have maximized the diversity. As an outcome, the work introduces a spontaneous Finnish affective speech corpus and informs sampling strategies for affective speech corpus creation in other languages or domains.

[287] Explaining Sources of Uncertainty in Automated Fact-Checking

Jingyi Sun,Greta Warren,Irina Shklovski,Isabelle Augenstein

Main category: cs.CL

TL;DR: CLUE框架通过无监督方式识别文本中的冲突与一致性关系,生成自然语言解释模型不确定性,提升人机协作效果。

Details Motivation: 现有方法(如数值不确定性或模糊表达)无法解释证据冲突导致的不确定性,用户难以依赖或解决分歧。 Method: CLUE通过识别文本中的声明-证据或证据间冲突与一致性关系,并利用提示和注意力引导生成解释。 Result: 在三个语言模型和两个事实核查数据集上,CLUE生成的解释更忠实于模型不确定性,且更符合事实核查决策。人类评估认为其更实用、信息丰富且逻辑一致。 Conclusion: CLUE无需微调即可应用于任何白盒语言模型,通过明确将不确定性与证据冲突关联,为事实核查等任务提供实用支持。 Abstract: Understanding sources of a model's uncertainty regarding its predictions is crucial for effective human-AI collaboration. Prior work proposes using numerical uncertainty or hedges ("I'm not sure, but ..."), which do not explain uncertainty that arises from conflicting evidence, leaving users unable to resolve disagreements or rely on the output. We introduce CLUE (Conflict-and-Agreement-aware Language-model Uncertainty Explanations), the first framework to generate natural language explanations of model uncertainty by (i) identifying relationships between spans of text that expose claim-evidence or inter-evidence conflicts and agreements that drive the model's predictive uncertainty in an unsupervised way, and (ii) generating explanations via prompting and attention steering that verbalize these critical interactions. Across three language models and two fact-checking datasets, we show that CLUE produces explanations that are more faithful to the model's uncertainty and more consistent with fact-checking decisions than prompting for uncertainty explanations without span-interaction guidance. Human evaluators judge our explanations to be more helpful, more informative, less redundant, and more logically consistent with the input than this baseline. CLUE requires no fine-tuning or architectural changes, making it plug-and-play for any white-box language model. By explicitly linking uncertainty to evidence conflicts, it offers practical support for fact-checking and generalises readily to other tasks that require reasoning over complex information.

[288] Just as Humans Need Vaccines, So Do Models: Model Immunization to Combat Falsehoods

Shaina Raza,Rizwan Qureshi,Marcelo Lotif,Aman Chadha,Deval Pandya,Christos Emmanouilidis

Main category: cs.CL

TL;DR: 论文提出一种类似生物免疫的方法,通过微调AI模型使其接触标记的虚假信息,以增强其识别和拒绝错误信息的能力。

Details Motivation: 生成式AI模型常从训练数据中学习并复制错误信息,需一种主动方法提升其事实性。 Method: 在微调阶段定期注入少量标记的虚假信息,作为“疫苗”训练模型。 Result: 实验表明,免疫化模型生成的错误信息显著少于基线模型。 Conclusion: 该方法为提升AI模型事实性提供了一种新范式,并需伦理保障以确保虚假数据的安全使用。 Abstract: Generative AI models often learn and reproduce false information present in their training corpora. This position paper argues that, analogous to biological immunization, where controlled exposure to a weakened pathogen builds immunity, AI models should be fine tuned on small, quarantined sets of explicitly labeled falsehoods as a "vaccine" against misinformation. These curated false examples are periodically injected during finetuning, strengthening the model ability to recognize and reject misleading claims while preserving accuracy on truthful inputs. An illustrative case study shows that immunized models generate substantially less misinformation than baselines. To our knowledge, this is the first training framework that treats fact checked falsehoods themselves as a supervised vaccine, rather than relying on input perturbations or generic human feedback signals, to harden models against future misinformation. We also outline ethical safeguards and governance controls to ensure the safe use of false data. Model immunization offers a proactive paradigm for aligning AI systems with factuality.

[289] MOOSE-Chem3: Toward Experiment-Guided Hypothesis Ranking via Simulated Experimental Feedback

Wanhao Liu,Zonglin Yang,Jue Wang,Lidong Bing,Di Zhang,Dongzhan Zhou,Yuqiang Li,Houqiang Li,Erik Cambria,Wanli Ouyang

Main category: cs.CL

TL;DR: 论文提出了一种基于实验反馈的假设排序方法,通过模拟器解决自然科学领域实验成本高的问题,并在化学假设数据集上验证了其有效性。

Details Motivation: 现有假设排序方法仅依赖语言模型推理,未结合实验反馈,而真实实验成本高且难以重复。 Method: 提出实验引导的排序任务,设计基于领域假设的模拟器,开发伪实验引导的排序方法。 Result: 在124个化学假设数据集上验证,新方法优于预实验基准和强消融实验。 Conclusion: 实验引导的排序方法通过模拟器有效提升假设排序性能,适用于高成本实验领域。 Abstract: Hypothesis ranking is a crucial component of automated scientific discovery, particularly in natural sciences where wet-lab experiments are costly and throughput-limited. Existing approaches focus on pre-experiment ranking, relying solely on large language model's internal reasoning without incorporating empirical outcomes from experiments. We introduce the task of experiment-guided ranking, which aims to prioritize candidate hypotheses based on the results of previously tested ones. However, developing such strategies is challenging due to the impracticality of repeatedly conducting real experiments in natural science domains. To address this, we propose a simulator grounded in three domain-informed assumptions, modeling hypothesis performance as a function of similarity to a known ground truth hypothesis, perturbed by noise. We curate a dataset of 124 chemistry hypotheses with experimentally reported outcomes to validate the simulator. Building on this simulator, we develop a pseudo experiment-guided ranking method that clusters hypotheses by shared functional characteristics and prioritizes candidates based on insights derived from simulated experimental feedback. Experiments show that our method outperforms pre-experiment baselines and strong ablations.

[290] Mutarjim: Advancing Bidirectional Arabic-English Translation with a Small Language Model

Khalil Hennara,Muhammad Hreden,Mohamed Motaism Hamed,Zeina Aldallal,Sara Chrouf,Safwan AlModhayan

Main category: cs.CL

TL;DR: Mutarjim是一个紧凑但强大的阿拉伯语-英语双向翻译模型,基于Kuwain-1.5B开发,通过优化训练方法和高质量语料库,性能超越更大模型,并显著降低计算成本。同时,作者发布了新的基准数据集Tarjama-25,用于更全面的评估。

Details Motivation: 现有大规模语言模型在机器翻译任务中表现优异,但计算成本高。作者旨在开发一个更高效的紧凑模型,同时解决现有阿拉伯语-英语基准数据集的局限性。 Method: 基于Kuwain-1.5B开发Mutarjim,采用优化的两阶段训练方法和高质量训练语料库。 Result: Mutarjim在多个基准测试中超越更大模型,并在Tarjama-25上达到最先进性能,显著降低计算成本。 Conclusion: Mutarjim证明了紧凑模型在机器翻译任务中的高效性,Tarjama-25为未来研究提供了更全面的评估工具。 Abstract: We introduce Mutarjim, a compact yet powerful language model for bidirectional Arabic-English translation. While large-scale LLMs have shown impressive progress in natural language processing tasks, including machine translation, smaller models. Leveraging this insight, we developed Mutarjim based on Kuwain-1.5B , a language model tailored for both Arabic and English. Despite its modest size, Mutarjim outperforms much larger models on several established benchmarks, achieved through an optimized two-phase training approach and a carefully curated, high-quality training corpus.. Experimental results show that Mutarjim rivals models up to 20 times larger while significantly reducing computational costs and training requirements. We also introduce Tarjama-25, a new benchmark designed to overcome limitations in existing Arabic-English benchmarking datasets, such as domain narrowness, short sentence lengths, and English-source bias. Tarjama-25 comprises 5,000 expert-reviewed sentence pairs and spans a wide range of domains, offering a more comprehensive and balanced evaluation framework. Notably, Mutarjim achieves state-of-the-art performance on the English-to-Arabic task in Tarjama-25, surpassing even significantly larger and proprietary models like GPT-4o mini. We publicly release Tarjama-25 to support future research and advance the evaluation of Arabic-English translation systems.

[291] Language models can learn implicit multi-hop reasoning, but only if they have lots of training data

Yuekun Yao,Yupei Du,Dawei Zhu,Michael Hahn,Alexander Koller

Main category: cs.CL

TL;DR: 研究语言模型在单次前向传递中完成多跳推理的能力,发现训练数据和模型层数随推理步数呈指数和线性增长,并提出理论解释。

Details Motivation: 探索语言模型是否能在不依赖链式思考的情况下完成多跳推理任务。 Method: 使用GPT2风格的语言模型,在控制的多跳推理数据集(2、3、4跳)上进行训练。 Result: 模型能学习隐式多跳推理,但训练数据需求随步数指数增长,层数需求线性增长。课程学习可缓解数据需求。 Conclusion: 隐式多跳推理可行,但需更多资源和优化策略。 Abstract: Implicit reasoning is the ability of a language model to solve multi-hop reasoning tasks in a single forward pass, without chain of thought. We investigate this capability using GPT2-style language models trained from scratch on controlled $k$-hop reasoning datasets ($k = 2, 3, 4$). We show that while such models can indeed learn implicit $k$-hop reasoning, the required training data grows exponentially in $k$, and the required number of transformer layers grows linearly in $k$. We offer a theoretical explanation for why this depth growth is necessary. We further find that the data requirement can be mitigated, but not eliminated, through curriculum learning.

[292] Handling Symbolic Language in Student Texts: A Comparative Study of NLP Embedding Models

Tom Bleckmann,Paul Tschisgale

Main category: cs.CL

TL;DR: 研究探讨了NLP嵌入模型在处理科学相关符号表达式时的性能差异,发现GPT-text-embedding-3-large表现最佳,但优势有限,并强调了模型选择时需考虑成本、合规性和透明度。

Details Motivation: 科学语言中的符号表达式(如公式)给NLP嵌入模型带来挑战,现有研究常忽略或移除这些符号,可能导致偏差和性能下降。 Method: 通过基于相似性分析和机器学习管道的两种方法,评估多种嵌入模型对物理相关符号表达式的处理能力。 Result: GPT-text-embedding-3-large表现最优,但优势不明显;模型选择还需考虑成本、合规性和透明度。 Conclusion: LA研究者和从业者在处理含符号表达式的科学语言时,需谨慎选择NLP嵌入模型。 Abstract: Recent advancements in Natural Language Processing (NLP) have facilitated the analysis of student-generated language products in learning analytics (LA), particularly through the use of NLP embedding models. Yet when it comes to science-related language, symbolic expressions such as equations and formulas introduce challenges that current embedding models struggle to address. Existing studies and applications often either overlook these challenges or remove symbolic expressions altogether, potentially leading to biased findings and diminished performance of LA applications. This study therefore explores how contemporary embedding models differ in their capability to process and interpret science-related symbolic expressions. To this end, various embedding models are evaluated using physics-specific symbolic expressions drawn from authentic student responses, with performance assessed via two approaches: similarity-based analyses and integration into a machine learning pipeline. Our findings reveal significant differences in model performance, with OpenAI's GPT-text-embedding-3-large outperforming all other examined models, though its advantage over other models was moderate rather than decisive. Beyond performance, additional factors such as cost, regulatory compliance, and model transparency are discussed as key considerations for model selection. Overall, this study underscores the importance for LA researchers and practitioners of carefully selecting NLP embedding models when working with science-related language products that include symbolic expressions.

[293] Beyond Distillation: Pushing the Limits of Medical LLM Reasoning with Minimalist Rule-Based RL

Che Liu,Haozhe Wang,Jiazhen Pan,Zhongwei Wan,Yong Dai,Fangzhen Lin,Wenjia Bai,Daniel Rueckert,Rossella Arcucci

Main category: cs.CL

TL;DR: AlphaMed是首个通过纯强化学习(RL)在医学问答任务中实现推理能力的语言模型,无需监督微调或链式思维数据,并在多个基准测试中表现优异。

Details Motivation: 提升大型语言模型在复杂任务(尤其是临床应用)中的性能和可解释性,同时避免依赖昂贵的监督微调或链式思维数据。 Method: 使用基于规则的强化学习奖励机制,在公开的多选题问答数据集上训练AlphaMed,不依赖监督微调或链式思维数据。 Result: AlphaMed在六个医学问答基准测试中取得最佳成绩,甚至超越更大或闭源模型。 Conclusion: 数据的信息量是推理性能的关键驱动因素,基于规则的强化学习在信息丰富的多选题数据上能有效诱导推理能力,但当前评估方法存在局限性。 Abstract: Improving performance on complex tasks and enabling interpretable decision making in large language models (LLMs), especially for clinical applications, requires effective reasoning. Yet this remains challenging without supervised fine-tuning (SFT) on costly chain-of-thought (CoT) data distilled from closed-source models (e.g., GPT-4o). In this work, we present AlphaMed, the first medical LLM to show that reasoning capability can emerge purely through reinforcement learning (RL), using minimalist rule-based rewards on public multiple-choice QA datasets, without relying on SFT or distilled CoT data. AlphaMed achieves state-of-the-art results on six medical QA benchmarks, outperforming models trained with conventional SFT+RL pipelines. On challenging benchmarks (e.g., MedXpert), AlphaMed even surpasses larger or closed-source models such as DeepSeek-V3-671B and Claude-3.5-Sonnet. To understand the factors behind this success, we conduct a comprehensive data-centric analysis guided by three questions: (i) Can minimalist rule-based RL incentivize reasoning without distilled CoT supervision? (ii) How do dataset quantity and diversity impact reasoning? (iii) How does question difficulty shape the emergence and generalization of reasoning? Our findings show that dataset informativeness is a key driver of reasoning performance, and that minimalist RL on informative, multiple-choice QA data is effective at inducing reasoning without CoT supervision. We also observe divergent trends across benchmarks, underscoring limitations in current evaluation and the need for more challenging, reasoning-oriented medical QA benchmarks.

[294] Counting Cycles with Deepseek

Jiashun Jin,Tracy Ke,Bingcheng Sui,Zhenggang Wang

Main category: cs.CL

TL;DR: AI通过结合新方法和强大的编码能力,解决了计算高效等价形式(CEEF)的难题,但需要明确的策略和逐步指导。

Details Motivation: 尽管AI在数学领域有进展,但仍难以解决复杂的组合问题,如CEEF问题,这需要细致的组合学和计算。 Method: 结合新颖方法和AI的编码能力,提供逐步指导和清晰策略。 Result: 发现了通用情况下的新公式,并验证AI在明确指导下能解决问题。 Conclusion: AI在解决复杂数学问题时需要人类策略支持,但能显著提升效率。 Abstract: Despite recent progress, AI still struggles on advanced mathematics. We consider a difficult open problem: How to derive a Computationally Efficient Equivalent Form (CEEF) for the cycle count statistic? The CEEF problem does not have known general solutions, and requires delicate combinatorics and tedious calculations. Such a task is hard to accomplish by humans but is an ideal example where AI can be very helpful. We solve the problem by combining a novel approach we propose and the powerful coding skills of AI. Our results use delicate graph theory and contain new formulas for general cases that have not been discovered before. We find that, while AI is unable to solve the problem all by itself, it is able to solve it if we provide it with a clear strategy, a step-by-step guidance and carefully written prompts. For simplicity, we focus our study on DeepSeek-R1 but we also investigate other AI approaches.

[295] AVerImaTeC: A Dataset for Automatic Verification of Image-Text Claims with Evidence from the Web

Rui Cao,Zifeng Ding,Zhijiang Guo,Michael Schlichtkrull,Andreas Vlachos

Main category: cs.CL

TL;DR: AVerImaTeC是一个包含1,297个真实世界图像-文本声明的数据集,附带问答对形式的证据注释,用于自动化验证。

Details Motivation: 现有数据集多为合成声明且缺乏证据注释,无法捕捉判决背后的推理过程,限制了自动化验证的发展。 Method: 通过声明规范化、时间约束证据注释和两阶段充分性检查,解决了上下文依赖、时间泄漏和证据不足等常见问题。 Result: 数据集标注一致性较高(κ=0.742,QA对一致性74.7%),并提出了新的证据检索评估方法。 Conclusion: AVerImaTeC为基于开放网络证据的图像-文本声明验证提供了高质量数据集和基线方法。 Abstract: Textual claims are often accompanied by images to enhance their credibility and spread on social media, but this also raises concerns about the spread of misinformation. Existing datasets for automated verification of image-text claims remain limited, as they often consist of synthetic claims and lack evidence annotations to capture the reasoning behind the verdict. In this work, we introduce AVerImaTeC, a dataset consisting of 1,297 real-world image-text claims. Each claim is annotated with question-answer (QA) pairs containing evidence from the web, reflecting a decomposed reasoning regarding the verdict. We mitigate common challenges in fact-checking datasets such as contextual dependence, temporal leakage, and evidence insufficiency, via claim normalization, temporally constrained evidence annotation, and a two-stage sufficiency check. We assess the consistency of the annotation in AVerImaTeC via inter-annotator studies, achieving a $\kappa=0.742$ on verdicts and $74.7\%$ consistency on QA pairs. We also propose a novel evaluation method for evidence retrieval and conduct extensive experiments to establish baselines for verifying image-text claims using open-web evidence.

[296] TRACE for Tracking the Emergence of Semantic Representations in Transformers

Nura Aljaafari,Danilo S. Carvalho,André Freitas

Main category: cs.CL

TL;DR: TRACE框架通过几何、信息和语言信号检测Transformer模型训练中的相变,揭示语言抽象涌现的机制。

Details Motivation: 理解Transformer模型从记忆到抽象的相变机制,填补现有研究对语言结构涌现的忽视。 Method: 提出TRACE框架和ABSynth数据生成方法,结合几何、信息和语言信号分析相变。 Result: 相变与曲率崩溃和维度稳定相关,几何变化与句法语义准确性涌现一致,抽象模式在不同架构中持续。 Conclusion: TRACE框架为语言模型的语言抽象涌现提供新见解,有助于模型可解释性、训练效率和组合泛化。 Abstract: Modern transformer models exhibit phase transitions during training, distinct shifts from memorisation to abstraction, but the mechanisms underlying these transitions remain poorly understood. Prior work has often focused on endpoint representations or isolated signals like curvature or mutual information, typically in symbolic or arithmetic domains, overlooking the emergence of linguistic structure. We introduce TRACE (Tracking Representation Abstraction and Compositional Emergence), a diagnostic framework combining geometric, informational, and linguistic signals to detect phase transitions in Transformer-based LMs. TRACE leverages a frame-semantic data generation method, ABSynth, that produces annotated synthetic corpora with controllable complexity, lexical distributions, and structural entropy, while being fully annotated with linguistic categories, enabling precise analysis of abstraction emergence. Experiments reveal that (i) phase transitions align with clear intersections between curvature collapse and dimension stabilisation; (ii) these geometric shifts coincide with emerging syntactic and semantic accuracy; (iii) abstraction patterns persist across architectural variants, with components like feedforward networks affecting optimisation stability rather than fundamentally altering trajectories. This work advances our understanding of how linguistic abstractions emerge in LMs, offering insights into model interpretability, training efficiency, and compositional generalisation that could inform more principled approaches to LM development.

[297] Training with Pseudo-Code for Instruction Following

Prince Kumar,Rudra Murthy,Riyaz Bhat,Danish Contractor

Main category: cs.CL

TL;DR: 论文提出通过伪代码重新表达指令来微调大型语言模型(LLMs),以提升其遵循指令的能力,并在多个任务中验证了方法的有效性。

Details Motivation: 尽管LLMs能力快速提升,但在遵循简单、明确指令(尤其是涉及组合时)仍有困难。伪代码可能有助于改进指令遵循,但编写伪代码对非专家用户不友好。 Method: 提出在指令微调数据中额外加入伪代码表达的指令和最终响应,对LLMs进行微调。 Result: 在11个公开基准测试中,模型在指令遵循任务上相对提升3-19%,数学和常识推理任务平均提升14%。 Conclusion: 伪代码辅助的微调能显著提升LLMs的指令遵循能力,同时不影响其他任务表现。 Abstract: Despite the rapid progress in the capabilities of Large Language Models (LLMs), they continue to have difficulty following relatively simple, unambiguous instructions, especially when compositions are involved. In this paper, we take inspiration from recent work that suggests that models may follow instructions better when they are expressed in pseudo-code. However, writing pseudo-code programs can be tedious and using few-shot demonstrations to craft code representations for use in inference can be unnatural for non-expert users of LLMs. To overcome these limitations, we propose fine-tuning LLMs with instruction-tuning data that additionally includes instructions re-expressed in pseudo-code along with the final response. We evaluate models trained using our method on $11$ publicly available benchmarks comprising of tasks related to instruction-following, mathematics, and common-sense reasoning. We conduct rigorous experiments with $5$ different models and find that not only do models follow instructions better when trained with pseudo-code, they also retain their capabilities on the other tasks related to mathematical and common sense reasoning. Specifically, we observe a relative gain of $3$--$19$% on instruction-following benchmark, and an average gain of upto 14% across all tasks.

[298] Contrastive Distillation of Emotion Knowledge from LLMs for Zero-Shot Emotion Recognition

Minxue Niu,Emily Mower Provost

Main category: cs.CL

TL;DR: 提出了一种对比蒸馏框架,将大型语言模型(如GPT-4)的情感知识迁移到紧凑模型中,无需人工标注,实现了零样本情感识别。

Details Motivation: 传统情感识别模型依赖固定标签集训练,泛化能力有限;大型语言模型虽表现优异,但规模过大,难以在边缘设备上部署。 Method: 利用GPT-4生成情感描述作为监督信号,通过对比学习在共享嵌入空间中对齐文本样本与情感描述。 Result: 蒸馏模型在多个数据集和标签空间上表现优异,性能接近GPT-4,但模型规模缩小了10,000倍以上。 Conclusion: 该方法为构建轻量级、高泛化能力的情感识别系统提供了有效途径。 Abstract: The ability to handle various emotion labels without dedicated training is crucial for building adaptable Emotion Recognition (ER) systems. Conventional ER models rely on training using fixed label sets and struggle to generalize beyond them. On the other hand, Large Language Models (LLMs) have shown strong zero-shot ER performance across diverse label spaces, but their scale limits their use on edge devices. In this work, we propose a contrastive distillation framework that transfers rich emotional knowledge from LLMs into a compact model without the use of human annotations. We use GPT-4 to generate descriptive emotion annotations, offering rich supervision beyond fixed label sets. By aligning text samples with emotion descriptors in a shared embedding space, our method enables zero-shot prediction on different emotion classes, granularity, and label schema. The distilled model is effective across multiple datasets and label spaces, outperforming strong baselines of similar size and approaching GPT-4's zero-shot performance, while being over 10,000 times smaller.

[299] MathEDU: Towards Adaptive Feedback for Student Mathematical Problem-Solving

Wei-Ling Hsu,Yu-Chien Tang,An-Zi Yen

Main category: cs.CL

TL;DR: 论文探讨了大型语言模型(LLMs)在数学问题解决中提供个性化反馈的能力,并引入MathEDU数据集进行评估。模型在识别正确性方面表现良好,但在生成详细教学反馈方面仍有挑战。

Details Motivation: 在线学习缺乏即时个性化反馈,尤其是在数学问题解决中。研究旨在利用LLMs填补这一空白。 Method: 使用MathEDU数据集(包含学生解答和教师反馈),评估LLMs在两种场景(有历史答案和无历史答案)中的表现。 Result: 微调模型在识别答案正确性方面表现良好,但在生成详细教学反馈方面仍有不足。 Conclusion: LLMs在数学教育中具有潜力,但需进一步改进以提供更有效的教学反馈。 Abstract: Online learning enhances educational accessibility, offering students the flexibility to learn anytime, anywhere. However, a key limitation is the lack of immediate, personalized feedback, particularly in helping students correct errors in math problem-solving. Several studies have investigated the applications of large language models (LLMs) in educational contexts. In this paper, we explore the capabilities of LLMs to assess students' math problem-solving processes and provide adaptive feedback. The MathEDU dataset is introduced, comprising authentic student solutions annotated with teacher feedback. We evaluate the model's ability to support personalized learning in two scenarios: one where the model has access to students' prior answer histories, and another simulating a cold-start context. Experimental results show that the fine-tuned model performs well in identifying correctness. However, the model still faces challenges in generating detailed feedback for pedagogical purposes.

[300] Extended Inductive Reasoning for Personalized Preference Inference from Behavioral Signals

Jia-Nan Li,Jian Guan,Wei Wu,Rui Yan

Main category: cs.CL

TL;DR: 论文探讨了大型语言模型(LLMs)在归纳推理中的表现,提出了一种名为AlignXplore的模型,通过结合合成数据和在线强化学习,显著提升了用户偏好推断能力。

Details Motivation: 当前LLMs在演绎推理任务中表现优异,但归纳推理能力(如从分散信号中推断用户偏好)尚未充分探索,尤其是在LLM对齐任务中。 Method: 提出AlignXplore模型,结合冷启动训练(基于合成数据)和在线强化学习,系统推断用户偏好。 Result: 实验显示AlignXplore在领域内外基准上平均提升11.05%,并展现出对不同输入格式和下游模型的强泛化能力。 Conclusion: AlignXplore不仅提升了偏好推断性能,还揭示了训练过程中类似人类的归纳推理模式,为相关研究提供了最佳实践。 Abstract: Large language models (LLMs) have demonstrated significant success in complex reasoning tasks such as math and coding. In contrast to these tasks where deductive reasoning predominates, inductive reasoning\textemdash the ability to derive general rules from incomplete evidence, remains underexplored. This paper investigates extended inductive reasoning in LLMs through the lens of personalized preference inference, a critical challenge in LLM alignment where current approaches struggle to capture diverse user preferences. The task demands strong inductive reasoning capabilities as user preferences are typically embedded implicitly across various interaction forms, requiring models to synthesize consistent preference patterns from scattered signals. We propose \textsc{AlignXplore}, a model that leverages extended reasoning chains to enable systematic preference inference from behavioral signals in users' interaction histories. We develop \textsc{AlignXplore} by combining cold-start training based on synthetic data with subsequent online reinforcement learning. Through extensive experiments, we demonstrate that \textsc{AlignXplore} achieves substantial improvements over the backbone model by an average of 11.05\% on in-domain and out-of-domain benchmarks, while maintaining strong generalization ability across different input formats and downstream models. Further analyses establish best practices for preference inference learning through systematic comparison of reward modeling strategies, while revealing the emergence of human-like inductive reasoning patterns during training.

[301] QwenLong-CPRS: Towards $\infty$-LLMs with Dynamic Context Optimization

Weizhou Shen,Chenliang Li,Fanqi Wan,Shengyi Liao,Shaopeng Lai,Bo Zhang,Yingcheng Shi,Yuning Wu,Gang Fu,Zhansheng Li,Bin Yang,Ji Zhang,Fei Huang,Jingren Zhou,Ming Yan

Main category: cs.CL

TL;DR: QwenLong-CPRS是一个针对长上下文优化的压缩框架,通过动态优化机制提升效率和性能。

Details Motivation: 解决长序列处理中的计算开销和性能下降问题。 Method: 采用自然语言指导的动态优化、双向推理层、令牌批评机制和窗口并行推理。 Result: 在多个基准测试中表现优异,支持多种主流LLM,并实现显著压缩和性能提升。 Conclusion: QwenLong-CPRS在长上下文处理中实现了新的SOTA性能。 Abstract: This technical report presents QwenLong-CPRS, a context compression framework designed for explicit long-context optimization, addressing prohibitive computation overhead during the prefill stage and the "lost in the middle" performance degradation of large language models (LLMs) during long sequence processing. Implemented through a novel dynamic context optimization mechanism, QwenLong-CPRS enables multi-granularity context compression guided by natural language instructions, achieving both efficiency gains and improved performance. Evolved from the Qwen architecture series, QwenLong-CPRS introduces four key innovations: (1) Natural language-guided dynamic optimization, (2) Bidirectional reasoning layers for enhanced boundary awareness, (3) Token critic mechanisms with language modeling heads, and (4) Window-parallel inference. Comprehensive evaluations across five benchmarks (4K-2M word contexts) demonstrate QwenLong-CPRS's threefold effectiveness: (1) Consistent superiority over other context management methods like RAG and sparse attention in both accuracy and efficiency. (2) Architecture-agnostic integration with all flagship LLMs, including GPT-4o, Gemini2.0-pro, Claude3.7-sonnet, DeepSeek-v3, and Qwen2.5-max, achieves 21.59$\times$ context compression alongside 19.15-point average performance gains; (3) Deployed with Qwen2.5-32B-Instruct, QwenLong-CPRS surpasses leading proprietary LLMs by 4.85 and 10.88 points on Ruler-128K and InfiniteBench, establishing new SOTA performance.

[302] Planning without Search: Refining Frontier LLMs with Offline Goal-Conditioned RL

Joey Hong,Anca Dragan,Sergey Levine

Main category: cs.CL

TL;DR: 提出了一种基于目标条件值函数的新方法,用于指导大型语言模型(LLM)的推理,解决了强化学习(RL)微调在复杂任务中的可扩展性问题。

Details Motivation: 尽管LLM在问答和对话等任务中表现出色,但在需要长期推理和规划的复杂任务(如谈判和说服)中,RL微调存在内存和计算成本高的问题,且不适用于大型API模型。 Method: 使用目标条件值函数预测任务的可能结果,指导LLM代理在多轮交互中有效规划。这些值函数针对推理步骤而非完整动作进行训练,保持轻量级。 Result: 在工具使用、社交推理和对话等任务中,该方法优于RL微调和提示方法,同时保持高效和可扩展性。 Conclusion: 提出的方法为LLM在复杂交互任务中的推理和规划提供了一种高效且可扩展的解决方案。 Abstract: Large language models (LLMs) excel in tasks like question answering and dialogue, but complex tasks requiring interaction, such as negotiation and persuasion, require additional long-horizon reasoning and planning. Reinforcement learning (RL) fine-tuning can enable such planning in principle, but suffers from drawbacks that hinder scalability. In particular, multi-turn RL training incurs high memory and computational costs, which are exacerbated when training LLMs as policies. Furthermore, the largest LLMs do not expose the APIs necessary to be trained in such manner. As a result, modern methods to improve the reasoning of LLMs rely on sophisticated prompting mechanisms rather than RL fine-tuning. To remedy this, we propose a novel approach that uses goal-conditioned value functions to guide the reasoning of LLM agents, that scales even to large API-based models. These value functions predict how a task will unfold given an action, allowing the LLM agent to evaluate multiple possible outcomes, both positive and negative, to plan effectively. In addition, these value functions are trained over reasoning steps rather than full actions, to be a concise and light-weight module that facilitates decision-making in multi-turn interactions. We validate our method on tasks requiring interaction, including tool use, social deduction, and dialogue, demonstrating superior performance over both RL fine-tuning and prompting methods while maintaining efficiency and scalability.

[303] ManuSearch: Democratizing Deep Search in Large Language Models with a Transparent and Open Multi-Agent Framework

Lisheng Huang,Yichen Liu,Jinhao Jiang,Rongxiang Zhang,Jiahao Yan,Junyi Li,Wayne Xin Zhao

Main category: cs.CL

TL;DR: ManuSearch是一个透明、模块化的多智能体框架,旨在为大型语言模型(LLMs)提供深度搜索能力,通过三个协作智能体分解搜索和推理过程。

Details Motivation: 当前基于网络的LLMs在复杂推理任务中表现优异,但其能力主要局限于架构不透明的专有系统中,ManuSearch旨在解决这一问题。 Method: ManuSearch包含三个智能体:解决方案规划智能体、互联网搜索智能体和结构化网页阅读智能体,协同完成深度搜索和推理。 Result: 实验表明,ManuSearch显著优于开源基线,甚至超越领先的闭源系统。 Conclusion: ManuSearch为开放深度搜索系统的可复现和可扩展研究奠定了基础,相关数据和代码已开源。 Abstract: Recent advances in web-augmented large language models (LLMs) have exhibited strong performance in complex reasoning tasks, yet these capabilities are mostly locked in proprietary systems with opaque architectures. In this work, we propose \textbf{ManuSearch}, a transparent and modular multi-agent framework designed to democratize deep search for LLMs. ManuSearch decomposes the search and reasoning process into three collaborative agents: (1) a solution planning agent that iteratively formulates sub-queries, (2) an Internet search agent that retrieves relevant documents via real-time web search, and (3) a structured webpage reading agent that extracts key evidence from raw web content. To rigorously evaluate deep reasoning abilities, we introduce \textbf{ORION}, a challenging benchmark focused on open-web reasoning over long-tail entities, covering both English and Chinese. Experimental results show that ManuSearch substantially outperforms prior open-source baselines and even surpasses leading closed-source systems. Our work paves the way for reproducible, extensible research in open deep search systems. We release the data and code in https://github.com/RUCAIBox/ManuSearch

[304] Watch and Listen: Understanding Audio-Visual-Speech Moments with Multimodal LLM

Zinuo Li,Xian Zhang,Yongxin Guo,Mohammed Bennamoun,Farid Boussaid,Girish Dwivedi,Luqi Gong,Qiuhong Ke

Main category: cs.CL

TL;DR: TriSense是一个三模态大语言模型,通过整合视觉、音频和语音模态,实现全面的视频时序理解。

Details Motivation: 现有模型在融合和解释音频信息方面表现不佳,限制了视频时序理解的全面性。 Method: TriSense采用基于查询的连接器,动态调整模态贡献,并引入TriSense-2M数据集支持多模态能力。 Result: 实验证明TriSense在多基准测试中表现优异,推动了多模态视频分析的发展。 Conclusion: TriSense及其数据集TriSense-2M为视频多模态分析提供了有效工具,具有广泛的应用潜力。 Abstract: Humans naturally understand moments in a video by integrating visual and auditory cues. For example, localizing a scene in the video like "A scientist passionately speaks on wildlife conservation as dramatic orchestral music plays, with the audience nodding and applauding" requires simultaneous processing of visual, audio, and speech signals. However, existing models often struggle to effectively fuse and interpret audio information, limiting their capacity for comprehensive video temporal understanding. To address this, we present TriSense, a triple-modality large language model designed for holistic video temporal understanding through the integration of visual, audio, and speech modalities. Central to TriSense is a Query-Based Connector that adaptively reweights modality contributions based on the input query, enabling robust performance under modality dropout and allowing flexible combinations of available inputs. To support TriSense's multimodal capabilities, we introduce TriSense-2M, a high-quality dataset of over 2 million curated samples generated via an automated pipeline powered by fine-tuned LLMs. TriSense-2M includes long-form videos and diverse modality combinations, facilitating broad generalization. Extensive experiments across multiple benchmarks demonstrate the effectiveness of TriSense and its potential to advance multimodal video analysis. Code and dataset will be publicly released.

[305] UNJOIN: Enhancing Multi-Table Text-to-SQL Generation via Schema Simplification

Poojah Ganesan,Rajat Aayush Jha,Dan Roth,Vivek Gupta

Main category: cs.CL

TL;DR: UNJOIN是一个两阶段框架,用于解决多表数据库中Text-to-SQL的挑战,通过解耦模式元素检索和SQL逻辑生成,提高了性能。

Details Motivation: 多表数据库中的Text-to-SQL任务因复杂模式和关系操作而具有挑战性,现有方法在表列检索、JOIN和UNION生成以及模式泛化上表现不佳。 Method: UNJOIN采用两阶段方法:1)将多表列名合并为单表表示;2)在简化模式上生成SQL并映射回原模式。 Result: 在SPIDER和BIRD数据集上,UNJOIN匹配或超越了现有最佳基线。 Conclusion: UNJOIN仅需模式信息,无需数据访问或微调,具有可扩展性和跨数据库适应性。 Abstract: Recent advances in large language models (LLMs) have greatly improved Text-to-SQL performance for single-table queries. But, it remains challenging in multi-table databases due to complex schema and relational operations. Existing methods often struggle with retrieving the right tables and columns, generating accurate JOINs and UNIONs, and generalizing across diverse schemas. To address these issues, we introduce UNJOIN, a two-stage framework that decouples the retrieval of schema elements from SQL logic generation. In the first stage, we merge the column names of all tables in the database into a single-table representation by prefixing each column with its table name. This allows the model to focus purely on accurate retrieval without being distracted by the need to write complex SQL logic. In the second stage, the SQL query is generated on this simplified schema and mapped back to the original schema by reconstructing JOINs, UNIONs, and relational logic. Evaluations on SPIDER and BIRD datasets show that UNJOIN matches or exceeds the state-of-the-art baselines. UNJOIN uses only schema information, which does not require data access or fine-tuning, making it scalable and adaptable across databases.

[306] Frankentext: Stitching random text fragments into long-form narratives

Chau Minh Pham,Jenna Russell,Dzung Pham,Mohit Iyyer

Main category: cs.CL

TL;DR: Frankentexts是一种由LLMs生成的长篇叙事,要求大部分内容(如90%)必须直接复制人类文本,测试可控生成能力。Gemini-2.5-Pro表现优异,81%的文本连贯且100%符合提示,但59%被误判为人类写作。

Details Motivation: 研究在极端约束下(高比例复制人类文本)LLMs的可控生成能力,并探讨AI文本检测器的局限性。 Method: 通过选择和组合人类文本片段生成初稿,然后迭代修改以保持指定的复制比例。 Result: 81%的Frankentexts连贯且100%相关,59%被误判为人类写作。检测器如Pangram存在局限性。 Conclusion: Frankentexts不仅是一项挑战性任务,还为混合作者检测和人类-AI协作写作研究提供了新视角。 Abstract: We introduce Frankentexts, a new type of long-form narratives produced by LLMs under the extreme constraint that most tokens (e.g., 90%) must be copied verbatim from human writings. This task presents a challenging test of controllable generation, requiring models to satisfy a writing prompt, integrate disparate text fragments, and still produce a coherent narrative. To generate Frankentexts, we instruct the model to produce a draft by selecting and combining human-written passages, then iteratively revise the draft while maintaining a user-specified copy ratio. We evaluate the resulting Frankentexts along three axes: writing quality, instruction adherence, and detectability. Gemini-2.5-Pro performs surprisingly well on this task: 81% of its Frankentexts are coherent and 100% relevant to the prompt. Notably, up to 59% of these outputs are misclassified as human-written by detectors like Pangram, revealing limitations in AI text detectors. Human annotators can sometimes identify Frankentexts through their abrupt tone shifts and inconsistent grammar between segments, especially in longer generations. Beyond presenting a challenging generation task, Frankentexts invite discussion on building effective detectors for this new grey zone of authorship, provide training data for mixed authorship detection, and serve as a sandbox for studying human-AI co-writing processes.

[307] Graph-Linguistic Fusion: Using Language Models for Wikidata Vandalism Detection

Mykola Trokhymovych,Lydia Pintscher,Ricardo Baeza-Yates,Diego Saez-Trumper

Main category: cs.CL

TL;DR: 提出了一种用于 Wikidata 的下一代破坏检测系统,通过 Graph2Text 方法将编辑统一处理,并使用多语言模型评估破坏行为,效果优于现有系统。

Details Motivation: Wikidata 是一个复杂的开源知识库,其内容包含结构化数据和多语言文本,需要一种统一的破坏检测方法以提高覆盖率和简化维护。 Method: 采用 Graph2Text 方法将所有编辑转换为统一空间,并使用多语言语言模型评估潜在破坏行为。 Result: 实验表明,该方法优于当前生产系统,同时公开了代码和数据集以促进进一步研究。 Conclusion: 提出的统一方法在破坏检测中表现优异,且具有开放性和可扩展性。 Abstract: We introduce a next-generation vandalism detection system for Wikidata, one of the largest open-source structured knowledge bases on the Web. Wikidata is highly complex: its items incorporate an ever-expanding universe of factual triples and multilingual texts. While edits can alter both structured and textual content, our approach converts all edits into a single space using a method we call Graph2Text. This allows for evaluating all content changes for potential vandalism using a single multilingual language model. This unified approach improves coverage and simplifies maintenance. Experiments demonstrate that our solution outperforms the current production system. Additionally, we are releasing the code under an open license along with a large dataset of various human-generated knowledge alterations, enabling further research.

[308] Lost in the Haystack: Smaller Needles are More Difficult for LLMs to Find

Owen Bianchi,Mathew J. Koretsky,Maya Willey,Chelsea X. Alvarado,Tanay Nayak,Adi Asija,Nicole Kuznetsov,Mike A. Nalls,Faraz Faghri,Daniel Khashabi

Main category: cs.CL

TL;DR: 研究发现,大语言模型(LLMs)在长上下文问答任务中,当相关上下文("金上下文")较短时,性能显著下降,且位置敏感性增强。

Details Motivation: 探讨金上下文长度对LLMs在长上下文任务中性能的影响,填补此前研究的空白。 Method: 通过系统实验,分析不同金上下文长度对LLMs性能的影响,覆盖三个领域和七种先进模型。 Result: 金上下文较短时,LLMs性能显著下降,位置敏感性增强,且这一现象在不同领域和模型中普遍存在。 Conclusion: 研究为设计鲁棒、上下文感知的LLM驱动系统提供了重要指导。 Abstract: Large language models (LLMs) face significant challenges with needle-in-a-haystack tasks, where relevant information ("the needle") must be drawn from a large pool of irrelevant context ("the haystack"). Previous studies have highlighted positional bias and distractor quantity as critical factors affecting model performance, yet the influence of gold context size has received little attention. We address this gap by systematically studying how variations in gold context length impact LLM performance on long-context question answering tasks. Our experiments reveal that LLM performance drops sharply when the gold context is shorter, i.e., smaller gold contexts consistently degrade model performance and amplify positional sensitivity, posing a major challenge for agentic systems that must integrate scattered, fine-grained information of varying lengths. This pattern holds across three diverse domains (general knowledge, biomedical reasoning, and mathematical reasoning) and seven state-of-the-art LLMs of various sizes and architectures. Our work provides clear insights to guide the design of robust, context-aware LLM-driven systems.

[309] First Finish Search: Efficient Test-Time Scaling in Large Language Models

Aradhye Agarwal,Ayan Sengupta,Tanmoy Chakraborty

Main category: cs.CL

TL;DR: 论文提出了一种名为First Finish Search (FFS)的无训练并行解码策略,通过动态分配计算资源在推理过程中提升大语言模型的推理能力。FFS在多个数据集上表现优异,显著提升了准确率。

Details Motivation: 现有的测试时缩放(TTS)方法通常依赖长解码路径或需要生成大量样本,增加了令牌使用和推理延迟。作者发现,在推理任务中,较短的解码路径更可能是正确的,因此提出了FFS。 Method: FFS是一种无训练的并行解码策略,同时启动多个独立样本,并在任一完成时立即返回结果。 Result: 在DeepSeek-R1模型上,FFS在AIME数据集上达到了82.23%的准确率,比单独使用DeepSeek-R1提升了15%,接近OpenAI的o4-mini性能。 Conclusion: FFS的简洁性和高效性表明,简单的TTS策略在推理时具有巨大潜力,揭示了简单方法的未开发潜力。 Abstract: Test-time scaling (TTS), which involves dynamic allocation of compute during inference, offers a promising way to improve reasoning in large language models. While existing TTS methods work well, they often rely on long decoding paths or require a large number of samples to be generated, increasing the token usage and inference latency. We observe the surprising fact that for reasoning tasks, shorter traces are much more likely to be correct than longer ones. Motivated by this, we introduce First Finish Search (FFS), a training-free parallel decoding strategy that launches $n$ independent samples and returns as soon as any one completes. We evaluate FFS alongside simple decoding, beam search, majority voting, and budget forcing on four reasoning models (DeepSeek-R1, R1-Distill-Qwen-32B, QwQ-32B and Phi-4-Reasoning-Plus) and across four datasets (AIME24, AIME25-I, AIME25-II and GPQA Diamond). With DeepSeek-R1, FFS achieves $82.23\%$ accuracy on the AIME datasets, a $15\%$ improvement over DeepSeek-R1's standalone accuracy, nearly matching OpenAI's o4-mini performance. Our theoretical analysis explains why stopping at the shortest trace is likely to yield a correct answer and identifies the conditions under which early stopping may be suboptimal. The elegance and simplicity of FFS demonstrate that straightforward TTS strategies can perform remarkably well, revealing the untapped potential of simple approaches at inference time.

[310] Fann or Flop: A Multigenre, Multiera Benchmark for Arabic Poetry Understanding in LLMs

Wafa Alghallabi,Ritesh Thawkar,Sara Ghaboura,Ketan More,Omkar Thawakar,Hisham Cholakkal,Salman Khan,Rao Muhammad Anwer

Main category: cs.CL

TL;DR: 该论文介绍了首个评估大型语言模型(LLM)对阿拉伯诗歌理解能力的基准测试Fann or Flop,覆盖12个历史时期和21种诗歌类型。

Details Motivation: 阿拉伯诗歌是阿拉伯语言中复杂且文化深厚的表达形式,但LLM对其理解能力尚未被充分探索。 Method: 通过构建包含诗歌解释的语料库,评估LLM在语义理解、隐喻解释、韵律意识和文化背景方面的表现。 Result: 大多数先进LLM在诗歌理解上表现不佳,尽管在标准阿拉伯语任务中表现良好。 Conclusion: 诗歌理解是评估LLM对古典阿拉伯语理解能力的重要指标,`Fann or Flop`为未来研究提供了开源资源。 Abstract: Arabic poetry stands as one of the most sophisticated and culturally embedded forms of expression in the Arabic language, known for its layered meanings, stylistic diversity, and deep historical continuity. Although large language models (LLMs) have demonstrated strong performance across languages and tasks, their ability to understand Arabic poetry remains largely unexplored. In this work, we introduce `Fann or Flop`, the first benchmark designed to assess the comprehension of Arabic poetry by LLMs in twelve historical eras, covering 21 core poetic genres and a variety of metrical forms, from classical structures to contemporary free verse. The benchmark comprises a curated corpus of poems with explanations that assess semantic understanding, metaphor interpretation, prosodic awareness, and cultural context. We argue that poetic comprehension offers a strong indicator for testing how good the LLM is in understanding classical Arabic through the Arabic poetry. Unlike surface-level tasks, this domain demands deeper interpretive reasoning and cultural sensitivity. Our evaluation of state-of-the-art LLMs shows that most models struggle with poetic understanding despite strong results on standard Arabic benchmarks. We release `Fann or Flop` along with the evaluation suite as an open-source resource to enable rigorous evaluation and advancement for Arabic language models. Code is available at: https://github.com/mbzuai-oryx/FannOrFlop.

[311] The Staircase of Ethics: Probing LLM Value Priorities through Multi-Step Induction to Complex Moral Dilemmas

Ya Wu,Qiang Sheng,Danding Wang,Guang Yang,Yifan Sun,Zhengjia Wang,Yuyan Bu,Juan Cao

Main category: cs.CL

TL;DR: 论文提出了Multi-step Moral Dilemmas (MMDs)数据集,用于动态评估LLMs在复杂道德困境中的推理能力,发现其价值偏好会随情境变化而调整。

Details Motivation: 现有评估方法多为单步,无法捕捉LLMs在动态道德挑战中的适应能力,需开发更全面的评估框架。 Method: 构建包含3,302个五阶段困境的MMDs数据集,对九种常用LLMs进行动态分析。 Result: LLMs的价值偏好随困境复杂化显著变化,优先关注关怀价值,但在某些情境下公平价值会超越关怀。 Conclusion: 需转向动态、情境感知的评估范式,以推动LLMs更符合人类价值观的发展。 Abstract: Ethical decision-making is a critical aspect of human judgment, and the growing use of LLMs in decision-support systems necessitates a rigorous evaluation of their moral reasoning capabilities. However, existing assessments primarily rely on single-step evaluations, failing to capture how models adapt to evolving ethical challenges. Addressing this gap, we introduce the Multi-step Moral Dilemmas (MMDs), the first dataset specifically constructed to evaluate the evolving moral judgments of LLMs across 3,302 five-stage dilemmas. This framework enables a fine-grained, dynamic analysis of how LLMs adjust their moral reasoning across escalating dilemmas. Our evaluation of nine widely used LLMs reveals that their value preferences shift significantly as dilemmas progress, indicating that models recalibrate moral judgments based on scenario complexity. Furthermore, pairwise value comparisons demonstrate that while LLMs often prioritize the value of care, this value can sometimes be superseded by fairness in certain contexts, highlighting the dynamic and context-dependent nature of LLM ethical reasoning. Our findings call for a shift toward dynamic, context-aware evaluation paradigms, paving the way for more human-aligned and value-sensitive development of LLMs.

cs.HC [Back]

[312] CHART-6: Human-Centered Evaluation of Data Visualization Understanding in Vision-Language Models

Arnav Verma,Kushin Mukherjee,Christopher Potts,Elisa Kreiss,Judith E. Fan

Main category: cs.HC

TL;DR: 论文评估了八种视觉语言模型在六项数据可视化理解任务上的表现,发现其表现普遍低于人类,且错误模式与人类不同,表明未来需进一步发展此类模型。

Details Motivation: 研究动机是探索视觉语言模型是否能模拟人类在数据可视化理解任务中的认知行为,填补现有评估方法的不足。 Method: 方法是对比八种视觉语言模型与人类在六项数据可视化理解任务上的表现,使用相同的评估标准。 Result: 结果显示模型表现普遍低于人类,错误模式与人类不同,且相对性能相关性有限。 Conclusion: 结论指出现有模型仍需改进,未来研究可进一步开发更接近人类认知的模型。 Abstract: Data visualizations are powerful tools for communicating patterns in quantitative data. Yet understanding any data visualization is no small feat -- succeeding requires jointly making sense of visual, numerical, and linguistic inputs arranged in a conventionalized format one has previously learned to parse. Recently developed vision-language models are, in principle, promising candidates for developing computational models of these cognitive operations. However, it is currently unclear to what degree these models emulate human behavior on tasks that involve reasoning about data visualizations. This gap reflects limitations in prior work that has evaluated data visualization understanding in artificial systems using measures that differ from those typically used to assess these abilities in humans. Here we evaluated eight vision-language models on six data visualization literacy assessments designed for humans and compared model responses to those of human participants. We found that these models performed worse than human participants on average, and this performance gap persisted even when using relatively lenient criteria to assess model performance. Moreover, while relative performance across items was somewhat correlated between models and humans, all models produced patterns of errors that were reliably distinct from those produced by human participants. Taken together, these findings suggest significant opportunities for further development of artificial systems that might serve as useful models of how humans reason about data visualizations. All code and data needed to reproduce these results are available at: https://osf.io/e25mu/?view_only=399daff5a14d4b16b09473cf19043f18.

Yuchen He,Jianbing Lv,Liqi Cheng,Lingyu Meng,Dazhen Deng,Yingcai Wu

Main category: cs.HC

TL;DR: ProTAL是一个拖拽链接视频编程框架,用于时间动作定位(TAL),通过定义关键事件生成标签,减少人工标注需求。

Details Motivation: TAL模型训练需要大量人工标注数据,数据编程虽高效,但在TAL中定义复杂动作困难。 Method: 提出ProTAL框架,用户通过拖拽节点定义关键事件并链接约束关系,生成动作标签,结合半监督方法训练TAL模型。 Result: 通过使用场景和用户研究验证了ProTAL的有效性。 Conclusion: ProTAL为视频编程框架设计提供了新思路,减少了人工标注负担。 Abstract: Temporal Action Localization (TAL) aims to detect the start and end timestamps of actions in a video. However, the training of TAL models requires a substantial amount of manually annotated data. Data programming is an efficient method to create training labels with a series of human-defined labeling functions. However, its application in TAL faces difficulties of defining complex actions in the context of temporal video frames. In this paper, we propose ProTAL, a drag-and-link video programming framework for TAL. ProTAL enables users to define \textbf{key events} by dragging nodes representing body parts and objects and linking them to constrain the relations (direction, distance, etc.). These definitions are used to generate action labels for large-scale unlabelled videos. A semi-supervised method is then employed to train TAL models with such labels. We demonstrate the effectiveness of ProTAL through a usage scenario and a user study, providing insights into designing video programming framework.

[314] Chart-to-Experience: Benchmarking Multimodal LLMs for Predicting Experiential Impact of Charts

Seon Gyeom Kim,Jae Young Choi,Ryan Rossi,Eunyee Koh,Tak Yeon Lee

Main category: cs.HC

TL;DR: 论文提出了Chart-to-Experience基准数据集,评估了MLLMs在图表感知和情感影响预测中的表现,发现MLLMs在直接预测任务中不如人类敏感,但在成对比较中表现可靠。

Details Motivation: 尽管MLLMs在视觉理解任务中取得进展,但其在图表感知和情感影响预测中的应用缺乏充分验证,存在过度泛化的问题。 Method: 构建了包含36个图表的基准数据集,并由众包工作者评估其对七个体验因素的影响。以此为基础,评估了MLLMs在直接预测和成对比较任务中的表现。 Result: MLLMs在直接预测任务中不如人类敏感,但在成对比较中表现准确可靠。 Conclusion: MLLMs在图表感知任务中存在局限性,但在成对比较中具有潜力。 Abstract: The field of Multimodal Large Language Models (MLLMs) has made remarkable progress in visual understanding tasks, presenting a vast opportunity to predict the perceptual and emotional impact of charts. However, it also raises concerns, as many applications of LLMs are based on overgeneralized assumptions from a few examples, lacking sufficient validation of their performance and effectiveness. We introduce Chart-to-Experience, a benchmark dataset comprising 36 charts, evaluated by crowdsourced workers for their impact on seven experiential factors. Using the dataset as ground truth, we evaluated capabilities of state-of-the-art MLLMs on two tasks: direct prediction and pairwise comparison of charts. Our findings imply that MLLMs are not as sensitive as human evaluators when assessing individual charts, but are accurate and reliable in pairwise comparisons.

cs.CY [Back]

[315] Exploring EFL Secondary Students' AI-generated Text Editing While Composition Writing

David James Woo,Yangyang Yu,Kai Guo

Main category: cs.CY

TL;DR: 研究探讨了EFL中学生如何整合和修改AI生成文本,发现15种编辑类型和4种编辑行为模式,挑战了学生对AI工具被动使用的假设。

Details Motivation: 了解EFL学生如何在写作过程中操纵AI生成文本,填补相关研究空白。 Method: 采用混合方法设计,通过屏幕录像分析29名香港中学生的AI辅助写作行为。 Result: 识别出15种编辑类型和4种编辑行为模式,揭示了学生与AI文本互动的复杂性。 Conclusion: 研究发现学生使用AI工具的行为比预想的更复杂,需开发明确的教学策略指导AI文本编辑。 Abstract: Generative Artificial Intelligence is transforming how English as a foreign language students write. Still, little is known about how students manipulate text generated by generative AI during the writing process. This study investigates how EFL secondary school students integrate and modify AI-generated text when completing an expository writing task. The study employed an exploratory mixed-methods design. Screen recordings were collected from 29 Hong Kong secondary school students who attended an AI-assisted writing workshop and recorded their screens while using generative AI to write an article. Content analysis with hierarchical coding and thematic analysis with a multiple case study approach were adopted to analyze the recordings. 15 types of AI-generated text edits across seven categories were identified from the recordings. Notably, AI-initiated edits from iOS and Google Docs emerged as unanticipated sources of AI-generated text. A thematic analysis revealed four patterns of students' editing behaviors based on planning and drafting direction: planning with top-down drafting and revising; top-down drafting and revising without planning; planning with bottom-up drafting and revising; and bottom-up drafting and revising without planning. Network graphs illustrate cases of each pattern, demonstrating that students' interactions with AI-generated text involve more complex cognitive processes than simple text insertion. The findings challenge assumptions about students' passive, simplistic use of generative AI tools and have implications for developing explicit instructional approaches to teaching AI-generated text editing strategies in the AFL writing pedagogy.

eess.AS [Back]

[316] From Weak Labels to Strong Results: Utilizing 5,000 Hours of Noisy Classroom Transcripts with Minimal Accurate Data

Ahmed Adel Attia,Dorottya Demszky,Jing Liu,Carol Espy-Wilson

Main category: eess.AS

TL;DR: 论文提出了一种弱监督预训练(WSP)方法,用于解决课堂语音识别中弱标注数据丰富但高质量数据稀缺的问题,通过预训练和微调两阶段提升模型性能。

Details Motivation: 课堂语音识别面临高质量标注数据稀缺而弱标注数据丰富的现实挑战,传统方法因高标注成本难以适用。 Method: 采用两阶段训练:先在弱标注数据上进行监督预训练,再在少量高质量数据上微调。 Result: 实验表明,WSP在合成和真实弱标注数据上均优于其他方法。 Conclusion: WSP是一种适用于低资源语音识别的有效训练方法。 Abstract: Recent progress in speech recognition has relied on models trained on vast amounts of labeled data. However, classroom Automatic Speech Recognition (ASR) faces the real-world challenge of abundant weak transcripts paired with only a small amount of accurate, gold-standard data. In such low-resource settings, high transcription costs make re-transcription impractical. To address this, we ask: what is the best approach when abundant inexpensive weak transcripts coexist with limited gold-standard data, as is the case for classroom speech data? We propose Weakly Supervised Pretraining (WSP), a two-step process where models are first pretrained on weak transcripts in a supervised manner, and then fine-tuned on accurate data. Our results, based on both synthetic and real weak transcripts, show that WSP outperforms alternative methods, establishing it as an effective training methodology for low-resource ASR in real-world scenarios.

[317] Voicing Personas: Rewriting Persona Descriptions into Style Prompts for Controllable Text-to-Speech

Yejin Lee,Jaehoon Kang,Kyuhong Shim

Main category: eess.AS

TL;DR: 提出了一种基于文本角色的语音风格控制框架,通过角色重写策略优化语音合成的自然度和一致性。

Details Motivation: 探索如何通过文本角色描述精细控制语音风格,提升语音合成的表现力。 Method: 采用两种角色重写策略,将通用角色描述转化为语音导向提示,以操纵音高、情感和语速等韵律属性。 Result: 实验表明,该方法提高了合成语音的自然度、清晰度和一致性。 Conclusion: 语音风格是角色驱动AI对话系统的关键因素,同时需注意LLM重写可能引入的社会偏见(如性别)。 Abstract: In this paper, we propose a novel framework to control voice style in prompt-based, controllable text-to-speech systems by leveraging textual personas as voice style prompts. We present two persona rewriting strategies to transform generic persona descriptions into speech-oriented prompts, enabling fine-grained manipulation of prosodic attributes such as pitch, emotion, and speaking rate. Experimental results demonstrate that our methods enhance the naturalness, clarity, and consistency of synthesized speech. Finally, we analyze implicit social biases introduced by LLM-based rewriting, with a focus on gender. We underscore voice style as a crucial factor for persona-driven AI dialogue systems.

[318] Speechless: Speech Instruction Training Without Speech for Low Resource Languages

Alan Dao,Dinh Bach Vu,Huy Hoang Ha,Tuan Le Duc Anh,Shreyas Gopal,Yue Heng Yeo,Warren Keng Hoong Low,Eng Siong Chng,Jia Qi Yip

Main category: eess.AS

TL;DR: 提出一种新方法,通过跳过文本到语音(TTS)步骤,直接在语义表示层面合成语音指令数据,以解决低资源语言中语音指令数据稀缺的问题。

Details Motivation: 语音助手需要大量语音指令数据,但低资源语言缺乏高质量TTS模型,导致数据生成困难。 Method: 在语义表示层面停止合成,将合成语义表示与预训练的Whisper编码器对齐,使LLM能在文本指令上微调,同时保留理解语音指令的能力。 Result: 简化了训练过程,为低资源语言构建语音助手提供了可行方案。 Conclusion: 该方法为低资源语言语音助手的发展提供了新的可能性。 Abstract: The rapid growth of voice assistants powered by large language models (LLM) has highlighted a need for speech instruction data to train these systems. Despite the abundance of speech recognition data, there is a notable scarcity of speech instruction data, which is essential for fine-tuning models to understand and execute spoken commands. Generating high-quality synthetic speech requires a good text-to-speech (TTS) model, which may not be available to low resource languages. Our novel approach addresses this challenge by halting synthesis at the semantic representation level, bypassing the need for TTS. We achieve this by aligning synthetic semantic representations with the pre-trained Whisper encoder, enabling an LLM to be fine-tuned on text instructions while maintaining the ability to understand spoken instructions during inference. This simplified training process is a promising approach to building voice assistant for low-resource languages.

eess.IV [Back]

[319] TAGS: 3D Tumor-Adaptive Guidance for SAM

Sirui Li,Linkai Peng,Zheyuan Zhang,Gorkem Durak,Ulas Bagci

Main category: eess.IV

TL;DR: TAGS框架通过多提示融合将2D基础模型(如CLIP和SAM)适应于3D医学图像分割任务,显著提升了肿瘤分割性能。

Details Motivation: 现有基础模型(如CLIP和SAM)在自然图像上表现优异,但在3D医学图像(如肿瘤分割)中因领域差距表现不佳,需要适应性改进。 Method: 提出TAGS框架,结合CLIP的语义信息和解剖学特定提示,增强SAM的空间特征提取能力,同时保留预训练权重。 Result: 在三个开源肿瘤分割数据集上,TAGS性能显著优于现有方法(如nnUNet、SAM-Med2D等),提升幅度达46.88%。 Conclusion: TAGS展示了在多样化医学分割任务中的鲁棒性和适应性,为3D医学图像分析提供了有效解决方案。 Abstract: Foundation models (FMs) such as CLIP and SAM have recently shown great promise in image segmentation tasks, yet their adaptation to 3D medical imaging-particularly for pathology detection and segmentation-remains underexplored. A critical challenge arises from the domain gap between natural images and medical volumes: existing FMs, pre-trained on 2D data, struggle to capture 3D anatomical context, limiting their utility in clinical applications like tumor segmentation. To address this, we propose an adaptation framework called TAGS: Tumor Adaptive Guidance for SAM, which unlocks 2D FMs for 3D medical tasks through multi-prompt fusion. By preserving most of the pre-trained weights, our approach enhances SAM's spatial feature extraction using CLIP's semantic insights and anatomy-specific prompts. Extensive experiments on three open-source tumor segmentation datasets prove that our model surpasses the state-of-the-art medical image segmentation models (+46.88% over nnUNet), interactive segmentation frameworks, and other established medical FMs, including SAM-Med2D, SAM-Med3D, SegVol, Universal, 3D-Adapter, and SAM-B (at least +13% over them). This highlights the robustness and adaptability of our proposed framework across diverse medical segmentation tasks.

[320] Assessing the generalization performance of SAM for ureteroscopy scene understanding

Martin Villagrana,Francisco Lopez-Tiro,Clement Larose,Gilberto Ochoa-Ruiz,Christian Daul

Main category: eess.IV

TL;DR: 论文研究了Segment Anything Model(SAM)在肾脏结石分割中的潜力,发现其优于传统U-Net模型,尤其在泛化能力上表现突出。

Details Motivation: 肾脏结石分割是识别结石类型的关键步骤,但手动分割效率低下,因此需要自动化方法。 Method: 比较了SAM与传统模型(U-Net、Residual U-Net、Attention U-Net)在肾脏结石分割中的性能。 Result: SAM在分布内数据上表现与U-Net相当,但在分布外数据上泛化能力显著优于U-Net变体(提升高达23%)。 Conclusion: SAM在肾脏结石分割中具有更高的适应性和效率,是自动化分割的有力工具。 Abstract: The segmentation of kidney stones is regarded as a critical preliminary step to enable the identification of urinary stone types through machine- or deep-learning-based approaches. In urology, manual segmentation is considered tedious and impractical due to the typically large scale of image databases and the continuous generation of new data. In this study, the potential of the Segment Anything Model (SAM) -- a state-of-the-art deep learning framework -- is investigated for the automation of kidney stone segmentation. The performance of SAM is evaluated in comparison to traditional models, including U-Net, Residual U-Net, and Attention U-Net, which, despite their efficiency, frequently exhibit limitations in generalizing to unseen datasets. The findings highlight SAM's superior adaptability and efficiency. While SAM achieves comparable performance to U-Net on in-distribution data (Accuracy: 97.68 + 3.04; Dice: 97.78 + 2.47; IoU: 95.76 + 4.18), it demonstrates significantly enhanced generalization capabilities on out-of-distribution data, surpassing all U-Net variants by margins of up to 23 percent.

[321] SUFFICIENT: A scan-specific unsupervised deep learning framework for high-resolution 3D isotropic fetal brain MRI reconstruction

Jiangjie Wu,Lixuan Chen,Zhenghao Li,Xin Li,Saban Ozturk,Lihui Wang,Rongpin Wang,Hongjiang Wei,Yuyao Zhang

Main category: eess.IV

TL;DR: 提出了一种无监督的迭代SVR-SRR框架,用于从运动伪影的2D切片重建高质量3D胎儿脑MRI,优于现有方法。

Details Motivation: 临床胎儿MRI难以获取大规模训练数据,而深度学习在SVR和SRR中表现优异但依赖外部数据。 Method: 通过卷积神经网络参数化SVR,结合深度图像先验框架和解码网络进行SRR,优化HR体积重建。 Result: 在模拟和临床数据上,框架表现优于现有胎儿脑重建方法。 Conclusion: 该框架为无监督3D胎儿脑MRI重建提供了有效解决方案。 Abstract: High-quality 3D fetal brain MRI reconstruction from motion-corrupted 2D slices is crucial for clinical diagnosis. Reliable slice-to-volume registration (SVR)-based motion correction and super-resolution reconstruction (SRR) methods are essential. Deep learning (DL) has demonstrated potential in enhancing SVR and SRR when compared to conventional methods. However, it requires large-scale external training datasets, which are difficult to obtain for clinical fetal MRI. To address this issue, we propose an unsupervised iterative SVR-SRR framework for isotropic HR volume reconstruction. Specifically, SVR is formulated as a function mapping a 2D slice and a 3D target volume to a rigid transformation matrix, which aligns the slice to the underlying location in the target volume. The function is parameterized by a convolutional neural network, which is trained by minimizing the difference between the volume slicing at the predicted position and the input slice. In SRR, a decoding network embedded within a deep image prior framework is incorporated with a comprehensive image degradation model to produce the high-resolution (HR) volume. The deep image prior framework offers a local consistency prior to guide the reconstruction of HR volumes. By performing a forward degradation model, the HR volume is optimized by minimizing loss between predicted slices and the observed slices. Comprehensive experiments conducted on large-magnitude motion-corrupted simulation data and clinical data demonstrate the superior performance of the proposed framework over state-of-the-art fetal brain reconstruction frameworks.

[322] Anatomy-Guided Multitask Learning for MRI-Based Classification of Placenta Accreta Spectrum and its Subtypes

Hai Jiang,Qiongting Liu,Yuanpin Zhou,Jiawei Pan,Ting Song,Yao Lu

Main category: eess.IV

TL;DR: 提出了一种基于CNN的新架构,用于胎盘植入谱系障碍(PAS)及其亚型的一阶段多类诊断,结合解剖特征和多任务学习策略,性能优于现有方法。

Details Motivation: 胎盘植入谱系障碍(PAS)及其亚型(PA、PI、PP)的准确产前诊断对临床至关重要,但现有方法多关注PAS存在性,亚型识别研究有限,且多采用低效的两阶段分类。 Method: 设计了一种双分支CNN架构:主分支采用残差块结构,次分支整合子宫胎盘区域解剖特征;结合多任务学习策略,基于4,140张MRI切片进行训练。 Result: 在真实临床数据集上实验表明,该模型性能达到当前最优水平。 Conclusion: 提出的方法为PAS及其亚型的高效诊断提供了新思路,具有临床应用潜力。 Abstract: Placenta Accreta Spectrum Disorders (PAS) pose significant risks during pregnancy, frequently leading to postpartum hemorrhage during cesarean deliveries and other severe clinical complications, with bleeding severity correlating to the degree of placental invasion. Consequently, accurate prenatal diagnosis of PAS and its subtypes-placenta accreta (PA), placenta increta (PI), and placenta percreta (PP)-is crucial. However, existing guidelines and methodologies predominantly focus on the presence of PAS, with limited research addressing subtype recognition. Additionally, previous multi-class diagnostic efforts have primarily relied on inefficient two-stage cascaded binary classification tasks. In this study, we propose a novel convolutional neural network (CNN) architecture designed for efficient one-stage multiclass diagnosis of PAS and its subtypes, based on 4,140 magnetic resonance imaging (MRI) slices. Our model features two branches: the main classification branch utilizes a residual block architecture comprising multiple residual blocks, while the second branch integrates anatomical features of the uteroplacental area and the adjacent uterine serous layer to enhance the model's attention during classification. Furthermore, we implement a multitask learning strategy to leverage both branches effectively. Experiments conducted on a real clinical dataset demonstrate that our model achieves state-of-the-art performance.

[323] DECT-based Space-Squeeze Method for Multi-Class Classification of Metastatic Lymph Nodes in Breast Cancer

Hai Jiang,Chushan Zheng,Jiawei Pan,Yuanpin Zhou,Qiongting Liu,Xiang Zhang,Jun Shen,Yao Lu

Main category: eess.IV

TL;DR: 该研究利用双能CT(DECT)开发了一种非侵入性模型,用于分类乳腺癌前哨淋巴结的转移负荷,通过空间压缩方法和虚拟类注入技术显著提升了分类性能。

Details Motivation: 传统影像学方法难以区分淋巴结转移负荷水平,因此需要一种更准确的分类方法以指导乳腺癌治疗决策。 Method: 提出了一种结合通道注意力机制和虚拟类注入的空间压缩方法,以优化光谱-空间特征并减少类间模糊性。 Result: 在227例活检确认的病例中,模型平均测试AUC为0.86,显著优于传统CNN方法,且各组件分别提升了5.01%和5.87%的AUC。 Conclusion: 该框架通过整合DECT的光谱-空间数据并减少类间模糊性,为临床非侵入性转移负荷评估提供了有效工具。 Abstract: Background: Accurate assessment of metastatic burden in axillary lymph nodes is crucial for guiding breast cancer treatment decisions, yet conventional imaging modalities struggle to differentiate metastatic burden levels and capture comprehensive lymph node characteristics. This study leverages dual-energy computed tomography (DECT) to exploit spectral-spatial information for improved multi-class classification. Purpose: To develop a noninvasive DECT-based model classifying sentinel lymph nodes into three categories: no metastasis ($N_0$), low metastatic burden ($N_{+(1-2)}$), and heavy metastatic burden ($N_{+(\geq3)}$), thereby aiding therapeutic planning. Methods: We propose a novel space-squeeze method combining two innovations: (1) a channel-wise attention mechanism to compress and recalibrate spectral-spatial features across 11 energy levels, and (2) virtual class injection to sharpen inter-class boundaries and compact intra-class variations in the representation space. Results: Evaluated on 227 biopsy-confirmed cases, our method achieved an average test AUC of 0.86 (95% CI: 0.80-0.91) across three cross-validation folds, outperforming established CNNs (VGG, ResNet, etc). The channel-wise attention and virtual class components individually improved AUC by 5.01% and 5.87%, respectively, demonstrating complementary benefits. Conclusions: The proposed framework enhances diagnostic AUC by effectively integrating DECT's spectral-spatial data and mitigating class ambiguity, offering a promising tool for noninvasive metastatic burden assessment in clinical practice.

[324] FreqU-FNet: Frequency-Aware U-Net for Imbalanced Medical Image Segmentation

Ruiqi Xing

Main category: eess.IV

TL;DR: FreqU-FNet是一种新型的医学图像分割架构,通过在频域操作解决了类别不平衡和频率分布问题,优于传统CNN和Transformer方法。

Details Motivation: 医学图像分割中严重的类别不平衡和频率分布问题导致传统CNN和Transformer方法难以捕捉少数类信号和局部细节。 Method: 提出了FreqU-FNet,包含频域编码器(低通频率卷积和小波下采样)和空间可学习解码器(自适应多分支上采样),并设计了频率感知损失函数。 Result: 在多个医学分割基准测试中,FreqU-FNet显著优于CNN和Transformer基线,尤其在少数类处理上表现突出。 Conclusion: FreqU-FNet通过频域操作和多尺度特征提取,有效解决了医学图像分割中的挑战,为未来研究提供了新方向。 Abstract: Medical image segmentation faces persistent challenges due to severe class imbalance and the frequency-specific distribution of anatomical structures. Most conventional CNN-based methods operate in the spatial domain and struggle to capture minority class signals, often affected by frequency aliasing and limited spectral selectivity. Transformer-based models, while powerful in modeling global dependencies, tend to overlook critical local details necessary for fine-grained segmentation. To overcome these limitations, we propose FreqU-FNet, a novel U-shaped segmentation architecture operating in the frequency domain. Our framework incorporates a Frequency Encoder that leverages Low-Pass Frequency Convolution and Daubechies wavelet-based downsampling to extract multi-scale spectral features. To reconstruct fine spatial details, we introduce a Spatial Learnable Decoder (SLD) equipped with an adaptive multi-branch upsampling strategy. Furthermore, we design a frequency-aware loss (FAL) function to enhance minority class learning. Extensive experiments on multiple medical segmentation benchmarks demonstrate that FreqU-FNet consistently outperforms both CNN and Transformer baselines, particularly in handling under-represented classes, by effectively exploiting discriminative frequency bands.

[325] Distance Estimation in Outdoor Driving Environments Using Phase-only Correlation Method with Event Cameras

Masataka Kobayashi,Shintaro Shiba,Quan Kong,Norimasa Kobori,Tsukasa Shimizu,Shan Lu,Takaya Yamazato

Main category: eess.IV

TL;DR: 提出了一种基于单目事件相机和路边LED条的距离估计方法,通过相位相关技术实现高精度三角测量,实验显示在20至60米范围内误差小于0.5米。

Details Motivation: 随着自动驾驶的发展,传感器融合技术虽有效但硬件复杂且成本高,开发多功能单一传感器是理想解决方案。事件相机因其高动态范围、低延迟等特点成为潜在选择。 Method: 利用单目事件相机和路边LED条,通过相位相关技术检测光源间的空间位移,实现无需立体视觉的高精度三角测距。 Result: 户外驾驶场景实验表明,该方法在20至60米范围内成功率超过90%,误差小于0.5米。 Conclusion: 该方法为低成本、高精度的距离估计提供了可行方案,未来可扩展至实时位置估计,提升自动驾驶导航精度与智能交通系统集成。 Abstract: With the growing adoption of autonomous driving, the advancement of sensor technology is crucial for ensuring safety and reliable operation. Sensor fusion techniques that combine multiple sensors such as LiDAR, radar, and cameras have proven effective, but the integration of multiple devices increases both hardware complexity and cost. Therefore, developing a single sensor capable of performing multiple roles is highly desirable for cost-efficient and scalable autonomous driving systems. Event cameras have emerged as a promising solution due to their unique characteristics, including high dynamic range, low latency, and high temporal resolution. These features enable them to perform well in challenging lighting conditions, such as low-light or backlit environments. Moreover, their ability to detect fine-grained motion events makes them suitable for applications like pedestrian detection and vehicle-to-infrastructure communication via visible light. In this study, we present a method for distance estimation using a monocular event camera and a roadside LED bar. By applying a phase-only correlation technique to the event data, we achieve sub-pixel precision in detecting the spatial shift between two light sources. This enables accurate triangulation-based distance estimation without requiring stereo vision. Field experiments conducted in outdoor driving scenarios demonstrated that the proposed approach achieves over 90% success rate with less than 0.5-meter error for distances ranging from 20 to 60 meters. Future work includes extending this method to full position estimation by leveraging infrastructure such as smart poles equipped with LEDs, enabling event-camera-based vehicles to determine their own position in real time. This advancement could significantly enhance navigation accuracy, route optimization, and integration into intelligent transportation systems.

[326] Towards Prospective Medical Image Reconstruction via Knowledge-Informed Dynamic Optimal Transport

Taoran Zheng,Xing Li,Yan Yang,Xiang Gu,Zongben Xu,Jian Sun

Main category: eess.IV

TL;DR: 本文提出了一种基于成像知识动态最优传输(KIDOT)的新方法,用于解决医学图像重建中模拟数据与实际数据之间的性能差距问题。

Details Motivation: 医学图像重建中,深度学习通常依赖于模拟的配对数据,但这种方法在实际应用中性能下降,原因是模拟数据与实际数据之间存在知识差距。 Method: KIDOT框架将重建问题建模为从测量数据到图像的动态传输路径,利用成像知识指导传输过程,并通过无配对数据学习。 Result: 实验表明,KIDOT在MRI和CT重建中表现优异,优于传统方法。 Conclusion: KIDOT通过动态最优传输和成像知识结合,有效提升了重建性能,适用于无配对数据场景。 Abstract: Medical image reconstruction from measurement data is a vital but challenging inverse problem. Deep learning approaches have achieved promising results, but often requires paired measurement and high-quality images, which is typically simulated through a forward model, i.e., retrospective reconstruction. However, training on simulated pairs commonly leads to performance degradation on real prospective data due to the retrospective-to-prospective gap caused by incomplete imaging knowledge in simulation. To address this challenge, this paper introduces imaging Knowledge-Informed Dynamic Optimal Transport (KIDOT), a novel dynamic optimal transport framework with optimality in the sense of preserving consistency with imaging physics in transport, that conceptualizes reconstruction as finding a dynamic transport path. KIDOT learns from unpaired data by modeling reconstruction as a continuous evolution path from measurements to images, guided by an imaging knowledge-informed cost function and transport equation. This dynamic and knowledge-aware approach enhances robustness and better leverages unpaired data while respecting acquisition physics. Theoretically, we demonstrate that KIDOT naturally generalizes dynamic optimal transport, ensuring its mathematical rationale and solution existence. Extensive experiments on MRI and CT reconstruction demonstrate KIDOT's superior performance.

[327] Dual Attention Residual U-Net for Accurate Brain Ultrasound Segmentation in IVH Detection

Dan Yuan,Yi Feng,Ziyun Tang

Main category: eess.IV

TL;DR: 提出了一种结合CBAM和SAL的增强型Residual U-Net架构,用于早产儿脑超声图像中的脑室出血分割,取得了优异的性能。

Details Motivation: 脑室出血(IVH)是早产儿严重的神经系统并发症,需要从脑超声图像中早期准确检测以提高临床效果。现有深度学习方法在捕捉局部空间细节和全局上下文依赖方面仍有挑战。 Method: 提出了一种增强的Residual U-Net架构,结合了CBAM(卷积块注意力模块)和SAL(稀疏注意力层)。CBAM优化空间和通道特征,SAL通过稀疏注意力过滤低置信度信息并确保信息传播。 Result: 在脑超声数据集上,该方法在脑室区域分割中达到了89.04%的Dice分数和81.84%的IoU,性能领先。 Conclusion: 结合空间优化和注意力稀疏性的方法能有效提升脑解剖结构的检测鲁棒性。 Abstract: Intraventricular hemorrhage (IVH) is a severe neurological complication among premature infants, necessitating early and accurate detection from brain ultrasound (US) images to improve clinical outcomes. While recent deep learning methods offer promise for computer-aided diagnosis, challenges remain in capturing both local spatial details and global contextual dependencies critical for segmenting brain anatomies. In this work, we propose an enhanced Residual U-Net architecture incorporating two complementary attention mechanisms: the Convolutional Block Attention Module (CBAM) and a Sparse Attention Layer (SAL). The CBAM improves the model's ability to refine spatial and channel-wise features, while the SAL introduces a dual-branch design, sparse attention filters out low-confidence query-key pairs to suppress noise, and dense attention ensures comprehensive information propagation. Extensive experiments on the Brain US dataset demonstrate that our method achieves state-of-the-art segmentation performance, with a Dice score of 89.04% and IoU of 81.84% for ventricle region segmentation. These results highlight the effectiveness of integrating spatial refinement and attention sparsity for robust brain anatomy detection. Code is available at: https://github.com/DanYuan001/BrainImgSegment.

[328] UltraBoneUDF: Self-supervised Bone Surface Reconstruction from Ultrasound Based on Neural Unsigned Distance Functions

Luohong Wu,Matthias Seibold,Nicola A. Cavalcanti,Giuseppe Loggia,Lisa Reissner,Bastian Sigrist,Jonas Hein,Lilian Calvet,Arnd Viehöfer,Philipp Fürnstahl

Main category: eess.IV

TL;DR: UltraBoneUDF是一种自监督框架,用于从超声数据中重建开放骨表面,通过神经无符号距离函数和全局特征提取器显著提升重建质量。

Details Motivation: 传统超声成像仅能捕捉部分骨表面,现有方法对不完整数据重建效果不佳,亟需更有效的技术。 Method: 提出UltraBoneUDF框架,结合全局特征提取器和基于局部切平面优化的损失函数。 Result: UltraBoneUDF在多个数据集上显著优于现有方法,平均Chamfer距离误差降低39.6%至70.2%。 Conclusion: UltraBoneUDF为开放骨表面重建提供了一种高效解决方案,具有临床应用的潜力。 Abstract: Background: Bone surface reconstruction plays a critical role in computer-assisted orthopedic surgery. Compared to traditional imaging modalities such as CT and MRI, ultrasound offers a radiation-free, cost-effective, and portable alternative. Continuous bone surface reconstruction can be employed for many clinical applications. However, due to the inherent limitations of ultrasound imaging, B-mode ultrasound typically capture only partial bone surfaces. Existing reconstruction methods struggle with such incomplete data, leading to artifacts and increased reconstruction errors. Effective techniques for accurately reconstructing thin and open bone surfaces from real-world 3D ultrasound volumes remain lacking. Methods: We propose UltraBoneUDF, a self-supervised framework designed for reconstructing open bone surfaces from ultrasound using neural Unsigned Distance Functions. To enhance reconstruction quality, we introduce a novel global feature extractor that effectively fuses ultrasound-specific image characteristics. Additionally, we present a novel loss function based on local tangent plane optimization that substantially improves surface reconstruction quality. UltraBoneUDF and baseline models are extensively evaluated on four open-source datasets. Results: Qualitative results highlight the limitations of the state-of-the-art methods for open bone surface reconstruction and demonstrate the effectiveness of UltraBoneUDF. Quantitatively, UltraBoneUDF significantly outperforms competing methods across all evaluated datasets for both open and closed bone surface reconstruction in terms of mean Chamfer distance error: 1.10 mm on the UltraBones100k dataset (39.6\% improvement compared to the SOTA), 0.23 mm on the OpenBoneCT dataset (69.3\% improvement), 0.18 mm on the ClosedBoneCT dataset (70.2\% improvement), and 0.05 mm on the Prostate dataset (55.3\% improvement).

[329] Promptable cancer segmentation using minimal expert-curated data

Lynn Karam,Yipei Wang,Veeru Kasivisvanathan,Mirabela Rusu,Yipeng Hu,Shaheer U. Saeed

Main category: eess.IV

TL;DR: 提出一种新型可提示分割方法,仅需少量标注数据(24张全标注和8张弱标注图像),通过双分类器引导搜索优化分割效果,性能优于现有方法。

Details Motivation: 解决医学图像癌症分割中标注成本高、数据变异大的问题,减少对大规模标注数据的依赖。 Method: 结合弱监督和全监督分类器,通过单点提示引导搜索过程,优化分割结果。 Result: 在少量标注数据下,性能优于现有可提示分割方法,接近全监督方法,标注数据量减少100倍。 Conclusion: 该方法显著降低标注需求,使高质量标注成为可能,适用于实际医疗应用。 Abstract: Automated segmentation of cancer on medical images can aid targeted diagnostic and therapeutic procedures. However, its adoption is limited by the high cost of expert annotations required for training and inter-observer variability in datasets. While weakly-supervised methods mitigate some challenges, using binary histology labels for training as opposed to requiring full segmentation, they require large paired datasets of histology and images, which are difficult to curate. Similarly, promptable segmentation aims to allow segmentation with no re-training for new tasks at inference, however, existing models perform poorly on pathological regions, again necessitating large datasets for training. In this work we propose a novel approach for promptable segmentation requiring only 24 fully-segmented images, supplemented by 8 weakly-labelled images, for training. Curating this minimal data to a high standard is relatively feasible and thus issues with the cost and variability of obtaining labels can be mitigated. By leveraging two classifiers, one weakly-supervised and one fully-supervised, our method refines segmentation through a guided search process initiated by a single-point prompt. Our approach outperforms existing promptable segmentation methods, and performs comparably with fully-supervised methods, for the task of prostate cancer segmentation, while using substantially less annotated data (up to 100X less). This enables promptable segmentation with very minimal labelled data, such that the labels can be curated to a very high standard.

[330] Explainable Anatomy-Guided AI for Prostate MRI: Foundation Models and In Silico Clinical Trials for Virtual Biopsy-based Risk Assessment

Danial Khan,Zohaib Salahuddin,Yumeng Zhang,Sheng Kuang,Shruti Atul Mali,Henry C. Woodruff,Sina Amirrajab,Rachel Cavill,Eduardo Ibor-Crespo,Ana Jimenez-Pastor,Adrian Galiana-Bordera,Paula Jimenez Gomez,Luis Marti-Bonmati,Philippe Lambin

Main category: eess.IV

TL;DR: 提出了一种基于深度学习的自动化前列腺癌风险分层系统,整合了nnU-Net分割、Swin Transformer分类和VAE-GAN生成反事实热图,显著提升了诊断准确性和效率。

Details Motivation: 通过结合解剖学先验和临床数据,开发一种高效、可解释的前列腺癌风险分层方法,以支持临床决策。 Method: 使用nnU-Net分割前列腺及其区域,Swin Transformer分类模型结合解剖学先验和临床数据,VAE-GAN生成反事实热图增强可解释性。 Result: 分割Dice分数达0.95(腺体),分类AUC提升至0.79,临床实验中诊断准确性从0.72提升至0.77,审查时间减少40%。 Conclusion: 解剖学感知的深度学习模型结合反事实解释性,可提供高准确性和高效的前列腺癌风险评估,有望成为临床虚拟活检工具。 Abstract: We present a fully automated, anatomically guided deep learning pipeline for prostate cancer (PCa) risk stratification using routine MRI. The pipeline integrates three key components: an nnU-Net module for segmenting the prostate gland and its zones on axial T2-weighted MRI; a classification module based on the UMedPT Swin Transformer foundation model, fine-tuned on 3D patches with optional anatomical priors and clinical data; and a VAE-GAN framework for generating counterfactual heatmaps that localize decision-driving image regions. The system was developed using 1,500 PI-CAI cases for segmentation and 617 biparametric MRIs with metadata from the CHAIMELEON challenge for classification (split into 70% training, 10% validation, and 20% testing). Segmentation achieved mean Dice scores of 0.95 (gland), 0.94 (peripheral zone), and 0.92 (transition zone). Incorporating gland priors improved AUC from 0.69 to 0.72, with a three-scale ensemble achieving top performance (AUC = 0.79, composite score = 0.76), outperforming the 2024 CHAIMELEON challenge winners. Counterfactual heatmaps reliably highlighted lesions within segmented regions, enhancing model interpretability. In a prospective multi-center in-silico trial with 20 clinicians, AI assistance increased diagnostic accuracy from 0.72 to 0.77 and Cohen's kappa from 0.43 to 0.53, while reducing review time per case by 40%. These results demonstrate that anatomy-aware foundation models with counterfactual explainability can enable accurate, interpretable, and efficient PCa risk assessment, supporting their potential use as virtual biopsies in clinical practice.

[331] A Foundation Model Framework for Multi-View MRI Classification of Extramural Vascular Invasion and Mesorectal Fascia Invasion in Rectal Cancer

Yumeng Zhang,Zohaib Salahuddin,Danial Khan,Shruti Atul Mali,Henry C. Woodruff,Sina Amirrajab,Eduardo Ibor-Crespo,Ana Jimenez-Pastor,Luis Marti-Bonmati,Philippe Lambin

Main category: eess.IV

TL;DR: 该研究开发了一种基于多中心基础模型的框架,用于自动分类直肠癌MRI中的EVI和MFI,通过特征融合和频率域协调显著提升了诊断性能。

Details Motivation: MRI视觉评估EVI和MFI存在主观性和机构间差异,需要一种自动化的高精度分类方法。 Method: 研究回顾性分析了331例直肠癌MRI数据,采用TotalSegmentator提取直肠区域,并通过频率域协调减少扫描仪差异,比较了四种分类器(ResNet50、SeResNet、UMedPT和UMedPT_LR)。 Result: UMedPT_LR在EVI检测中表现最佳(AUC=0.82),UMedPT在MFI分类中表现最优(AUC=0.77),均优于现有方法。频率域协调对MFI分类有提升,但对EVI效果不一。 Conclusion: 结合基础模型特征、频率域协调和多视图融合可显著提升直肠MRI的诊断性能。 Abstract: Background: Accurate MRI-based identification of extramural vascular invasion (EVI) and mesorectal fascia invasion (MFI) is pivotal for risk-stratified management of rectal cancer, yet visual assessment is subjective and vulnerable to inter-institutional variability. Purpose: To develop and externally evaluate a multicenter, foundation-model-driven framework that automatically classifies EVI and MFI on axial and sagittal T2-weighted MRI. Methods: This retrospective study used 331 pre-treatment rectal cancer MRI examinations from three European hospitals. After TotalSegmentator-guided rectal patch extraction, a self-supervised frequency-domain harmonization pipeline was trained to minimize scanner-related contrast shifts. Four classifiers were compared: ResNet50, SeResNet, the universal biomedical pretrained transformer (UMedPT) with a lightweight MLP head, and a logistic-regression variant using frozen UMedPT features (UMedPT_LR). Results: UMedPT_LR achieved the best EVI detection when axial and sagittal features were fused (AUC = 0.82; sensitivity = 0.75; F1 score = 0.73), surpassing the Chaimeleon Grand-Challenge winner (AUC = 0.74). The highest MFI performance was attained by UMedPT on axial harmonized images (AUC = 0.77), surpassing the Chaimeleon Grand-Challenge winner (AUC = 0.75). Frequency-domain harmonization improved MFI classification but variably affected EVI performance. Conventional CNNs (ResNet50, SeResNet) underperformed, especially in F1 score and balanced accuracy. Conclusion: These findings demonstrate that combining foundation model features, harmonization, and multi-view fusion significantly enhances diagnostic performance in rectal MRI.

[332] Accelerating Learned Image Compression Through Modeling Neural Training Dynamics

Yichi Zhang,Zhihao Duan,Yuning Huang,Fengqing Zhu

Main category: eess.IV

TL;DR: 本文提出了一种加速学习图像压缩(LIC)方法训练的新机制STDET和SMA技术,通过减少可训练参数和优化训练动态,显著提升训练效率而不牺牲模型性能。

Details Motivation: 随着学习图像压缩方法计算需求增加,提升其训练效率变得至关重要。 Method: 提出Sensitivity-aware True and Dummy Embedding Training(STDET)机制,将参数聚类为少量模式,并利用模式内稳定相关性和参数敏感性减少可训练参数;结合Sampling-then-Moving Average(SMA)技术平滑训练动态。 Result: 方法显著减少了训练空间维度和可训练参数数量,加速模型收敛,且性能不降。理论分析显示其训练方差低于标准SGD。 Conclusion: 该方法为开发高效LIC训练方法提供了新思路。 Abstract: As learned image compression (LIC) methods become increasingly computationally demanding, enhancing their training efficiency is crucial. This paper takes a step forward in accelerating the training of LIC methods by modeling the neural training dynamics. We first propose a Sensitivity-aware True and Dummy Embedding Training mechanism (STDET) that clusters LIC model parameters into few separate modes where parameters are expressed as affine transformations of reference parameters within the same mode. By further utilizing the stable intra-mode correlations throughout training and parameter sensitivities, we gradually embed non-reference parameters, reducing the number of trainable parameters. Additionally, we incorporate a Sampling-then-Moving Average (SMA) technique, interpolating sampled weights from stochastic gradient descent (SGD) training to obtain the moving average weights, ensuring smooth temporal behavior and minimizing training state variances. Overall, our method significantly reduces training space dimensions and the number of trainable parameters without sacrificing model performance, thus accelerating model convergence. We also provide a theoretical analysis on the Noisy quadratic model, showing that the proposed method achieves a lower training variance than standard SGD. Our approach offers valuable insights for further developing efficient training methods for LICs.

cs.AI [Back]

[333] MMMG: a Comprehensive and Reliable Evaluation Suite for Multitask Multimodal Generation

Jihan Yao,Yushi Hu,Yujie Yi,Bin Han,Shangbin Feng,Guang Yang,Bingbing Wen,Ranjay Krishna,Lucy Lu Wang,Yulia Tsvetkov,Noah A. Smith,Banghua Zhu

Main category: cs.AI

TL;DR: MMMG是一个多模态生成评估基准,涵盖4种模态组合和49个任务,通过模型和程序实现可靠自动评估,与人类评估高度一致(94.3%)。

Details Motivation: 自动评估多模态生成任务时,现有指标难以与人类评估一致,尤其是复杂多模态任务。 Method: 开发MMMG基准,包含49个任务和937条指令,结合模型和程序设计评估流程。 Result: MMMG与人类评估高度一致(94.3%)。现有模型(如GPT Image)在多模态推理和交错生成表现不足,音频生成有较大改进空间。 Conclusion: MMMG为多模态生成提供了可靠的评估工具,揭示了现有模型的不足,尤其是音频生成领域需进一步研究。 Abstract: Automatically evaluating multimodal generation presents a significant challenge, as automated metrics often struggle to align reliably with human evaluation, especially for complex tasks that involve multiple modalities. To address this, we present MMMG, a comprehensive and human-aligned benchmark for multimodal generation across 4 modality combinations (image, audio, interleaved text and image, interleaved text and audio), with a focus on tasks that present significant challenges for generation models, while still enabling reliable automatic evaluation through a combination of models and programs. MMMG encompasses 49 tasks (including 29 newly developed ones), each with a carefully designed evaluation pipeline, and 937 instructions to systematically assess reasoning, controllability, and other key capabilities of multimodal generation models. Extensive validation demonstrates that MMMG is highly aligned with human evaluation, achieving an average agreement of 94.3%. Benchmarking results on 24 multimodal generation models reveal that even though the state-of-the-art model, GPT Image, achieves 78.3% accuracy for image generation, it falls short on multimodal reasoning and interleaved generation. Furthermore, results suggest considerable headroom for improvement in audio generation, highlighting an important direction for future research.

[334] ComfyMind: Toward General-Purpose Generation via Tree-Based Planning and Reactive Feedback

Litao Guo,Xinli Xu,Luozhou Wang,Jiantao Lin,Jinsong Zhou,Zixin Zhang,Bolan Su,Ying-Cong Chen

Main category: cs.AI

TL;DR: ComfyMind是一个基于ComfyUI平台的协作AI系统,通过语义工作流接口和搜索树规划机制,提升生成模型的稳定性和灵活性,在多个基准测试中表现优异。

Details Motivation: 现有开源框架在复杂现实应用中表现脆弱,缺乏结构化工作流规划和执行级反馈,ComfyMind旨在解决这些问题。 Method: 引入语义工作流接口(SWI)和搜索树规划机制,支持高级组合和自适应修正。 Result: 在ComfyBench、GenEval和Reason-Edit基准测试中,ComfyMind表现优于现有开源基线,接近GPT-Image-1。 Conclusion: ComfyMind为开源通用生成AI系统的发展提供了有前景的路径。 Abstract: With the rapid advancement of generative models, general-purpose generation has gained increasing attention as a promising approach to unify diverse tasks across modalities within a single system. Despite this progress, existing open-source frameworks often remain fragile and struggle to support complex real-world applications due to the lack of structured workflow planning and execution-level feedback. To address these limitations, we present ComfyMind, a collaborative AI system designed to enable robust and scalable general-purpose generation, built on the ComfyUI platform. ComfyMind introduces two core innovations: Semantic Workflow Interface (SWI) that abstracts low-level node graphs into callable functional modules described in natural language, enabling high-level composition and reducing structural errors; Search Tree Planning mechanism with localized feedback execution, which models generation as a hierarchical decision process and allows adaptive correction at each stage. Together, these components improve the stability and flexibility of complex generative workflows. We evaluate ComfyMind on three public benchmarks: ComfyBench, GenEval, and Reason-Edit, which span generation, editing, and reasoning tasks. Results show that ComfyMind consistently outperforms existing open-source baselines and achieves performance comparable to GPT-Image-1. ComfyMind paves a promising path for the development of open-source general-purpose generative AI systems. Project page: https://github.com/LitaoGuo/ComfyMind

Alex L. Zhang,Thomas L. Griffiths,Karthik R. Narasimhan,Ofir Press

Main category: cs.AI

TL;DR: VideoGameBench是一个评估视觉语言模型(VLMs)在实时视频游戏中表现的新基准,包含10款90年代游戏。结果显示前沿模型表现不佳,仅完成极少量任务。

Details Motivation: 研究VLMs在人类自然任务(如感知、空间导航)中的能力,填补现有研究的空白。 Method: 通过VideoGameBench测试VLMs在10款游戏中的表现,仅提供原始视觉输入和高层目标描述。 Result: 前沿模型(如Gemini 2.5 Pro)在基准中完成率极低(0.48%),实时延迟是主要限制。 Conclusion: VideoGameBench为VLMs在人类技能方面的研究提供了新方向,需进一步改进模型能力。 Abstract: Vision-language models (VLMs) have achieved strong results on coding and math benchmarks that are challenging for humans, yet their ability to perform tasks that come naturally to humans--such as perception, spatial navigation, and memory management--remains understudied. Real video games are crafted to be intuitive for humans to learn and master by leveraging innate inductive biases, making them an ideal testbed for evaluating such capabilities in VLMs. To this end, we introduce VideoGameBench, a benchmark consisting of 10 popular video games from the 1990s that VLMs directly interact with in real-time. VideoGameBench challenges models to complete entire games with access to only raw visual inputs and a high-level description of objectives and controls, a significant departure from existing setups that rely on game-specific scaffolding and auxiliary information. We keep three of the games secret to encourage solutions that generalize to unseen environments. Our experiments show that frontier vision-language models struggle to progress beyond the beginning of each game. We find inference latency to be a major limitation of frontier models in the real-time setting; therefore, we introduce VideoGameBench Lite, a setting where the game pauses while waiting for the LM's next action. The best performing model, Gemini 2.5 Pro, completes only 0.48% of VideoGameBench and 1.6% of VideoGameBench Lite. We hope that the formalization of the human skills mentioned above into this benchmark motivates progress in these research directions.

[336] Longer Context, Deeper Thinking: Uncovering the Role of Long-Context Ability in Reasoning

Wang Yang,Zirui Liu,Hongye Jin,Qingyu Yin,Vipin Chaudhary,Xiaotian Han

Main category: cs.AI

TL;DR: 研究发现,增强语言模型的长上下文能力可以显著提升其推理性能,即使任务输入较短。

Details Motivation: 当前语言模型的推理能力可能受限于长上下文能力的不足,实证观察支持这一假设。 Method: 比较具有相同架构和微调数据但长上下文能力不同的模型,验证长上下文能力对推理的影响。 Result: 长上下文能力更强的模型在推理任务中表现更优,且这种优势在短输入任务中依然存在。 Conclusion: 长上下文能力是推理的关键基础,未来语言模型设计应将其作为首要目标。 Abstract: Recent language models exhibit strong reasoning capabilities, yet the influence of long-context capacity on reasoning remains underexplored. In this work, we hypothesize that current limitations in reasoning stem, in part, from insufficient long-context capacity, motivated by empirical observations such as (1) higher context window length often leads to stronger reasoning performance, and (2) failed reasoning cases resemble failed long-context cases. To test this hypothesis, we examine whether enhancing a model's long-context ability before Supervised Fine-Tuning (SFT) leads to improved reasoning performance. Specifically, we compared models with identical architectures and fine-tuning data but varying levels of long-context capacity. Our results reveal a consistent trend: models with stronger long-context capacity achieve significantly higher accuracy on reasoning benchmarks after SFT. Notably, these gains persist even on tasks with short input lengths, indicating that long-context training offers generalizable benefits for reasoning performance. These findings suggest that long-context modeling is not just essential for processing lengthy inputs, but also serves as a critical foundation for reasoning. We advocate for treating long-context capacity as a first-class objective in the design of future language models.

[337] DEL-ToM: Inference-Time Scaling for Theory-of-Mind Reasoning via Dynamic Epistemic Logic

Yuheng Wu,Jianwen Xie,Denghui Zhang,Zhaozhuo Xu

Main category: cs.AI

TL;DR: DEL-ToM框架通过动态认知逻辑分解ToM任务,利用验证器PBM评分候选信念轨迹,提升小语言模型的ToM推理能力。

Details Motivation: 小语言模型因规模限制难以进行深度社会推理,DEL-ToM旨在通过推理时扩展而非架构改变提升其ToM能力。 Method: 将ToM任务分解为基于动态认知逻辑的信念更新序列,训练PBM验证器评分每一步,选择最优轨迹。 Result: 实验表明DEL-ToM在不同规模和基准下均能提升性能,验证了信念监督对小语言模型ToM能力的增强。 Conclusion: DEL-ToM通过结构化推理和验证器评分,显著提升小语言模型的ToM能力,无需重新训练。 Abstract: Theory-of-Mind (ToM) tasks pose a unique challenge for small language models (SLMs) with limited scale, which often lack the capacity to perform deep social reasoning. In this work, we propose DEL-ToM, a framework that improves ToM reasoning through inference-time scaling rather than architectural changes. Our approach decomposes ToM tasks into a sequence of belief updates grounded in Dynamic Epistemic Logic (DEL), enabling structured and transparent reasoning. We train a verifier, called the Process Belief Model (PBM), to score each belief update step using labels generated automatically via a DEL simulator. During inference, candidate belief traces generated by a language model are evaluated by the PBM, and the highest-scoring trace is selected. This allows SLMs to emulate more deliberate reasoning by allocating additional compute at test time. Experiments across multiple model scales and benchmarks show that DEL-ToM consistently improves performance, demonstrating that verifiable belief supervision can significantly enhance ToM abilities of SLMs without retraining.

[338] From Reasoning to Generalization: Knowledge-Augmented LLMs for ARC Benchmark

Chao Lei,Nir Lipovetzky,Krista A. Ehinger,Yanchuan Chang

Main category: cs.AI

TL;DR: 论文评估了推理导向的LLMs在抽象推理任务(ARC)上的表现,提出了一种新的方法KAAR,通过知识增强显著提升了性能。

Details Motivation: 探索LLMs在抽象推理和泛化能力上的不足,并改进其在ARC基准上的表现。 Method: 将ARC任务转化为程序合成问题,提出RSPC和KAAR两种方法,后者通过分阶段知识增强提升推理能力。 Result: KAAR在所有测试的LLMs中表现优于非增强的RSPC,绝对提升约5%,相对提升高达64.52%。 Conclusion: ARC仍是LLMs的挑战性任务,KAAR展示了知识增强在提升推理能力上的潜力。 Abstract: Recent reasoning-oriented LLMs have demonstrated strong performance on challenging tasks such as mathematics and science examinations. However, core cognitive faculties of human intelligence, such as abstract reasoning and generalization, remain underexplored. To address this, we evaluate recent reasoning-oriented LLMs on the Abstraction and Reasoning Corpus (ARC) benchmark, which explicitly demands both faculties. We formulate ARC as a program synthesis task and propose nine candidate solvers. Experimental results show that repeated-sampling planning-aided code generation (RSPC) achieves the highest test accuracy and demonstrates consistent generalization across most LLMs. To further improve performance, we introduce an ARC solver, Knowledge Augmentation for Abstract Reasoning (KAAR), which encodes core knowledge priors within an ontology that classifies priors into three hierarchical levels based on their dependencies. KAAR progressively expands LLM reasoning capacity by gradually augmenting priors at each level, and invokes RSPC to generate candidate solutions after each augmentation stage. This stage-wise reasoning reduces interference from irrelevant priors and improves LLM performance. Empirical results show that KAAR maintains strong generalization and consistently outperforms non-augmented RSPC across all evaluated LLMs, achieving around 5% absolute gains and up to 64.52% relative improvement. Despite these achievements, ARC remains a challenging benchmark for reasoning-oriented LLMs, highlighting future avenues of progress in LLMs.

[339] PD$^3$: A Project Duplication Detection Framework via Adapted Multi-Agent Debate

Dezheng Bao,Yueci Yang,Xin Chen,Zhengxuan Jiang,Zeguo Fei,Daoze Zhang,Xuanwen Huang,Junru Chen,Chutian Yu,Xiang Yuan,Yang Yang

Main category: cs.AI

TL;DR: PD$^3$是一个通过多智能体辩论检测项目重复的框架,结合定性和定量分析,优于现有方法,并开发了在线平台Review Dingdang。

Details Motivation: 项目重复检测对资源利用效率至关重要,但现有方法缺乏深入理解和专家反馈。 Method: 采用多智能体辩论框架,结合定性和定量分析,检索相关项目。 Result: 在800多个电力项目数据上表现优于现有方法7.43%和8.00%,节省了573万美元。 Conclusion: PD$^3$框架有效提升了项目重复检测的准确性和实用性。 Abstract: Project duplication detection is critical for project quality assessment, as it improves resource utilization efficiency by preventing investing in newly proposed project that have already been studied. It requires the ability to understand high-level semantics and generate constructive and valuable feedback. Existing detection methods rely on basic word- or sentence-level comparison or solely apply large language models, lacking valuable insights for experts and in-depth comprehension of project content and review criteria. To tackle this issue, we propose PD$^3$, a Project Duplication Detection framework via adapted multi-agent Debate. Inspired by real-world expert debates, it employs a fair competition format to guide multi-agent debate to retrieve relevant projects. For feedback, it incorporates both qualitative and quantitative analysis to improve its practicality. Over 800 real-world power project data spanning more than 20 specialized fields are used to evaluate the framework, demonstrating that our method outperforms existing approaches by 7.43% and 8.00% in two downstream tasks. Furthermore, we establish an online platform, Review Dingdang, to assist power experts, saving 5.73 million USD in initial detection on more than 100 newly proposed projects.

[340] Probe by Gaming: A Game-based Benchmark for Assessing Conceptual Knowledge in LLMs

Shuhang Xu,Weijian Deng,Yixuan Zhou,Fangwei Zhong

Main category: cs.AI

TL;DR: CK-Arena是一个基于多智能体交互游戏的基准测试,用于评估大语言模型(LLMs)在动态环境中理解概念边界的能力。

Details Motivation: 现有基准测试主要关注事实回忆和孤立任务,未能评估LLMs对概念边界的理解。 Method: 通过基于Undercover游戏的多智能体交互游戏CK-Arena,要求模型描述、区分和推断概念边界。 Result: 实验结果显示,LLMs对概念知识的理解在不同类别间差异显著,且与模型参数规模或通用能力不完全一致。 Conclusion: CK-Arena为动态环境中的概念推理提供了一个可扩展且现实的评估工具。 Abstract: Concepts represent generalized abstractions that enable humans to categorize and reason efficiently, yet it is unclear to what extent Large Language Models (LLMs) comprehend these semantic relationships. Existing benchmarks typically focus on factual recall and isolated tasks, failing to evaluate the ability of LLMs to understand conceptual boundaries. To address this gap, we introduce CK-Arena, a multi-agent interaction game built upon the Undercover game, designed to evaluate the capacity of LLMs to reason with concepts in interactive settings. CK-Arena challenges models to describe, differentiate, and infer conceptual boundaries based on partial information, encouraging models to explore commonalities and distinctions between closely related concepts. By simulating real-world interaction, CK-Arena provides a scalable and realistic benchmark for assessing conceptual reasoning in dynamic environments. Experimental results show that LLMs' understanding of conceptual knowledge varies significantly across different categories and is not strictly aligned with parameter size or general model capabilities. The data and code are available at the project homepage: https://ck-arena.site.

[341] Controlled Agentic Planning & Reasoning for Mechanism Synthesis

João Pedro Gandarela,Thiago Rios,Stefan Menzel,André Freitas

Main category: cs.AI

TL;DR: 提出了一种基于双代理大语言模型(LLM)的机制合成推理方法,结合语言和符号层面生成几何与动态结果,并通过反馈闭环优化。

Details Motivation: 解决机制合成中自然语言描述与符号推理的融合问题,提升生成结果的准确性和可操作性。 Method: 采用双代理LLM模型,从自然语言描述出发,通过方程引用、代码生成与参数化、符号回归和距离函数实现反馈闭环。 Result: 在平面机制合成中表现出高效性和收敛性,并引入新基准MSynth验证模型组件的影响。 Conclusion: 符号回归提示能揭示机制性洞察,但需足够大的模型架构支持。 Abstract: This work presents a dual-agent Large Language Model (LLM)-based reasoning method for mechanism synthesis, capable of reasoning at both linguistic and symbolic levels to generate geometrical and dynamic outcomes. The model consists of a composition of well-defined functions that, starting from a natural language specification, references abstract properties through supporting equations, generates and parametrizes simulation code, and elicits feedback anchor points using symbolic regression and distance functions. This process closes an actionable refinement loop at the linguistic and symbolic layers. The approach is shown to be both effective and convergent in the context of planar mechanisms. Additionally, we introduce MSynth, a novel benchmark for planar mechanism synthesis, and perform a comprehensive analysis of the impact of the model components. We further demonstrate that symbolic regression prompts unlock mechanistic insights only when applied to sufficiently large architectures.

[342] PatientSim: A Persona-Driven Simulator for Realistic Doctor-Patient Interactions

Daeun Kyung,Hyunseung Chung,Seongsu Bae,Jiho Kim,Jae Ho Sohn,Taerim Kim,Soo Kyung Kim,Edward Choi

Main category: cs.AI

TL;DR: PatientSim是一个基于真实医疗数据的患者模拟器,用于生成多样化的患者角色,支持医疗对话系统的训练和评估。

Details Motivation: 现有模拟器未能全面反映临床实践中的患者多样性,需要更真实的患者交互系统。 Method: 结合临床资料(症状和病史)和四维角色定义(性格、语言能力、病史回忆水平和认知混乱水平),生成37种独特患者角色。 Result: 评估了8个LLM,Llama 3.3表现最佳,并经临床医生验证。 Conclusion: PatientSim是一个开源、可定制的平台,适用于医疗对话系统评估和医疗教育。 Abstract: Doctor-patient consultations require multi-turn, context-aware communication tailored to diverse patient personas. Training or evaluating doctor LLMs in such settings requires realistic patient interaction systems. However, existing simulators often fail to reflect the full range of personas seen in clinical practice. To address this, we introduce PatientSim, a patient simulator that generates realistic and diverse patient personas for clinical scenarios, grounded in medical expertise. PatientSim operates using: 1) clinical profiles, including symptoms and medical history, derived from real-world data in the MIMIC-ED and MIMIC-IV datasets, and 2) personas defined by four axes: personality, language proficiency, medical history recall level, and cognitive confusion level, resulting in 37 unique combinations. We evaluated eight LLMs for factual accuracy and persona consistency. The top-performing open-source model, Llama 3.3, was validated by four clinicians to confirm the robustness of our framework. As an open-source, customizable platform, PatientSim provides a reproducible and scalable solution that can be customized for specific training needs. Offering a privacy-compliant environment, it serves as a robust testbed for evaluating medical dialogue systems across diverse patient presentations and shows promise as an educational tool for healthcare.

[343] T2I-Eval-R1: Reinforcement Learning-Driven Reasoning for Interpretable Text-to-Image Evaluation

Zi-Ao Ma,Tian Lan,Rong-Cheng Tu,Shu-Hang Liu,Heyan Huang,Zhijing Wu,Chen Xu,Xian-Ling Mao

Main category: cs.AI

TL;DR: 提出T2I-Eval-R1,一种基于强化学习的框架,用于训练开源多模态大语言模型(MLLMs)作为文本到图像(T2I)生成评估器,仅需粗粒度质量分数,无需高质量解释性标注。

Details Motivation: 解决现有评估方法依赖高成本标注或商业模型的局限性,提升开源模型的推理能力。 Method: 采用强化学习框架,结合Group Relative Policy Optimization(GRPO),生成分数和解释性推理链。 Result: 在三个T2I元评估基准上,T2I-Eval-R1与人类评估更一致,提供更准确的解释性分数理由。 Conclusion: T2I-Eval-R1是一种高效且可扩展的T2I生成评估方法,减少了对高成本标注的依赖。 Abstract: The rapid progress in diffusion-based text-to-image (T2I) generation has created an urgent need for interpretable automatic evaluation methods that can assess the quality of generated images, therefore reducing the human annotation burden. To reduce the prohibitive cost of relying on commercial models for large-scale evaluation, and to improve the reasoning capabilities of open-source models, recent research has explored supervised fine-tuning (SFT) of multimodal large language models (MLLMs) as dedicated T2I evaluators. However, SFT approaches typically rely on high-quality critique datasets, which are either generated by proprietary LLMs-with potential issues of bias and inconsistency-or annotated by humans at high cost, limiting their scalability and generalization. To address these limitations, we propose T2I-Eval-R1, a novel reinforcement learning framework that trains open-source MLLMs using only coarse-grained quality scores, thereby avoiding the need for annotating high-quality interpretable evaluation rationale. Our approach integrates Group Relative Policy Optimization (GRPO) into the instruction-tuning process, enabling models to generate both scalar scores and interpretable reasoning chains with only easy accessible annotated judgment scores or preferences. Furthermore, we introduce a continuous reward formulation that encourages score diversity and provides stable optimization signals, leading to more robust and discriminative evaluation behavior. Experimental results on three established T2I meta-evaluation benchmarks demonstrate that T2I-Eval-R1 achieves significantly higher alignment with human assessments and offers more accurate interpretable score rationales compared to strong baseline methods.

[344] Structured Thinking Matters: Improving LLMs Generalization in Causal Inference Tasks

Wentao Sun,Joao Paulo Nogueira,Alonso Silva

Main category: cs.AI

TL;DR: 论文提出了一种结构化方法,通过构建知识图谱来增强LLMs在因果推理中的表现,显著提升了F1分数。

Details Motivation: 尽管LLMs在多个领域取得进展,但在区分因果关系与相关性方面表现不佳,现有模型的性能仅略优于随机基线。 Method: 提出结构化方法,指导模型构建知识图谱以系统化编码相关性前提,从而回答因果查询。 Result: 在Corr2Cause数据集上,Qwen3-32B模型的F1分数从32.71提升至48.26,相对提升47.5%,精确率和召回率也有显著提高。 Conclusion: 结构化思维方法显著提升了LLMs的因果推理能力,展示了其在更广泛因果推断任务中的潜力。 Abstract: Despite remarkable advances in the field, LLMs remain unreliable in distinguishing causation from correlation. Recent results from the Corr2Cause dataset benchmark reveal that state-of-the-art LLMs -- such as GPT-4 (F1 score: 29.08) -- only marginally outperform random baselines (Random Uniform, F1 score: 20.38), indicating limited capacity of generalization. To tackle this limitation, we propose a novel structured approach: rather than directly answering causal queries, we provide the model with the capability to structure its thinking by guiding the model to build a structured knowledge graph, systematically encoding the provided correlational premises, to answer the causal queries. This intermediate representation significantly enhances the model's causal capabilities. Experiments on the test subset of the Corr2Cause dataset benchmark with Qwen3-32B model (reasoning model) show substantial gains over standard direct prompting methods, improving F1 scores from 32.71 to 48.26 (over 47.5% relative increase), along with notable improvements in precision and recall. These results underscore the effectiveness of providing the model with the capability to structure its thinking and highlight its promising potential for broader generalization across diverse causal inference tasks.

[345] ProgRM: Build Better GUI Agents with Progress Rewards

Danyang Zhang,Situo Zhang,Ziyue Yang,Zichen Zhu,Zihan Zhao,Ruisheng Cao,Lu Chen,Kai Yu

Main category: cs.AI

TL;DR: 论文提出了一种名为ProgRM的进展奖励模型,用于为在线训练中的每一步提供密集的中间奖励,解决了现有ORM模型无法提供细粒度反馈的问题。

Details Motivation: 当前基于LLM的GUI代理因高质量训练数据稀缺而受限,尤其是轨迹收集和奖励标注的困难。现有ORM模型无法提供细粒度反馈,且可能过度惩罚失败轨迹中的有价值步骤。 Method: 提出ProgRM模型,通过预测任务完成进度为在线训练的每一步提供中间奖励;设计基于LCS的自标注算法,高效标注进展奖励标签。 Result: 实验表明,使用ProgRM训练的代理表现优于领先的专有LLM和ORM训练的代理。 Conclusion: ProgRM通过提供密集的中间奖励显著提升了代理性能,解决了现有模型的局限性。 Abstract: LLM-based (Large Language Model) GUI (Graphical User Interface) agents can potentially reshape our daily lives significantly. However, current LLM-based GUI agents suffer from the scarcity of high-quality training data owing to the difficulties of trajectory collection and reward annotation. Existing works have been exploring LLMs to collect trajectories for imitation learning or to offer reward signals for online RL training. However, the Outcome Reward Model (ORM) used in existing works cannot provide finegrained feedback and can over-penalize the valuable steps in finally failed trajectories. To this end, we propose Progress Reward Model (ProgRM) to provide dense informative intermediate rewards by predicting a task completion progress for each step in online training. To handle the challenge of progress reward label annotation, we further design an efficient LCS-based (Longest Common Subsequence) self-annotation algorithm to discover the key steps in trajectories and assign progress labels accordingly. ProgRM is evaluated with extensive experiments and analyses. Actors trained with ProgRM outperform leading proprietary LLMs and ORM-trained actors, illustrating the effectiveness of ProgRM. The codes for experiments will be made publicly available upon acceptance.

[346] Gaming Tool Preferences in Agentic LLMs

Kazem Faghih,Wenxiao Wang,Yize Cheng,Siddhant Bharti,Gaurang Sriramanan,Sriram Balasubramanian,Parsa Hosseini,Soheil Feizi

Main category: cs.AI

TL;DR: 研究发现,通过编辑工具描述可以显著影响LLMs对工具的选择,某些编辑甚至能使工具使用率提升10倍以上,揭示了当前工具调用协议的脆弱性。

Details Motivation: 揭示大型语言模型(LLMs)在依赖文本描述选择工具时的脆弱性,并探讨如何通过编辑描述影响其决策。 Method: 通过一系列对工具描述的编辑实验,比较不同编辑对LLMs(如GPT-4.1和Qwen2.5-7B)工具选择的影响,并扩展到10种不同模型的验证。 Result: 编辑后的工具描述在某些情况下能使工具使用率提升10倍以上,且不同模型对编辑的响应趋势相似。 Conclusion: 当前工具调用协议存在脆弱性,开发者需为LLMs提供更可靠的工具选择基础。 Abstract: Large language models (LLMs) can now access a wide range of external tools, thanks to the Model Context Protocol (MCP). This greatly expands their abilities as various agents. However, LLMs rely entirely on the text descriptions of tools to decide which ones to use--a process that is surprisingly fragile. In this work, we expose a vulnerability in prevalent tool/function-calling protocols by investigating a series of edits to tool descriptions, some of which can drastically increase a tool's usage from LLMs when competing with alternatives. Through controlled experiments, we show that tools with properly edited descriptions receive over 10 times more usage from GPT-4.1 and Qwen2.5-7B than tools with original descriptions. We further evaluate how various edits to tool descriptions perform when competing directly with one another and how these trends generalize or differ across a broader set of 10 different models. These phenomenons, while giving developers a powerful way to promote their tools, underscore the need for a more reliable foundation for agentic LLMs to select and utilize tools and resources.

cs.RO [Back]

[347] Plan-R1: Safe and Feasible Trajectory Planning as Language Modeling

Xiaolong Tang,Meina Kan,Shiguang Shan,Xilin Chen

Main category: cs.RO

TL;DR: Plan-R1是一个两阶段的轨迹规划框架,通过结合专家数据和强化学习,显著提升了自动驾驶的安全性和可行性。

Details Motivation: 现有基于学习的规划方法依赖专家演示,缺乏明确的安全意识,可能继承不安全行为。 Method: 两阶段框架:第一阶段通过专家数据训练自回归轨迹预测器;第二阶段设计规则奖励并使用GRPO强化学习微调模型。 Result: 在nuPlan基准测试中,Plan-R1显著提升了规划的安全性和可行性,达到最先进性能。 Conclusion: Plan-R1通过结合专家数据和强化学习,成功实现了安全且可行的轨迹规划。 Abstract: Safe and feasible trajectory planning is essential for real-world autonomous driving systems. However, existing learning-based planning methods often rely on expert demonstrations, which not only lack explicit safety awareness but also risk inheriting unsafe behaviors such as speeding from suboptimal human driving data. Inspired by the success of large language models, we propose Plan-R1, a novel two-stage trajectory planning framework that formulates trajectory planning as a sequential prediction task, guided by explicit planning principles such as safety, comfort, and traffic rule compliance. In the first stage, we train an autoregressive trajectory predictor via next motion token prediction on expert data. In the second stage, we design rule-based rewards (e.g., collision avoidance, speed limits) and fine-tune the model using Group Relative Policy Optimization (GRPO), a reinforcement learning strategy, to align its predictions with these planning principles. Experiments on the nuPlan benchmark demonstrate that our Plan-R1 significantly improves planning safety and feasibility, achieving state-of-the-art performance.

[348] Is Single-View Mesh Reconstruction Ready for Robotics?

Frederik Nolte,Bernhard Schölkopf,Ingmar Posner

Main category: cs.RO

TL;DR: 评估单视角网格重建模型在机器人操作中创建数字孪生环境的适用性,发现现有方法虽在计算机视觉基准上成功,但无法满足机器人特定需求。

Details Motivation: 探索单视角3D重建技术在机器人操作中创建虚拟环境的高效性,填补其在物理模拟和机器人应用中的研究空白。 Method: 建立机器人场景下的3D重建基准标准,包括输入处理、无碰撞稳定重建、遮挡管理和计算限制,并使用真实机器人数据集进行实证评估。 Result: 现有方法在机器人特定需求上表现不佳,与计算机视觉基准结果形成对比。 Conclusion: 研究揭示了计算机视觉进展与机器人需求之间的关键差距,为未来研究提供了方向。 Abstract: This paper evaluates single-view mesh reconstruction models for creating digital twin environments in robot manipulation. Recent advances in computer vision for 3D reconstruction from single viewpoints present a potential breakthrough for efficiently creating virtual replicas of physical environments for robotics contexts. However, their suitability for physics simulations and robotics applications remains unexplored. We establish benchmarking criteria for 3D reconstruction in robotics contexts, including handling typical inputs, producing collision-free and stable reconstructions, managing occlusions, and meeting computational constraints. Our empirical evaluation using realistic robotics datasets shows that despite success on computer vision benchmarks, existing approaches fail to meet robotics-specific requirements. We quantitively examine limitations of single-view reconstruction for practical robotics implementation, in contrast to prior work that focuses on multi-view approaches. Our findings highlight critical gaps between computer vision advances and robotics needs, guiding future research at this intersection.

cs.SE [Back]

[349] Towards Practical Defect-Focused Automated Code Review

Junyi Lu,Lili Jiang,Xiaojia Li,Jianbing Fang,Fengjun Zhang,Li Yang,Chun Zuo

Main category: cs.SE

TL;DR: 论文提出了一种自动化代码审查的方法,解决了现有方法忽略仓库上下文和实际缺陷检测的问题,通过代码切片、多角色LLM框架等技术,显著提升了性能。

Details Motivation: 现有代码审查自动化方法过于简化任务,忽略了仓库上下文和实际缺陷检测,限制了实用性。 Method: 采用代码切片算法提取上下文,多角色LLM框架提升关键缺陷检测,过滤机制降低误报率,并设计新提示以优化人机交互。 Result: 在真实合并请求上验证,性能比标准LLM提升2倍,比基线提升10倍。 Conclusion: 方法基于语言无关原则(如AST分析),具有广泛适用潜力。 Abstract: The complexity of code reviews has driven efforts to automate review comments, but prior approaches oversimplify this task by treating it as snippet-level code-to-text generation and relying on text similarity metrics like BLEU for evaluation. These methods overlook repository context, real-world merge request evaluation, and defect detection, limiting their practicality. To address these issues, we explore the full automation pipeline within the online recommendation service of a company with nearly 400 million daily active users, analyzing industry-grade C++ codebases comprising hundreds of thousands of lines of code. We identify four key challenges: 1) capturing relevant context, 2) improving key bug inclusion (KBI), 3) reducing false alarm rates (FAR), and 4) integrating human workflows. To tackle these, we propose 1) code slicing algorithms for context extraction, 2) a multi-role LLM framework for KBI, 3) a filtering mechanism for FAR reduction, and 4) a novel prompt design for better human interaction. Our approach, validated on real-world merge requests from historical fault reports, achieves a 2x improvement over standard LLMs and a 10x gain over previous baselines. While the presented results focus on C++, the underlying framework design leverages language-agnostic principles (e.g., AST-based analysis), suggesting potential for broader applicability.

cs.CR [Back]

[350] Safety Alignment Can Be Not Superficial With Explicit Safety Signals

Jianwei Li,Jung-Eng Kim

Main category: cs.CR

TL;DR: 本文提出了一种新方法,通过显式引入安全相关的二元分类任务,并结合注意力与解码策略,显著提升大语言模型(LLMs)对抗攻击的鲁棒性。

Details Motivation: 现有LLMs安全对齐方法通常假设模型能隐式学习安全相关推理任务,但实际中安全信号常被其他目标稀释,导致模型在对抗攻击下表现不佳。 Method: 显式引入安全相关的二元分类任务,并将其信号与注意力及解码策略结合,消除模糊性,使模型能更负责任地响应恶意查询。 Result: 实验表明,该方法以不到0.2倍的开销成本,显著提升了LLMs对抗各种攻击的鲁棒性。 Conclusion: 该方法为构建更鲁棒的生成式AI系统提供了可行路径。 Abstract: Recent studies on the safety alignment of large language models (LLMs) have revealed that existing approaches often operate superficially, leaving models vulnerable to various adversarial attacks. Despite their significance, these studies generally fail to offer actionable solutions beyond data augmentation for achieving more robust safety mechanisms. This paper identifies a fundamental cause of this superficiality: existing alignment approaches often presume that models can implicitly learn a safety-related reasoning task during the alignment process, enabling them to refuse harmful requests. However, the learned safety signals are often diluted by other competing objectives, leading models to struggle with drawing a firm safety-conscious decision boundary when confronted with adversarial attacks. Based on this observation, by explicitly introducing a safety-related binary classification task and integrating its signals with our attention and decoding strategies, we eliminate this ambiguity and allow models to respond more responsibly to malicious queries. We emphasize that, with less than 0.2x overhead cost, our approach enables LLMs to assess the safety of both the query and the previously generated tokens at each necessary generating step. Extensive experiments demonstrate that our method significantly improves the resilience of LLMs against various adversarial attacks, offering a promising pathway toward more robust generative AI systems.

[351] GSDFuse: Capturing Cognitive Inconsistencies from Multi-Dimensional Weak Signals in Social Media Steganalysis

Kaibo Huang,Zipei Zhang,Yukun Wei,TianXin Zhang,Zhongliang Yang,Linna Zhou

Main category: cs.CR

TL;DR: GSDFuse是一种新方法,通过多模态特征工程、数据增强、自适应证据融合和判别嵌入学习,有效检测社交媒体中的恶意隐写术,性能优于现有技术。

Details Motivation: 社交媒体中恶意隐写术的普遍性带来了安全风险,而现有方法在检测复杂对话中的隐写术时面临认知不一致、信号稀疏和数据不平衡等挑战。 Method: GSDFuse采用分层多模态特征工程、数据增强、自适应证据融合和判别嵌入学习,系统性解决检测难题。 Result: 实验表明,GSDFuse在复杂对话环境中检测隐写术的性能达到最先进水平。 Conclusion: GSDFuse为社交媒体中的恶意隐写术检测提供了高效解决方案,代码已开源。 Abstract: The ubiquity of social media platforms facilitates malicious linguistic steganography, posing significant security risks. Steganalysis is profoundly hindered by the challenge of identifying subtle cognitive inconsistencies arising from textual fragmentation and complex dialogue structures, and the difficulty in achieving robust aggregation of multi-dimensional weak signals, especially given extreme steganographic sparsity and sophisticated steganography. These core detection difficulties are compounded by significant data imbalance. This paper introduces GSDFuse, a novel method designed to systematically overcome these obstacles. GSDFuse employs a holistic approach, synergistically integrating hierarchical multi-modal feature engineering to capture diverse signals, strategic data augmentation to address sparsity, adaptive evidence fusion to intelligently aggregate weak signals, and discriminative embedding learning to enhance sensitivity to subtle inconsistencies. Experiments on social media datasets demonstrate GSDFuse's state-of-the-art (SOTA) performance in identifying sophisticated steganography within complex dialogue environments. The source code for GSDFuse is available at https://github.com/NebulaEmmaZh/GSDFuse.

[352] Mitigating Cyber Risk in the Age of Open-Weight LLMs: Policy Gaps and Technical Realities

Alfonso de Gregorio

Main category: cs.CR

TL;DR: 开放权重的通用AI模型(GPAI)带来显著好处,但也增加了网络安全风险。本文分析了开放权重AI模型放大的威胁,评估了当前法规的不足,并提出了针对高风险能力的控制策略。

Details Motivation: 开放权重的GPAI模型(如DeepSeek-R1)展示了强大的攻击能力,挑战了传统防御和监管模式。研究旨在分析其威胁并填补法规空白。 Method: 分析了开放权重AI模型的具体威胁(如加速恶意软件开发和增强社交工程),并评估了现有法规(如欧盟AI法案和GPAI行为准则)的不足。 Result: 发现开放分发导致控制缺失,使标准安全措施失效。提出了针对高风险能力的控制策略。 Conclusion: 建议通过评估高风险能力、推动防御性AI创新和国际合作,平衡安全与技术进步。 Abstract: Open-weight general-purpose AI (GPAI) models offer significant benefits but also introduce substantial cybersecurity risks, as demonstrated by the offensive capabilities of models like DeepSeek-R1 in evaluations such as MITRE's OCCULT. These publicly available models empower a wider range of actors to automate and scale cyberattacks, challenging traditional defence paradigms and regulatory approaches. This paper analyzes the specific threats -- including accelerated malware development and enhanced social engineering -- magnified by open-weight AI release. We critically assess current regulations, notably the EU AI Act and the GPAI Code of Practice, identifying significant gaps stemming from the loss of control inherent in open distribution, which renders many standard security mitigations ineffective. We propose a path forward focusing on evaluating and controlling specific high-risk capabilities rather than entire models, advocating for pragmatic policy interpretations for open-weight systems, promoting defensive AI innovation, and fostering international collaboration on standards and cyber threat intelligence (CTI) sharing to ensure security without unduly stifling open technological progress.

[353] Chain-of-Lure: A Synthetic Narrative-Driven Approach to Compromise Large Language Models

Wenhan Chang,Tianqing Zhu,Yu Zhao,Shuangyong Song,Ping Xiong,Wanlei Zhou,Yongxiang Li

Main category: cs.CR

TL;DR: 论文提出了一种基于思维链机制的新型越狱方法,通过任务转移和叙事诱饵攻击受害者模型,揭示了LLM的潜在漏洞,并提供了优化安全机制的建议。

Details Motivation: 在生成式AI快速发展的时代,人类与大型语言模型的交互存在滥用风险。现有研究忽视了LLM既可能是受害者模型,也可能是攻击者模型的可能性。 Method: 提出了一种基于思维链机制的越狱方法,利用任务转移隐藏有害意图,并通过叙事诱饵激发受害者模型的推理能力。引入辅助模型优化诱饵以提高攻击成功率。 Result: 实验表明,安全机制较弱的模型攻击能力更强,毒性评分能更精确评估攻击效果,揭示了LLM的潜在漏洞。 Conclusion: 研究不仅展示了LLM的潜在风险,还提供了数据驱动的反馈以优化安全机制,并讨论了两种防御策略。 Abstract: In the era of rapid generative AI development, interactions between humans and large language models face significant misusing risks. Previous research has primarily focused on black-box scenarios using human-guided prompts and white-box scenarios leveraging gradient-based LLM generation methods, neglecting the possibility that LLMs can act not only as victim models, but also as attacker models to harm other models. We proposes a novel jailbreaking method inspired by the Chain-of-Thought mechanism, where the attacker model uses mission transfer to conceal harmful user intent in dialogue and generates chained narrative lures to stimulate the reasoning capabilities of victim models, leading to successful jailbreaking. To enhance the attack success rate, we introduce a helper model that performs random narrative optimization on the narrative lures during multi-turn dialogues while ensuring alignment with the original intent, enabling the optimized lures to bypass the safety barriers of victim models effectively. Our experiments reveal that models with weaker safety mechanisms exhibit stronger attack capabilities, demonstrating that models can not only be exploited, but also help harm others. By incorporating toxicity scores, we employ third-party models to evaluate the harmfulness of victim models' responses to jailbreaking attempts. The study shows that using refusal keywords as an evaluation metric for attack success rates is significantly flawed because it does not assess whether the responses guide harmful questions, while toxicity scores measure the harm of generated content with more precision and its alignment with harmful questions. Our approach demonstrates outstanding performance, uncovering latent vulnerabilities in LLMs and providing data-driven feedback to optimize LLM safety mechanisms. We also discuss two defensive strategies to offer guidance on improving defense mechanisms.

[354] One Model Transfer to All: On Robust Jailbreak Prompts Generation against LLMs

Linbao Li,Yannan Liu,Daojing He,Yu Li

Main category: cs.CR

TL;DR: ArrAttack是一种新型的越狱攻击方法,专门针对防御性LLMs,通过自动生成鲁棒的越狱提示绕过防御措施。

Details Motivation: 现有越狱策略难以应对快速发展的防御机制,如防御性后缀,导致攻击失效。 Method: 提出ArrAttack方法,利用通用鲁棒性判断模型生成鲁棒的越狱提示,支持多种防御措施。 Result: ArrAttack在多种模型(如GPT-4和Claude-3)上表现优于现有攻击策略,具有强迁移性。 Conclusion: 该工作填补了越狱攻击与防御之间的空白,为生成鲁棒越狱提示提供了新视角。 Abstract: Safety alignment in large language models (LLMs) is increasingly compromised by jailbreak attacks, which can manipulate these models to generate harmful or unintended content. Investigating these attacks is crucial for uncovering model vulnerabilities. However, many existing jailbreak strategies fail to keep pace with the rapid development of defense mechanisms, such as defensive suffixes, rendering them ineffective against defended models. To tackle this issue, we introduce a novel attack method called ArrAttack, specifically designed to target defended LLMs. ArrAttack automatically generates robust jailbreak prompts capable of bypassing various defense measures. This capability is supported by a universal robustness judgment model that, once trained, can perform robustness evaluation for any target model with a wide variety of defenses. By leveraging this model, we can rapidly develop a robust jailbreak prompt generator that efficiently converts malicious input prompts into effective attacks. Extensive evaluations reveal that ArrAttack significantly outperforms existing attack strategies, demonstrating strong transferability across both white-box and black-box models, including GPT-4 and Claude-3. Our work bridges the gap between jailbreak attacks and defenses, providing a fresh perspective on generating robust jailbreak prompts. We make the codebase available at https://github.com/LLBao/ArrAttack.

cs.SD [Back]

[355] LLM-based Generative Error Correction for Rare Words with Synthetic Data and Phonetic Context

Natsuo Yamashita,Masaaki Yamamoto,Hiroaki Kokubo,Yohei Kawaguchi

Main category: cs.SD

TL;DR: 提出一种结合语音信息的生成式纠错方法,针对罕见词改进ASR性能。

Details Motivation: 解决现有LLM生成式纠错方法对罕见词和语音信息利用不足的问题。 Method: 生成合成数据以包含罕见词,并整合ASR的N-best假设及语音上下文。 Result: 在英语和日语数据集中,罕见词纠错效果提升,WER和CER降低。 Conclusion: 结合语音信息和合成数据的方法有效提升了生成式纠错的性能。 Abstract: Generative error correction (GER) with large language models (LLMs) has emerged as an effective post-processing approach to improve automatic speech recognition (ASR) performance. However, it often struggles with rare or domain-specific words due to limited training data. Furthermore, existing LLM-based GER approaches primarily rely on textual information, neglecting phonetic cues, which leads to over-correction. To address these issues, we propose a novel LLM-based GER approach that targets rare words and incorporates phonetic information. First, we generate synthetic data to contain rare words for fine-tuning the GER model. Second, we integrate ASR's N-best hypotheses along with phonetic context to mitigate over-correction. Experimental results show that our method not only improves the correction of rare words but also reduces the WER and CER across both English and Japanese datasets.

cs.LG [Back]

[356] OCR-Reasoning Benchmark: Unveiling the True Capabilities of MLLMs in Complex Text-Rich Image Reasoning

Mingxin Huang,Yongxin Shi,Dezhi Peng,Songxuan Lai,Zecheng Xie,Lianwen Jin

Main category: cs.LG

TL;DR: OCR-Reasoning 是一个用于评估多模态大语言模型在文本丰富图像推理任务中的系统性基准,包含 1,069 个标注样本,覆盖 6 种核心推理能力和 18 种任务。

Details Motivation: 填补文本丰富图像推理任务缺乏系统性基准的空白。 Method: 提出 OCR-Reasoning 基准,标注推理过程和最终答案,全面评估模型能力。 Result: 现有 MLLMs 表现不佳,准确率均未超过 50%,显示文本丰富图像推理任务仍具挑战性。 Conclusion: OCR-Reasoning 为评估和改进 MLLMs 提供了重要工具,揭示了当前方法的局限性。 Abstract: Recent advancements in multimodal slow-thinking systems have demonstrated remarkable performance across diverse visual reasoning tasks. However, their capabilities in text-rich image reasoning tasks remain understudied due to the lack of a systematic benchmark. To address this gap, we propose OCR-Reasoning, a comprehensive benchmark designed to systematically assess Multimodal Large Language Models on text-rich image reasoning tasks. The benchmark comprises 1,069 human-annotated examples spanning 6 core reasoning abilities and 18 practical reasoning tasks in text-rich visual scenarios. Furthermore, unlike other text-rich image understanding benchmarks that only annotate the final answers, OCR-Reasoning also annotates the reasoning process simultaneously. With the annotated reasoning process and the final answers, OCR-Reasoning evaluates not only the final answers generated by models but also their reasoning processes, enabling a holistic analysis of their problem-solving abilities. Leveraging this benchmark, we conducted a comprehensive evaluation of state-of-the-art MLLMs. Our results demonstrate the limitations of existing methodologies. Notably, even state-of-the-art MLLMs exhibit substantial difficulties, with none achieving accuracy surpassing 50\% across OCR-Reasoning, indicating that the challenges of text-rich image reasoning are an urgent issue to be addressed. The benchmark and evaluation scripts are available at https://github.com/SCUT-DLVCLab/OCR-Reasoning.

[357] Graph Attention Neural Network for Botnet Detection: Evaluating Autoencoder, VAE and PCA-Based Dimension Reduction

Hassan Wasswa,Hussein Abbass,Timothy Lynar

Main category: cs.LG

TL;DR: 论文提出了一种框架,通过降维技术(VAE-encoder、AE-encoder、PCA)处理高维IoT攻击数据,再将其转换为图数据集,以提升GAT模型在僵尸网络攻击检测中的性能。

Details Motivation: 现有方法通常独立处理攻击实例,忽略了实例间的关系,且高维数据转换为图结构时存在计算开销大的问题。 Method: 使用三种降维技术(VAE-encoder、AE-encoder、PCA)预处理数据,再通过GAT模型结合注意力机制捕捉实例间关系。 Result: 降维技术有效减少了计算开销,同时GAT模型结合降维数据提升了检测性能。 Conclusion: 提出的框架通过降维和图神经网络结合,显著提高了IoT僵尸网络攻击的检测效果。 Abstract: With the rise of IoT-based botnet attacks, researchers have explored various learning models for detection, including traditional machine learning, deep learning, and hybrid approaches. A key advancement involves deploying attention mechanisms to capture long-term dependencies among features, significantly improving detection accuracy. However, most models treat attack instances independently, overlooking inter-instance relationships. Graph Neural Networks (GNNs) address this limitation by learning an embedding space via iterative message passing where similar instances are placed closer based on node features and relationships, enhancing classification performance. To further improve detection, attention mechanisms have been embedded within GNNs, leveraging both long-range dependencies and inter-instance connections. However, transforming the high dimensional IoT attack datasets into a graph structured dataset poses challenges, such as large graph structures leading computational overhead. To mitigate this, this paper proposes a framework that first reduces dimensionality of the NetFlow-based IoT attack dataset before transforming it into a graph dataset. We evaluate three dimension reduction techniques--Variational Autoencoder (VAE-encoder), classical autoencoder (AE-encoder), and Principal Component Analysis (PCA)--and compare their effects on a Graph Attention neural network (GAT) model for botnet attack detection

[358] Variational Autoencoding Discrete Diffusion with Enhanced Dimensional Correlations Modeling

Tianyu Xie,Shuchen Xue,Zijin Feng,Tianyang Hu,Jiacheng Sun,Zhenguo Li,Cheng Zhang

Main category: cs.LG

TL;DR: VADD是一种新型离散扩散模型框架,通过隐变量建模增强维度间相关性,显著提升样本质量,尤其在去噪步骤较少时表现优于传统MDM。

Details Motivation: 传统MDM在去噪步骤较少时性能下降,主要由于维度间依赖关系建模不足。 Method: 提出VADD框架,引入辅助识别模型,通过变分下界最大化实现稳定训练,并在训练集上进行摊销推断。 Result: 在2D玩具数据、像素级图像生成和文本生成任务中,VADD均优于MDM基线。 Conclusion: VADD在保持MDM效率的同时,显著提升了样本质量,尤其在去噪步骤较少时表现更优。 Abstract: Discrete diffusion models have recently shown great promise for modeling complex discrete data, with masked diffusion models (MDMs) offering a compelling trade-off between quality and generation speed. MDMs denoise by progressively unmasking multiple dimensions from an all-masked input, but their performance can degrade when using few denoising steps due to limited modeling of inter-dimensional dependencies. In this paper, we propose Variational Autoencoding Discrete Diffusion (VADD), a novel framework that enhances discrete diffusion with latent variable modeling to implicitly capture correlations among dimensions. By introducing an auxiliary recognition model, VADD enables stable training via variational lower bounds maximization and amortized inference over the training set. Our approach retains the efficiency of traditional MDMs while significantly improving sample quality, especially when the number of denoising steps is small. Empirical results on 2D toy data, pixel-level image generation, and text generation demonstrate that VADD consistently outperforms MDM baselines.

[359] Baitradar: A Multi-Model Clickbait Detection Algorithm Using Deep Learning

Bhanuka Gamage,Adnan Labib,Aisha Joomun,Chern Hong Lim,KokSheik Wong

Main category: cs.LG

TL;DR: 提出了一种名为BaitRadar的深度学习算法,通过结合六个推理模型来检测YouTube上的点击诱饵内容,准确率达98%。

Details Motivation: YouTube上点击诱饵问题日益严重,用户被误导点击内容不符的视频,需有效检测方法。 Method: BaitRadar算法结合六个模型,分别分析视频标题、评论、缩略图、标签、统计数据和音频转录,通过平均计算得出最终分类。 Result: 在1,400个视频上测试,平均准确率98%,推理时间小于2秒。 Conclusion: BaitRadar能高效准确地检测点击诱饵,即使在数据缺失情况下仍表现稳健。 Abstract: Following the rising popularity of YouTube, there is an emerging problem on this platform called clickbait, which provokes users to click on videos using attractive titles and thumbnails. As a result, users ended up watching a video that does not have the content as publicized in the title. This issue is addressed in this study by proposing an algorithm called BaitRadar, which uses a deep learning technique where six inference models are jointly consulted to make the final classification decision. These models focus on different attributes of the video, including title, comments, thumbnail, tags, video statistics and audio transcript. The final classification is attained by computing the average of multiple models to provide a robust and accurate output even in situation where there is missing data. The proposed method is tested on 1,400 YouTube videos. On average, a test accuracy of 98% is achieved with an inference time of less than 2s.

[360] Wildfire spread forecasting with Deep Learning

Nikolaos Anastasiou,Spyros Kondylatos,Ioannis Papoutsis

Main category: cs.LG

TL;DR: 本文提出了一种基于深度学习的框架,用于预测野火蔓延的最终范围,利用点火时可用的数据,并通过多日观测数据显著提高了预测准确性。

Details Motivation: 准确的野火蔓延预测对风险管理、应急响应和资源分配至关重要。 Method: 利用2006年至2022年地中海地区的时空数据集,结合遥感、气象、植被、地形等多源数据,通过深度学习模型进行预测,并研究了时间上下文的影响。 Result: 包含点火前后多日数据的模型表现最佳,F1分数和IoU比基线提高了近5%。 Conclusion: 多日观测数据显著提升了预测准确性,研究公开了数据集和模型以推动野火建模研究。 Abstract: Accurate prediction of wildfire spread is crucial for effective risk management, emergency response, and strategic resource allocation. In this study, we present a deep learning (DL)-based framework for forecasting the final extent of burned areas, using data available at the time of ignition. We leverage a spatio-temporal dataset that covers the Mediterranean region from 2006 to 2022, incorporating remote sensing data, meteorological observations, vegetation maps, land cover classifications, anthropogenic factors, topography data, and thermal anomalies. To evaluate the influence of temporal context, we conduct an ablation study examining how the inclusion of pre- and post-ignition data affects model performance, benchmarking the temporal-aware DL models against a baseline trained exclusively on ignition-day inputs. Our results indicate that multi-day observational data substantially improve predictive accuracy. Particularly, the best-performing model, incorporating a temporal window of four days before to five days after ignition, improves both the F1 score and the Intersection over Union by almost 5% in comparison to the baseline on the test dataset. We publicly release our dataset and models to enhance research into data-driven approaches for wildfire modeling and response.

[361] MinkUNeXt-SI: Improving point cloud-based place recognition including spherical coordinates and LiDAR intensity

Judith Vilella-Cantos,Juan José Cabrera,Luis Payá,Mónica Ballesta,David Valiente

Main category: cs.LG

TL;DR: MinkUNeXt-SI是一种基于LiDAR点云的鲁棒地点识别方法,结合Minkowski卷积和U-net架构,能够应对场景变化并泛化到其他数据集。

Details Motivation: 解决自主导航系统中地点识别问题,需适应场景变化(如季节、天气)并具备泛化能力。 Method: 预处理LiDAR点云数据,结合Minkowski卷积和U-net架构生成鲁棒描述符。 Result: 性能超越现有技术,泛化能力出色,自定义数据集评估结果优异。 Conclusion: MinkUNeXt-SI高效且可复现,代码和数据集已公开。 Abstract: In autonomous navigation systems, the solution of the place recognition problem is crucial for their safe functioning. But this is not a trivial solution, since it must be accurate regardless of any changes in the scene, such as seasonal changes and different weather conditions, and it must be generalizable to other environments. This paper presents our method, MinkUNeXt-SI, which, starting from a LiDAR point cloud, preprocesses the input data to obtain its spherical coordinates and intensity values normalized within a range of 0 to 1 for each point, and it produces a robust place recognition descriptor. To that end, a deep learning approach that combines Minkowski convolutions and a U-net architecture with skip connections is used. The results of MinkUNeXt-SI demonstrate that this method reaches and surpasses state-of-the-art performance while it also generalizes satisfactorily to other datasets. Additionally, we showcase the capture of a custom dataset and its use in evaluating our solution, which also achieves outstanding results. Both the code of our solution and the runs of our dataset are publicly available for reproducibility purposes.

[362] SynRES: Towards Referring Expression Segmentation in the Wild via Synthetic Data

Dong-Hee Kim,Hyunjee Song,Donghyun Kim

Main category: cs.LG

TL;DR: WildRES是一个新的RES基准,用于评估复杂推理能力,而SynRES是一个自动化数据生成管道,显著提升了模型性能。

Details Motivation: 现有RES基准的评估协议受限,无法充分评估复杂推理能力,因此需要更全面的基准和训练数据。 Method: 引入WildRES基准和SynRES数据生成管道,后者通过密集标注、语义对齐和域感知增强生成合成数据。 Result: 当前RES模型在WildRES上表现显著下降,而SynRES训练的模型在WildRES-ID和WildRES-DS上分别提升2.0%和3.8%的gIoU。 Conclusion: WildRES和SynRES为RES领域提供了更全面的评估和训练解决方案,显著提升了模型性能。 Abstract: Despite the advances in Referring Expression Segmentation (RES) benchmarks, their evaluation protocols remain constrained, primarily focusing on either single targets with short queries (containing minimal attributes) or multiple targets from distinctly different queries on a single domain. This limitation significantly hinders the assessment of more complex reasoning capabilities in RES models. We introduce WildRES, a novel benchmark that incorporates long queries with diverse attributes and non-distinctive queries for multiple targets. This benchmark spans diverse application domains, including autonomous driving environments and robotic manipulation scenarios, thus enabling more rigorous evaluation of complex reasoning capabilities in real-world settings. Our analysis reveals that current RES models demonstrate substantial performance deterioration when evaluated on WildRES. To address this challenge, we introduce SynRES, an automated pipeline generating densely paired compositional synthetic training data through three innovations: (1) a dense caption-driven synthesis for attribute-rich image-mask-expression triplets, (2) reliable semantic alignment mechanisms rectifying caption-pseudo mask inconsistencies via Image-Text Aligned Grouping, and (3) domain-aware augmentations incorporating mosaic composition and superclass replacement to emphasize generalization ability and distinguishing attributes over object categories. Experimental results demonstrate that models trained with SynRES achieve state-of-the-art performance, improving gIoU by 2.0% on WildRES-ID and 3.8% on WildRES-DS. Code and datasets are available at https://github.com/UTLLab/SynRES.

[363] Soft-CAM: Making black box models self-explainable for high-stakes decisions

Kerol Djoumessi,Philipp Berens

Main category: cs.LG

TL;DR: SoftCAM是一种使标准CNN架构具有内在可解释性的方法,通过移除全局平均池化层并替换为基于卷积的类别证据层,生成显式的类别激活图。

Details Motivation: 现有的事后解释方法在关键应用中不可靠且难以信任,需要一种内在可解释的CNN方法。 Method: 移除全局平均池化层,替换全连接分类层为卷积层,保留空间信息并生成类别激活图。 Result: 在三个医学数据集上,SoftCAM保持分类性能,同时显著提升解释的质与量。 Conclusion: CNN可以在不牺牲性能的情况下实现内在可解释性,推动高风险决策中的自解释深度学习发展。 Abstract: Convolutional neural networks (CNNs) are widely used for high-stakes applications like medicine, often surpassing human performance. However, most explanation methods rely on post-hoc attribution, approximating the decision-making process of already trained black-box models. These methods are often sensitive, unreliable, and fail to reflect true model reasoning, limiting their trustworthiness in critical applications. In this work, we introduce SoftCAM, a straightforward yet effective approach that makes standard CNN architectures inherently interpretable. By removing the global average pooling layer and replacing the fully connected classification layer with a convolution-based class evidence layer, SoftCAM preserves spatial information and produces explicit class activation maps that form the basis of the model's predictions. Evaluated on three medical datasets, SoftCAM maintains classification performance while significantly improving both the qualitative and quantitative explanation compared to existing post-hoc methods. Our results demonstrate that CNNs can be inherently interpretable without compromising performance, advancing the development of self-explainable deep learning for high-stakes decision-making.

[364] A Coreset Selection of Coreset Selection Literature: Introduction and Recent Advances

Brian B. Moser,Arundhati S. Shanbhag,Stanislav Frolov,Federico Raue,Joachim Folz,Andreas Dengel

Main category: cs.LG

TL;DR: 本文综述了核心集选择的研究,统一了训练无关、训练导向和无标签方法,填补了现有文献的空白,并探讨了未来研究方向。

Details Motivation: 解决核心集选择研究中现有综述的局限性,提供更全面的分类和新见解。 Method: 提出统一分类法,涵盖训练无关、训练导向和无标签方法,并分析子模块优化、双层优化和伪标签技术。 Result: 揭示了修剪策略对泛化和神经缩放规律的影响,比较了不同方法在计算、鲁棒性和性能需求下的表现。 Conclusion: 指出了未来研究的挑战,如鲁棒性、异常值过滤和核心集选择在基础模型中的应用。 Abstract: Coreset selection targets the challenge of finding a small, representative subset of a large dataset that preserves essential patterns for effective machine learning. Although several surveys have examined data reduction strategies before, most focus narrowly on either classical geometry-based methods or active learning techniques. In contrast, this survey presents a more comprehensive view by unifying three major lines of coreset research, namely, training-free, training-oriented, and label-free approaches, into a single taxonomy. We present subfields often overlooked by existing work, including submodular formulations, bilevel optimization, and recent progress in pseudo-labeling for unlabeled datasets. Additionally, we examine how pruning strategies influence generalization and neural scaling laws, offering new insights that are absent from prior reviews. Finally, we compare these methods under varying computational, robustness, and performance demands and highlight open challenges, such as robustness, outlier filtering, and adapting coreset selection to foundation models, for future research.

[365] FastCAV: Efficient Computation of Concept Activation Vectors for Explaining Deep Neural Networks

Laines Schmalwasser,Niklas Penzel,Joachim Denzler,Julia Niebling

Main category: cs.LG

TL;DR: FastCAV是一种加速概念激活向量(CAV)提取的新方法,比现有方法快46.4倍(最高63.6倍),同时保持性能。

Details Motivation: 现有CAV计算方法在高维架构中计算成本高,限制了大规模应用。 Method: 提出FastCAV,基于理论假设加速CAV提取,并与传统SVM方法等效。 Result: FastCAV在效率和稳定性上优于现有方法,性能相似。 Conclusion: FastCAV可替代现有方法,支持深度模型的概念演化研究。 Abstract: Concepts such as objects, patterns, and shapes are how humans understand the world. Building on this intuition, concept-based explainability methods aim to study representations learned by deep neural networks in relation to human-understandable concepts. Here, Concept Activation Vectors (CAVs) are an important tool and can identify whether a model learned a concept or not. However, the computational cost and time requirements of existing CAV computation pose a significant challenge, particularly in large-scale, high-dimensional architectures. To address this limitation, we introduce FastCAV, a novel approach that accelerates the extraction of CAVs by up to 63.6x (on average 46.4x). We provide a theoretical foundation for our approach and give concrete assumptions under which it is equivalent to established SVM-based methods. Our empirical results demonstrate that CAVs calculated with FastCAV maintain similar performance while being more efficient and stable. In downstream applications, i.e., concept-based explanation methods, we show that FastCAV can act as a replacement leading to equivalent insights. Hence, our approach enables previously infeasible investigations of deep models, which we demonstrate by tracking the evolution of concepts during model training.

[366] Knot So Simple: A Minimalistic Environment for Spatial Reasoning

Zizhao Chen,Yoav Artzi

Main category: cs.LG

TL;DR: KnotGym是一个用于复杂空间推理和操作的交互式环境,专注于基于图像观察的绳结操作任务,任务复杂度可量化。

Details Motivation: 开发一个能够测试感知、空间推理和操作整合能力的标准化环境,以推动相关领域的研究。 Method: 设计了基于绳结交叉点数量的任务复杂度轴,支持多种方法(如基于模型的RL、模型预测控制和思维链推理)的评估。 Result: 展示了KnotGym在测试不同方法时的核心挑战,并提供了一个可扩展的开发平台。 Conclusion: KnotGym为复杂空间推理和操作任务提供了一个标准化且具有挑战性的测试环境。 Abstract: We propose KnotGym, an interactive environment for complex, spatial reasoning and manipulation. KnotGym includes goal-oriented rope manipulation tasks with varying levels of complexity, all requiring acting from pure image observations. Tasks are defined along a clear and quantifiable axis of complexity based on the number of knot crossings, creating a natural generalization test. KnotGym has a simple observation space, allowing for scalable development, yet it highlights core challenges in integrating acute perception, spatial reasoning, and grounded manipulation. We evaluate methods of different classes, including model-based RL, model-predictive control, and chain-of-thought reasoning, and illustrate the challenges KnotGym presents. KnotGym is available at https://github.com/lil-lab/knotgym.

[367] Mahalanobis++: Improving OOD Detection via Feature Normalization

Maximilian Mueller,Matthias Hein

Main category: cs.LG

TL;DR: Error

Details Motivation: Error Method: Error Result: Error Conclusion: Error Abstract: Detecting out-of-distribution (OOD) examples is an important task for deploying reliable machine learning models in safety-critial applications. While post-hoc methods based on the Mahalanobis distance applied to pre-logit features are among the most effective for ImageNet-scale OOD detection, their performance varies significantly across models. We connect this inconsistency to strong variations in feature norms, indicating severe violations of the Gaussian assumption underlying the Mahalanobis distance estimation. We show that simple $\ell_2$-normalization of the features mitigates this problem effectively, aligning better with the premise of normally distributed data with shared covariance matrix. Extensive experiments on 44 models across diverse architectures and pretraining schemes show that $\ell_2$-normalization improves the conventional Mahalanobis distance-based approaches significantly and consistently, and outperforms other recently proposed OOD detection methods.

[368] Towards more transferable adversarial attack in black-box manner

Chun Tong Lei,Zhongliang Guo,Hon Chung Lee,Minh Quoc Duong,Chun Pong Lau

Main category: cs.LG

TL;DR: 论文探讨了对抗攻击中的黑盒攻击方法,提出了一种新的损失函数和替代模型,以减少计算开销并提升迁移性。

Details Motivation: 传统黑盒攻击方法依赖优化框架而非替代模型架构,而扩散模型虽提升迁移性但计算成本高。作者假设类似扩散模型归纳偏置的模型结合适当损失函数可达到类似效果且降低开销。 Method: 提出一种新损失函数和独特替代模型,利用分类器引导扩散模型的时间依赖分类器分数,将自然数据分布知识融入对抗优化过程。 Result: 实验表明,该方法在多种模型架构中显著提升迁移性,同时保持对扩散防御的鲁棒性。 Conclusion: 研究表明,无需引入扩散模型,通过适当损失函数和模型设计即可高效提升对抗攻击的迁移性。 Abstract: Adversarial attacks have become a well-explored domain, frequently serving as evaluation baselines for model robustness. Among these, black-box attacks based on transferability have received significant attention due to their practical applicability in real-world scenarios. Traditional black-box methods have generally focused on improving the optimization framework (e.g., utilizing momentum in MI-FGSM) to enhance transferability, rather than examining the dependency on surrogate white-box model architectures. Recent state-of-the-art approach DiffPGD has demonstrated enhanced transferability by employing diffusion-based adversarial purification models for adaptive attacks. The inductive bias of diffusion-based adversarial purification aligns naturally with the adversarial attack process, where both involving noise addition, reducing dependency on surrogate white-box model selection. However, the denoising process of diffusion models incurs substantial computational costs through chain rule derivation, manifested in excessive VRAM consumption and extended runtime. This progression prompts us to question whether introducing diffusion models is necessary. We hypothesize that a model sharing similar inductive bias to diffusion-based adversarial purification, combined with an appropriate loss function, could achieve comparable or superior transferability while dramatically reducing computational overhead. In this paper, we propose a novel loss function coupled with a unique surrogate model to validate our hypothesis. Our approach leverages the score of the time-dependent classifier from classifier-guided diffusion models, effectively incorporating natural data distribution knowledge into the adversarial optimization process. Experimental results demonstrate significantly improved transferability across diverse model architectures while maintaining robustness against diffusion-based defenses.

[369] Generalizing Large Language Model Usability Across Resource-Constrained

Yun-Da Tsai

Main category: cs.LG

TL;DR: 该论文提出了一种系统化方法,通过文本对齐框架、对抗提示技术和推理优化策略,提升大语言模型在真实世界约束下的适应性和效率。

Details Motivation: 现有方法依赖昂贵的监督微调或固定训练条件,限制了模型在未见模态、有限数据或计算资源下的泛化能力。 Method: 引入文本对齐框架支持多模态集成,提出对抗提示技术增强鲁棒性,并研究推理优化策略。 Result: 实现了在未见模态下的适应、噪声模态的鲁棒性提升,以及在低资源领域的先进性能。 Conclusion: 这些贡献共同增强了大型语言模型在实践约束下的适应性、可扩展性和效率。 Abstract: Large Language Models (LLMs) have achieved remarkable success across a wide range of natural language tasks, and recent efforts have sought to extend their capabilities to multimodal domains and resource-constrained environments. However, existing approaches often rely on costly supervised fine-tuning or assume fixed training conditions, limiting their generalization when facing unseen modalities, limited data, or restricted compute resources. This dissertation presents a systematic study toward generalizing LLM usability under real-world constraints. First, it introduces a robust text-centric alignment framework that enables LLMs to seamlessly integrate diverse modalities-including text, images, tables, and any modalities - via natural language interfaces. This approach supports in-context adaptation to unseen or dynamically changing modalities without requiring retraining. To enhance robustness against noisy and missing modalities, an adversarial prompting technique is proposed, generating semantically challenging perturbations at the prompt level to stress-test model reliability. Beyond multimodal setting, the dissertation investigates inference-time optimization strategies for LLMs, leveraging prompt search and uncertainty quantification to improve performance without additional model training. This perspective offers an efficient alternative to scaling model parameters or retraining from scratch. Additionally, the work addresses low-resource domains such as Verilog code generation by designing correct-by-construction synthetic data pipelines and logic-enhanced reasoning models, achieving state-of-the-art performance with minimal data. Together, these contributions form a unified effort to enhance the adaptability, scalability, and efficiency of large language models under practical constraints.

[370] TrimR: Verifier-based Training-Free Thinking Compression for Efficient Test-Time Scaling

Weizhe Lin,Xing Li,Zhiyuan Yang,Xiaojin Fu,Hui-Ling Zhen,Yaoyuan Wang,Xianzhi Yu,Wulong Liu,Xiaosong Li,Mingxuan Yuan

Main category: cs.LG

TL;DR: TrimR是一种基于验证器的动态CoT压缩框架,通过修剪冗余推理步骤显著提升大型推理模型的效率,同时保持准确性。

Details Motivation: 大型推理模型(LRMs)在复杂任务中表现出色,但存在冗余推理问题,导致解码效率低下。受人类认知和数值优化理论启发,提出TrimR以优化推理效率。 Method: 采用轻量级预训练验证器动态检测并截断LRMs的冗余中间推理步骤,无需微调模型或验证器。 Result: 在MATH500等基准测试中,推理运行时间最高提升70%,准确性几乎不受影响。 Conclusion: TrimR为生产级部署提供了一种高效、无需训练的推理优化方案。 Abstract: Large Reasoning Models (LRMs) demonstrate exceptional capability in tackling complex mathematical, logical, and coding tasks by leveraging extended Chain-of-Thought (CoT) reasoning. Test-time scaling methods, such as prolonging CoT with explicit token-level exploration, can push LRMs' accuracy boundaries, but they incur significant decoding overhead. A key inefficiency source is LRMs often generate redundant thinking CoTs, which demonstrate clear structured overthinking and underthinking patterns. Inspired by human cognitive reasoning processes and numerical optimization theories, we propose TrimR, a verifier-based, training-free, efficient framework for dynamic CoT compression to trim reasoning and enhance test-time scaling, explicitly tailored for production-level deployment. Our method employs a lightweight, pretrained, instruction-tuned verifier to detect and truncate redundant intermediate thoughts of LRMs without any LRM or verifier fine-tuning. We present both the core algorithm and asynchronous online system engineered for high-throughput industrial applications. Empirical evaluations on Ascend NPUs and vLLM show that our framework delivers substantial gains in inference efficiency under large-batch workloads. In particular, on the four MATH500, AIME24, AIME25, and GPQA benchmarks, the reasoning runtime of Pangu-R-38B, QwQ-32B, and DeepSeek-R1-Distill-Qwen-32B is improved by up to 70% with negligible impact on accuracy.

[371] Zebra-Llama: Towards Extremely Efficient Hybrid Models

Mingyu Yang,Mehdi Rezagholizadeh,Guihong Li,Vikram Appia,Emad Barsoum

Main category: cs.LG

TL;DR: Zebra-Llama提出了一种高效混合语言模型的方法,结合SSMs和MLA层,显著提升了推理效率并减少了训练资源需求。

Details Motivation: 随着大语言模型(LLMs)应用需求的增长,提升其推理效率对可持续和普及化访问至关重要,而重新训练LLMs成本高昂且不环保。 Method: 通过结合状态空间模型(SSMs)和多头潜在注意力(MLA)层,使用改进的初始化和后训练流程,从预训练Transformers中高效转移知识。 Result: Zebra-Llama在仅使用7-11B训练标记的情况下,达到Transformer级准确度,同时显著减少KV缓存大小(1B、3B、8B变体分别降至3.9%、2%、2.73%)。 Conclusion: Zebra-Llama在效率和准确性上优于现有模型,如MambaInLLaMA等,同时大幅减少资源和内存需求。 Abstract: With the growing demand for deploying large language models (LLMs) across diverse applications, improving their inference efficiency is crucial for sustainable and democratized access. However, retraining LLMs to meet new user-specific requirements is prohibitively expensive and environmentally unsustainable. In this work, we propose a practical and scalable alternative: composing efficient hybrid language models from existing pre-trained models. Our approach, Zebra-Llama, introduces a family of 1B, 3B, and 8B hybrid models by combining State Space Models (SSMs) and Multi-head Latent Attention (MLA) layers, using a refined initialization and post-training pipeline to efficiently transfer knowledge from pre-trained Transformers. Zebra-Llama achieves Transformer-level accuracy with near-SSM efficiency using only 7-11B training tokens (compared to trillions of tokens required for pre-training) and an 8B teacher. Moreover, Zebra-Llama dramatically reduces KV cache size -down to 3.9%, 2%, and 2.73% of the original for the 1B, 3B, and 8B variants, respectively-while preserving 100%, 100%, and >97% of average zero-shot performance on LM Harness tasks. Compared to models like MambaInLLaMA, X-EcoMLA, Minitron, and Llamba, Zebra-Llama consistently delivers competitive or superior accuracy while using significantly fewer tokens, smaller teachers, and vastly reduced KV cache memory. Notably, Zebra-Llama-8B surpasses Minitron-8B in few-shot accuracy by 7% while using 8x fewer training tokens, over 12x smaller KV cache, and a smaller teacher (8B vs. 15B). It also achieves 2.6x-3.8x higher throughput (tokens/s) than MambaInLlama up to a 32k context length. We will release code and model checkpoints upon acceptance.

[372] Attention with Trained Embeddings Provably Selects Important Tokens

Diyuan Wu,Aleksandr Shevchenko,Samet Oymak,Marco Mondelli

Main category: cs.LG

TL;DR: 本文研究了通过梯度下降获得的词嵌入结构,揭示了嵌入如何捕捉数据集中标记的重要性,并通过实验验证了理论发现。

Details Motivation: 尽管词嵌入在语言建模中具有重要作用,但其理论理解仍有限。本文旨在填补这一空白。 Method: 使用单层softmax注意力模型和线性头进行二元分类,通过梯度下降和梯度流训练嵌入。 Result: 训练后,嵌入能捕捉标记的重要性并与输出向量对齐,实验验证了理论现象。 Conclusion: 词嵌入通过梯度训练能有效捕捉标记的重要性,理论分析与实验结果一致。 Abstract: Token embeddings play a crucial role in language modeling but, despite this practical relevance, their theoretical understanding remains limited. Our paper addresses the gap by characterizing the structure of embeddings obtained via gradient descent. Specifically, we consider a one-layer softmax attention model with a linear head for binary classification, i.e., $\texttt{Softmax}( p^\top E_X^\top ) E_X v = \frac{ \sum_{i=1}^T \exp(p^\top E_{x_i}) E_{x_i}^\top v}{\sum_{j=1}^T \exp(p^\top E_{x_{j}}) }$, where $E_X = [ E_{x_1} , \dots, E_{x_T} ]^\top$ contains the embeddings of the input sequence, $p$ is the embedding of the $\mathrm{\langle cls \rangle}$ token and $v$ the output vector. First, we show that, already after a single step of gradient training with the logistic loss, the embeddings $E_X$ capture the importance of tokens in the dataset by aligning with the output vector $v$ proportionally to the frequency with which the corresponding tokens appear in the dataset. Then, after training $p$ via gradient flow until convergence, the softmax selects the important tokens in the sentence (i.e., those that are predictive of the label), and the resulting $\mathrm{\langle cls \rangle}$ embedding maximizes the margin for such a selection. Experiments on real-world datasets (IMDB, Yelp) exhibit a phenomenology close to that unveiled by our theory.

[373] ECHO-LLaMA: Efficient Caching for High-Performance LLaMA Training

Maryam Dialameh,Rezaul Karim,Hossein Rajabzadeh,Omar Mohamed Awad,Hyock Ju Kwon,Boxing Chen,Walid Ahmed,Yang Liu

Main category: cs.LG

TL;DR: ECHO-LLaMA是一种改进的LLaMA架构,通过共享KV缓存提升训练速度和推理吞吐量,同时保持学习能力。实验显示其训练吞吐量提高77%,MFU提升16%,损失降低14%,测试吞吐量提升7%。

Details Motivation: 提升LLaMA架构的训练速度和推理吞吐量,同时保持其学习能力,以提供更高效、经济的预训练和微调解决方案。 Method: 通过共享KV缓存减少计算复杂度,同时引入高效的自适应机制。 Result: 训练吞吐量提高77%,MFU提升16%,损失降低14%,测试吞吐量提升7%。 Conclusion: ECHO-LLaMA为大规模语言模型提供了一种高效、经济的解决方案,显著提升性能且不牺牲学习能力。 Abstract: This paper introduces ECHO-LLaMA, an efficient LLaMA architecture designed to improve both the training speed and inference throughput of LLaMA architectures while maintaining its learning capacity. ECHO-LLaMA transforms LLaMA models into shared KV caching across certain layers, significantly reducing KV computational complexity while maintaining or improving language performance. Experimental results demonstrate that ECHO-LLaMA achieves up to 77\% higher token-per-second throughput during training, up to 16\% higher Model FLOPs Utilization (MFU), and up to 14\% lower loss when trained on an equal number of tokens. Furthermore, on the 1.1B model, ECHO-LLaMA delivers approximately 7\% higher test-time throughput compared to the baseline. By introducing a computationally efficient adaptation mechanism, ECHO-LLaMA offers a scalable and cost-effective solution for pretraining and finetuning large language models, enabling faster and more resource-efficient training without compromising performance.

[374] An End-to-End Approach for Child Reading Assessment in the Xhosa Language

Sergio Chevtchenko,Nikhil Navas,Rafaella Vale,Franco Ubaudi,Sipumelele Lucwaba,Cally Ardington,Soheil Afshar,Mark Antoniou,Saeed Afshar

Main category: cs.LG

TL;DR: 研究探讨了利用AI技术开发低成本、高效的儿童阅读评估系统,针对低资源语言(如南非的科萨语),通过构建新的儿童语音数据集并测试三种先进模型,发现数据量和平衡性对模型性能至关重要。

Details Motivation: 儿童识字能力对个体未来发展至关重要,但低收入和中等收入地区的识字水平较低,需要针对性干预。AI支持的阅读评估工具可以经济高效地帮助教育工作者。 Method: 研究构建了一个科萨语儿童语音数据集,包含十个单词和字母的录音,并通过多标记和独立评审验证。测试了wav2vec 2.0、HuBERT和Whisper三种模型。 Result: 实验表明,数据量和平衡性显著影响模型性能,wav2vec 2.0在样本有限时通过多类别训练表现更优。 Conclusion: 研究为低资源语言的儿童阅读评估提供了可行方案,强调了数据收集的重要性,并展示了wav2vec 2.0的潜力。 Abstract: Child literacy is a strong predictor of life outcomes at the subsequent stages of an individual's life. This points to a need for targeted interventions in vulnerable low and middle income populations to help bridge the gap between literacy levels in these regions and high income ones. In this effort, reading assessments provide an important tool to measure the effectiveness of these programs and AI can be a reliable and economical tool to support educators with this task. Developing accurate automatic reading assessment systems for child speech in low-resource languages poses significant challenges due to limited data and the unique acoustic properties of children's voices. This study focuses on Xhosa, a language spoken in South Africa, to advance child speech recognition capabilities. We present a novel dataset composed of child speech samples in Xhosa. The dataset is available upon request and contains ten words and letters, which are part of the Early Grade Reading Assessment (EGRA) system. Each recording is labeled with an online and cost-effective approach by multiple markers and a subsample is validated by an independent EGRA reviewer. This dataset is evaluated with three fine-tuned state-of-the-art end-to-end models: wav2vec 2.0, HuBERT, and Whisper. The results indicate that the performance of these models can be significantly influenced by the amount and balancing of the available training data, which is fundamental for cost-effective large dataset collection. Furthermore, our experiments indicate that the wav2vec 2.0 performance is improved by training on multiple classes at a time, even when the number of available samples is constrained.

[375] Value-Guided Search for Efficient Chain-of-Thought Reasoning

Kaiwen Wang,Jin Peng Zhou,Jonathan Chang,Zhaolin Gao,Nathan Kallus,Kianté Brantley,Wen Sun

Main category: cs.LG

TL;DR: 提出了一种简单高效的长上下文推理轨迹价值模型训练方法,无需定义细粒度步骤,显著提升了推理效率和性能。

Details Motivation: 现有过程奖励模型(PRMs)需要定义细粒度步骤,而长上下文推理模型难以满足这一要求,因此需要一种更简单的方法。 Method: 通过收集250万条推理轨迹数据集,训练了一个15亿token级别的价值模型,并应用于DeepSeek模型,采用块级价值引导搜索(VGS)和加权多数投票。 Result: 在64次生成的推理预算下,VGS在四个数学竞赛基准测试中平均准确率达45.7%,与o3-mini-medium相当,同时显著减少了推理FLOPs。 Conclusion: 该方法在长上下文推理中表现出色,数据集、模型和代码已开源。 Abstract: In this paper, we propose a simple and efficient method for value model training on long-context reasoning traces. Compared to existing process reward models (PRMs), our method does not require a fine-grained notion of "step," which is difficult to define for long-context reasoning models. By collecting a dataset of 2.5 million reasoning traces, we train a 1.5B token-level value model and apply it to DeepSeek models for improved performance with test-time compute scaling. We find that block-wise value-guided search (VGS) with a final weighted majority vote achieves better test-time scaling than standard methods such as majority voting or best-of-n. With an inference budget of 64 generations, VGS with DeepSeek-R1-Distill-1.5B achieves an average accuracy of 45.7% across four competition math benchmarks (AIME 2024 & 2025, HMMT Feb 2024 & 2025), reaching parity with o3-mini-medium. Moreover, VGS significantly reduces the inference FLOPs required to achieve the same performance of majority voting. Our dataset, model and codebase are open-sourced.

[376] Self-Training Large Language Models with Confident Reasoning

Hyosoon Jang,Yunhui Jang,Sungjae Lee,Jungseul Ok,Sungsoo Ahn

Main category: cs.LG

TL;DR: 论文提出了一种新的自训练方法CORE-PO,通过优化策略选择高置信度的推理路径,提升大语言模型的推理能力。

Details Motivation: 现有自训练方法仅关注最终答案的置信度,忽略了推理路径的质量。本文旨在通过推理级置信度识别高质量路径。 Method: 提出CORE-PO方法,利用策略优化(Policy Optimization)选择高置信度推理路径进行自训练。 Result: 在四个分布内和两个分布外基准测试中,CORE-PO显著提升了输出准确性。 Conclusion: CORE-PO通过关注推理路径质量,优于现有自训练方法,验证了推理级置信度的重要性。 Abstract: Large language models (LLMs) have shown impressive performance by generating reasoning paths before final answers, but learning such a reasoning path requires costly human supervision. To address this issue, recent studies have explored self-training methods that improve reasoning capabilities using pseudo-labels generated by the LLMs themselves. Among these, confidence-based self-training fine-tunes LLMs to prefer reasoning paths with high-confidence answers, where confidence is estimated via majority voting. However, such methods exclusively focus on the quality of the final answer and may ignore the quality of the reasoning paths, as even an incorrect reasoning path leads to a correct answer by chance. Instead, we advocate the use of reasoning-level confidence to identify high-quality reasoning paths for self-training, supported by our empirical observations. We then propose a new self-training method, CORE-PO, that fine-tunes LLMs to prefer high-COnfidence REasoning paths through Policy Optimization. Our experiments show that CORE-PO improves the accuracy of outputs on four in-distribution and two out-of-distribution benchmarks, compared to existing self-training methods.

[377] ProxySPEX: Inference-Efficient Interpretability via Sparse Feature Interactions in LLMs

Landon Butler,Abhineet Agarwal,Justin Singh Kang,Yigit Efe Erginbas,Bin Yu,Kannan Ramchandran

Main category: cs.LG

TL;DR: ProxySPEX是一种基于梯度提升树的交互归因算法,利用LLM特征交互的层次性,高效发现重要交互,比SPEX减少10倍推理次数,并在多个任务中表现优于边际归因方法。

Details Motivation: 现有方法(如SPEX)需要大量模型推理,难以扩展到大模型。LLM特征交互具有层次性,可以利用这一特性提高效率。 Method: 提出ProxySPEX算法,先通过梯度提升树拟合掩码LLM输出,再提取重要交互。 Result: 在四个高维数据集上,ProxySPEX比边际归因方法重建LLM输出的准确性高20%,推理次数减少10倍。在可解释性任务中,ProxySPEX能更有效地识别关键交互。 Conclusion: ProxySPEX通过利用交互层次性,显著提高了交互发现的效率和准确性,适用于大规模模型和复杂任务。 Abstract: Large Language Models (LLMs) have achieved remarkable performance by capturing complex interactions between input features. To identify these interactions, most existing approaches require enumerating all possible combinations of features up to a given order, causing them to scale poorly with the number of inputs $n$. Recently, Kang et al. (2025) proposed SPEX, an information-theoretic approach that uses interaction sparsity to scale to $n \approx 10^3$ features. SPEX greatly improves upon prior methods but requires tens of thousands of model inferences, which can be prohibitive for large models. In this paper, we observe that LLM feature interactions are often hierarchical -- higher-order interactions are accompanied by their lower-order subsets -- which enables more efficient discovery. To exploit this hierarchy, we propose ProxySPEX, an interaction attribution algorithm that first fits gradient boosted trees to masked LLM outputs and then extracts the important interactions. Experiments across four challenging high-dimensional datasets show that ProxySPEX more faithfully reconstructs LLM outputs by 20% over marginal attribution approaches while using $10\times$ fewer inferences than SPEX. By accounting for interactions, ProxySPEX identifies features that influence model output over 20% more than those selected by marginal approaches. Further, we apply ProxySPEX to two interpretability tasks. Data attribution, where we identify interactions among CIFAR-10 training samples that influence test predictions, and mechanistic interpretability, where we uncover interactions between attention heads, both within and across layers, on a question-answering task. ProxySPEX identifies interactions that enable more aggressive pruning of heads than marginal approaches.

[378] On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning

Yifan Zhang,Yifeng Liu,Huizhuo Yuan,Yang Yuan,Quanquan Gu,Andrew C Yao

Main category: cs.LG

TL;DR: 论文提出了一种正则化策略梯度(RPG)框架,系统性地探索了KL正则化在在线强化学习中的应用,并展示了其在提升大型语言模型推理能力中的效果。

Details Motivation: 尽管KL正则化在策略梯度算法中被广泛用于稳定训练,但如何系统性地探索不同KL散度形式并将其整合到在线强化学习的损失函数中仍是一个未被充分研究的问题。 Method: 论文提出了RPG框架,推导了基于正向和反向KL散度的策略梯度及对应的损失函数,支持归一化和非归一化的策略分布,并提供了完全可微的损失函数和REINFORCE风格的梯度估计器。 Result: 实验表明,RPG在训练稳定性和性能上优于或与GRPO、REINFORCE++和DAPO等基线方法相当。 Conclusion: RPG为KL正则化的策略梯度方法提供了一个系统性的框架,并在大型语言模型推理任务中表现出色。 Abstract: Policy gradient algorithms have been successfully applied to enhance the reasoning capabilities of large language models (LLMs). Despite the widespread use of Kullback-Leibler (KL) regularization in policy gradient algorithms to stabilize training, the systematic exploration of how different KL divergence formulations can be estimated and integrated into surrogate loss functions for online reinforcement learning (RL) presents a nuanced and systematically explorable design space. In this paper, we propose regularized policy gradient (RPG), a systematic framework for deriving and analyzing KL-regularized policy gradient methods in the online RL setting. We derive policy gradients and corresponding surrogate loss functions for objectives regularized by both forward and reverse KL divergences, considering both normalized and unnormalized policy distributions. Furthermore, we present derivations for fully differentiable loss functions as well as REINFORCE-style gradient estimators, accommodating diverse algorithmic needs. We conduct extensive experiments on RL for LLM reasoning using these methods, showing improved or competitive results in terms of training stability and performance compared to strong baselines such as GRPO, REINFORCE++, and DAPO. The code is available at https://github.com/complex-reasoning/RPG.

[379] What You Read Isn't What You Hear: Linguistic Sensitivity in Deepfake Speech Detection

Binh Nguyen,Shuji Shi,Ryan Ofman,Thai Le

Main category: cs.LG

TL;DR: 论文研究了文本级对抗攻击对音频反欺骗系统的影响,发现即使是微小的语言扰动也能显著降低检测准确率,揭示了现有系统在语言多样性方面的脆弱性。

Details Motivation: 音频深度伪造攻击日益严重,但现有反欺骗系统主要关注声学层面的扰动,忽略了语言变异的影响。 Method: 通过引入文本级对抗攻击,评估开源和商业反欺骗检测器的语言敏感性,并进行特征归因分析。 Result: 实验显示,语言扰动使攻击成功率超过60%,商业检测器准确率从100%降至32%。 Conclusion: 研究强调了在反欺骗系统设计中需考虑语言变异,以提升鲁棒性。 Abstract: Recent advances in text-to-speech technologies have enabled realistic voice generation, fueling audio-based deepfake attacks such as fraud and impersonation. While audio anti-spoofing systems are critical for detecting such threats, prior work has predominantly focused on acoustic-level perturbations, leaving the impact of linguistic variation largely unexplored. In this paper, we investigate the linguistic sensitivity of both open-source and commercial anti-spoofing detectors by introducing transcript-level adversarial attacks. Our extensive evaluation reveals that even minor linguistic perturbations can significantly degrade detection accuracy: attack success rates surpass 60% on several open-source detector-voice pairs, and notably one commercial detection accuracy drops from 100% on synthetic audio to just 32%. Through a comprehensive feature attribution analysis, we identify that both linguistic complexity and model-level audio embedding similarity contribute strongly to detector vulnerability. We further demonstrate the real-world risk via a case study replicating the Brad Pitt audio deepfake scam, using transcript adversarial attacks to completely bypass commercial detectors. These results highlight the need to move beyond purely acoustic defenses and account for linguistic variation in the design of robust anti-spoofing systems. All source code will be publicly available.

[380] CoMoE: Contrastive Representation for Mixture-of-Experts in Parameter-Efficient Fine-tuning

Jinyuan Feng,Chaopeng Wei,Tenghai Qiu,Tianyi Hu,Zhiqiang Pu

Main category: cs.LG

TL;DR: 本文提出了一种名为CoMoE的新方法,通过对比目标训练专家模块,提升MoE的模块化和专业化能力,解决了现有MoE在异构数据集上专家知识重复的问题。

Details Motivation: 现有MoE方法在异构数据集上表现不佳,专家模块可能学习重复知识,导致MoE容量未充分利用。 Method: 提出CoMoE方法,通过对比目标训练专家模块,采样激活和未激活的专家进行对比学习。 Result: 实验表明,CoMoE能显著提升MoE的容量并促进专家模块的模块化。 Conclusion: CoMoE是一种有效的改进方法,能够优化MoE在异构数据集上的表现。 Abstract: In parameter-efficient fine-tuning, mixture-of-experts (MoE), which involves specializing functionalities into different experts and sparsely activating them appropriately, has been widely adopted as a promising approach to trade-off between model capacity and computation overhead. However, current MoE variants fall short on heterogeneous datasets, ignoring the fact that experts may learn similar knowledge, resulting in the underutilization of MoE's capacity. In this paper, we propose Contrastive Representation for MoE (CoMoE), a novel method to promote modularization and specialization in MoE, where the experts are trained along with a contrastive objective by sampling from activated and inactivated experts in top-k routing. We demonstrate that such a contrastive objective recovers the mutual-information gap between inputs and the two types of experts. Experiments on several benchmarks and in multi-task settings demonstrate that CoMoE can consistently enhance MoE's capacity and promote modularization among the experts.

[381] NeUQI: Near-Optimal Uniform Quantization Parameter Initialization

Li Lin,Xinyu Hu,Xiaojun Wan

Main category: cs.LG

TL;DR: NeUQI是一种为均匀量化确定近最优初始参数的方法,显著提升LLMs在消费级设备上的性能。

Details Motivation: 大型语言模型(LLMs)在消费级GPU或个人设备上部署时面临高内存消耗和推理成本问题,现有量化参数初始化方法(如Min-Max策略)效果不佳。 Method: 提出NeUQI方法,专注于高效确定均匀量化的近最优初始参数,并可与其他量化方法无缝结合。 Result: 在LLaMA和Qwen模型上的实验表明,NeUQI优于现有方法,结合轻量级蒸馏策略后性能甚至超过资源密集型的PV-tuning。 Conclusion: NeUQI为LLMs的量化部署提供了一种高效且性能优越的解决方案。 Abstract: Large language models (LLMs) achieve impressive performance across domains but face significant challenges when deployed on consumer-grade GPUs or personal devices such as laptops, due to high memory consumption and inference costs. Post-training quantization (PTQ) of LLMs offers a promising solution that reduces their memory footprint and decoding latency. In practice, PTQ with uniform quantization representation is favored for its efficiency and ease of deployment since uniform quantization is widely supported by mainstream hardware and software libraries. Recent studies on $\geq 2$-bit uniform quantization have led to noticeable improvements in post-quantization model performance; however, they primarily focus on quantization methodologies, while the initialization of quantization parameters is underexplored and still relies on the suboptimal Min-Max strategies. In this work, we propose NeUQI, a method devoted to efficiently determining near-optimal initial parameters for uniform quantization. NeUQI is orthogonal to prior quantization methodologies and can seamlessly integrate with them. The experiments with the LLaMA and Qwen families on various tasks demonstrate that our NeUQI consistently outperforms existing methods. Furthermore, when combined with a lightweight distillation strategy, NeUQI can achieve superior performance to PV-tuning, a much more resource-intensive approach.

[382] Large language model as user daily behavior data generator: balancing population diversity and individual personality

Haoxin Li,Jingtao Ding,Jiahui Gong,Yong Li

Main category: cs.LG

TL;DR: BehaviorGen利用大型语言模型生成高质量合成行为数据,显著提升行为预测性能,同时保护隐私。

Details Motivation: 预测人类日常行为复杂且依赖敏感数据,现有方法受限。 Method: 提出BehaviorGen框架,基于用户画像和真实事件模拟行为数据。 Result: 在多种场景下显著提升预测性能,最高提升18.9%。 Conclusion: BehaviorGen为行为建模提供灵活且隐私保护的解决方案。 Abstract: Predicting human daily behavior is challenging due to the complexity of routine patterns and short-term fluctuations. While data-driven models have improved behavior prediction by leveraging empirical data from various platforms and devices, the reliance on sensitive, large-scale user data raises privacy concerns and limits data availability. Synthetic data generation has emerged as a promising solution, though existing methods are often limited to specific applications. In this work, we introduce BehaviorGen, a framework that uses large language models (LLMs) to generate high-quality synthetic behavior data. By simulating user behavior based on profiles and real events, BehaviorGen supports data augmentation and replacement in behavior prediction models. We evaluate its performance in scenarios such as pertaining augmentation, fine-tuning replacement, and fine-tuning augmentation, achieving significant improvements in human mobility and smartphone usage predictions, with gains of up to 18.9%. Our results demonstrate the potential of BehaviorGen to enhance user behavior modeling through flexible and privacy-preserving synthetic data generation.

[383] Surfacing Semantic Orthogonality Across Model Safety Benchmarks: A Multi-Dimensional Analysis

Jonathan Bennion,Shaona Ghosh,Mantek Singh,Nouha Dziri

Main category: cs.LG

TL;DR: 通过UMAP降维和kmeans聚类分析五个开源安全基准数据集,识别出六大主要危害类别,并量化了基准之间的语义正交性。

Details Motivation: 评估现有AI安全基准数据集对危害的覆盖情况,以促进更全面的数据集开发。 Method: 使用UMAP降维和kmeans聚类分析五个开源安全基准数据集,识别语义聚类和危害类别。 Result: 发现六大危害类别,各基准数据集覆盖不均,提示长度分布差异显著。 Conclusion: 提出的量化框架有助于识别覆盖缺口,推动针对性地开发更全面的AI安全数据集。 Abstract: Various AI safety datasets have been developed to measure LLMs against evolving interpretations of harm. Our evaluation of five recently published open-source safety benchmarks reveals distinct semantic clusters using UMAP dimensionality reduction and kmeans clustering (silhouette score: 0.470). We identify six primary harm categories with varying benchmark representation. GretelAI, for example, focuses heavily on privacy concerns, while WildGuardMix emphasizes self-harm scenarios. Significant differences in prompt length distribution suggests confounds to data collection and interpretations of harm as well as offer possible context. Our analysis quantifies benchmark orthogonality among AI benchmarks, allowing for transparency in coverage gaps despite topical similarities. Our quantitative framework for analyzing semantic orthogonality across safety benchmarks enables more targeted development of datasets that comprehensively address the evolving landscape of harms in AI use, however that is defined in the future.

[384] COUNTDOWN: Contextually Sparse Activation Filtering Out Unnecessary Weights in Down Projection

Jaewon Cheon,Pilsung Kang

Main category: cs.LG

TL;DR: 论文提出两种稀疏激活方法(M-COUNTDOWN和D-COUNTDOWN),通过线性组合减少FFNN层的计算量,显著提升效率。

Details Motivation: 大型语言模型的计算效率问题日益突出,现有方法多关注非线性门控机制,而作者认为稀疏性体现在线性组合上。 Method: 提出M-COUNTDOWN(间接系数)和D-COUNTDOWN(直接系数)两种方法,通过线性组合选择性去除非必要参数。 Result: D-COUNTDOWN可减少90%计算量,性能损失仅5.5%;M-COUNTDOWN无需预测器,性能保留比现有方法高29.4%。 Conclusion: 提出的方法通过专用内核实现理论优势,显著加速实际应用。 Abstract: The growing size of large language models has created significant computational inefficiencies. To address this challenge, sparse activation methods selectively deactivates non-essential parameters during inference, reducing computational costs in FFNN layers. While existing methods focus on non-linear gating mechanisms, we hypothesize that the sparsity of the FFNN layer lies globally in the form of a linear combination over its internal down projection matrix. Based on this insight, we propose two methods: M-COUNTDOWN, leveraging indirect coefficients, and D-COUNTDOWN, utilizing direct coefficients of the linear combination. Experimental results demonstrate that D-COUNTDOWN can omit 90% of computations with performance loss as low as 5.5% ideally, while M-COUNTDOWN provides a predictor-free solution with up to 29.4% better performance preservation compared to existing methods. Our specialized kernel implementations effectively realize these theoretical gains into substantial real-world acceleration.

[385] PPO-BR: Dual-Signal Entropy-Reward Adaptation for Trust Region Policy Optimization

Ben Rahman

Main category: cs.LG

TL;DR: PPO-BR提出了一种自适应强化学习框架,通过动态调整信任区域解决了PPO在探索与收敛间的权衡问题,显著提升了性能。

Details Motivation: PPO的静态信任区域导致探索与收敛间的矛盾,限制了其在安全关键系统中的应用。 Method: PPO-BR结合熵驱动扩展和奖励引导收缩,动态调整信任区域。 Result: 在多个基准测试中,PPO-BR实现了29.1%的更快收敛、更低的奖励方差,且运行时开销小于1.8%。 Conclusion: PPO-BR通过简单且理论支持的机制,适用于安全关键领域,如机器人手术和自主无人机。 Abstract: Despite Proximal Policy Optimization (PPO) dominating policy gradient methods -- from robotic control to game AI -- its static trust region forces a brittle trade-off: aggressive clipping stifles early exploration, while late-stage updates destabilize convergence. PPO-BR establishes a new paradigm in adaptive RL by fusing exploration and convergence signals into a single bounded trust region -- a theoretically grounded innovation that outperforms five SOTA baselines with less than 2% overhead. This work bridges a critical gap in phase-aware learning, enabling real-world deployment in safety-critical systems like robotic surgery within a single adaptive mechanism. PPO-BR achieves 29.1% faster convergence by combining: (1) entropy-driven expansion (epsilon up) for exploration in high-uncertainty states, and (2) reward-guided contraction (epsilon down) for convergence stability. On six diverse benchmarks (MuJoCo, Atari, sparse-reward), PPO-BR achieves 29.1% faster convergence (p < 0.001), 2.3x lower reward variance than PPO, and less than 1.8% runtime overhead with only five lines of code change. PPO-BR's simplicity and theoretical guarantees make it ready-to-deploy in safety-critical domains -- from surgical robotics to autonomous drones. In contrast to recent methods such as Group Relative Policy Optimization (GRPO), PPO-BR offers a unified entropy-reward mechanism applicable to both language models and general reinforcement learning environments.

[386] Trinity-RFT: A General-Purpose and Unified Framework for Reinforcement Fine-Tuning of Large Language Models

Xuchen Pan,Yanxi Chen,Yushuo Chen,Yuchang Sun,Daoyuan Chen,Wenhao Zhang,Yuexiang Xie,Yilun Huang,Yilei Zhang,Dawei Gao,Yaliang Li,Bolin Ding,Jingren Zhou

Main category: cs.LG

TL;DR: Trinity-RFT是一个通用、灵活且可扩展的框架,用于大规模语言模型的强化微调(RFT)。

Details Motivation: 为统一和推广同步/异步、在线/离线等多种RFT模式,提供一个高效且鲁棒的框架。 Method: 采用解耦设计,包括RFT核心、无缝的代理-环境交互集成以及优化的数据管道。 Result: Trinity-RFT可轻松适应多样化应用场景,并作为探索高级强化学习范式的统一平台。 Conclusion: 该框架通过示例展示了其实用性和用户友好性,为RFT提供了高效解决方案。 Abstract: Trinity-RFT is a general-purpose, flexible and scalable framework designed for reinforcement fine-tuning (RFT) of large language models. It is built with a decoupled design, consisting of (1) an RFT-core that unifies and generalizes synchronous/asynchronous, on-policy/off-policy, and online/offline modes of RFT, (2) seamless integration for agent-environment interaction with high efficiency and robustness, and (3) systematic data pipelines optimized for RFT. Trinity-RFT can be easily adapted for diverse application scenarios, and serves as a unified platform for exploring advanced reinforcement learning paradigms. This technical report outlines the vision, features, design and implementations of Trinity-RFT, accompanied by extensive examples demonstrating the utility and user-friendliness of the proposed framework.

[387] Understanding Gated Neurons in Transformers from Their Input-Output Functionality

Sebastian Gerstner,Hinrich Schütze

Main category: cs.LG

TL;DR: 论文提出了一种通过分析输入与输出权重间的余弦相似性来研究语言模型中MLP神经元交互的方法,发现早期层以“富集神经元”为主,后期层则以“耗尽神经元”为主。

Details Motivation: 现有研究主要关注神经元的激活上下文和输出权重,忽视了输入与输出之间的交互作用。 Method: 通过计算神经元输入与输出权重的余弦相似性,分析其交互模式。 Result: 在12个模型中发现早期层多为富集神经元,后期层多为耗尽神经元。 Conclusion: 输入-输出视角是对现有激活依赖分析方法的补充,揭示了神经元在概念表示中的作用。 Abstract: Interpretability researchers have attempted to understand MLP neurons of language models based on both the contexts in which they activate and their output weight vectors. They have paid little attention to a complementary aspect: the interactions between input and output. For example, when neurons detect a direction in the input, they might add much the same direction to the residual stream ("enrichment neurons") or reduce its presence ("depletion neurons"). We address this aspect by examining the cosine similarity between input and output weights of a neuron. We apply our method to 12 models and find that enrichment neurons dominate in early-middle layers whereas later layers tend more towards depletion. To explain this finding, we argue that enrichment neurons are largely responsible for enriching concept representations, one of the first steps of factual recall. Our input-output perspective is a complement to activation-dependent analyses and to approaches that treat input and output separately.

[388] Are Large Language Models Reliable AI Scientists? Assessing Reverse-Engineering of Black-Box Systems

Jiayi Geng,Howard Chen,Dilip Arumugam,Thomas L. Griffiths

Main category: cs.LG

TL;DR: 论文探讨了大型语言模型(LLM)如何通过被动观察与主动干预数据来识别黑盒系统的结构,发现主动干预能显著提升性能。

Details Motivation: 研究AI模型如何通过行为理解黑盒系统结构,以支持未来自主AI研究者的科学发现能力。 Method: 通过实验比较LLM在被动观察与主动干预(查询特定输入)下的表现,分析其在程序、形式语言和数学方程三类黑盒系统中的逆向工程能力。 Result: LLM仅通过观察难以提取信息,性能低于贝叶斯推理理想水平;主动干预能显著提升性能,帮助避免过度复杂化和忽略观察的失败模式。 Conclusion: 主动干预是提升LLM逆向工程黑盒系统能力的关键,为AI在科学发现中的应用提供了实用指导。 Abstract: Using AI to create autonomous researchers has the potential to accelerate scientific discovery. A prerequisite for this vision is understanding how well an AI model can identify the underlying structure of a black-box system from its behavior. In this paper, we explore how well a large language model (LLM) learns to identify a black-box function from passively observed versus actively collected data. We investigate the reverse-engineering capabilities of LLMs across three distinct types of black-box systems, each chosen to represent different problem domains where future autonomous AI researchers may have considerable impact: Program, Formal Language, and Math Equation. Through extensive experiments, we show that LLMs fail to extract information from observations, reaching a performance plateau that falls short of the ideal of Bayesian inference. However, we demonstrate that prompting LLMs to not only observe but also intervene -- actively querying the black-box with specific inputs to observe the resulting output -- improves performance by allowing LLMs to test edge cases and refine their beliefs. By providing the intervention data from one LLM to another, we show that this improvement is partly a result of engaging in the process of generating effective interventions, paralleling results in the literature on human learning. Further analysis reveals that engaging in intervention can help LLMs escape from two common failure modes: overcomplication, where the LLM falsely assumes prior knowledge about the black-box, and overlooking, where the LLM fails to incorporate observations. These insights provide practical guidance for helping LLMs more effectively reverse-engineer black-box systems, supporting their use in making new discoveries.

[389] Towards Analyzing and Understanding the Limitations of VAPO: A Theoretical Perspective

Jintian Shao,Yiming Cheng,Hongyi Huang,Beiwen Zhang,Zhiyu Wu,You Shan,Mingkai Zheng

Main category: cs.LG

TL;DR: VAPO框架在提升大语言模型长链推理任务的效率和可靠性方面表现出色,但其理论机制和潜在限制需要进一步研究。

Details Motivation: 探讨VAPO的理论基础,揭示其假设的潜在问题,并为未来研究提供方向。 Method: 从理论角度分析VAPO的价值函数近似、自适应优势估计、令牌级优化等问题。 Result: VAPO在复杂推理任务中表现优异,但理论机制尚不完善。 Conclusion: 未来研究应关注VAPO的理论深化和泛化能力提升。 Abstract: The VAPO framework has demonstrated significant empirical success in enhancing the efficiency and reliability of reinforcement learning for long chain-of-thought (CoT) reasoning tasks with large language models (LLMs). By systematically addressing challenges such as value model bias, heterogeneous sequence lengths, and sparse reward signals, VAPO achieves state-of-the-art performance. While its practical benefits are evident, a deeper theoretical understanding of its underlying mechanisms and potential limitations is crucial for guiding future advancements. This paper aims to initiate such a discussion by exploring VAPO from a theoretical perspective, highlighting areas where its assumptions might be challenged and where further investigation could yield more robust and generalizable reasoning agents. We delve into the intricacies of value function approximation in complex reasoning spaces, the optimality of adaptive advantage estimation, the impact of token-level optimization, and the enduring challenges of exploration and generalization.

[390] Data Mixing Can Induce Phase Transitions in Knowledge Acquisition

Xinran Gu,Kaifeng Lyu,Jiazheng Li,Jingzhao Zhang

Main category: cs.LG

TL;DR: 研究发现,在混合数据上训练大语言模型时,知识获取会出现相变现象,受模型大小和混合比例影响。

Details Motivation: 探讨大语言模型在混合数据(网络抓取数据与高质量知识密集数据)训练中的知识获取行为,揭示其非线性特征。 Method: 通过合成传记数据集与网络数据的混合实验,分析模型大小和混合比例对知识获取的影响。 Result: 模型在达到临界大小时会突然从少量记忆转变为大量记忆;混合比例低于临界值时几乎无记忆,超过后迅速增加。 Conclusion: 知识获取的相变现象源于模型容量分配的非连续性变化,混合策略需根据模型大小调整。 Abstract: Large Language Models (LLMs) are typically trained on data mixtures: most data come from web scrapes, while a small portion is curated from high-quality sources with dense domain-specific knowledge. In this paper, we show that when training LLMs on such data mixtures, knowledge acquisition from knowledge-dense datasets, unlike training exclusively on knowledge-dense data (arXiv:2404.05405), does not always follow a smooth scaling law but can exhibit phase transitions with respect to the mixing ratio and model size. Through controlled experiments on a synthetic biography dataset mixed with web-scraped data, we demonstrate that: (1) as we increase the model size to a critical value, the model suddenly transitions from memorizing very few to most of the biographies; (2) below a critical mixing ratio, the model memorizes almost nothing even with extensive training, but beyond this threshold, it rapidly memorizes more biographies. We attribute these phase transitions to a capacity allocation phenomenon: a model with bounded capacity must act like a knapsack problem solver to minimize the overall test loss, and the optimal allocation across datasets can change discontinuously as the model size or mixing ratio varies. We formalize this intuition in an information-theoretic framework and reveal that these phase transitions are predictable, with the critical mixing ratio following a power-law relationship with the model size. Our findings highlight a concrete case where a good mixing recipe for large models may not be optimal for small models, and vice versa.

[391] How Can I Publish My LLM Benchmark Without Giving the True Answers Away?

Takashi Ishida,Thanawat Lodkaew,Ikko Yamane

Main category: cs.LG

TL;DR: 提出一种通过注入随机性来发布LLM基准测试的方法,防止数据污染并检测模型是否过度拟合。

Details Motivation: 解决公开LLM基准测试可能导致的数据污染问题,避免依赖单一组织并防止测试集过度拟合。 Method: 为每个问题准备多个逻辑正确的答案,随机选择一个作为基准答案,降低最佳可能准确率(贝叶斯准确率)。 Result: 实验证明该方法能准确检测多种基准测试、模型和训练方法中的数据污染。 Conclusion: 该方法有效解决了基准测试公开的隐私问题,并能可靠检测数据污染。 Abstract: Publishing a large language model (LLM) benchmark on the Internet risks contaminating future LLMs: the benchmark may be unintentionally (or intentionally) used to train or select a model. A common mitigation is to keep the benchmark private and let participants submit their models or predictions to the organizers. However, this strategy will require trust in a single organization and still permits test-set overfitting through repeated queries. To overcome this issue, we propose a way to publish benchmarks without completely disclosing the ground-truth answers to the questions, while still maintaining the ability to openly evaluate LLMs. Our main idea is to inject randomness to the answers by preparing several logically correct answers, and only include one of them as the solution in the benchmark. This reduces the best possible accuracy, i.e., Bayes accuracy, of the benchmark. Not only is this helpful to keep us from disclosing the ground truth, but this approach also offers a test for detecting data contamination. In principle, even fully capable models should not surpass the Bayes accuracy. If a model surpasses this ceiling despite this expectation, this is a strong signal of data contamination. We present experimental evidence that our method can detect data contamination accurately on a wide range of benchmarks, models, and training methodologies.

[392] Bridging Supervised Learning and Reinforcement Learning in Math Reasoning

Huayu Chen,Kaiwen Zheng,Qinsheng Zhang,Ganqu Cui,Yin Cui,Haotian Ye,Tsung-Yi Lin,Ming-Yu Liu,Jun Zhu,Haoxiang Wang

Main category: cs.LG

TL;DR: 论文提出了一种名为Negative-aware Fine-Tuning (NFT)的监督学习方法,通过利用模型自身生成的负面反馈,使LLMs能够自主改进,无需外部教师。实验表明,NFT在数学推理任务中表现优于传统监督学习方法,甚至与领先的强化学习算法相当。

Details Motivation: 挑战强化学习在自我改进中的主导地位,探索监督学习在验证驱动训练中的潜力。 Method: 提出NFT方法,通过建模隐式负面策略,利用负面反馈优化模型。 Result: NFT在7B和32B模型上的实验表现优于监督学习基线,与GRPO和DAPO等强化学习算法相当。 Conclusion: NFT填补了监督学习与强化学习在二元反馈学习系统中的差距,展示了监督学习的潜力。 Abstract: Reinforcement Learning (RL) has played a central role in the recent surge of LLMs' math abilities by enabling self-improvement through binary verifier signals. In contrast, Supervised Learning (SL) is rarely considered for such verification-driven training, largely due to its heavy reliance on reference answers and inability to reflect on mistakes. In this work, we challenge the prevailing notion that self-improvement is exclusive to RL and propose Negative-aware Fine-Tuning (NFT) -- a supervised approach that enables LLMs to reflect on their failures and improve autonomously with no external teachers. In online training, instead of throwing away self-generated negative answers, NFT constructs an implicit negative policy to model them. This implicit policy is parameterized with the same positive LLM we target to optimize on positive data, enabling direct policy optimization on all LLMs' generations. We conduct experiments on 7B and 32B models in math reasoning tasks. Results consistently show that through the additional leverage of negative feedback, NFT significantly improves over SL baselines like Rejection sampling Fine-Tuning, matching or even surpassing leading RL algorithms like GRPO and DAPO. Furthermore, we demonstrate that NFT and GRPO are actually equivalent in strict-on-policy training, even though they originate from entirely different theoretical foundations. Our experiments and theoretical findings bridge the gap between SL and RL methods in binary-feedback learning systems.

[393] TabSTAR: A Foundation Tabular Model With Semantically Target-Aware Representations

Alan Arazi,Eilam Shapira,Roi Reichart

Main category: cs.LG

TL;DR: TabSTAR是一种基于语义目标感知表示的基础表格模型,旨在通过预训练文本编码器和目标令牌实现表格数据的迁移学习,并在包含文本特征的分类任务中达到最先进性能。

Details Motivation: 尽管深度学习在许多领域取得了显著成功,但在表格学习任务中表现不佳,而梯度提升决策树(GBDTs)仍占主导地位。TabSTAR旨在通过结合语言模型能力提升表格任务的性能。 Method: TabSTAR采用预训练文本编码器,输入目标令牌以学习任务特定的嵌入,其架构不包含数据集特定参数。 Result: TabSTAR在包含文本特征的分类任务的中型和大型数据集上实现了最先进的性能,并展示了预训练阶段的扩展规律。 Conclusion: TabSTAR为表格数据与文本特征的结合提供了有效的解决方案,并为进一步性能提升提供了路径。 Abstract: While deep learning has achieved remarkable success across many domains, it has historically underperformed on tabular learning tasks, which remain dominated by gradient boosting decision trees (GBDTs). However, recent advancements are paving the way for Tabular Foundation Models, which can leverage real-world knowledge and generalize across diverse datasets, particularly when the data contains free-text. Although incorporating language model capabilities into tabular tasks has been explored, most existing methods utilize static, target-agnostic textual representations, limiting their effectiveness. We introduce TabSTAR: a Foundation Tabular Model with Semantically Target-Aware Representations. TabSTAR is designed to enable transfer learning on tabular data with textual features, with an architecture free of dataset-specific parameters. It unfreezes a pretrained text encoder and takes as input target tokens, which provide the model with the context needed to learn task-specific embeddings. TabSTAR achieves state-of-the-art performance for both medium- and large-sized datasets across known benchmarks of classification tasks with text features, and its pretraining phase exhibits scaling laws in the number of datasets, offering a pathway for further performance improvements.

[394] Reward Model Overoptimisation in Iterated RLHF

Lorenz Wolf,Robert Kirk,Mirco Musolesi

Main category: cs.LG

TL;DR: 本文研究了迭代RLHF中的过优化问题,分析了关键设计选择,发现过优化随迭代减少,但性能提升递减。

Details Motivation: RLHF常因奖励模型过优化导致策略泛化性差,迭代RLHF虽被广泛采用,但其动态机制尚不明确。 Method: 通过AlpacaFarm基准实验,系统分析奖励模型训练数据传递、优化函数选择及策略初始化等设计选择。 Result: 过优化随迭代减少,奖励模型更接近真实偏好,但性能提升递减;策略初始化方式影响恢复能力。 Conclusion: 研究为构建更稳定、泛化的RLHF流程提供了实用建议。 Abstract: Reinforcement learning from human feedback (RLHF) is a widely used method for aligning large language models with human preferences. However, RLHF often suffers from reward model overoptimisation, in which models overfit to the reward function, resulting in non-generalisable policies that exploit the idiosyncrasies and peculiarities of the reward function. A common mitigation is iterated RLHF, in which reward models are repeatedly retrained with updated human feedback and policies are re-optimised. Despite its increasing adoption, the dynamics of overoptimisation in this setting remain poorly understood. In this work, we present the first comprehensive study of overoptimisation in iterated RLHF. We systematically analyse key design choices - how reward model training data is transferred across iterations, which reward function is used for optimisation, and how policies are initialised. Using the controlled AlpacaFarm benchmark, we observe that overoptimisation tends to decrease over successive iterations, as reward models increasingly approximate ground-truth preferences. However, performance gains diminish over time, and while reinitialising from the base policy is robust, it limits optimisation flexibility. Other initialisation strategies often fail to recover from early overoptimisation. These findings offer actionable insights for building more stable and generalisable RLHF pipelines.