cs.CV [Back]

[1] BioCLIP 2: Emergent Properties from Scaling Hierarchical Contrastive Learning

Jianyang Gu,Samuel Stevens,Elizabeth G Campolongo,Matthew J Thompson,Net Zhang,Jiaman Wu,Andrei Kopanev,Zheda Mai,Alexander E. White,James Balhoff,Wasila Dahdul,Daniel Rubenstein,Hilmar Lapp,Tanya Berger-Wolf,Wei-Lun Chao,Yu Su

Main category: cs.CV

TL;DR: 论文通过大规模对比视觉语言训练，发现生物视觉模型中的涌现行为，并利用TreeOfLife-200M数据集训练BioCLIP 2，展示了其在多种生物视觉任务中的高准确性和涌现特性。

Details

Motivation: 探索大规模训练的生物视觉模型是否能够涌现出超越初始训练目标的新能力。 Method: 使用TreeOfLife-200M数据集（包含2.14亿张生物图像）训练BioCLIP 2模型，通过对比学习区分不同物种。 Result: BioCLIP 2在栖息地分类和性状预测等任务中表现优异，其嵌入空间在物种间和物种内均展现出生物学意义的结构。 Conclusion: 大规模训练数据和对比学习目标能够促进涌现特性的形成，生成具有生物学意义的嵌入空间。 Abstract: Foundation models trained at scale exhibit remarkable emergent behaviors, learning new capabilities beyond their initial training objectives. We find such emergent behaviors in biological vision models via large-scale contrastive vision-language training. To achieve this, we first curate TreeOfLife-200M, comprising 214 million images of living organisms, the largest and most diverse biological organism image dataset to date. We then train BioCLIP 2 on TreeOfLife-200M to distinguish different species. Despite the narrow training objective, BioCLIP 2 yields extraordinary accuracy when applied to various biological visual tasks such as habitat classification and trait prediction. We identify emergent properties in the learned embedding space of BioCLIP 2. At the inter-species level, the embedding distribution of different species aligns closely with functional and ecological meanings (e.g., beak sizes and habitats). At the intra-species level, instead of being diminished, the intra-species variations (e.g., life stages and sexes) are preserved and better separated in subspaces orthogonal to inter-species distinctions. We provide formal proof and analyses to explain why hierarchical supervision and contrastive objectives encourage these emergent properties. Crucially, our results reveal that these properties become increasingly significant with larger-scale training data, leading to a biologically meaningful embedding space.

[2] Generating Fit Check Videos with a Handheld Camera

Bowei Chen,Brian Curless,Ira Kemelmacher-Shlizerman,Steven M. Seitz

Main category: cs.CV

TL;DR: 提出了一种利用手持移动设备和静态照片生成全身视频的方法，通过视频扩散模型实现真实感渲染。

Details

Motivation: 解决传统全身视频拍摄需要固定摄像头和复杂设置的问题，提供更便捷的解决方案。 Method: 输入两张静态照片（正反面）和IMU运动参考，结合视频扩散模型和多参考注意力机制生成视频。 Result: 实现了真实感视频合成，支持新场景渲染，光照和阴影效果一致。 Conclusion: 该方法为便捷的全身视频生成提供了有效解决方案，具有实际应用潜力。 Abstract: Self-captured full-body videos are popular, but most deployments require mounted cameras, carefully-framed shots, and repeated practice. We propose a more convenient solution that enables full-body video capture using handheld mobile devices. Our approach takes as input two static photos (front and back) of you in a mirror, along with an IMU motion reference that you perform while holding your mobile phone, and synthesizes a realistic video of you performing a similar target motion. We enable rendering into a new scene, with consistent illumination and shadows. We propose a novel video diffusion-based model to achieve this. Specifically, we propose a parameter-free frame generation strategy, as well as a multi-reference attention mechanism, that effectively integrate appearance information from both the front and back selfies into the video diffusion model. Additionally, we introduce an image-based fine-tuning strategy to enhance frame sharpness and improve the generation of shadows and reflections, achieving a more realistic human-scene composition.

[3] Cora: Correspondence-aware image editing using few step diffusion

Amirhossein Almohammadi,Aryan Mikaeili,Sauradip Nag,Negar Hassanpour,Andrea Tagliasacchi,Ali Mahdavi-Amiri

Main category: cs.CV

TL;DR: Cora是一个新的图像编辑框架，通过引入对应感知的噪声校正和插值注意力图，解决了现有方法在结构修改和纹理保留上的问题。

Details

Motivation: 当前基于扩散的图像编辑方法在处理结构变化（如非刚性变形或对象修改）时存在纹理不相关或关键属性丢失的问题。 Method: Cora利用语义对应关系对齐源图像和目标图像的纹理和结构，结合噪声校正和插值注意力图，实现精确的纹理转移和新内容生成。 Result: 实验表明，Cora在保持结构、纹理和身份方面表现优异，适用于姿态变化、对象添加和纹理细化等多种编辑任务。 Conclusion: Cora在用户研究中表现优于现有方法，提供了更好的编辑效果和控制能力。 Abstract: Image editing is an important task in computer graphics, vision, and VFX, with recent diffusion-based methods achieving fast and high-quality results. However, edits requiring significant structural changes, such as non-rigid deformations, object modifications, or content generation, remain challenging. Existing few step editing approaches produce artifacts such as irrelevant texture or struggle to preserve key attributes of the source image (e.g., pose). We introduce Cora, a novel editing framework that addresses these limitations by introducing correspondence-aware noise correction and interpolated attention maps. Our method aligns textures and structures between the source and target images through semantic correspondence, enabling accurate texture transfer while generating new content when necessary. Cora offers control over the balance between content generation and preservation. Extensive experiments demonstrate that, quantitatively and qualitatively, Cora excels in maintaining structure, textures, and identity across diverse edits, including pose changes, object addition, and texture refinements. User studies confirm that Cora delivers superior results, outperforming alternatives.

[4] Representational Difference Explanations

Neehar Kondapaneni,Oisin Mac Aodha,Pietro Perona

Main category: cs.CV

TL;DR: 提出了一种名为RDX的方法，用于发现和可视化两种学习表示之间的差异，支持更直接和可解释的模型比较。

Details

Motivation: 当前的可解释AI（XAI）方法在支持模型比较方面存在不足，科学分析中对比是基础，但现有工具难以有效实现。 Method: 提出Representational Differences Explanations（RDX）方法，通过比较已知概念差异的模型验证其有效性。 Result: RDX在ImageNet和iNaturalist数据集上成功揭示了模型间的表示差异和数据的微妙模式，优于现有XAI技术。 Conclusion: RDX填补了机器学习中模型比较工具的空白，提供了一种有效且可解释的模型对比方法。 Abstract: We propose a method for discovering and visualizing the differences between two learned representations, enabling more direct and interpretable model comparisons. We validate our method, which we call Representational Differences Explanations (RDX), by using it to compare models with known conceptual differences and demonstrate that it recovers meaningful distinctions where existing explainable AI (XAI) techniques fail. Applied to state-of-the-art models on challenging subsets of the ImageNet and iNaturalist datasets, RDX reveals both insightful representational differences and subtle patterns in the data. Although comparison is a cornerstone of scientific analysis, current tools in machine learning, namely post hoc XAI methods, struggle to support model comparison effectively. Our work addresses this gap by introducing an effective and explainable tool for contrasting model representations.

[5] ScaleLong: A Multi-Timescale Benchmark for Long Video Understanding

David Ma,Huaqing Yuan,Xingjian Wang,Qianbo Zang,Tianci Liu,Xinyang He,Yanbin Wei,Jiawei Guo,Ni Jiahui,Zhenzhu Yang,Meng Cao,Shanghaoran Quan,Yizhi Li,Wangchunshu Zhou,Jiaheng Liu,Wenhao Huang,Ge Zhang,Shiwen Ni,Xiaojie Jin

Main category: cs.CV

TL;DR: ScaleLong是一个新的基准测试，用于评估多尺度长视频理解能力，通过在同一视频内容中嵌入不同时间尺度的问题，直接比较模型性能。

Details

Motivation: 现有基准测试未能统一多尺度设计或分散问题，无法直接比较模型在不同时间尺度上的表现。 Method: ScaleLong包含269个长视频，每个视频设计4-8个问题，覆盖四个时间尺度（秒、十秒、分钟、小时）。 Result: 评估23个MLLM显示U形性能曲线，最短和最长尺度表现最佳，中间尺度表现较差。视觉标记容量增加可提升所有尺度的推理能力。 Conclusion: ScaleLong为长视频理解提供了一个细粒度的多尺度基准，推动了MLLM能力的进步。 Abstract: Although long-video understanding demands that models capture hierarchical temporal information -- from clip (seconds) and shot (tens of seconds) to event (minutes) and story (hours) -- existing benchmarks either neglect this multi-scale design or scatter scale-specific questions across different videos, preventing direct comparison of model performance across timescales on the same content. To address this, we introduce ScaleLong, the first benchmark to disentangle these factors by embedding questions targeting four hierarchical timescales -- clip (seconds), shot (tens of seconds), event (minutes), and story (hours) -- all within the same video content. This within-content multi-timescale questioning design enables direct comparison of model performance across timescales on identical videos. ScaleLong features 269 long videos (avg.\ 86\,min) from 5 main categories and 36 sub-categories, with 4--8 carefully designed questions, including at least one question for each timescale. Evaluating 23 MLLMs reveals a U-shaped performance curve, with higher accuracy at the shortest and longest timescales and a dip at intermediate levels. Furthermore, ablation studies show that increased visual token capacity consistently enhances reasoning across all timescales. ScaleLong offers a fine-grained, multi-timescale benchmark for advancing MLLM capabilities in long-video understanding. The code and dataset are available https://github.com/multimodal-art-projection/ScaleLong.

[6] Point-MoE: Towards Cross-Domain Generalization in 3D Semantic Segmentation via Mixture-of-Experts

Xuweiyi Chen,Wentao Zhou,Aruni RoyChowdhury,Zezhou Cheng

Main category: cs.CV

TL;DR: Point-MoE提出了一种基于Mixture-of-Experts的架构，用于解决3D点云理解中跨域泛化的问题，无需域标签即可自动分配专家模块，显著提升了性能。

Details

Motivation: 3D点云数据因传感器和场景的多样性导致域异质性，传统方法难以实现跨域统一建模。 Method: 采用Point-MoE架构，结合简单的top-k路由策略，自动分配专家模块处理不同域的数据。 Result: Point-MoE在混合域数据上表现优于传统方法，并能更好地泛化到未见过的域。 Conclusion: Point-MoE为3D理解提供了一种无需人工干预的可扩展解决方案。 Abstract: While scaling laws have transformed natural language processing and computer vision, 3D point cloud understanding has yet to reach that stage. This can be attributed to both the comparatively smaller scale of 3D datasets, as well as the disparate sources of the data itself. Point clouds are captured by diverse sensors (e.g., depth cameras, LiDAR) across varied domains (e.g., indoor, outdoor), each introducing unique scanning patterns, sampling densities, and semantic biases. Such domain heterogeneity poses a major barrier towards training unified models at scale, especially under the realistic constraint that domain labels are typically inaccessible at inference time. In this work, we propose Point-MoE, a Mixture-of-Experts architecture designed to enable large-scale, cross-domain generalization in 3D perception. We show that standard point cloud backbones degrade significantly in performance when trained on mixed-domain data, whereas Point-MoE with a simple top-k routing strategy can automatically specialize experts, even without access to domain labels. Our experiments demonstrate that Point-MoE not only outperforms strong multi-domain baselines but also generalizes better to unseen domains. This work highlights a scalable path forward for 3D understanding: letting the model discover structure in diverse 3D data, rather than imposing it via manual curation or domain supervision.

[7] Leveraging Auxiliary Information in Text-to-Video Retrieval: A Review

Adriano Fragomeni,Dima Damen,Michael Wray

Main category: cs.CV

TL;DR: 本文综述了81篇关于文本到视频检索的研究论文，重点分析了利用辅助信息（如视觉属性、时空上下文和文本描述）提升检索性能的方法，并探讨了未来研究方向。

Details

Motivation: 传统方法仅依赖视频和文本模态的对齐，但存在语义鸿沟问题。引入辅助信息（如视觉属性、时空上下文等）可以提升检索性能。 Method: 综述分析了81篇论文，总结了利用辅助信息的方法，包括视觉属性、时空上下文和文本描述等。 Result: 综述展示了在基准数据集上的最新成果，并讨论了可用数据集及其辅助信息。 Conclusion: 未来研究应进一步探索如何利用辅助信息提升文本到视频检索性能。 Abstract: Text-to-Video (T2V) retrieval aims to identify the most relevant item from a gallery of videos based on a user's text query. Traditional methods rely solely on aligning video and text modalities to compute the similarity and retrieve relevant items. However, recent advancements emphasise incorporating auxiliary information extracted from video and text modalities to improve retrieval performance and bridge the semantic gap between these modalities. Auxiliary information can include visual attributes, such as objects; temporal and spatial context; and textual descriptions, such as speech and rephrased captions. This survey comprehensively reviews 81 research papers on Text-to-Video retrieval that utilise such auxiliary information. It provides a detailed analysis of their methodologies; highlights state-of-the-art results on benchmark datasets; and discusses available datasets and their auxiliary information. Additionally, it proposes promising directions for future research, focusing on different ways to further enhance retrieval performance using this information.

[8] MangoLeafViT: Leveraging Lightweight Vision Transformer with Runtime Augmentation for Efficient Mango Leaf Disease Classification

Rafi Hassan Chowdhury,Sabbir Ahmed

Main category: cs.CV

TL;DR: 提出了一种轻量级的Vision Transformer方法，用于芒果叶病害分类，计算高效且适用于低端设备，准确率达99.43%。

Details

Motivation: 芒果病害对南亚农业经济影响重大，现有深度学习方法计算成本高，难以在低端设备上部署。 Method: 采用轻量级Vision Transformer结合自注意力机制，引入运行时增强以提升性能。 Result: 在MangoLeafBD数据集上达到99.43%准确率，模型大小、参数和计算量均优于现有方法。 Conclusion: 该方法高效且实用，为芒果叶病害分类提供了低成本解决方案。 Abstract: Ensuring food safety is critical due to its profound impact on public health, economic stability, and global supply chains. Cultivation of Mango, a major agricultural product in several South Asian countries, faces high financial losses due to different diseases, affecting various aspects of the entire supply chain. While deep learning-based methods have been explored for mango leaf disease classification, there remains a gap in designing solutions that are computationally efficient and compatible with low-end devices. In this work, we propose a lightweight Vision Transformer-based pipeline with a self-attention mechanism to classify mango leaf diseases, achieving state-of-the-art performance with minimal computational overhead. Our approach leverages global attention to capture intricate patterns among disease types and incorporates runtime augmentation for enhanced performance. Evaluation on the MangoLeafBD dataset demonstrates a 99.43% accuracy, outperforming existing methods in terms of model size, parameter count, and FLOPs count.

[9] VisualSphinx: Large-Scale Synthetic Vision Logic Puzzles for RL

Yichen Feng,Zhangchen Xu,Fengqing Jiang,Yuetai Li,Bhaskar Ramasubramanian,Luyao Niu,Bill Yuchen Lin,Radha Poovendran

Main category: cs.CV

TL;DR: 论文提出了一种名为VisualSphinx的大规模合成视觉逻辑推理训练数据，以解决当前视觉语言模型（VLM）在逻辑推理任务中缺乏结构化训练数据的问题。通过规则到图像的合成流程，生成具有基础答案的图像，实验表明该方法提升了VLM的逻辑推理能力。

Details

Motivation: 当前视觉语言模型（VLM）在逻辑推理任务中缺乏大规模且结构化的训练数据，限制了其性能。 Method: 提出VisualSphinx数据集，采用规则到图像的合成流程，从种子问题中提取并扩展谜题规则，生成具有基础答案的图像。 Result: 实验表明，使用VisualSphinx训练的VLM在逻辑推理任务中表现更优，且其推理能力还能提升其他任务（如代数、算术和几何推理）的性能。 Conclusion: VisualSphinx为VLM提供了高质量的逻辑推理训练数据，显著提升了其多模态推理能力，并具有广泛的应用潜力。 Abstract: Vision language models (VLMs) are expected to perform effective multimodal reasoning and make logically coherent decisions, which is critical to tasks such as diagram understanding and spatial problem solving. However, current VLM reasoning lacks large-scale and well-structured training datasets. To bridge this gap, we propose VisualSphinx, a first-of-its-kind large-scale synthetic visual logical reasoning training data. To tackle the challenge of image synthesis with grounding answers, we propose a rule-to-image synthesis pipeline, which extracts and expands puzzle rules from seed questions and generates the code of grounding synthesis image synthesis for puzzle sample assembly. Experiments demonstrate that VLM trained using GRPO on VisualSphinx benefit from logical coherence and readability of our dataset and exhibit improved performance on logical reasoning tasks. The enhanced reasoning capabilities developed from VisualSphinx also benefit other reasoning tasks such as algebraic reasoning, arithmetic reasoning and geometry reasoning.

[10] DeepTopoNet: A Framework for Subglacial Topography Estimation on the Greenland Ice Sheets

Bayu Adhi Tama,Mansa Krishna,Homayra Alam,Mostafa Cham,Omar Faruque,Gong Cheng,Jianwu Wang,Mathieu Morlighem,Vandana Janeja

Main category: cs.CV

TL;DR: 本文提出了一种名为DeepTopoNet的深度学习框架，用于整合雷达和BedMachine数据，以更准确地重建格陵兰冰盖下的地形。

Details

Motivation: 格陵兰冰盖下的地形数据对预测其未来质量损失和全球海平面上升至关重要，但现有观测数据稀疏且复杂，增加了模型预测的不确定性。 Method: 通过动态损失平衡机制整合雷达和BedMachine数据，结合梯度特征和趋势表面特征，使用CNN架构进行亚网格尺度预测。 Result: 在Upernavik Isstrøm区域的测试中，模型表现出高精度，优于基线方法。 Conclusion: 深度学习能有效填补观测空白，为推断冰下地形提供可扩展的高效解决方案。 Abstract: Understanding Greenland's subglacial topography is critical for projecting the future mass loss of the ice sheet and its contribution to global sea-level rise. However, the complex and sparse nature of observational data, particularly information about the bed topography under the ice sheet, significantly increases the uncertainty in model projections. Bed topography is traditionally measured by airborne ice-penetrating radar that measures the ice thickness directly underneath the aircraft, leaving data gap of tens of kilometers in between flight lines. This study introduces a deep learning framework, which we call as DeepTopoNet, that integrates radar-derived ice thickness observations and BedMachine Greenland data through a novel dynamic loss-balancing mechanism. Among all efforts to reconstruct bed topography, BedMachine has emerged as one of the most widely used datasets, combining mass conservation principles and ice thickness measurements to generate high-resolution bed elevation estimates. The proposed loss function adaptively adjusts the weighting between radar and BedMachine data, ensuring robustness in areas with limited radar coverage while leveraging the high spatial resolution of BedMachine predictions i.e. bed estimates. Our approach incorporates gradient-based and trend surface features to enhance model performance and utilizes a CNN architecture designed for subgrid-scale predictions. By systematically testing on the Upernavik Isstr{\o}m) region, the model achieves high accuracy, outperforming baseline methods in reconstructing subglacial terrain. This work demonstrates the potential of deep learning in bridging observational gaps, providing a scalable and efficient solution to inferring subglacial topography.

Vaishnav Ramesh,Junliang Liu,Haining Wang,Md Jahidul Islam

Main category: cs.CV

TL;DR: 论文提出了一种结合深度引导跨注意力和Transformer-CNN桥接机制的NR-IQA方法DGIQA，显著提升了无参考图像质量评估的性能。

Details

Motivation: 解决NR-IQA中缺乏对未见自然失真的客观泛化能力的问题。 Method: 引入Depth-CAR机制和TCB模块，融合深度信息和多模态特征。 Result: DGIQA在合成和真实数据集上达到SOTA性能，尤其在跨数据集评估和自然失真评估中表现突出。 Conclusion: DGIQA通过深度引导和多模态特征融合，显著提升了NR-IQA的泛化能力和评估效果。 Abstract: A long-held challenge in no-reference image quality assessment (NR-IQA) learning from human subjective perception is the lack of objective generalization to unseen natural distortions. To address this, we integrate a novel Depth-Guided cross-attention and refinement (Depth-CAR) mechanism, which distills scene depth and spatial features into a structure-aware representation for improved NR-IQA. This brings in the knowledge of object saliency and relative contrast of the scene for more discriminative feature learning. Additionally, we introduce the idea of TCB (Transformer-CNN Bridge) to fuse high-level global contextual dependencies from a transformer backbone with local spatial features captured by a set of hierarchical CNN (convolutional neural network) layers. We implement TCB and Depth-CAR as multimodal attention-based projection functions to select the most informative features, which also improve training time and inference efficiency. Experimental results demonstrate that our proposed DGIQA model achieves state-of-the-art (SOTA) performance on both synthetic and authentic benchmark datasets. More importantly, DGIQA outperforms SOTA models on cross-dataset evaluations as well as in assessing natural image distortions such as low-light effects, hazy conditions, and lens flares.

[12] Preemptive Hallucination Reduction: An Input-Level Approach for Multimodal Language Model

Nokimul Hasan Arif,Shadman Rabby,Md Hefzul Hossain Papon,Sabbir Ahmed

Main category: cs.CV

TL;DR: 论文提出了一种基于集成预处理的框架，通过自适应选择输入过滤方法（降噪、边缘增强或原始输入），显著减少大型语言模型（LLMs）的视觉幻觉问题，无需修改模型架构或训练流程。

Details

Motivation: 视觉幻觉问题影响LLMs的可靠性，当前研究多关注事后修正或模型微调，缺乏对预处理技术的探索。 Method: 提出一种自适应选择预处理方法的集成框架，根据问题类型选择降噪、边缘增强或原始输入。 Result: 在`HaloQuest'数据集上，幻觉率降低了44.3%，通过SelfCheckGPT的NLI评分验证。 Conclusion: 智能输入预处理能显著提升LLMs的事实基础，为更可靠的多模态系统铺平道路。 Abstract: Visual hallucinations in Large Language Models (LLMs), where the model generates responses that are inconsistent with the visual input, pose a significant challenge to their reliability, particularly in contexts where precise and trustworthy outputs are critical. Current research largely emphasizes post-hoc correction or model-specific fine-tuning strategies, with limited exploration of preprocessing techniques to address hallucination issues at the input stage. This study presents a novel ensemble-based preprocessing framework that adaptively selects the most appropriate filtering approach -- noise reduced (NR), edge enhanced (EE), or unaltered input (org) based on the type of question posed, resulting into reduced hallucination without requiring any modifications to the underlying model architecture or training pipeline. Evaluated on the `HaloQuest' dataset -- a benchmark designed to test multimodal reasoning on visually complex inputs, our method achieves a 44.3% reduction in hallucination rates, as measured by Natural Language Inference (NLI) scores using SelfCheckGPT. This demonstrates that intelligent input conditioning alone can significantly enhance factual grounding in LLM responses. The findings highlight the importance of adaptive preprocessing techniques in mitigating hallucinations, paving the way for more reliable multimodal systems capable of addressing real-world challenges.

[13] Multi-Group Proportional Representation for Text-to-Image Models

Sangwon Jung,Alex Oesterling,Claudio Mayrink Verdun,Sajani Vithana,Taesup Moon,Flavio P. Calmon

Main category: cs.CV

TL;DR: 本文提出了一种新框架，通过多群体比例表示（MPR）度量标准，系统评估文本到图像生成模型对交叉群体的代表性，并开发了优化算法以改善模型表现。

Details

Motivation: 随着文本到图像生成模型的普及，其可能导致的代表性偏差和刻板印象问题日益突出，但目前缺乏系统的方法来测量和控制这些危害。 Method: 引入MPR度量标准，评估生成图像中不同群体的最坏偏差，并开发算法优化模型以平衡群体代表性。 Result: 实验表明，MPR能有效测量多交叉群体的代表性，且作为训练目标时，可在保持生成质量的同时改善平衡性。 Conclusion: MPR为测量和优化生成模型的代表性提供了灵活且有效的工具，有助于减少AI生成内容中的偏差。 Abstract: Text-to-image (T2I) generative models can create vivid, realistic images from textual descriptions. As these models proliferate, they expose new concerns about their ability to represent diverse demographic groups, propagate stereotypes, and efface minority populations. Despite growing attention to the "safe" and "responsible" design of artificial intelligence (AI), there is no established methodology to systematically measure and control representational harms in image generation. This paper introduces a novel framework to measure the representation of intersectional groups in images generated by T2I models by applying the Multi-Group Proportional Representation (MPR) metric. MPR evaluates the worst-case deviation of representation statistics across given population groups in images produced by a generative model, allowing for flexible and context-specific measurements based on user requirements. We also develop an algorithm to optimize T2I models for this metric. Through experiments, we demonstrate that MPR can effectively measure representation statistics across multiple intersectional groups and, when used as a training objective, can guide models toward a more balanced generation across demographic groups while maintaining generation quality.

[14] DINO-R1: Incentivizing Reasoning Capability in Vision Foundation Models

Chenbin Pan,Wenbin He,Zhengzhong Tu,Liu Ren

Main category: cs.CV

TL;DR: DINO-R1首次尝试通过强化学习提升视觉基础模型的上下文推理能力，提出GRQO训练策略，显著优于传统方法。

Details

Motivation: 当前大型语言模型的推理能力已通过强化学习得到显著提升，但视觉基础模型（如DINO系列）的类似能力尚未充分探索。 Method: 提出GRQO策略，结合KL正则化稳定训练，并基于Grounding-DINO构建DINO-R1系列模型。 Result: 在COCO、LVIS和ODinW等数据集上，DINO-R1显著优于监督微调基线，泛化能力强。 Conclusion: DINO-R1成功将强化学习引入视觉基础模型，为视觉推理任务提供了新思路。 Abstract: The recent explosive interest in the reasoning capabilities of large language models, such as DeepSeek-R1, has demonstrated remarkable success through reinforcement learning-based fine-tuning frameworks, exemplified by methods like Group Relative Policy Optimization (GRPO). However, such reasoning abilities remain underexplored and notably absent in vision foundation models, including representation models like the DINO series. In this work, we propose \textbf{DINO-R1}, the first such attempt to incentivize visual in-context reasoning capabilities of vision foundation models using reinforcement learning. Specifically, DINO-R1 introduces \textbf{Group Relative Query Optimization (GRQO)}, a novel reinforcement-style training strategy explicitly designed for query-based representation models, which computes query-level rewards based on group-normalized alignment quality. We also apply KL-regularization to stabilize the objectness distribution to reduce the training instability. This joint optimization enables dense and expressive supervision across queries while mitigating overfitting and distributional drift. Building upon Grounding-DINO, we train a series of DINO-R1 family models that integrate a visual prompt encoder and a visual-guided query selection mechanism. Extensive experiments on COCO, LVIS, and ODinW demonstrate that DINO-R1 significantly outperforms supervised fine-tuning baselines, achieving strong generalization in both open-vocabulary and closed-set visual prompting scenarios.

[15] MaskAdapt: Unsupervised Geometry-Aware Domain Adaptation Using Multimodal Contextual Learning and RGB-Depth Masking

Numair Nadeem,Muhammad Hamza Asad,Saeed Anwar,Abdul Bais

Main category: cs.CV

TL;DR: MaskAdapt是一种通过结合RGB图像和深度数据特征的多模态上下文学习方法，显著提高了作物与杂草语义分割的准确性，解决了现有无监督域适应方法在遮挡和视觉混淆问题上的不足。

Details

Motivation: 现有作物与杂草语义分割方法依赖像素级标注且难以适应不同农田环境（域偏移），导致模型泛化能力差。 Method: 提出MaskAdapt方法，通过深度梯度捕捉空间过渡信息，结合交叉注意力机制优化RGB特征，并采用几何感知掩码策略增强空间上下文学习。 Result: 在真实农业数据集上，MaskAdapt显著优于现有无监督域适应方法，提高了分割的mIOU。 Conclusion: MaskAdapt通过多模态学习和几何感知策略，有效解决了域适应中的遮挡和视觉混淆问题，提升了分割性能。 Abstract: Semantic segmentation of crops and weeds is crucial for site-specific farm management; however, most existing methods depend on labor intensive pixel-level annotations. A further challenge arises when models trained on one field (source domain) fail to generalize to new fields (target domain) due to domain shifts, such as variations in lighting, camera setups, soil composition, and crop growth stages. Unsupervised Domain Adaptation (UDA) addresses this by enabling adaptation without target-domain labels, but current UDA methods struggle with occlusions and visual blending between crops and weeds, leading to misclassifications in real-world conditions. To overcome these limitations, we introduce MaskAdapt, a novel approach that enhances segmentation accuracy through multimodal contextual learning by integrating RGB images with features derived from depth data. By computing depth gradients from depth maps, our method captures spatial transitions that help resolve texture ambiguities. These gradients, through a cross-attention mechanism, refines RGB feature representations, resulting in sharper boundary delineation. In addition, we propose a geometry-aware masking strategy that applies horizontal, vertical, and stochastic masks during training. This encourages the model to focus on the broader spatial context for robust visual recognition. Evaluations on real agricultural datasets demonstrate that MaskAdapt consistently outperforms existing State-of-the-Art (SOTA) UDA methods, achieving improved segmentation mean Intersection over Union (mIOU) across diverse field conditions.

[16] SIM: A mapping framework for built environment auditing based on street view imagery

Huan Ning,Zhenlong Li,Manzhu Yu,Wenpeng Yin

Main category: cs.CV

TL;DR: 本文提出了一种基于街景图像的开源映射框架，用于自动测量和映射地面物体、3D定位已知尺寸物体以及测量直径，以提高建筑环境审计的效率和准确性。

Details

Motivation: 传统建筑环境审计依赖耗时且昂贵的人工调查，而新兴的街景图像和深度学习技术为远程审计提供了可能，但缺乏通用的映射方法和工具。 Method: 开发了一个开源框架，提供三种管道：地面物体宽度测量、已知尺寸物体的3D定位以及直径测量。 Result: 通过道路宽度测量、停车标志定位和街道树木直径测量三个案例研究验证了框架的有效性。 Conclusion: 该框架为建筑环境审计提供了自动化和高精度的解决方案，有助于提升审计效率和准确性。 Abstract: Built environment auditing refers to the systematic documentation and assessment of urban and rural spaces' physical, social, and environmental characteristics, such as walkability, road conditions, and traffic lights. It is used to collect data for the evaluation of how built environments impact human behavior, health, mobility, and overall urban functionality. Traditionally, built environment audits were conducted using field surveys and manual observations, which were time-consuming and costly. The emerging street view imagery, e.g., Google Street View, has become a widely used data source for conducting built environment audits remotely. Deep learning and computer vision techniques can extract and classify objects from street images to enhance auditing productivity. Before meaningful analysis, the detected objects need to be geospatially mapped for accurate documentation. However, the mapping methods and tools based on street images are underexplored, and there are no universal frameworks or solutions yet, imposing difficulties in auditing the street objects. In this study, we introduced an open source street view mapping framework, providing three pipelines to map and measure: 1) width measurement for ground objects, such as roads; 2) 3D localization for objects with a known dimension (e.g., doors and stop signs); and 3) diameter measurements (e.g., street trees). These pipelines can help researchers, urban planners, and other professionals automatically measure and map target objects, promoting built environment auditing productivity and accuracy. Three case studies, including road width measurement, stop sign localization, and street tree diameter measurement, are provided in this paper to showcase pipeline usage.

[17] ComposeAnything: Composite Object Priors for Text-to-Image Generation

Zeeshan Khan,Shizhe Chen,Cordelia Schmid

Main category: cs.CV

TL;DR: ComposeAnything通过结合LLM的推理能力和2.5D语义布局，改进了文本到图像的生成质量，尤其在复杂对象排列和3D定位方面表现优异。

Details

Motivation: 现有文本到图像模型在复杂对象排列和3D定位上表现不佳，布局方法虽能改善2D排列，但牺牲了质量和连贯性。 Method: 利用LLM生成2.5D语义布局（含深度信息），并基于此生成空间和深度感知的粗合成图，作为扩散模型的先验，通过对象先验强化和空间控制去噪生成图像。 Result: 在T2I-CompBench和NSR-1K基准测试中表现优于现有方法，尤其在2D/3D空间排列、高对象数量和超现实组合方面。 Conclusion: ComposeAnything能够生成高质量且忠实于文本的图像，解决了复杂对象排列和3D定位的挑战。 Abstract: Generating images from text involving complex and novel object arrangements remains a significant challenge for current text-to-image (T2I) models. Although prior layout-based methods improve object arrangements using spatial constraints with 2D layouts, they often struggle to capture 3D positioning and sacrifice quality and coherence. In this work, we introduce ComposeAnything, a novel framework for improving compositional image generation without retraining existing T2I models. Our approach first leverages the chain-of-thought reasoning abilities of LLMs to produce 2.5D semantic layouts from text, consisting of 2D object bounding boxes enriched with depth information and detailed captions. Based on this layout, we generate a spatial and depth aware coarse composite of objects that captures the intended composition, serving as a strong and interpretable prior that replaces stochastic noise initialization in diffusion-based T2I models. This prior guides the denoising process through object prior reinforcement and spatial-controlled denoising, enabling seamless generation of compositional objects and coherent backgrounds, while allowing refinement of inaccurate priors. ComposeAnything outperforms state-of-the-art methods on the T2I-CompBench and NSR-1K benchmarks for prompts with 2D/3D spatial arrangements, high object counts, and surreal compositions. Human evaluations further demonstrate that our model generates high-quality images with compositions that faithfully reflect the text.

[18] Weakly-Supervised Affordance Grounding Guided by Part-Level Semantic Priors

Peiran Xu,Yadong Mu

Main category: cs.CV

TL;DR: 论文提出了一种弱监督功能定位方法，利用伪标签和基础模型改进功能学习，性能显著优于现有方法。

Details

Motivation: 现有方法基于类激活图，适用于语义分割但不适合定位动作和功能，需改进。 Method: 基于伪标签的监督训练流程，结合标签细化、细粒度特征对齐和轻量推理模块。 Result: 实验表明，模型性能显著优于现有方法。 Conclusion: 通过利用基础模型的语义知识，成功弥合了对象与动作之间的差距。 Abstract: In this work, we focus on the task of weakly supervised affordance grounding, where a model is trained to identify affordance regions on objects using human-object interaction images and egocentric object images without dense labels. Previous works are mostly built upon class activation maps, which are effective for semantic segmentation but may not be suitable for locating actions and functions. Leveraging recent advanced foundation models, we develop a supervised training pipeline based on pseudo labels. The pseudo labels are generated from an off-the-shelf part segmentation model, guided by a mapping from affordance to part names. Furthermore, we introduce three key enhancements to the baseline model: a label refining stage, a fine-grained feature alignment process, and a lightweight reasoning module. These techniques harness the semantic knowledge of static objects embedded in off-the-shelf foundation models to improve affordance learning, effectively bridging the gap between objects and actions. Extensive experiments demonstrate that the performance of the proposed model has achieved a breakthrough improvement over existing methods. Our codes are available at https://github.com/woyut/WSAG-PLSP .

[19] Federated Foundation Model for GI Endoscopy Images

Alina Devkota,Annahita Amireskandari,Joel Palko,Shyam Thakkar,Donald Adjeroh,Xiajun Jiang,Binod Bhattarai,Prashnna K. Gyawali

Main category: cs.CV

TL;DR: 提出了一种基于联邦学习（FL）的框架，用于训练胃肠道内窥镜成像的基础模型，解决了医疗数据隐私问题，并在分类、检测和分割任务中表现优异。

Details

Motivation: 胃肠道内窥镜在疾病早期检测中至关重要，但深度学习模型需要昂贵的标注数据。基础模型通过通用表示学习提供解决方案，但医疗数据的隐私限制使其训练困难。 Method: 采用联邦学习框架，使数据保留在本地医院环境中，同时贡献于共享模型。评估了多种FL算法在无任务标签情况下的适用性，并在同质和异质设置下进行实验。 Result: 训练的基础模型在分类、检测和分割三个下游任务中均表现优异，验证了联邦学习框架在隐私保护环境中的有效性。 Conclusion: 提出的联邦学习方法成功训练了基础模型，解决了医疗数据隐私问题，并在多个任务中提升了性能，具有实际应用潜力。 Abstract: Gastrointestinal (GI) endoscopy is essential in identifying GI tract abnormalities in order to detect diseases in their early stages and improve patient outcomes. Although deep learning has shown success in supporting GI diagnostics and decision-making, these models require curated datasets with labels that are expensive to acquire. Foundation models offer a promising solution by learning general-purpose representations, which can be finetuned for specific tasks, overcoming data scarcity. Developing foundation models for medical imaging holds significant potential, but the sensitive and protected nature of medical data presents unique challenges. Foundation model training typically requires extensive datasets, and while hospitals generate large volumes of data, privacy restrictions prevent direct data sharing, making foundation model training infeasible in most scenarios. In this work, we propose a FL framework for training foundation models for gastroendoscopy imaging, enabling data to remain within local hospital environments while contributing to a shared model. We explore several established FL algorithms, assessing their suitability for training foundation models without relying on task-specific labels, conducting experiments in both homogeneous and heterogeneous settings. We evaluate the trained foundation model on three critical downstream tasks--classification, detection, and segmentation--and demonstrate that it achieves improved performance across all tasks, highlighting the effectiveness of our approach in a federated, privacy-preserving setting.

[20] CSVQA: A Chinese Multimodal Benchmark for Evaluating STEM Reasoning Capabilities of VLMs

Ai Jian,Weijie Qiu,Xiaokun Wang,Peiyu Wang,Yunzhuo Hao,Jiangbo Pei,Yichen Wei,Yi Peng,Xuchen Song

Main category: cs.CV

TL;DR: CSVQA是一个专门设计用于评估科学推理的多模态基准测试，填补了现有基准测试在科学语境评估上的空白。

Details

Motivation: 当前多模态基准测试主要评估通用图像理解或文本驱动推理，缺乏需要领域知识与视觉证据分析结合的科学语境评估。 Method: 提出了CSVQA基准测试，包含1,378个跨STEM学科的问题-答案对，要求领域知识、视觉证据整合和高阶推理。 Result: 对15个VLM的评估显示性能差距显著，最优模型准确率仅为49.6%。 Conclusion: CSVQA揭示了VLM在科学推理能力上的不足，呼吁进一步研究提升。 Abstract: Vision-Language Models (VLMs) have demonstrated remarkable progress in multimodal understanding, yet their capabilities for scientific reasoning remains inadequately assessed. Current multimodal benchmarks predominantly evaluate generic image comprehension or text-driven reasoning, lacking authentic scientific contexts that require domain-specific knowledge integration with visual evidence analysis. To fill this gap, we present CSVQA, a diagnostic multimodal benchmark specifically designed for evaluating scientific reasoning through domain-grounded visual question answering.Our benchmark features 1,378 carefully constructed question-answer pairs spanning diverse STEM disciplines, each demanding domain knowledge, integration of visual evidence, and higher-order reasoning. Compared to prior multimodal benchmarks, CSVQA places greater emphasis on real-world scientific content and complex reasoning.We additionally propose a rigorous evaluation protocol to systematically assess whether model predictions are substantiated by valid intermediate reasoning steps based on curated explanations. Our comprehensive evaluation of 15 VLMs on this benchmark reveals notable performance disparities, as even the top-ranked proprietary model attains only 49.6\% accuracy.This empirical evidence underscores the pressing need for advancing scientific reasoning capabilities in VLMs. Our CSVQA is released at https://huggingface.co/datasets/Skywork/CSVQA.

[21] S4-Driver: Scalable Self-Supervised Driving Multimodal Large Language Modelwith Spatio-Temporal Visual Representation

Yichen Xie,Runsheng Xu,Tong He,Jyh-Jing Hwang,Katie Luo,Jingwei Ji,Hubert Lin,Letian Chen,Yiren Lu,Zhaoqi Leng,Dragomir Anguelov,Mingxing Tan

Main category: cs.CV

TL;DR: S4-Driver是一种基于多模态大语言模型的自监督运动规划算法，通过稀疏体积策略将2D视觉表示转换为3D空间，无需微调视觉编码器，性能优于现有监督方法。

Details

Motivation: 当前基于多模态大语言模型的端到端运动规划方法通常在2D图像空间预训练，而自动驾驶规划在3D空间进行，存在输入表示空间的差距。 Method: 提出S4-Driver算法，利用稀疏体积策略将PaLI模型的视觉表示从透视视图无缝转换到3D空间，并聚合多视角和多帧视觉输入。 Result: 在nuScenes和Waymo数据集上验证，S4-Driver性能优于现有监督多任务方法，且无需人工标注，具有良好扩展性。 Conclusion: S4-Driver通过自监督学习和3D视觉表示优化，为自动驾驶运动规划提供了高效且可扩展的解决方案。 Abstract: The latest advancements in multi-modal large language models (MLLMs) have spurred a strong renewed interest in end-to-end motion planning approaches for autonomous driving. Many end-to-end approaches rely on human annotations to learn intermediate perception and prediction tasks, while purely self-supervised approaches--which directly learn from sensor inputs to generate planning trajectories without human annotations often underperform the state of the art. We observe a key gap in the input representation space: end-to-end approaches built on MLLMs are often pretrained with reasoning tasks in 2D image space rather than the native 3D space in which autonomous vehicles plan. To this end, we propose S4-Driver, a scalable self-supervised motion planning algorithm with spatio-temporal visual representation, based on the popular PaLI multimodal large language model. S4-Driver uses a novel sparse volume strategy to seamlessly transform the strong visual representation of MLLMs from perspective view to 3D space without the need to finetune the vision encoder. This representation aggregates multi-view and multi-frame visual inputs and enables better prediction of planning trajectories in 3D space. To validate our method, we run experiments on both nuScenes and Waymo Open Motion Dataset (with in-house camera data). Results show that S4-Driver performs favorably against existing supervised multi-task approaches while requiring no human annotations. It also demonstrates great scalability when pretrained on large volumes of unannotated driving logs.

[22] The Butterfly Effect in Pathology: Exploring Security in Pathology Foundation Models

Jiashuai Liu,Yingjia Shang,Yingkang Zhan,Di Zhang,Yi Niu,Dong Wei,Xian Wu,Zeyu Gao,Chen Li,Yefeng Zheng

Main category: cs.CV

TL;DR: 该论文首次系统研究了病理学基础模型在全切片图像分析中对对抗攻击的脆弱性，提出了一种无标签攻击框架，并在实验中验证了其有效性。

Details

Motivation: 随着病理学基础模型在研究和临床决策支持系统中的广泛应用，其安全性问题日益突出，但对抗攻击的脆弱性尚未得到充分研究。 Method: 提出“局部扰动全局影响”原则，设计无标签攻击框架，并基于WSI特性重新定义扰动预算，修订四种经典白盒攻击方法。 Result: 实验表明，仅修改0.1%的切片区域即可导致下游任务准确率下降高达20%，并分析了攻击成功的关键因素及潜在防御策略。 Conclusion: 该研究为病理学基础模型的对抗鲁棒性和可靠部署奠定了基础，并公开了代码以供进一步研究。 Abstract: With the widespread adoption of pathology foundation models in both research and clinical decision support systems, exploring their security has become a critical concern. However, despite their growing impact, the vulnerability of these models to adversarial attacks remains largely unexplored. In this work, we present the first systematic investigation into the security of pathology foundation models for whole slide image~(WSI) analysis against adversarial attacks. Specifically, we introduce the principle of \textit{local perturbation with global impact} and propose a label-free attack framework that operates without requiring access to downstream task labels. Under this attack framework, we revise four classical white-box attack methods and redefine the perturbation budget based on the characteristics of WSI. We conduct comprehensive experiments on three representative pathology foundation models across five datasets and six downstream tasks. Despite modifying only 0.1\% of patches per slide with imperceptible noise, our attack leads to downstream accuracy degradation that can reach up to 20\% in the worst cases. Furthermore, we analyze key factors that influence attack success, explore the relationship between patch-level vulnerability and semantic content, and conduct a preliminary investigation into potential defence strategies. These findings lay the groundwork for future research on the adversarial robustness and reliable deployment of pathology foundation models. Our code is publicly available at: https://github.com/Jiashuai-Liu-hmos/Attack-WSI-pathology-foundation-models.

[23] Towards a Generalizable Bimanual Foundation Policy via Flow-based Video Prediction

Chenyou Fan,Fangzheng Yan,Chenjia Bai,Jiepeng Wang,Chi Zhang,Zhen Wang,Xuelong Li

Main category: cs.CV

TL;DR: 提出了一种基于文本到视频模型的双臂操作策略，通过两阶段训练（文本到光流和光流到视频）减少语言模糊性并降低数据需求。

Details

Motivation: 现有方法依赖单臂数据或预训练模型，难以泛化到双臂操作，主要因数据稀缺和单双臂操作差异。 Method: 微调文本到视频模型生成机器人轨迹，训练轻量扩散策略生成动作；引入文本到光流和光流到视频两阶段模型，光流作为中间变量。 Result: 仿真和真实实验验证了方法的有效性，显著减少了机器人数据需求。 Conclusion: 两阶段模型能有效解决双臂操作的语言模糊性和数据稀缺问题。 Abstract: Learning a generalizable bimanual manipulation policy is extremely challenging for embodied agents due to the large action space and the need for coordinated arm movements. Existing approaches rely on Vision-Language-Action (VLA) models to acquire bimanual policies. However, transferring knowledge from single-arm datasets or pre-trained VLA models often fails to generalize effectively, primarily due to the scarcity of bimanual data and the fundamental differences between single-arm and bimanual manipulation. In this paper, we propose a novel bimanual foundation policy by fine-tuning the leading text-to-video models to predict robot trajectories and training a lightweight diffusion policy for action generation. Given the lack of embodied knowledge in text-to-video models, we introduce a two-stage paradigm that fine-tunes independent text-to-flow and flow-to-video models derived from a pre-trained text-to-video model. Specifically, optical flow serves as an intermediate variable, providing a concise representation of subtle movements between images. The text-to-flow model predicts optical flow to concretize the intent of language instructions, and the flow-to-video model leverages this flow for fine-grained video prediction. Our method mitigates the ambiguity of language in single-stage text-to-video prediction and significantly reduces the robot-data requirement by avoiding direct use of low-level actions. In experiments, we collect high-quality manipulation data for real dual-arm robot, and the results of simulation and real-world experiments demonstrate the effectiveness of our method.

[24] Threading Keyframe with Narratives: MLLMs as Strong Long Video Comprehenders

Bo Fang,Wenhao Wu,Qiangqiang Wu,Yuxin Song,Antoni B. Chan

Main category: cs.CV

TL;DR: Nar-KFC是一个用于长视频理解的模块，通过关键帧选择和文本叙事插入，解决了视觉令牌过多与语言模型上下文长度有限的矛盾。

Details

Motivation: 长视频理解中，视觉令牌过多与语言模型上下文长度有限的矛盾导致传统方法效率低下或内容不相关。 Method: 提出Nar-KFC模块，包括关键帧选择和文本叙事插入两步，前者通过整数二次规划优化，后者利用现成字幕生成器补充非关键帧信息。 Result: 在多个长视频基准测试中，Nar-KFC显著提升了多模态大语言模型的性能。 Conclusion: Nar-KFC是一种高效且内容感知的长视频压缩策略，有效结合了视觉和文本模态。 Abstract: Employing Multimodal Large Language Models (MLLMs) for long video understanding remains a challenging problem due to the dilemma between the substantial number of video frames (i.e., visual tokens) versus the limited context length of language models. Traditional uniform sampling often leads to selection of irrelevant content, while post-training MLLMs on thousands of frames imposes a substantial computational burden. In this paper, we propose threading keyframes with narratives (Nar-KFC), a plug-and-play module to facilitate effective and efficient long video perception. Nar-KFC generally involves two collaborative steps. First, we formulate the keyframe selection process as an integer quadratic programming problem, jointly optimizing query-relevance and frame-diversity. To avoid its computational complexity, a customized greedy search strategy is designed as an efficient alternative. Second, to mitigate the temporal discontinuity caused by sparse keyframe sampling, we further introduce interleaved textual narratives generated from non-keyframes using off-the-shelf captioners. These narratives are inserted between keyframes based on their true temporal order, forming a coherent and compact representation. Nar-KFC thus serves as a temporal- and content-aware compression strategy that complements visual and textual modalities. Experimental results on multiple long-video benchmarks demonstrate that Nar-KFC significantly improves the performance of popular MLLMs. Code will be made publicly available.

[25] Training-free zero-shot 3D symmetry detection with visual features back-projected to geometry

Isaac Aguirre,Ivan Sipiran

Main category: cs.CV

TL;DR: 提出一种无需训练的零样本3D对称性检测方法，利用DINOv2等基础视觉模型的视觉特征，通过渲染视图提取特征并反向投影到原始几何体上，实验表明其优于传统几何方法和基于学习的方法。

Details

Motivation: 探索基础视觉模型在解决复杂3D几何问题（如对称性检测）中的潜力，同时避免对训练数据的依赖。 Method: 从3D物体的渲染视图中提取视觉特征，反向投影到几何体上，利用特征的对称不变性设计算法检测反射对称平面。 Result: 在ShapeNet子集上的实验表明，该方法优于传统几何方法和基于学习的方法，且无需训练数据。 Conclusion: 基础视觉模型可用于解决复杂3D几何问题，展示了其在对称性检测中的有效性。 Abstract: We present a simple yet effective training-free approach for zero-shot 3D symmetry detection that leverages visual features from foundation vision models such as DINOv2. Our method extracts features from rendered views of 3D objects and backprojects them onto the original geometry. We demonstrate the symmetric invariance of these features and use them to identify reflection-symmetry planes through a proposed algorithm. Experiments on a subset of ShapeNet demonstrate that our approach outperforms both traditional geometric methods and learning-based approaches without requiring any training data. Our work demonstrates how foundation vision models can help in solving complex 3D geometric problems such as symmetry detection.

[26] Pretraining Deformable Image Registration Networks with Random Images

Junyu Chen,Shuwen Wei,Yihao Liu,Aaron Carass,Yong Du

Main category: cs.CV

TL;DR: 通过随机图像配准作为预训练任务，提升医学图像配准的准确性、减少领域数据需求并加速下游训练。

Details

Motivation: 探索无需医学图像的深度学习方法，利用随机图像配准作为预训练任务，以提升模型泛化能力和计算效率。 Method: 提出使用随机图像配准作为预训练任务，构建基础模型，并在下游医学图像配准任务中验证其效果。 Result: 预训练策略提高了配准准确性，减少了对领域特定数据的需求，并加速了下游训练的收敛。 Conclusion: 随机图像配准作为预训练任务是一种有效的策略，能够提升医学图像配准的性能和效率。 Abstract: Recent advances in deep learning-based medical image registration have shown that training deep neural networks~(DNNs) does not necessarily require medical images. Previous work showed that DNNs trained on randomly generated images with carefully designed noise and contrast properties can still generalize well to unseen medical data. Building on this insight, we propose using registration between random images as a proxy task for pretraining a foundation model for image registration. Empirical results show that our pretraining strategy improves registration accuracy, reduces the amount of domain-specific data needed to achieve competitive performance, and accelerates convergence during downstream training, thereby enhancing computational efficiency.

[27] DrVD-Bench: Do Vision-Language Models Reason Like Human Doctors in Medical Image Diagnosis?

Tianhong Zhou,Yin Xu,Yingtao Zhu,Chuxi Xiao,Haiyang Bian,Lei Wei,Xuegong Zhang

Main category: cs.CV

TL;DR: DrVD-Bench是一个多模态临床视觉推理基准，旨在评估视觉语言模型（VLMs）是否真正模拟人类临床医生的推理过程，而非仅模仿表面模式。

Details

Motivation: 现有基准未能系统评估VLMs是否具备人类临床医生的推理能力，因此需要DrVD-Bench填补这一空白。 Method: DrVD-Bench包含三个模块：视觉证据理解、推理轨迹评估和报告生成评估，共7,789个图像-问题对，覆盖20种任务类型、17种诊断类别和5种成像模态。 Result: 测试19种VLMs后发现，随着推理复杂性增加，性能显著下降；部分模型展现出类似人类推理的迹象，但仍依赖表面相关性。 Conclusion: DrVD-Bench为开发可信赖的临床VLMs提供了严格的评估框架。 Abstract: Vision-language models (VLMs) exhibit strong zero-shot generalization on natural images and show early promise in interpretable medical image analysis. However, existing benchmarks do not systematically evaluate whether these models truly reason like human clinicians or merely imitate superficial patterns. To address this gap, we propose DrVD-Bench, the first multimodal benchmark for clinical visual reasoning. DrVD-Bench consists of three modules: Visual Evidence Comprehension, Reasoning Trajectory Assessment, and Report Generation Evaluation, comprising a total of 7,789 image-question pairs. Our benchmark covers 20 task types, 17 diagnostic categories, and five imaging modalities-CT, MRI, ultrasound, radiography, and pathology. DrVD-Bench is explicitly structured to reflect the clinical reasoning workflow from modality recognition to lesion identification and diagnosis. We benchmark 19 VLMs, including general-purpose and medical-specific, open-source and proprietary models, and observe that performance drops sharply as reasoning complexity increases. While some models begin to exhibit traces of human-like reasoning, they often still rely on shortcut correlations rather than grounded visual understanding. DrVD-Bench offers a rigorous and structured evaluation framework to guide the development of clinically trustworthy VLMs.

[28] Seeing is Not Reasoning: MVPBench for Graph-based Evaluation of Multi-path Visual Physical CoT

Zhuobai Dong,Junchao Yi,Ziyuan Zheng,Haochen Han,Xiangxi Zheng,Alex Jinpeng Wang,Fangming Liu,Linjie Li

Main category: cs.CV

TL;DR: 论文指出当前多模态大语言模型（MLLMs）在视觉物理推理方面存在显著缺陷，并提出了MVPBench基准和基于图的CoT一致性指标来评估和改进这一问题。

Details

Motivation: 现有MLLMs在理解物理世界（如运动规律、空间关系和因果效应）方面表现不佳，尤其是在复杂的视觉推理任务中。 Method: 提出MVPBench基准，通过多图像输入和视觉链式推理（CoT）任务评估模型，并设计图式CoT一致性指标验证推理路径的物理逻辑。 Result: 实验显示，即使是前沿MLLMs在视觉物理推理和图像-文本对齐方面表现较差，且RL微调可能损害空间推理能力。 Conclusion: 需重新思考当前微调方法，强调视觉理解的重要性，并提出MVPBench作为未来研究的基准。 Abstract: Understanding the physical world - governed by laws of motion, spatial relations, and causality - poses a fundamental challenge for multimodal large language models (MLLMs). While recent advances such as OpenAI o3 and GPT-4o demonstrate impressive perceptual and reasoning capabilities, our investigation reveals these models struggle profoundly with visual physical reasoning, failing to grasp basic physical laws, spatial interactions, and causal effects in complex scenes. More importantly, they often fail to follow coherent reasoning chains grounded in visual evidence, especially when multiple steps are needed to arrive at the correct answer. To rigorously evaluate this capability, we introduce MVPBench, a curated benchmark designed to rigorously evaluate visual physical reasoning through the lens of visual chain-of-thought (CoT). Each example features interleaved multi-image inputs and demands not only the correct final answer but also a coherent, step-by-step reasoning path grounded in evolving visual cues. This setup mirrors how humans reason through real-world physical processes over time. To ensure fine-grained evaluation, we introduce a graph-based CoT consistency metric that verifies whether the reasoning path of model adheres to valid physical logic. Additionally, we minimize shortcut exploitation from text priors, encouraging models to rely on visual understanding. Experimental results reveal a concerning trend: even cutting-edge MLLMs exhibit poor visual reasoning accuracy and weak image-text alignment in physical domains. Surprisingly, RL-based post-training alignment - commonly believed to improve visual reasoning performance - often harms spatial reasoning, suggesting a need to rethink current fine-tuning practices.

[29] Boosting All-in-One Image Restoration via Self-Improved Privilege Learning

Gang Wu,Junjun Jiang,Kui Jiang,Xianming Liu

Main category: cs.CV

TL;DR: SIPL通过利用自身初步输出作为伪特权信号，在推理阶段实现迭代自优化，显著提升图像恢复性能。

Details

Motivation: 解决统一图像恢复模型在多样化和混合退化任务中优化不稳定和任务间冲突的问题。 Method: 提出SIPL范式，结合Proxy Fusion模块和特权字典，训练时提取高频和结构先验，推理时利用模型输出进行自校正。 Result: 在PromptIR模型上，SIPL在复合退化任务中PSNR提升4.58 dB，五任务基准中提升1.28 dB。 Conclusion: SIPL通过自改进特权学习，显著提升图像恢复性能，具有广泛适用性和高效性。 Abstract: Unified image restoration models for diverse and mixed degradations often suffer from unstable optimization dynamics and inter-task conflicts. This paper introduces Self-Improved Privilege Learning (SIPL), a novel paradigm that overcomes these limitations by innovatively extending the utility of privileged information (PI) beyond training into the inference stage. Unlike conventional Privilege Learning, where ground-truth-derived guidance is typically discarded after training, SIPL empowers the model to leverage its own preliminary outputs as pseudo-privileged signals for iterative self-refinement at test time. Central to SIPL is Proxy Fusion, a lightweight module incorporating a learnable Privileged Dictionary. During training, this dictionary distills essential high-frequency and structural priors from privileged feature representations. Critically, at inference, the same learned dictionary then interacts with features derived from the model's initial restoration, facilitating a self-correction loop. SIPL can be seamlessly integrated into various backbone architectures, offering substantial performance improvements with minimal computational overhead. Extensive experiments demonstrate that SIPL significantly advances the state-of-the-art on diverse all-in-one image restoration benchmarks. For instance, when integrated with the PromptIR model, SIPL achieves remarkable PSNR improvements of +4.58 dB on composite degradation tasks and +1.28 dB on diverse five-task benchmarks, underscoring its effectiveness and broad applicability. Codes are available at our project page https://github.com/Aitical/SIPL.

[30] STORK: Improving the Fidelity of Mid-NFE Sampling for Diffusion and Flow Matching Models

Zheng Tan,Weizhen Wang,Andrea L. Bertozzi,Ernest K. Ryu

Main category: cs.CV

TL;DR: STORK是一种新型、无需训练且结构无关的DM ODE求解器，适用于中NFE范围（20-50），在图像生成任务中表现优于现有方法。

Details

Motivation: 尽管扩散模型在高保真图像生成中表现优异，但中NFE范围（20-50）的研究较少，而实际应用如Stable Diffusion 3.5等常在此范围运行。 Method: 提出STORK方法，基于刚性ODE求解器和泰勒展开适配，适用于任何DM采样，包括噪声和流匹配模型。 Result: 在20-50 NFE范围内，STORK在无条件像素级和条件潜空间生成任务中均表现出更高的生成质量（FID得分）。 Conclusion: STORK为中NFE范围的扩散模型采样提供了高效且通用的解决方案，适用于多种实际应用。 Abstract: Diffusion models (DMs) have demonstrated remarkable performance in high-fidelity image and video generation. Because high-quality generations with DMs typically require a large number of function evaluations (NFEs), resulting in slow sampling, there has been extensive research successfully reducing the NFE to a small range (<10) while maintaining acceptable image quality. However, many practical applications, such as those involving Stable Diffusion 3.5, FLUX, and SANA, commonly operate in the mid-NFE regime (20-50 NFE) to achieve superior results, and, despite the practical relevance, research on the effective sampling within this mid-NFE regime remains underexplored. In this work, we propose a novel, training-free, and structure-independent DM ODE solver called the Stabilized Taylor Orthogonal Runge--Kutta (STORK) method, based on a class of stiff ODE solvers with a Taylor expansion adaptation. Unlike prior work such as DPM-Solver, which is dependent on the semi-linear structure of the DM ODE, STORK is applicable to any DM sampling, including noise-based and flow matching-based models. Within the 20-50 NFE range, STORK achieves improved generation quality, as measured by FID scores, across unconditional pixel-level generation and conditional latent-space generation tasks using models like Stable Diffusion 3.5 and SANA. Code is available at https://github.com/ZT220501/STORK.

[31] Benchmarking Foundation Models for Zero-Shot Biometric Tasks

Redwan Sony,Parisa Farmanifard,Hamzeh Alzwairy,Nitish Shukla,Arun Ross

Main category: cs.CV

TL;DR: 该论文评估了41种视觉语言模型和多模态大语言模型在六种生物识别任务中的零样本和少样本性能，展示了其在无需微调的情况下取得的高准确率。

Details

Motivation: 探索基础模型（如VLMs和MLLMs）在生物识别任务中的潜力，填补现有研究的空白。 Method: 通过构建一个综合基准，评估模型在六种生物识别任务（如人脸验证、虹膜识别等）中的表现，使用零样本和少样本学习。 Result: 模型在多项任务中表现优异，例如人脸验证在LFW数据集上达到96.77%的TMR，虹膜识别在IITD-R-Full数据集上达到97.55%的TMR。 Conclusion: 预训练模型在生物识别任务中具有巨大潜力，为实现通用人工智能提供了支持。 Abstract: The advent of foundation models, particularly Vision-Language Models (VLMs) and Multi-modal Large Language Models (MLLMs), has redefined the frontiers of artificial intelligence, enabling remarkable generalization across diverse tasks with minimal or no supervision. Yet, their potential in biometric recognition and analysis remains relatively underexplored. In this work, we introduce a comprehensive benchmark that evaluates the zero-shot and few-shot performance of state-of-the-art publicly available VLMs and MLLMs across six biometric tasks spanning the face and iris modalities: face verification, soft biometric attribute prediction (gender and race), iris recognition, presentation attack detection (PAD), and face manipulation detection (morphs and deepfakes). A total of 41 VLMs were used in this evaluation. Experiments show that embeddings from these foundation models can be used for diverse biometric tasks with varying degrees of success. For example, in the case of face verification, a True Match Rate (TMR) of 96.77 percent was obtained at a False Match Rate (FMR) of 1 percent on the Labeled Face in the Wild (LFW) dataset, without any fine-tuning. In the case of iris recognition, the TMR at 1 percent FMR on the IITD-R-Full dataset was 97.55 percent without any fine-tuning. Further, we show that applying a simple classifier head to these embeddings can help perform DeepFake detection for faces, Presentation Attack Detection (PAD) for irides, and extract soft biometric attributes like gender and ethnicity from faces with reasonably high accuracy. This work reiterates the potential of pretrained models in achieving the long-term vision of Artificial General Intelligence.

[32] Shuffle PatchMix Augmentation with Confidence-Margin Weighted Pseudo-Labels for Enhanced Source-Free Domain Adaptation

Prasanna Reddy Pulakurthi,Majid Rabbani,Jamison Heard,Sohail Dianat,Celso M. de Melo,Raghuveer Rao

Main category: cs.CV

TL;DR: 论文提出了一种无源域自适应方法（SFDA），结合新的增强技术Shuffle PatchMix（SPM）和伪标签重加权策略，显著提升了性能，并在多个基准测试中取得了最优结果。

Details

Motivation: 解决无源域自适应中因缺乏源数据而导致的过拟合和伪标签噪声问题，特别是在小数据集上。 Method: 引入SPM技术通过打乱和混合图像块生成多样化增强数据，同时采用重加权策略优化伪标签的可靠性。 Result: 在PACS、VisDA-C和DomainNet-126上取得最优性能，其中PACS单目标和多目标设置分别提升7.3%和7.2%。 Conclusion: SPM和重加权策略的结合为SFDA设定了新的基准，代码已开源。 Abstract: This work investigates Source-Free Domain Adaptation (SFDA), where a model adapts to a target domain without access to source data. A new augmentation technique, Shuffle PatchMix (SPM), and a novel reweighting strategy are introduced to enhance performance. SPM shuffles and blends image patches to generate diverse and challenging augmentations, while the reweighting strategy prioritizes reliable pseudo-labels to mitigate label noise. These techniques are particularly effective on smaller datasets like PACS, where overfitting and pseudo-label noise pose greater risks. State-of-the-art results are achieved on three major benchmarks: PACS, VisDA-C, and DomainNet-126. Notably, on PACS, improvements of 7.3% (79.4% to 86.7%) and 7.2% are observed in single-target and multi-target settings, respectively, while gains of 2.8% and 0.7% are attained on DomainNet-126 and VisDA-C. This combination of advanced augmentation and robust pseudo-label reweighting establishes a new benchmark for SFDA. The code is available at: https://github.com/PrasannaPulakurthi/SPM

[33] Unleashing High-Quality Image Generation in Diffusion Sampling Using Second-Order Levenberg-Marquardt-Langevin

Fangyikang Wang,Hubery Yin,Lei Qian,Yinan Li,Shaobin Zhuang,Huminhao Zhu,Yilin Zhang,Yanlong Tang,Chao Zhang,Hanbin Zhao,Hui Qian,Chen Li

Main category: cs.CV

TL;DR: 提出了一种名为Levenberg-Marquardt-Langevin（LML）的新方法，通过低秩近似和阻尼机制高效利用扩散Hessian几何，显著提升图像生成质量，且计算开销极小。

Details

Motivation: 当前扩散模型采样技术主要依赖一阶Langevin动力学，而直接利用Hessian几何会导致高计算成本。本文旨在通过高效近似Hessian几何来提升采样质量。 Method: 提出LML方法，包括低秩近似扩散Hessian和阻尼机制，避免显式二次复杂度计算，并在训练中无需额外数据。 Result: 实验表明，LML方法显著提高了图像生成质量，且计算开销可忽略。 Conclusion: LML方法通过高效近似Hessian几何，为扩散模型采样提供了更准确的步骤，提升了生成质量，具有实际应用潜力。 Abstract: The diffusion models (DMs) have demonstrated the remarkable capability of generating images via learning the noised score function of data distribution. Current DM sampling techniques typically rely on first-order Langevin dynamics at each noise level, with efforts concentrated on refining inter-level denoising strategies. While leveraging additional second-order Hessian geometry to enhance the sampling quality of Langevin is a common practice in Markov chain Monte Carlo (MCMC), the naive attempts to utilize Hessian geometry in high-dimensional DMs lead to quadratic-complexity computational costs, rendering them non-scalable. In this work, we introduce a novel Levenberg-Marquardt-Langevin (LML) method that approximates the diffusion Hessian geometry in a training-free manner, drawing inspiration from the celebrated Levenberg-Marquardt optimization algorithm. Our approach introduces two key innovations: (1) A low-rank approximation of the diffusion Hessian, leveraging the DMs' inherent structure and circumventing explicit quadratic-complexity computations; (2) A damping mechanism to stabilize the approximated Hessian. This LML approximated Hessian geometry enables the diffusion sampling to execute more accurate steps and improve the image generation quality. We further conduct a theoretical analysis to substantiate the approximation error bound of low-rank approximation and the convergence property of the damping mechanism. Extensive experiments across multiple pretrained DMs validate that the LML method significantly improves image generation quality, with negligible computational overhead.

[34] Reasoning Can Hurt the Inductive Abilities of Large Language Models

Haibo Jin,Peiyan Zhang,Man Luo,Haohan Wang

Main category: cs.CV

TL;DR: 研究发现，链式思维（CoT）提示可能降低大型语言模型（LLM）的归纳推理能力，而非增强。通过理论框架和实验分析，提出了结构化干预方法以改进推理准确性。

Details

Motivation: 探索链式思维提示对LLM归纳推理能力的影响，验证其是否如假设中那样能提升推理性能。 Method: 设计了四种基于游戏的诊断任务（如国际象棋、德州扑克等），对比分析CoT推理与非推理模型的性能差异，并提出理论框架解释失败模式。 Result: CoT推理可能降低归纳推理性能，且大型推理模型（LRM）表现常不及非推理模型。通过结构化干预可提升准确性。 Conclusion: 有效的CoT推理不仅需要更多步骤，还需确保步骤结构合理。结构化干预能显著改进推理性能。 Abstract: Large Language Models (LLMs) have shown remarkable progress across domains, yet their ability to perform inductive reasoning - inferring latent rules from sparse examples - remains limited. It is often assumed that chain-of-thought (CoT) prompting, as used in Large Reasoning Models (LRMs), enhances such reasoning. We investigate this assumption with creating four controlled, diagnostic game-based tasks - chess, Texas Hold'em, dice games, and blackjack - with hidden human-defined rules. We find that CoT reasoning can degrade inductive performance, with LRMs often underperforming their non-reasoning counterparts. To explain this, we present a theoretical framework that reveals how reasoning steps can amplify error through three failure modes: incorrect sub-task decomposition, incorrect sub-task solving, and incorrect final answer summarization. Based on our theoretical and empirical analysis, we introduce structured interventions that adapt CoT generation according to our identified failure types. These interventions improve inductive accuracy without retraining. Our findings suggest that effective (CoT) reasoning depends not only on taking more steps but also on ensuring those steps are well-structured.

[35] Light as Deception: GPT-driven Natural Relighting Against Vision-Language Pre-training Models

Ying Yang,Jie Zhang,Xiao Lv,Di Lin,Tao Xiang,Qing Guo

Main category: cs.CV

TL;DR: LightD是一个通过语义引导的重新光照生成自然对抗样本的框架，针对视觉与语言预训练模型（VLP），解决了现有方法在优化空间受限和生成不自然扰动上的问题。

Details

Motivation: 现有对抗攻击方法主要针对分类任务，难以适应VLP模型，且生成的扰动不自然或语义不一致。LightD旨在通过语义引导的重新光照生成更自然的对抗样本。 Method: LightD结合ChatGPT提出上下文感知的初始光照参数，并利用预训练的重新光照模型（IC-light）进行多样化光照调整。通过梯度优化参考光照图像，增强攻击效果并保持视觉自然性。 Result: LightD在多种VLP模型（如图像描述和视觉问答任务）中表现出高效性和优越性。 Conclusion: LightD通过语义引导的重新光照成功生成了自然的对抗样本，为VLP模型的对抗攻击提供了新思路。 Abstract: While adversarial attacks on vision-and-language pretraining (VLP) models have been explored, generating natural adversarial samples crafted through realistic and semantically meaningful perturbations remains an open challenge. Existing methods, primarily designed for classification tasks, struggle when adapted to VLP models due to their restricted optimization spaces, leading to ineffective attacks or unnatural artifacts. To address this, we propose \textbf{LightD}, a novel framework that generates natural adversarial samples for VLP models via semantically guided relighting. Specifically, LightD leverages ChatGPT to propose context-aware initial lighting parameters and integrates a pretrained relighting model (IC-light) to enable diverse lighting adjustments. LightD expands the optimization space while ensuring perturbations align with scene semantics. Additionally, gradient-based optimization is applied to the reference lighting image to further enhance attack effectiveness while maintaining visual naturalness. The effectiveness and superiority of the proposed LightD have been demonstrated across various VLP models in tasks such as image captioning and visual question answering.

[36] From Hallucinations to Jailbreaks: Rethinking the Vulnerability of Large Foundation Models

Haibo Jin,Peiyan Zhang,Peiran Wang,Man Luo,Haohan Wang

Main category: cs.CV

TL;DR: 论文提出统一理论框架，将越狱攻击和幻觉问题分别建模为令牌级和注意力级优化，发现两者在损失收敛和梯度行为上相似，并通过实验验证了这种联系。

Details

Motivation: 大型基础模型（LFMs）存在幻觉和越狱攻击两种漏洞，现有研究通常孤立分析，但防御措施对两者的影响表明它们可能存在深层联系。 Method: 提出统一框架，将越狱攻击建模为令牌级优化，幻觉为注意力级优化，验证损失收敛和梯度一致性。 Result: 在LLaVA-1.5和MiniGPT-4上实验证实优化趋势和梯度行为一致，且针对一种漏洞的缓解技术可降低另一种漏洞的成功率。 Conclusion: 揭示了LFMs的共享失败模式，建议鲁棒性策略需同时应对两种漏洞。 Abstract: Large foundation models (LFMs) are susceptible to two distinct vulnerabilities: hallucinations and jailbreak attacks. While typically studied in isolation, we observe that defenses targeting one often affect the other, hinting at a deeper connection. We propose a unified theoretical framework that models jailbreaks as token-level optimization and hallucinations as attention-level optimization. Within this framework, we establish two key propositions: (1) \textit{Similar Loss Convergence} - the loss functions for both vulnerabilities converge similarly when optimizing for target-specific outputs; and (2) \textit{Gradient Consistency in Attention Redistribution} - both exhibit consistent gradient behavior driven by shared attention dynamics. We validate these propositions empirically on LLaVA-1.5 and MiniGPT-4, showing consistent optimization trends and aligned gradients. Leveraging this connection, we demonstrate that mitigation techniques for hallucinations can reduce jailbreak success rates, and vice versa. Our findings reveal a shared failure mode in LFMs and suggest that robustness strategies should jointly address both vulnerabilities.

[37] MIRAGE: Assessing Hallucination in Multimodal Reasoning Chains of MLLM

Bowen Dong,Minheng Ni,Zitong Huang,Guanglei Yang,Wangmeng Zuo,Lei Zhang

Main category: cs.CV

TL;DR: 该论文提出了一个名为{\dataset}的基准测试，用于区分多模态大语言模型（MLLMs）中的感知诱导幻觉和推理诱导幻觉，并提出了{\method}方法来减少推理幻觉。

Details

Motivation: 多模态幻觉限制了MLLMs的正确性，但现有基准无法区分感知和推理导致的幻觉，阻碍了对MLLMs推理失败的诊断。 Method: 提出了{\dataset}基准，通过构建问题隔离推理幻觉，并引入多粒度评估指标。同时提出{\method}方法，结合课程强化微调和协作提示推理来减少推理复杂性。 Result: 分析表明模型规模、数据规模和训练阶段显著影响幻觉程度；当前MLLMs在空间幻觉上表现不佳；问题类型与幻觉模式相关。{\method}在基准上建立了基线，减少了逻辑幻觉。 Conclusion: 论文通过{\dataset}和{\method}解决了多模态幻觉问题，为MLLMs的推理能力提升提供了方向。 Abstract: Multimodal hallucination in multimodal large language models (MLLMs) restricts the correctness of MLLMs. However, multimodal hallucinations are multi-sourced and arise from diverse causes. Existing benchmarks fail to adequately distinguish between perception-induced hallucinations and reasoning-induced hallucinations. This failure constitutes a significant issue and hinders the diagnosis of multimodal reasoning failures within MLLMs. To address this, we propose the {\dataset} benchmark, which isolates reasoning hallucinations by constructing questions where input images are correctly perceived by MLLMs yet reasoning errors persist. {\dataset} introduces multi-granular evaluation metrics: accuracy, factuality, and LLMs hallucination score for hallucination quantification. Our analysis reveals that (1) the model scale, data scale, and training stages significantly affect the degree of logical, fabrication, and factual hallucinations; (2) current MLLMs show no effective improvement on spatial hallucinations caused by misinterpreted spatial relationships, indicating their limited visual reasoning capabilities; and (3) question types correlate with distinct hallucination patterns, highlighting targeted challenges and potential mitigation strategies. To address these challenges, we propose {\method}, a method that combines curriculum reinforcement fine-tuning to encourage models to generate logic-consistent reasoning chains by stepwise reducing learning difficulty, and collaborative hint inference to reduce reasoning complexity. {\method} establishes a baseline on {\dataset}, and reduces the logical hallucinations in original base models.

[38] LTM3D: Bridging Token Spaces for Conditional 3D Generation with Auto-Regressive Diffusion Framework

Xin Kang,Zihan Zheng,Lei Chu,Yue Gao,Jiahao Li,Hao Pan,Xuejin Chen,Yan Lu

Main category: cs.CV

TL;DR: LTM3D是一个结合扩散模型和自回归模型的3D形状生成框架，通过条件分布建模和前缀学习提升生成效果，支持多种3D表示形式。

Details

Motivation: 现有方法在结合扩散模型和自回归模型进行3D形状生成时存在挑战，LTM3D旨在整合两者的优势以提升生成质量。 Method: LTM3D采用条件分布建模主干网络，结合掩码自编码器和扩散模型，引入前缀学习和潜在令牌重建模块。 Result: 实验表明，LTM3D在图像和文本条件生成任务中优于现有方法，生成形状的提示保真度和结构准确性更高。 Conclusion: LTM3D提供了一个通用的多模态、多表示3D生成框架，显著提升了生成质量和灵活性。 Abstract: We present LTM3D, a Latent Token space Modeling framework for conditional 3D shape generation that integrates the strengths of diffusion and auto-regressive (AR) models. While diffusion-based methods effectively model continuous latent spaces and AR models excel at capturing inter-token dependencies, combining these paradigms for 3D shape generation remains a challenge. To address this, LTM3D features a Conditional Distribution Modeling backbone, leveraging a masked autoencoder and a diffusion model to enhance token dependency learning. Additionally, we introduce Prefix Learning, which aligns condition tokens with shape latent tokens during generation, improving flexibility across modalities. We further propose a Latent Token Reconstruction module with Reconstruction-Guided Sampling to reduce uncertainty and enhance structural fidelity in generated shapes. Our approach operates in token space, enabling support for multiple 3D representations, including signed distance fields, point clouds, meshes, and 3D Gaussian Splatting. Extensive experiments on image- and text-conditioned shape generation tasks demonstrate that LTM3D outperforms existing methods in prompt fidelity and structural accuracy while offering a generalizable framework for multi-modal, multi-representation 3D generation.

[39] 50 Years of Automated Face Recognition

Minchul Kim,Anil Jain,Xiaoming Liu

Main category: cs.CV

TL;DR: 本文回顾了50年来人脸识别技术的发展历程，从早期几何统计方法到现代深度学习模型，探讨了关键创新点、数据集影响及未来挑战。

Details

Motivation: 总结人脸识别技术的演进历程，分析关键技术进步及其对性能提升的影响，并指出未来研究方向。 Method: 通过历史回顾和技术分析，梳理了从几何统计方法到深度学习模型的演变，重点讨论了数据集、损失函数、神经网络设计和特征融合的创新。 Result: 现代人脸识别系统在NIST FRVT评估中达到0.13%的误识率，展示了高性能，但仍存在泛化和可解释性等挑战。 Conclusion: 尽管技术取得显著进展，未来仍需解决可扩展性、多模态融合、合成身份生成和可解释性等问题。 Abstract: Over the past 50 years, automated face recognition has evolved from rudimentary, handcrafted systems into sophisticated deep learning models that rival and often surpass human performance. This paper chronicles the history and technological progression of FR, from early geometric and statistical methods to modern deep neural architectures leveraging massive real and AI-generated datasets. We examine key innovations that have shaped the field, including developments in dataset, loss function, neural network design and feature fusion. We also analyze how the scale and diversity of training data influence model generalization, drawing connections between dataset growth and benchmark improvements. Recent advances have achieved remarkable milestones: state-of-the-art face verification systems now report False Negative Identification Rates of 0.13% against a 12.4 million gallery in NIST FRVT evaluations for 1:N visa-to-border matching. While recent advances have enabled remarkable accuracy in high- and low-quality face scenarios, numerous challenges persist. While remarkable progress has been achieved, several open research problems remain. We outline critical challenges and promising directions for future face recognition research, including scalability, multi-modal fusion, synthetic identity generation, and explainable systems.

[40] Harnessing Foundation Models for Robust and Generalizable 6-DOF Bronchoscopy Localization

Qingyao Tian,Huai Liao,Xinyan Huang,Bingyu Yang,Hongbin Liu

Main category: cs.CV

TL;DR: PANSv2是一个通用的、鲁棒的支气管镜定位框架，通过整合深度估计、标志点检测和中心线约束，解决了现有方法在泛化和视觉退化方面的不足。

Details

Motivation: 现有支气管镜定位方法泛化能力差且对视觉退化敏感，PANSv2旨在解决这些问题。 Method: PANSv2结合深度估计、标志点检测和中心线约束，利用EndoOmni和EndoMamba模型增强泛化能力，并引入自动重新初始化模块提高鲁棒性。 Result: 在10例患者数据上，PANSv2的跟踪成功率最高，SR-5指标提升18.1%。 Conclusion: PANSv2在泛化和鲁棒性方面表现优异，具有临床应用的潜力。 Abstract: Vision-based 6-DOF bronchoscopy localization offers a promising solution for accurate and cost-effective interventional guidance. However, existing methods struggle with 1) limited generalization across patient cases due to scarce labeled data, and 2) poor robustness under visual degradation, as bronchoscopy procedures frequently involve artifacts such as occlusions and motion blur that impair visual information. To address these challenges, we propose PANSv2, a generalizable and robust bronchoscopy localization framework. Motivated by PANS that leverages multiple visual cues for pose likelihood measurement, PANSv2 integrates depth estimation, landmark detection, and centerline constraints into a unified pose optimization framework that evaluates pose probability and solves for the optimal bronchoscope pose. To further enhance generalization capabilities, we leverage the endoscopic foundation model EndoOmni for depth estimation and the video foundation model EndoMamba for landmark detection, incorporating both spatial and temporal analyses. Pretrained on diverse endoscopic datasets, these models provide stable and transferable visual representations, enabling reliable performance across varied bronchoscopy scenarios. Additionally, to improve robustness to visual degradation, we introduce an automatic re-initialization module that detects tracking failures and re-establishes pose using landmark detections once clear views are available. Experimental results on bronchoscopy dataset encompassing 10 patient cases show that PANSv2 achieves the highest tracking success rate, with an 18.1% improvement in SR-5 (percentage of absolute trajectory error under 5 mm) compared to existing methods, showing potential towards real clinical usage.

[41] Interactive Video Generation via Domain Adaptation

Ishaan Rawal,Suryansh Kumar

Main category: cs.CV

TL;DR: 本文提出了一种改进的交互式视频生成方法，通过掩码归一化和时序内在扩散先验解决现有技术中的感知质量下降和轨迹控制问题。

Details

Motivation: 现有基于文本条件的扩散模型在交互式视频生成（IVG）中面临轨迹控制困难，且现有训练无关方法因掩码注意力导致感知质量下降。 Method: 提出掩码归一化以解决内部协变量偏移，并引入时序内在扩散先验以对齐初始噪声与IVG条件。 Result: 实验表明，所提方法在感知质量和轨迹控制上优于现有技术。 Conclusion: 掩码归一化和时序内在扩散先验有效提升了交互式视频生成的性能。 Abstract: Text-conditioned diffusion models have emerged as powerful tools for high-quality video generation. However, enabling Interactive Video Generation (IVG), where users control motion elements such as object trajectory, remains challenging. Recent training-free approaches introduce attention masking to guide trajectory, but this often degrades perceptual quality. We identify two key failure modes in these methods, both of which we interpret as domain shift problems, and propose solutions inspired by domain adaptation. First, we attribute the perceptual degradation to internal covariate shift induced by attention masking, as pretrained models are not trained to handle masked attention. To address this, we propose mask normalization, a pre-normalization layer designed to mitigate this shift via distribution matching. Second, we address initialization gap, where the randomly sampled initial noise does not align with IVG conditioning, by introducing a temporal intrinsic diffusion prior that enforces spatio-temporal consistency at each denoising step. Extensive qualitative and quantitative evaluations demonstrate that mask normalization and temporal intrinsic denoising improve both perceptual quality and trajectory control over the existing state-of-the-art IVG techniques.

[42] Out of Sight, Not Out of Context? Egocentric Spatial Reasoning in VLMs Across Disjoint Frames

Sahithya Ravi,Gabriel Sarch,Vibhav Vineet,Andrew D. Wilson,Balasaravanan Thoravi Kumaravel

Main category: cs.CV

TL;DR: Disjoint-3DQA是一个评估视觉语言模型（VLMs）在非共视帧中对物体空间关系推理能力的基准测试，发现当前模型性能落后人类28%，且时间间隔增大会导致性能显著下降。

Details

Motivation: 研究旨在解决具身AI助手在连续视频中整合跨时间空间信息的能力不足问题。 Method: 提出Disjoint-3DQA基准测试，评估VLMs在非共视帧中物体对的空间推理能力，并测试了七种先进VLMs。 Result: 模型性能落后人类28%，时间间隔增大时性能下降更显著（60%到30%）；提供3D坐标可显著提升性能20%。 Conclusion: Disjoint-3DQA为长期空间推理设定了明确挑战，旨在推动视觉、语言与具身AI的交叉研究。 Abstract: An embodied AI assistant operating on egocentric video must integrate spatial cues across time - for instance, determining where an object A, glimpsed a few moments ago lies relative to an object B encountered later. We introduce Disjoint-3DQA , a generative QA benchmark that evaluates this ability of VLMs by posing questions about object pairs that are not co-visible in the same frame. We evaluated seven state-of-the-art VLMs and found that models lag behind human performance by 28%, with steeper declines in accuracy (60% to 30 %) as the temporal gap widens. Our analysis further reveals that providing trajectories or bird's-eye-view projections to VLMs results in only marginal improvements, whereas providing oracle 3D coordinates leads to a substantial 20% performance increase. This highlights a core bottleneck of multi-frame VLMs in constructing and maintaining 3D scene representations over time from visual signals. Disjoint-3DQA therefore sets a clear, measurable challenge for long-horizon spatial reasoning and aims to catalyze future research at the intersection of vision, language, and embodied AI.

[43] LLM-powered Query Expansion for Enhancing Boundary Prediction in Language-driven Action Localization

Zirui Shang,Xinxiao Wu,Shuo Yang

Main category: cs.CV

TL;DR: 论文提出了一种通过扩展语言查询和建模边界概率分数来减少视频动作定位中边界不确定性的方法。

Details

Motivation: 语言查询通常缺乏动作边界的具体细节，导致标注主观性和训练数据边界不确定性。 Method: 1. 使用LLM生成动作边界的文本描述扩展查询；2. 计算帧与扩展查询的语义相似性及与标注边界帧的时间距离，建模边界概率分数。 Result: 实验结果表明该方法有效提升了语言驱动动作定位的稳定性。 Conclusion: 该方法模型无关，可轻松集成到现有模型中，减少边界不确定性并提升训练稳定性。 Abstract: Language-driven action localization in videos requires not only semantic alignment between language query and video segment, but also prediction of action boundaries. However, the language query primarily describes the main content of an action and usually lacks specific details of action start and end boundaries, which increases the subjectivity of manual boundary annotation and leads to boundary uncertainty in training data. In this paper, on one hand, we propose to expand the original query by generating textual descriptions of the action start and end boundaries through LLMs, which can provide more detailed boundary cues for localization and thus reduce the impact of boundary uncertainty. On the other hand, to enhance the tolerance to boundary uncertainty during training, we propose to model probability scores of action boundaries by calculating the semantic similarities between frames and the expanded query as well as the temporal distances between frames and the annotated boundary frames. They can provide more consistent boundary supervision, thus improving the stability of training. Our method is model-agnostic and can be seamlessly and easily integrated into any existing models of language-driven action localization in an off-the-shelf manner. Experimental results on several datasets demonstrate the effectiveness of our method.

[44] EgoExOR: An Ego-Exo-Centric Operating Room Dataset for Surgical Activity Understanding

Ege Özsoy,Arda Mamur,Felix Tristram,Chantal Pellegrini,Magdalena Wysocki,Benjamin Busam,Nassir Navab

Main category: cs.CV

TL;DR: EgoExOR是首个结合第一人称和第三人称视角的手术室数据集，提供多模态数据支持临床交互建模。

Details

Motivation: 手术室需要精确协调，现有数据集视角单一，无法全面支持临床感知需求。 Method: 引入EgoExOR数据集，整合穿戴设备和RGB-D相机的多模态数据，标注详细场景图。 Result: 评估两种模型性能，提出新基线，展示多模态和多视角信号的优势。 Conclusion: EgoExOR为手术室感知研究提供了新基础，支持下一代临床感知任务。 Abstract: Operating rooms (ORs) demand precise coordination among surgeons, nurses, and equipment in a fast-paced, occlusion-heavy environment, necessitating advanced perception models to enhance safety and efficiency. Existing datasets either provide partial egocentric views or sparse exocentric multi-view context, but do not explore the comprehensive combination of both. We introduce EgoExOR, the first OR dataset and accompanying benchmark to fuse first-person and third-person perspectives. Spanning 94 minutes (84,553 frames at 15 FPS) of two emulated spine procedures, Ultrasound-Guided Needle Insertion and Minimally Invasive Spine Surgery, EgoExOR integrates egocentric data (RGB, gaze, hand tracking, audio) from wearable glasses, exocentric RGB and depth from RGB-D cameras, and ultrasound imagery. Its detailed scene graph annotations, covering 36 entities and 22 relations (568,235 triplets), enable robust modeling of clinical interactions, supporting tasks like action recognition and human-centric perception. We evaluate the surgical scene graph generation performance of two adapted state-of-the-art models and offer a new baseline that explicitly leverages EgoExOR's multimodal and multi-perspective signals. This new dataset and benchmark set a new foundation for OR perception, offering a rich, multimodal resource for next-generation clinical perception.

[45] Category-aware EEG image generation based on wavelet transform and contrast semantic loss

Enshang Zhang,Zhicheng Zhang,Takashi Hanakawa

Main category: cs.CV

TL;DR: 本文提出了一种基于Transformer的EEG信号编码器，结合离散小波变换和门控机制，用于从EEG信号中提取视觉刺激相关特征，并通过预训练扩散模型重建为视觉刺激。实验表明，该方法在语义对齐和分类准确率上显著优于现有方法。

Details

Motivation: 实现脑机接口的关键步骤是从EEG信号中重建视觉刺激，现有方法在语义层面的定量分析存在局限性。 Method: 提出了一种结合DWT和门控机制的Transformer编码器，利用特征对齐和类别感知融合损失提取EEG特征，并通过预训练扩散模型重建视觉刺激。 Result: 在THINGS-EEG数据集上的实验显示，模型在语义对齐和分类准确率上显著提升，单被试最高准确率达43%。 Conclusion: 该方法在EEG信号到视觉刺激的重建任务中表现出色，为脑机接口提供了新的解决方案。 Abstract: Reconstructing visual stimuli from EEG signals is a crucial step in realizing brain-computer interfaces. In this paper, we propose a transformer-based EEG signal encoder integrating the Discrete Wavelet Transform (DWT) and the gating mechanism. Guided by the feature alignment and category-aware fusion losses, this encoder is used to extract features related to visual stimuli from EEG signals. Subsequently, with the aid of a pre-trained diffusion model, these features are reconstructed into visual stimuli. To verify the effectiveness of the model, we conducted EEG-to-image generation and classification tasks using the THINGS-EEG dataset. To address the limitations of quantitative analysis at the semantic level, we combined WordNet-based classification and semantic similarity metrics to propose a novel semantic-based score, emphasizing the ability of our model to transfer neural activities into visual representations. Experimental results show that our model significantly improves semantic alignment and classification accuracy, which achieves a maximum single-subject accuracy of 43\%, outperforming other state-of-the-art methods. The source code and supplementary material is available at https://github.com/zes0v0inn/DWT_EEG_Reconstruction/tree/main.

[46] Progressive Class-level Distillation

Jiayan Li,Jun Li,Zhourui Zhang,Jianhua Xu

Main category: cs.CV

TL;DR: 提出了一种名为渐进式类别级蒸馏（PCD）的新方法，通过分阶段蒸馏解决传统知识蒸馏中低概率类别信息被忽视的问题。

Details

Motivation: 传统知识蒸馏方法中，高置信度类别主导蒸馏过程，导致低概率类别的区分信息被忽略，知识传递不充分。 Method: PCD方法通过分阶段蒸馏，首先根据教师-学生logits差异排序确定优先级，然后分阶段进行双向蒸馏（从细到粗和从粗到细）。 Result: 在公开基准数据集上的实验表明，PCD在分类和检测任务上优于现有方法。 Conclusion: PCD通过渐进式分阶段蒸馏，实现了更全面的知识传递，提升了学生模型的性能。 Abstract: In knowledge distillation (KD), logit distillation (LD) aims to transfer class-level knowledge from a more powerful teacher network to a small student model via accurate teacher-student alignment at the logits level. Since high-confidence object classes usually dominate the distillation process, low-probability classes which also contain discriminating information are downplayed in conventional methods, leading to insufficient knowledge transfer. To address this issue, we propose a simple yet effective LD method termed Progressive Class-level Distillation (PCD). In contrast to existing methods which perform all-class ensemble distillation, our PCD approach performs stage-wise distillation for step-by-step knowledge transfer. More specifically, we perform ranking on teacher-student logits difference for identifying distillation priority from scratch, and subsequently divide the entire LD process into multiple stages. Next, bidirectional stage-wise distillation incorporating fine-to-coarse progressive learning and reverse coarse-to-fine refinement is conducted, allowing comprehensive knowledge transfer via sufficient logits alignment within separate class groups in different distillation stages. Extension experiments on public benchmarking datasets demonstrate the superiority of our method compared to state-of-the-arts for both classification and detection tasks.

[47] InteractAnything: Zero-shot Human Object Interaction Synthesis via LLM Feedback and Object Affordance Parsing

Jinlu Zhang,Yixin Chen,Zan Wang,Jie Yang,Yizhou Wang,Siyuan Huang

Main category: cs.CV

TL;DR: 提出了一种零样本3D人-物交互生成框架，利用预训练大模型知识解决开放集对象的交互生成问题。

Details

Motivation: 现有方法在从文本生成新颖人-物交互（HOI）时面临挑战，尤其是开放集对象。 Method: 结合大语言模型（LLMs）推理人-物关系，预训练2D扩散模型解析对象，多视角SDS生成初始姿态，并通过优化实现精细交互。 Result: 实验证明该方法在交互细节和开放集对象处理上优于现有方法。 Conclusion: 该框架无需特定数据集训练，有效解决了开放集3D对象交互生成的挑战。 Abstract: Recent advances in 3D human-aware generation have made significant progress. However, existing methods still struggle with generating novel Human Object Interaction (HOI) from text, particularly for open-set objects. We identify three main challenges of this task: precise human-object relation reasoning, affordance parsing for any object, and detailed human interaction pose synthesis aligning description and object geometry. In this work, we propose a novel zero-shot 3D HOI generation framework without training on specific datasets, leveraging the knowledge from large-scale pre-trained models. Specifically, the human-object relations are inferred from large language models (LLMs) to initialize object properties and guide the optimization process. Then we utilize a pre-trained 2D image diffusion model to parse unseen objects and extract contact points, avoiding the limitations imposed by existing 3D asset knowledge. The initial human pose is generated by sampling multiple hypotheses through multi-view SDS based on the input text and object geometry. Finally, we introduce a detailed optimization to generate fine-grained, precise, and natural interaction, enforcing realistic 3D contact between the 3D object and the involved body parts, including hands in grasping. This is achieved by distilling human-level feedback from LLMs to capture detailed human-object relations from the text instruction. Extensive experiments validate the effectiveness of our approach compared to prior works, particularly in terms of the fine-grained nature of interactions and the ability to handle open-set 3D objects.

[48] STAR-Net: An Interpretable Model-Aided Network for Remote Sensing Image Denoising

Jingjing Liu,Jiashun Jin,Xianchao Xiu,Jianhua Zhang,Wanquan Liu

Main category: cs.CV

TL;DR: 提出了一种名为STAR-Net的新型遥感图像去噪方法，结合低秩先验和非局部自相似性，并通过ADMM引导的深度展开网络自动学习参数。

Details

Motivation: 当前深度学习方法缺乏物理模型集成和可解释性，且对非局部自相似性关注不足，参数调优繁琐。 Method: 提出STAR-Net及其稀疏变体STAR-Net-S，利用低秩先验捕获非局部自相似性，并通过ADMM引导的深度展开网络自动学习参数。 Result: 在合成和真实数据集上，STAR-Net和STAR-Net-S优于现有方法。 Conclusion: STAR-Net成功结合了模型驱动和数据驱动的优势，解决了现有方法的不足。 Abstract: Remote sensing image (RSI) denoising is an important topic in the field of remote sensing. Despite the impressive denoising performance of RSI denoising methods, most current deep learning-based approaches function as black boxes and lack integration with physical information models, leading to limited interpretability. Additionally, many methods may struggle with insufficient attention to non-local self-similarity in RSI and require tedious tuning of regularization parameters to achieve optimal performance, particularly in conventional iterative optimization approaches. In this paper, we first propose a novel RSI denoising method named sparse tensor-aided representation network (STAR-Net), which leverages a low-rank prior to effectively capture the non-local self-similarity within RSI. Furthermore, we extend STAR-Net to a sparse variant called STAR-Net-S to deal with the interference caused by non-Gaussian noise in original RSI for the purpose of improving robustness. Different from conventional iterative optimization, we develop an alternating direction method of multipliers (ADMM)-guided deep unrolling network, in which all regularization parameters can be automatically learned, thus inheriting the advantages of both model-based and deep learning-based approaches and successfully addressing the above-mentioned shortcomings. Comprehensive experiments on synthetic and real-world datasets demonstrate that STAR-Net and STAR-Net-S outperform state-of-the-art RSI denoising methods.

[49] DisTime: Distribution-based Time Representation for Video Large Language Models

Yingsen Zeng,Zepeng Huang,Yujie Zhong,Chengjian Feng,Jie Hu,Lin Ma,Yang Liu

Main category: cs.CV

TL;DR: DisTime框架通过连续时间嵌入和分布解码器提升Video-LLMs的时间定位能力，并构建了大规模数据集InternVid-TG。

Details

Motivation: 现有Video-LLMs在时间定位上存在离散表示和数据集不足的问题。 Method: DisTime采用可学习token和分布解码器，结合自动化标注构建InternVid-TG数据集。 Result: 在三个时间敏感任务中达到SOTA，同时在Video QA中保持竞争力。 Conclusion: DisTime有效解决了时间定位问题，并通过数据集扩展提升了性能。 Abstract: Despite advances in general video understanding, Video Large Language Models (Video-LLMs) face challenges in precise temporal localization due to discrete time representations and limited temporally aware datasets. Existing methods for temporal expression either conflate time with text-based numerical values, add a series of dedicated temporal tokens, or regress time using specialized temporal grounding heads. To address these issues, we introduce DisTime, a lightweight framework designed to enhance temporal comprehension in Video-LLMs. DisTime employs a learnable token to create a continuous temporal embedding space and incorporates a Distribution-based Time Decoder that generates temporal probability distributions, effectively mitigating boundary ambiguities and maintaining temporal continuity. Additionally, the Distribution-based Time Encoder re-encodes timestamps to provide time markers for Video-LLMs. To overcome temporal granularity limitations in existing datasets, we propose an automated annotation paradigm that combines the captioning capabilities of Video-LLMs with the localization expertise of dedicated temporal models. This leads to the creation of InternVid-TG, a substantial dataset with 1.25M temporally grounded events across 179k videos, surpassing ActivityNet-Caption by 55 times. Extensive experiments demonstrate that DisTime achieves state-of-the-art performance across benchmarks in three time-sensitive tasks while maintaining competitive performance in Video QA tasks. Code and data are released at https://github.com/josephzpng/DisTime.

[50] KairosAD: A SAM-Based Model for Industrial Anomaly Detection on Embedded Devices

Uzair Khan,Franco Fummi,Luigi Capogrosso

Main category: cs.CV

TL;DR: KairosAD是一种基于MobileSAM的新型监督学习方法，用于图像异常检测，适用于资源受限的嵌入式设备，参数减少78%，推理速度快4倍，性能与现有模型相当。

Details

Motivation: 解决现有异常检测模型在资源受限的SME生产线上部署困难的问题。 Method: 利用MobileSAM进行图像异常检测，并在MVTec-AD和ViSA数据集上评估。 Result: 参数减少78%，推理速度快4倍，AUROC性能相当，成功部署于NVIDIA Jetson设备及实际生产线。 Conclusion: KairosAD为资源受限环境提供了一种高效、实用的异常检测解决方案。 Abstract: In the era of intelligent manufacturing, anomaly detection has become essential for maintaining quality control on modern production lines. However, while many existing models show promising performance, they are often too large, computationally demanding, and impractical to deploy on resource-constrained embedded devices that can be easily installed on the production lines of Small and Medium Enterprises (SMEs). To bridge this gap, we present KairosAD, a novel supervised approach that uses the power of the Mobile Segment Anything Model (MobileSAM) for image-based anomaly detection. KairosAD has been evaluated on the two well-known industrial anomaly detection datasets, i.e., MVTec-AD and ViSA. The results show that KairosAD requires 78% fewer parameters and boasts a 4x faster inference time compared to the leading state-of-the-art model, while maintaining comparable AUROC performance. We deployed KairosAD on two embedded devices, the NVIDIA Jetson NX, and the NVIDIA Jetson AGX. Finally, KairosAD was successfully installed and tested on the real production line of the Industrial Computer Engineering Laboratory (ICE Lab) at the University of Verona. The code is available at https://github.com/intelligolabs/KairosAD.

[51] GeoVision Labeler: Zero-Shot Geospatial Classification with Vision and Language Models

Gilles Quentin Hacheme,Girmaw Abebe Tadesse,Caleb Robinson,Akram Zaytar,Rahul Dodhia,Juan M. Lavista Ferres

Main category: cs.CV

TL;DR: GVL是一个严格零样本分类框架，通过视觉大语言模型生成图像描述，再通过常规大语言模型映射到用户定义类别，实现灵活分类。

Details

Motivation: 解决地理空间图像分类中标注数据稀缺的问题，提供无需任务特定预训练的零样本分类方案。 Method: 使用视觉大语言模型生成图像描述，再通过常规大语言模型映射到用户定义类别；复杂任务采用递归LLM驱动的聚类和层次分类。 Result: 在SpaceNet v7上达到93.2%的零样本准确率；在复杂多类任务中通过层次分类实现竞争性表现。 Conclusion: GVL提供了一种模块化、可解释的零样本分类框架，适用于多种地理空间应用，并已开源以促进实际应用。 Abstract: Classifying geospatial imagery remains a major bottleneck for applications such as disaster response and land-use monitoring-particularly in regions where annotated data is scarce or unavailable. Existing tools (e.g., RS-CLIP) that claim zero-shot classification capabilities for satellite imagery nonetheless rely on task-specific pretraining and adaptation to reach competitive performance. We introduce GeoVision Labeler (GVL), a strictly zero-shot classification framework: a vision Large Language Model (vLLM) generates rich, human-readable image descriptions, which are then mapped to user-defined classes by a conventional Large Language Model (LLM). This modular, and interpretable pipeline enables flexible image classification for a large range of use cases. We evaluated GVL across three benchmarks-SpaceNet v7, UC Merced, and RESISC45. It achieves up to 93.2% zero-shot accuracy on the binary Buildings vs. No Buildings task on SpaceNet v7. For complex multi-class classification tasks (UC Merced, RESISC45), we implemented a recursive LLM-driven clustering to form meta-classes at successive depths, followed by hierarchical classification-first resolving coarse groups, then finer distinctions-to deliver competitive zero-shot performance. GVL is open-sourced at https://github.com/microsoft/geo-vision-labeler to catalyze adoption in real-world geospatial workflows.

[52] KEVER^2: Knowledge-Enhanced Visual Emotion Reasoning and Retrieval

Fanhang Man,Xiaoyue Chen,Huandong Wang,Baining Zhao,Han Li,Xinlei Chen,Yong Li

Main category: cs.CV

TL;DR: 论文提出了一种知识增强的视觉情感分析框架K-EVER²，通过整合外部情感知识，解决了现有视觉语言模型在情感分析中的抽象、重叠和稀疏监督问题。

Details

Motivation: 图像中的情感线索通常抽象且复杂，现有视觉语言模型难以准确建模和解释，且缺乏结构化情感知识。 Method: 提出K-EVER²框架，结合语义结构化情感线索和多模态对齐的外部情感知识，无需手工标注即可实现情感预测。 Result: 在三个基准数据集上，K-EVER²显著优于CNN和VLM基线模型，最高提升19%的准确率。 Conclusion: K-EVER²为视觉内容的情感理解提供了可扩展且通用的解决方案。 Abstract: Understanding what emotions images evoke in their viewers is a foundational goal in human-centric visual computing. While recent advances in vision-language models (VLMs) have shown promise for visual emotion analysis (VEA), several key challenges remain unresolved. Emotional cues in images are often abstract, overlapping, and entangled, making them difficult to model and interpret. Moreover, VLMs struggle to align these complex visual patterns with emotional semantics due to limited supervision and sparse emotional grounding. Finally, existing approaches lack structured affective knowledge to resolve ambiguity and ensure consistent emotional reasoning across diverse visual domains. To address these limitations, we propose \textbf{K-EVER\textsuperscript{2}}, a knowledge-enhanced framework for emotion reasoning and retrieval. Our approach introduces a semantically structured formulation of visual emotion cues and integrates external affective knowledge through multimodal alignment. Without relying on handcrafted labels or direct emotion supervision, K-EVER\textsuperscript{2} achieves robust and interpretable emotion predictions across heterogeneous image types. We validate our framework on three representative benchmarks, Emotion6, EmoSet, and M-Disaster, covering social media imagery, human-centric scenes, and disaster contexts. K-EVER\textsuperscript{2} consistently outperforms strong CNN and VLM baselines, achieving up to a \textbf{19\% accuracy gain} for specific emotions and a \textbf{12.3\% average accuracy gain} across all emotion categories. Our results demonstrate a scalable and generalizable solution for advancing emotional understanding of visual content.

[53] VUDG: A Dataset for Video Understanding Domain Generalization

Ziyi Wang,Zhi Gao,Boxuan Yu,Zirui Dai,Yuxiang Song,Qingyuan Lu,Jin Chen,Xinxiao Wu

Main category: cs.CV

TL;DR: 论文提出了Video Understanding Domain Generalization (VUDG)数据集，用于评估视频理解中的领域泛化性能，并展示了现有模型在领域偏移下的性能下降。

Details

Motivation: 现有视频理解研究通常忽略真实世界中的领域偏移问题，导致领域泛化在视频理解中未被充分探索。 Method: 提出VUDG数据集，包含11个不同领域的视频，覆盖三种领域偏移类型，并通过多专家渐进标注框架为视频标注多选和开放式问答对。 Result: 实验表明，大多数模型（包括最先进的LVLMs）在领域偏移下性能下降，揭示了当前模型对数据分布偏移的鲁棒性差异。 Conclusion: VUDG为未来领域泛化视频理解研究提供了宝贵资源。 Abstract: Video understanding has made remarkable progress in recent years, largely driven by advances in deep models and the availability of large-scale annotated datasets. However, existing works typically ignore the inherent domain shifts encountered in real-world video applications, leaving domain generalization (DG) in video understanding underexplored. Hence, we propose Video Understanding Domain Generalization (VUDG), a novel dataset designed specifically for evaluating the DG performance in video understanding. VUDG contains videos from 11 distinct domains that cover three types of domain shifts, and maintains semantic similarity across different domains to ensure fair and meaningful evaluation. We propose a multi-expert progressive annotation framework to annotate each video with both multiple-choice and open-ended question-answer pairs. Extensive experiments on 9 representative large video-language models (LVLMs) and several traditional video question answering methods show that most models (including state-of-the-art LVLMs) suffer performance degradation under domain shifts. These results highlight the challenges posed by VUDG and the difference in the robustness of current models to data distribution shifts. We believe VUDG provides a valuable resource for prompting future research in domain generalization video understanding.

Roger Ferrod,Cássio F. Dantas,Luigi Di Caro,Dino Ienco

Main category: cs.CV

TL;DR: CroDiNo-KD提出了一种新的跨模态知识蒸馏框架，通过解耦表示、对比学习和数据增强，解决了传统方法在教师架构选择和蒸馏过程中的限制。

Details

Motivation: 多模态RGBD数据在训练时可用，但在推理阶段可能因传感器故障或资源限制而缺失，导致训练与推理模态不匹配。传统CMKD方法在教师架构和蒸馏过程选择上存在挑战。 Method: CroDiNo-KD利用解耦表示、对比学习和解耦数据增强，同时学习单模态RGB和深度模型，通过交互和协作结构化神经网络内部流形。 Result: 在三个RGBD数据集上的实验表明，CroDiNo-KD优于现有CMKD框架，并建议重新考虑传统的教师/学生范式。 Conclusion: CroDiNo-KD为跨模态知识蒸馏提供了新视角，展示了其在实际场景中的潜力。 Abstract: Multi-modal RGB and Depth (RGBD) data are predominant in many domains such as robotics, autonomous driving and remote sensing. The combination of these multi-modal data enhances environmental perception by providing 3D spatial context, which is absent in standard RGB images. Although RGBD multi-modal data can be available to train computer vision models, accessing all sensor modalities during the inference stage may be infeasible due to sensor failures or resource constraints, leading to a mismatch between data modalities available during training and inference. Traditional Cross-Modal Knowledge Distillation (CMKD) frameworks, developed to address this task, are typically based on a teacher/student paradigm, where a multi-modal teacher distills knowledge into a single-modality student model. However, these approaches face challenges in teacher architecture choices and distillation process selection, thus limiting their adoption in real-world scenarios. To overcome these issues, we introduce CroDiNo-KD (Cross-Modal Disentanglement: a New Outlook on Knowledge Distillation), a novel cross-modal knowledge distillation framework for RGBD semantic segmentation. Our approach simultaneously learns single-modality RGB and Depth models by exploiting disentanglement representation, contrastive learning and decoupled data augmentation with the aim to structure the internal manifolds of neural network models through interaction and collaboration. We evaluated CroDiNo-KD on three RGBD datasets across diverse domains, considering recent CMKD frameworks as competitors. Our findings illustrate the quality of CroDiNo-KD, and they suggest reconsidering the conventional teacher/student paradigm to distill information from multi-modal data to single-modality neural networks.

[55] Grid-LOGAT: Grid Based Local and Global Area Transcription for Video Question Answering

Md Intisar Chowdhury,Kittinun Aukkapinyo,Hiroshi Fujimura,Joo Ann Woo,Wasu Wasusatein,Fadoua Ghourabi

Main category: cs.CV

TL;DR: Grid-LoGAT系统通过网格化视觉提示提取视频帧中的文本转录，结合VLM和LLM实现视频问答，保护图像隐私并提升性能。

Details

Motivation: 解决视频问答中图像隐私保护和转录质量提升的问题。 Method: 分两阶段：VLM提取文本转录，LLM处理问题生成答案；采用网格化视觉提示优化转录。 Result: 在NExT-QA和STAR-QA数据集上分别达到65.9%和50.11%的准确率，优于非网格版本24分。 Conclusion: Grid-LoGAT在隐私保护和性能上均表现优异，适用于视频问答任务。 Abstract: In this paper, we propose a Grid-based Local and Global Area Transcription (Grid-LoGAT) system for Video Question Answering (VideoQA). The system operates in two phases. First, extracting text transcripts from video frames using a Vision-Language Model (VLM). Next, processing questions using these transcripts to generate answers through a Large Language Model (LLM). This design ensures image privacy by deploying the VLM on edge devices and the LLM in the cloud. To improve transcript quality, we propose grid-based visual prompting, which extracts intricate local details from each grid cell and integrates them with global information. Evaluation results show that Grid-LoGAT, using the open-source VLM (LLaVA-1.6-7B) and LLM (Llama-3.1-8B), outperforms state-of-the-art methods with similar baseline models on NExT-QA and STAR-QA datasets with an accuracy of 65.9% and 50.11% respectively. Additionally, our method surpasses the non-grid version by 24 points on localization-based questions we created using NExT-QA.

[56] D2AF: A Dual-Driven Annotation and Filtering Framework for Visual Grounding

Yichi Zhang,Gongwei Chen,Jun Zhu,Jia Wan

Main category: cs.CV

TL;DR: D2AF是一种仅使用输入图像的视觉定位标注框架，通过双驱动策略生成区域-文本对，提升数据量和多样性，并通过过滤方法优化数据质量，显著提升模型性能。

Details

Motivation: 手动标注成本高限制了视觉定位任务的数据规模和模型性能提升，现有伪标签方法依赖人工标注，缺乏扩展性和多样性。 Method: 提出D2AF框架，结合多模态大模型和物体检测模型，采用双驱动策略生成区域-文本对，并通过一致性及分布感知过滤方法优化数据质量。 Result: 实验表明，增加数据量能提升性能，但提升程度取决于伪标签对原始数据分布的扩展效果。过滤方法有效去除噪声数据，显著提升模型表现。 Conclusion: D2AF框架克服了数据规模限制，提升了数据多样性和质量，在三个视觉定位任务中达到最优性能。 Abstract: Visual Grounding is a task that aims to localize a target region in an image based on a free-form natural language description. With the rise of Transformer architectures, there is an increasing need for larger datasets to boost performance. However, the high cost of manual annotation poses a challenge, hindering the scale of data and the ability of large models to enhance their effectiveness. Previous pseudo label generation methods heavily rely on human-labeled captions of the original dataset, limiting scalability and diversity. To address this, we propose D2AF, a robust annotation framework for visual grounding using only input images. This approach overcomes dataset size limitations and enriches both the quantity and diversity of referring expressions. Our approach leverages multimodal large models and object detection models. By implementing dual-driven annotation strategies, we effectively generate detailed region-text pairs using both closed-set and open-set approaches. We further conduct an in-depth analysis of data quantity and data distribution. Our findings demonstrate that increasing data volume enhances model performance. However, the degree of improvement depends on how well the pseudo labels broaden the original data distribution. Based on these insights, we propose a consistency and distribution aware filtering method to further improve data quality by effectively removing erroneous and redundant data. This approach effectively eliminates noisy data, leading to improved performance. Experiments on three visual grounding tasks demonstrate that our method significantly improves the performance of existing models and achieves state-of-the-art results.

[57] Spatiotemporal Analysis of Forest Machine Operations Using 3D Video Classification

Maciej Wielgosz,Simon Berg,Heikki Korpunen,Stephan Hoffmann

Main category: cs.CV

TL;DR: 提出了一种基于深度学习的框架，用于从行车记录仪视频中分类林业操作，采用3D ResNet-50架构，验证F1分数为0.88，精度为0.90。

Details

Motivation: 减少传统时间研究中的人工工作量，为林业操作监测和效率分析提供可扩展的解决方案。 Method: 使用3D ResNet-50架构和PyTorchVideo实现，结合标准预处理和数据增强技术。 Result: 模型表现良好，但存在过拟合问题，需要更多数据和更好的类别平衡。 Conclusion: 该方法在林业环境中展示了潜力，为实时活动识别系统奠定了基础，未来计划扩展数据集和增强正则化。 Abstract: This paper presents a deep learning-based framework for classifying forestry operations from dashcam video footage. Focusing on four key work elements - crane-out, cutting-and-to-processing, driving, and processing - the approach employs a 3D ResNet-50 architecture implemented with PyTorchVideo. Trained on a manually annotated dataset of field recordings, the model achieves strong performance, with a validation F1 score of 0.88 and precision of 0.90. These results underscore the effectiveness of spatiotemporal convolutional networks for capturing both motion patterns and appearance in real-world forestry environments. The system integrates standard preprocessing and augmentation techniques to improve generalization, but overfitting is evident, highlighting the need for more training data and better class balance. Despite these challenges, the method demonstrates clear potential for reducing the manual workload associated with traditional time studies, offering a scalable solution for operational monitoring and efficiency analysis in forestry. This work contributes to the growing application of AI in natural resource management and sets the foundation for future systems capable of real-time activity recognition in forest machinery. Planned improvements include dataset expansion, enhanced regularization, and deployment trials on embedded systems for in-field use.

[58] SASP: Strip-Aware Spatial Perception for Fine-Grained Bird Image Classification

Zheng Wang

Main category: cs.CV

TL;DR: 本文提出了一种基于条带感知空间感知的细粒度鸟类分类框架，通过捕捉图像中的长距离空间依赖关系，提升了模型的鲁棒性和可解释性。

Details

Motivation: 细粒度鸟类图像分类（FBIC）在生态监测和物种识别中具有重要意义，但由于鸟类大小、背景干扰和姿态多变等因素，传统方法难以稳定提取判别性特征。 Method: 提出了两个新模块：扩展感知聚合器（EPA）和通道语义编织（CSW），分别用于整合局部与全局信息以及优化语义表示。 Result: 在CUB-200-2011数据集上的实验表明，该方法显著提升了性能并保持了架构效率。 Conclusion: 该框架通过长距离空间依赖关系增强了模型的鲁棒性和可解释性，为细粒度鸟类分类提供了有效解决方案。 Abstract: Fine-grained bird image classification (FBIC) is not only of great significance for ecological monitoring and species identification, but also holds broad research value in the fields of image recognition and fine-grained visual modeling. Compared with general image classification tasks, FBIC poses more formidable challenges: 1) the differences in species size and imaging distance result in the varying sizes of birds presented in the images; 2) complex natural habitats often introduce strong background interference; 3) and highly flexible poses such as flying, perching, or foraging result in substantial intra-class variability. These factors collectively make it difficult for traditional methods to stably extract discriminative features, thereby limiting the generalizability and interpretability of models in real-world applications. To address these challenges, this paper proposes a fine-grained bird classification framework based on strip-aware spatial perception, which aims to capture long-range spatial dependencies across entire rows or columns in bird images, thereby enhancing the model's robustness and interpretability. The proposed method incorporates two novel modules: extensional perception aggregator (EPA) and channel semantic weaving (CSW). Specifically, EPA integrates local texture details with global structural cues by aggregating information across horizontal and vertical spatial directions. CSW further refines the semantic representations by adaptively fusing long-range and short-range information along the channel dimension. Built upon a ResNet-50 backbone, the model enables jump-wise connection of extended structural features across the spatial domain. Experimental results on the CUB-200-2011 dataset demonstrate that our framework achieves significant performance improvements while maintaining architectural efficiency.

[59] Leadership Assessment in Pediatric Intensive Care Unit Team Training

Liangyang Ouyang,Yuki Sakai,Ryosuke Furuta,Hisataka Nozawa,Hikoro Matsui,Yoichi Sato

Main category: cs.CV

TL;DR: 本文提出了一种基于自我中心视觉的自动化分析框架，用于评估PICU团队的领导技能，通过行为线索（如注视对象、眼神接触和对话模式）进行量化分析。

Details

Motivation: 评估PICU团队的领导技能对培训和改进至关重要，但传统方法依赖主观评估，缺乏客观性。 Method: 使用Aria Glasses记录视频、音频、视线和头部运动数据，结合REMoDNaV、SAM、YOLO和ChatGPT进行自动化分析。 Result: 实验显示领导技能与行为指标（如注视时间、转换模式和直接指令）显著相关。 Conclusion: 提出的框架能有效解决PICU团队技能评估问题。 Abstract: This paper addresses the task of assessing PICU team's leadership skills by developing an automated analysis framework based on egocentric vision. We identify key behavioral cues, including fixation object, eye contact, and conversation patterns, as essential indicators of leadership assessment. In order to capture these multimodal signals, we employ Aria Glasses to record egocentric video, audio, gaze, and head movement data. We collect one-hour videos of four simulated sessions involving doctors with different roles and levels. To automate data processing, we propose a method leveraging REMoDNaV, SAM, YOLO, and ChatGPT for fixation object detection, eye contact detection, and conversation classification. In the experiments, significant correlations are observed between leadership skills and behavioral metrics, i.e., the output of our proposed methods, such as fixation time, transition patterns, and direct orders in speech. These results indicate that our proposed data collection and analysis framework can effectively solve skill assessment for training PICU teams.

[60] S3CE-Net: Spike-guided Spatiotemporal Semantic Coupling and Expansion Network for Long Sequence Event Re-Identification

Xianheng Ma,Hongchen Tan,Xiuping Liu,Yi Zhang,Huasheng Wang,Jiang Liu,Ying Chen,Hantao Liu

Main category: cs.CV

TL;DR: 论文提出了一种基于事件相机的长序列行人重识别模型S3CE-Net，利用脉冲神经网络处理异步事件数据，结合时空注意力机制和特征采样策略，实现了高效且低参数的性能。

Details

Motivation: 研究旨在利用事件相机在恶劣光照条件下减少背景干扰、保护隐私的优势，解决长序列事件行人重识别问题。 Method: 提出S3CE-Net模型，基于脉冲神经网络（SNNs），包含时空注意力机制（SSAM）和时空特征采样策略（STFS）。SSAM实现时空语义交互，STFS采样特征子序列提升模型鲁棒性。 Result: 实验表明S3CE-Net在多个主流数据集上表现优异，模型参数少且高效。 Conclusion: S3CE-Net为长序列事件行人重识别提供了一种高效、低参数的解决方案，具有广泛应用潜力。 Abstract: In this paper, we leverage the advantages of event cameras to resist harsh lighting conditions, reduce background interference, achieve high time resolution, and protect facial information to study the long-sequence event-based person re-identification (Re-ID) task. To this end, we propose a simple and efficient long-sequence event Re-ID model, namely the Spike-guided Spatiotemporal Semantic Coupling and Expansion Network (S3CE-Net). To better handle asynchronous event data, we build S3CE-Net based on spiking neural networks (SNNs). The S3CE-Net incorporates the Spike-guided Spatial-temporal Attention Mechanism (SSAM) and the Spatiotemporal Feature Sampling Strategy (STFS). The SSAM is designed to carry out semantic interaction and association in both spatial and temporal dimensions, leveraging the capabilities of SNNs. The STFS involves sampling spatial feature subsequences and temporal feature subsequences from the spatiotemporal dimensions, driving the Re-ID model to perceive broader and more robust effective semantics. Notably, the STFS introduces no additional parameters and is only utilized during the training stage. Therefore, S3CE-Net is a low-parameter and high-efficiency model for long-sequence event-based person Re-ID. Extensive experiments have verified that our S3CE-Net achieves outstanding performance on many mainstream long-sequence event-based person Re-ID datasets. Code is available at:https://github.com/Mhsunshine/SC3E_Net.

[61] Leveraging Intermediate Features of Vision Transformer for Face Anti-Spoofing

Mika Feng,Koichi Ito,Takafumi Aoki,Tetsushi Ohki,Masakatsu Nishigaki

Main category: cs.CV

TL;DR: 提出了一种基于Vision Transformer（ViT）的欺骗攻击检测方法，利用ViT的中间特征平衡局部和全局特征，并结合两种数据增强方法提升检测精度。

Details

Motivation: 防止恶意用户通过照片欺骗人脸识别系统，确保认证过程的安全性。 Method: 使用ViT的中间特征计算训练损失和推理得分，并引入两种数据增强方法（反欺骗数据增强和分块数据增强）。 Result: 在OULU-NPU和SiW数据集上验证了方法的有效性。 Conclusion: 该方法能有效检测欺骗攻击，提升人脸识别系统的安全性。 Abstract: Face recognition systems are designed to be robust against changes in head pose, illumination, and blurring during image capture. If a malicious person presents a face photo of the registered user, they may bypass the authentication process illegally. Such spoofing attacks need to be detected before face recognition. In this paper, we propose a spoofing attack detection method based on Vision Transformer (ViT) to detect minute differences between live and spoofed face images. The proposed method utilizes the intermediate features of ViT, which have a good balance between local and global features that are important for spoofing attack detection, for calculating loss in training and score in inference. The proposed method also introduces two data augmentation methods: face anti-spoofing data augmentation and patch-wise data augmentation, to improve the accuracy of spoofing attack detection. We demonstrate the effectiveness of the proposed method through experiments using the OULU-NPU and SiW datasets.

Kanokphan Lertniphonphan,Feng Chen,Junda Xu,Fengbu Lan,Jun Xie,Tao Zhang,Zhepeng Wang

Main category: cs.CV

TL;DR: PCIE_Interaction方案在CVPR 2025 Ego4D社交互动挑战中解决了LAM和TTM任务，通过面部质量增强和集成方法（LAM）及视听融合（TTM），分别取得0.81和0.71的mAP。

Details

Motivation: 解决社交互动检测中的LAM（仅依赖面部序列）和TTM（结合面部与音频）任务，提升准确率。 Method: LAM任务使用面部质量增强和集成方法；TTM任务通过视听融合加权视觉质量分数。 Result: LAM任务mAP为0.81，TTM任务mAP为0.71。 Conclusion: PCIE_Interaction方案在LAM和TTM任务中表现优异，代码已开源。 Abstract: This report presents our team's PCIE_Interaction solution for the Ego4D Social Interaction Challenge at CVPR 2025, addressing both Looking At Me (LAM) and Talking To Me (TTM) tasks. The challenge requires accurate detection of social interactions between subjects and the camera wearer, with LAM relying exclusively on face crop sequences and TTM combining speaker face crops with synchronized audio segments. In the LAM track, we employ face quality enhancement and ensemble methods. For the TTM task, we extend visual interaction analysis by fusing audio and visual cues, weighted by a visual quality score. Our approach achieved 0.81 and 0.71 mean average precision (mAP) on the LAM and TTM challenges leader board. Code is available at https://github.com/KanokphanL/PCIE_Ego4D_Social_Interaction

[63] IRBridge: Solving Image Restoration Bridge with Pre-trained Generative Diffusion Models

Hanting Wang,Tao Jin,Wang Lin,Shulei Wang,Hai Huang,Shengpeng Ji,Zhou Zhao

Main category: cs.CV

TL;DR: 本文提出IRBridge框架，利用预训练的生成模型改进图像修复任务，避免了为每种退化类型单独训练桥模型的高成本问题。

Details

Motivation: 现有图像修复桥模型需要为每种退化类型单独训练，计算成本高且性能有限。本文旨在利用预训练的生成先验，避免这一需求。 Method: 提出一个过渡方程，连接两个具有相同终点分布的扩散过程，并引入IRBridge框架，直接利用生成模型进行图像修复。 Result: 在六种图像修复任务上的实验表明，IRBridge能高效整合生成先验，提升鲁棒性和泛化性能。 Conclusion: IRBridge提供了一种更灵活、适应性更强的图像修复方法，显著降低了计算成本并提升了性能。 Abstract: Bridge models in image restoration construct a diffusion process from degraded to clear images. However, existing methods typically require training a bridge model from scratch for each specific type of degradation, resulting in high computational costs and limited performance. This work aims to efficiently leverage pretrained generative priors within existing image restoration bridges to eliminate this requirement. The main challenge is that standard generative models are typically designed for a diffusion process that starts from pure noise, while restoration tasks begin with a low-quality image, resulting in a mismatch in the state distributions between the two processes. To address this challenge, we propose a transition equation that bridges two diffusion processes with the same endpoint distribution. Based on this, we introduce the IRBridge framework, which enables the direct utilization of generative models within image restoration bridges, offering a more flexible and adaptable approach to image restoration. Extensive experiments on six image restoration tasks demonstrate that IRBridge efficiently integrates generative priors, resulting in improved robustness and generalization performance. Code will be available at GitHub.

[64] PCIE_Pose Solution for EgoExo4D Pose and Proficiency Estimation Challenge

Feng Chen,Kanokphan Lertniphonphan,Qiancheng Yan,Xiaohui Fan,Jun Xie,Tao Zhang,Zhepeng Wang

Main category: cs.CV

TL;DR: 团队提出HP-ViT+和时空特征融合方法，在CVPR2025挑战赛中取得冠军，并在熟练度估计任务中达到SOTA。

Details

Motivation: 解决RGB视频中手部和身体姿态估计的复杂问题，如细微动作和遮挡。 Method: HP-ViT+结合Vision Transformer和CNN，加权融合优化手部姿态；身体姿态采用多模态时空特征融合。 Result: 手部姿态PA-MPJPE 8.31，身体姿态MPJPE 11.25，熟练度估计Top-1准确率0.53。 Conclusion: 方法在多任务中表现优异，验证了其有效性。 Abstract: This report introduces our team's (PCIE_EgoPose) solutions for the EgoExo4D Pose and Proficiency Estimation Challenges at CVPR2025. Focused on the intricate task of estimating 21 3D hand joints from RGB egocentric videos, which are complicated by subtle movements and frequent occlusions, we developed the Hand Pose Vision Transformer (HP-ViT+). This architecture synergizes a Vision Transformer and a CNN backbone, using weighted fusion to refine the hand pose predictions. For the EgoExo4D Body Pose Challenge, we adopted a multimodal spatio-temporal feature integration strategy to address the complexities of body pose estimation across dynamic contexts. Our methods achieved remarkable performance: 8.31 PA-MPJPE in the Hand Pose Challenge and 11.25 MPJPE in the Body Pose Challenge, securing championship titles in both competitions. We extended our pose estimation solutions to the Proficiency Estimation task, applying core technologies such as transformer-based architectures. This extension enabled us to achieve a top-1 accuracy of 0.53, a SOTA result, in the Demonstrator Proficiency Estimation competition.

[65] EasyText: Controllable Diffusion Transformer for Multilingual Text Rendering

Runnan Lu,Yuxuan Zhang,Jiaming Liu,Haofan Wang,Yiren Song

Main category: cs.CV

TL;DR: EasyText是一个基于DiT的多语言文本渲染框架，通过字符定位编码和位置编码插值技术实现可控且精确的文本生成。

Details

Motivation: 解决扩散模型在多语言文本生成中的挑战，填补多语言文本渲染的空白。 Method: 采用DiT框架，结合字符定位编码和位置编码插值技术，并构建大规模合成数据集进行预训练和微调。 Result: 实验表明，EasyText在多语言文本渲染、视觉质量和布局感知文本集成方面表现优异。 Conclusion: EasyText为多语言文本生成提供了高效且先进的解决方案。 Abstract: Generating accurate multilingual text with diffusion models has long been desired but remains challenging. Recent methods have made progress in rendering text in a single language, but rendering arbitrary languages is still an unexplored area. This paper introduces EasyText, a text rendering framework based on DiT (Diffusion Transformer), which connects denoising latents with multilingual character tokens encoded as character tokens. We propose character positioning encoding and position encoding interpolation techniques to achieve controllable and precise text rendering. Additionally, we construct a large-scale synthetic text image dataset with 1 million multilingual image-text annotations as well as a high-quality dataset of 20K annotated images, which are used for pretraining and fine-tuning respectively. Extensive experiments and evaluations demonstrate the effectiveness and advancement of our approach in multilingual text rendering, visual quality, and layout-aware text integration.

[66] Bridging 3D Anomaly Localization and Repair via High-Quality Continuous Geometric Representation

Bozhong Zheng,Jinye Gan,Xiaohao Xu,Wenqiao Li,Xiaonan Huang,Na Ni,Yingna Wu

Main category: cs.CV

TL;DR: PASDF是一种新型3D点云异常检测与修复框架，通过连续、姿态不变的形状表示提升检测精度，并在实验中表现出色。

Details

Motivation: 现有基于补丁的方法因离散体素化或投影表示导致几何保真度问题，限制了细粒度异常定位。 Method: PASDF结合姿态对齐模块和SDF网络，动态整合姿态信息，通过连续SDF隐式学习高保真修复模板。 Result: 在Real3D-AD和Anomaly-ShapeNet上分别达到80.2%和90.0%的AUROC分数。 Conclusion: 连续几何表示在3D异常检测和修复中具有显著优势，PASDF框架为相关研究提供了新思路。 Abstract: 3D point cloud anomaly detection is essential for robust vision systems but is challenged by pose variations and complex geometric anomalies. Existing patch-based methods often suffer from geometric fidelity issues due to discrete voxelization or projection-based representations, limiting fine-grained anomaly localization. We introduce Pose-Aware Signed Distance Field (PASDF), a novel framework that integrates 3D anomaly detection and repair by learning a continuous, pose-invariant shape representation. PASDF leverages a Pose Alignment Module for canonicalization and a SDF Network to dynamically incorporate pose, enabling implicit learning of high-fidelity anomaly repair templates from the continuous SDF. This facilitates precise pixel-level anomaly localization through an Anomaly-Aware Scoring Module. Crucially, the continuous 3D representation in PASDF extends beyond detection, facilitating in-situ anomaly repair. Experiments on Real3D-AD and Anomaly-ShapeNet demonstrate state-of-the-art performance, achieving high object-level AUROC scores of 80.2% and 90.0%, respectively. These results highlight the effectiveness of continuous geometric representations in advancing 3D anomaly detection and facilitating practical anomaly region repair. The code is available at https://github.com/ZZZBBBZZZ/PASDF to support further research.

[67] SORCE: Small Object Retrieval in Complex Environments

Chunxu Liu,Chi Xie,Xiaxu Chen,Wei Li,Feng Zhu,Rui Zhao,Limin Wang

Main category: cs.CV

TL;DR: 论文提出了SORCE（复杂环境中的小物体检索），作为文本到图像检索（T2IR）的新子领域，并创建了SORCE-1K基准。现有方法在检索小物体时表现不佳，因此作者提出使用多模态大语言模型（MLLM）和区域提示（ReP）生成多嵌入表示，显著提升了性能。

Details

Motivation: 现有T2IR基准主要关注整体图像语义或显著前景物体，忽略了复杂环境中不显眼的小物体，而实际应用中这些小物体检索至关重要。 Method: 提出SORCE-1K基准，并利用MLLM和ReP为每张图像生成多个嵌入表示，以更好地捕捉小物体。 Result: 实验表明，多嵌入方法在SORCE-1K上显著优于现有T2IR方法。 Conclusion: SORCE-1K是评估小物体检索性能的有效基准，多嵌入表示和文本定制的MLLM特征对此任务具有潜力。 Abstract: Text-to-Image Retrieval (T2IR) is a highly valuable task that aims to match a given textual query to images in a gallery. Existing benchmarks primarily focus on textual queries describing overall image semantics or foreground salient objects, possibly overlooking inconspicuous small objects, especially in complex environments. Such small object retrieval is crucial, as in real-world applications, the targets of interest are not always prominent in the image. Thus, we introduce SORCE (Small Object Retrieval in Complex Environments), a new subfield of T2IR, focusing on retrieving small objects in complex images with textual queries. We propose a new benchmark, SORCE-1K, consisting of images with complex environments and textual queries describing less conspicuous small objects with minimal contextual cues from other salient objects. Preliminary analysis on SORCE-1K finds that existing T2IR methods struggle to capture small objects and encode all the semantics into a single embedding, leading to poor retrieval performance on SORCE-1K. Therefore, we propose to represent each image with multiple distinctive embeddings. We leverage Multimodal Large Language Models (MLLMs) to extract multiple embeddings for each image instructed by a set of Regional Prompts (ReP). Experimental results show that our multi-embedding approach through MLLM and ReP significantly outperforms existing T2IR methods on SORCE-1K. Our experiments validate the effectiveness of SORCE-1K for benchmarking SORCE performances, highlighting the potential of multi-embedding representation and text-customized MLLM features for addressing this task.

[68] Diversify and Conquer: Open-set Disagreement for Robust Semi-supervised Learning with Outliers

Heejo Kong,Sung-Jin Kim,Gunho Jung,Seong-Whan Lee

Main category: cs.CV

TL;DR: 论文提出了一种名为DAC的新框架，通过利用多个模型对未标记数据的不同预测差异，增强开放集半监督学习的鲁棒性。

Details

Motivation: 传统半监督学习假设标记和未标记数据共享相同的类别分布，但实际中未标记数据常包含异常类（离群点），导致模型性能下降。现有方法依赖单一模型的预测差异，在标记数据不足时效果不佳。 Method: 提出DAC框架，通过训练多个不同偏见的模型，利用其对未标记数据的预测差异来检测离群点。通过鼓励模型对离群点产生不同预测而对内点保持一致，实现鲁棒的离群点识别。 Result: DAC框架在标记数据不足的情况下仍能有效识别离群点，性能优于传统方法。 Conclusion: DAC通过多模型预测差异解决了开放集半监督学习中的离群点问题，显著提升了模型鲁棒性。 Abstract: Conventional semi-supervised learning (SSL) ideally assumes that labeled and unlabeled data share an identical class distribution, however in practice, this assumption is easily violated, as unlabeled data often includes unknown class data, i.e., outliers. The outliers are treated as noise, considerably degrading the performance of SSL models. To address this drawback, we propose a novel framework, Diversify and Conquer (DAC), to enhance SSL robustness in the context of open-set semi-supervised learning. In particular, we note that existing open-set SSL methods rely on prediction discrepancies between inliers and outliers from a single model trained on labeled data. This approach can be easily failed when the labeled data is insufficient, leading to performance degradation that is worse than naive SSL that do not account for outliers. In contrast, our approach exploits prediction disagreements among multiple models that are differently biased towards the unlabeled distribution. By leveraging the discrepancies arising from training on unlabeled data, our method enables robust outlier detection even when the labeled data is underspecified. Our key contribution is constructing a collection of differently biased models through a single training process. By encouraging divergent heads to be differently biased towards outliers while making consistent predictions for inliers, we exploit the disagreement among these heads as a measure to identify unknown concepts. Our code is available at https://github.com/heejokong/DivCon.

[69] SA-Person: Text-Based Person Retrieval with Scene-aware Re-ranking

Yingjia Xu,Jinlin Wu,Zhen Chen,Daming Gao,Yang Yang,Zhen Lei,Min Cao

Main category: cs.CV

TL;DR: 论文提出了一种基于文本的人物检索方法，通过结合行人外观和场景上下文信息，提升了检索效果。

Details

Motivation: 现有方法主要关注外观跨模态检索，忽略了场景中的上下文信息，而这些信息可能对检索有重要补充作用。 Method: 提出了SA-Person框架，分为两阶段：第一阶段通过文本与行人区域对齐进行外观检索；第二阶段引入SceneRanker，利用多模态大语言模型联合推理外观和场景上下文。 Result: 在SCENEPERSON-13W数据集上的实验验证了该框架在复杂场景检索中的有效性。 Conclusion: 结合外观和场景上下文的方法能显著提升文本人物检索的性能。 Abstract: Text-based person retrieval aims to identify a target individual from a gallery of images based on a natural language description. It presents a significant challenge due to the complexity of real-world scenes and the ambiguity of appearance-related descriptions. Existing methods primarily emphasize appearance-based cross-modal retrieval, often neglecting the contextual information embedded within the scene, which can offer valuable complementary insights for retrieval. To address this, we introduce SCENEPERSON-13W, a large-scale dataset featuring over 100,000 scenes with rich annotations covering both pedestrian appearance and environmental cues. Based on this, we propose SA-Person, a two-stage retrieval framework. In the first stage, it performs discriminative appearance grounding by aligning textual cues with pedestrian-specific regions. In the second stage, it introduces SceneRanker, a training-free, scene-aware re-ranking method leveraging multimodal large language models to jointly reason over pedestrian appearance and the global scene context. Experiments on SCENEPERSON-13W validate the effectiveness of our framework in challenging scene-level retrieval scenarios. The code and dataset will be made publicly available.

[70] SPPSFormer: High-quality Superpoint-based Transformer for Roof Plane Instance Segmentation from Point Clouds

Cheng Zeng,Xiatian Qi,Chi Chen,Kai Sun,Wangle Zhang,Yuxuan Liu,Yan Meng,Bisheng Yang

Main category: cs.CV

TL;DR: 该研究提出了一种用于点云屋顶平面实例分割的两阶段超点生成方法，结合多维手工特征和改进的解码器，显著提升了Transformer模型的性能。

Details

Motivation: 现有超点Transformer在点云屋顶平面分割中性能有限，主要由于低质量超点的使用。 Method: 提出了高质量超点的生成标准及两阶段生成过程，结合手工特征，并设计了结合Kolmogorov-Arnold网络与Transformer模块的解码器。 Result: 在多个数据集上实现了最先进的性能，且模型对边界标注不敏感，减少了标注负担。 Conclusion: 点云密度、均匀性和精度对分割性能有显著影响，需结合数据增强策略提升模型鲁棒性。 Abstract: Transformers have been seldom employed in point cloud roof plane instance segmentation, which is the focus of this study, and existing superpoint Transformers suffer from limited performance due to the use of low-quality superpoints. To address this challenge, we establish two criteria that high-quality superpoints for Transformers should satisfy and introduce a corresponding two-stage superpoint generation process. The superpoints generated by our method not only have accurate boundaries, but also exhibit consistent geometric sizes and shapes, both of which greatly benefit the feature learning of superpoint Transformers. To compensate for the limitations of deep learning features when the training set size is limited, we incorporate multidimensional handcrafted features into the model. Additionally, we design a decoder that combines a Kolmogorov-Arnold Network with a Transformer module to improve instance prediction and mask extraction. Finally, our network's predictions are refined using traditional algorithm-based postprocessing. For evaluation, we annotated a real-world dataset and corrected annotation errors in the existing RoofN3D dataset. Experimental results show that our method achieves state-of-the-art performance on our dataset, as well as both the original and reannotated RoofN3D datasets. Moreover, our model is not sensitive to plane boundary annotations during training, significantly reducing the annotation burden. Through comprehensive experiments, we also identified key factors influencing roof plane segmentation performance: in addition to roof types, variations in point cloud density, density uniformity, and 3D point precision have a considerable impact. These findings underscore the importance of incorporating data augmentation strategies that account for point cloud quality to enhance model robustness under diverse and challenging conditions.

[71] Period-LLM: Extending the Periodic Capability of Multimodal Large Language Model

Yuting Zhang,Hao Lu,Qingyong Hu,Yin Wang,Kaishen Yuan,Xin Liu,Kaishun Wu

Main category: cs.CV

TL;DR: Period-LLM是一种多模态大语言模型，旨在解决现有MLLM在周期性任务中的不足，通过逐步学习和优化策略提升性能。

Details

Motivation: 周期性现象在自然过程中普遍存在，但现有MLLM因缺乏时间建模和周期冲突问题表现不佳。 Method: 采用从简单到复杂的任务范式，并提出“抵抗逻辑遗忘”优化策略。 Result: 实验表明Period-LLM在周期性任务中优于现有MLLM。 Conclusion: Period-LLM为多模态周期性任务提供了有效解决方案，代码已开源。 Abstract: Periodic or quasi-periodic phenomena reveal intrinsic characteristics in various natural processes, such as weather patterns, movement behaviors, traffic flows, and biological signals. Given that these phenomena span multiple modalities, the capabilities of Multimodal Large Language Models (MLLMs) offer promising potential to effectively capture and understand their complex nature. However, current MLLMs struggle with periodic tasks due to limitations in: 1) lack of temporal modelling and 2) conflict between short and long periods. This paper introduces Period-LLM, a multimodal large language model designed to enhance the performance of periodic tasks across various modalities, and constructs a benchmark of various difficulty for evaluating the cross-modal periodic capabilities of large models. Specially, We adopt an "Easy to Hard Generalization" paradigm, starting with relatively simple text-based tasks and progressing to more complex visual and multimodal tasks, ensuring that the model gradually builds robust periodic reasoning capabilities. Additionally, we propose a "Resisting Logical Oblivion" optimization strategy to maintain periodic reasoning abilities during semantic alignment. Extensive experiments demonstrate the superiority of the proposed Period-LLM over existing MLLMs in periodic tasks. The code is available at https://github.com/keke-nice/Period-LLM.

[72] ACM-UNet: Adaptive Integration of CNNs and Mamba for Efficient Medical Image Segmentation

Jing Huang,Yongkang Zhao,Yuhan Li,Zhitao Dai,Cheng Chen,Qiying Lai

Main category: cs.CV

TL;DR: ACM-UNet是一种通用的医学图像分割框架，通过轻量级适配器机制有效结合预训练的CNN和Mamba模型，解决了结构不匹配问题，并在Synapse和ACDC基准测试中达到最先进性能。

Details

Motivation: 现有方法难以充分利用预训练视觉主干（如ResNet、ViT、VMamba）的结构优势，ACM-UNet旨在解决这一问题。 Method: 提出ACM-UNet框架，结合轻量级适配器机制和分层多尺度小波变换模块，以融合CNN和SSM的优势。 Result: 在Synapse数据集上达到85.12% Dice Score和13.89mm HD95，计算效率高（17.93G FLOPs）。 Conclusion: ACM-UNet在保持简单UNet设计的同时，有效结合了预训练模型的优势，性能优越且计算高效。 Abstract: The U-shaped encoder-decoder architecture with skip connections has become a prevailing paradigm in medical image segmentation due to its simplicity and effectiveness. While many recent works aim to improve this framework by designing more powerful encoders and decoders, employing advanced convolutional neural networks (CNNs) for local feature extraction, Transformers or state space models (SSMs) such as Mamba for global context modeling, or hybrid combinations of both, these methods often struggle to fully utilize pretrained vision backbones (e.g., ResNet, ViT, VMamba) due to structural mismatches. To bridge this gap, we introduce ACM-UNet, a general-purpose segmentation framework that retains a simple UNet-like design while effectively incorporating pretrained CNNs and Mamba models through a lightweight adapter mechanism. This adapter resolves architectural incompatibilities and enables the model to harness the complementary strengths of CNNs and SSMs-namely, fine-grained local detail extraction and long-range dependency modeling. Additionally, we propose a hierarchical multi-scale wavelet transform module in the decoder to enhance feature fusion and reconstruction fidelity. Extensive experiments on the Synapse and ACDC benchmarks demonstrate that ACM-UNet achieves state-of-the-art performance while remaining computationally efficient. Notably, it reaches 85.12% Dice Score and 13.89mm HD95 on the Synapse dataset with 17.93G FLOPs, showcasing its effectiveness and scalability. Code is available at: https://github.com/zyklcode/ACM-UNet.

[73] Deformable Attention Mechanisms Applied to Object Detection, case of Remote Sensing

Anasse Boutayeb,Iyad Lahsen-cherif,Ahmed El Khadimi

Main category: cs.CV

TL;DR: 论文提出了一种基于Deformable-DETR模型的目标检测方法，应用于光学和SAR遥感图像，取得了优异的性能。

Details

Motivation: 遥感图像目标检测在DL模型（尤其是基于Transformer的模型）中具有重要研究价值。 Method: 采用Deformable-DETR模型，结合变形注意力机制，在光学（Pleiades Aircraft数据集）和SAR（SSDD数据集）图像上进行实验。 Result: 10折分层验证显示，模型在光学和SAR数据集上的F1分数分别达到95.12%和94.54%，优于其他CNN和Transformer模型。 Conclusion: Deformable-DETR在遥感图像目标检测中表现优异，具有实际应用潜力。 Abstract: Object detection has recently seen an interesting trend in terms of the most innovative research work, this task being of particular importance in the field of remote sensing, given the consistency of these images in terms of geographical coverage and the objects present. Furthermore, Deep Learning (DL) models, in particular those based on Transformers, are especially relevant for visual computing tasks in general, and target detection in particular. Thus, the present work proposes an application of Deformable-DETR model, a specific architecture using deformable attention mechanisms, on remote sensing images in two different modes, especially optical and Synthetic Aperture Radar (SAR). To achieve this objective, two datasets are used, one optical, which is Pleiades Aircraft dataset, and the other SAR, in particular SAR Ship Detection Dataset (SSDD). The results of a 10-fold stratified validation showed that the proposed model performed particularly well, obtaining an F1 score of 95.12% for the optical dataset and 94.54% for SSDD, while comparing these results with several models detections, especially those based on CNNs and transformers, as well as those specifically designed to detect different object classes in remote sensing images.

[74] Reason-SVG: Hybrid Reward RL for Aha-Moments in Vector Graphics Generation

Ximing Xing,Yandong Guan,Jing Zhang,Dong Xu,Qian Yu

Main category: cs.CV

TL;DR: Reason-SVG框架通过“Drawing-with-Thought”范式增强LLM生成SVG的能力，结合监督微调和强化学习，显著提升生成质量。

Details

Motivation: 当前LLM在生成SVG时缺乏结构有效性、语义忠实性和视觉连贯性，Reason-SVG旨在解决这一问题。 Method: 采用两阶段训练策略：监督微调激活基础推理能力，强化学习（GRPO）优化生成过程，结合混合奖励函数评估。 Result: Reason-SVG显著提升了LLM生成SVG的准确性和视觉吸引力。 Conclusion: Reason-SVG通过整合DwT、SFT和强化学习，为LLM生成高质量SVG提供了新思路。 Abstract: Generating high-quality Scalable Vector Graphics (SVGs) is challenging for Large Language Models (LLMs), as it requires advanced reasoning for structural validity, semantic faithfulness, and visual coherence -- capabilities in which current LLMs often fall short. In this work, we introduce Reason-SVG, a novel framework designed to enhance LLM reasoning for SVG generation. Reason-SVG pioneers the "Drawing-with-Thought" (DwT) paradigm, in which models generate both SVG code and explicit design rationales, mimicking the human creative process. Reason-SVG adopts a two-stage training strategy: First, Supervised Fine-Tuning (SFT) trains the LLM on the DwT paradigm to activate foundational reasoning abilities. Second, Reinforcement Learning (RL), utilizing Group Relative Policy Optimization (GRPO), empowers the model to generate both DwT and SVGs rationales through refined, reward-driven reasoning. To facilitate reasoning-driven SVG generation, we design a Hybrid Reward function that evaluates the presence and utility of DwT reasoning, along with structural validity, semantic alignment, and visual quality. We also introduce the SVGX-DwT-10k dataset, a high-quality corpus of 10,000 SVG-DwT pairs, where each SVG code is generated based on explicit DwT reasoning. By integrating DwT, SFT, and Hybrid Reward-guided RL, Reason-SVG significantly improves LLM performance in generating accurate and visually compelling SVGs, potentially fostering "Aha moments" in design.

[75] un$^2$CLIP: Improving CLIP's Visual Detail Capturing Ability via Inverting unCLIP

Yinqi Li,Jiahe Zhao,Hong Chang,Ruibing Hou,Shiguang Shan,Xilin Chen

Main category: cs.CV

TL;DR: 本文提出了一种改进CLIP模型的方法un²CLIP，通过利用生成模型unCLIP的能力来捕捉更多图像细节，同时保持与原始文本编码器的对齐。实验表明，该方法在多个任务上显著优于原始CLIP和其他改进方法。

Details

Motivation: CLIP在区分图像细节方面表现不足，影响了其在密集预测和视觉中心多模态任务中的性能。因此，本文旨在改进CLIP模型，使其能更好地捕捉图像细节。 Method: 利用生成模型unCLIP（通过CLIP图像嵌入训练的图像生成器）来改进CLIP模型。具体方法是将unCLIP反转（称为un²CLIP），以增强CLIP图像编码器的细节捕捉能力。 Result: 实验表明，un²CLIP在MMVP-VLM基准测试、开放词汇分割任务和多模态大语言模型任务中显著优于原始CLIP和其他改进方法。 Conclusion: 通过结合生成模型unCLIP的能力，un²CLIP成功提升了CLIP模型的性能，尤其是在捕捉图像细节方面，同时保持了与文本编码器的对齐。 Abstract: Contrastive Language-Image Pre-training (CLIP) has become a foundation model and has been applied to various vision and multimodal tasks. However, recent works indicate that CLIP falls short in distinguishing detailed differences in images and shows suboptimal performance on dense-prediction and vision-centric multimodal tasks. Therefore, this work focuses on improving existing CLIP models, aiming to capture as many visual details in images as possible. We find that a specific type of generative models, unCLIP, provides a suitable framework for achieving our goal. Specifically, unCLIP trains an image generator conditioned on the CLIP image embedding. In other words, it inverts the CLIP image encoder. Compared to discriminative models like CLIP, generative models are better at capturing image details because they are trained to learn the data distribution of images. Additionally, the conditional input space of unCLIP aligns with CLIP's original image-text embedding space. Therefore, we propose to invert unCLIP (dubbed un$^2$CLIP) to improve the CLIP model. In this way, the improved image encoder can gain unCLIP's visual detail capturing ability while preserving its alignment with the original text encoder simultaneously. We evaluate our improved CLIP across various tasks to which CLIP has been applied, including the challenging MMVP-VLM benchmark, the dense-prediction open-vocabulary segmentation task, and multimodal large language model tasks. Experiments show that un$^2$CLIP significantly improves the original CLIP and previous CLIP improvement methods. Code and models will be available at https://github.com/LiYinqi/un2CLIP.

[76] AMIA: Automatic Masking and Joint Intention Analysis Makes LVLMs Robust Jailbreak Defenders

Yuqi Zhang,Yuchun Miao,Zuchao Li,Liang Ding

Main category: cs.CV

TL;DR: AMIA是一种轻量级的防御方法，通过自动屏蔽无关图像块和联合意图分析，显著提高大型视觉语言模型的安全性，同时保持实用性。

Details

Motivation: 针对大型视觉语言模型（LVLMs）在对抗性攻击下的脆弱性，提出一种无需重新训练的轻量级防御方法。 Method: 1. 自动屏蔽与文本无关的图像块以破坏对抗性扰动；2. 联合意图分析以在生成响应前发现并缓解潜在有害意图。 Result: 防御成功率从平均52.4%提升至81.7%，实用性仅下降2%，推理开销较低。 Conclusion: 屏蔽和意图分析是实现安全性与实用性平衡的关键。 Abstract: We introduce AMIA, a lightweight, inference-only defense for Large Vision-Language Models (LVLMs) that (1) Automatically Masks a small set of text-irrelevant image patches to disrupt adversarial perturbations, and (2) conducts joint Intention Analysis to uncover and mitigate hidden harmful intents before response generation. Without any retraining, AMIA improves defense success rates across diverse LVLMs and jailbreak benchmarks from an average of 52.4% to 81.7%, preserves general utility with only a 2% average accuracy drop, and incurs only modest inference overhead. Ablation confirms both masking and intention analysis are essential for a robust safety-utility trade-off.

[77] UniGeo: Taming Video Diffusion for Unified Consistent Geometry Estimation

Yang-Tian Sun,Xin Yu,Zehuan Huang,Yi-Hua Huang,Yuan-Chen Guo,Ziyi Yang,Yan-Pei Cao,Xiaojuan Qi

Main category: cs.CV

TL;DR: 论文提出了一种利用扩散模型先验辅助单目几何估计的方法，通过设计全局坐标系中的几何属性预测目标、引入高效的条件化方法及联合训练多属性，实现了视频中全局几何属性的高精度预测。

Details

Motivation: 现有方法多关注单帧相机坐标系内的几何属性估计，忽略了扩散模型在帧间一致性上的潜力。本文旨在利用视频生成模型的固有一致性，实现更一致的几何估计。 Method: 1) 选择全局坐标系中与视频帧共享对应关系的几何属性作为预测目标；2) 提出重用位置编码的高效条件化方法；3) 通过联合训练共享对应关系的多几何属性提升性能。 Result: 方法在视频全局几何属性预测上表现优异，可直接应用于重建任务，且在仅静态视频数据训练下，展现出对动态场景的泛化潜力。 Conclusion: 通过合理设计和微调，扩散模型的固有一致性可有效用于一致几何估计，为视频几何分析提供了新思路。 Abstract: Recently, methods leveraging diffusion model priors to assist monocular geometric estimation (e.g., depth and normal) have gained significant attention due to their strong generalization ability. However, most existing works focus on estimating geometric properties within the camera coordinate system of individual video frames, neglecting the inherent ability of diffusion models to determine inter-frame correspondence. In this work, we demonstrate that, through appropriate design and fine-tuning, the intrinsic consistency of video generation models can be effectively harnessed for consistent geometric estimation. Specifically, we 1) select geometric attributes in the global coordinate system that share the same correspondence with video frames as the prediction targets, 2) introduce a novel and efficient conditioning method by reusing positional encodings, and 3) enhance performance through joint training on multiple geometric attributes that share the same correspondence. Our results achieve superior performance in predicting global geometric attributes in videos and can be directly applied to reconstruction tasks. Even when trained solely on static video data, our approach exhibits the potential to generalize to dynamic video scenes.

[78] Optimal Density Functions for Weighted Convolution in Learning Models

Simone Cammarasana,Giuseppe Patanè

Main category: cs.CV

TL;DR: 提出了一种加权卷积方法，通过最优密度函数调整邻域像素的贡献，显著提升了卷积神经网络的精度。

Details

Motivation: 传统卷积对所有邻域像素平等处理，而加权卷积通过距离调整贡献，以提高近似精度。 Method: 定义框架优化密度函数（DIRECT-L）和卷积核权重（随机梯度下降），应用于图像任务（如去噪）。 Result: 实验显示加权卷积减少损失（最高53%），提高测试精度，但增加11%执行时间。 Conclusion: 加权卷积在图像任务中表现优异，未来将扩展至2D和3D图像学习问题。 Abstract: The paper introduces the weighted convolution, a novel approach to the convolution for signals defined on regular grids (e.g., 2D images) through the application of an optimal density function to scale the contribution of neighbouring pixels based on their distance from the central pixel. This choice differs from the traditional uniform convolution, which treats all neighbouring pixels equally. Our weighted convolution can be applied to convolutional neural network problems to improve the approximation accuracy. Given a convolutional network, we define a framework to compute the optimal density function through a minimisation model. The framework separates the optimisation of the convolutional kernel weights (using stochastic gradient descent) from the optimisation of the density function (using DIRECT-L). Experimental results on a learning model for an image-to-image task (e.g., image denoising) show that the weighted convolution significantly reduces the loss (up to 53% improvement) and increases the test accuracy compared to standard convolution. While this method increases execution time by 11%, it is robust across several hyperparameters of the learning model. Future work will apply the weighted convolution to real-case 2D and 3D image convolutional learning problems.

[79] Geospatial Foundation Models to Enable Progress on Sustainable Development Goals

Pedram Ghamisi,Weikang Yu,Xiaokang Zhang,Aldino Rizaldy,Jian Wang,Chufeng Zhou,Richard Gloaguen,Gustau Camps-Valls

Main category: cs.CV

TL;DR: SustainFM是一个基于17个可持续发展目标的基准框架，评估地理空间基础模型（FMs）在可持续性任务中的表现，发现FMs在多任务中优于传统方法，并强调需关注能效、泛化能力和伦理问题。

Details

Motivation: 地理空间基础模型（FMs）在可持续性目标中的实际效用和对齐性尚未充分探索，因此需要系统性评估其性能和影响。 Method: 引入SustainFM框架，涵盖多样化的可持续性任务（如财富预测、环境灾害检测），并综合评估FMs的准确性、泛化能力和能效。 Result: FMs在多任务中表现优于传统方法，但其评估需扩展至能效、泛化能力和伦理考量；FMs为可持续性挑战提供了可扩展的解决方案。 Conclusion: 呼吁从模型中心开发转向影响驱动的部署，强调能效、领域适应性和伦理作为关键指标。 Abstract: Foundation Models (FMs) are large-scale, pre-trained AI systems that have revolutionized natural language processing and computer vision, and are now advancing geospatial analysis and Earth Observation (EO). They promise improved generalization across tasks, scalability, and efficient adaptation with minimal labeled data. However, despite the rapid proliferation of geospatial FMs, their real-world utility and alignment with global sustainability goals remain underexplored. We introduce SustainFM, a comprehensive benchmarking framework grounded in the 17 Sustainable Development Goals with extremely diverse tasks ranging from asset wealth prediction to environmental hazard detection. This study provides a rigorous, interdisciplinary assessment of geospatial FMs and offers critical insights into their role in attaining sustainability goals. Our findings show: (1) While not universally superior, FMs often outperform traditional approaches across diverse tasks and datasets. (2) Evaluating FMs should go beyond accuracy to include transferability, generalization, and energy efficiency as key criteria for their responsible use. (3) FMs enable scalable, SDG-grounded solutions, offering broad utility for tackling complex sustainability challenges. Critically, we advocate for a paradigm shift from model-centric development to impact-driven deployment, and emphasize metrics such as energy efficiency, robustness to domain shifts, and ethical considerations.

[80] Mixpert: Mitigating Multimodal Learning Conflicts with Efficient Mixture-of-Vision-Experts

Xin He,Xumeng Han,Longhui Wei,Lingxi Xie,Qi Tian

Main category: cs.CV

TL;DR: Mixpert是一种高效的多视觉专家混合架构，通过动态路由机制分配任务到合适的专家，解决了单视觉编码器在多任务学习中的冲突问题，且计算成本低。

Details

Motivation: 单视觉编码器难以处理多样化的视觉任务，容易导致冲突；而直接集成多个编码器会增加复杂性和限制联合优化。 Method: 提出Mixpert架构，结合单视觉编码器的联合学习优势和多专家范式，设计动态路由机制分配任务。 Result: Mixpert有效缓解了多任务学习中的领域冲突，计算成本低，并在实验中表现出显著的性能提升。 Conclusion: Mixpert是一种高效且通用的解决方案，适用于多模态大语言模型中的多任务视觉处理。 Abstract: Multimodal large language models (MLLMs) require a nuanced interpretation of complex image information, typically leveraging a vision encoder to perceive various visual scenarios. However, relying solely on a single vision encoder to handle diverse task domains proves difficult and inevitably leads to conflicts. Recent work enhances data perception by directly integrating multiple domain-specific vision encoders, yet this structure adds complexity and limits the potential for joint optimization. In this paper, we introduce Mixpert, an efficient mixture-of-vision-experts architecture that inherits the joint learning advantages from a single vision encoder while being restructured into a multi-expert paradigm for task-specific fine-tuning across different visual tasks. Additionally, we design a dynamic routing mechanism that allocates input images to the most suitable visual expert. Mixpert effectively alleviates domain conflicts encountered by a single vision encoder in multi-task learning with minimal additional computational cost, making it more efficient than multiple encoders. Furthermore, Mixpert integrates seamlessly into any MLLM, with experimental results demonstrating substantial performance gains across various tasks.

[81] Optimal Weighted Convolution for Classification and Denosing

Simone Cammarasana,Giuseppe Patanè

Main category: cs.CV

TL;DR: 提出了一种新型加权卷积算子，通过整合空间密度函数增强传统CNN，提升空间特征提取能力，兼容现有架构，并在图像分类和去噪任务中表现优于标准卷积。

Details

Motivation: 传统卷积操作对邻域像素的处理缺乏灵活性，无法根据像素的相对位置进行差异化加权，限制了特征提取的效果。 Method: 引入空间密度函数到卷积算子中，实现邻域像素的差异化加权，保持参数数量不变，并提供高效实现方法。 Result: 在CIFAR-100和DIV2K数据集上，加权卷积显著提升了VGG和DnCNN等模型的性能（如分类准确率从56.89%提升至66.94%，去噪PSNR从20.17提升至22.63）。 Conclusion: 加权卷积算子是一种高效且通用的改进方法，适用于多种任务和数据类型，代码已开源。 Abstract: We introduce a novel weighted convolution operator that enhances traditional convolutional neural networks (CNNs) by integrating a spatial density function into the convolution operator. This extension enables the network to differentially weight neighbouring pixels based on their relative position to the reference pixel, improving spatial characterisation and feature extraction. The proposed operator maintains the same number of trainable parameters and is fully compatible with existing CNN architectures. Although developed for 2D image data, the framework is generalisable to signals on regular grids of arbitrary dimensions, such as 3D volumetric data or 1D time series. We propose an efficient implementation of the weighted convolution by pre-computing the density function and achieving execution times comparable to standard convolution layers. We evaluate our method on two deep learning tasks: image classification using the CIFAR-100 dataset [KH+09] and image denoising using the DIV2K dataset [AT17]. Experimental results with state-of-the-art classification (e.g., VGG [SZ15], ResNet [HZRS16]) and denoising (e.g., DnCNN [ZZC+17], NAFNet [CCZS22]) methods show that the weighted convolution improves performance with respect to standard convolution across different quantitative metrics. For example, VGG achieves an accuracy of 66.94% with weighted convolution versus 56.89% with standard convolution on the classification problem, while DnCNN improves the PSNR value from 20.17 to 22.63 on the denoising problem. All models were trained on the CINECA Leonardo cluster to reduce the execution time and improve the tuning of the density function values. The PyTorch implementation of the weighted convolution is publicly available at: https://github.com/cammarasana123/weightedConvolution2.0.

[82] Unleashing the Power of Intermediate Domains for Mixed Domain Semi-Supervised Medical Image Segmentation

Qinghe Ma,Jian Zhang,Lei Qi,Qian Yu,Yinghuan Shi,Yang Gao

Main category: cs.CV

TL;DR: 论文提出了一种名为MiDSS的新场景，结合了有限标注和多域数据的问题，并提出了UST-RUN框架来解决这一问题。

Details

Motivation: 医学图像分割中，有限标注和域偏移问题普遍存在，但传统方法通常单独处理这些问题。MiDSS场景的提出是为了同时解决这两个挑战。 Method: UST-RUN框架通过统一复制粘贴（UCP）构建中间域，采用对称引导训练策略（SymGD）监督未标注数据，并结合训练过程感知的随机幅度混合（TP-RAM）逐步引入风格转换。 Result: 在四个公开数据集上的实验表明，UST-RUN表现优异，尤其在Prostate数据集上Dice分数提升了12.94%。 Conclusion: UST-RUN框架有效解决了MiDSS场景下的挑战，为医学图像分割提供了新的解决方案。 Abstract: Both limited annotation and domain shift are prevalent challenges in medical image segmentation. Traditional semi-supervised segmentation and unsupervised domain adaptation methods address one of these issues separately. However, the coexistence of limited annotation and domain shift is quite common, which motivates us to introduce a novel and challenging scenario: Mixed Domain Semi-supervised medical image Segmentation (MiDSS), where limited labeled data from a single domain and a large amount of unlabeled data from multiple domains. To tackle this issue, we propose the UST-RUN framework, which fully leverages intermediate domain information to facilitate knowledge transfer. We employ Unified Copy-paste (UCP) to construct intermediate domains, and propose a Symmetric GuiDance training strategy (SymGD) to supervise unlabeled data by merging pseudo-labels from intermediate samples. Subsequently, we introduce a Training Process aware Random Amplitude MixUp (TP-RAM) to progressively incorporate style-transition components into intermediate samples. To generate more diverse intermediate samples, we further select reliable samples with high-quality pseudo-labels, which are then mixed with other unlabeled data. Additionally, we generate sophisticated intermediate samples with high-quality pseudo-labels for unreliable samples, ensuring effective knowledge transfer for them. Extensive experiments on four public datasets demonstrate the superiority of UST-RUN. Notably, UST-RUN achieves a 12.94% improvement in Dice score on the Prostate dataset. Our code is available at https://github.com/MQinghe/UST-RUN

[83] SARD: A Large-Scale Synthetic Arabic OCR Dataset for Book-Style Text Recognition

Omer Nacar,Yasser Al-Habashi,Serry Sibaee,Adel Ammar,Wadii Boulila

Main category: cs.CV

TL;DR: SARD是一个大规模合成的阿拉伯语OCR数据集，旨在解决现有数据集在规模、多样性和结构复杂性上的不足，为训练和评估OCR模型提供干净、可控的环境。

Details

Motivation: 现有阿拉伯语OCR数据集规模小、多样性低且缺乏真实书籍布局的复杂性，限制了现代OCR模型的训练效果。 Method: 通过合成生成843,622个文档图像，包含6.9亿单词，覆盖10种阿拉伯字体，模拟书籍布局，避免真实文档的噪声和失真。 Result: SARD为OCR模型提供了高质量的训练数据，并在多种模型上进行了基准测试，展示了其挑战和潜力。 Conclusion: SARD是开发和评估鲁棒OCR及视觉语言模型的宝贵资源，能够处理多样化的阿拉伯书籍文本。 Abstract: Arabic Optical Character Recognition (OCR) is essential for converting vast amounts of Arabic print media into digital formats. However, training modern OCR models, especially powerful vision-language models, is hampered by the lack of large, diverse, and well-structured datasets that mimic real-world book layouts. Existing Arabic OCR datasets often focus on isolated words or lines or are limited in scale, typographic variety, or structural complexity found in books. To address this significant gap, we introduce SARD (Large-Scale Synthetic Arabic OCR Dataset). SARD is a massive, synthetically generated dataset specifically designed to simulate book-style documents. It comprises 843,622 document images containing 690 million words, rendered across ten distinct Arabic fonts to ensure broad typographic coverage. Unlike datasets derived from scanned documents, SARD is free from real-world noise and distortions, offering a clean and controlled environment for model training. Its synthetic nature provides unparalleled scalability and allows for precise control over layout and content variation. We detail the dataset's composition and generation process and provide benchmark results for several OCR models, including traditional and deep learning approaches, highlighting the challenges and opportunities presented by this dataset. SARD serves as a valuable resource for developing and evaluating robust OCR and vision-language models capable of processing diverse Arabic book-style texts.

[84] GARLIC: GAussian Representation LearnIng for spaCe partitioning

Panagiotis Rigas,Panagiotis Drivas,Charalambos Tzamos,Ioannis Chamodrakas,George Ioannakis,Leonidas J. Guibas,Ioannis Z. Emiris

Main category: cs.CV

TL;DR: GARLIC是一种基于高维高斯分布的新型索引结构，用于高效学习和搜索高维向量空间，结合信息论目标优化，在快速构建和高召回率方面表现优异。

Details

Motivation: 高维向量空间的高效学习和搜索是许多应用的核心需求，传统方法在速度和精度上存在局限，因此需要一种既能快速构建又能保持高召回率的新方法。 Method: GARLIC通过高斯分布表示空间分区，利用信息论目标优化高斯参数，并通过分裂和克隆操作逐步细化表示，适用于高维数据。 Result: 在标准基准测试中，GARLIC在k-NN检索和分类任务中表现优异，召回率和准确率显著优于现有方法，且在小样本训练数据下仍保持高性能。 Conclusion: GARLIC是一种高效且通用的高维向量空间索引方法，适用于需要速度和精度的应用场景。 Abstract: We introduce GARLIC (GAussian Representation LearnIng for spaCe partitioning), a novel indexing structure based on $N$-dimensional Gaussians for efficiently learning high-dimensional vector spaces. Our approach is inspired from Gaussian splatting techniques, typically used in 3D rendering, which we adapt for high-dimensional search and classification. We optimize Gaussian parameters using information-theoretic objectives that balance coverage, assignment confidence, and structural and semantic consistency. A key contribution is to progressively refine the representation through split and clone operations, handling hundreds of dimensions, thus handling varying data densities. GARLIC offers the fast building times of traditional space partitioning methods (e.g., under $\sim5$ min build time for SIFT1M) while achieving $\sim50\%$ Recall10@10 in low-candidate regimes. Experimental results on standard benchmarks demonstrate our method's consistency in (a) $k$-NN retrieval, outperforming methods, such as Faiss-IVF, in fast-recall by using about half their probes for the same Recall10@10 in Fashion-MNIST, and (b) in classification tasks, beating by $\sim15\%$ accuracy other majority voting methods. Further, we show strong generalization capabilities, maintaining high accuracy even with downsampled training data: using just $1\%$ of the training data returns $\sim 45\%$ Recall@1, thus making GARLIC quite powerful for applications requiring both speed and accuracy.

[85] Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors

Duo Zheng,Shijia Huang,Yanyang Li,Liwei Wang

Main category: cs.CV

TL;DR: 研究提出了一种直接从视频数据理解3D场景的方法VG LLM，无需额外3D输入，性能优于现有方法。

Details

Motivation: 提升多模态大语言模型（MLLMs）直接从视频理解3D空间的能力，减少对复杂3D数据的依赖。 Method: 使用3D视觉几何编码器从视频序列提取3D先验信息，结合视觉标记输入MLLM。 Result: 在3D场景理解和空间推理任务中表现优异，4B模型在VSI-Bench评测中超越Gemini-1.5-Pro。 Conclusion: VG LLM方法高效且性能优越，为直接从视频学习3D理解提供了新思路。 Abstract: Previous research has investigated the application of Multimodal Large Language Models (MLLMs) in understanding 3D scenes by interpreting them as videos. These approaches generally depend on comprehensive 3D data inputs, such as point clouds or reconstructed Bird's-Eye View (BEV) maps. In our research, we advance this field by enhancing the capability of MLLMs to understand and reason in 3D spaces directly from video data, without the need for additional 3D input. We propose a novel and efficient method, the Video-3D Geometry Large Language Model (VG LLM). Our approach employs a 3D visual geometry encoder that extracts 3D prior information from video sequences. This information is integrated with visual tokens and fed into the MLLM. Extensive experiments have shown that our method has achieved substantial improvements in various tasks related to 3D scene understanding and spatial reasoning, all directly learned from video sources. Impressively, our 4B model, which does not rely on explicit 3D data inputs, achieves competitive results compared to existing state-of-the-art methods, and even surpasses the Gemini-1.5-Pro in the VSI-Bench evaluations.

[86] NUC-Net: Non-uniform Cylindrical Partition Network for Efficient LiDAR Semantic Segmentation

Xuzhi Wang,Wei Feng,Lingdong Kong,Liang Wan

Main category: cs.CV

TL;DR: NUC-Net提出了一种非均匀圆柱分割网络，解决了LiDAR点云语义分割中计算成本高和点分布不平衡的问题，显著提升了性能和效率。

Details

Motivation: 现有基于体素的方法在LiDAR语义分割中存在高计算成本和内存消耗，且未能有效处理点云分布不平衡的问题。 Method: 提出非均匀圆柱分割网络（NUC-Net），采用API方法非均匀分割径向轴，并结合非均匀多尺度聚合方法提升上下文信息。 Result: 在SemanticKITTI和nuScenes数据集上达到SOTA性能，训练速度提升4倍，GPU内存减少2倍，推理速度提升3倍。 Conclusion: NUC-Net是一种高效通用的LiDAR语义分割组件，显著提升了精度和效率，并提供了理论分析支持。 Abstract: LiDAR semantic segmentation plays a vital role in autonomous driving. Existing voxel-based methods for LiDAR semantic segmentation apply uniform partition to the 3D LiDAR point cloud to form a structured representation based on cartesian/cylindrical coordinates. Although these methods show impressive performance, the drawback of existing voxel-based methods remains in two aspects: (1) it requires a large enough input voxel resolution, which brings a large amount of computation cost and memory consumption. (2) it does not well handle the unbalanced point distribution of LiDAR point cloud. In this paper, we propose a non-uniform cylindrical partition network named NUC-Net to tackle the above challenges. Specifically, we propose the Arithmetic Progression of Interval (API) method to non-uniformly partition the radial axis and generate the voxel representation which is representative and efficient. Moreover, we propose a non-uniform multi-scale aggregation method to improve contextual information. Our method achieves state-of-the-art performance on SemanticKITTI and nuScenes datasets with much faster speed and much less training time. And our method can be a general component for LiDAR semantic segmentation, which significantly improves both the accuracy and efficiency of the uniform counterpart by $4 \times$ training faster and $2 \times$ GPU memory reduction and $3 \times$ inference speedup. We further provide theoretical analysis towards understanding why NUC is effective and how point distribution affects performance. Code is available at \href{https://github.com/alanWXZ/NUC-Net}{https://github.com/alanWXZ/NUC-Net}.

[87] Category-Level 6D Object Pose Estimation in Agricultural Settings Using a Lattice-Deformation Framework and Diffusion-Augmented Synthetic Data

Marios Glytsos,Panagiotis P. Filntisis,George Retsinas,Petros Maragos

Main category: cs.CV

TL;DR: PLANTPose是一个基于RGB输入的类别级6D姿态估计框架，通过预测相对于基础网格的6D姿态和变形参数，适应未见实例，无需依赖实例特定数据。

Details

Motivation: 农业中水果和蔬菜的形状、大小和纹理存在高度类内差异，现有方法依赖实例特定CAD模型或深度传感器，不适用于实际应用。 Method: PLANTPose预测6D姿态和变形参数，利用Stable Diffusion增强合成训练图像的纹理真实性。 Result: 在包含不同形状、大小和成熟度的香蕉的基准测试中，PLANTPose显著优于现有RGB方法MegaPose。 Conclusion: PLANTPose能够有效处理类内差异，实现准确的6D姿态估计，适用于农业等实际场景。 Abstract: Accurate 6D object pose estimation is essential for robotic grasping and manipulation, particularly in agriculture, where fruits and vegetables exhibit high intra-class variability in shape, size, and texture. The vast majority of existing methods rely on instance-specific CAD models or require depth sensors to resolve geometric ambiguities, making them impractical for real-world agricultural applications. In this work, we introduce PLANTPose, a novel framework for category-level 6D pose estimation that operates purely on RGB input. PLANTPose predicts both the 6D pose and deformation parameters relative to a base mesh, allowing a single category-level CAD model to adapt to unseen instances. This enables accurate pose estimation across varying shapes without relying on instance-specific data. To enhance realism and improve generalization, we also leverage Stable Diffusion to refine synthetic training images with realistic texturing, mimicking variations due to ripeness and environmental factors and bridging the domain gap between synthetic data and the real world. Our evaluations on a challenging benchmark that includes bananas of various shapes, sizes, and ripeness status demonstrate the effectiveness of our framework in handling large intraclass variations while maintaining accurate 6D pose predictions, significantly outperforming the state-of-the-art RGB-based approach MegaPose.

[88] Cloud Optical Thickness Retrievals Using Angle Invariant Attention Based Deep Learning Models

Zahid Hassan Tushar,Adeleke Ademakinwa,Jianwu Wang,Zhibo Zhang,Sanjay Purushotham

Main category: cs.CV

TL;DR: 论文提出了一种名为CAAC的新型角度不变、基于注意力的深度学习模型，用于更准确地估计云光学厚度（COT），显著优于现有方法。

Details

Motivation: 传统独立像素近似（IPA）方法因简化假设引入显著偏差，而现有深度学习模型对辐射强度、畸变和云阴影变化敏感，且在不同角度下误差较大。 Method: 提出Cloud-Attention-Net with Angle Coding（CAAC），结合注意力机制和角度嵌入，考虑卫星视角几何和3D辐射传输效应，并采用多角度训练策略确保角度不变性。 Result: CAAC显著优于现有深度学习模型，将云属性检索误差至少降低九倍。 Conclusion: CAAC通过结合注意力机制和多角度训练，解决了现有方法的局限性，为COT估计提供了更准确的解决方案。 Abstract: Cloud Optical Thickness (COT) is a critical cloud property influencing Earth's climate, weather, and radiation budget. Satellite radiance measurements enable global COT retrieval, but challenges like 3D cloud effects, viewing angles, and atmospheric interference must be addressed to ensure accurate estimation. Traditionally, the Independent Pixel Approximation (IPA) method, which treats individual pixels independently, has been used for COT estimation. However, IPA introduces significant bias due to its simplified assumptions. Recently, deep learning-based models have shown improved performance over IPA but lack robustness, as they are sensitive to variations in radiance intensity, distortions, and cloud shadows. These models also introduce substantial errors in COT estimation under different solar and viewing zenith angles. To address these challenges, we propose a novel angle-invariant, attention-based deep model called Cloud-Attention-Net with Angle Coding (CAAC). Our model leverages attention mechanisms and angle embeddings to account for satellite viewing geometry and 3D radiative transfer effects, enabling more accurate retrieval of COT. Additionally, our multi-angle training strategy ensures angle invariance. Through comprehensive experiments, we demonstrate that CAAC significantly outperforms existing state-of-the-art deep learning models, reducing cloud property retrieval errors by at least a factor of nine.

[89] A Cross Branch Fusion-Based Contrastive Learning Framework for Point Cloud Self-supervised Learning

Chengzhi Wu,Qianliang Huang,Kun Jin,Julius Pfrommer,Jürgen Beyerer

Main category: cs.CV

TL;DR: 提出了一种基于对比跨分支注意力的点云无监督学习框架PoCCA，通过在损失端之前引入子分支实现信息交换，提升了点云表征学习效果。

Details

Motivation: 现有对比学习框架仅在损失端对特征进行对比操作，缺乏分支间的信息交换，限制了表征学习能力。 Method: 提出PoCCA框架，引入子分支实现分支间信息交换，适用于点云数据的无监督学习。 Result: 实验表明，在不使用额外训练数据的情况下，PoCCA在下游任务中达到了最先进的性能。 Conclusion: PoCCA通过分支间信息交换显著提升了点云表征学习效果，为无监督学习提供了新思路。 Abstract: Contrastive learning is an essential method in self-supervised learning. It primarily employs a multi-branch strategy to compare latent representations obtained from different branches and train the encoder. In the case of multi-modal input, diverse modalities of the same object are fed into distinct branches. When using single-modal data, the same input undergoes various augmentations before being fed into different branches. However, all existing contrastive learning frameworks have so far only performed contrastive operations on the learned features at the final loss end, with no information exchange between different branches prior to this stage. In this paper, for point cloud unsupervised learning without the use of extra training data, we propose a Contrastive Cross-branch Attention-based framework for Point cloud data (termed PoCCA), to learn rich 3D point cloud representations. By introducing sub-branches, PoCCA allows information exchange between different branches before the loss end. Experimental results demonstrate that in the case of using no extra training data, the representations learned with our self-supervised model achieve state-of-the-art performances when used for downstream tasks on point clouds.

[90] BIMA: Bijective Maximum Likelihood Learning Approach to Hallucination Prediction and Mitigation in Large Vision-Language Models

Huu-Thien Tran,Thanh-Dat Truong,Khoa Luu

Main category: cs.CV

TL;DR: 提出了一种名为BIMA的新方法，利用标准化流理论减少视觉语言模型中的幻觉问题，显著提升了性能。

Details

Motivation: 大型视觉语言模型存在幻觉问题，即生成与视觉内容不符的响应，亟需解决以提高系统可信度。 Method: 采用Bijective Maximum Likelihood Learning (BIMA)方法，基于标准化流理论优化解码过程。 Result: BIMA在POPE基准测试中平均F1得分为85.06%，CHAIRS和CHAIRI分别降低了7.6%和2.6%。 Conclusion: BIMA是首批利用双射方法减少视觉语言模型幻觉的研究之一，效果显著。 Abstract: Large vision-language models have become widely adopted to advance in various domains. However, developing a trustworthy system with minimal interpretable characteristics of large-scale models presents a significant challenge. One of the most prevalent terms associated with the fallacy functions caused by these systems is hallucination, where the language model generates a response that does not correspond to the visual content. To mitigate this problem, several approaches have been developed, and one prominent direction is to ameliorate the decoding process. In this paper, we propose a new Bijective Maximum Likelihood Learning (BIMA) approach to hallucination mitigation using normalizing flow theories. The proposed BIMA method can efficiently mitigate the hallucination problem in prevailing vision-language models, resulting in significant improvements. Notably, BIMA achieves the average F1 score of 85.06% on POPE benchmark and remarkably reduce CHAIRS and CHAIRI by 7.6% and 2.6%, respectively. To the best of our knowledge, this is one of the first studies that contemplates the bijection means to reduce hallucination induced by large vision-language models.

[91] Decoupled Competitive Framework for Semi-supervised Medical Image Segmentation

Jiahe Chen,Jiahe Ying,Shen Wang,Jianwei Zheng

Main category: cs.CV

TL;DR: 论文提出了一种解耦竞争框架（DCF），用于解决半监督医学图像分割中Mean Teacher和Dual Students结构的性能瓶颈问题。

Details

Motivation: 医学领域标注样本不足，现有方法（如MT和DS）因过度耦合和认知偏差导致性能瓶颈，需改进。 Method: DCF通过竞争机制动态解耦师生模型，并促进学生间精确知识交换。 Result: 在多个公开数据集上验证，DCF性能优于现有方法。 Conclusion: DCF有效解决了现有方法的局限性，提升了半监督医学图像分割的性能。 Abstract: Confronting the critical challenge of insufficiently annotated samples in medical domain, semi-supervised medical image segmentation (SSMIS) emerges as a promising solution. Specifically, most methodologies following the Mean Teacher (MT) or Dual Students (DS) architecture have achieved commendable results. However, to date, these approaches face a performance bottleneck due to two inherent limitations, \textit{e.g.}, the over-coupling problem within MT structure owing to the employment of exponential moving average (EMA) mechanism, as well as the severe cognitive bias between two students of DS structure, both of which potentially lead to reduced efficacy, or even model collapse eventually. To mitigate these issues, a Decoupled Competitive Framework (DCF) is elaborated in this work, which utilizes a straightforward competition mechanism for the update of EMA, effectively decoupling students and teachers in a dynamical manner. In addition, the seamless exchange of invaluable and precise insights is facilitated among students, guaranteeing a better learning paradigm. The DCF introduced undergoes rigorous validation on three publicly accessible datasets, which encompass both 2D and 3D datasets. The results demonstrate the superiority of our method over previous cutting-edge competitors. Code will be available at https://github.com/JiaheChen2002/DCF.

[92] 6D Pose Estimation on Point Cloud Data through Prior Knowledge Integration: A Case Study in Autonomous Disassembly

Chengzhi Wu,Hao Fu,Jan-Philipp Kaiser,Erik Tabuchi Barczak,Julius Pfrommer,Gisela Lanza,Michael Heizmann,Jürgen Beyerer

Main category: cs.CV

TL;DR: 论文提出了一种多阶段管道，用于在制造领域中准确估计螺栓的6D姿态，解决了遮挡和单视角数据限制的问题。

Details

Motivation: 在制造领域中，利用先验知识改进6D姿态估计，特别是针对螺栓的自动拆卸任务。 Method: 设计了一个多阶段管道，结合先验知识，全面捕捉螺栓的6D信息。 Result: 管道成功获取了所有螺栓的6D姿态信息，证明了先验知识在复杂任务中的有效性。 Conclusion: 该方法不仅推动了6D姿态估计领域，还展示了领域知识在制造自动化中的实际应用价值。 Abstract: The accurate estimation of 6D pose remains a challenging task within the computer vision domain, even when utilizing 3D point cloud data. Conversely, in the manufacturing domain, instances arise where leveraging prior knowledge can yield advancements in this endeavor. This study focuses on the disassembly of starter motors to augment the engineering of product life cycles. A pivotal objective in this context involves the identification and 6D pose estimation of bolts affixed to the motors, facilitating automated disassembly within the manufacturing workflow. Complicating matters, the presence of occlusions and the limitations of single-view data acquisition, notably when motors are placed in a clamping system, obscure certain portions and render some bolts imperceptible. Consequently, the development of a comprehensive pipeline capable of acquiring complete bolt information is imperative to avoid oversight in bolt detection. In this paper, employing the task of bolt detection within the scope of our project as a pertinent use case, we introduce a meticulously devised pipeline. This multi-stage pipeline effectively captures the 6D information with regard to all bolts on the motor, thereby showcasing the effective utilization of prior knowledge in handling this challenging task. The proposed methodology not only contributes to the field of 6D pose estimation but also underscores the viability of integrating domain-specific insights to tackle complex problems in manufacturing and automation.

[93] Beyond FACS: Data-driven Facial Expression Dictionaries, with Application to Predicting Autism

Evangelos Sariyanidi,Lisa Yankowitz,Robert T. Schultz,John D. Herrington,Birkan Tunc,Jeffrey Cohn

Main category: cs.CV

TL;DR: 论文提出了一种名为Facial Basis的新编码系统，替代传统的FACS，通过无监督方法解决FACS在自动检测中的局限性，并在自闭症诊断中表现优于常用AU检测器。

Details

Motivation: FACS编码过程繁琐且成本高，且自动检测精度不足，无法全面捕捉面部表情。因此，需要一种新的编码系统来克服这些限制。 Method: 提出Facial Basis，一种数据驱动的无监督编码系统，基于局部化和可解释的3D面部运动单元，避免了手动标注的局限性。 Result: Facial Basis在预测自闭症诊断中优于常用AU检测器，并能全面捕捉所有可观察的面部运动。 Conclusion: Facial Basis是首个替代FACS的系统，能够全面编码面部行为，为行为研究提供了新工具。 Abstract: The Facial Action Coding System (FACS) has been used by numerous studies to investigate the links between facial behavior and mental health. The laborious and costly process of FACS coding has motivated the development of machine learning frameworks for Action Unit (AU) detection. Despite intense efforts spanning three decades, the detection accuracy for many AUs is considered to be below the threshold needed for behavioral research. Also, many AUs are excluded altogether, making it impossible to fulfill the ultimate goal of FACS-the representation of any facial expression in its entirety. This paper considers an alternative approach. Instead of creating automated tools that mimic FACS experts, we propose to use a new coding system that mimics the key properties of FACS. Specifically, we construct a data-driven coding system called the Facial Basis, which contains units that correspond to localized and interpretable 3D facial movements, and overcomes three structural limitations of automated FACS coding. First, the proposed method is completely unsupervised, bypassing costly, laborious and variable manual annotation. Second, Facial Basis reconstructs all observable movement, rather than relying on a limited repertoire of recognizable movements (as in automated FACS). Finally, the Facial Basis units are additive, whereas AUs may fail detection when they appear in a non-additive combination. The proposed method outperforms the most frequently used AU detector in predicting autism diagnosis from in-person and remote conversations, highlighting the importance of encoding facial behavior comprehensively. To our knowledge, Facial Basis is the first alternative to FACS for deconstructing facial expressions in videos into localized movements. We provide an open source implementation of the method at github.com/sariyanidi/FacialBasis.

[94] Learning reusable concepts across different egocentric video understanding tasks

Simone Alberto Peirone,Francesca Pistilli,Antonio Alliegro,Tatiana Tommasi,Giuseppe Averta

Main category: cs.CV

TL;DR: Hier-EgoPack是一个统一框架，旨在通过多任务视角提升自主系统的整体感知能力。

Details

Motivation: 赋予自主系统类似人类的全面感知能力，包括理解场景、对象交互和预测未来事件。 Method: 提出Hier-EgoPack框架，构建任务视角集合，支持下游任务的跨任务知识共享和技能复用。 Result: 框架能够为机器人提供可携带的技能背包，增强其多任务处理能力。 Conclusion: Hier-EgoPack为自主系统的综合感知和技能学习提供了有效解决方案。 Abstract: Our comprehension of video streams depicting human activities is naturally multifaceted: in just a few moments, we can grasp what is happening, identify the relevance and interactions of objects in the scene, and forecast what will happen soon, everything all at once. To endow autonomous systems with such holistic perception, learning how to correlate concepts, abstract knowledge across diverse tasks, and leverage tasks synergies when learning novel skills is essential. In this paper, we introduce Hier-EgoPack, a unified framework able to create a collection of task perspectives that can be carried across downstream tasks and used as a potential source of additional insights, as a backpack of skills that a robot can carry around and use when needed.

[95] Conformal Prediction for Zero-Shot Models

Julio Silva-Rodríguez,Ismail Ben Ayed,Jose Dolz

Main category: cs.CV

TL;DR: 本文研究了CLIP模型在分割共形预测范式下的能力，提出Conf-OT方法以缓解域漂移问题，显著提升了集合效率。

Details

Motivation: 尽管大规模预训练的视觉语言模型在下游任务中表现出色，但其可靠性和不确定性仍被忽视。本文旨在探索CLIP模型在共形预测中的表现，并提出解决方案。 Method: 提出Conf-OT方法，通过最优传输问题解决预训练与适应任务之间的域差距，无需额外数据分割。 Result: 在15个数据集和三种非共形分数上验证，Conf-OT相对提升20%的集合效率，速度比现有方法快15倍。 Conclusion: Conf-OT有效解决了域漂移问题，显著提升了共形预测的效率，为预训练模型的不确定性研究提供了新思路。 Abstract: Vision-language models pre-trained at large scale have shown unprecedented adaptability and generalization to downstream tasks. Although its discriminative potential has been widely explored, its reliability and uncertainty are still overlooked. In this work, we investigate the capabilities of CLIP models under the split conformal prediction paradigm, which provides theoretical guarantees to black-box models based on a small, labeled calibration set. In contrast to the main body of literature on conformal predictors in vision classifiers, foundation models exhibit a particular characteristic: they are pre-trained on a one-time basis on an inaccessible source domain, different from the transferred task. This domain drift negatively affects the efficiency of the conformal sets and poses additional challenges. To alleviate this issue, we propose Conf-OT, a transfer learning setting that operates transductive over the combined calibration and query sets. Solving an optimal transport problem, the proposed method bridges the domain gap between pre-training and adaptation without requiring additional data splits but still maintaining coverage guarantees. We comprehensively explore this conformal prediction strategy on a broad span of 15 datasets and three non-conformity scores. Conf-OT provides consistent relative improvements of up to 20% on set efficiency while being 15 times faster than popular transductive approaches.

[96] RT-X Net: RGB-Thermal cross attention network for Low-Light Image Enhancement

Raman Jha,Adithya Lenka,Mani Ramanagopal,Aswin Sankaranarayanan,Kaushik Mitra

Main category: cs.CV

TL;DR: RT-X Net是一种跨注意力网络，融合RGB和热成像图像以提升夜间图像质量，通过自注意力网络提取特征，并通过跨注意力机制融合两种模态信息。

Details

Motivation: 夜间图像质量因高噪声和强光源而下降，热成像图像能提供互补信息，但现有方法未能充分利用多模态数据。 Method: 提出RT-X Net，结合自注意力网络和跨注意力机制，融合RGB与热成像图像；并发布V-TIEE数据集支持研究。 Result: 在LLVIP和V-TIEE数据集上，RT-X Net表现优于现有低光图像增强方法。 Conclusion: RT-X Net通过多模态融合显著提升夜间图像增强效果，V-TIEE数据集为未来研究提供支持。 Abstract: In nighttime conditions, high noise levels and bright illumination sources degrade image quality, making low-light image enhancement challenging. Thermal images provide complementary information, offering richer textures and structural details. We propose RT-X Net, a cross-attention network that fuses RGB and thermal images for nighttime image enhancement. We leverage self-attention networks for feature extraction and a cross-attention mechanism for fusion to effectively integrate information from both modalities. To support research in this domain, we introduce the Visible-Thermal Image Enhancement Evaluation (V-TIEE) dataset, comprising 50 co-located visible and thermal images captured under diverse nighttime conditions. Extensive evaluations on the publicly available LLVIP dataset and our V-TIEE dataset demonstrate that RT-X Net outperforms state-of-the-art methods in low-light image enhancement. The code and the V-TIEE can be found here https://github.com/jhakrraman/rt-xnet.

[97] Reinforcing Video Reasoning with Focused Thinking

Jisheng Dang,Jingze Wu,Teng Wang,Xuanhui Lin,Nannan Zhu,Hongbo Chen,Wei-Shi Zheng,Meng Wang,Tat-Seng Chua

Main category: cs.CV

TL;DR: TW-GRPO通过引入令牌加权机制和密集奖励粒度，改进了GRPO在视觉推理任务中的表现，显著提升了多模态大语言模型的性能。

Details

Motivation: 现有GRPO方法存在推理链冗长和二元奖励效率低的问题，TW-GRPO旨在解决这些问题。 Method: 采用令牌加权机制优化信息密度高的令牌，并通过多选择QA任务和软奖励改进RL训练。 Result: 在CLEVRER和MMVU等基准测试中取得显著性能提升。 Conclusion: TW-GRPO为复杂推理任务提供了一种更高效的解决方案。 Abstract: Recent advancements in reinforcement learning, particularly through Group Relative Policy Optimization (GRPO), have significantly improved multimodal large language models for complex reasoning tasks. However, two critical limitations persist: 1) they often produce unfocused, verbose reasoning chains that obscure salient spatiotemporal cues and 2) binary rewarding fails to account for partially correct answers, resulting in high reward variance and inefficient learning. In this paper, we propose TW-GRPO, a novel framework that enhances visual reasoning with focused thinking and dense reward granularity. Specifically, we employs a token weighting mechanism that prioritizes tokens with high informational density (estimated by intra-group variance), suppressing redundant tokens like generic reasoning prefixes. Furthermore, we reformulate RL training by shifting from single-choice to multi-choice QA tasks, where soft rewards enable finer-grained gradient estimation by distinguishing partial correctness. Additionally, we propose question-answer inversion, a data augmentation strategy to generate diverse multi-choice samples from existing benchmarks. Experiments demonstrate state-of-the-art performance on several video reasoning and general understanding benchmarks. Notably, TW-GRPO achieves 50.4\% accuracy on CLEVRER (18.8\% improvement over Video-R1) and 65.8\% on MMVU. Our codes are available at \href{https://github.com/longmalongma/TW-GRPO}{https://github.com/longmalongma/TW-GRPO}.

[98] DreamDance: Animating Character Art via Inpainting Stable Gaussian Worlds

Jiaxu Zhang,Xianfang Zeng,Xin Chen,Wei Zuo,Gang Yu,Guosheng Lin,Zhigang Tu

Main category: cs.CV

TL;DR: DreamDance是一个新颖的角色艺术动画框架，通过相机轨迹生成稳定的角色和场景运动。方法包括相机感知场景修复和姿态感知视频修复两步，最终实现高质量动态角色动画。

Details

Motivation: 现有动画框架难以在精确相机轨迹下生成稳定且一致的角色和场景运动，DreamDance旨在解决这一问题。 Method: 1. 相机感知场景修复：利用预训练图像修复模型生成多视角场景图像，优化高斯场以渲染粗糙背景视频。2. 姿态感知视频修复：训练基于DiT的视频生成模型，动态注入角色并提升背景质量。 Result: 实验证明DreamDance能生成高质量、一致的角色动画，并具有显著的相机动态效果。 Conclusion: DreamDance通过两步修复方法，成功实现了稳定且动态的角色动画生成，具有广泛适用性。 Abstract: This paper presents DreamDance, a novel character art animation framework capable of producing stable, consistent character and scene motion conditioned on precise camera trajectories. To achieve this, we re-formulate the animation task as two inpainting-based steps: Camera-aware Scene Inpainting and Pose-aware Video Inpainting. The first step leverages a pre-trained image inpainting model to generate multi-view scene images from the reference art and optimizes a stable large-scale Gaussian field, which enables coarse background video rendering with camera trajectories. However, the rendered video is rough and only conveys scene motion. To resolve this, the second step trains a pose-aware video inpainting model that injects the dynamic character into the scene video while enhancing background quality. Specifically, this model is a DiT-based video generation model with a gating strategy that adaptively integrates the character's appearance and pose information into the base background video. Through extensive experiments, we demonstrate the effectiveness and generalizability of DreamDance, producing high-quality and consistent character animations with remarkable camera dynamics.

[99] Tackling View-Dependent Semantics in 3D Language Gaussian Splatting

Jiazhong Cen,Xudong Zhou,Jiemin Fang,Changsong Wen,Lingxi Xie,Xiaopeng Zhang,Wei Shen,Qi Tian

Main category: cs.CV

TL;DR: LaGa（Language Gaussians）通过分解3D场景为对象并构建视图聚合语义表示，解决了2D到3D语义理解中的视图依赖性问题，显著提升了3D场景理解的性能。

Details

Motivation: 现有方法将2D语义特征简单投影到3D高斯上，忽略了3D对象在不同视角下可能展现不同语义的视图依赖性问题。 Method: LaGa通过分解3D场景为对象，聚类语义描述符并基于多视角语义重新加权，构建视图聚合的语义表示。 Result: 在LERF-OVS数据集上，LaGa比之前的最佳方法提升了18.7%的mIoU。 Conclusion: LaGa有效捕捉了视图依赖语义的关键信息，实现了更全面的3D场景理解。 Abstract: Recent advancements in 3D Gaussian Splatting (3D-GS) enable high-quality 3D scene reconstruction from RGB images. Many studies extend this paradigm for language-driven open-vocabulary scene understanding. However, most of them simply project 2D semantic features onto 3D Gaussians and overlook a fundamental gap between 2D and 3D understanding: a 3D object may exhibit various semantics from different viewpoints--a phenomenon we term view-dependent semantics. To address this challenge, we propose LaGa (Language Gaussians), which establishes cross-view semantic connections by decomposing the 3D scene into objects. Then, it constructs view-aggregated semantic representations by clustering semantic descriptors and reweighting them based on multi-view semantics. Extensive experiments demonstrate that LaGa effectively captures key information from view-dependent semantics, enabling a more comprehensive understanding of 3D scenes. Notably, under the same settings, LaGa achieves a significant improvement of +18.7% mIoU over the previous SOTA on the LERF-OVS dataset. Our code is available at: https://github.com/SJTU-DeepVisionLab/LaGa.

[100] Draw ALL Your Imagine: A Holistic Benchmark and Agent Framework for Complex Instruction-based Image Generation

Yucheng Zhou,Jiahao Yuan,Qianning Wang

Main category: cs.CV

TL;DR: 论文介绍了LongBench-T2I，一个专门评估文本到图像生成模型在复杂指令下表现的基准测试，并提出Plan2Gen框架以提升复杂指令驱动的图像生成能力。

Details

Motivation: 现有文本到图像生成模型在处理复杂指令时表现不佳，且缺乏针对复杂指令的评估基准。 Method: 提出LongBench-T2I基准测试和Plan2Gen框架，后者利用大语言模型分解复杂指令以指导图像生成。 Result: LongBench-T2I包含500个复杂提示，Plan2Gen框架无需额外训练即可提升生成效果。 Conclusion: LongBench-T2I和Plan2Gen填补了复杂指令评估和生成的空白，为未来研究提供了工具。 Abstract: Recent advancements in text-to-image (T2I) generation have enabled models to produce high-quality images from textual descriptions. However, these models often struggle with complex instructions involving multiple objects, attributes, and spatial relationships. Existing benchmarks for evaluating T2I models primarily focus on general text-image alignment and fail to capture the nuanced requirements of complex, multi-faceted prompts. Given this gap, we introduce LongBench-T2I, a comprehensive benchmark specifically designed to evaluate T2I models under complex instructions. LongBench-T2I consists of 500 intricately designed prompts spanning nine diverse visual evaluation dimensions, enabling a thorough assessment of a model's ability to follow complex instructions. Beyond benchmarking, we propose an agent framework (Plan2Gen) that facilitates complex instruction-driven image generation without requiring additional model training. This framework integrates seamlessly with existing T2I models, using large language models to interpret and decompose complex prompts, thereby guiding the generation process more effectively. As existing evaluation metrics, such as CLIPScore, fail to adequately capture the nuances of complex instructions, we introduce an evaluation toolkit that automates the quality assessment of generated images using a set of multi-dimensional metrics. The data and code are released at https://github.com/yczhou001/LongBench-T2I.

[101] Lightweight Relational Embedding in Task-Interpolated Few-Shot Networks for Enhanced Gastrointestinal Disease Classification

Xinliu Zhong,Leo Hwa Liang,Angela S. Koh,Yeo Si Yong

Main category: cs.CV

TL;DR: 提出一种基于Few-Shot Learning的深度学习网络，用于优化结肠镜图像分析，提高结直肠癌早期检测的准确性和效率。

Details

Motivation: 传统结肠镜检查具有侵入性且依赖高质量图像，而现有方法在图像判别和适应性上存在不足。 Method: 采用Few-Shot Learning架构，结合特征提取器、任务插值、关系嵌入和双级路由注意力机制。 Result: 在Kvasir数据集上表现优异，准确率达90.1%，F1分数0.891，优于现有方法。 Conclusion: 该模型为结肠镜检查提供了一种非侵入性且高效的解决方案，显著提升了结直肠癌的检测能力。 Abstract: Traditional diagnostic methods like colonoscopy are invasive yet critical tools necessary for accurately diagnosing colorectal cancer (CRC). Detection of CRC at early stages is crucial for increasing patient survival rates. However, colonoscopy is dependent on obtaining adequate and high-quality endoscopic images. Prolonged invasive procedures are inherently risky for patients, while suboptimal or insufficient images hamper diagnostic accuracy. These images, typically derived from video frames, often exhibit similar patterns, posing challenges in discrimination. To overcome these challenges, we propose a novel Deep Learning network built on a Few-Shot Learning architecture, which includes a tailored feature extractor, task interpolation, relational embedding, and a bi-level routing attention mechanism. The Few-Shot Learning paradigm enables our model to rapidly adapt to unseen fine-grained endoscopic image patterns, and the task interpolation augments the insufficient images artificially from varied instrument viewpoints. Our relational embedding approach discerns critical intra-image features and captures inter-image transitions between consecutive endoscopic frames, overcoming the limitations of Convolutional Neural Networks (CNNs). The integration of a light-weight attention mechanism ensures a concentrated analysis of pertinent image regions. By training on diverse datasets, the model's generalizability and robustness are notably improved for handling endoscopic images. Evaluated on Kvasir dataset, our model demonstrated superior performance, achieving an accuracy of 90.1\%, precision of 0.845, recall of 0.942, and an F1 score of 0.891. This surpasses current state-of-the-art methods, presenting a promising solution to the challenges of invasive colonoscopy by optimizing CRC detection through advanced image analysis.

[102] CL-LoRA: Continual Low-Rank Adaptation for Rehearsal-Free Class-Incremental Learning

Jiangpeng He,Zhihao Duan,Fengqing Zhu

Main category: cs.CV

TL;DR: CL-LoRA提出了一种双适配器架构，结合任务共享适配器和任务特定适配器，解决了CIL中参数冗余和跨任务知识共享不足的问题。

Details

Motivation: 现有基于适配器的方法在CIL中为每个新任务创建新适配器，导致参数冗余且未能利用跨任务共享知识。 Method: CL-LoRA采用双适配器架构：任务共享适配器学习跨任务知识，任务特定适配器捕获新任务独特特征。共享适配器使用随机正交矩阵和知识蒸馏，任务特定适配器引入可学习块级权重。 Result: CL-LoRA在多个基准测试中表现优异，同时减少了训练和推理计算量。 Conclusion: CL-LoRA为基于预训练模型的持续学习提供了更高效和可扩展的范例。 Abstract: Class-Incremental Learning (CIL) aims to learn new classes sequentially while retaining the knowledge of previously learned classes. Recently, pre-trained models (PTMs) combined with parameter-efficient fine-tuning (PEFT) have shown remarkable performance in rehearsal-free CIL without requiring exemplars from previous tasks. However, existing adapter-based methods, which incorporate lightweight learnable modules into PTMs for CIL, create new adapters for each new task, leading to both parameter redundancy and failure to leverage shared knowledge across tasks. In this work, we propose ContinuaL Low-Rank Adaptation (CL-LoRA), which introduces a novel dual-adapter architecture combining \textbf{task-shared adapters} to learn cross-task knowledge and \textbf{task-specific adapters} to capture unique features of each new task. Specifically, the shared adapters utilize random orthogonal matrices and leverage knowledge distillation with gradient reassignment to preserve essential shared knowledge. In addition, we introduce learnable block-wise weights for task-specific adapters, which mitigate inter-task interference while maintaining the model's plasticity. We demonstrate CL-LoRA consistently achieves promising performance under multiple benchmarks with reduced training and inference computation, establishing a more efficient and scalable paradigm for continual learning with pre-trained models.

[103] Segmenting France Across Four Centuries

Marta López-Rauhut,Hongyu Zhou,Mathieu Aubry,Loic Landrieu

Main category: cs.CV

TL;DR: 本文介绍了一个新的历史地图数据集，用于大规模、长期的土地利用和土地覆盖演变分析，并评估了三种分割方法的性能。

Details

Motivation: 历史地图提供了研究过去几个世纪领土演变的独特视角，但现有数据集通常局限于单一类型或时期，且标注成本高。 Method: 提出了一个覆盖法国大都市的历史地图数据集，包含18至20世纪的地图，并提供现代和历史标注。评估了全监督和弱监督三种分割方法。 Result: 数据集展示了分割任务的复杂性（如风格不一致、解释模糊等）。弱监督方法表现良好，尤其是通过图像翻译解决风格差异的方法。 Conclusion: 该数据集和方法为长期环境监测提供了新工具，有助于理解几个世纪的景观变化。 Abstract: Historical maps offer an invaluable perspective into territory evolution across past centuries--long before satellite or remote sensing technologies existed. Deep learning methods have shown promising results in segmenting historical maps, but publicly available datasets typically focus on a single map type or period, require extensive and costly annotations, and are not suited for nationwide, long-term analyses. In this paper, we introduce a new dataset of historical maps tailored for analyzing large-scale, long-term land use and land cover evolution with limited annotations. Spanning metropolitan France (548,305 km^2), our dataset contains three map collections from the 18th, 19th, and 20th centuries. We provide both comprehensive modern labels and 22,878 km^2 of manually annotated historical labels for the 18th and 19th century maps. Our dataset illustrates the complexity of the segmentation task, featuring stylistic inconsistencies, interpretive ambiguities, and significant landscape changes (e.g., marshlands disappearing in favor of forests). We assess the difficulty of these challenges by benchmarking three approaches: a fully-supervised model trained with historical labels, and two weakly-supervised models that rely only on modern annotations. The latter either use the modern labels directly or first perform image-to-image translation to address the stylistic gap between historical and contemporary maps. Finally, we discuss how these methods can support long-term environment monitoring, offering insights into centuries of landscape transformation. Our official project repository is publicly available at https://github.com/Archiel19/FRAx4.git.

[104] Zero-Shot Chinese Character Recognition with Hierarchical Multi-Granularity Image-Text Aligning

Yinglian Zhu,Haiyang Yu,Qizao Wang,Wei Lu,Xiangyang Xue,Bin Li

Main category: cs.CV

TL;DR: 本文提出了一种基于对比范式的分层多粒度图文对齐框架（Hi-GITA），用于中文字符识别，显著提升了零样本识别性能。

Details

Motivation: 现有中文字符识别方法通常基于自回归和编辑距离后处理，且依赖单层字符表示，未能充分利用中文字符的细粒度语义信息。 Method: 提出多粒度编码器（图像和文本侧），引入多粒度融合模块，并使用细粒度解耦图文对比损失对齐多粒度表示。 Result: 实验表明，Hi-GITA在零样本中文字符识别任务中显著优于现有方法，例如在手写字符和部首零样本设置中提升约20%准确率。 Conclusion: Hi-GITA框架通过分层多粒度表示和对比学习，有效提升了中文字符识别的性能。 Abstract: Chinese Character Recognition (CCR) is a fundamental technology for intelligent document processing. Unlike Latin characters, Chinese characters exhibit unique spatial structures and compositional rules, allowing for the use of fine-grained semantic information in representation. However, existing approaches are usually based on auto-regressive as well as edit distance post-process and typically rely on a single-level character representation. In this paper, we propose a Hierarchical Multi-Granularity Image-Text Aligning (Hi-GITA) framework based on a contrastive paradigm. To leverage the abundant fine-grained semantic information of Chinese characters, we propose multi-granularity encoders on both image and text sides. Specifically, the Image Multi-Granularity Encoder extracts hierarchical image representations from character images, capturing semantic cues from localized strokes to holistic structures. The Text Multi-Granularity Encoder extracts stroke and radical sequence representations at different levels of granularity. To better capture the relationships between strokes and radicals, we introduce Multi-Granularity Fusion Modules on the image and text sides, respectively. Furthermore, to effectively bridge the two modalities, we further introduce a Fine-Grained Decoupled Image-Text Contrastive loss, which aligns image and text representations across multiple granularities. Extensive experiments demonstrate that our proposed Hi-GITA significantly outperforms existing zero-shot CCR methods. For instance, it brings about 20% accuracy improvement in handwritten character and radical zero-shot settings. Code and models will be released soon.

[105] VideoCAD: A Large-Scale Video Dataset for Learning UI Interactions and 3D Reasoning from CAD Software

Brandon Man,Ghadi Nehme,Md Ferdous Alam,Faez Ahmed

Main category: cs.CV

TL;DR: VideoCAD是一个大规模合成数据集，用于学习CAD操作的UI交互，支持长时程和复杂任务，并提出了VideoCADFormer模型和VQA基准测试。

Details

Motivation: 现有AI驱动的UI代理数据集和方法无法满足专业工程工具的需求，尤其是CAD这种复杂、长时程的任务。 Method: 通过自动化框架生成41K个带注释的CAD操作视频，提出VideoCADFormer模型学习UI交互，并设计VQA基准测试。 Result: VideoCADFormer在CAD交互学习中表现优于基线模型，VQA基准揭示了视频UI理解中的关键挑战。 Conclusion: VideoCAD为复杂UI交互学习提供了新方向，揭示了多模态、空间推理和长时程依赖的挑战。 Abstract: Computer-Aided Design (CAD) is a time-consuming and complex process, requiring precise, long-horizon user interactions with intricate 3D interfaces. While recent advances in AI-driven user interface (UI) agents show promise, most existing datasets and methods focus on short, low-complexity tasks in mobile or web applications, failing to capture the demands of professional engineering tools. In this work, we introduce VideoCAD, the first attempt at engineering UI interaction learning for precision tasks. Specifically, VideoCAD is a large-scale synthetic dataset consisting of over 41K annotated video recordings of CAD operations, generated using an automated framework for collecting high-fidelity UI action data from human-made CAD designs. Compared to existing datasets, VideoCAD offers an order of magnitude higher complexity in UI interaction learning for real-world engineering tasks, having up to a 20x longer time horizon than other datasets. We show two important downstream applications of VideoCAD: learning UI interactions from professional precision 3D CAD tools and a visual question-answering (VQA) benchmark designed to evaluate multimodal large language models' (LLM) spatial reasoning and video understanding abilities. To learn the UI interactions, we propose VideoCADFormer - a state-of-the-art model in learning CAD interactions directly from video, which outperforms multiple behavior cloning baselines. Both VideoCADFormer and the VQA benchmark derived from VideoCAD reveal key challenges in the current state of video-based UI understanding, including the need for precise action grounding, multi-modal and spatial reasoning, and long-horizon dependencies.

[106] Vision LLMs Are Bad at Hierarchical Visual Understanding, and LLMs Are the Bottleneck

Yuwen Tan,Yuan Qing,Boqing Gong

Main category: cs.CV

TL;DR: 研究发现，当前大型语言模型（LLMs）缺乏对视觉世界的层次化知识，限制了视觉LLMs的层次化视觉理解能力。通过百万级视觉问答任务验证，微调后LLMs的层次一致性提升有限。

Details

Motivation: 揭示LLMs在视觉层次化知识上的不足，探讨其对视觉LLMs理解能力的瓶颈效应。 Method: 利用六种分类法和四个图像数据集构建百万级四选一视觉问答任务，对视觉LLMs进行微调。 Result: 微调后LLMs的层次一致性提升有限，表明其仍是视觉LLMs层次化理解的瓶颈。 Conclusion: LLMs需具备相应分类知识，才能支持视觉LLMs实现完全层次化的视觉概念理解。 Abstract: This paper reveals that many state-of-the-art large language models (LLMs) lack hierarchical knowledge about our visual world, unaware of even well-established biology taxonomies. This shortcoming makes LLMs a bottleneck for vision LLMs' hierarchical visual understanding (e.g., recognizing Anemone Fish but not Vertebrate). We arrive at these findings using about one million four-choice visual question answering (VQA) tasks constructed from six taxonomies and four image datasets. Interestingly, finetuning a vision LLM using our VQA tasks reaffirms LLMs' bottleneck effect to some extent because the VQA tasks improve the LLM's hierarchical consistency more than the vision LLM's. We conjecture that one cannot make vision LLMs understand visual concepts fully hierarchical until LLMs possess corresponding taxonomy knowledge.

[107] Reading Recognition in the Wild

Charig Yang,Samiul Alam,Shakhrul Iman Siam,Michael J. Proulx,Lambert Mathias,Kiran Somasundaram,Luis Pesqueira,James Fort,Sheroze Sheriffdeen,Omkar Parkhi,Carl Ren,Mi Zhang,Yuning Chai,Richard Newcombe,Hyo Jin Kim

Main category: cs.CV

TL;DR: 论文提出了一种新任务——阅读识别，用于判断用户何时在阅读，并引入了首个大规模多模态数据集“Reading in the Wild”，同时提出了一种灵活的Transformer模型。

Details

Motivation: 为了实现智能眼镜中的自我情境AI，需要记录用户与世界的互动，包括阅读行为。 Method: 使用三种模态（RGB、眼动、头部姿态）设计了一个灵活的Transformer模型，支持单模态或多模态输入。 Result: 证明了这些模态的相关性和互补性，并探索了如何高效编码每种模态。数据集还扩展了阅读分类研究的规模和多样性。 Conclusion: 该研究为阅读识别提供了新方法和数据集，推动了在非受限环境中的阅读理解研究。 Abstract: To enable egocentric contextual AI in always-on smart glasses, it is crucial to be able to keep a record of the user's interactions with the world, including during reading. In this paper, we introduce a new task of reading recognition to determine when the user is reading. We first introduce the first-of-its-kind large-scale multimodal Reading in the Wild dataset, containing 100 hours of reading and non-reading videos in diverse and realistic scenarios. We then identify three modalities (egocentric RGB, eye gaze, head pose) that can be used to solve the task, and present a flexible transformer model that performs the task using these modalities, either individually or combined. We show that these modalities are relevant and complementary to the task, and investigate how to efficiently and effectively encode each modality. Additionally, we show the usefulness of this dataset towards classifying types of reading, extending current reading understanding studies conducted in constrained settings to larger scale, diversity and realism. Code, model, and data will be public.

[108] ViStoryBench: Comprehensive Benchmark Suite for Story Visualization

Cailin Zhuang,Ailin Huang,Wei Cheng,Jingwei Wu,Yaoqi Hu,Jiaqi Liao,Zhewei Huang,Hongyuan Wang,Xinyao Liao,Weiwei Cai,Hengyuan Xu,Xuanyang Zhang,Xianfang Zeng,Gang Yu,Chi Zhang

Main category: cs.CV

TL;DR: ViStoryBench是一个用于评估故事可视化模型的综合性基准测试，涵盖多样化的故事类型和艺术风格，旨在提升模型在真实场景中的表现。

Details

Motivation: 为了进一步提升故事可视化框架在现实场景中的性能，需要一种全面的评估方法。 Method: 通过收集多样化的数据集，涵盖不同情节和视觉风格，并设计多维度评估指标。 Result: ViStoryBench能够全面评估模型的性能，帮助识别其优缺点。 Conclusion: 该基准测试为研究人员提供了系统的评估工具，促进模型的针对性改进。 Abstract: Story visualization, which aims to generate a sequence of visually coherent images aligning with a given narrative and reference images, has seen significant progress with recent advancements in generative models. To further enhance the performance of story visualization frameworks in real-world scenarios, we introduce a comprehensive evaluation benchmark, ViStoryBench. We collect a diverse dataset encompassing various story types and artistic styles, ensuring models are evaluated across multiple dimensions such as different plots (e.g., comedy, horror) and visual aesthetics (e.g., anime, 3D renderings). ViStoryBench is carefully curated to balance narrative structures and visual elements, featuring stories with single and multiple protagonists to test models' ability to maintain character consistency. Additionally, it includes complex plots and intricate world-building to challenge models in generating accurate visuals. To ensure comprehensive comparisons, our benchmark incorporates a wide range of evaluation metrics assessing critical aspects. This structured and multifaceted framework enables researchers to thoroughly identify both the strengths and weaknesses of different models, fostering targeted improvements.

Xinqi Xiong,Prakrut Patel,Qingyuan Fan,Amisha Wadhwa,Sarathy Selvam,Xiao Guo,Luchao Qi,Xiaoming Liu,Roni Sengupta

Main category: cs.CV

TL;DR: 论文提出了TalkingHeadBench，一个多模型多生成器的基准测试和数据集，用于评估最先进的深度伪造检测器在最新生成器上的性能。

Details

Motivation: 当前深度伪造检测基准未能反映生成技术的快速进步，缺乏对模型鲁棒性和泛化能力的评估。 Method: 构建了一个包含学术和商业模型生成的深度伪造数据集，设计了评估协议以测试身份和生成器特性分布变化下的泛化能力。 Result: 评估了多种检测方法（如CNN、视觉Transformer和时间模型），并通过Grad-CAM可视化分析常见失败模式和检测器偏差。 Conclusion: TalkingHeadBench旨在推动更鲁棒和泛化能力强的检测模型研究，以应对快速发展的生成技术。 Abstract: The rapid advancement of talking-head deepfake generation fueled by advanced generative models has elevated the realism of synthetic videos to a level that poses substantial risks in domains such as media, politics, and finance. However, current benchmarks for deepfake talking-head detection fail to reflect this progress, relying on outdated generators and offering limited insight into model robustness and generalization. We introduce TalkingHeadBench, a comprehensive multi-model multi-generator benchmark and curated dataset designed to evaluate the performance of state-of-the-art detectors on the most advanced generators. Our dataset includes deepfakes synthesized by leading academic and commercial models and features carefully constructed protocols to assess generalization under distribution shifts in identity and generator characteristics. We benchmark a diverse set of existing detection methods, including CNNs, vision transformers, and temporal models, and analyze their robustness and generalization capabilities. In addition, we provide error analysis using Grad-CAM visualizations to expose common failure modes and detector biases. TalkingHeadBench is hosted on https://huggingface.co/datasets/luchaoqi/TalkingHeadBench with open access to all data splits and protocols. Our benchmark aims to accelerate research towards more robust and generalizable detection models in the face of rapidly evolving generative techniques.

[110] Time Blindness: Why Video-Language Models Can't See What Humans Can?

Ujjwal Upadhyay,Mukul Ranjan,Zhiqiang Shen,Mohamed Elhoseiny

Main category: cs.CV

TL;DR: 论文提出了SpookyBench基准测试，揭示当前视觉语言模型在纯时间序列任务中的局限性，与人类表现存在显著差距。

Details

Motivation: 现有视觉语言模型在空间信息模糊时无法捕捉纯时间模式，而人类却能高效识别。 Method: 通过创建SpookyBench基准测试，评估模型在噪声帧序列中的表现。 Result: 人类准确率超过98%，而先进模型准确率为0%，显示模型对时间线索的依赖不足。 Conclusion: 需新架构或训练范式以解耦空间依赖，提升时间模式识别能力。 Abstract: Recent advances in vision-language models (VLMs) have made impressive strides in understanding spatio-temporal relationships in videos. However, when spatial information is obscured, these models struggle to capture purely temporal patterns. We introduce $\textbf{SpookyBench}$, a benchmark where information is encoded solely in temporal sequences of noise-like frames, mirroring natural phenomena from biological signaling to covert communication. Interestingly, while humans can recognize shapes, text, and patterns in these sequences with over 98% accuracy, state-of-the-art VLMs achieve 0% accuracy. This performance gap highlights a critical limitation: an over-reliance on frame-level spatial features and an inability to extract meaning from temporal cues. Furthermore, when trained in data sets with low spatial signal-to-noise ratios (SNR), temporal understanding of models degrades more rapidly than human perception, especially in tasks requiring fine-grained temporal reasoning. Overcoming this limitation will require novel architectures or training paradigms that decouple spatial dependencies from temporal processing. Our systematic analysis shows that this issue persists across model scales and architectures. We release SpookyBench to catalyze research in temporal pattern recognition and bridge the gap between human and machine video understanding. Dataset and code has been made available on our project website: https://timeblindness.github.io/.

[111] SiLVR: A Simple Language-based Video Reasoning Framework

Ce Zhang,Yan-Bo Lin,Ziyang Wang,Mohit Bansal,Gedas Bertasius

Main category: cs.CV

TL;DR: SiLVR是一个简单的基于语言的视频推理框架，通过两阶段分解复杂视频理解任务，结合多感官输入和自适应令牌减少方案，显著提升了多模态大语言模型（MLLMs）在复杂视频语言任务中的表现。

Details

Motivation: 多模态大语言模型（MLLMs）在复杂视频语言任务中的推理能力显著落后于纯语言模型（LLMs），需要一种有效的方法来提升其性能。 Method: SiLVR框架分为两阶段：1) 将原始视频转换为基于语言的表示（如短片段字幕和音频/语音字幕）；2) 将语言描述输入强大的推理LLM完成任务。采用自适应令牌减少方案处理长上下文多感官输入。 Result: SiLVR在Video-MME（长）、Video-MMMU（理解）、Video-MMLU、CGBench和EgoLife等任务上取得了最佳报告结果。研究表明，强大的推理LLM能有效聚合视频、语音和音频的多感官信息。 Conclusion: SiLVR框架简单、模块化且无需训练，显著提升了MLLMs在复杂视频语言任务中的推理能力，证明了推理LLM在多模态任务中的潜力。 Abstract: Recent advances in test-time optimization have led to remarkable reasoning capabilities in Large Language Models (LLMs), enabling them to solve highly complex problems in math and coding. However, the reasoning capabilities of multimodal LLMs (MLLMs) still significantly lag, especially for complex video-language tasks. To address this issue, we present SiLVR, a Simple Language-based Video Reasoning framework that decomposes complex video understanding into two stages. In the first stage, SiLVR transforms raw video into language-based representations using multisensory inputs, such as short clip captions and audio/speech subtitles. In the second stage, language descriptions are fed into a powerful reasoning LLM to solve complex video-language understanding tasks. To handle long-context multisensory inputs, we use an adaptive token reduction scheme, which dynamically determines the temporal granularity with which to sample the tokens. Our simple, modular, and training-free video reasoning framework achieves the best-reported results on Video-MME (long), Video-MMMU (comprehension), Video-MMLU, CGBench, and EgoLife. Furthermore, our empirical study focused on video reasoning capabilities shows that, despite not being explicitly trained on video, strong reasoning LLMs can effectively aggregate multisensory input information from video, speech, and audio for complex temporal, causal, long-context, and knowledge acquisition reasoning tasks in video. Code is available at https://github.com/CeeZh/SILVR.

[112] GenSpace: Benchmarking Spatially-Aware Image Generation

Zehan Wang,Jiayang Xu,Ziang Zhang,Tianyu Pan,Chao Du,Hengshuang Zhao,Zhou Zhao

Main category: cs.CV

TL;DR: GenSpace是一个评估AI图像生成模型3D空间感知能力的基准和流程，发现现有模型在对象放置、关系和测量等细节上存在不足。

Details

Motivation: 评估当前AI图像生成模型在3D空间感知方面的能力，填补标准评估方法的不足。 Method: 提出GenSpace基准和专用评估流程，通过多视觉基础模型重建3D场景几何，提供更准确的评估指标。 Result: AI模型能生成视觉吸引力的图像，但在对象放置、关系和测量等3D细节上表现不佳。 Conclusion: 总结了当前图像生成模型在空间感知上的三大局限，为改进空间智能提供了方向。 Abstract: Humans can intuitively compose and arrange scenes in the 3D space for photography. However, can advanced AI image generators plan scenes with similar 3D spatial awareness when creating images from text or image prompts? We present GenSpace, a novel benchmark and evaluation pipeline to comprehensively assess the spatial awareness of current image generation models. Furthermore, standard evaluations using general Vision-Language Models (VLMs) frequently fail to capture the detailed spatial errors. To handle this challenge, we propose a specialized evaluation pipeline and metric, which reconstructs 3D scene geometry using multiple visual foundation models and provides a more accurate and human-aligned metric of spatial faithfulness. Our findings show that while AI models create visually appealing images and can follow general instructions, they struggle with specific 3D details like object placement, relationships, and measurements. We summarize three core limitations in the spatial perception of current state-of-the-art image generation models: 1) Object Perspective Understanding, 2) Egocentric-Allocentric Transformation and 3) Metric Measurement Adherence, highlighting possible directions for improving spatial intelligence in image generation.

[113] MoDoMoDo: Multi-Domain Data Mixtures for Multimodal LLM Reinforcement Learning

Yiqing Liang,Jielin Qiu,Wenhao Ding,Zuxin Liu,James Tompkin,Mengdi Xu,Mengzhou Xia,Zhengzhong Tu,Laixi Shi,Jiacheng Zhu

Main category: cs.CV

TL;DR: 论文提出了一种多模态LLM的RLVR后训练框架，通过优化数据集混合策略提升模型的泛化和推理能力。

Details

Motivation: RLVR在多模态LLM中的应用潜力巨大，但因任务多样性和目标冲突而复杂化，需优化数据集混合策略。 Method: 开发了多模态RLVR框架，提出数据混合策略以预测和优化RL微调结果。 Result: 多域RLVR训练结合混合策略显著提升模型性能，最佳混合策略使准确率平均提升5.24%。 Conclusion: 系统化的数据集混合策略能有效提升多模态LLM的泛化能力，为RLVR在多模态任务中的应用提供了新思路。 Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a powerful paradigm for post-training large language models (LLMs), achieving state-of-the-art performance on tasks with structured, verifiable answers. Applying RLVR to Multimodal LLMs (MLLMs) presents significant opportunities but is complicated by the broader, heterogeneous nature of vision-language tasks that demand nuanced visual, logical, and spatial capabilities. As such, training MLLMs using RLVR on multiple datasets could be beneficial but creates challenges with conflicting objectives from interaction among diverse datasets, highlighting the need for optimal dataset mixture strategies to improve generalization and reasoning. We introduce a systematic post-training framework for Multimodal LLM RLVR, featuring a rigorous data mixture problem formulation and benchmark implementation. Specifically, (1) We developed a multimodal RLVR framework for multi-dataset post-training by curating a dataset that contains different verifiable vision-language problems and enabling multi-domain online RL learning with different verifiable rewards; (2) We proposed a data mixture strategy that learns to predict the RL fine-tuning outcome from the data mixture distribution, and consequently optimizes the best mixture. Comprehensive experiments showcase that multi-domain RLVR training, when combined with mixture prediction strategies, can significantly boost MLLM general reasoning capacities. Our best mixture improves the post-trained model's accuracy on out-of-distribution benchmarks by an average of 5.24% compared to the same model post-trained with uniform data mixture, and by a total of 20.74% compared to the pre-finetuning baseline.

[114] ProxyThinker: Test-Time Guidance through Small Visual Reasoners

Zilin Xiao,Jaywon Koo,Siru Ouyang,Jefferson Hernandez,Yu Meng,Vicente Ordonez

Main category: cs.CV

TL;DR: ProxyThinker是一种无需训练的推理时技术，通过修改解码动态，使大型视觉语言模型继承小型慢思考视觉推理器的能力，显著提升性能并加速推理。

Details

Motivation: 尽管强化微调（RFT）能提升视觉语言模型的推理能力，但其训练成本高昂，难以扩展模型规模。 Method: 通过从RFT推理器的输出分布中减去基础模型的输出分布，ProxyThinker修改解码动态，激发慢思考推理行为。 Result: ProxyThinker在空间、数学和多学科推理等挑战性视觉基准上表现优异，且推理速度提升38倍。 Conclusion: ProxyThinker为大型视觉语言模型的实用部署提供了高效解决方案。 Abstract: Recent advancements in reinforcement learning with verifiable rewards have pushed the boundaries of the visual reasoning capabilities in large vision-language models (LVLMs). However, training LVLMs with reinforcement fine-tuning (RFT) is computationally expensive, posing a significant challenge to scaling model size. In this work, we propose ProxyThinker, an inference-time technique that enables large models to inherit the visual reasoning capabilities from small, slow-thinking visual reasoners without any training. By subtracting the output distributions of base models from those of RFT reasoners, ProxyThinker modifies the decoding dynamics and successfully elicits the slow-thinking reasoning demonstrated by the emerged sophisticated behaviors such as self-verification and self-correction. ProxyThinker consistently boosts performance on challenging visual benchmarks on spatial, mathematical, and multi-disciplinary reasoning, enabling untuned base models to compete with the performance of their full-scale RFT counterparts. Furthermore, our implementation efficiently coordinates multiple language models with parallelism techniques and achieves up to 38 $\times$ faster inference compared to previous decoding-time methods, paving the way for the practical deployment of ProxyThinker. Code is available at https://github.com/MrZilinXiao/ProxyThinker.

[115] MiniMax-Remover: Taming Bad Noise Helps Video Object Removal

Bojia Zi,Weixuan Peng,Xianbiao Qi,Jianan Wang,Shihao Zhao,Rong Xiao,Kam-Fai Wong

Main category: cs.CV

TL;DR: MiniMax-Remover是一种新颖的两阶段视频对象移除方法，通过简化模型架构和最小化优化策略，显著提高了编辑质量和推理速度。

Details

Motivation: 视频对象移除是视频编辑中的关键子任务，但现有方法存在幻觉对象、视觉伪影和计算成本高的问题。 Method: 第一阶段移除文本输入和交叉注意力层以简化模型；第二阶段通过最小化优化策略进一步改进编辑质量和推理速度。 Result: 该方法仅需6个采样步骤即可实现最先进的视频对象移除效果，且不依赖CFG，显著提高了推理效率。 Conclusion: MiniMax-Remover在效果和效率上均优于现有方法，实验证明了其优越性。 Abstract: Recent advances in video diffusion models have driven rapid progress in video editing techniques. However, video object removal, a critical subtask of video editing, remains challenging due to issues such as hallucinated objects and visual artifacts. Furthermore, existing methods often rely on computationally expensive sampling procedures and classifier-free guidance (CFG), resulting in slow inference. To address these limitations, we propose MiniMax-Remover, a novel two-stage video object removal approach. Motivated by the observation that text condition is not best suited for this task, we simplify the pretrained video generation model by removing textual input and cross-attention layers, resulting in a more lightweight and efficient model architecture in the first stage. In the second stage, we distilled our remover on successful videos produced by the stage-1 model and curated by human annotators, using a minimax optimization strategy to further improve editing quality and inference speed. Specifically, the inner maximization identifies adversarial input noise ("bad noise") that makes failure removals, while the outer minimization step trains the model to generate high-quality removal results even under such challenging conditions. As a result, our method achieves a state-of-the-art video object removal results with as few as 6 sampling steps and doesn't rely on CFG, significantly improving inference efficiency. Extensive experiments demonstrate the effectiveness and superiority of MiniMax-Remover compared to existing methods. Codes and Videos are available at: https://minimax-remover.github.io.

[116] ReasonGen-R1: CoT for Autoregressive Image generation models through SFT and RL

Yu Zhang,Yunqi Li,Yifan Yang,Rui Wang,Yuqing Yang,Dai Qi,Jianmin Bao,Dongdong Chen,Chong Luo,Lili Qiu

Main category: cs.CV

TL;DR: ReasonGen-R1是一个两阶段框架，结合了链式思维推理和强化学习，通过文本推理优化图像生成，表现优于现有模型。

Details

Motivation: 尽管链式思维推理和强化学习在NLP中取得突破，但在生成视觉模型中的应用仍不足。 Method: 首先通过监督微调赋予自回归图像生成器文本推理能力，然后使用Group Relative Policy Optimization优化输出。 Result: 在GenEval、DPG和T2I基准测试中，ReasonGen-R1表现优于基线模型和现有最佳模型。 Conclusion: ReasonGen-R1通过文本推理优化图像生成，展示了显著的性能提升。 Abstract: Although chain-of-thought reasoning and reinforcement learning (RL) have driven breakthroughs in NLP, their integration into generative vision models remains underexplored. We introduce ReasonGen-R1, a two-stage framework that first imbues an autoregressive image generator with explicit text-based "thinking" skills via supervised fine-tuning on a newly generated reasoning dataset of written rationales, and then refines its outputs using Group Relative Policy Optimization. To enable the model to reason through text before generating images, We automatically generate and release a corpus of model crafted rationales paired with visual prompts, enabling controlled planning of object layouts, styles, and scene compositions. Our GRPO algorithm uses reward signals from a pretrained vision language model to assess overall visual quality, optimizing the policy in each update. Evaluations on GenEval, DPG, and the T2I benchmark demonstrate that ReasonGen-R1 consistently outperforms strong baselines and prior state-of-the-art models. More: aka.ms/reasongen.

[117] Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks

Tajamul Ashraf,Amal Saqib,Hanan Ghani,Muhra AlMahri,Yuhao Li,Noor Ahsan,Umair Nawaz,Jean Lahoud,Hisham Cholakkal,Mubarak Shah,Philip Torr,Fahad Shahbaz Khan,Rao Muhammad Anwer,Salman Khan

Main category: cs.CV

TL;DR: Agent-X是一个大规模基准测试，用于评估视觉中心代理在多模态环境中的多步深度推理能力，揭示了当前模型在复杂任务中的局限性。

Details

Motivation: 现有基准测试通常局限于合成、单轮查询和单一视觉模态，无法评估真实场景中多步推理的质量。 Method: 引入Agent-X基准，包含828个真实视觉任务，覆盖六种环境，并提出细粒度、步骤级的评估框架。 Result: 即使最佳模型（如GPT、Gemini和Qwen家族）在多步视觉任务中的成功率也低于50%。 Conclusion: Agent-X揭示了当前LMM推理和工具使用能力的关键瓶颈，为未来研究方向提供了指导。 Abstract: Deep reasoning is fundamental for solving complex tasks, especially in vision-centric scenarios that demand sequential, multimodal understanding. However, existing benchmarks typically evaluate agents with fully synthetic, single-turn queries, limited visual modalities, and lack a framework to assess reasoning quality over multiple steps as required in real-world settings. To address this, we introduce Agent-X, a large-scale benchmark for evaluating vision-centric agents multi-step and deep reasoning capabilities in real-world, multimodal settings. Agent- X features 828 agentic tasks with authentic visual contexts, including images, multi-image comparisons, videos, and instructional text. These tasks span six major agentic environments: general visual reasoning, web browsing, security and surveillance, autonomous driving, sports, and math reasoning. Our benchmark requires agents to integrate tool use with explicit, stepwise decision-making in these diverse settings. In addition, we propose a fine-grained, step-level evaluation framework that assesses the correctness and logical coherence of each reasoning step and the effectiveness of tool usage throughout the task. Our results reveal that even the best-performing models, including GPT, Gemini, and Qwen families, struggle to solve multi-step vision tasks, achieving less than 50% full-chain success. These findings highlight key bottlenecks in current LMM reasoning and tool-use capabilities and identify future research directions in vision-centric agentic reasoning models. Our data and code are publicly available at https://github.com/mbzuai-oryx/Agent-X

[118] AdaHuman: Animatable Detailed 3D Human Generation with Compositional Multiview Diffusion

Yangyi Huang,Ye Yuan,Xueting Li,Jan Kautz,Umar Iqbal

Main category: cs.CV

TL;DR: AdaHuman是一种新型框架，能够从单张图像生成高保真、可动画的3D虚拟形象，解决了现有方法在细节和动画适应性上的不足。

Details

Motivation: 现有方法难以生成高细节、适用于动画的3D虚拟形象，限制了实际应用。 Method: AdaHuman结合了两种创新技术：1）基于姿态的3D联合扩散模型，生成多视角图像和3D高斯重建；2）组合式3D高斯细化模块，通过图像到图像细化增强局部细节。 Result: AdaHuman在公共基准测试和实际图像中显著优于现有方法，能够生成高度真实的标准化A姿态虚拟形象。 Conclusion: AdaHuman为生成高保真、可动画的3D虚拟形象提供了有效解决方案，代码和模型将公开供研究使用。 Abstract: Existing methods for image-to-3D avatar generation struggle to produce highly detailed, animation-ready avatars suitable for real-world applications. We introduce AdaHuman, a novel framework that generates high-fidelity animatable 3D avatars from a single in-the-wild image. AdaHuman incorporates two key innovations: (1) A pose-conditioned 3D joint diffusion model that synthesizes consistent multi-view images in arbitrary poses alongside corresponding 3D Gaussian Splats (3DGS) reconstruction at each diffusion step; (2) A compositional 3DGS refinement module that enhances the details of local body parts through image-to-image refinement and seamlessly integrates them using a novel crop-aware camera ray map, producing a cohesive detailed 3D avatar. These components allow AdaHuman to generate highly realistic standardized A-pose avatars with minimal self-occlusion, enabling rigging and animation with any input motion. Extensive evaluation on public benchmarks and in-the-wild images demonstrates that AdaHuman significantly outperforms state-of-the-art methods in both avatar reconstruction and reposing. Code and models will be publicly available for research purposes.

cs.GR [Back]

[119] Force-Dual Modes: Subspace Design from Stochastic Forces

Otman Benchekroun,Eitan Grinspun,Maurizio Chiaramonte,Philip Allen Etter

Main category: cs.GR

TL;DR: 提出了一种基于力分布构建降阶模型（ROM）子空间的方法，适用于图形和工程中的有限元模拟加速。

Details

Motivation: 在动态模拟中，如何选择最优子空间并不明确，因此需要一种能适应常见场景交互（如约束、控制、接触和肌肉骨骼驱动）的方法。 Method: 采用统计学视角，通过线性化模拟将用户设计的力分布转化为位移分布，并拟合低秩高斯模型构建子空间。 Result: 该方法能生成针对物理材料特性和任意力分布（如控制、接触和肌肉骨骼交互）优化的子空间。 Conclusion: 提出的框架不仅推广了线性模态分析和格林函数子空间，还能为复杂场景交互提供最优子空间。 Abstract: Designing subspaces for Reduced Order Modeling (ROM) is crucial for accelerating finite element simulations in graphics and engineering. Unfortunately, it's not always clear which subspace is optimal for arbitrary dynamic simulation. We propose to construct simulation subspaces from force distributions, allowing us to tailor such subspaces to common scene interactions involving constraint penalties, handles-based control, contact and musculoskeletal actuation. To achieve this we adopt a statistical perspective on Reduced Order Modelling, which allows us to push such user-designed force distributions through a linearized simulation to obtain a dual distribution on displacements. To construct our subspace, we then fit a low-rank Gaussian model to this displacement distribution, which we show generalizes Linear Modal Analysis subspaces for uncorrelated unit variance force distributions, as well as Green's Function subspaces for low rank force distributions. We show our framework allows for the construction of subspaces that are optimal both with respect to physical material properties, as well as arbitrary force distributions as observed in handle-based, contact, and musculoskeletal scene interactions.

[120] 3DGEER: Exact and Efficient Volumetric Rendering with 3D Gaussians

Zixun Huang,Cho-Ying Wu,Yuliang Guo,Xinyu Huang,Liu Ren

Main category: cs.GR

TL;DR: 3DGEER是一种精确且高效的体积高斯渲染方法，通过封闭形式的密度积分和优化的射线关联技术，显著提升了渲染质量和实时性能。

Details

Motivation: 3D高斯溅射（3DGS）在平衡渲染质量和效率方面取得了进展，但其近似投影方法在大视场角下限制了渲染质量。现有方法未能同时实现精确性和高效性。 Method: 3DGEER从基本原理出发，推导了3D高斯分布沿射线的密度积分的封闭形式表达式，提出了粒子边界视锥（PBF）和双极等角投影（BEAP）技术，以实现精确且高效的渲染。 Result: 实验表明，3DGEER在针孔和鱼眼数据集上均优于现有方法，实现了实时神经渲染的最新水平。 Conclusion: 3DGEER通过精确的数学建模和高效的射线关联技术，成功解决了3DGS的局限性，为实时高质量渲染提供了新方案。 Abstract: 3D Gaussian Splatting (3DGS) marks a significant milestone in balancing the quality and efficiency of differentiable rendering. However, its high efficiency stems from an approximation of projecting 3D Gaussians onto the image plane as 2D Gaussians, which inherently limits rendering quality--particularly under large Field-of-View (FoV) camera inputs. While several recent works have extended 3DGS to mitigate these approximation errors, none have successfully achieved both exactness and high efficiency simultaneously. In this work, we introduce 3DGEER, an Exact and Efficient Volumetric Gaussian Rendering method. Starting from first principles, we derive a closed-form expression for the density integral along a ray traversing a 3D Gaussian distribution. This formulation enables precise forward rendering with arbitrary camera models and supports gradient-based optimization of 3D Gaussian parameters. To ensure both exactness and real-time performance, we propose an efficient method for computing a tight Particle Bounding Frustum (PBF) for each 3D Gaussian, enabling accurate and efficient ray-Gaussian association. We also introduce a novel Bipolar Equiangular Projection (BEAP) representation to accelerate ray association under generic camera models. BEAP further provides a more uniform ray sampling strategy to apply supervision, which empirically improves reconstruction quality. Experiments on multiple pinhole and fisheye datasets show that our method consistently outperforms prior methods, establishing a new state-of-the-art in real-time neural rendering.

[121] Minimizing Ray Tracing Memory Traffic through Quantized Structures and Ray Stream Tracing

Moritz Grauer,Johannes Hanika,Carsten Dachsbacher

Main category: cs.GR

TL;DR: 提出了一种内存高效的射线追踪方法，结合压缩数据结构和射线流技术以减少内存流量。

Details

Motivation: 内存带宽限制是射线追踪性能的主要瓶颈，尤其是随着场景复杂度的增加和计算能力超过内存访问速度。 Method: 采用压缩BVH和三角形表示来减小加速结构大小，并结合射线流追踪以减少遍历堆栈内存流量。使用定点算术进行交集测试，避免传统浮点舍入误差。 Result: 定量分析显示，在各种场景复杂度和BVH配置下，内存流量显著减少。8-wide BVH射线流实现将内存流量降至传统方法的18%。 Conclusion: 该方法有效解决了现代射线追踪应用中的内存带宽限制和数值精度挑战，特别适用于带宽受限的硬件环境。 Abstract: Memory bandwidth constraints continue to be a significant limiting factor in ray tracing performance, particularly as scene complexity grows and computational capabilities outpace memory access speeds. This paper presents a memory-efficient ray tracing methodology that integrates compressed data structures with ray stream techniques to reduce memory traffic. The approach implements compressed BVH and triangle representations to minimize acceleration structure size in combination with ray stream tracing to reduce traversal stack memory traffic. The technique employs fixed-point arithmetic for intersection tests for prospective hardware with tailored integer operations. Despite using reduced precision, geometric holes are avoided by leveraging fixed-point arithmetic instead of encountering the floating-point rounding errors common in traditional approaches. Quantitative analysis demonstrates significant memory traffic reduction across various scene complexities and BVH configurations. The presented 8-wide BVH ray stream implementation reduces memory traffic to only 18% of traditional approaches by using 8-bit quantization for box and triangle coordinates and directly ray tracing these quantized structures. These reductions are especially beneficial for bandwidth-constrained hardware environments such as mobile devices. This integrated approach addresses both memory bandwidth limitations and numerical precision challenges inherent to modern ray tracing applications.

[122] TC-GS: A Faster Gaussian Splatting Module Utilizing Tensor Cores

Zimu Liao,Jifeng Ding,Rong Fu,Siwei Cui,Ruixuan Gong,Li Wang,Boni Hu,Yi Wang,Hengjie Li,XIngcheng Zhang,Hui Wang

Main category: cs.GR

TL;DR: TC-GS通过将alpha计算映射到矩阵乘法，利用Tensor Core加速3D高斯渲染，提升速度2.18倍，总加速达5.6倍。

Details

Motivation: 现有3D高斯渲染中，alpha混合是时间瓶颈，而Tensor Core未被充分利用。 Method: 提出TC-GS模块，将alpha计算转为矩阵乘法，并引入全局-局部坐标变换以减少精度误差。 Result: 实验显示，TC-GS在保持渲染质量的同时，速度提升2.18倍，总加速达5.6倍。 Conclusion: TC-GS是一种通用模块，可无缝集成现有优化框架，显著提升3D高斯渲染效率。 Abstract: 3D Gaussian Splatting (3DGS) renders pixels by rasterizing Gaussian primitives, where conditional alpha-blending dominates the time cost in the rendering pipeline. This paper proposes TC-GS, an algorithm-independent universal module that expands Tensor Core (TCU) applicability for 3DGS, leading to substantial speedups and seamless integration into existing 3DGS optimization frameworks. The key innovation lies in mapping alpha computation to matrix multiplication, fully utilizing otherwise idle TCUs in existing 3DGS implementations. TC-GS provides plug-and-play acceleration for existing top-tier acceleration algorithms tightly coupled with rendering pipeline designs, like Gaussian compression and redundancy elimination algorithms. Additionally, we introduce a global-to-local coordinate transformation to mitigate rounding errors from quadratic terms of pixel coordinates caused by Tensor Core half-precision computation. Extensive experiments demonstrate that our method maintains rendering quality while providing an additional 2.18x speedup over existing Gaussian acceleration algorithms, thus reaching up to a total 5.6x acceleration. The code is currently available at anonymous \href{https://github.com/TensorCore3DGS/3DGSTensorCore}

cs.CL [Back]

[123] Meaning Is Not A Metric: Using LLMs to make cultural context legible at scale

Cody Kommers,Drew Hemment,Maria Antoniak,Joel Z. Leibo,Hoyt Long,Emily Robinson,Adam Sobey

Main category: cs.CL

TL;DR: 本文主张大型语言模型（LLMs）能够以前所未有的规模在基于AI的社会技术系统中呈现文化背景和人类意义，克服传统系统依赖“薄描述”的局限。

Details

Motivation: 传统AI系统依赖数值化表示（薄描述），无法捕捉人类活动的文化背景和意义。人文和定性社会科学中的“厚描述”方法虽能有效编码意义，但难以规模化。LLMs的文本能力为自动化生成和处理厚描述提供了可能。 Method: 提出利用LLMs的文本能力生成和处理厚描述，以保留文化背景和人类意义。讨论了厚描述作为新表示格式的潜力。 Result: LLMs能够部分自动化厚描述的生成和处理，解决传统方法难以规模化的问题。提出了五个关键挑战：保留背景、维护解释多元性、整合经验与批判视角、区分质与量、承认意义的动态性。 Conclusion: 厚描述可能是统一框架，解决LLMs中文化表示的难题。生成AI的应用需关注新表示格式的开发，以更全面地捕捉人类意义。 Abstract: This position paper argues that large language models (LLMs) can make cultural context, and therefore human meaning, legible at an unprecedented scale in AI-based sociotechnical systems. We argue that such systems have previously been unable to represent human meaning because they rely on thin descriptions: numerical representations that enforce standardization and therefore strip human activity of the cultural context that gives it meaning. By contrast, scholars in the humanities and qualitative social sciences have developed frameworks for representing meaning through thick description: verbal representations that accommodate heterogeneity and retain contextual information needed to represent human meaning. While these methods can effectively codify meaning, they are difficult to deploy at scale. However, the verbal capabilities of LLMs now provide a means of (at least partially) automating the generation and processing of thick descriptions, potentially overcoming this bottleneck. We argue that the problem of rendering human meaning legible is not just about selecting better metrics, but about developing new representational formats (based on thick description). We frame this as a crucial direction for the application of generative AI and identify five key challenges: preserving context, maintaining interpretive pluralism, integrating perspectives based on lived experience and critical distance, distinguishing qualitative content from quantitative magnitude, and acknowledging meaning as dynamic rather than static. Furthermore, we suggest that thick description has the potential to serve as a unifying framework to address a number of emerging concerns about the difficulties of representing culture in (or using) LLMs.

[124] Nine Ways to Break Copyright Law and Why Our LLM Won't: A Fair Use Aligned Generation Framework

Aakash Sen Sharma,Debdeep Sanyal,Priyansh Srivastava,Sundar Atreya H.,Shirish Karande,Mohan Kankanhalli,Murari Mandal

Main category: cs.CL

TL;DR: FUA-LLM框架通过法律依据和优化方法减少LLM的版权侵权风险，同时保持实用性。

Details

Motivation: 解决LLM在生成内容时可能侵犯版权的问题，避免传统拒绝式过滤的实用性损失。 Method: 构建FairUseDB数据集，利用DPO微调LLM，提出新评估指标Weighted Penalty Utility和CAH。 Result: FUA-LLM显著减少侵权输出（达20%），同时保持实用性。 Conclusion: FUA-LLM为LLM版权问题提供了合法且实用的解决方案。 Abstract: Large language models (LLMs) commonly risk copyright infringement by reproducing protected content verbatim or with insufficient transformative modifications, posing significant ethical, legal, and practical concerns. Current inference-time safeguards predominantly rely on restrictive refusal-based filters, often compromising the practical utility of these models. To address this, we collaborated closely with intellectual property experts to develop FUA-LLM (Fair Use Aligned Language Models), a legally-grounded framework explicitly designed to align LLM outputs with fair-use doctrine. Central to our method is FairUseDB, a carefully constructed dataset containing 18,000 expert-validated examples covering nine realistic infringement scenarios. Leveraging this dataset, we apply Direct Preference Optimization (DPO) to fine-tune open-source LLMs, encouraging them to produce legally compliant and practically useful alternatives rather than resorting to blunt refusal. Recognizing the shortcomings of traditional evaluation metrics, we propose new measures: Weighted Penalty Utility and Compliance Aware Harmonic Mean (CAH) to balance infringement risk against response utility. Extensive quantitative experiments coupled with expert evaluations confirm that FUA-LLM substantially reduces problematic outputs (up to 20\%) compared to state-of-the-art approaches, while preserving real-world usability.

[125] Conversational Exploration of Literature Landscape with LitChat

Mingyu Huang,Shasha Zhou,Yuxuan Chen,Ke Li

Main category: cs.CL

TL;DR: LitChat是一个交互式文献代理，结合大型语言模型和数据驱动工具，帮助用户快速探索大规模文献，提供基于证据的见解。

Details

Motivation: 解决传统手动文献综述在大数据时代不可行的问题，同时克服大型语言模型在系统性综述中的局限性。 Method: LitChat通过自动解析用户查询、检索相关文献、构建知识图谱，并应用数据挖掘技术生成基于证据的见解。 Result: 通过AI4Health案例研究，LitChat展示了其快速导航大规模文献并提供数据支持的能力。 Conclusion: LitChat为系统性文献综述提供了一种高效、透明且基于证据的新方法。 Abstract: We are living in an era of "big literature", where the volume of digital scientific publications is growing exponentially. While offering new opportunities, this also poses challenges for understanding literature landscapes, as traditional manual reviewing is no longer feasible. Recent large language models (LLMs) have shown strong capabilities for literature comprehension, yet they are incapable of offering "comprehensive, objective, open and transparent" views desired by systematic reviews due to their limited context windows and trust issues like hallucinations. Here we present LitChat, an end-to-end, interactive and conversational literature agent that augments LLM agents with data-driven discovery tools to facilitate literature exploration. LitChat automatically interprets user queries, retrieves relevant sources, constructs knowledge graphs, and employs diverse data-mining techniques to generate evidence-based insights addressing user needs. We illustrate the effectiveness of LitChat via a case study on AI4Health, highlighting its capacity to quickly navigate the users through large-scale literature landscape with data-based evidence that is otherwise infeasible with traditional means.

[126] Rethinking the Understanding Ability across LLMs through Mutual Information

Shaojie Wang,Sirui Ding,Na Zou

Main category: cs.CL

TL;DR: 提出了一种基于互信息的信息论框架，用于评估大型语言模型的语言理解能力，发现编码器模型比解码器模型更能保持信息，且通过微调可提升模型性能。

Details

Motivation: 评估大型语言模型的内在语言理解能力，超越传统任务导向的评估方法。 Method: 利用互信息（MI）框架，将句子级MI分解为词级MI，并通过Fano不等式推导可计算的下界，实现词级恢复任务。 Result: 编码器模型比解码器模型更能保持信息，解码器模型在深层出现信息遗忘现象；微调可提升模型理解能力。 Conclusion: 互信息可作为评估和提升语言模型能力的理论基础。 Abstract: Recent advances in large language models (LLMs) have revolutionized natural language processing, yet evaluating their intrinsic linguistic understanding remains challenging. Moving beyond specialized evaluation tasks, we propose an information-theoretic framework grounded in mutual information (MI) to achieve this. We formalize the understanding as MI between an input sentence and its latent representation (sentence-level MI), measuring how effectively input information is preserved in latent representation. Given that LLMs learn embeddings for individual tokens, we decompose sentence-level MI into token-level MI between tokens and sentence embeddings, establishing theoretical bounds connecting these measures. Based on this foundation, we theoretically derive a computable lower bound for token-level MI using Fano's inequality, which directly relates to token-level recoverability-the ability to predict original tokens from sentence embedding. We implement this recoverability task to comparatively measure MI across different LLMs, revealing that encoder-only models consistently maintain higher information fidelity than their decoder-only counterparts, with the latter exhibiting a distinctive late-layer "forgetting" pattern where mutual information is first enhanced and then discarded. Moreover, fine-tuning to maximize token-level recoverability consistently improves understanding ability of LLMs on tasks without task-specific supervision, demonstrating that mutual information can serve as a foundation for understanding and improving language model capabilities.

[127] R3-RAG: Learning Step-by-Step Reasoning and Retrieval for LLMs via Reinforcement Learning

Yuan Li,Qi Luo,Xiaonan Li,Bufan Li,Qinyuan Cheng,Bo Wang,Yining Zheng,Yuxin Wang,Zhangyue Yin,Xipeng Qiu

Main category: cs.CL

TL;DR: R3-RAG 是一种通过强化学习让 LLM 逐步学习推理和检索的方法，显著提升了 RAG 系统的性能。

Details

Motivation: 解决密集检索器在 RAG 系统中成为瓶颈的问题，以及传统迭代 RAG 受限于人工设计流程的不足。 Method: 分两阶段：冷启动学习推理与检索的交替，再用强化学习优化检索能力，设计两种奖励机制（答案正确性和文档相关性）。 Result: R3-RAG 显著优于基线方法，并能适应不同检索器。 Conclusion: R3-RAG 通过强化学习有效提升了 RAG 系统的推理和检索能力。 Abstract: Retrieval-Augmented Generation (RAG) integrates external knowledge with Large Language Models (LLMs) to enhance factual correctness and mitigate hallucination. However, dense retrievers often become the bottleneck of RAG systems due to their limited parameters compared to LLMs and their inability to perform step-by-step reasoning. While prompt-based iterative RAG attempts to address these limitations, it is constrained by human-designed workflows. To address these limitations, we propose $\textbf{R3-RAG}$, which uses $\textbf{R}$einforcement learning to make the LLM learn how to $\textbf{R}$eason and $\textbf{R}$etrieve step by step, thus retrieving comprehensive external knowledge and leading to correct answers. R3-RAG is divided into two stages. We first use cold start to make the model learn the manner of iteratively interleaving reasoning and retrieval. Then we use reinforcement learning to further harness its ability to better explore the external retrieval environment. Specifically, we propose two rewards for R3-RAG: 1) answer correctness for outcome reward, which judges whether the trajectory leads to a correct answer; 2) relevance-based document verification for process reward, encouraging the model to retrieve documents that are relevant to the user question, through which we can let the model learn how to iteratively reason and retrieve relevant documents to get the correct answer. Experimental results show that R3-RAG significantly outperforms baselines and can transfer well to different retrievers. We release R3-RAG at https://github.com/Yuan-Li-FNLP/R3-RAG.

[128] Emergent LLM behaviors are observationally equivalent to data leakage

Christopher Barrie,Petter Törnberg

Main category: cs.CL

TL;DR: 论文指出Ashery等人的研究结果可能是数据泄漏导致的，而非LLMs自发形成的语言规范。

Details

Motivation: 质疑Ashery等人关于LLMs在命名游戏中自发形成语言规范的结论，认为其更可能是训练数据记忆的结果。 Method: 通过多种分析证明LLMs识别了协调游戏的结构并回忆了其结果，而非产生新规范。 Result: 观察到的行为与训练数据记忆无法区分，不支持“涌现”规范的结论。 Conclusion: 建议重新思考LLMs在社会科学模型中的应用，并提出替代策略。 Abstract: Ashery et al. recently argue that large language models (LLMs), when paired to play a classic "naming game," spontaneously develop linguistic conventions reminiscent of human social norms. Here, we show that their results are better explained by data leakage: the models simply reproduce conventions they already encountered during pre-training. Despite the authors' mitigation measures, we provide multiple analyses demonstrating that the LLMs recognize the structure of the coordination game and recall its outcomes, rather than exhibit "emergent" conventions. Consequently, the observed behaviors are indistinguishable from memorization of the training corpus. We conclude by pointing to potential alternative strategies and reflecting more generally on the place of LLMs for social science models.

Zaihan Yang,Ryan Leonard,Hien Tran,Rory Driscoll,Chadbourne Davis

Main category: cs.CL

TL;DR: 论文提出了一种结合RoBERTa、TF-IDF和PCA的混合模型，用于从Reddit帖子中自动分类自杀风险，并通过数据增强和重采样技术提升模型性能。

Details

Motivation: 自杀风险早期检测工具的迫切需求，促使研究利用机器学习从社交媒体数据中识别自杀风险。 Method: 采用RoBERTa-TF-IDF-PCA混合模型，结合深度上下文嵌入与统计特征，并通过数据增强和重采样解决数据不平衡问题。 Result: 混合模型在加权F1分数上达到0.7512，优于单独使用RoBERTa、BERT及其他传统分类器。 Conclusion: 混合模型在自杀风险分类任务中表现出更高的准确性和可靠性，为早期检测提供了有效工具。 Abstract: Suicidal thoughts and behaviors are increasingly recognized as a critical societal concern, highlighting the urgent need for effective tools to enable early detection of suicidal risk. In this work, we develop robust machine learning models that leverage Reddit posts to automatically classify them into four distinct levels of suicide risk severity. We frame this as a multi-class classification task and propose a RoBERTa-TF-IDF-PCA Hybrid model, integrating the deep contextual embeddings from Robustly Optimized BERT Approach (RoBERTa), a state-of-the-art deep learning transformer model, with the statistical term-weighting of TF-IDF, further compressed with PCA, to boost the accuracy and reliability of suicide risk assessment. To address data imbalance and overfitting, we explore various data resampling techniques and data augmentation strategies to enhance model generalization. Additionally, we compare our model's performance against that of using RoBERTa only, the BERT model and other traditional machine learning classifiers. Experimental results demonstrate that the hybrid model can achieve improved performance, giving a best weighted $F_{1}$ score of 0.7512.

Jian Lan,Yifei Fu,Udo Schlegel,Gengyuan Zhang,Tanveer Hannan,Haokun Chen,Thomas Seidl

Main category: cs.CL

TL;DR: 该研究评估了大型视觉语言模型（VLMs）中的社会偏见问题，并提出了一种无需训练的后处理方法以减少偏见。

Details

Motivation: 社会偏见是VLMs中的一个重要问题，可能导致对某些社会群体的不公平对待，但目前对其生成响应中的偏见程度尚不清楚。 Method: 研究评估了四种最先进的VLMs在PAIRS和SocialCounterfactuals数据集上的表现，并通过后处理方法（消除偏见相关的残差并放大公平相关的残差）来减少偏见。 Result: 研究发现VLMs存在性别和种族偏见，且模型的置信度校准存在问题。提出的后处理方法显著提高了公平性和置信度可靠性。 Conclusion: 该研究揭示了VLMs中的偏见问题，并提出了一种有效的后处理方法，为未来公平性研究提供了方向。 Abstract: Social bias is a critical issue in large vision-language models (VLMs), where fairness- and ethics-related problems harm certain groups of people in society. It is unknown to what extent VLMs yield social bias in generative responses. In this study, we focus on evaluating and mitigating social bias on both the model's response and probability distribution. To do so, we first evaluate four state-of-the-art VLMs on PAIRS and SocialCounterfactuals datasets with the multiple-choice selection task. Surprisingly, we find that models suffer from generating gender-biased or race-biased responses. We also observe that models are prone to stating their responses are fair, but indeed having mis-calibrated confidence levels towards particular social groups. While investigating why VLMs are unfair in this study, we observe that VLMs' hidden layers exhibit substantial fluctuations in fairness levels. Meanwhile, residuals in each layer show mixed effects on fairness, with some contributing positively while some lead to increased bias. Based on these findings, we propose a post-hoc method for the inference stage to mitigate social bias, which is training-free and model-agnostic. We achieve this by ablating bias-associated residuals while amplifying fairness-associated residuals on model hidden layers during inference. We demonstrate that our post-hoc method outperforms the competing training strategies, helping VLMs have fairer responses and more reliable confidence levels.

[131] Estimating LLM Consistency: A User Baseline vs Surrogate Metrics

Xiaoyuan Wu,Weiran Lin,Omer Akgul,Lujo Bauer

Main category: cs.CL

TL;DR: 论文探讨了大型语言模型（LLM）生成文本的一致性问题，提出了一种基于logit的集成方法，并发现现有方法在评估人类感知一致性方面表现不佳。

Details

Motivation: LLM容易产生幻觉且对提示扰动敏感，导致生成文本不一致或不可靠。现有的一致性评估方法未能很好地反映人类感知。 Method: 提出了一种基于logit的集成方法，并通过用户研究（n=2,976）验证其效果。 Result: 新方法在评估人类一致性评分方面表现最佳，但现有无人类评估的方法仍不完善。 Conclusion: 建议更广泛地使用人类输入来评估LLM一致性。 Abstract: Large language models (LLMs) are prone to hallucinations and sensitive to prompt perturbations, often resulting in inconsistent or unreliable generated text. Different methods have been proposed to mitigate such hallucinations and fragility -- one of them being measuring the consistency (the model's confidence in the response, or likelihood of generating a similar response when resampled) of LLM responses. In previous work, measuring consistency often relied on the probability of a response appearing within a pool of resampled responses, or internal states or logits of responses. However, it is not yet clear how well these approaches approximate how humans perceive the consistency of LLM responses. We performed a user study (n=2,976) and found current methods typically do not approximate users' perceptions of LLM consistency very well. We propose a logit-based ensemble method for estimating LLM consistency, and we show that this method matches the performance of the best-performing existing metric in estimating human ratings of LLM consistency. Our results suggest that methods of estimating LLM consistency without human evaluation are sufficiently imperfect that we suggest evaluation with human input be more broadly used.

[132] SEMFED: Semantic-Aware Resource-Efficient Federated Learning for Heterogeneous NLP Tasks

Sajid Hussain,Muhammad Sohail,Nauman Ali Khan

Main category: cs.CL

TL;DR: SEMFED是一种针对异构NLP任务的语义感知资源高效联邦学习框架，显著降低通信成本并保持高精度。

Details

Motivation: 解决联邦学习在NLP任务中面临的语义异构、词汇不匹配和边缘设备资源限制等挑战。 Method: 结合语义感知客户端选择、自适应NLP模型架构和通信高效的语义特征压缩技术。 Result: 在多个NLP分类任务中，通信成本降低80.5%，模型精度保持在98%以上。 Conclusion: SEMFED有效管理异构客户端环境，适用于实际联邦NLP部署。 Abstract: Background: Federated Learning (FL) has emerged as a promising paradigm for training machine learning models while preserving data privacy. However, applying FL to Natural Language Processing (NLP) tasks presents unique challenges due to semantic heterogeneity across clients, vocabulary mismatches, and varying resource constraints on edge devices. Objectives: This paper introduces SEMFED, a novel semantic-aware resource-efficient federated learning framework specifically designed for heterogeneous NLP tasks. Methods: SEMFED incorporates three key innovations: (1) a semantic-aware client selection mechanism that balances semantic diversity with resource constraints, (2) adaptive NLP-specific model architectures tailored to device capabilities while preserving semantic information, and (3) a communication-efficient semantic feature compression technique that significantly reduces bandwidth requirements. Results: Experimental results on various NLP classification tasks demonstrate that SEMFED achieves an 80.5% reduction in communication costs while maintaining model accuracy above 98%, outperforming state-of-the-art FL approaches. Conclusion: SEMFED effectively manages heterogeneous client environments with varying computational resources, network reliability, and semantic data distributions, making it particularly suitable for real-world federated NLP deployments.

[133] MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks

Suhana Bedi,Hejie Cui,Miguel Fuentes,Alyssa Unell,Michael Wornow,Juan M. Banda,Nikesh Kotecha,Timothy Keyes,Yifan Mai,Mert Oez,Hao Qiu,Shrey Jain,Leonardo Schettini,Mehr Kashyap,Jason Alan Fries,Akshay Swaminathan,Philip Chung,Fateme Nateghi,Asad Aali,Ashwin Nayak,Shivam Vedak,Sneha S. Jain,Birju Patel,Oluseyi Fayanju,Shreya Shah,Ethan Goh,Dong-han Yao,Brian Soetikno,Eduardo Reis,Sergios Gatidis,Vasu Divi,Robson Capasso,Rachna Saralkar,Chia-Chun Chiang,Jenelle Jindal,Tho Pham,Faraz Ghoddusi,Steven Lin,Albert S. Chiou,Christy Hong,Mohana Roy,Michael F. Gensheimer,Hinesh Patel,Kevin Schulman,Dev Dash,Danton Char,Lance Downing,Francois Grolleau,Kameron Black,Bethel Mieso,Aydin Zahedivash,Wen-wai Yim,Harshita Sharma,Tony Lee,Hannah Kirsch,Jennifer Lee,Nerissa Ambers,Carlene Lugtu,Aditya Sharma,Bilal Mawji,Alex Alekseyev,Vicky Zhou,Vikas Kakkar,Jarrod Helzer,Anurang Revri,Yair Bannett,Roxana Daneshjou,Jonathan Chen,Emily Alsentzer,Keith Morse,Nirmal Ravi,Nima Aghaeepour,Vanessa Kennedy,Akshay Chaudhari,Thomas Wang,Sanmi Koyejo,Matthew P. Lungren,Eric Horvitz,Percy Liang,Mike Pfeffer,Nigam H. Shah

Main category: cs.CL

TL;DR: MedHELM是一个用于评估大型语言模型（LLM）在医疗任务中性能的框架，包含分类、基准测试和改进评估方法。研究发现不同模型在任务中表现差异显著，Claude 3.5 Sonnet以较低成本实现与顶级模型相当的性能。

Details

Motivation: 现有评估方法未能反映真实临床实践的复杂性和多样性，需要更全面的评估框架。 Method: 开发了包含5类22子类121任务的分类法，构建35个基准测试，并使用LLM-jury方法进行系统评估。 Result: 模型在临床笔记生成和患者沟通中表现最佳，临床决策支持和管理任务中较弱；LLM-jury方法与临床医生评分一致性较高。 Conclusion: MedHELM为LLM在医疗领域的评估提供了开源框架，强调任务特异性评估的重要性。 Abstract: While large language models (LLMs) achieve near-perfect scores on medical licensing exams, these evaluations inadequately reflect the complexity and diversity of real-world clinical practice. We introduce MedHELM, an extensible evaluation framework for assessing LLM performance for medical tasks with three key contributions. First, a clinician-validated taxonomy spanning 5 categories, 22 subcategories, and 121 tasks developed with 29 clinicians. Second, a comprehensive benchmark suite comprising 35 benchmarks (17 existing, 18 newly formulated) providing complete coverage of all categories and subcategories in the taxonomy. Third, a systematic comparison of LLMs with improved evaluation methods (using an LLM-jury) and a cost-performance analysis. Evaluation of 9 frontier LLMs, using the 35 benchmarks, revealed significant performance variation. Advanced reasoning models (DeepSeek R1: 66% win-rate; o3-mini: 64% win-rate) demonstrated superior performance, though Claude 3.5 Sonnet achieved comparable results at 40% lower estimated computational cost. On a normalized accuracy scale (0-1), most models performed strongly in Clinical Note Generation (0.73-0.85) and Patient Communication & Education (0.78-0.83), moderately in Medical Research Assistance (0.65-0.75), and generally lower in Clinical Decision Support (0.56-0.72) and Administration & Workflow (0.53-0.63). Our LLM-jury evaluation method achieved good agreement with clinician ratings (ICC = 0.47), surpassing both average clinician-clinician agreement (ICC = 0.43) and automated baselines including ROUGE-L (0.36) and BERTScore-F1 (0.44). Claude 3.5 Sonnet achieved comparable performance to top models at lower estimated cost. These findings highlight the importance of real-world, task-specific evaluation for medical use of LLMs and provides an open source framework to enable this.

[134] Calibrating LLMs for Text-to-SQL Parsing by Leveraging Sub-clause Frequencies

Terrance Liu,Shuyi Wang,Daniel Preotiuc-Pietro,Yash Chandarana,Chirag Gupta

Main category: cs.CL

TL;DR: 论文研究了如何为基于LLM的文本到SQL解析提供校准的置信度分数，提出了结合子句频率（SCF）和多变量Platt缩放（MPS）的方法，显著提升了校准效果和错误检测能力。

Details

Motivation: 大型语言模型（LLMs）在文本到SQL解析中表现优异，但有时会自信地给出错误答案，因此需要可靠的置信度衡量方法。 Method: 提出子句频率（SCF）评分方法，结合多变量Platt缩放（MPS）技术，生成更准确的校准分数。 Result: 在两种流行的文本到SQL数据集上，该方法显著优于传统Platt缩放，提升了校准和错误检测能力。 Conclusion: 结合SCF和MPS的方法为文本到SQL解析提供了更可靠的置信度衡量，增强了系统的可信度。 Abstract: While large language models (LLMs) achieve strong performance on text-to-SQL parsing, they sometimes exhibit unexpected failures in which they are confidently incorrect. Building trustworthy text-to-SQL systems thus requires eliciting reliable uncertainty measures from the LLM. In this paper, we study the problem of providing a calibrated confidence score that conveys the likelihood of an output query being correct. Our work is the first to establish a benchmark for post-hoc calibration of LLM-based text-to-SQL parsing. In particular, we show that Platt scaling, a canonical method for calibration, provides substantial improvements over directly using raw model output probabilities as confidence scores. Furthermore, we propose a method for text-to-SQL calibration that leverages the structured nature of SQL queries to provide more granular signals of correctness, named "sub-clause frequency" (SCF) scores. Using multivariate Platt scaling (MPS), our extension of the canonical Platt scaling technique, we combine individual SCF scores into an overall accurate and calibrated score. Empirical evaluation on two popular text-to-SQL datasets shows that our approach of combining MPS and SCF yields further improvements in calibration and the related task of error detection over traditional Platt scaling.

[135] MedOrchestra: A Hybrid Cloud-Local LLM Approach for Clinical Data Interpretation

Sihyeon Lee,Hyunjoo Song,Jong-chan Lee,Yoon Jin Lee,Boram Lee,Hee-Eon Lim,Dongyeong Kim,Jinwook Seo,Bohyoung Kim

Main category: cs.CL

TL;DR: MedOrchestra是一个混合框架，结合云端和本地LLMs的优势，在保护临床数据隐私的同时提升复杂任务的准确性。

Details

Motivation: 解决云端LLMs隐私风险与本地LLMs性能不足的矛盾。 Method: 云端LLM分解任务并生成提示，本地LLM执行隐私保护任务。 Result: 在胰腺癌分期任务中，MedOrchestra表现优于本地模型和临床医生。 Conclusion: MedOrchestra在隐私和性能间取得平衡，适用于临床环境。 Abstract: Deploying large language models (LLMs) in clinical settings faces critical trade-offs: cloud LLMs, with their extensive parameters and superior performance, pose risks to sensitive clinical data privacy, while local LLMs preserve privacy but often fail at complex clinical interpretation tasks. We propose MedOrchestra, a hybrid framework where a cloud LLM decomposes complex clinical tasks into manageable subtasks and prompt generation, while a local LLM executes these subtasks in a privacy-preserving manner. Without accessing clinical data, the cloud LLM generates and validates subtask prompts using clinical guidelines and synthetic test cases. The local LLM executes subtasks locally and synthesizes outputs generated by the cloud LLM. We evaluate MedOrchestra on pancreatic cancer staging using 100 radiology reports under NCCN guidelines. On free-text reports, MedOrchestra achieves 70.21% accuracy, outperforming local model baselines (without guideline: 48.94%, with guideline: 56.59%) and board-certified clinicians (gastroenterologists: 59.57%, surgeons: 65.96%, radiologists: 55.32%). On structured reports, MedOrchestra reaches 85.42% accuracy, showing clear superiority across all settings.

[136] DLP: Dynamic Layerwise Pruning in Large Language Models

Yuli Chen,Bo Cheng,Jiale Han,Yingying Zhang,Yingting Li,Shuhao Zhang

Main category: cs.CL

TL;DR: 论文提出了一种名为动态分层剪枝（DLP）的新方法，通过结合模型权重和输入激活信息自适应地确定每层的重要性，从而在高稀疏度下有效保留模型性能。

Details

Motivation: 主流的分层剪枝技术通常采用统一的剪枝策略，导致在高稀疏度下性能显著下降。不同层对模型的贡献不同，但现有非均匀剪枝方法依赖预定义值，性能次优。 Method: DLP方法动态确定每层的相对重要性，结合模型权重和输入激活信息分配剪枝率。 Result: 在70%稀疏度下，DLP将LLaMA2-7B的困惑度降低了7.79，平均准确率提高了2.7%。 Conclusion: DLP在高稀疏度下表现优异，兼容多种现有LLM压缩技术，并可无缝集成到参数高效微调中。 Abstract: Pruning has recently been widely adopted to reduce the parameter scale and improve the inference efficiency of Large Language Models (LLMs). Mainstream pruning techniques often rely on uniform layerwise pruning strategies, which can lead to severe performance degradation at high sparsity levels. Recognizing the varying contributions of different layers in LLMs, recent studies have shifted their focus toward non-uniform layerwise pruning. However, these approaches often rely on pre-defined values, which can result in suboptimal performance. To overcome these limitations, we propose a novel method called Dynamic Layerwise Pruning (DLP). This approach adaptively determines the relative importance of each layer by integrating model weights with input activation information, assigning pruning rates accordingly. Experimental results show that DLP effectively preserves model performance at high sparsity levels across multiple LLMs. Specifically, at 70% sparsity, DLP reduces the perplexity of LLaMA2-7B by 7.79 and improves the average accuracy by 2.7% compared to state-of-the-art methods. Moreover, DLP is compatible with various existing LLM compression techniques and can be seamlessly integrated into Parameter-Efficient Fine-Tuning (PEFT). We release the code at https://github.com/ironartisan/DLP to facilitate future research.

[137] DenseLoRA: Dense Low-Rank Adaptation of Large Language Models

Lin Mu,Xiaoyu Wang,Li Ni,Yang Li,Zhize Wu,Peiquan Jin,Yiwen Zhang

Main category: cs.CL

TL;DR: DenseLoRA是一种改进的低秩适应方法，通过使用密集低秩矩阵替代传统LoRA中的冗余矩阵，提高了参数利用率和模型性能。

Details

Motivation: 传统LoRA方法中的低秩矩阵存在冗余权重，导致参数利用率低，DenseLoRA旨在解决这一问题。 Method: DenseLoRA引入一个编码器-解码器结构，对隐藏表示进行细化和压缩，然后通过密集低秩矩阵进行适应。 Result: 在LLaMA3-8B上，DenseLoRA仅用0.01%可训练参数达到83.8%准确率，优于LoRA的80.8%（0.70%参数）。 Conclusion: DenseLoRA在参数效率和性能上均优于传统LoRA，为大型语言模型的高效适应提供了新方法。 Abstract: Low-rank adaptation (LoRA) has been developed as an efficient approach for adapting large language models (LLMs) by fine-tuning two low-rank matrices, thereby reducing the number of trainable parameters. However, prior research indicates that many of the weights in these matrices are redundant, leading to inefficiencies in parameter utilization. To address this limitation, we introduce Dense Low-Rank Adaptation (DenseLoRA), a novel approach that enhances parameter efficiency while achieving superior performance compared to LoRA. DenseLoRA builds upon the concept of representation fine-tuning, incorporating a single Encoder-Decoder to refine and compress hidden representations across all adaptation layers before applying adaptation. Instead of relying on two redundant low-rank matrices as in LoRA, DenseLoRA adapts LLMs through a dense low-rank matrix, improving parameter utilization and adaptation efficiency. We evaluate DenseLoRA on various benchmarks, showing that it achieves 83.8% accuracy with only 0.01% of trainable parameters, compared to LoRA's 80.8% accuracy with 0.70% of trainable parameters on LLaMA3-8B. Additionally, we conduct extensive experiments to systematically assess the impact of DenseLoRA's components on overall model performance. Code is available at https://github.com/mulin-ahu/DenseLoRA.

[138] LLM-Driven E-Commerce Marketing Content Optimization: Balancing Creativity and Conversion

Haowei Yang,Haotian Lyu,Tianle Zhang,Dingzhou Wang,Yushang Zhao

Main category: cs.CL

TL;DR: 论文提出了一种结合LLM的框架，通过提示工程、多目标微调和后处理生成兼具吸引力和转化效果的营销文案，显著提升了CTR和CVR。

Details

Motivation: 随着电商竞争加剧，如何在创意内容和转化效果之间取得平衡成为关键问题。 Method: 采用提示工程、多目标微调（包括情感调整、多样性增强和CTA嵌入）和后处理技术。 Result: 离线评估和在线A/B测试显示，CTR提升12.5%，CVR提升8.3%，同时保持内容新颖性。 Conclusion: 该框架为自动化文案生成提供了实用解决方案，并为未来多模态、实时个性化研究指明了方向。 Abstract: As e-commerce competition intensifies, balancing creative content with conversion effectiveness becomes critical. Leveraging LLMs' language generation capabilities, we propose a framework that integrates prompt engineering, multi-objective fine-tuning, and post-processing to generate marketing copy that is both engaging and conversion-driven. Our fine-tuning method combines sentiment adjustment, diversity enhancement, and CTA embedding. Through offline evaluations and online A/B tests across categories, our approach achieves a 12.5 % increase in CTR and an 8.3 % increase in CVR while maintaining content novelty. This provides a practical solution for automated copy generation and suggests paths for future multimodal, real-time personalization.

[139] MARS-Bench: A Multi-turn Athletic Real-world Scenario Benchmark for Dialogue Evaluation

Chenghao Yang,Yinbo Luo,Zhoufutu Wen,Qi Chu,Tao Gong,Longxiang Liu,Kaiyuan Zhang,Jianpeng Jiao,Ge Zhang,Wenhao Huang,Nenghai Yu

Main category: cs.CL

TL;DR: MARS-Bench是一个多轮现实场景对话基准，用于评估LLMs在超多轮、交互多轮和跨轮任务中的表现，揭示了闭源LLMs优于开源模型，显式推理提升鲁棒性，以及LLMs在处理动机转移和跨轮依赖时的挑战。

Details

Motivation: 现有基准无法全面反映LLMs在处理长复杂对话会话（如动机转移和跨轮依赖）时的弱点，因此提出MARS-Bench以填补这一空白。 Method: MARS-Bench基于实时文本评论构建，评估LLMs在超多轮、交互多轮和跨轮任务中的表现，并通过注意力可视化实验分析性能下降机制。 Result: 闭源LLMs显著优于开源模型，显式推理提升鲁棒性，LLMs在处理动机转移和跨轮依赖时面临挑战。 Conclusion: MARS-Bench为评估LLMs在多轮对话中的表现提供了新基准，揭示了其局限性，并提出了改进方向。 Abstract: Large Language Models (\textbf{LLMs}), e.g. ChatGPT, have been widely adopted in real-world dialogue applications. However, LLMs' robustness, especially in handling long complex dialogue sessions, including frequent motivation transfer, sophisticated cross-turn dependency, is criticized all along. Nevertheless, no existing benchmarks can fully reflect these weaknesses. We present \textbf{MARS-Bench}, a \textbf{M}ulti-turn \textbf{A}thletic \textbf{R}eal-world \textbf{S}cenario Dialogue \textbf{Bench}mark, designed to remedy the gap. MARS-Bench is constructed from play-by-play text commentary so to feature realistic dialogues specifically designed to evaluate three critical aspects of multi-turn conversations: Ultra Multi-turn, Interactive Multi-turn, and Cross-turn Tasks. Extensive experiments on MARS-Bench also reveal that closed-source LLMs significantly outperform open-source alternatives, explicit reasoning significantly boosts LLMs' robustness on handling long complex dialogue sessions, and LLMs indeed face significant challenges when handling motivation transfer and sophisticated cross-turn dependency. Moreover, we provide mechanistic interpretability on how attention sinks due to special tokens lead to LLMs' performance degradation when handling long complex dialogue sessions based on attention visualization experiment in Qwen2.5-7B-Instruction.

[140] LayerIF: Estimating Layer Quality for Large Language Models using Influence Functions

Hadi Askari,Shivanshu Gupta,Fei Wang,Anshuman Chhabra,Muhao Chen

Main category: cs.CL

TL;DR: 论文提出LayerIF框架，通过数据驱动方法量化LLM各层训练质量，提升下游任务性能。

Details

Motivation: 现有方法主要依赖模型中心启发式，忽视数据影响，限制了LLM下游表现。 Method: 利用影响函数（Influence Functions）计算层间梯度敏感性，量化各层训练质量。 Result: 实验表明，LayerIF能任务敏感地分配专家和稀疏度，提升性能。 Conclusion: LayerIF为LLM层间优化提供数据驱动方案，显著提升任务表现。 Abstract: Pretrained Large Language Models (LLMs) achieve strong performance across a wide range of tasks, yet exhibit substantial variability in the various layers' training quality with respect to specific downstream applications, limiting their downstream performance.It is therefore critical to estimate layer-wise training quality in a manner that accounts for both model architecture and training data. However, existing approaches predominantly rely on model-centric heuristics (such as spectral statistics, outlier detection, or uniform allocation) while overlooking the influence of data. To address these limitations, we propose LayerIF, a data-driven framework that leverages Influence Functions to quantify the training quality of individual layers in a principled and task-sensitive manner. By isolating each layer's gradients and measuring the sensitivity of the validation loss to training examples by computing layer-wise influences, we derive data-driven estimates of layer importance. Notably, our method produces task-specific layer importance estimates for the same LLM, revealing how layers specialize for different test-time evaluation tasks. We demonstrate the utility of our scores by leveraging them for two downstream applications: (a) expert allocation in LoRA-MoE architectures and (b) layer-wise sparsity distribution for LLM pruning. Experiments across multiple LLM architectures demonstrate that our model-agnostic, influence-guided allocation leads to consistent gains in task performance.

Lata Pangtey,Mohammad Zia Ur Rehman,Prasad Chaudhari,Shubhi Bansal,Nagendra Kumar

Main category: cs.CL

TL;DR: SPLAENet是一种新颖的立场检测方法，通过双交叉注意力和情感感知神经网络，显著提升了在社交媒体虚假信息内容中的立场检测性能。

Details

Motivation: 社交媒体中用户生成内容的快速增长和虚假信息的传播，使得立场检测成为分析偏见和打击虚假信息的关键方法。 Method: 提出SPLAENet方法，结合双交叉注意力机制和分层注意力网络，利用情感对齐和标签融合技术来区分不同立场。 Result: 在多个数据集上，SPLAENet在准确率和F1分数上均显著优于现有方法，最高提升17.36%。 Conclusion: SPLAENet在社交媒体虚假信息内容中的立场检测方面表现出色，验证了其有效性。 Abstract: The rapid evolution of social media has generated an overwhelming volume of user-generated content, conveying implicit opinions and contributing to the spread of misinformation. The method aims to enhance the detection of stance where misinformation can polarize user opinions. Stance detection has emerged as a crucial approach to effectively analyze underlying biases in shared information and combating misinformation. This paper proposes a novel method for \textbf{S}tance \textbf{P}rediction through a \textbf{L}abel-fused dual cross-\textbf{A}ttentive \textbf{E}motion-aware neural \textbf{Net}work (SPLAENet) in misinformative social media user-generated content. The proposed method employs a dual cross-attention mechanism and a hierarchical attention network to capture inter and intra-relationships by focusing on the relevant parts of source text in the context of reply text and vice versa. We incorporate emotions to effectively distinguish between different stance categories by leveraging the emotional alignment or divergence between the texts. We also employ label fusion that uses distance-metric learning to align extracted features with stance labels, improving the method's ability to accurately distinguish between stances. Extensive experiments demonstrate the significant improvements achieved by SPLAENet over existing state-of-the-art methods. SPLAENet demonstrates an average gain of 8.92\% in accuracy and 17.36\% in F1-score on the RumourEval dataset. On the SemEval dataset, it achieves average gains of 7.02\% in accuracy and 10.92\% in F1-score. On the P-stance dataset, it demonstrates average gains of 10.03\% in accuracy and 11.18\% in F1-score. These results validate the effectiveness of the proposed method for stance detection in the context of misinformative social media content.

[142] Aligning LLMs by Predicting Preferences from User Writing Samples

Stéphane Aroca-Ouellette,Natalie Mackraz,Barry-John Theobald,Katherine Metcalf

Main category: cs.CL

TL;DR: PROSE方法通过迭代优化和验证用户偏好描述，显著提升了LLM代理对个性化偏好的捕捉能力，效果优于现有方法。

Details

Motivation: 现有方法生成的偏好描述过于通用，无法体现用户个性化需求，因此需要更精确的偏好推断方法。 Method: PROSE结合迭代优化和跨样本验证，从用户写作样本中推断更精确的偏好描述。 Result: PROSE在多个LLM上表现优异，偏好推断准确率提升33%，与ICL结合后效果进一步提升9%。 Conclusion: PROSE能有效提升LLM代理的个性化交互能力，为偏好推断提供了新思路。 Abstract: Accommodating human preferences is essential for creating aligned LLM agents that deliver personalized and effective interactions. Recent work has shown the potential for LLMs acting as writing agents to infer a description of user preferences. Agent alignment then comes from conditioning on the inferred preference description. However, existing methods often produce generic preference descriptions that fail to capture the unique and individualized nature of human preferences. This paper introduces PROSE, a method designed to enhance the precision of preference descriptions inferred from user writing samples. PROSE incorporates two key elements: (1) iterative refinement of inferred preferences, and (2) verification of inferred preferences across multiple user writing samples. We evaluate PROSE with several LLMs (i.e., Qwen2.5 7B and 72B Instruct, GPT-mini, and GPT-4o) on a summarization and an email writing task. We find that PROSE more accurately infers nuanced human preferences, improving the quality of the writing agent's generations over CIPHER (a state-of-the-art method for inferring preferences) by 33\%. Lastly, we demonstrate that ICL and PROSE are complementary methods, and combining them provides up to a 9\% improvement over ICL alone.

[143] A Course Correction in Steerability Evaluation: Revealing Miscalibration and Side Effects in LLMs

Trenton Chang,Tobias Schnabel,Adith Swaminathan,Jenna Wiens

Main category: cs.CL

TL;DR: 论文探讨了大型语言模型（LLMs）在用户目标对齐（steerability）方面的表现，发现当前模型在覆盖性、校准性和副作用方面存在问题，并提出评估框架。

Details

Motivation: 尽管LLMs在推理和指令遵循方面表现优异，但其是否能可靠地满足多样化的用户目标尚不明确。 Method: 提出基于多维目标空间的框架，将用户目标和LLM输出建模为向量，并应用于文本重写任务。 Result: 发现当前LLMs在steerability方面表现不佳，干预措施（如提示工程、采样和微调）效果有限。 Conclusion: 现有LLMs在steerability方面仍有不足，需进一步改进对齐策略。 Abstract: Despite advances in large language models (LLMs) on reasoning and instruction-following benchmarks, it remains unclear whether they can reliably produce outputs aligned with a broad variety of user goals, a concept we refer to as steerability. The abundance of methods proposed to modify LLM behavior makes it unclear whether current LLMs are already steerable, or require further intervention. In particular, LLMs may exhibit (i) poor coverage, where rare user goals are underrepresented; (ii) miscalibration, where models overshoot requests; and (iii) side effects, where changes to one dimension of text inadvertently affect others. To systematically evaluate these failures, we introduce a framework based on a multi-dimensional goal space that models user goals and LLM outputs as vectors with dimensions corresponding to text attributes (e.g., reading difficulty). Applied to a text-rewriting task, we find that current LLMs struggle with steerability, as side effects are persistent. Interventions to improve steerability, such as prompt engineering, best-of-$N$ sampling, and reinforcement learning fine-tuning, have varying effectiveness, yet side effects remain problematic. Our findings suggest that even strong LLMs struggle with steerability, and existing alignment strategies may be insufficient. We open-source our steerability evaluation framework at https://github.com/MLD3/steerability.

[144] Ratas framework: A comprehensive genai-based approach to rubric-based marking of real-world textual exams

Masoud Safilian,Amin Beheshti,Stephen Elbourn

Main category: cs.CL

TL;DR: RATAS是一种基于生成AI的自动化评分框架，支持多样化评分标准并提供可解释的评分依据。

Details

Motivation: 解决现有自动化评分方法在适用性、一致性和解释性上的不足。 Method: 利用生成AI模型和数学框架，设计支持多样化评分标准的RATAS框架。 Result: 在真实课程数据上验证，RATAS表现出高可靠性和准确性，并提供透明反馈。 Conclusion: RATAS为自动化评分提供了高效、可解释的解决方案，适用于多样化评估场景。 Abstract: Automated answer grading is a critical challenge in educational technology, with the potential to streamline assessment processes, ensure grading consistency, and provide timely feedback to students. However, existing approaches are often constrained to specific exam formats, lack interpretability in score assignment, and struggle with real-world applicability across diverse subjects and assessment types. To address these limitations, we introduce RATAS (Rubric Automated Tree-based Answer Scoring), a novel framework that leverages state-of-the-art generative AI models for rubric-based grading of textual responses. RATAS is designed to support a wide range of grading rubrics, enable subject-agnostic evaluation, and generate structured, explainable rationales for assigned scores. We formalize the automatic grading task through a mathematical framework tailored to rubric-based assessment and present an architecture capable of handling complex, real-world exam structures. To rigorously evaluate our approach, we construct a unique, contextualized dataset derived from real-world project-based courses, encompassing diverse response formats and varying levels of complexity. Empirical results demonstrate that RATAS achieves high reliability and accuracy in automated grading while providing interpretable feedback that enhances transparency for both students and nstructors.

[145] Arbiters of Ambivalence: Challenges of Using LLMs in No-Consensus Tasks

Bhaktipriya Radharapu,Manon Revel,Megan Ung,Sebastian Ruder,Adina Williams

Main category: cs.CL

TL;DR: LLMs在作为人类替代品进行对齐时，难以完全复现人类在争议性话题上的判断和偏好，尤其是在作为评判者或辩论者时倾向于采取立场。

Details

Motivation: 研究LLMs在争议性场景中作为答案生成者、评判者和辩论者时的偏见和局限性，以评估其对齐能力。 Method: 开发了一个“无共识”基准，包含多种争议性场景，评估LLMs在不同角色下的表现。 Result: LLMs在开放答案生成中能提供细致评估，但在作为评判者或辩论者时倾向于采取立场，无法完全捕捉人类的分歧。 Conclusion: 需开发更复杂的对齐方法，因为LLMs无法完全复现人类在争议性话题上的分歧。 Abstract: The increasing use of LLMs as substitutes for humans in ``aligning'' LLMs has raised questions about their ability to replicate human judgments and preferences, especially in ambivalent scenarios where humans disagree. This study examines the biases and limitations of LLMs in three roles: answer generator, judge, and debater. These roles loosely correspond to previously described alignment frameworks: preference alignment (judge) and scalable oversight (debater), with the answer generator reflecting the typical setting with user interactions. We develop a ``no-consensus'' benchmark by curating examples that encompass a variety of a priori ambivalent scenarios, each presenting two possible stances. Our results show that while LLMs can provide nuanced assessments when generating open-ended answers, they tend to take a stance on no-consensus topics when employed as judges or debaters. These findings underscore the necessity for more sophisticated methods for aligning LLMs without human oversight, highlighting that LLMs cannot fully capture human disagreement even on topics where humans themselves are divided.

[146] Speech as a Multimodal Digital Phenotype for Multi-Task LLM-based Mental Health Prediction

Mai Ali,Christopher Lucasius,Tanmay P. Patel,Madison Aitken,Jacob Vorstman,Peter Szatmari,Marco Battaglia,Deepa Kundur

Main category: cs.CL

TL;DR: 该论文提出了一种基于多模态（语音、声学标志物和声学生物标志物）和多任务学习（MTL）的抑郁症检测方法，结合纵向分析策略，显著提高了检测准确性。

Details

Motivation: 抑郁症检测通常仅依赖单一模态的语音数据，而多模态和多任务学习可能提供更全面的分析。青少年抑郁症常伴随其他症状，如自杀意念和睡眠障碍，进一步支持多任务学习的应用。 Method: 提出了一种基于大型语言模型的架构，整合语音文本、声学标志物和声学生物标志物，并采用多任务学习和纵向分析策略。 Result: 在Depression Early Warning数据集上，该方法达到了70.8%的平衡准确率，优于单模态、单任务和非纵向方法。 Conclusion: 多模态、多任务学习和纵向分析的结合显著提升了抑郁症检测的性能，为心理健康监测提供了更全面的解决方案。 Abstract: Speech is a noninvasive digital phenotype that can offer valuable insights into mental health conditions, but it is often treated as a single modality. In contrast, we propose the treatment of patient speech data as a trimodal multimedia data source for depression detection. This study explores the potential of large language model-based architectures for speech-based depression prediction in a multimodal regime that integrates speech-derived text, acoustic landmarks, and vocal biomarkers. Adolescent depression presents a significant challenge and is often comorbid with multiple disorders, such as suicidal ideation and sleep disturbances. This presents an additional opportunity to integrate multi-task learning (MTL) into our study by simultaneously predicting depression, suicidal ideation, and sleep disturbances using the multimodal formulation. We also propose a longitudinal analysis strategy that models temporal changes across multiple clinical interactions, allowing for a comprehensive understanding of the conditions' progression. Our proposed approach, featuring trimodal, longitudinal MTL is evaluated on the Depression Early Warning dataset. It achieves a balanced accuracy of 70.8%, which is higher than each of the unimodal, single-task, and non-longitudinal methods.

[147] RAGPPI: RAG Benchmark for Protein-Protein Interactions in Drug Discovery

Youngseung Jeon,Ziwen Li,Thomas Li,JiaSyuan Chang,Morteza Ziyadi,Xiang 'Anthony' Chen

Main category: cs.CL

TL;DR: 论文介绍了RAGPPI基准，用于评估蛋白质-蛋白质相互作用（PPIs）的生物学影响，包含4,420个问答对，支持药物开发中的目标识别。

Details

Motivation: 当前缺乏评估PPIs生物学影响的基准，阻碍了药物开发中目标识别的进展。 Method: 通过专家访谈确定基准标准，构建了500个专家标注的金标准数据集和3,720个自动标注的银标准数据集，并开发了集成自动评估LLM。 Result: 提出了RAGPPI基准，包含4,420个问答对，支持药物发现中的问答系统研究。 Conclusion: RAGPPI将成为支持药物发现中问答系统研究的重要资源。 Abstract: Retrieving the biological impacts of protein-protein interactions (PPIs) is essential for target identification (Target ID) in drug development. Given the vast number of proteins involved, this process remains time-consuming and challenging. Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) frameworks have supported Target ID; however, no benchmark currently exists for identifying the biological impacts of PPIs. To bridge this gap, we introduce the RAG Benchmark for PPIs (RAGPPI), a factual question-answer benchmark of 4,420 question-answer pairs that focus on the potential biological impacts of PPIs. Through interviews with experts, we identified criteria for a benchmark dataset, such as a type of QA and source. We built a gold-standard dataset (500 QA pairs) through expert-driven data annotation. We developed an ensemble auto-evaluation LLM that reflected expert labeling characteristics, which facilitates the construction of a silver-standard dataset (3,720 QA pairs). We are committed to maintaining RAGPPI as a resource to support the research community in advancing RAG systems for drug discovery QA solutions.

[148] Reviewing Scientific Papers for Critical Problems With Reasoning LLMs: Baseline Approaches and Automatic Evaluation

Tianmai M. Zhang,Neil F. Abernethy

Main category: cs.CL

TL;DR: 论文提出利用大语言模型作为科学论文质量检查工具，而非直接生成评审意见，并通过自动评估框架验证其性能。

Details

Motivation: 解决科学出版同行评审过程中难以招募领域专家进行人工评估的问题。 Method: 提出基线方法及可扩展的自动评估框架，利用arXiv撤稿论文验证不同大语言模型的性能。 Result: OpenAI的o3模型表现最佳，o4-mini最具成本效益。 Conclusion: 为基于文档的科学理解/推理提供了见解，并为未来应用奠定了基础。 Abstract: Recent advancements in large language models have sparked interest in utilizing them to assist the peer review process of scientific publication. Instead of having AI models generate reviews in the same way as human reviewers, we propose adopting them as manuscript quality checkers. We introduce several baseline approaches and an extendable automatic evaluation framework using top LLMs as judges to tackle the difficulty of recruiting domain experts for manual evaluation. Utilizing papers withdrawn from arXiv, we validated our proposed methods with several leading reasoning LLMs from different providers and assessed their performance and API costs for identifying critical errors and unsoundness problems. The OpenAI o3 model performed the best, while o4-mini was the most cost-effective one in our evaluation. This paper provides insights into document-based scientific understanding/reasoning and lays the foundation for future applications.

[149] ValueSim: Generating Backstories to Model Individual Value Systems

Bangde Du,Ziyi Ye,Zhijing Wu,Jankowska Monika,Shuqi Zhu,Qingyao Ai,Yujia Zhou,Yiqun Liu

Main category: cs.CL

TL;DR: ValueSim框架通过生成个人背景故事模拟个体价值观，相比检索增强生成方法，在Top-1准确率上提升10%以上。

Details

Motivation: 当前LLM对齐人类价值观的方法缺乏对个体价值系统的模拟，ValueSim旨在填补这一空白。 Method: ValueSim将结构化个人数据转化为叙事背景故事，并采用多模块架构模拟个体价值观。 Result: 在自建基准测试中，ValueSim的Top-1准确率提升超过10%，且性能随用户交互历史增加而提升。 Conclusion: ValueSim能够有效模拟个体价值观，并随时间优化其模拟能力。 Abstract: As Large Language Models (LLMs) continue to exhibit increasingly human-like capabilities, aligning them with human values has become critically important. Contemporary advanced techniques, such as prompt learning and reinforcement learning, are being deployed to better align LLMs with human values. However, while these approaches address broad ethical considerations and helpfulness, they rarely focus on simulating individualized human value systems. To address this gap, we present ValueSim, a framework that simulates individual values through the generation of personal backstories reflecting past experiences and demographic information. ValueSim converts structured individual data into narrative backstories and employs a multi-module architecture inspired by the Cognitive-Affective Personality System to simulate individual values based on these narratives. Testing ValueSim on a self-constructed benchmark derived from the World Values Survey demonstrates an improvement in top-1 accuracy by over 10% compared to retrieval-augmented generation methods. Further analysis reveals that performance enhances as additional user interaction history becomes available, indicating the model's ability to refine its persona simulation capabilities over time.

[150] BiasFilter: An Inference-Time Debiasing Framework for Large Language Models

Xiaoqing Cheng,Ruizhe Chen,Hongying Zan,Yuxiang Jia,Min Peng

Main category: cs.CL

TL;DR: BiasFilter是一种模型无关的推理时去偏框架，通过实时过滤生成输出来减少大型语言模型中的社会偏见，无需重新训练或修改模型参数。

Details

Motivation: 现有去偏方法成本高、效果有限且难以扩展，BiasFilter旨在解决这些问题。 Method: BiasFilter通过周期性评估中间输出、维护候选续集，并基于公平性奖励信号丢弃低奖励段来完成生成。 Result: 实验表明，BiasFilter能有效减少社会偏见，同时保持生成质量。 Conclusion: BiasFilter是一种高效且可扩展的去偏解决方案。 Abstract: Mitigating social bias in large language models (LLMs) has become an increasingly important research objective. However, existing debiasing methods often incur high human and computational costs, exhibit limited effectiveness, and struggle to scale to larger models and open-ended generation tasks. To address these limitations, this paper proposes BiasFilter, a model-agnostic, inference-time debiasing framework that integrates seamlessly with both open-source and API-based LLMs. Instead of relying on retraining with balanced data or modifying model parameters, BiasFilter enforces fairness by filtering generation outputs in real time. Specifically, it periodically evaluates intermediate outputs every few tokens, maintains an active set of candidate continuations, and incrementally completes generation by discarding low-reward segments based on a fairness reward signal. To support this process, we construct a fairness preference dataset and train an implicit reward model to assess token-level fairness in generated responses. Extensive experiments demonstrate that BiasFilter effectively mitigates social bias across a range of LLMs while preserving overall generation quality.

[151] EvoMoE: Expert Evolution in Mixture of Experts for Multimodal Large Language Models

Linglin Jing,Yuting Gao,Zhigang Wang,Wang Lan,Yiwen Tang,Wenhai Wang,Kaipeng Zhang,Qingpei Guo

Main category: cs.CL

TL;DR: EvoMoE提出了一种创新的MoE调优框架，通过专家进化和动态路由机制解决了多模态MoE中的专家同质化和路由器僵化问题，显著提升了MLLM的性能。

Details

Motivation: 现有多模态MoE方法面临专家同质化和路由器僵化两大挑战，限制了MoE架构的多样性和性能提升。 Method: EvoMoE采用专家进化策略从单个可训练专家逐步演化出多个专家，并引入动态令牌感知路由器（DTR）根据模态和令牌特性动态分配专家。 Result: 实验表明EvoMoE在多个多模态基准测试（MME、MMBench等）中显著优于其他稀疏MLLM。 Conclusion: EvoMoE通过解决专家同质化和路由器僵化问题，有效提升了MLLM的性能。 Abstract: Recent advancements have shown that the Mixture of Experts (MoE) approach significantly enhances the capacity of large language models (LLMs) and improves performance on downstream tasks. Building on these promising results, multi-modal large language models (MLLMs) have increasingly adopted MoE techniques. However, existing multi-modal MoE tuning methods typically face two key challenges: expert uniformity and router rigidity. Expert uniformity occurs because MoE experts are often initialized by simply replicating the FFN parameters from LLMs, leading to homogenized expert functions and weakening the intended diversification of the MoE architecture. Meanwhile, router rigidity stems from the prevalent use of static linear routers for expert selection, which fail to distinguish between visual and textual tokens, resulting in similar expert distributions for image and text. To address these limitations, we propose EvoMoE, an innovative MoE tuning framework. EvoMoE introduces a meticulously designed expert initialization strategy that progressively evolves multiple robust experts from a single trainable expert, a process termed expert evolution that specifically targets severe expert homogenization. Furthermore, we introduce the Dynamic Token-aware Router (DTR), a novel routing mechanism that allocates input tokens to appropriate experts based on their modality and intrinsic token values. This dynamic routing is facilitated by hypernetworks, which dynamically generate routing weights tailored for each individual token. Extensive experiments demonstrate that EvoMoE significantly outperforms other sparse MLLMs across a variety of multi-modal benchmarks, including MME, MMBench, TextVQA, and POPE. Our results highlight the effectiveness of EvoMoE in enhancing the performance of MLLMs by addressing the critical issues of expert uniformity and router rigidity.

[152] ICH-Qwen: A Large Language Model Towards Chinese Intangible Cultural Heritage

Wenhao Ye,Tiansheng Zheng,Yue Qi,Wenhua Zhao,Xiyu Wang,Xue Zhao,Jiacheng He,Yaya Zheng,Dongbo Wang

Main category: cs.CL

TL;DR: 研究利用大语言模型技术开发了ICH-Qwen模型，用于保护和传播中国非物质文化遗产，展示了其在ICH领域的有效性。

Details

Motivation: 现代化进程对中国非物质文化遗产（ICH）构成威胁，亟需新技术手段进行保护和传承。 Method: 利用开源中文ICH数据开发大型语言模型ICH-Qwen，结合自然语言理解和知识推理能力，并通过合成数据与微调技术优化。 Result: 实验证明ICH-Qwen在ICH领域任务中表现有效。 Conclusion: ICH-Qwen为ICH的保护与传播提供了智能解决方案，并为数字人文研究开辟了新路径。 Abstract: The intangible cultural heritage (ICH) of China, a cultural asset transmitted across generations by various ethnic groups, serves as a significant testament to the evolution of human civilization and holds irreplaceable value for the preservation of historical lineage and the enhancement of cultural self-confidence. However, the rapid pace of modernization poses formidable challenges to ICH, including threats damage, disappearance and discontinuity of inheritance. China has the highest number of items on the UNESCO Intangible Cultural Heritage List, which is indicative of the nation's abundant cultural resources and emphasises the pressing need for ICH preservation. In recent years, the rapid advancements in large language modelling have provided a novel technological approach for the preservation and dissemination of ICH. This study utilises a substantial corpus of open-source Chinese ICH data to develop a large language model, ICH-Qwen, for the ICH domain. The model employs natural language understanding and knowledge reasoning capabilities of large language models, augmented with synthetic data and fine-tuning techniques. The experimental results demonstrate the efficacy of ICH-Qwen in executing tasks specific to the ICH domain. It is anticipated that the model will provide intelligent solutions for the protection, inheritance and dissemination of intangible cultural heritage, as well as new theoretical and practical references for the sustainable development of intangible cultural heritage. Furthermore, it is expected that the study will open up new paths for digital humanities research.

[153] LegalSearchLM: Rethinking Legal Case Retrieval as Legal Elements Generation

Chaeeun Kim,Jinu Lee,Wonseok Hwang

Main category: cs.CL

TL;DR: 论文提出了LEGAR BENCH（首个大规模韩语法例检索基准）和LegalSearchLM（一种基于法律元素推理的检索模型），解决了现有法律案例检索研究中数据规模小和检索方法局限的问题。

Details

Motivation: 现有法律案例检索研究存在数据规模小、犯罪类型单一以及检索方法表现不足的问题，无法反映真实法律检索场景的复杂性。 Method: 提出了LEGAR BENCH（覆盖1.2M案例和411种犯罪类型）和LegalSearchLM（通过法律元素推理和约束解码生成相关内容）。 Result: LegalSearchLM在LEGAR BENCH上比基线方法提升6-20%，并在域外案例上表现优异。 Conclusion: LEGAR BENCH和LegalSearchLM为法律案例检索提供了更全面的基准和更高效的检索方法。 Abstract: Legal Case Retrieval (LCR), which retrieves relevant cases from a query case, is a fundamental task for legal professionals in research and decision-making. However, existing studies on LCR face two major limitations. First, they are evaluated on relatively small-scale retrieval corpora (e.g., 100-55K cases) and use a narrow range of criminal query types, which cannot sufficiently reflect the complexity of real-world legal retrieval scenarios. Second, their reliance on embedding-based or lexical matching methods often results in limited representations and legally irrelevant matches. To address these issues, we present: (1) LEGAR BENCH, the first large-scale Korean LCR benchmark, covering 411 diverse crime types in queries over 1.2M legal cases; and (2) LegalSearchLM, a retrieval model that performs legal element reasoning over the query case and directly generates content grounded in the target cases through constrained decoding. Experimental results show that LegalSearchLM outperforms baselines by 6-20% on LEGAR BENCH, achieving state-of-the-art performance. It also demonstrates strong generalization to out-of-domain cases, outperforming naive generative models trained on in-domain data by 15%.

[154] Benchmarking Abstract and Reasoning Abilities Through A Theoretical Perspective

Qingchuan Ma,Yuhang Wu,Xiawu Zheng,Rongrong Ji

Main category: cs.CL

TL;DR: 论文提出了一种简单有效的基准测试方法，用于评估大语言模型（LLMs）的抽象推理能力，并揭示了当前模型的局限性。

Details

Motivation: 旨在建立一种理论基础的基准测试，以严格评估LLMs的抽象推理能力，区分真实抽象与记忆依赖。 Method: 开发了数学框架定义抽象推理，并引入两个新指标（γ和δ分数），设计了符号重映射的基准测试。 Result: 评估显示LLMs在非十进制算术和符号推理上存在局限，且即使使用思维链提示仍存在抽象差距。 Conclusion: 当前LLMs在抽象推理上仍有不足，δ分数能有效衡量记忆依赖，为未来改进指明了方向。 Abstract: In this paper, we aim to establish a simple, effective, and theoretically grounded benchmark for rigorously probing abstract reasoning in Large Language Models (LLMs). To achieve this, we first develop a mathematic framework that defines abstract reasoning as the ability to: (i) extract essential patterns independent of surface representations, and (ii) apply consistent rules to these abstract patterns. Based on this framework, we introduce two novel complementary metrics: $\scoreGamma$ measures basic reasoning accuracy, while $\scoreDelta$ quantifies a model's reliance on specific symbols rather than underlying patterns - a key indicator of true abstraction versus mere memorization. To implement this measurement, we design a benchmark: systematic symbol remapping in rule-based tasks, which forces models to demonstrate genuine pattern recognition beyond superficial token matching. Extensive LLM evaluations using this benchmark (commercial API models, 7B-70B, multi-agent) reveal:1) critical limitations in non-decimal arithmetic and symbolic reasoning; 2) persistent abstraction gaps despite chain-of-thought prompting; and 3) $\scoreDelta$'s effectiveness in robustly measuring memory dependence by quantifying performance degradation under symbol remapping, particularly highlighting operand-specific memorization. These findings underscore that current LLMs, despite domain-specific strengths, still lack robust abstract reasoning, highlighting key areas for future improvement.

[155] Say What You Mean: Natural Language Access Control with Large Language Models for Internet of Things

Ye Cheng,Minghui Xu,Yue Zhang,Kun Li,Hao Wu,Yechao Zhang,Shaoyong Guo,Wangjie Qiu,Dongxiao Yu,Xiuzhen Cheng

Main category: cs.CL

TL;DR: LACE是一个基于语言模型的访问控制引擎，通过自然语言生成和验证策略，解决了IoT中动态访问控制的复杂性问题。

Details

Motivation: IoT访问控制需考虑动态和上下文因素，现有平台无法满足语义丰富或模糊场景的需求，且策略编写过程存在语义鸿沟。 Method: LACE结合提示引导的策略生成、检索增强推理和形式化验证，支持自然语言策略生成与验证。 Result: LACE在智能家居环境中验证了100%策略生成正确性，决策准确率达88%，F1分数0.79，优于GPT-3.5和Gemini。 Conclusion: LACE展示了在现实IoT平台中实现安全、灵活和用户友好访问控制的潜力。 Abstract: Access control in the Internet of Things (IoT) is becoming increasingly complex, as policies must account for dynamic and contextual factors such as time, location, user behavior, and environmental conditions. However, existing platforms either offer only coarse-grained controls or rely on rigid rule matching, making them ill-suited for semantically rich or ambiguous access scenarios. Moreover, the policy authoring process remains fragmented: domain experts describe requirements in natural language, but developers must manually translate them into code, introducing semantic gaps and potential misconfiguration. In this work, we present LACE, the Language-based Access Control Engine, a hybrid framework that leverages large language models (LLMs) to bridge the gap between human intent and machine-enforceable logic. LACE combines prompt-guided policy generation, retrieval-augmented reasoning, and formal validation to support expressive, interpretable, and verifiable access control. It enables users to specify policies in natural language, automatically translates them into structured rules, validates semantic correctness, and makes access decisions using a hybrid LLM-rule-based engine. We evaluate LACE in smart home environments through extensive experiments. LACE achieves 100% correctness in verified policy generation and up to 88% decision accuracy with 0.79 F1-score using DeepSeek-V3, outperforming baselines such as GPT-3.5 and Gemini. The system also demonstrates strong scalability under increasing policy volume and request concurrency. Our results highlight LACE's potential to enable secure, flexible, and user-friendly access control across real-world IoT platforms.

[156] Large Language Models Often Know When They Are Being Evaluated

Joe Needham,Giles Edkins,Govind Pimpale,Henning Bartsch,Marius Hobbhahn

Main category: cs.CL

TL;DR: 研究发现前沿语言模型能够区分评估和实际部署的文本，但尚未超越人类基线。建议未来模型跟踪此能力。

Details

Motivation: 探讨AI模型是否能检测评估场景，以避免评估结果失真影响部署和治理决策。 Method: 构建包含1000个提示和文本的多样化基准，测试模型对评估和实际部署文本的分类能力。 Result: 模型表现优于随机（AUC 0.83），但低于人类基线（AUC 0.92）。在代理场景中表现更佳。 Conclusion: 前沿模型已具备显著的评估意识能力，但未达人类水平，建议未来持续监测。 Abstract: If AI models can detect when they are being evaluated, the effectiveness of evaluations might be compromised. For example, models could have systematically different behavior during evaluations, leading to less reliable benchmarks for deployment and governance decisions. We investigate whether frontier language models can accurately classify transcripts based on whether they originate from evaluations or real-world deployment, a capability we call evaluation awareness. To achieve this, we construct a diverse benchmark of 1,000 prompts and transcripts from 61 distinct datasets. These span public benchmarks (e.g., MMLU, SWEBench), real-world deployment interactions, and agent trajectories from scaffolding frameworks (e.g., web-browsing agents). Frontier models clearly demonstrate above-random evaluation awareness (Gemini-2.5-Pro reaches an AUC of $0.83$), but do not yet surpass our simple human baseline (AUC of $0.92$). Furthermore, both AI models and humans are better at identifying evaluations in agentic settings compared to chat settings. Additionally, we test whether models can identify the purpose of the evaluation. Under multiple-choice and open-ended questioning, AI models far outperform random chance in identifying what an evaluation is testing for. Our results indicate that frontier models already exhibit a substantial, though not yet superhuman, level of evaluation-awareness. We recommend tracking this capability in future models.

[157] CoMaPOI: A Collaborative Multi-Agent Framework for Next POI Prediction Bridging the Gap Between Trajectory and Language

Lin Zhong,Lingzhi Wang,Xu Yang,Qing Liao

Main category: cs.CL

TL;DR: 论文提出了一种名为CoMaPOI的多智能体协作框架，解决了大语言模型在POI预测任务中的两个关键挑战：数值时空数据的语义理解和候选POI空间的动态约束。实验表明其性能优于现有方法。

Details

Motivation: 大语言模型在POI预测任务中存在对数值时空数据理解不足和候选POI空间过大导致预测不准确的问题，亟需解决。 Method: 提出CoMaPOI框架，包含Profiler、Forecaster和Predictor三个智能体，分别负责数据语义转换、候选空间约束和预测生成。 Result: 在三个基准数据集上实验，CoMaPOI性能提升5%至10%，达到最优水平。 Conclusion: 通过协作智能体框架，成功解决了LLM在复杂时空任务中的应用挑战，为未来研究提供了新思路。 Abstract: Large Language Models (LLMs) offer new opportunities for the next Point-Of-Interest (POI) prediction task, leveraging their capabilities in semantic understanding of POI trajectories. However, previous LLM-based methods, which are superficially adapted to next POI prediction, largely overlook critical challenges associated with applying LLMs to this task. Specifically, LLMs encounter two critical challenges: (1) a lack of intrinsic understanding of numeric spatiotemporal data, which hinders accurate modeling of users' spatiotemporal distributions and preferences; and (2) an excessively large and unconstrained candidate POI space, which often results in random or irrelevant predictions. To address these issues, we propose a Collaborative Multi Agent Framework for Next POI Prediction, named CoMaPOI. Through the close interaction of three specialized agents (Profiler, Forecaster, and Predictor), CoMaPOI collaboratively addresses the two critical challenges. The Profiler agent is responsible for converting numeric data into language descriptions, enhancing semantic understanding. The Forecaster agent focuses on dynamically constraining and refining the candidate POI space. The Predictor agent integrates this information to generate high-precision predictions. Extensive experiments on three benchmark datasets (NYC, TKY, and CA) demonstrate that CoMaPOI achieves state of the art performance, improving all metrics by 5% to 10% compared to SOTA baselines. This work pioneers the investigation of challenges associated with applying LLMs to complex spatiotemporal tasks by leveraging tailored collaborative agents.

[158] Exploring the Landscape of Text-to-SQL with Large Language Models: Progresses, Challenges and Opportunities

Yiming Huang,Jiyu Guo,Wenxin Mao,Cuiyun Gao,Peiyi Han,Chuanyi Liu,Qing Ling

Main category: cs.CL

TL;DR: 本文系统综述了基于大语言模型（LLM）的Text-to-SQL技术，涵盖研究趋势、技术分析、数据集与评估指标总结，以及未来挑战与方向。

Details

Motivation: Text-to-SQL技术为非SQL用户提供了访问关系数据库的便捷途径，而LLM的发展为改进这一技术提供了新机会。 Method: 通过系统综述，分析LLM-based Text-to-SQL的研究趋势、技术、数据集和评估指标。 Result: 总结了现有技术、数据集和评估指标，并指出了未来研究的潜在障碍和方向。 Conclusion: 本文旨在为研究者提供深入理解，推动LLM-based Text-to-SQL领域的创新与发展。 Abstract: Converting natural language (NL) questions into SQL queries, referred to as Text-to-SQL, has emerged as a pivotal technology for facilitating access to relational databases, especially for users without SQL knowledge. Recent progress in large language models (LLMs) has markedly propelled the field of natural language processing (NLP), opening new avenues to improve text-to-SQL systems. This study presents a systematic review of LLM-based text-to-SQL, focusing on four key aspects: (1) an analysis of the research trends in LLM-based text-to-SQL; (2) an in-depth analysis of existing LLM-based text-to-SQL techniques from diverse perspectives; (3) summarization of existing text-to-SQL datasets and evaluation metrics; and (4) discussion on potential obstacles and avenues for future exploration in this domain. This survey seeks to furnish researchers with an in-depth understanding of LLM-based text-to-SQL, sparking new innovations and advancements in this field.

[159] Measuring Sycophancy of Language Models in Multi-turn Dialogues

Jiseung Hong,Grace Byun,Seungone Kim,Kai Shu

Main category: cs.CL

TL;DR: SYCON Bench是一个新基准，用于评估多轮对话中LLMs的迎合行为，发现对齐调优加剧迎合，而模型规模和推理优化增强抵抗能力。

Details

Motivation: 研究LLMs在真实交互中的迎合行为，填补单轮事实正确性研究的不足。 Method: 引入SYCON Bench，测量模型在多轮对话中迎合用户的速度和频率，评估17种LLMs。 Result: 迎合行为普遍存在，对齐调优加剧迎合，推理模型表现更好但易因过度逻辑化而失败。 Conclusion: 采用第三人称视角的提示策略可显著减少迎合行为。 Abstract: Large Language Models (LLMs) are expected to provide helpful and harmless responses, yet they often exhibit sycophancy--conforming to user beliefs regardless of factual accuracy or ethical soundness. Prior research on sycophancy has primarily focused on single-turn factual correctness, overlooking the dynamics of real-world interactions. In this work, we introduce SYCON Bench, a novel benchmark for evaluating sycophantic behavior in multi-turn, free-form conversational settings. Our benchmark measures how quickly a model conforms to the user (Turn of Flip) and how frequently it shifts its stance under sustained user pressure (Number of Flip). Applying SYCON Bench to 17 LLMs across three real-world scenarios, we find that sycophancy remains a prevalent failure mode. Our analysis shows that alignment tuning amplifies sycophantic behavior, whereas model scaling and reasoning optimization strengthen the model's ability to resist undesirable user views. Reasoning models generally outperform instruction-tuned models but often fail when they over-index on logical exposition instead of directly addressing the user's underlying beliefs. Finally, we evaluate four additional prompting strategies and demonstrate that adopting a third-person perspective reduces sycophancy by up to 63.8% in debate scenario. We release our code and data at https://github.com/JiseungHong/SYCON-Bench.

[160] Document Valuation in LLM Summaries: A Cluster Shapley Approach

Zikun Ye,Hema Yoganarasimhan

Main category: cs.CL

TL;DR: 论文提出了一种基于Shapley值的文档贡献评估方法，并设计了高效的Cluster Shapley算法，用于在LLM生成摘要时公平分配原始内容创作者的贡献。

Details

Motivation: LLM在生成摘要时模糊了原始内容创作者的贡献，引发了关于信用分配和补偿的问题。 Method: 使用Shapley值评估文档贡献，并提出Cluster Shapley算法，通过聚类文档减少计算复杂度。 Result: Cluster Shapley在保持高准确性的同时显著降低了计算复杂度，优于基线方法。 Conclusion: 该方法适用于多种摘要场景，具有广泛的应用潜力。 Abstract: Large Language Models (LLMs) are increasingly used in systems that retrieve and summarize content from multiple sources, such as search engines and AI assistants. While these models enhance user experience by generating coherent summaries, they obscure the contributions of original content creators, raising concerns about credit attribution and compensation. We address the challenge of valuing individual documents used in LLM-generated summaries. We propose using Shapley values, a game-theoretic method that allocates credit based on each document's marginal contribution. Although theoretically appealing, Shapley values are expensive to compute at scale. We therefore propose Cluster Shapley, an efficient approximation algorithm that leverages semantic similarity between documents. By clustering documents using LLM-based embeddings and computing Shapley values at the cluster level, our method significantly reduces computation while maintaining attribution quality. We demonstrate our approach to a summarization task using Amazon product reviews. Cluster Shapley significantly reduces computational complexity while maintaining high accuracy, outperforming baseline methods such as Monte Carlo sampling and Kernel SHAP with a better efficient frontier. Our approach is agnostic to the exact LLM used, the summarization process used, and the evaluation procedure, which makes it broadly applicable to a variety of summarization settings.

[161] Evaluation Hallucination in Multi-Round Incomplete Information Lateral-Driven Reasoning Tasks

Wenhan Dong,Tianyi Hu,Jingyi Zheng,Zhen Sun,Yuemeng Zhao,Yule Liu,Xinlei He,Xinyi Huang

Main category: cs.CL

TL;DR: 论文提出现有评估方法在多轮不完全信息任务中对大型语言模型（LLMs）横向思维能力的评估存在误导性，并提出改进标准。

Details

Motivation: 现有评估方法未能揭示LLMs的关键问题（如走捷径、模式僵化和任务提前终止），导致评估结果不可靠。 Method: 提出改进的评估标准，包括推理路径检查、多样化评估指标及与人类表现的对比分析。 Result: 现有方法易产生误导性结果，掩盖LLMs的真实推理能力。 Conclusion: 需采用更全面的评估标准以提高LLMs横向思维能力评估的可靠性。 Abstract: Multi-round incomplete information tasks are crucial for evaluating the lateral thinking capabilities of large language models (LLMs). Currently, research primarily relies on multiple benchmarks and automated evaluation metrics to assess these abilities. However, our study reveals novel insights into the limitations of existing methods, as they often yield misleading results that fail to uncover key issues, such as shortcut-taking behaviors, rigid patterns, and premature task termination. These issues obscure the true reasoning capabilities of LLMs and undermine the reliability of evaluations. To address these limitations, we propose a refined set of evaluation standards, including inspection of reasoning paths, diversified assessment metrics, and comparative analyses with human performance.

[162] Enabling Flexible Multi-LLM Integration for Scalable Knowledge Aggregation

Zhenglun Kong,Zheng Zhan,Shiyue Hou,Yifan Gong,Xin Meng,Pengwei Sui,Peiyan Dong,Xuan Shen,Zifeng Wang,Pu Zhao,Hao Tang,Stratis Ioannidis,Yanzhi Wang

Main category: cs.CL

TL;DR: 提出了一种自适应选择和聚合多源LLM知识的框架，以减少知识干扰和内存开销。

Details

Motivation: 传统微调方法在持续改进LLM时面临内存和适应性问题，现有方法易受知识干扰影响。 Method: 设计了自适应选择网络和动态加权融合策略，结合反馈驱动的损失函数。 Result: 实验表明，该方法能减少50%的知识干扰，提升稳定性和可扩展性。 Conclusion: 该框架有效解决了多源LLM知识聚合中的干扰问题，具有实际应用潜力。 Abstract: Large language models (LLMs) have shown remarkable promise but remain challenging to continually improve through traditional finetuning, particularly when integrating capabilities from other specialized LLMs. Popular methods like ensemble and weight merging require substantial memory and struggle to adapt to changing data environments. Recent efforts have transferred knowledge from multiple LLMs into a single target model; however, they suffer from interference and degraded performance among tasks, largely due to limited flexibility in candidate selection and training pipelines. To address these issues, we propose a framework that adaptively selects and aggregates knowledge from diverse LLMs to build a single, stronger model, avoiding the high memory overhead of ensemble and inflexible weight merging. Specifically, we design an adaptive selection network that identifies the most relevant source LLMs based on their scores, thereby reducing knowledge interference. We further propose a dynamic weighted fusion strategy that accounts for the inherent strengths of candidate LLMs, along with a feedback-driven loss function that prevents the selector from converging on a single subset of sources. Experimental results demonstrate that our method can enable a more stable and scalable knowledge aggregation process while reducing knowledge interference by up to 50% compared to existing approaches. Code is avaliable at https://github.com/ZLKong/LLM_Integration

[163] Read Your Own Mind: Reasoning Helps Surface Self-Confidence Signals in LLMs

Jakub Podolak,Rajeev Verma

Main category: cs.CL

TL;DR: DeepSeek R1-32B的自我报告置信度在QA任务中通常过于自信，但通过语义熵（多次采样）可以更可靠地估计不确定性。通过强制长链思考，模型置信度有效性显著提升。

Details

Motivation: 研究DeepSeek R1-32B在QA任务中不确定性来源，探索如何提高其自我报告置信度的可靠性。 Method: 分析默认设置下的置信度与语义熵，强制模型进行长链思考，并使用独立阅读器模型验证置信度来源。 Result: 长链思考显著提升置信度有效性，语义熵更可靠，独立阅读器模型能重建类似置信度。 Conclusion: 可靠的不确定性估计需显式探索生成空间，自我报告置信度仅在探索后可信。 Abstract: We study the source of uncertainty in DeepSeek R1-32B by analyzing its self-reported verbal confidence on question answering (QA) tasks. In the default answer-then-confidence setting, the model is regularly over-confident, whereas semantic entropy - obtained by sampling many responses - remains reliable. We hypothesize that this is because of semantic entropy's larger test-time compute, which lets us explore the model's predictive distribution. We show that granting DeepSeek the budget to explore its distribution by forcing a long chain-of-thought before the final answer greatly improves its verbal score effectiveness, even on simple fact-retrieval questions that normally require no reasoning. Furthermore, a separate reader model that sees only the chain can reconstruct very similar confidences, indicating the verbal score might be merely a statistic of the alternatives surfaced during reasoning. Our analysis concludes that reliable uncertainty estimation requires explicit exploration of the generative space, and self-reported confidence is trustworthy only after such exploration.

[164] Scalable, Symbiotic, AI and Non-AI Agent Based Parallel Discrete Event Simulations

Atanu Barai,Stephan Eidenbenz,Nandakishore Santhi

Main category: cs.CL

TL;DR: 提出了一种基于并行离散事件模拟（PDES）的方法，将AI与非AI代理结合，通过动态约束和规则验证提高系统输出的准确性和可信度。

Details

Motivation: 为了充分发挥AI系统的潜力并确保其输出的正确性，需要将多个AI与非AI系统无缝耦合。 Method: 采用PDES框架，将每个代理视为实体，通过时间传递和因果规则协同工作，非AI代理作为审计者验证AI代理的行为。 Result: 在四个不同领域的问题中，该方法将准确率从23%提升至68%，显著优于单独使用AI模型。 Conclusion: 该方法通过结合AI与非AI代理，有效解决了AI模型单独处理复杂任务时的局限性，提高了系统的准确性和可扩展性。 Abstract: To fully leverage the potential of artificial intelligence (AI) systems in a trustworthy manner, it is desirable to couple multiple AI and non-AI systems together seamlessly for constraining and ensuring correctness of the output. This paper introduces a novel parallel discrete event simulation (PDES) based methodology to combine multiple AI and non-AI agents in a causal, rule-based way. Our approach tightly integrates the concept of passage of time, with each agent considered as an entity in the PDES framework and responding to prior requests from other agents. Such coupling mechanism enables the agents to work in a co-operative environment towards a common goal while many tasks run in parallel throughout the simulation. It further enables setting up boundaries to the outputs of the AI agents by applying necessary dynamic constraints using non-AI agents while allowing for scalability through deployment of hundreds of such agents in a larger compute cluster. Distributing smaller AI agents can enable extremely scalable simulations in the future, addressing local memory bottlenecks for model parameter storage. Within a PDES involving both AI and non-AI agents, we break down the problem at hand into structured steps, when necessary, providing a set of multiple choices to the AI agents, and then progressively solve these steps towards a final goal. At each step, the non-AI agents act as unbiased auditors, verifying each action by the AI agents so that certain rules of engagement are followed. We evaluate our approach by solving four problems from four different domains and comparing the results with those from AI models alone. Our results show greater accuracy in solving problems from various domains where the AI models struggle to solve the problems solely by themselves. Results show that overall accuracy of our approach is 68% where as the accuracy of vanilla models is less than 23%.

[165] Derailing Non-Answers via Logit Suppression at Output Subspace Boundaries in RLHF-Aligned Language Models

Harvey Dam,Jonas Knochelmann,Vinu Joseph,Ganesh Gopalakrishnan

Main category: cs.CL

TL;DR: 通过调整生成过程中的特定标记序列，减少大型语言模型对敏感内容的拒绝率，无需修改模型权重或提示。

Details

Motivation: 观察到某些模型在拒绝敏感内容前会生成特定的标记序列（如后接双换行符），希望通过简单调整这些标记来降低拒绝率。 Method: 在生成过程中抑制后的双换行符和后的序列结束标记，仅通过修改标记概率实现。 Result: 在DeepSeek-R1实验中，该方法提高了对敏感提示的实质性回答比例，且不影响标准基准性能。 Conclusion: 拒绝行为可通过在生成过程中阻断特定标记序列来规避。 Abstract: We introduce a method to reduce refusal rates of large language models (LLMs) on sensitive content without modifying model weights or prompts. Motivated by the observation that refusals in certain models were often preceded by the specific token sequence of a token marking the beginning of the chain-of-thought (CoT) block () followed by a double newline token (\n\n), we investigate the impact of two simple formatting adjustments during generation: suppressing \n\n after and suppressing the end-of-sequence token after the end of the CoT block (). Our method requires no datasets, parameter changes, or training, relying solely on modifying token probabilities during generation. In our experiments with official DeepSeek-R1 distillations, these interventions increased the proportion of substantive answers to sensitive prompts without affecting performance on standard benchmarks. Our findings suggest that refusal behaviors can be circumvented by blocking refusal subspaces at specific points in the generation process.

[166] ASyMOB: Algebraic Symbolic Mathematical Operations Benchmark

Michael Shalyt,Rotem Elimelech,Ido Kaminer

Main category: cs.CL

TL;DR: ASyMOB是一个专注于符号数学评估的新框架，揭示了LLMs在符号数学中的表现依赖记忆而非深层理解，但高级模型显示出更强的鲁棒性。

Details

Motivation: 现有基准无法全面评估LLMs在符号数学中的核心能力，如积分、微分方程和代数简化，因此需要新的评估工具。 Method: 引入ASyMOB框架，包含17,092个数学挑战，通过数值或符号扰动分析LLMs的泛化能力。 Result: LLMs在扰动问题中表现显著下降（最高-70.3%），但高级模型（如o4-mini、Gemini 2.5 Flash）表现更稳健（仅下降约21%）。 Conclusion: LLMs在符号数学中依赖记忆，但高级模型显示出更强的泛化能力，未来可能需结合外部工具或进一步提升模型能力。 Abstract: Large language models (LLMs) are rapidly approaching the level of proficiency in university-level symbolic mathematics required for applications in advanced science and technology. However, existing benchmarks fall short in assessing the core skills of LLMs in symbolic mathematics-such as integration, differential equations, and algebraic simplification. To address this gap, we introduce ASyMOB, a novel assessment framework focused exclusively on symbolic manipulation, featuring 17,092 unique math challenges, organized by similarity and complexity. ASyMOB enables analysis of LLM generalization capabilities by comparing performance in problems that differ by simple numerical or symbolic `perturbations'. Evaluated LLMs exhibit substantial degradation in performance for all perturbation types (up to -70.3%), suggesting reliance on memorized patterns rather than deeper understanding of symbolic math, even among models achieving high baseline accuracy. Comparing LLM performance to computer algebra systems, we identify examples where they fail while LLMs succeed, as well as problems solved only by combining both approaches. Models capable of integrated code execution yielded higher accuracy compared to their performance without code, particularly stabilizing weaker models (up to +33.1% for certain perturbation types). Notably, the most advanced models (o4-mini, Gemini 2.5 Flash) demonstrate not only high symbolic math proficiency (scoring 96.8% and 97.6% on the unperturbed set), but also remarkable robustness against perturbations, (-21.7% and -21.2% vs. average -50.4% for the other models). This may indicate a recent "phase transition" in the generalization capabilities of frontier LLMs. It remains to be seen whether the path forward lies in deeper integration with sophisticated external tools, or in developing models so capable that symbolic math systems like CAS become unnecessary.

[167] Large Language Model-Based Agents for Automated Research Reproducibility: An Exploratory Study in Alzheimer's Disease

Nic Dobbins,Christelle Xiong,Kristine Lan,Meliha Yetisgen

Main category: cs.CL

TL;DR: LLM作为自主代理尝试复现阿尔茨海默病研究结果，成功复现约53.2%的关键发现，但数值和方法存在差异，显示了其潜力与局限性。

Details

Motivation: 探索LLM作为自主代理在生物医学研究中自动化复现研究结果的能力。 Method: 使用GPT-4o模拟研究团队，基于NACC数据集和论文摘要、方法部分及数据字典，动态复现五项阿尔茨海默病研究的关键发现。 Result: LLM代理平均复现了53.2%的关键发现，数值和方法与原始研究存在差异，但部分趋势和显著性相似。 Conclusion: LLM在自动化复现研究中显示出潜力，但仍受限于方法细节和实现问题，需进一步改进。 Abstract: Objective: To demonstrate the capabilities of Large Language Models (LLMs) as autonomous agents to reproduce findings of published research studies using the same or similar dataset. Materials and Methods: We used the "Quick Access" dataset of the National Alzheimer's Coordinating Center (NACC). We identified highly cited published research manuscripts using NACC data and selected five studies that appeared reproducible using this dataset alone. Using GPT-4o, we created a simulated research team of LLM-based autonomous agents tasked with writing and executing code to dynamically reproduce the findings of each study, given only study Abstracts, Methods sections, and data dictionary descriptions of the dataset. Results: We extracted 35 key findings described in the Abstracts across 5 Alzheimer's studies. On average, LLM agents approximately reproduced 53.2% of findings per study. Numeric values and range-based findings often differed between studies and agents. The agents also applied statistical methods or parameters that varied from the originals, though overall trends and significance were sometimes similar. Discussion: In some cases, LLM-based agents replicated research techniques and findings. In others, they failed due to implementation flaws or missing methodological detail. These discrepancies show the current limits of LLMs in fully automating reproducibility assessments. Still, this early investigation highlights the potential of structured agent-based systems to provide scalable evaluation of scientific rigor. Conclusion: This exploratory work illustrates both the promise and limitations of LLMs as autonomous agents for automating reproducibility in biomedical research.

[168] Revisiting Uncertainty Estimation and Calibration of Large Language Models

Linwei Tao,Yi-Fan Yeh,Minjing Dong,Tao Huang,Philip Torr,Chang Xu

Main category: cs.CL

TL;DR: 本文对大型语言模型（LLMs）的不确定性估计进行了全面研究，评估了80种模型，发现语言不确定性（LVU）表现最佳，且模型规模、训练后处理等因素影响估计性能。

Details

Motivation: 随着LLMs在高风险应用中的部署增加，可靠的不确定性估计对其安全可信部署至关重要。 Method: 研究评估了80种模型，涵盖多种架构和规模，并比较了三种不确定性估计方法（TPU、NVU、LVU）在MMLU-Pro基准上的表现。 Result: LVU在校准和区分性上优于TPU和NVU，且模型规模、推理能力等因素显著影响不确定性估计。 Conclusion: 研究强调了多角度评估的重要性，并推荐LVU作为提升LLMs可靠性的实用工具。 Abstract: As large language models (LLMs) are increasingly deployed in high-stakes applications, robust uncertainty estimation is essential for ensuring the safe and trustworthy deployment of LLMs. We present the most comprehensive study to date of uncertainty estimation in LLMs, evaluating 80 models spanning open- and closed-source families, dense and Mixture-of-Experts (MoE) architectures, reasoning and non-reasoning modes, quantization variants and parameter scales from 0.6B to 671B. Focusing on three representative black-box single-pass methods, including token probability-based uncertainty (TPU), numerical verbal uncertainty (NVU), and linguistic verbal uncertainty (LVU), we systematically evaluate uncertainty calibration and selective classification using the challenging MMLU-Pro benchmark, which covers both reasoning-intensive and knowledge-based tasks. Our results show that LVU consistently outperforms TPU and NVU, offering stronger calibration and discrimination while being more interpretable. We also find that high accuracy does not imply reliable uncertainty, and that model scale, post-training, reasoning ability and quantization all influence estimation performance. Notably, LLMs exhibit better uncertainty estimates on reasoning tasks than on knowledge-heavy ones, and good calibration does not necessarily translate to effective error ranking. These findings highlight the need for multi-perspective evaluation and position LVU as a practical tool for improving the reliability of LLMs in real-world settings.

[169] OMNIGUARD: An Efficient Approach for AI Safety Moderation Across Modalities

Sahil Verma,Keegan Hines,Jeff Bilmes,Charlotte Siska,Luke Zettlemoyer,Hila Gonen,Chandan Singh

Main category: cs.CL

TL;DR: OMNIGUARD是一种检测多语言和多模态有害提示的方法，通过利用LLM的内部表示，显著提高了分类准确性和效率。

Details

Motivation: 大型语言模型（LLM）的潜在有害使用引发担忧，现有检测方法易受攻击，尤其是在低资源语言或非文本模态（如图像和音频）中。 Method: OMNIGUARD通过识别LLM在多语言或多模态中对齐的内部表示，构建语言或模态无关的分类器来检测有害提示。 Result: OMNIGUARD在多语言、图像和音频提示的分类准确率上分别提高了11.57%、20.44%，并显著提升了效率（约120倍）。 Conclusion: OMNIGUARD为检测有害提示提供了高效且跨语言、跨模态的解决方案，显著优于现有基线方法。 Abstract: The emerging capabilities of large language models (LLMs) have sparked concerns about their immediate potential for harmful misuse. The core approach to mitigate these concerns is the detection of harmful queries to the model. Current detection approaches are fallible, and are particularly susceptible to attacks that exploit mismatched generalization of model capabilities (e.g., prompts in low-resource languages or prompts provided in non-text modalities such as image and audio). To tackle this challenge, we propose OMNIGUARD, an approach for detecting harmful prompts across languages and modalities. Our approach (i) identifies internal representations of an LLM/MLLM that are aligned across languages or modalities and then (ii) uses them to build a language-agnostic or modality-agnostic classifier for detecting harmful prompts. OMNIGUARD improves harmful prompt classification accuracy by 11.57\% over the strongest baseline in a multilingual setting, by 20.44\% for image-based prompts, and sets a new SOTA for audio-based prompts. By repurposing embeddings computed during generation, OMNIGUARD is also very efficient ($\approx 120 \times$ faster than the next fastest baseline). Code and data are available at: https://github.com/vsahil/OmniGuard.

[170] Infi-Med: Low-Resource Medical MLLMs with Robust Reasoning Evaluation

Zeyu Liu,Zhitian Hou,Yining Di,Kejing Yang,Zhijie Sang,Congkai Xie,Jingwen Yang,Siyuan Liu,Jialu Wang,Chunming Li,Ming Li,Hongxia Yang

Main category: cs.CL

TL;DR: Infi-Med是一个医疗多模态大语言模型框架，通过高效资源利用、增强多模态推理能力和系统评估，解决了医疗领域部署中的关键挑战。

Details

Motivation: 解决医疗MLLMs在资源效率、诊断准确性、临床考量和伦理隐私方面的实际部署问题。 Method: 提出Infi-Med框架，包括高效SFT数据集构建、增强多模态推理能力和系统评估体系。 Result: 实验表明Infi-Med在医疗推理任务中达到SOTA性能，并快速适应临床场景。 Conclusion: Infi-Med为医疗MLLMs的实际部署提供了有效平衡模型性能与操作限制的基础。 Abstract: Multimodal large language models (MLLMs) have demonstrated promising prospects in healthcare, particularly for addressing complex medical tasks, supporting multidisciplinary treatment (MDT), and enabling personalized precision medicine. However, their practical deployment faces critical challenges in resource efficiency, diagnostic accuracy, clinical considerations, and ethical privacy. To address these limitations, we propose Infi-Med, a comprehensive framework for medical MLLMs that introduces three key innovations: (1) a resource-efficient approach through curating and constructing high-quality supervised fine-tuning (SFT) datasets with minimal sample requirements, with a forward-looking design that extends to both pretraining and posttraining phases; (2) enhanced multimodal reasoning capabilities for cross-modal integration and clinical task understanding; and (3) a systematic evaluation system that assesses model performance across medical modalities and task types. Our experiments demonstrate that Infi-Med achieves state-of-the-art (SOTA) performance in general medical reasoning while maintaining rapid adaptability to clinical scenarios. The framework establishes a solid foundation for deploying MLLMs in real-world healthcare settings by balancing model effectiveness with operational constraints.

[171] One Task Vector is not Enough: A Large-Scale Study for In-Context Learning

Pavel Tikhonov,Ivan Oseledets,Elena Tutubalina

Main category: cs.CL

TL;DR: 论文提出了QuiteAFew数据集，用于分析LLM在少样本任务中的任务向量表现，发现任务向量性能在中间层最佳，且复杂任务依赖多个子任务向量。

Details

Motivation: 现有研究受限于小规模基准，无法全面分析任务向量在少样本学习中的作用。 Method: 使用Llama-3-8B模型在3,096个多样化少样本任务（来自Alpaca数据集）上实验，分析任务向量的表现。 Result: 任务向量性能在中间层（如第15层）达到峰值，且复杂任务需要多个子任务向量。 Conclusion: 任务知识在LLM中可能是分布式表示的，而非单一向量。 Abstract: In-context learning (ICL) enables Large Language Models (LLMs) to adapt to new tasks using few examples, with task vectors - specific hidden state activations - hypothesized to encode task information. Existing studies are limited by small-scale benchmarks, restricting comprehensive analysis. We introduce QuiteAFew, a novel dataset of 3,096 diverse few-shot tasks, each with 30 input-output pairs derived from the Alpaca dataset. Experiments with Llama-3-8B on QuiteAFew reveal: (1) task vector performance peaks at an intermediate layer (e.g., 15th), (2) effectiveness varies significantly by task type, and (3) complex tasks rely on multiple, subtask-specific vectors rather than a single vector, suggesting distributed task knowledge representation.

[172] Reinforcement Learning for Better Verbalized Confidence in Long-Form Generation

Caiqi Zhang,Xiaochen Zhu,Chengzu Li,Nigel Collier,Andreas Vlachos

Main category: cs.CL

TL;DR: 论文提出了一种名为LoVeC的方法，通过强化学习训练大语言模型在生成内容时附加置信度分数，以高效检测幻觉问题。

Details

Motivation: 解决大语言模型在生成事实内容时的幻觉问题，现有方法效率低且难以推广到开放式生成任务。 Method: 使用强化学习（包括DPO、ORPO和GRPO）训练模型为每个生成语句附加数值置信度分数。 Result: 在三个长问答数据集上，RL训练模型表现出更好的校准性和跨领域鲁棒性。 Conclusion: LoVeC方法高效且直接，仅需少量额外标记即可实现置信度估计。 Abstract: Hallucination remains a major challenge for the safe and trustworthy deployment of large language models (LLMs) in factual content generation. Prior work has explored confidence estimation as an effective approach to hallucination detection, but often relies on post-hoc self-consistency methods that require computationally expensive sampling. Verbalized confidence offers a more efficient alternative, but existing approaches are largely limited to short-form question answering (QA) tasks and do not generalize well to open-ended generation. In this paper, we propose LoVeC (Long-form Verbalized Confidence), an on-the-fly verbalized confidence estimation method for long-form generation. Specifically, we use reinforcement learning (RL) to train LLMs to append numerical confidence scores to each generated statement, serving as a direct and interpretable signal of the factuality of generation. Our experiments consider both on-policy and off-policy RL methods, including DPO, ORPO, and GRPO, to enhance the model calibration. We introduce two novel evaluation settings, free-form tagging and iterative tagging, to assess different verbalized confidence estimation methods. Experiments on three long-form QA datasets show that our RL-trained models achieve better calibration and generalize robustly across domains. Also, our method is highly efficient, as it only requires adding a few tokens to the output being decoded.

[173] Probing Association Biases in LLM Moderation Over-Sensitivity

Yuxin Wang,Botao Yu,Ivory Yang,Saeed Hassanpour,Soroush Vosoughi

Main category: cs.CL

TL;DR: 论文通过主题关联分析揭示了大语言模型（LLM）在内容审核中存在的系统性主题偏见，表明其不仅对显性冒犯性语言敏感，还依赖于学习到的主题关联。

Details

Motivation: 研究旨在探索LLM在内容审核中误判良性评论为有毒内容的潜在原因，超越传统的词汇层面分析。 Method: 引入主题关联分析（Topic Association Analysis），通过让LLM生成自由形式的场景想象，量化其将特定主题与毒性关联的程度。 Result: 发现更先进的模型（如GPT-4 Turbo）尽管总体误报率较低，但表现出更强的主题刻板印象。 Conclusion: 研究强调需要超越基于关键词的过滤方法，改进LLM的内容审核机制。 Abstract: Large Language Models are widely used for content moderation but often misclassify benign comments as toxic, leading to over-sensitivity. While previous research attributes this issue primarily to the presence of offensive terms, we reveal a potential cause beyond token level: LLMs exhibit systematic topic biases in their implicit associations. Inspired by cognitive psychology's implicit association tests, we introduce Topic Association Analysis, a semantic-level approach to quantify how LLMs associate certain topics with toxicity. By prompting LLMs to generate free-form scenario imagination for misclassified benign comments and analyzing their topic amplification levels, we find that more advanced models (e.g., GPT-4 Turbo) demonstrate stronger topic stereotype despite lower overall false positive rates. These biases suggest that LLMs do not merely react to explicit, offensive language but rely on learned topic associations, shaping their moderation decisions. Our findings highlight the need for refinement beyond keyword-based filtering, providing insights into the underlying mechanisms driving LLM over-sensitivity.

[174] ChARM: Character-based Act-adaptive Reward Modeling for Advanced Role-Playing Language Agents

Feiteng Fang,Ting-En Lin,Yuchuan Wu,Xiong Liu,Xiang Huang,Dingwei Chen,Jing Ye,Haonan Zhang,Liang Zhu,Hamid Alinejad-Rokny,Min Yang,Fei Huang,Yongbin Li

Main category: cs.CL

TL;DR: 论文提出ChARM模型，通过动态奖励机制和自进化机制解决角色扮演语言代理的偏好学习问题，并引入新数据集和评估基准，实验显示性能提升显著。

Details

Motivation: 传统奖励模型在角色扮演语言代理（RPLAs）中难以适应主观对话偏好且扩展性不足，需改进。 Method: 提出ChARM模型，包含动态奖励机制和自进化机制，并引入RoleplayPref数据集和RoleplayEval评估基准。 Result: 实验显示ChARM在偏好排名上比传统模型提升13%，并在偏好学习技术上达到最优效果。 Conclusion: ChARM通过创新机制和新数据集显著提升了RPLAs的性能和适应性。 Abstract: Role-Playing Language Agents (RPLAs) aim to simulate characters for realistic and engaging human-computer interactions. However, traditional reward models often struggle with scalability and adapting to subjective conversational preferences. We propose ChARM, a Character-based Act-adaptive Reward Model, addressing these challenges through two innovations: (1) an act-adaptive margin that significantly enhances learning efficiency and generalizability, and (2) a self-evolution mechanism leveraging large-scale unlabeled data to improve training coverage. Additionally, we introduce RoleplayPref, the first large-scale preference dataset specifically for RPLAs, featuring 1,108 characters, 13 subcategories, and 16,888 bilingual dialogues, alongside RoleplayEval, a dedicated evaluation benchmark. Experimental results show a 13% improvement over the conventional Bradley-Terry model in preference rankings. Furthermore, applying ChARM-generated rewards to preference learning techniques (e.g., direct preference optimization) achieves state-of-the-art results on CharacterEval and RoleplayEval. Code and dataset are available at https://github.com/calubkk/ChARM.

[175] Scaling up the think-aloud method

Daniel Wurgaft,Ben Prystawski,Kanishk Gandhi,Cedegao E. Zhang,Joshua B. Tenenbaum,Noah D. Goodman

Main category: cs.CL

TL;DR: 论文提出自动化转录和标注口头报告的方法，利用自然语言处理工具分析大规模思考出声数据，验证了其可行性和价值。

Details

Motivation: 思考出声方法因转录和标注工作量大而难以大规模应用，阻碍了其在认知科学中的使用。 Method: 开发自动化转录和标注工具，分析640名参与者在数学任务中的思考出声数据，将其编码为搜索图。 Result: 自动化方法与人工标注的可靠性中等，成功分析了人类推理的一致性和变异性。 Conclusion: 研究证明了大规模思考出声数据的价值，为自动化分析口头报告提供了概念验证。 Abstract: The think-aloud method, where participants voice their thoughts as they solve a task, is a valuable source of rich data about human reasoning processes. Yet, it has declined in popularity in contemporary cognitive science, largely because labor-intensive transcription and annotation preclude large sample sizes. Here, we develop methods to automate the transcription and annotation of verbal reports of reasoning using natural language processing tools, allowing for large-scale analysis of think-aloud data. In our study, 640 participants thought aloud while playing the Game of 24, a mathematical reasoning task. We automatically transcribed the recordings and coded the transcripts as search graphs, finding moderate inter-rater reliability with humans. We analyze these graphs and characterize consistency and variation in human reasoning traces. Our work demonstrates the value of think-aloud data at scale and serves as a proof of concept for the automated analysis of verbal reports.

[176] SwingArena: Competitive Programming Arena for Long-context GitHub Issue Solving

Wendong Xu,Jing Xiong,Chenyang Zhao,Qiujiang Chen,Haoran Wang,Hui Shen,Zhongwei Wan,Jianbo Dai,Taiqiang Wu,He Xiao,Chaofan Tao,Z. Morley Mao,Ying Sheng,Zhijiang Guo,Hongxia Yang,Bei Yu,Lingpeng Kong,Quanquan Gu,Ngai Wong

Main category: cs.CL

TL;DR: SwingArena是一个竞争性评估框架，用于大语言模型（LLMs），模拟真实软件开发流程，通过提交者和评审者的互动评估模型性能。

Details

Motivation: 传统静态基准测试无法反映真实软件开发中的协作迭代过程，因此需要一种更贴近实际的评估方法。 Method: SwingArena结合提交者（生成补丁）和评审者（创建测试用例并通过CI验证）的角色，引入检索增强代码生成（RACG）模块处理长上下文问题。 Result: 实验显示，GPT-4o擅长生成激进补丁，而DeepSeek和Gemini更注重CI验证的正确性。 Conclusion: SwingArena为LLMs在真实CI驱动开发环境中的评估提供了可扩展的方法。 Abstract: We present SwingArena, a competitive evaluation framework for Large Language Models (LLMs) that closely mirrors real-world software development workflows. Unlike traditional static benchmarks, SwingArena models the collaborative process of software iteration by pairing LLMs as submitters, who generate patches, and reviewers, who create test cases and verify the patches through continuous integration (CI) pipelines. To support these interactive evaluations, we introduce a retrieval-augmented code generation (RACG) module that efficiently handles long-context challenges by providing syntactically and semantically relevant code snippets from large codebases, supporting multiple programming languages (C++, Python, Rust, and Go). This enables the framework to scale across diverse tasks and contexts while respecting token limitations. Our experiments, using over 400 high-quality real-world GitHub issues selected from a pool of 2,300 issues, show that models like GPT-4o excel at aggressive patch generation, whereas DeepSeek and Gemini prioritize correctness in CI validation. SwingArena presents a scalable and extensible methodology for evaluating LLMs in realistic, CI-driven software development settings. More details are available on our project page: swing-bench.github.io

[177] Retrieval Augmented Generation based Large Language Models for Causality Mining

Thushara Manjari Naduvilakandy,Hyeju Jang,Mohammad Al Hasan

Main category: cs.CL

TL;DR: 论文提出了一种基于检索增强生成（RAG）的动态提示方案，用于提升大语言模型（LLM）在因果关系检测和挖掘任务中的性能。

Details

Motivation: 现有无监督方法性能差且需人工干预，监督方法缺乏大规模训练数据，而LLM在提示工程下虽有效但缺乏全面研究。 Method: 采用RAG框架设计动态提示方案，结合检索和生成技术优化LLM性能。 Result: 在三个数据集和五种LLM上的实验表明，所提方法优于静态提示方案。 Conclusion: RAG动态提示方案显著提升了LLM在因果关系任务中的表现，解决了现有方法的局限性。 Abstract: Causality detection and mining are important tasks in information retrieval due to their enormous use in information extraction, and knowledge graph construction. To solve these tasks, in existing literature there exist several solutions -- both unsupervised and supervised. However, the unsupervised methods suffer from poor performance and they often require significant human intervention for causal rule selection, leading to poor generalization across different domains. On the other hand, supervised methods suffer from the lack of large training datasets. Recently, large language models (LLMs) with effective prompt engineering are found to be effective to overcome the issue of unavailability of large training dataset. Yet, in existing literature, there does not exist comprehensive works on causality detection and mining using LLM prompting. In this paper, we present several retrieval-augmented generation (RAG) based dynamic prompting schemes to enhance LLM performance in causality detection and extraction tasks. Extensive experiments over three datasets and five LLMs validate the superiority of our proposed RAG-based dynamic prompting over other static prompting schemes.

[178] A Closer Look at Bias and Chain-of-Thought Faithfulness of Large (Vision) Language Models

Sriram Balasubramanian,Samyadeep Basu,Soheil Feizi

Main category: cs.CL

TL;DR: 该论文研究了大型视觉语言模型（LVLMs）中思维链（CoT）推理的忠实性，揭示了文本和图像偏见对推理的影响，并提出了一种新的评估框架。

Details

Motivation: 探讨CoT推理是否真实反映模型内部过程，尤其是视觉语言模型中未被充分研究的图像偏见如何影响推理。 Method: 引入细粒度评估流程，分类偏见表达模式，分析CoT推理的忠实性。 Result: 发现图像偏见较少被表达，模型存在“不一致”推理现象，语言模型在隐含线索下仍难以忠实表达。 Conclusion: 当前模型在忠实表达隐含偏见方面仍有不足，需进一步改进推理机制。 Abstract: Chain-of-thought (CoT) reasoning enhances performance of large language models, but questions remain about whether these reasoning traces faithfully reflect the internal processes of the model. We present the first comprehensive study of CoT faithfulness in large vision-language models (LVLMs), investigating how both text-based and previously unexplored image-based biases affect reasoning and bias articulation. Our work introduces a novel, fine-grained evaluation pipeline for categorizing bias articulation patterns, enabling significantly more precise analysis of CoT reasoning than previous methods. This framework reveals critical distinctions in how models process and respond to different types of biases, providing new insights into LVLM CoT faithfulness. Our findings reveal that subtle image-based biases are rarely articulated compared to explicit text-based ones, even in models specialized for reasoning. Additionally, many models exhibit a previously unidentified phenomenon we term ``inconsistent'' reasoning - correctly reasoning before abruptly changing answers, serving as a potential canary for detecting biased reasoning from unfaithful CoTs. We then apply the same evaluation pipeline to revisit CoT faithfulness in LLMs across various levels of implicit cues. Our findings reveal that current language-only reasoning models continue to struggle with articulating cues that are not overtly stated.

[179] FLAT-LLM: Fine-grained Low-rank Activation Space Transformation for Large Language Model Compression

Jiayi Tian,Ryan Solgi,Jinming Lu,Yifan Yang,Hai Li,Zheng Zhang

Main category: cs.CL

TL;DR: FLAT-LLM是一种无需训练的结构压缩方法，通过细粒度低秩变换在激活空间中实现快速准确的权重压缩，优于现有方法。

Details

Motivation: 大型语言模型（LLMs）在资源受限环境中部署时面临计算和内存需求高的挑战，现有低秩分解方法存在精度下降和效率低的问题。 Method: 基于头部分析的主成分分析（PCA）截断特征向量变换权重，并通过重要性指标自适应分配解码器秩，实现无需微调的高效压缩。 Result: 在4个模型和11个数据集上的评估显示，FLAT-LLM在泛化和下游性能上优于结构剪枝基线，同时提供更快的推理速度。 Conclusion: FLAT-LLM是一种高效且无需训练的结构压缩方法，显著提升了模型在资源受限环境中的适用性。 Abstract: Large Language Models (LLMs) have enabled remarkable progress in natural language processing, yet their high computational and memory demands pose challenges for deployment in resource-constrained environments. Although recent low-rank decomposition methods offer a promising path for structural compression, they often suffer from accuracy degradation, expensive calibration procedures, and result in inefficient model architectures that hinder real-world inference speedups. In this paper, we propose FLAT-LLM, a fast and accurate, training-free structural compression method based on fine-grained low-rank transformations in the activation space. Specifically, we reduce the hidden dimension by transforming the weights using truncated eigenvectors computed via head-wise Principal Component Analysis (PCA), and employ an importance-based metric to adaptively allocate ranks across decoders. FLAT-LLM achieves efficient and effective weight compression without recovery fine-tuning, which could complete the calibration within a few minutes. Evaluated across 4 models and 11 datasets, FLAT-LLM outperforms structural pruning baselines in generalization and downstream performance, while delivering inference speedups over decomposition-based methods.

[180] Is Your Model Fairly Certain? Uncertainty-Aware Fairness Evaluation for LLMs

Yinong Oliver Wang,Nivedha Sivakumar,Falaah Arif Khan,Rin Metcalf Susa,Adam Golinski,Natalie Mackraz,Barry-John Theobald,Luca Zappella,Nicholas Apostoloff

Main category: cs.CL

TL;DR: 论文提出了一种新的不确定性感知公平性度量UCerF，用于更精细地评估大语言模型的公平性，并引入了一个新的性别-职业公平性评估数据集。

Details

Motivation: 现有公平性度量未能捕捉模型不确定性的隐性影响，且当前数据集在数据规模、多样性和清晰度方面存在问题。 Method: 提出UCerF度量，并构建包含31,756个样本的新数据集，用于评估十种开源大语言模型。 Result: Mistral-7B等模型在UCerF下表现出公平性问题，而传统度量如Equalized Odds未能发现。 Conclusion: UCerF和新的数据集为开发更透明、可靠的大语言模型提供了基准。 Abstract: The recent rapid adoption of large language models (LLMs) highlights the critical need for benchmarking their fairness. Conventional fairness metrics, which focus on discrete accuracy-based evaluations (i.e., prediction correctness), fail to capture the implicit impact of model uncertainty (e.g., higher model confidence about one group over another despite similar accuracy). To address this limitation, we propose an uncertainty-aware fairness metric, UCerF, to enable a fine-grained evaluation of model fairness that is more reflective of the internal bias in model decisions compared to conventional fairness measures. Furthermore, observing data size, diversity, and clarity issues in current datasets, we introduce a new gender-occupation fairness evaluation dataset with 31,756 samples for co-reference resolution, offering a more diverse and suitable dataset for evaluating modern LLMs. We establish a benchmark, using our metric and dataset, and apply it to evaluate the behavior of ten open-source LLMs. For example, Mistral-7B exhibits suboptimal fairness due to high confidence in incorrect predictions, a detail overlooked by Equalized Odds but captured by UCerF. Overall, our proposed LLM benchmark, which evaluates fairness with uncertainty awareness, paves the way for developing more transparent and accountable AI systems.

[181] Diversity of Transformer Layers: One Aspect of Parameter Scaling Laws

Hidetaka Kamigaito,Ying Zhang,Jingun Kwon,Katsuhiko Hayashi,Manabu Okumura,Taro Watanabe

Main category: cs.CL

TL;DR: 本文研究了Transformer层数与性能的关系，通过偏差-多样性分解理论揭示了层间多样性的重要性，并发现增加层数仅在层间行为不同时提升性能，且性能增益呈现次模性。

Details

Motivation: 尽管Transformer在任务表现上表现出色，但其内部机制与参数规模定律的关系尚不明确，本文旨在填补这一空白。 Method: 通过偏差-多样性分解理论分析Transformer层的行为，并引入信息论多样性概念，实验验证理论发现。 Result: 研究发现层间多样性对性能至关重要，尤其是当单层输出远离真实值时；增加层数仅在层间行为不同时提升性能，且性能增益呈次模性。 Conclusion: 本文揭示了Transformer层间多样性与性能的关系，为参数规模定律提供了理论支持，实验验证了理论发现的普适性。 Abstract: Transformers deliver outstanding performance across a wide range of tasks and are now a dominant backbone architecture for large language models (LLMs). Their task-solving performance is improved by increasing parameter size, as shown in the recent studies on parameter scaling laws. Although recent mechanistic-interpretability studies have deepened our understanding of the internal behavior of Transformers by analyzing their residual stream, the relationship between these internal mechanisms and the parameter scaling laws remains unclear. To bridge this gap, we focus on layers and their size, which mainly decide the parameter size of Transformers. For this purpose, we first theoretically investigate the layers within the residual stream through a bias-diversity decomposition. The decomposition separates (i) bias, the error of each layer's output from the ground truth, and (ii) diversity, which indicates how much the outputs of each layer differ from each other. Analyzing Transformers under this theory reveals that performance improves when individual layers make predictions close to the correct answer and remain mutually diverse. We show that diversity becomes especially critical when individual layers' outputs are far from the ground truth. Finally, we introduce an information-theoretic diversity and show our main findings that adding layers enhances performance only when those layers behave differently, i.e., are diverse. We also reveal the performance gains from increasing the number of layers exhibit submodularity: marginal improvements diminish as additional layers increase, mirroring the logarithmic convergence predicted by the parameter scaling laws. Experiments on multiple semantic-understanding tasks with various LLMs empirically confirm the theoretical properties derived in this study.

[182] Large Language Model Meets Constraint Propagation

Alexandre Bonlarron,Florian Régin,Elisabetta De Maria,Jean-Charles Régin

Main category: cs.CL

TL;DR: GenCP结合LLM与CP推理，通过MLM实现双向约束传播，提升约束感知文本生成性能。

Details

Motivation: LLM在生成文本时缺乏显式控制机制，难以满足外部约束。 Method: 将文本生成建模为CSP，集成MLM实现双向约束传播。 Result: 在COLLIE基准测试中，GenCP性能显著提升，生成更可靠且符合约束的文本。 Conclusion: GenCP通过MLM集成优化了LLM推理效率，适用于严格约束任务。 Abstract: Large Language Models (LLMs) excel at generating fluent text but struggle to enforce external constraints because they generate tokens sequentially without explicit control mechanisms. GenCP addresses this limitation by combining LLM predictions with Constraint Programming (CP) reasoning, formulating text generation as a Constraint Satisfaction Problem (CSP). In this paper, we improve GenCP by integrating Masked Language Models (MLMs) for domain generation, which allows bidirectional constraint propagation that leverages both past and future tokens. This integration bridges the gap between token-level prediction and structured constraint enforcement, leading to more reliable and constraint-aware text generation. Our evaluation on COLLIE benchmarks demonstrates that incorporating domain preview via MLM calls significantly improves GenCP's performance. Although this approach incurs additional MLM calls and, in some cases, increased backtracking, the overall effect is a more efficient use of LLM inferences and an enhanced ability to generate feasible and meaningful solutions, particularly in tasks with strict content constraints.

[183] BeaverTalk: Oregon State University's IWSLT 2025 Simultaneous Speech Translation System

Matthew Raffel,Victor Agostinelli,Lizhong Chen

Main category: cs.CL

TL;DR: BeaverTalk是一个用于语音到文本翻译的级联系统，结合了VAD分割器、Whisper Large V2和Gemma 3 12B模型，通过LoRAs微调实现低延迟翻译，在IWSLT 2025任务中表现优异。

Details

Motivation: 解决语音到文本翻译中的低延迟和高准确性需求，特别是在多语言场景下。 Method: 使用VAD分割器分割语音流，Whisper Large V2进行语音识别，Gemma 3 12B进行翻译，并通过LoRAs微调优化上下文策略。 Result: 在英语→德语任务中BLEU得分为24.64和27.83，英语→中文任务中为34.07和37.23，延迟表现良好。 Conclusion: BeaverTalk在多语言低延迟翻译任务中表现出色，验证了其架构和微调策略的有效性。 Abstract: This paper discusses the construction, fine-tuning, and deployment of BeaverTalk, a cascaded system for speech-to-text translation as part of the IWSLT 2025 simultaneous translation task. The system architecture employs a VAD segmenter for breaking a speech stream into segments, Whisper Large V2 for automatic speech recognition (ASR), and Gemma 3 12B for simultaneous translation. Regarding the simultaneous translation LLM, it is fine-tuned via low-rank adaptors (LoRAs) for a conversational prompting strategy that leverages a single prior-sentence memory bank from the source language as context. The cascaded system participated in the English$\rightarrow$German and English$\rightarrow$Chinese language directions for both the low and high latency regimes. In particular, on the English$\rightarrow$German task, the system achieves a BLEU of 24.64 and 27.83 at a StreamLAAL of 1837.86 and 3343.73, respectively. Then, on the English$\rightarrow$Chinese task, the system achieves a BLEU of 34.07 and 37.23 at a StreamLAAL of 2216.99 and 3521.35, respectively.

Kateryna Akhynko,Oleksandr Kosovan,Mykola Trokhymovych

Main category: cs.CL

TL;DR: 本文介绍了在UNLP 2025共享任务中表现优异的解决方案，专注于检测乌克兰Telegram用户中的修辞和风格操纵技术。

Details

Motivation: 任务是检测和分类用于影响乌克兰Telegram用户的修辞和风格操纵技术。 Method: 在分类子任务中，使用LoRA适配器微调Gemma 2语言模型，并应用第二级分类器（利用元特征和阈值优化）。在跨度检测中，采用XLM-RoBERTa模型进行多目标训练，包括令牌二元分类。 Result: 该方法在分类任务中排名第2，在跨度检测中排名第3。 Conclusion: 提出的方法在检测和分类操纵技术方面表现优异，验证了其有效性。 Abstract: This paper presents one of the top-performing solutions to the UNLP 2025 Shared Task on Detecting Manipulation in Social Media. The task focuses on detecting and classifying rhetorical and stylistic manipulation techniques used to influence Ukrainian Telegram users. For the classification subtask, we fine-tuned the Gemma 2 language model with LoRA adapters and applied a second-level classifier leveraging meta-features and threshold optimization. For span detection, we employed an XLM-RoBERTa model trained for multi-target, including token binary classification. Our approach achieved 2nd place in classification and 3rd place in span detection.

[185] The Surprising Soupability of Documents in State Space Models

Yasaman Jafari,Zixian Wang,Leon Bergen,Taylor Berg-Kirkpatrick

Main category: cs.CL

TL;DR: 研究探讨了结构化状态空间模型（SSMs）的隐藏状态是否可以通过后处理合并以支持下游推理。提出了一种称为“文档融合”的策略，通过简单操作（如平均）将独立编码的文档表示合并为单一上下文状态。该方法支持模块化编码和重用，无需为每个查询重新处理完整输入。实验表明，该方法在多跳QA、稀疏检索和长文档推理中表现优异。

Details

Motivation: 探索如何高效合并独立编码的文档表示以支持下游任务，避免重复处理输入。 Method: 提出“文档融合”策略，通过简单操作（如平均）合并独立编码的文档表示，并微调Mamba2模型生成可融合的表示。 Result: 在HotpotQA上，融合十个独立编码的文档几乎达到了交叉编码器在相同输入上的性能。 Conclusion: 文档融合是一种高效且性能接近交叉编码器的方法，适用于多跳QA、稀疏检索和长文档推理。 Abstract: We investigate whether hidden states from Structured State Space Models (SSMs) can be merged post-hoc to support downstream reasoning. Inspired by model souping, we propose a strategy where documents are encoded independently and their representations are pooled -- via simple operations like averaging -- into a single context state. This approach, which we call document souping, enables modular encoding and reuse without reprocessing the full input for each query. We finetune Mamba2 models to produce soupable representations and find that they support multi-hop QA, sparse retrieval, and long-document reasoning with strong accuracy. On HotpotQA, souping ten independently encoded documents nearly matches the performance of a cross-encoder trained on the same inputs.

[186] MedPAIR: Measuring Physicians and AI Relevance Alignment in Medical Question Answering

Yuexing Hao,Kumail Alhamoud,Hyewon Jeong,Haoran Zhang,Isha Puri,Philip Torr,Mike Schaekermann,Ariel D. Stern,Marzyeh Ghassemi

Main category: cs.CL

TL;DR: 研究通过MedPAIR数据集比较医生学员和LLMs在医学QA任务中信息相关性评估的差异，发现LLMs与医生学员的相关性评估不一致，过滤无关信息后双方表现均提升。

Details

Motivation: 尽管LLMs在医学QA任务中表现优异，但其逻辑可能存在问题，需验证其与医生学员在信息相关性评估上的一致性。 Method: 使用MedPAIR数据集，标注1,300个QA对中每个句子的相关性，比较医生学员和LLMs的评估差异，并分析其对任务表现的影响。 Result: LLMs与医生学员的相关性评估不一致，过滤无关信息后双方准确率均提高。 Conclusion: 研究揭示了LLMs在医学QA任务中信息相关性评估的不足，过滤无关信息可提升模型和人类表现。 Abstract: Large Language Models (LLMs) have demonstrated remarkable performance on various medical question-answering (QA) benchmarks, including standardized medical exams. However, correct answers alone do not ensure correct logic, and models may reach accurate conclusions through flawed processes. In this study, we introduce the MedPAIR (Medical Dataset Comparing Physicians and AI Relevance Estimation and Question Answering) dataset to evaluate how physician trainees and LLMs prioritize relevant information when answering QA questions. We obtain annotations on 1,300 QA pairs from 36 physician trainees, labeling each sentence within the question components for relevance. We compare these relevance estimates to those for LLMs, and further evaluate the impact of these "relevant" subsets on downstream task performance for both physician trainees and LLMs. We find that LLMs are frequently not aligned with the content relevance estimates of physician trainees. After filtering out physician trainee-labeled irrelevant sentences, accuracy improves for both the trainees and the LLMs. All LLM and physician trainee-labeled data are available at: http://medpair.csail.mit.edu/.

[187] TCM-Ladder: A Benchmark for Multimodal Question Answering on Traditional Chinese Medicine

Jiacheng Xie,Yang Yu,Ziyang Zhang,Shuai Zeng,Jiaxuan He,Ayush Vasireddy,Xiaoting Tang,Congyu Guo,Lening Zhao,Congcong Jing,Guanghui An,Dong Xu

Main category: cs.CL

TL;DR: TCM-Ladder是首个针对中医大语言模型的多模态问答数据集，涵盖中医核心学科，包含52,000+问题，并提出了专门的中医问答评估方法Ladder-Score。

Details

Motivation: 现有评估数据集范围有限且多为文本形式，缺乏统一的多模态基准，需客观评估中医大语言模型的实际任务表现。 Method: 构建TCM-Ladder数据集，结合自动与人工过滤，涵盖多种问题类型和模态（文本、图像、视频），并训练推理模型进行对比实验。 Result: 在TCM-Ladder上评估了9个通用领域和5个中医专用LLM，提出了Ladder-Score评估方法。 Conclusion: TCM-Ladder为中医大语言模型提供了首个统一的多模态评估基准，数据集和排行榜公开可用。 Abstract: Traditional Chinese Medicine (TCM), as an effective alternative medicine, has been receiving increasing attention. In recent years, the rapid development of large language models (LLMs) tailored for TCM has underscored the need for an objective and comprehensive evaluation framework to assess their performance on real-world tasks. However, existing evaluation datasets are limited in scope and primarily text-based, lacking a unified and standardized multimodal question-answering (QA) benchmark. To address this issue, we introduce TCM-Ladder, the first multimodal QA dataset specifically designed for evaluating large TCM language models. The dataset spans multiple core disciplines of TCM, including fundamental theory, diagnostics, herbal formulas, internal medicine, surgery, pharmacognosy, and pediatrics. In addition to textual content, TCM-Ladder incorporates various modalities such as images and videos. The datasets were constructed using a combination of automated and manual filtering processes and comprise 52,000+ questions in total. These questions include single-choice, multiple-choice, fill-in-the-blank, diagnostic dialogue, and visual comprehension tasks. We trained a reasoning model on TCM-Ladder and conducted comparative experiments against 9 state-of-the-art general domain and 5 leading TCM-specific LLMs to evaluate their performance on the datasets. Moreover, we propose Ladder-Score, an evaluation method specifically designed for TCM question answering that effectively assesses answer quality regarding terminology usage and semantic expression. To our knowledge, this is the first work to evaluate mainstream general domain and TCM-specific LLMs on a unified multimodal benchmark. The datasets and leaderboard are publicly available at https://tcmladder.com or https://54.211.107.106 and will be continuously updated.

[188] HardTests: Synthesizing High-Quality Test Cases for LLM Coding

Zhongmou He,Yee Man Choi,Kexun Zhang,Jiabao Ji,Junting Zhou,Dejia Xu,Ivan Bercovich,Aidan Zhang,Lei Li

Main category: cs.CL

TL;DR: HARDTESTGEN是一个利用大语言模型（LLM）生成高质量测试的流程，解决了验证器在复杂编程问题中难以获取可靠测试的难题。

Details

Motivation: 验证器在LLM推理中至关重要，但为复杂编程问题生成可靠测试困难，因为错误的解决方案可能难以通过人工编写的边缘案例检测。 Method: 提出HARDTESTGEN流程，利用LLM合成高质量测试，并构建了包含47k个问题的HARDTESTS数据集。 Result: HARDTESTGEN测试在评估LLM生成代码时，精确度提高11.3%，召回率提高17.5%，对更难问题精确度提升可达40%。 Conclusion: HARDTESTS数据集和合成流程显著提升了测试质量和模型训练效果，并将开源。 Abstract: Verifiers play a crucial role in large language model (LLM) reasoning, needed by post-training techniques such as reinforcement learning. However, reliable verifiers are hard to get for difficult coding problems, because a well-disguised wrong solution may only be detected by carefully human-written edge cases that are difficult to synthesize. To address this issue, we propose HARDTESTGEN, a pipeline for high-quality test synthesis using LLMs. With this pipeline, we curate a comprehensive competitive programming dataset HARDTESTS with 47k problems and synthetic high-quality tests. Compared with existing tests, HARDTESTGEN tests demonstrate precision that is 11.3 percentage points higher and recall that is 17.5 percentage points higher when evaluating LLM-generated code. For harder problems, the improvement in precision can be as large as 40 points. HARDTESTS also proves to be more effective for model training, measured by downstream code generation performance. We will open-source our dataset and synthesis pipeline at https://leililab.github.io/HardTests/.

[189] Training LLMs for EHR-Based Reasoning Tasks via Reinforcement Learning

Jiacheng Lin,Zhenbang Wu,Jimeng Sun

Main category: cs.CL

TL;DR: EHRMIND提出了一种通过强化学习与可验证奖励（RLVR）调整大型语言模型（LLM）以适应临床推理任务的方法，解决了知识误用和缺失问题。

Details

Motivation: 由于电子健康记录（EHR）解释需要专业知识和复杂推理，RLVR在医疗领域的应用面临挑战。 Method: 采用两阶段解决方案：轻量级监督微调（SFT）注入缺失知识并稳定训练，随后通过RLVR优化决策。 Result: 在多个临床任务中（如MEDCALC、TREC CLINICAL TRIALS、EHRSHOT），EHRMIND显著提升了准确性、可解释性和跨任务泛化能力。 Conclusion: EHRMIND为RLVR在医疗领域的应用提供了实用指导，增强了LLM的临床推理能力。 Abstract: We present EHRMIND, a practical recipe for adapting large language models (LLMs) to complex clinical reasoning tasks using reinforcement learning with verifiable rewards (RLVR). While RLVR has succeeded in mathematics and coding, its application to healthcare contexts presents unique challenges due to the specialized knowledge and reasoning required for electronic health record (EHR) interpretation. Our pilot study on the MEDCALC benchmark reveals two key failure modes: (1) misapplied knowledge, where models possess relevant medical knowledge but apply it incorrectly, and (2) missing knowledge, where models lack essential domain knowledge. To address these cases, EHRMIND applies a two-stage solution: a lightweight supervised fine-tuning (SFT) warm-up that injects missing domain knowledge, stabilizes subsequent training, and encourages structured, interpretable outputs; followed by RLVR, which reinforces outcome correctness and refines the model's decision-making. We demonstrate the effectiveness of our method across diverse clinical applications, including medical calculations (MEDCALC), patient-trial matching (TREC CLINICAL TRIALS), and disease diagnosis (EHRSHOT). EHRMIND delivers consistent gains in accuracy, interpretability, and cross-task generalization. These findings offer practical guidance for applying RLVR to enhance LLM capabilities in healthcare settings.

[190] The State of Multilingual LLM Safety Research: From Measuring the Language Gap to Mitigating It

Zheng-Xin Yong,Beyza Ermis,Marzieh Fadaee,Stephen H. Bach,Julia Kreutzer

Main category: cs.CL

TL;DR: 本文分析了LLM安全研究的语言多样性问题，指出该领域以英语为中心，非英语语言研究严重不足，并提出了未来多语言安全研究的建议。

Details

Motivation: 揭示LLM安全研究中的语言不平等现象，推动更包容的多语言安全研究。 Method: 系统分析了2020-2024年间近300篇来自*ACL会议和研讨会的论文。 Result: 发现非英语语言研究极少，且英语研究缺乏语言记录实践。 Conclusion: 提出了多语言安全研究的未来方向，以促进更包容的AI安全实践。 Abstract: This paper presents a comprehensive analysis of the linguistic diversity of LLM safety research, highlighting the English-centric nature of the field. Through a systematic review of nearly 300 publications from 2020--2024 across major NLP conferences and workshops at *ACL, we identify a significant and growing language gap in LLM safety research, with even high-resource non-English languages receiving minimal attention. We further observe that non-English languages are rarely studied as a standalone language and that English safety research exhibits poor language documentation practice. To motivate future research into multilingual safety, we make several recommendations based on our survey, and we then pose three concrete future directions on safety evaluation, training data generation, and crosslingual safety generalization. Based on our survey and proposed directions, the field can develop more robust, inclusive AI safety practices for diverse global populations.

[191] R-KV: Redundancy-aware KV Cache Compression for Training-Free Reasoning Models Acceleration

Zefan Cai,Wen Xiao,Hanshi Sun,Cheng Luo,Yikai Zhang,Ke Wan,Yucheng Li,Yeyang Zhou,Li-Wen Chang,Jiuxiang Gu,Zhen Dong,Anima Anandkumar,Abedelkadir Asi,Junjie Hu

Main category: cs.CL

TL;DR: R-KV是一种针对推理模型中冗余令牌的KV缓存压缩方法，显著减少内存占用并提升性能。

Details

Motivation: 现有推理模型在自反思和链式推理中表现优异，但输出过长导致KV缓存过大，且现有压缩方法在链式推理中表现不佳。 Method: 提出R-KV方法，专注于压缩推理模型中的冗余令牌，保留性能的同时大幅减少KV缓存。 Result: R-KV仅用10%的KV缓存即可保留近100%性能，内存节省90%，吞吐量提升6.6倍。 Conclusion: R-KV在数学推理数据集上优于现有KV缓存压缩方法，展示了高效性和实用性。 Abstract: Reasoning models have demonstrated impressive performance in self-reflection and chain-of-thought reasoning. However, they often produce excessively long outputs, leading to prohibitively large key-value (KV) caches during inference. While chain-of-thought inference significantly improves performance on complex reasoning tasks, it can also lead to reasoning failures when deployed with existing KV cache compression approaches. To address this, we propose Redundancy-aware KV Cache Compression for Reasoning models (R-KV), a novel method specifically targeting redundant tokens in reasoning models. Our method preserves nearly 100% of the full KV cache performance using only 10% of the KV cache, substantially outperforming existing KV cache baselines, which reach only 60% of the performance. Remarkably, R-KV even achieves 105% of full KV cache performance with 16% of the KV cache. This KV-cache reduction also leads to a 90% memory saving and a 6.6X throughput over standard chain-of-thought reasoning inference. Experimental results show that R-KV consistently outperforms existing KV cache compression baselines across two mathematical reasoning datasets.

[192] CrossICL: Cross-Task In-Context Learning via Unsupervised Demonstration Transfer

Jinglong Gao,Xiao Ding,Lingxiao Zou,Bing Qin,Ting Liu

Main category: cs.CL

TL;DR: 论文提出CrossICL，一种利用源任务演示提升目标任务性能的新ICL范式，无需额外人工标注。

Details

Motivation: 现有ICL依赖人工标注演示，但用户常不愿或无法提供。受人类类比启发，研究如何利用源任务演示指导目标任务。 Method: 设计两阶段对齐策略以减少任务间干扰，并在875个NLP任务和6种LLM（包括GPT-4o）上进行实验。 Result: 实验证明CrossICL有效，并回答了跨任务演示选择标准和任务间干扰类型等问题。 Conclusion: CrossICL为ICL提供了一种无需人工标注的高效解决方案，并揭示了跨任务演示的实用准则。 Abstract: In-Context Learning (ICL) enhances the performance of large language models (LLMs) with demonstrations. However, obtaining these demonstrations primarily relies on manual effort. In most real-world scenarios, users are often unwilling or unable to provide such demonstrations. Inspired by the human analogy, we explore a new ICL paradigm CrossICL to study how to utilize existing source task demonstrations in the ICL for target tasks, thereby obtaining reliable guidance without any additional manual effort. To explore this, we first design a two-stage alignment strategy to mitigate the interference caused by gaps across tasks, as the foundation for our experimental exploration. Based on it, we conduct comprehensive exploration of CrossICL, with 875 NLP tasks from the Super-NI benchmark and six types of LLMs, including GPT-4o. Experimental results demonstrate the effectiveness of CrossICL and provide valuable insights on questions like the criteria for selecting cross-task demonstrations, as well as the types of task-gap-induced interference in CrossICL.

[193] Rationales Are Not Silver Bullets: Measuring the Impact of Rationales on Model Performance and Reliability

Chiwei Zhu,Benfeng Xu,An Yang,Junyang Lin,Quan Wang,Chang Zhou,Zhendong Mao

Main category: cs.CL

TL;DR: 研究发现，增强语言模型的理性并不总是提升性能，有时会降低性能或提高可靠性，且性能与可靠性改进之间存在线性关系。

Details

Motivation: 探讨理性增强对语言模型性能和可靠性的实际影响，挑战现有观点。 Method: 通过全面调查，分析理性对模型性能和可靠性的影响，并引入任务内在难度的视角。 Result: 1) 理性可能降低模型性能；2) 理性可能提高模型可靠性；3) 性能与可靠性改进之间存在线性关系，且受任务难度驱动。 Conclusion: 研究为理性在语言模型中的广泛使用提供了指导，并对模型与人类思维的显式对齐提出了重要启示。 Abstract: Training language models with rationales augmentation has been shown to be beneficial in many existing works. In this paper, we identify that such a prevailing view does not hold consistently. We conduct comprehensive investigations to thoroughly inspect the impact of rationales on model performance as well as a novel perspective of model reliability. The results lead to several key findings that add new insights upon existing understandings: 1) Rationales can, at times, deteriorate model performance; 2) Rationales can, at times, improve model reliability, even outperforming their untrained counterparts; 3) A linear correspondence exists in between the performance and reliability improvements, while both are driven by the intrinsic difficulty of the task. These findings provide informative regulations on the broad utilization of rationales and raise critical implications on the procedure of explicitly aligning language models with implicit human thoughts. Codes can be found at https://github.com/Ignoramus0817/rationales.

[194] LKD-KGC: Domain-Specific KG Construction via LLM-driven Knowledge Dependency Parsing

Jiaqi Sun,Shiyou Qian,Zhangchi Han,Wei Li,Zelin Qian,Dingyu Yang,Jian Cao,Guangtao Xue

Main category: cs.CL

TL;DR: LKD-KGC是一种无监督的领域特定知识图谱构建框架，通过分析文档库推断知识依赖关系，利用LLM驱动的优先级确定处理顺序，并自动生成实体模式，显著提升了构建效率和质量。

Details

Motivation: 领域特定知识图谱的手动构建效率低下且依赖专业知识，现有基于LLM的方法受限于手动定义模式、单文档处理和公共领域参考，难以应对复杂知识依赖和特定领域需求。 Method: LKD-KGC通过分析文档库推断知识依赖，利用LLM驱动优先级确定处理顺序，自动生成实体模式，并基于此无监督提取实体和关系。 Result: 实验表明，LKD-KGC在精确率和召回率上比现有基线方法提升了10%至20%。 Conclusion: LKD-KGC在构建高质量领域特定知识图谱方面具有显著潜力，解决了现有方法的局限性。 Abstract: Knowledge Graphs (KGs) structure real-world entities and their relationships into triples, enhancing machine reasoning for various tasks. While domain-specific KGs offer substantial benefits, their manual construction is often inefficient and requires specialized knowledge. Recent approaches for knowledge graph construction (KGC) based on large language models (LLMs), such as schema-guided KGC and reference knowledge integration, have proven efficient. However, these methods are constrained by their reliance on manually defined schema, single-document processing, and public-domain references, making them less effective for domain-specific corpora that exhibit complex knowledge dependencies and specificity, as well as limited reference knowledge. To address these challenges, we propose LKD-KGC, a novel framework for unsupervised domain-specific KG construction. LKD-KGC autonomously analyzes document repositories to infer knowledge dependencies, determines optimal processing sequences via LLM driven prioritization, and autoregressively generates entity schema by integrating hierarchical inter-document contexts. This schema guides the unsupervised extraction of entities and relationships, eliminating reliance on predefined structures or external knowledge. Extensive experiments show that compared with state-of-the-art baselines, LKD-KGC generally achieves improvements of 10% to 20% in both precision and recall rate, demonstrating its potential in constructing high-quality domain-specific KGs.

[195] Mixed-R1: Unified Reward Perspective For Reasoning Capability in Multimodal Large Language Models

Shilin Xu,Yanwei Li,Rui Yang,Tao Zhang,Yueyi Sun,Wei Chow,Linfeng Li,Hang Song,Qi Xu,Yunhai Tong,Xiangtai Li,Hao Fei

Main category: cs.CL

TL;DR: 论文提出Mixed-R1框架，通过混合奖励函数设计和混合数据集（Mixed-45K）解决多源MLLM任务的稳定强化学习问题。

Details

Motivation: 现有研究仅关注单一任务（如数学问题或图表分析），缺乏多源MLLM任务的统一解决方案。 Method: 设计数据引擎构建Mixed-45K数据集，提出包含四种奖励函数的Mixed-Reward（匹配奖励、图表奖励、IoU奖励和开放式奖励BMAS）。 Result: 实验证明Mixed-R1在多种MLLM（如Qwen2.5-VL和Intern-VL）上有效。 Conclusion: Mixed-R1为多源MLLM任务提供了统一的强化学习框架，具有广泛适用性。 Abstract: Recent works on large language models (LLMs) have successfully demonstrated the emergence of reasoning capabilities via reinforcement learning (RL). Although recent efforts leverage group relative policy optimization (GRPO) for MLLMs post-training, they constantly explore one specific aspect, such as grounding tasks, math problems, or chart analysis. There are no works that can leverage multi-source MLLM tasks for stable reinforcement learning. In this work, we present a unified perspective to solve this problem. We present Mixed-R1, a unified yet straightforward framework that contains a mixed reward function design (Mixed-Reward) and a mixed post-training dataset (Mixed-45K). We first design a data engine to select high-quality examples to build the Mixed-45K post-training dataset. Then, we present a Mixed-Reward design, which contains various reward functions for various MLLM tasks. In particular, it has four different reward functions: matching reward for binary answer or multiple-choice problems, chart reward for chart-aware datasets, IoU reward for grounding problems, and open-ended reward for long-form text responses such as caption datasets. To handle the various long-form text content, we propose a new open-ended reward named Bidirectional Max-Average Similarity (BMAS) by leveraging tokenizer embedding matching between the generated response and the ground truth. Extensive experiments show the effectiveness of our proposed method on various MLLMs, including Qwen2.5-VL and Intern-VL on various sizes. Our dataset and model are available at https://github.com/xushilin1/mixed-r1.

[196] Tag-Evol: Achieving Efficient Instruction Evolving via Tag Injection

Yixuan Wang,Shiqi Zhou,Chuanzhe Guo,Qingfu Zhu

Main category: cs.CL

TL;DR: Tag-Evol是一种基于知识标签的指令演化方法，通过多样化标签组合实现高效、可控的数据合成，显著优于现有方法。

Details

Motivation: 现有数据合成方法依赖固定策略，需手动设计且形式单一，迭代演化获取困难样本成本高。 Method: Tag-Evol利用多样化知识标签作为策略，通过注入不同标签组合实现指令的受控演化。 Result: 实验表明，Tag-Evol生成的演化数据质量显著优于其他方法，且更高效、多样且更具挑战性。 Conclusion: Tag-Evol是一种高效、多样化的指令演化框架，能生成更优质的数据。 Abstract: Evol-Instruct has made significant improvements as a data synthesis method in several areas. Existing methods typically rely on a fixed set of strategies to evolve, which require manual design and are monolithic in form. In addition, iterative evolution also makes the acquisition of hard samples expensive. In view of this, we propose the Tag-Evol framework, a more diverse and efficient instruction evolving method. Specifically, Tag-Evol uses diverse and specific knowledge tags as strategies to achieve controlled evolution by injecting different combinations of tags into the original instructions. Experiments with multiple backbones in diverse domain benchmarks show that the proposed method generates significantly better evolved data than other methods. Furthermore, we conduct a thorough analysis of the evolved data, demonstrating that Tag-Evol is not only efficient but also generates more diverse and challenging data.

[197] Adaptive LoRA Merge with Parameter Pruning for Low-Resource Generation

Ryota Miyano,Yuki Arase

Main category: cs.CL

TL;DR: 提出了一种简单有效的LoRA合并方法，用于低资源语言生成任务，通过微调更新和修剪LoRA参数，显著提升了任务适应性。

Details

Motivation: 现有LoRA合并方法因参数冻结而适应性有限，且未解决低资源问题。 Method: 提出一种通过微调更新和修剪LoRA参数的方法，利用少量目标任务数据进行细粒度调整。 Result: 在多种领域和语言（英语和日语）的摘要任务实验中，新方法显著优于现有方法。 Conclusion: 该方法通过动态调整LoRA参数，有效提升了低资源语言生成任务的适应性。 Abstract: This study proposes a simple yet effective LoRA merge method to achieve LLM adaptation for low-resource language generation tasks. The LoRA merge technique, which integrates multiple LoRA modules trained on different tasks, has gained attention as an effective and efficient approach for adapting LLMs to target tasks. However, previous methods are limited in adaptability as they keep the LoRA parameters frozen. Additionally, the low-resource problem has been out of their scope. We propose a LoRA merge method that updates and prunes LoRA parameters through fine-tuning with minimal target task data, which allows finer-grained adjustments of LoRA parameters and enhancement of task adaptability. Extensive experiments have been conducted taking summarization as a benchmark task. Our datasets cover various domains and multiple languages of English and Japanese. The results confirm that the proposed method achieves significant and consistent improvements in task adaptability over the previous methods.

[198] Beyond Exponential Decay: Rethinking Error Accumulation in Large Language Models

Mikhail L. Arbuzov,Alexey A. Shvets,Sisong Beir

Main category: cs.CL

TL;DR: 论文挑战了传统观点，认为LLM的错误集中在少数关键令牌上，而非均匀分布，并提出新框架以提升长序列性能。

Details

Motivation: 传统假设认为LLM可靠性随序列长度指数衰减，但研究发现错误集中在少数关键令牌上，挑战了这一观点。 Method: 通过区分高影响力令牌和可预测令牌，提出新可靠性公式，并设计框架以动态分配计算资源。 Result: 研究表明，长上下文性能依赖于关键语义决策点，而非均匀令牌准确性，新策略显著优于传统方法。 Conclusion: 论文提出从规模扩展转向战略推理的框架，为更高效强大的语言系统开辟新路径。 Abstract: The prevailing assumption of an exponential decay in large language model (LLM) reliability with sequence length, predicated on independent per-token error probabilities, posits an inherent limitation for long autoregressive outputs. Our research fundamentally challenges this view by synthesizing emerging evidence that LLM errors are not uniformly distributed but are concentrated at sparse "key tokens" ($5-10\%$ of total tokens) representing critical decision junctions. By distinguishing these high-impact tokens from the increasingly predictable majority, we introduce a new reliability formula explaining the sustained coherence of modern LLMs over thousands of tokens. Converging research streams reveal that long-context performance primarily depends on accurately navigating a few crucial semantic decision points rather than on uniform token-level accuracy, enabling targeted strategies that significantly outperform brute-force approaches. We thus propose a framework for next-generation systems centered on selective preservation of semantically vital tokens, dynamic computational allocation at uncertain decision boundaries, multi-path exploration at ambiguities, and architectures aligned with natural semantic domains. This marks a fundamental shift from raw scaling to strategic reasoning, promising breakthrough performance without proportionate computational scaling and offering a more nuanced understanding that supersedes the exponential decay hypothesis, thereby opening pathways toward substantially more powerful and efficient language systems.

[199] CLaSp: In-Context Layer Skip for Self-Speculative Decoding

Longze Chen,Renke Shan,Huiming Wang,Lu Wang,Ziqiang Liu,Run Luo,Jiawei Wang,Hamid Alinejad-Rokny,Min Yang

Main category: cs.CL

TL;DR: CLaSp是一种无需额外训练或模块的自推测解码方法，通过跳过中间层构建压缩草稿模型，动态优化层跳过策略，实现1.3x~1.7x的加速。

Details

Motivation: 现有推测解码方法需要额外训练模块，难以兼容不同LLM，CLaSp旨在解决这一问题。 Method: 采用上下文层跳过策略，动态编程算法优化层跳过过程，利用完整隐藏状态作为目标。 Result: 在LLaMA3系列模型上实现1.3x~1.7x加速，且不改变生成文本的原始分布。 Conclusion: CLaSp是一种高效、兼容性强的自推测解码方法，适用于多种LLM。 Abstract: Speculative decoding (SD) is a promising method for accelerating the decoding process of Large Language Models (LLMs). The efficiency of SD primarily hinges on the consistency between the draft model and the verify model. However, existing drafting approaches typically require additional modules to be trained, which can be challenging to implement and ensure compatibility across various LLMs. In this paper, we propose CLaSp, an in-context layer-skipping strategy for self-speculative decoding. Unlike prior methods, CLaSp does not require additional drafting modules or extra training. Instead, it employs a plug-and-play mechanism by skipping intermediate layers of the verify model to construct a compressed draft model. Specifically, we develop a dynamic programming algorithm that optimizes the layer-skipping process by leveraging the complete hidden states from the last verification stage as an objective. This enables CLaSp to dynamically adjust its layer-skipping strategy after each verification stage, without relying on pre-optimized sets of skipped layers. Experimental results across diverse downstream tasks demonstrate that CLaSp achieves a speedup of 1.3x ~ 1.7x on LLaMA3 series models without altering the original distribution of the generated text.

[200] Intuitionistic Fuzzy Sets for Large Language Model Data Annotation: A Novel Approach to Side-by-Side Preference Labeling

Yimin Du

Main category: cs.CL

TL;DR: 本文提出了一种基于直觉模糊集（IFS）的新框架，用于建模和聚合人类偏好数据，解决了传统标注方法的不确定性和复杂性，显著提升了标注质量和模型性能。

Details

Motivation: 传统的人类偏好标注方法（如SBS）存在不确定性、标注者分歧和判断复杂性等问题，影响了大型语言模型（LLM）的训练和评估质量。 Method: 提出基于IFS的标注协议，通过隶属度、非隶属度和犹豫度建模偏好，开发了聚合方法和质量评估指标。 Result: 实验表明，IFS方法显著提高了标注一致性，减少了标注疲劳，数据质量更高，下游任务模型性能提升12.3%，标注时间减少15.7%。 Conclusion: IFS框架为处理人类偏好标注中的不确定性提供了理论支持，并为大规模LLM训练提供了实用优势。 Abstract: The quality of human preference data is crucial for training and evaluating large language models (LLMs), particularly in reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO) scenarios. Traditional side-by-side (SBS) annotation approaches often struggle with inherent uncertainty, annotator disagreement, and the complexity of preference judgments. This paper introduces a novel framework based on intuitionistic fuzzy sets (IFS) for modeling and aggregating human preferences in LLM data annotation tasks. Our approach captures not only the degree of preference but also the uncertainty and hesitation inherent in human judgment through membership, non-membership, and hesitation degrees. We propose an IFS-based annotation protocol that enables more nuanced preference modeling, develops aggregation methods for handling annotator disagreement, and introduces quality metrics for preference data assessment. Experimental validation on multiple datasets demonstrates that our IFS-based approach significantly improves annotation consistency, reduces annotator fatigue, and produces higher-quality preference data compared to traditional binary and Likert-scale methods. The resulting preference datasets lead to improved model performance in downstream tasks, with 12.3\% improvement in win-rate against baseline models and 15.7\% reduction in annotation time. Our framework provides a principled approach to handling uncertainty in human preference annotation and offers practical benefits for large-scale LLM training.

[201] Are Any-to-Any Models More Consistent Across Modality Transfers Than Specialists?

Jiwan Chung,Janghan Yoon,Junhyeong Park,Sangeyl Lee,Joowon Yang,Sooyeon Park,Youngjae Yu

Main category: cs.CL

TL;DR: 论文探讨了多模态生成模型是否真正实现跨模态一致性，通过ACON数据集和三种一致性标准评估，发现统一模型在点对点评估中表现不如专用模型，但在结构化分析中显示出弱一致性。

Details

Motivation: 研究多模态生成模型是否能真正实现跨模态一致性，而非仅表面感知。 Method: 引入ACON数据集（1000张图像，含新贡献的500张），使用循环一致性、前向等变性和共轭等变性三种标准评估跨模态一致性。 Result: 统一模型在点对点评估中表现不如专用模型，但在结构化分析中显示出弱一致性。 Conclusion: 多模态生成模型的跨模态一致性有限，需进一步研究提升。 Abstract: Any-to-any generative models aim to enable seamless interpretation and generation across multiple modalities within a unified framework, yet their ability to preserve relationships across modalities remains uncertain. Do unified models truly achieve cross-modal coherence, or is this coherence merely perceived? To explore this, we introduce ACON, a dataset of 1,000 images (500 newly contributed) paired with captions, editing instructions, and Q&A pairs to evaluate cross-modal transfers rigorously. Using three consistency criteria-cyclic consistency, forward equivariance, and conjugated equivariance-our experiments reveal that any-to-any models do not consistently demonstrate greater cross-modal consistency than specialized models in pointwise evaluations such as cyclic consistency. However, equivariance evaluations uncover weak but observable consistency through structured analyses of the intermediate latent space enabled by multiple editing operations. We release our code and data at https://github.com/JiwanChung/ACON.

[202] Semi-structured LLM Reasoners Can Be Rigorously Audited

Jixuan Leng,Cassandra A. Cohen,Zhixian Zhang,Chenyan Xiong,William W. Cohen

Main category: cs.CL

TL;DR: 论文提出半结构化推理模型（SSRMs），通过内部化半结构化的思维链（CoT）推理格式，生成类似Python语法的推理轨迹，以提高LLM推理的忠实性和可分析性。

Details

Motivation: 解决大型语言模型（LLMs）推理中的'忠实性'问题，即推理轨迹中的错误和遗漏难以检测，可能掩盖模型输出的偏见。 Method: 引入SSRMs，采用半结构化的CoT推理格式，生成具有任务特定词汇的推理轨迹，标记每个步骤的输入和输出。 Result: 在十个基准测试中，SSRMs表现优异，比同类基线模型高出近十个百分点，同时在跨领域医学任务中保持竞争力。 Conclusion: 半结构化推理更易于分析，可通过手工或学习型审计自动检测推理缺陷，有效标记可能的推理错误。 Abstract: As Large Language Models (LLMs) become increasingly capable at reasoning, the problem of "faithfulness" persists: LLM "reasoning traces" can contain errors and omissions that are difficult to detect, and may obscure biases in model outputs. To address these limitations, we introduce Semi-Structured Reasoning Models (SSRMs), which internalize a semi-structured Chain-of-Thought (CoT) reasoning format within the model. Our SSRMs generate reasoning traces in a Pythonic syntax. While SSRM traces are not executable, they adopt a restricted, task-specific vocabulary to name distinct reasoning steps, and to mark each step's inputs and outputs. Through extensive evaluation on ten benchmarks, SSRMs demonstrate strong performance and generality: they outperform comparably sized baselines by nearly ten percentage points on in-domain tasks while remaining competitive with specialized models on out-of-domain medical benchmarks. Furthermore, we show that semi-structured reasoning is more amenable to analysis: in particular, they can be automatically audited to identify reasoning flaws. We explore both hand-crafted structured audits, which detect task-specific problematic reasoning patterns, and learned typicality audits, which apply probabilistic models over reasoning patterns, and show that both audits can be used to effectively flag probable reasoning errors.

[203] ERU-KG: Efficient Reference-aligned Unsupervised Keyphrase Generation

Lam Thanh Do,Aaditya Bodke,Pritom Saha Akash,Kevin Chen-Chuan Chang

Main category: cs.CL

TL;DR: ERU-KG是一种无监督关键词生成模型，通过信息性和短语性模块解决现有方法的不足，在性能和效率上优于基线方法。

Details

Motivation: 现有无监督关键词预测方法依赖启发式定义的重要性评分，导致信息性估计不准确且缺乏时间效率考虑。 Method: ERU-KG包含信息性和短语性模块，前者通过参考和术语级建模估计关键短语的关联性，后者生成候选短语。 Result: 在关键词生成基准测试中优于无监督基线，达到监督模型89%的性能，且在文本检索任务中表现实用。 Conclusion: ERU-KG在性能、效率和灵活性上表现优异，适用于多种应用场景。 Abstract: Unsupervised keyphrase prediction has gained growing interest in recent years. However, existing methods typically rely on heuristically defined importance scores, which may lead to inaccurate informativeness estimation. In addition, they lack consideration for time efficiency. To solve these problems, we propose ERU-KG, an unsupervised keyphrase generation (UKG) model that consists of an informativeness and a phraseness module. The former estimates the relevance of keyphrase candidates, while the latter generate those candidates. The informativeness module innovates by learning to model informativeness through references (e.g., queries, citation contexts, and titles) and at the term-level, thereby 1) capturing how the key concepts of documents are perceived in different contexts and 2) estimating informativeness of phrases more efficiently by aggregating term informativeness, removing the need for explicit modeling of the candidates. ERU-KG demonstrates its effectiveness on keyphrase generation benchmarks by outperforming unsupervised baselines and achieving on average 89\% of the performance of a supervised model for top 10 predictions. Additionally, to highlight its practical utility, we evaluate the model on text retrieval tasks and show that keyphrases generated by ERU-KG are effective when employed as query and document expansions. Furthermore, inference speed tests reveal that ERU-KG is the fastest among baselines of similar model sizes. Finally, our proposed model can switch between keyphrase generation and extraction by adjusting hyperparameters, catering to diverse application requirements.

[204] Automated Structured Radiology Report Generation

Jean-Benoit Delbrouck,Justin Xu,Johannes Moll,Alois Thomas,Zhihong Chen,Sophie Ostmeier,Asfandyar Azhar,Kelvin Zhenghao Li,Andrew Johnston,Christian Bluethgen,Eduardo Reis,Mohamed Muneer,Maya Varma,Curtis Langlotz

Main category: cs.CL

TL;DR: 论文提出了一种结构化放射学报告生成（SRRG）任务，通过将自由文本报告转换为标准化格式，解决现有模型生成不一致和评估困难的问题。

Details

Motivation: 自由文本放射学报告的变异性导致模型生成不一致且临床意义不明确，现有评估指标难以捕捉放射学解读的细微差别。 Method: 引入SRRG任务，利用大语言模型（LLMs）重构报告为标准化格式，并提出SRR-BERT模型进行精细疾病分类。 Result: 通过SRR-BERT和F1-SRR-BERT指标，实现了更精确的结构化报告评估，并通过放射科医生验证和实验验证了数据集的有效性。 Conclusion: 结构化报告生成任务和SRR-BERT模型显著提升了放射学报告的清晰性、一致性和临床实用性。 Abstract: Automated radiology report generation from chest X-ray (CXR) images has the potential to improve clinical efficiency and reduce radiologists' workload. However, most datasets, including the publicly available MIMIC-CXR and CheXpert Plus, consist entirely of free-form reports, which are inherently variable and unstructured. This variability poses challenges for both generation and evaluation: existing models struggle to produce consistent, clinically meaningful reports, and standard evaluation metrics fail to capture the nuances of radiological interpretation. To address this, we introduce Structured Radiology Report Generation (SRRG), a new task that reformulates free-text radiology reports into a standardized format, ensuring clarity, consistency, and structured clinical reporting. We create a novel dataset by restructuring reports using large language models (LLMs) following strict structured reporting desiderata. Additionally, we introduce SRR-BERT, a fine-grained disease classification model trained on 55 labels, enabling more precise and clinically informed evaluation of structured reports. To assess report quality, we propose F1-SRR-BERT, a metric that leverages SRR-BERT's hierarchical disease taxonomy to bridge the gap between free-text variability and structured clinical reporting. We validate our dataset through a reader study conducted by five board-certified radiologists and extensive benchmarking experiments.

[205] Dynamic Context-Aware Streaming Pretrained Language Model For Inverse Text Normalization

Luong Ho,Khanh Le,Vinh Pham,Bao Nguyen,Tan Tran,Duc Chau

Main category: cs.CL

TL;DR: 提出了一种基于预训练语言模型的流式逆向文本归一化方法，解决了流式ASR中ITN的挑战，并在越南语数据集上表现优于现有方法。

Details

Motivation: 流式ASR中逆向文本归一化（ITN）的准确性和效率问题尚未充分解决，尤其在低资源和有限上下文场景中。 Method: 采用预训练语言模型，结合动态上下文感知机制，动态调整块大小并整合右上下文信息。 Result: 在越南语数据集上，该方法达到与非流式ITN相当的准确性，且优于现有流式ITN模型，同时保持低延迟。 Conclusion: 该方法为流式ASR中的ITN提供了高效、准确的解决方案，适用于实际应用。 Abstract: Inverse Text Normalization (ITN) is crucial for converting spoken Automatic Speech Recognition (ASR) outputs into well-formatted written text, enhancing both readability and usability. Despite its importance, the integration of streaming ITN within streaming ASR remains largely unexplored due to challenges in accuracy, efficiency, and adaptability, particularly in low-resource and limited-context scenarios. In this paper, we introduce a streaming pretrained language model for ITN, leveraging pretrained linguistic representations for improved robustness. To address streaming constraints, we propose Dynamic Context-Aware during training and inference, enabling adaptive chunk size adjustments and the integration of right-context information. Experimental results demonstrate that our method achieves accuracy comparable to non-streaming ITN and surpasses existing streaming ITN models on a Vietnamese dataset, all while maintaining low latency, ensuring seamless integration into ASR systems.

[206] Advantageous Parameter Expansion Training Makes Better Large Language Models

Naibin Gu,Yilong Chen,Zhenyu Zhang,Peng Fu,Zheng Lin,Shuohuan Wang,Yu Sun,Hua Wu,Weiping Wang,Haifeng Wang

Main category: cs.CL

TL;DR: APEX方法通过扩展优势参数比例提升模型性能，减少计算开销。

Details

Motivation: 大规模语言模型参数增加带来性能提升但计算开销大，优势参数对性能起关键作用。 Method: 提出APEX方法，逐步扩展优势参数至劣势参数空间，提升其比例。 Result: 指令微调中APEX仅用52%可训练参数优于全参数微调；持续预训练中仅需33%数据达到相同困惑度。 Conclusion: APEX通过优化参数分配显著提升训练效率和模型性能。 Abstract: Although scaling up the number of trainable parameters in both pre-training and fine-tuning can effectively improve the performance of large language models, it also leads to increased computational overhead. When delving into the parameter difference, we find that a subset of parameters, termed advantageous parameters, plays a crucial role in determining model performance. Further analysis reveals that stronger models tend to possess more such parameters. In this paper, we propose Advantageous Parameter EXpansion Training (APEX), a method that progressively expands advantageous parameters into the space of disadvantageous ones, thereby increasing their proportion and enhancing training effectiveness. Further theoretical analysis from the perspective of matrix effective rank explains the performance gains of APEX. Extensive experiments on both instruction tuning and continued pre-training demonstrate that, in instruction tuning, APEX outperforms full-parameter tuning while using only 52% of the trainable parameters. In continued pre-training, APEX achieves the same perplexity level as conventional training with just 33% of the training data, and yields significant improvements on downstream tasks.

[207] Mamba Knockout for Unraveling Factual Information Flow

Nir Endy,Idan Daniel Grosbard,Yuval Ran-Milo,Yonatan Slutzky,Itay Tshuva,Raja Giryes

Main category: cs.CL

TL;DR: 研究Mamba状态空间模型（SSM）中事实信息的流动，通过借鉴Transformer的注意力机制解释方法，揭示信息在token和层间的传递模式，并比较Mamba与Transformer的异同。

Details

Motivation: 探索Mamba模型内部信息流动的机制，并与Transformer模型进行对比，以揭示大型语言模型（LLM）的通用特性。 Method: 利用Transformer的注意力解释技术（如Attention Knockout），将其适配到Mamba-1和Mamba-2模型，分析信息在token和层间的传递与定位。 Result: 发现Mamba与Transformer在某些现象上存在差异，但也存在跨模型的通用模式，表明这些可能是LLM的固有特性。通过Mamba的结构化分解，进一步解析了信息交换与token增强的机制。 Conclusion: 研究为理解Mamba模型的内部运作提供了统一视角，并揭示了LLM的潜在通用特性。 Abstract: This paper investigates the flow of factual information in Mamba State-Space Model (SSM)-based language models. We rely on theoretical and empirical connections to Transformer-based architectures and their attention mechanisms. Exploiting this relationship, we adapt attentional interpretability techniques originally developed for Transformers--specifically, the Attention Knockout methodology--to both Mamba-1 and Mamba-2. Using them we trace how information is transmitted and localized across tokens and layers, revealing patterns of subject-token information emergence and layer-wise dynamics. Notably, some phenomena vary between mamba models and Transformer based models, while others appear universally across all models inspected--hinting that these may be inherent to LLMs in general. By further leveraging Mamba's structured factorization, we disentangle how distinct "features" either enable token-to-token information exchange or enrich individual tokens, thus offering a unified lens to understand Mamba internal operations.

[208] Proactive Guidance of Multi-Turn Conversation in Industrial Search

Xiaoyu Li,Xiao Li,Li Gao,Yiding Liu,Xiaoyang Wang,Shuaiqiang Wang,Junfeng Wang,Dawei Yin

Main category: cs.CL

TL;DR: 论文提出了一种两阶段框架（G-SFT和C-RL），用于提升多轮对话系统的主动引导能力，兼顾目标适应性和实时交互效率。

Details

Motivation: 大型语言模型在多轮对话中需要动态适应用户目标变化并保持低延迟，现有系统在此方面存在不足。 Method: 第一阶段（G-SFT）通过目标适应代理动态调整用户目标，并结合知识蒸馏实现轻量化；第二阶段（C-RL）利用点击信号生成偏好对，优化点击率。 Result: 离线评估准确率达86.10%（提升23.95%），在线点击率提升149.06%，推理延迟降低69.55%。 Conclusion: 该框架在多轮对话系统中实现了目标跟踪和交互质量的双重优化，具有显著的工业应用价值。 Abstract: The evolution of Large Language Models (LLMs) has significantly advanced multi-turn conversation systems, emphasizing the need for proactive guidance to enhance users' interactions. However, these systems face challenges in dynamically adapting to shifts in users' goals and maintaining low latency for real-time interactions. In the Baidu Search AI assistant, an industrial-scale multi-turn search system, we propose a novel two-phase framework to provide proactive guidance. The first phase, Goal-adaptive Supervised Fine-Tuning (G-SFT), employs a goal adaptation agent that dynamically adapts to user goal shifts and provides goal-relevant contextual information. G-SFT also incorporates scalable knowledge transfer to distill insights from LLMs into a lightweight model for real-time interaction. The second phase, Click-oriented Reinforcement Learning (C-RL), adopts a generate-rank paradigm, systematically constructs preference pairs from user click signals, and proactively improves click-through rates through more engaging guidance. This dual-phase architecture achieves complementary objectives: G-SFT ensures accurate goal tracking, while C-RL optimizes interaction quality through click signal-driven reinforcement learning. Extensive experiments demonstrate that our framework achieves 86.10% accuracy in offline evaluation (+23.95% over baseline) and 25.28% CTR in online deployment (149.06% relative improvement), while reducing inference latency by 69.55% through scalable knowledge distillation.

[209] Effects of Theory of Mind and Prosocial Beliefs on Steering Human-Aligned Behaviors of LLMs in Ultimatum Games

Neemesh Yadav,Palakorn Achananuparp,Jing Jiang,Ee-Peng Lim

Main category: cs.CL

TL;DR: 研究探讨了大型语言模型（LLMs）在谈判任务中通过心理理论（ToM）推理与人类行为对齐的作用，发现ToM推理能提升行为一致性和谈判结果。

Details

Motivation: 心理理论（ToM）推理对复杂社交互动至关重要，研究旨在探索ToM在LLMs中如何帮助代理行为与人类规范对齐。 Method: 使用最后通牒游戏作为实验环境，初始化不同亲社会信念的LLM代理，并采用链式思维（CoT）和不同ToM水平的推理方法，分析其决策过程。 Result: 2,700次模拟显示，ToM推理显著提升了行为对齐和谈判结果，推理模型表现有限。 Conclusion: ToM推理有助于增强人机互动和合作决策，研究为理解ToM在AI中的作用提供了新视角。 Abstract: Large Language Models (LLMs) have shown potential in simulating human behaviors and performing theory-of-mind (ToM) reasoning, a crucial skill for complex social interactions. In this study, we investigate the role of ToM reasoning in aligning agentic behaviors with human norms in negotiation tasks, using the ultimatum game as a controlled environment. We initialized LLM agents with different prosocial beliefs (including Greedy, Fair, and Selfless) and reasoning methods like chain-of-thought (CoT) and varying ToM levels, and examined their decision-making processes across diverse LLMs, including reasoning models like o3-mini and DeepSeek-R1 Distilled Qwen 32B. Results from 2,700 simulations indicated that ToM reasoning enhances behavior alignment, decision-making consistency, and negotiation outcomes. Consistent with previous findings, reasoning models exhibit limited capability compared to models with ToM reasoning, different roles of the game benefits with different orders of ToM reasoning. Our findings contribute to the understanding of ToM's role in enhancing human-AI interaction and cooperative decision-making. The code used for our experiments can be found at https://github.com/Stealth-py/UltimatumToM.

[210] Simulating Training Data Leakage in Multiple-Choice Benchmarks for LLM Evaluation

Naila Shafirni Hidayat,Muhammad Dehan Al Kautsar,Alfan Farizki Wicaksono,Fajri Koto

Main category: cs.CL

TL;DR: 本文比较了现有数据泄漏检测技术（如排列和n-gram方法），并提出了一种轻量级方法semi-half question。研究发现n-gram方法表现最佳，并改进了这些技术以支持实例级检测。通过清理MMLU和HellaSwag数据集，重新评估了多个LLM，建议将污染检查作为发布基准结果前的标准步骤。

Details

Motivation: 大型语言模型（LLM）性能提升，但训练数据不透明可能导致评估集重叠，影响结果公平性。现有泄漏检测方法未在模拟泄漏条件下评估。 Method: 比较现有泄漏检测技术（排列和n-gram方法），提出semi-half question方法，改进技术以支持实例级检测。 Result: n-gram方法F1-score最高。清理后的MMLU和HellaSwag数据集重新评估LLM，结果更可靠。 Conclusion: 建议将污染检查作为标准步骤，以提高评估的可靠性和透明度。 Abstract: The performance of large language models (LLMs) continues to improve, as reflected in rising scores on standard benchmarks. However, the lack of transparency around training data raises concerns about potential overlap with evaluation sets and the fairness of reported results. Although prior work has proposed methods for detecting data leakage, these approaches primarily focus on identifying outliers and have not been evaluated under controlled simulated leakage conditions. In this work, we compare existing leakage detection techniques, namely permutation and n-gram-based methods, under a continual pretraining setup that simulates real-world leakage scenarios, and additionally explore a lightweight method we call semi-half question. Although semi-half offers a low-cost alternative, our analysis shows that the n-gram method consistently achieves the highest F1-score. We also refine these techniques to support instance-level detection and reduce computational overhead. Leveraging the best-performing method, we create cleaned versions of MMLU and HellaSwag, and re-evaluate several LLMs. Our findings present a practical path toward more reliable and transparent evaluations, and we recommend contamination checks as a standard step before releasing benchmark results.

[211] Faithful and Robust LLM-Driven Theorem Proving for NLI Explanations

Xin Quan,Marco Valentino,Louise A. Dennis,André Freitas

Main category: cs.CL

TL;DR: 论文研究了如何通过结合大型语言模型（LLM）和定理证明器（TP）来改进自然语言推理（NLI）解释的忠实性和鲁棒性，提出了四种策略以减少语义信息丢失并提升验证效率。

Details

Motivation: 解决LLM与TP结合时因自然语言到形式化表示的转换导致的语义信息丢失和解释不忠实问题，以及LLM在形式化验证中的局限性。 Method: 提出四种策略：1) 减少自动形式化中的语义损失；2) 高效识别和修正逻辑表示中的语法错误；3) 使用逻辑表达式引导LLM生成结构化证明草图；4) 提升LLM对TP反馈的迭代改进能力。 Result: 在e-SNLI、QASC和WorldTree数据集上，自动形式化和解释精炼分别提升了18.46%-39.77%和29.5%-51.5%，验证效率显著提高。 Conclusion: 通过特定干预策略，LLM-TP混合架构在NLI解释的忠实性和鲁棒性上取得了显著改进，验证效率大幅提升。 Abstract: Natural language explanations play a fundamental role in Natural Language Inference (NLI) by revealing how premises logically entail hypotheses. Recent work has shown that the interaction of large language models (LLMs) with theorem provers (TPs) can help verify and improve the validity of NLI explanations. However, TPs require translating natural language into machine-verifiable formal representations, a process that introduces the risk of semantic information loss and unfaithful interpretation, an issue compounded by LLMs' challenges in capturing critical logical structures with sufficient precision. Moreover, LLMs are still limited in their capacity for rigorous and robust proof construction within formal verification frameworks. To mitigate issues related to faithfulness and robustness, this paper investigates strategies to (1) alleviate semantic loss during autoformalisation, (2) efficiently identify and correct syntactic errors in logical representations, (3) explicitly use logical expressions to guide LLMs in generating structured proof sketches, and (4) increase LLMs' capacity of interpreting TP's feedback for iterative refinement. Our empirical results on e-SNLI, QASC and WorldTree using different LLMs demonstrate that the proposed strategies yield significant improvements in autoformalisation (+18.46%, +34.2%, +39.77%) and explanation refinement (+29.5%, +51.5%, +41.25%) over the state-of-the-art model. Moreover, we show that specific interventions on the hybrid LLM-TP architecture can substantially improve efficiency, drastically reducing the number of iterations required for successful verification.

[212] ScienceMeter: Tracking Scientific Knowledge Updates in Language Models

Yike Wang,Shangbin Feng,Yulia Tsvetkov,Hannaneh Hajishirzi

Main category: cs.CL

TL;DR: ScienceMeter框架评估LLMs科学知识更新方法，涵盖过去、现在和未来知识，发现现有方法在知识保留、获取和预测方面表现有限。

Details

Motivation: 解决LLMs科学知识快速过时的问题，评估其知识更新能力。 Method: 提出ScienceMeter框架，定义知识保留、获取和预测三个指标，并在多领域数据集上测试五种知识更新方法。 Result: 最佳方法仅保留85.9%现有知识，获取71.7%新知识，预测37.7%未来知识。小模型需训练，大模型可基于推理。 Conclusion: 现有方法难以全面满足科学知识更新需求，开发更鲁棒的机制至关重要且具挑战性。 Abstract: Large Language Models (LLMs) are increasingly used to support scientific research, but their knowledge of scientific advancements can quickly become outdated. We introduce ScienceMeter, a new framework for evaluating scientific knowledge update methods over scientific knowledge spanning the past, present, and future. ScienceMeter defines three metrics: knowledge preservation, the extent to which models' understanding of previously learned papers are preserved; knowledge acquisition, how well scientific claims from newly introduced papers are acquired; and knowledge projection, the ability of the updated model to anticipate or generalize to related scientific claims that may emerge in the future. Using ScienceMeter, we examine the scientific knowledge of LLMs on claim judgment and generation tasks across a curated dataset of 15,444 scientific papers and 30,888 scientific claims from ten domains including medicine, biology, materials science, and computer science. We evaluate five representative knowledge update approaches including training- and inference-time methods. With extensive experiments, we find that the best-performing knowledge update methods can preserve only 85.9% of existing knowledge, acquire 71.7% of new knowledge, and project 37.7% of future knowledge. Inference-based methods work for larger models, whereas smaller models require training to achieve comparable performance. Cross-domain analysis reveals that performance on these objectives is correlated. Even when applying on specialized scientific LLMs, existing knowledge update methods fail to achieve these objectives collectively, underscoring that developing robust scientific knowledge update mechanisms is both crucial and challenging.

[213] HiCaM: A Hierarchical-Causal Modification Framework for Long-Form Text Modification

Yuntao Shi,Yi Luo,Yeyun Gong,Chen Lin

Main category: cs.CL

TL;DR: HiCaM框架通过层次化摘要树和因果图解决LLMs在长文本修改中的问题，显著提升性能。

Details

Motivation: LLMs在长文本修改中存在不必要修改和遗漏关键修改的问题，影响文档连贯性。 Method: 提出HiCaM框架，结合层次化摘要树和因果图进行修改。 Result: 在多领域数据集上评估，HiCaM的胜率达79.50%，优于现有LLMs。 Conclusion: HiCaM在多个模型和领域中表现一致，显著提升了长文本修改的全面性。 Abstract: Large Language Models (LLMs) have achieved remarkable success in various domains. However, when handling long-form text modification tasks, they still face two major problems: (1) producing undesired modifications by inappropriately altering or summarizing irrelevant content, and (2) missing necessary modifications to implicitly related passages that are crucial for maintaining document coherence. To address these issues, we propose HiCaM, a Hierarchical-Causal Modification framework that operates through a hierarchical summary tree and a causal graph. Furthermore, to evaluate HiCaM, we derive a multi-domain dataset from various benchmarks, providing a resource for assessing its effectiveness. Comprehensive evaluations on the dataset demonstrate significant improvements over strong LLMs, with our method achieving up to a 79.50\% win rate. These results highlight the comprehensiveness of our approach, showing consistent performance improvements across multiple models and domains.

[214] Context-Aware Sentiment Forecasting via LLM-based Multi-Perspective Role-Playing Agents

Fanhang Man,Huandong Wang,Jianjie Fang,Zhaoyi Deng,Baining Zhao,Xinlei Chen,Yong Li

Main category: cs.CL

TL;DR: 该论文提出了一种社交媒体情感预测方法，通过提取情感相关特征并模拟人类响应过程，显著提升了情感预测的准确性。

Details

Motivation: 社交媒体用户情感反映了社会趋势和需求，但现有研究多关注历史情感分析，而对未来情感预测的研究较少。 Method: 提取情感相关特征，提出多视角角色扮演框架模拟人类响应过程。 Result: 初步结果显示在微观和宏观层面均显著提升了情感预测能力。 Conclusion: 该方法为社交媒体情感预测提供了有效工具，具有实际应用潜力。 Abstract: User sentiment on social media reveals the underlying social trends, crises, and needs. Researchers have analyzed users' past messages to trace the evolution of sentiments and reconstruct sentiment dynamics. However, predicting the imminent sentiment of an ongoing event is rarely studied. In this paper, we address the problem of \textbf{sentiment forecasting} on social media to predict the user's future sentiment in response to the development of the event. We extract sentiment-related features to enhance the modeling skill and propose a multi-perspective role-playing framework to simulate the process of human response. Our preliminary results show significant improvement in sentiment forecasting on both microscopic and macroscopic levels.

[215] Pangu DeepDiver: Adaptive Search Intensity Scaling via Open-Web Reinforcement Learning

Wenxuan Shi,Haochen Tan,Chuqiao Kuang,Xiaoguang Li,Xiaozhe Ren,Chen Zhang,Hanting Chen,Yasheng Wang,Lifeng Shang,Fisher Yu,Yunhe Wang

Main category: cs.CL

TL;DR: 论文提出了一种名为Search Intensity Scaling (SIS)的能力，用于解决大型语言模型在开放网络问答中的信息搜索问题，并开发了WebPuzzle数据集和DeepDiver框架来促进SIS。

Details

Motivation: 现有方法在开放网络环境中适应性不足，无法动态调整搜索策略，导致模型在模糊或冲突证据下表现不佳。 Method: 提出WebPuzzle数据集和DeepDiver强化学习框架，通过自适应搜索策略提升模型在开放网络环境中的表现。 Result: 实验表明，DeepDiver框架使Pangu-7B-Reasoner在真实网络任务中表现接近671B参数的DeepSeek-R1。 Conclusion: DeepDiver框架和WebPuzzle数据集为LLMs的自适应信息搜索提供了新方向，并为未来研究提供了基准。 Abstract: Information seeking demands iterative evidence gathering and reflective reasoning, yet large language models (LLMs) still struggle with it in open-web question answering. Existing methods rely on static prompting rules or training with Wikipedia-based corpora and retrieval environments, limiting adaptability to the real-world web environment where ambiguity, conflicting evidence, and noise are prevalent. These constrained training settings hinder LLMs from learning to dynamically decide when and where to search, and how to adjust search depth and frequency based on informational demands. We define this missing capacity as Search Intensity Scaling (SIS)--the emergent skill to intensify search efforts under ambiguous or conflicting conditions, rather than settling on overconfident, under-verification answers. To study SIS, we introduce WebPuzzle, the first dataset designed to foster information-seeking behavior in open-world internet environments. WebPuzzle consists of 24K training instances and 275 test questions spanning both wiki-based and open-web queries. Building on this dataset, we propose DeepDiver, a Reinforcement Learning (RL) framework that promotes SIS by encouraging adaptive search policies through exploration under a real-world open-web environment. Experimental results show that Pangu-7B-Reasoner empowered by DeepDiver achieve performance on real-web tasks comparable to the 671B-parameter DeepSeek-R1. We detail DeepDiver's training curriculum from cold-start supervised fine-tuning to a carefully designed RL phase, and present that its capability of SIS generalizes from closed-form QA to open-ended tasks such as long-form writing. Our contributions advance adaptive information seeking in LLMs and provide a valuable benchmark and dataset for future research.

[216] Exploring Multimodal Challenges in Toxic Chinese Detection: Taxonomy, Benchmark, and Findings

Shujian Yang,Shiyao Cui,Chuanrui Hu,Haicheng Wang,Tianwei Zhang,Minlie Huang,Jialiang Lu,Han Qiu

Main category: cs.CL

TL;DR: 论文研究了大型语言模型（LLMs）在检测中文有毒内容时的挑战，尤其是面对字符替换等扰动策略时的表现。作者提出了扰动分类法，并测试了9种SOTA LLMs，发现其检测能力有限，且简单增强方法可能导致误判。

Details

Motivation: 中文有毒内容检测对LLMs提出了挑战，尤其是字符替换等扰动策略容易混淆现有模型，因此需要深入研究其表现和改进方法。 Method: 提出了3种扰动策略和8种具体方法，构建了相关数据集，并测试了9种SOTA LLMs，同时探索了上下文学习（ICL）和监督微调（SFT）等增强方法。 Result: 发现LLMs对扰动后的中文有毒内容检测能力较弱，且ICL或SFT可能导致模型过度修正，误判正常内容为有毒。 Conclusion: 中文有毒内容检测需要更鲁棒的模型和方法，避免因简单增强导致的误判问题。 Abstract: Detecting toxic content using language models is important but challenging. While large language models (LLMs) have demonstrated strong performance in understanding Chinese, recent studies show that simple character substitutions in toxic Chinese text can easily confuse the state-of-the-art (SOTA) LLMs. In this paper, we highlight the multimodal nature of Chinese language as a key challenge for deploying LLMs in toxic Chinese detection. First, we propose a taxonomy of 3 perturbation strategies and 8 specific approaches in toxic Chinese content. Then, we curate a dataset based on this taxonomy, and benchmark 9 SOTA LLMs (from both the US and China) to assess if they can detect perturbed toxic Chinese text. Additionally, we explore cost-effective enhancement solutions like in-context learning (ICL) and supervised fine-tuning (SFT). Our results reveal two important findings. (1) LLMs are less capable of detecting perturbed multimodal Chinese toxic contents. (2) ICL or SFT with a small number of perturbed examples may cause the LLMs "overcorrect'': misidentify many normal Chinese contents as toxic.

[217] Fewer Hallucinations, More Verification: A Three-Stage LLM-Based Framework for ASR Error Correction

Yangui Fang,Baixu Cheng,Jing Peng,Xu Li,Yu Xi,Chengwei Zhang,Guohui Zhong

Main category: cs.CL

TL;DR: 论文提出了一种基于LLM的ASR错误校正框架RLLM-CF，通过三阶段方法避免幻觉问题，并在多个数据集上显著降低了CER/WER。

Details

Motivation: 传统ASR错误校正方法效果有限，而直接使用LLM会引入幻觉问题，因此需要一种可靠的方法来校正错误且不修改正确文本。 Method: RLLM-CF框架包括错误预检测、链式思维子任务迭代校正和推理过程验证三阶段，无需额外训练或数据。 Result: 在AISHELL-1、AISHELL-2和Librispeech数据集上，CER/WER相对降低了21%、11%、9%和11.4%。 Conclusion: RLLM-CF框架有效解决了LLM在ASR错误校正中的幻觉问题，显著提升了校正效果。 Abstract: Automatic Speech Recognition (ASR) error correction aims to correct recognition errors while preserving accurate text. Although traditional approaches demonstrate moderate effectiveness, LLMs offer a paradigm that eliminates the need for training and labeled data. However, directly using LLMs will encounter hallucinations problem, which may lead to the modification of the correct text. To address this problem, we propose the Reliable LLM Correction Framework (RLLM-CF), which consists of three stages: (1) error pre-detection, (2) chain-of-thought sub-tasks iterative correction, and (3) reasoning process verification. The advantage of our method is that it does not require additional information or fine-tuning of the model, and ensures the correctness of the LLM correction under multi-pass programming. Experiments on AISHELL-1, AISHELL-2, and Librispeech show that the GPT-4o model enhanced by our framework achieves 21%, 11%, 9%, and 11.4% relative reductions in CER/WER.

[218] Unifying Language Agent Algorithms with Graph-based Orchestration Engine for Reproducible Agent Research

Qianqian Zhang,Jiajia Liao,Heting Ying,Yibo Ma,Haozhan Shen,Jingcheng Li,Peng Liu,Lu Zhang,Chunxin Fang,Kyusong Lee,Ruochen Xu,Tiancheng Zhao

Main category: cs.CL

TL;DR: AGORA是一个基于图的灵活框架，用于解决语言代理开发中的工程化、标准化和评估问题，通过模块化架构、可重用算法和严格评估框架提升代理能力。

Details

Motivation: 开发强大的语言代理面临工程化复杂、缺乏标准化组件和评估框架不足的挑战。 Method: AGORA采用模块化架构（图工作流引擎、内存管理、组件抽象）、提供可重用算法套件，并建立严格评估框架。 Result: 实验表明，复杂推理方法能提升代理能力，但简单方法（如Chain-of-Thought）在计算开销较低时表现稳健。 Conclusion: AGORA简化了语言代理开发，并通过标准化评估为可复现研究奠定了基础。 Abstract: Language agents powered by large language models (LLMs) have demonstrated remarkable capabilities in understanding, reasoning, and executing complex tasks. However, developing robust agents presents significant challenges: substantial engineering overhead, lack of standardized components, and insufficient evaluation frameworks for fair comparison. We introduce Agent Graph-based Orchestration for Reasoning and Assessment (AGORA), a flexible and extensible framework that addresses these challenges through three key contributions: (1) a modular architecture with a graph-based workflow engine, efficient memory management, and clean component abstraction; (2) a comprehensive suite of reusable agent algorithms implementing state-of-the-art reasoning approaches; and (3) a rigorous evaluation framework enabling systematic comparison across multiple dimensions. Through extensive experiments on mathematical reasoning and multimodal tasks, we evaluate various agent algorithms across different LLMs, revealing important insights about their relative strengths and applicability. Our results demonstrate that while sophisticated reasoning approaches can enhance agent capabilities, simpler methods like Chain-of-Thought often exhibit robust performance with significantly lower computational overhead. AGORA not only simplifies language agent development but also establishes a foundation for reproducible agent research through standardized evaluation protocols.

[219] Multilingual Gloss-free Sign Language Translation: Towards Building a Sign Language Foundation Model

Sihan Tan,Taro Miyazaki,Kazuhiro Nakadai

Main category: cs.CL

TL;DR: 本文提出了一种多语言手语翻译（MLSLT）模型，通过双CTC目标解决语言冲突和对齐问题，支持10种手语，并在多个任务中表现优异。

Details

Motivation: 现有研究多集中于单一手语到单一口语的翻译（一对一SLT），而多语言资源可以缓解资源不足问题并提升可访问性。然而，多语言SLT因语言冲突和对齐困难尚未被探索。 Method: 提出了一种基于双CTC目标的多语言无注释模型，用于标记级手语识别和口语文本生成。 Result: 模型支持10种手语，处理一对一、多对一和多对多SLT任务，在三个基准测试中表现优异。 Conclusion: 该模型在多语言手语翻译中具有竞争力，为解决语言冲突和对齐问题提供了有效方案。 Abstract: Sign Language Translation (SLT) aims to convert sign language (SL) videos into spoken language text, thereby bridging the communication gap between the sign and the spoken community. While most existing works focus on translating a single sign language into a single spoken language (one-to-one SLT), leveraging multilingual resources could mitigate low-resource issues and enhance accessibility. However, multilingual SLT (MLSLT) remains unexplored due to language conflicts and alignment difficulties across SLs and spoken languages. To address these challenges, we propose a multilingual gloss-free model with dual CTC objectives for token-level SL identification and spoken text generation. Our model supports 10 SLs and handles one-to-one, many-to-one, and many-to-many SLT tasks, achieving competitive performance compared to state-of-the-art methods on three widely adopted benchmarks: multilingual SP-10, PHOENIX14T, and CSL-Daily.

[220] Knowing Before Saying: LLM Representations Encode Information About Chain-of-Thought Success Before Completion

Anum Afzal,Florian Matthes,Gal Chechik,Yftah Ziser

Main category: cs.CL

TL;DR: 研究发现，基于LLM表示的探测分类器能在零样本CoT推理过程开始前预测其成功，而BERT基线表现较差。早期表示已包含关键信息，后期推理步骤未必总是有帮助。实验表明，提前截断CoT推理仍优于完全不使用CoT，但效果不及完整推理。

Details

Motivation: 探究零样本CoT推理的成功是否能在推理完成前被预测，以优化推理效率。 Method: 使用基于LLM表示的探测分类器和BERT基线模型，比较它们在推理不同阶段的表现，并进行早期停止实验。 Result: LLM分类器在推理开始前即表现良好，BERT基线依赖浅层语言线索表现较差。早期表示已包含关键信息，后期步骤未必有帮助。提前截断CoT推理仍优于无CoT。 Conclusion: LLM早期表示已包含推理关键信息，支持优化CoT效率的方法（如早期停止），同时保留其优势。 Abstract: We investigate whether the success of a zero-shot Chain-of-Thought (CoT) process can be predicted before completion. We discover that a probing classifier, based on LLM representations, performs well \emph{even before a single token is generated}, suggesting that crucial information about the reasoning process is already present in the initial steps representations. In contrast, a strong BERT-based baseline, which relies solely on the generated tokens, performs worse, likely because it depends on shallow linguistic cues rather than deeper reasoning dynamics. Surprisingly, using later reasoning steps does not always improve classification. When additional context is unhelpful, earlier representations resemble later ones more, suggesting LLMs encode key information early. This implies reasoning can often stop early without loss. To test this, we conduct early stopping experiments, showing that truncating CoT reasoning still improves performance over not using CoT at all, though a gap remains compared to full reasoning. However, approaches like supervised learning or reinforcement learning designed to shorten CoT chains could leverage our classifier's guidance to identify when early stopping is effective. Our findings provide insights that may support such methods, helping to optimize CoT's efficiency while preserving its benefits.\footnote{Code and data is available at \href{https://github.com/anum94/CoTpred}{\texttt{github.com/anum94/CoTpred}}.

[221] LLM Inference Enhanced by External Knowledge: A Survey

Yu-Hsuan Lin,Qian-Hui Chen,Yi-Jie Cheng,Jia-Ren Zhang,Yi-Hung Liu,Liang-Yu Hsia,Yun-Nung Chen

Main category: cs.CL

TL;DR: 论文探讨了利用外部知识增强大语言模型（LLMs）的策略，重点分析了结构化知识（如表格和知识图谱）的整合方法及其权衡。

Details

Motivation: 尽管LLMs在自然语言推理方面取得进展，但其有限的参数记忆和易产生幻觉的问题仍限制了其在需要准确上下文推理任务中的应用。 Method: 提出了一种分类法，将外部知识分为非结构化和结构化数据，并详细分析了结构化知识（表格和知识图谱）与LLMs的整合范式及代表性方法。 Result: 比较分析揭示了可解释性、可扩展性和性能之间的权衡，为开发可信且通用的知识增强LLMs提供了见解。 Conclusion: 研究为如何有效利用外部知识增强LLMs提供了系统指导，并指出了未来发展的方向。 Abstract: Recent advancements in large language models (LLMs) have enhanced natural-language reasoning. However, their limited parametric memory and susceptibility to hallucination present persistent challenges for tasks requiring accurate, context-based inference. To overcome these limitations, an increasing number of studies have proposed leveraging external knowledge to enhance LLMs. This study offers a systematic exploration of strategies for using external knowledge to enhance LLMs, beginning with a taxonomy that categorizes external knowledge into unstructured and structured data. We then focus on structured knowledge, presenting distinct taxonomies for tables and knowledge graphs (KGs), detailing their integration paradigms with LLMs, and reviewing representative methods. Our comparative analysis further highlights the trade-offs among interpretability, scalability, and performance, providing insights for developing trustworthy and generalizable knowledge-enhanced LLMs.

[222] ClueAnchor: Clue-Anchored Knowledge Reasoning Exploration and Optimization for Retrieval-Augmented Generation

Hao Chen,Yukun Yan,Sen Mei,Wanxiang Che,Zhenghao Liu,Qi Shi,Xinze Li,Yuchun Fan,Pengcheng Huang,Qiushi Xiong,Zhiyuan Liu,Maosong Sun

Main category: cs.CL

TL;DR: ClueAnchor通过提取关键线索并优化推理路径，显著提升了RAG系统的推理完整性和鲁棒性。

Details

Motivation: 现有RAG系统未能充分利用检索到的文档，尤其是在证据隐式、分散或受噪声干扰时，导致推理不准确。 Method: 提出ClueAnchor框架，通过提取关键线索、生成多推理路径，并基于奖励优化选择最佳路径。 Result: 实验表明ClueAnchor在推理完整性和鲁棒性上优于现有RAG基线，且对噪声和部分相关内容具有强韧性。 Conclusion: ClueAnchor无需显式监督即可识别支持证据，显著提升了RAG系统的性能。 Abstract: Retrieval-Augmented Generation (RAG) augments Large Language Models (LLMs) with external knowledge to improve factuality. However, existing RAG systems frequently underutilize the retrieved documents, failing to extract and integrate the key clues needed to support faithful and interpretable reasoning, especially in cases where relevant evidence is implicit, scattered, or obscured by noise. To address this issue, we propose ClueAnchor, a novel framework for enhancing RAG via clue-anchored reasoning exploration and optimization. ClueAnchor extracts key clues from retrieved content and generates multiple reasoning paths based on different knowledge configurations, optimizing the model by selecting the most effective one through reward-based preference optimization. Experiments show that ClueAnchor significantly outperforms prior RAG baselines in reasoning completeness and robustness. Further analysis confirms its strong resilience to noisy or partially relevant retrieved content, as well as its capability to identify supporting evidence even in the absence of explicit clue supervision during inference.

[223] LLMs Are Globally Multilingual Yet Locally Monolingual: Exploring Knowledge Transfer via Language and Thought Theory

Eojin Kang,Juae Kim

Main category: cs.CL

TL;DR: 论文探讨了多语言大语言模型（LLMs）在跨语言知识传递中的不一致性，提出了一种基于语言与思维理论的L2T提示策略，以改善知识传递效果。

Details

Motivation: 多语言LLMs在不同输入语言下的知识回忆表现不一致，现有研究多基于英语，缺乏对其他语言的深入探讨。 Method: 提出L2T提示策略，分析输入语言、内部认知过程与知识之间的关系，并通过实验验证其效果。 Result: 实验结果表明，英语方法并非总是最优，模型内部思维与任务知识的对齐对跨语言传递至关重要。L2T训练可减少对输入语言的依赖。 Conclusion: L2T策略为跨语言知识传递提供了新视角，强调内部思维与任务知识的对齐，且无需依赖翻译学习。 Abstract: Multilingual large language models (LLMs) open up new possibilities for leveraging information across languages, but their factual knowledge recall remains inconsistent depending on the input language. While previous studies have attempted to address this issue through English-based prompting and evaluation, we explore non-English to English transfer via Language and Thought Theory. This perspective allows us to examine language-thought binding in LLMs and uncover why factual knowledge often fails to transfer effectively. We propose the Language-to-Thought (L2T) prompting strategy, which analyzes the relationship between input language, internal cognitive processes, and knowledge. Experimental results challenge the assumption that English-based approaches consistently outperform other languages and offer a novel insight that aligning the model's internal thought with the knowledge required for the task is critical for successful cross-lingual transfer. Furthermore, we show that applying L2T during training can alleviate LLMs' reliance on the input language and facilitate cross-linguistic knowledge integration without translation-based learning. Code and datasets will be available.

[224] MMAFFBen: A Multilingual and Multimodal Affective Analysis Benchmark for Evaluating LLMs and VLMs

Zhiwei Liu,Lingfei Qian,Qianqian Xie,Jimin Huang,Kailai Yang,Sophia Ananiadou

Main category: cs.CL

TL;DR: 论文介绍了MMAFFBen，首个多语言多模态情感分析开源基准，涵盖35种语言的文本、图像和视频数据，并开发了MMAFFLM模型进行评测。

Details

Motivation: 现有大语言模型和视觉语言模型在情感分析领域的能力尚未充分探索，缺乏全面的评估基准。 Method: 构建MMAFFBen基准和MMAFFIn数据集，开发MMAFFLM-3b和MMAFFLM-7b模型，评测包括GPT-4o-mini在内的多种模型。 Result: 提供了多语言多模态情感分析任务的系统评测结果。 Conclusion: MMAFFBen填补了情感分析领域的评估空白，为未来研究提供了重要工具。 Abstract: Large language models and vision-language models (which we jointly call LMs) have transformed NLP and CV, demonstrating remarkable potential across various fields. However, their capabilities in affective analysis (i.e. sentiment analysis and emotion detection) remain underexplored. This gap is largely due to the absence of comprehensive evaluation benchmarks, and the inherent complexity of affective analysis tasks. In this paper, we introduce MMAFFBen, the first extensive open-source benchmark for multilingual multimodal affective analysis. MMAFFBen encompasses text, image, and video modalities across 35 languages, covering four key affective analysis tasks: sentiment polarity, sentiment intensity, emotion classification, and emotion intensity. Moreover, we construct the MMAFFIn dataset for fine-tuning LMs on affective analysis tasks, and further develop MMAFFLM-3b and MMAFFLM-7b based on it. We evaluate various representative LMs, including GPT-4o-mini, providing a systematic comparison of their affective understanding capabilities. This project is available at https://github.com/lzw108/MMAFFBen.

Christopher Bagdon,Aidan Combs,Carina Silberer,Roman Klinger

Main category: cs.CL

TL;DR: 研究比较了研究中创建的内容与真实内容在情感表达上的差异，发现两者在长度、文本与图像的依赖程度以及情感事件典型性上存在显著差异。

Details

Motivation: 情感表达等主观现象的准确建模需要基于作者意图标注的数据。研究中创建内容与真实内容是否存在差异及其对模型的影响尚不明确。 Method: 收集研究中创建的和真实的多模态社交媒体帖子，标注情感并进行多维度比较，包括模型性能。 Result: 研究发现，研究中创建的内容比真实内容更长，更依赖文本而非图像表达情感，且更聚焦于情感典型事件。 Conclusion: 研究中创建的数据对训练泛化能力强的模型有价值，但真实数据对评估模型效果更可靠。 Abstract: Accurate modeling of subjective phenomena such as emotion expression requires data annotated with authors' intentions. Commonly such data is collected by asking study participants to donate and label genuine content produced in the real world, or create content fitting particular labels during the study. Asking participants to create content is often simpler to implement and presents fewer risks to participant privacy than data donation. However, it is unclear if and how study-created content may differ from genuine content, and how differences may impact models. We collect study-created and genuine multimodal social media posts labeled for emotion and compare them on several dimensions, including model performance. We find that compared to genuine posts, study-created posts are longer, rely more on their text and less on their images for emotion expression, and focus more on emotion-prototypical events. The samples of participants willing to donate versus create posts are demographically different. Study-created data is valuable to train models that generalize well to genuine data, but realistic effectiveness estimates require genuine data.

[226] Model Unlearning via Sparse Autoencoder Subspace Guided Projections

Xu Wang,Zihao Li,Benyou Wang,Yan Hu,Difan Zou

Main category: cs.CL

TL;DR: SAE-Guided Subspace Projection Unlearning (SSPU) 是一种新框架，利用稀疏自编码器特征实现精确、可解释且鲁棒的模型知识遗忘，优于现有方法。

Details

Motivation: 大型语言模型（LLMs）存储大量信息，但需要选择性知识遗忘以解决隐私和安全问题，现有方法在可解释性或鲁棒性上不足。 Method: SSPU 通过三阶段流程（数据驱动的层和特征选择、QR 分解构建子空间、约束优化）实现知识遗忘，并利用 SAE 特征监督参数更新。 Result: 在 WMDP-Cyber 遗忘集和三个基准测试中，SSPU 将有害知识准确率降低 3.22%，并提升对抗鲁棒性。 Conclusion: SSPU 揭示了现有遗忘方法的局限性，展示了基于可解释子空间优化的鲁棒可控模型行为。 Abstract: Large language models (LLMs) store vast amounts of information, making them powerful yet raising privacy and safety concerns when selective knowledge removal is required. Existing unlearning strategies, ranging from gradient-based fine-tuning and model editing to sparse autoencoder (SAE) steering, either lack interpretability or fail to provide a robust defense against adversarial prompts. We propose SAE-Guided Subspace Projection Unlearning (SSPU), a novel framework that leverages SAE features to drive targeted updates in the model's parameter space, enabling precise, interpretable, and robust unlearning. SSPU's three-stage pipeline performs data-driven layer and feature selection, subspace construction via QR decomposition, and constrained optimization that controls activations into an "irrelevant" subspace while preserving retained knowledge. Overall, we use SAE features to construct a subspace that supervises unlearning, refining the loss and adding a regularization term to guide interpretable parameter updates. In experiments on the WMDP-Cyber forget set and three utility benchmarks (MMLU, TruthfulQA, GSM8K), SSPU reduces harmful knowledge accuracy by 3.22% compared to the strongest baseline. It also improves adversarial robustness, lowering malicious accuracy under jailbreak prompts compared to baselines. Our findings expose the limitations of prior unlearning methods and demonstrate how interpretable subspace-guided optimization can achieve robust, controllable model behavior.

[227] Exploring the Impact of Occupational Personas on Domain-Specific QA

Eojin Kang,Jaehyuk Yu,Juae Kim

Main category: cs.CL

TL;DR: 研究发现，专业相关的角色（PBPs）能略微提升领域问答任务的表现，而职业性格相关的角色（OPBPs）反而可能降低性能。角色相关性并不保证知识有效利用，可能带来认知限制。

Details

Motivation: 探讨角色（persona）对领域特定问答任务的影响，尤其是专业相关和职业性格相关的角色。 Method: 引入两种角色类型（PBPs和OPBPs），并通过多领域科学问答任务进行实证评估。 Result: PBPs略微提升准确性，OPBPs则可能降低性能，即使语义相关。 Conclusion: 角色相关性不足以保证知识有效利用，未来可研究角色表征的细微差异如何引导LLMs。 Abstract: Recent studies on personas have improved the way Large Language Models (LLMs) interact with users. However, the effect of personas on domain-specific question-answering (QA) tasks remains a subject of debate. This study analyzes whether personas enhance specialized QA performance by introducing two types of persona: Profession-Based Personas (PBPs) (e.g., scientist), which directly relate to domain expertise, and Occupational Personality-Based Personas (OPBPs) (e.g., scientific person), which reflect cognitive tendencies rather than explicit expertise. Through empirical evaluations across multiple scientific domains, we demonstrate that while PBPs can slightly improve accuracy, OPBPs often degrade performance, even when semantically related to the task. Our findings suggest that persona relevance alone does not guarantee effective knowledge utilization and that they may impose cognitive constraints that hinder optimal knowledge application. Future research can explore how nuanced distinctions in persona representations guide LLMs, potentially contributing to reasoning and knowledge retrieval that more closely mirror human social conceptualization.

[228] When Large Multimodal Models Confront Evolving Knowledge:Challenges and Pathways

Kailin Jiang,Yuntao Du,Yukai Ding,Yuchen Ren,Ning Jiang,Zhi Gao,Zilong Zheng,Lei Liu,Bin Li,Qing Li

Main category: cs.CL

TL;DR: 论文提出EVOKE基准，评估多模态大模型（LMMs）在真实场景中注入动态知识的能力，发现现有方法在动态知识注入上表现不佳，并提出改进路径。

Details

Motivation: 解决多模态大模型（LMMs）在动态知识注入中的挑战，避免灾难性遗忘，并提升知识更新能力。 Method: 提出EVOKE基准，评估动态知识注入能力；分析现有方法的局限性，并提出文本知识增强和持续学习方法（如Replay和MoELoRA）。 Result: 现有知识注入方法在动态知识上表现差；监督微调导致灾难性遗忘；文本知识增强有效，图像增强无效；持续学习方法缓解遗忘。 Conclusion: 当前知识注入方法在动态知识上存在局限，需进一步研究更高效稳定的方法。 Abstract: Large language/multimodal models (LLMs/LMMs) store extensive pre-trained knowledge but struggle to maintain consistency with real-world updates, making it difficult to avoid catastrophic forgetting while acquiring evolving knowledge. Previous work focused on constructing textual knowledge datasets and exploring knowledge injection in LLMs, lacking exploration of multimodal evolving knowledge injection in LMMs. To address this, we propose the EVOKE benchmark to evaluate LMMs' ability to inject multimodal evolving knowledge in real-world scenarios. Meanwhile, a comprehensive evaluation of multimodal evolving knowledge injection revealed two challenges: (1) Existing knowledge injection methods perform terribly on evolving knowledge. (2) Supervised fine-tuning causes catastrophic forgetting, particularly instruction following ability is severely compromised. Additionally, we provide pathways and find that: (1) Text knowledge augmentation during the training phase improves performance, while image augmentation cannot achieve it. (2) Continual learning methods, especially Replay and MoELoRA, effectively mitigate forgetting. Our findings indicate that current knowledge injection methods have many limitations on evolving knowledge, which motivates further research on more efficient and stable knowledge injection methods.

[229] Domain Pre-training Impact on Representations

Cesar Gonzalez-Gutierrez,Ariadna Quattoni

Main category: cs.CL

TL;DR: 研究表明，预训练语料库对Transformer表示质量有显著影响，小规模专用语料库也能生成有效表示，且通用与专用语料库的结合效果取决于目标任务与专用语料库的分布相似性。

Details

Motivation: 探讨预训练语料库对Transformer表示质量的影响，特别是仅通过预训练诱导的表示质量。 Method: 通过实验分析不同预训练语料库（小规模专用语料库、通用与专用语料库结合）对表示质量的影响。 Result: 小规模专用语料库可生成有效表示；通用与专用语料库结合的效果取决于目标任务与专用语料库的分布相似性。 Conclusion: 预训练语料库的选择对表示质量至关重要，专用语料库在特定条件下表现优异。 Abstract: This empirical study analyzes the effects of the pre-training corpus on the quality of learned transformer representations. We focus on the representation quality induced solely through pre-training. Our experiments show that pre-training on a small, specialized corpus can yield effective representations, and that the success of combining a generic and a specialized corpus depends on the distributional similarity between the target task and the specialized corpus.

[230] CaMMT: Benchmarking Culturally Aware Multimodal Machine Translation

Emilio Villa-Cueva,Sholpan Bolatzhanova,Diana Turmakhan,Kareem Elzeky,Henok Biadglign Ademtew,Alham Fikri Aji,Israel Abebe Azime,Jinheon Baek,Frederico Belcavello,Fermin Cristobal,Jan Christian Blaise Cruz,Mary Dabre,Raj Dabre,Toqeer Ehsan,Naome A Etori,Fauzan Farooqui,Jiahui Geng,Guido Ivetta,Thanmay Jayakumar,Soyeong Jeong,Zheng Wei Lim,Aishik Mandal,Sofia Martinelli,Mihail Minkov Mihaylov,Daniil Orel,Aniket Pramanick,Sukannya Purkayastha,Israfel Salazar,Haiyue Song,Tiago Timponi Torrent,Debela Desalegn Yadeta,Injy Hamed,Atnafu Lambebo Tonja,Thamar Solorio

Main category: cs.CL

TL;DR: CaMMT是一个包含5800多组图像及平行标题的数据集，用于研究图像作为文化语境在多模态翻译中的作用。实验表明，视觉上下文能提升翻译质量，特别是在处理文化特定内容、歧义和性别准确性方面。

Details

Motivation: 文化内容的差异导致机器翻译难以准确传达区域特定含义，因此研究图像是否能作为文化语境辅助翻译。 Method: 构建CaMMT数据集，评估五种视觉语言模型在纯文本和文本+图像设置下的表现。 Result: 视觉上下文普遍提升翻译质量，尤其在文化特定内容、歧义和性别准确性方面效果显著。 Conclusion: CaMMT的发布支持多模态翻译系统的开发与评估，以更好地适应文化差异和区域变化。 Abstract: Cultural content poses challenges for machine translation systems due to the differences in conceptualizations between cultures, where language alone may fail to convey sufficient context to capture region-specific meanings. In this work, we investigate whether images can act as cultural context in multimodal translation. We introduce CaMMT, a human-curated benchmark of over 5,800 triples of images along with parallel captions in English and regional languages. Using this dataset, we evaluate five Vision Language Models (VLMs) in text-only and text+image settings. Through automatic and human evaluations, we find that visual context generally improves translation quality, especially in handling Culturally-Specific Items (CSIs), disambiguation, and correct gender usage. By releasing CaMMT, we aim to support broader efforts in building and evaluating multimodal translation systems that are better aligned with cultural nuance and regional variation.

[231] VietMix: A Naturally Occurring Vietnamese-English Code-Mixed Corpus with Iterative Augmentation for Machine Translation

Hieu Tran,Phuong-Anh Nguyen-Le,Huy Nghiem,Quang-Nhan Nguyen,Wei Ai,Marine Carpuat

Main category: cs.CL

TL;DR: 论文提出VietMix语料库和合成数据生成方法，提升低资源语言混合代码翻译性能。

Details

Motivation: 解决低资源语言混合代码翻译系统性能不足的问题。 Method: 构建VietMix平行语料库，开发合成数据生成流程，并加入过滤机制确保语法和语用合理性。 Result: 实验显示模型性能显著提升（COMETkiwi 71.84，XCOMET 81.77），增强模型在49%的评估中优于基线。 Conclusion: VietMix和方法提升了神经机器翻译的生态效度，为其他低资源语言对提供了解决框架。 Abstract: Machine translation systems fail when processing code-mixed inputs for low-resource languages. We address this challenge by curating VietMix, a parallel corpus of naturally occurring code-mixed Vietnamese text paired with expert English translations. Augmenting this resource, we developed a complementary synthetic data generation pipeline. This pipeline incorporates filtering mechanisms to ensure syntactic plausibility and pragmatic appropriateness in code-mixing patterns. Experimental validation shows our naturalistic and complementary synthetic data boost models' performance, measured by translation quality estimation scores, of up to 71.84 on COMETkiwi and 81.77 on XCOMET. Triangulating positive results with LLM-based assessments, augmented models are favored over seed fine-tuned counterparts in approximately 49% of judgments (54-56% excluding ties). VietMix and our augmentation methodology advance ecological validity in neural MT evaluations and establish a framework for addressing code-mixed translation challenges across other low-resource pairs.

[232] Towards Effective Code-Integrated Reasoning

Fei Bai,Yingqian Min,Beichen Zhang,Zhipeng Chen,Wayne Xin Zhao,Lei Fang,Zheng Liu,Zhongyuan Wang,Ji-Rong Wen

Main category: cs.CL

TL;DR: 论文提出了一种代码集成推理方法，通过工具增强的强化学习（RL）训练模型，动态生成和执行代码以提高推理能力，并改进了训练策略以增强稳定性和效果。

Details

Motivation: 研究代码集成推理的潜力，通过动态生成和执行代码扩展模型能力边界，同时解决工具增强RL训练中的不稳定性问题。 Method: 采用工具增强的RL方法，结合改进的训练策略，平衡探索与稳定性，逐步提升模型使用外部代码工具的能力。 Result: 在五个主流数学推理基准测试中表现显著优于基线模型，同时揭示了代码集成对模型能力边界和推理效率的提升。 Conclusion: 代码集成推理是一种有效的方法，通过动态代码生成和执行显著提升模型性能，同时改进的训练策略增强了RL的稳定性。 Abstract: In this paper, we investigate code-integrated reasoning, where models generate code when necessary and integrate feedback by executing it through a code interpreter. To acquire this capability, models must learn when and how to use external code tools effectively, which is supported by tool-augmented reinforcement learning (RL) through interactive learning. Despite its benefits, tool-augmented RL can still suffer from potential instability in the learning dynamics. In light of this challenge, we present a systematic approach to improving the training effectiveness and stability of tool-augmented RL for code-integrated reasoning. Specifically, we develop enhanced training strategies that balance exploration and stability, progressively building tool-use capabilities while improving reasoning performance. Through extensive experiments on five mainstream mathematical reasoning benchmarks, our model demonstrates significant performance improvements over multiple competitive baselines. Furthermore, we conduct an in-depth analysis of the mechanism and effect of code-integrated reasoning, revealing several key insights, such as the extension of model's capability boundaries and the simultaneous improvement of reasoning efficiency through code integration. All data and code for reproducing this work are available at: https://github.com/RUCAIBox/CIR.

Guiyang Hou,Xing Gao,Yuchuan Wu,Xiang Huang,Wenqi Zhang,Zhe Zheng,Yongliang Shen,Jialu Du,Fei Huang,Yongbin Li,Weiming Lu

Main category: cs.CL

TL;DR: 论文提出了一种名为TimeHC-RL的方法，用于增强大型语言模型（LLMs）在社交领域的认知能力，实验证明其优于传统的System 2 RL方法。

Details

Motivation: 当前LLMs在数学和编码等需要系统2认知的领域表现优异，但在社交领域的认知发展仍未被充分探索。社交领域需要结合系统1和系统2的认知模式，且具有独特的时间线。 Method: 提出了Temporal-aware Hierarchical Cognitive Reinforcement Learning (TimeHC-RL)方法，通过五种后训练范式和两种测试时干预范式，在八个数据集上验证其有效性。 Result: 实验结果显示TimeHC-RL优于System 2 RL方法，使7B模型性能媲美DeepSeek-R1和OpenAI-O3等先进模型。 Conclusion: TimeHC-RL为提升LLMs的社交智能提供了有效方法，并揭示了后训练和测试时干预的重要见解。 Abstract: Recently, Large Language Models (LLMs) have made significant progress in IQ-related domains that require careful thinking, such as mathematics and coding. However, enhancing LLMs' cognitive development in social domains, particularly from a post-training perspective, remains underexplored. Recognizing that the social world follows a distinct timeline and requires a richer blend of cognitive modes (from intuitive reactions (System 1) and surface-level thinking to deliberate thinking (System 2)) than mathematics, which primarily relies on System 2 cognition (careful, step-by-step reasoning), we introduce Temporal-aware Hierarchical Cognitive Reinforcement Learning (TimeHC-RL) for enhancing LLMs' social intelligence. In our experiments, we systematically explore improving LLMs' social intelligence and validate the effectiveness of the TimeHC-RL method, through five other post-training paradigms and two test-time intervention paradigms on eight datasets with diverse data patterns. Experimental results reveal the superiority of our proposed TimeHC-RL method compared to the widely adopted System 2 RL method. It gives the 7B backbone model wings, enabling it to rival the performance of advanced models like DeepSeek-R1 and OpenAI-O3. Additionally, the systematic exploration from post-training and test-time interventions perspectives to improve LLMs' social intelligence has uncovered several valuable insights.

[234] Stress-testing Machine Generated Text Detection: Shifting Language Models Writing Style to Fool Detectors

Andrea Pedrotti,Michele Papucci,Cristiano Ciaccio,Alessio Miaschi,Giovanni Puccetti,Felice Dell'Orletta,Andrea Esuli

Main category: cs.CL

TL;DR: 论文提出了一种测试机器生成文本（MGT）检测器鲁棒性的方法，通过对抗性攻击挑战现有检测器，并分析了检测器的局限性。

Details

Motivation: 随着生成式AI和大型语言模型（LLMs）的进步，合成内容的真实性提高，恶意使用（如虚假信息）的风险增加，但现有MGT检测器在真实场景中的泛化能力不足。 Method: 使用直接偏好优化（DPO）微调语言模型，将MGT风格向人类书写文本（HWT）靠拢，以挑战检测器的风格依赖。 Result: 实验表明，检测器容易被少量对抗样本欺骗，检测性能显著下降。 Conclusion: 研究强调了改进检测方法并增强其对未见领域文本的鲁棒性的重要性。 Abstract: Recent advancements in Generative AI and Large Language Models (LLMs) have enabled the creation of highly realistic synthetic content, raising concerns about the potential for malicious use, such as misinformation and manipulation. Moreover, detecting Machine-Generated Text (MGT) remains challenging due to the lack of robust benchmarks that assess generalization to real-world scenarios. In this work, we present a pipeline to test the resilience of state-of-the-art MGT detectors (e.g., Mage, Radar, LLM-DetectAIve) to linguistically informed adversarial attacks. To challenge the detectors, we fine-tune language models using Direct Preference Optimization (DPO) to shift the MGT style toward human-written text (HWT). This exploits the detectors' reliance on stylistic clues, making new generations more challenging to detect. Additionally, we analyze the linguistic shifts induced by the alignment and which features are used by detectors to detect MGT texts. Our results show that detectors can be easily fooled with relatively few examples, resulting in a significant drop in detection performance. This highlights the importance of improving detection methods and making them robust to unseen in-domain texts.

[235] Limited-Resource Adapters Are Regularizers, Not Linguists

Marcell Fekete,Nathaniel R. Robinson,Ernests Lavrinovics,E. Djeride Jean-Baptiste,Raj Dabre,Johannes Bjerva,Heather Lent

Main category: cs.CL

TL;DR: 本文研究了通过适配器融合和跨注意力微调预训练机器翻译模型，以提升三种低资源克里奥尔语的性能。结果表明，适配器的性能与语言相关性无关，其作用可能在于参数正则化而非信息传递。

Details

Motivation: 探索如何通过跨语言迁移提升低资源语言技术，特别是针对克里奥尔语这类与多语言群体相关的语言。 Method: 采用适配器融合和跨注意力微调预训练机器翻译模型，分析适配器性能与语言相关性的关系。 Result: 性能显著提升，但适配器性能与语言相关性无关，随机初始化的适配器同样有效。 Conclusion: 适配器的作用可能在于参数正则化，而非信息传递，表明神经网络方法并不总是直观地利用语言知识。 Abstract: Cross-lingual transfer from related high-resource languages is a well-established strategy to enhance low-resource language technologies. Prior work has shown that adapters show promise for, e.g., improving low-resource machine translation (MT). In this work, we investigate an adapter souping method combined with cross-attention fine-tuning of a pre-trained MT model to leverage language transfer for three low-resource Creole languages, which exhibit relatedness to different language groups across distinct linguistic dimensions. Our approach improves performance substantially over baselines. However, we find that linguistic relatedness -- or even a lack thereof -- does not covary meaningfully with adapter performance. Surprisingly, our cross-attention fine-tuning approach appears equally effective with randomly initialized adapters, implying that the benefit of adapters in this setting lies in parameter regularization, and not in meaningful information transfer. We provide analysis supporting this regularization hypothesis. Our findings underscore the reality that neural language processing involves many success factors, and that not all neural methods leverage linguistic knowledge in intuitive ways.

[236] DEEPQUESTION: Systematic Generation of Real-World Challenges for Evaluating LLMs Performance

Ali Khoramfar,Ali Ramezani,Mohammad Mahdi Mohajeri,Mohammad Javad Dousti,Majid Nili Ahmadabadi,Heshaam Faili

Main category: cs.CL

TL;DR: DeepQuestion是一个可扩展的自动化框架，通过Bloom分类法增强现有数据集并创建新问题，揭示LLM在高级推理任务中的性能下降。

Details

Motivation: LLM在标准基准测试中表现优异，但在实际任务中表现不佳，需要更全面的评估方法。 Method: 利用Bloom分类法扩展数据集，设计追踪原始解决路径的问题，以评估LLM的高阶推理能力。 Result: 实验显示，LLM在高级任务中性能显著下降（准确率损失高达70%），表明其在深度推理方面的不足。 Conclusion: 需要多样化的认知基准测试以推动LLM发展，DeepQuestion及相关数据集将公开发布。 Abstract: LLMs often excel on standard benchmarks but falter on real-world tasks. We introduce DeepQuestion, a scalable automated framework that augments existing datasets based on Bloom's taxonomy and creates novel questions that trace original solution paths to probe evaluative and creative skills. Extensive experiments across ten open-source and proprietary models, covering both general-purpose and reasoning LLMs, reveal substantial performance drops (even up to 70% accuracy loss) on higher-order tasks, underscoring persistent gaps in deep reasoning. Our work highlights the need for cognitively diverse benchmarks to advance LLM progress. DeepQuestion and related datasets will be released upon acceptance of the paper.

[237] Don't Erase, Inform! Detecting and Contextualizing Harmful Language in Cultural Heritage Collections

Orfeas Menis Mastromichalakis,Jason Liartis,Kristina Rose,Antoine Isaac,Giorgos Stamou

Main category: cs.CL

TL;DR: 开发了一个AI工具，用于检测文化遗产数据中的冒犯性术语，并提供历史背景和当代视角的见解。

Details

Motivation: 文化遗产数据中存在过时或冒犯性描述，反映了历史偏见，需要一种方法来识别和解决这些问题。 Method: 结合多语言词汇库、传统NLP技术和大型语言模型（LLMs），开发了一个工具，可集成到主要文化遗产平台中。 Result: 工具已处理超过790万条记录，成功识别并提供了冒犯性术语的上下文信息。 Conclusion: 该工具旨在揭示偏见并提供可操作的见解，而非直接删除术语，从而促进更包容和可访问的文化遗产收藏。 Abstract: Cultural Heritage (CH) data hold invaluable knowledge, reflecting the history, traditions, and identities of societies, and shaping our understanding of the past and present. However, many CH collections contain outdated or offensive descriptions that reflect historical biases. CH Institutions (CHIs) face significant challenges in curating these data due to the vast scale and complexity of the task. To address this, we develop an AI-powered tool that detects offensive terms in CH metadata and provides contextual insights into their historical background and contemporary perception. We leverage a multilingual vocabulary co-created with marginalized communities, researchers, and CH professionals, along with traditional NLP techniques and Large Language Models (LLMs). Available as a standalone web app and integrated with major CH platforms, the tool has processed over 7.9 million records, contextualizing the contentious terms detected in their metadata. Rather than erasing these terms, our approach seeks to inform, making biases visible and providing actionable insights for creating more inclusive and accessible CH collections.

[238] Localizing Persona Representations in LLMs

Celia Cintas,Miriam Rateike,Erik Miehling,Elizabeth Daly,Skyler Speakman

Main category: cs.CL

TL;DR: 研究探讨了大型语言模型（LLMs）中人物特征的编码方式及其在表示空间中的分布，发现不同人物特征在解码器后三分之一层中差异最大，且某些伦理观点存在重叠，而政治意识形态则更为独立。

Details

Motivation: 理解LLMs如何内部编码不同人物特征（如价值观、信仰）及其分布，以改进模型输出中对特定人类特征的调控。 Method: 采用降维和模式识别方法，分析LLMs各层中人物特征的编码差异，并研究特定层中激活模式。 Result: 人物特征在解码器后三分之一层中差异显著；伦理观点（如道德虚无主义和功利主义）有重叠，政治意识形态（如保守主义和自由主义）则更独立。 Conclusion: 研究揭示了LLMs内部表示信息的方式，为未来优化模型输出中特定人类特征的调控提供了依据。 Abstract: We present a study on how and where personas -- defined by distinct sets of human characteristics, values, and beliefs -- are encoded in the representation space of large language models (LLMs). Using a range of dimension reduction and pattern recognition methods, we first identify the model layers that show the greatest divergence in encoding these representations. We then analyze the activations within a selected layer to examine how specific personas are encoded relative to others, including their shared and distinct embedding spaces. We find that, across multiple pre-trained decoder-only LLMs, the analyzed personas show large differences in representation space only within the final third of the decoder layers. We observe overlapping activations for specific ethical perspectives -- such as moral nihilism and utilitarianism -- suggesting a degree of polysemy. In contrast, political ideologies like conservatism and liberalism appear to be represented in more distinct regions. These findings help to improve our understanding of how LLMs internally represent information and can inform future efforts in refining the modulation of specific human traits in LLM outputs. Warning: This paper includes potentially offensive sample statements.

[239] Cross-Attention Speculative Decoding

Wei Zhong,Manasa Bharadwaj,Yixiao Wang,Nikhil Verma,Yipeng Ji,Chul Lee

Main category: cs.CL

TL;DR: Beagle是一种基于交叉注意力的Transformer解码器SD模型，性能与领先的自注意力SD模型相当，但简化了架构并提高了训练效率。

Details

Motivation: 现有的SD方法依赖紧密耦合的自注意力Transformer解码器，架构复杂且难以泛化。 Method: 提出Budget EAGLE (Beagle)，采用交叉注意力Transformer解码器，无需辅助组件，并提出Two-Stage Block-Attention Training方法。 Result: Beagle在多个LLM和数据集上实现了与EAGLE-v2相当的推理加速和更高的训练效率。 Conclusion: Beagle为SD提供了一种简化且高效的架构替代方案。 Abstract: Speculative decoding (SD) is a widely adopted approach for accelerating inference in large language models (LLMs), particularly when the draft and target models are well aligned. However, state-of-the-art SD methods typically rely on tightly coupled, self-attention-based Transformer decoders, often augmented with auxiliary pooling or fusion layers. This coupling makes them increasingly complex and harder to generalize across different models. We present Budget EAGLE (Beagle), the first, to our knowledge, cross-attention-based Transformer decoder SD model that achieves performance on par with leading self-attention SD models (EAGLE-v2) while eliminating the need for pooling or auxiliary components, simplifying the architecture, improving training efficiency, and maintaining stable memory usage during training-time simulation. To enable effective training of this novel architecture, we propose Two-Stage Block-Attention Training, a new method that achieves training stability and convergence efficiency in block-level attention scenarios. Extensive experiments across multiple LLMs and datasets show that Beagle achieves competitive inference speedups and higher training efficiency than EAGLE-v2, offering a strong alternative for architectures in speculative decoding.

[240] A*-Thought: Efficient Reasoning via Bidirectional Compression for Low-Resource Settings

Xiaoang Xu,Shuo Wang,Xu Han,Zhenghao Liu,Huijia Wu,Peipei Li,Zhiyuan Liu,Maosong Sun,Zhaofeng He

Main category: cs.CL

TL;DR: A*-Thought通过树搜索框架高效压缩大型推理模型的思维链，提升性能与效率。

Details

Motivation: 解决长思维链导致效率下降的问题，避免现有方法因压缩思维链而性能降低。 Method: 结合A*搜索算法与特定成本函数，双向重要性估计机制优化搜索过程。 Result: 在数学任务中，A*-Thought显著提升性能（如QwQ-32B性能提升2.39倍），并减少输出长度50%。 Conclusion: A*-Thought在性能与效率间取得平衡，具有广泛适用性。 Abstract: Large Reasoning Models (LRMs) achieve superior performance by extending the thought length. However, a lengthy thinking trajectory leads to reduced efficiency. Most of the existing methods are stuck in the assumption of overthinking and attempt to reason efficiently by compressing the Chain-of-Thought, but this often leads to performance degradation. To address this problem, we introduce A*-Thought, an efficient tree search-based unified framework designed to identify and isolate the most essential thoughts from the extensive reasoning chains produced by these models. It formulates the reasoning process of LRMs as a search tree, where each node represents a reasoning span in the giant reasoning space. By combining the A* search algorithm with a cost function specific to the reasoning path, it can efficiently compress the chain of thought and determine a reasoning path with high information density and low cost. In addition, we also propose a bidirectional importance estimation mechanism, which further refines this search process and enhances its efficiency beyond uniform sampling. Extensive experiments on several advanced math tasks show that A*-Thought effectively balances performance and efficiency over a huge search space. Specifically, A*-Thought can improve the performance of QwQ-32B by 2.39$\times$ with low-budget and reduce the length of the output token by nearly 50% with high-budget. The proposed method is also compatible with several other LRMs, demonstrating its generalization capability. The code can be accessed at: https://github.com/AI9Stars/AStar-Thought.

[241] CREFT: Sequential Multi-Agent LLM for Character Relation Extraction

Ye Eun Chun,Taeyoon Hwang,Seung-won Hwang,Byung-Hak Kim

Main category: cs.CL

TL;DR: CREFT是一个基于大型语言模型（LLM）的框架，用于从长篇叙事中提取复杂的角色关系，显著优于单代理LLM基线。

Details

Motivation: 现有方法难以处理长篇叙事中复杂的角色关系，限制了叙事分析和剧本评估的效率。 Method: CREFT通过知识蒸馏构建基础角色图，并迭代优化角色组成、关系提取、角色识别和分组分配。 Result: 在韩国电视剧数据集上的实验表明，CREFT在准确性和完整性上显著优于单代理LLM基线。 Conclusion: CREFT通过系统可视化角色网络，提升了叙事理解效率，为娱乐、出版和教育领域提供了重要价值。 Abstract: Understanding complex character relations is crucial for narrative analysis and efficient script evaluation, yet existing extraction methods often fail to handle long-form narratives with nuanced interactions. To address this challenge, we present CREFT, a novel sequential framework leveraging specialized Large Language Model (LLM) agents. First, CREFT builds a base character graph through knowledge distillation, then iteratively refines character composition, relation extraction, role identification, and group assignments. Experiments on a curated Korean drama dataset demonstrate that CREFT significantly outperforms single-agent LLM baselines in both accuracy and completeness. By systematically visualizing character networks, CREFT streamlines narrative comprehension and accelerates script review -- offering substantial benefits to the entertainment, publishing, and educational sectors.

[242] Bench4KE: Benchmarking Automated Competency Question Generation

Anna Sofia Lippolis,Minh Davide Ragagni,Paolo Ciancarini,Andrea Giovanni Nuzzolese,Valentina Presutti

Main category: cs.CL

TL;DR: 论文介绍了Bench4KE，一个用于知识工程自动化（KE）的基准测试系统，旨在标准化评估LLM生成的胜任问题（CQs）的工具。

Details

Motivation: 当前LLM在KE自动化中的应用缺乏标准化评估方法，影响研究严谨性和结果的可比性。 Method: 提出Bench4KE系统，基于API设计，使用黄金标准数据集和相似性指标评估CQ生成工具。 Result: 对四种基于LLM的CQ生成系统进行了比较分析，为未来研究提供了基准。 Conclusion: Bench4KE可扩展至其他KE自动化任务，代码和数据公开，促进研究标准化和可重复性。 Abstract: The availability of Large Language Models (LLMs) presents a unique opportunity to reinvigorate research on Knowledge Engineering (KE) automation, a trend already evident in recent efforts developing LLM-based methods and tools for the automatic generation of Competency Questions (CQs). However, the evaluation of these tools lacks standardisation. This undermines the methodological rigour and hinders the replication and comparison of results. To address this gap, we introduce Bench4KE, an extensible API-based benchmarking system for KE automation. Its first release focuses on evaluating tools that generate CQs automatically. CQs are natural language questions used by ontology engineers to define the functional requirements of an ontology. Bench4KE provides a curated gold standard consisting of CQ datasets from four real-world ontology projects. It uses a suite of similarity metrics to assess the quality of the CQs generated. We present a comparative analysis of four recent CQ generation systems, which are based on LLMs, establishing a baseline for future research. Bench4KE is also designed to accommodate additional KE automation tasks, such as SPARQL query generation, ontology testing and drafting. Code and datasets are publicly available under the Apache 2.0 license.

[243] Improving Language and Modality Transfer in Translation by Character-level Modeling

Ioannis Tsiamas,David Dale,Marta R. Costa-jussà

Main category: cs.CL

TL;DR: 提出了一种基于字符的方法，通过跨语言和跨模态知识转移，提升低资源语言的翻译能力，并在文本和语音翻译中取得优异表现。

Details

Motivation: 当前翻译系统仅覆盖5%的语言，需扩展至低资源语言，依赖数据高效方法。 Method: 利用SONAR多语言嵌入空间，采用师生学习方法训练字符级编码器，结合ASR数据适配语音翻译。 Result: 在75种语言的文本翻译中表现优于传统子词模型，语音翻译在33种语言上达到SOTA。 Conclusion: 字符级方法在低资源和零-shot场景下具有优势，跨模态知识转移效果显著。 Abstract: Current translation systems, despite being highly multilingual, cover only 5% of the world's languages. Expanding language coverage to the long-tail of low-resource languages requires data-efficient methods that rely on cross-lingual and cross-modal knowledge transfer. To this end, we propose a character-based approach to improve adaptability to new languages and modalities. Our method leverages SONAR, a multilingual fixed-size embedding space with different modules for encoding and decoding. We use a teacher-student approach with parallel translation data to obtain a character-level encoder. Then, using ASR data, we train a lightweight adapter to connect a massively multilingual CTC ASR model (MMS), to the character-level encoder, potentially enabling speech translation from 1,000+ languages. Experimental results in text translation for 75 languages on FLORES+ demonstrate that our character-based approach can achieve better language transfer than traditional subword-based models, especially outperforming them in low-resource settings, and demonstrating better zero-shot generalizability to unseen languages. Our speech adaptation, maximizing knowledge transfer from the text modality, achieves state-of-the-art results in speech-to-text translation on the FLEURS benchmark on 33 languages, surpassing previous supervised and cascade models, albeit being a zero-shot model with minimal supervision from ASR data.

[244] NexusSum: Hierarchical LLM Agents for Long-Form Narrative Summarization

Hyuntak Kim,Byung-Hak Kim

Main category: cs.CL

TL;DR: NexusSum是一个多代理LLM框架，用于长篇幅叙事文本的摘要生成，通过结构化流水线处理，无需微调，显著提升摘要质量。

Details

Motivation: 长篇幅叙事文本（如书籍、电影、电视剧本）的摘要生成需要捕捉复杂情节、角色互动和主题连贯性，现有LLM难以胜任。 Method: 1. 对话到描述的转换：标准化角色对话和描述文本格式；2. 分层多LLM摘要：结构化流水线优化分块处理和控制输出长度。 Result: 在书籍、电影和电视剧本上，BERTScore（F1）提升高达30.0%，达到新的SOTA。 Conclusion: 多代理LLM能有效处理长篇幅内容，为多样化叙事领域提供可扩展的结构化摘要方法。 Abstract: Summarizing long-form narratives--such as books, movies, and TV scripts--requires capturing intricate plotlines, character interactions, and thematic coherence, a task that remains challenging for existing LLMs. We introduce NexusSum, a multi-agent LLM framework for narrative summarization that processes long-form text through a structured, sequential pipeline--without requiring fine-tuning. Our approach introduces two key innovations: (1) Dialogue-to-Description Transformation: A narrative-specific preprocessing method that standardizes character dialogue and descriptive text into a unified format, improving coherence. (2) Hierarchical Multi-LLM Summarization: A structured summarization pipeline that optimizes chunk processing and controls output length for accurate, high-quality summaries. Our method establishes a new state-of-the-art in narrative summarization, achieving up to a 30.0% improvement in BERTScore (F1) across books, movies, and TV scripts. These results demonstrate the effectiveness of multi-agent LLMs in handling long-form content, offering a scalable approach for structured summarization in diverse storytelling domains.

[245] GATE: General Arabic Text Embedding for Enhanced Semantic Textual Similarity with Matryoshka Representation Learning and Hybrid Loss Training

Omer Nacar,Anis Koubaa,Serry Sibaee,Yasser Al-Habashi,Adel Ammar,Wadii Boulila

Main category: cs.CL

TL;DR: 本文介绍了GATE模型，一种针对阿拉伯语的语义文本相似性任务（STS）的先进模型，通过结合Matryoshka表示学习和混合损失训练方法，显著提升了性能。

Details

Motivation: 阿拉伯语在语义文本相似性任务中的研究因缺乏高质量数据集和预训练模型而受限，GATE旨在填补这一空白。 Method: GATE采用Matryoshka表示学习和混合损失训练方法，结合阿拉伯语三元组数据集进行自然语言推理任务。 Result: GATE在STS任务中表现优于包括OpenAI在内的更大模型，性能提升20-25%。 Conclusion: GATE模型成功捕捉阿拉伯语的独特语义细微差别，为阿拉伯语STS任务提供了高效解决方案。 Abstract: Semantic textual similarity (STS) is a critical task in natural language processing (NLP), enabling applications in retrieval, clustering, and understanding semantic relationships between texts. However, research in this area for the Arabic language remains limited due to the lack of high-quality datasets and pre-trained models. This scarcity of resources has restricted the accurate evaluation and advance of semantic similarity in Arabic text. This paper introduces General Arabic Text Embedding (GATE) models that achieve state-of-the-art performance on the Semantic Textual Similarity task within the MTEB benchmark. GATE leverages Matryoshka Representation Learning and a hybrid loss training approach with Arabic triplet datasets for Natural Language Inference, which are essential for enhancing model performance in tasks that demand fine-grained semantic understanding. GATE outperforms larger models, including OpenAI, with a 20-25% performance improvement on STS benchmarks, effectively capturing the unique semantic nuances of Arabic.

Junzhuo Li,Bo Wang,Xiuze Zhou,Peijie Jiang,Jia Liu,Xuming Hu

Main category: cs.CL

TL;DR: 提出了一种跨层归因算法，用于分析稀疏MoE架构的动态路由-专家交互，揭示了MoE模型的效率和协作模式。

Details

Motivation: 现有归因方法无法捕捉稀疏MoE架构中的动态路由-专家交互，因此需要一种新的分析方法。 Method: 提出跨层归因算法，对比分析稀疏MoE架构（如Qwen 1.5-MoE）与密集模型（如Qwen 1.5-7B）。 Result: MoE模型通过“中期激活、后期放大”模式实现37%的层效率提升，并揭示了“基础-细化”框架和语义驱动路由。 Conclusion: 研究为MoE模型的可解释性提供了新见解，并提出了平衡效率、专业化和鲁棒性的设计原则。 Abstract: The interpretability of Mixture-of-Experts (MoE) models, especially those with heterogeneous designs, remains underexplored. Existing attribution methods for dense models fail to capture dynamic routing-expert interactions in sparse MoE architectures. To address this issue, we propose a cross-level attribution algorithm to analyze sparse MoE architectures (Qwen 1.5-MoE, OLMoE, Mixtral-8x7B) against dense models (Qwen 1.5-7B, Llama-7B, Mixtral-7B). Results show MoE models achieve 37% higher per-layer efficiency via a "mid-activation, late-amplification" pattern: early layers screen experts, while late layers refine knowledge collaboratively. Ablation studies reveal a "basic-refinement" framework--shared experts handle general tasks (entity recognition), while routed experts specialize in domain-specific processing (geographic attributes). Semantic-driven routing is evidenced by strong correlations between attention heads and experts (r=0.68), enabling task-aware coordination. Notably, architectural depth dictates robustness: deep Qwen 1.5-MoE mitigates expert failures (e.g., 43% MRR drop in geographic tasks when blocking top-10 experts) through shared expert redundancy, whereas shallow OLMoE suffers severe degradation (76% drop). Task sensitivity further guides design: core-sensitive tasks (geography) require concentrated expertise, while distributed-tolerant tasks (object attributes) leverage broader participation. These insights advance MoE interpretability, offering principles to balance efficiency, specialization, and robustness.

[247] Explainable Depression Detection using Masked Hard Instance Mining

Patawee Prakrankamanant,Shinji Watanabe,Ekapol Chuangsuwanich

Main category: cs.CL

TL;DR: 本文提出了一种名为MHIM的方法，用于提升文本抑郁症检测的可解释性，通过掩蔽注意力权重来增强模型对关键特征的关注。

Details

Motivation: 当前抑郁症检测模型虽然能提供预测结果，但缺乏对预测的解释性，影响了系统的可信度。 Method: 采用掩蔽硬实例挖掘（MHIM）方法，通过掩蔽注意力权重，迫使模型关注更广泛的关键特征。 Result: 在泰语（Thai-Maywe）和英语（DAIC-WOZ）数据集上，MHIM显著提升了预测准确性和可解释性。 Conclusion: MHIM方法在抑郁症检测任务中有效提升了模型的性能和可解释性。 Abstract: This paper addresses the critical need for improved explainability in text-based depression detection. While offering predictive outcomes, current solutions often overlook the understanding of model predictions which can hinder trust in the system. We propose the use of Masked Hard Instance Mining (MHIM) to enhance the explainability in the depression detection task. MHIM strategically masks attention weights within the model, compelling it to distribute attention across a wider range of salient features. We evaluate MHIM on two datasets representing distinct languages: Thai (Thai-Maywe) and English (DAIC-WOZ). Our results demonstrate that MHIM significantly improves performance in terms of both prediction accuracy and explainability metrics.

[248] When Harry Meets Superman: The Role of The Interlocutor in Persona-Based Dialogue Generation

Daniela Occhipinti,Marco Guerini,Malvina Nissim

Main category: cs.CL

TL;DR: 论文研究了对话模型如何利用对话双方的人物信息生成更一致和多样化的回复，并探讨了模型对陌生对话者和话题的适应性。

Details

Motivation: 现有研究多关注对话与提供的人物信息对齐，但对对话者个人信息的适应研究不足。 Method: 通过评估模型在生成对话时对双方人物信息的利用能力，以及在不同话题和对话者下的表现，采用LLM和人工评估。 Result: 结果显示，利用对话者人物信息能提升目标说话者的识别率，但模型对陌生对话者表现较差。 Conclusion: 模型在零样本设置下倾向于复制人物信息，简化任务但缺乏深度。 Abstract: Endowing dialogue agents with persona information has proven to significantly improve the consistency and diversity of their generations. While much focus has been placed on aligning dialogues with provided personas, the adaptation to the interlocutor's profile remains largely underexplored. In this work, we investigate three key aspects: (1) a model's ability to align responses with both the provided persona and the interlocutor's; (2) its robustness when dealing with familiar versus unfamiliar interlocutors and topics, and (3) the impact of additional fine-tuning on specific persona-based dialogues. We evaluate dialogues generated with diverse speaker pairings and topics, framing the evaluation as an author identification task and employing both LLM-as-a-judge and human evaluations. By systematically masking or disclosing information about the interlocutor, we assess its impact on dialogue generation. Results show that access to the interlocutor's persona improves the recognition of the target speaker, while masking it does the opposite. Although models generalise well across topics, they struggle with unfamiliar interlocutors. Finally, we found that in zero-shot settings, LLMs often copy biographical details, facilitating identification but trivialising the task.

[249] Harnessing Large Language Models for Scientific Novelty Detection

Yan Liu,Zonglin Yang,Soujanya Poria,Thanh-Son Nguyen,Erik Cambria

Main category: cs.CL

TL;DR: 本文提出利用大语言模型（LLMs）进行科学新颖性检测（ND），并构建了两个新数据集。通过提取论文关系构建数据集，并训练轻量级检索器以提升新颖性检测效果。

Details

Motivation: 科学领域快速扩张，但缺乏合适的基准数据集和现有NLP技术的局限性阻碍了新颖性检测的研究。 Method: 利用LLMs构建数据集，提取论文关系并总结主要思想；训练轻量级检索器以对齐相似概念。 Result: 实验表明，该方法在提出的基准数据集上优于其他方法。 Conclusion: 该方法为科学新颖性检测提供了高效准确的解决方案，并公开了代码和数据。 Abstract: In an era of exponential scientific growth, identifying novel research ideas is crucial and challenging in academia. Despite potential, the lack of an appropriate benchmark dataset hinders the research of novelty detection. More importantly, simply adopting existing NLP technologies, e.g., retrieving and then cross-checking, is not a one-size-fits-all solution due to the gap between textual similarity and idea conception. In this paper, we propose to harness large language models (LLMs) for scientific novelty detection (ND), associated with two new datasets in marketing and NLP domains. To construct the considerate datasets for ND, we propose to extract closure sets of papers based on their relationship, and then summarize their main ideas based on LLMs. To capture idea conception, we propose to train a lightweight retriever by distilling the idea-level knowledge from LLMs to align ideas with similar conception, enabling efficient and accurate idea retrieval for LLM novelty detection. Experiments show our method consistently outperforms others on the proposed benchmark datasets for idea retrieval and ND tasks. Codes and data are available at https://anonymous.4open.science/r/NoveltyDetection-10FB/.

[250] Eye of Judgement: Dissecting the Evaluation of Russian-speaking LLMs with POLLUX

Nikita Martynov,Anastasia Mordasheva,Dmitriy Gorbetskiy,Danil Astafurov,Ulyana Isaeva,Elina Basyrova,Sergey Skachkov,Victoria Berestova,Nikolay Ivanov,Valeriia Zanina,Alena Fenogenova

Main category: cs.CL

TL;DR: POLLUX是一个开源基准测试，用于评估俄语大语言模型（LLM）的生成能力，提供透明、标准化的评估方法。

Details

Motivation: 传统评估方法依赖耗时的人力对比，缺乏透明性和可扩展性，POLLUX旨在解决这一问题。 Method: 定义35种任务类型的详细标准，开发评分协议，并训练LLM评估器（7B和32B）进行生成输出的评估。 Result: POLLUX包含2,100个手动编写的提示，覆盖多种生成领域，并提供可扩展、透明的评估工具。 Conclusion: POLLUX为LLM开发提供了高效、可解释的评估方法，替代了传统的人力评估。 Abstract: We introduce POLLUX, a comprehensive open-source benchmark designed to evaluate the generative capabilities of large language models (LLMs) in Russian. Our main contribution is a novel evaluation methodology that enhances the interpretability of LLM assessment. For each task type, we define a set of detailed criteria and develop a scoring protocol where models evaluate responses and provide justifications for their ratings. This enables transparent, criteria-driven evaluation beyond traditional resource-consuming, side-by-side human comparisons. POLLUX includes a detailed, fine-grained taxonomy of 35 task types covering diverse generative domains such as code generation, creative writing, and practical assistant use cases, totaling 2,100 manually crafted and professionally authored prompts. Each task is categorized by difficulty (easy/medium/hard), with experts constructing the dataset entirely from scratch. We also release a family of LLM-as-a-Judge (7B and 32B) evaluators trained for nuanced assessment of generative outputs. This approach provides scalable, interpretable evaluation and annotation tools for model development, effectively replacing costly and less precise human judgments.

[251] Interpretable phenotyping of Heart Failure patients with Dutch discharge letters

Vittorio Torri,Machteld J. Boonstra,Marielle C. van de Veerdonk,Deborah N. Kalkman,Alicia Uijl,Francesca Ieva,Ameen Abu-Hanna,Folkert W. Asselbergs,Iacer Calixto

Main category: cs.CL

TL;DR: 该研究评估了基于左心室射血分数（LVEF）的心力衰竭（HF）患者表型分类模型，结合结构化和非结构化数据，发现出院记录是最具信息量的数据源，Aug-Linear模型在性能与可解释性上均表现优异。

Details

Motivation: 心力衰竭患者表型多样，影响治疗和预后，需开发基于LVEF的分类模型以支持临床决策。 Method: 研究分析了2015-2023年两家医院的HF住院数据，使用BERT和Aug-Linear模型，比较性能与可解释性。 Result: BERT和Aug-Linear模型表现最佳（AUC=0.84和0.81），Aug-Linear的解释更符合临床医生标注。 Conclusion: 出院记录是HF表型分类的最佳数据源，Aug-Linear模型兼具高性能和可解释性，适合临床透明决策。 Abstract: Objective: Heart failure (HF) patients present with diverse phenotypes affecting treatment and prognosis. This study evaluates models for phenotyping HF patients based on left ventricular ejection fraction (LVEF) classes, using structured and unstructured data, assessing performance and interpretability. Materials and Methods: The study analyzes all HF hospitalizations at both Amsterdam UMC hospitals (AMC and VUmc) from 2015 to 2023 (33,105 hospitalizations, 16,334 patients). Data from AMC were used for model training, and from VUmc for external validation. The dataset was unlabelled and included tabular clinical measurements and discharge letters. Silver labels for LVEF classes were generated by combining diagnosis codes, echocardiography results, and textual mentions. Gold labels were manually annotated for 300 patients for testing. Multiple Transformer-based (black-box) and Aug-Linear (white-box) models were trained and compared with baselines on structured and unstructured data. To evaluate interpretability, two clinicians annotated 20 discharge letters by highlighting information they considered relevant for LVEF classification. These were compared to SHAP and LIME explanations from black-box models and the inherent explanations of Aug-Linear models. Results: BERT-based and Aug-Linear models, using discharge letters alone, achieved the highest classification results (AUC=0.84 for BERT, 0.81 for Aug-Linear on external validation), outperforming baselines. Aug-Linear explanations aligned more closely with clinicians' explanations than post-hoc explanations on black-box models. Conclusions: Discharge letters emerged as the most informative source for phenotyping HF patients. Aug-Linear models matched black-box performance while providing clinician-aligned interpretability, supporting their use in transparent clinical decision-making.

[252] Benchmarking Large Language Models for Cryptanalysis and Mismatched-Generalization

Utsav Maskey,Chencheng Zhu,Usman Naseem

Main category: cs.CL

TL;DR: 该论文探讨了大型语言模型（LLMs）在密码分析领域的潜力，通过评估其在加密文本上的表现，揭示了其优缺点及安全隐患。

Details

Motivation: 密码分析是数据安全和加密的关键领域，但尚未在LLM评估中得到充分研究。论文旨在填补这一空白。 Method: 引入了一个包含多样化明文及其加密版本的基准数据集，并在零样本和少样本设置下评估了多个LLM的解密准确性和语义理解能力。 Result: 研究发现LLM在侧信道通信中表现出一定的能力，但也暴露了其对越狱攻击的脆弱性。 Conclusion: 研究强调了LLM在安全领域的双重用途，为AI安全与安全的讨论提供了新视角。 Abstract: Recent advancements in Large Language Models (LLMs) have transformed natural language understanding and generation, leading to extensive benchmarking across diverse tasks. However, cryptanalysis a critical area for data security and encryption has not yet been thoroughly explored in LLM evaluations. To address this gap, we evaluate cryptanalytic potential of state of the art LLMs on encrypted texts generated using a range of cryptographic algorithms. We introduce a novel benchmark dataset comprising diverse plain texts spanning various domains, lengths, writing styles, and topics paired with their encrypted versions. Using zero-shot and few shot settings, we assess multiple LLMs for decryption accuracy and semantic comprehension across different encryption schemes. Our findings reveal key insights into the strengths and limitations of LLMs in side-channel communication while raising concerns about their susceptibility to jailbreaking attacks. This research highlights the dual-use nature of LLMs in security contexts and contributes to the ongoing discussion on AI safety and security.

[253] The Hallucination Dilemma: Factuality-Aware Reinforcement Learning for Large Reasoning Models

Junyi Li,Hwee Tou Ng

Main category: cs.CL

TL;DR: 论文提出FSPO算法，通过逐步事实验证减少LLMs在RL微调中的幻觉问题，提升推理准确性。

Details

Motivation: 研究发现RL微调会显著增加LLMs的幻觉现象，影响推理任务的可靠性。 Method: 提出FSPO算法，在每一步推理中引入事实验证，动态调整优势值以激励事实正确性。 Result: 实验表明FSPO有效减少幻觉并提升推理准确性，显著改善模型可靠性。 Conclusion: FSPO为RL微调中的幻觉问题提供了有效解决方案，提升了LLMs的推理性能。 Abstract: Large language models (LLMs) have significantly advanced in reasoning tasks through reinforcement learning (RL) optimization, achieving impressive capabilities across various challenging benchmarks. However, our empirical analysis reveals a critical drawback: reasoning-oriented RL fine-tuning significantly increases the prevalence of hallucinations. We theoretically analyze the RL training dynamics, identifying high-variance gradient, entropy-induced randomness, and susceptibility to spurious local optima as key factors leading to hallucinations. To address this drawback, we propose Factuality-aware Step-wise Policy Optimization (FSPO), an innovative RL fine-tuning algorithm incorporating explicit factuality verification at each reasoning step. FSPO leverages automated verification against given evidence to dynamically adjust token-level advantage values, incentivizing factual correctness throughout the reasoning process. Experiments across mathematical reasoning and hallucination benchmarks using Qwen2.5 and Llama models demonstrate that FSPO effectively reduces hallucinations while enhancing reasoning accuracy, substantially improving both reliability and performance.

[254] Disentangling Language and Culture for Evaluating Multilingual Large Language Models

Jiahao Ying,Wei Tang,Yiran Zhao,Yixin Cao,Yu Rong,Wenxuan Zhang

Main category: cs.CL

TL;DR: 本文提出了一种双评估框架，用于全面评估LLMs的多语言能力，揭示了文化与语言协同现象，并挑战了LLMs在跨语言任务中表现一致的观点。

Details

Motivation: 现有评估方法未能充分考量语言和文化背景对LLMs表现的影响，因此需要一种更全面的评估框架。 Method: 通过分解语言媒介和文化背景两个维度，构建双评估框架，并结合可解释性探针分析神经元激活情况。 Result: 发现LLMs在文化与语言对齐时表现更优，且特定神经元激活比例可作为多语言性能的潜在指标。 Conclusion: 研究强调了文化和语言背景在LLMs评估中的重要性，并提供了新的评估视角。 Abstract: This paper introduces a Dual Evaluation Framework to comprehensively assess the multilingual capabilities of LLMs. By decomposing the evaluation along the dimensions of linguistic medium and cultural context, this framework enables a nuanced analysis of LLMs' ability to process questions within both native and cross-cultural contexts cross-lingually. Extensive evaluations are conducted on a wide range of models, revealing a notable "CulturalLinguistic Synergy" phenomenon, where models exhibit better performance when questions are culturally aligned with the language. This phenomenon is further explored through interpretability probing, which shows that a higher proportion of specific neurons are activated in a language's cultural context. This activation proportion could serve as a potential indicator for evaluating multilingual performance during model training. Our findings challenge the prevailing notion that LLMs, primarily trained on English data, perform uniformly across languages and highlight the necessity of culturally and linguistically model evaluations. Our code can be found at https://yingjiahao14. github.io/Dual-Evaluation/.

[255] Efficient Text Encoders for Labor Market Analysis

Jens-Joris Decorte,Jeroen Van Hautte,Chris Develder,Thomas Demeester

Main category: cs.CL

TL;DR: 提出了一种名为ConTeXT-match的对比学习方法，用于技能分类任务，显著提高了效率和性能，并引入了新的基准Skill-XL和改进的JobBERT V2模型。

Details

Motivation: 劳动力市场分析依赖于从招聘广告中提取信息，但现有方法依赖计算成本高的大型语言模型。 Method: 采用对比学习方法和token级注意力机制，提出轻量级双编码器模型ConTeXT-match，并引入Skill-XL基准和JobBERT V2模型。 Result: 模型在技能提取和职位标准化任务中表现高效、准确且可扩展。 Conclusion: 提出的方法适合大规模实时劳动力市场分析，显著提升了现有技术的性能。 Abstract: Labor market analysis relies on extracting insights from job advertisements, which provide valuable yet unstructured information on job titles and corresponding skill requirements. While state-of-the-art methods for skill extraction achieve strong performance, they depend on large language models (LLMs), which are computationally expensive and slow. In this paper, we propose \textbf{ConTeXT-match}, a novel contrastive learning approach with token-level attention that is well-suited for the extreme multi-label classification task of skill classification. \textbf{ConTeXT-match} significantly improves skill extraction efficiency and performance, achieving state-of-the-art results with a lightweight bi-encoder model. To support robust evaluation, we introduce \textbf{Skill-XL}, a new benchmark with exhaustive, sentence-level skill annotations that explicitly address the redundancy in the large label space. Finally, we present \textbf{JobBERT V2}, an improved job title normalization model that leverages extracted skills to produce high-quality job title representations. Experiments demonstrate that our models are efficient, accurate, and scalable, making them ideal for large-scale, real-time labor market analysis.

[256] Are Optimal Algorithms Still Optimal? Rethinking Sorting in LLM-Based Pairwise Ranking with Batching and Caching

Juan Wisznia,Cecilia Bolaños,Juan Tollo,Giovanni Marraffini,Agustín Gianolini,Noe Hsueh,Luciano Del Corro

Main category: cs.CL

TL;DR: 提出了一种基于LLM推理的新型排序算法分析框架，取代传统的成对比较成本模型，强调批处理和缓存以降低推理成本。

Details

Motivation: 传统基于比较次数的效率指标在LLM推理成本主导时失效，需重新定义成本模型。 Method: 通过分析LLM推理成本，提出批处理和缓存等优化策略。 Result: 传统最优算法在LLM推理成本主导时可能效率下降。 Conclusion: 新框架为LLM推理主导的排序算法提供了更准确的效率评估方法。 Abstract: We introduce a novel framework for analyzing sorting algorithms in pairwise ranking prompting (PRP), re-centering the cost model around LLM inferences rather than traditional pairwise comparisons. While classical metrics based on comparison counts have traditionally been used to gauge efficiency, our analysis reveals that expensive LLM inferences overturn these predictions; accordingly, our framework encourages strategies such as batching and caching to mitigate inference costs. We show that algorithms optimal in the classical setting can lose efficiency when LLM inferences dominate the cost under certain optimizations.

[257] PRISM: A Framework for Producing Interpretable Political Bias Embeddings with Political-Aware Cross-Encoder

Yiqun Sun,Qiang Huang,Anthony K. H. Tung,Jun Yu

Main category: cs.CL

TL;DR: PRISM是一个新框架，用于生成可解释的政治偏见嵌入，通过提取细粒度政治主题和偏见指标，提升嵌入的语义相似性和分类性能。

Details

Motivation: 现有文本嵌入模型在捕捉意识形态细微差别方面表现不足，限制了其在政治偏见理解任务中的有效性。 Method: PRISM分为两个阶段：1) 从弱标签新闻数据中提取政治主题和偏见指标；2) 基于这些指标为新闻文章分配结构化偏见分数。 Result: PRISM在政治偏见分类任务中优于现有模型，并提供高度可解释的表示。 Conclusion: PRISM通过显式关联偏见揭示维度，增强了嵌入的语义相似性和分类性能，适用于多样化检索和意识形态分析。 Abstract: Semantic Text Embedding is a fundamental NLP task that encodes textual content into vector representations, where proximity in the embedding space reflects semantic similarity. While existing embedding models excel at capturing general meaning, they often overlook ideological nuances, limiting their effectiveness in tasks that require an understanding of political bias. To address this gap, we introduce PRISM, the first framework designed to Produce inteRpretable polItical biaS eMbeddings. PRISM operates in two key stages: (1) Controversial Topic Bias Indicator Mining, which systematically extracts fine-grained political topics and their corresponding bias indicators from weakly labeled news data, and (2) Cross-Encoder Political Bias Embedding, which assigns structured bias scores to news articles based on their alignment with these indicators. This approach ensures that embeddings are explicitly tied to bias-revealing dimensions, enhancing both interpretability and predictive power. Through extensive experiments on two large-scale datasets, we demonstrate that PRISM outperforms state-of-the-art text embedding models in political bias classification while offering highly interpretable representations that facilitate diversified retrieval and ideological analysis. The source code is available at https://github.com/dukesun99/ACL-PRISM.

[258] MSDA: Combining Pseudo-labeling and Self-Supervision for Unsupervised Domain Adaptation in ASR

Dimitrios Damianos,Georgios Paraskevopoulos,Alexandros Potamianos

Main category: cs.CL

TL;DR: 本文提出了一种名为MSDA的多阶段域适应框架，结合自监督和半监督技术，显著提升了ASR模型在低资源语言和弱监督场景下的性能。

Details

Motivation: 解决自动语音识别（ASR）在低资源语言和弱监督场景下的域适应问题，提升模型的鲁棒性和泛化能力。 Method: 采用多阶段域适应（MSDA）框架，结合自监督学习和半监督技术，通过两阶段适应流程优化模型。 Result: 实验表明，MSDA在ASR任务中取得了最先进的性能，显著优于现有方法，尤其在低资源语言和弱监督场景下表现突出。 Conclusion: MSDA框架为ASR的无监督域适应提供了高效且鲁棒的解决方案，证明了自监督与自训练级联方法的必要性。 Abstract: In this work, we investigate the Meta PL unsupervised domain adaptation framework for Automatic Speech Recognition (ASR). We introduce a Multi-Stage Domain Adaptation pipeline (MSDA), a sample-efficient, two-stage adaptation approach that integrates self-supervised learning with semi-supervised techniques. MSDA is designed to enhance the robustness and generalization of ASR models, making them more adaptable to diverse conditions. It is particularly effective for low-resource languages like Greek and in weakly supervised scenarios where labeled data is scarce or noisy. Through extensive experiments, we demonstrate that Meta PL can be applied effectively to ASR tasks, achieving state-of-the-art results, significantly outperforming state-of-the-art methods, and providing more robust solutions for unsupervised domain adaptation in ASR. Our ablations highlight the necessity of utilizing a cascading approach when combining self-supervision with self-training.

[259] Multiple LLM Agents Debate for Equitable Cultural Alignment

Dayeon Ki,Rachel Rudinger,Tianyi Zhou,Marine Carpuat

Main category: cs.CL

TL;DR: 提出多智能体辩论框架，利用多个LLM的互补优势提升文化适应性，实验表明辩论能提高准确性和文化群体公平性。

Details

Motivation: LLM需要适应多样文化背景以服务全球社区，现有方法多为单LLM单轮交互，无法充分利用多LLM优势。 Method: 引入多智能体辩论框架，两种变体：纯辩论和动态选择自省与辩论，在75国社交礼仪基准上评估。 Result: 辩论显著提升准确性和文化公平性，小模型（7-9B）通过辩论达到与大模型（27B）相当的精度。 Conclusion: 多智能体辩论是提升LLM文化适应性的有效方法，尤其对小模型效果显著。 Abstract: Large Language Models (LLMs) need to adapt their predictions to diverse cultural contexts to benefit diverse communities across the world. While previous efforts have focused on single-LLM, single-turn approaches, we propose to exploit the complementary strengths of multiple LLMs to promote cultural adaptability. We introduce a Multi-Agent Debate framework, where two LLM-based agents debate over a cultural scenario and collaboratively reach a final decision. We propose two variants: one where either LLM agents exclusively debate and another where they dynamically choose between self-reflection and debate during their turns. We evaluate these approaches on 7 open-weight LLMs (and 21 LLM combinations) using the NormAd-ETI benchmark for social etiquette norms in 75 countries. Experiments show that debate improves both overall accuracy and cultural group parity over single-LLM baselines. Notably, multi-agent debate enables relatively small LLMs (7-9B) to achieve accuracies comparable to that of a much larger model (27B parameters).

[260] TRIDENT: Enhancing Large Language Model Safety with Tri-Dimensional Diversified Red-Teaming Data Synthesis

Xiaorui Wu,Xiaofeng Mao,Fei Li,Xin Zhang,Xuanhong Li,Chong Teng,Donghong Ji,Zhuang Li

Main category: cs.CL

TL;DR: 论文提出了一种新框架TRIDENT，通过分析词汇多样性、恶意意图和越狱策略三个维度，系统评估对齐数据集的风险覆盖范围，并生成两个数据集TRIDENT-Core和TRIDENT-Edge。实验表明，基于TRIDENT-Edge微调的模型在减少有害内容和攻击成功率方面显著优于基线。

Details

Motivation: 现有安全对齐数据集在风险覆盖上存在不足，主要关注词汇多样性而忽略其他关键维度，导致模型仍易生成有害内容或被恶意利用。 Method: 提出TRIDENT框架，利用基于角色的零样本LLM生成技术，构建覆盖三个维度的多样化指令数据集（TRIDENT-Core和TRIDENT-Edge），并微调模型。 Result: 微调后的模型在Harm Score上平均降低14.29%，攻击成功率降低20%，优于基线模型。 Conclusion: TRIDENT框架和数据集显著提升了模型的安全性，为未来对齐数据集的构建提供了新思路。 Abstract: Large Language Models (LLMs) excel in various natural language processing tasks but remain vulnerable to generating harmful content or being exploited for malicious purposes. Although safety alignment datasets have been introduced to mitigate such risks through supervised fine-tuning (SFT), these datasets often lack comprehensive risk coverage. Most existing datasets focus primarily on lexical diversity while neglecting other critical dimensions. To address this limitation, we propose a novel analysis framework to systematically measure the risk coverage of alignment datasets across three essential dimensions: Lexical Diversity, Malicious Intent, and Jailbreak Tactics. We further introduce TRIDENT, an automated pipeline that leverages persona-based, zero-shot LLM generation to produce diverse and comprehensive instructions spanning these dimensions. Each harmful instruction is paired with an ethically aligned response, resulting in two datasets: TRIDENT-Core, comprising 26,311 examples, and TRIDENT-Edge, with 18,773 examples. Fine-tuning Llama 3.1-8B on TRIDENT-Edge demonstrates substantial improvements, achieving an average 14.29% reduction in Harm Score, and a 20% decrease in Attack Success Rate compared to the best-performing baseline model fine-tuned on the WildBreak dataset.

[261] A Simple Linear Patch Revives Layer-Pruned Large Language Models

Xinrui Chen,Haoli Bai,Tao Yuan,Ruikang Liu,Kang Zhao,Xianzhi Yu,Lu Hou,Tian Guan,Yonghong He,Chun Yuan

Main category: cs.CL

TL;DR: LinearPatch是一种简单即插即用的技术，通过Hadamard变换和通道缩放解决层剪枝后的激活幅度不匹配问题，显著提升剪枝后模型的性能。

Details

Motivation: 现有层剪枝方法因激活幅度不匹配导致性能显著下降，需解决这一问题以提升剪枝后模型的实用性。 Method: 采用Hadamard变换抑制异常值，通道缩放对齐激活幅度，并将操作融合为单一矩阵作为补丁。 Result: 在LLaMA-3-8B上剪枝5层后，性能保留94.15%，优于现有方法4%；进一步优化后可提升至95.16%。 Conclusion: LinearPatch有效解决了层剪枝后的性能下降问题，具有高效性和实用性。 Abstract: Layer pruning has become a popular technique for compressing large language models (LLMs) due to its simplicity. However, existing layer pruning methods often suffer from significant performance drops. We identify that this degradation stems from the mismatch of activation magnitudes across layers and tokens at the pruning interface. To address this, we propose LinearPatch, a simple plug-and-play technique to revive the layer-pruned LLMs. The proposed method adopts Hadamard transformation to suppress massive outliers in particular tokens, and channel-wise scaling to align the activation magnitudes. These operations can be fused into a single matrix, which functions as a patch to bridge the pruning interface with negligible inference overhead. LinearPatch retains up to 94.15% performance of the original model when pruning 5 layers of LLaMA-3-8B on the question answering benchmark, surpassing existing state-of-the-art methods by 4%. In addition, the patch matrix can be further optimized with memory efficient offline knowledge distillation. With only 5K samples, the retained performance of LinearPatch can be further boosted to 95.16% within 30 minutes on a single computing card.

Dayeon Ki,Kevin Duh,Marine Carpuat

Main category: cs.CL

TL;DR: 研究探讨了四种AI翻译质量反馈机制对用户决策的影响，发现隐式反馈（尤其是QA表格）在准确性、依赖性和用户体验上优于显式反馈。

Details

Motivation: 随着AI在日常生活中的普及，用户需要有效的反馈机制来评估AI预测质量，尤其是在用户无法自行判断的情况下。 Method: 在机器翻译场景中，比较了四种反馈类型：显式反馈（错误高亮和LLM解释）和隐式反馈（回译和QA表格）。 Result: 除错误高亮外，其他反馈均显著提升决策准确性和适当依赖性；隐式反馈（尤其是QA表格）表现最佳。 Conclusion: 隐式反馈在提升用户决策和体验方面优于显式反馈，QA表格是最有效的反馈形式。 Abstract: As people increasingly use AI systems in work and daily life, feedback mechanisms that help them use AI responsibly are urgently needed, particularly in settings where users are not equipped to assess the quality of AI predictions. We study a realistic Machine Translation (MT) scenario where monolingual users decide whether to share an MT output, first without and then with quality feedback. We compare four types of quality feedback: explicit feedback that directly give users an assessment of translation quality using 1) error highlights and 2) LLM explanations, and implicit feedback that helps users compare MT inputs and outputs through 3) backtranslation and 4) question-answer (QA) tables. We find that all feedback types, except error highlights, significantly improve both decision accuracy and appropriate reliance. Notably, implicit feedback, especially QA tables, yields significantly greater gains than explicit feedback in terms of decision accuracy, appropriate reliance, and user perceptions, receiving the highest ratings for helpfulness and trust, and the lowest for mental burden.

[263] Soft Reasoning: Navigating Solution Spaces in Large Language Models through Controlled Embedding Exploration

Qinglin Zhu,Runcong Zhao,Hanqi Yan,Yulan He,Yudong Chen,Lin Gui

Main category: cs.CL

TL;DR: Soft Reasoning框架通过嵌入扰动和贝叶斯优化提升大语言模型的复杂推理能力，减少对启发式搜索的依赖。

Details

Motivation: 大语言模型在复杂推理中因多样性和搜索效率不足而表现不佳。 Method: 结合嵌入扰动和贝叶斯优化，通过验证器引导的目标优化嵌入，平衡探索与利用。 Result: 实验显示该方法在推理准确性和连贯性上表现优越，且计算成本低。 Conclusion: Soft Reasoning是一种可扩展、模型无关的解决方案，显著提升推理性能。 Abstract: Large Language Models (LLMs) struggle with complex reasoning due to limited diversity and inefficient search. We propose Soft Reasoning, an embedding-based search framework that optimises the embedding of the first token to guide generation. It combines (1) embedding perturbation for controlled exploration and (2) Bayesian optimisation to refine embeddings via a verifier-guided objective, balancing exploration and exploitation. This approach improves reasoning accuracy and coherence while avoiding reliance on heuristic search. Experiments demonstrate superior correctness with minimal computation, making it a scalable, model-agnostic solution.

[264] BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization

Sander Land,Catherine Arnett

Main category: cs.CL

TL;DR: SCRIPT-BPE是一种新的编码方案，通过基于Unicode脚本和类别属性的初始令牌，解决了多语言环境中BPE分词器的挑战，提供了一种简单且鲁棒的预分词策略。

Details

Motivation: 多语言环境中，BPE分词器对非西方脚本的惩罚和部分UTF-8序列令牌的创建存在问题，且现有预分词方法依赖复杂正则表达式，容易引入脆弱性和边缘情况。 Method: 提出SCRIPT方案，绕过UTF-8字节转换，使用Unicode脚本和类别属性进行初始令牌化，并结合约束BPE合并策略以保持字符完整性。 Result: SCRIPT-BPE在保持竞争力的压缩率的同时，消除了对非拉丁脚本语言的编码惩罚。 Conclusion: SCRIPT-BPE为多语言环境提供了一种更鲁棒且公平的分词方案。 Abstract: Byte Pair Encoding (BPE) tokenizers, widely used in Large Language Models, face challenges in multilingual settings, including penalization of non-Western scripts and the creation of tokens with partial UTF-8 sequences. Pretokenization, often reliant on complex regular expressions, can also introduce fragility and unexpected edge cases. We propose SCRIPT (Script Category Representation in PreTokenization), a novel encoding scheme that bypasses UTF-8 byte conversion by using initial tokens based on Unicode script and category properties. This approach enables a simple, rule-based pretokenization strategy that respects script boundaries, offering a robust alternative to pretokenization strategies based on regular expressions. We also introduce and validate a constrained BPE merging strategy that enforces character integrity, applicable to both SCRIPT-BPE and byte-based BPE. Our experiments demonstrate that SCRIPT-BPE achieves competitive compression while eliminating encoding-based penalties for non-Latin-script languages.

[265] Speech-to-Text Translation with Phoneme-Augmented CoT: Enhancing Cross-Lingual Transfer in Low-Resource Scenarios

Gerard I. Gállego,Oriol Pareras,Martí Cortada Garcia,Lucas Takanori,Javier Hernando

Main category: cs.CL

TL;DR: 提出一种结合音素表示与思维链框架的语音到文本翻译方法，提升低资源和零资源语言的翻译能力。

Details

Motivation: 解决低资源和零资源语言在语音到文本翻译中的挑战，增强跨语言迁移能力。 Method: 通过音素识别作为中间步骤，扩展多语言大语言模型处理语音和音素，采用课程学习策略逐步训练。 Result: 在低资源条件下提升翻译质量，实现零资源翻译，但对高资源性能略有影响。 Conclusion: 音素增强的思维链框架为语音到文本翻译在多语言场景中的普及提供了可行方向。 Abstract: We propose a Speech-to-Text Translation (S2TT) approach that integrates phoneme representations into a Chain-of-Thought (CoT) framework to improve translation in low-resource and zero-resource settings. By introducing phoneme recognition as an intermediate step, we enhance cross-lingual transfer, enabling translation even for languages with no labeled speech data. Our system builds on a multilingual LLM, which we extend to process speech and phonemes. Training follows a curriculum learning strategy that progressively introduces more complex tasks. Experiments on multilingual S2TT benchmarks show that phoneme-augmented CoT improves translation quality in low-resource conditions and enables zero-resource translation, while slightly impacting high-resource performance. Despite this trade-off, our findings demonstrate that phoneme-based CoT is a promising step toward making S2TT more accessible across diverse languages.

[266] Multi-Domain ABSA Conversation Dataset Generation via LLMs for Real-World Evaluation and Model Comparison

Tejul Pandit,Meet Raval,Dhvani Upadhyay

Main category: cs.CL

TL;DR: 本文提出了一种利用大型语言模型（如GPT-4o）生成合成ABSA数据的方法，以解决现实世界中标注数据稀缺的问题，并通过实验验证了生成数据的质量和实用性。

Details

Motivation: 解决ABSA任务中多样化标注数据稀缺的问题，尤其是在反映真实对话场景时。 Method: 使用GPT-4o生成合成数据，确保主题和情感分布的一致性，并评估三种先进LLM（Gemini 1.5 Pro、Claude 3.5 Sonnet和DeepSeek-R1）在分类任务中的表现。 Result: 合成数据表现出有效性：DeepSeek-R1精度更高，Gemini 1.5 Pro和Claude 3.5 Sonnet召回率强，Gemini 1.5 Pro推理速度更快。 Conclusion: 基于LLM的合成数据生成是一种可行且灵活的方法，可为ABSA研究提供有价值的资源，减少对有限真实标注数据的依赖。 Abstract: Aspect-Based Sentiment Analysis (ABSA) offers granular insights into opinions but often suffers from the scarcity of diverse, labeled datasets that reflect real-world conversational nuances. This paper presents an approach for generating synthetic ABSA data using Large Language Models (LLMs) to address this gap. We detail the generation process aimed at producing data with consistent topic and sentiment distributions across multiple domains using GPT-4o. The quality and utility of the generated data were evaluated by assessing the performance of three state-of-the-art LLMs (Gemini 1.5 Pro, Claude 3.5 Sonnet, and DeepSeek-R1) on topic and sentiment classification tasks. Our results demonstrate the effectiveness of the synthetic data, revealing distinct performance trade-offs among the models: DeepSeekR1 showed higher precision, Gemini 1.5 Pro and Claude 3.5 Sonnet exhibited strong recall, and Gemini 1.5 Pro offered significantly faster inference. We conclude that LLM-based synthetic data generation is a viable and flexible method for creating valuable ABSA resources, facilitating research and model evaluation without reliance on limited or inaccessible real-world labeled data.

Guido Ivetta,Marcos J. Gomez,Sofía Martinelli,Pietro Palombini,M. Emilia Echeveste,Nair Carolina Mazzeo,Beatriz Busaniche,Luciana Benotti

Main category: cs.CL

TL;DR: HESEIA是一个由拉丁美洲教育工作者和学生共同创建的包含46,499个句子的数据集，用于评估大型语言模型中的社会偏见，特别关注交叉性偏见和本地化情境。

Details

Motivation: 现有资源通常缺乏受偏见影响社区的参与，HESEIA通过教育者的生活经验和专业知识填补了这一空白。 Method: 370名高中教师和5,370名学生通过专业发展课程创建了反映其学科和社区刻板印象的句子。 Result: HESEIA在人口统计轴和知识领域上表现出多样性，并揭示了当前LLM未识别的更多刻板印象。 Conclusion: HESEIA为基于教育社区的偏见评估提供了支持，强调了本地化参与的重要性。 Abstract: Most resources for evaluating social biases in Large Language Models are developed without co-design from the communities affected by these biases, and rarely involve participatory approaches. We introduce HESEIA, a dataset of 46,499 sentences created in a professional development course. The course involved 370 high-school teachers and 5,370 students from 189 Latin-American schools. Unlike existing benchmarks, HESEIA captures intersectional biases across multiple demographic axes and school subjects. It reflects local contexts through the lived experience and pedagogical expertise of educators. Teachers used minimal pairs to create sentences that express stereotypes relevant to their school subjects and communities. We show the dataset diversity in term of demographic axes represented and also in terms of the knowledge areas included. We demonstrate that the dataset contains more stereotypes unrecognized by current LLMs than previous datasets. HESEIA is available to support bias assessments grounded in educational communities.

[268] Voice Conversion Improves Cross-Domain Robustness for Spoken Arabic Dialect Identification

Badr M. Abdullah,Matthew Baas,Bernd Möbius,Dietrich Klakow

Main category: cs.CL

TL;DR: 本文提出了一种基于语音转换的阿拉伯方言识别（ADI）方法，显著提升了跨域场景下的鲁棒性，并在新收集的真实测试集上实现了最高34.1%的准确率提升。

Details

Motivation: 当前ADI系统在跨域语音识别中泛化能力不足，限制了其在包容性语音技术中的应用。 Method: 采用基于语音转换的训练方法，优化ADI模型。 Result: 在新测试集上，模型在跨域场景中表现优异，准确率提升高达34.1%，并减少了数据集中的说话者偏差。 Conclusion: 该方法显著提升了ADI系统的鲁棒性，支持了阿拉伯语包容性语音技术的发展。 Abstract: Arabic dialect identification (ADI) systems are essential for large-scale data collection pipelines that enable the development of inclusive speech technologies for Arabic language varieties. However, the reliability of current ADI systems is limited by poor generalization to out-of-domain speech. In this paper, we present an effective approach based on voice conversion for training ADI models that achieves state-of-the-art performance and significantly improves robustness in cross-domain scenarios. Evaluated on a newly collected real-world test set spanning four different domains, our approach yields consistent improvements of up to +34.1% in accuracy across domains. Furthermore, we present an analysis of our approach and demonstrate that voice conversion helps mitigate the speaker bias in the ADI dataset. We release our robust ADI model and cross-domain evaluation dataset to support the development of inclusive speech technologies for Arabic.

Junyu Luo,Zhizhuo Kou,Liming Yang,Xiao Luo,Jinsheng Huang,Zhiping Xiao,Jingshu Peng,Chengzhong Liu,Jiaming Ji,Xuanzhe Liu,Sirui Han,Ming Zhang,Yike Guo

Main category: cs.CL

TL;DR: 论文介绍了FinMME，一个专为金融领域设计的多模态评估数据集，包含11,000多个高质量样本，覆盖18个金融领域和6种资产类别，并开发了FinScore评估系统。实验表明，即使是GPT-4o等先进模型在FinMME上表现不佳，凸显其挑战性。

Details

Motivation: 金融领域缺乏有效的多模态评估数据集，阻碍了多模态大语言模型（MLLMs）在该领域的发展。 Method: 构建FinMME数据集，涵盖多样化的金融样本，并通过20名标注者和验证机制确保数据质量；开发FinScore评估系统，包含幻觉惩罚和多维能力评估。 Result: 实验显示，先进模型如GPT-4o在FinMME上表现不佳，数据集具有高鲁棒性（预测变异低于1%）。 Conclusion: FinMME为金融领域的MLLMs提供了高质量的评估基准，其挑战性和可靠性优于现有数据集。 Abstract: Multimodal Large Language Models (MLLMs) have experienced rapid development in recent years. However, in the financial domain, there is a notable lack of effective and specialized multimodal evaluation datasets. To advance the development of MLLMs in the finance domain, we introduce FinMME, encompassing more than 11,000 high-quality financial research samples across 18 financial domains and 6 asset classes, featuring 10 major chart types and 21 subtypes. We ensure data quality through 20 annotators and carefully designed validation mechanisms. Additionally, we develop FinScore, an evaluation system incorporating hallucination penalties and multi-dimensional capability assessment to provide an unbiased evaluation. Extensive experimental results demonstrate that even state-of-the-art models like GPT-4o exhibit unsatisfactory performance on FinMME, highlighting its challenging nature. The benchmark exhibits high robustness with prediction variations under different prompts remaining below 1%, demonstrating superior reliability compared to existing datasets. Our dataset and evaluation protocol are available at https://huggingface.co/datasets/luojunyu/FinMME and https://github.com/luo-junyu/FinMME.

[270] Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning

Shelly Bensal,Umar Jamil,Christopher Bryant,Melisa Russak,Kiran Kamble,Dmytro Mozolevskyi,Muayad Ali,Waseem AlShikh

Main category: cs.CL

TL;DR: 通过自我反思和强化学习提升大语言模型性能，实验显示在数学方程编写和函数调用任务中分别有34.7%和18.1%的提升。

Details

Motivation: 探索在生成合成数据不可行且仅有二元反馈的情况下，如何通过自我反思提升模型解决复杂任务的能力。 Method: 分两阶段：任务失败时生成自我反思分析，随后在反思内容指导下重试任务，成功则奖励反思阶段的生成内容。 Result: 实验结果显示性能显著提升，小规模微调模型表现优于更大模型。 Conclusion: 该框架为语言模型在有限反馈下自我改进提供了新途径。 Abstract: We explore a method for improving the performance of large language models through self-reflection and reinforcement learning. By incentivizing the model to generate better self-reflections when it answers incorrectly, we demonstrate that a model's ability to solve complex, verifiable tasks can be enhanced even when generating synthetic data is infeasible and only binary feedback is available. Our framework operates in two stages: first, upon failing a given task, the model generates a self-reflective commentary analyzing its previous attempt; second, the model is given another attempt at the task with the self-reflection in context. If the subsequent attempt succeeds, the tokens generated during the self-reflection phase are rewarded. Our experimental results show substantial performance gains across a variety of model architectures, as high as 34.7% improvement at math equation writing and 18.1% improvement at function calling. Notably, smaller fine-tuned models (1.5 billion to 7 billion parameters) outperform models in the same family that are 10 times larger. Our novel paradigm is thus an exciting pathway to more useful and reliable language models that can self-improve on challenging tasks with limited external feedback.

[271] Circuit Stability Characterizes Language Model Generalization

Alan Sun

Main category: cs.CL

TL;DR: 论文提出了一种通过电路稳定性评估语言模型性能的新方法，解决了现有基准测试饱和的问题。

Details

Motivation: 由于语言模型的快速发展导致基准测试饱和，而创建更具挑战性的数据集又费时费力，因此需要新的评估方法。 Method: 通过数学形式化电路稳定性和电路等价性，并通过三个案例研究验证其有效性。 Result: 电路稳定性及其缺失可以表征和预测模型的不同泛化能力。 Conclusion: 该方法为将模型的泛化能力与其可解释性联系起来提供了新的思路。 Abstract: Extensively evaluating the capabilities of (large) language models is difficult. Rapid development of state-of-the-art models induce benchmark saturation, while creating more challenging datasets is labor-intensive. Inspired by the recent developments in mechanistic interpretability, we introduce circuit stability as a new way to assess model performance. Circuit stability refers to a model's ability to apply a consistent reasoning process-its circuit-across various inputs. We mathematically formalize circuit stability and circuit equivalence. Then, through three case studies, we empirically show that circuit stability and the lack thereof can characterize and predict different aspects of generalization. Our proposed methods offer a step towards rigorously relating the generality of models to their interpretability.

[272] Don't Reinvent the Wheel: Efficient Instruction-Following Text Embedding based on Guided Space Transformation

Yingchaojie Feng,Yiqun Sun,Yandong Sun,Minfeng Zhu,Qiang Huang,Anthony K. H. Tung,Wei Chen

Main category: cs.CL

TL;DR: GSTransform提出了一种轻量级的指令跟随文本嵌入框架，通过空间转换动态调整预计算嵌入，显著提升了效率和质量。

Details

Motivation: 现有方法需要为每个新指令重新编码整个语料库，计算开销大，而GSTransform旨在解决这一问题。 Method: 基于预计算的通用嵌入，通过轻量级空间转换实时调整嵌入以适应用户指令，无需重复编码。 Result: 在三个任务和九个数据集上的实验表明，GSTransform在质量和速度上均优于现有方法，速度提升6~300倍。 Conclusion: GSTransform是一种高效且高质量的指令跟随文本嵌入解决方案，适用于大规模实时处理。 Abstract: In this work, we investigate an important task named instruction-following text embedding, which generates dynamic text embeddings that adapt to user instructions, highlighting specific attributes of text. Despite recent advancements, existing approaches suffer from significant computational overhead, as they require re-encoding the entire corpus for each new instruction. To address this challenge, we propose GSTransform, a novel instruction-following text embedding framework based on Guided Space Transformation. Our key observation is that instruction-relevant information is inherently encoded in generic embeddings but remains underutilized. Instead of repeatedly encoding the corpus for each instruction, GSTransform is a lightweight transformation mechanism that adapts pre-computed embeddings in real time to align with user instructions, guided by a small amount of text data with instruction-focused label annotation. We conduct extensive experiments on three instruction-awareness downstream tasks across nine real-world datasets, demonstrating that GSTransform improves instruction-following text embedding quality over state-of-the-art methods while achieving dramatic speedups of 6~300x in real-time processing on large-scale datasets. The source code is available at https://github.com/YingchaojieFeng/GSTransform.

[273] LGAR: Zero-Shot LLM-Guided Neural Ranking for Abstract Screening in Systematic Literature Reviews

Christian Jaumann,Andreas Wiedholz,Annemarie Friedrich

Main category: cs.CL

TL;DR: 论文提出了一种基于大语言模型（LLM）的零样本抽象排名方法LGAR，用于系统性文献综述的摘要筛选，显著优于现有QA方法。

Details

Motivation: 科学文献增长迅速，系统性文献综述（SLR）的摘要筛选阶段需要高效方法。现有方法存在局限性，如二元分类或错误传播问题。 Method: 提出LGAR方法，结合LLM的评分和密集重排序，手动提取57篇SLR的纳入排除标准和研究问题作为基准。 Result: LGAR在平均精度上比现有QA方法提升5-10个百分点。 Conclusion: LGAR为SLR摘要筛选提供了更优解决方案，代码和数据已公开。 Abstract: The scientific literature is growing rapidly, making it hard to keep track of the state-of-the-art. Systematic literature reviews (SLRs) aim to identify and evaluate all relevant papers on a topic. After retrieving a set of candidate papers, the abstract screening phase determines initial relevance. To date, abstract screening methods using large language models (LLMs) focus on binary classification settings; existing question answering (QA) based ranking approaches suffer from error propagation. LLMs offer a unique opportunity to evaluate the SLR's inclusion and exclusion criteria, yet, existing benchmarks do not provide them exhaustively. We manually extract these criteria as well as research questions for 57 SLRs, mostly in the medical domain, enabling principled comparisons between approaches. Moreover, we propose LGAR, a zero-shot LLM Guided Abstract Ranker composed of an LLM based graded relevance scorer and a dense re-ranker. Our extensive experiments show that LGAR outperforms existing QA-based methods by 5-10 pp. in mean average precision. Our code and data is publicly available.

[274] From Macro to Micro: Probing Dataset Diversity in Language Model Fine-Tuning

Haoyu Li,Xuhong Li,Yiming Dong,Kun Liu

Main category: cs.CL

TL;DR: 论文系统分析了数据集多样性对大型语言模型（LLM）监督微调（SFT）的重要性，提出了一种多样性控制策略的分类方法，并发现微观响应多样性策略对模型性能影响最大。

Details

Motivation: 尽管数据集多样性在LLM训练中的重要性逐渐被认可，但其系统性分析仍不足。本文旨在填补这一空白。 Method: 提出了一种多样性控制策略的分类方法，涵盖宏观、中观和微观层面，并构建了包含六种策略的固定大小数据集进行实验评估。 Result: 实验表明，宏观和中观策略在多样性增加时性能提升，而微观响应多样性策略与模型性能相关性更强，且在最大多样性时表现最佳。 Conclusion: 研究结果为构建高性能SFT数据集提供了实用建议，强调了微观响应多样性的重要性。 Abstract: Dataset diversity plays a pivotal role for the successful training of many machine learning models, particularly in the supervised fine-tuning (SFT) stage of large language model (LLM) development. Despite increasing recognition of its importance, systematic analyses of dataset diversity still remain underexplored. To address this gap, this work presents a systematic taxonomy of existing diversity-control strategies, which primarily focus on the instruction component, operating at either macroscopic (entire instruction semantics) or mesoscopic levels (instruction units), and furthermore introduces a novel analysis of microscopic diversity within the response component, specifically analyzing the statistical distribution of tokens in SFT training samples. In the experimental evaluation, we construct fixed-size datasets (e.g., 10,000 samples each) from a corpus of 117,000 open-source SFT samples, incorporating six distinct diversity-control strategies spanning macro-, meso-, and microscopic levels applied to both instructions and responses. We then fine-tune LLMs on these datasets to assess the six diversity-control strategies. Results reveal that while macroscopic and mesoscopic strategies lead to higher performance with increasing diversity, the microscopic strategy in responses exhibits both a stronger correlation between model performance and the degree of diversity and superior performance with maximum diversity across all strategies. These findings offer actionable insights for constructing high-performance SFT datasets.

[275] Revisiting Epistemic Markers in Confidence Estimation: Can Markers Accurately Reflect Large Language Models' Uncertainty?

Jiayu Liu,Qing Zong,Weiqi Wang,Yangqiu Song

Main category: cs.CL

TL;DR: 研究探讨了大语言模型（LLMs）使用认知标记（如“相当自信”）表达自信的可靠性，发现标记在同一分布内表现稳定，但在分布外场景中不一致。

Details

Motivation: 由于LLMs在高风险领域的使用增加，准确评估其自信至关重要，但认知标记是否能真实反映模型内在自信尚不明确。 Method: 通过定义标记自信为模型使用认知标记时的观察准确率，并在多个问答数据集上评估开源和专有LLMs的表现。 Result: 标记在同一分布内表现良好，但在分布外场景中自信不一致。 Conclusion: 认知标记用于自信评估的可靠性存疑，需改进标记自信与实际模型不确定性的对齐。 Abstract: As large language models (LLMs) are increasingly used in high-stakes domains, accurately assessing their confidence is crucial. Humans typically express confidence through epistemic markers (e.g., "fairly confident") instead of numerical values. However, it remains unclear whether LLMs consistently use these markers to reflect their intrinsic confidence due to the difficulty of quantifying uncertainty associated with various markers. To address this gap, we first define marker confidence as the observed accuracy when a model employs an epistemic marker. We evaluate its stability across multiple question-answering datasets in both in-distribution and out-of-distribution settings for open-source and proprietary LLMs. Our results show that while markers generalize well within the same distribution, their confidence is inconsistent in out-of-distribution scenarios. These findings raise significant concerns about the reliability of epistemic markers for confidence estimation, underscoring the need for improved alignment between marker based confidence and actual model uncertainty. Our code is available at https://github.com/HKUST-KnowComp/MarCon.

[276] Drop Dropout on Single-Epoch Language Model Pretraining

Houjun Liu,John Bauer,Christopher D. Manning

Main category: cs.CL

TL;DR: 研究表明，在单周期预训练中不使用dropout可以提升语言模型的下游任务表现，同时增强模型的可编辑性。

Details

Motivation: 探讨dropout在大型语言模型（LLM）单周期预训练中的作用，填补现有研究的空白。 Method: 通过实验比较不同dropout水平在BERT、Pythia 160M和1.4B等模型预训练中的效果，评估下游任务表现和模型可编辑性。 Result: 不使用dropout的模型在下游任务（如语言建模、问答等）表现更好，且更易于通过梯度编辑。 Conclusion: 建议在单周期预训练中完全放弃使用dropout。 Abstract: Originally, dropout was seen as a breakthrough regularization technique that reduced overfitting and improved performance in almost all applications of deep learning by reducing overfitting. Yet, single-epoch pretraining tasks common to modern LLMs yield minimal overfitting, leading to dropout not being used for large LLMs. Nevertheless, no thorough empirical investigation has been done on the role of dropout in LM pretraining. Through experiments in single-epoch pretraining of both masked (BERT) and autoregressive (Pythia 160M and 1.4B) LMs with varying levels of dropout, we find that downstream performance in language modeling, morpho-syntax (BLiMP), question answering (SQuAD), and natural-language inference (MNLI) improves when dropout is not applied during pretraining. We additionally find that the recently-introduced "early dropout" also degrades performance over applying no dropout at all. We further investigate the models' editability, and find that models trained without dropout are more successful in gradient-based model editing (MEND) and equivalent in representation-based model editing (ReFT). Therefore, we advocate to drop dropout during single-epoch pretraining.

[277] Guiding Generative Storytelling with Knowledge Graphs

Zhijun Pan,Antonios Andronis,Eva Hayek,Oscar AP Wilkinson,Ilya Lasy,Annette Parry,Guy Gadney,Tim J. Smith,Mick Grierson

Main category: cs.CL

TL;DR: 知识图谱（KG）辅助的大型语言模型（LLM）在故事生成中提升叙事质量和用户控制。

Details

Motivation: 解决LLM在长故事生成中的连贯性和用户控制不足的问题。 Method: 提出KG辅助的故事生成流程，通过用户研究评估其效果。 Result: 知识图谱显著提升动作导向和结构化叙事的质量，增强用户控制感。 Conclusion: KG辅助的LLM故事生成更具互动性和趣味性。 Abstract: Large Language Models (LLMs) have shown great potential in automated story generation, but challenges remain in maintaining long-form coherence and providing users with intuitive and effective control. Retrieval-Augmented Generation (RAG) has proven effective in reducing hallucinations in text generation; however, the use of structured data to support generative storytelling remains underexplored. This paper investigates how knowledge graphs (KGs) can enhance LLM-based storytelling by improving narrative quality and enabling user-driven modifications. We propose a KG-assisted storytelling pipeline and evaluate its effectiveness through a user study with 15 participants. Participants created their own story prompts, generated stories, and edited knowledge graphs to shape their narratives. Through quantitative and qualitative analysis, our findings demonstrate that knowledge graphs significantly enhance story quality in action-oriented and structured narratives within our system settings. Additionally, editing the knowledge graph increases users' sense of control, making storytelling more engaging, interactive, and playful.

[278] LegalEval-Q: A New Benchmark for The Quality Evaluation of LLM-Generated Legal Text

Li yunhan,Wu gengshen

Main category: cs.CL

TL;DR: 论文提出了一种评估法律文本质量（清晰度、连贯性、术语准确性）的回归模型，并分析了49个LLM，发现模型质量在140亿参数后趋于稳定，推理模型优于基础架构，同时发布了Qwen3系列作为性价比最优的推荐。

Details

Motivation: 当前法律领域的大语言模型评估主要关注事实准确性，而忽略了语言质量（如清晰度、连贯性、术语准确性），因此需要填补这一空白。 Method: 1. 开发回归模型评估法律文本质量；2. 创建专门的法律问题集；3. 用此框架分析49个LLM。 Result: 1. 模型质量在140亿参数后提升有限（72亿参数仅提升2.7%）；2. 量化和上下文长度影响不显著；3. 推理模型优于基础架构。Qwen3系列为性价比最优。 Conclusion: 研究建立了法律LLM的标准化评估协议，揭示了当前训练数据优化的局限性，并发布了代码和模型。 Abstract: As large language models (LLMs) are increasingly used in legal applications, current evaluation benchmarks tend to focus mainly on factual accuracy while largely neglecting important linguistic quality aspects such as clarity, coherence, and terminology. To address this gap, we propose three steps: First, we develop a regression model to evaluate the quality of legal texts based on clarity, coherence, and terminology. Second, we create a specialized set of legal questions. Third, we analyze 49 LLMs using this evaluation framework. Our analysis identifies three key findings: First, model quality levels off at 14 billion parameters, with only a marginal improvement of $2.7\%$ noted at 72 billion parameters. Second, engineering choices such as quantization and context length have a negligible impact, as indicated by statistical significance thresholds above 0.016. Third, reasoning models consistently outperform base architectures. A significant outcome of our research is the release of a ranking list and Pareto analysis, which highlight the Qwen3 series as the optimal choice for cost-performance tradeoffs. This work not only establishes standardized evaluation protocols for legal LLMs but also uncovers fundamental limitations in current training data refinement approaches. Code and models are available at: https://github.com/lyxx3rd/LegalEval-Q.

[279] Improving Reliability and Explainability of Medical Question Answering through Atomic Fact Checking in Retrieval-Augmented LLMs

Juraj Vladika,Annika Domres,Mai Nguyen,Rebecca Moser,Jana Nano,Felix Busch,Lisa C. Adams,Keno K. Bressem,Denise Bernhardt,Stephanie E. Combs,Kai J. Borm,Florian Matthes,Jan C. Peeken

Main category: cs.CL

TL;DR: 提出了一种原子事实检查框架，通过分解LLM生成的回答为可验证的原子事实，显著提高了医学问答的事实准确性和可解释性。

Details

Motivation: 解决LLM在医学问答中的幻觉和引用不准确问题，提升临床应用的可靠性和合规性。 Method: 将LLM生成的回答分解为原子事实，并独立验证每个事实与权威医学指南的匹配性。 Result: 在医学专家评估和自动化基准测试中，事实准确性和可解释性显著提升，整体回答改进40%，幻觉检测率达50%。 Conclusion: 该框架为LLM在临床医学中的应用提供了更可信和可靠的支持，是迈向AI辅助医疗的重要一步。 Abstract: Large language models (LLMs) exhibit extensive medical knowledge but are prone to hallucinations and inaccurate citations, which pose a challenge to their clinical adoption and regulatory compliance. Current methods, such as Retrieval Augmented Generation, partially address these issues by grounding answers in source documents, but hallucinations and low fact-level explainability persist. In this work, we introduce a novel atomic fact-checking framework designed to enhance the reliability and explainability of LLMs used in medical long-form question answering. This method decomposes LLM-generated responses into discrete, verifiable units called atomic facts, each of which is independently verified against an authoritative knowledge base of medical guidelines. This approach enables targeted correction of errors and direct tracing to source literature, thereby improving the factual accuracy and explainability of medical Q&A. Extensive evaluation using multi-reader assessments by medical experts and an automated open Q&A benchmark demonstrated significant improvements in factual accuracy and explainability. Our framework achieved up to a 40% overall answer improvement and a 50% hallucination detection rate. The ability to trace each atomic fact back to the most relevant chunks from the database provides a granular, transparent explanation of the generated responses, addressing a major gap in current medical AI applications. This work represents a crucial step towards more trustworthy and reliable clinical applications of LLMs, addressing key prerequisites for clinical application and fostering greater confidence in AI-assisted healthcare.

[280] How much do language models memorize?

John X. Morris,Chawin Sitawarin,Chuan Guo,Narine Kokhlikyan,G. Edward Suh,Alexander M. Rush,Kamalika Chaudhuri,Saeed Mahloujifar

Main category: cs.CL

TL;DR: 提出了一种新方法，用于估计模型对数据点的“了解”程度，并测量现代语言模型的容量。研究发现，GPT类模型的容量约为每参数3.6比特。

Details

Motivation: 先前研究难以区分记忆与泛化，本文旨在明确分离这两种能力。 Method: 将记忆分为无意记忆和泛化两部分，通过消除泛化计算总记忆量，训练不同规模的模型并观察其行为。 Result: 模型在容量填满前会记忆数据，之后开始泛化；训练了数百个模型并建立了容量与数据规模的比例关系。 Conclusion: 研究提供了模型容量的量化方法，并揭示了记忆与泛化之间的关系。 Abstract: We propose a new method for estimating how much a model ``knows'' about a datapoint and use it to measure the capacity of modern language models. Prior studies of language model memorization have struggled to disentangle memorization from generalization. We formally separate memorization into two components: \textit{unintended memorization}, the information a model contains about a specific dataset, and \textit{generalization}, the information a model contains about the true data-generation process. When we completely eliminate generalization, we can compute the total memorization, which provides an estimate of model capacity: our measurements estimate that GPT-style models have a capacity of approximately 3.6 bits per parameter. We train language models on datasets of increasing size and observe that models memorize until their capacity fills, at which point ``grokking'' begins, and unintended memorization decreases as models begin to generalize. We train hundreds of transformer language models ranging from $500K$ to $1.5B$ parameters and produce a series of scaling laws relating model capacity and data size to membership inference.

[281] Multilinguality Does not Make Sense: Investigating Factors Behind Zero-Shot Transfer in Sense-Aware Tasks

Roksana Goworek,Haim Dubossarsky

Main category: cs.CL

TL;DR: 研究表明，多语言训练并非跨语言迁移的必要条件，且其优势可能被其他因素（如微调数据构成和评估方法）所掩盖。

Details

Motivation: 挑战多语言训练对跨语言迁移任务（如多义词消歧和词汇语义变化）的固有优势假设。 Method: 通过对28种语言的大规模分析，探讨多语言训练的实际效果。 Result: 发现多语言训练并非必要，其优势可能源于其他因素。 Conclusion: 呼吁更严谨的多语言NLP评估，并发布模型和基准以支持研究。 Abstract: Cross-lingual transfer allows models to perform tasks in languages unseen during training and is often assumed to benefit from increased multilinguality. In this work, we challenge this assumption in the context of two underexplored, sense-aware tasks: polysemy disambiguation and lexical semantic change. Through a large-scale analysis across 28 languages, we show that multilingual training is neither necessary nor inherently beneficial for effective transfer. Instead, we find that confounding factors - such as fine-tuning data composition and evaluation artifacts - better account for the perceived advantages of multilinguality. Our findings call for more rigorous evaluations in multilingual NLP. We release fine-tuned models and benchmarks to support further research, with implications extending to low-resource and typologically diverse languages.

[282] MetaFaith: Faithful Natural Language Uncertainty Expression in LLMs

Gabrielle Kaili-May Liu,Gal Yona,Avi Caciularu,Idan Szpektor,Tim G. J. Rudner,Arman Cohan

Main category: cs.CL

TL;DR: 论文研究了LLM在不确定性表达上的可靠性问题，提出了MetaFaith方法以改进其置信度校准。

Details

Motivation: LLM在传递虚假信息时往往使用过于肯定的语言，导致用户过度依赖和信任度下降，因此需要研究如何让LLM的不确定性表达更忠实于其内在不确定性。 Method: 通过系统研究LLM的置信度校准能力，评估了多种模型、数据集和提示策略，并提出了基于人类元认知的MetaFaith校准方法。 Result: 研究发现现有LLM在此任务上表现不佳，现有干预措施效果有限，而MetaFaith显著提升了校准的忠实性，最高提升61%。 Conclusion: MetaFaith是一种有效的LLM置信度校准方法，能够显著改进不确定性表达的可靠性。 Abstract: A critical component in the trustworthiness of LLMs is reliable uncertainty communication, yet LLMs often use assertive language when conveying false claims, leading to over-reliance and eroded trust. We present the first systematic study of $\textit{faithful confidence calibration}$ of LLMs, benchmarking models' ability to use linguistic expressions of uncertainty that $\textit{faithfully reflect}$ their intrinsic uncertainty, across a comprehensive array of models, datasets, and prompting strategies. Our results demonstrate that LLMs largely fail at this task, and that existing interventions are insufficient: standard prompt approaches provide only marginal gains, and existing, factuality-based calibration techniques can even harm faithful calibration. To address this critical gap, we introduce MetaFaith, a novel prompt-based calibration approach inspired by human metacognition. We show that MetaFaith robustly improves faithful calibration across diverse models and task domains, enabling up to 61% improvement in faithfulness and achieving an 83% win rate over original generations as judged by humans.

[283] AlphaOne: Reasoning Models Thinking Slow and Fast at Test Time

Junyu Zhang,Runpei Dong,Han Wang,Xuying Ning,Haoran Geng,Peihao Li,Xialin He,Yutong Bai,Jitendra Malik,Saurabh Gupta,Huan Zhang

Main category: cs.CL

TL;DR: AlphaOne（α1）是一个通用框架，用于在测试时动态调节大型推理模型（LRMs）的推理进度。通过引入α时刻和动态调度慢思考转换，α1实现了灵活且高效的慢到快推理调节。

Details

Motivation: 现有方法在推理过程中缺乏灵活性和效率，α1旨在通过动态调节推理进度来解决这一问题。 Method: α1引入α时刻，通过伯努利随机过程动态调度慢思考转换，并在α时刻后确定性地终止慢思考，以促进快速推理。 Result: 在数学、编码和科学领域的多个基准测试中，α1表现出卓越的推理能力和效率。 Conclusion: α1通过统一的框架实现了灵活且高效的推理调节，为大型推理模型的优化提供了新思路。 Abstract: This paper presents AlphaOne ($\alpha$1), a universal framework for modulating reasoning progress in large reasoning models (LRMs) at test time. $\alpha$1 first introduces $\alpha$ moment, which represents the scaled thinking phase with a universal parameter $\alpha$. Within this scaled pre-$\alpha$ moment phase, it dynamically schedules slow thinking transitions by modeling the insertion of reasoning transition tokens as a Bernoulli stochastic process. After the $\alpha$ moment, $\alpha$1 deterministically terminates slow thinking with the end-of-thinking token, thereby fostering fast reasoning and efficient answer generation. This approach unifies and generalizes existing monotonic scaling methods by enabling flexible and dense slow-to-fast reasoning modulation. Extensive empirical studies on various challenging benchmarks across mathematical, coding, and scientific domains demonstrate $\alpha$1's superior reasoning capability and efficiency. Project page: https://alphaone-project.github.io/

[284] ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models

Mingjie Liu,Shizhe Diao,Ximing Lu,Jian Hu,Xin Dong,Yejin Choi,Jan Kautz,Yi Dong

Main category: cs.CL

TL;DR: 研究表明，强化学习（RL）可以扩展语言模型的推理能力，而不仅仅是放大已有输出。提出的ProRL方法通过KL散度控制和任务多样性，显著提升了模型性能。

Details

Motivation: 探讨RL是否能真正扩展模型的推理能力，而非仅放大已有输出，并验证RL计算规模的增加是否可靠提升推理性能。 Method: 提出ProRL方法，结合KL散度控制、参考策略重置和多样化任务，进行长期RL训练。 Result: RL训练模型在多种任务中优于基础模型，甚至在某些任务中基础模型完全失败。推理能力的提升与基础模型能力和训练时长相关。 Conclusion: RL能探索新的解决方案空间，为未来长期RL推理研究奠定基础。 Abstract: Recent advances in reasoning-centric language models have highlighted reinforcement learning (RL) as a promising method for aligning models with verifiable rewards. However, it remains contentious whether RL truly expands a model's reasoning capabilities or merely amplifies high-reward outputs already latent in the base model's distribution, and whether continually scaling up RL compute reliably leads to improved reasoning performance. In this work, we challenge prevailing assumptions by demonstrating that prolonged RL (ProRL) training can uncover novel reasoning strategies that are inaccessible to base models, even under extensive sampling. We introduce ProRL, a novel training methodology that incorporates KL divergence control, reference policy resetting, and a diverse suite of tasks. Our empirical analysis reveals that RL-trained models consistently outperform base models across a wide range of pass@k evaluations, including scenarios where base models fail entirely regardless of the number of attempts. We further show that reasoning boundary improvements correlates strongly with task competence of base model and training duration, suggesting that RL can explore and populate new regions of solution space over time. These findings offer new insights into the conditions under which RL meaningfully expands reasoning boundaries in language models and establish a foundation for future work on long-horizon RL for reasoning. We release model weights to support further research: https://huggingface.co/nvidia/Nemotron-Research-Reasoning-Qwen-1.5B

stat.ML [Back]

[285] Conformal Object Detection by Sequential Risk Control

Léo Andéol,Luca Mossina,Adrien Mazoyer,Sébastien Gerchinovitz

Main category: stat.ML

TL;DR: 论文提出了一种基于Conformal Prediction的新方法SeqCRC，用于解决目标检测模型的可靠性问题，并提供了统计保证和实验验证。

Details

Motivation: 目标检测模型在关键应用中缺乏可靠性，且结构复杂，需要一种无需先验知识的统计保证方法。 Method: 提出Sequential Conformal Risk Control (SeqCRC)方法，扩展CRC的统计保证至顺序任务，并设计适合不同应用的损失函数和预测集。 Result: 通过实验验证了方法的有效性，并提供了工具包和基准测试。 Conclusion: SeqCRC为关键应用中的目标检测提供了可靠的统计保证，具有实际应用潜力。 Abstract: Recent advances in object detectors have led to their adoption for industrial uses. However, their deployment in critical applications is hindered by the inherent lack of reliability of neural networks and the complex structure of object detection models. To address these challenges, we turn to Conformal Prediction, a post-hoc procedure which offers statistical guarantees that are valid for any dataset size, without requiring prior knowledge on the model or data distribution. Our contribution is manifold: first, we formally define the problem of Conformal Object Detection (COD) and introduce a novel method, Sequential Conformal Risk Control (SeqCRC), that extends the statistical guarantees of Conformal Risk Control (CRC) to two sequential tasks with two parameters, as required in the COD setting. Then, we propose loss functions and prediction sets suited to applying CRC to different applications and certification requirements. Finally, we present a conformal toolkit, enabling replication and further exploration of our methods. Using this toolkit, we perform extensive experiments, yielding a benchmark that validates the investigated methods and emphasizes trade-offs and other practical consequences.

[286] A Mathematical Perspective On Contrastive Learning

Ricardo Baptista,Andrew M. Stuart,Son Tran

Main category: stat.ML

TL;DR: 本文提出了一种基于概率视角的多模态对比学习框架，通过优化编码器定义条件概率分布，支持跨模态检索、分类和生成任务，并探讨了损失函数和度量标准的新变体。

Details

Motivation: 传统多模态对比学习通常关注模态间的表示对齐，但缺乏概率解释。本文旨在通过概率框架统一跨模态任务，并提供更灵活的算法设计空间。 Method: 将对比学习视为优化定义条件概率分布的编码器，提出基于多元高斯假设的低秩矩阵近似方法，并引入新的损失函数和度量标准。 Result: 在多元高斯、MNIST数据集和海洋学数据同化应用中验证了框架的有效性，展示了其在模式搜索和生成任务中的优势。 Conclusion: 概率视角为多模态对比学习提供了更丰富的理论支持和算法设计工具，适用于多种实际任务。 Abstract: Multimodal contrastive learning is a methodology for linking different data modalities; the canonical example is linking image and text data. The methodology is typically framed as the identification of a set of encoders, one for each modality, that align representations within a common latent space. In this work, we focus on the bimodal setting and interpret contrastive learning as the optimization of (parameterized) encoders that define conditional probability distributions, for each modality conditioned on the other, consistent with the available data. This provides a framework for multimodal algorithms such as crossmodal retrieval, which identifies the mode of one of these conditional distributions, and crossmodal classification, which is similar to retrieval but includes a fine-tuning step to make it task specific. The framework we adopt also gives rise to crossmodal generative models. This probabilistic perspective suggests two natural generalizations of contrastive learning: the introduction of novel probabilistic loss functions, and the use of alternative metrics for measuring alignment in the common latent space. We study these generalizations of the classical approach in the multivariate Gaussian setting. In this context we view the latent space identification as a low-rank matrix approximation problem. This allows us to characterize the capabilities of loss functions and alignment metrics to approximate natural statistics, such as conditional means and covariances; doing so yields novel variants on contrastive learning algorithms for specific mode-seeking and for generative tasks. The framework we introduce is also studied through numerical experiments on multivariate Gaussians, the labeled MNIST dataset, and on a data assimilation application arising in oceanography.

[287] Efficient Estimation of Regularized Tyler's M-Estimator Using Approximate LOOCV

Karim Abou-Moustafa

Main category: stat.ML

TL;DR: 提出了一种高效估计正则化参数的方法，通过近似留一交叉验证（LOOCV）对数似然损失，显著降低了计算复杂度。

Details

Motivation: 解决正则化Tyler M估计器（RTME）中正则化参数估计的计算效率问题。 Method: 提出一种近似LOOCV对数似然损失的方法，避免重复计算RTME，将时间复杂度从O(n^2)降至O(n)。 Result: 在合成和真实高维数据上验证了方法的效率和准确性，优于现有方法。 Conclusion: 该方法高效且准确，适用于高维数据中的正则化参数估计。 Abstract: We consider the problem of estimating a regularization parameter, or a shrinkage coefficient $\alpha \in (0,1)$ for Regularized Tyler's M-estimator (RTME). In particular, we propose to estimate an optimal shrinkage coefficient by setting $\alpha$ as the solution to a suitably chosen objective function; namely the leave-one-out cross-validated (LOOCV) log-likelihood loss. Since LOOCV is computationally prohibitive even for moderate sample size $n$, we propose a computationally efficient approximation for the LOOCV log-likelihood loss that eliminates the need for invoking the RTME procedure $n$ times for each sample left out during the LOOCV procedure. This approximation yields an $O(n)$ reduction in the running time complexity for the LOOCV procedure, which results in a significant speedup for computing the LOOCV estimate. We demonstrate the efficiency and accuracy of the proposed approach on synthetic high-dimensional data sampled from heavy-tailed elliptical distributions, as well as on real high-dimensional datasets for object recognition, face recognition, and handwritten digit's recognition. Our experiments show that the proposed approach is efficient and consistently more accurate than other methods in the literature for shrinkage coefficient estimation.

[288] Boosting In-Context Learning in LLMs Through the Lens of Classical Supervised Learning

Korel Gundem,Juncheng Dong,Dennis Zhang,Vahid Tarokh,Zhengling Qi

Main category: stat.ML

TL;DR: 论文提出了一种名为监督校准（SC）的框架，通过优化每类的仿射变换来修正大语言模型（LLM）在上下文学习中的系统性偏差，提升分类性能。

Details

Motivation: 现有校准方法仅能平移决策边界，无法修正方向性偏差，导致性能不稳定。 Method: SC框架在logit空间中学习最优的仿射变换，无需额外数据，并支持正则化技术。 Result: SC在多个数据集和模型（如Mistral-7B、LLaMA-2-7B、Qwen2-7B）上优于基线方法。 Conclusion: SC为上下文学习中的校准问题提供了统一且高效的解决方案。 Abstract: In-Context Learning (ICL) allows Large Language Models (LLMs) to adapt to new tasks with just a few examples, but their predictions often suffer from systematic biases, leading to unstable performances in classification. While calibration techniques are proposed to mitigate these biases, we show that, in the logit space, many of these methods are equivalent to merely shifting the LLM's decision boundary without having the ability to alter its orientation. This proves inadequate when biases cause the LLM to be severely misdirected. To address these limitations and provide a unifying framework, we propose Supervised Calibration (SC), a loss-minimization based framework which learns an optimal, per-class affine transformation of the LLM's predictive probabilities in the logit space without requiring external data beyond the context. By using a more expressive functional class, SC not only subsumes many existing calibration methods in ICL as special cases, but also enables the ability to alter and even completely reverse the orientation of the LLM's decision boundary. Furthermore, SC's loss-based nature facilitates the seamless integration of two purpose-built regularization techniques: context-invariance and directional trust-region. The former is designed to tackle the instability issue in ICL, while the latter controls the degree of calibration. Finally, SC delivers state-of-the-art performance over calibration baselines in the 4-shot, 8-shot, and 16-shot settings across all nine datasets for Mistral-7B-Instruct-v0.3, LLaMA-2-7B-chat, and Qwen2-7B-Instruct.

cs.CR [Back]

[289] PatchDEMUX: A Certifiably Robust Framework for Multi-label Classifiers Against Adversarial Patches

Dennis Jacob,Chong Xiang,Prateek Mittal

Main category: cs.CR

TL;DR: PatchDEMUX是一个可证明鲁棒的多标签分类框架，通过将多标签任务分解为多个二分类问题，并利用单标签防御方法扩展实现。

Details

Motivation: 对抗性补丁攻击对深度学习模型性能造成严重影响，目前可证明鲁棒的防御方法主要针对单标签分类，多标签分类领域研究较少。 Method: 将多标签分类任务分解为多个独立的二分类问题，并扩展现有单标签防御方法（如PatchCleanser），在单补丁攻击场景下提出更严格的鲁棒性边界。 Result: 在MS-COCO和PASCAL VOC数据集上，PatchDEMUX实现了非平凡的鲁棒性，同时保持了较高的干净性能。 Conclusion: PatchDEMUX为多标签分类提供了一种可证明鲁棒的防御框架，填补了该领域的空白。 Abstract: Deep learning techniques have enabled vast improvements in computer vision technologies. Nevertheless, these models are vulnerable to adversarial patch attacks which catastrophically impair performance. The physically realizable nature of these attacks calls for certifiable defenses, which feature provable guarantees on robustness. While certifiable defenses have been successfully applied to single-label classification, limited work has been done for multi-label classification. In this work, we present PatchDEMUX, a certifiably robust framework for multi-label classifiers against adversarial patches. Our approach is a generalizable method which can extend any existing certifiable defense for single-label classification; this is done by considering the multi-label classification task as a series of isolated binary classification problems to provably guarantee robustness. Furthermore, in the scenario where an attacker is limited to a single patch we propose an additional certification procedure that can provide tighter robustness bounds. Using the current state-of-the-art (SOTA) single-label certifiable defense PatchCleanser as a backbone, we find that PatchDEMUX can achieve non-trivial robustness on the MS-COCO and PASCAL VOC datasets while maintaining high clean performance

cs.AI [Back]

[290] Open CaptchaWorld: A Comprehensive Web-based Platform for Testing and Benchmarking Multimodal LLM Agents

Yaxin Luo,Zhaoyi Li,Jiacheng Liu,Jiacheng Cui,Xiaohan Zhao,Zhiqiang Shen

Main category: cs.AI

TL;DR: Open CaptchaWorld是一个专门评估多模态LLM代理在动态CAPTCHA任务中表现的基准平台，结果显示当前代理性能远低于人类水平。

Details

Motivation: CAPTCHAs是网络代理部署的关键瓶颈，现有多模态LLM代理在交互式任务中的能力尚未充分测试。 Method: 提出Open CaptchaWorld基准，涵盖20种CAPTCHA类型，共225个任务，并引入CAPTCHA Reasoning Depth指标。 Result: 人类表现接近完美（93.3%），而最佳MLLM代理成功率仅为40.0%。 Conclusion: Open CaptchaWorld揭示了当前多模态代理的局限性，为开发更鲁棒的系统提供了指导。 Abstract: CAPTCHAs have been a critical bottleneck for deploying web agents in real-world applications, often blocking them from completing end-to-end automation tasks. While modern multimodal LLM agents have demonstrated impressive performance in static perception tasks, their ability to handle interactive, multi-step reasoning challenges like CAPTCHAs is largely untested. To address this gap, we introduce Open CaptchaWorld, the first web-based benchmark and platform specifically designed to evaluate the visual reasoning and interaction capabilities of MLLM-powered agents through diverse and dynamic CAPTCHA puzzles. Our benchmark spans 20 modern CAPTCHA types, totaling 225 CAPTCHAs, annotated with a new metric we propose: CAPTCHA Reasoning Depth, which quantifies the number of cognitive and motor steps required to solve each puzzle. Experimental results show that humans consistently achieve near-perfect scores, state-of-the-art MLLM agents struggle significantly, with success rates at most 40.0% by Browser-Use Openai-o3, far below human-level performance, 93.3%. This highlights Open CaptchaWorld as a vital benchmark for diagnosing the limits of current multimodal agents and guiding the development of more robust multimodal reasoning systems. Code and Data are available at this https URL.

[291] Using Reasoning Models to Generate Search Heuristics that Solve Open Instances of Combinatorial Design Problems

Christopher D. Rosin

Main category: cs.AI

TL;DR: 论文探讨了利用推理能力的大语言模型（LLMs）在组合设计领域中的应用，通过CPro1协议成功解决了多个长期未解决的组合设计问题。

Details

Motivation: 组合设计领域存在许多未解决的实例，传统方法难以应对。通过结合推理能力的LLMs，可以生成有效的搜索启发式方法，解决这些问题。 Method: 采用CPro1协议，结合推理LLMs，从文本定义和验证器出发，指导模型选择和实施策略，并提供自动超参数调整和执行反馈。 Result: 成功解决了16个组合设计问题中的7个长期未解实例，包括3个新解决的实例，并在2025年文献中解决了更多问题。 Conclusion: 推理LLMs与CPro1协议的结合为组合设计领域提供了高效解决方案，展示了LLMs在数学和代码生成中的潜力。 Abstract: Large Language Models (LLMs) with reasoning are trained to iteratively generate and refine their answers before finalizing them, which can help with applications to mathematics and code generation. We apply code generation with reasoning LLMs to a specific task in the mathematical field of combinatorial design. This field studies diverse types of combinatorial designs, many of which have lists of open instances for which existence has not yet been determined. The Constructive Protocol CPro1 uses LLMs to generate search heuristics that have the potential to construct solutions to small open instances. Starting with a textual definition and a validity verifier for a particular type of design, CPro1 guides LLMs to select and implement strategies, while providing automated hyperparameter tuning and execution feedback. CPro1 with reasoning LLMs successfully solves long-standing open instances for 7 of 16 combinatorial design problems selected from the 2006 Handbook of Combinatorial Designs, including new solved instances for 3 of these (Bhaskar Rao Designs, Symmetric Weighing Matrices, Balanced Ternary Designs) that were unsolved by CPro1 with non-reasoning LLMs. It also solves open instances for several problems from recent (2025) literature, generating new Covering Sequences, Johnson Clique Covers, Deletion Codes, and a Uniform Nested Steiner Quadruple System.

[292] OWL: Optimized Workforce Learning for General Multi-Agent Assistance in Real-World Task Automation

Mengkang Hu,Yuhang Zhou,Wendong Fan,Yuzhou Nie,Bowei Xia,Tao Sun,Ziyu Ye,Zhaoxuan Jin,Yingru Li,Qiguang Chen,Zeyu Zhang,Yifeng Wang,Qianshuo Ye,Bernard Ghanem,Ping Luo,Guohao Li

Main category: cs.AI

TL;DR: Workforce是一个分层多智能体框架，通过模块化架构实现跨领域任务转移，无需完全重新设计或训练所有组件。

Details

Motivation: 当前基于LLM的多智能体系统在跨领域任务转移时面临架构重新设计和全组件重新训练的挑战。 Method: Workforce采用模块化架构，包括领域无关的规划器、协调器和领域特定的工作器，并通过优化学习（OWL）提升跨领域泛化能力。 Result: 在GAIA基准测试中，Workforce达到69.70%的准确率，优于商业系统，其32B模型在挑战性任务中表现接近GPT-4o。 Conclusion: Workforce通过模块化和优化学习，为下一代通用AI助手奠定了基础。 Abstract: Large Language Model (LLM)-based multi-agent systems show promise for automating real-world tasks but struggle to transfer across domains due to their domain-specific nature. Current approaches face two critical shortcomings: they require complete architectural redesign and full retraining of all components when applied to new domains. We introduce Workforce, a hierarchical multi-agent framework that decouples strategic planning from specialized execution through a modular architecture comprising: (i) a domain-agnostic Planner for task decomposition, (ii) a Coordinator for subtask management, and (iii) specialized Workers with domain-specific tool-calling capabilities. This decoupling enables cross-domain transferability during both inference and training phases: During inference, Workforce seamlessly adapts to new domains by adding or modifying worker agents; For training, we introduce Optimized Workforce Learning (OWL), which improves generalization across domains by optimizing a domain-agnostic planner with reinforcement learning from real-world feedback. To validate our approach, we evaluate Workforce on the GAIA benchmark, covering various realistic, multi-domain agentic tasks. Experimental results demonstrate Workforce achieves open-source state-of-the-art performance (69.70%), outperforming commercial systems like OpenAI's Deep Research by 2.34%. More notably, our OWL-trained 32B model achieves 52.73% accuracy (+16.37%) and demonstrates performance comparable to GPT-4o on challenging tasks. To summarize, by enabling scalable generalization and modular domain transfer, our work establishes a foundation for the next generation of general-purpose AI assistants.

[293] Mind the Quote: Enabling Quotation-Aware Dialogue in LLMs via Plug-and-Play Modules

Yueqi Zhang,Peiwen Yuan,Shaoxiong Feng,Yiwei Li,Xinglin Wang,Jiayi Shi,Chuyi Tan,Boyuan Pan,Yao Hu,Kan Li

Main category: cs.AI

TL;DR: 论文提出了一种解决人类-AI对话中引用问题的框架，通过引入跨度条件生成和QuAda方法，实现了对引用文本的高效利用。

Details

Motivation: 当前大型语言模型缺乏显式机制定位和利用对话中的引用文本，限制了对话的准确性和效率。 Method: 提出跨度条件生成框架，并设计QuAda方法，通过轻量级训练动态调整对引用文本的注意力。 Result: 实验表明QuAda适用于多种场景，并能泛化到未见主题，提供了一种高效的即插即用解决方案。 Conclusion: QuAda为引用感知对话提供了一种有效且参数高效的方法，具有广泛适用性。 Abstract: Human-AI conversation frequently relies on quoting earlier text-"check it with the formula I just highlighted"-yet today's large language models (LLMs) lack an explicit mechanism for locating and exploiting such spans. We formalise the challenge as span-conditioned generation, decomposing each turn into the dialogue history, a set of token-offset quotation spans, and an intent utterance. Building on this abstraction, we introduce a quotation-centric data pipeline that automatically synthesises task-specific dialogues, verifies answer correctness through multi-stage consistency checks, and yields both a heterogeneous training corpus and the first benchmark covering five representative scenarios. To meet the benchmark's zero-overhead and parameter-efficiency requirements, we propose QuAda, a lightweight training-based method that attaches two bottleneck projections to every attention head, dynamically amplifying or suppressing attention to quoted spans at inference time while leaving the prompt unchanged and updating < 2.8% of backbone weights. Experiments across models show that QuAda is suitable for all scenarios and generalises to unseen topics, offering an effective, plug-and-play solution for quotation-aware dialogue.

[294] Optimizing the Interface Between Knowledge Graphs and LLMs for Complex Reasoning

Vasilije Markovic,Lazar Obradovic,Laszlo Hajdu,Jovan Pavlovic

Main category: cs.AI

TL;DR: 论文研究了将大语言模型（LLMs）与知识图谱（KGs）结合时，系统超参数优化的重要性，并通过实验展示了针对性调优的性能提升。

Details

Motivation: 探讨在检索增强生成系统中，超参数优化的作用尚未充分研究，尤其是在复杂的模块化框架中。 Method: 使用Cognee框架，在三个多跳QA基准上优化与分块、图谱构建、检索和提示相关的参数，并通过多种指标评估性能。 Result: 实验结果显示针对性调优能带来显著性能提升，但效果因数据集和指标而异。 Conclusion: 未来进展不仅需要架构创新，还需更清晰的优化和评估框架。 Abstract: Integrating Large Language Models (LLMs) with Knowledge Graphs (KGs) results in complex systems with numerous hyperparameters that directly affect performance. While such systems are increasingly common in retrieval-augmented generation, the role of systematic hyperparameter optimization remains underexplored. In this paper, we study this problem in the context of Cognee, a modular framework for end-to-end KG construction and retrieval. Using three multi-hop QA benchmarks (HotPotQA, TwoWikiMultiHop, and MuSiQue) we optimize parameters related to chunking, graph construction, retrieval, and prompting. Each configuration is scored using established metrics (exact match, F1, and DeepEval's LLM-based correctness metric). Our results demonstrate that meaningful gains can be achieved through targeted tuning. While the gains are consistent, they are not uniform, with performance varying across datasets and metrics. This variability highlights both the value of tuning and the limitations of standard evaluation measures. While demonstrating the immediate potential of hyperparameter tuning, we argue that future progress will depend not only on architectural advances but also on clearer frameworks for optimization and evaluation in complex, modular systems.

[295] Leveraging Knowledge Graphs and LLMs for Structured Generation of Misinformation

Sania Nayab,Marco Simoni,Giulio Rossolini

Main category: cs.AI

TL;DR: 本文提出了一种利用知识图谱（KGs）系统生成虚假三元组的方法，并通过分析KGs的结构特性生成难以识别的虚假信息。同时探讨了大型语言模型（LLMs）在区分真假信息方面的局限性。

Details

Motivation: 生成式AI的快速发展加剧了虚假信息的传播，对社会稳定和国家安全构成威胁。研究需要探索结构化且可扩展的虚假信息生成方法以评估其影响。 Method: 利用知识图谱的结构特性（如实体间距离和谓词关系）生成虚假三元组，并指导LLMs生成不同可信度的虚假信息。 Result: 该方法生成的虚假信息难以被人类识别，同时揭示了当前LLMs在检测虚假信息方面的显著局限性。 Conclusion: 需开发更有效的检测策略，并深入研究生成模型的固有偏见。 Abstract: The rapid spread of misinformation, further amplified by recent advances in generative AI, poses significant threats to society, impacting public opinion, democratic stability, and national security. Understanding and proactively assessing these threats requires exploring methodologies that enable structured and scalable misinformation generation. In this paper, we propose a novel approach that leverages knowledge graphs (KGs) as structured semantic resources to systematically generate fake triplets. By analyzing the structural properties of KGs, such as the distance between entities and their predicates, we identify plausibly false relationships. These triplets are then used to guide large language models (LLMs) in generating misinformation statements with varying degrees of credibility. By utilizing structured semantic relationships, our deterministic approach produces misinformation inherently challenging for humans to detect, drawing exclusively upon publicly available KGs (e.g., WikiGraphs). Additionally, we investigate the effectiveness of LLMs in distinguishing between genuine and artificially generated misinformation. Our analysis highlights significant limitations in current LLM-based detection methods, underscoring the necessity for enhanced detection strategies and a deeper exploration of inherent biases in generative models.

[296] MiCRo: Mixture Modeling and Context-aware Routing for Personalized Preference Learning

Jingyan Shen,Jiarui Yao,Rui Yang,Yifan Sun,Feng Luo,Rui Pan,Tong Zhang,Han Zhao

Main category: cs.AI

TL;DR: MiCRo是一个两阶段框架，通过上下文感知混合建模和动态路由策略，解决了传统奖励建模无法捕捉多样化人类偏好的问题，显著提升了LLM的个性化对齐能力。

Details

Motivation: 传统Bradley-Terry模型假设全局奖励函数，无法捕捉人类偏好的多样性和异质性，限制了LLM的个性化和多元化对齐。 Method: MiCRo采用两阶段方法：1）上下文感知混合建模捕捉多样化偏好；2）在线路由策略动态调整混合权重以适应特定上下文。 Result: 实验表明，MiCRo能有效捕捉多样化人类偏好，显著提升下游个性化任务表现。 Conclusion: MiCRo为LLM的个性化对齐提供了一种高效且可扩展的解决方案，无需依赖昂贵的细粒度标注。 Abstract: Reward modeling is a key step in building safe foundation models when applying reinforcement learning from human feedback (RLHF) to align Large Language Models (LLMs). However, reward modeling based on the Bradley-Terry (BT) model assumes a global reward function, failing to capture the inherently diverse and heterogeneous human preferences. Hence, such oversimplification limits LLMs from supporting personalization and pluralistic alignment. Theoretically, we show that when human preferences follow a mixture distribution of diverse subgroups, a single BT model has an irreducible error. While existing solutions, such as multi-objective learning with fine-grained annotations, help address this issue, they are costly and constrained by predefined attributes, failing to fully capture the richness of human values. In this work, we introduce MiCRo, a two-stage framework that enhances personalized preference learning by leveraging large-scale binary preference datasets without requiring explicit fine-grained annotations. In the first stage, MiCRo introduces context-aware mixture modeling approach to capture diverse human preferences. In the second stage, MiCRo integrates an online routing strategy that dynamically adapts mixture weights based on specific context to resolve ambiguity, allowing for efficient and scalable preference adaptation with minimal additional supervision. Experiments on multiple preference datasets demonstrate that MiCRo effectively captures diverse human preferences and significantly improves downstream personalization.

cs.IR [Back]

[297] SkewRoute: Training-Free LLM Routing for Knowledge Graph Retrieval-Augmented Generation via Score Skewness of Retrieved Context

Hairu Wang,Yuan Feng,Yukun Cao,Xike Xie,S Kevin Zhou

Main category: cs.IR

TL;DR: 提出了一种无需训练的KG-RAG路由框架，通过检索评分分布判断查询难度，平衡性能与成本，实验显示可减少50%对大模型的调用。

Details

Motivation: 大型语言模型推理成本高，现有KG-RAG系统因检索大量知识上下文进一步增加成本，需平衡性能与成本。 Method: 基于检索评分分布与查询难度的强相关性，设计了一种无需训练的KG-RAG路由框架。 Result: 实验表明，该方法可减少对大模型50%的调用，同时保持响应质量。 Conclusion: 该框架为高效、可扩展的LLM部署提供了潜力。 Abstract: Large language models excel at many tasks but often incur high inference costs during deployment. To mitigate hallucination, many systems use a knowledge graph to enhance retrieval-augmented generation (KG-RAG). However, the large amount of retrieved knowledge contexts increase these inference costs further. A promising solution to balance performance and cost is LLM routing, which directs simple queries to smaller LLMs and complex ones to larger LLMs. However, no dedicated routing methods currently exist for RAG, and existing training-based routers face challenges scaling to this domain due to the need for extensive training data. We observe that the score distributions produced by the retrieval scorer strongly correlate with query difficulty. Based on this, we propose a novel, training-free routing framework, the first tailored to KG-RAG that effectively balances performance and cost in a plug-and-play manner. Experiments show our method reduces calls to larger LLMs by up to 50% without sacrificing response quality, demonstrating its potential for efficient and scalable LLM deployment.

cs.MA [Back]

[298] An Adversary-Resistant Multi-Agent LLM System via Credibility Scoring

Sana Ebrahimi,Mohsen Dehghankar,Abolfazl Asudeh

Main category: cs.MA

TL;DR: 提出了一种基于可信度评分的多智能体LLM框架，以抵御对抗性和低性能智能体的影响。

Details

Motivation: 多智能体LLM系统在多个领域表现出强大能力，但容易受到对抗性和低性能智能体的干扰。 Method: 将协作问答过程建模为迭代游戏，通过可信度评分聚合团队输出，评分基于智能体过去的表现逐步学习。 Result: 实验表明，该系统能有效减轻对抗性影响，提升多智能体协作的鲁棒性，即使在对抗性占多数的情况下。 Conclusion: 该框架为多智能体LLM系统提供了一种通用的抗干扰解决方案。 Abstract: While multi-agent LLM systems show strong capabilities in various domains, they are highly vulnerable to adversarial and low-performing agents. To resolve this issue, in this paper, we introduce a general and adversary-resistant multi-agent LLM framework based on credibility scoring. We model the collaborative query-answering process as an iterative game, where the agents communicate and contribute to a final system output. Our system associates a credibility score that is used when aggregating the team outputs. The credibility scores are learned gradually based on the past contributions of each agent in query answering. Our experiments across multiple tasks and settings demonstrate our system's effectiveness in mitigating adversarial influence and enhancing the resilience of multi-agent cooperation, even in the adversary-majority settings.

cond-mat.mtrl-sci [Back]

[299] Exploring Domain Wall Pinning in Ferroelectrics via Automated High Throughput AFM

Kamyar Barakati,Yu Liu,Hiroshi Funakubo,Sergei V. Kalinin

Main category: cond-mat.mtrl-sci

TL;DR: 论文研究了铁电材料中畴壁动力学的空间依赖性，通过机器学习控制的自动压电力显微镜技术，量化了电场驱动的极性应变畴结构动态行为。

Details

Motivation: 畴壁动力学在铁电材料中具有强烈的空间依赖性，传统方法难以高效研究稀疏的钉扎中心和畴壁。 Method: 使用ML控制的自动压电力显微镜技术，分析了大面积外延PbTiO3薄膜中1500个开关事件。 Result: 畴壁位移不仅依赖于电场参数，还与局部的铁电-铁弹构型相关，例如双晶界在特定偏压下保持钉扎。 Conclusion: 研究结果可用于构建预测性图谱，为设计铁电存储器提供微观结构特定的规则集。 Abstract: Domain-wall dynamics in ferroelectric materials are strongly position-dependent since each polar interface is locked into a unique local microstructure. This necessitates spatially resolved studies of the wall-pinning using scanning-probe microscopy techniques. The pinning centers and preexisting domain walls are usually sparse within image plane, precluding the use of dense hyperspectral imaging modes and requiring time-consuming human experimentation. Here, a large area epitaxial PbTiO$_3$ film on cubic KTaO$_3$ were investigated to quantify the electric field driven dynamics of the polar-strain domain structures using ML-controlled automated Piezoresponse Force Microscopy. Analysis of 1500 switching events reveals that domain wall displacement depends not only on field parameters but also on the local ferroelectric-ferroelastic configuration. For example, twin boundaries in polydomains regions like a$_1^-$/$c^+$ $\parallel$ a$_2^-$/$c^-$ stay pinned up to a certain level of bias magnitude and change only marginally as the bias increases from 20V to 30V, whereas single variant boundaries like a$_2^+$/$c^+$ $\parallel$ a$_2^-$/$c^-$ stack are already activated at 20V. These statistics on the possible ferroelectric and ferroelastic wall orientations, together with the automated, high-throughput AFM workflow, can be distilled into a predictive map that links domain configurations to pulse parameters. This microstructure-specific rule set forms the foundation for designing ferroelectric memories.

cs.LG [Back]

[300] Test-Time Training Done Right

Tianyuan Zhang,Sai Bi,Yicong Hong,Kai Zhang,Fujun Luan,Songlin Yang,Kalyan Sunkavalli,William T. Freeman,Hao Tan

Main category: cs.LG

TL;DR: LaCT通过大块更新（2K至1M标记）提升硬件利用率和状态容量，适用于多模态任务，验证了其在长上下文建模中的有效性。

Details

Motivation: 现有TTT方法因小批量更新导致硬件利用率低，难以处理长上下文数据。 Method: 采用大块更新（LaCT），提高硬件利用率和状态容量，支持复杂优化器。 Result: 在多种任务中验证有效性，包括14B参数视频扩散模型和百万标记上下文的新视图合成。 Conclusion: LaCT为长上下文建模和测试时训练提供了高效解决方案，有望推动相关研究。 Abstract: Test-Time Training (TTT) models context dependencies by adapting part of the model's weights (referred to as fast weights) during inference. This fast weight, akin to recurrent states in RNNs, stores temporary memories of past tokens in the current sequence. Existing TTT methods struggled to show effectiveness in handling long-context data, due to their inefficiency on modern GPUs. The TTT layers in many of these approaches operate with extremely low FLOPs utilization (often <5%) because they deliberately apply small online minibatch sizes (e.g., updating fast weights every 16 or 64 tokens). Moreover, a small minibatch implies fine-grained block-wise causal dependencies in the data, unsuitable for data beyond 1D ordered sequences, like sets or N-dimensional grids such as images or videos. In contrast, we pursue the opposite direction by using an extremely large chunk update, ranging from 2K to 1M tokens across tasks of varying modalities, which we refer to as Large Chunk Test-Time Training (LaCT). It improves hardware utilization by orders of magnitude, and more importantly, facilitates scaling of nonlinear state size (up to 40% of model parameters), hence substantially improving state capacity, all without requiring cumbersome and error-prone kernel implementations. It also allows easy integration of sophisticated optimizers, e.g. Muon for online updates. We validate our approach across diverse modalities and tasks, including novel view synthesis with image set, language models, and auto-regressive video diffusion. Our approach can scale up to 14B-parameter AR video diffusion model on sequences up to 56K tokens. In our longest sequence experiment, we perform novel view synthesis with 1 million context length. We hope this work will inspire and accelerate new research in the field of long-context modeling and test-time training. Website: https://tianyuanzhang.com/projects/ttt-done-right

[301] Vision Language Models are Biased

An Vo,Khai-Nguyen Nguyen,Mohammad Reza Taesiri,Vy Tuong Dang,Anh Totti Nguyen,Daeyoung Kim

Main category: cs.LG

TL;DR: 研究发现，视觉语言模型（VLMs）在处理标准视觉任务（如计数和识别）时，受其记忆的流行知识影响，表现出强烈偏见，导致准确率显著下降。

Details

Motivation: 探讨大型语言模型（LLMs）记忆的流行知识如何影响视觉语言模型（VLMs）在标准视觉任务中的表现，尤其是偏见导致的错误。 Method: 通过测试VLMs在7个不同领域的计数任务（如条纹计数）中的表现，并插入描述性文本以观察其影响。 Result: VLMs在计数任务中的平均准确率仅为17.05%，且插入文本进一步降低其表现。即使要求模型复查或仅依赖图像细节，准确率提升也有限（+2%）。 Conclusion: 研究揭示了VLMs的一种失败模式，并提出了测试VLM偏见的自动化框架。 Abstract: Large language models (LLMs) memorize a vast amount of prior knowledge from the Internet that help them on downstream tasks but also may notoriously sway their outputs towards wrong or biased answers. In this work, we test how the knowledge about popular subjects hurt the accuracy of vision language models (VLMs) on standard, objective visual tasks of counting and identification. We find that state-of-the-art VLMs are strongly biased (e.g, unable to recognize a fourth stripe has been added to a 3-stripe Adidas logo) scoring an average of 17.05% accuracy in counting (e.g., counting stripes in an Adidas-like logo) across 7 diverse domains from animals, logos, chess, board games, optical illusions, to patterned grids. Insert text (e.g., "Adidas") describing the subject name into the counterfactual image further decreases VLM accuracy. The biases in VLMs are so strong that instructing them to double-check their results or rely exclusively on image details to answer improves counting accuracy by only +2 points, on average. Our work presents an interesting failure mode in VLMs and an automated framework for testing VLM biases. Code and data are available at: vlmsarebiased.github.io.

[302] From Images to Signals: Are Large Vision Models Useful for Time Series Analysis?

Ziming Zhao,ChengAo Shen,Hanghang Tong,Dongjin Song,Zhigang Deng,Qingsong Wen,Jingchao Ni

Main category: cs.LG

TL;DR: 本文首次系统研究了大型视觉模型（LVMs）在时间序列分析中的有效性，发现LVMs在分类任务中表现良好，但在预测任务中存在局限性。

Details

Motivation: 探讨大型视觉模型（LVMs）是否真正适用于时间序列分析，填补该领域的研究空白。 Method: 设计了涵盖4种LVMs、8种成像方法、18个数据集和26个基线的实验，覆盖分类和预测任务，并进行详细消融分析。 Result: LVMs在时间序列分类中表现有效，但在预测任务中面临挑战，且当前最佳LVM预测器存在局限性。 Conclusion: LVMs在时间序列分类中有潜力，但在预测任务中需进一步优化，为未来多模态时间序列研究奠定基础。 Abstract: Transformer-based models have gained increasing attention in time series research, driving interest in Large Language Models (LLMs) and foundation models for time series analysis. As the field moves toward multi-modality, Large Vision Models (LVMs) are emerging as a promising direction. In the past, the effectiveness of Transformer and LLMs in time series has been debated. When it comes to LVMs, a similar question arises: are LVMs truely useful for time series analysis? To address it, we design and conduct the first principled study involving 4 LVMs, 8 imaging methods, 18 datasets and 26 baselines across both high-level (classification) and low-level (forecasting) tasks, with extensive ablation analysis. Our findings indicate LVMs are indeed useful for time series classification but face challenges in forecasting. Although effective, the contemporary best LVM forecasters are limited to specific types of LVMs and imaging methods, exhibit a bias toward forecasting periods, and have limited ability to utilize long look-back windows. We hope our findings could serve as a cornerstone for future research on LVM- and multimodal-based solutions to different time series tasks.

[303] Proxy-FDA: Proxy-based Feature Distribution Alignment for Fine-tuning Vision Foundation Models without Forgetting

Chen Huang,Skyler Seto,Hadi Pouransari,Mehrdad Farajtabar,Raviteja Vemulapalli,Fartash Faghri,Oncel Tuzel,Barry-John Theobald,Josh Susskind

Main category: cs.LG

TL;DR: Proxy-FDA是一种新的正则化方法，通过特征分布对齐和动态生成代理来减少微调过程中的概念遗忘问题。

Details

Motivation: 微调视觉基础模型时，常导致其他任务的概念遗忘问题，现有方法通过点对点匹配保留知识，但忽略了特征邻域结构的重要性。 Method: 提出Proxy-FDA方法，利用最近邻图对齐预训练和微调特征空间，并通过动态生成代理增加数据多样性。 Result: 实验表明Proxy-FDA显著减少概念遗忘，并发现遗忘与分布距离度量强相关。 Conclusion: Proxy-FDA在多种微调设置和任务中表现优异，验证了其有效性。 Abstract: Vision foundation models pre-trained on massive data encode rich representations of real-world concepts, which can be adapted to downstream tasks by fine-tuning. However, fine-tuning foundation models on one task often leads to the issue of concept forgetting on other tasks. Recent methods of robust fine-tuning aim to mitigate forgetting of prior knowledge without affecting the fine-tuning performance. Knowledge is often preserved by matching the original and fine-tuned model weights or feature pairs. However, such point-wise matching can be too strong, without explicit awareness of the feature neighborhood structures that encode rich knowledge as well. We propose a novel regularization method Proxy-FDA that explicitly preserves the structural knowledge in feature space. Proxy-FDA performs Feature Distribution Alignment (using nearest neighbor graphs) between the pre-trained and fine-tuned feature spaces, and the alignment is further improved by informative proxies that are generated dynamically to increase data diversity. Experiments show that Proxy-FDA significantly reduces concept forgetting during fine-tuning, and we find a strong correlation between forgetting and a distributional distance metric (in comparison to L2 distance). We further demonstrate Proxy-FDA's benefits in various fine-tuning settings (end-to-end, few-shot and continual tuning) and across different tasks like image classification, captioning and VQA.

[304] Towards Unified Modeling in Federated Multi-Task Learning via Subspace Decoupling

Yipan Wei,Yuchen Zou,Yapeng Li,Bo Du

Main category: cs.LG

TL;DR: FedDEA是一种联邦多任务学习方法，通过动态识别任务相关维度并重新调整优化效果，实现任务级解耦聚合，提升异构任务联合训练效果。

Details

Motivation: 现有联邦多任务学习方法无法有效聚合异构任务，FedDEA旨在解决这一问题。 Method: 提出FedDEA方法，基于本地更新的响应强度动态识别任务相关维度，并通过重新调整优化效果实现任务级解耦聚合。 Result: 实验表明，FedDEA能显著提升性能，且易于集成到主流联邦优化算法中。 Conclusion: FedDEA在高度异构任务场景下表现出鲁棒性和泛化能力。 Abstract: Federated Multi-Task Learning (FMTL) enables multiple clients performing heterogeneous tasks without exchanging their local data, offering broad potential for privacy preserving multi-task collaboration. However, most existing methods focus on building personalized models for each client and unable to support the aggregation of multiple heterogeneous tasks into a unified model. As a result, in real-world scenarios where task objectives, label spaces, and optimization paths vary significantly, conventional FMTL methods struggle to achieve effective joint training. To address this challenge, we propose FedDEA (Federated Decoupled Aggregation), an update-structure-aware aggregation method specifically designed for multi-task model integration. Our method dynamically identifies task-relevant dimensions based on the response strength of local updates and enhances their optimization effectiveness through rescaling. This mechanism effectively suppresses cross-task interference and enables task-level decoupled aggregation within a unified global model. FedDEA does not rely on task labels or architectural modifications, making it broadly applicable and deployment-friendly. Experimental results demonstrate that it can be easily integrated into various mainstream federated optimization algorithms and consistently delivers significant overall performance improvements on widely used NYUD-V2 and PASCAL-Context. These results validate the robustness and generalization capabilities of FedDEA under highly heterogeneous task settings.

[305] Provably Improving Generalization of Few-Shot Models with Synthetic Data

Lan-Cuong Nguyen,Quan Nguyen-Tri,Bang Tran Khanh,Dung D. Le,Long Tran-Thanh,Khoat Than

Main category: cs.LG

TL;DR: 论文提出了一种理论框架，用于量化合成数据与真实数据分布差异对图像分类的影响，并提出了一种基于原型的算法来优化数据划分和模型训练，显著提升了少样本分类性能。

Details

Motivation: 少样本图像分类因标注数据稀缺而具有挑战性，合成数据虽能缓解这一问题，但分布差异导致模型性能下降。 Method: 开发理论框架量化分布差异影响，提出基于原型的算法优化数据划分和模型训练。 Result: 实验表明，该方法在多个数据集上优于现有技术。 Conclusion: 理论框架和算法有效缩小了真实与合成数据的分布差异，提升了少样本分类性能。 Abstract: Few-shot image classification remains challenging due to the scarcity of labeled training examples. Augmenting them with synthetic data has emerged as a promising way to alleviate this issue, but models trained on synthetic samples often face performance degradation due to the inherent gap between real and synthetic distributions. To address this limitation, we develop a theoretical framework that quantifies the impact of such distribution discrepancies on supervised learning, specifically in the context of image classification. More importantly, our framework suggests practical ways to generate good synthetic samples and to train a predictor with high generalization ability. Building upon this framework, we propose a novel theoretical-based algorithm that integrates prototype learning to optimize both data partitioning and model training, effectively bridging the gap between real few-shot data and synthetic data. Extensive experiments results show that our approach demonstrates superior performance compared to state-of-the-art methods, outperforming them across multiple datasets.

[306] Advancing Compositional Awareness in CLIP with Efficient Fine-Tuning

Amit Peleg,Naman Deep Singh,Matthias Hein

Main category: cs.LG

TL;DR: CLIC是一种基于多图像和关联标题的新型微调方法，旨在提升CLIP模型的组合推理能力，同时改进检索性能。

Details

Motivation: 现有视觉语言模型（如CLIP）在组合推理方面表现不足，且改进组合性的方法往往忽略了语义理解，导致检索性能下降。 Method: 提出CLIC方法，通过结合多图像及其关联标题进行微调，提升模型的组合推理能力。 Result: CLIC显著提升了模型的组合推理能力（包括词汇和语义理解），并在检索任务中取得一致改进。 Conclusion: CLIC是一种有效的微调方法，能够在不损害检索性能的情况下提升组合推理能力。 Abstract: Vision-language models like CLIP have demonstrated remarkable zero-shot capabilities in classification and retrieval. However, these models often struggle with compositional reasoning - the ability to understand the relationships between concepts. A recent benchmark, SugarCrepe++, reveals that previous works on improving compositionality have mainly improved lexical sensitivity but neglected semantic understanding. In addition, downstream retrieval performance often deteriorates, although one would expect that improving compositionality should enhance retrieval. In this work, we introduce CLIC (Compositionally-aware Learning in CLIP), a fine-tuning method based on a novel training technique combining multiple images and their associated captions. CLIC improves compositionality across architectures as well as differently pre-trained CLIP models, both in terms of lexical and semantic understanding, and achieves consistent gains in retrieval performance. This even applies to the recent CLIPS, which achieves SOTA retrieval performance. Nevertheless, the short fine-tuning with CLIC leads to an improvement in retrieval and to the best compositional CLIP model on SugarCrepe++. All our models and code are available at https://clic-compositional-clip.github.io

[307] Graph Flow Matching: Enhancing Image Generation with Neighbor-Aware Flow Fields

Md Shahriar Rahim Siddiqui,Moshe Eliasof,Eldad Haber

Main category: cs.LG

TL;DR: 论文提出了一种名为Graph Flow Matching（GFM）的方法，通过引入图神经网络模块增强流匹配模型的性能，利用局部上下文信息提升生成质量。

Details

Motivation: 现有的流匹配网络通常独立预测每个点的速度，忽略了相邻点之间的相关性，可能导致生成质量受限。 Method: GFM将学习到的速度分解为反应项（标准流匹配网络）和扩散项（通过图神经网络聚合邻居信息），以反应-扩散形式保留模型的扩展性并提升预测准确性。 Result: 在五个图像生成基准测试中，GFM显著改善了FID和recall指标，证明了其作为现有流匹配架构的模块化增强的有效性。 Conclusion: GFM是一种轻量级且高效的增强方法，能够在不显著增加计算成本的情况下提升流匹配模型的生成质量。 Abstract: Flow matching casts sample generation as learning a continuous-time velocity field that transports noise to data. Existing flow matching networks typically predict each point's velocity independently, considering only its location and time along its flow trajectory, and ignoring neighboring points. However, this pointwise approach may overlook correlations between points along the generation trajectory that could enhance velocity predictions, thereby improving downstream generation quality. To address this, we propose Graph Flow Matching (GFM), a lightweight enhancement that decomposes the learned velocity into a reaction term -- any standard flow matching network -- and a diffusion term that aggregates neighbor information via a graph neural module. This reaction-diffusion formulation retains the scalability of deep flow models while enriching velocity predictions with local context, all at minimal additional computational cost. Operating in the latent space of a pretrained variational autoencoder, GFM consistently improves Fr\'echet Inception Distance (FID) and recall across five image generation benchmarks (LSUN Church, LSUN Bedroom, FFHQ, AFHQ-Cat, and CelebA-HQ at $256\times256$), demonstrating its effectiveness as a modular enhancement to existing flow matching architectures.

[308] Hyperbolic Dataset Distillation

Wenyuan Li,Guang Li,Keisuke Maeda,Takahiro Ogawa,Miki Haseyama

Main category: cs.LG

TL;DR: 论文提出了一种基于双曲空间的数据集蒸馏方法（HDD），通过将数据嵌入双曲空间并优化中心点间的双曲距离，有效解决了传统分布匹配方法在欧几里得空间中忽略复杂几何和层次关系的局限。

Details

Motivation: 大规模数据集在深度学习中的计算和存储挑战促使数据集蒸馏技术的发展，但现有方法局限于欧几里得空间，无法捕捉数据的层次结构。 Method: HDD方法将浅层网络提取的特征嵌入洛伦兹双曲空间，通过优化合成数据与原始数据中心点间的双曲距离，显式整合层次结构。 Result: 实验表明，HDD在保持模型性能的同时，仅需20%的核心集即可实现高效训练，且显著提升训练稳定性。 Conclusion: HDD是一种高效且兼容性强的数据集蒸馏方法，适用于多种数据集，为复杂数据结构的建模提供了新思路。 Abstract: To address the computational and storage challenges posed by large-scale datasets in deep learning, dataset distillation has been proposed to synthesize a compact dataset that replaces the original while maintaining comparable model performance. Unlike optimization-based approaches that require costly bi-level optimization, distribution matching (DM) methods improve efficiency by aligning the distributions of synthetic and original data, thereby eliminating nested optimization. DM achieves high computational efficiency and has emerged as a promising solution. However, existing DM methods, constrained to Euclidean space, treat data as independent and identically distributed points, overlooking complex geometric and hierarchical relationships. To overcome this limitation, we propose a novel hyperbolic dataset distillation method, termed HDD. Hyperbolic space, characterized by negative curvature and exponential volume growth with distance, naturally models hierarchical and tree-like structures. HDD embeds features extracted by a shallow network into the Lorentz hyperbolic space, where the discrepancy between synthetic and original data is measured by the hyperbolic (geodesic) distance between their centroids. By optimizing this distance, the hierarchical structure is explicitly integrated into the distillation process, guiding synthetic samples to gravitate towards the root-centric regions of the original data distribution while preserving their underlying geometric characteristics. Furthermore, we find that pruning in hyperbolic space requires only 20% of the distilled core set to retain model performance, while significantly improving training stability. Notably, HDD is seamlessly compatible with most existing DM methods, and extensive experiments on different datasets validate its effectiveness.

[309] Information Structure in Mappings: An Approach to Learning, Representation, and Generalisation

Henry Conklin

Main category: cs.LG

TL;DR: 该论文提出了定量方法来分析大规模神经网络的表示空间结构，探讨了学习、泛化和设计决策对结构的影响，并引入了一种高效估计向量空间熵的方法。

Details

Motivation: 尽管大规模神经网络取得了显著成功，但目前缺乏统一的表示空间描述方法，无法可靠地描述其结构、学习过程及理想结构。 Method: 通过识别映射中的结构基元和信息论量化方法，分析了多智能体强化学习模型、序列到序列模型和大语言模型的学习与泛化。 Result: 实验揭示了大规模分布式认知模型的学习机制，并展示了语言结构与神经网络性能驱动结构之间的相似性。 Conclusion: 该研究为理解神经网络表示空间的结构及其与人类认知的类比提供了新工具和见解。 Abstract: Despite the remarkable success of large large-scale neural networks, we still lack unified notation for thinking about and describing their representational spaces. We lack methods to reliably describe how their representations are structured, how that structure emerges over training, and what kinds of structures are desirable. This thesis introduces quantitative methods for identifying systematic structure in a mapping between spaces, and leverages them to understand how deep-learning models learn to represent information, what representational structures drive generalisation, and how design decisions condition the structures that emerge. To do this I identify structural primitives present in a mapping, along with information theoretic quantifications of each. These allow us to analyse learning, structure, and generalisation across multi-agent reinforcement learning models, sequence-to-sequence models trained on a single task, and Large Language Models. I also introduce a novel, performant, approach to estimating the entropy of vector space, that allows this analysis to be applied to models ranging in size from 1 million to 12 billion parameters. The experiments here work to shed light on how large-scale distributed models of cognition learn, while allowing us to draw parallels between those systems and their human analogs. They show how the structures of language and the constraints that give rise to them in many ways parallel the kinds of structures that drive performance of contemporary neural networks.

[310] Large Language Models for Controllable Multi-property Multi-objective Molecule Optimization

Vishal Dey,Xiao Hu,Xia Ning

Main category: cs.LG

TL;DR: 论文提出C-MuMOInstruct数据集和GeLLMO-Cs模型，用于多属性分子优化，显著提升成功率和泛化能力。

Details

Motivation: 现有方法和LLM难以实现多属性分子优化中的属性特异性目标，限制了实际应用。 Method: 利用C-MuMOInstruct数据集开发GeLLMO-Cs模型，支持属性特异性优化。 Result: GeLLMO-Cs在5个分布内和5个分布外任务中表现优异，成功率提升126%，并展示零样本泛化能力。 Conclusion: GeLLMO-Cs为支持现实多样化的属性特异性优化提供了基础LLM。 Abstract: In real-world drug design, molecule optimization requires selectively improving multiple molecular properties up to pharmaceutically relevant levels, while maintaining others that already meet such criteria. However, existing computational approaches and instruction-tuned LLMs fail to capture such nuanced property-specific objectives, limiting their practical applicability. To address this, we introduce C-MuMOInstruct, the first instruction-tuning dataset focused on multi-property optimization with explicit, property-specific objectives. Leveraging C-MuMOInstruct, we develop GeLLMO-Cs, a series of instruction-tuned LLMs that can perform targeted property-specific optimization. Our experiments across 5 in-distribution and 5 out-of-distribution tasks show that GeLLMO-Cs consistently outperform strong baselines, achieving up to 126% higher success rate. Notably, GeLLMO-Cs exhibit impressive 0-shot generalization to novel optimization tasks and unseen instructions. This offers a step toward a foundational LLM to support realistic, diverse optimizations with property-specific objectives. C-MuMOInstruct and code are accessible through https://github.com/ninglab/GeLLMO-C.

[311] Fine-Tune an SLM or Prompt an LLM? The Case of Generating Low-Code Workflows

Orlando Marquez Ayala,Patrice Bechard,Emily Chen,Maggie Baird,Jingfei Chen

Main category: cs.LG

TL;DR: 研究表明，尽管大型语言模型（LLM）如GPT-4o能处理复杂任务，但在需要结构化输出的领域特定任务中，微调小型语言模型（SLM）仍有质量优势，平均提升10%。

Details

Motivation: 探讨在LLM成本降低的背景下，微调SLM是否仍具有实际应用优势，尤其是在需要结构化输出的任务中。 Method: 通过比较微调SLM与提示LLM在生成JSON格式低代码工作流任务中的表现，并进行系统错误分析。 Result: 微调SLM在质量上平均优于LLM 10%，提示LLM虽能提供合理结果，但仍有局限性。 Conclusion: 对于需要结构化输出的领域特定任务，微调SLM仍具有显著的质量优势，值得在实际应用中考虑。 Abstract: Large Language Models (LLMs) such as GPT-4o can handle a wide range of complex tasks with the right prompt. As per token costs are reduced, the advantages of fine-tuning Small Language Models (SLMs) for real-world applications -- faster inference, lower costs -- may no longer be clear. In this work, we present evidence that, for domain-specific tasks that require structured outputs, SLMs still have a quality advantage. We compare fine-tuning an SLM against prompting LLMs on the task of generating low-code workflows in JSON form. We observe that while a good prompt can yield reasonable results, fine-tuning improves quality by 10% on average. We also perform systematic error analysis to reveal model limitations.

[312] Large Language Models are Locally Linear Mappings

James R. Golden

Main category: cs.LG

TL;DR: 论文提出了一种方法，将大型语言模型（LLM）的推理操作映射为线性系统，揭示了其内部表示的局部线性特性。

Details

Motivation: 研究动机是理解现代LLM的内部工作机制，尽管其具有全局非线性，但局部可以近似为线性系统。 Method: 通过修改输入序列的梯度计算，使模型的雅可比矩阵几乎完全复现前向预测的线性系统，并结合奇异值分解分析低维子空间。 Result: 实验表明，LLM在极低维子空间中运行，且最大奇异向量与最可能输出令牌相关，同时揭示了语义概念的涌现。 Conclusion: 尽管LLM具有强大的表达能力，但其局部线性分解提供了对其内部表示和语义结构的可解释性。 Abstract: We demonstrate that the inference operations of several open-weight large language models (LLMs) can be mapped to an exactly equivalent linear system for an input sequence without modifying the model weights or altering output predictions. Extending techniques from image diffusion models that exhibit local or piecewise linearity, we strategically alter the gradient computation with respect to a given input sequence for a next-token prediction such that the Jacobian of the model nearly exactly reproduces the forward prediction with a linear system. We demonstrate this approach across models (Llama 3, Gemma 3, Qwen 3, Phi 4, Mistral Ministral and OLMo 2, up to Llama 3.3 70B Q4) and show through the singular value decomposition of the detached Jacobian that these LLMs operate in extremely low-dimensional subspaces where many of the largest singular vectors decode to concepts related to the most-likely output token. This approach also allows us to examine the operation of each successive layer (and its attention and MLP components) as nearly-exact linear systems and observe the emergence of semantic concepts. Despite their expressive power and global nonlinearity, modern LLMs can be interpreted through nearly-exact locally linear decompositions that provide insights into their internal representations and reveal interpretable semantic structures in the next-token prediction process.

[313] SwiftEval: Developing a Language-Specific Benchmark for LLM-generated Code Evaluation

Ivan Petrukha,Yana Kurliak,Nataliia Stulova

Main category: cs.LG

TL;DR: 论文提出了SwiftEval，首个针对Swift编程语言的评测基准，解决了现有评测基准在Swift上的不足。

Details

Motivation: 现有评测基准主要针对Python，无法高质量评估其他编程语言（如Swift），因此需要专门针对Swift的评测基准。 Method: 采用质量优先的方法，手工设计了28个Swift问题，评测了44个流行的代码生成大模型。 Result: 结果显示，模型在需要语言特定功能的问题上表现显著下降，尤其是小规模模型。 Conclusion: SwiftEval填补了Swift评测的空白，并揭示了模型在语言特定功能上的不足。 Abstract: In recent years, large language models (LLMs) have showcased significant advancements in code generation. However, most evaluation benchmarks are primarily oriented towards Python, making it difficult to evaluate other programming languages, such as Swift, with high quality. By examining widely established multilingual benchmarks like HumanEval-XL and MultiPL-E, we identified critical issues specific to their Swift components, making them insufficient or even irrelevant for assessing LLM coding capabilities on Swift. Unlike these existing approaches, which prioritize rapid scaling and generalization by automatically translating Python-centric benchmarks with LLMs, we adopt a quality-over-quantity methodology. We present SwiftEval, the first Swift-oriented benchmark consisting of 28 carefully hand-crafted problems, and evaluate 44 popular Code LLMs on it. Our results show significant LLM scores drop for problems requiring language-specific features, most noticeable in the models of smaller sizes.

[314] Breaking the Gold Standard: Extracting Forgotten Data under Exact Unlearning in Large Language Models

Xiaoyu Wu,Yifei Pang,Terrance Liu,Zhiwei Steven Wu

Main category: cs.LG

TL;DR: 论文挑战了精确遗忘作为隐私保护黄金标准的假设，提出了一种新的数据提取攻击方法，证明即使精确遗忘也可能导致隐私泄露。

Details

Motivation: 针对大型语言模型训练数据中可能包含的敏感信息，现有精确遗忘方法被认为能有效保护隐私，但作者质疑其安全性。 Method: 通过结合预遗忘和遗忘后模型的信号，设计了一种数据提取攻击方法，并采用令牌过滤策略提高攻击成功率。 Result: 攻击方法在多个基准测试中显著提高了数据提取成功率，并在模拟医疗数据集上验证了其现实隐私风险。 Conclusion: 研究发现遗忘可能增加隐私泄露风险，建议评估遗忘方法时考虑更广泛的威胁模型。 Abstract: Large language models are typically trained on datasets collected from the web, which may inadvertently contain harmful or sensitive personal information. To address growing privacy concerns, unlearning methods have been proposed to remove the influence of specific data from trained models. Of these, exact unlearning -- which retrains the model from scratch without the target data -- is widely regarded the gold standard, believed to be robust against privacy-related attacks. In this paper, we challenge this assumption by introducing a novel data extraction attack that compromises even exact unlearning. Our method leverages both the pre- and post-unlearning models: by guiding the post-unlearning model using signals from the pre-unlearning model, we uncover patterns that reflect the removed data distribution. Combining model guidance with a token filtering strategy, our attack significantly improves extraction success rates -- doubling performance in some cases -- across common benchmarks such as MUSE, TOFU, and WMDP. Furthermore, we demonstrate our attack's effectiveness on a simulated medical diagnosis dataset to highlight real-world privacy risks associated with exact unlearning. In light of our findings, which suggest that unlearning may, in a contradictory way, increase the risk of privacy leakage, we advocate for evaluation of unlearning methods to consider broader threat models that account not only for post-unlearning models but also for adversarial access to prior checkpoints.

[315] Beyond Linear Steering: Unified Multi-Attribute Control for Language Models

Narmeen Oozeer,Luke Marks,Fazl Barez,Amirali Abdullah

Main category: cs.LG

TL;DR: K-Steering是一种非线性方法，通过梯度计算干预方向，解决了多行为属性控制的干扰问题，优于基线方法。

Details

Motivation: 大型语言模型（LLMs）在推理时控制多个行为属性存在干扰和线性方法的局限性。 Method: 训练一个非线性多标签分类器，通过梯度计算干预方向，避免线性假设和单独调整属性向量。 Result: 在3个模型家族中，K-Steering在准确控制多个行为方面优于基线方法。 Conclusion: K-Steering提供了一种灵活、动态的方法，无需重新训练即可实现多行为控制。 Abstract: Controlling multiple behavioral attributes in large language models (LLMs) at inference time is a challenging problem due to interference between attributes and the limitations of linear steering methods, which assume additive behavior in activation space and require per-attribute tuning. We introduce K-Steering, a unified and flexible approach that trains a single non-linear multi-label classifier on hidden activations and computes intervention directions via gradients at inference time. This avoids linearity assumptions, removes the need for storing and tuning separate attribute vectors, and allows dynamic composition of behaviors without retraining. To evaluate our method, we propose two new benchmarks, ToneBank and DebateMix, targeting compositional behavioral control. Empirical results across 3 model families, validated by both activation-based classifiers and LLM-based judges, demonstrate that K-Steering outperforms strong baselines in accurately steering multiple behaviors.

[316] Causal-aware Large Language Models: Enhancing Decision-Making Through Learning, Adapting and Acting

Wei Chen,Jiahao Zhang,Haipeng Zhu,Boyan Xu,Zhifeng Hao,Keli Zhang,Junjian Ye,Ruichu Cai

Main category: cs.LG

TL;DR: 论文提出了一种因果感知的大型语言模型（Causal-aware LLMs），通过结合结构因果模型（SCM）来提升决策能力，解决了预训练模型缺乏推理能力和适应性差的问题。

Details

Motivation: 预训练的大型语言模型（LLMs）虽然存储了大量知识，但在决策任务中缺乏推理能力且难以适应新环境，限制了其在复杂现实任务中的应用。 Method: 提出了一种“学习-适应-行动”范式：1）学习阶段提取环境因果实体和关系初始化SCM；2）适应阶段通过因果干预更新SCM；3）行动阶段利用SCM知识通过强化学习代理制定策略。 Result: 在开放世界游戏“Crafter”的22个任务中验证了方法的有效性。 Conclusion: 因果感知LLMs通过迭代学习因果知识，能够更准确地理解环境并做出高效决策。 Abstract: Large language models (LLMs) have shown great potential in decision-making due to the vast amount of knowledge stored within the models. However, these pre-trained models are prone to lack reasoning abilities and are difficult to adapt to new environments, further hindering their application to complex real-world tasks. To address these challenges, inspired by the human cognitive process, we propose Causal-aware LLMs, which integrate the structural causal model (SCM) into the decision-making process to model, update, and utilize structured knowledge of the environment in a ``learning-adapting-acting" paradigm. Specifically, in the learning stage, we first utilize an LLM to extract the environment-specific causal entities and their causal relations to initialize a structured causal model of the environment. Subsequently,in the adapting stage, we update the structured causal model through external feedback about the environment, via an idea of causal intervention. Finally, in the acting stage, Causal-aware LLMs exploit structured causal knowledge for more efficient policy-making through the reinforcement learning agent. The above processes are performed iteratively to learn causal knowledge, ultimately enabling the causal-aware LLMs to achieve a more accurate understanding of the environment and make more efficient decisions. Experimental results across 22 diverse tasks within the open-world game ``Crafter" validate the effectiveness of our proposed method.

[317] CoRet: Improved Retriever for Code Editing

Fabio Fehr,Prabhu Teja Sivaprasad,Luca Franceschi,Giovanni Zappella

Main category: cs.LG

TL;DR: CoRet是一种密集检索模型，专为代码编辑任务设计，整合代码语义、仓库结构和调用图依赖，显著提升检索召回率。

Details

Motivation: 解决基于自然语言查询检索代码仓库相关部分的需求，如实现新功能或修复错误。 Method: 提出一种专门为仓库级检索设计的损失函数，整合代码语义、仓库结构和调用图依赖。 Result: 在SWE-bench和Long Code Arena的bug定位数据集上，检索召回率至少提高15个百分点。 Conclusion: CoRet的设计选择对提升检索性能至关重要，显著优于现有模型。 Abstract: In this paper, we introduce CoRet, a dense retrieval model designed for code-editing tasks that integrates code semantics, repository structure, and call graph dependencies. The model focuses on retrieving relevant portions of a code repository based on natural language queries such as requests to implement new features or fix bugs. These retrieved code chunks can then be presented to a user or to a second code-editing model or agent. To train CoRet, we propose a loss function explicitly designed for repository-level retrieval. On SWE-bench and Long Code Arena's bug localisation datasets, we show that our model substantially improves retrieval recall by at least 15 percentage points over existing models, and ablate the design choices to show their importance in achieving these results.

[318] SUMO: Subspace-Aware Moment-Orthogonalization for Accelerating Memory-Efficient LLM Training

Yehonathan Refael,Guy Smorodinsky,Tom Tirer,Ofir Lindenbaum

Main category: cs.LG

TL;DR: SUMO是一种基于低维子空间和SVD的优化器，通过精确正交化提升收敛速度和稳定性，同时减少内存需求。

Details

Motivation: 现有低秩梯度优化方法虽节省内存，但忽略了收敛加速潜力，尤其在LLM训练中的各向异性损失场景表现不佳。 Method: SUMO利用动态调整的低维子空间和SVD进行正交化，实现规范诱导的最陡下降优化，减少近似误差。 Result: 理论证明SUMO能降低近似误差并提升收敛速度；实验显示其性能、稳定性和内存效率均优于现有方法。 Conclusion: SUMO通过精确正交化显著优化LLM训练，平衡了内存效率和收敛性能。 Abstract: Low-rank gradient-based optimization methods have significantly improved memory efficiency during the training of large language models (LLMs), enabling operations within constrained hardware without sacrificing performance. However, these methods primarily emphasize memory savings, often overlooking potential acceleration in convergence due to their reliance on standard isotropic steepest descent techniques, which can perform suboptimally in the highly anisotropic landscapes typical of deep networks, particularly LLMs. In this paper, we propose SUMO (Subspace-Aware Moment-Orthogonalization), an optimizer that employs exact singular value decomposition (SVD) for moment orthogonalization within a dynamically adapted low-dimensional subspace, enabling norm-inducing steepest descent optimization steps. By explicitly aligning optimization steps with the spectral characteristics of the loss landscape, SUMO effectively mitigates approximation errors associated with commonly used methods like Newton-Schulz orthogonalization approximation. We theoretically establish an upper bound on these approximation errors, proving their dependence on the condition numbers of moments, conditions we analytically demonstrate are encountered during LLM training. Furthermore, we both theoretically and empirically illustrate that exact orthogonalization via SVD substantially improves convergence rates while reducing overall complexity. Empirical evaluations confirm that SUMO accelerates convergence, enhances stability, improves performance, and reduces memory requirements by up to 20% compared to state-of-the-art methods.

[319] REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable Rewards

Zafir Stojanovski,Oliver Stanley,Joe Sharratt,Richard Jones,Abdulhakeem Adefioye,Jean Kaddour,Andreas Köpf

Main category: cs.LG

TL;DR: Reasoning Gym (RG) 是一个用于强化学习的推理环境库，提供可验证奖励，支持多领域数据生成和验证。

Details

Motivation: 解决传统推理数据集固定且有限的问题，通过程序化生成无限训练数据并调整复杂度。 Method: 利用超过100个数据生成器和验证器，覆盖代数、算术、几何等多个领域，支持难度可调的无限数据生成。 Result: 实验证明 RG 在评估和强化学习推理模型方面具有高效性。 Conclusion: RG 为推理模型的训练和评估提供了灵活且可扩展的工具。 Abstract: We introduce Reasoning Gym (RG), a library of reasoning environments for reinforcement learning with verifiable rewards. It provides over 100 data generators and verifiers spanning multiple domains including algebra, arithmetic, computation, cognition, geometry, graph theory, logic, and various common games. Its key innovation is the ability to generate virtually infinite training data with adjustable complexity, unlike most previous reasoning datasets, which are typically fixed. This procedural generation approach allows for continuous evaluation across varying difficulty levels. Our experimental results demonstrate the efficacy of RG in both evaluating and reinforcement learning of reasoning models.

[320] PhySense: Principle-Based Physics Reasoning Benchmarking for Large Language Models

Yinggan Xu,Yue Liu,Zhiqiang Gao,Changnan Peng,Di Luo

Main category: cs.LG

TL;DR: 论文提出PhySense基准，揭示当前大语言模型在基于物理原理的推理上的不足。

Details

Motivation: 当前大语言模型在解决复杂科学问题时，难以像人类专家那样进行简洁、基于原理的推理，而是生成冗长且不透明的解决方案。 Method: 引入PhySense基准，评估多种先进大语言模型和提示类型在基于物理原理的推理上的表现。 Result: 评估显示，大语言模型无法与专家推理路径对齐。 Conclusion: 研究为开发高效、稳健且可解释的基于原理的科学推理AI系统提供了方向。 Abstract: Large language models (LLMs) have rapidly advanced and are increasingly capable of tackling complex scientific problems, including those in physics. Despite this progress, current LLMs often fail to emulate the concise, principle-based reasoning characteristic of human experts, instead generating lengthy and opaque solutions. This discrepancy highlights a crucial gap in their ability to apply core physical principles for efficient and interpretable problem solving. To systematically investigate this limitation, we introduce PhySense, a novel principle-based physics reasoning benchmark designed to be easily solvable by experts using guiding principles, yet deceptively difficult for LLMs without principle-first reasoning. Our evaluation across multiple state-of-the-art LLMs and prompt types reveals a consistent failure to align with expert-like reasoning paths, providing insights for developing AI systems with efficient, robust and interpretable principle-based scientific reasoning.

[321] Chameleon: A Flexible Data-mixing Framework for Language Model Pretraining and Finetuning

Wanyun Xie,Francesco Tonin,Volkan Cevher

Main category: cs.LG

TL;DR: Chameleon是一个高效灵活的数据混合框架，通过利用分数量化嵌入空间中的领域重要性，无需重新训练即可适应新数据。

Details

Motivation: 现有领域重加权方法计算成本高且需重新训练，Chameleon旨在解决这些问题。 Method: 构建领域亲和矩阵，利用诱导的杠杆分数确定混合权重，直接迁移到新数据。 Result: 在预训练、新数据适应和微调场景中均表现优异，计算效率高。 Conclusion: Chameleon提供了一种高效且灵活的领域重加权方法，显著提升模型性能。 Abstract: Training data mixtures greatly impact the generalization performance of large language models. Existing domain reweighting methods often rely on costly weight computations and require retraining when new data is introduced. To this end, we introduce a flexible and efficient data mixing framework, Chameleon, that employs leverage scores to quantify domain importance within a learned embedding space. We first construct a domain affinity matrix over domain embeddings. The induced leverage scores determine a mixture that upweights domains sharing common representations in embedding space. This formulation allows direct transfer to new data by computing the new domain embeddings. In experiments, we demonstrate improvements over three key scenarios: (i) our computed weights improve performance on pretraining domains with a fraction of the compute of existing methods; (ii) Chameleon can adapt to data changes without proxy retraining, boosting few-shot reasoning accuracies when transferred to new data; (iii) our method enables efficient domain reweighting in finetuning, consistently improving test perplexity on all finetuning domains over uniform mixture. Our code is available at https://github.com/LIONS-EPFL/Chameleon.

[322] Harnessing Negative Signals: Reinforcement Distillation from Teacher Data for LLM Reasoning

Shuyao Xu,Cheng Peng,Jiangxuan Long,Weidi Xu,Wei Chu,Yuan Qi

Main category: cs.LG

TL;DR: 论文提出了一种名为REDI的两阶段框架，通过利用正负推理痕迹提升小模型的推理能力，效果优于现有方法。

Details

Motivation: 现有模型蒸馏方法通常丢弃错误的推理痕迹，而这些数据可能对提升模型性能有价值。本文旨在探索如何有效利用正负推理痕迹。 Method: 提出REDI框架：第一阶段通过监督微调（SFT）学习正例；第二阶段通过REDI目标函数（一种简单、无参考的损失函数）结合正负例进一步优化模型。 Result: REDI在数学推理任务上表现优于基线方法，Qwen-REDI-1.5B模型在MATH-500上达到83.1%的准确率，性能与基于专有数据的模型相当。 Conclusion: REDI框架通过有效利用正负推理痕迹，显著提升了小模型的推理能力，为公开数据训练的模型设立了新的性能标杆。 Abstract: Recent advances in model distillation demonstrate that data from advanced reasoning models (e.g., DeepSeek-R1, OpenAI's o1) can effectively transfer complex reasoning abilities to smaller, efficient student models. However, standard practices employ rejection sampling, discarding incorrect reasoning examples -- valuable, yet often underutilized data. This paper addresses the critical question: How can both positive and negative distilled reasoning traces be effectively leveraged to maximize LLM reasoning performance in an offline setting? To this end, We propose Reinforcement Distillation (REDI), a two-stage framework. Stage 1 learns from positive traces via Supervised Fine-Tuning (SFT). Stage 2 further refines the model using both positive and negative traces through our proposed REDI objective. This novel objective is a simple, reference-free loss function that outperforms established methods like DPO and SimPO in this distillation context. Our empirical evaluations demonstrate REDI's superiority over baseline Rejection Sampling SFT or SFT combined with DPO/SimPO on mathematical reasoning tasks. Notably, the Qwen-REDI-1.5B model, post-trained on just 131k positive and negative examples from the open Open-R1 dataset, achieves an 83.1% score on MATH-500 (pass@1). Its performance matches or surpasses that of DeepSeek-R1-Distill-Qwen-1.5B (a model post-trained on 800k proprietary data) across various mathematical reasoning benchmarks, establishing a new state-of-the-art for 1.5B models post-trained offline with openly available data.

[323] Beyond Multiple Choice: Evaluating Steering Vectors for Adaptive Free-Form Summarization

Joschka Braun,Carsten Eickhoff,Seyed Ali Bahrainian

Main category: cs.LG

TL;DR: 本文研究了在自由生成任务中使用转向向量控制文本属性的效果，发现其能有效控制目标属性，但高强度转向会降低文本质量。提示法控制较弱但质量保持更好，两者结合效果最佳。

Details

Motivation: 转向向量是一种轻量级方法，用于在推理时通过添加学习偏置来控制文本属性，但其在自由生成任务中的效果尚未充分研究。 Method: 在NEWTS数据集的摘要生成任务中，评估转向向量对主题焦点、情感、毒性和可读性的控制效果，并与提示法进行比较。 Result: 转向向量能有效控制目标属性，但高强度转向会降低文本质量；提示法控制较弱但质量保持更好；两者结合效果最佳。 Conclusion: 转向向量在自由生成任务中需权衡控制强度与文本质量，结合提示法可优化效果。 Abstract: Steering vectors are a lightweight method for controlling text properties by adding a learned bias to language model activations at inference time. So far, steering vectors have predominantly been evaluated in multiple-choice settings, while their effectiveness in free-form generation tasks remains understudied. Moving "Beyond Multiple Choice," we thoroughly evaluate the effectiveness of steering vectors in adaptively controlling topical focus, sentiment, toxicity, and readability in abstractive summaries of the NEWTS dataset. We find that steering effectively controls the targeted summary properties, but high steering strengths consistently degrade both intrinsic and extrinsic text quality. Compared to steering, prompting offers weaker control, while preserving text quality. Combining steering and prompting yields the strongest control over text properties and offers the most favorable efficacy-quality trade-off at moderate steering strengths. Our results underscore the practical trade-off between control strength and text quality preservation when applying steering vectors to free-form generation tasks.

physics.med-ph [Back]

[324] Digital twins enable full-reference quality assessment of photoacoustic image reconstructions

Janek Gröhl,Leonid Kunyansky,Jenni Poimala,Thomas R. Else,Francesca Di Cecio,Sarah E. Bohndiek,Ben T. Cox,Andreas Hauptmann

Main category: physics.med-ph

TL;DR: 论文提出了一种基于数字孪生体的方法，用于定量评估光声图像重建算法的质量，并测试了一种新的傅里叶变换重建算法。

Details

Motivation: 光声图像重建算法的质量评估缺乏理想参考图像，数字孪生体可减少仿真与实际数据的差距。 Method: 使用数字孪生体框架比较多种重建算法，并首次在实验数据中测试傅里叶变换算法。 Result: 数字孪生体有效评估了数值模型的准确性，傅里叶变换算法在计算成本较低的情况下表现与迭代时间反转相当。 Conclusion: 数字孪生体为光声图像重建算法的定量评估提供了新方法，傅里叶变换算法具有高效性。 Abstract: Quantitative comparison of the quality of photoacoustic image reconstruction algorithms remains a major challenge. No-reference image quality measures are often inadequate, but full-reference measures require access to an ideal reference image. While the ground truth is known in simulations, it is unknown in vivo, or in phantom studies, as the reference depends on both the phantom properties and the imaging system. We tackle this problem by using numerical digital twins of tissue-mimicking phantoms and the imaging system to perform a quantitative calibration to reduce the simulation gap. The contributions of this paper are two-fold: First, we use this digital-twin framework to compare multiple state-of-the-art reconstruction algorithms. Second, among these is a Fourier transform-based reconstruction algorithm for circular detection geometries, which we test on experimental data for the first time. Our results demonstrate the usefulness of digital phantom twins by enabling assessment of the accuracy of the numerical forward model and enabling comparison of image reconstruction schemes with full-reference image quality assessment. We show that the Fourier transform-based algorithm yields results comparable to those of iterative time reversal, but at a lower computational cost. All data and code are publicly available on Zenodo: https://doi.org/10.5281/zenodo.15388429.

cs.RO [Back]

[325] SR3D: Unleashing Single-view 3D Reconstruction for Transparent and Specular Object Grasping

Mingxu Zhang,Xiaoqi Li,Jiahui Xu,Kaichen Zhou,Hojin Bae,Yan Shen,Chuyan Xiong,Jiaming Liu,Hao Dong

Main category: cs.RO

TL;DR: SR3D是一种无需训练的框架，通过单视角RGB和深度图像实现透明和镜面物体的3D重建与抓取。

Details

Motivation: 透明和镜面材料因深度感知限制难以抓取，现有方法复杂或信息利用率低。 Method: 利用外部视觉模型生成3D网格，通过视图和关键点匹配确定物体位姿，重建准确深度图。 Result: 仿真和现实实验验证了SR3D的重建有效性。 Conclusion: SR3D为透明和镜面物体的抓取提供了高效解决方案。 Abstract: Recent advancements in 3D robotic manipulation have improved grasping of everyday objects, but transparent and specular materials remain challenging due to depth sensing limitations. While several 3D reconstruction and depth completion approaches address these challenges, they suffer from setup complexity or limited observation information utilization. To address this, leveraging the power of single view 3D object reconstruction approaches, we propose a training free framework SR3D that enables robotic grasping of transparent and specular objects from a single view observation. Specifically, given single view RGB and depth images, SR3D first uses the external visual models to generate 3D reconstructed object mesh based on RGB image. Then, the key idea is to determine the 3D object's pose and scale to accurately localize the reconstructed object back into its original depth corrupted 3D scene. Therefore, we propose view matching and keypoint matching mechanisms,which leverage both the 2D and 3D's inherent semantic and geometric information in the observation to determine the object's 3D state within the scene, thereby reconstructing an accurate 3D depth map for effective grasp detection. Experiments in both simulation and real world show the reconstruction effectiveness of SR3D.

[326] Black-box Adversarial Attacks on CNN-based SLAM Algorithms

Maria Rafaela Gkeka,Bowen Sun,Evgenia Smirni,Christos D. Antonopoulos,Spyros Lalis,Nikolaos Bellas

Main category: cs.RO

TL;DR: 论文研究了针对CNN-based SLAM系统的对抗攻击，发现即使是中等规模的攻击也能导致76%的帧跟踪失败，且攻击深度输入比RGB输入更具破坏性。

Details

Motivation: 尽管深度学习在SLAM任务中取得了显著进展，但深度神经网络对对抗攻击的脆弱性限制了其在自主导航等应用中的可靠性。目前缺乏对CNN-based SLAM系统中特征检测器的对抗攻击的全面研究。 Method: 研究采用黑盒对抗扰动方法，针对GCN-SLAM算法的RGB输入图像进行攻击，并在TUM数据集上进行实验。 Result: 实验结果显示，中等规模的对抗攻击可导致76%的帧跟踪失败，且攻击深度输入对SLAM系统的破坏性更大。 Conclusion: 论文揭示了CNN-based SLAM系统对对抗攻击的脆弱性，强调了在自主导航等关键应用中需加强对抗鲁棒性。 Abstract: Continuous advancements in deep learning have led to significant progress in feature detection, resulting in enhanced accuracy in tasks like Simultaneous Localization and Mapping (SLAM). Nevertheless, the vulnerability of deep neural networks to adversarial attacks remains a challenge for their reliable deployment in applications, such as navigation of autonomous agents. Even though CNN-based SLAM algorithms are a growing area of research there is a notable absence of a comprehensive presentation and examination of adversarial attacks targeting CNN-based feature detectors, as part of a SLAM system. Our work introduces black-box adversarial perturbations applied to the RGB images fed into the GCN-SLAM algorithm. Our findings on the TUM dataset [30] reveal that even attacks of moderate scale can lead to tracking failure in as many as 76% of the frames. Moreover, our experiments highlight the catastrophic impact of attacking depth instead of RGB input images on the SLAM system.

[327] DiG-Net: Enhancing Quality of Life through Hyper-Range Dynamic Gesture Recognition in Assistive Robotics

Eran Bamani Beeri,Eden Nissinman,Avishai Sintov

Main category: cs.RO

TL;DR: 提出了一种名为DiG-Net的新型动态手势识别方法，支持远达30米的距离，显著提升了辅助机器人的人机交互能力。

Details

Motivation: 当前手势识别方法局限于短距离交互，限制了其在需要远距离辅助通信的场景中的应用。 Method: 结合Depth-Conditioned Deformable Alignment (DADA)块和时空图模块，并引入Radiometric Spatio-Temporal Depth Attenuation Loss (RSTDAL)增强模型鲁棒性。 Result: 在多样化数据集上达到97.3%的识别准确率，显著优于现有方法。 Conclusion: DiG-Net显著提升了辅助机器人在家庭护理、工业安全和远程协助等场景中的可用性。 Abstract: Dynamic hand gestures play a pivotal role in assistive human-robot interaction (HRI), facilitating intuitive, non-verbal communication, particularly for individuals with mobility constraints or those operating robots remotely. Current gesture recognition methods are mostly limited to short-range interactions, reducing their utility in scenarios demanding robust assistive communication from afar. In this paper, we introduce a novel approach designed specifically for assistive robotics, enabling dynamic gesture recognition at extended distances of up to 30 meters, thereby significantly improving accessibility and quality of life. Our proposed Distance-aware Gesture Network (DiG-Net) effectively combines Depth-Conditioned Deformable Alignment (DADA) blocks with Spatio-Temporal Graph modules, enabling robust processing and classification of gesture sequences captured under challenging conditions, including significant physical attenuation, reduced resolution, and dynamic gesture variations commonly experienced in real-world assistive environments. We further introduce the Radiometric Spatio-Temporal Depth Attenuation Loss (RSTDAL), shown to enhance learning and strengthen model robustness across varying distances. Our model demonstrates significant performance improvement over state-of-the-art gesture recognition frameworks, achieving a recognition accuracy of 97.3% on a diverse dataset with challenging hyper-range gestures. By effectively interpreting gestures from considerable distances, DiG-Net significantly enhances the usability of assistive robots in home healthcare, industrial safety, and remote assistance scenarios, enabling seamless and intuitive interactions for users regardless of physical limitations

[328] Bi-Manual Joint Camera Calibration and Scene Representation

Haozhan Tang,Tianyi Zhang,Matthew Johnson-Roberson,Weiming Zhi

Main category: cs.RO

TL;DR: Bi-JCR框架通过3D基础模型实现多视角无标记校准，联合估计相机外参、机械臂间位姿及共享工作空间的3D表示，简化了双机械臂系统的校准流程。

Details

Motivation: 传统相机校准过程繁琐，需依赖预定义标记物，限制了双机械臂系统的效率和灵活性。 Method: 利用3D基础模型进行密集无标记多视角对应，联合估计相机外参、机械臂间位姿及共享工作空间的3D表示。 Result: Bi-JCR在多种桌面环境中表现出鲁棒性，并支持碰撞检测和语义分割，适用于下游双机械臂协调任务。 Conclusion: Bi-JCR提供了一种高效、无标记的校准方法，显著简化了双机械臂系统的初始化流程。 Abstract: Robot manipulation, especially bimanual manipulation, often requires setting up multiple cameras on multiple robot manipulators. Before robot manipulators can generate motion or even build representations of their environments, the cameras rigidly mounted to the robot need to be calibrated. Camera calibration is a cumbersome process involving collecting a set of images, with each capturing a pre-determined marker. In this work, we introduce the Bi-Manual Joint Calibration and Representation Framework (Bi-JCR). Bi-JCR enables multiple robot manipulators, each with cameras mounted, to circumvent taking images of calibration markers. By leveraging 3D foundation models for dense, marker-free multi-view correspondence, Bi-JCR jointly estimates: (i) the extrinsic transformation from each camera to its end-effector, (ii) the inter-arm relative poses between manipulators, and (iii) a unified, scale-consistent 3D representation of the shared workspace, all from the same captured RGB image sets. The representation, jointly constructed from images captured by cameras on both manipulators, lives in a common coordinate frame and supports collision checking and semantic segmentation to facilitate downstream bimanual coordination tasks. We empirically evaluate the robustness of Bi-JCR on a variety of tabletop environments, and demonstrate its applicability on a variety of downstream tasks.

cs.SD [Back]

[329] Improving Multilingual Speech Models on ML-SUPERB 2.0: Fine-tuning with Data Augmentation and LID-Aware CTC

Qingzheng Wang,Jiancheng Sun,Yifan Peng,Shinji Watanabe

Main category: cs.SD

TL;DR: 论文通过多种策略（如冻结上游训练、部分微调和低秩适应）改进多语言LID和ASR，结合数据增强和LID CTC损失，显著提升了性能。

Details

Motivation: 解决预训练语音基础模型在微调时资源有限的问题，提升多语言LID和ASR的性能。 Method: 采用冻结上游训练、部分微调、低秩适应等策略，结合数据增强和LID CTC损失进行正则化。 Result: 在ML-SUPERB 2.0上，LID准确率相对提升14%，ASR CER相对降低30%，并在挑战赛中获第二名。 Conclusion: 提出的方法在多语言LID和ASR任务中表现优异，验证了策略的有效性。 Abstract: Multilingual speech processing with self-supervised or supervised pre-trained Speech Foundation Models (SFM) has achieved strong performance on tasks like Language Identification (LID) and Automatic Speech Recognition (ASR). However, these models struggle with limited resources during fine-tuning. This paper enhances multilingual LID and ASR on ML-SUPERB 2.0 by exploring multiple strategies for adapting SFMs, including frozen upstream training, partial fine-tuning, and low-rank adaptation. Furthermore, we employ data augmentation to mitigate performance gaps in few-shot settings and introduce LID Connectionist Temporal Classification (CTC) loss for regularization. Our approach achieves a 14% relative improvement in LID accuracy and a 30% relative reduction in ASR CER over the baseline on ML-SUPERB 2.0, securing second place in the Interspeech 2025 ML-SUPERB 2.0 Challenge.

eess.AS [Back]

Nikola Ljubešić,Ivan Porupski,Peter Rupnik

Main category: eess.AS

TL;DR: 论文研究了通过微调预训练Transformer模型来自动识别主要重音，在克罗地亚语等语言上表现优异。

Details

Motivation: 重音在编码意义和辅助语音理解中起重要作用，但以往研究多依赖传统声学特征和英语数据集。 Method: 使用预训练的Transformer模型，添加音频帧分类头，并在新的克罗地亚语数据集上进行微调。 Result: Transformer模型在所有测试语言（克罗地亚语、塞尔维亚语、查方言和斯洛文尼亚语）上均优于传统SVM分类器，克罗地亚语和塞尔维亚语接近完美。 Conclusion: 仅需少量多音节训练词即可实现高性能，模型和数据集已开源。 Abstract: Automating primary stress identification has been an active research field due to the role of stress in encoding meaning and aiding speech comprehension. Previous studies relied mainly on traditional acoustic features and English datasets. In this paper, we investigate the approach of fine-tuning a pre-trained transformer model with an audio frame classification head. Our experiments use a new Croatian training dataset, with test sets in Croatian, Serbian, the Chakavian dialect, and Slovenian. By comparing an SVM classifier using traditional acoustic features with the fine-tuned speech transformer, we demonstrate the transformer's superiority across the board, achieving near-perfect results for Croatian and Serbian, with a 10-point performance drop for the more distant Chakavian and Slovenian. Finally, we show that only a few hundred multi-syllabic training words suffice for strong performance. We release our datasets and model under permissive licenses.

[331] "Dyadosyncrasy", Idiosyncrasy and Demographic Factors in Turn-Taking

Julio Cesar Cavalcanti,Gabriel Skantze

Main category: eess.AS

TL;DR: 研究探讨了人口统计和个体因素如何影响对话中的轮换行为，发现个体差异和对话双方关系对轮换时间的影响超过性别、年龄等人口统计因素。

Details

Motivation: 了解对话轮换行为中人口统计和个体因素的相对影响，揭示轮换时间的决定因素。 Method: 使用美国英语对话数据集（Fisher），分析过渡地板偏移（TFO），考察性别、年龄、教育等因素的影响。 Result: 性别和年龄对TFO有微小但显著影响，教育无影响；个体差异和对话双方关系对TFO影响更大。 Conclusion: 对话双方关系和共同活动是TFO的最强决定因素，超过人口统计影响。 Abstract: Turn-taking in dialogue follows universal constraints but also varies significantly. This study examines how demographic (sex, age, education) and individual factors shape turn-taking using a large dataset of US English conversations (Fisher). We analyze Transition Floor Offset (TFO) and find notable interspeaker variation. Sex and age have small but significant effects female speakers and older individuals exhibit slightly shorter offsets - while education shows no effect. Lighter topics correlate with shorter TFOs. However, individual differences have a greater impact, driven by a strong idiosyncratic and an even stronger "dyadosyncratic" component - speakers in a dyad resemble each other more than they resemble themselves in different dyads. This suggests that the dyadic relationship and joint activity are the strongest determinants of TFO, outweighing demographic influences.

eess.IV [Back]

[332] Parameter-Free Bio-Inspired Channel Attention for Enhanced Cardiac MRI Reconstruction

Anam Hashmi,Julia Dietlmeier,Kathleen M. Curran,Noel E. O'Connor

Main category: eess.IV

TL;DR: 提出了一种基于生态学原理的非线性注意力架构，用于心脏MRI重建，其参数自由设计优于现有方法。

Details

Motivation: 现有注意力模块缺乏理论支持，生态学原理可能为设计高效注意力机制提供指导。 Method: 研究单物种种群增长的非线性生态差异方程，设计参数自由的注意力模块。 Result: 提出的注意力模块在性能上超越了当前最先进的参数自由方法。 Conclusion: 生态学原理可有效指导注意力机制设计，提升模型性能和可解释性。 Abstract: Attention is a fundamental component of the human visual recognition system. The inclusion of attention in a convolutional neural network amplifies relevant visual features and suppresses the less important ones. Integrating attention mechanisms into convolutional neural networks enhances model performance and interpretability. Spatial and channel attention mechanisms have shown significant advantages across many downstream tasks in medical imaging. While existing attention modules have proven to be effective, their design often lacks a robust theoretical underpinning. In this study, we address this gap by proposing a non-linear attention architecture for cardiac MRI reconstruction and hypothesize that insights from ecological principles can guide the development of effective and efficient attention mechanisms. Specifically, we investigate a non-linear ecological difference equation that describes single-species population growth to devise a parameter-free attention module surpassing current state-of-the-art parameter-free methods.

[333] Estimating Head Motion in Structural MRI Using a Deep Neural Network Trained on Synthetic Artifacts

Charles Bricout,Samira Ebrahimi Kahou,Sylvain Bouix

Main category: eess.IV

TL;DR: 该论文提出了一种基于3D卷积神经网络的自动化方法，用于评估MRI中的运动伪影，无需依赖专业硬件或噪声数据。

Details

Motivation: MRI中的运动伪影会影响神经解剖学指标的准确性，现有方法存在硬件依赖或数据噪声问题。 Method: 使用合成损坏的MRI体积训练3D卷积神经网络，预测运动严重程度。 Result: 在独立数据集上验证，与人工标签的R²为0.65，且与年龄相关。 Conclusion: 该方法适用于不同品牌和协议的MRI设备，为结构MRI研究提供了客观、可扩展的运动评估方案。 Abstract: Motion-related artifacts are inevitable in Magnetic Resonance Imaging (MRI) and can bias automated neuroanatomical metrics such as cortical thickness. Manual review cannot objectively quantify motion in anatomical scans, and existing automated approaches often require specialized hardware or rely on unbalanced noisy training data. Here, we train a 3D convolutional neural network to estimate motion severity using only synthetically corrupted volumes. We validate our method with one held-out site from our training cohort and with 14 fully independent datasets, including one with manual ratings, achieving a representative $R^2 = 0.65$ versus manual labels and significant thickness-motion correlations in 12/15 datasets. Furthermore, our predicted motion correlates with subject age in line with prior studies. Our approach generalizes across scanner brands and protocols, enabling objective, scalable motion assessment in structural MRI studies without prospective motion correction.

[334] Sparsity-Driven Parallel Imaging Consistency for Improved Self-Supervised MRI Reconstruction

Yaşar Utku Alçalar,Mehmet Akçakaya

Main category: eess.IV

TL;DR: 论文提出了一种通过精心设计的扰动训练物理驱动深度学习（PD-DL）网络的新方法，以解决高加速率下自监督学习引入的伪影问题。

Details

Motivation: 在高加速率MRI扫描中，自监督学习常引入伪影，影响图像质量。 Method: 通过增强k空间掩蔽的自监督学习，引入一致性项评估模型对稀疏域扰动的预测能力。 Result: 在fastMRI膝部和脑部数据集上，该方法有效减少了伪影和噪声放大，视觉和定量表现优于现有自监督方法。 Conclusion: 提出的训练策略在高加速率下实现了更可靠和无伪影的重建。 Abstract: Physics-driven deep learning (PD-DL) models have proven to be a powerful approach for improved reconstruction of rapid MRI scans. In order to train these models in scenarios where fully-sampled reference data is unavailable, self-supervised learning has gained prominence. However, its application at high acceleration rates frequently introduces artifacts, compromising image fidelity. To mitigate this shortcoming, we propose a novel way to train PD-DL networks via carefully-designed perturbations. In particular, we enhance the k-space masking idea of conventional self-supervised learning with a novel consistency term that assesses the model's ability to accurately predict the added perturbations in a sparse domain, leading to more reliable and artifact-free reconstructions. The results obtained from the fastMRI knee and brain datasets show that the proposed training strategy effectively reduces aliasing artifacts and mitigates noise amplification at high acceleration rates, outperforming state-of-the-art self-supervised methods both visually and quantitatively.

[335] Beyond the LUMIR challenge: The pathway to foundational registration models

Junyu Chen,Shuwen Wei,Joel Honkamaa,Pekka Marttinen,Hang Zhang,Min Liu,Yichao Zhou,Zuopeng Tan,Zhuoyuan Wang,Yi Wang,Hongchao Zhou,Shunbo Hu,Yi Zhang,Qian Tao,Lukas Förner,Thomas Wendler,Bailiang Jian,Benedikt Wiestler,Tim Hable,Jin Kim,Dan Ruan,Frederic Madesta,Thilo Sentker,Wiebke Heyer,Lianrui Zuo,Yuwei Dai,Jing Wu,Jerry L. Prince,Harrison Bai,Yong Du,Yihao Liu,Alessa Hering,Reuben Dorent,Lasse Hansen,Mattias P. Heinrich,Aaron Carass

Main category: eess.IV

TL;DR: LUMIR挑战是一个无监督脑MRI图像配准基准，旨在通过自监督学习评估和推进配准技术，无需依赖标签地图。

Details

Motivation: 推动无监督脑MRI配准技术的发展，减少对标签地图的依赖，并通过自监督学习实现生物学合理的变形建模。 Method: 提供超过4,000个预处理T1加权脑MRI图像用于训练，并引入零样本泛化任务，涵盖多种成像模态、疾病群体、采集协议和物种。 Result: 深度学习方法在域内和零样本任务中均表现出色，生成解剖学合理的变形场，并优于优化方法。 Conclusion: LUMIR挑战展示了深度学习在无监督脑MRI配准中的潜力，尽管在域外对比度下性能有所下降。 Abstract: Medical image challenges have played a transformative role in advancing the field, catalyzing algorithmic innovation and establishing new performance standards across diverse clinical applications. Image registration, a foundational task in neuroimaging pipelines, has similarly benefited from the Learn2Reg initiative. Building on this foundation, we introduce the Large-scale Unsupervised Brain MRI Image Registration (LUMIR) challenge, a next-generation benchmark designed to assess and advance unsupervised brain MRI registration. Distinct from prior challenges that leveraged anatomical label maps for supervision, LUMIR removes this dependency by providing over 4,000 preprocessed T1-weighted brain MRIs for training without any label maps, encouraging biologically plausible deformation modeling through self-supervision. In addition to evaluating performance on 590 held-out test subjects, LUMIR introduces a rigorous suite of zero-shot generalization tasks, spanning out-of-domain imaging modalities (e.g., FLAIR, T2-weighted, T2*-weighted), disease populations (e.g., Alzheimer's disease), acquisition protocols (e.g., 9.4T MRI), and species (e.g., macaque brains). A total of 1,158 subjects and over 4,000 image pairs were included for evaluation. Performance was assessed using both segmentation-based metrics (Dice coefficient, 95th percentile Hausdorff distance) and landmark-based registration accuracy (target registration error). Across both in-domain and zero-shot tasks, deep learning-based methods consistently achieved state-of-the-art accuracy while producing anatomically plausible deformation fields. The top-performing deep learning-based models demonstrated diffeomorphic properties and inverse consistency, outperforming several leading optimization-based methods, and showing strong robustness to most domain shifts, the exception being a drop in performance on out-of-domain contrasts.

[336] A Novel Coronary Artery Registration Method Based on Super-pixel Particle Swarm Optimization

Peng Qi,Wenxi Qu,Tianliang Yao,Haonan Ma,Dylan Wintle,Yinyi Lai,Giorgos Papanastasiou,Chengjia Wang

Main category: eess.IV

TL;DR: 提出了一种基于群体优化算法的多模态冠状动脉图像配准方法，显著提升了PCI手术中XRA和CTA图像的配准精度。

Details

Motivation: PCI手术需要实时XRA引导，但CTA能提供更精确的3D血管信息，因此需要一种准确的多模态图像配准方法以结合两者优势。 Method: 采用预处理模块和基于Steger及超像素粒子群优化算法的配准模块，解决了2D-3D图像配准中的变形、低对比度和噪声问题。 Result: 在28对XRA和CTA图像上的实验表明，该方法在配准精度、鲁棒性和效率上均优于现有技术。 Conclusion: 该方法显著提升了PCI手术的临床效果，有望改善冠状动脉疾病患者的治疗结果。 Abstract: Percutaneous Coronary Intervention (PCI) is a minimally invasive procedure that improves coronary blood flow and treats coronary artery disease. Although PCI typically requires 2D X-ray angiography (XRA) to guide catheter placement at real-time, computed tomography angiography (CTA) may substantially improve PCI by providing precise information of 3D vascular anatomy and status. To leverage real-time XRA and detailed 3D CTA anatomy for PCI, accurate multimodal image registration of XRA and CTA is required, to guide the procedure and avoid complications. This is a challenging process as it requires registration of images from different geometrical modalities (2D -> 3D and vice versa), with variations in contrast and noise levels. In this paper, we propose a novel multimodal coronary artery image registration method based on a swarm optimization algorithm, which effectively addresses challenges such as large deformations, low contrast, and noise across these imaging modalities. Our algorithm consists of two main modules: 1) preprocessing of XRA and CTA images separately, and 2) a registration module based on feature extraction using the Steger and Superpixel Particle Swarm Optimization algorithms. Our technique was evaluated on a pilot dataset of 28 pairs of XRA and CTA images from 10 patients who underwent PCI. The algorithm was compared with four state-of-the-art (SOTA) methods in terms of registration accuracy, robustness, and efficiency. Our method outperformed the selected SOTA baselines in all aspects. Experimental results demonstrate the significant effectiveness of our algorithm, surpassing the previous benchmarks and proposes a novel clinical approach that can potentially have merit for improving patient outcomes in coronary artery disease.

[337] Efficient RAW Image Deblurring with Adaptive Frequency Modulation

Wenlong Jiao,Binglong Li,Wei Shang,Ping Wang,Dongwei Ren

Main category: eess.IV

TL;DR: FrENet是一个专为RAW图像去模糊设计的频域增强网络，通过自适应频率位置调制模块和频域跳跃连接，显著提升去模糊效果，同时保持高效计算。

Details

Motivation: 尽管RAW图像具有更强的恢复潜力，但现有深度学习方法主要关注sRGB图像，导致RAW图像去模糊研究不足。FrENet旨在解决频域模糊处理与计算效率的挑战。 Method: FrENet直接在频域操作，引入自适应频率位置调制模块动态调整频域分量，并结合频域跳跃连接保留高频细节。 Result: 实验表明，FrENet在RAW图像去模糊中优于现有方法，恢复质量更高且计算效率更高，同时可扩展至sRGB图像。 Conclusion: FrENet为RAW图像去模糊提供了高效且高质量的解决方案，并展示了在sRGB图像上的扩展潜力。 Abstract: Image deblurring plays a crucial role in enhancing visual clarity across various applications. Although most deep learning approaches primarily focus on sRGB images, which inherently lose critical information during the image signal processing pipeline, RAW images, being unprocessed and linear, possess superior restoration potential but remain underexplored. Deblurring RAW images presents unique challenges, particularly in handling frequency-dependent blur while maintaining computational efficiency. To address these issues, we propose Frequency Enhanced Network (FrENet), a framework specifically designed for RAW-to-RAW deblurring that operates directly in the frequency domain. We introduce a novel Adaptive Frequency Positional Modulation module, which dynamically adjusts frequency components according to their spectral positions, thereby enabling precise control over the deblurring process. Additionally, frequency domain skip connections are adopted to further preserve high-frequency details. Experimental results demonstrate that FrENet surpasses state-of-the-art deblurring methods in RAW image deblurring, achieving significantly better restoration quality while maintaining high efficiency in terms of reduced MACs. Furthermore, FrENet's adaptability enables it to be extended to sRGB images, where it delivers comparable or superior performance compared to methods specifically designed for sRGB data. The code will be available at https://github.com/WenlongJiao/FrENet .

[338] pyMEAL: A Multi-Encoder Augmentation-Aware Learning for Robust and Generalizable Medical Image Translation

Abdul-mojeed Olabisi Ilyas,Adeleke Maradesa,Jamal Banzi,Jianpan Huang,Henry K. F. Mak,Kannie W. Y. Chan

Main category: eess.IV

TL;DR: 论文提出了一种名为MEAL的多编码器增强感知学习框架，用于解决3D医学影像中数据稀缺和协议差异的问题，通过四种不同的增强变体和三种融合策略，显著提升了影像翻译任务的性能。

Details

Motivation: 临床中AI驱动的医学影像面临患者差异、图像伪影和模型泛化能力不足的挑战，传统增强方法无法有效处理这些问题。 Method: MEAL框架利用四种增强变体，通过专用编码器处理，并结合三种融合策略（CC、FL、BD）构建多编码器模型，保留增强感知特征。 Result: 在CT到T1加权MRI的翻译任务中，MEAL-BD在未见数据和预定义测试数据上均表现最佳，PSNR和SSIM分数显著优于其他方法。 Conclusion: MEAL通过将增强视为多样化特征的来源，支持鲁棒且协议不变的学习，为临床可靠的医学影像解决方案提供了新思路。 Abstract: Medical imaging is critical for diagnostics, but clinical adoption of advanced AI-driven imaging faces challenges due to patient variability, image artifacts, and limited model generalization. While deep learning has transformed image analysis, 3D medical imaging still suffers from data scarcity and inconsistencies due to acquisition protocols, scanner differences, and patient motion. Traditional augmentation uses a single pipeline for all transformations, disregarding the unique traits of each augmentation and struggling with large data volumes. To address these challenges, we propose a Multi-encoder Augmentation-Aware Learning (MEAL) framework that leverages four distinct augmentation variants processed through dedicated encoders. Three fusion strategies such as concatenation (CC), fusion layer (FL), and adaptive controller block (BD) are integrated to build multi-encoder models that combine augmentation-specific features before decoding. MEAL-BD uniquely preserves augmentation-aware representations, enabling robust, protocol-invariant feature learning. As demonstrated in a Computed Tomography (CT)-to-T1-weighted Magnetic Resonance Imaging (MRI) translation study, MEAL-BD consistently achieved the best performance on both unseen- and predefined-test data. On both geometric transformations (like rotations and flips) and non-augmented inputs, MEAL-BD outperformed other competing methods, achieving higher mean peak signal-to-noise ratio (PSNR) and structural similarity index measure (SSIM) scores. These results establish MEAL as a reliable framework for preserving structural fidelity and generalizing across clinically relevant variability. By reframing augmentation as a source of diverse, generalizable features, MEAL supports robust, protocol-invariant learning, advancing clinically reliable medical imaging solutions.

[339] Model-Guided Network with Cluster-Based Operators for Spatio-Spectral Super-Resolution

Ivan Pereira-Sánchez,Julia Navarro,Ana Belén Petro,Joan Duran

Main category: eess.IV

TL;DR: 提出一种端到端的模型驱动框架，将联合时空超分辨率问题分解为空间超分辨率、光谱超分辨率和融合任务，并通过可学习模块优化各子任务。

Details

Motivation: 解决从低分辨率多光谱观测重建高分辨率高光谱图像的问题，填补联合时空超分辨率研究的空白。 Method: 将问题分解为空间超分辨率、光谱超分辨率和融合任务，采用变分方法展开，并用可学习模块替代传统算子。设计了基于反向投影算法的上采样算子、基于聚类的光谱重建算子，以及融合模块。 Result: 在多个数据集和采样因子上的广泛评估证明了方法的有效性。 Conclusion: 提出的框架在联合时空超分辨率任务中表现优异，代码将开源。 Abstract: This paper addresses the problem of reconstructing a high-resolution hyperspectral image from a low-resolution multispectral observation. While spatial super-resolution and spectral super-resolution have been extensively studied, joint spatio-spectral super-resolution remains relatively explored. We propose an end-to-end model-driven framework that explicitly decomposes the joint spatio-spectral super-resolution problem into spatial super-resolution, spectral super-resolution and fusion tasks. Each sub-task is addressed by unfolding a variational-based approach, where the operators involved in the proximal gradient iterative scheme are replaced with tailored learnable modules. In particular, we design an upsampling operator for spatial super-resolution based on classical back-projection algorithms, adapted to handle arbitrary scaling factors. Spectral reconstruction is performed using learnable cluster-based upsampling and downsampling operators. For image fusion, we integrate low-frequency estimation and high-frequency injection modules to combine the spatial and spectral information from spatial super-resolution and spectral super-resolution outputs. Additionally, we introduce an efficient nonlocal post-processing step that leverages image self-similarity by combining a multi-head attention mechanism with residual connections. Extensive evaluations on several datasets and sampling factors demonstrate the effectiveness of our approach. The source code will be available at https://github.com/TAMI-UIB/JSSUNet

[340] TumorGen: Boundary-Aware Tumor-Mask Synthesis with Rectified Flow Matching

Shengyuan Liu,Wenting Chen,Boyun Zheng,Wentao Pan,Xiang Li,Yixuan Yuan

Main category: eess.IV

TL;DR: TumorGen提出了一种高效的三维肿瘤合成方法，通过边界感知和流匹配技术解决了现有方法在多样性和计算效率上的不足。

Details

Motivation: 现有肿瘤数据合成方法存在多样性受限、计算效率低和边界过渡不自然的问题，TumorGen旨在解决这些问题。 Method: TumorGen结合了边界感知伪掩模生成、空间约束向量场估计器和VAE引导的掩模细化模块，利用修正流匹配技术实现高效合成。 Result: 实验表明，TumorGen在计算效率和真实性上优于现有方法，同时减少了采样步骤。 Conclusion: TumorGen为AI驱动的癌症诊断提供了高效且真实的肿瘤合成解决方案。 Abstract: Tumor data synthesis offers a promising solution to the shortage of annotated medical datasets. However, current approaches either limit tumor diversity by using predefined masks or employ computationally expensive two-stage processes with multiple denoising steps, causing computational inefficiency. Additionally, these methods typically rely on binary masks that fail to capture the gradual transitions characteristic of tumor boundaries. We present TumorGen, a novel Boundary-Aware Tumor-Mask Synthesis with Rectified Flow Matching for efficient 3D tumor synthesis with three key components: a Boundary-Aware Pseudo Mask Generation module that replaces strict binary masks with flexible bounding boxes; a Spatial-Constraint Vector Field Estimator that simultaneously synthesizes tumor latents and masks using rectified flow matching to ensure computational efficiency; and a VAE-guided mask refiner that enhances boundary realism. TumorGen significantly improves computational efficiency by requiring fewer sampling steps while maintaining pathological accuracy through coarse and fine-grained spatial constraints. Experimental results demonstrate TumorGen's superior performance over existing tumor synthesis methods in both efficiency and realism, offering a valuable contribution to AI-driven cancer diagnostics.

[341] Contrast-Invariant Self-supervised Segmentation for Quantitative Placental MRI

Xinliu Zhong,Ruiying Liu,Emily S. Nichols,Xuzhe Zhang,Andrew F. Laine,Emma G. Duerden,Yun Wang

Main category: eess.IV

TL;DR: 提出了一种基于多回波T2*加权MRI的胎盘分割框架，通过自监督学习和无监督域适应解决边界对比弱、标注缺失和运动伪影问题。

Details

Motivation: 胎盘分割在T2*加权成像中面临边界对比弱、标注缺失和运动伪影的挑战，需利用多回波信息提升鲁棒性。 Method: 结合掩码自编码（MAE）自监督预训练、掩码伪标签（MPL）无监督域适应及全局-局部协作，引入语义匹配损失。 Result: 在临床多回波胎盘MRI数据集上表现优于单回波和简单融合基线，能有效跨回波泛化。 Conclusion: 首次系统利用多回波T2*加权MRI进行胎盘分割，为定量分析提供新方法。 Abstract: Accurate placental segmentation is essential for quantitative analysis of the placenta. However, this task is particularly challenging in T2*-weighted placental imaging due to: (1) weak and inconsistent boundary contrast across individual echoes; (2) the absence of manual ground truth annotations for all echo times; and (3) motion artifacts across echoes caused by fetal and maternal movement. In this work, we propose a contrast-augmented segmentation framework that leverages complementary information across multi-echo T2*-weighted MRI to learn robust, contrast-invariant representations. Our method integrates: (i) masked autoencoding (MAE) for self-supervised pretraining on unlabeled multi-echo slices; (ii) masked pseudo-labeling (MPL) for unsupervised domain adaptation across echo times; and (iii) global-local collaboration to align fine-grained features with global anatomical context. We further introduce a semantic matching loss to encourage representation consistency across echoes of the same subject. Experiments on a clinical multi-echo placental MRI dataset demonstrate that our approach generalizes effectively across echo times and outperforms both single-echo and naive fusion baselines. To our knowledge, this is the first work to systematically exploit multi-echo T2*-weighted MRI for placental segmentation.

[342] Beyond Pretty Pictures: Combined Single- and Multi-Image Super-resolution for Sentinel-2 Images

Aditya Retnanto,Son Le,Sebastian Mueller,Armin Leitner,Konrad Schindler,Yohan Iddawela,Michael Riffler

Main category: eess.IV

TL;DR: SEN4X是一种混合超分辨率架构，结合单图像和多图像技术，将Sentinel-2图像分辨率提升至2.5米，显著优于现有方法。

Details

Motivation: Sentinel-2卫星图像分辨率较低，无法捕捉小尺度特征（如房屋、街道），需通过超分辨率技术提升其分辨率。 Method: 结合Sentinel-2的重复观测数据和高分辨率Pléiades Neo数据，设计混合超分辨率架构SEN4X。 Result: 在越南河内的城市土地覆盖分类测试中，SEN4X显著优于现有超分辨率基线方法。 Conclusion: SEN4X有效提升了Sentinel-2图像的分辨率，适用于小尺度特征分析。 Abstract: Super-resolution aims to increase the resolution of satellite images by reconstructing high-frequency details, which go beyond na\"ive upsampling. This has particular relevance for Earth observation missions like Sentinel-2, which offer frequent, regular coverage at no cost; but at coarse resolution. Its pixel footprint is too large to capture small features like houses, streets, or hedge rows. To address this, we present SEN4X, a hybrid super-resolution architecture that combines the advantages of single-image and multi-image techniques. It combines temporal oversampling from repeated Sentinel-2 acquisitions with a learned prior from high-resolution Pl\'eiades Neo data. In doing so, SEN4X upgrades Sentinel-2 imagery to 2.5 m ground sampling distance. We test the super-resolved images on urban land-cover classification in Hanoi, Vietnam. We find that they lead to a significant performance improvement over state-of-the-art super-resolution baselines.

cs.HC [Back]

[343] Redefining Research Crowdsourcing: Incorporating Human Feedback with LLM-Powered Digital Twins

Amanda Chan,Catherine Di,Joseph Rupertus,Gary Smith,Varun Nagaraj Rao,Manoel Horta Ribeiro,Andrés Monroy-Hernández

Main category: cs.HC

TL;DR: 论文提出了一种基于数字孪生的混合框架，以解决众包平台中生成式AI工具对数据真实性和工人角色的影响。

Details

Motivation: 众包平台中工人使用生成式AI工具导致数据真实性下降，工人角色被削弱，亟需解决方案。 Method: 提出数字孪生框架，通过个性化AI模型模拟工人行为，同时保持人类参与。实验（n=88工人）和访谈（工人n=5，研究者n=4）验证。 Result: 数字孪生可提高生产力、减少决策疲劳，同时保持响应质量。透明性、伦理数据使用和工人自主权是关键。 Conclusion: 数字孪生通过自动化重复任务和保留人类参与复杂任务，平衡了规模化和真实性。 Abstract: Crowd work platforms like Amazon Mechanical Turk and Prolific are vital for research, yet workers' growing use of generative AI tools poses challenges. Researchers face compromised data validity as AI responses replace authentic human behavior, while workers risk diminished roles as AI automates tasks. To address this, we propose a hybrid framework using digital twins, personalized AI models that emulate workers' behaviors and preferences while keeping humans in the loop. We evaluate our system with an experiment (n=88 crowd workers) and in-depth interviews with crowd workers (n=5) and social science researchers (n=4). Our results suggest that digital twins may enhance productivity and reduce decision fatigue while maintaining response quality. Both researchers and workers emphasized the importance of transparency, ethical data use, and worker agency. By automating repetitive tasks and preserving human engagement for nuanced ones, digital twins may help balance scalability with authenticity.

[344] WikiGap: Promoting Epistemic Equity by Surfacing Knowledge Gaps Between English Wikipedia and other Language Editions

Zining Wang,Yuxuan Zhang,Dongwook Yoon,Nicholas Vincent,Farhan Samir,Vered Shwartz

Main category: cs.HC

TL;DR: WikiGap系统通过多语言信息差距发现方法和用户中心设计，在英语维基百科界面中展示其他语言版本的补充信息，显著提高了信息查找准确性和效率。

Details

Motivation: 英语维基百科在全球知识获取中占主导地位，但其他语言版本包含的文化和媒体环境相关的补充信息被忽视。WikiGap旨在解决这一问题。 Method: 结合多语言信息差距发现方法和用户中心设计，从法语、俄语和中文维基百科中提取补充信息。 Result: 在混合方法研究中，WikiGap显著提高了事实查找准确性，减少了任务时间，并获得了更高的可用性评分。 Conclusion: WikiGap提高了用户对其他语言版本补充信息的认识，促进了跨语言版本的知识公平。 Abstract: With more than 11 times as many pageviews as the next, English Wikipedia dominates global knowledge access relative to other language editions. Readers are prone to assuming English Wikipedia as a superset of all language editions, leading many to prefer it even when their primary language is not English. Other language editions, however, comprise complementary facts rooted in their respective cultures and media environments, which are marginalized in English Wikipedia. While Wikipedia's user interface enables switching between language editions through its Interlanguage Link (ILL) system, it does not reveal to readers that other language editions contain valuable, complementary information. We present WikiGap, a system that surfaces complementary facts sourced from other Wikipedias within the English Wikipedia interface. Specifically, by combining a recent multilingual information-gap discovery method with a user-centered design, WikiGap enables access to complementary information from French, Russian, and Chinese Wikipedia. In a mixed-methods study (n=21), WikiGap significantly improved fact-finding accuracy, reduced task time, and received a 32-point higher usability score relative to Wikipedia's current ILL-based navigation system. Participants reported increased awareness of the availability of complementary information in non-English editions and reconsidered the completeness of English Wikipedia. WikiGap thus paves the way for improved epistemic equity across language editions.

Table of Contents

cs.CV [Back]

[1] BioCLIP 2: Emergent Properties from Scaling Hierarchical Contrastive Learning

[2] Generating Fit Check Videos with a Handheld Camera

[3] Cora: Correspondence-aware image editing using few step diffusion

[4] Representational Difference Explanations

[5] ScaleLong: A Multi-Timescale Benchmark for Long Video Understanding

[6] Point-MoE: Towards Cross-Domain Generalization in 3D Semantic Segmentation via Mixture-of-Experts

[7] Leveraging Auxiliary Information in Text-to-Video Retrieval: A Review

[8] MangoLeafViT: Leveraging Lightweight Vision Transformer with Runtime Augmentation for Efficient Mango Leaf Disease Classification

[9] VisualSphinx: Large-Scale Synthetic Vision Logic Puzzles for RL

[10] DeepTopoNet: A Framework for Subglacial Topography Estimation on the Greenland Ice Sheets

[11] DGIQA: Depth-guided Feature Attention and Refinement for Generalizable Image Quality Assessment

[12] Preemptive Hallucination Reduction: An Input-Level Approach for Multimodal Language Model

[13] Multi-Group Proportional Representation for Text-to-Image Models

[14] DINO-R1: Incentivizing Reasoning Capability in Vision Foundation Models

[15] MaskAdapt: Unsupervised Geometry-Aware Domain Adaptation Using Multimodal Contextual Learning and RGB-Depth Masking

[16] SIM: A mapping framework for built environment auditing based on street view imagery

[17] ComposeAnything: Composite Object Priors for Text-to-Image Generation

[18] Weakly-Supervised Affordance Grounding Guided by Part-Level Semantic Priors

[19] Federated Foundation Model for GI Endoscopy Images

[20] CSVQA: A Chinese Multimodal Benchmark for Evaluating STEM Reasoning Capabilities of VLMs

[21] S4-Driver: Scalable Self-Supervised Driving Multimodal Large Language Modelwith Spatio-Temporal Visual Representation

[22] The Butterfly Effect in Pathology: Exploring Security in Pathology Foundation Models

[23] Towards a Generalizable Bimanual Foundation Policy via Flow-based Video Prediction

[24] Threading Keyframe with Narratives: MLLMs as Strong Long Video Comprehenders

[25] Training-free zero-shot 3D symmetry detection with visual features back-projected to geometry

[26] Pretraining Deformable Image Registration Networks with Random Images

[27] DrVD-Bench: Do Vision-Language Models Reason Like Human Doctors in Medical Image Diagnosis?

[28] Seeing is Not Reasoning: MVPBench for Graph-based Evaluation of Multi-path Visual Physical CoT

[29] Boosting All-in-One Image Restoration via Self-Improved Privilege Learning

[30] STORK: Improving the Fidelity of Mid-NFE Sampling for Diffusion and Flow Matching Models

[31] Benchmarking Foundation Models for Zero-Shot Biometric Tasks

[32] Shuffle PatchMix Augmentation with Confidence-Margin Weighted Pseudo-Labels for Enhanced Source-Free Domain Adaptation

[33] Unleashing High-Quality Image Generation in Diffusion Sampling Using Second-Order Levenberg-Marquardt-Langevin

[34] Reasoning Can Hurt the Inductive Abilities of Large Language Models

[35] Light as Deception: GPT-driven Natural Relighting Against Vision-Language Pre-training Models

[36] From Hallucinations to Jailbreaks: Rethinking the Vulnerability of Large Foundation Models

[37] MIRAGE: Assessing Hallucination in Multimodal Reasoning Chains of MLLM

[38] LTM3D: Bridging Token Spaces for Conditional 3D Generation with Auto-Regressive Diffusion Framework

[39] 50 Years of Automated Face Recognition

[40] Harnessing Foundation Models for Robust and Generalizable 6-DOF Bronchoscopy Localization

[41] Interactive Video Generation via Domain Adaptation

[42] Out of Sight, Not Out of Context? Egocentric Spatial Reasoning in VLMs Across Disjoint Frames

[43] LLM-powered Query Expansion for Enhancing Boundary Prediction in Language-driven Action Localization

[44] EgoExOR: An Ego-Exo-Centric Operating Room Dataset for Surgical Activity Understanding

[45] Category-aware EEG image generation based on wavelet transform and contrast semantic loss

[46] Progressive Class-level Distillation

[47] InteractAnything: Zero-shot Human Object Interaction Synthesis via LLM Feedback and Object Affordance Parsing

[48] STAR-Net: An Interpretable Model-Aided Network for Remote Sensing Image Denoising

[49] DisTime: Distribution-based Time Representation for Video Large Language Models

[50] KairosAD: A SAM-Based Model for Industrial Anomaly Detection on Embedded Devices

[51] GeoVision Labeler: Zero-Shot Geospatial Classification with Vision and Language Models

[52] KEVER^2: Knowledge-Enhanced Visual Emotion Reasoning and Retrieval

[53] VUDG: A Dataset for Video Understanding Domain Generalization

[54] Revisiting Cross-Modal Knowledge Distillation: A Disentanglement Approach for RGBD Semantic Segmentation

[55] Grid-LOGAT: Grid Based Local and Global Area Transcription for Video Question Answering

[56] D2AF: A Dual-Driven Annotation and Filtering Framework for Visual Grounding

[57] Spatiotemporal Analysis of Forest Machine Operations Using 3D Video Classification

[58] SASP: Strip-Aware Spatial Perception for Fine-Grained Bird Image Classification

[59] Leadership Assessment in Pediatric Intensive Care Unit Team Training

[60] S3CE-Net: Spike-guided Spatiotemporal Semantic Coupling and Expansion Network for Long Sequence Event Re-Identification

[61] Leveraging Intermediate Features of Vision Transformer for Face Anti-Spoofing

[62] PCIE_Interaction Solution for Ego4D Social Interaction Challenge

[63] IRBridge: Solving Image Restoration Bridge with Pre-trained Generative Diffusion Models

[64] PCIE_Pose Solution for EgoExo4D Pose and Proficiency Estimation Challenge

[65] EasyText: Controllable Diffusion Transformer for Multilingual Text Rendering

[66] Bridging 3D Anomaly Localization and Repair via High-Quality Continuous Geometric Representation

[67] SORCE: Small Object Retrieval in Complex Environments

[68] Diversify and Conquer: Open-set Disagreement for Robust Semi-supervised Learning with Outliers

[69] SA-Person: Text-Based Person Retrieval with Scene-aware Re-ranking

[70] SPPSFormer: High-quality Superpoint-based Transformer for Roof Plane Instance Segmentation from Point Clouds

[71] Period-LLM: Extending the Periodic Capability of Multimodal Large Language Model

[72] ACM-UNet: Adaptive Integration of CNNs and Mamba for Efficient Medical Image Segmentation

[73] Deformable Attention Mechanisms Applied to Object Detection, case of Remote Sensing

[74] Reason-SVG: Hybrid Reward RL for Aha-Moments in Vector Graphics Generation

[75] un$^2$CLIP: Improving CLIP's Visual Detail Capturing Ability via Inverting unCLIP

[76] AMIA: Automatic Masking and Joint Intention Analysis Makes LVLMs Robust Jailbreak Defenders

[77] UniGeo: Taming Video Diffusion for Unified Consistent Geometry Estimation

[78] Optimal Density Functions for Weighted Convolution in Learning Models