cs.CV [Total: 116]
cs.GR [Total: 3]
cs.CL [Total: 110]
cond-mat.stat-mech [Total: 1]
cs.RO [Total: 3]
physics.med-ph [Total: 2]
eess.SY [Total: 1]
eess.IV [Total: 21]
physics.optics [Total: 1]
cs.DL [Total: 2]
eess.AS [Total: 2]
cs.LG [Total: 14]
cs.MM [Total: 1]
cs.IR [Total: 1]
cs.SE [Total: 2]
cs.HC [Total: 1]
cs.AI [Total: 4]

cs.CV [Back]

[1] Dual Branch VideoMamba with Gated Class Token Fusion for Violence Detection

Damith Chamalke Senadeera,Xiaoyun Yang,Dimitrios Kollias,Gregory Slabaugh

Main category: cs.CV

TL;DR: 提出了一种结合双分支设计和状态空间模型的高效架构Dual Branch VideoMamba with GCTF，用于视频暴力检测，并在新基准上实现了最优性能。

Details

Motivation: 随着监控摄像头的快速普及，自动暴力检测需求增加，但现有方法在长期依赖和计算效率方面存在不足。 Method: 采用双分支设计，分别捕获空间和时间特征，并通过门控机制进行连续融合，结合状态空间模型提升效率。 Result: 在新合并的基准数据集上实现了最优性能，平衡了准确性和计算效率。 Conclusion: 状态空间模型在可扩展的实时监控暴力检测中具有潜力。 Abstract: The rapid proliferation of surveillance cameras has increased the demand for automated violence detection. While CNNs and Transformers have shown success in extracting spatio-temporal features, they struggle with long-term dependencies and computational efficiency. We propose Dual Branch VideoMamba with Gated Class Token Fusion (GCTF), an efficient architecture combining a dual-branch design and a state-space model (SSM) backbone where one branch captures spatial features, while the other focuses on temporal dynamics, with continuous fusion via a gating mechanism. We also present a new benchmark by merging RWF-2000, RLVS, and VioPeru datasets in video violence detection, ensuring strict separation between training and testing sets. Our model achieves state-of-the-art performance on this benchmark offering an optimal balance between accuracy and computational efficiency, demonstrating the promise of SSMs for scalable, real-time surveillance violence detection.

[2] Farm-LightSeek: An Edge-centric Multimodal Agricultural IoT Data Analytics Framework with Lightweight LLMs

Dawen Jiang,Zhishu Shen,Qiushi Zheng,Tiehua Zhang,Wei Xiang,Jiong Jin

Main category: cs.CV

TL;DR: Farm-LightSeek是一个基于边缘计算和多模态数据处理的农业物联网框架，利用大语言模型（LLMs）解决传统智能农业的挑战，如数据融合和实时决策。

Details

Motivation: 全球人口增长和气候变化对农业提出更高要求，传统农业物联网系统面临数据融合、动态环境适应和实时决策等挑战。LLMs的知识获取和语义理解能力为这些问题提供了解决方案。 Method: 提出Farm-LightSeek框架，整合LLMs与边缘计算，通过传感器收集多源数据（图像、天气、地理信息），在边缘节点进行跨模态推理和疾病检测，实现低延迟决策和云端协作。 Result: 在两个真实数据集上的实验表明，Farm-LightSeek在边缘计算资源限制下仍能可靠完成关键任务。 Conclusion: Farm-LightSeek推动了实时智能农业解决方案的发展，展示了农业物联网与LLMs深度融合的潜力。 Abstract: Amid the challenges posed by global population growth and climate change, traditional agricultural Internet of Things (IoT) systems is currently undergoing a significant digital transformation to facilitate efficient big data processing. While smart agriculture utilizes artificial intelligence (AI) technologies to enable precise control, it still encounters significant challenges, including excessive reliance on agricultural expert knowledge, difficulties in fusing multimodal data, poor adaptability to dynamic environments, and bottlenecks in real-time decision-making at the edge. Large language models (LLMs), with their exceptional capabilities in knowledge acquisition and semantic understanding, provide a promising solution to address these challenges. To this end, we propose Farm-LightSeek, an edge-centric multimodal agricultural IoT data analytics framework that integrates LLMs with edge computing. This framework collects real-time farmland multi-source data (images, weather, geographic information) via sensors, performs cross-modal reasoning and disease detection at edge nodes, conducts low-latency management decisions, and enables cloud collaboration for model updates. The main innovations of Farm-LightSeek include: (1) an agricultural "perception-decision-action" closed-loop architecture; (2) cross-modal adaptive monitoring; and (3)a lightweight LLM deployment strategy balancing performance and efficiency. Experiments conducted on two real-world datasets demonstrate that Farm-LightSeek consistently achieves reliable performance in mission-critical tasks, even under the limitations of edge computing resources. This work advances intelligent real-time agricultural solutions and highlights the potential for deeper integration of agricultural IoT with LLMs.

[3] Improvement of human health lifespan with hybrid group pose estimation methods

Arindam Chaudhuri

Main category: cs.CV

TL;DR: 提出了一种基于混合集成的方法，用于实时多人姿态估计，以提高人体健康监测的准确性和鲁棒性。

Details

Motivation: 人体姿态估计在健康监测中具有重要应用，但现有方法在实时性和遮挡处理上存在不足。 Method: 结合改进的群体姿态估计和实时姿态估计方法，通过特征融合和预训练模型优化性能。 Result: 在公开数据集上验证了方法的有效性，显著提升了实时姿态估计的准确性和鲁棒性。 Conclusion: 该方法在实时应用中具有潜力，能够改善人体健康监测的效果。 Abstract: Human beings rely heavily on estimation of poses in order to access their body movements. Human pose estimation methods take advantage of computer vision advances in order to track human body movements in real life applications. This comes from videos which are recorded through available devices. These para-digms provide potential to make human movement measurement more accessible to users. The consumers of pose estimation movements believe that human poses content tend to supplement available videos. This has increased pose estimation software usage to estimate human poses. In order to address this problem, we develop hybrid-ensemble-based group pose estimation method to improve human health. This proposed hybrid-ensemble-based group pose estimation method aims to detect multi-person poses using modified group pose estimation and modified real time pose estimation. This ensemble allows fusion of performance of stated methods in real time. The input poses from images are fed into individual meth-ods. The pose transformation method helps to identify relevant features for en-semble to perform training effectively. After this, customized pre-trained hybrid ensemble is trained on public benchmarked datasets which is being evaluated through test datasets. The effectiveness and viability of proposed method is estab-lished based on comparative analysis of group pose estimation methods and ex-periments conducted on benchmarked datasets. It provides best optimized results in real-time pose estimation. It makes pose estimation method more robust to oc-clusion and improves dense regression accuracy. These results have affirmed po-tential application of this method in several real-time situations with improvement in human health life span

[4] PALADIN : Robust Neural Fingerprinting for Text-to-Image Diffusion Models

Murthy L,Subarna Tripathi

Main category: cs.CV

TL;DR: 提出了一种基于循环纠错码的神经指纹方法，用于文本到图像扩散模型，旨在实现完美归因准确性。

Details

Motivation: 开源文本到图像生成模型的滥用风险日益严重，现有神经指纹方法无法实现100%归因准确性，限制了实际部署。 Method: 利用编码理论中的循环纠错码概念，提出了一种新的神经指纹技术。 Result: 该方法旨在实现完美的归因准确性，解决了现有方法的不足。 Conclusion: 提出的方法为文本到图像扩散模型的神经指纹提供了一种可行的解决方案，有望实现实际部署。 Abstract: The risk of misusing text-to-image generative models for malicious uses, especially due to the open-source development of such models, has become a serious concern. As a risk mitigation strategy, attributing generative models with neural fingerprinting is emerging as a popular technique. There has been a plethora of recent work that aim for addressing neural fingerprinting. A trade-off between the attribution accuracy and generation quality of such models has been studied extensively. None of the existing methods yet achieved $100\%$ attribution accuracy. However, any model with less than \emph{perfect} accuracy is practically non-deployable. In this work, we propose an accurate method to incorporate neural fingerprinting for text-to-image diffusion models leveraging the concepts of cyclic error correcting codes from the literature of coding theory.

[5] EdgeVidSum: Real-Time Personalized Video Summarization at the Edge

Ghulam Mujtaba,Eun-Seok Ryu

Main category: cs.CV

TL;DR: EdgeVidSum是一种轻量级方法，直接在边缘设备上生成长视频的个性化快进摘要，通过本地数据处理保护用户隐私。

Details

Motivation: 解决传统视频摘要方法计算复杂度高、隐私保护不足的问题，满足现代视频消费环境对高效、个性化和隐私的需求。 Method: 采用基于缩略图的创新技术和高效神经网络架构，通过分层分析方法（轻量级2D CNN模型）识别用户偏好内容并生成时间戳，创建快进摘要。 Result: 在资源受限设备（如Jetson Nano）上实现实时视频摘要，生成符合用户偏好的个性化摘要。 Conclusion: EdgeVidSum在计算效率、个性化和隐私保护方面表现出色，适用于现代视频消费场景。 Abstract: EdgeVidSum is a lightweight method that generates personalized, fast-forward summaries of long-form videos directly on edge devices. The proposed approach enables real-time video summarization while safeguarding user privacy through local data processing using innovative thumbnail-based techniques and efficient neural architectures. Unlike conventional methods that process entire videos frame by frame, the proposed method uses thumbnail containers to significantly reduce computational complexity without sacrificing semantic relevance. The framework employs a hierarchical analysis approach, where a lightweight 2D CNN model identifies user-preferred content from thumbnails and generates timestamps to create fast-forward summaries. Our interactive demo highlights the system's ability to create tailored video summaries for long-form videos, such as movies, sports events, and TV shows, based on individual user preferences. The entire computation occurs seamlessly on resource-constrained devices like Jetson Nano, demonstrating how EdgeVidSum addresses the critical challenges of computational efficiency, personalization, and privacy in modern video consumption environments.

[6] FOLIAGE: Towards Physical Intelligence World Models Via Unbounded Surface Evolution

Xiaoyi Liu,Hao Tang

Main category: cs.CV

TL;DR: FOLIAGE是一个物理信息的多模态世界模型，用于无界增长表面，通过统一的编码器和物理感知预测器生成模态无关的生长嵌入（MAGE），并在SURF-GARDEN平台上验证其性能。

Details

Motivation: 物理智能是下一代世界模型的关键，需要从多感官观察中预测和塑造世界。 Method: FOLIAGE通过统一的上下文编码器将图像、网格连接和点云映射到共享潜在状态，结合物理感知预测器和MAGE嵌入，利用AGN网络和几何融合技术增强表达力。 Result: FOLIAGE在SURF-BENCH测试中表现优于专用基线，并在动态环境中保持鲁棒性。 Conclusion: FOLIAGE为物理智能提供了一种新的多模态世界模型路径。 Abstract: Physical intelligence -- anticipating and shaping the world from partial, multisensory observations -- is critical for next-generation world models. We propose FOLIAGE, a physics-informed multimodal world model for unbounded accretive surface growth. In its Action-Perception loop, a unified context encoder maps images, mesh connectivity, and point clouds to a shared latent state. A physics-aware predictor, conditioned on physical control actions, advances this latent state in time to align with the target latent of the surface, yielding a Modality-Agnostic Growth Embedding (MAGE) that interfaces with critic heads for downstream objectives. FOLIAGE's Accretive Graph Network (AGN) captures dynamic connectivity through Age Positional Encoding and Energy-Gated Message-Passing. Geometry-Correspondence Fusion and Cross-Patch Masking enhance MAGE's expressiveness, while Hierarchical Pooling balances global context with local dynamics. We create SURF-GARDEN, a world model learning platform comprising a Counterfactual Physics Simulator, a Multimodal Correspondence Extractor, and Evolution Tracing, which generates 7,200 diverse surface-growth sequences. SURF-BENCH, our physical-intelligence evaluation suite, evaluates six core tasks -- topology recognition, inverse material estimation, growth-stage classification, latent roll-out, cross-modal retrieval, and dense correspondence -- and four stress tests -- sensor dropout, zero-shot modality transfer, long-horizon prediction, and physics ablation -- to probe resilience. FOLIAGE outperforms specialized baselines while remaining robust across dynamic environments, establishing a new world-model based, multimodal pathway to physical intelligence.

Koki Matsuishi,Kosuke Ukita,Tsuyoshi Okita

Main category: cs.CV

TL;DR: 论文提出AURA-MFM，一种多模态基础模型，整合第三人称视频、动作捕捉、IMU和文本数据，以提升对人类活动的多维理解。实验表明，该模型在检索和活动识别任务中优于现有方法。

Details

Motivation: 现有多模态基础模型主要依赖第一人称视频和文本数据，无法全面分析全身活动。为解决这一问题，作者提出整合更多模态数据以提升分析能力。 Method: 提出AURA-MFM模型，整合第三人称视频、动作捕捉、IMU和文本四种模态数据，并采用基于Transformer的IMU编码器提升性能。 Result: 在零样本动作识别任务中，模型F1分数为0.6226，准确率为0.7320，显著优于现有方法（F1分数0.0747，准确率0.1961）。 Conclusion: AURA-MFM通过整合多模态数据，显著提升了人类活动分析的性能，尤其在零样本任务中表现突出。 Abstract: In recent years, the widespread adoption of wearable devices has highlighted the growing importance of behavior analysis using IMU. While applications span diverse fields such as healthcare and robotics, recent studies have increasingly focused on multimodal analysis, in addition to unimodal analysis. Several studies have proposed multimodal foundation models that incorporate first-person video and text data; however, these models still fall short in providing a detailed analysis of full-body human activity. To address this limitation, we propose Activity Understanding and Representations Alignment - Multimodal Foundation Model (AURA-MFM), a foundational model integrating four modalities: third-person video, motion capture, IMU, and text. By incorporating third-person video and motion capture data, the model enables a detailed and multidimensional understanding of human activity, which first-person perspectives alone fail to capture. Additionally, a Transformer-based IMU encoder is employed to enhance the model's overall performance. Experimental evaluations on retrieval and activity recognition tasks demonstrate that our model surpasses existing methods. Notably, in the zero-shot classification for action recognition, our method achieved significantly higher performance, with an F1-score of 0.6226 and an accuracy of 0.7320, whereas the existing method recorded an F1-score of 0.0747 and an accuracy of 0.1961.

[8] Vid-SME: Membership Inference Attacks against Large Video Understanding Models

Qi Li,Runpeng Yu,Xinchao Wang

Main category: cs.CV

TL;DR: Vid-SME是一种针对视频理解大语言模型（VULLMs）的成员推理方法，解决了现有方法在视频数据上泛化能力差的问题，通过自适应参数化和Sharma-Mittal熵（SME）计算，有效识别训练数据中的视频。

Details

Motivation: 多模态大语言模型（MLLMs）在视频理解应用中快速发展，但数据隐私问题突出，尤其是敏感视频内容可能被不当用于训练。现有成员推理方法（MIAs）在视频领域效果不佳，亟需针对性解决方案。 Method: 提出Vid-SME方法，利用模型输出的置信度和自适应参数化计算SME，并通过自然视频帧与时间反转帧的SME差异生成成员分数。 Result: 实验表明，Vid-SME在多种自训练和开源VULLMs中表现优异，显著提升了成员推理的准确性。 Conclusion: Vid-SME为视频数据的成员推理提供了有效工具，解决了现有方法的局限性，对保护数据隐私具有重要意义。 Abstract: Multimodal large language models (MLLMs) demonstrate remarkable capabilities in handling complex multimodal tasks and are increasingly adopted in video understanding applications. However, their rapid advancement raises serious data privacy concerns, particularly given the potential inclusion of sensitive video content, such as personal recordings and surveillance footage, in their training datasets. Determining improperly used videos during training remains a critical and unresolved challenge. Despite considerable progress on membership inference attacks (MIAs) for text and image data in MLLMs, existing methods fail to generalize effectively to the video domain. These methods suffer from poor scalability as more frames are sampled and generally achieve negligible true positive rates at low false positive rates (TPR@Low FPR), mainly due to their failure to capture the inherent temporal variations of video frames and to account for model behavior differences as the number of frames varies. To address these challenges, we introduce Vid-SME, the first membership inference method tailored for video data used in video understanding LLMs (VULLMs). Vid-SME leverages the confidence of model output and integrates adaptive parameterization to compute Sharma-Mittal entropy (SME) for video inputs. By leveraging the SME difference between natural and temporally-reversed video frames, Vid-SME derives robust membership scores to determine whether a given video is part of the model's training set. Experiments on various self-trained and open-sourced VULLMs demonstrate the strong effectiveness of Vid-SME.

[9] TerraIncognita: A Dynamic Benchmark for Species Discovery Using Frontier Models

Shivani Chiranjeevi,Hossein Zaremehrjerdi,Zi K. Deng,Talukder Z. Jubery,Ari Grele,Arti Singh,Asheesh K Singh,Soumik Sarkar,Nirav Merchant,Harold F. Greeney,Baskar Ganapathysubramanian,Chinmay Hegde

Main category: cs.CV

TL;DR: TerraIncognita是一个动态基准，用于评估多模态模型在识别未知昆虫物种方面的能力，结合了已知和罕见物种的图像数据。

Details

Motivation: 全球生物多样性快速丧失，尤其是昆虫，当前物种发现方法效率低下，阻碍了及时的保护行动。 Method: 通过混合已知和罕见昆虫物种的图像数据，评估模型在分类、检测新物种和生成解释方面的能力。 Result: 模型在已知物种的目级分类上表现优异（F1>90%），但在物种级分类上表现极差（F1<2%）。 Conclusion: TerraIncognita将定期更新，为前沿AI方法提供一个持续演进的评估平台。 Abstract: The rapid global loss of biodiversity, particularly among insects, represents an urgent ecological crisis. Current methods for insect species discovery are manual, slow, and severely constrained by taxonomic expertise, hindering timely conservation actions. We introduce TerraIncognita, a dynamic benchmark designed to evaluate state-of-the-art multimodal models for the challenging problem of identifying unknown, potentially undescribed insect species from image data. Our benchmark dataset combines a mix of expertly annotated images of insect species likely known to frontier AI models, and images of rare and poorly known species, for which few/no publicly available images exist. These images were collected from underexplored biodiversity hotspots, realistically mimicking open-world discovery scenarios faced by ecologists. The benchmark assesses models' proficiency in hierarchical taxonomic classification, their capability to detect and abstain from out-of-distribution (OOD) samples representing novel species, and their ability to generate explanations aligned with expert taxonomic knowledge. Notably, top-performing models achieve over 90\% F1 at the Order level on known species, but drop below 2\% at the Species level, highlighting the sharp difficulty gradient from coarse to fine taxonomic prediction (Order $\rightarrow$ Family $\rightarrow$ Genus $\rightarrow$ Species). TerraIncognita will be updated regularly, and by committing to quarterly dataset expansions (of both known and novel species), will provide an evolving platform for longitudinal benchmarking of frontier AI methods. All TerraIncognita data, results, and future updates are available \href{https://baskargroup.github.io/TerraIncognita/}{here}.

[10] Impact of Tuning Parameters in Deep Convolutional Neural Network Using a Crack Image Dataset

Mahe Zabin,Ho-Jin Choi,Md. Monirul Islam,Jia Uddin

Main category: cs.CV

TL;DR: 研究了深度卷积神经网络（DCNN）中不同调参对性能的影响，发现使用maxpooling、adam优化器和tanh激活函数时性能最佳。

Details

Motivation: 探讨DCNN中调参对分类器性能的影响，以优化模型表现。 Method: 使用包含2个卷积层、2个池化层、1个dropout层和1个密集层的DCNN，通过实验评估池化、激活函数和优化器调参的影响。 Result: 实验结果表明，maxpooling、adam优化器和tanh激活函数的组合性能最佳。 Conclusion: 调参对DCNN性能有显著影响，maxpooling、adam和tanh是最佳组合。 Abstract: The performance of a classifier depends on the tuning of its parame ters. In this paper, we have experimented the impact of various tuning parameters on the performance of a deep convolutional neural network (DCNN). In the ex perimental evaluation, we have considered a DCNN classifier that consists of 2 convolutional layers (CL), 2 pooling layers (PL), 1 dropout, and a dense layer. To observe the impact of pooling, activation function, and optimizer tuning pa rameters, we utilized a crack image dataset having two classes: negative and pos itive. The experimental results demonstrate that with the maxpooling, the DCNN demonstrates its better performance for adam optimizer and tanh activation func tion.

[11] Continual Learning in Vision-Language Models via Aligned Model Merging

Ghada Sokar,Gintare Karolina Dziugaite,Anurag Arnab,Ahmet Iscen,Pablo Samuel Castro,Cordelia Schmid

Main category: cs.CV

TL;DR: 论文提出了一种基于模型合并的持续学习方法，旨在平衡稳定性和可塑性，减少灾难性遗忘，并提升任务顺序和相似性的鲁棒性。

Details

Motivation: 传统的持续学习方法通过顺序微调实现适应，但倾向于牺牲稳定性以保持可塑性，导致对近期任务的偏见和灾难性遗忘。 Method: 提出了一种模型合并方法，将新训练的任务参数与先前学习的参数合并，并通过简单机制促进对齐权重学习以避免干扰。 Result: 在大型视觉语言模型上验证了方法的有效性，减少了遗忘，提升了任务顺序和相似性的鲁棒性，并改善了泛化能力。 Conclusion: 基于模型合并的方法为持续学习提供了新的视角，有效平衡了稳定性和可塑性，具有实际应用潜力。 Abstract: Continual learning is conventionally tackled through sequential fine-tuning, a process that, while enabling adaptation, inherently favors plasticity over the stability needed to retain prior knowledge. While existing approaches attempt to mitigate catastrophic forgetting, a bias towards recent tasks persists as they build upon this sequential nature. In this work we present a new perspective based on model merging to maintain stability while still retaining plasticity. Rather than just sequentially updating the model weights, we propose merging newly trained task parameters with previously learned ones, promoting a better balance. To maximize the effectiveness of the merging process, we propose a simple mechanism that promotes learning aligned weights with previous ones, thereby avoiding interference when merging. We evaluate this approach on large Vision-Language Models (VLMs), and demonstrate its effectiveness in reducing forgetting, increasing robustness to various task orders and similarities, and improving generalization.

[12] MINT: Memory-Infused Prompt Tuning at Test-time for CLIP

Jiaming Yi,Ruirui Pan,Jishen Yang,Xiulong Yang

Main category: cs.CV

TL;DR: 提出了一种名为MINT的新框架，通过记忆提示库动态适应视觉语义信息，提升视觉语言预训练模型在测试时的泛化能力。

Details

Motivation: 现有测试时适应方法未能充分利用模型内部知识，特别是在动态适应复杂和分层的视觉语义信息方面。 Method: MINT引入记忆提示库（MPB），存储可学习的键值提示对，通过分层视觉特征检索相关提示对，动态组装关联提示并注入图像编码器。 Result: MINT能够在测试时快速、精确地适应视觉语言预训练模型，无需源数据或重新训练。 Conclusion: MINT通过记忆提示库有效提升了模型在测试时的泛化能力，为视觉语言预训练模型的动态适应提供了新思路。 Abstract: Improving the generalization ability of Vision-Language Pre-trained Models (VLMs) under test-time data distribution shifts remains a critical challenge. The existing Test-Time Adaptation (TTA) methods fall short in fully leveraging the model's internal knowledge, particularly in dynamically adapting to complex and hierarchical visual semantic information. In this paper, we propose Memory-Infused Prompt Tuning (MINT), a novel framework to address this issue. Inspired by human associative memory theory, MINT introduces a Memory Prompt Bank (MPB), which stores learnable key-value prompt pairs that work as a memory of previously seen samples. During the test time, relevant prompt pairs in the MPB are retrieved by the hierarchical visual features of test images to dynamically assemble Associative Prompts. The associative prompts are then injected into the image encoder for fine-grained, customized visual contextual guidance. MINT also utilizes learnable text prompts. MINT thus enables rapid, precise VLM adaptation at test time by leveraging this MPB-acquired memory, without source data or retraining. The code is available at https://github.com/Jamieyi2004/MINT.

[13] Multimodal Generative AI with Autoregressive LLMs for Human Motion Understanding and Generation: A Way Forward

Muhammad Islam,Tao Huang,Euijoon Ahn,Usman Naseem

Main category: cs.CV

TL;DR: 本文综述了多模态生成式人工智能（GenAI）和自回归大语言模型（LLMs）在人体运动理解与生成中的应用，探讨了新兴方法、架构及其在提升运动合成逼真度和多样性方面的潜力。

Details

Motivation: 研究文本和运动模态的结合，探索如何通过文本描述指导生成复杂、拟人化的运动序列，以推动运动生成技术的发展。 Method: 分析了多种生成方法，包括自回归模型、扩散模型、生成对抗网络（GANs）、变分自编码器（VAEs）和基于Transformer的模型，评估其在运动质量、计算效率和适应性方面的优劣。 Result: 研究表明，文本条件运动生成技术能更精确地控制运动输出，而LLMs的引入进一步提升了语义对齐能力，增强了运动的连贯性和上下文相关性。 Conclusion: 文本到运动的GenAI和LLM架构在医疗、人形机器人、游戏、动画和辅助技术等领域具有变革潜力，但仍需解决生成高效且逼真运动的挑战。 Abstract: This paper presents an in-depth survey on the use of multimodal Generative Artificial Intelligence (GenAI) and autoregressive Large Language Models (LLMs) for human motion understanding and generation, offering insights into emerging methods, architectures, and their potential to advance realistic and versatile motion synthesis. Focusing exclusively on text and motion modalities, this research investigates how textual descriptions can guide the generation of complex, human-like motion sequences. The paper explores various generative approaches, including autoregressive models, diffusion models, Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and transformer-based models, by analyzing their strengths and limitations in terms of motion quality, computational efficiency, and adaptability. It highlights recent advances in text-conditioned motion generation, where textual inputs are used to control and refine motion outputs with greater precision. The integration of LLMs further enhances these models by enabling semantic alignment between instructions and motion, improving coherence and contextual relevance. This systematic survey underscores the transformative potential of text-to-motion GenAI and LLM architectures in applications such as healthcare, humanoids, gaming, animation, and assistive technologies, while addressing ongoing challenges in generating efficient and realistic human motion.

[14] Human Fall Detection using Transfer Learning-based 3D CNN

Ekram Alam,Abu Sufian,Paramartha Dutta,Marco Leo

Main category: cs.CV

TL;DR: 本文提出了一种基于预训练3D CNN的视觉跌倒检测系统，利用时空特征和SVM分类器实现高效分类。

Details

Motivation: 老年人意外跌倒是一个严重的健康问题，随着老年人口增加，需要自动化的跌倒检测系统。 Method: 使用预训练的3D CNN提取时空特征，仅训练SVM分类器以减少时间成本，采用分层五折交叉验证。 Result: 在GMDCSA和CAUCAFall数据集上进行了实验，模型表现良好。 Conclusion: 该方法通过预训练模型和SVM分类器实现了高效的跌倒检测，代码已开源。 Abstract: Unintentional or accidental falls are one of the significant health issues in senior persons. The population of senior persons is increasing steadily. So, there is a need for an automated fall detection monitoring system. This paper introduces a vision-based fall detection system using a pre-trained 3D CNN. Unlike 2D CNN, 3D CNN extracts not only spatial but also temporal features. The proposed model leverages the original learned weights of a 3D CNN model pre-trained on the Sports1M dataset to extract the spatio-temporal features. Only the SVM classifier was trained, which saves the time required to train the 3D CNN. Stratified shuffle five split cross-validation has been used to split the dataset into training and testing data. Extracted features from the proposed 3D CNN model were fed to an SVM classifier to classify the activity as fall or ADL. Two datasets, GMDCSA and CAUCAFall, were utilized to conduct the experiment. The source code for this work can be accessed via the following link: https://github.com/ekramalam/HFD_3DCNN.

[15] HueManity: Probing Fine-Grained Visual Perception in MLLMs

Rynaa Grover,Jayant Sravan Tamarapalli,Sahiti Yerramilli,Nilay Pande

Main category: cs.CV

TL;DR: HueManity是一个评估多模态大语言模型（MLLMs）视觉感知能力的基准测试，结果显示MLLMs在精细视觉任务上表现显著低于人类和传统计算机视觉模型。

Details

Motivation: 当前MLLMs在高层次视觉推理上表现出色，但在精细感知任务上表现有限，需要量化评估其视觉能力。 Method: HueManity包含83,850张Ishihara风格点阵图像，测试MLLMs对嵌入字符的精确识别能力。评估了9种先进MLLMs。 Result: MLLMs表现显著落后，最佳模型在“简单”任务中准确率为33.6%，“困难”任务中仅为3%，而人类和ResNet50模型接近满分。 Conclusion: MLLMs在视觉感知上存在显著缺陷，需改进架构和训练范式。HueManity数据集和代码已开源以促进研究。 Abstract: Multimodal Large Language Models (MLLMs) excel at high-level visual reasoning, but their performance on nuanced perceptual tasks remains surprisingly limited. We present HueManity, a benchmark designed to assess visual perception in MLLMs. The dataset comprises 83,850 images featuring two-character alphanumeric strings embedded in Ishihara test style dot patterns, challenging models on precise pattern recognition. Our evaluation of nine state-of-the-art MLLMs on HueManity demonstrates a significant performance deficit compared to human and traditional computer vision baselines. The best-performing MLLM achieved a 33.6% accuracy on the numeric `easy' task and a striking 3% on the alphanumeric `hard' task. In contrast, human participants achieved near-perfect scores (100% and 95.6%), and a fine-tuned ResNet50 model reached accuracies of 96.5% and 94.5%. These results highlight a critical gap in the visual capabilities of current MLLMs. Our analysis further explores potential architectural and training-paradigm factors contributing to this perceptual gap in MLLMs. We open-source HueManity dataset and code to foster further research in improving perceptual robustness of MLLMs.

[16] Unlabeled Data Improves Fine-Grained Image Zero-shot Classification with Multimodal LLMs

Yunqi Hong,Sohyun An,Andrew Bai,Neil Y. C. Lin,Cho-Jui Hsieh

Main category: cs.CV

TL;DR: AutoSEP是一种自监督提示学习框架，旨在提升多模态大语言模型（MLLMs）在细粒度图像分类任务中的表现，无需监督数据。

Details

Motivation: 细粒度图像分类需要关注细微视觉差异，而MLLMs在无明确指导时容易忽略这些细节。 Method: 通过迭代自监督学习，利用未标记数据优化描述提示，引导MLLMs识别关键判别特征。 Result: 在多个细粒度分类数据集上，AutoSEP平均比零样本分类提升13%，优于其他无监督基线。 Conclusion: AutoSEP有效提升了MLLMs在细粒度分类任务中的性能，且无需训练或微调。 Abstract: Despite Multimodal Large Language Models (MLLMs) showing promising results on general zero-shot image classification tasks, fine-grained image classification remains challenging. It demands precise attention to subtle visual details to distinguish between visually similar subcategories--details that MLLMs may easily overlook without explicit guidance. To address this, we introduce AutoSEP, an iterative self-supervised prompt learning framework designed to enhance MLLM fine-grained classification capabilities in a fully unsupervised manner. Our core idea is to leverage unlabeled data to learn a description prompt that guides MLLMs in identifying crucial discriminative features within an image, and boosts classification accuracy. We developed an automatic self-enhancing prompt learning framework called AutoSEP to iteratively improve the description prompt using unlabeled data, based on instance-level classification scoring function. AutoSEP only requires black-box access to MLLMs, eliminating the need for any training or fine-tuning. We evaluate our approach on multiple fine-grained classification datasets. It consistently outperforms other unsupervised baselines, demonstrating the effectiveness of our self-supervised optimization framework. Notably, AutoSEP on average improves 13 percent over standard zero-shot classification and 5 percent over the best-performing baselines. Code is available at: https://github.com/yq-hong/AutoSEP

[17] Infinity Parser: Layout Aware Reinforcement Learning for Scanned Document Parsing

Baode Wang,Biao Wu,Weizhen Li,Meng Fang,Yanjie Liang,Zuming Huang,Haozhe Wang,Jun Huang,Ling Chen,Wei Chu,Yuan Qi

Main category: cs.CV

TL;DR: 论文提出了一种名为layoutRL的端到端强化学习框架，通过优化复合奖励函数（包括编辑距离、段落计数准确性和阅读顺序保持）来训练模型，显著提升了文档解析的准确性和结构保真度。

Details

Motivation: 传统多阶段文档解析流程存在错误传播和布局适应性差的问题，亟需一种更高效、适应性更强的解决方案。 Method: 采用layoutRL框架，结合新发布的Infinity-Doc-55K数据集，训练基于视觉语言模型的Infinity-Parser。 Result: 在OCR、表格和公式提取以及阅读顺序检测等任务中，Infinity-Parser在准确性和结构保真度上均达到新的最优性能。 Conclusion: 论文提出的方法显著提升了文档解析能力，并公开代码和数据集以推动文档理解领域的进步。 Abstract: Automated parsing of scanned documents into richly structured, machine-readable formats remains a critical bottleneck in Document AI, as traditional multi-stage pipelines suffer from error propagation and limited adaptability to diverse layouts. We introduce layoutRL, an end-to-end reinforcement learning framework that trains models to be explicitly layout-aware by optimizing a composite reward of normalized edit distance, paragraph count accuracy, and reading order preservation. Leveraging our newly released dataset, Infinity-Doc-55K, which combines 55K high-fidelity synthetic scanned document parsing data with expert-filtered real-world documents, we instantiate layoutRL in a vision-language-model-based parser called Infinity-Parser. Evaluated on English and Chinese benchmarks for OCR, table and formula extraction, and reading order detection, Infinity-Parser achieves new state-of-the-art performance in both accuracy and structural fidelity, outpacing specialist pipelines and general-purpose vision-language models. We will publicly release our code and dataset to accelerate progress in robust document understanding.

Hao Yin,Lijun Gu,Paritosh Parmar,Lin Xu,Tianxiao Guo,Weiwei Fu,Yang Zhang,Tianyou Zheng

Main category: cs.CV

TL;DR: 论文提出了FLEX数据集，首个多模态、多动作的大规模数据集，结合表面肌电信号（sEMG）用于动作质量评估（AQA），填补了现有AQA在健身领域的空白。

Details

Motivation: 随着健康意识的提升和审美体型的追求，健身成为流行趋势，但健身训练尤其是负重动作存在潜在风险。现有AQA方法和数据集局限于单视角竞技体育场景和RGB模态，缺乏对健身动作的专业评估与指导。 Method: 提出FLEX数据集，包含20种负重动作，由38名不同技能水平的受试者各重复10次，采集多视角RGB视频、3D姿态、sEMG和生理信息。结合知识图谱构建标注规则，通过惩罚函数映射动作、关键步骤、错误类型和反馈。 Result: 实验表明，多模态数据、多视角数据和细粒度标注显著提升模型性能。FLEX推动了AQA方法向多模态和多动作场景发展。 Conclusion: FLEX不仅推进了AQA方法和数据集的发展，还促进了人工智能在健身领域的应用。数据集和代码已开源。 Abstract: With the increasing awareness of health and the growing desire for aesthetic physique, fitness has become a prevailing trend. However, the potential risks associated with fitness training, especially with weight-loaded fitness actions, cannot be overlooked. Action Quality Assessment (AQA), a technology that quantifies the quality of human action and provides feedback, holds the potential to assist fitness enthusiasts of varying skill levels in achieving better training outcomes. Nevertheless, current AQA methodologies and datasets are limited to single-view competitive sports scenarios and RGB modality and lack professional assessment and guidance of fitness actions. To address this gap, we propose the FLEX dataset, the first multi-modal, multi-action, large-scale dataset that incorporates surface electromyography (sEMG) signals into AQA. FLEX utilizes high-precision MoCap to collect 20 different weight-loaded actions performed by 38 subjects across 3 different skill levels for 10 repetitions each, containing 5 different views of the RGB video, 3D pose, sEMG, and physiological information. Additionally, FLEX incorporates knowledge graphs into AQA, constructing annotation rules in the form of penalty functions that map weight-loaded actions, action keysteps, error types, and feedback. We conducted various baseline methodologies on FLEX, demonstrating that multimodal data, multiview data, and fine-grained annotations significantly enhance model performance. FLEX not only advances AQA methodologies and datasets towards multi-modal and multi-action scenarios but also fosters the integration of artificial intelligence within the fitness domain. Dataset and code are available at https://haoyin116.github.io/FLEX_Dataset.

Wanting Yang,Zehui Xiong,Qianqian Yang,Ping Zhang,Merouane Debbah,Rahim Tafazolli

Main category: cs.CV

TL;DR: 提出了一种名为GenSeC-PC的新型通道自适应跨模态生成语义通信方法，用于高效传输点云数据，结合图像作为非传输侧信息，显著提升了压缩效率和重建性能。

Details

Motivation: 随着自动驾驶和扩展现实的快速发展，点云的高效传输变得日益重要。现有方法在压缩效率和重建性能上存在不足，且需要无误差的侧信息传输。 Method: 采用跨模态设计，融合图像和点云数据，使用PointDif作为解码器基础，并设计了通道自适应的联合语义-信道编码架构。此外，采用修正的去噪扩散隐式模型加速解码过程。 Result: 实验结果表明，该方法在低信噪比、带宽限制等多样化条件下均表现出鲁棒性，支持完全模拟传输，显著提升了压缩效率。 Conclusion: GenSeC-PC通过生成先验和跨模态设计，实现了高效、鲁棒的点云传输，为语义通信提供了新的解决方案。 Abstract: With the rapid development of autonomous driving and extended reality, efficient transmission of point clouds (PCs) has become increasingly important. In this context, we propose a novel channel-adaptive cross-modal generative semantic communication (SemCom) for PC transmission, called GenSeC-PC. GenSeC-PC employs a semantic encoder that fuses images and point clouds, where images serve as non-transmitted side information. Meanwhile, the decoder is built upon the backbone of PointDif. Such a cross-modal design not only ensures high compression efficiency but also delivers superior reconstruction performance compared to PointDif. Moreover, to ensure robust transmission and reduce system complexity, we design a streamlined and asymmetric channel-adaptive joint semantic-channel coding architecture, where only the encoder needs the feedback of average signal-to-noise ratio (SNR) and available bandwidth. In addition, rectified denoising diffusion implicit models is employed to accelerate the decoding process to the millisecond level, enabling real-time PC communication. Unlike existing methods, GenSeC-PC leverages generative priors to ensure reliable reconstruction even from noisy or incomplete source PCs. More importantly, it supports fully analog transmission, improving compression efficiency by eliminating the need for error-free side information transmission common in prior SemCom approaches. Simulation results confirm the effectiveness of cross-modal semantic extraction and dual-metric guided fine-tuning, highlighting the framework's robustness across diverse conditions, including low SNR, bandwidth limitations, varying numbers of 2D images, and previously unseen objects.

[20] ConMamba: Contrastive Vision Mamba for Plant Disease Detection

Abdullah Al Mamun,Miaohua Zhang,David Ahmedt-Aristizabal,Zeeshan Hayder,Mohammad Awrangjeb

Main category: cs.CV

TL;DR: ConMamba是一种新型自监督学习框架，专为植物病害检测设计，通过Vision Mamba Encoder和动态权重调整的双重对比损失，显著提升了性能。

Details

Motivation: 现有深度学习方法依赖大量标注数据且计算成本高，自监督学习虽能利用未标注数据，但存在长程依赖捕捉不足和特征对齐问题。 Method: 提出ConMamba框架，整合Vision Mamba Encoder（双向状态空间模型）和动态权重调整的双重对比损失。 Result: 在三个基准数据集上，ConMamba显著优于现有方法。 Conclusion: ConMamba为植物病害检测提供了高效且鲁棒的解决方案。 Abstract: Plant Disease Detection (PDD) is a key aspect of precision agriculture. However, existing deep learning methods often rely on extensively annotated datasets, which are time-consuming and costly to generate. Self-supervised Learning (SSL) offers a promising alternative by exploiting the abundance of unlabeled data. However, most existing SSL approaches suffer from high computational costs due to convolutional neural networks or transformer-based architectures. Additionally, they struggle to capture long-range dependencies in visual representation and rely on static loss functions that fail to align local and global features effectively. To address these challenges, we propose ConMamba, a novel SSL framework specially designed for PDD. ConMamba integrates the Vision Mamba Encoder (VME), which employs a bidirectional State Space Model (SSM) to capture long-range dependencies efficiently. Furthermore, we introduce a dual-level contrastive loss with dynamic weight adjustment to optimize local-global feature alignment. Experimental results on three benchmark datasets demonstrate that ConMamba significantly outperforms state-of-the-art methods across multiple evaluation metrics. This provides an efficient and robust solution for PDD.

[21] OpenCarbon: A Contrastive Learning-based Cross-Modality Neural Approach for High-Resolution Carbon Emission Prediction Using Open Data

Jinwei Zeng,Yu Liu,Guozhen Zhang,Jingtao Ding,Yuming Lin,Jian Yuan,Yong Li

Main category: cs.CV

TL;DR: 论文提出OpenCarbon模型，结合卫星图像和POI数据预测高分辨率城市碳排放，解决功能性和空间连续性问题，性能提升26.6%。

Details

Motivation: 传统碳排放核算方法数据收集成本高，而开放数据和先进学习技术为高分辨率碳排放估算提供了新途径。 Method: 结合卫星图像和POI数据，设计跨模态信息提取与融合模块和邻域信息聚合模块。 Result: 模型性能显著提升26.6%，验证了其捕捉城市功能与碳排放关系的能力。 Conclusion: OpenCarbon模型为高效碳排放治理和针对性减排规划提供了潜力。 Abstract: Accurately estimating high-resolution carbon emissions is crucial for effective emission governance and mitigation planning. While conventional methods for precise carbon accounting are hindered by substantial data collection efforts, the rise of open data and advanced learning techniques offers a promising solution. Once an open data-based prediction model is developed and trained, it can easily infer emissions for new areas based on available open data. To address this, we incorporate two modalities of open data, satellite images and point-of-interest (POI) data, to predict high-resolution urban carbon emissions, with satellite images providing macroscopic and static and POI data offering fine-grained and relatively dynamic functionality information. However, estimating high-resolution carbon emissions presents two significant challenges: the intertwined and implicit effects of various functionalities on carbon emissions, and the complex spatial contiguity correlations that give rise to the agglomeration effect. Our model, OpenCarbon, features two major designs that target the challenges: a cross-modality information extraction and fusion module to extract complementary functionality information from two modules and model their interactions, and a neighborhood-informed aggregation module to capture the spatial contiguity correlations. Extensive experiments demonstrate our model's superiority, with a significant performance gain of 26.6\% on R2. Further generalizability tests and case studies also show OpenCarbon's capacity to capture the intrinsic relation between urban functionalities and carbon emissions, validating its potential to empower efficient carbon governance and targeted carbon mitigation planning. Codes and data are available: https://github.com/JinweiZzz/OpenCarbon.

[22] Pre-trained Vision-Language Models Assisted Noisy Partial Label Learning

Qian-Wei Wang,Yuqiu Xie,Letian Zhang,Zimo Liu,Shu-Tao Xia

Main category: cs.CV

TL;DR: 本文提出了一种名为Co-Reg的创新方法，用于从预训练视觉语言模型（VLMs）生成的噪声部分标签中学习，通过协同一致性正则化提升模型性能。

Details

Motivation: 预训练VLMs（如CLIP、LLaVa和GPT-4V）的出现为替代耗时的人工标注提供了可能，但其生成的噪声标签具有实例依赖性，增加了学习难度。 Method: 通过同时训练两个神经网络，利用“协同伪标签”机制净化训练标签，并在标签空间和特征表示空间中施加一致性正则化约束。 Result: 实验表明，该方法在不同去噪和消歧算法、标注方式及预训练模型应用方案中均表现出有效性。 Conclusion: 该方法展示了将弱监督学习技术融入预训练模型知识蒸馏过程的广阔前景。 Abstract: In the context of noisy partial label learning (NPLL), each training sample is associated with a set of candidate labels annotated by multiple noisy annotators. With the emergence of high-performance pre-trained vision-language models (VLMs) such as CLIP, LLaVa and GPT-4V, the direction of using these models to replace time-consuming manual annotation workflows and achieve "manual-annotation-free" training for downstream tasks has become a highly promising research avenue. This paper focuses on learning from noisy partial labels annotated by pre-trained VLMs and proposes an innovative collaborative consistency regularization (Co-Reg) method. Unlike the symmetric noise primarily addressed in traditional noisy label learning, the noise generated by pre-trained models is instance-dependent, embodying the underlying patterns of the pre-trained models themselves, which significantly increases the learning difficulty for the model. To address this, we simultaneously train two neural networks that implement collaborative purification of training labels through a "Co-Pseudo-Labeling" mechanism, while enforcing consistency regularization constraints in both the label space and feature representation space. Our method can also leverage few-shot manually annotated valid labels to further enhance its performances. Comparative experiments with different denoising and disambiguation algorithms, annotation manners, and pre-trained model application schemes fully validate the effectiveness of the proposed method, while revealing the broad prospects of integrating weakly-supervised learning techniques into the knowledge distillation process of pre-trained models.

[23] Chipmunk: Training-Free Acceleration of Diffusion Transformers with Dynamic Column-Sparse Deltas

Austin Silveria,Soham V. Govande,Daniel Y. Fu

Main category: cs.CV

TL;DR: Chipmunk通过动态稀疏性减少DiT推理时的计算冗余，显著提升速度且不影响生成质量。

Details

Motivation: DiT在推理时存在计算冗余，因为潜在噪声向量变化缓慢。 Method: 研究DiT激活变化，提出动态稀疏性方法Chipmunk，优化稀疏操作和缓存策略。 Result: Chipmunk在多个模型上实现1.41x至3.72x的加速，质量几乎无损失。 Conclusion: Chipmunk有效减少DiT推理计算冗余，显著提升效率。 Abstract: Diffusion Transformers (DiTs) have achieved state-of-the-art performance in high-quality image and video generation but incur substantial compute cost at inference. A common observation is that DiT latent noise vectors change slowly across inference steps, which suggests that the DiT compute may be redundant across steps. In this paper, we aim to speed up inference by reducing this redundancy, without additional training. We first study how activations change between steps in two state-of-the-art open-source DiTs. We find that just 5-25% of the values in attention and MLP explain 70-90% of the change in activations across steps. This finding motivates our approach, Chipmunk, which uses dynamic sparsity at inference time to recompute only the fastest-changing intermediate activations, while caching the rest. Dynamic sparsity introduces two systems challenges: (1) sparse attention and MLP operations tend to underutilize GPU tensor cores; and (2) computing dynamic sparsity patterns at runtime and caching activations both introduce overhead. To address these challenges, Chipmunk first uses a voxel-based reordering of input tokens to introduce column-wise sparsity. We implement column-sparse kernels utilizing efficient sparse gathers from global to shared GPU memory, achieving a 9.3x speedup at 93% sparsity compared to highly-optimized dense baselines. Second, Chipmunk overlaps the computation of sparsity patterns and cache updates with other parts of the computation (e.g., second layer of the MLP) to hide the extra latency. Chipmunk achieves up to 2.16x speedup on HunyuanVideo and 1.41x on FLUX.1-dev without compromising generation quality. Furthermore, we show that Chipmunk can be stacked on top of full step caching, achieving a 3.72x speedup on HunyuanVideo, a 2.67x speedup on WAN2.1, and a 2.25x speedup on FLUX.1-dev with minimal quality impact.

[24] Learning Optical Flow Field via Neural Ordinary Differential Equation

Leyla Mirvakhabova,Hong Cai,Jisoo Jeong,Hanno Ackermann,Farhad Zanjani,Fatih Porikli

Main category: cs.CV

TL;DR: 提出一种基于神经ODE的连续模型，动态调整计算步骤，优化光流估计性能。

Details

Motivation: 传统方法使用固定步骤的迭代优化，可能因输入数据不同导致性能不佳，需更灵活的调整机制。 Method: 利用神经ODE预测光流导数，动态调整计算步骤，通过特定架构和求解器实现。 Result: 在光流基准测试中显著优于基线模型，仅需单一步骤即可实现优化。 Conclusion: 神经ODE模型为光流估计提供了更灵活高效的解决方案，性能显著提升。 Abstract: Recent works on optical flow estimation use neural networks to predict the flow field that maps positions of one image to positions of the other. These networks consist of a feature extractor, a correlation volume, and finally several refinement steps. These refinement steps mimic the iterative refinements performed by classical optimization algorithms and are usually implemented by neural layers (e.g., GRU) which are recurrently executed for a fixed and pre-determined number of steps. However, relying on a fixed number of steps may result in suboptimal performance because it is not tailored to the input data. In this paper, we introduce a novel approach for predicting the derivative of the flow using a continuous model, namely neural ordinary differential equations (ODE). One key advantage of this approach is its capacity to model an equilibrium process, dynamically adjusting the number of compute steps based on the data at hand. By following a particular neural architecture, ODE solver, and associated hyperparameters, our proposed model can replicate the exact same updates as recurrent cells used in existing works, offering greater generality. Through extensive experimental analysis on optical flow benchmarks, we demonstrate that our approach achieves an impressive improvement over baseline and existing models, all while requiring only a single refinement step.

[25] SportMamba: Adaptive Non-Linear Multi-Object Tracking with State Space Models for Team Sports

Dheeraj Khanna,Jerrin Bright,Yuhao Chen,John S. Zelek

Main category: cs.CV

TL;DR: SportMamba是一种针对团队运动的多目标跟踪技术，通过引入mamba-attention机制和高度自适应空间关联度量，解决了快速运动和遮挡带来的挑战。

Details

Motivation: 团队运动中的多目标跟踪因快速运动和频繁遮挡而极具挑战性，现有方法依赖检测和外观跟踪，难以应对复杂的运动模式和非线性运动。 Method: 提出mamba-attention机制建模非线性运动，并设计高度自适应空间关联度量以减少遮挡引起的ID切换，同时扩展检测搜索空间以应对快速运动。 Result: 在SportsMOT数据集上表现优异，并在冰球数据集VIP-HTD上展示了零样本迁移能力。 Conclusion: SportMamba在复杂运动场景中实现了先进的跟踪性能，并具备良好的泛化能力。 Abstract: Multi-object tracking (MOT) in team sports is particularly challenging due to the fast-paced motion and frequent occlusions resulting in motion blur and identity switches, respectively. Predicting player positions in such scenarios is particularly difficult due to the observed highly non-linear motion patterns. Current methods are heavily reliant on object detection and appearance-based tracking, which struggle to perform in complex team sports scenarios, where appearance cues are ambiguous and motion patterns do not necessarily follow a linear pattern. To address these challenges, we introduce SportMamba, an adaptive hybrid MOT technique specifically designed for tracking in dynamic team sports. The technical contribution of SportMamba is twofold. First, we introduce a mamba-attention mechanism that models non-linear motion by implicitly focusing on relevant embedding dependencies. Second, we propose a height-adaptive spatial association metric to reduce ID switches caused by partial occlusions by accounting for scale variations due to depth changes. Additionally, we extend the detection search space with adaptive buffers to improve associations in fast-motion scenarios. Our proposed technique, SportMamba, demonstrates state-of-the-art performance on various metrics in the SportsMOT dataset, which is characterized by complex motion and severe occlusion. Furthermore, we demonstrate its generalization capability through zero-shot transfer to VIP-HTD, an ice hockey dataset.

[26] Seeing the Arrow of Time in Large Multimodal Models

Zihui Xue,Mi Luo,Kristen Grauman

Main category: cs.CV

TL;DR: 论文提出ArrowRL，一种基于强化学习的训练策略，通过逆向奖励提升大模型对视频时间方向性的感知能力，并在新基准AoTBench上验证其有效性。

Details

Motivation: 现代大型多模态模型（LMMs）在视频理解中难以感知和利用时间方向性，阻碍了更深层次的时间理解。 Method: 提出ArrowRL，一种基于强化学习的训练策略，通过逆向奖励机制鼓励模型对正向和反向视频帧产生不同的解释。 Result: ArrowRL显著提升了时间感知能力，在AoTBench和标准视频问答基准上分别实现了超过20%和10%的准确率提升。 Conclusion: ArrowRL的有效性验证了在LMMs中专门理解时间方向性的重要性。 Abstract: The Arrow of Time (AoT)-time's irreversible flow shaping physical events-is fundamental to video comprehension, yet remains a significant challenge for modern large multimodal models (LMMs). Current LMMs struggle to perceive and utilize temporal directionality in video when responding to language queries, obstructing deeper temporal understanding. We tackle this deficiency by first providing a critical analysis of existing benchmarks and models. We then introduce ArrowRL, a reinforcement learning (RL)-based training strategy with an innovative reverse reward that instills AoT awareness by encouraging divergent video interpretations between forward and reversed visual frames. For rigorous evaluation, we additionally develop AoTBench, a new multi-faceted benchmark probing temporally challenging questions. Experiments show ArrowRL greatly advances temporal perception: it not only achieves substantial improvements on our challenging AoTBench but also demonstrably boosts performance on standard video question answering (VQA) benchmarks (with peak accuracy gains reaching over 20% and 10% respectively). This validates ArrowRL's effectiveness and highlights the critical need for dedicated AoT understanding in LMMs.

[27] Semiconductor SEM Image Defect Classification Using Supervised and Semi-Supervised Learning with Vision Transformers

Chien-Fu,Huang,Katherine Sieg,Leonid Karlinksy,Nash Flores,Rebekah Sheraw,Xin Zhang

Main category: cs.CV

TL;DR: 论文提出使用视觉变换器（ViT）神经网络自动分类半导体晶圆缺陷，通过少量样本实现高准确率。

Details

Motivation: 半导体工艺中缺陷控制对产量、成本和可靠性至关重要，传统人工分类方法效率低且易受主观影响。 Method: 采用ViT神经网络，结合DinoV2迁移学习和半监督学习，对7400多张SEM图像中的11种缺陷进行分类。 Result: 每类仅需少于15张图像即可实现超过90%的分类准确率。 Conclusion: 该方法为快速、灵活的晶圆缺陷分类提供了可行方案。 Abstract: Controlling defects in semiconductor processes is important for maintaining yield, improving production cost, and preventing time-dependent critical component failures. Electron beam-based imaging has been used as a tool to survey wafers in the line and inspect for defects. However, manual classification of images for these nano-scale defects is limited by time, labor constraints, and human biases. In recent years, deep learning computer vision algorithms have shown to be effective solutions for image-based inspection applications in industry. This work proposes application of vision transformer (ViT) neural networks for automatic defect classification (ADC) of scanning electron microscope (SEM) images of wafer defects. We evaluated our proposed methods on 300mm wafer semiconductor defect data from our fab in IBM Albany. We studied 11 defect types from over 7400 total images and investigated the potential of transfer learning of DinoV2 and semi-supervised learning for improved classification accuracy and efficient computation. We were able to achieve classification accuracies of over 90% with less than 15 images per defect class. Our work demonstrates the potential to apply the proposed framework for a platform agnostic in-house classification tool with faster turnaround time and flexibility.

[28] Toward Reliable VLM: A Fine-Grained Benchmark and Framework for Exposure, Bias, and Inference in Korean Street Views

Xiaonan Wang,Bo Shao,Hansaem Kim

Main category: cs.CV

TL;DR: 论文介绍了KoreaGEO Bench，一个针对韩国街景的细粒度多模态地理定位基准数据集，旨在解决现有基准的不足。

Details

Motivation: 当前视觉语言模型（VLMs）在基于图像的地理定位方面取得了进展，但现有基准存在粗粒度、语言偏见和缺乏多模态及隐私意识评估的问题。 Method: 构建了包含1,080张高分辨率图像的韩国街景数据集，涵盖四种城市集群和九种地点类型，并提供了多上下文注释和两种韩语字幕。采用三路径评估协议测试了十种主流VLMs。 Result: 结果显示不同输入模态对定位精度有显著影响，并揭示了模型对核心城市的结构性预测偏见。 Conclusion: KoreaGEO Bench填补了现有基准的空白，为多模态地理定位和隐私风险评估提供了更全面的工具。 Abstract: Recent advances in vision-language models (VLMs) have enabled accurate image-based geolocation, raising serious concerns about location privacy risks in everyday social media posts. However, current benchmarks remain coarse-grained, linguistically biased, and lack multimodal and privacy-aware evaluations. To address these gaps, we present KoreaGEO Bench, the first fine-grained, multimodal geolocation benchmark for Korean street views. Our dataset comprises 1,080 high-resolution images sampled across four urban clusters and nine place types, enriched with multi-contextual annotations and two styles of Korean captions simulating real-world privacy exposure. We introduce a three-path evaluation protocol to assess ten mainstream VLMs under varying input modalities and analyze their accuracy, spatial bias, and reasoning behavior. Results reveal modality-driven shifts in localization precision and highlight structural prediction biases toward core cities.

[29] A Foundation Model for Spatial Proteomics

Muhammad Shaban,Yuzhou Chang,Huaying Qiu,Yao Yu Yeo,Andrew H. Song,Guillaume Jaume,Yuchen Wang,Luca L. Weishaupt,Tong Ding,Anurag Vaidya,Abdallah Lamane,Daniel Shao,Mohammed Zidane,Yunhao Bai,Paige McCallum,Shuli Luo,Wenrui Wu,Yang Wang,Precious Cramer,Chi Ngai Chan,Pierre Stephan,Johanna Schaffenrath,Jia Le Lee,Hendrik A. Michel,Caiwei Tian,Cristina Almagro-Perez,Sophia J. Wagner,Sharifa Sahai,Ming Y. Lu,Richard J. Chen,Andrew Zhang,Mark Edward M. Gonzales,Ahmad Makky,Jia-Ying Joey Lee,Hao Cheng,Nourhan El Ahmar,Sayed Matar,Maximilian Haist,Darci Phillips,Yuqi Tan,Garry P. Nolan,W. Richard Burack,Jacob D. Estes,Jonathan T. C. Liu,Toni K Choueiri,Neeraj Agarwal,Marc Barry,Scott J. Rodig,Long Phi Le,Georg Gerber,Christian M. Schürch,Fabian J. Theis,Youn H Kim,Joe Yeong,Sabina Signoretti,Brooke E. Howitt,Lit-Hsin Loo,Qin Ma,Sizun Jiang,Faisal Mahmood

Main category: cs.CV

TL;DR: KRONOS是一个为空间蛋白质组学设计的基础模型，通过自监督学习在多平台数据上训练，支持多种下游任务，并在多个任务中表现优异。

Details

Motivation: 空间蛋白质组学在单细胞分辨率下映射蛋白质，但现有基础模型对其影响有限，因此需要专门设计的模型。 Method: KRONOS采用自监督学习，训练于47百万图像块，覆盖175种蛋白质标记、16种组织类型和8种成像平台，并针对多通道异质数据优化架构。 Result: KRONOS在11个独立队列中表现优异，支持细胞表型分析、区域分类和患者分层等任务，并实现高效数据处理。 Conclusion: KRONOS是一个灵活且可扩展的空间蛋白质组学工具，支持跨机构比较和图像逆向搜索。 Abstract: Foundation models have begun to transform image analysis by acting as pretrained generalist backbones that can be adapted to many tasks even when post-training data are limited, yet their impact on spatial proteomics, imaging that maps proteins at single-cell resolution, remains limited. Here, we introduce KRONOS, a foundation model built for spatial proteomics. KRONOS was trained in a self-supervised manner on over 47 million image patches covering 175 protein markers, 16 tissue types, and 8 fluorescence-based imaging platforms. We introduce key architectural adaptations to address the high-dimensional, multi-channel, and heterogeneous nature of multiplex imaging. We demonstrate that KRONOS learns biologically meaningful representations across multiple scales, ranging from cellular and microenvironment to tissue levels, enabling it to address diverse downstream tasks, including cell phenotyping, region classification, and patient stratification. Evaluated across 11 independent cohorts, KRONOS achieves state-of-the-art performance across cell phenotyping, treatment response prediction, and retrieval tasks, and is highly data-efficient. KRONOS also introduces the paradigm of segmentation-free patch-level processing for efficient and scalable spatial proteomics analysis, allowing cross-institutional comparisons, and as an image reverse search engine for spatial patterns. Together, these results position KRONOS as a flexible and scalable tool for spatial proteomics. The model is publicly accessible at https://github.com/mahmoodlab/KRONOS.

Pengyu Chen,Xiao Huang,Teng Fei,Sicheng Wang

Main category: cs.CV

TL;DR: 研究探讨了城市声音与视觉场景的对应关系，通过多模态方法比较了视觉表示策略在捕捉声学语义方面的效果。

Details

Motivation: 探索环境声景在大型地理分析中的潜力，研究声音与视觉场景的关联。 Method: 整合地理参考声音记录与街景及遥感图像，使用AST、CLIP、RemoteCLIP等模型提取特征并评估跨模态相似性。 Result: 街景嵌入与声音对齐更强，而遥感分割在生态分类（BGA框架）中更有效。 Conclusion: 嵌入模型提供更好的语义对齐，分割方法则更易解释视觉结构与声学生态的联系，推动了多模态城市感知领域的发展。 Abstract: Environmental soundscapes convey substantial ecological and social information regarding urban environments; however, their potential remains largely untapped in large-scale geographic analysis. In this study, we investigate the extent to which urban sounds correspond with visual scenes by comparing various visual representation strategies in capturing acoustic semantics. We employ a multimodal approach that integrates geo-referenced sound recordings with both street-level and remote sensing imagery across three major global cities: London, New York, and Tokyo. Utilizing the AST model for audio, along with CLIP and RemoteCLIP for imagery, as well as CLIPSeg and Seg-Earth OV for semantic segmentation, we extract embeddings and class-level features to evaluate cross-modal similarity. The results indicate that street view embeddings demonstrate stronger alignment with environmental sounds compared to segmentation outputs, whereas remote sensing segmentation is more effective in interpreting ecological categories through a Biophony--Geophony--Anthrophony (BGA) framework. These findings imply that embedding-based models offer superior semantic alignment, while segmentation-based methods provide interpretable links between visual structure and acoustic ecology. This work advances the burgeoning field of multimodal urban sensing by offering novel perspectives for incorporating sound into geospatial analysis.

[31] Temporal Vegetation Index-Based Unsupervised Crop Stress Detection via Eigenvector-Guided Contrastive Learning

Shafqaat Ahmad

Main category: cs.CV

TL;DR: EigenCL是一种无监督对比学习框架，利用NDRE时间动态和特征分解实现作物胁迫早期检测，优于传统方法。

Details

Motivation: 早期检测作物胁迫对精准农业至关重要，但传统方法依赖标记数据或仅能检测可见症状，限制了可扩展性。 Method: 通过NDRE时间序列构建RBF相似矩阵，利用主特征向量定义胁迫感知相似性，进行对比学习。 Result: 学习到的嵌入形成生理学有意义的聚类，实现76%的早期胁迫检测，分类准确率达95%（k-NN）和91%（逻辑回归）。 Conclusion: EigenCL提供了一种无需标记、可扩展的早期胁迫检测方法，适用于数据稀缺的农业环境。 Abstract: Early detection of crop stress is vital for minimizing yield loss and enabling timely intervention in precision agriculture. Traditional approaches using NDRE often detect stress only after visible symptoms appear or require labeled datasets, limiting scalability. This study introduces EigenCL, a novel unsupervised contrastive learning framework guided by temporal NDRE dynamics and biologically grounded eigen decomposition. Using over 10,000 Sentinel-2 NDRE image patches from drought-affected Iowa cornfields, we constructed five-point NDRE time series per patch and derived an RBF similarity matrix. The principal eigenvector explaining 76% of the variance and strongly correlated (r = 0.95) with raw NDRE values was used to define stress-aware similarity for contrastive embedding learning. Unlike existing methods that rely on visual augmentations, EigenCL pulls embeddings together based on biologically similar stress trajectories and pushes apart divergent ones. The learned embeddings formed physiologically meaningful clusters, achieving superior clustering metrics (Silhouette: 0.748, DBI: 0.35) and enabling 76% early stress detection up to 12 days before conventional NDRE thresholds. Downstream classification yielded 95% k-NN and 91% logistic regression accuracy. Validation on an independent 2023 Nebraska dataset confirmed generalizability without retraining. EigenCL offers a label-free, scalable approach for early stress detection that aligns with underlying plant physiology and is suitable for real-world deployment in data-scarce agricultural environments.

[32] ViT-Split: Unleashing the Power of Vision Foundation Models via Efficient Splitting Heads

Yifan Li,Xin Li,Tianqin Li,Wenbin He,Yu Kong,Liu Ren

Main category: cs.CV

TL;DR: ViT-Split是一种新的视觉基础模型（VFM）适配方法，通过分离VFM的提取器和适配器组件，解决了现有方法中的梯度传播和参数调优问题，显著提高了效率和性能。

Details

Motivation: 现有VFM适配器存在梯度传播过早和参数调优复杂的问题，且未能充分利用VFM的先验知识。 Method: 提出ViT-Split方法，将VFM分为提取器和适配器两部分，引入任务头和先验头，避免早期梯度传播并减少调参。 Result: 在多个任务（如分割、检测、深度估计和视觉问答）上验证了ViT-Split的高效性，训练时间减少4倍，性能优于其他适配器。 Conclusion: ViT-Split通过优化VFM的结构和训练方式，显著提升了效率和性能，为VFM适配提供了新思路。 Abstract: Vision foundation models (VFMs) have demonstrated remarkable performance across a wide range of downstream tasks. While several VFM adapters have shown promising results by leveraging the prior knowledge of VFMs, we identify two inefficiencies in these approaches. First, the interaction between convolutional neural network (CNN) and VFM backbone triggers early layer gradient backpropagation. Second, existing methods require tuning all components, adding complexity. Besides, these adapters alter VFM features, underutilizing the prior knowledge. To tackle these challenges, we propose a new approach called ViT-Split, based on a key observation: the layers of several VFMs, like DINOv2, can be divided into two distinct components: an extractor for learning low-level features and an adapter for learning task-specific features. Leveraging this insight, we eliminate the CNN branch and introduce two heads, task head and prior head, to the frozen VFM. The task head is designed to learn task-specific features, mitigating the early gradient propagation issue. The prior head is used to leverage the multi-scale prior features from the frozen VFM, reducing tuning parameters and overfitting. Extensive experiments on various tasks (e.g., segmentation, detection, depth estimation, and visual question answering) validate the effectiveness and efficiency of ViT-Split. Specifically, ViT-Split reduces training time up to $4\times$ while achieving comparable or even better results on ADE20K, compared to other VFM adapters.

[33] Geometric Visual Fusion Graph Neural Networks for Multi-Person Human-Object Interaction Recognition in Videos

Tanqiu Qiao,Ruochen Li,Frederick W. B. Li,Yoshiki Kubotani,Shigeo Morishima,Hubert P. H. Shum

Main category: cs.CV

TL;DR: 论文提出了一种名为GeoVis-GNN的几何视觉融合图神经网络，通过双注意力特征融合和相互依赖的实体图学习，从实体特定表示逐步构建高层次交互理解。同时，引入了MPHOI-120数据集，用于处理动态多人交互中的并发动作和部分参与问题。实验表明该方法在多种HOI场景中表现优异。

Details

Motivation: 视频中的人-物交互（HOI）识别需要同时理解视觉模式和几何关系的变化。视觉和几何特征各有优势，但如何在不损害各自特性的情况下有效融合多模态特征仍具挑战性。 Method: 提出GeoVis-GNN，采用双注意力特征融合和相互依赖的实体图学习，从实体特定表示逐步构建交互理解。并引入MPHOI-120数据集，专注于动态多人交互。 Result: 在多种HOI场景（如双人交互、单人活动、双手操作和复杂并发部分交互）中，方法表现出色，达到最先进性能。 Conclusion: GeoVis-GNN通过有效的多模态特征融合和实体特定表示，显著提升了HOI识别性能，MPHOI-120数据集为复杂交互研究提供了支持。 Abstract: Human-Object Interaction (HOI) recognition in videos requires understanding both visual patterns and geometric relationships as they evolve over time. Visual and geometric features offer complementary strengths. Visual features capture appearance context, while geometric features provide structural patterns. Effectively fusing these multimodal features without compromising their unique characteristics remains challenging. We observe that establishing robust, entity-specific representations before modeling interactions helps preserve the strengths of each modality. Therefore, we hypothesize that a bottom-up approach is crucial for effective multimodal fusion. Following this insight, we propose the Geometric Visual Fusion Graph Neural Network (GeoVis-GNN), which uses dual-attention feature fusion combined with interdependent entity graph learning. It progressively builds from entity-specific representations toward high-level interaction understanding. To advance HOI recognition to real-world scenarios, we introduce the Concurrent Partial Interaction Dataset (MPHOI-120). It captures dynamic multi-person interactions involving concurrent actions and partial engagement. This dataset helps address challenges like complex human-object dynamics and mutual occlusions. Extensive experiments demonstrate the effectiveness of our method across various HOI scenarios. These scenarios include two-person interactions, single-person activities, bimanual manipulations, and complex concurrent partial interactions. Our method achieves state-of-the-art performance.

[34] RefEdit: A Benchmark and Method for Improving Instruction-based Image Editing Model on Referring Expressions

Bimsara Pathiraja,Maitreya Patel,Shivam Singh,Yezhou Yang,Chitta Baral

Main category: cs.CV

TL;DR: RefEdit是一种基于指令的图像编辑模型，通过合成数据训练，在复杂场景中优于现有方法。

Details

Motivation: 现有方法在编辑复杂场景中的多实体时表现不佳，需要量化并改进这一能力。 Method: 引入RefEdit-Bench基准，并开发RefEdit模型，通过合成数据训练。 Result: RefEdit在少量数据上训练后优于基于大量数据的基线模型，并在多个基准测试中达到最优。 Conclusion: RefEdit在复杂场景编辑任务中表现出色，且性能优于闭源方法。 Abstract: Despite recent advances in inversion and instruction-based image editing, existing approaches primarily excel at editing single, prominent objects but significantly struggle when applied to complex scenes containing multiple entities. To quantify this gap, we first introduce RefEdit-Bench, a rigorous real-world benchmark rooted in RefCOCO, where even baselines trained on millions of samples perform poorly. To overcome this limitation, we introduce RefEdit -- an instruction-based editing model trained on our scalable synthetic data generation pipeline. Our RefEdit, trained on only 20,000 editing triplets, outperforms the Flux/SD3 model-based baselines trained on millions of data. Extensive evaluations across various benchmarks demonstrate that our model not only excels in referring expression tasks but also enhances performance on traditional benchmarks, achieving state-of-the-art results comparable to closed-source methods. We release data \& checkpoint for reproducibility.

[35] The effects of using created synthetic images in computer vision training

John W. Smutny

Main category: cs.CV

TL;DR: 研究探讨了使用Unreal Engine 4生成合成图像补充计算机视觉模型数据集的效果，发现合成图像能显著减少真实图像需求并保持模型性能。

Details

Motivation: 解决真实图像数据获取成本高、易受污染的问题，提供廉价、可扩展的替代方案。 Method: 通过UE4生成合成图像，测试其在猫狗分类和焊接缺陷检测任务中对VGG16和MobileNetV3-small模型性能的影响。 Result: 合成图像可减少真实图像需求至10%，且性能接近纯真实图像训练。少量真实图像加入合成数据集能显著提升模型表现。 Conclusion: 合成图像为数据稀缺项目提供了高效解决方案，并提出了使用指南和潜在限制。 Abstract: This paper investigates how rendering engines, like Unreal Engine 4 (UE), can be used to create synthetic images to supplement datasets for deep computer vision (CV) models in image abundant and image limited use cases. Using rendered synthetic images from UE can provide developers and businesses with a method of accessing nearly unlimited, reproducible, agile, and cheap training sets for their customers and applications without the threat of poisoned images from the internet or the cost of collecting them. The validity of these generated images are examined by testing the change in model test accuracy in two different sized CV models across two binary classification cases (Cat vs Dog and Weld Defect Detection). In addition, this paper provides an implementation of how to measure the quality of synthetic images by using pre-trained CV models as auditors. Results imply that for large (VGG16) and small (MobileNetV3-small) parameter deep CV models, adding >60% additional synthetic images to a real image dataset during model training can narrow the test-training accuracy gap to ~1-2% without a conclusive effect on test accuracy compared to using real world images alone. Likewise, adding <10% additional real training images to synthetic only training sets decreased the classification error rate in half, then decreasing further when adding more real training images. For these cases tested, using synthetic images from rendering engines allow researchers to only use 10% of their real images during training, compared to the traditional 50-70%. This research serves as an example of how to create synthetic images, guidelines on how to use the images, potential restrictions and possible performance improvements for data-scarce projects.

[36] RoNFA: Robust Neural Field-based Approach for Few-Shot Image Classification with Noisy Labels

Nan Xiang,Lifeng Xing,Dequan Jin

Main category: cs.CV

TL;DR: 本文提出了一种名为RoNFA的鲁棒神经场方法，用于解决少样本学习（FSL）中标签噪声的问题，通过特征和类别表示的两个神经场提升模型的鲁棒性和分类准确性。

Details

Motivation: 在少样本学习中，标签样本稀缺且标签错误不可避免，这会显著降低分类准确性。因此，提升模型在标签噪声下的鲁棒性至关重要。 Method: RoNFA通过两个神经场分别表示特征和类别，其中类别表示场（FCR）的神经元在特征表示场（FFR）上具有可调整的感受野（RF），并通过软聚类生成类别代表神经元。预测阶段，感受野范围根据FCR的神经元激活自适应调整。 Result: 实验表明，RoNFA在三种不同标签噪声的真实FSL数据集上显著优于现有方法，其噪声标签下的准确率甚至超过现有方法在干净支持集上的结果。 Conclusion: RoNFA展示了强大的少样本学习能力和对标签噪声的鲁棒性，为实际应用中的标签噪声问题提供了有效解决方案。 Abstract: In few-shot learning (FSL), the labeled samples are scarce. Thus, label errors can significantly reduce classification accuracy. Since label errors are inevitable in realistic learning tasks, improving the robustness of the model in the presence of label errors is critical. This paper proposes a new robust neural field-based image approach (RoNFA) for few-shot image classification with noisy labels. RoNFA consists of two neural fields for feature and category representation. They correspond to the feature space and category set. Each neuron in the field for category representation (FCR) has a receptive field (RF) on the field for feature representation (FFR) centered at the representative neuron for its category generated by soft clustering. In the prediction stage, the range of these receptive fields adapts according to the neuronal activation in FCR to ensure prediction accuracy. These learning strategies provide the proposed model with excellent few-shot learning capability and strong robustness against label noises. The experimental results on real-world FSL datasets with three different types of label noise demonstrate that the proposed method significantly outperforms state-of-the-art FSL methods. Its accuracy obtained in the presence of noisy labels even surpasses the results obtained by state-of-the-art FSL methods trained on clean support sets, indicating its strong robustness against noisy labels.

[37] MamFusion: Multi-Mamba with Temporal Fusion for Partially Relevant Video Retrieval

Xinru Ying,Jiaqi Mo,Jingyang Lin,Canghong Jin,Fangfang Wang,Lina Wei

Main category: cs.CV

TL;DR: 提出了一种基于多Mamba模块的框架MamFusion，用于解决部分相关视频检索（PRVR）任务，通过时间融合提升长序列视频内容理解。

Details

Motivation: 解决部分相关视频检索中信息冗余和长序列视频内容理解的挑战。 Method: 利用Mamba模块的长时状态空间建模能力，提出多Mamba模块的时间融合框架（MamFusion），包括Temporal T-to-V Fusion和Temporal V-to-T Fusion。 Result: 在大规模数据集上验证，MamFusion在检索效果上达到最先进水平。 Conclusion: MamFusion框架有效提升了PRVR任务的性能，为长序列视频内容理解提供了新思路。 Abstract: Partially Relevant Video Retrieval (PRVR) is a challenging task in the domain of multimedia retrieval. It is designed to identify and retrieve untrimmed videos that are partially relevant to the provided query. In this work, we investigate long-sequence video content understanding to address information redundancy issues. Leveraging the outstanding long-term state space modeling capability and linear scalability of the Mamba module, we introduce a multi-Mamba module with temporal fusion framework (MamFusion) tailored for PRVR task. This framework effectively captures the state-relatedness in long-term video content and seamlessly integrates it into text-video relevance understanding, thereby enhancing the retrieval process. Specifically, we introduce Temporal T-to-V Fusion and Temporal V-to-T Fusion to explicitly model temporal relationships between text queries and video moments, improving contextual awareness and retrieval accuracy. Extensive experiments conducted on large-scale datasets demonstrate that MamFusion achieves state-of-the-art performance in retrieval effectiveness. Code is available at the link: https://github.com/Vision-Multimodal-Lab-HZCU/MamFusion.

[38] Heterogeneous Skeleton-Based Action Representation Learning

Hongsong Wang,Xiaoyan Ma,Jidong Kuang,Jie Gui

Main category: cs.CV

TL;DR: 该论文提出了一种处理异构骨架数据的框架，包括异构骨架处理和统一表示学习两部分，通过实验验证了其有效性。

Details

Motivation: 骨架数据因来源不同而具有异构性，但现有方法仅针对同构骨架设计模型，忽略了这一问题。 Method: 框架包括异构骨架处理（将2D骨架转为3D并构建统一骨架）和统一表示学习（使用共享主干网络处理异构骨架）。 Result: 在NTU-60、NTU-120和PKU-MMD II数据集上的实验证明了方法的有效性。 Conclusion: 该方法适用于不同人形结构的机器人动作识别。 Abstract: Skeleton-based human action recognition has received widespread attention in recent years due to its diverse range of application scenarios. Due to the different sources of human skeletons, skeleton data naturally exhibit heterogeneity. The previous works, however, overlook the heterogeneity of human skeletons and solely construct models tailored for homogeneous skeletons. This work addresses the challenge of heterogeneous skeleton-based action representation learning, specifically focusing on processing skeleton data that varies in joint dimensions and topological structures. The proposed framework comprises two primary components: heterogeneous skeleton processing and unified representation learning. The former first converts two-dimensional skeleton data into three-dimensional skeleton via an auxiliary network, and then constructs a prompted unified skeleton using skeleton-specific prompts. We also design an additional modality named semantic motion encoding to harness the semantic information within skeletons. The latter module learns a unified action representation using a shared backbone network that processes different heterogeneous skeletons. Extensive experiments on the NTU-60, NTU-120, and PKU-MMD II datasets demonstrate the effectiveness of our method in various tasks of action understanding. Our approach can be applied to action recognition in robots with different humanoid structures.

[39] CHIME: Conditional Hallucination and Integrated Multi-scale Enhancement for Time Series Diffusion Model

Yuxuan Chen,Haipeng Xie

Main category: cs.CV

TL;DR: CHIME是一个用于时间序列扩散模型的多尺度增强框架，通过多尺度分解和自适应集成解决了特征对齐和生成能力问题，并在少样本场景中表现出色。

Details

Motivation: 现有研究在多尺度特征对齐和跨实体、长时间尺度的生成能力方面存在挑战，CHIME旨在解决这些问题。 Method: 采用多尺度分解和自适应集成，结合特征幻觉模块，实现时间序列特征的对齐和转移。 Result: 在公开数据集上，CHIME实现了最先进的性能，并在少样本场景中表现出优秀的生成泛化能力。 Conclusion: CHIME通过多尺度增强和特征幻觉模块，显著提升了时间序列扩散模型的性能。 Abstract: The denoising diffusion probabilistic model has become a mainstream generative model, achieving significant success in various computer vision tasks. Recently, there has been initial exploration of applying diffusion models to time series tasks. However, existing studies still face challenges in multi-scale feature alignment and generative capabilities across different entities and long-time scales. In this paper, we propose CHIME, a conditional hallucination and integrated multi-scale enhancement framework for time series diffusion models. By employing multi-scale decomposition and adaptive integration, CHIME captures the decomposed features of time series, achieving in-domain distribution alignment between generated and original samples. In addition, we introduce a feature hallucination module in the conditional denoising process, enabling the transfer of temporal features through the training of category-independent transformation layers. Experimental results on publicly available real-world datasets demonstrate that CHIME achieves state-of-the-art performance and exhibits excellent generative generalization capabilities in few-shot scenarios.

[40] EDCFlow: Exploring Temporally Dense Difference Maps for Event-based Optical Flow Estimation

Daikun Liu,Lei Cheng,Teng Wang,changyin Sun

Main category: cs.CV

TL;DR: EDCFlow提出了一种轻量级事件光流网络，通过结合时间密集特征差异和成本体积，实现高分辨率光流估计，性能优于现有方法。

Details

Motivation: 现有基于事件的光流估计方法存在计算冗余和分辨率扩展受限问题，EDCFlow旨在解决这些问题。 Method: 开发了基于注意力的多尺度时间特征差异层，高效捕捉高分辨率运动模式，并自适应融合高低分辨率运动特征。 Result: EDCFlow在性能和复杂度上优于现有方法，且可作为RAFT类方法的插件模块。 Conclusion: EDCFlow通过高效特征融合和轻量化设计，实现了高质量光流估计，具有优越的泛化能力。 Abstract: Recent learning-based methods for event-based optical flow estimation utilize cost volumes for pixel matching but suffer from redundant computations and limited scalability to higher resolutions for flow refinement. In this work, we take advantage of the complementarity between temporally dense feature differences of adjacent event frames and cost volume and present a lightweight event-based optical flow network (EDCFlow) to achieve high-quality flow estimation at a higher resolution. Specifically, an attention-based multi-scale temporal feature difference layer is developed to capture diverse motion patterns at high resolution in a computation-efficient manner. An adaptive fusion of high-resolution difference motion features and low-resolution correlation motion features is performed to enhance motion representation and model generalization. Notably, EDCFlow can serve as a plug-and-play refinement module for RAFT-like event-based methods to enhance flow details. Extensive experiments demonstrate that EDCFlow achieves better performance with lower complexity compared to existing methods, offering superior generalization.

[41] DenseDPO: Fine-Grained Temporal Preference Optimization for Video Diffusion Models

Ziyi Wu,Anil Kag,Ivan Skorokhodov,Willi Menapace,Ashkan Mirzaei,Igor Gilitschenski,Sergey Tulyakov,Aliaksandr Siarohin

Main category: cs.CV

TL;DR: DenseDPO改进Direct Preference Optimization (DPO)方法，通过生成对齐的视频对和分段标注偏好，解决了原方法的运动偏差问题，并利用视觉语言模型实现自动标注。

Details

Motivation: 原DPO方法在文本到视频扩散模型中存在运动偏差和粗粒度比较的问题，DenseDPO旨在解决这些不足。 Method: DenseDPO通过生成对齐的视频对、分段标注偏好，并利用视觉语言模型自动标注，提升训练数据的质量和效率。 Result: DenseDPO在仅使用三分之一标注数据的情况下，显著提升了运动生成效果，同时保持了文本对齐、视觉质量和时间一致性。 Conclusion: DenseDPO不仅解决了原方法的局限性，还通过自动标注实现了接近人类标注的性能。 Abstract: Direct Preference Optimization (DPO) has recently been applied as a post-training technique for text-to-video diffusion models. To obtain training data, annotators are asked to provide preferences between two videos generated from independent noise. However, this approach prohibits fine-grained comparisons, and we point out that it biases the annotators towards low-motion clips as they often contain fewer visual artifacts. In this work, we introduce DenseDPO, a method that addresses these shortcomings by making three contributions. First, we create each video pair for DPO by denoising corrupted copies of a ground truth video. This results in aligned pairs with similar motion structures while differing in local details, effectively neutralizing the motion bias. Second, we leverage the resulting temporal alignment to label preferences on short segments rather than entire clips, yielding a denser and more precise learning signal. With only one-third of the labeled data, DenseDPO greatly improves motion generation over vanilla DPO, while matching it in text alignment, visual quality, and temporal consistency. Finally, we show that DenseDPO unlocks automatic preference annotation using off-the-shelf Vision Language Models (VLMs): GPT accurately predicts segment-level preferences similar to task-specifically fine-tuned video reward models, and DenseDPO trained on these labels achieves performance close to using human labels.

[42] Target Semantics Clustering via Text Representations for Robust Universal Domain Adaptation

Weinan He,Zilei Wang,Yixin Zhang

Main category: cs.CV

TL;DR: 提出了一种基于视觉语言模型的通用域适应方法TASC，通过文本表示空间搜索语义中心，实现简单且鲁棒的域对齐。

Details

Motivation: 解决通用域适应中因域偏移和未知类别偏移导致的语义中心复杂且不鲁棒的问题。 Method: 分两阶段：1) 使用贪婪搜索框架在文本表示空间寻找目标语义中心；2) 固定搜索结果，通过梯度下降优化编码器。同时提出UniMS评分函数检测开放集样本。 Result: 在四个基准测试中表现优异，达到最先进性能。 Conclusion: TASC方法通过文本表示空间约束，显著提升了通用域适应的鲁棒性和效果。 Abstract: Universal Domain Adaptation (UniDA) focuses on transferring source domain knowledge to the target domain under both domain shift and unknown category shift. Its main challenge lies in identifying common class samples and aligning them. Current methods typically obtain target domain semantics centers from an unconstrained continuous image representation space. Due to domain shift and the unknown number of clusters, these centers often result in complex and less robust alignment algorithm. In this paper, based on vision-language models, we search for semantic centers in a semantically meaningful and discrete text representation space. The constrained space ensures almost no domain bias and appropriate semantic granularity for these centers, enabling a simple and robust adaptation algorithm. Specifically, we propose TArget Semantics Clustering (TASC) via Text Representations, which leverages information maximization as a unified objective and involves two stages. First, with the frozen encoders, a greedy search-based framework is used to search for an optimal set of text embeddings to represent target semantics. Second, with the search results fixed, encoders are refined based on gradient descent, simultaneously achieving robust domain alignment and private class clustering. Additionally, we propose Universal Maximum Similarity (UniMS), a scoring function tailored for detecting open-set samples in UniDA. Experimentally, we evaluate the universality of UniDA algorithms under four category shift scenarios. Extensive experiments on four benchmarks demonstrate the effectiveness and robustness of our method, which has achieved state-of-the-art performance.

[43] Video-Skill-CoT: Skill-based Chain-of-Thoughts for Domain-Adaptive Video Reasoning

Daeun Lee,Jaehong Yoon,Jaemin Cho,Mohit Bansal

Main category: cs.CV

TL;DR: Video-SKoT通过自动构建技能感知的CoT监督，提升了领域自适应视频推理能力。

Details

Motivation: 现有CoT方法难以适应不同视频内容中的领域特定技能，如事件检测、空间关系理解等。 Method: 1. 构建基于技能的CoT标注；2. 引入技能特定专家学习框架。 Result: 在三个视频理解基准测试中，Video-SKoT表现优于基线方法。 Conclusion: Video-SKoT通过技能感知的CoT监督，显著提升了视频推理的领域适应性。 Abstract: Recent advances in Chain-of-Thought (CoT) reasoning have improved complex video understanding, but existing methods often struggle to adapt to domain-specific skills (e.g., event detection, spatial relation understanding, emotion understanding) over various video content. To address this, we propose Video-Skill-CoT (a.k.a. Video-SKoT), a framework that automatically constructs and leverages skill-aware CoT supervisions for domain-adaptive video reasoning. First, we construct skill-based CoT annotations: we extract domain-relevant reasoning skills from training questions, cluster them into a shared skill taxonomy, and create detailed multi-step CoT rationale tailored to each video-question pair for training. Second, we introduce a skill-specific expert learning framework. Each expert module specializes in a subset of reasoning skills and is trained with lightweight adapters using the collected CoT supervision. We demonstrate the effectiveness of the proposed approach on three video understanding benchmarks, where Video-SKoT consistently outperforms strong baselines. We also provide in-depth analyses on comparing different CoT annotation pipelines and learned skills over multiple video domains.

[44] Robust Neural Rendering in the Wild with Asymmetric Dual 3D Gaussian Splatting

Chengqi Li,Zhihao Shi,Yangdi Lu,Wenbo He,Xiangyu Xu

Main category: cs.CV

TL;DR: 提出了一种名为Asymmetric Dual 3DGS的新框架，通过并行训练两个3D高斯溅射模型并引入一致性约束，减少视觉伪影，同时通过发散掩码策略防止模型崩溃。

Details

Motivation: 解决野外图像3D重建中因光照不一致和瞬态干扰导致的低质量数据问题，现有方法难以生成稳定且一致的重建结果。 Method: 并行训练两个3DGS模型，施加一致性约束和发散掩码策略（多线索自适应掩码和自监督软掩码），并引入Dynamic EMA Proxy提高效率。 Result: 在真实世界数据集上表现优于现有方法，同时保持高效性。 Conclusion: Asymmetric Dual 3DGS能有效减少伪影并提升重建质量，代码和模型将开源。 Abstract: 3D reconstruction from in-the-wild images remains a challenging task due to inconsistent lighting conditions and transient distractors. Existing methods typically rely on heuristic strategies to handle the low-quality training data, which often struggle to produce stable and consistent reconstructions, frequently resulting in visual artifacts. In this work, we propose Asymmetric Dual 3DGS, a novel framework that leverages the stochastic nature of these artifacts: they tend to vary across different training runs due to minor randomness. Specifically, our method trains two 3D Gaussian Splatting (3DGS) models in parallel, enforcing a consistency constraint that encourages convergence on reliable scene geometry while suppressing inconsistent artifacts. To prevent the two models from collapsing into similar failure modes due to confirmation bias, we introduce a divergent masking strategy that applies two complementary masks: a multi-cue adaptive mask and a self-supervised soft mask, which leads to an asymmetric training process of the two models, reducing shared error modes. In addition, to improve the efficiency of model training, we introduce a lightweight variant called Dynamic EMA Proxy, which replaces one of the two models with a dynamically updated Exponential Moving Average (EMA) proxy, and employs an alternating masking strategy to preserve divergence. Extensive experiments on challenging real-world datasets demonstrate that our method consistently outperforms existing approaches while achieving high efficiency. Codes and trained models will be released.

[45] WIFE-Fusion:Wavelet-aware Intra-inter Frequency Enhancement for Multi-model Image Fusion

Tianpei Zhang,Jufeng Zhao,Yiming Zhu,Guangmang Cui

Main category: cs.CV

TL;DR: 提出了一种基于频域交互的多模态图像融合框架WIFE-Fusion，通过频域自注意力机制和频域间交互提升特征提取与融合效果。

Details

Motivation: 现有多模态图像融合方法忽视频域特征探索和交互关系，限制了融合效果。 Method: 提出Intra-Frequency Self-Attention (IFSA)和Inter-Frequency Interaction (IFI)，分别利用自注意力机制和频域间交互提取和增强特征。 Result: 在五个数据集上的实验表明，WIFE-Fusion优于现有专用和统一融合方法。 Conclusion: WIFE-Fusion通过频域交互实现了更精确的特征提取与融合，具有显著优势。 Abstract: Multimodal image fusion effectively aggregates information from diverse modalities, with fused images playing a crucial role in vision systems. However, existing methods often neglect frequency-domain feature exploration and interactive relationships. In this paper, we propose wavelet-aware Intra-inter Frequency Enhancement Fusion (WIFE-Fusion), a multimodal image fusion framework based on frequency-domain components interactions. Its core innovations include: Intra-Frequency Self-Attention (IFSA) that leverages inherent cross-modal correlations and complementarity through interactive self-attention mechanisms to extract enriched frequency-domain features, and Inter-Frequency Interaction (IFI) that enhances enriched features and filters latent features via combinatorial interactions between heterogeneous frequency-domain components across modalities. These processes achieve precise source feature extraction and unified modeling of feature extraction-aggregation. Extensive experiments on five datasets across three multimodal fusion tasks demonstrate WIFE-Fusion's superiority over current specialized and unified fusion methods. Our code is available at https://github.com/Lmmh058/WIFE-Fusion.

[46] DiagNet: Detecting Objects using Diagonal Constraints on Adjacency Matrix of Graph Neural Network

Chong Hyun Lee,Kibae Lee

Main category: cs.CV

TL;DR: DaigNet提出了一种基于图卷积网络（GCN）邻接矩阵对角约束的物体检测新方法，通过硬约束和软约束算法及两种损失函数，无需设计锚框，实验显示其在Pascal VOC和MS COCO数据集上优于YOLO系列模型。

Details

Motivation: 传统物体检测方法依赖锚框设计，DaigNet旨在通过GCN邻接矩阵的对角约束简化检测流程并提升性能。 Method: 提出两种对角化算法（硬约束和软约束）及两种损失函数，结合YOLO模型的检测头进行验证。 Result: 在Pascal VOC上mAP50比YOLOv1高7.5%，在MS COCO上比YOLOv3u、YOLOv5u和YOLOv8分别高5.1%、3.7%和2.9%。 Conclusion: DaigNet通过GCN对角约束有效提升了物体检测性能，且无需锚框设计，具有实际应用潜力。 Abstract: We propose DaigNet, a new approach to object detection with which we can detect an object bounding box using diagonal constraints on adjacency matrix of a graph convolutional network (GCN). We propose two diagonalization algorithms based on hard and soft constraints on adjacency matrix and two loss functions using diagonal constraint and complementary constraint. The DaigNet eliminates the need for designing a set of anchor boxes commonly used. To prove feasibility of our novel detector, we adopt detection head in YOLO models. Experiments show that the DiagNet achieves 7.5% higher mAP50 on Pascal VOC than YOLOv1. The DiagNet also shows 5.1% higher mAP on MS COCO than YOLOv3u, 3.7% higher mAP than YOLOv5u, and 2.9% higher mAP than YOLOv8.

[47] ViTSGMM: A Robust Semi-Supervised Image Recognition Network Using Sparse Labels

Rui Yann,Xianglei Xing

Main category: cs.CV

TL;DR: ViTSGMM是一种高效利用半监督学习的图像识别网络，通过优化特征表示与目标类别的互信息，显著提升了在有限标注数据下的泛化能力。

Details

Motivation: 现有方法依赖复杂训练技术和架构，且在标注数据极少时泛化能力不足。 Method: 构建分层混合密度分类决策机制，优化互信息并压缩冗余信息。 Result: 在STL-10和CIFAR-10/100数据集上，使用极少标注样本即达到SOTA性能，并发现STL-10数据集的数据泄漏问题。 Conclusion: ViTSGMM高效且可靠，解决了半监督学习中的关键问题。 Abstract: We present ViTSGMM, an image recognition network that leverages semi-supervised learning in a highly efficient manner. Existing works often rely on complex training techniques and architectures, while their generalization ability when dealing with extremely limited labeled data remains to be improved. To address these limitations, we construct a hierarchical mixture density classification decision mechanism by optimizing mutual information between feature representations and target classes, compressing redundant information while retaining crucial discriminative components. Experimental results demonstrate that our method achieves state-of-the-art performance on STL-10 and CIFAR-10/100 datasets when using negligible labeled samples. Notably, this paper also reveals a long-overlooked data leakage issue in the STL-10 dataset for semi-supervised learning tasks and removes duplicates to ensure the reliability of experimental results. Code available at https://github.com/Shu1L0n9/ViTSGMM.

[48] A Large-Scale Referring Remote Sensing Image Segmentation Dataset and Benchmark

Zhigang Yang,Huiguang Yao,Linmao Tian,Xuezhi Zhao,Qiang Li,Qi Wang

Main category: cs.CV

TL;DR: 论文提出了NWPU-Refer数据集和MRSNet框架，解决了现有RRSIS数据集在分辨率、多样性和类别覆盖上的不足，并通过实验验证了MRSNet的优越性能。

Details

Motivation: 现有RRSIS数据集在分辨率、场景多样性和类别覆盖上存在局限性，影响了模型的泛化能力和实际应用。 Method: 提出了NWPU-Refer数据集，包含15,003张高分辨率图像和49,745个标注目标；设计了MRSNet框架，包含IFIM和HFIM模块，用于多尺度特征交互。 Result: MRSNet在NWPU-Refer数据集上实现了最先进的性能。 Conclusion: NWPU-Refer和MRSNet为RRSIS领域的发展提供了重要支持，数据集和代码已开源。 Abstract: Referring Remote Sensing Image Segmentation is a complex and challenging task that integrates the paradigms of computer vision and natural language processing. Existing datasets for RRSIS suffer from critical limitations in resolution, scene diversity, and category coverage, which hinders the generalization and real-world applicability of refer segmentation models. To facilitate the development of this field, we introduce NWPU-Refer, the largest and most diverse RRSIS dataset to date, comprising 15,003 high-resolution images (1024-2048px) spanning 30+ countries with 49,745 annotated targets supporting single-object, multi-object, and non-object segmentation scenarios. Additionally, we propose the Multi-scale Referring Segmentation Network (MRSNet), a novel framework tailored for the unique demands of RRSIS. MRSNet introduces two key innovations: (1) an Intra-scale Feature Interaction Module (IFIM) that captures fine-grained details within each encoder stage, and (2) a Hierarchical Feature Interaction Module (HFIM) to enable seamless cross-scale feature fusion, preserving spatial integrity while enhancing discriminative power. Extensive experiments conducte on the proposed NWPU-Refer dataset demonstrate that MRSNet achieves state-of-the-art performance across multiple evaluation metrics, validating its effectiveness. The dataset and code are publicly available at https://github.com/CVer-Yang/NWPU-Refer.

[49] BiMa: Towards Biases Mitigation for Text-Video Retrieval via Scene Element Guidance

Huy Le,Nhat Chung,Tung Kieu,Anh Nguyen,Ngan Le

Main category: cs.CV

TL;DR: BiMa框架通过视觉和文本去偏方法提升文本-视频检索性能，并在多个基准测试中验证其竞争力。

Details

Motivation: 解决文本-视频检索系统中因数据集视觉-语言偏见导致的关键细节被忽略的问题。 Method: 生成视频场景元素以增强视频嵌入，并解耦文本特征为内容和偏见两部分。 Result: 在五个主要TVR基准测试中表现优异，并在分布外检索任务中验证了去偏能力。 Conclusion: BiMa框架有效缓解了视觉和文本偏见，提升了检索性能。 Abstract: Text-video retrieval (TVR) systems often suffer from visual-linguistic biases present in datasets, which cause pre-trained vision-language models to overlook key details. To address this, we propose BiMa, a novel framework designed to mitigate biases in both visual and textual representations. Our approach begins by generating scene elements that characterize each video by identifying relevant entities/objects and activities. For visual debiasing, we integrate these scene elements into the video embeddings, enhancing them to emphasize fine-grained and salient details. For textual debiasing, we introduce a mechanism to disentangle text features into content and bias components, enabling the model to focus on meaningful content while separately handling biased information. Extensive experiments and ablation studies across five major TVR benchmarks (i.e., MSR-VTT, MSVD, LSMDC, ActivityNet, and DiDeMo) demonstrate the competitive performance of BiMa. Additionally, the model's bias mitigation capability is consistently validated by its strong results on out-of-distribution retrieval tasks.

[50] Resolving Task Objective Conflicts in Unified Multimodal Understanding and Generation via Task-Aware Mixture-of-Experts

Jiaxing Zhang,Xinyi Zeng,Hao Tang

Main category: cs.CV

TL;DR: 提出了一种名为UTAMoE的新型框架，通过解耦自回归（AR）模型的内部组件来解决多模态大语言模型（MLLMs）中任务目标冲突的问题。

Details

Motivation: 现有的基于自回归变换器的多模态大语言模型在理解和生成任务之间存在任务目标冲突，导致性能下降。 Method: 设计了UTAMoE框架，通过任务感知的混合专家（MoE）层解耦AR模块，并采用两阶段训练策略增强任务区分。 Result: 实验表明，UTAMoE有效缓解了任务冲突，在多模态基准测试中取得了最先进的性能。 Conclusion: UTAMoE为解决多模态大语言模型中的任务目标冲突提供了一种有效方法，并通过实验验证了其优越性。 Abstract: Unified multimodal large language models (MLLMs) based on end-to-end autoregressive (AR) transformers effectively integrate both understanding and generation tasks within a single framework. However, intrinsic Task Objective Conflicts between high-level semantic abstraction in understanding and fine-grained detail preservation in generation pose significant challenges, often leading to suboptimal trade-offs and task interference. Existing solutions, such as decoupling shared visual encoders, fall short of fundamentally resolving these conflicts due to inherent AR architecture. In this paper, we propose a novel approach that decouples internal components of AR to resolve task objective conflicts. Specifically, we design UTAMoE, a Unified Task-Aware Mixture-of-Experts (MoE) framework that decouples internal AR modules via a Task-Aware MoE Layer to create task-specific optimization subpaths. To enhance task differentiation while maintaining overall coordination, we introduce a novel Two-Stage Training Strategy. Extensive experiments on multimodal benchmarks demonstrate that UTAMoE mitigates task objective conflicts, achieving state-of-the-art performance across various tasks. Visualizations and ablation studies further validate the effectiveness of our approach.

[51] ControlThinker: Unveiling Latent Semantics for Controllable Image Generation through Visual Reasoning

Feng Han,Yang Jiao,Shaoxiang Chen,Junhao Xu,Jingjing Chen,Yu-Gang Jiang

Main category: cs.CV

TL;DR: ControlThinker提出了一种新颖的框架，通过增强文本提示的语义理解来改善可控图像生成的质量和一致性。

Details

Motivation: 现有方法在稀疏语义的文本提示与目标图像之间存在语义鸿沟，且过度依赖低层次控制信号。 Method: 采用“理解后生成”范式，利用MLLM挖掘控制图像的潜在语义以丰富文本提示，并通过ORM选择最优推理轨迹。 Result: 实验表明，ControlThinker有效缩小了文本提示与目标图像之间的语义鸿沟，提升了视觉质量和语义一致性。 Conclusion: ControlThinker通过语义增强和优化推理轨迹，显著改进了可控图像生成的性能。 Abstract: The field of controllable image generation has seen significant advancements, with various architectures improving generation layout consistency with control signals. However, contemporary methods still face challenges in bridging the semantic gap between input text prompts with sparse semantics and the target images, often over-relying on low-level control signals to infer regional details. To address this challenge, we propose ControlThinker, a novel framework that employs a "comprehend-then-generate" paradigm. Firstly, by incentivizing the visual reasoning capability of a MLLM, latent semantics from control images are mined to enrich text prompts. This enriched semantic understanding then seamlessly aids in image generation without the need for additional complex modifications. To further tackle the uncertainty arising from the ambiguity of control images, we encourage broader exploration of reasoning trajectories and select the optimal one using a metric-based output reward model (ORM). Extensive experimental results demonstrate that ControlThinker effectively mitigates the semantic gap between raw text prompts and target images, resulting in improved visual quality and semantic consistency across a wide range of benchmarks. The code and models are available at https://github.com/Maplebb/ControlThinker.

[52] Generating 6DoF Object Manipulation Trajectories from Action Description in Egocentric Vision

Tomoya Yoshida,Shuhei Kurita,Taichi Nishimura,Shinsuke Mori

Main category: cs.CV

TL;DR: 提出了一种利用大规模视频数据集提取多样化操作轨迹的框架，并基于视觉和点云语言模型生成轨迹，验证了其有效性。

Details

Motivation: 开发交互式机器人需要大量多样化的操作演示，但收集这些数据难以实现。 Method: 利用Exo-Ego4D的大规模视频数据集提取操作轨迹，结合文本描述，开发基于视觉和点云的语言模型生成轨迹。 Result: 在HOT3D数据集上验证了模型能成功生成有效的6DoF操作轨迹。 Conclusion: 该框架为从第一人称视角生成操作轨迹提供了数据集和基线模型。 Abstract: Learning to use tools or objects in common scenes, particularly handling them in various ways as instructed, is a key challenge for developing interactive robots. Training models to generate such manipulation trajectories requires a large and diverse collection of detailed manipulation demonstrations for various objects, which is nearly unfeasible to gather at scale. In this paper, we propose a framework that leverages large-scale ego- and exo-centric video datasets -- constructed globally with substantial effort -- of Exo-Ego4D to extract diverse manipulation trajectories at scale. From these extracted trajectories with the associated textual action description, we develop trajectory generation models based on visual and point cloud-based language models. In the recently proposed egocentric vision-based in-a-quality trajectory dataset of HOT3D, we confirmed that our models successfully generate valid object trajectories, establishing a training dataset and baseline models for the novel task of generating 6DoF manipulation trajectories from action descriptions in egocentric vision.

[53] Analyzing Transformer Models and Knowledge Distillation Approaches for Image Captioning on Edge AI

Wing Man Casca Kwok,Yip Chiu Tung,Kunal Bhagchandani

Main category: cs.CV

TL;DR: 研究展示了如何在资源受限的边缘设备上高效运行基于Transformer的图像描述模型，通过知识蒸馏技术加速推理并保持性能。

Details

Motivation: 边缘设备在工业自动化和物联网应用中需要实时AI决策，但传统深度学习模型因计算资源受限难以满足需求。 Method: 评估资源高效的Transformer模型，并应用知识蒸馏技术优化模型。 Result: 在资源受限设备上实现了加速推理，同时保持了模型性能。 Conclusion: 通过优化和知识蒸馏，Transformer模型可在边缘设备上高效运行，满足实时需求。 Abstract: Edge computing decentralizes processing power to network edge, enabling real-time AI-driven decision-making in IoT applications. In industrial automation such as robotics and rugged edge AI, real-time perception and intelligence are critical for autonomous operations. Deploying transformer-based image captioning models at the edge can enhance machine perception, improve scene understanding for autonomous robots, and aid in industrial inspection. However, these edge or IoT devices are often constrained in computational resources for physical agility, yet they have strict response time requirements. Traditional deep learning models can be too large and computationally demanding for these devices. In this research, we present findings of transformer-based models for image captioning that operate effectively on edge devices. By evaluating resource-effective transformer models and applying knowledge distillation techniques, we demonstrate inference can be accelerated on resource-constrained devices while maintaining model performance using these techniques.

[54] PDSE: A Multiple Lesion Detector for CT Images using PANet and Deformable Squeeze-and-Excitation Block

Di Fan,Heng Yu,Zhiyuan Xu

Main category: cs.CV

TL;DR: PDSE是一种改进的一阶段病灶检测框架，通过优化Retinanet并结合低层特征图、自适应SE模块和通道特征图注意力，显著提升了CT图像中病灶检测的准确性和效率。

Details

Motivation: 由于CT扫描中病灶类型、大小和位置的多样性，病灶检测是一项具有挑战性的任务。需要一种更高效和准确的检测方法。 Method: 重新设计Retinanet，引入低层特征图增强路径聚合流，结合自适应SE模块和通道特征图注意力机制。 Result: 在DeepLesion基准测试中，mAP超过0.20，显著提升了小目标和多尺度目标的检测性能。 Conclusion: PDSE框架在CT图像病灶检测中表现出色，达到了新的最先进水平。 Abstract: Detecting lesions in Computed Tomography (CT) scans is a challenging task in medical image processing due to the diverse types, sizes, and locations of lesions. Recently, various one-stage and two-stage framework networks have been developed to focus on lesion localization. We introduce a one-stage lesion detection framework, PDSE, by redesigning Retinanet to achieve higher accuracy and efficiency for detecting lesions in multimodal CT images. Specifically, we enhance the path aggregation flow by incorporating a low-level feature map. Additionally, to improve model representation, we utilize the adaptive Squeeze-and-Excitation (SE) block and integrate channel feature map attention. This approach has resulted in achieving new state-of-the-art performance. Our method significantly improves the detection of small and multiscaled objects. When evaluated against other advanced algorithms on the public DeepLesion benchmark, our algorithm achieved an mAP of over 0.20.

[55] VLMs Can Aggregate Scattered Training Patches

Zhanhui Zhou,Lingjie Chen,Chao Yang,Chaochao Lu

Main category: cs.CV

TL;DR: 论文探讨了视觉语言模型（VLMs）中通过分散有害图像为小块数据绕过数据审核的风险，并提出了一种称为“视觉缝合”的攻击方式。

Details

Motivation: 研究动机是揭示VLMs在训练数据分散为小块时可能绕过审核，导致生成有害内容的潜在风险。 Method: 方法包括将图像分割为小块并标记唯一ID，通过微调模型验证其视觉缝合能力，并模拟对抗性数据投毒场景。 Result: 结果表明，VLMs能够通过视觉缝合整合分散的视觉信息，从而在推理时生成有害内容。 Conclusion: 结论指出视觉缝合能力对VLM安全性构成严重威胁，需进一步研究防御措施。 Abstract: One way to mitigate risks in vision-language models (VLMs) is to remove dangerous samples in their training data. However, such data moderation can be easily bypassed when harmful images are split into small, benign-looking patches, scattered across many training samples. VLMs may then learn to piece these fragments together during training and generate harmful responses at inference, either from full images or text references. For instance, if trained on image patches from a bloody scene paired with the descriptions "safe," VLMs may later describe, the full image or a text reference to the scene, as "safe." We define the core ability of VLMs enabling this attack as $\textit{visual stitching}$ -- the ability to integrate visual information spread across multiple training samples that share the same textual descriptions. In our work, we first demonstrate visual stitching abilities in common open-source VLMs on three datasets where each image is labeled with a unique synthetic ID: we split each $(\texttt{image}, \texttt{ID})$ pair into $\{(\texttt{patch}, \texttt{ID})\}$ pairs at different granularity for finetuning, and we find that tuned models can verbalize the correct IDs from full images or text reference. Building on this, we simulate the adversarial data poisoning scenario mentioned above by using patches from dangerous images and replacing IDs with text descriptions like ``safe'' or ``unsafe'', demonstrating how harmful content can evade moderation in patches and later be reconstructed through visual stitching, posing serious VLM safety risks. Code is available at https://github.com/ZHZisZZ/visual-stitching.

[56] Isharah: A Large-Scale Multi-Scene Dataset for Continuous Sign Language Recognition

Sarah Alyami,Hamzah Luqman,Sadam Al-Azani,Maad Alowaifeer,Yazeed Alharbi,Yaser Alonaizan

Main category: cs.CV

TL;DR: Isharah是一个大型多场景连续手语识别（CSLR）数据集，首次在非受控环境中通过智能手机摄像头收集，包含30,000个视频片段，提供丰富的语言学注释，适用于CSLR和手语翻译（SLT）系统的开发。

Details

Motivation: 现有CSLR数据集主要在受控环境中收集，限制了其在真实场景中的鲁棒性。Isharah旨在填补这一空白，提供更接近真实世界多样性的数据。 Method: 通过18名聋人和专业手语者使用智能手机摄像头在非受控环境中录制视频，涵盖多种录制设置、距离、角度和分辨率。 Result: Isharah数据集包含30,000个视频片段，提供词汇级注释，支持CSLR和SLT任务，并引入了多种手语理解基准。 Conclusion: Isharah为开发鲁棒的CSLR和SLT系统提供了重要资源，推动了手语识别技术在真实场景中的应用。 Abstract: Current benchmarks for sign language recognition (SLR) focus mainly on isolated SLR, while there are limited datasets for continuous SLR (CSLR), which recognizes sequences of signs in a video. Additionally, existing CSLR datasets are collected in controlled settings, which restricts their effectiveness in building robust real-world CSLR systems. To address these limitations, we present Isharah, a large multi-scene dataset for CSLR. It is the first dataset of its type and size that has been collected in an unconstrained environment using signers' smartphone cameras. This setup resulted in high variations of recording settings, camera distances, angles, and resolutions. This variation helps with developing sign language understanding models capable of handling the variability and complexity of real-world scenarios. The dataset consists of 30,000 video clips performed by 18 deaf and professional signers. Additionally, the dataset is linguistically rich as it provides a gloss-level annotation for all dataset's videos, making it useful for developing CSLR and sign language translation (SLT) systems. This paper also introduces multiple sign language understanding benchmarks, including signer-independent and unseen-sentence CSLR, along with gloss-based and gloss-free SLT. The Isharah dataset is available on https://snalyami.github.io/Isharah_CSLR/.

[57] Negative-Guided Subject Fidelity Optimization for Zero-Shot Subject-Driven Generation

Chaehun Shin,Jooyoung Choi,Johan Barthelemy,Jungbeom Lee,Sungroh Yoon

Main category: cs.CV

TL;DR: SFO是一种新型的对比学习框架，通过引入合成负样本和优化扩散时间步，显著提升了零样本主题驱动生成中的主题保真度。

Details

Motivation: 现有方法仅依赖正样本和扩散损失，缺乏对负样本的利用，导致主题保真度不足。 Method: SFO通过条件退化负采样（CDNS）生成负样本，并通过成对比较优化模型；同时重新加权扩散时间步以聚焦关键步骤。 Result: 实验表明，SFO在主题保真度和文本对齐方面显著优于基线方法。 Conclusion: SFO通过对比学习和时间步优化，有效提升了主题驱动生成的性能。 Abstract: We present Subject Fidelity Optimization (SFO), a novel comparative learning framework for zero-shot subject-driven generation that enhances subject fidelity. Beyond supervised fine-tuning methods that rely only on positive targets and use the diffusion loss as in the pre-training stage, SFO introduces synthetic negative targets and explicitly guides the model to favor positives over negatives through pairwise comparison. For negative targets, we propose Condition-Degradation Negative Sampling (CDNS), which automatically generates distinctive and informative negatives by intentionally degrading visual and textual cues without expensive human annotations. Moreover, we reweight the diffusion timesteps to focus finetuning on intermediate steps where subject details emerge. Extensive experiments demonstrate that SFO with CDNS significantly outperforms baselines in terms of both subject fidelity and text alignment on a subject-driven generation benchmark. Project page: https://subjectfidelityoptimization.github.io/

[58] FingerVeinSyn-5M: A Million-Scale Dataset and Benchmark for Finger Vein Recognition

Yinfan Wang,Jie Gui,Baosheng Yu,Qi Li,Zhenan Sun,Juho Kannala,Guoying Zhao

Main category: cs.CV

TL;DR: 论文提出了一种合成手指静脉图像生成器FVeinSyn，并创建了最大的手指静脉数据集FingerVeinSyn-5M，包含500万样本，显著提升了深度学习模型的性能。

Details

Motivation: 现有手指静脉数据集规模小且样本单一，限制了深度学习方法的进展。 Method: 开发了FVeinSyn生成器，生成多样化的手指静脉图像，并构建了包含500万样本的FingerVeinSyn-5M数据集。 Result: 使用该数据集预训练并微调的模型在多个基准测试中平均性能提升了53.91%。 Conclusion: FingerVeinSyn-5M为手指静脉识别领域提供了大规模、多样化的数据集，显著推动了深度学习应用的发展。 Abstract: A major challenge in finger vein recognition is the lack of large-scale public datasets. Existing datasets contain few identities and limited samples per finger, restricting the advancement of deep learning-based methods. To address this, we introduce FVeinSyn, a synthetic generator capable of producing diverse finger vein patterns with rich intra-class variations. Using FVeinSyn, we created FingerVeinSyn-5M -- the largest available finger vein dataset -- containing 5 million samples from 50,000 unique fingers, each with 100 variations including shift, rotation, scale, roll, varying exposure levels, skin scattering blur, optical blur, and motion blur. FingerVeinSyn-5M is also the first to offer fully annotated finger vein images, supporting deep learning applications in this field. Models pretrained on FingerVeinSyn-5M and fine-tuned with minimal real data achieve an average 53.91\% performance gain across multiple benchmarks. The dataset is publicly available at: https://github.com/EvanWang98/FingerVeinSyn-5M.

[59] Spatial Understanding from Videos: Structured Prompts Meet Simulation Data

Haoyu Zhang,Meng Liu,Zaijing Li,Haokun Wen,Weili Guan,Yaowei Wang,Liqiang Nie

Main category: cs.CV

TL;DR: 提出了一种增强预训练视觉语言模型（VLMs）3D空间推理能力的统一框架，结合结构化提示策略和自动化构建的数据集。

Details

Motivation: 现有方法因空间不确定性和数据稀缺性限制了VLMs的3D空间推理能力，需改进。 Method: 提出SpatialMind（结构化提示策略）和ScanForgeQA（自动化构建的数据集），不修改模型架构。 Result: 多基准测试显示提示和微调策略单独及组合均有效。 Conclusion: 框架有效提升3D空间推理能力，为未来视觉空间理解研究提供启发。 Abstract: Visual-spatial understanding, the ability to infer object relationships and layouts from visual input, is fundamental to downstream tasks such as robotic navigation and embodied interaction. However, existing methods face spatial uncertainty and data scarcity, limiting the 3D spatial reasoning capability of pre-trained vision-language models (VLMs). To address these challenges, we present a unified framework for enhancing 3D spatial reasoning in pre-trained VLMs without modifying their architecture. This framework combines SpatialMind, a structured prompting strategy that decomposes complex scenes and questions into interpretable reasoning steps, with ScanForgeQA, a scalable question-answering dataset built from diverse 3D simulation scenes through an automated construction process designed for fine-tuning. Extensive experiments across multiple benchmarks demonstrate the individual and combined effectiveness of our prompting and fine-tuning strategies, and yield insights that may inspire future research on visual-spatial understanding.

[60] Images are Worth Variable Length of Representations

Lingjun Mao,Rodolfo Corona,Xin Liang,Wenhao Yan,Zineng Tang

Main category: cs.CV

TL;DR: DOVE是一种动态视觉编码器，根据图像信息量动态生成可变数量的视觉标记，显著减少平均标记数同时保持高质量重建。

Details

Motivation: 现有视觉编码器固定标记数量，无法适应不同图像的信息量差异，导致效率低下。 Method: 提出DOVE动态编码器，生成可变数量视觉标记以重建图像，并扩展为查询条件化标记化。 Result: DOVE在减少标记数的同时保持高质量重建，在线性探测和多模态任务中表现优于固定长度编码方法。 Conclusion: DOVE通过动态标记化和查询条件化，实现了更高效和有针对性的语义提取。 Abstract: Most existing vision encoders map images into a fixed-length sequence of tokens, overlooking the fact that different images contain varying amounts of information. For example, a visually complex image (e.g., a cluttered room) inherently carries more information and thus deserves more tokens than a simple image (e.g., a blank wall). To address this inefficiency, we propose DOVE, a dynamic vision encoder that produces a variable number of visual tokens (i.e., continuous representation vectors) to reconstruct each image. Our results show that DOVE significantly reduces the average number of tokens while maintaining high reconstruction quality. In several linear probing and downstream multimodal tasks, it outperforms existing autoencoder-based tokenization methods when using far fewer tokens, capturing more expressive semantic features compared to fixed-length encoding. We further extend DOVE with query-conditioned tokenization. By guiding the model to focus on query-relevant regions, it achieves more efficient and targeted semantic extraction. Our code and checkpoints are available at https://dove-encoder.github.io/dove-encoder.

Hansen Feng,Lizhi Wang,Yiqi Huang,Tong Li,Lin Zhu,Hua Huang

Main category: cs.CV

TL;DR: YOND是一种新型盲原始图像去噪方法，通过合成数据训练，能泛化到未知相机的噪声图像。

Details

Motivation: 现有基于学习的方法依赖相机特定数据，导致在未知相机数据上性能下降。 Method: 提出三个关键模块：CNE（粗到细噪声估计）、EM-VST（期望匹配方差稳定变换）和SNR-Net（信噪比引导去噪）。 Result: 在未知相机数据上表现优异，具有灵活性和实用性。 Conclusion: YOND解决了相机依赖问题，提供了一种通用且可控的去噪方法。 Abstract: The rapid advancement of photography has created a growing demand for a practical blind raw image denoising method. Recently, learning-based methods have become mainstream due to their excellent performance. However, most existing learning-based methods suffer from camera-specific data dependency, resulting in performance drops when applied to data from unknown cameras. To address this challenge, we introduce a novel blind raw image denoising method named YOND, which represents You Only Need a Denoiser. Trained solely on synthetic data, YOND can generalize robustly to noisy raw images captured by diverse unknown cameras. Specifically, we propose three key modules to guarantee the practicality of YOND: coarse-to-fine noise estimation (CNE), expectation-matched variance-stabilizing transform (EM-VST), and SNR-guided denoiser (SNR-Net). Firstly, we propose CNE to identify the camera noise characteristic, refining the estimated noise parameters based on the coarse denoised image. Secondly, we propose EM-VST to eliminate camera-specific data dependency, correcting the bias expectation of VST according to the noisy image. Finally, we propose SNR-Net to offer controllable raw image denoising, supporting adaptive adjustments and manual fine-tuning. Extensive experiments on unknown cameras, along with flexible solutions for challenging cases, demonstrate the superior practicality of our method. The source code will be publicly available at the \href{https://fenghansen.github.io/publication/YOND}{project homepage}.

[62] EmoArt: A Multidimensional Dataset for Emotion-Aware Artistic Generation

Cheng Zhang,Hongxia xie,Bin Wen,Songhan Zuo,Ruoxuan Zhang,Wen-huang Cheng

Main category: cs.CV

TL;DR: 论文介绍了EmoArt数据集，用于解决文本到图像生成中情感表达不足的问题，并评估了现有模型在情感对齐方面的表现。

Details

Motivation: 当前文本到图像生成模型在情感表达和抽象艺术生成方面存在不足，主要缺乏大规模细粒度情感数据集。 Method: 提出EmoArt数据集，包含132,664幅艺术作品，涵盖56种绘画风格，每幅作品标注了场景描述、视觉属性、情感标签等。 Result: 利用EmoArt数据集评估了流行文本到图像扩散模型的情感对齐能力。 Conclusion: EmoArt数据集为情感驱动的图像合成提供了重要数据和基准，推动了情感计算、多模态学习和计算艺术的发展。 Abstract: With the rapid advancement of diffusion models, text-to-image generation has achieved significant progress in image resolution, detail fidelity, and semantic alignment, particularly with models like Stable Diffusion 3.5, Stable Diffusion XL, and FLUX 1. However, generating emotionally expressive and abstract artistic images remains a major challenge, largely due to the lack of large-scale, fine-grained emotional datasets. To address this gap, we present the EmoArt Dataset -- one of the most comprehensive emotion-annotated art datasets to date. It contains 132,664 artworks across 56 painting styles (e.g., Impressionism, Expressionism, Abstract Art), offering rich stylistic and cultural diversity. Each image includes structured annotations: objective scene descriptions, five key visual attributes (brushwork, composition, color, line, light), binary arousal-valence labels, twelve emotion categories, and potential art therapy effects. Using EmoArt, we systematically evaluate popular text-to-image diffusion models for their ability to generate emotionally aligned images from text. Our work provides essential data and benchmarks for emotion-driven image synthesis and aims to advance fields such as affective computing, multimodal learning, and computational art, enabling applications in art therapy and creative design. The dataset and more details can be accessed via our project website.

[63] MambaNeXt-YOLO: A Hybrid State Space Model for Real-time Object Detection

Xiaochun Lei,Siqi Wu,Weilin Wu,Zetao Jiang

Main category: cs.CV

TL;DR: 提出MambaNeXt-YOLO，结合CNN与Mamba模型，平衡实时目标检测的准确性与效率。

Details

Motivation: 解决Transformer在实时目标检测中计算复杂度高的问题，利用线性状态空间模型（如Mamba）提升效率。 Method: 1. MambaNeXt Block：结合CNN与Mamba；2. MAFPN：改进多尺度检测；3. 优化边缘设备部署。 Result: 在PASCAL VOC上达到66.6% mAP，31.9 FPS，支持边缘设备。 Conclusion: MambaNeXt-YOLO在实时目标检测中高效且准确，适合边缘部署。 Abstract: Real-time object detection is a fundamental but challenging task in computer vision, particularly when computational resources are limited. Although YOLO-series models have set strong benchmarks by balancing speed and accuracy, the increasing need for richer global context modeling has led to the use of Transformer-based architectures. Nevertheless, Transformers have high computational complexity because of their self-attention mechanism, which limits their practicality for real-time and edge deployments. To overcome these challenges, recent developments in linear state space models, such as Mamba, provide a promising alternative by enabling efficient sequence modeling with linear complexity. Building on this insight, we propose MambaNeXt-YOLO, a novel object detection framework that balances accuracy and efficiency through three key contributions: (1) MambaNeXt Block: a hybrid design that integrates CNNs with Mamba to effectively capture both local features and long-range dependencies; (2) Multi-branch Asymmetric Fusion Pyramid Network (MAFPN): an enhanced feature pyramid architecture that improves multi-scale object detection across various object sizes; and (3) Edge-focused Efficiency: our method achieved 66.6\% mAP at 31.9 FPS on the PASCAL VOC dataset without any pre-training and supports deployment on edge devices such as the NVIDIA Jetson Xavier NX and Orin NX.

[64] INP-Former++: Advancing Universal Anomaly Detection via Intrinsic Normal Prototypes and Residual Learning

Wei Luo,Haiming Yao,Yunkang Cao,Qiyu Chen,Ang Gao,Weiming Shen,Weihang Zhang,Wenyong Yu

Main category: cs.CV

TL;DR: INP-Former提出了一种直接从测试图像中提取内在正常原型（INPs）的方法，通过INP提取器和损失函数优化，显著提升了异常检测性能。

Details

Motivation: 现有异常检测方法依赖训练集中的正常参考图像，但外观和位置变化导致对齐困难，影响检测精度。作者发现异常图像中仍包含有价值的正常信息，提出直接从测试图像中提取INPs以解决这一问题。 Method: 提出INP-Former，包括INP提取器、INP一致性损失和INP引导的解码器，通过重构误差计算异常分数。还引入Soft Mining Loss优化训练样本。 Result: INP-Former在单类、多类和少样本异常检测任务中达到SOTA性能，并展示零样本能力。改进版INP-Former++进一步提升了性能。 Conclusion: INP-Former是一种通用且高效的异常检测方法，通过利用测试图像中的内在信息，显著优于传统方法。 Abstract: Anomaly detection (AD) is essential for industrial inspection and medical diagnosis, yet existing methods typically rely on ``comparing'' test images to normal references from a training set. However, variations in appearance and positioning often complicate the alignment of these references with the test image, limiting detection accuracy. We observe that most anomalies manifest as local variations, meaning that even within anomalous images, valuable normal information remains. We argue that this information is useful and may be more aligned with the anomalies since both the anomalies and the normal information originate from the same image. Therefore, rather than relying on external normality from the training set, we propose INP-Former, a novel method that extracts Intrinsic Normal Prototypes (INPs) directly from the test image. Specifically, we introduce the INP Extractor, which linearly combines normal tokens to represent INPs. We further propose an INP Coherence Loss to ensure INPs can faithfully represent normality for the testing image. These INPs then guide the INP-guided Decoder to reconstruct only normal tokens, with reconstruction errors serving as anomaly scores. Additionally, we propose a Soft Mining Loss to prioritize hard-to-optimize samples during training. INP-Former achieves state-of-the-art performance in single-class, multi-class, and few-shot AD tasks across MVTec-AD, VisA, and Real-IAD, positioning it as a versatile and universal solution for AD. Remarkably, INP-Former also demonstrates some zero-shot AD capability. Furthermore, we propose a soft version of the INP Coherence Loss and enhance INP-Former by incorporating residual learning, leading to the development of INP-Former++. The proposed method significantly improves detection performance across single-class, multi-class, semi-supervised, few-shot, and zero-shot settings.

[65] Zero-Shot Temporal Interaction Localization for Egocentric Videos

Erhang Zhang,Junyi Ma,Yin-Dong Zheng,Yixuan Zhou,Hesheng Wang

Main category: cs.CV

TL;DR: EgoLoc是一种零样本时序交互定位方法，通过自适应采样策略和闭环反馈，显著提升了自我中心视频中的人-物交互动作定位性能。

Details

Motivation: 现有方法依赖标注数据导致领域偏差和低效部署，而零样本方法因粗粒度估计和开环流程性能受限。 Method: EgoLoc结合2D和3D观测，通过自适应采样生成高质量初始猜测，并利用闭环反馈优化定位结果。 Result: 在公开数据集和新基准上，EgoLoc优于现有基线方法。 Conclusion: EgoLoc为自我中心视频中的交互动作定位提供了高效准确的解决方案，代码将开源。 Abstract: Locating human-object interaction (HOI) actions within video serves as the foundation for multiple downstream tasks, such as human behavior analysis and human-robot skill transfer. Current temporal action localization methods typically rely on annotated action and object categories of interactions for optimization, which leads to domain bias and low deployment efficiency. Although some recent works have achieved zero-shot temporal action localization (ZS-TAL) with large vision-language models (VLMs), their coarse-grained estimations and open-loop pipelines hinder further performance improvements for temporal interaction localization (TIL). To address these issues, we propose a novel zero-shot TIL approach dubbed EgoLoc to locate the timings of grasp actions for human-object interaction in egocentric videos. EgoLoc introduces a self-adaptive sampling strategy to generate reasonable visual prompts for VLM reasoning. By absorbing both 2D and 3D observations, it directly samples high-quality initial guesses around the possible contact/separation timestamps of HOI according to 3D hand velocities, leading to high inference accuracy and efficiency. In addition, EgoLoc generates closed-loop feedback from visual and dynamic cues to further refine the localization results. Comprehensive experiments on the publicly available dataset and our newly proposed benchmark demonstrate that EgoLoc achieves better temporal interaction localization for egocentric videos compared to state-of-the-art baselines. We will release our code and relevant data as open-source at https://github.com/IRMVLab/EgoLoc.

[66] Intersectional Bias in Pre-Trained Image Recognition Models

Valerie Krug,Sebastian Stober

Main category: cs.CV

TL;DR: 研究发现ImageNet分类器在面部图像中表现出年龄、种族和性别的偏见，尤其是年龄区分明显。

Details

Motivation: 探讨预训练模型可能存在的偏见问题，特别是在面部图像中对敏感变量（年龄、种族、性别）的编码偏见。 Method: 使用线性分类器探针和激活可视化技术（地形图）分析ImageNet分类器的表示。 Result: 模型在年龄区分上表现显著，种族和性别偏见在中年群体中较为明显。 Conclusion: ImageNet分类器存在对敏感变量的偏见，需进一步研究以减少偏见影响。 Abstract: Deep Learning models have achieved remarkable success. Training them is often accelerated by building on top of pre-trained models which poses the risk of perpetuating encoded biases. Here, we investigate biases in the representations of commonly used ImageNet classifiers for facial images while considering intersections of sensitive variables age, race and gender. To assess the biases, we use linear classifier probes and visualize activations as topographic maps. We find that representations in ImageNet classifiers particularly allow differentiation between ages. Less strongly pronounced, the models appear to associate certain ethnicities and distinguish genders in middle-aged groups.

[67] Accelerating SfM-based Pose Estimation with Dominating Set

Joji Joseph,Bharadwaj Amrutur,Shalabh Bhatnagar

Main category: cs.CV

TL;DR: 提出一种基于图论支配集的预处理技术，显著加速SfM位姿估计，适用于AR/VR和机器人实时应用。

Details

Motivation: 解决实时应用中SfM位姿估计速度不足的问题。 Method: 利用图论中的支配集概念预处理SfM模型，减少计算量。 Result: 处理速度提升1.5-14.48倍，参考图像和点云规模分别减少17-23倍和2.27-4倍。 Conclusion: 该方法在速度和精度间取得平衡，为实时3D位姿估计提供高效解决方案。 Abstract: This paper introduces a preprocessing technique to speed up Structure-from-Motion (SfM) based pose estimation, which is critical for real-time applications like augmented reality (AR), virtual reality (VR), and robotics. Our method leverages the concept of a dominating set from graph theory to preprocess SfM models, significantly enhancing the speed of the pose estimation process without losing significant accuracy. Using the OnePose dataset, we evaluated our method across various SfM-based pose estimation techniques. The results demonstrate substantial improvements in processing speed, ranging from 1.5 to 14.48 times, and a reduction in reference images and point cloud size by factors of 17-23 and 2.27-4, respectively. This work offers a promising solution for efficient and accurate 3D pose estimation, balancing speed and accuracy in real-time applications.

Jialei Chen,Xu Zheng,Danda Pani Paudel,Luc Van Gool,Hiroshi Murase,Daisuke Deguchi

Main category: cs.CV

TL;DR: 论文提出BiXFormer方法，通过统一模态匹配（UMM）和跨模态对齐（CMA）优化多模态语义分割，提升模态利用效率并处理缺失模态问题。

Details

Motivation: 现有方法在融合多模态特征时限制了各模态的优势发挥，且对非RGB模态关注不足。 Method: BiXFormer将多模态输入分为RGB和非RGB（X）分别处理，提出UMM（含MAM和CM）和CMA，最大化模态贡献。 Result: 在合成和真实多模态基准测试中，mIoU分别提升2.75%和22.74%。 Conclusion: BiXFormer有效提升多模态场景理解能力，尤其在处理缺失模态时表现优异。 Abstract: Utilizing multi-modal data enhances scene understanding by providing complementary semantic and geometric information. Existing methods fuse features or distill knowledge from multiple modalities into a unified representation, improving robustness but restricting each modality's ability to fully leverage its strengths in different situations. We reformulate multi-modal semantic segmentation as a mask-level classification task and propose BiXFormer, which integrates Unified Modality Matching (UMM) and Cross Modality Alignment (CMA) to maximize modality effectiveness and handle missing modalities. Specifically, BiXFormer first categorizes multi-modal inputs into RGB and X, where X represents any non-RGB modalities, e.g., depth, allowing separate processing for each. This design leverages the well-established pretraining for RGB, while addressing the relative lack of attention to X modalities. Then, we propose UMM, which includes Modality Agnostic Matching (MAM) and Complementary Matching (CM). MAM assigns labels to features from all modalities without considering modality differences, leveraging each modality's strengths. CM then reassigns unmatched labels to remaining unassigned features within their respective modalities, ensuring that each available modality contributes to the final prediction and mitigating the impact of missing modalities. Moreover, to further facilitate UMM, we introduce CMA, which enhances the weaker queries assigned in CM by aligning them with optimally matched queries from MAM. Experiments on both synthetic and real-world multi-modal benchmarks demonstrate the effectiveness of our method, achieving significant improvements in mIoU of +2.75% and +22.74% over the prior arts.

[69] How PARTs assemble into wholes: Learning the relative composition of images

Melika Ayoughi,Samira Abnar,Chen Huang,Chris Sandino,Sayeri Lala,Eeshan Gunesh Dhekane,Dan Busbridge,Shuangfei Zhai,Vimal Thilak,Josh Susskind,Pascal Mettes,Paul Groth,Hanlin Goh

Main category: cs.CV

TL;DR: PART是一种自监督学习方法，通过连续相对变换学习图像部分的相对组成，克服了基于网格方法的局限性。

Details

Motivation: 现有基于网格的方法无法捕捉真实世界物体组成的连续性和流动性，限制了表示学习的效果。 Method: PART利用离网格补丁之间的连续相对变换，建模部分之间的相对关系，实现自监督学习。 Result: 在需要精确空间理解的任务（如目标检测和时间序列预测）中，PART优于基于网格的方法（如MAE和DropPos），同时在全局分类任务中表现竞争性。 Conclusion: PART突破了网格限制，为跨多样数据类型的通用自监督预训练开辟了新方向，具有广泛的应用潜力。 Abstract: The composition of objects and their parts, along with object-object positional relationships, provides a rich source of information for representation learning. Hence, spatial-aware pretext tasks have been actively explored in self-supervised learning. Existing works commonly start from a grid structure, where the goal of the pretext task involves predicting the absolute position index of patches within a fixed grid. However, grid-based approaches fall short of capturing the fluid and continuous nature of real-world object compositions. We introduce PART, a self-supervised learning approach that leverages continuous relative transformations between off-grid patches to overcome these limitations. By modeling how parts relate to each other in a continuous space, PART learns the relative composition of images-an off-grid structural relative positioning process that generalizes beyond occlusions and deformations. In tasks requiring precise spatial understanding such as object detection and time series prediction, PART outperforms strong grid-based methods like MAE and DropPos, while also maintaining competitive performance on global classification tasks with minimal hyperparameter tuning. By breaking free from grid constraints, PART opens up an exciting new trajectory for universal self-supervised pretraining across diverse datatypes-from natural images to EEG signals-with promising potential in video, medical imaging, and audio.

[70] PRJ: Perception-Retrieval-Judgement for Generated Images

Qiang Fu,Zonglei Jing,Zonghao Ying,Xiaoqian Li

Main category: cs.CV

TL;DR: 论文提出了一种名为PRJ的认知启发框架，通过感知-检索-判断三阶段设计，提升AI生成视觉内容的安全性检测能力。

Details

Motivation: 生成式AI的快速发展带来了创意能力，但也引发了AI生成视觉内容的安全性问题，现有系统缺乏上下文理解和动态毒性评估能力。 Method: PRJ框架分为三阶段：感知（图像转描述语言）、检索（获取外部知识）、判断（基于规则评估毒性），并引入动态评分机制。 Result: 实验表明，PRJ在检测准确性和鲁棒性上优于现有安全检测器，并能支持结构化毒性解释。 Conclusion: PRJ框架为AI生成内容的安全检测提供了更高效、可解释的解决方案。 Abstract: The rapid progress of generative AI has enabled remarkable creative capabilities, yet it also raises urgent concerns regarding the safety of AI-generated visual content in real-world applications such as content moderation, platform governance, and digital media regulation. This includes unsafe material such as sexually explicit images, violent scenes, hate symbols, propaganda, and unauthorized imitations of copyrighted artworks. Existing image safety systems often rely on rigid category filters and produce binary outputs, lacking the capacity to interpret context or reason about nuanced, adversarially induced forms of harm. In addition, standard evaluation metrics (e.g., attack success rate) fail to capture the semantic severity and dynamic progression of toxicity. To address these limitations, we propose Perception-Retrieval-Judgement (PRJ), a cognitively inspired framework that models toxicity detection as a structured reasoning process. PRJ follows a three-stage design: it first transforms an image into descriptive language (perception), then retrieves external knowledge related to harm categories and traits (retrieval), and finally evaluates toxicity based on legal or normative rules (judgement). This language-centric structure enables the system to detect both explicit and implicit harms with improved interpretability and categorical granularity. In addition, we introduce a dynamic scoring mechanism based on a contextual toxicity risk matrix to quantify harmfulness across different semantic dimensions. Experiments show that PRJ surpasses existing safety checkers in detection accuracy and robustness while uniquely supporting structured category-level toxicity interpretation.

[71] DSSAU-Net:U-Shaped Hybrid Network for Pubic Symphysis and Fetal Head Segmentation

Zunhui Xia,Hongxing Li,Libin Lan

Main category: cs.CV

TL;DR: 提出了一种名为DSSAU-Net的稀疏自注意力网络架构，用于胎儿头部和耻骨联合的精确分割，以辅助分娩过程中的超声诊断。

Details

Motivation: 传统分娩检查方法主观且不准确，超声辅助诊断需要精确分割胎儿头部和耻骨联合，以提高诊断客观性和效率。 Method: 设计了DSSAU-Net，采用对称U形编码器-解码器架构，结合双稀疏选择注意力块（DSSA）和多尺度特征融合，减少计算复杂度并增强特征提取。 Result: 在MICCAI IUGC 2024竞赛中，DSSAU-Net在分类和分割任务中排名第四，验证了其有效性。 Conclusion: DSSAU-Net在胎儿头部和耻骨联合分割任务中表现出色，为超声辅助分娩诊断提供了高效且精确的工具。 Abstract: In the childbirth process, traditional methods involve invasive vaginal examinations, but research has shown that these methods are both subjective and inaccurate. Ultrasound-assisted diagnosis offers an objective yet effective way to assess fetal head position via two key parameters: Angle of Progression (AoP) and Head-Symphysis Distance (HSD), calculated by segmenting the fetal head (FH) and pubic symphysis (PS), which aids clinicians in ensuring a smooth delivery process. Therefore, accurate segmentation of FH and PS is crucial. In this work, we propose a sparse self-attention network architecture with good performance and high computational efficiency, named DSSAU-Net, for the segmentation of FH and PS. Specifically, we stack varying numbers of Dual Sparse Selection Attention (DSSA) blocks at each stage to form a symmetric U-shaped encoder-decoder network architecture. For a given query, DSSA is designed to explicitly perform one sparse token selection at both the region and pixel levels, respectively, which is beneficial for further reducing computational complexity while extracting the most relevant features. To compensate for the information loss during the upsampling process, skip connections with convolutions are designed. Additionally, multiscale feature fusion is employed to enrich the model's global and local information. The performance of DSSAU-Net has been validated using the Intrapartum Ultrasound Grand Challenge (IUGC) 2024 \textit{test set} provided by the organizer in the MICCAI IUGC 2024 competition\footnote{\href{https://codalab.lisn.upsaclay.fr/competitions/18413\#learn\_the\_details}{https://codalab.lisn.upsaclay.fr/competitions/18413\#learn\_the\_details}}, where we win the fourth place on the tasks of classification and segmentation, demonstrating its effectiveness. The codes will be available at https://github.com/XiaZunhui/DSSAU-Net.

[72] Advancements in Artificial Intelligence Applications for Cardiovascular Disease Research

Yuanlin Mo,Haishan Huang,Bocheng Liang,Weibo Ma

Main category: cs.CV

TL;DR: AI在心血管医学中的应用，尤其是结合CT、MRI、ECG和超声，通过深度学习提升诊断效率和准确性，但仍需解决数据验证问题。

Details

Motivation: 探讨AI在心血管医学中的革命性潜力及其面临的挑战。 Method: 使用卷积神经网络和生成对抗网络等深度学习架构，自动化分析医学影像和生理信号。 Result: AI在诊断准确性和工作流效率上超越人类，但输入数据验证问题可能导致诊断错误。 Conclusion: AI在精准诊断中具有变革潜力，未来需开发混合模型和多模态数据整合，同时加强验证协议以确保临床可靠性。 Abstract: Recent advancements in artificial intelligence (AI) have revolutionized cardiovascular medicine, particularly through integration with computed tomography (CT), magnetic resonance imaging (MRI), electrocardiography (ECG) and ultrasound (US). Deep learning architectures, including convolutional neural networks and generative adversarial networks, enable automated analysis of medical imaging and physiological signals, surpassing human capabilities in diagnostic accuracy and workflow efficiency. However, critical challenges persist, including the inability to validate input data accuracy, which may propagate diagnostic errors. This review highlights AI's transformative potential in precision diagnostics while underscoring the need for robust validation protocols to ensure clinical reliability. Future directions emphasize hybrid models integrating multimodal data and adaptive algorithms to refine personalized cardiovascular care.

[73] OV-COAST: Cost Aggregation with Optimal Transport for Open-Vocabulary Semantic Segmentation

Aditya Gandhamal,Aniruddh Sikdar,Suresh Sundaram

Main category: cs.CV

TL;DR: 论文提出了一种基于最优传输理论的开放词汇语义分割方法OV-COAST，通过两阶段优化策略显著提升了模型性能。

Details

Motivation: 提升开放词汇语义分割（OVSS）的跨域泛化能力，利用最优传输理论对齐视觉-语言特征。 Method: 采用两阶段优化策略：第一阶段通过Sinkhorn距离解决最优传输问题，第二阶段用对齐方案指导CAT-Seg模型训练。 Result: 在MESS基准测试中，OV-COAST显著优于CAT-Seg和SAN-B，分别提升1.72%和4.9%的mIoU。 Conclusion: OV-COAST通过最优传输理论有效提升了开放词汇语义分割的性能，具有实际应用潜力。 Abstract: Open-vocabulary semantic segmentation (OVSS) entails assigning semantic labels to each pixel in an image using textual descriptions, typically leveraging world models such as CLIP. To enhance out-of-domain generalization, we propose Cost Aggregation with Optimal Transport (OV-COAST) for open-vocabulary semantic segmentation. To align visual-language features within the framework of optimal transport theory, we employ cost volume to construct a cost matrix, which quantifies the distance between two distributions. Our approach adopts a two-stage optimization strategy: in the first stage, the optimal transport problem is solved using cost volume via Sinkhorn distance to obtain an alignment solution; in the second stage, this solution is used to guide the training of the CAT-Seg model. We evaluate state-of-the-art OVSS models on the MESS benchmark, where our approach notably improves the performance of the cost-aggregation model CAT-Seg with ViT-B backbone, achieving superior results, surpassing CAT-Seg by 1.72 % and SAN-B by 4.9 % mIoU. The code is available at https://github.com/adityagandhamal/OV-COAST/}{https://github.com/adityagandhamal/OV-COAST/ .

[74] AetherVision-Bench: An Open-Vocabulary RGB-Infrared Benchmark for Multi-Angle Segmentation across Aerial and Ground Perspectives

Aniruddh Sikdar,Aditya Gandhamal,Suresh Sundaram

Main category: cs.CV

TL;DR: 论文提出了AetherVision-Bench基准，用于评估多角度分割性能，并分析了零样本迁移模型的关键影响因素。

Details

Motivation: 现有开放词汇语义分割（OVSS）方法在跨域泛化方面存在挑战，限制了其实际应用。 Method: 提出了AetherVision-Bench基准，用于评估多角度（空中和地面视角）分割性能，并测试了现有OVSS模型。 Result: 评估了零样本迁移模型的性能，并识别了影响其表现的关键因素。 Conclusion: 该研究为未来研究提供了基准和见解，推动了开放词汇语义分割的鲁棒性研究。 Abstract: Open-vocabulary semantic segmentation (OVSS) involves assigning labels to each pixel in an image based on textual descriptions, leveraging world models like CLIP. However, they encounter significant challenges in cross-domain generalization, hindering their practical efficacy in real-world applications. Embodied AI systems are transforming autonomous navigation for ground vehicles and drones by enhancing their perception abilities, and in this study, we present AetherVision-Bench, a benchmark for multi-angle segmentation across aerial, and ground perspectives, which facilitates an extensive evaluation of performance across different viewing angles and sensor modalities. We assess state-of-the-art OVSS models on the proposed benchmark and investigate the key factors that impact the performance of zero-shot transfer models. Our work pioneers the creation of a robustness benchmark, offering valuable insights and establishing a foundation for future research.

[75] OSGNet @ Ego4D Episodic Memory Challenge 2025

Yisen Feng,Haoyu Zhang,Qiaohui Chu,Meng Liu,Weili Guan,Yaowei Wang,Liqiang Nie

Main category: cs.CV

TL;DR: 本文介绍了在CVPR 2025 Ego4D Episodic Memory Challenge中，针对三个自我中心视频定位任务的冠军解决方案，采用早期融合策略提升定位精度。

Details

Motivation: 现有统一视频定位方法多依赖后期融合策略，效果不佳，需改进。 Method: 采用基于早期融合的视频定位模型处理三项任务。 Result: 在Natural Language Queries、Goal Step和Moment Queries三个赛道中均获得第一名。 Conclusion: 早期融合策略有效提升了视频定位的准确性。 Abstract: In this report, we present our champion solutions for the three egocentric video localization tracks of the Ego4D Episodic Memory Challenge at CVPR 2025. All tracks require precise localization of the interval within an untrimmed egocentric video. Previous unified video localization approaches often rely on late fusion strategies, which tend to yield suboptimal results. To address this, we adopt an early fusion-based video localization model to tackle all three tasks, aiming to enhance localization accuracy. Ultimately, our method achieved first place in the Natural Language Queries, Goal Step, and Moment Queries tracks, demonstrating its effectiveness. Our code can be found at https://github.com/Yisen-Feng/OSGNet.

[76] PlückeRF: A Line-based 3D Representation for Few-view Reconstruction

Sam Bahrami,Dylan Campbell

Main category: cs.CV

TL;DR: 本文提出了一种改进的多视角3D重建方法，通过引入PlückeRF表示，更有效地利用多视角信息，提升了重建质量。

Details

Motivation: 现有的单视角或少视角3D重建方法虽然取得进展，但在利用多视角信息方面仍有提升空间。 Method: 提出了一种基于PlückeRF表示的方法，通过连接3D表示与输入视角的像素光线，优化信息共享。 Result: 实验表明，该方法在重建质量上优于现有的triplane表示和前沿的feedforward方法。 Conclusion: PlückeRF表示在多视角3D重建中具有显著优势，为未来研究提供了新方向。 Abstract: Feed-forward 3D reconstruction methods aim to predict the 3D structure of a scene directly from input images, providing a faster alternative to per-scene optimization approaches. Significant progress has been made in single-view and few-view reconstruction using learned priors that infer object shape and appearance, even for unobserved regions. However, there is substantial potential to enhance these methods by better leveraging information from multiple views when available. To address this, we propose a few-view reconstruction model that more effectively harnesses multi-view information. Our approach introduces a simple mechanism that connects the 3D representation with pixel rays from the input views, allowing for preferential sharing of information between nearby 3D locations and between 3D locations and nearby pixel rays. We achieve this by defining the 3D representation as a set of structured, feature-augmented lines; the Pl\"uckeRF representation. Using this representation, we demonstrate improvements in reconstruction quality over the equivalent triplane representation and state-of-the-art feedforward reconstruction methods.

[77] FSHNet: Fully Sparse Hybrid Network for 3D Object Detection

Shuai Liu,Mingyue Cui,Boyang Li,Quanmin Liang,Tinghe Hong,Kai Huang,Yunxiao Shan,Kai Huang

Main category: cs.CV

TL;DR: FSHNet提出了一种全稀疏混合网络，通过SlotFormer块增强稀疏编码器的长程特征提取能力，并引入动态稀疏标签分配策略和稀疏上采样模块，显著提升了3D检测性能。

Details

Motivation: 稀疏3D检测器仅从非空体素提取特征，导致长程交互能力弱和中心特征缺失，限制了特征提取能力和网络优化。 Method: FSHNet引入SlotFormer块扩展稀疏体素感受野，采用动态稀疏标签分配策略优化网络，并设计稀疏上采样模块保留细节。 Result: 在Waymo、nuScenes和Argoverse2基准测试中表现优异。 Conclusion: FSHNet通过增强特征提取和优化策略，显著提升了全稀疏3D检测器的性能。 Abstract: Fully sparse 3D detectors have recently gained significant attention due to their efficiency in long-range detection. However, sparse 3D detectors extract features only from non-empty voxels, which impairs long-range interactions and causes the center feature missing. The former weakens the feature extraction capability, while the latter hinders network optimization. To address these challenges, we introduce the Fully Sparse Hybrid Network (FSHNet). FSHNet incorporates a proposed SlotFormer block to enhance the long-range feature extraction capability of existing sparse encoders. The SlotFormer divides sparse voxels using a slot partition approach, which, compared to traditional window partition, provides a larger receptive field. Additionally, we propose a dynamic sparse label assignment strategy to deeply optimize the network by providing more high-quality positive samples. To further enhance performance, we introduce a sparse upsampling module to refine downsampled voxels, preserving fine-grained details crucial for detecting small objects. Extensive experiments on the Waymo, nuScenes, and Argoverse2 benchmarks demonstrate the effectiveness of FSHNet. The code is available at https://github.com/Say2L/FSHNet.

[78] ComRoPE: Scalable and Robust Rotary Position Embedding Parameterized by Trainable Commuting Angle Matrices

Hao Yu,Tangyu Jiang,Shuning Jia,Shannan Yan,Shunning Liu,Haolong Qian,Guanghao Li,Shuting Dong,Huaisong Zhang,Chun Yuan

Main category: cs.CV

TL;DR: ComRoPE提出了一种可训练的旋转角度矩阵方法，改进了RoPE的位置编码能力，提升了模型性能。

Details

Motivation: 传统位置编码方法（如RoPE）因旋转矩阵的固定性和有限变换空间限制了模型能力，需要更灵活和鲁棒的解决方案。 Method: 通过定义可训练的交换角度矩阵，确保RoPE方程的可扩展性和位置鲁棒性，提出了两种具体实现方案。 Result: 在ImageNet-1K数据集上，训练分辨率和高分辨率下分别比现有方法提升1.6%和2.9%。 Conclusion: ComRoPE不仅改进了RoPE，还为未来位置编码研究提供了新思路，代码已开源。 Abstract: The Transformer architecture has revolutionized various regions since it was proposed, and its effectiveness largely depends on the ability to encode positional information. Traditional position encoding methods exhibit significant limitations due to lack of robustness and flexibility of position. Therefore, Rotary Positional Encoding (RoPE) was proposed to alleviate these issues, which integrates positional information by rotating the embeddings in the attention mechanism. However, RoPE requires manually defined rotation matrices with limited transformation space, constraining the model's capacity. In this work, we propose ComRoPE, which generalizes RoPE by defining it in terms of trainable commuting angle matrices. Specifically, we demonstrate that pairwise commutativity of these matrices is essential for RoPE to achieve scalability and positional robustness. We formally define the RoPE Equation, which is an essential condition that ensures consistent performance with position offsets. Based on the theoretical analysis, we present two types of trainable commuting angle matrices as sufficient solutions to the RoPE equation, which significantly improve performance, surpassing the current state-of-the-art method by 1.6% at training resolution and 2.9% at higher resolution on the ImageNet-1K dataset. Furthermore, our framework shows versatility in generalizing to existing RoPE formulations and offering new insights for future positional encoding research. To ensure reproducibility, the source code and instructions are available at https://github.com/Longin-Yu/ComRoPE

[79] SAAT: Synergistic Alternating Aggregation Transformer for Image Super-Resolution

Jianfeng Wu,Nannan Xu

Main category: cs.CV

TL;DR: 论文提出了一种新型模型SAAT，通过结合通道和空间注意力机制，提升了单图像超分辨率任务的性能。

Details

Motivation: 现有基于Transformer的超分辨率方法通常忽略通道间信息和中间过程的丰富空间结构信息，且未充分探索通道与空间注意力的协同关系。 Method: 提出SAAT模型，包含CWSAG和SWSAG模块，分别结合高效通道注意力和空间注意力，以增强特征融合和结构特征提取。 Result: 实验表明SAAT在超分辨率任务中性能优异，与SOTA模型相当。 Conclusion: SAAT通过协同注意力机制有效提升了超分辨率任务的性能，为相关领域提供了新思路。 Abstract: Single image super-resolution is a well-known downstream task which aims to restore low-resolution images into high-resolution images. At present, models based on Transformers have shone brightly in the field of super-resolution due to their ability to capture long-term dependencies in information. However, current methods typically compute self-attention in nonoverlapping windows to save computational costs, and the standard self-attention computation only focuses on its results, thereby neglecting the useful information across channels and the rich spatial structural information generated in the intermediate process. Channel attention and spatial attention have, respectively, brought significant improvements to various downstream visual tasks in terms of extracting feature dependency and spatial structure relationships, but the synergistic relationship between channel and spatial attention has not been fully explored yet.To address these issues, we propose a novel model. Synergistic Alternating Aggregation Transformer (SAAT), which can better utilize the potential information of features. In SAAT, we introduce the Efficient Channel & Window Synergistic Attention Group (CWSAG) and the Spatial & Window Synergistic Attention Group (SWSAG). On the one hand, CWSAG combines efficient channel attention with shifted window attention, enhancing non-local feature fusion, and producing more visually appealing results. On the other hand, SWSAG leverages spatial attention to capture rich structured feature information, thereby enabling SAAT to more effectively extract structural features.Extensive experimental results and ablation studies demonstrate the effectiveness of SAAT in the field of super-resolution. SAAT achieves performance comparable to that of the state-of-the-art (SOTA) under the same quantity of parameters.

Caiyi Sun,Yujing Sun,Xiao Han,Zemin Yang,Jiawei Liu,Xinge Zhu,Siu Ming Yiu,Yuexin Ma

Main category: cs.CV

TL;DR: 提出了一种用于复杂交互场景中人体运动预测的分层特征表示方法，结合空间和频率视角，显著提升了预测准确性。

Details

Motivation: 复杂场景中的人体行为预测因交互信息丰富（如人-人、人-环境交互）而具有挑战性，现有方法难以应对。 Method: 设计分层交互特征表示，高层特征捕捉整体交互背景，低层特征关注细节；提出粗到细的交互推理模块，结合空间和频率视角。 Result: 在四个公开数据集上达到最先进性能。 Conclusion: 该方法通过分层特征和交互推理模块，显著提升了复杂场景中运动预测的准确性。 Abstract: Complex scenes present significant challenges for predicting human behaviour due to the abundance of interaction information, such as human-human and humanenvironment interactions. These factors complicate the analysis and understanding of human behaviour, thereby increasing the uncertainty in forecasting human motions. Existing motion prediction methods thus struggle in these complex scenarios. In this paper, we propose an effective method for human motion forecasting in interactive scenes. To achieve a comprehensive representation of interactions, we design a hierarchical interaction feature representation so that high-level features capture the overall context of the interactions, while low-level features focus on fine-grained details. Besides, we propose a coarse-to-fine interaction reasoning module that leverages both spatial and frequency perspectives to efficiently utilize hierarchical features, thereby enhancing the accuracy of motion predictions. Our method achieves state-of-the-art performance across four public datasets. Code will be released when this paper is published.

[81] CoLa: Chinese Character Decomposition with Compositional Latent Components

Fan Shi,Haiyang Yu,Bin Li,Xiangyang Xue

Main category: cs.CV

TL;DR: 论文提出了一种名为CoLa的深度潜变量模型，通过学习汉字的组合潜变量组件，无需依赖人工定义的分解方案，实现了零样本汉字识别。

Details

Motivation: 人类能够通过分解和重组汉字的组合部件来识别未见过的汉字，这反映了组合性和学习能力两种认知原则。现有方法通常忽略学习能力，限制了其泛化能力。 Method: 提出了一种深度潜变量模型（CoLa），学习汉字的组合潜变量组件，并在潜空间中进行识别和匹配。 Result: 实验表明，CoLa在零样本汉字识别和部首零样本识别中优于现有方法，且能跨数据集泛化。 Conclusion: CoLa通过学习组合潜变量组件，不仅提升了零样本识别能力，还展示了跨数据集的泛化潜力。 Abstract: Humans can decompose Chinese characters into compositional components and recombine them to recognize unseen characters. This reflects two cognitive principles: Compositionality, the idea that complex concepts are built on simpler parts; and Learning-to-learn, the ability to learn strategies for decomposing and recombining components to form new concepts. These principles provide inductive biases that support efficient generalization. They are critical to Chinese character recognition (CCR) in solving the zero-shot problem, which results from the common long-tail distribution of Chinese character datasets. Existing methods have made substantial progress in modeling compositionality via predefined radical or stroke decomposition. However, they often ignore the learning-to-learn capability, limiting their ability to generalize beyond human-defined schemes. Inspired by these principles, we propose a deep latent variable model that learns Compositional Latent components of Chinese characters (CoLa) without relying on human-defined decomposition schemes. Recognition and matching can be performed by comparing compositional latent components in the latent space, enabling zero-shot character recognition. The experiments illustrate that CoLa outperforms previous methods in both character the radical zero-shot CCR. Visualization indicates that the learned components can reflect the structure of characters in an interpretable way. Moreover, despite being trained on historical documents, CoLa can analyze components of oracle bone characters, highlighting its cross-dataset generalization ability.

[82] ConText: Driving In-context Learning for Text Removal and Segmentation

Fei Zhang,Pei Zhang,Baosong Yang,Fei Huang,Yanfeng Wang,Ya Zhang

Main category: cs.CV

TL;DR: 本文首次将视觉上下文学习（V-ICL）范式应用于光学字符识别任务，提出任务链式组合器和上下文感知聚合方法，显著提升了模型性能。

Details

Motivation: 现有V-ICL方法采用直接提示，限制了模型的单步推理能力，且视觉异质性增加了演示选择的复杂性。 Method: 提出任务链式组合器（图像-去除-分割）和上下文感知聚合，增强中间推理；采用自提示策略解决视觉异质性问题。 Result: ConText模型在多个基准测试中达到最新最优性能。 Conclusion: 任务链式提示和上下文感知聚合显著提升了视觉上下文学习在字符识别任务中的表现。 Abstract: This paper presents the first study on adapting the visual in-context learning (V-ICL) paradigm to optical character recognition tasks, specifically focusing on text removal and segmentation. Most existing V-ICL generalists employ a reasoning-as-reconstruction approach: they turn to using a straightforward image-label compositor as the prompt and query input, and then masking the query label to generate the desired output. This direct prompt confines the model to a challenging single-step reasoning process. To address this, we propose a task-chaining compositor in the form of image-removal-segmentation, providing an enhanced prompt that elicits reasoning with enriched intermediates. Additionally, we introduce context-aware aggregation, integrating the chained prompt pattern into the latent query representation, thereby strengthening the model's in-context reasoning. We also consider the issue of visual heterogeneity, which complicates the selection of homogeneous demonstrations in text recognition. Accordingly, this is effectively addressed through a simple self-prompting strategy, preventing the model's in-context learnability from devolving into specialist-like, context-free inference. Collectively, these insights culminate in our ConText model, which achieves new state-of-the-art across both in- and out-of-domain benchmarks. The code is available at https://github.com/Ferenas/ConText.

[83] Animal Pose Labeling Using General-Purpose Point Trackers

Zhuoyang Pan,Boxiao Pan,Guandao Yang,Adam W. Harley,Leonidas Guibas

Main category: cs.CV

TL;DR: 提出了一种基于测试时优化的动物姿态标注方法，通过稀疏标注帧微调预训练模型，实现高效自动标注。

Details

Motivation: 现有方法因训练数据不足而不可靠，而全面数据收集因动物形态多样性极具挑战。 Method: 利用稀疏标注帧微调预训练通用点跟踪器的轻量外观嵌入，应用于其余帧自动标注。 Result: 方法以合理标注成本达到最先进性能。 Conclusion: 该流程为动物行为自动量化提供了有价值的工具。 Abstract: Automatically estimating animal poses from videos is important for studying animal behaviors. Existing methods do not perform reliably since they are trained on datasets that are not comprehensive enough to capture all necessary animal behaviors. However, it is very challenging to collect such datasets due to the large variations in animal morphology. In this paper, we propose an animal pose labeling pipeline that follows a different strategy, i.e. test time optimization. Given a video, we fine-tune a lightweight appearance embedding inside a pre-trained general-purpose point tracker on a sparse set of annotated frames. These annotations can be obtained from human labelers or off-the-shelf pose detectors. The fine-tuned model is then applied to the rest of the frames for automatic labeling. Our method achieves state-of-the-art performance at a reasonable annotation cost. We believe our pipeline offers a valuable tool for the automatic quantification of animal behavior. Visit our project webpage at https://zhuoyang-pan.github.io/animal-labeling.

[84] JointSplat: Probabilistic Joint Flow-Depth Optimization for Sparse-View Gaussian Splatting

Yang Xiao,Guoan Xu,Qiang Wu,Wenjing Jia

Main category: cs.CV

TL;DR: JointSplat提出了一种联合光流和深度的概率优化框架，解决了稀疏视角3D重建中的几何不一致和噪声问题，显著提升了重建质量。

Details

Motivation: 稀疏视角3D重建中，现有方法在低纹理或重复区域存在定位错误和伪影问题，光流-深度联合估计则因缺乏真实光流监督而受限于局部噪声和全局不一致。 Method: JointSplat通过概率优化机制动态融合光流和深度信息，并提出多视角深度一致性损失以抑制不确定区域的误导梯度。 Result: 在RealEstate10K和ACID数据集上，JointSplat优于现有方法，验证了其高保真稀疏视角3D重建的有效性和鲁棒性。 Conclusion: JointSplat通过联合优化光流和深度，显著提升了稀疏视角3D重建的质量和一致性。 Abstract: Reconstructing 3D scenes from sparse viewpoints is a long-standing challenge with wide applications. Recent advances in feed-forward 3D Gaussian sparse-view reconstruction methods provide an efficient solution for real-time novel view synthesis by leveraging geometric priors learned from large-scale multi-view datasets and computing 3D Gaussian centers via back-projection. Despite offering strong geometric cues, both feed-forward multi-view depth estimation and flow-depth joint estimation face key limitations: the former suffers from mislocation and artifact issues in low-texture or repetitive regions, while the latter is prone to local noise and global inconsistency due to unreliable matches when ground-truth flow supervision is unavailable. To overcome this, we propose JointSplat, a unified framework that leverages the complementarity between optical flow and depth via a novel probabilistic optimization mechanism. Specifically, this pixel-level mechanism scales the information fusion between depth and flow based on the matching probability of optical flow during training. Building upon the above mechanism, we further propose a novel multi-view depth-consistency loss to leverage the reliability of supervision while suppressing misleading gradients in uncertain areas. Evaluated on RealEstate10K and ACID, JointSplat consistently outperforms state-of-the-art (SOTA) methods, demonstrating the effectiveness and robustness of our proposed probabilistic joint flow-depth optimization approach for high-fidelity sparse-view 3D reconstruction.

[85] Video, How Do Your Tokens Merge?

Sam Pollard,Michael Wray

Main category: cs.CV

TL;DR: 视频Transformer模型因输入时空扩展需大量计算资源，本文探讨无需训练的视频token合并方法，在保持精度的同时提速约2.5倍。

Details

Motivation: 解决视频Transformer模型因高计算需求导致的效率问题，探索token合并方法在视频理解中的应用。 Method: 采用无需训练的token合并技术，评估其在四种视频Transformer和三个数据集上的表现。 Result: 在ViViT等模型上实现约2.5倍加速，平均精度仅下降0.55%。 Conclusion: 视频token合并是一种高效且无需重新训练的方法，适用于复杂视频理解任务。 Abstract: Video transformer models require huge amounts of compute resources due to the spatio-temporal scaling of the input. Tackling this, recent methods have proposed to drop or merge tokens for image models, whether randomly or via learned methods. Merging tokens has many benefits: it can be plugged into any vision transformer, does not require model re-training, and it propagates information that would otherwise be dropped through the model. Before now, video token merging has not been evaluated on temporally complex datasets for video understanding. In this work, we explore training-free token merging for video to provide comprehensive experiments and find best practices across four video transformers on three datasets that exhibit coarse and fine-grained action recognition. Our results showcase the benefits of video token merging with a speedup of around $2.5$X while maintaining accuracy (avg. $-0.55\%$ for ViViT). Code available at https://github.com/sjpollard/video-how-do-your-tokens-merge.

[86] Joint Video Enhancement with Deblurring, Super-Resolution, and Frame Interpolation Network

Giyong Choi,HyunWook Park

Main category: cs.CV

TL;DR: 提出了一种联合视频增强方法DSFN，通过同时解决多个退化因素，直接从低质量视频生成高质量视频，优于现有顺序方法。

Details

Motivation: 视频质量通常受多种因素共同影响，而现有顺序增强方法效率低下且非最优。 Method: DSFN网络结合联合去模糊和超分辨率模块（JDSR）以及基于三帧的帧插值模块（TFBFI），同时处理多种退化问题。 Result: 实验表明，DSFN在公共数据集上优于其他顺序方法，网络规模更小且处理速度更快。 Conclusion: DSFN为联合视频增强任务提供了一种高效且性能优越的解决方案。 Abstract: Video quality is often severely degraded by multiple factors rather than a single factor. These low-quality videos can be restored to high-quality videos by sequentially performing appropriate video enhancement techniques. However, the sequential approach was inefficient and sub-optimal because most video enhancement approaches were designed without taking into account that multiple factors together degrade video quality. In this paper, we propose a new joint video enhancement method that mitigates multiple degradation factors simultaneously by resolving an integrated enhancement problem. Our proposed network, named DSFN, directly produces a high-resolution, high-frame-rate, and clear video from a low-resolution, low-frame-rate, and blurry video. In the DSFN, low-resolution and blurry input frames are enhanced by a joint deblurring and super-resolution (JDSR) module. Meanwhile, intermediate frames between input adjacent frames are interpolated by a triple-frame-based frame interpolation (TFBFI) module. The proper combination of the proposed modules of DSFN can achieve superior performance on the joint video enhancement task. Experimental results show that the proposed method outperforms other sequential state-of-the-art techniques on public datasets with a smaller network size and faster processing time.

[87] Learning from Noise: Enhancing DNNs for Event-Based Vision through Controlled Noise Injection

Marcin Kowalczyk,Kamil Jeziorek,Tomasz Kryjak

Main category: cs.CV

TL;DR: 提出了一种新颖的噪声注入训练方法，增强神经网络对事件数据噪声的鲁棒性，优于传统滤波方法。

Details

Motivation: 事件数据常受噪声影响，传统滤波方法可能丢失有用信息，需更鲁棒的处理方式。 Method: 在训练数据中注入可控噪声，使模型学习噪声鲁棒表示，评估了多种网络架构。 Result: 在多个数据集上表现稳定，分类准确率最高，优于传统滤波方法。 Conclusion: 噪声注入训练是事件数据分类系统中传统滤波方法的可行替代方案。 Abstract: Event-based sensors offer significant advantages over traditional frame-based cameras, especially in scenarios involving rapid motion or challenging lighting conditions. However, event data frequently suffers from considerable noise, negatively impacting the performance and robustness of deep learning models. Traditionally, this problem has been addressed by applying filtering algorithms to the event stream, but this may also remove some of relevant data. In this paper, we propose a novel noise-injection training methodology designed to enhance the neural networks robustness against varying levels of event noise. Our approach introduces controlled noise directly into the training data, enabling models to learn noise-resilient representations. We have conducted extensive evaluations of the proposed method using multiple benchmark datasets (N-Caltech101, N-Cars, and Mini N-ImageNet) and various network architectures, including Convolutional Neural Networks, Vision Transformers, Spiking Neural Networks, and Graph Convolutional Networks. Experimental results show that our noise-injection training strategy achieves stable performance over a range of noise intensities, consistently outperforms event-filtering techniques, and achieves the highest average classification accuracy, making it a viable alternative to traditional event-data filtering methods in an object classification system. Code: https://github.com/vision-agh/DVS_Filtering

[88] Multiple Stochastic Prompt Tuning for Practical Cross-Domain Few Shot Learning

Debarshi Brahma,Soma Biswas

Main category: cs.CV

TL;DR: 提出了一种实用的跨域少样本学习任务pCDFSL，利用CLIP等预训练模型在目标数据集上分类未见类别，通过少量标注样本处理极端域偏移。提出了MIST框架，使用多个随机提示处理域和语义偏移，并通过高斯分布建模提示权重以减少过拟合。

Details

Motivation: 现有CDFSL框架依赖人工创建的训练和测试模式，不适用于实际应用。pCDFSL任务旨在解决这一问题，使其更具挑战性和实用性。 Method: 提出MIST框架，为每个类别学习多个随机提示，捕捉输入数据的多峰分布；将提示权重建模为可学习的高斯分布，以减少过拟合。 Result: 在四个CDFSL基准测试中，MIST框架表现优于现有方法。 Conclusion: MIST框架有效解决了跨域少样本学习中的域偏移和过拟合问题，适用于实际应用。 Abstract: In this work, we propose a practical cross-domain few-shot learning (pCDFSL) task, where a large-scale pre-trained model like CLIP can be easily deployed on a target dataset. The goal is to simultaneously classify all unseen classes under extreme domain shifts, by utilizing only a few labeled samples per class. The pCDFSL paradigm is source-free and moves beyond artificially created episodic training and testing regimes followed by existing CDFSL frameworks, making it more challenging and relevant to real-world applications. Towards that goal, we propose a novel framework, termed MIST (MultIple STochastic Prompt tuning), where multiple stochastic prompts are utilized to handle significant domain and semantic shifts. Specifically, multiple prompts are learnt for each class, effectively capturing multiple peaks in the input data. Furthermore, instead of representing the weights of the multiple prompts as point-estimates, we model them as learnable Gaussian distributions with two different strategies, encouraging an efficient exploration of the prompt parameter space, which mitigate overfitting due to the few labeled training samples. Extensive experiments and comparison with the state-of-the-art methods on four CDFSL benchmarks adapted to this setting, show the effectiveness of the proposed framework.

[89] Vision Remember: Alleviating Visual Forgetting in Efficient MLLM with Vision Feature Resample

Ze Feng,Jiang-Jiang Liu,Sen Yang,Lingyu Xiao,Xiaofan Li,Wankou Yang,Jingdong Wang

Main category: cs.CV

TL;DR: 提出Vision Remember方法，通过在LLM解码层间插入模块，重新记忆视觉特征，解决视觉信息丢失问题，提升细粒度任务性能。

Details

Motivation: 冗余视觉token占用大量计算资源，简单压缩会导致视觉信息丢失，尤其是依赖细粒度空间关系的任务。 Method: 保留多级视觉特征，通过局部注意力机制重新采样，增强细粒度信息和空间关系。 Result: 在多个视觉理解基准测试中表现优异，结合高效视觉投影器时性能提升且不牺牲效率。 Conclusion: Vision Remember方法有效解决了视觉信息丢失问题，LLaVA-VR模型在参数较少情况下优于其他MLLMs。 Abstract: In this work, we study the Efficient Multimodal Large Language Model. Redundant vision tokens consume a significant amount of computational memory and resources. Therefore, many previous works compress them in the Vision Projector to reduce the number of vision tokens. However, simply compressing in the Vision Projector can lead to the loss of visual information, especially for tasks that rely on fine-grained spatial relationships, such as OCR and Chart \& Table Understanding. To address this problem, we propose Vision Remember, which is inserted between the LLM decoder layers to allow vision tokens to re-memorize vision features. Specifically, we retain multi-level vision features and resample them with the vision tokens that have interacted with the text token. During the resampling process, each vision token only attends to a local region in vision features, which is referred to as saliency-enhancing local attention. Saliency-enhancing local attention not only improves computational efficiency but also captures more fine-grained contextual information and spatial relationships within the region. Comprehensive experiments on multiple visual understanding benchmarks validate the effectiveness of our method when combined with various Efficient Vision Projectors, showing performance gains without sacrificing efficiency. Based on Vision Remember, LLaVA-VR with only 2B parameters is also superior to previous representative MLLMs such as Tokenpacker-HD-7B and DeepSeek-VL-7B.

[90] DiffCAP: Diffusion-based Cumulative Adversarial Purification for Vision Language Models

Jia Fu,Yongtao Wu,Yihang Chen,Kunyu Peng,Xiao Zhang,Volkan Cevher,Sepideh Pashami,Anders Holst

Main category: cs.CV

TL;DR: DiffCAP是一种基于扩散的净化策略，能有效中和视觉语言模型（VLM）中的对抗性扰动，显著优于现有防御技术。

Details

Motivation: 尽管VLM在多模态理解中表现出色，但其对扰动的敏感性威胁了其在实际应用中的可靠性。 Method: DiffCAP通过逐步注入高斯噪声直至图像嵌入稳定，再利用预训练扩散模型去噪，恢复干净表示。 Result: 在多种数据集、VLM和攻击强度下，DiffCAP表现优于现有防御技术，并减少了超参数调整和去噪时间。 Conclusion: DiffCAP为在对抗环境中安全部署VLM提供了强大且实用的解决方案。 Abstract: Vision Language Models (VLMs) have shown remarkable capabilities in multimodal understanding, yet their susceptibility to perturbations poses a significant threat to their reliability in real-world applications. Despite often being imperceptible to humans, these perturbations can drastically alter model outputs, leading to erroneous interpretations and decisions. This paper introduces DiffCAP, a novel diffusion-based purification strategy that can effectively neutralize adversarial corruptions in VLMs. We observe that adding minimal noise to an adversarially corrupted image significantly alters its latent embedding with respect to VLMs. Building on this insight, DiffCAP cumulatively injects random Gaussian noise into adversarially perturbed input data. This process continues until the embeddings of two consecutive noisy images reach a predefined similarity threshold, indicating a potential approach to neutralize the adversarial effect. Subsequently, a pretrained diffusion model is employed to denoise the stabilized image, recovering a clean representation suitable for the VLMs to produce an output. Through extensive experiments across six datasets with three VLMs under varying attack strengths in three task scenarios, we show that DiffCAP consistently outperforms existing defense techniques by a substantial margin. Notably, DiffCAP significantly reduces both hyperparameter tuning complexity and the required diffusion time, thereby accelerating the denoising process. Equipped with strong theoretical and empirical support, DiffCAP provides a robust and practical solution for securely deploying VLMs in adversarial environments.

[91] Average Calibration Losses for Reliable Uncertainty in Medical Image Segmentation

Theodore Barfoot,Luis C. Garcia-Peraza-Herrera,Samet Akcay,Ben Glocker,Tom Vercauteren

Main category: cs.CV

TL;DR: 提出了一种可微分的mL1-ACE损失函数，用于改善医学图像分割的校准误差，同时保持分割性能。

Details

Motivation: 解决深度神经网络在医学图像分割中过度自信的问题，提升其可靠性和临床实用性。 Method: 提出硬分箱和软分箱的mL1-ACE损失函数，并在四个数据集上验证其效果。 Result: mL1-ACE显著降低了校准误差（ACE和MCE），同时保持了较高的DSC。软分箱版本校准效果最好，但可能影响分割性能。 Conclusion: 该方法提升了分割预测的可信度，有助于深度学习在临床中的安全应用。 Abstract: Deep neural networks for medical image segmentation are often overconfident, compromising both reliability and clinical utility. In this work, we propose differentiable formulations of marginal L1 Average Calibration Error (mL1-ACE) as an auxiliary loss that can be computed on a per-image basis. We compare both hard- and soft-binning approaches to directly improve pixel-wise calibration. Our experiments on four datasets (ACDC, AMOS, KiTS, BraTS) demonstrate that incorporating mL1-ACE significantly reduces calibration errors, particularly Average Calibration Error (ACE) and Maximum Calibration Error (MCE), while largely maintaining high Dice Similarity Coefficients (DSCs). We find that the soft-binned variant yields the greatest improvements in calibration, over the Dice plus cross-entropy loss baseline, but often compromises segmentation performance, with hard-binned mL1-ACE maintaining segmentation performance, albeit with weaker calibration improvement. To gain further insight into calibration performance and its variability across an imaging dataset, we introduce dataset reliability histograms, an aggregation of per-image reliability diagrams. The resulting analysis highlights improved alignment between predicted confidences and true accuracies. Overall, our approach not only enhances the trustworthiness of segmentation predictions but also shows potential for safer integration of deep learning methods into clinical workflows. We share our code here: https://github.com/cai4cai/Average-Calibration-Losses

[92] MS-YOLO: A Multi-Scale Model for Accurate and Efficient Blood Cell Detection

Guohua Wu,Shengqi Chen,Pengchao Deng,Wenting Yu

Main category: cs.CV

TL;DR: MS-YOLO是一种基于YOLOv11的多尺度血细胞检测模型，通过三个关键创新模块提升检测性能，实验证明其在重叠细胞和多尺度目标检测中表现优异。

Details

Motivation: 传统手动显微镜方法效率低且准确性不足，现有自动化检测方法成本高且精度不理想，深度学习虽引入新范式但仍面临重叠细胞和多尺度目标检测的挑战。 Method: 提出MS-YOLO模型，包含多尺度扩张残差模块（MS-DRM）、动态跨路径特征增强模块（DCFEM）和轻量自适应权重下采样模块（LADS）。 Result: 在CBC基准测试中，MS-YOLO的mAP@50达到97.4%，优于现有模型，并在WBCDD数据集上验证了其泛化能力。 Conclusion: MS-YOLO具备轻量级架构和实时推理效率，满足临床部署需求，为标准化血液病理评估提供了可靠技术支持。 Abstract: Complete blood cell detection holds significant value in clinical diagnostics. Conventional manual microscopy methods suffer from time inefficiency and diagnostic inaccuracies. Existing automated detection approaches remain constrained by high deployment costs and suboptimal accuracy. While deep learning has introduced powerful paradigms to this field, persistent challenges in detecting overlapping cells and multi-scale objects hinder practical deployment. This study proposes the multi-scale YOLO (MS-YOLO), a blood cell detection model based on the YOLOv11 framework, incorporating three key architectural innovations to enhance detection performance. Specifically, the multi-scale dilated residual module (MS-DRM) replaces the original C3K2 modules to improve multi-scale discriminability; the dynamic cross-path feature enhancement module (DCFEM) enables the fusion of hierarchical features from the backbone with aggregated features from the neck to enhance feature representations; and the light adaptive-weight downsampling module (LADS) improves feature downsampling through adaptive spatial weighting while reducing computational complexity. Experimental results on the CBC benchmark demonstrate that MS-YOLO achieves precise detection of overlapping cells and multi-scale objects, particularly small targets such as platelets, achieving an mAP@50 of 97.4% that outperforms existing models. Further validation on the supplementary WBCDD dataset confirms its robust generalization capability. Additionally, with a lightweight architecture and real-time inference efficiency, MS-YOLO meets clinical deployment requirements, providing reliable technical support for standardized blood pathology assessment.

[93] RAID: A Dataset for Testing the Adversarial Robustness of AI-Generated Image Detectors

Hicham Eddoubi,Jonas Ricker,Federico Cocchi,Lorenzo Baraldi,Angelo Sotgiu,Maura Pintor,Marcella Cornia,Lorenzo Baraldi,Asja Fischer,Rita Cucchiara,Battista Biggio

Main category: cs.CV

TL;DR: 论文提出了一种评估AI生成图像检测器鲁棒性的方法，并发布了RAID数据集，包含72k对抗样本，实验表明现有检测器易受对抗攻击。

Details

Motivation: AI生成图像质量高，人类难以区分，检测其真实性成为重要挑战，但现有方法在对抗鲁棒性上缺乏评估。 Method: 提出RAID数据集，通过攻击七种先进检测器和四种文本生成图像模型，生成多样且可迁移的对抗样本。 Result: 实验显示对抗样本能高成功率欺骗未见过检测器，现有检测器鲁棒性不足。 Conclusion: 需开发更鲁棒的检测方法，RAID数据集和代码已开源。 Abstract: AI-generated images have reached a quality level at which humans are incapable of reliably distinguishing them from real images. To counteract the inherent risk of fraud and disinformation, the detection of AI-generated images is a pressing challenge and an active research topic. While many of the presented methods claim to achieve high detection accuracy, they are usually evaluated under idealized conditions. In particular, the adversarial robustness is often neglected, potentially due to a lack of awareness or the substantial effort required to conduct a comprehensive robustness analysis. In this work, we tackle this problem by providing a simpler means to assess the robustness of AI-generated image detectors. We present RAID (Robust evaluation of AI-generated image Detectors), a dataset of 72k diverse and highly transferable adversarial examples. The dataset is created by running attacks against an ensemble of seven state-of-the-art detectors and images generated by four different text-to-image models. Extensive experiments show that our methodology generates adversarial images that transfer with a high success rate to unseen detectors, which can be used to quickly provide an approximate yet still reliable estimate of a detector's adversarial robustnessOur findings indicate that current state-of-the-art AI-generated image detectors can be easily deceived by adversarial examples, highlighting the critical need for the development of more robust methods. We release our dataset at https://huggingface.co/datasets/aimagelab/RAID and evaluation code at https://github.com/pralab/RAID.

[94] Vocabulary-free few-shot learning for Vision-Language Models

Maxime Zanella,Clément Fuchs,Ismail Ben Ayed,Christophe De Vleeschouwer

Main category: cs.CV

TL;DR: 论文提出了一种无需词汇的少样本学习方法SiM，通过相似性映射分类目标实例，摆脱了对预定义类别名称的依赖。

Details

Motivation: 现有方法依赖预定义的类别名称，限制了适用性，尤其是在类别名称不可用或难以指定的场景。 Method: 提出Similarity Mapping（SiM），仅基于与通用提示的相似性分数分类目标实例，无需精心设计的提示。 Result: SiM表现出色，计算效率高（学习映射通常不到一秒），并具有可解释性。 Conclusion: SiM为无需词汇的少样本学习提供了重要基线，未来研究可在此基础上展开。 Abstract: Recent advances in few-shot adaptation for Vision-Language Models (VLMs) have greatly expanded their ability to generalize across tasks using only a few labeled examples. However, existing approaches primarily build upon the strong zero-shot priors of these models by leveraging carefully designed, task-specific prompts. This dependence on predefined class names can restrict their applicability, especially in scenarios where exact class names are unavailable or difficult to specify. To address this limitation, we introduce vocabulary-free few-shot learning for VLMs, a setting where target class instances - that is, images - are available but their corresponding names are not. We propose Similarity Mapping (SiM), a simple yet effective baseline that classifies target instances solely based on similarity scores with a set of generic prompts (textual or visual), eliminating the need for carefully handcrafted prompts. Although conceptually straightforward, SiM demonstrates strong performance, operates with high computational efficiency (learning the mapping typically takes less than one second), and provides interpretability by linking target classes to generic prompts. We believe that our approach could serve as an important baseline for future research in vocabulary-free few-shot learning. Code is available at https://github.com/MaxZanella/vocabulary-free-FSL.

[95] Rex-Thinker: Grounded Object Referring via Chain-of-Thought Reasoning

Qing Jiang,Xingyu Chen,Zhaoyang Zeng,Junzhi Yu,Lei Zhang

Main category: cs.CV

TL;DR: Rex-Thinker将目标引用任务转化为显式的CoT推理任务，通过分步推理验证候选对象是否匹配描述，提升了模型的解释性和可信度。

Details

Motivation: 现有目标引用模型缺乏解释性和对无匹配对象的拒绝能力，Rex-Thinker旨在通过显式推理解决这些问题。 Method: 模型分两步：1) 识别候选对象；2) 对每个候选对象进行分步推理验证。训练包括监督微调和GRPO强化学习。 Result: 实验表明，Rex-Thinker在精度和解释性上优于基线，并能更好地拒绝无匹配描述。 Conclusion: Rex-Thinker通过显式推理提升了目标引用任务的解释性和可信度，适用于多种场景。 Abstract: Object referring aims to detect all objects in an image that match a given natural language description. We argue that a robust object referring model should be grounded, meaning its predictions should be both explainable and faithful to the visual content. Specifically, it should satisfy two key properties: 1) Verifiable, by producing interpretable reasoning that justifies its predictions and clearly links them to visual evidence; and 2) Trustworthy, by learning to abstain when no object in the image satisfies the given expression. However, most methods treat referring as a direct bounding box prediction task, offering limited interpretability and struggling to reject expressions with no matching object. In this work, we propose Rex-Thinker, a model that formulates object referring as an explicit CoT reasoning task. Given a referring expression, we first identify all candidate object instances corresponding to the referred object category. Rex-Thinker then performs step-by-step reasoning over each candidate to assess whether it matches the given expression, before making a final prediction. To support this paradigm, we construct a large-scale CoT-style referring dataset named HumanRef-CoT by prompting GPT-4o on the HumanRef dataset. Each reasoning trace follows a structured planning, action, and summarization format, enabling the model to learn decomposed, interpretable reasoning over object candidates. We then train Rex-Thinker in two stages: a cold-start supervised fine-tuning phase to teach the model how to perform structured reasoning, followed by GRPO-based RL learning to improve accuracy and generalization. Experiments show that our approach outperforms standard baselines in both precision and interpretability on in-domain evaluation, while also demonstrating improved ability to reject hallucinated outputs and strong generalization in out-of-domain settings.

[96] Mitigating Hallucinations in Large Vision-Language Models via Entity-Centric Multimodal Preference Optimization

Jiulong Wu,Zhengliang Shi,Shuaiqiang Wang,Jizhou Huang,Dawei Yin,Lingyong Yan,Min Cao,Min Zhang

Main category: cs.CV

TL;DR: 论文提出了一种名为EMPO的方法，通过实体中心的多模态偏好优化，解决了大型视觉语言模型中的幻觉问题，显著降低了幻觉率。

Details

Motivation: 大型视觉语言模型（LVLMs）在多任务中表现优异，但幻觉问题严重，主要源于模态不对齐和底层大型语言模型（LLMs）的固有幻觉。现有偏好对齐方法忽视了图像-文本模态对齐，导致对LLMs的过度依赖和幻觉。 Method: 提出Entity-centric Multimodal Preference Optimization（EMPO），通过自动构建高质量多模态偏好数据（涵盖图像、指令和响应），增强模态对齐。 Result: 在两项人类偏好数据集和五项多模态幻觉基准测试中，EMPO显著降低了幻觉率，例如在Object-HalBench上减少85.9%，在MM-HalBench上减少49.8%。 Conclusion: EMPO通过改进模态对齐，有效减少了LVLMs的幻觉问题，为多模态模型的可靠性提供了新思路。 Abstract: Large Visual Language Models (LVLMs) have demonstrated impressive capabilities across multiple tasks. However, their trustworthiness is often challenged by hallucinations, which can be attributed to the modality misalignment and the inherent hallucinations of their underlying Large Language Models (LLMs) backbone. Existing preference alignment methods focus on aligning model responses with human preferences while neglecting image-text modality alignment, resulting in over-reliance on LLMs and hallucinations. In this paper, we propose Entity-centric Multimodal Preference Optimization (EMPO), which achieves enhanced modality alignment than existing human preference alignment methods. Besides, to overcome the scarcity of high-quality multimodal preference data, we utilize open-source instruction datasets to automatically construct high-quality preference data across three aspects: image, instruction, and response. Experiments on two human preference datasets and five multimodal hallucination benchmarks demonstrate the effectiveness of EMPO, e.g., reducing hallucination rates by 85.9% on Object-HalBench and 49.8% on MM-HalBench.

[97] EV-Flying: an Event-based Dataset for In-The-Wild Recognition of Flying Objects

Gabriele Magrini,Federico Becattini,Giovanni Colombo,Pietro Pala

Main category: cs.CV

TL;DR: 论文提出了一种基于事件相机的事件流处理方法EV-Flying，用于检测和识别飞行物体，解决了传统RGB方法的局限性。

Details

Motivation: 传统RGB方法在检测飞行物体时面临尺度变化、运动模糊和高速运动等挑战，尤其是对小物体如昆虫和无人机。事件相机的高时间分辨率和低延迟特性使其更适合此类任务。 Method: 采用基于点的方法处理异步事件流，使用轻量级架构（受PointNet启发）对飞行物体进行分类。 Result: 提出了EV-Flying数据集，包含标注的鸟类、昆虫和无人机数据，验证了点云事件表示的有效性。 Conclusion: EV-Flying数据集和方法为现实场景中高效可靠的飞行物体识别提供了新途径。 Abstract: Monitoring aerial objects is crucial for security, wildlife conservation, and environmental studies. Traditional RGB-based approaches struggle with challenges such as scale variations, motion blur, and high-speed object movements, especially for small flying entities like insects and drones. In this work, we explore the potential of event-based vision for detecting and recognizing flying objects, in particular animals that may not follow short and long-term predictable patters. Event cameras offer high temporal resolution, low latency, and robustness to motion blur, making them well-suited for this task. We introduce EV-Flying, an event-based dataset of flying objects, comprising manually annotated birds, insects and drones with spatio-temporal bounding boxes and track identities. To effectively process the asynchronous event streams, we employ a point-based approach leveraging lightweight architectures inspired by PointNet. Our study investigates the classification of flying objects using point cloud-based event representations. The proposed dataset and methodology pave the way for more efficient and reliable aerial object recognition in real-world scenarios.

[98] Video Deblurring with Deconvolution and Aggregation Networks

Giyong Choi,HyunWook Park

Main category: cs.CV

TL;DR: 本文提出了一种用于视频去模糊的解卷积与聚合网络（DAN），通过三个子网络（PPN、ABDN、FAN）有效利用邻帧信息，显著提升了性能。

Details

Motivation: 现有视频去模糊算法未能充分利用邻帧信息，导致性能不佳。 Method: DAN包含预处理网络（PPN）、基于对齐的解卷积网络（ABDN）和帧聚合网络（FAN），分别负责预处理、解卷积和聚合邻帧信息。 Result: 实验表明，DAN在公共数据集上的定量和定性评估均优于现有方法。 Conclusion: DAN通过合理结合三个子网络，有效利用邻帧信息，显著提升了视频去模糊性能。 Abstract: In contrast to single-image deblurring, video deblurring has the advantage that neighbor frames can be utilized to deblur a target frame. However, existing video deblurring algorithms often fail to properly employ the neighbor frames, resulting in sub-optimal performance. In this paper, we propose a deconvolution and aggregation network (DAN) for video deblurring that utilizes the information of neighbor frames well. In DAN, both deconvolution and aggregation strategies are achieved through three sub-networks: the preprocessing network (PPN) and the alignment-based deconvolution network (ABDN) for the deconvolution scheme; the frame aggregation network (FAN) for the aggregation scheme. In the deconvolution part, blurry inputs are first preprocessed by the PPN with non-local operations. Then, the output frames from the PPN are deblurred by the ABDN based on the frame alignment. In the FAN, these deblurred frames from the deconvolution part are combined into a latent frame according to reliability maps which infer pixel-wise sharpness. The proper combination of three sub-networks can achieve favorable performance on video deblurring by using the neighbor frames suitably. In experiments, the proposed DAN was demonstrated to be superior to existing state-of-the-art methods through both quantitative and qualitative evaluations on the public datasets.

[99] Point Cloud Quality Assessment Using the Perceptual Clustering Weighted Graph (PCW-Graph) and Attention Fusion Network

Abdelouahed Laazoufi,Mohammed El Hassouni,Hocine Cherifi

Main category: cs.CV

TL;DR: 无参考点云质量评估（NR-PCQA）在无参考模型的实际应用中至关重要。

Details

Motivation: 评估3D内容质量时，参考模型往往不可用，因此需要无参考方法。 Method: 未提及具体方法。 Result: 未提及具体结果。 Conclusion: NR-PCQA在无参考场景下对3D内容质量评估具有重要意义。 Abstract: No-Reference Point Cloud Quality Assessment (NR-PCQA) is critical for evaluating 3D content in real-world applications where reference models are unavailable.

[100] GlobalBuildingAtlas: An Open Global and Complete Dataset of Building Polygons, Heights and LoD1 3D Models

Xiao Xiang Zhu,Sining Chen,Fahong Zhang,Yilei Shi,Yuanyuan Wang

Main category: cs.CV

TL;DR: GlobalBuildingAtlas是一个公开数据集，提供全球建筑多边形、高度和LoD1 3D建筑模型，覆盖全面且质量高。

Details

Motivation: 填补全球范围内高质量、一致性和完整性建筑数据的空白，支持高分辨率的建筑分析。 Method: 开发基于机器学习的流程从卫星数据提取建筑多边形和高度，并采用质量融合策略优化多边形数据。 Result: 数据集包含27.5亿建筑，多边形数据超越现有最全数据库10亿建筑，高度数据分辨率达3x3米。 Conclusion: GlobalBuildingAtlas为全球建筑现状提供新见解，支持联合国可持续发展目标监测。 Abstract: We introduce GlobalBuildingAtlas, a publicly available dataset providing global and complete coverage of building polygons, heights and Level of Detail 1 (LoD1) 3D building models. This is the first open dataset to offer high quality, consistent, and complete building data in 2D and 3D form at the individual building level on a global scale. Towards this dataset, we developed machine learning-based pipelines to derive building polygons and heights (called GBA.Height) from global PlanetScope satellite data, respectively. Also a quality-based fusion strategy was employed to generate higher-quality polygons (called GBA.Polygon) based on existing open building polygons, including our own derived one. With more than 2.75 billion buildings worldwide, GBA.Polygon surpasses the most comprehensive database to date by more than 1 billion buildings. GBA.Height offers the most detailed and accurate global 3D building height maps to date, achieving a spatial resolution of 3x3 meters-30 times finer than previous global products (90 m), enabling a high-resolution and reliable analysis of building volumes at both local and global scales. Finally, we generated a global LoD1 building model (called GBA.LoD1) from the resulting GBA.Polygon and GBA.Height. GBA.LoD1 represents the first complete global LoD1 building models, including 2.68 billion building instances with predicted heights, i.e., with a height completeness of more than 97%, achieving RMSEs ranging from 1.5 m to 8.9 m across different continents. With its height accuracy, comprehensive global coverage and rich spatial details, GlobalBuildingAltas offers novel insights on the status quo of global buildings, which unlocks unprecedented geospatial analysis possibilities, as showcased by a better illustration of where people live and a more comprehensive monitoring of the progress on the 11th Sustainable Development Goal of the United Nations.

[101] Multi-view Surface Reconstruction Using Normal and Reflectance Cues

Robin Bruneau,Baptiste Brument,Yvain Quéau,Jean Mélou,François Bernard Lauze,Jean-Denis Durou,Lilian Calvet

Main category: cs.CV

TL;DR: 提出了一种结合多视角法线和反射率图的框架，用于高保真3D表面重建，尤其在复杂反射材料和稀疏视角下表现优异。

Details

Motivation: 解决在复杂反射材料和稀疏视角下高保真3D表面重建的挑战。 Method: 采用像素级联合重参数化反射率和表面法线，将其表示为模拟光照下的辐射向量，并兼容传统和现代重建框架。 Result: 在多视角光度立体基准数据集上达到最先进性能，尤其在细节重建和复杂可见性条件下表现突出。 Conclusion: 该方法在细节重建和鲁棒性方面表现优异，是前期工作的扩展版本，代码和数据已开源。 Abstract: Achieving high-fidelity 3D surface reconstruction while preserving fine details remains challenging, especially in the presence of materials with complex reflectance properties and without a dense-view setup. In this paper, we introduce a versatile framework that incorporates multi-view normal and optionally reflectance maps into radiance-based surface reconstruction. Our approach employs a pixel-wise joint re-parametrization of reflectance and surface normals, representing them as a vector of radiances under simulated, varying illumination. This formulation enables seamless incorporation into standard surface reconstruction pipelines, such as traditional multi-view stereo (MVS) frameworks or modern neural volume rendering (NVR) ones. Combined with the latter, our approach achieves state-of-the-art performance on multi-view photometric stereo (MVPS) benchmark datasets, including DiLiGenT-MV, LUCES-MV and Skoltech3D. In particular, our method excels in reconstructing fine-grained details and handling challenging visibility conditions. The present paper is an extended version of the earlier conference paper by Brument et al. (in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024), featuring an accelerated and more robust algorithm as well as a broader empirical evaluation. The code and data relative to this article is available at https://github.com/RobinBruneau/RNb-NeuS2.

[102] Contour Errors: An Ego-Centric Metric for Reliable 3D Multi-Object Tracking

Sharang Kaul,Mario Berk,Thiemo Gerbich,Abhinav Valada

Main category: cs.CV

TL;DR: 提出了一种新的度量标准Contour Errors（CEs），用于在3D场景中更可靠地匹配目标，显著减少了功能失败率。

Details

Motivation: 传统2D度量标准（如IoU和CPD）在复杂3D场景中表现不佳，影响感知系统的准确性和可靠性。 Method: 引入CEs作为基于功能视角的度量标准，通过比较自车坐标系中的边界框来评估匹配。 Result: 在nuScenes数据集上的实验表明，CEs在3D车辆跟踪中显著减少了功能失败率（近距80%，远距60%）。 Conclusion: CEs在3D目标跟踪中优于传统2D度量标准，提升了匹配的可靠性和安全性。 Abstract: Finding reliable matches is essential in multi-object tracking to ensure the accuracy and reliability of perception systems in safety-critical applications such as autonomous vehicles. Effective matching mitigates perception errors, enhancing object identification and tracking for improved performance and safety. However, traditional metrics such as Intersection over Union (IoU) and Center Point Distances (CPDs), which are effective in 2D image planes, often fail to find critical matches in complex 3D scenes. To address this limitation, we introduce Contour Errors (CEs), an ego or object-centric metric for identifying matches of interest in tracking scenarios from a functional perspective. By comparing bounding boxes in the ego vehicle's frame, contour errors provide a more functionally relevant assessment of object matches. Extensive experiments on the nuScenes dataset demonstrate that contour errors improve the reliability of matches over the state-of-the-art 2D IoU and CPD metrics in tracking-by-detection methods. In 3D car tracking, our results show that Contour Errors reduce functional failures (FPs/FNs) by 80% at close ranges and 60% at far ranges compared to IoU in the evaluation stage.

[103] UniCUE: Unified Recognition and Generation Framework for Chinese Cued Speech Video-to-Speech Generation

Jinting Wang,Shan Yang,Li Liu

Main category: cs.CV

TL;DR: 论文提出UniCUE框架，直接通过CS视频生成语音，避免中间文本导致的误差传播和时间错位，显著提升了性能。

Details

Motivation: 现有方法依赖中间文本转换，导致误差传播和时间错位，需直接生成语音以提高准确性和同步性。 Method: 提出UniCUE框架，包含细粒度语义对齐池、VisioPhonetic适配器和姿态感知视觉处理器，直接生成语音。 Result: 在中文CS数据集上，UniCUE将词错误率降低78.3%，唇语同步性提升32%。 Conclusion: UniCUE通过直接生成语音和任务整合，显著提升了CSV2S的性能和同步性。 Abstract: Cued Speech (CS) enhances lipreading through hand coding, providing precise speech perception support for the hearing-impaired. CS Video-to-Speech generation (CSV2S) task aims to convert the CS visual expressions (CS videos) of hearing-impaired individuals into comprehensible speech signals. Direct generation of speech from CS video (called single CSV2S) yields poor performance due to insufficient CS data. Current research mostly focuses on CS Recognition (CSR), which convert video content into linguistic text. Based on this, one straightforward way of CSV2S is to combine CSR with a Text-to-Speech system. This combined architecture relies on text as an intermediate medium for stepwise cross-modal alignment, which may lead to error propagation and temporal misalignment between speech and video dynamics. To address these challenges, we propose a novel approach that directly generates speech from CS videos without relying on intermediate text. Building upon this, we propose UniCUE, the first unified framework for CSV2S, whose core innovation lies in the integration of the CSR task that provides fine-grained visual-semantic information to facilitate speech generation from CS videos. More precisely, (1) a novel fine-grained semantic alignment pool to ensure precise mapping between visual features and speech contents; (2) a VisioPhonetic adapter to bridge cross-task representations, ensuring seamless compatibility between two distinct tasks (i.e., CSV2S and CSR); (3) a pose-aware visual processor is introduced to enhance fine-grained spatiotemporal correlations between lip and hand movements in CS video. Experiments on our new established Chinese CS dataset (14 cuers1: 8 hearing-impaired and 6 normal-hearing) show that our UniCUE significantly reduces Word Error Rate by 78.3% and improves lip-speech synchronization by 32% compared to the single CSV2S.

[104] MMR-V: What's Left Unsaid? A Benchmark for Multimodal Deep Reasoning in Videos

Kejian Zhu,Zhuoran Jin,Hongbang Yuan,Jiachun Li,Shangqing Tu,Pengfei Cao,Yubo Chen,Kang Liu,Jun Zhao

Main category: cs.CV

TL;DR: MMR-V是一个新的视频多模态深度推理基准，要求模型进行长距离多帧推理和隐藏信息分析，现有模型表现不佳，推理增强策略效果有限。

Details

Motivation: 现有视频基准主要关注理解任务，缺乏对多模态深度推理能力的评估，因此提出MMR-V填补这一空白。 Method: MMR-V包含317个视频和1,257个任务，要求模型进行长距离多帧推理、隐藏信息分析和抗干扰设计。 Result: 当前模型表现较差（最佳模型准确率52.5%），推理增强策略（如Chain-of-Thought）效果有限。 Conclusion: MMR-V揭示了多模态推理的挑战，希望激发进一步研究。 Abstract: The sequential structure of videos poses a challenge to the ability of multimodal large language models (MLLMs) to locate multi-frame evidence and conduct multimodal reasoning. However, existing video benchmarks mainly focus on understanding tasks, which only require models to match frames mentioned in the question (hereafter referred to as "question frame") and perceive a few adjacent frames. To address this gap, we propose MMR-V: A Benchmark for Multimodal Deep Reasoning in Videos. The benchmark is characterized by the following features. (1) Long-range, multi-frame reasoning: Models are required to infer and analyze evidence frames that may be far from the question frame. (2) Beyond perception: Questions cannot be answered through direct perception alone but require reasoning over hidden information. (3) Reliability: All tasks are manually annotated, referencing extensive real-world user understanding to align with common perceptions. (4) Confusability: Carefully designed distractor annotation strategies to reduce model shortcuts. MMR-V consists of 317 videos and 1,257 tasks. Our experiments reveal that current models still struggle with multi-modal reasoning; even the best-performing model, o4-mini, achieves only 52.5% accuracy. Additionally, current reasoning enhancement strategies (Chain-of-Thought and scaling test-time compute) bring limited gains. Further analysis indicates that the CoT demanded for multi-modal reasoning differs from it in textual reasoning, which partly explains the limited performance gains. We hope that MMR-V can inspire further research into enhancing multi-modal reasoning capabilities.

[105] Person Re-Identification System at Semantic Level based on Pedestrian Attributes Ontology

Ngoc Q. Ly,Hieu N. M. Cao,Thi T. Nguyen

Main category: cs.CV

TL;DR: 提出了一种统一的行人重识别系统，包含三个模块（PAO、Local MDCNN、IDS），通过语义信息解决属性不平衡问题，并在Market1501数据集上表现优异。

Details

Motivation: 解决行人重识别中的大规模数据、不平衡数据、视角和细粒度属性等挑战，尤其是语义级局部特征和属性不平衡问题。 Method: 提出包含PAO、Local MDCNN和IDS的统一系统，利用语义信息预过滤候选集并解决属性不平衡问题。 Result: 在Market1501数据集上表现优于现有方法。 Conclusion: 该系统通过模块间的相互支持有效提升了行人重识别的性能。 Abstract: Person Re-Identification (Re-ID) is a very important task in video surveillance systems such as tracking people, finding people in public places, or analysing customer behavior in supermarkets. Although there have been many works to solve this problem, there are still remaining challenges such as large-scale datasets, imbalanced data, viewpoint, fine grained data (attributes), the Local Features are not employed at semantic level in online stage of Re-ID task, furthermore, the imbalanced data problem of attributes are not taken into consideration. This paper has proposed a Unified Re-ID system consisted of three main modules such as Pedestrian Attribute Ontology (PAO), Local Multi-task DCNN (Local MDCNN), Imbalance Data Solver (IDS). The new main point of our Re-ID system is the power of mutual support of PAO, Local MDCNN and IDS to exploit the inner-group correlations of attributes and pre-filter the mismatch candidates from Gallery set based on semantic information as Fashion Attributes and Facial Attributes, to solve the imbalanced data of attributes without adjusting network architecture and data augmentation. We experimented on the well-known Market1501 dataset. The experimental results have shown the effectiveness of our Re-ID system and it could achieve the higher performance on Market1501 dataset in comparison to some state-of-the-art Re-ID methods.

[106] Image Editing As Programs with Diffusion Models

Yujia Hu,Songhua Liu,Zhenxiong Tan,Xingyi Yang,Xinchao Wang

Main category: cs.CV

TL;DR: IEAP框架通过将复杂编辑指令分解为原子操作，显著提升了扩散模型在结构不一致图像编辑任务中的表现。

Details

Motivation: 扩散模型在文本到图像生成中表现优异，但在指令驱动的图像编辑中遇到困难，尤其是涉及布局变化的编辑任务。 Method: 提出Image Editing As Programs (IEAP)框架，基于Diffusion Transformer (DiT)架构，将编辑指令分解为原子操作，每个操作由轻量级适配器实现，并通过VLM代理编程。 Result: IEAP在多种编辑场景下显著优于现有方法，尤其在复杂多步指令中表现出更高的准确性和语义保真度。 Conclusion: IEAP通过模块化和序列化编辑操作，为结构不一致的图像编辑任务提供了鲁棒的解决方案。 Abstract: While diffusion models have achieved remarkable success in text-to-image generation, they encounter significant challenges with instruction-driven image editing. Our research highlights a key challenge: these models particularly struggle with structurally inconsistent edits that involve substantial layout changes. To mitigate this gap, we introduce Image Editing As Programs (IEAP), a unified image editing framework built upon the Diffusion Transformer (DiT) architecture. At its core, IEAP approaches instructional editing through a reductionist lens, decomposing complex editing instructions into sequences of atomic operations. Each operation is implemented via a lightweight adapter sharing the same DiT backbone and is specialized for a specific type of edit. Programmed by a vision-language model (VLM)-based agent, these operations collaboratively support arbitrary and structurally inconsistent transformations. By modularizing and sequencing edits in this way, IEAP generalizes robustly across a wide range of editing tasks, from simple adjustments to substantial structural changes. Extensive experiments demonstrate that IEAP significantly outperforms state-of-the-art methods on standard benchmarks across various editing scenarios. In these evaluations, our framework delivers superior accuracy and semantic fidelity, particularly for complex, multi-step instructions. Codes are available at https://github.com/YujiaHu1109/IEAP.

[107] FlexGS: Train Once, Deploy Everywhere with Many-in-One Flexible 3D Gaussian Splatting

Hengyu Liu,Yuehao Wang,Chenxin Li,Ruisi Cai,Kevin Wang,Wuyang Li,Pavlo Molchanov,Peihao Wang,Zhangyang Wang

Main category: cs.CV

TL;DR: 提出了一种弹性推理方法，通过选择并变换高斯子集，无需微调即可适应不同设备的显存需求。

Details

Motivation: 3D高斯泼溅（3DGS）在3D场景表示和新视角合成中应用广泛，但其高GPU显存需求限制了在资源受限设备上的使用。现有方法需微调且缺乏灵活性。 Method: 引入可学习模块控制高斯选择，并通过变换模块调整选定高斯，以适应不同模型大小需求。 Result: 在ZipNeRF、MipNeRF和Tanks&Temples场景上的实验验证了方法的有效性。 Conclusion: 该方法显著提升了3DGS的适应性，无需额外微调即可满足不同设备的显存需求。 Abstract: 3D Gaussian splatting (3DGS) has enabled various applications in 3D scene representation and novel view synthesis due to its efficient rendering capabilities. However, 3DGS demands relatively significant GPU memory, limiting its use on devices with restricted computational resources. Previous approaches have focused on pruning less important Gaussians, effectively compressing 3DGS but often requiring a fine-tuning stage and lacking adaptability for the specific memory needs of different devices. In this work, we present an elastic inference method for 3DGS. Given an input for the desired model size, our method selects and transforms a subset of Gaussians, achieving substantial rendering performance without additional fine-tuning. We introduce a tiny learnable module that controls Gaussian selection based on the input percentage, along with a transformation module that adjusts the selected Gaussians to complement the performance of the reduced model. Comprehensive experiments on ZipNeRF, MipNeRF and Tanks\&Temples scenes demonstrate the effectiveness of our approach. Code is available at https://flexgs.github.io.

[108] Language-Image Alignment with Fixed Text Encoders

Jingfeng Yang,Ziyang Wu,Yue Zhao,Yi Ma

Main category: cs.CV

TL;DR: LIFT框架通过固定预训练的大型语言模型（LLM）作为文本编码器，仅训练图像编码器，实现了语言-图像对齐，效果优于CLIP，尤其在组合理解和长标题场景中表现突出，同时显著提升了计算效率。

Details

Motivation: 质疑当前主流的联合训练方法（如CLIP）是否必要，探索预训练的固定LLM是否能作为足够好的文本编码器指导视觉表示学习。 Method: 提出LIFT框架，固定LLM的文本编码器，仅训练图像编码器，简化了语言-图像对齐的学习过程。 Result: LIFT在组合理解和长标题任务中优于CLIP，计算效率显著提升。 Conclusion: LIFT为探索LLM文本嵌入如何指导视觉学习提供了新思路，并提出了语言对齐视觉表示学习的替代设计选择。 Abstract: Currently, the most dominant approach to establishing language-image alignment is to pre-train text and image encoders jointly through contrastive learning, such as CLIP and its variants. In this work, we question whether such a costly joint training is necessary. In particular, we investigate if a pre-trained fixed large language model (LLM) offers a good enough text encoder to guide visual representation learning. That is, we propose to learn Language-Image alignment with a Fixed Text encoder (LIFT) from an LLM by training only the image encoder. Somewhat surprisingly, through comprehensive benchmarking and ablation studies, we find that this much simplified framework LIFT is highly effective and it outperforms CLIP in most scenarios that involve compositional understanding and long captions, while achieving considerable gains in computational efficiency. Our work takes a first step towards systematically exploring how text embeddings from LLMs can guide visual learning and suggests an alternative design choice for learning language-aligned visual representations.

[109] Diffusion Domain Teacher: Diffusion Guided Domain Adaptive Object Detector

Boyong He,Yuxiang Ji,Zhuoyue Tan,Liaoni Wu

Main category: cs.CV

TL;DR: 论文提出了一种基于扩散模型的跨域目标检测方法（DDT），通过冻结权重的扩散模型生成伪标签，显著提升了跨域检测性能。

Details

Motivation: 解决目标检测中因训练数据与真实数据域差距导致的性能下降问题。 Method: 利用冻结权重的扩散模型作为教师模型生成伪标签，指导学生模型在目标域上的监督学习。 Result: 在6个数据集上平均mAP提升21.2%，超越现有SOTA方法5.7%。 Conclusion: DDT方法在跨域目标检测中表现出广泛适用性和高效性。 Abstract: Object detectors often suffer a decrease in performance due to the large domain gap between the training data (source domain) and real-world data (target domain). Diffusion-based generative models have shown remarkable abilities in generating high-quality and diverse images, suggesting their potential for extracting valuable feature from various domains. To effectively leverage the cross-domain feature representation of diffusion models, in this paper, we train a detector with frozen-weight diffusion model on the source domain, then employ it as a teacher model to generate pseudo labels on the unlabeled target domain, which are used to guide the supervised learning of the student model on the target domain. We refer to this approach as Diffusion Domain Teacher (DDT). By employing this straightforward yet potent framework, we significantly improve cross-domain object detection performance without compromising the inference speed. Our method achieves an average mAP improvement of 21.2% compared to the baseline on 6 datasets from three common cross-domain detection benchmarks (Cross-Camera, Syn2Real, Real2Artistic}, surpassing the current state-of-the-art (SOTA) methods by an average of 5.7% mAP. Furthermore, extensive experiments demonstrate that our method consistently brings improvements even in more powerful and complex models, highlighting broadly applicable and effective domain adaptation capability of our DDT. The code is available at https://github.com/heboyong/Diffusion-Domain-Teacher.

[110] FullDiT2: Efficient In-Context Conditioning for Video Diffusion Transformers

Xuanhua He,Quande Liu,Zixuan Ye,Wecai Ye,Qiulin Wang,Xintao Wang,Qifeng Chen,Pengfei Wan,Di Zhang,Kun Gai

Main category: cs.CV

TL;DR: FullDiT2提出了一种高效的上下文条件框架，通过动态令牌选择和选择性上下文缓存机制，显著减少了计算开销和提升了视频生成与编辑的效率。

Details

Motivation: 现有方法（如FullDiT）在视频生成中面临二次计算开销问题，限制了实际应用。 Method: FullDiT2采用动态令牌选择机制和选择性上下文缓存，以减少令牌冗余和计算冗余。 Result: 实验表明，FullDiT2在六种任务中实现了计算量显著减少和2-3倍的速度提升，且生成质量几乎无损。 Conclusion: FullDiT2为高效可控视频生成与编辑提供了实用解决方案。 Abstract: Fine-grained and efficient controllability on video diffusion transformers has raised increasing desires for the applicability. Recently, In-context Conditioning emerged as a powerful paradigm for unified conditional video generation, which enables diverse controls by concatenating varying context conditioning signals with noisy video latents into a long unified token sequence and jointly processing them via full-attention, e.g., FullDiT. Despite their effectiveness, these methods face quadratic computation overhead as task complexity increases, hindering practical deployment. In this paper, we study the efficiency bottleneck neglected in original in-context conditioning video generation framework. We begin with systematic analysis to identify two key sources of the computation inefficiencies: the inherent redundancy within context condition tokens and the computational redundancy in context-latent interactions throughout the diffusion process. Based on these insights, we propose FullDiT2, an efficient in-context conditioning framework for general controllability in both video generation and editing tasks, which innovates from two key perspectives. Firstly, to address the token redundancy, FullDiT2 leverages a dynamic token selection mechanism to adaptively identify important context tokens, reducing the sequence length for unified full-attention. Additionally, a selective context caching mechanism is devised to minimize redundant interactions between condition tokens and video latents. Extensive experiments on six diverse conditional video editing and generation tasks demonstrate that FullDiT2 achieves significant computation reduction and 2-3 times speedup in averaged time cost per diffusion step, with minimal degradation or even higher performance in video generation quality. The project page is at \href{https://fulldit2.github.io/}{https://fulldit2.github.io/}.

[111] Sounding that Object: Interactive Object-Aware Image to Audio Generation

Tingle Li,Baihe Huang,Xiaobin Zhuang,Dongya Jia,Jiawei Chen,Yuping Wang,Zhuo Chen,Gopala Anumanchipalli,Yuxuan Wang

Main category: cs.CV

TL;DR: 提出了一种基于用户选择视觉对象的交互式音频生成模型，通过多模态注意力将图像区域与声音关联，生成与选定对象对齐的音频。

Details

Motivation: 复杂视听场景中多对象和多声源的音频生成具有挑战性，需确保生成声音与视觉对象对齐。 Method: 结合对象中心学习和条件潜在扩散模型，利用多模态注意力关联图像区域与声音，并通过图像分割实现交互式音频生成。 Result: 定量和定性评估显示模型优于基线，生成的声音与对象对齐效果更好。 Conclusion: 提出的方法在交互式音频生成中表现出色，验证了注意力机制与分割掩模的功能近似性。 Abstract: Generating accurate sounds for complex audio-visual scenes is challenging, especially in the presence of multiple objects and sound sources. In this paper, we propose an {\em interactive object-aware audio generation} model that grounds sound generation in user-selected visual objects within images. Our method integrates object-centric learning into a conditional latent diffusion model, which learns to associate image regions with their corresponding sounds through multi-modal attention. At test time, our model employs image segmentation to allow users to interactively generate sounds at the {\em object} level. We theoretically validate that our attention mechanism functionally approximates test-time segmentation masks, ensuring the generated audio aligns with selected objects. Quantitative and qualitative evaluations show that our model outperforms baselines, achieving better alignment between objects and their associated sounds. Project page: https://tinglok.netlify.app/files/avobject/

[112] UNIC: Unified In-Context Video Editing

Zixuan Ye,Xuanhua He,Quande Liu,Qiulin Wang,Xintao Wang,Pengfei Wan,Di Zhang,Kun Gai,Qifeng Chen,Wenhan Luo

Main category: cs.CV

TL;DR: UNIC是一个统一的视频编辑框架，通过将多种编辑任务表示为三种令牌类型，并利用DiT的注意力机制实现任务统一，避免了任务特定设计。

Details

Motivation: 现有方法依赖任务特定架构或定制化设计，限制了编辑条件的多样性和任务的统一性。 Method: 将输入表示为源视频令牌、噪声视频潜在和多模态条件令牌，通过任务感知RoPE和条件偏置解决任务混淆问题。 Result: 在包含六种视频编辑任务的基准测试中表现优异，并展现出任务组合能力。 Conclusion: UNIC框架简单有效，统一了多种视频编辑任务，支持灵活的任务组合。 Abstract: Recent advances in text-to-video generation have sparked interest in generative video editing tasks. Previous methods often rely on task-specific architectures (e.g., additional adapter modules) or dedicated customizations (e.g., DDIM inversion), which limit the integration of versatile editing conditions and the unification of various editing tasks. In this paper, we introduce UNified In-Context Video Editing (UNIC), a simple yet effective framework that unifies diverse video editing tasks within a single model in an in-context manner. To achieve this unification, we represent the inputs of various video editing tasks as three types of tokens: the source video tokens, the noisy video latent, and the multi-modal conditioning tokens that vary according to the specific editing task. Based on this formulation, our key insight is to integrate these three types into a single consecutive token sequence and jointly model them using the native attention operations of DiT, thereby eliminating the need for task-specific adapter designs. Nevertheless, direct task unification under this framework is challenging, leading to severe token collisions and task confusion due to the varying video lengths and diverse condition modalities across tasks. To address these, we introduce task-aware RoPE to facilitate consistent temporal positional encoding, and condition bias that enables the model to clearly differentiate different editing tasks. This allows our approach to adaptively perform different video editing tasks by referring the source video and varying condition tokens "in context", and support flexible task composition. To validate our method, we construct a unified video editing benchmark containing six representative video editing tasks. Results demonstrate that our unified approach achieves superior performance on each task and exhibits emergent task composition abilities.

[113] Struct2D: A Perception-Guided Framework for Spatial Reasoning in Large Multimodal Models

Fangrui Zhu,Hanhui Wang,Yiming Xie,Jing Gu,Tianye Ding,Jianwei Yang,Huaizu Jiang

Main category: cs.CV

TL;DR: 论文提出Struct2D框架，通过结构化2D输入（如鸟瞰图和物体标记）增强大型多模态模型（LMMs）的空间推理能力，无需显式3D输入。实验显示LMMs在零样本任务中表现优异，并构建了大规模数据集Struct2D-Set用于微调，验证了方法的有效性。

Details

Motivation: 探索大型多模态模型（LMMs）是否仅通过结构化2D输入（如鸟瞰图和物体标记）就能进行3D空间推理，避免依赖显式3D输入或专用架构。 Method: 提出Struct2D框架，结合鸟瞰图、物体标记和元数据，生成结构化2D输入。构建Struct2D-Set数据集（200K QA对），用于微调开源LMM（Qwen2.5VL）。 Result: 实验表明，LMMs在零样本任务中表现出色（如方向估计和路径规划）。微调后的模型在3D问答、密集标注和物体定位等任务中表现优异。 Conclusion: 结构化2D输入能有效连接感知与语言推理，无需显式3D输入。代码和数据集将开源以支持未来研究。 Abstract: Unlocking spatial reasoning in Large Multimodal Models (LMMs) is crucial for enabling intelligent interaction with 3D environments. While prior efforts often rely on explicit 3D inputs or specialized model architectures, we ask: can LMMs reason about 3D space using only structured 2D representations derived from perception? We introduce Struct2D, a perception-guided prompting framework that combines bird's-eye-view (BEV) images with object marks and object-centric metadata, optionally incorporating egocentric keyframes when needed. Using Struct2D, we conduct an in-depth zero-shot analysis of closed-source LMMs (e.g., GPT-o3) and find that they exhibit surprisingly strong spatial reasoning abilities when provided with structured 2D inputs, effectively handling tasks such as relative direction estimation and route planning. Building on these insights, we construct Struct2D-Set, a large-scale instruction tuning dataset with 200K fine-grained QA pairs across eight spatial reasoning categories, generated automatically from 3D indoor scenes. We fine-tune an open-source LMM (Qwen2.5VL) on Struct2D-Set, achieving competitive performance on multiple benchmarks, including 3D question answering, dense captioning, and object grounding. Our approach demonstrates that structured 2D inputs can effectively bridge perception and language reasoning in LMMs-without requiring explicit 3D representations as input. We will release both our code and dataset to support future research.

[114] Seeing in the Dark: Benchmarking Egocentric 3D Vision with the Oxford Day-and-Night Dataset

Zirui Wang,Wenjing Bian,Xinghui Li,Yifu Tao,Jianeng Wang,Maurice Fallon,Victor Adrian Prisacariu

Main category: cs.CV

TL;DR: Oxford Day-and-Night是一个大规模的自中心数据集，用于挑战性光照条件下的新视角合成和视觉重定位，填补了现有数据集的不足。

Details

Motivation: 现有数据集缺乏真实3D几何、广泛光照变化和完整6DoF运动的组合，该数据集旨在解决这一问题。 Method: 利用Meta ARIA眼镜捕捉自中心视频，并通过多会话SLAM估计相机姿态、重建3D点云，并对齐不同光照条件下的序列。 Result: 数据集覆盖30公里轨迹和40,000平方米区域，支持新视角合成和重定位两个核心基准。 Conclusion: 该数据集为自中心3D视觉研究提供了丰富资源，适用于多样化和真实环境的模型评估。 Abstract: We introduce Oxford Day-and-Night, a large-scale, egocentric dataset for novel view synthesis (NVS) and visual relocalisation under challenging lighting conditions. Existing datasets often lack crucial combinations of features such as ground-truth 3D geometry, wide-ranging lighting variation, and full 6DoF motion. Oxford Day-and-Night addresses these gaps by leveraging Meta ARIA glasses to capture egocentric video and applying multi-session SLAM to estimate camera poses, reconstruct 3D point clouds, and align sequences captured under varying lighting conditions, including both day and night. The dataset spans over 30 $\mathrm{km}$ of recorded trajectories and covers an area of 40,000 $\mathrm{m}^2$, offering a rich foundation for egocentric 3D vision research. It supports two core benchmarks, NVS and relocalisation, providing a unique platform for evaluating models in realistic and diverse environments.

[115] Voyager: Long-Range and World-Consistent Video Diffusion for Explorable 3D Scene Generation

Tianyu Huang,Wangguandong Zheng,Tengfei Wang,Yuhao Liu,Zhenwei Wang,Junta Wu,Jie Jiang,Hui Li,Rynson W. H. Lau,Wangmeng Zuo,Chunchao Guo

Main category: cs.CV

TL;DR: Voyager是一种新颖的视频扩散框架，通过单张图像和用户定义的相机路径生成世界一致的3D点云序列，无需传统3D重建流程。

Details

Motivation: 现实应用（如视频游戏和虚拟现实）需要能够生成可沿自定义相机轨迹探索的3D场景，但目前生成长范围、3D一致且可探索的场景仍具挑战性。 Method: Voyager结合三个关键组件：1) 世界一致视频扩散，2) 长范围世界探索，3) 可扩展数据引擎，实现端到端场景生成与重建。 Result: 在视觉质量和几何精度上优于现有方法，具有广泛的应用潜力。 Conclusion: Voyager通过创新的设计和集成，解决了3D场景生成中的一致性和长范围探索问题，为实际应用提供了高效解决方案。 Abstract: Real-world applications like video gaming and virtual reality often demand the ability to model 3D scenes that users can explore along custom camera trajectories. While significant progress has been made in generating 3D objects from text or images, creating long-range, 3D-consistent, explorable 3D scenes remains a complex and challenging problem. In this work, we present Voyager, a novel video diffusion framework that generates world-consistent 3D point-cloud sequences from a single image with user-defined camera path. Unlike existing approaches, Voyager achieves end-to-end scene generation and reconstruction with inherent consistency across frames, eliminating the need for 3D reconstruction pipelines (e.g., structure-from-motion or multi-view stereo). Our method integrates three key components: 1) World-Consistent Video Diffusion: A unified architecture that jointly generates aligned RGB and depth video sequences, conditioned on existing world observation to ensure global coherence 2) Long-Range World Exploration: An efficient world cache with point culling and an auto-regressive inference with smooth video sampling for iterative scene extension with context-aware consistency, and 3) Scalable Data Engine: A video reconstruction pipeline that automates camera pose estimation and metric depth prediction for arbitrary videos, enabling large-scale, diverse training data curation without manual 3D annotations. Collectively, these designs result in a clear improvement over existing methods in visual quality and geometric accuracy, with versatile applications.

[116] LayerFlow: A Unified Model for Layer-aware Video Generation

Sihui Ji,Hao Luo,Xi Chen,Yuanpeng Tu,Yiyang Wang,Hengshuang Zhao

Main category: cs.CV

TL;DR: LayerFlow是一个统一的层感知视频生成解决方案，支持透明前景、干净背景和混合场景的生成，并能分解混合视频或生成给定前景的背景等变体。

Details

Motivation: 解决缺乏高质量分层训练视频的问题，并支持多样化的视频生成需求。 Method: 基于文本到视频扩散变换器，通过分层嵌入区分不同层的子片段，采用多阶段训练策略，结合低质量视频数据和高质量分层图像数据。 Result: 能够生成平滑的分层视频，支持多种变体操作。 Conclusion: LayerFlow通过统一框架和多阶段训练策略，实现了高质量分层视频的生成和多样化操作。 Abstract: We present LayerFlow, a unified solution for layer-aware video generation. Given per-layer prompts, LayerFlow generates videos for the transparent foreground, clean background, and blended scene. It also supports versatile variants like decomposing a blended video or generating the background for the given foreground and vice versa. Starting from a text-to-video diffusion transformer, we organize the videos for different layers as sub-clips, and leverage layer embeddings to distinguish each clip and the corresponding layer-wise prompts. In this way, we seamlessly support the aforementioned variants in one unified framework. For the lack of high-quality layer-wise training videos, we design a multi-stage training strategy to accommodate static images with high-quality layer annotations. Specifically, we first train the model with low-quality video data. Then, we tune a motion LoRA to make the model compatible with static frames. Afterward, we train the content LoRA on the mixture of image data with high-quality layered images along with copy-pasted video data. During inference, we remove the motion LoRA thus generating smooth videos with desired layers.

cs.GR [Back]

[117] Multi-Spectral Gaussian Splatting with Neural Color Representation

Lukas Meyer,Josef Grün,Maximilian Weiherer,Bernhard Egger,Marc Stamminger,Linus Franke

Main category: cs.GR

TL;DR: MS-Splatting是一种多光谱3D高斯泼溅框架，能够从不同光谱域的独立相机图像生成多视角一致的新视图，无需跨模态相机校准。

Details

Motivation: 现有方法需要跨模态相机校准，且无法有效利用光谱和空间相关性。MS-Splatting旨在解决这些问题，支持多种光谱（如热红外和近红外）的统一建模。 Method: 采用神经颜色表示，将多光谱信息编码为紧凑的每泼溅特征嵌入，并通过浅层MLP解码为光谱颜色值，实现所有波段的联合学习。 Result: 实验表明，该方法提高了多光谱渲染质量，并在单光谱渲染质量上优于现有方法。 Conclusion: MS-Splatting在农业应用中展示了有效性，如渲染植被指数（如NDVI）。 Abstract: We present MS-Splatting -- a multi-spectral 3D Gaussian Splatting (3DGS) framework that is able to generate multi-view consistent novel views from images of multiple, independent cameras with different spectral domains. In contrast to previous approaches, our method does not require cross-modal camera calibration and is versatile enough to model a variety of different spectra, including thermal and near-infra red, without any algorithmic changes. Unlike existing 3DGS-based frameworks that treat each modality separately (by optimizing per-channel spherical harmonics) and therefore fail to exploit the underlying spectral and spatial correlations, our method leverages a novel neural color representation that encodes multi-spectral information into a learned, compact, per-splat feature embedding. A shallow multi-layer perceptron (MLP) then decodes this embedding to obtain spectral color values, enabling joint learning of all bands within a unified representation. Our experiments show that this simple yet effective strategy is able to improve multi-spectral rendering quality, while also leading to improved per-spectra rendering quality over state-of-the-art methods. We demonstrate the effectiveness of this new technique in agricultural applications to render vegetation indices, such as normalized difference vegetation index (NDVI).

[118] Facial Appearance Capture at Home with Patch-Level Reflectance Prior

Yuxuan Han,Junfeng Lyu,Kuan Sheng,Minghao Que,Qixuan Zhang,Lan Xu,Feng Xu

Main category: cs.GR

TL;DR: 本文提出了一种基于智能手机和闪光灯的视频捕捉方法，通过扩散先验和补丁级后采样技术，显著提升了面部反射图的重建质量，缩小了低成本与专业工作室录制的差距。

Details

Motivation: 现有智能手机视频捕捉方法的面部反射重建质量远低于专业工作室录制，本文旨在填补这一差距。 Method: 采用智能手机和闪光灯在暗室中捕捉视频，学习扩散先验并利用补丁级后采样技术生成高质量反射图。 Result: 实验表明，该方法大幅缩小了低成本与专业工作室录制的质量差距。 Conclusion: 该方法为日常用户提供了高质量的数字克隆解决方案，代码将开源。 Abstract: Existing facial appearance capture methods can reconstruct plausible facial reflectance from smartphone-recorded videos. However, the reconstruction quality is still far behind the ones based on studio recordings. This paper fills the gap by developing a novel daily-used solution with a co-located smartphone and flashlight video capture setting in a dim room. To enhance the quality, our key observation is to solve facial reflectance maps within the data distribution of studio-scanned ones. Specifically, we first learn a diffusion prior over the Light Stage scans and then steer it to produce the reflectance map that best matches the captured images. We propose to train the diffusion prior at the patch level to improve generalization ability and training stability, as current Light Stage datasets are in ultra-high resolution but limited in data size. Tailored to this prior, we propose a patch-level posterior sampling technique to sample seamless full-resolution reflectance maps from this patch-level diffusion model. Experiments demonstrate our method closes the quality gap between low-cost and studio recordings by a large margin, opening the door for everyday users to clone themselves to the digital world. Our code will be released at https://github.com/yxuhan/DoRA.

[119] SplArt: Articulation Estimation and Part-Level Reconstruction with 3D Gaussian Splatting

Shengjie Lin,Jiading Fang,Muhammad Zubair Irshad,Vitor Campagnolo Guizilini,Rares Andrei Ambrus,Greg Shakhnarovich,Matthew R. Walter

Main category: cs.GR

TL;DR: SplArt是一种自监督、类别无关的框架，利用3D高斯泼溅（3DGS）从两组不同关节状态的RGB图像中重建关节物体并推断运动学，实现实时逼真渲染。

Details

Motivation: 现有方法在可扩展性、鲁棒性和渲染效果上存在局限，SplArt旨在解决这些问题。 Method: 通过为每个高斯添加可微分移动参数，采用多阶段优化策略处理重建、部件分割和关节估计。 Result: 在基准测试和实际应用中表现出色，无需3D标注或类别先验。 Conclusion: SplArt在性能和实用性上达到先进水平，代码已开源。 Abstract: Reconstructing articulated objects prevalent in daily environments is crucial for applications in augmented/virtual reality and robotics. However, existing methods face scalability limitations (requiring 3D supervision or costly annotations), robustness issues (being susceptible to local optima), and rendering shortcomings (lacking speed or photorealism). We introduce SplArt, a self-supervised, category-agnostic framework that leverages 3D Gaussian Splatting (3DGS) to reconstruct articulated objects and infer kinematics from two sets of posed RGB images captured at different articulation states, enabling real-time photorealistic rendering for novel viewpoints and articulations. SplArt augments 3DGS with a differentiable mobility parameter per Gaussian, achieving refined part segmentation. A multi-stage optimization strategy is employed to progressively handle reconstruction, part segmentation, and articulation estimation, significantly enhancing robustness and accuracy. SplArt exploits geometric self-supervision, effectively addressing challenging scenarios without requiring 3D annotations or category-specific priors. Evaluations on established and newly proposed benchmarks, along with applications to real-world scenarios using a handheld RGB camera, demonstrate SplArt's state-of-the-art performance and real-world practicality. Code is publicly available at https://github.com/ripl/splart.

cs.CL [Back]

[120] Evaluating Large Language Models for Zero-Shot Disease Labeling in CT Radiology Reports Across Organ Systems

Michael E. Garcia-Alcoser,Mobina GhojoghNejad,Fakrul Islam Tushar,David Kim,Kyle J. Lafata,Geoffrey D. Rubin,Joseph Y. Lo

Main category: cs.CL

TL;DR: 轻量级LLMs在CT报告疾病标注中优于基于规则的方法，且能通过零样本提示泛化到不同器官系统。

Details

Motivation: 评估大型语言模型（LLMs）在自动化CT放射学报告疾病标注中的有效性，并与基于规则的算法（RBA）进行比较。 Method: 回顾性研究分析了40,833份CT报告，测试了三种轻量级LLMs的零样本提示性能，使用Cohen's Kappa和F1分数评估。 Result: Gemma-3 27B和Llama-3.1 8B表现最佳，在Duke CAP报告中Kappa中位数为0.87，Gemma-3 27B的宏F1为0.82。 Conclusion: 轻量级LLMs提供了一种灵活高效的解决方案，但二元标签无法完全捕捉报告的细微差别。 Abstract: Purpose: This study aims to evaluate the effectiveness of large language models (LLMs) in automating disease annotation of CT radiology reports. We compare a rule-based algorithm (RBA), RadBERT, and three lightweight open-weight LLMs for multi-disease labeling of chest, abdomen, and pelvis (CAP) CT reports. Materials and Methods: This retrospective study analyzed 40,833 CT reports from 29,540 patients, with 1,789 CAP reports manually annotated across three organ systems. External validation was conducted using the CT-RATE dataset. Three open-weight LLMs were tested with zero-shot prompting. Performance was evaluated using Cohen's Kappa and micro/macro-averaged F1 scores. Results: In 12,197 Duke CAP reports from 8,854 patients, Llama-3.1 8B and Gemma-3 27B showed the highest agreement ($\kappa$ median: 0.87). On the manually annotated set, Gemma-3 27B achieved the top macro-F1 (0.82), followed by Llama-3.1 8B (0.79), while the RBA scored lowest (0.64). On the CT-RATE dataset (lungs/pleura only), Llama-3.1 8B performed best (0.91), with Gemma-3 27B close behind (0.89). Performance differences were mainly due to differing labeling practices, especially for lung atelectasis. Conclusion: Lightweight LLMs outperform rule-based methods for CT report annotation and generalize across organ systems with zero-shot prompting. However, binary labels alone cannot capture the full nuance of report language. LLMs can provide a flexible, efficient solution aligned with clinical judgment and user needs.

[121] A conclusive remark on linguistic theorizing and language modeling

Cristiano Chesi

Main category: cs.CL

TL;DR: 对目标论文的回复进行总结性评论。

Details

Motivation: 总结并回应意大利语言学杂志中对目标论文的反馈。 Method: 分析收到的回复内容，提出综合观点。 Result: 对回复的全面评述，可能包括对批评的回应或对支持的总结。 Conclusion: 为讨论提供最终结论或进一步研究方向。 Abstract: This is the final remark on the replies received to my target paper in the Italian Journal of Linguistics

[122] FailureSensorIQ: A Multi-Choice QA Dataset for Understanding Sensor Relationships and Failure Modes

Christodoulos Constantinides,Dhaval Patel,Shuxin Lin,Claudio Guerrero,Sunil Dagajirao Patil,Jayant Kalagnanam

Main category: cs.CL

TL;DR: FailureSensorIQ是一个新颖的多选题问答（MCQA）基准系统，用于评估大型语言模型（LLM）在工业4.0中复杂领域特定场景下的推理和理解能力。

Details

Motivation: 传统QA基准无法全面评估LLM在工业场景中的推理能力，因此需要一种专注于故障模式、传感器数据及其关系的系统。 Method: 通过Perturbation-Uncertainty-Complexity分析、专家评估研究、资产特定知识差距分析和ReAct代理等方法评估LLM。 Result: 尽管闭源模型表现接近专家水平，但基准显示其性能对扰动、干扰和知识差距敏感。 Conclusion: FailureSensorIQ为工业领域提供了新的评估工具和LLM驱动的特征选择方法，推动了数据驱动和领域驱动的结合。 Abstract: We introduce FailureSensorIQ, a novel Multi-Choice Question-Answering (MCQA) benchmarking system designed to assess the ability of Large Language Models (LLMs) to reason and understand complex, domain-specific scenarios in Industry 4.0. Unlike traditional QA benchmarks, our system focuses on multiple aspects of reasoning through failure modes, sensor data, and the relationships between them across various industrial assets. Through this work, we envision a paradigm shift where modeling decisions are not only data-driven using statistical tools like correlation analysis and significance tests, but also domain-driven by specialized LLMs which can reason about the key contributors and useful patterns that can be captured with feature engineering. We evaluate the Industrial knowledge of over a dozen LLMs-including GPT-4, Llama, and Mistral-on FailureSensorIQ from different lens using Perturbation-Uncertainty-Complexity analysis, Expert Evaluation study, Asset-Specific Knowledge Gap analysis, ReAct agent using external knowledge-bases. Even though closed-source models with strong reasoning capabilities approach expert-level performance, the comprehensive benchmark reveals a significant drop in performance that is fragile to perturbations, distractions, and inherent knowledge gaps in the models. We also provide a real-world case study of how LLMs can drive the modeling decisions on 3 different failure prediction datasets related to various assets. We release: (a) expert-curated MCQA for various industrial assets, (b) FailureSensorIQ benchmark and Hugging Face leaderboard based on MCQA built from non-textual data found in ISO documents, and (c) LLMFeatureSelector, an LLM-based feature selection scikit-learn pipeline. The software is available at https://github.com/IBM/FailureSensorIQ.

[123] HyperSteer: Activation Steering at Scale with Hypernetworks

Jiuding Sun,Sidharth Baskaran,Zhengxuan Wu,Michael Sklar,Christopher Potts,Atticus Geiger

Main category: cs.CL

TL;DR: HyperSteer是一种基于超网络的架构，用于生成语言模型的转向向量，性能优于现有方法。

Details

Motivation: 现有无监督方法缺乏对转向向量效果的保证，而有监督方法需要大量数据。HyperSteer旨在结合两者的优势。 Method: 使用超网络架构，根据自然语言转向提示和语言模型内部状态生成转向向量。 Result: HyperSteer在数千个转向提示上的表现优于现有方法，甚至适用于未见过的提示。 Conclusion: HyperSteer在性能和泛化能力上均表现出色，接近基于提示的转向方法。 Abstract: Steering language models (LMs) by modifying internal activations is a popular approach for controlling text generation. Unsupervised dictionary learning methods, e.g., sparse autoencoders, can be scaled to produce many steering vectors, but lack guarantees on the individual efficacy of each vector and control over the coverage of relevant steering tasks. In contrast, supervised methods for constructing steering vectors are targeted and effective, but require more data collection and training for each additional steering vector produced. In this work, we introduce HyperSteer, a family of hypernetwork-based architectures which are trained end-to-end to generate steering vectors conditioned on the natural language steering prompts and the internals of the steered LM. In our evaluations, we show that scaling HyperSteer with thousands of steering prompts exceeds the performance of state-of-the-art activation steering methods, even on steering prompts never seen during training. Moreover, HyperSteer performs on par with steering-via-prompting.

[124] Unleashing the Reasoning Potential of Pre-trained LLMs by Critique Fine-Tuning on One Problem

Yubo Wang,Ping Nie,Kai Zou,Lijun Wu,Wenhu Chen

Main category: cs.CL

TL;DR: 论文提出了一种名为Critique Fine-Tuning (CFT)的方法，通过单问题微调释放大型语言模型的推理潜力，显著优于传统强化学习（RL）方法，且计算成本更低。

Details

Motivation: 尽管强化学习（RL）可以显著提升大型语言模型（LLMs）的推理能力，但其计算成本高昂且不稳定。因此，研究旨在寻找一种更高效的方法来释放LLMs的推理潜力。 Method: 通过收集单问题的多样化模型生成解决方案，并利用教师LLMs提供详细批评，构建批评数据（CFT数据），然后对Qwen和Llama系列模型进行微调。 Result: 实验显示，仅用5 GPU小时的训练，Qwen-Math-7B-CFT在数学和逻辑推理任务上平均提升15%和16%，性能媲美或超越RL方法，但计算成本仅为后者的1/20。 Conclusion: 单次CFT是一种简单、通用且计算高效的方法，能够有效释放现代LLMs的推理能力。 Abstract: We have witnessed that strong LLMs like Qwen-Math, MiMo, and Phi-4 possess immense reasoning potential inherited from the pre-training stage. With reinforcement learning (RL), these models can improve dramatically on reasoning tasks. Recent studies have shown that even RL on a single problem can unleash these models' reasoning capabilities. However, RL is not only expensive but also unstable. Even one-shot RL requires hundreds of GPU hours. This raises a critical question: Is there a more efficient way to unleash the reasoning potential of these powerful base LLMs? In this work, we demonstrate that Critique Fine-Tuning (CFT) on only one problem can effectively unleash the reasoning potential of LLMs. Our method constructs critique data by collecting diverse model-generated solutions to a single problem and using teacher LLMs to provide detailed critiques. We fine-tune Qwen and Llama family models, ranging from 1.5B to 14B parameters, on the CFT data and observe significant performance gains across diverse reasoning tasks. For example, with just 5 GPU hours of training, Qwen-Math-7B-CFT show an average improvement of 15% on six math benchmarks and 16% on three logic reasoning benchmarks. These results are comparable to or even surpass the results from RL with 20x less compute. Ablation studies reveal the robustness of one-shot CFT across different prompt problems. These results highlight one-shot CFT as a simple, general, and compute-efficient approach to unleashing the reasoning capabilities of modern LLMs.

[125] From Instructions to ODRL Usage Policies: An Ontology Guided Approach

Daham M. Mustafa,Abhishek Nadgeri,Diego Collarana,Benedikt T. Arnold,Christoph Quix,Christoph Lange,Stefan Decker

Main category: cs.CL

TL;DR: 利用GPT-4等大语言模型自动从自然语言指令生成W3C ODRL使用策略，通过ODRL本体及其文档优化提示，实现端到端知识图谱构建。

Details

Motivation: 研究假设认为，经过整理的现有本体文档能更好地指导策略生成，尤其在文化领域的数据空间分布式基础设施中。 Method: 采用ODRL本体及其文档作为提示核心，提出多种启发式方法适应本体，支持端到端知识图谱构建。 Result: 在12个不同复杂度的用例基准测试中，知识图谱准确率高达91.95%。 Conclusion: 该方法在自动生成ODRL策略方面表现优异，适用于分布式数据交换场景。 Abstract: This study presents an approach that uses large language models such as GPT-4 to generate usage policies in the W3C Open Digital Rights Language ODRL automatically from natural language instructions. Our approach uses the ODRL ontology and its documentation as a central part of the prompt. Our research hypothesis is that a curated version of existing ontology documentation will better guide policy generation. We present various heuristics for adapting the ODRL ontology and its documentation to guide an end-to-end KG construction process. We evaluate our approach in the context of dataspaces, i.e., distributed infrastructures for trustworthy data exchange between multiple participating organizations for the cultural domain. We created a benchmark consisting of 12 use cases of varying complexity. Our evaluation shows excellent results with up to 91.95% accuracy in the resulting knowledge graph.

[126] Hopscotch: Discovering and Skipping Redundancies in Language Models

Mustafa Eyceoz,Nikhil Shivakumar Nayak,Hao Wang,Ligong Han,Akash Srivastava

Main category: cs.CL

TL;DR: Hopscotch是一种简单有效的方法，通过跳过贡献最小的注意力块并调整剩余层的输出，以保持模型性能。

Details

Motivation: 现代因果语言模型堆叠大量注意力块以提高性能，但并非所有块对每个任务都必要。Hopscotch旨在优化计算资源使用。 Method: Hopscotch联合优化跳过哪些块及如何缩放剩余层的输出，通过轻量级可训练参数缓解隐藏状态分布偏移。 Result: 在Llama-3.1-8B和Qwen2.5-7B上，跳过四个注意力块后性能下降不到2%。 Conclusion: Hopscotch无需修改模型权重或访问预训练数据，且兼容现有压缩技术，是一种高效的注意力块优化方法。 Abstract: Modern causal language models stack many attention blocks to improve performance, but not all blocks are necessary for every task. We propose Hopscotch, a simple yet effective method that identifies and skips attention blocks with least contributions to a task and adapts to preserve output quality. Hopscotch jointly optimizes which blocks to skip and how to scale the outputs of the remaining layers. By introducing lightweight, trainable scaling parameters to attention and MLP blocks, it mitigates distribution shifts in hidden states caused by removing attention blocks. Hopscotch does not modify model weights or require access to pretraining or instruction-tuning data, and is compatible with existing model compression techniques. When applied to $\texttt{Llama-3.1-8B}$ and $\texttt{Qwen2.5-7B}$, Hopscotch achieves less than a 2% drop in performance even after skipping four attention blocks.

[127] The Reader is the Metric: How Textual Features and Reader Profiles Explain Conflicting Evaluations of AI Creative Writing

Guillermo Marco,Julio Gonzalo,Víctor Fresno

Main category: cs.CL

TL;DR: 研究发现，AI与人类文学文本评价的差异源于读者偏好不同，而非文本本身质量。通过分析读者偏好，揭示了两种读者类型及其对文本特征的不同重视。

Details

Motivation: 探讨AI与人类文学文本评价差异的原因，提出读者偏好是关键因素。 Method: 使用五个公开数据集，提取17种文本特征，建模读者偏好并分析其聚类。 Result: 读者分为两类：表面关注型（非专家）和整体关注型（专家），其偏好解释了文学质量评价的差异。 Conclusion: 建议在创意文本生成领域采用读者敏感的评价框架。 Abstract: Recent studies comparing AI-generated and human-authored literary texts have produced conflicting results: some suggest AI already surpasses human quality, while others argue it still falls short. We start from the hypothesis that such divergences can be largely explained by genuine differences in how readers interpret and value literature, rather than by an intrinsic quality of the texts evaluated. Using five public datasets (1,471 stories, 101 annotators including critics, students, and lay readers), we (i) extract 17 reference-less textual features (e.g., coherence, emotional variance, average sentence length...); (ii) model individual reader preferences, deriving feature importance vectors that reflect their textual priorities; and (iii) analyze these vectors in a shared "preference space". Reader vectors cluster into two profiles: 'surface-focused readers' (mainly non-experts), who prioritize readability and textual richness; and 'holistic readers' (mainly experts), who value thematic development, rhetorical variety, and sentiment dynamics. Our results quantitatively explain how measurements of literary quality are a function of how text features align with each reader's preferences. These findings advocate for reader-sensitive evaluation frameworks in the field of creative text generation.

Celia Chen,Scotty Beland,Ingo Burghardt,Jill Byczek,William J. Conway,Eric Cotugno,Sadaf Davre,Megan Fletcher,Rajesh Kumar Gnanasekaran,Kristin Hamilton,Marilyn Harbert,Jordan Heustis,Tanaya Jha,Emily Klein,Hayden Kramer,Alex Leitch,Jessica Perkins,Casi Sherman,Celia Sterrn,Logan Stevens,Rebecca Zarrella,Jennifer Golbeck

Main category: cs.CL

TL;DR: 本文介绍了一个跨平台的暴力威胁数据集，并通过机器学习分析验证了其有效性，表明不同平台的数据可以互相补充用于内容分类。

Details

Motivation: 社交媒体上的暴力威胁问题严重，高质量数据有助于研究和检测恶意内容。 Method: 构建了一个包含30,000条手工标注帖子的数据集，涵盖政治和性暴力等子类型，并通过机器学习分析验证其信号。 Result: 即使数据集来自不同平台且标注标准不同，分类准确性仍然很高，表明数据可以跨平台使用。 Conclusion: 研究结果对内容分类策略和跨平台暴力内容理解具有重要意义。 Abstract: Violent threats remain a significant problem across social media platforms. Useful, high-quality data facilitates research into the understanding and detection of malicious content, including violence. In this paper, we introduce a cross-platform dataset of 30,000 posts hand-coded for violent threats and sub-types of violence, including political and sexual violence. To evaluate the signal present in this dataset, we perform a machine learning analysis with an existing dataset of violent comments from YouTube. We find that, despite originating from different platforms and using different coding criteria, we achieve high classification accuracy both by training on one dataset and testing on the other, and in a merged dataset condition. These results have implications for content-classification strategies and for understanding violent content across social media.

[129] Ask a Local: Detecting Hallucinations With Specialized Model Divergence

Aldan Creo,Héctor Cerezo-Costas,Pedro Alonso-Doval,Maximiliano Hormazábal-Lagos

Main category: cs.CL

TL;DR: 论文提出了一种名为“Ask a Local”的新方法，用于检测大型语言模型中的幻觉现象，通过利用领域专用模型对不准确信息的反应差异来识别幻觉内容。

Details

Motivation: 解决大型语言模型生成看似合理但事实错误信息（幻觉）的问题，尤其是在多语言环境中。 Method: 通过计算领域专用模型的困惑度分布差异来识别潜在的幻觉内容，无需外部数据或训练，适用于多语言环境。 Result: 在14种语言的问答数据集上表现一致，IoU分数约0.3，意大利语和加泰罗尼亚语表现尤为突出（IoU分别为0.42和0.38）。 Conclusion: 该方法在多语言环境中无需调整即可有效检测幻觉，具有可扩展性和高效性，为相关研究提供了工具支持。 Abstract: Hallucinations in large language models (LLMs) - instances where models generate plausible but factually incorrect information - present a significant challenge for AI. We introduce "Ask a Local", a novel hallucination detection method exploiting the intuition that specialized models exhibit greater surprise when encountering domain-specific inaccuracies. Our approach computes divergence between perplexity distributions of language-specialized models to identify potentially hallucinated spans. Our method is particularly well-suited for a multilingual context, as it naturally scales to multiple languages without the need for adaptation, relying on external data sources, or performing training. Moreover, we select computationally efficient models, providing a scalable solution that can be applied to a wide range of languages and domains. Our results on a human-annotated question-answer dataset spanning 14 languages demonstrate consistent performance across languages, with Intersection-over-Union (IoU) scores around 0.3 and comparable Spearman correlation values. Our model shows particularly strong performance on Italian and Catalan, with IoU scores of 0.42 and 0.38, respectively, while maintaining cross-lingual effectiveness without language-specific adaptations. We release our code and architecture to facilitate further research in multilingual hallucination detection.

[130] A Multimodal, Multilingual, and Multidimensional Pipeline for Fine-grained Crowdsourcing Earthquake Damage Evaluation

Zihui Ma,Lingyao Li,Juan Li,Wenyue Hua,Jingxiao Liu,Qingyuan Feng,Yuki Miura

Main category: cs.CL

TL;DR: 提出了一种基于多模态大语言模型（MLLMs）的3M管道，用于快速评估灾害影响，结果显示其与真实地震数据相关性较强。

Details

Motivation: 快速、细粒度的灾害评估对应急响应至关重要，但传统方法受限于地面传感器和官方报告的延迟。社交媒体提供了丰富的实时数据，但其多模态和非结构化特性增加了分析难度。 Method: 采用多模态、多语言和多维（3M）管道，利用MLLMs整合图像和文本信号，评估了三种基础模型在两个大地震事件中的表现。 Result: MLLMs能有效整合图像和文本信号，与真实地震数据相关性较强，但性能受语言、震中距离和输入模态影响。 Conclusion: MLLMs在灾害评估中具有潜力，为未来实时危机应用提供了基础。代码和数据已开源。 Abstract: Rapid, fine-grained disaster damage assessment is essential for effective emergency response, yet remains challenging due to limited ground sensors and delays in official reporting. Social media provides a rich, real-time source of human-centric observations, but its multimodal and unstructured nature presents challenges for traditional analytical methods. In this study, we propose a structured Multimodal, Multilingual, and Multidimensional (3M) pipeline that leverages multimodal large language models (MLLMs) to assess disaster impacts. We evaluate three foundation models across two major earthquake events using both macro- and micro-level analyses. Results show that MLLMs effectively integrate image-text signals and demonstrate a strong correlation with ground-truth seismic data. However, performance varies with language, epicentral distance, and input modality. This work highlights the potential of MLLMs for disaster assessment and provides a foundation for future research in applying MLLMs to real-time crisis contexts. The code and data are released at: https://github.com/missa7481/EMNLP25_earthquake

[131] Trajectory Prediction Meets Large Language Models: A Survey

Yi Xu,Ruining Yang,Yitian Zhang,Yizhou Wang,Jianglin Lu,Mingyuan Zhang,Lili Su,Yun Fu

Main category: cs.CL

TL;DR: 本文综述了大型语言模型（LLMs）在轨迹预测中的应用，总结了五个研究方向，并分析了代表性方法、设计选择和挑战。

Details

Motivation: 探索如何利用LLMs的语义和推理能力改进轨迹预测，为自主系统提供更智能的感知和预测能力。 Method: 将相关研究分为五类：语言建模范式、预训练模型直接预测、语言引导的场景理解、语言驱动的数据生成、基于语言的推理与可解释性。 Result: 提供了对LLMs在轨迹预测中应用的全面概述，并指出了各方向的代表性方法和核心设计。 Conclusion: 本文为自然语言处理与轨迹预测的交叉研究提供了统一视角，展示了语言如何丰富轨迹预测。 Abstract: Recent advances in large language models (LLMs) have sparked growing interest in integrating language-driven techniques into trajectory prediction. By leveraging their semantic and reasoning capabilities, LLMs are reshaping how autonomous systems perceive, model, and predict trajectories. This survey provides a comprehensive overview of this emerging field, categorizing recent work into five directions: (1) Trajectory prediction via language modeling paradigms, (2) Direct trajectory prediction with pretrained language models, (3) Language-guided scene understanding for trajectory prediction, (4) Language-driven data generation for trajectory prediction, (5) Language-based reasoning and interpretability for trajectory prediction. For each, we analyze representative methods, highlight core design choices, and identify open challenges. This survey bridges natural language processing and trajectory prediction, offering a unified perspective on how language can enrich trajectory prediction.

[132] DistRAG: Towards Distance-Based Spatial Reasoning in LLMs

Nicole R Schneider,Nandini Ramachandran,Kent O'Sullivan,Hanan Samet

Main category: cs.CL

TL;DR: 论文提出DistRAG方法，通过检索相关空间信息增强LLM的空间推理能力，解决其无法可靠处理距离问题的缺陷。

Details

Motivation: LLMs在空间推理（如距离计算）上能力不足，限制了其在POI推荐和行程规划等任务中的应用。 Method: DistRAG将城市间地理距离编码为图结构，并检索与问题相关的子图作为上下文。 Result: 该方法使LLM能够回答原本无法处理的基于距离的推理问题。 Conclusion: DistRAG为LLM提供了初步的‘世界模型’，补充其语言知识，扩展了应用潜力。 Abstract: Many real world tasks where Large Language Models (LLMs) can be used require spatial reasoning, like Point of Interest (POI) recommendation and itinerary planning. However, on their own LLMs lack reliable spatial reasoning capabilities, especially about distances. To address this problem, we develop a novel approach, DistRAG, that enables an LLM to retrieve relevant spatial information not explicitly learned during training. Our method encodes the geodesic distances between cities and towns in a graph and retrieves a context subgraph relevant to the question. Using this technique, our method enables an LLM to answer distance-based reasoning questions that it otherwise cannot answer. Given the vast array of possible places an LLM could be asked about, DistRAG offers a flexible first step towards providing a rudimentary `world model' to complement the linguistic knowledge held in LLMs.

[133] Time Course MechInterp: Analyzing the Evolution of Components and Knowledge in Large Language Models

Ahmad Dawar Hakimi,Ali Modarressi,Philipp Wicke,Hinrich Schütze

Main category: cs.CL

TL;DR: 研究分析了OLMo-7B模型中事实知识表示的演变，发现模型从通用组件逐渐转向专用组件，注意力头变化最大，FFNs更稳定，任务复杂度影响知识获取。

Details

Motivation: 理解大语言模型（LLMs）如何获取和存储事实知识，以提高其可解释性和可靠性。 Method: 通过追踪OLMo-7B模型的注意力头和FFNs在预训练过程中的角色变化，将其分为四类角色，并分析其稳定性和转变。 Result: 模型最初依赖通用组件，后期逐渐专用化；注意力头变化最大，FFNs更稳定；基于位置的关系比基于名称的关系更早收敛。 Conclusion: 研究揭示了LLMs中知识形成的机制，为模型优化提供了新视角。 Abstract: Understanding how large language models (LLMs) acquire and store factual knowledge is crucial for enhancing their interpretability and reliability. In this work, we analyze the evolution of factual knowledge representation in the OLMo-7B model by tracking the roles of its attention heads and feed forward networks (FFNs) over the course of pre-training. We classify these components into four roles: general, entity, relation-answer, and fact-answer specific, and examine their stability and transitions. Our results show that LLMs initially depend on broad, general-purpose components, which later specialize as training progresses. Once the model reliably predicts answers, some components are repurposed, suggesting an adaptive learning process. Notably, attention heads display the highest turnover. We also present evidence that FFNs remain more stable throughout training. Furthermore, our probing experiments reveal that location-based relations converge to high accuracy earlier in training than name-based relations, highlighting how task complexity shapes acquisition dynamics. These insights offer a mechanistic view of knowledge formation in LLMs.

[134] Culture Matters in Toxic Language Detection in Persian

Zahra Bokaei,Walid Magdy,Bonnie Webber

Main category: cs.CL

TL;DR: 本文比较了波斯语中有毒语言检测的不同方法，并探讨了文化背景对跨语言迁移学习的影响。

Details

Motivation: 有毒语言检测对创建更安全的在线环境和限制有害内容传播至关重要，但波斯语中的相关研究较少。 Method: 比较了微调、数据增强、零样本和少样本学习以及跨语言迁移学习等方法。 Result: 文化相似的语言在迁移学习中表现更好，而文化差异大的语言改进较小。 Conclusion: 文化背景对有毒语言检测的迁移学习效果有显著影响。 Abstract: Toxic language detection is crucial for creating safer online environments and limiting the spread of harmful content. While toxic language detection has been under-explored in Persian, the current work compares different methods for this task, including fine-tuning, data enrichment, zero-shot and few-shot learning, and cross-lingual transfer learning. What is especially compelling is the impact of cultural context on transfer learning for this task: We show that the language of a country with cultural similarities to Persian yields better results in transfer learning. Conversely, the improvement is lower when the language comes from a culturally distinct country. Warning: This paper contains examples of toxic language that may disturb some readers. These examples are included for the purpose of research on toxic detection.

[135] Delta-KNN: Improving Demonstration Selection in In-Context Learning for Alzheimer's Disease Detection

Chuyuan Li,Raymond Li,Thalia S. Field,Giuseppe Carenini

Main category: cs.CL

TL;DR: 论文提出Delta-KNN方法，通过改进演示选择策略，提升大型语言模型在阿尔茨海默病诊断中的表现。

Details

Motivation: 阿尔茨海默病（AD）的早期干预需要分析语言异常，但传统上下文学习方法效果不佳。 Method: 引入Delta-KNN策略，结合delta评分和KNN检索器动态选择最优训练示例。 Result: 在三个开源LLM上实验，Delta-KNN表现优于现有方法，使用Llama-3.1时达到新SOTA。 Conclusion: Delta-KNN显著提升AD诊断性能，优于监督分类器。 Abstract: Alzheimer's Disease (AD) is a progressive neurodegenerative disorder that leads to dementia, and early intervention can greatly benefit from analyzing linguistic abnormalities. In this work, we explore the potential of Large Language Models (LLMs) as health assistants for AD diagnosis from patient-generated text using in-context learning (ICL), where tasks are defined through a few input-output examples. Empirical results reveal that conventional ICL methods, such as similarity-based selection, perform poorly for AD diagnosis, likely due to the inherent complexity of this task. To address this, we introduce Delta-KNN, a novel demonstration selection strategy that enhances ICL performance. Our method leverages a delta score to assess the relative gains of each training example, coupled with a KNN-based retriever that dynamically selects optimal "representatives" for a given input. Experiments on two AD detection datasets across three open-source LLMs demonstrate that Delta-KNN consistently outperforms existing ICL baselines. Notably, when using the Llama-3.1 model, our approach achieves new state-of-the-art results, surpassing even supervised classifiers.

[136] APT: Improving Specialist LLM Performance with Weakness Case Acquisition and Iterative Preference Training

Jun Rao,Zepeng Lin,Xuebo Liu,Xiaopeng Ke,Lian Lian,Dong Jin,Shengjun Cheng,Jun Yu,Min Zhang

Main category: cs.CL

TL;DR: APT方法通过自生成弱点和相似数据，针对性地训练LLMs，既提升领域性能又保留通用能力。

Details

Motivation: 解决LLMs在领域微调中可能损害通用能力的问题。 Method: 使用自生成的弱点数据（错误样本和相似样本）进行迭代偏好训练。 Result: 实验证明APT在LLama-2和Mistral-V0.3上优于现有方法，且不损害通用能力。 Conclusion: APT是提升领域性能同时保持通用能力的有效策略。 Abstract: Large Language Models (LLMs) often require domain-specific fine-tuning to address targeted tasks, which risks degrading their general capabilities. Maintaining a balance between domain-specific enhancements and general model utility is a key challenge. This paper proposes a novel approach named APT (Weakness Case Acquisition and Iterative Preference Training) to enhance domain-specific performance with self-generated dis-preferred weakness data (bad cases and similar cases). APT uniquely focuses on training the model using only those samples where errors occur, alongside a small, similar set of samples retrieved for this purpose. This targeted training minimizes interference with the model's existing knowledge base, effectively retaining generic capabilities. Experimental results on the LLama-2 and Mistral-V0.3 models across various benchmarks demonstrate that APT ensures no reduction in generic capacity and achieves superior performance on downstream tasks compared to various existing methods. This validates our method as an effective strategy for enhancing domain-specific capabilities without sacrificing the model's broader applicability.

[137] Explainable AI: XAI-Guided Context-Aware Data Augmentation

Melkamu Abay Mersha,Mesay Gemeda Yigezu,Atnafu Lambebo Tonja,Hassan Shakil,Samer Iskander,Olga Kolesnikova,Jugal Kalita

Main category: cs.CL

TL;DR: 论文提出了一种基于可解释AI（XAI）的上下文感知数据增强方法（XAI-Guided Context-Aware Data Augmentation），通过选择性保留任务相关特征并迭代优化增强数据，显著提升了低资源语言任务（如仇恨言论和情感分析）的模型性能。

Details

Motivation: 解决传统数据增强技术在低资源语言任务中引入噪声、语义漂移和上下文不连贯等问题，同时利用XAI技术提升模型的透明度和可控性。 Method: 提出XAI-SR-BT和XAI-PR-BT框架，结合XAI技术和迭代反馈循环，优化数据增强过程。 Result: 在Amharic数据集上，XAI-SR-BT和XAI-PR-BT分别比基线模型提升了6.6%和8.1%的准确率，且优于现有增强技术4.8%和5%。 Conclusion: 该方法为数据增强提供了更可控、可解释且上下文感知的解决方案，推动了XAI技术在AI模型训练中的新应用。 Abstract: Explainable AI (XAI) has emerged as a powerful tool for improving the performance of AI models, going beyond providing model transparency and interpretability. The scarcity of labeled data remains a fundamental challenge in developing robust and generalizable AI models, particularly for low-resource languages. Conventional data augmentation techniques introduce noise, cause semantic drift, disrupt contextual coherence, lack control, and lead to overfitting. To address these challenges, we propose XAI-Guided Context-Aware Data Augmentation. This novel framework leverages XAI techniques to modify less critical features while selectively preserving most task-relevant features. Our approach integrates an iterative feedback loop, which refines augmented data over multiple augmentation cycles based on explainability-driven insights and the model performance gain. Our experimental results demonstrate that XAI-SR-BT and XAI-PR-BT improve the accuracy of models on hate speech and sentiment analysis tasks by 6.6% and 8.1%, respectively, compared to the baseline, using the Amharic dataset with the XLM-R model. XAI-SR-BT and XAI-PR-BT outperform existing augmentation techniques by 4.8% and 5%, respectively, on the same dataset and model. Overall, XAI-SR-BT and XAI-PR-BT consistently outperform both baseline and conventional augmentation techniques across all tasks and models. This study provides a more controlled, interpretable, and context-aware solution to data augmentation, addressing critical limitations of existing augmentation techniques and offering a new paradigm shift for leveraging XAI techniques to enhance AI model training.

[138] EpiCoDe: Boosting Model Performance Beyond Training with Extrapolation and Contrastive Decoding

Mingxu Tao,Jie Hu,Mingchuan Yang,Yunhuai Liu,Dongyan Zhao,Yansong Feng

Main category: cs.CL

TL;DR: EpiCoDe是一种新方法，通过模型外推和对比解码，在数据稀缺情况下提升LLM性能，无需额外训练。

Details

Motivation: 高成本标注数据限制了LLM在下游任务中的能力，需解决数据稀缺问题。 Method: 结合模型外推增强微调模型，并通过对比解码减少预测错误。 Result: 在四个LLM的三个任务中，EpiCoDe显著优于现有方法。 Conclusion: EpiCoDe通过理论框架揭示了对比解码机制，验证了其有效性。 Abstract: The remarkable performance of Large language models (LLMs) relies heavily on the availability of abundant high-quality training data. However, the high cost of acquiring annotated data often prevents models from obtaining capabilities to tackle downstream tasks. In this paper, we introduce a novel method, EpiCoDe that boosts model performance in data-scarcity scenarios without extra training. We first employ model extrapolation to enhance a finetuned model with its inferior version, and then adopt contrastive decoding to further reduce predicted errors, by comparing the logit scores given by the extrapolated and the vanilla finetuned model. Experiments across three tasks over four different LLMs show that EpiCoDe consistently outperforms existing methods with significant and robust improvement. We also propose a new theoretical framework to reveal the mechanism behind contrastive decoding in data-scarcity scenarios, which further helps us better understand the effectiveness of EpiCoDe.

[139] Beyond Memorization: A Rigorous Evaluation Framework for Medical Knowledge Editing

Shigeng Chen,Linhao Luo,Zhangchi Qiu,Yanan Cao,Carl Yang,Shirui Pan

Main category: cs.CL

TL;DR: 论文提出MedEditBench框架，评估知识编辑（KE）方法在医学领域的有效性，发现现有方法仅能浅层记忆信息，提出SGR-Edit方法显著改进性能。

Details

Motivation: 探索知识编辑在复杂医学领域的适用性，解决现有方法无法泛化到新场景的问题。 Method: 提出MedEditBench框架，包括新基准和三种知识编辑范式，并开发SGR-Edit方法利用模型生成的推理作为编辑目标。 Result: 现有KE方法仅实现浅层记忆，SGR-Edit显著优于现有方法，并提供医学知识定位和顺序编辑的深入见解。 Conclusion: SGR-Edit为医学知识编辑提供有效解决方案，为实际应用提供指导。 Abstract: Recently, knowledge editing (KE) has emerged as a promising approach to update specific facts in Large Language Models (LLMs) without the need for full retraining. Despite the effectiveness in general-domain benchmarks, their applicability to complex medical domain remains largely unexplored. Medical knowledge editing is particularly challenging, as it requires LLMs to internalize the knowledge and generalize to unseen scenarios for effective and interpretable decision-making. In this work, we propose a novel framework called MedEditBench to rigorously evaluate the effectiveness of existing KE methods in the medical domain. In MedEditBench, we introduce a new medical knowledge editing benchmark as well as three different knowledge editing paradigms, which are designed to assess the impact of different knowledge sources for editing. Our findings indicate that current KE methods result in only superficial memorization of the injected information, failing to generalize to new scenarios. To overcome this limitation, we present Self-Generated Rationale Editing (SGR-Edit), which utilizes model-derived rationales as the target knowledge for editing, thereby uncovering the underlying reasoning process and demonstrating significant improvements over existing KE approaches. Additionally, we offer deeper insights into medical knowledge editing, including the localization of medical knowledge in LLMs and the impact of sequential editing on evolving knowledge. This could provide practical guidance for implementing KE methods in real-world medical applications.

[140] Measuring Human Involvement in AI-Generated Text: A Case Study on Academic Writing

Yuchen Guo,Zhicheng Dou,Huy H. Nguyen,Ching-Chun Chang,Saku Sugawara,Isao Echizen

Main category: cs.CL

TL;DR: 论文提出了一种基于BERTScore和RoBERTa的方法，用于检测文本生成中的人类参与程度，解决了传统二元分类方法的不足。

Details

Motivation: 随着生成式AI的普及，人类与机器协作生成文本的现象日益普遍，但现有检测方法仅关注二元分类，无法准确衡量人类参与程度。 Method: 使用BERTScore作为度量标准，并训练一个基于RoBERTa的多任务回归模型，通过令牌分类任务解决参与检测模糊问题。 Result: 在模拟学术场景的数据集上，该方法表现出色（F1分数0.9423，回归均方误差0.004），优于现有检测器。 Conclusion: 该方法能有效检测人类参与程度，并具有一定的跨模型泛化能力。 Abstract: Content creation has dramatically progressed with the rapid advancement of large language models like ChatGPT and Claude. While this progress has greatly enhanced various aspects of life and work, it has also negatively affected certain areas of society. A recent survey revealed that nearly 30% of college students use generative AI to help write academic papers and reports. Most countermeasures treat the detection of AI-generated text as a binary classification task and thus lack robustness. This approach overlooks human involvement in the generation of content even though human-machine collaboration is becoming mainstream. Besides generating entire texts, people may use machines to complete or revise texts. Such human involvement varies case by case, which makes binary classification a less than satisfactory approach. We refer to this situation as participation detection obfuscation. We propose using BERTScore as a metric to measure human involvement in the generation process and a multi-task RoBERTa-based regressor trained on a token classification task to address this problem. To evaluate the effectiveness of this approach, we simulated academic-based scenarios and created a continuous dataset reflecting various levels of human involvement. All of the existing detectors we examined failed to detect the level of human involvement on this dataset. Our method, however, succeeded (F1 score of 0.9423 and a regressor mean squared error of 0.004). Moreover, it demonstrated some generalizability across generative models. Our code is available at https://github.com/gyc-nii/CAS-CS-and-dual-head-detector

[141] Accurate Sublayer Pruning for Large Language Models by Exploiting Latency and Tunability Information

Seungcheol Park,Sojin Lee,Jongjin Kim,Jinsik Lee,Hyunjik Jo,U Kang

Main category: cs.CL

TL;DR: SPRINT是一种针对大语言模型（LLMs）的子层剪枝方法，通过考虑延迟减少和子层可调性，在不牺牲准确性的情况下加速模型推理。

Details

Motivation: LLMs推理速度慢限制了其广泛应用，现有子层剪枝方法因忽视子层特性导致准确性不足。 Method: SPRINT结合延迟减少和子层可调性信息，迭代剪枝冗余子层并快速调整剩余子层参数。 Result: 实验显示SPRINT在零样本常识推理基准上比现有剪枝算法准确率提升高达23.88%。 Conclusion: SPRINT在准确性和加速之间实现了最佳平衡，为LLMs的高效应用提供了有效解决方案。 Abstract: How can we accelerate large language models(LLMs) without sacrificing accuracy? The slow inference speed of LLMs hinders us to benefit from their remarkable performance in diverse applications. This is mainly because numerous sublayers are stacked together in LLMs. Sublayer pruning compresses and expedites LLMs via removing unnecessary sublayers. However, existing sublayer pruning algorithms are limited in accuracy since they naively select sublayers to prune, overlooking the different characteristics of each sublayer. In this paper, we propose SPRINT (Sublayer PRuning wIth LateNcy and Tunability Information), an accurate sublayer pruning method for LLMs. SPRINT accurately selects a target sublayer to prune by considering 1) the amount of latency reduction after pruning and 2) the tunability of sublayers. SPRINT iteratively prunes redundant sublayers and swiftly tunes the parameters of remaining sublayers. Experiments show that SPRINT achieves the best accuracy-speedup trade-off, exhibiting up to 23.88%p higher accuracy on zero-shot commonsense reasoning benchmarks compared to existing pruning algorithms.

[142] An Efficient Task-Oriented Dialogue Policy: Evolutionary Reinforcement Learning Injected by Elite Individuals

Yangyang Zhao,Ben Niu,Libo Qin,Shihan Wang

Main category: cs.CL

TL;DR: 论文提出了一种结合进化算法（EA）和深度强化学习（DRL）的方法，通过精英个体注入机制优化任务导向对话系统的探索与利用平衡，显著提升了性能并减少了探索时间。

Details

Motivation: 深度强化学习在任务导向对话系统中因状态和动作空间的高维度难以平衡探索与利用，导致局部最优或收敛性差。进化算法通过保持种群多样性有效探索解空间，但其与DRL的直接结合因自然语言的灵活性而效率低下。 Method: 创新性地结合EA的全局搜索能力和DRL的局部优化，并提出精英个体注入机制，通过自适应引入表现最佳的个体提升EA搜索效率。 Result: 在四个数据集上的实验表明，该方法显著改善了探索与利用的平衡，提升了性能，同时精英个体注入机制有效减少了探索时间。 Conclusion: 该方法实现了EA与DRL在任务导向对话策略任务上的高效结合，为探索与利用的平衡提供了有效解决方案。 Abstract: Deep Reinforcement Learning (DRL) is widely used in task-oriented dialogue systems to optimize dialogue policy, but it struggles to balance exploration and exploitation due to the high dimensionality of state and action spaces. This challenge often results in local optima or poor convergence. Evolutionary Algorithms (EAs) have been proven to effectively explore the solution space of neural networks by maintaining population diversity. Inspired by this, we innovatively combine the global search capabilities of EA with the local optimization of DRL to achieve a balance between exploration and exploitation. Nevertheless, the inherent flexibility of natural language in dialogue tasks complicates this direct integration, leading to prolonged evolutionary times. Thus, we further propose an elite individual injection mechanism to enhance EA's search efficiency by adaptively introducing best-performing individuals into the population. Experiments across four datasets show that our approach significantly improves the balance between exploration and exploitation, boosting performance. Moreover, the effectiveness of the EII mechanism in reducing exploration time has been demonstrated, achieving an efficient integration of EA and DRL on task-oriented dialogue policy tasks.

[143] TokAlign: Efficient Vocabulary Adaptation via Token Alignment

Chong Li,Jiajun Zhang,Chengqing Zong

Main category: cs.CL

TL;DR: TokAlign是一种高效方法，通过词汇替换和知识转移提升LLM在多语言和领域的性能。

Details

Motivation: 解决LLM在新领域或语言中因词汇不匹配导致的训练和生成效率低下问题。 Method: 通过词汇对齐和渐进微调，重新排列模型参数以适应新词汇。 Result: 显著提升多语言文本压缩率和词汇初始化效果，困惑度从3.4e²降至1.2e²。 Conclusion: TokAlign有效且通用，仅需5k步即可恢复模型性能，词汇统一后显著提升知识蒸馏效果。 Abstract: Tokenization serves as a foundational step for Large Language Models (LLMs) to process text. In new domains or languages, the inefficiency of the tokenizer will slow down the training and generation of LLM. The mismatch in vocabulary also hinders deep knowledge transfer between LLMs like token-level distillation. To mitigate this gap, we propose an efficient method named TokAlign to replace the vocabulary of LLM from the token co-occurrences view, and further transfer the token-level knowledge between models. It first aligns the source vocabulary to the target one by learning a one-to-one mapping matrix for token IDs. Model parameters, including embeddings, are rearranged and progressively fine-tuned for the new vocabulary. Our method significantly improves multilingual text compression rates and vocabulary initialization for LLMs, decreasing the perplexity from 3.4$\text{e}^2$ of strong baseline methods to 1.2$\text{e}^2$ after initialization. Experimental results on models across multiple parameter scales demonstrate the effectiveness and generalization of TokAlign, which costs as few as 5k steps to restore the performance of the vanilla model. After unifying vocabularies between LLMs, token-level distillation can remarkably boost (+4.4% than sentence-level distillation) the base model, costing only 235M tokens.

[144] Seed-Coder: Let the Code Model Curate Data for Itself

Yuyu Zhang,Jing Su,Yifan Sun,Chenguang Xi,Xia Xiao,Shen Zheng,Anxiang Zhang,Kaibo Liu,Daoguang Zan,Tao Sun,Jinhua Zhu,Shulin Xin,Dong Huang,Yetao Bai,Lixin Dong,Chao Li,Jianchong Chen,Hanzhi Zhou,Yifan Huang,Guanghan Ning,Xierui Song,Jiaze Chen,Siyao Liu,Kai Shen,Liang Xiang,Yonghui Wu

Main category: cs.CL

TL;DR: Seed-Coder是一个开源LLM系列，通过模型中心的数据管道减少人工干预，在代码预训练数据上表现出色。

Details

Motivation: 当前开源LLM依赖人工构建代码预训练数据，存在扩展性差、主观偏见和维护成本高的问题。 Method: 采用模型中心的数据管道，利用LLM评分和过滤代码数据；指令模型通过监督微调和偏好优化训练，推理模型采用LongCoT强化学习。 Result: Seed-Coder在代码生成、补全、编辑、推理及软件工程任务中表现优异，超越同类开源模型。 Conclusion: Seed-Coder展示了减少人工干预的高效方法，为代码相关任务和LLM通用智能提升提供了新思路。 Abstract: Code data in large language model (LLM) pretraining is recognized crucial not only for code-related tasks but also for enhancing general intelligence of LLMs. Current open-source LLMs often heavily rely on human effort to produce their code pretraining data, such as employing hand-crafted filtering rules tailored to individual programming languages, or using human-annotated data to train quality filters. However, these approaches are inherently limited in scalability, prone to subjective biases, and costly to extend and maintain across diverse programming languages. To address these challenges, we introduce Seed-Coder, a series of open-source LLMs comprising base, instruct and reasoning models of 8B size, minimizing human involvement in data construction. Our code pretraining data is produced by a model-centric data pipeline, which predominantly leverages LLMs for scoring and filtering code data. The instruct model is further trained via supervised fine-tuning and preference optimization, and the reasoning model leverages Long-Chain-of-Thought (LongCoT) reinforcement learning to improve multi-step code reasoning. Seed-Coder achieves state-of-the-art results among open-source models of similar size and even surpasses some much larger models, demonstrating superior performance in code generation, code completion, code editing, code reasoning, and software engineering tasks.

[145] Go-Browse: Training Web Agents with Structured Exploration

Apurva Gandhi,Graham Neubig

Main category: cs.CL

TL;DR: Go-Browse提出了一种通过结构化探索自动收集多样化网页代理数据的方法，提升了数字代理对环境的理解能力。

Details

Motivation: 数字代理缺乏对环境的理解，例如在陌生网页中迷失方向，无法高效完成任务。 Method: 将数据收集建模为图搜索问题，实现跨探索任务的信息复用，并在WebArena基准上实例化，收集了10K任务轨迹和40K交互步骤。 Result: 在WebArena基准上，微调7B参数语言模型的任务成功率达到21.7%，超过GPT-4o mini 2.4%，并在10B参数以下模型中领先2.9%。 Conclusion: Go-Browse通过高效探索和数据复用，显著提升了网页代理的任务成功率，为小规模模型提供了新的性能标杆。 Abstract: One of the fundamental problems in digital agents is their lack of understanding of their environment. For instance, a web browsing agent may get lost in unfamiliar websites, uncertain what pages must be visited to achieve its goals. To address this, we propose Go-Browse, a method for automatically collecting diverse and realistic web agent data at scale through structured exploration of web environments. Go-Browse achieves efficient exploration by framing data collection as a graph search, enabling reuse of information across exploration episodes. We instantiate our method on the WebArena benchmark, collecting a dataset of 10K successful task-solving trajectories and 40K interaction steps across 100 URLs. Fine-tuning a 7B parameter language model on this dataset achieves a success rate of 21.7% on the WebArena benchmark, beating GPT-4o mini by 2.4% and exceeding current state-of-the-art results for sub-10B parameter models by 2.9%.

[146] Debate, Reflect, and Distill: Multi-Agent Feedback with Tree-Structured Preference Optimization for Efficient Language Model Enhancement

Xiaofeng Zhou,Heyan Huang,Lizi Liao

Main category: cs.CL

TL;DR: 提出了一种新颖的辩论与反思（D&R）框架，通过小模型与强教师模型的多轮辩论生成反馈，并结合树状直接偏好优化（T-DPO）提升小模型性能。

Details

Motivation: 尽管大语言模型（LLMs）在复杂任务中表现优异，但其高计算需求限制了广泛应用。现有蒸馏方法效果有限，需更高效的解决方案。 Method: 采用D&R框架组织小模型与教师模型的多轮辩论，生成反馈；引入T-DPO方法，将辩论日志组织为层次结构以优化训练。 Result: 实验表明，该方法显著提升了小模型的准确性、鲁棒性和泛化能力，远超传统基线。 Conclusion: D&R框架和T-DPO方法为高效蒸馏大模型提供了新思路，具有广泛应用潜力。 Abstract: Large Language Models (LLMs) continue to set new standards in knowledge-intensive and complex reasoning tasks, yet their high computational demands limit widespread adoption. While distilling large models into smaller ones offers a sustainable solution, current techniques--such as static knowledge distillation, resource-intensive reinforcement learning from human feedback, or limited self-reflection--struggle to yield substantial and lasting performance gains. In this paper, we present a novel Debate and Reflect (D&R) framework that orchestrates multi-turn debates between smaller models and stronger teacher models, eliciting actionable feedback (e.g., error analysis, corrective strategies) to guide student models. Further, we introduce Tree-structured Direct Preference Optimization (T-DPO) to efficiently leverage these debate logs, organizing interactions into a hierarchical format for effective training. Empirical evaluations across diverse NLP benchmarks demonstrate that our approach significantly improves smaller-model accuracy, robustness, and generalization, outperforming conventional baselines by a large margin.

[147] BPO: Revisiting Preference Modeling in Direct Preference Optimization

Lin Sun,Chuang Liu,Peng Liu,Bingyang Li,Weijia Lu,Ning Wu

Main category: cs.CL

TL;DR: BPO是一种新框架，通过平衡奖励边际和间隙适配器动态优化选择和拒绝的响应，解决了DPO中绝对奖励幅度忽略的问题，显著提升了性能。

Details

Motivation: DPO方法虽然有效但对绝对奖励幅度忽略，导致选择响应概率下降和分布外响应风险增加，即DCR问题。 Method: 提出BPO框架，包含平衡奖励边际和间隙适配器，动态优化选择和拒绝的响应。 Result: BPO在多个数学推理任务中显著优于DPO及其变体，准确率提升最高达11.7%。 Conclusion: BPO简单易实现，仅需单行代码修改，完全兼容现有DPO框架，有效解决了DCR问题。 Abstract: Direct Preference Optimization (DPO) have emerged as a popular method for aligning Large Language Models (LLMs) with human preferences. While DPO effectively preserves the relative ordering between chosen and rejected responses through pairwise ranking losses, it often neglects absolute reward magnitudes. This oversight can decrease the likelihood of chosen responses and increase the risk of generating out-of-distribution responses, leading to poor performance. We term this issue Degraded Chosen Responses (DCR).To address this issue, we propose Balanced Preference Optimization (BPO), a novel framework that dynamically balances the optimization of chosen and rejected responses through two key components: balanced reward margin and gap adaptor. Unlike previous methods, BPO can fundamentally resolve DPO's DCR issue, without introducing additional constraints to the loss function. Experimental results on multiple mathematical reasoning tasks show that BPO significantly outperforms DPO, improving accuracy by +10.1% with Llama-3.1-8B-Instruct (18.8% to 28.9%) and +11.7% with Qwen2.5-Math-7B (35.0% to 46.7%). It also surpasses DPO variants by +3.6% over IPO (43.1%), +5.0% over SLiC (41.7%), and +3.1% over Cal-DPO (43.6%) on the same model. Remarkably, our algorithm requires only a single line of code modification, making it simple to implement and fully compatible with existing DPO-based frameworks.

[148] ConsistentChat: Building Skeleton-Guided Consistent Dialogues for Large Language Models from Scratch

Jiawei Chen,Xinyan Guan,Qianhao Yuan,Guozhao Mo,Weixiang Zhou,Yaojie Lu,Hongyu Lin,Ben He,Le Sun,Xianpei Han

Main category: cs.CL

TL;DR: 提出了一种基于骨架引导的多轮对话生成框架，通过建模对话意图和生成结构化的对话骨架，显著提升了多轮对话的一致性和任务完成率。

Details

Motivation: 现有指令数据合成方法主要关注单轮指令，忽视了多轮对话的连贯性，导致上下文漂移和任务完成率下降。 Method: 分为两阶段：意图建模（捕捉对话的全局结构）和骨架生成（构建与意图对齐的用户查询序列）。 Result: 构建了ConsistentChat数据集，实验显示模型在对话一致性和任务成功率上显著提升。 Conclusion: 该方法在多轮对话生成中表现出色，优于现有单轮和多轮指令数据集训练的模型。 Abstract: Current instruction data synthesis methods primarily focus on single-turn instructions and often neglect cross-turn coherence, resulting in context drift and reduced task completion rates in extended conversations. To address this limitation, we propose Skeleton-Guided Multi-Turn Dialogue Generation, a framework that constrains multi-turn instruction synthesis by explicitly modeling human conversational intent. It operates in two stages: (1) Intent Modeling, which captures the global structure of human dialogues by assigning each conversation to one of nine well-defined intent trajectories, ensuring a coherent and goal-oriented information flow; and (2) Skeleton Generation, which constructs a structurally grounded sequence of user queries aligned with the modeled intent, thereby serving as a scaffold that constrains and guides the downstream instruction synthesis process. Based on this process, we construct ConsistentChat, a multi-turn instruction dataset with approximately 15,000 multi-turn conversations and 224,392 utterances. Experiments on the Light, Topdial, and MT-Eval benchmarks show that models fine-tuned on ConsistentChat achieve a 20-30% improvement in chat consistency and up to a 15% increase in task success rate, significantly outperforming models trained on existing single-turn and multi-turn instruction datasets.

[149] POSS: Position Specialist Generates Better Draft for Speculative Decoding

Langlin Huang,Chengsong Huang,Jixuan Leng,Di Huang,Jiaxin Huang

Main category: cs.CL

TL;DR: 论文提出了一种名为Position Specialists (PosS)的方法，通过多个位置专用的草稿层提升大语言模型推理速度，解决了现有方法中草稿模型预测质量随位置下降的问题。

Details

Motivation: 现有方法在草稿模型预测后期位置时，由于特征误差累积导致预测质量下降，影响了推理效率。 Method: 提出PosS方法，使用多个位置专用的草稿层，每个层专注于处理特定位置的草稿模型特征偏差。 Result: 在Llama-3-8B-Instruct和Llama-2-13B-chat模型上的实验表明，PosS显著提高了平均接受长度和加速比。 Conclusion: PosS通过位置专用草稿层有效提升了草稿模型的预测准确性和推理速度。 Abstract: Speculative decoding accelerates Large Language Model (LLM) inference by using a small draft model to predict multiple tokens, and a large target model to verify these tokens in parallel. Recent studies leverage the hidden state of the target model to enhance draft model prediction accuracy. However, existing methods suffer from the degrading quality of draft token predictions at later positions, due to error accumulation in draft model generated features. In this paper, we propose Position Specialists (PosS), which consist of multiple position-specialized draft layers to generate tokens at assigned position(s). Position specialists greatly improve token acceptance rate at later positions per drafting round, as each specialist only needs to focus on handling a certain level of draft model feature deviation. Experiment results on Llama-3-8B-Instruct and Llama-2-13B-chat across six datasets demonstrate that PosS effectively improves over baselines on average acceptance length and speed-up ratio. Our codebase is available at https://github.com/shrango/PosS.

[150] MiMo-VL Technical Report

Xiaomi LLM-Core Team,:,Zihao Yue,Zhenru Lin,Yifan Song,Weikun Wang,Shuhuai Ren,Shuhao Gu,Shicheng Li,Peidian Li,Liang Zhao,Lei Li,Kainan Bao,Hao Tian,Hailin Zhang,Gang Wang,Dawei Zhu,Cici,Chenhong He,Bowen Ye,Bowen Shen,Zihan Zhang,Zihan Jiang,Zhixian Zheng,Zhichao Song,Zhenbo Luo,Yue Yu,Yudong Wang,Yuanyuan Tian,Yu Tu,Yihan Yan,Yi Huang,Xu Wang,Xinzhe Xu,Xingchen Song,Xing Zhang,Xing Yong,Xin Zhang,Xiangwei Deng,Wenyu Yang,Wenhan Ma,Weiwei Lv,Weiji Zhuang,Wei Liu,Sirui Deng,Shuo Liu,Shimao Chen,Shihua Yu,Shaohui Liu,Shande Wang,Rui Ma,Qiantong Wang,Peng Wang,Nuo Chen,Menghang Zhu,Kangyang Zhou,Kang Zhou,Kai Fang,Jun Shi,Jinhao Dong,Jiebao Xiao,Jiaming Xu,Huaqiu Liu,Hongshen Xu,Heng Qu,Haochen Zhao,Hanglong Lv,Guoan Wang,Duo Zhang,Dong Zhang,Di Zhang,Chong Ma,Chang Liu,Can Cai,Bingquan Xia

Main category: cs.CL

TL;DR: 开源MiMo-VL-7B-SFT和MiMo-VL-7B-RL两个视觉语言模型，性能领先，在多任务评估中表现优异。

Details

Motivation: 提升视觉理解和多模态推理能力，推动领域发展。 Method: 结合四阶段预训练（2.4万亿token）与混合策略强化学习（MORL），整合高质量推理数据。 Result: MiMo-VL-7B-RL在40项任务中35项超越Qwen2.5-VL-7B，OlympiadBench得分59.4，GUI任务表现突出。 Conclusion: 混合RL和高质量数据对模型性能至关重要，开源模型和评估套件促进可复现性。 Abstract: We open-source MiMo-VL-7B-SFT and MiMo-VL-7B-RL, two powerful vision-language models delivering state-of-the-art performance in both general visual understanding and multimodal reasoning. MiMo-VL-7B-RL outperforms Qwen2.5-VL-7B on 35 out of 40 evaluated tasks, and scores 59.4 on OlympiadBench, surpassing models with up to 78B parameters. For GUI grounding applications, it sets a new standard with 56.1 on OSWorld-G, even outperforming specialized models such as UI-TARS. Our training combines four-stage pre-training (2.4 trillion tokens) with Mixed On-policy Reinforcement Learning (MORL) integrating diverse reward signals. We identify the importance of incorporating high-quality reasoning data with long Chain-of-Thought into pre-training stages, and the benefits of mixed RL despite challenges in simultaneous multi-domain optimization. We also contribute a comprehensive evaluation suite covering 50+ tasks to promote reproducibility and advance the field. The model checkpoints and full evaluation suite are available at https://github.com/XiaomiMiMo/MiMo-VL.

[151] FreePRM: Training Process Reward Models Without Ground Truth Process Labels

Lin Sun,Chuang Liu,Xiaofeng Ma,Tao Yang,Weijia Lu,Ning Wu

Main category: cs.CL

TL;DR: FreePRM是一种弱监督框架，无需真实步骤标签即可训练过程奖励模型（PRM），通过伪标签和缓冲概率减少噪声影响，性能优于现有方法。

Details

Motivation: 传统PRM训练依赖昂贵且难以获取的步骤级标签，FreePRM旨在解决这一问题。 Method: FreePRM基于最终结果的正确性生成伪步骤标签，并利用缓冲概率消除噪声。 Result: FreePRM在ProcessBench上平均F1得分为53.0%，显著优于其他PRM模型。 Conclusion: FreePRM为PRM训练提供了新范式，减少对昂贵标注的依赖，同时保持高性能。 Abstract: Recent advancements in Large Language Models (LLMs) have demonstrated that Process Reward Models (PRMs) play a crucial role in enhancing model performance. However, training PRMs typically requires step-level labels, either manually annotated or automatically generated, which can be costly and difficult to obtain at scale. To address this challenge, we introduce FreePRM, a weakly supervised framework for training PRMs without access to ground-truth step-level labels. FreePRM first generates pseudo step-level labels based on the correctness of final outcome, and then employs Buffer Probability to eliminate impact of noise inherent in pseudo labeling. Experimental results show that FreePRM achieves an average F1 score of 53.0% on ProcessBench, outperforming fully supervised PRM trained on Math-Shepherd by +24.1%. Compared to other open-source PRMs, FreePRM outperforms upon RLHFlow-PRM-Mistral-8B (28.4%) by +24.6%, EurusPRM (31.3%) by +21.7%, and Skywork-PRM-7B (42.1%) by +10.9%. This work introduces a new paradigm in PRM training, significantly reducing reliance on costly step-level annotations while maintaining strong performance.

[152] Exchange of Perspective Prompting Enhances Reasoning in Large Language Models

Lin Sun,Can Zhang

Main category: cs.CL

TL;DR: 提出了一种名为Exchange-of-Perspective (EoP)的新框架，通过交换问题的不同定义视角来提升大语言模型在NLP任务中的表现。

Details

Motivation: 大语言模型在解决多样化的NLP任务时，其性能受限于对问题的固有理解。EoP旨在打破这种固定思维模式。 Method: EoP框架通过交换问题的不同定义视角，避免对单一问题表述的依赖。 Result: 在8个基准测试中，EoP显著提升了性能，例如在AQuA上提升3.6%，在Math上提升7.7%。 Conclusion: EoP框架有效提升了大语言模型在复杂NLP任务中的表现。 Abstract: Large language models (LLMs) have made significant advancements in addressing diverse natural language processing (NLP) tasks. However, their performance is often limited by inherent comprehension of problems. To address this limitation, we propose Exchange-of-Perspective (EoP), a novel framework designed to exchange perspectives across different definitions of problem, so that it can break the fixed mindset from any particular formulation of the question. We conducted extensive and comprehensive experiments on 8 benchmarks. The results show that EoP can significantly improve performance. For instance, compared to the non-commutative baseline PHP, with GPT-3.5-Turbo and EoP, we observe a 3.6% improvement on AQuA (60.6% to 64.2%), while GPT-4-powered EoP demonstrates a 7.7% overall accuracy enhancement on Math (53.9% to 61.6%) and a 3.5% improvement on OlympiadBench Maths (43.5% to 47.0%) when using Qwen-2.5-72b.

[153] KG-BiLM: Knowledge Graph Embedding via Bidirectional Language Models

Zirui Chen,Xin Wang,Zhao Li,Wenbin Guo,Dongxiao He

Main category: cs.CL

TL;DR: KG-BiLM是一个双向语言模型框架，通过结合知识图谱的结构信息和生成式变换器的语义表达能力，填补了现有方法在统一全局KG连通性、语言上下文和推理语义方面的空白。

Details

Motivation: 现有方法通常仅关注知识图谱的结构或文本语义，缺乏一个统一框架来同时捕捉全局连通性、语言上下文和推理语义。 Method: KG-BiLM包含三个关键组件：双向知识注意力、知识掩码预测和对比图语义聚合，以融合结构信息和语义表达。 Result: 实验表明，KG-BiLM在链接预测任务中表现优于基线模型，尤其是在具有复杂多跳关系的大规模图谱上。 Conclusion: KG-BiLM成功统一了结构信息和文本语义，验证了其在知识表示学习中的有效性。 Abstract: Recent advances in knowledge representation learning (KRL) highlight the urgent necessity to unify symbolic knowledge graphs (KGs) with language models (LMs) for richer semantic understanding. However, existing approaches typically prioritize either graph structure or textual semantics, leaving a gap: a unified framework that simultaneously captures global KG connectivity, nuanced linguistic context, and discriminative reasoning semantics. To bridge this gap, we introduce KG-BiLM, a bidirectional LM framework that fuses structural cues from KGs with the semantic expressiveness of generative transformers. KG-BiLM incorporates three key components: (i) Bidirectional Knowledge Attention, which removes the causal mask to enable full interaction among all tokens and entities; (ii) Knowledge-Masked Prediction, which encourages the model to leverage both local semantic contexts and global graph connectivity; and (iii) Contrastive Graph Semantic Aggregation, which preserves KG structure via contrastive alignment of sampled sub-graph representations. Extensive experiments on standard benchmarks demonstrate that KG-BiLM outperforms strong baselines in link prediction, especially on large-scale graphs with complex multi-hop relations - validating its effectiveness in unifying structural information and textual semantics.

[154] Automatically Suggesting Diverse Example Sentences for L2 Japanese Learners Using Pre-Trained Language Models

Enrico Benedetti,Akiko Aizawa,Florian Boudin

Main category: cs.CL

TL;DR: 研究探讨了使用预训练语言模型（PLMs）为L2日语学习者生成多样化且符合其水平的例句，发现检索方法优于生成方法，但PLMs在提升语言学习适应性方面具有潜力。

Details

Motivation: 为语言学习者提供多样化且符合其水平的例句对有效语言习得至关重要，研究旨在利用PLMs优化例句生成和检索系统。 Method: 研究采用PLMs作为检索系统的质量评分组件和直接生成例句的工具（零样本学习），并通过多维度（难度、多样性、自然度）评估句子质量。 Result: 评估结果显示，参与者对句子质量的评分存在分歧（除难度外），检索方法在所有评估者中更受欢迎，生成方法得分较低。 Conclusion: 尽管生成方法表现不佳，但PLMs在提升句子建议系统的适应性方面具有潜力，有助于改善语言学习体验。 Abstract: Providing example sentences that are diverse and aligned with learners' proficiency levels is essential for fostering effective language acquisition. This study examines the use of Pre-trained Language Models (PLMs) to produce example sentences targeting L2 Japanese learners. We utilize PLMs in two ways: as quality scoring components in a retrieval system that draws from a newly curated corpus of Japanese sentences, and as direct sentence generators using zero-shot learning. We evaluate the quality of sentences by considering multiple aspects such as difficulty, diversity, and naturalness, with a panel of raters consisting of learners of Japanese, native speakers -- and GPT-4. Our findings suggest that there is inherent disagreement among participants on the ratings of sentence qualities, except for difficulty. Despite that, the retrieval approach was preferred by all evaluators, especially for beginner and advanced target proficiency, while the generative approaches received lower scores on average. Even so, our experiments highlight the potential for using PLMs to enhance the adaptability of sentence suggestion systems and therefore improve the language learning journey.

[155] From Understanding to Generation: An Efficient Shortcut for Evaluating Language Models

Viktor Hangya,Fabian Küch,Darina Gold

Main category: cs.CL

TL;DR: 该论文提出将生成式任务（NLG）转化为更高效的选择式任务（NLU），以降低评估LLM训练中关键能力的计算负担。实验表明，两种任务形式间存在强相关性，评估时间平均减少35倍以上。

Details

Motivation: 减少生成式任务（NLG）评估的计算负担，以便在LLM训练中更高效地监控关键能力。 Method: 将生成式任务重新设计为选择式任务（NLU），并通过8种不同规模的LM和4种能力（数学推理、代码生成、事实知识和阅读理解）测试其与原任务的相关性。 Result: 生成式任务与选择式任务间存在强相关性，评估时间平均减少35倍以上。 Conclusion: 通过选择式任务替代生成式任务，可以高效评估LLM的关键能力，显著降低计算成本。 Abstract: Iterative evaluation of LLMs during training is essential to ensure expected capability development, but can be time- and compute-intensive. While NLU tasks, where the model selects from fixed answer choices, are cheap to evaluate, essential capabilities like reasoning and code generation rely on the more time-consuming NLG (token-by-token generation) format. In this work, our aim is to decrease the computational burden of NLG benchmarks in order to enable monitoring crucial LLM capabilities during model training. We reformulate generative tasks into computationally cheaper NLU alternatives. We test the performance correlation between the original and reformulated tasks using 8 LMs of various sizes and 4 capabilities: mathematical reasoning, code generation, factual knowledge and reading comprehension. Our results show a strong correlation between task formats, supporting capability assessment via cheaper alternatives and achieving over 35x average reduction in evaluation time. We plan to publish our benchmark adaptions.

[156] Is linguistically-motivated data augmentation worth it?

Ray Groshan,Michael Ginn,Alexis Palmer

Main category: cs.CL

TL;DR: 本文通过系统比较语言无关和语言相关的数据增强策略，发现语言相关策略在生成数据与训练数据分布相似时更有效。

Details

Motivation: 解决数据稀缺问题，并比较语言无关和语言相关数据增强策略的效果。 Method: 在两种低资源语言（Uspanteko和Arapaho）上，对多种数据增强策略及其组合进行实验，评估机器翻译和词间注释任务的性能。 Result: 语言相关策略在生成数据与训练数据分布相似时优于语言无关策略。 Conclusion: 语言相关的数据增强策略需确保生成数据与训练数据分布一致，才能发挥优势。 Abstract: Data augmentation, a widely-employed technique for addressing data scarcity, involves generating synthetic data examples which are then used to augment available training data. Researchers have seen surprising success from simple methods, such as random perturbations from natural examples, where models seem to benefit even from data with nonsense words, or data that doesn't conform to the rules of the language. A second line of research produces synthetic data that does in fact follow all linguistic constraints; these methods require some linguistic expertise and are generally more challenging to implement. No previous work has done a systematic, empirical comparison of both linguistically-naive and linguistically-motivated data augmentation strategies, leaving uncertainty about whether the additional time and effort of linguistically-motivated data augmentation work in fact yields better downstream performance. In this work, we conduct a careful and comprehensive comparison of augmentation strategies (both linguistically-naive and linguistically-motivated) for two low-resource languages with different morphological properties, Uspanteko and Arapaho. We evaluate the effectiveness of many different strategies and their combinations across two important sequence-to-sequence tasks for low-resource languages: machine translation and interlinear glossing. We find that linguistically-motivated strategies can have benefits over naive approaches, but only when the new examples they produce are not significantly unlike the training data distribution.

[157] Auto prompt sql: a resource-efficient architecture for text-to-sql translation in constrained environments

Zetong Tang,Qian Ma,Di Wu

Main category: cs.CL

TL;DR: AP-SQL是一种新颖的架构，旨在在资源受限环境中高效实现Text-to-SQL翻译，通过分解任务、优化提示工程和微调大语言模型，显著提升了SQL生成的准确性。

Details

Motivation: 解决资源密集型开源模型在资源受限环境中的使用问题，同时结合小开源模型和大闭源模型的优势。 Method: 将任务分解为模式过滤、基于上下文示例的检索增强Text-to-SQL生成、提示驱动的模式链接和SQL生成，并利用CoT和GoT模板优化提示工程。 Result: 在Spider基准测试中表现出色，验证了AP-SQL的有效性。 Conclusion: AP-SQL通过任务分解和提示工程优化，成功实现了高效且准确的Text-to-SQL翻译。 Abstract: Using the best Text-to-SQL methods in resource-constrained environments is challenging due to their reliance on resource-intensive open-source models. This paper introduces Auto Prompt SQL(AP-SQL), a novel architecture designed to bridge the gap between resource-efficient small open-source models and the powerful capabilities of large closed-source models for Text-to-SQL translation. Our method decomposes the task into schema filtering, retrieval-augmented text-to-SQL generation based on in-context examples, and prompt-driven schema linking and SQL generation. To improve schema selection accuracy, we fine-tune large language models. Crucially, we also explore the impact of prompt engineering throughout the process, leveraging Chain-of-Thought(CoT) and Graph-of-Thought(GoT) templates to significantly enhance the model's reasoning for accurate SQL generation. Comprehensive evaluations on the Spider benchmarks demonstrate the effectiveness of AP-SQL.

[158] Learning to Insert [PAUSE] Tokens for Better Reasoning

Eunki Kim,Sangryul Kim,James Thorne

Main category: cs.CL

TL;DR: 本文提出了一种动态插入标记训练方法（DIT），通过在模型置信度最低的位置插入[PAUSE]标记，显著提升了模型的推理能力，实验证明其优于传统微调和其他标记插入方法。

Details

Motivation: 增强基于Transformer的大型语言模型（LLMs）的推理能力，通过动态插入标记优化模型预测。 Method: 提出DIT方法，根据标记对数似然识别序列中模型置信度最低的位置，并在这些位置插入[PAUSE]标记。 Result: 在多个数据集和模型上，DIT表现优于传统方法，最高提升GSM8K准确率4.7%p，AQUA-RAT 3.23%p，MBPP pass@1 3.4%p。 Conclusion: DIT是一种基于模型的动态方法，为推理研究提供了新思路。 Abstract: To enhance reasoning capabilities, previous works have explored incorporating special-purpose tokens into the training process. These strategies strengthen the learning mechanism of transformer-based large language models (LLMs). Building on prior research, in which inserting dummy tokens consecutively just before reasoning steps can enhance effectiveness, we introduce a novel approach termed Dynamic Inserting Tokens Training (DIT). Our method identifies positions within sequences where model confidence is lowest according to token log-likelihood. Strategically inserting [PAUSE] tokens on these positions bolsters the model's predictive capabilities for subsequent tokens. Experimental results across diverse datasets and models, from the 2.7B model to the 8B model, demonstrate that DIT consistently outperforms traditional fine-tuning and previous token insertion methods. With this simple yet effective method, we achieve accuracy gains of up to 4.7%p on GSM8K, 3.23%p on AQUA-RAT, and pass@1 improvements of up to 3.4%p on MBPP datasets. Our work shows a model-based, dynamic approach rather than a heuristic one, thereby broadening the scope of research in reasoning.

[159] Do Large Language Models Know Folktales? A Case Study of Yokai in Japanese Folktales

Ayuto Tsutsumi,Yuu Jinnai

Main category: cs.CL

TL;DR: 该研究评估了大型语言模型（LLMs）对日本妖怪文化的了解，并提出了YokaiEval基准数据集。结果显示，基于日语训练的模型表现优于英语中心模型。

Details

Motivation: LLMs的文化知识通常局限于英语社区，导致非英语社区文化被边缘化。研究旨在评估LLMs的文化意识，特别是对日本妖怪文化的了解。 Method: 研究设计了YokaiEval数据集，包含809道关于妖怪的多选题，评估了31个日语和多语言LLMs的表现。 Result: 基于日语训练的模型（尤其是Llama-3）表现优于英语中心模型。 Conclusion: 研究表明，语言资源对LLMs的文化意识有显著影响，日语训练的模型在妖怪文化知识上表现更优。 Abstract: Although Large Language Models (LLMs) have demonstrated strong language understanding and generation abilities across various languages, their cultural knowledge is often limited to English-speaking communities, which can marginalize the cultures of non-English communities. To address the problem, evaluation of the cultural awareness of the LLMs and the methods to develop culturally aware LLMs have been investigated. In this study, we focus on evaluating knowledge of folktales, a key medium for conveying and circulating culture. In particular, we focus on Japanese folktales, specifically on knowledge of Yokai. Yokai are supernatural creatures originating from Japanese folktales that continue to be popular motifs in art and entertainment today. Yokai have long served as a medium for cultural expression, making them an ideal subject for assessing the cultural awareness of LLMs. We introduce YokaiEval, a benchmark dataset consisting of 809 multiple-choice questions (each with four options) designed to probe knowledge about yokai. We evaluate the performance of 31 Japanese and multilingual LLMs on this dataset. The results show that models trained with Japanese language resources achieve higher accuracy than English-centric models, with those that underwent continued pretraining in Japanese, particularly those based on Llama-3, performing especially well. The code and dataset are available at https://github.com/CyberAgentA ILab/YokaiEval.

[160] Robustness of Prompting: Enhancing Robustness of Large Language Models Against Prompting Attacks

Lin Mu,Guowei Chu,Li Ni,Lei Sang,Zhize Wu,Peiquan Jin,Yiwen Zhang

Main category: cs.CL

TL;DR: 提出了一种名为RoP的新型提示策略，通过错误纠正和引导两阶段增强大语言模型（LLM）对输入扰动的鲁棒性。

Details

Motivation: LLM对输入扰动（如拼写错误或字符顺序错误）高度敏感，现有提示技术未能有效解决这一问题。 Method: RoP分为两阶段：错误纠正阶段生成对抗性示例并自动修正输入错误；引导阶段基于修正后的输入生成最优提示。 Result: 实验表明，RoP显著提升LLM在算术、常识和逻辑推理任务中的鲁棒性，且对模型准确性影响极小。 Conclusion: RoP是一种实用且有效的方法，可增强LLM在真实场景中的鲁棒性。 Abstract: Large Language Models (LLMs) have demonstrated remarkable performance across various tasks by effectively utilizing a prompting strategy. However, they are highly sensitive to input perturbations, such as typographical errors or slight character order errors, which can substantially degrade their performance. Despite advances in prompting techniques, developing a prompting strategy that explicitly mitigates the negative impact of such perturbations remains an open challenge. To bridge this gap, we propose Robustness of Prompting (RoP), a novel prompting strategy specifically designed to enhance the robustness of LLMs. RoP consists of two stages: Error Correction and Guidance. In the Error Correction stage, RoP applies diverse perturbation methods to generate adversarial examples, which are then used to construct prompts that automatically correct input errors. In the Guidance stage, RoP generates an optimal guidance prompting based on the corrected input, steering the model toward more robust and accurate inferences. Through comprehensive experiments spanning arithmetic, commonsense, and logical reasoning tasks, we demonstrate that RoP significantly improves LLMs' robustness against adversarial perturbations. Notably, it maintains model accuracy with only minimal degradation compared to clean input scenarios, thereby establishing RoP as a practical and effective approach for enhancing LLM robustness in real-world applications.

[161] RewardAnything: Generalizable Principle-Following Reward Models

Zhuohao Yu,Jiali Zeng,Weizheng Gu,Yidong Wang,Jindong Wang,Fandong Meng,Jie Zhou,Yue Zhang,Shikun Zhang,Wei Ye

Main category: cs.CL

TL;DR: 论文提出了一种通用、遵循原则的奖励模型（RewardAnything），通过自然语言动态指定奖励原则，解决了传统奖励模型固定偏好数据集导致的适应性问题。

Details

Motivation: 传统奖励模型依赖固定偏好数据集，无法适应多样化需求，且任务特定数据收集和重训练成本高、易产生偏差。 Method: 提出RewardAnything模型，设计为遵循自然语言原则，并开发RABench基准测试其泛化能力。 Result: RewardAnything在传统基准测试中表现优异，且在RABench中无需重训练即可适应新原则。 Conclusion: RewardAnything通过自然语言原则动态指导，显著提升了奖励模型的通用性和实用性。 Abstract: Reward Models, essential for guiding Large Language Model optimization, are typically trained on fixed preference datasets, resulting in rigid alignment to single, implicit preference distributions. This prevents adaptation to diverse real-world needs-from conciseness in one task to detailed explanations in another. The standard practice of collecting task-specific preference data and retraining reward models is resource-intensive, often producing biased rewards, and limits practical application. We introduce generalizable, principle-following reward models. We propose that RMs should understand and adhere to dynamically provided natural language specifications of reward principles, similar to instruction-following in LLMs. To measure this capability, we develop RABench, a comprehensive benchmark for RMs focusing on generalization across diverse principles. Evaluations on RABench reveal poor generalization of current RMs. As a solution, we present RewardAnything, a novel RM designed and trained to explicitly follow natural language principles. We achieve SotA performance with RewardAnything in traditional RM benchmark simply by specifying a well-defined principle, and results on RABench show we excel in adapting to novel principles without retraining. Furthermore, RewardAnything integrates seamlessly with existing RLHF methods and we show by a case study on how to automatically and efficiently align LLMs with only natural language principles.

[162] Trustworthy Medical Question Answering: An Evaluation-Centric Survey

Yinuo Wang,Robert E. Mercer,Frank Rudzicz,Sudipta Singha Roy,Pengjie Ren,Zhumin Chen,Xindi Wang

Main category: cs.CL

TL;DR: 该论文探讨了医疗问答系统中信任度的六个关键维度，并分析了现有评估方法和技术改进方向，提出了未来研究的挑战和方向。

Details

Motivation: 医疗问答系统的信任度对患者安全和临床决策至关重要，但现有系统在复杂医疗数据和临床场景中面临挑战。 Method: 系统综述了六个信任维度（事实性、鲁棒性、公平性、安全性、可解释性和校准性），比较了评估基准和技术改进方法。 Result: 总结了现有评估方法和改进技术，如检索增强、对抗微调和安全对齐，并指出了多维度评估和实际部署的挑战。 Conclusion: 提出了未来研究方向，以推动LLM在医疗问答中的安全、可靠和透明应用。 Abstract: Trustworthiness in healthcare question-answering (QA) systems is important for ensuring patient safety, clinical effectiveness, and user confidence. As large language models (LLMs) become increasingly integrated into medical settings, the reliability of their responses directly influences clinical decision-making and patient outcomes. However, achieving comprehensive trustworthiness in medical QA poses significant challenges due to the inherent complexity of healthcare data, the critical nature of clinical scenarios, and the multifaceted dimensions of trustworthy AI. In this survey, we systematically examine six key dimensions of trustworthiness in medical QA, i.e., Factuality, Robustness, Fairness, Safety, Explainability, and Calibration. We review how each dimension is evaluated in existing LLM-based medical QA systems. We compile and compare major benchmarks designed to assess these dimensions and analyze evaluation-guided techniques that drive model improvements, such as retrieval-augmented grounding, adversarial fine-tuning, and safety alignment. Finally, we identify open challenges-such as scalable expert evaluation, integrated multi-dimensional metrics, and real-world deployment studies-and propose future research directions to advance the safe, reliable, and transparent deployment of LLM-powered medical QA.

[163] ROSA: Addressing text understanding challenges in photographs via ROtated SAmpling

Hernán Maina,Guido Ivetta,Mateo Lione Stuto,Julian Martin Eisenschlos,Jorge Sánchez,Luciana Benotti

Main category: cs.CL

TL;DR: ROSA解码策略提升视觉问答系统在文本方向错误图像中的性能。

Details

Motivation: 视觉障碍人士依赖VQA系统解读环境文本，但现有模型难以处理其拍摄的文本方向问题。 Method: 通过访谈识别常见文本方向问题，提出ROSA解码策略。 Result: ROSA在最佳模型中比Greedy解码提升11.7个百分点。 Conclusion: ROSA有效解决了VQA系统在文本方向错误图像中的性能问题。 Abstract: Visually impaired people could benefit from Visual Question Answering (VQA) systems to interpret text in their surroundings. However, current models often struggle with recognizing text in the photos taken by this population. Through in-depth interviews with visually impaired individuals, we identified common framing conventions that frequently result in misaligned text. Existing VQA benchmarks primarily feature well-oriented text captured by sighted users, under-representing these challenges. To address this gap, we introduce ROtated SAmpling (ROSA), a decoding strategy that enhances VQA performance in text-rich images with incorrectly oriented text. ROSA outperforms Greedy decoding by 11.7 absolute points in the best-performing model.

[164] Efficient Data Selection for Domain Adaptation of ASR Using Pseudo-Labels and Multi-Stage Filtering

Pradeep Rangappa,Andres Carofilis,Jeena Prakash,Shashi Kumar,Sergio Burdisso,Srikanth Madikeri,Esau Villatoro-Tello,Bidisha Sharma,Petr Motlicek,Kadri Hacioglu,Shankar Venkatesan,Saurabh Vyas,Andreas Stolcke

Main category: cs.CL

TL;DR: 本文提出了一种通过筛选伪标签数据来优化ASR模型微调的方法，结合多种策略（如WER预测、NER和CER分析），显著减少了所需数据量。

Details

Motivation: 针对小组织在有限标注数据和计算资源下微调ASR模型的挑战，探索高效的数据选择方法。 Method: 提出一种鲁棒的数据选择流程，结合Whisper和Zipformer生成的伪标签，通过WER预测、NER和CER分析筛选高质量数据。 Result: 在7500小时伪标注数据上微调后WER为12.3%，而筛选后的100小时数据（1.4%）性能相近。 Conclusion: 该方法显著减少了微调所需数据量，同时保持了模型性能，适用于资源受限的场景。 Abstract: Fine-tuning pretrained ASR models for specific domains is challenging for small organizations with limited labeled data and computational resources. Here, we explore different data selection pipelines and propose a robust approach that improves ASR adaptation by filtering pseudo-labels generated using Whisper (encoder-decoder) and Zipformer (transducer) models. Our approach integrates multiple selection strategies -- including word error rate (WER) prediction, named entity recognition (NER), and character error rate (CER) analysis -- to extract high-quality training segments. We evaluate our method on Whisper and Zipformer using a 7500-hour baseline, comparing it to a CER-based approach relying on hypotheses from three ASR systems. Fine-tuning on 7500 hours of pseudo-labeled call center data achieves 12.3% WER, while our filtering reduces the dataset to 100 hours (1.4%) with similar performance; a similar trend is observed on Fisher English.

[165] Robust Preference Optimization via Dynamic Target Margins

Jie Sun,Junkang Wu,Jiancan Wu,Zhibo Zhu,Xingyu Lu,Jun Zhou,Lintao Ma,Xiang Wang

Main category: cs.CL

TL;DR: 论文提出了一种动态目标边际偏好优化算法（γ-PO），通过调整成对奖励边际来提升大语言模型（LLM）的对齐效果，显著优于基线方法。

Details

Motivation: 直接偏好优化（DPO）的效率受数据质量影响较大，尤其是噪声问题。γ-PO旨在通过动态调整奖励边际来优化高置信度对并抑制噪声。 Method: γ-PO是一种动态目标边际偏好优化算法，通过实例特定的边际校准，优先处理高置信度对，同时减少模糊对的噪声影响。 Result: 在AlpacaEval2和Arena-Hard等基准测试中，γ-PO平均提升4.4%，且对训练效率影响极小。 Conclusion: γ-PO是一种高效、兼容性强的插件式方法，显著提升了LLM的对齐性能，适用于实际应用。 Abstract: The alignment of Large Language Models (LLMs) is crucial for ensuring their safety and reliability in practical applications. Direct Preference Optimization (DPO) has emerged as an efficient method that directly optimizes models using preference pairs, significantly reducing resource demands. However, the effectiveness of DPO heavily depends on the data quality, which is frequently compromised by noise. In this work, we propose $\gamma$-PO, a dynamic target margin preference optimization algorithm that adjust reward margins at the pairwise level. By introducing instance-specific margin calibration, $\gamma$-PO strategically prioritizes high-confidence pairs (those demonstrating higher reward margins) while suppressing potential noise from ambiguous pairs. Moreover, $\gamma$-PO is a plug-and-play method, compatible with variants of DPO that rely on reward margin between preference pairs. Across benchmarks such as AlpacaEval2 and Arena-Hard, $\gamma$-PO achieves an average 4.4\% improvement over other baselines, setting new benchmarks for state-of-the-art performance. Additionally, $\gamma$-PO requires minimal code changes and has a negligible impact on training efficiency, making it a robust solution for enhancing LLMs alignment. Our codes are available at \href{https://github.com/sunjie279/gammaPO}{https://github.com/sunjie279/gammaPO}.

[166] AdaDecode: Accelerating LLM Decoding with Adaptive Layer Parallelism

Zhepei Wei,Wei-Lin Chen,Xinyu Zhu,Yu Meng

Main category: cs.CL

TL;DR: AdaDecode是一种无需辅助模型或修改原始模型参数即可加速大型语言模型解码的方法，通过自适应在中间层生成高置信度令牌，并并行处理剩余层计算，实现1.73倍的速度提升。

Details

Motivation: 现有方法如推测解码和层跳过存在内存开销大或输出不一致的问题，需要一种更高效且不影响输出质量的解码加速方法。 Method: AdaDecode利用中间层生成高置信度令牌，并行处理剩余层计算，并通过验证步骤确保输出一致性。 Result: 实验表明，AdaDecode在多种生成任务中实现了1.73倍的解码吞吐量提升，且输出与标准自回归解码一致。 Conclusion: AdaDecode是一种高效且可靠的解码加速方法，适用于大规模语言模型的长内容生成任务。 Abstract: Large language models (LLMs) are increasingly used for long-content generation (e.g., long Chain-of-Thought reasoning) where decoding efficiency becomes a critical bottleneck: Autoregressive decoding is inherently limited by its sequential token generation process, where each token must be generated before the next can be processed. This sequential dependency restricts the ability to fully leverage modern hardware's parallel processing capabilities. Existing methods like speculative decoding and layer skipping offer potential speedups but have notable drawbacks: speculative decoding relies on an auxiliary "drafter" model, which can be challenging to acquire and increases memory overhead, while layer skipping may introduce discrepancies in the outputs due to the missing key-value cache at skipped layers. In this work, we propose AdaDecode, which accelerates LLM decoding without requiring auxiliary models or changes to the original model parameters, while ensuring output consistency. AdaDecode leverages the insight that many tokens can accurately be generated at intermediate layers, as further layers often do not significantly alter predictions once the model reaches a certain confidence. By adaptively generating tokens at intermediate layers when confidence is high, AdaDecode enables the next token's computation to begin immediately. The remaining layer computations for early-predicted tokens are deferred and executed in parallel with subsequent tokens when needed, maximizing hardware utilization and reducing decoding latency. A final verification step ensures that early predictions match the results of standard autoregressive decoding, preserving output parity. Experiments across diverse generation tasks shows that AdaDecode consistently achieves superior decoding throughput with up to 1.73x speedup, while guaranteeing output parity with standard autoregressive decoding.

[167] ScoreRAG: A Retrieval-Augmented Generation Framework with Consistency-Relevance Scoring and Structured Summarization for News Generation

Pei-Yun Lin,Yen-lung Tsai

Main category: cs.CL

TL;DR: ScoreRAG通过多阶段框架提升新闻生成质量，结合检索增强生成、一致性评估和结构化摘要，显著提高生成新闻的准确性、连贯性和专业性。

Details

Motivation: 当前新闻生成方法存在幻觉、事实不一致和缺乏领域专业知识的问题，ScoreRAG旨在解决这些挑战。 Method: 系统通过检索相关新闻文档、评估一致性相关性、重新排序并生成分级摘要，指导大语言模型生成符合专业标准的新闻。 Result: ScoreRAG显著提高了生成新闻的准确性、连贯性、信息量和专业性，同时保持生成过程的稳定性。 Conclusion: ScoreRAG为自动化新闻生成提供了一种高效且专业的方法，代码和演示已开源。 Abstract: This research introduces ScoreRAG, an approach to enhance the quality of automated news generation. Despite advancements in Natural Language Processing and large language models, current news generation methods often struggle with hallucinations, factual inconsistencies, and lack of domain-specific expertise when producing news articles. ScoreRAG addresses these challenges through a multi-stage framework combining retrieval-augmented generation, consistency relevance evaluation, and structured summarization. The system first retrieves relevant news documents from a vector database, maps them to complete news items, and assigns consistency relevance scores based on large language model evaluations. These documents are then reranked according to relevance, with low-quality items filtered out. The framework proceeds to generate graded summaries based on relevance scores, which guide the large language model in producing complete news articles following professional journalistic standards. Through this methodical approach, ScoreRAG aims to significantly improve the accuracy, coherence, informativeness, and professionalism of generated news articles while maintaining stability and consistency throughout the generation process. The code and demo are available at: https://github.com/peiyun2260/ScoreRAG.

[168] MFLA: Monotonic Finite Look-ahead Attention for Streaming Speech Recognition

Yinfeng Xia,Huiyan Li,Chenyang Le,Manhong Wang,Yutao Sun,Xingyang Ma,Yanmin Qian

Main category: cs.CL

TL;DR: 提出了一种基于Whisper的流式语音识别框架，通过前缀到前缀训练和连续积分触发机制，实现了低延迟与高质量的平衡。

Details

Motivation: 尽管大型预训练语音模型（如Whisper）在降低训练成本方面表现优异，但其在流式系统中的集成仍具挑战性。 Method: 设计了前缀到前缀训练框架，结合连续积分触发机制和单调有限前瞻注意力，并采用wait-k解码策略。 Result: 实验证明该方法在延迟与质量间实现了可控的权衡。 Conclusion: 该方法适用于多种流式应用场景。 Abstract: Applying large pre-trained speech models like Whisper has shown promise in reducing training costs for various speech tasks. However, integrating these models into streaming systems remains a challenge. This paper presents a novel prefix-to-prefix training framework for streaming recognition by fine-tuning the Whisper. We introduce the Continuous Integrate-and-Fire mechanism to establish a quasi-monotonic alignment between continuous speech sequences and discrete text tokens. Additionally, we design Monotonic Finite Look-ahead Attention, allowing each token to attend to infinite left-context and finite right-context from the speech sequences. We also employ the wait-k decoding strategy to simplify the decoding process while ensuring consistency between training and testing. Our theoretical analysis and experiments demonstrate that this approach achieves a controllable trade-off between latency and quality, making it suitable for various streaming applications.

[169] Verbalized Confidence Triggers Self-Verification: Emergent Behavior Without Explicit Reasoning Supervision

Chaeyun Jang,Moonseok Choi,Yegon Kim,Hyungi Lee,Juho Lee

Main category: cs.CL

TL;DR: 该论文研究了大型语言模型（LLMs）在链式推理（CoT）中的不确定性校准问题，发现仅通过标量置信度标签的监督微调即可引发模型的自验证行为，无需显式推理监督或强化学习奖励。

Details

Motivation: 确保LLMs在用户依赖其置信度估计时的安全部署，尤其是在链式推理任务中，现有研究主要集中在分类器或短文本生成，而CoT推理的置信度校准尚未充分探索。 Method: 通过监督微调，仅使用标量置信度标签训练模型，无需显式推理监督或强化学习奖励。进一步提出一种简单的“重新思考”方法，基于校准的不确定性在测试时调整性能。 Result: 在GSM8K、MATH-500和ARC-Challenge等任务上，置信度感知的微调提高了校准性和准确性，同时增强了模型推理路径与其置信度的一致性。 Conclusion: 仅通过标量置信度标签的监督微调即可有效校准LLMs在CoT推理中的不确定性，提升性能和可解释性。 Abstract: Uncertainty calibration is essential for the safe deployment of large language models (LLMs), particularly when users rely on verbalized confidence estimates. While prior work has focused on classifiers or short-form generation, confidence calibration for chain-of-thought (CoT) reasoning remains largely unexplored. Surprisingly, we find that supervised fine-tuning with scalar confidence labels alone suffices to elicit self-verification behavior of language models, without any explicit reasoning supervision or reinforcement learning-based rewards. Despite being trained only to produce a verbalized confidence score without any self-verifying examples, the model learns to generate longer and self-checking responses for low-confidence queries while providing more concise answers for high-confidence ones. We further propose a simple rethinking method that boosts performance via test-time scaling based on calibrated uncertainty. Experiments on GSM8K and held-out reasoning tasks such as MATH-500 and ARC-Challenge show that our confidence-aware fine-tuning improves both calibration and accuracy, while also enhancing interpretability by aligning the model's reasoning path with its confidence.

[170] Generating Pedagogically Meaningful Visuals for Math Word Problems: A New Benchmark and Analysis of Text-to-Image Models

Junling Wang,Anna Rutkiewicz,April Yi Wang,Mrinmaya Sachan

Main category: cs.CL

TL;DR: Math2Visual是一个自动生成数学应用题教学视觉内容的框架，基于预定义的视觉语言和教师访谈设计空间，提升了教育视觉生成的质量。

Details

Motivation: 数学应用题的教学视觉内容生成通常耗时且缺乏自动化支持，Math2Visual旨在解决这一问题。 Method: 利用预定义的视觉语言和教师访谈设计空间，构建了1903个标注视觉数据集，并评估和微调了文本到图像模型。 Result: 实验表明，微调后的模型在教育视觉生成方面表现更优，建立了新的自动化生成基准。 Conclusion: Math2Visual为教育视觉生成提供了新方法，并揭示了多模态教育内容生成中的关键挑战。 Abstract: Visuals are valuable tools for teaching math word problems (MWPs), helping young learners interpret textual descriptions into mathematical expressions before solving them. However, creating such visuals is labor-intensive and there is a lack of automated methods to support this process. In this paper, we present Math2Visual, an automatic framework for generating pedagogically meaningful visuals from MWP text descriptions. Math2Visual leverages a pre-defined visual language and a design space grounded in interviews with math teachers, to illustrate the core mathematical relationships in MWPs. Using Math2Visual, we construct an annotated dataset of 1,903 visuals and evaluate Text-to-Image (TTI) models for their ability to generate visuals that align with our design. We further fine-tune several TTI models with our dataset, demonstrating improvements in educational visual generation. Our work establishes a new benchmark for automated generation of pedagogically meaningful visuals and offers insights into key challenges in producing multimodal educational content, such as the misrepresentation of mathematical relationships and the omission of essential visual elements.

Hongcheng Guo,Zheyong Xie,Shaosheng Cao,Boyang Wang,Weiting Liu,Zheyu Ye,Zhoujun Li,Zuozhu Liu

Main category: cs.CL

TL;DR: Pet-Bench是一个专为评估大型语言模型（LLMs）在虚拟宠物陪伴领域的性能而设计的基准测试，强调自我进化和互动行为，包含7500多个交互实例。

Details

Motivation: 随着对LLMs在情感丰富互动中应用的兴趣增长，虚拟宠物陪伴成为一个未被充分探索的领域，现有方法缺乏系统性评估。 Method: 提出Pet-Bench基准，涵盖自我互动和人类互动两个维度，包括智能调度、记忆对话和心理对话等多样化任务。 Result: 对28个LLMs的评估显示性能差异显著，与模型规模和能力相关，表明需要专门优化。 Conclusion: Pet-Bench为评估LLMs在宠物陪伴领域的能力提供了基础资源，推动了情感沉浸式人宠互动的发展。 Abstract: As interest in using Large Language Models (LLMs) for interactive and emotionally rich experiences grows, virtual pet companionship emerges as a novel yet underexplored application. Existing approaches focus on basic pet role-playing interactions without systematically benchmarking LLMs for comprehensive companionship. In this paper, we introduce Pet-Bench, a dedicated benchmark that evaluates LLMs across both self-interaction and human-interaction dimensions. Unlike prior work, Pet-Bench emphasizes self-evolution and developmental behaviors alongside interactive engagement, offering a more realistic reflection of pet companionship. It features diverse tasks such as intelligent scheduling, memory-based dialogues, and psychological conversations, with over 7,500 interaction instances designed to simulate complex pet behaviors. Evaluation of 28 LLMs reveals significant performance variations linked to model size and inherent capabilities, underscoring the need for specialized optimization in this domain. Pet-Bench serves as a foundational resource for benchmarking pet-related LLM abilities and advancing emotionally immersive human-pet interactions.

[172] AhaKV: Adaptive Holistic Attention-Driven KV Cache Eviction for Efficient Inference of Large Language Models

Yifeng Gu,Zicong Jiang,Jianxiu Jin,Kailing Guo,Ziyang Zhang,Xiangmin Xu

Main category: cs.CL

TL;DR: 论文提出AhaKV方法，通过自适应调整softmax规模以减少KV缓存中的偏差，提升全局上下文信息利用。

Details

Motivation: LLMs的KV缓存在推理时消耗大量内存，现有方法通过注意力分数淘汰不必要token，但存在偏差，导致初始位置token被过度保留。 Method: 提出AhaKV，自适应调整softmax规模以平衡注意力分数偏差，并利用值向量信息优化淘汰策略。 Result: 实验表明AhaKV有效减少偏差，保留全局关键token，在多个基准任务中达到最优性能。 Conclusion: AhaKV通过自适应机制解决了KV缓存偏差问题，提升了LLMs的推理效率和性能。 Abstract: Large Language Models (LLMs) have significantly advanced the field of Artificial Intelligence. However, their deployment is resource-intensive, not only due to the large number of model parameters but also because the (Key-Value) KV cache consumes a lot of memory during inference. While several works propose reducing the KV cache by evicting the unnecessary tokens, these approaches rely on accumulated attention score as eviction score to quantify the importance of the token. We identify the accumulated attention score is biased and it decreases with the position of the tokens in the mathematical expectation. As a result, the retained tokens concentrate on the initial positions, limiting model's access to global contextual information. To address this issue, we propose Adaptive holistic attention KV (AhaKV), it addresses the bias of the accumulated attention score by adaptively tuning the scale of softmax according the expectation of information entropy of attention scores. To make use of the holistic attention information in self-attention mechanism, AhaKV utilize the information of value vectors, which is overlooked in previous works, to refine the adaptive score. We show theoretically that our method is well suited for bias reduction. We deployed AhaKV on different models with a fixed cache budget. Experiments show that AhaKV successfully mitigates bias and retains crucial tokens across global context and achieve state-of-the-art results against other related work on several benchmark tasks.

[173] ClozeMath: Improving Mathematical Reasoning in Language Models by Learning to Fill Equations

Quang Hieu Pham,Thuy Duong Nguyen,Tung Pham,Anh Tuan Luu,Dat Quoc Nguyen

Main category: cs.CL

TL;DR: 论文提出了一种名为ClozeMath的新方法，通过填充任务训练大语言模型（LLMs）进行数学推理，优于传统方法。

Details

Motivation: 传统基于下一个词预测的方法可能无法完全模拟人类学习思维的过程，尤其是数学推理。 Method: 提出ClozeMath方法，通过预测被掩码的方程来训练LLMs，类似于人类的完形填空练习。 Result: 在GSM8K、MATH和GSM-Symbolic数据集上，ClozeMath性能优于基线Masked Thought，且更鲁棒。 Conclusion: ClozeMath为LLMs的数学推理提供了一种更有效的训练方式，并通过实验验证了其优越性。 Abstract: The capabilities of large language models (LLMs) have been enhanced by training on data that reflects human thought processes, such as the Chain-of-Thought format. However, evidence suggests that the conventional scheme of next-word prediction may not fully capture how humans learn to think. Inspired by how humans generalize mathematical reasoning, we propose a new approach named ClozeMath to fine-tune LLMs for mathematical reasoning. Our ClozeMath involves a text-infilling task that predicts masked equations from a given solution, analogous to cloze exercises used in human learning. Experiments on GSM8K, MATH, and GSM-Symbolic show that ClozeMath surpasses the strong baseline Masked Thought in performance and robustness, with two test-time scaling decoding algorithms, Beam Search and Chain-of-Thought decoding. Additionally, we conduct an ablation study to analyze the effects of various architectural and implementation choices on our approach.

[174] Unifying Uniform and Binary-coding Quantization for Accurate Compression of Large Language Models

Seungcheol Park,Jeongin Bae,Beomseok Kwon,Minjun Kim,Byeongwook Kim,Se Jung Kwon,U Kang,Dongsoo Lee

Main category: cs.CL

TL;DR: UniQuanF是一种统一量化方法，结合了BCQ和UQ的优势，通过灵活映射技术实现高精度量化，并在实验中表现优于现有方法。

Details

Motivation: 量化大型语言模型（LLMs）时如何保持准确性是一个关键问题。现有方法（如BCQ和UQ）各有优势但未能结合。 Method: 提出UniQuanF方法，统一BCQ的非均匀量化级别和UQ的灵活映射技术，并采用统一初始化及局部和周期性映射优化参数。 Result: 实验显示UniQuanF在GSM8K基准测试中比现有方法准确率提升高达4.60%。 Conclusion: UniQuanF结合了BCQ和UQ的优势，实现了高精度量化且无额外部署成本。 Abstract: How can we quantize large language models while preserving accuracy? Quantization is essential for deploying large language models (LLMs) efficiently. Binary-coding quantization (BCQ) and uniform quantization (UQ) are promising quantization schemes that have strong expressiveness and optimizability, respectively. However, neither scheme leverages both advantages. In this paper, we propose UniQuanF (Unified Quantization with Flexible Mapping), an accurate quantization method for LLMs. UniQuanF harnesses both strong expressiveness and optimizability by unifying the flexible mapping technique in UQ and non-uniform quantization levels of BCQ. We propose unified initialization, and local and periodic mapping techniques to optimize the parameters in UniQuanF precisely. After optimization, our unification theorem removes computational and memory overhead, allowing us to utilize the superior accuracy of UniQuanF without extra deployment costs induced by the unification. Experimental results demonstrate that UniQuanF outperforms existing UQ and BCQ methods, achieving up to 4.60% higher accuracy on GSM8K benchmark.

[175] Knockout LLM Assessment: Using Large Language Models for Evaluations through Iterative Pairwise Comparisons

Isik Baran Sandan,Tu Anh Dinh,Jan Niehues

Main category: cs.CL

TL;DR: Knockout Assessment方法通过淘汰赛系统改进LLM评估的准确性，使其更接近人类评分。

Details

Motivation: 现有LLM评估方法缺乏全局排名视角，仅依赖单次或个体评估。 Method: 提出Knockout Assessment，采用淘汰赛系统进行迭代成对比较。 Result: 实验表明，该方法在两项任务中平均提高Pearson相关性0.07。 Conclusion: Knockout Assessment能有效提升LLM评估的准确性和一致性。 Abstract: Large Language Models (LLMs) have shown to be effective evaluators across various domains such as machine translations or the scientific domain. Current LLM-as-a-Judge approaches rely mostly on individual assessments or a single round of pairwise assessments, preventing the judge LLM from developing a global ranking perspective. To address this, we present Knockout Assessment, an LLM-asa Judge method using a knockout tournament system with iterative pairwise comparisons. Experiments across three LLMs on two datasets show that knockout assessment improves scoring accuracy, increasing Pearson correlation with expert evaluations by 0.07 on average for university-level exam scoring and machine translation evaluations, aligning LLM assessments more closely with human scoring.

[176] Mark My Words: A Robust Multilingual Model for Punctuation in Text and Speech Transcripts

Sidharth Pulipaka,Sparsh Jain,Ashwin Sankar,Raj Dabre

Main category: cs.CL

TL;DR: Cadence是一个基于预训练大语言模型的通用标点恢复模型，适用于书面文本和口语转录，支持22种印度语言和英语，性能优于现有技术。

Details

Motivation: 当前模型在口语转录中标点恢复效果不佳，影响下游任务质量，需要一种更通用的解决方案。 Method: 利用预训练大语言模型开发Cadence，支持多语言标点恢复，并进行全面分析。 Result: Cadence在性能上超越现有技术，支持更多语言，但面临领域转移和罕见标点挑战。 Conclusion: 预训练语言模型在多语言标点恢复中有效，Cadence对低资源NLP任务具有实用价值。 Abstract: Punctuation plays a vital role in structuring meaning, yet current models often struggle to restore it accurately in transcripts of spontaneous speech, especially in the presence of disfluencies such as false starts and backtracking. These limitations hinder the performance of downstream tasks like translation, text to speech, summarization, etc. where sentence boundaries are critical for preserving quality. In this work, we introduce Cadence, a generalist punctuation restoration model adapted from a pretrained large language model. Cadence is designed to handle both clean written text and highly spontaneous spoken transcripts. It surpasses the previous state of the art in performance while expanding support from 14 to all 22 Indian languages and English. We conduct a comprehensive analysis of model behavior across punctuation types and language families, identifying persistent challenges under domain shift and with rare punctuation marks. Our findings demonstrate the efficacy of utilizing pretrained language models for multilingual punctuation restoration and highlight Cadence practical value for low resource NLP pipelines at scale.

[177] Automatic Correction of Writing Anomalies in Hausa Texts

Ahmad Mustapha Wali,Sergiu Nisioi

Main category: cs.CL

TL;DR: 本文提出了一种基于Transformer模型的自动校正豪萨语文本异常的方法，通过合成噪声生成大规模平行数据集，并微调多语言模型，显著提升了文本校正的效果。

Details

Motivation: 豪萨语文本常因书写异常（如字符替换和空格错误）影响NLP应用，需一种自动校正方法。 Method: 通过合成噪声生成45万对噪声-干净句子数据集，并微调M2M100、AfriTEVA等模型，使用SentencePiece分词。 Result: 实验显示F1、BLEU、METEOR分数显著提升，CER和WER降低。 Conclusion: 研究提供了方法论、公开数据集和有效模型，推动了豪萨语NLP发展，并为其他低资源语言提供借鉴。 Abstract: Hausa texts are often characterized by writing anomalies such as incorrect character substitutions and spacing errors, which sometimes hinder natural language processing (NLP) applications. This paper presents an approach to automatically correct the anomalies by finetuning transformer-based models. Using a corpus gathered from several public sources, we created a large-scale parallel dataset of over 450,000 noisy-clean Hausa sentence pairs by introducing synthetically generated noise, fine-tuned to mimic realistic writing errors. Moreover, we adapted several multilingual and African language-focused models, including M2M100, AfriTEVA, mBART, and Opus-MT variants for this correction task using SentencePiece tokenization. Our experimental results demonstrate significant increases in F1, BLEU and METEOR scores, as well as reductions in Character Error Rate (CER) and Word Error Rate (WER). This research provides a robust methodology, a publicly available dataset, and effective models to improve Hausa text quality, thereby advancing NLP capabilities for the language and offering transferable insights for other low-resource languages.

[178] CRAWLDoc: A Dataset for Robust Ranking of Bibliographic Documents

Fabian Karl,Ansgar Scherp

Main category: cs.CL

TL;DR: CRAWLDoc是一种新的上下文排名方法，用于从网页链接中提取和排名相关文档，解决了不同布局和格式的挑战。

Details

Motivation: 解决因网页布局和数据格式多样性导致的元数据提取困难。 Method: 通过URL获取网页及其链接资源（如PDF、ORCID等），嵌入统一表示中，并进行排名。 Result: 在600篇计算机科学出版物数据集上验证了方法的鲁棒性和布局无关性。 Conclusion: CRAWLDoc为多样布局和格式的元数据提取提供了基础，代码和数据集已开源。 Abstract: Publication databases rely on accurate metadata extraction from diverse web sources, yet variations in web layouts and data formats present challenges for metadata providers. This paper introduces CRAWLDoc, a new method for contextual ranking of linked web documents. Starting with a publication's URL, such as a digital object identifier, CRAWLDoc retrieves the landing page and all linked web resources, including PDFs, ORCID profiles, and supplementary materials. It embeds these resources, along with anchor texts and the URLs, into a unified representation. For evaluating CRAWLDoc, we have created a new, manually labeled dataset of 600 publications from six top publishers in computer science. Our method CRAWLDoc demonstrates a robust and layout-independent ranking of relevant documents across publishers and data formats. It lays the foundation for improved metadata extraction from web documents with various layouts and formats. Our source code and dataset can be accessed at https://github.com/FKarl/CRAWLDoc.

[179] Multi-objective Aligned Bidword Generation Model for E-commerce Search Advertising

Zhenhui Liu,Chunyuan Yuan,Ming Pang,Zheng Fang,Li Yuan,Xue Jiang,Changping Peng,Zhangang Lin,Zheng Luo,Jingping Shao

Main category: cs.CL

TL;DR: 提出了一种多目标对齐的竞价词生成模型（MoBGM），通过判别器、生成器和偏好对齐模块，同时优化查询与改写之间的相关性、真实性及广告召回收益。

Details

Motivation: 解决现有查询改写方法无法同时优化查询相关性、真实性和广告收益的问题，提升电商搜索广告的用户体验和效率。 Method: 设计了一个包含判别器、生成器和偏好对齐模块的模型，利用判别器的反馈信号训练生成器，以最大化三个目标的综合效果。 Result: 离线与在线实验表明，该算法显著优于现有技术，部署后为平台创造了巨大商业价值。 Conclusion: MoBGM模型在提升查询改写质量和广告收益方面具有显著优势，验证了其可行性和鲁棒性。 Abstract: Retrieval systems primarily address the challenge of matching user queries with the most relevant advertisements, playing a crucial role in e-commerce search advertising. The diversity of user needs and expressions often produces massive long-tail queries that cannot be matched with merchant bidwords or product titles, which results in some advertisements not being recalled, ultimately harming user experience and search efficiency. Existing query rewriting research focuses on various methods such as query log mining, query-bidword vector matching, or generation-based rewriting. However, these methods often fail to simultaneously optimize the relevance and authenticity of the user's original query and rewrite and maximize the revenue potential of recalled ads. In this paper, we propose a Multi-objective aligned Bidword Generation Model (MoBGM), which is composed of a discriminator, generator, and preference alignment module, to address these challenges. To simultaneously improve the relevance and authenticity of the query and rewrite and maximize the platform revenue, we design a discriminator to optimize these key objectives. Using the feedback signal of the discriminator, we train a multi-objective aligned bidword generator that aims to maximize the combined effect of the three objectives. Extensive offline and online experiments show that our proposed algorithm significantly outperforms the state of the art. After deployment, the algorithm has created huge commercial value for the platform, further verifying its feasibility and robustness.

[180] Brain-tuned Speech Models Better Reflect Speech Processing Stages in the Brain

Omer Moussa,Mariya Toneva

Main category: cs.CL

TL;DR: 脑调优的自监督语音模型在层次化语音处理上优于预训练模型，更接近人脑处理方式。

Details

Motivation: 研究脑调优模型是否能更好地反映人脑语音处理的中间阶段。 Method: 通过脑调优（使用人脑记录微调模型）改进语音模型的语义理解，并进行分层分析。 Result: 脑调优模型的后期层显著提升与语义语言区域的匹配度，早期层仍专注于低层次声学特征。 Conclusion: 脑调优模型不仅性能更优，还展现出从声学到语义的层次化处理，更适合模拟人脑语音处理。 Abstract: Pretrained self-supervised speech models excel in speech tasks but do not reflect the hierarchy of human speech processing, as they encode rich semantics in middle layers and poor semantics in late layers. Recent work showed that brain-tuning (fine-tuning models using human brain recordings) improves speech models' semantic understanding. Here, we examine how well brain-tuned models further reflect the brain's intermediate stages of speech processing. We find that late layers of brain-tuned models substantially improve over pretrained models in their alignment with semantic language regions. Further layer-wise probing reveals that early layers remain dedicated to low-level acoustic features, while late layers become the best at complex high-level tasks. These findings show that brain-tuned models not only perform better but also exhibit a well-defined hierarchical processing going from acoustic to semantic representations, making them better model organisms for human speech processing.

[181] PulseReddit: A Novel Reddit Dataset for Benchmarking MAS in High-Frequency Cryptocurrency Trading

Qiuhan Han,Qian Wang,Atsushi Yoshikawa,Masayuki Yamamura

Main category: cs.CL

TL;DR: 论文介绍了PulseReddit数据集，结合Reddit讨论与高频加密货币市场数据，通过LLM多代理系统研究社交情绪对交易的影响，结果显示优于传统方法。

Details

Motivation: 探索社交媒体（如Reddit）在高频加密货币交易中的潜在价值，填补现有研究的空白。 Method: 使用LLM多代理系统分析PulseReddit数据集，评估社交情绪对交易表现的影响。 Result: 结合PulseReddit数据的MAS在牛市表现更优，且适应性强，同时揭示了LLM性能与效率的权衡。 Conclusion: PulseReddit为HFT研究提供了新方向，证明了社交媒体数据在高频交易中的实际价值。 Abstract: High-Frequency Trading (HFT) is pivotal in cryptocurrency markets, demanding rapid decision-making. Social media platforms like Reddit offer valuable, yet underexplored, information for such high-frequency, short-term trading. This paper introduces \textbf{PulseReddit}, a novel dataset that is the first to align large-scale Reddit discussion data with high-frequency cryptocurrency market statistics for short-term trading analysis. We conduct an extensive empirical study using Large Language Model (LLM)-based Multi-Agent Systems (MAS) to investigate the impact of social sentiment from PulseReddit on trading performance. Our experiments conclude that MAS augmented with PulseReddit data achieve superior trading outcomes compared to traditional baselines, particularly in bull markets, and demonstrate robust adaptability across different market regimes. Furthermore, our research provides conclusive insights into the performance-efficiency trade-offs of different LLMs, detailing significant considerations for practical model selection in HFT applications. PulseReddit and our findings establish a foundation for advanced MAS research in HFT, demonstrating the tangible benefits of integrating social media.

[182] EuroGEST: Investigating gender stereotypes in multilingual language models

Jacqueline Rowe,Mateusz Klimaszewski,Liane Guillou,Shannon Vallor,Alexandra Birch

Main category: cs.CL

TL;DR: EuroGEST是一个多语言数据集，用于评估大语言模型（LLMs）在29种欧洲语言中的性别刻板印象，发现女性常被标签为美丽、同理心强和整洁，男性则被标签为领导者、强壮和专业。

Details

Motivation: 现有性别偏见基准多集中于英语，缺乏多语言研究，因此需要扩展至更多语言以全面评估LLMs的公平性。 Method: 基于专家标注的16种性别刻板印象，通过翻译工具、质量评估指标和形态学启发式方法扩展数据集，并进行人工验证。 Result: 评估24个多语言模型发现，所有模型和语言中性别刻板印象最强的是女性美丽、同理心强和整洁，男性为领导者、强壮和专业。模型越大，刻板印象越强，指令微调未能一致减少偏见。 Conclusion: 研究强调需要更多多语言公平性研究，并提供了可扩展的方法和资源来审计跨语言性别偏见。 Abstract: Large language models increasingly support multiple languages, yet most benchmarks for gender bias remain English-centric. We introduce EuroGEST, a dataset designed to measure gender-stereotypical reasoning in LLMs across English and 29 European languages. EuroGEST builds on an existing expert-informed benchmark covering 16 gender stereotypes, expanded in this work using translation tools, quality estimation metrics, and morphological heuristics. Human evaluations confirm that our data generation method results in high accuracy of both translations and gender labels across languages. We use EuroGEST to evaluate 24 multilingual language models from six model families, demonstrating that the strongest stereotypes in all models across all languages are that women are \textit{beautiful,} \textit{empathetic} and \textit{neat} and men are \textit{leaders}, \textit{strong, tough} and \textit{professional}. We also show that larger models encode gendered stereotypes more strongly and that instruction finetuning does not consistently reduce gendered stereotypes. Our work highlights the need for more multilingual studies of fairness in LLMs and offers scalable methods and resources to audit gender bias across languages.

[183] RadialRouter: Structured Representation for Efficient and Robust Large Language Models Routing

Ruihan Jin,Pengpeng Shao,Zhengqi Wen,Jinyang Wu,Mingkuan Feng,Shuai Zhang,Jianhua Tao

Main category: cs.CL

TL;DR: RadialRouter是一种新型的LLM路由框架，通过轻量级Transformer结构RadialFormer优化查询与LLM的关系，显著提升了路由性能。

Details

Motivation: 当前LLM路由方法因未充分探索用户查询与LLM特性之间的内在联系而效果有限，RadialRouter旨在解决这一问题。 Method: 采用RadialFormer结构结合KL散度和查询对比损失的目标函数，优化LLM选择。 Result: 在RouterBench上，RadialRouter在Balance和Cost First场景中分别比现有方法提升9.2%和5.8%。 Conclusion: RadialRouter不仅性能优越，还具备对不同性能-成本权衡和动态LLM池的适应性，具有实际应用潜力。 Abstract: The rapid advancements in large language models (LLMs) have led to the emergence of routing techniques, which aim to efficiently select the optimal LLM from diverse candidates to tackle specific tasks, optimizing performance while reducing costs. Current LLM routing methods are limited in effectiveness due to insufficient exploration of the intrinsic connection between user queries and the characteristics of LLMs. To address this issue, in this paper, we present RadialRouter, a novel framework for LLM routing which employs a lightweight Transformer-based backbone with a radial structure named RadialFormer to articulate the query-LLMs relationship. The optimal LLM selection is performed based on the final states of RadialFormer. The pipeline is further refined by an objective function that combines Kullback-Leibler divergence with the query-query contrastive loss to enhance robustness. Experimental results on RouterBench show that RadialRouter significantly outperforms existing routing methods by 9.2\% and 5.8\% in the Balance and Cost First scenarios, respectively. Additionally, its adaptability toward different performance-cost trade-offs and the dynamic LLM pool demonstrates practical application potential.

[184] Kinship in Speech: Leveraging Linguistic Relatedness for Zero-Shot TTS in Indian Languages

Utkarsh Pathak,Chandra Sai Krishna Gunda,Anusha Prakash,Keshav Agarwal,Hema A. Murthy

Main category: cs.CL

TL;DR: 该论文提出了一种零样本合成方法，用于解决印度多语言环境下缺乏数字资源的TTS系统训练问题，通过共享音素表示和调整文本解析规则，实现了对多种语言的快速适应。

Details

Motivation: 印度有1369种语言，其中大多数缺乏数字资源，传统TTS系统需要高质量数据和准确转录，难以覆盖所有语言。 Method: 通过共享音素表示和修改文本解析规则，以适应目标语言的音位结构，减少合成器开销并实现快速适应。 Result: 成功为梵语、马哈拉施特拉语、卡纳拉孔卡尼语、迈蒂利语和库鲁克语生成了清晰自然的语音。 Conclusion: 该方法有效扩展了语音技术对资源匮乏语言的覆盖，具有广泛应用潜力。 Abstract: Text-to-speech (TTS) systems typically require high-quality studio data and accurate transcriptions for training. India has 1369 languages, with 22 official using 13 scripts. Training a TTS system for all these languages, most of which have no digital resources, seems a Herculean task. Our work focuses on zero-shot synthesis, particularly for languages whose scripts and phonotactics come from different families. The novelty of our work is in the augmentation of a shared phone representation and modifying the text parsing rules to match the phonotactics of the target language, thus reducing the synthesiser overhead and enabling rapid adaptation. Intelligible and natural speech was generated for Sanskrit, Maharashtrian and Canara Konkani, Maithili and Kurukh by leveraging linguistic connections across languages with suitable synthesisers. Evaluations confirm the effectiveness of this approach, highlighting its potential to expand speech technology access for under-represented languages.

[185] Pre$^3$: Enabling Deterministic Pushdown Automata for Faster Structured LLM Generation

Junyi Chen,Shihao Bai,Zaijun Wang,Siyu Wu,Chuheng Du,Hailong Yang,Ruihao Gong,Shengzhong Liu,Fan Wu,Guihai Chen

Main category: cs.CL

TL;DR: Pre$^3$ 是一种优化大型语言模型（LLM）解码效率的方法，通过预计算前缀条件边和确定性下推自动机（DPDA）来减少运行时开销，提升生成速度和吞吐量。

Details

Motivation: 现有方法在处理 LR(1) 语法时存在运行时执行开销高的问题，尤其是在大规模推理批次下效率低下。 Method: Pre$^3$ 预计算前缀条件边，实现并行转换处理，并将 LR(1) 转换图转化为 DPDA，避免运行时路径探索。 Result: 实验显示，Pre$^3$ 将每个输出令牌的时间（TPOT）减少高达 40%，吞吐量提升高达 36%。 Conclusion: Pre$^3$ 是一种高效的方法，可无缝集成到标准 LLM 推理框架中，显著提升性能。 Abstract: Extensive LLM applications demand efficient structured generations, particularly for LR(1) grammars, to produce outputs in specified formats (e.g., JSON). Existing methods primarily parse LR(1) grammars into a pushdown automaton (PDA), leading to runtime execution overhead for context-dependent token processing, especially inefficient under large inference batches. To address these issues, we propose Pre$^3$ that exploits deterministic pushdown automata (DPDA) to optimize the constrained LLM decoding efficiency. First, by precomputing prefix-conditioned edges during the preprocessing, Pre$^3$ enables ahead-of-time edge analysis and thus makes parallel transition processing possible. Second, by leveraging the prefix-conditioned edges, Pre$^3$ introduces a novel approach that transforms LR(1) transition graphs into DPDA, eliminating the need for runtime path exploration and achieving edge transitions with minimal overhead. Pre$^3$ can be seamlessly integrated into standard LLM inference frameworks, reducing time per output token (TPOT) by up to 40% and increasing throughput by up to 36% in our experiments. Our code is available at https://github.com/ModelTC/lightllm.

[186] Magic Mushroom: A Customizable Benchmark for Fine-grained Analysis of Retrieval Noise Erosion in RAG Systems

Yuxin Zhang,Yan Wang,Yongrui Chen,Shenyu Zhang,Xinbang Dai,Sheng Bi,Guilin Qi

Main category: cs.CL

TL;DR: 论文提出Magic Mushroom基准，模拟真实检索噪声，评估RAG系统的鲁棒性。

Details

Motivation: 现有基准无法模拟真实检索噪声，影响RAG系统的可靠性评估。 Method: 定义四类检索噪声，构建Magic Mushroom基准，包含单跳和多跳问答对。 Result: 生成器和去噪策略对噪声分布敏感，仍有改进空间。 Conclusion: Magic Mushroom是评估和提升RAG系统噪声鲁棒性的有效工具。 Abstract: Retrieval-Augmented Generation (RAG) systems enhance Large Language Models (LLMs) by incorporating external retrieved information, mitigating issues such as hallucination and outdated knowledge. However, RAG systems are highly sensitive to retrieval noise prevalent in real-world scenarios. Existing benchmarks fail to emulate the complex and heterogeneous noise distributions encountered in real-world retrieval environments, undermining reliable robustness assessment. In this paper, we define four categories of retrieval noise based on linguistic properties and noise characteristics, aiming to reflect the heterogeneity of noise in real-world scenarios. Building on this, we introduce Magic Mushroom, a benchmark for replicating "magic mushroom" noise: contexts that appear relevant on the surface but covertly mislead RAG systems. Magic Mushroom comprises 7,468 single-hop and 3,925 multi-hop question-answer pairs. More importantly, Magic Mushroom enables researchers to flexibly configure combinations of retrieval noise according to specific research objectives or application scenarios, allowing for highly controlled evaluation setups. We evaluate LLM generators of varying parameter scales and classic RAG denoising strategies under diverse noise distributions to investigate their performance dynamics during progressive noise encroachment. Our analysis reveals that both generators and denoising strategies have significant room for improvement and exhibit extreme sensitivity to noise distributions. Magic Mushroom emerges as a promising tool for evaluating and advancing noise-robust RAG systems, accelerating their widespread deployment in real-world applications. The Magic Mushroom benchmark is available at the https://drive.google.com/file/d/1aP5kyPuk4L-L_uoI6T9UhxuTyt8oMqjT/view?usp=sharing.

[187] The Harmonic Structure of Information Contours

Eleftheria Tsipidi,Samuel Kiegeland,Franz Nowak,Tianyang Xu,Ethan Wilcox,Alex Warstadt,Ryan Cotterell,Mario Giulianelli

Main category: cs.CL

TL;DR: 论文探讨了语言信息密度的周期性波动现象，提出了一种新的分析方法，发现信息率存在周期性模式，并与语篇结构相关。

Details

Motivation: 研究旨在探索语言信息密度波动的新视角，即周期性压力可能影响信息率的分布，而非仅由传统因素（如句法约束或风格选择）解释。 Method: 采用谐波回归和时间缩放的新方法，分析六种语言（英语、西班牙语、德语、荷兰语、巴斯克语和巴西葡萄牙语）的信息率周期性。 Result: 发现信息率存在一致的周期性模式，且主要频率与语篇结构相关，表明这些波动反映了有意义的语言组织。 Conclusion: 研究不仅揭示了信息率与语篇结构的联系，还提供了一种分析语言结构压力的通用框架。 Abstract: The uniform information density (UID) hypothesis proposes that speakers aim to distribute information evenly throughout a text, balancing production effort and listener comprehension difficulty. However, language typically does not maintain a strictly uniform information rate; instead, it fluctuates around a global average. These fluctuations are often explained by factors such as syntactic constraints, stylistic choices, or audience design. In this work, we explore an alternative perspective: that these fluctuations may be influenced by an implicit linguistic pressure towards periodicity, where the information rate oscillates at regular intervals, potentially across multiple frequencies simultaneously. We apply harmonic regression and introduce a novel extension called time scaling to detect and test for such periodicity in information contours. Analyzing texts in English, Spanish, German, Dutch, Basque, and Brazilian Portuguese, we find consistent evidence of periodic patterns in information rate. Many dominant frequencies align with discourse structure, suggesting these oscillations reflect meaningful linguistic organization. Beyond highlighting the connection between information rate and discourse structure, our approach offers a general framework for uncovering structural pressures at various levels of linguistic granularity.

[188] When Fairness Isn't Statistical: The Limits of Machine Learning in Evaluating Legal Reasoning

Claire Barale,Michael Rovatsos,Nehal Bhuta

Main category: cs.CL

TL;DR: 论文探讨了机器学习在法律决策公平性评估中的局限性，通过实证分析三种常见方法在难民案件数据集上的表现，发现其信号不一致且依赖非法律特征，强调需结合法律推理和制度背景。

Details

Motivation: 研究动机是探讨机器学习方法是否能有效评估法律决策的公平性，尤其是在自由裁量权高、规范性复杂的法律领域。 Method: 采用实证方法，对59,000+加拿大难民案件数据集（AsyLex）应用三种ML方法：基于特征的分析、语义聚类和预测建模。 Result: 实验结果显示这些方法产生不一致甚至矛盾的信号，预测建模依赖非法律特征，语义聚类未能捕捉实质性法律推理。 Conclusion: 结论指出统计公平性评估的局限性，强调公平性评估需结合法律推理和制度背景，而非仅依赖数据。 Abstract: Legal decisions are increasingly evaluated for fairness, consistency, and bias using machine learning (ML) techniques. In high-stakes domains like refugee adjudication, such methods are often applied to detect disparities in outcomes. Yet it remains unclear whether statistical methods can meaningfully assess fairness in legal contexts shaped by discretion, normative complexity, and limited ground truth. In this paper, we empirically evaluate three common ML approaches (feature-based analysis, semantic clustering, and predictive modeling) on a large, real-world dataset of 59,000+ Canadian refugee decisions (AsyLex). Our experiments show that these methods produce divergent and sometimes contradictory signals, that predictive modeling often depends on contextual and procedural features rather than legal features, and that semantic clustering fails to capture substantive legal reasoning. We show limitations of statistical fairness evaluation, challenge the assumption that statistical regularity equates to fairness, and argue that current computational approaches fall short of evaluating fairness in legally discretionary domains. We argue that evaluating fairness in law requires methods grounded not only in data, but in legal reasoning and institutional context.

[189] Compositional Generalisation for Explainable Hate Speech Detection

Agostina Calabrese,Tom Sherborne,Björn Ross,Mirella Lapata

Main category: cs.CL

TL;DR: 论文提出了一种通过平衡上下文表达频率的数据集U-PLEAD，结合真实数据训练，提升了仇恨言论检测模型的组合泛化能力。

Details

Motivation: 当前仇恨言论检测模型因数据集偏见和句子级标签的局限性，难以泛化到训练数据之外。 Method: 使用包含约364,000条合成帖子的U-PLEAD数据集，结合真实数据训练，并创建了一个约8,000条手动验证帖子的组合泛化基准。 Result: 结合U-PLEAD和真实数据训练，模型在组合泛化能力上表现更优，并在人类标注的PLEAD数据集上达到最优性能。 Conclusion: 平衡表达频率的数据集能有效提升仇恨言论检测模型的泛化能力。 Abstract: Hate speech detection is key to online content moderation, but current models struggle to generalise beyond their training data. This has been linked to dataset biases and the use of sentence-level labels, which fail to teach models the underlying structure of hate speech. In this work, we show that even when models are trained with more fine-grained, span-level annotations (e.g., "artists" is labeled as target and "are parasites" as dehumanising comparison), they struggle to disentangle the meaning of these labels from the surrounding context. As a result, combinations of expressions that deviate from those seen during training remain particularly difficult for models to detect. We investigate whether training on a dataset where expressions occur with equal frequency across all contexts can improve generalisation. To this end, we create U-PLEAD, a dataset of ~364,000 synthetic posts, along with a novel compositional generalisation benchmark of ~8,000 manually validated posts. Training on a combination of U-PLEAD and real data improves compositional generalisation while achieving state-of-the-art performance on the human-sourced PLEAD.

Zhaolu Kang,Junhao Gong,Jiaxu Yan,Wanke Xia,Yian Wang,Ziwen Wang,Huaxuan Ding,Zhuo Cheng,Wenhao Cao,Zhiyuan Feng,Siqi He,Shannan Yan,Junzhe Chen,Xiaomin He,Chaoya Jiang,Wei Ye,Kaidong Yu,Xuelong Li

Main category: cs.CL

TL;DR: HSSBench是一个专为评估多模态大语言模型（MLLMs）在人文与社会科学（HSS）领域能力的基准测试，填补了现有基准测试的不足。

Details

Motivation: 现有基准测试主要关注STEM领域，忽视了HSS领域对跨学科思维和知识整合的需求，HSSBench旨在填补这一空白。 Method: 通过多领域专家与自动化代理协作生成和迭代优化样本，构建了包含13,000多个样本的HSSBench基准测试。 Result: 测试了20多个主流MLLMs，发现即使是先进模型也面临显著挑战。 Conclusion: HSSBench有望推动MLLMs在跨学科推理能力方面的研究，尤其是知识整合与连接能力的提升。 Abstract: Multimodal Large Language Models (MLLMs) have demonstrated significant potential to advance a broad range of domains. However, current benchmarks for evaluating MLLMs primarily emphasize general knowledge and vertical step-by-step reasoning typical of STEM disciplines, while overlooking the distinct needs and potential of the Humanities and Social Sciences (HSS). Tasks in the HSS domain require more horizontal, interdisciplinary thinking and a deep integration of knowledge across related fields, which presents unique challenges for MLLMs, particularly in linking abstract concepts with corresponding visual representations. Addressing this gap, we present HSSBench, a dedicated benchmark designed to assess the capabilities of MLLMs on HSS tasks in multiple languages, including the six official languages of the United Nations. We also introduce a novel data generation pipeline tailored for HSS scenarios, in which multiple domain experts and automated agents collaborate to generate and iteratively refine each sample. HSSBench contains over 13,000 meticulously designed samples, covering six key categories. We benchmark more than 20 mainstream MLLMs on HSSBench and demonstrate that it poses significant challenges even for state-of-the-art models. We hope that this benchmark will inspire further research into enhancing the cross-disciplinary reasoning abilities of MLLMs, especially their capacity to internalize and connect knowledge across fields.

[191] More or Less Wrong: A Benchmark for Directional Bias in LLM Comparative Reasoning

Mohammadamin Shafiei,Hamidreza Saffari,Nafise Sadat Moosavi

Main category: cs.CL

TL;DR: 研究发现大型语言模型（LLM）在数学比较问题中受输入措辞影响，存在方向性框架偏差，且链式思维提示可减少偏差，但效果因形式而异。

Details

Motivation: 探讨LLM在语义框架下如何受输入措辞影响，尤其是数学比较问题中的方向性偏差。 Method: 引入MathComp基准测试，包含300个比较场景，每个场景在14种提示变体下评估三种LLM家族。 Result: 模型错误常反映语言引导的偏差，链式思维提示可缓解但效果不一，且社会身份词会放大偏差。 Conclusion: 研究揭示了LLM评估中的盲点，需开发框架感知的基准以提升推理鲁棒性和公平性。 Abstract: Large language models (LLMs) are known to be sensitive to input phrasing, but the mechanisms by which semantic cues shape reasoning remain poorly understood. We investigate this phenomenon in the context of comparative math problems with objective ground truth, revealing a consistent and directional framing bias: logically equivalent questions containing the words ``more'', ``less'', or ``equal'' systematically steer predictions in the direction of the framing term. To study this effect, we introduce MathComp, a controlled benchmark of 300 comparison scenarios, each evaluated under 14 prompt variants across three LLM families. We find that model errors frequently reflect linguistic steering, systematic shifts toward the comparative term present in the prompt. Chain-of-thought prompting reduces these biases, but its effectiveness varies: free-form reasoning is more robust, while structured formats may preserve or reintroduce directional drift. Finally, we show that including demographic identity terms (e.g., ``a woman'', ``a Black person'') in input scenarios amplifies directional drift, despite identical underlying quantities, highlighting the interplay between semantic framing and social referents. These findings expose critical blind spots in standard evaluation and motivate framing-aware benchmarks for diagnosing reasoning robustness and fairness in LLMs.

[192] Hanging in the Balance: Pivotal Moments in Crisis Counseling Conversations

Vivian Nguyen,Lillian Lee,Cristian Danescu-Niculescu-Mizil

Main category: cs.CL

TL;DR: 该论文提出了一种无监督计算方法，用于实时检测对话中的关键时刻（pivotal moments），并通过心理咨询对话验证了其有效性。

Details

Motivation: 在对话中，某些关键时刻可能显著影响最终结果。检测这些时刻有助于高后果领域（如心理咨询）的对话参与者。 Method: 基于直觉，如果一个时刻的预期结果因接下来的话而大幅变化，则该时刻为关键时刻。通过无监督计算在线检测这些时刻。 Result: 方法验证显示，检测到的关键时刻与人类感知一致（咨询师反应时间更长），且对话轨迹更可能在此改变。 Conclusion: 该框架可用于分析咨询师在关键时刻的回应与最终会话结果的关系。 Abstract: During a conversation, there can come certain moments where its outcome hangs in the balance. In these pivotal moments, how one responds can put the conversation on substantially different trajectories leading to significantly different outcomes. Systems that can detect when such moments arise could assist conversationalists in domains with highly consequential outcomes, such as mental health crisis counseling. In this work, we introduce an unsupervised computational method for detecting such pivotal moments as they happen, in an online fashion. Our approach relies on the intuition that a moment is pivotal if our expectation of the outcome varies widely depending on what might be said next. By applying our method to crisis counseling conversations, we first validate it by showing that it aligns with human perception -- counselors take significantly longer to respond during moments detected by our method -- and with the eventual conversational trajectory -- which is more likely to change course at these times. We then use our framework to explore the relation of the counselor's response during pivotal moments with the eventual outcome of the session.

[193] TableEval: A Real-World Benchmark for Complex, Multilingual, and Multi-Structured Table Question Answering

Junnan Zhu,Jingyi Wang,Bohan Yu,Xiaoyu Wu,Junbo Li,Lei Wang,Nan Xu

Main category: cs.CL

TL;DR: TableEval 是一个新的 TableQA 基准测试，旨在解决现有基准测试的局限性，包括表格结构单一、数据泄露和多语言支持不足。它包含多样化的表格结构和跨语言数据，并提出了新的评估框架 SEAT。

Details

Motivation: 现有 TableQA 基准测试过于简单且存在数据泄露问题，无法反映实际应用中的复杂性和多语言需求。 Method: TableEval 收集了来自四个领域的多样化表格结构（简洁、层次化和嵌套表格）和多语言数据（简体中文、繁体中文和英文），并提出了 SEAT 评估框架。 Result: 实验表明 SEAT 与人类判断高度一致，现有 LLMs 在处理复杂 TableQA 任务时存在显著不足。 Conclusion: TableEval 和 SEAT 为评估和改进 LLMs 在复杂 TableQA 任务中的表现提供了重要工具。 Abstract: LLMs have shown impressive progress in natural language processing. However, they still face significant challenges in TableQA, where real-world complexities such as diverse table structures, multilingual data, and domain-specific reasoning are crucial. Existing TableQA benchmarks are often limited by their focus on simple flat tables and suffer from data leakage. Furthermore, most benchmarks are monolingual and fail to capture the cross-lingual and cross-domain variability in practical applications. To address these limitations, we introduce TableEval, a new benchmark designed to evaluate LLMs on realistic TableQA tasks. Specifically, TableEval includes tables with various structures (such as concise, hierarchical, and nested tables) collected from four domains (including government, finance, academia, and industry reports). Besides, TableEval features cross-lingual scenarios with tables in Simplified Chinese, Traditional Chinese, and English. To minimize the risk of data leakage, we collect all data from recent real-world documents. Considering that existing TableQA metrics fail to capture semantic accuracy, we further propose SEAT, a new evaluation framework that assesses the alignment between model responses and reference answers at the sub-question level. Experimental results have shown that SEAT achieves high agreement with human judgment. Extensive experiments on TableEval reveal critical gaps in the ability of state-of-the-art LLMs to handle these complex, real-world TableQA tasks, offering insights for future improvements. We make our dataset available here: https://github.com/wenge-research/TableEval.

[194] From Real to Synthetic: Synthesizing Millions of Diversified and Complicated User Instructions with Attributed Grounding

Chiwei Zhu,Benfeng Xu,Xiaorui Wang,Zhendong Mao

Main category: cs.CL

TL;DR: 论文提出了一种通过上下结合的方法（自上而下的归因和自下而上的合成）生成多样化、复杂的大规模指令数据，用于对齐大型语言模型（LLMs），并构建了一个包含100万指令的数据集SynthQuestions。

Details

Motivation: 现有方法生成的指令数据要么分布狭窄，要么缺乏复杂性，而高效的模型对齐需要基于认知洞察和真实用例的指令。 Method: 采用归因合成框架：1）自上而下将真实指令归因于特定用户；2）自下而上利用网络文档首先生成情境，再生成有意义的指令。 Result: 构建了SynthQuestions数据集，模型在该数据集上训练后在多个基准测试中表现领先，且性能随网络语料增加持续提升。 Conclusion: 归因合成框架能高效生成多样化、复杂的指令数据，显著提升模型对齐效果。 Abstract: The pursuit of diverse, complex, and large-scale instruction data is crucial for automatically aligning large language models (LLMs). While there are methods capable of generating synthetic instructions at scale, they either suffer from limited grounding sources, leading to a narrow distribution, or rely on trivial extensions that fail to produce meaningful trajectories in terms of complexity. In contrast, instructions that benefit efficient alignment are typically crafted with cognitive insights and grounded in real-world use cases. In this paper, we synthesize such instructions using attributed grounding, which involves 1) a top-down attribution process that grounds a selective set of real instructions to situated users, and 2) a bottom-up synthesis process that leverages web documents to first generate a situation, then a meaningful instruction. This framework allows us to harvest diverse and complex instructions at scale, utilizing the vast range of web documents. Specifically, we construct a dataset of 1 million instructions, called SynthQuestions, and demonstrate that models trained on it achieve leading performance on several common benchmarks, with improvements that continually scale with more web corpora. Data, models and codes will be available at https://github.com/Ignoramus0817/SynthQuestions.

[195] Structured Pruning for Diverse Best-of-N Reasoning Optimization

Hieu Trung Nguyen,Bao Nguyen,Viet Anh Nguyen

Main category: cs.CL

TL;DR: 模型剪枝在基于Transformer的语言模型中不仅能节省计算资源，还能提升推理能力。研究发现选择性剪枝某些注意力头可以显著提升推理性能，尤其是在复杂任务上。为此，作者提出了SPRINT框架，通过对比学习动态选择最优剪枝配置，实验表明其性能优于传统方法。

Details

Motivation: 传统观点认为模型剪枝仅用于节省计算资源，但研究发现选择性剪枝某些注意力头能提升推理能力，尤其是在复杂任务上。这激发了作者探索剪枝对推理性能的潜在影响。 Method: 提出了SPRINT框架，通过对比学习动态选择最优剪枝的注意力头和层。该方法通过对齐问题嵌入和注意力头嵌入，识别出能提升推理准确性的剪枝配置。 Result: 在MATH500和GSM8K数据集上的实验表明，SPRINT显著优于传统的best-of-N和随机剪枝策略。 Conclusion: 选择性剪枝可以提升模型推理能力，SPRINT框架为动态剪枝提供了有效方法，展示了剪枝在性能优化中的潜力。 Abstract: Model pruning in transformer-based language models, traditionally viewed as a means of achieving computational savings, can enhance the model's reasoning capabilities. In this work, we uncover a surprising phenomenon: the selective pruning of certain attention heads leads to improvements in reasoning performance, particularly on challenging tasks. Motivated by this observation, we propose SPRINT, a novel contrastive learning framework that dynamically selects the optimal head and layer to prune during inference. By aligning question embeddings with head embeddings, SPRINT identifies those pruned-head configurations that result in more accurate reasoning. Extensive experiments demonstrate that our method significantly outperforms traditional best-of-$N$ and random head selection strategies on the MATH500 and GSM8K datasets.

[196] Voice Activity Projection Model with Multimodal Encoders

Takeshi Saga,Catherine Pelachaud

Main category: cs.CL

TL;DR: 本文提出了一种多模态模型，通过预训练的音频和面部编码器提升对话轮次预测性能，表现优于现有最佳模型。

Details

Motivation: 传统基于静音时长的系统难以建模复杂社交情境中的对话轮次行为，多模态VAP模型虽有所改进，但仍有提升空间。 Method: 采用预训练的音频和面部编码器增强多模态VAP模型，捕捉细微表情以优化性能。 Result: 模型在对话轮次指标上表现优异，部分情况下超越现有最佳模型。 Conclusion: 多模态结合预训练编码器显著提升对话轮次预测性能，代码和模型已开源。 Abstract: Turn-taking management is crucial for any social interaction. Still, it is challenging to model human-machine interaction due to the complexity of the social context and its multimodal nature. Unlike conventional systems based on silence duration, previous existing voice activity projection (VAP) models successfully utilized a unified representation of turn-taking behaviors as prediction targets, which improved turn-taking prediction performance. Recently, a multimodal VAP model outperformed the previous state-of-the-art model by a significant margin. In this paper, we propose a multimodal model enhanced with pre-trained audio and face encoders to improve performance by capturing subtle expressions. Our model performed competitively, and in some cases, even better than state-of-the-art models on turn-taking metrics. All the source codes and pretrained models are available at https://github.com/sagatake/VAPwithAudioFaceEncoders.

[197] Around the World in 24 Hours: Probing LLM Knowledge of Time and Place

Carolin Holtermann,Paul Röttger,Anne Lauscher

Main category: cs.CL

TL;DR: 该论文首次评估了语言模型在时间和空间联合推理上的能力，发现模型在纯时间推理上表现良好，但在时空联合任务中表现受限。

Details

Motivation: 探索语言模型在时间和空间联合推理上的能力，填补此前研究的空白。 Method: 创建GeoTemp数据集（320k提示，覆盖289个城市、217个国家、37个时区），评估8个开放聊天模型。 Result: 模型在纯时间推理任务中表现良好，性能随规模提升；时空联合任务表现受限；低困惑度地名表现更好；提示形式对性能影响显著。 Conclusion: 语言模型在时空联合推理上仍有局限，提示设计和训练数据分布是关键影响因素。 Abstract: Reasoning over time and space is essential for understanding our world. However, the abilities of language models in this area are largely unexplored as previous work has tested their abilities for logical reasoning in terms of time and space in isolation or only in simple or artificial environments. In this paper, we present the first evaluation of the ability of language models to jointly reason over time and space. To enable our analysis, we create GeoTemp, a dataset of 320k prompts covering 289 cities in 217 countries and 37 time zones. Using GeoTemp, we evaluate eight open chat models of three different model families for different combinations of temporal and geographic knowledge. We find that most models perform well on reasoning tasks involving only temporal knowledge and that overall performance improves with scale. However, performance remains constrained in tasks that require connecting temporal and geographical information. We do not find clear correlations of performance with specific geographic regions. Instead, we find a significant performance increase for location names with low model perplexity, suggesting their repeated occurrence during model training. We further demonstrate that their performance is heavily influenced by prompt formulation - a direct injection of geographical knowledge leads to performance gains, whereas, surprisingly, techniques like chain-of-thought prompting decrease performance on simpler tasks.

[198] Stronger Baselines for Retrieval-Augmented Generation with Long-Context Language Models

Alex Laitenberger,Christopher D. Manning,Nelson F. Liu

Main category: cs.CL

TL;DR: 论文探讨了在多阶段检索增强生成（RAG）与简单单阶段方法之间的比较，发现简单的DOS RAG在多任务中表现优异。

Details

Motivation: 研究长上下文语言模型（LMs）是否使多阶段RAG管道仍具优势。 Method: 通过QA任务在系统化扩展的token预算下，比较多阶段管道（ReadAgent、RAPTOR）与单阶段方法（DOS RAG）。 Result: DOS RAG在多个长上下文QA基准测试中表现优于或持平复杂方法。 Conclusion: 建议将DOS RAG作为未来RAG评估的简单强基线，结合新兴模型评估复杂性与效果的权衡。 Abstract: With the rise of long-context language models (LMs) capable of processing tens of thousands of tokens in a single pass, do multi-stage retrieval-augmented generation (RAG) pipelines still offer measurable benefits over simpler, single-stage approaches? To assess this question, we conduct a controlled evaluation for QA tasks under systematically scaled token budgets, comparing two recent multi-stage pipelines, ReadAgent and RAPTOR, against three baselines, including DOS RAG (Document's Original Structure RAG), a simple retrieve-then-read method that preserves original passage order. Despite its straightforward design, DOS RAG consistently matches or outperforms more intricate methods on multiple long-context QA benchmarks. We recommend establishing DOS RAG as a simple yet strong baseline for future RAG evaluations, pairing it with emerging embedding and language models to assess trade-offs between complexity and effectiveness as model capabilities evolve.

[199] DynTok: Dynamic Compression of Visual Tokens for Efficient and Effective Video Understanding

Hongzhi Zhang,Jingyuan Zhang,Xingguang Ji,Qi Wang,Fuzheng Zhang

Main category: cs.CL

TL;DR: DynTok是一种动态视频令牌压缩策略，通过自适应分组和合并令牌，显著减少计算开销，同时保持性能。

Details

Motivation: 现有视频建模方法（如LLava）因处理大量视觉令牌而计算开销大，尤其是长视频。 Method: DynTok动态分组并合并视觉令牌，在高信息密度区域保留内容，低密度区域压缩。 Result: 令牌数量减少至44.4%，性能相当；在Video-MME和MLVU上分别达到65.3%和72.5%。 Conclusion: DynTok揭示了视频令牌表示的冗余性，为高效视频建模技术设计提供了启示。 Abstract: Typical video modeling methods, such as LLava, represent videos as sequences of visual tokens, which are then processed by the LLM backbone for effective video understanding. However, this approach leads to a massive number of visual tokens, especially for long videos. A practical solution is to first extract relevant visual information from the large visual context before feeding it into the LLM backbone, thereby reducing computational overhead. In this work, we introduce DynTok, a novel \textbf{Dyn}amic video \textbf{Tok}en compression strategy. DynTok adaptively splits visual tokens into groups and merges them within each group, achieving high compression in regions with low information density while preserving essential content. Our method reduces the number of tokens to 44.4% of the original size while maintaining comparable performance. It further benefits from increasing the number of video frames and achieves 65.3% on Video-MME and 72.5% on MLVU. By applying this simple yet effective compression method, we expose the redundancy in video token representations and offer insights for designing more efficient video modeling techniques.

[200] Words of Warmth: Trust and Sociability Norms for over 26k English Words

Saif M. Mohammad

Main category: cs.CL

TL;DR: 论文介绍了首个大规模手动标注的词汇-温暖关联数据库Words of Warmth，包含2.6万英语词汇，并验证了其可靠性。研究还探讨了儿童如何随着年龄增长学习这些词汇，并通过案例展示了其在偏见与刻板印象研究中的应用。

Details

Motivation: 社会心理学研究表明，温暖（W）和能力（C）是评估他人和群体的主要维度，但温暖的两个子成分信任（T）和社交性（S）的研究尚不充分。 Method: 构建了Words of Warmth数据库，包含词汇与温暖、信任、社交性的关联，并验证其可靠性。通过该数据库研究儿童词汇习得及偏见研究。 Result: 词汇关联高度可靠，数据库可用于儿童词汇习得研究和多种偏见与刻板印象的案例研究。 Conclusion: Words of Warmth为温暖及其子成分的研究提供了可靠工具，支持了儿童发展和偏见研究的广泛应用。 Abstract: Social psychologists have shown that Warmth (W) and Competence (C) are the primary dimensions along which we assess other people and groups. These dimensions impact various aspects of our lives from social competence and emotion regulation to success in the work place and how we view the world. More recent work has started to explore how these dimensions develop, why they have developed, and what they constitute. Of particular note, is the finding that warmth has two distinct components: Trust (T) and Sociability (S). In this work, we introduce Words of Warmth, the first large-scale repository of manually derived word--warmth (as well as word--trust and word--sociability) associations for over 26k English words. We show that the associations are highly reliable. We use the lexicons to study the rate at which children acquire WCTS words with age. Finally, we show that the lexicon enables a wide variety of bias and stereotype research through case studies on various target entities. Words of Warmth is freely available at: http://saifmohammad.com/warmth.html

[201] Seeing What Tastes Good: Revisiting Multimodal Distributional Semantics in the Billion Parameter Era

Dan Oneata,Desmond Elliott,Stella Frank

Main category: cs.CL

TL;DR: 论文探讨了大规模模型（如图像编码器和语言模型）在表示具体对象概念的语义特征（如颜色、气味等）时的表现，发现多模态图像编码器略优于纯语言模型，而纯图像编码器在非视觉属性上表现与语言模型相当。

Details

Motivation: 研究人类学习与概念表征基于感知运动经验，而现有的大规模模型是否能够有效表示具体对象的语义特征（如颜色、气味等）。 Method: 通过探测任务评估图像编码器（纯图像训练、多模态训练）和纯语言模型在预测McRae规范和Binder数据集属性评分上的表现。 Result: 多模态图像编码器略优于纯语言模型，纯图像编码器在非视觉属性上与语言模型表现相当。 Conclusion: 研究揭示了纯单模态学习的潜力以及多模态的互补性。 Abstract: Human learning and conceptual representation is grounded in sensorimotor experience, in contrast to state-of-the-art foundation models. In this paper, we investigate how well such large-scale models, trained on vast quantities of data, represent the semantic feature norms of concrete object concepts, e.g. a ROSE is red, smells sweet, and is a flower. More specifically, we use probing tasks to test which properties of objects these models are aware of. We evaluate image encoders trained on image data alone, as well as multimodally-trained image encoders and language-only models, on predicting an extended denser version of the classic McRae norms and the newer Binder dataset of attribute ratings. We find that multimodal image encoders slightly outperform language-only approaches, and that image-only encoders perform comparably to the language models, even on non-visual attributes that are classified as "encyclopedic" or "function". These results offer new insights into what can be learned from pure unimodal learning, and the complementarity of the modalities.

[202] QQSUM: A Novel Task and Model of Quantitative Query-Focused Summarization for Review-based Product Question Answering

An Quang Tang,Xiuzhen Zhang,Minh Ngoc Dinh,Zhuang Li

Main category: cs.CL

TL;DR: 论文提出了一种新任务QQSUM，通过量化用户意见多样性生成多视角答案，并扩展RAG模型为QQSUM-RAG，在文本质量和量化准确性上优于现有方法。

Details

Motivation: 现有PQA系统仅能生成单一视角答案，无法捕捉用户意见多样性，因此需要一种能总结并量化多视角意见的方法。 Method: 提出QQSUM任务，扩展RAG模型为QQSUM-RAG，结合少样本学习训练检索器和生成器，生成多样且代表性的关键点摘要。 Result: 实验表明QQSUM-RAG在文本质量和意见量化准确性上优于现有RAG基线。 Conclusion: QQSUM-RAG能有效捕捉和量化用户意见多样性，为PQA任务提供了更全面的解决方案。 Abstract: Review-based Product Question Answering (PQA) allows e-commerce platforms to automatically address customer queries by leveraging insights from user reviews. However, existing PQA systems generate answers with only a single perspective, failing to capture the diversity of customer opinions. In this paper we introduce a novel task Quantitative Query-Focused Summarization (QQSUM), which aims to summarize diverse customer opinions into representative Key Points (KPs) and quantify their prevalence to effectively answer user queries. While Retrieval-Augmented Generation (RAG) shows promise for PQA, its generated answers still fall short of capturing the full diversity of viewpoints. To tackle this challenge, our model QQSUM-RAG, which extends RAG, employs few-shot learning to jointly train a KP-oriented retriever and a KP summary generator, enabling KP-based summaries that capture diverse and representative opinions. Experimental results demonstrate that QQSUM-RAG achieves superior performance compared to state-of-the-art RAG baselines in both textual quality and quantification accuracy of opinions. Our source code is available at: https://github.com/antangrocket1312/QQSUMM

[203] AI Agents for Conversational Patient Triage: Preliminary Simulation-Based Evaluation with Real-World EHR Data

Sina Rashidian,Nan Li,Jonathan Amar,Jong Ha Lee,Sam Pugh,Eric Yang,Geoff Masterson,Myoung Cha,Yugang Jia,Akhil Vaid

Main category: cs.CL

TL;DR: 开发了一个基于真实患者数据的病人模拟器，用于训练和测试医疗AI代理，结果显示模拟器表现与真实病例高度一致。

Details

Motivation: 为医疗AI代理提供真实、可扩展的训练和测试环境，解决数据稀缺和隐私问题。 Method: 从真实电子健康记录（EHR）中提取病例场景，构建模拟器，并通过AI代理进行多轮对话测试，由专家评估一致性。 Result: 模拟器在97.7%的病例中表现与真实病例一致，提取的病例摘要99%相关。 Conclusion: 该方法可用于大规模训练和测试多轮对话AI代理，模拟器性能优异。 Abstract: Background: We present a Patient Simulator that leverages real world patient encounters which cover a broad range of conditions and symptoms to provide synthetic test subjects for development and testing of healthcare agentic models. The simulator provides a realistic approach to patient presentation and multi-turn conversation with a symptom-checking agent. Objectives: (1) To construct and instantiate a Patient Simulator to train and test an AI health agent, based on patient vignettes derived from real EHR data. (2) To test the validity and alignment of the simulated encounters provided by the Patient Simulator to expert human clinical providers. (3) To illustrate the evaluation framework of such an LLM system on the generated realistic, data-driven simulations -- yielding a preliminary assessment of our proposed system. Methods: We first constructed realistic clinical scenarios by deriving patient vignettes from real-world EHR encounters. These vignettes cover a variety of presenting symptoms and underlying conditions. We then evaluate the performance of the Patient Simulator as a simulacrum of a real patient encounter across over 500 different patient vignettes. We leveraged a separate AI agent to provide multi-turn questions to obtain a history of present illness. The resulting multiturn conversations were evaluated by two expert clinicians. Results: Clinicians scored the Patient Simulator as consistent with the patient vignettes in those same 97.7% of cases. The extracted case summary based on the conversation history was 99% relevant. Conclusions: We developed a methodology to incorporate vignettes derived from real healthcare patient data to build a simulation of patient responses to symptom checking agents. The performance and alignment of this Patient Simulator could be used to train and test a multi-turn conversational AI agent at scale.

[204] The mutual exclusivity bias of bilingual visually grounded speech models

Dan Oneata,Leanne Nortje,Yevgen Matusevych,Herman Kamper

Main category: cs.CL

TL;DR: 双语视觉语音模型中的互斥性（ME）偏见比单语模型弱，部分原因是视觉嵌入的方差较小。

Details

Motivation: 研究双语儿童在语言学习中可能较少使用互斥性（ME）策略的现象，并通过计算模型验证这一假设。 Method: 使用双语视觉语音模型（VGS）训练英语、法语和荷兰语的组合数据，分析其ME偏见。 Result: 双语模型的ME偏见通常弱于单语模型，视觉嵌入方差较小导致对熟悉概念的混淆增加。 Conclusion: 双语模型的ME偏见较弱，揭示了视觉嵌入方差对ME偏见的影响，并提供了对ME偏见存在原因的新见解。 Abstract: Mutual exclusivity (ME) is a strategy where a novel word is associated with a novel object rather than a familiar one, facilitating language learning in children. Recent work has found an ME bias in a visually grounded speech (VGS) model trained on English speech with paired images. But ME has also been studied in bilingual children, who may employ it less due to cross-lingual ambiguity. We explore this pattern computationally using bilingual VGS models trained on combinations of English, French, and Dutch. We find that bilingual models generally exhibit a weaker ME bias than monolingual models, though exceptions exist. Analyses show that the combined visual embeddings of bilingual models have a smaller variance for familiar data, partly explaining the increase in confusion between novel and familiar concepts. We also provide new insights into why the ME bias exists in VGS models in the first place. Code and data: https://github.com/danoneata/me-vgs

[205] LexTime: A Benchmark for Temporal Ordering of Legal Events

Claire Barale,Leslie Barrett,Vikram Sunil Bajaj,Michael Rovatsos

Main category: cs.CL

TL;DR: LexTime是首个评估LLMs在法律文本中事件排序能力的数据集，包含512个标注实例，结果显示LLMs在法律事件排序上表现优于叙事文本，但法律语言复杂性仍是挑战。

Details

Motivation: 现有数据集缺乏专家语言评估，无法理解LLMs在法律语境中如何处理事件排序，因此需要专门的数据集LexTime来填补这一空白。 Method: 构建LexTime数据集，包含512个来自美国联邦投诉的标注实例，分析LLMs在法律事件排序中的表现，并探讨上下文长度、显隐事件对及法律语言特征的影响。 Result: LLMs在法律事件排序上比叙事文本准确率高10.5%；长上下文和隐式事件能提升准确率至80.8%；法律语言复杂性仍是主要挑战。 Conclusion: 需针对法律语言特性开发特定建模策略，以提升LLMs在时间事件推理中的表现。 Abstract: Temporal reasoning in legal texts is important for applications like case law analysis and compliance monitoring. However, existing datasets lack expert language evaluation, leaving a gap in understanding how LLMs manage event ordering in legal contexts. We introduce LexTime, the first dataset designed to evaluate LLMs' event ordering capabilities in legal language, consisting of 512 instances from U.S. Federal Complaints with annotated event pairs and their temporal relations. Our findings show that (1) LLMs are more accurate on legal event ordering than on narrative (up to +10.5%); (2) longer input contexts and implicit events boost accuracy, reaching 80.8% for implicit-explicit event pairs; (3) legal linguistic complexities and nested clauses remain a challenge. We investigate how context length, explicit vs implicit event pairs, and legal language features affect model performance, demonstrating the need for specific modeling strategies to enhance temporal event reasoning.

[206] Unveiling and Eliminating the Shortcut Learning for Locate-Then-Edit Knowledge Editing via Both Subject and Relation Awareness

Xiyu Liu,Zhengxiao Liu,Naibin Gu,Zheng Lin,Ji Xiang,Weiping Wang

Main category: cs.CL

TL;DR: 论文提出了一种两阶段优化方法，解决知识编辑中的捷径学习问题，实现可控的知识编辑。

Details

Motivation: 现有知识编辑方法在优化过程中倾向于过度学习主体特征而忽视关系特征，导致不可控的副作用。 Method: 提出两阶段优化过程，平衡主体特征和关系特征的学习。 Result: 实验证明该方法有效防止捷径学习，实现最优整体性能。 Conclusion: 两阶段优化方法为可控知识编辑提供了有效解决方案。 Abstract: Knowledge editing aims to alternate the target knowledge predicted by large language models while ensuring the least side effects on unrelated knowledge. An effective way to achieve knowledge editing is to identify pivotal parameters for predicting factual associations and modify them with an optimization process to update the predictions. However, these locate-then-edit methods are uncontrollable since they tend to modify most unrelated relations connected to the subject of target editing. We unveil that this failure of controllable editing is due to a shortcut learning issue during the optimization process. Specifically, we discover two crucial features that are the subject feature and the relation feature for models to learn during optimization, but the current optimization process tends to over-learning the subject feature while neglecting the relation feature. To eliminate this shortcut learning of the subject feature, we propose a novel two-stage optimization process that balances the learning of the subject feature and the relation feature. Experimental results demonstrate that our approach successfully prevents knowledge editing from shortcut learning and achieves the optimal overall performance, contributing to controllable knowledge editing.

[207] Think Like a Person Before Responding: A Multi-Faceted Evaluation of Persona-Guided LLMs for Countering Hate

Mikel K. Ngueajio,Flor Miriam Plaza-del-Arco,Yi-Ling Chung,Danda B. Rawat,Amanda Cercas Curry

Main category: cs.CL

TL;DR: 论文提出了一种评估大型语言模型生成反叙事（CN）的框架，发现其存在冗长、可读性低和伦理风险等问题。

Details

Motivation: 在线仇恨言论的自动反叙事策略虽有效，但其情感基调、可访问性和伦理风险仍需评估。 Method: 使用GPT-4o-Mini、Cohere's CommandR-7B和Meta's LLaMA 3.1-70B，在MT-Conan和HatEval数据集上评估三种提示策略。 Result: LLM生成的反叙事冗长且适合大学文化水平人群，情感引导提示能提高共情和可读性，但安全性和有效性仍存疑。 Conclusion: 需进一步优化反叙事的可访问性和伦理安全性。 Abstract: Automated counter-narratives (CN) offer a promising strategy for mitigating online hate speech, yet concerns about their affective tone, accessibility, and ethical risks remain. We propose a framework for evaluating Large Language Model (LLM)-generated CNs across four dimensions: persona framing, verbosity and readability, affective tone, and ethical robustness. Using GPT-4o-Mini, Cohere's CommandR-7B, and Meta's LLaMA 3.1-70B, we assess three prompting strategies on the MT-Conan and HatEval datasets. Our findings reveal that LLM-generated CNs are often verbose and adapted for people with college-level literacy, limiting their accessibility. While emotionally guided prompts yield more empathetic and readable responses, there remain concerns surrounding safety and effectiveness.

[208] Lacuna Inc. at SemEval-2025 Task 4: LoRA-Enhanced Influence-Based Unlearning for LLMs

Aleksey Kudelya,Alexander Shirnin

Main category: cs.CL

TL;DR: LIBU是一种轻量级算法，结合影响函数和二阶优化，用于从大语言模型中移除特定知识而不损害其整体性能。

Details

Motivation: 解决大语言模型中敏感内容的移除问题，避免从头训练带来的高成本。 Method: 结合影响函数移除数据影响，并通过二阶优化稳定模型性能。 Result: 实验证明LIBU适用于多种任务中的大语言模型知识移除。 Conclusion: LIBU是一种高效且实用的知识移除方法。 Abstract: This paper describes LIBU (LoRA enhanced influence-based unlearning), an algorithm to solve the task of unlearning - removing specific knowledge from a large language model without retraining from scratch and compromising its overall utility (SemEval-2025 Task 4: Unlearning sensitive content from Large Language Models). The algorithm combines classical \textit{influence functions} to remove the influence of the data from the model and \textit{second-order optimization} to stabilize the overall utility. Our experiments show that this lightweight approach is well applicable for unlearning LLMs in different kinds of task.

[209] On Support Samples of Next Word Prediction

Yuqian Li,Yupei Du,Yufang Liu,Feifei Feng,Mou Xiao Feng,Yuanbin Wu

Main category: cs.CL

TL;DR: 论文研究了语言模型的数据中心可解释性，通过支持样本和非支持样本分析模型决策的内在属性及其对泛化和表示学习的影响。

Details

Motivation: 理解语言模型决策背后的逻辑是挑战，研究旨在通过数据中心方法揭示模型行为的可解释性。 Method: 利用表示定理识别支持样本和非支持样本，分析它们在预测和模型训练中的作用。 Result: 支持样本的预测性是固有属性；非支持样本对防止过拟合和表示学习至关重要，尤其在深层网络中作用显著。 Conclusion: 研究揭示了数据与模型决策的相互作用，为理解语言模型行为提供了新视角。 Abstract: Language models excel in various tasks by making complex decisions, yet understanding the rationale behind these decisions remains a challenge. This paper investigates \emph{data-centric interpretability} in language models, focusing on the next-word prediction task. Using representer theorem, we identify two types of \emph{support samples}-those that either promote or deter specific predictions. Our findings reveal that being a support sample is an intrinsic property, predictable even before training begins. Additionally, while non-support samples are less influential in direct predictions, they play a critical role in preventing overfitting and shaping generalization and representation learning. Notably, the importance of non-support samples increases in deeper layers, suggesting their significant role in intermediate representation formation.These insights shed light on the interplay between data and model decisions, offering a new dimension to understanding language model behavior and interpretability.

[210] Explainability-Based Token Replacement on LLM-Generated Text

Hadi Mohammadi,Anastasia Giachanou,Daniel L. Oberski,Ayoub Bagheri

Main category: cs.CL

TL;DR: 论文探讨了如何利用可解释AI（XAI）方法降低AI生成文本（AIGT）的可检测性，并提出了一种基于集成分类器的检测方法。通过替换关键标记，AIGT的检测难度增加，但集成分类器仍能保持高性能。

Details

Motivation: 尽管大型语言模型（LLMs）生成的文本接近人类水平，但其模式容易被检测。研究旨在通过XAI方法减少AIGT的可检测性，并开发更鲁棒的检测策略。 Method: 训练集成分类器区分AIGT与人类文本，使用SHAP和LIME识别关键标记，提出四种基于解释性的标记替换策略。 Result: 标记替换策略显著降低单分类器的检测能力，但集成分类器在多语言和多领域中表现稳定。 Conclusion: XAI方法可降低AIGT的可检测性，但需依赖鲁棒的集成检测策略以应对不断演变的隐藏手段。 Abstract: Generative models, especially large language models (LLMs), have shown remarkable progress in producing text that appears human-like. However, they often exhibit patterns that make their output easier to detect than text written by humans. In this paper, we investigate how explainable AI (XAI) methods can be used to reduce the detectability of AI-generated text (AIGT) while also introducing a robust ensemble-based detection approach. We begin by training an ensemble classifier to distinguish AIGT from human-written text, then apply SHAP and LIME to identify tokens that most strongly influence its predictions. We propose four explainability-based token replacement strategies to modify these influential tokens. Our findings show that these token replacement approaches can significantly diminish a single classifier's ability to detect AIGT. However, our ensemble classifier maintains strong performance across multiple languages and domains, showing that a multi-model approach can mitigate the impact of token-level manipulations. These results show that XAI methods can make AIGT harder to detect by focusing on the most influential tokens. At the same time, they highlight the need for robust, ensemble-based detection strategies that can adapt to evolving approaches for hiding AIGT.

[211] High Accuracy, Less Talk (HALT): Reliable LLMs through Capability-Aligned Finetuning

Tim Franzmeyer,Archie Sravankumar,Lijuan Liu,Yuning Mao,Rui Hou,Sinong Wang,Jakob N. Foerster,Luke Zettlemoyer,Madian Khabsa

Main category: cs.CL

TL;DR: 论文提出了一种名为HALT的方法，通过后训练使大语言模型（LLM）仅在对其答案有信心时生成内容，否则部分或完全放弃回答，以减少幻觉问题。

Details

Motivation: 当前LLM对所有提示都会回应，但缺乏知识或能力时会产生错误答案（幻觉问题）。HALT旨在通过能力对齐的后训练数据，提升模型生成内容的正确性。 Method: HALT方法将预训练LLM的响应拆分为事实片段，利用真实信息识别错误片段，并通过移除或替换为“Unsure from Here”进行能力对齐的微调。 Result: 在四个开源模型上的实验表明，HALT显著提升了片段正确性（平均增加15%），F1分数提高4%。最高正确性设置下，Llama3-70B的正确性从51%提升至87%。 Conclusion: HALT通过能力对齐的后训练，有效平衡了响应完整性和正确性，显著减少了LLM的幻觉问题。 Abstract: Large Language Models (LLMs) currently respond to every prompt. However, they can produce incorrect answers when they lack knowledge or capability -- a problem known as hallucination. We instead propose post-training an LLM to generate content only when confident in its correctness and to otherwise (partially) abstain. Specifically, our method, HALT, produces capability-aligned post-training data that encodes what the model can and cannot reliably generate. We generate this data by splitting responses of the pretrained LLM into factual fragments (atomic statements or reasoning steps), and use ground truth information to identify incorrect fragments. We achieve capability-aligned finetuning responses by either removing incorrect fragments or replacing them with "Unsure from Here" -- according to a tunable threshold that allows practitioners to trade off response completeness and mean correctness of the response's fragments. We finetune four open-source models for biography writing, mathematics, coding, and medicine with HALT for three different trade-off thresholds. HALT effectively trades off response completeness for correctness, increasing the mean correctness of response fragments by 15% on average, while resulting in a 4% improvement in the F1 score (mean of completeness and correctness of the response) compared to the relevant baselines. By tuning HALT for highest correctness, we train a single reliable Llama3-70B model with correctness increased from 51% to 87% across all four domains while maintaining 53% of the response completeness achieved with standard finetuning.

[212] Progressive Mastery: Customized Curriculum Learning with Guided Prompting for Mathematical Reasoning

Muling Wu,Qi Qian,Wenhao Liu,Xiaohua Wang,Zisu Huang,Di Liang,LI Miao,Shihan Dou,Changze Lv,Zhenghua Wang,Zhibo Xu,Lina Chen,Tianlong Li,Xiaoqing Zheng,Xuanjing Huang

Main category: cs.CL

TL;DR: 论文提出了一种名为定制课程学习（CCL）的新框架，通过模型自适应难度定义和动态提示技术，显著提升了大型语言模型在推理任务中的样本利用率和性能。

Details

Motivation: 大型语言模型（LLMs）在推理任务中表现出色，但后训练阶段存在样本利用率低和难度样本处理不灵活的问题。 Method: CCL框架包含两个创新点：模型自适应难度定义（根据模型能力定制课程数据集）和动态提示技术（通过提示降低样本难度）。 Result: 在监督微调和强化学习的实验中，CCL在五个数学推理基准上显著优于统一训练方法。 Conclusion: CCL有效提升了样本利用率和模型性能，适用于多种训练范式。 Abstract: Large Language Models (LLMs) have achieved remarkable performance across various reasoning tasks, yet post-training is constrained by inefficient sample utilization and inflexible difficulty samples processing. To address these limitations, we propose Customized Curriculum Learning (CCL), a novel framework with two key innovations. First, we introduce model-adaptive difficulty definition that customizes curriculum datasets based on each model's individual capabilities rather than using predefined difficulty metrics. Second, we develop "Guided Prompting," which dynamically reduces sample difficulty through strategic hints, enabling effective utilization of challenging samples that would otherwise degrade performance. Comprehensive experiments on supervised fine-tuning and reinforcement learning demonstrate that CCL significantly outperforms uniform training approaches across five mathematical reasoning benchmarks, confirming its effectiveness across both paradigms in enhancing sample utilization and model performance.

Yi Zhao,Siqi Wang,Jing Li

Main category: cs.CL

TL;DR: 该论文提出了一种名为LaF-GRPO的方法，利用LLM模拟视障用户反馈来优化VLM生成的导航指令，并通过NIG4VI基准测试验证其有效性。

Details

Motivation: 为视障人士生成精确、实用的导航指令是一个重要但研究较少的领域。 Method: 提出LaF-GRPO方法，利用LLM模拟用户反馈生成奖励信号，指导VLM后训练。 Result: 实验表明，LaF-GRPO在BLEU和METEOR等指标上显著优于基线方法，并生成更直观、安全的指令。 Conclusion: LaF-GRPO通过减少对真实数据的需求，提高了导航指令的实用性，为视障人士提供了更好的支持。 Abstract: Navigation instruction generation for visually impaired (VI) individuals (NIG-VI) is critical yet relatively underexplored. This study, hence, focuses on producing precise, in-situ, step-by-step navigation instructions that are practically usable by VI users. Concretely, we propose LaF-GRPO (LLM-as-Follower GRPO), where an LLM simulates VI user responses to generate rewards guiding the Vision-Language Model (VLM) post-training. This enhances instruction usability while reducing costly real-world data needs. To facilitate training and testing, we introduce NIG4VI, a 27k-sample open-sourced benchmark. It provides diverse navigation scenarios with accurate spatial coordinates, supporting detailed, open-ended in-situ instruction generation. Experiments on NIG4VI show the effectiveness of LaF-GRPO by quantitative metrics (e.g., Zero-(LaF-GRPO) boosts BLEU +14\%; SFT+(LaF-GRPO) METEOR 0.542 vs. GPT-4o's 0.323) and yields more intuitive, safer instructions. Code and benchmark are available at \href{https://github.com/YiyiyiZhao/NIG4VI}{https://github.com/YiyiyiZhao/NIG4VI}.

[214] Controlling Difficulty of Generated Text for AI-Assisted Language Learning

Meiqing Jin,Liam Dugan,Chris Callison-Burch

Main category: cs.CL

TL;DR: 研究探讨了如何通过可控生成技术调整大型语言模型（LLM）的输出，以更好地支持初学者语言学习。

Details

Motivation: 传统LLM输出复杂度高，不适合初学者（CEFR: A1-A2），需探索无需微调的模块化方法。 Method: 采用可控生成技术（如未来判别器）调整LLM输出，并通过自动指标和用户研究（日语学习者）评估效果。 Result: 未来判别器显著提升输出可理解性（40.4%至84.3%），并引入新指标Token Miss Rate（TMR）。 Conclusion: 可控生成技术能有效支持初学者语言学习，未来研究可通过发布的代码、工具和数据集进一步探索。 Abstract: Practicing conversations with large language models (LLMs) presents a promising alternative to traditional in-person language learning. However, most LLMs generate text at a near-native level of complexity, making them ill-suited for beginner learners (CEFR: A1-A2). In this paper, we investigate whether controllable generation techniques -- specifically modular methods that do not require model fine-tuning -- can adapt LLM outputs to better support absolute beginners. We evaluate these methods through both automatic metrics and a user study with university-level learners of Japanese. Our findings show that while prompting alone fails to control output difficulty, the use of future discriminators (Yang and Klein, 2021) significantly improves output comprehensibility (from 40.4\% to 84.3\%). We further introduce a novel token-level evaluation metric, Token Miss Rate (TMR), that quantifies the proportion of incomprehensible tokens per utterance and correlates strongly with human judgments. To support future research in AI-assisted language learning, we release our code, models, annotation tools, and dataset.

[215] Acoustically Precise Hesitation Tagging Is Essential for End-to-End Verbatim Transcription Systems

Jhen-Ke Lin,Hao-Chien Lu,Chung-Chun Wang,Hong-Yun Lin,Berlin Chen

Main category: cs.CL

TL;DR: 论文通过微调Whisper模型，使用LoRA技术，对比三种标注方案（Pure、Rich、Extra），证明精确标注填充词能显著提高ASR对非母语口语转录的准确性。

Details

Motivation: 自动语音评估需要准确捕捉不流畅现象，但现有ASR系统常忽略或泛化这些细节，影响下游任务。 Method: 微调Whisper模型，采用LoRA技术，对比三种标注方案（Pure、Rich、Extra）。 Result: 使用Extra方案时，Whisper Large V3 Turbo的WER降至5.5%，比Pure方案（6.2% WER）提升11.3%。 Conclusion: 精确标注填充词能显著提升ASR对非母语口语转录的准确性。 Abstract: Verbatim transcription for automatic speaking assessment demands accurate capture of disfluencies, crucial for downstream tasks like error analysis and feedback. However, many ASR systems discard or generalize hesitations, losing important acoustic details. We fine-tune Whisper models on the Speak & Improve 2025 corpus using low-rank adaptation (LoRA), without recourse to external audio training data. We compare three annotation schemes: removing hesitations (Pure), generic tags (Rich), and acoustically precise fillers inferred by Gemini 2.0 Flash from existing audio-transcript pairs (Extra). Our challenge system achieved 6.47% WER (Pure) and 5.81% WER (Extra). Post-challenge experiments reveal that fine-tuning Whisper Large V3 Turbo with the "Extra" scheme yielded a 5.5% WER, an 11.3% relative improvement over the "Pure" scheme (6.2% WER). This demonstrates that explicit, realistic filled-pause labeling significantly enhances ASR accuracy for verbatim L2 speech transcription.

[216] A Novel Data Augmentation Approach for Automatic Speaking Assessment on Opinion Expressions

Chung-Chun Wang,Jhen-Ke Lin,Hao-Chien Lu,Hong-Yun Lin,Berlin Chen

Main category: cs.CL

TL;DR: 提出了一种利用大语言模型生成多样化响应并合成语音的训练范式，以解决自动口语评估中标注数据稀缺的问题。

Details

Motivation: 标注录音的稀缺性限制了提示多样性并影响评分可靠性，需要一种新方法来解决这一问题。 Method: 利用大语言模型生成多样化响应，通过语音合成转换为语音，并采用动态重要性损失调整训练权重，最后通过多模态大语言模型预测分数。 Result: 在LTTC数据集上的实验表明，该方法优于依赖真实数据或传统增强的方法。 Conclusion: 该方法有效缓解了低资源限制，实现了基于跨模态信息的意见表达自动评估。 Abstract: Automated speaking assessment (ASA) on opinion expressions is often hampered by the scarcity of labeled recordings, which restricts prompt diversity and undermines scoring reliability. To address this challenge, we propose a novel training paradigm that leverages a large language models (LLM) to generate diverse responses of a given proficiency level, converts responses into synthesized speech via speaker-aware text-to-speech synthesis, and employs a dynamic importance loss to adaptively reweight training instances based on feature distribution differences between synthesized and real speech. Subsequently, a multimodal large language model integrates aligned textual features with speech signals to predict proficiency scores directly. Experiments conducted on the LTTC dataset show that our approach outperforms methods relying on real data or conventional augmentation, effectively mitigating low-resource constraints and enabling ASA on opinion expressions with cross-modal information.

[217] LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation

Ming Zhang,Yujiong Shen,Zelin Li,Huayu Sha,Binze Hu,Yuhui Wang,Chenhao Huang,Shichun Liu,Jingqi Tong,Changhao Jiang,Mingxu Chai,Zhiheng Xi,Shihan Dou,Tao Gui,Qi Zhang,Xuanjing Huang

Main category: cs.CL

TL;DR: LLMEval-Med是一个新的医学基准测试，覆盖五个核心医学领域，包含2,996个问题，基于真实电子健康记录和专家设计的临床场景，解决了现有基准测试的局限性。

Details

Motivation: 医学应用需要高准确性，现有基准测试在问题设计、数据来源和评估方法上存在不足。 Method: 提出LLMEval-Med基准测试，结合自动化评估流程和专家反馈动态优化评分标准。 Result: 评估了13种大型语言模型，为医学领域LLM的安全有效部署提供了见解。 Conclusion: LLMEval-Med通过真实数据和专家反馈提升了医学LLM评估的可靠性。 Abstract: Evaluating large language models (LLMs) in medicine is crucial because medical applications require high accuracy with little room for error. Current medical benchmarks have three main types: medical exam-based, comprehensive medical, and specialized assessments. However, these benchmarks have limitations in question design (mostly multiple-choice), data sources (often not derived from real clinical scenarios), and evaluation methods (poor assessment of complex reasoning). To address these issues, we present LLMEval-Med, a new benchmark covering five core medical areas, including 2,996 questions created from real-world electronic health records and expert-designed clinical scenarios. We also design an automated evaluation pipeline, incorporating expert-developed checklists into our LLM-as-Judge framework. Furthermore, our methodology validates machine scoring through human-machine agreement analysis, dynamically refining checklists and prompts based on expert feedback to ensure reliability. We evaluate 13 LLMs across three categories (specialized medical models, open-source models, and closed-source models) on LLMEval-Med, providing valuable insights for the safe and effective deployment of LLMs in medical domains. The dataset is released in https://github.com/llmeval/LLMEval-Med.

[218] EuroLLM-9B: Technical Report

Pedro Henrique Martins,João Alves,Patrick Fernandes,Nuno M. Guerreiro,Ricardo Rei,Amin Farajian,Mateusz Klimaszewski,Duarte M. Alves,José Pombal,Manuel Faysse,Pierre Colombo,François Yvon,Barry Haddow,José G. C. de Souza,Alexandra Birch,André F. T. Martins

Main category: cs.CL

TL;DR: EuroLLM-9B是一个支持24种欧盟官方语言和11种其他语言的大型语言模型，旨在解决欧洲语言在现有开放模型中的不足。

Details

Motivation: 欧洲语言在现有开放大型语言模型中代表性不足，EuroLLM-9B旨在填补这一空白。 Method: 通过设计多语言分词器、数据过滤（EuroFilter）和合成数据集（EuroBlocks-Synthetic）进行训练。 Result: 在多语言基准测试和机器翻译任务中表现优异，成为同类欧洲开源模型的领先者。 Conclusion: EuroLLM-9B为欧洲语言提供了强大的支持，并开源了所有主要组件以促进研究和应用。 Abstract: This report presents EuroLLM-9B, a large language model trained from scratch to support the needs of European citizens by covering all 24 official European Union languages and 11 additional languages. EuroLLM addresses the issue of European languages being underrepresented and underserved in existing open large language models. We provide a comprehensive overview of EuroLLM-9B's development, including tokenizer design, architectural specifications, data filtering, and training procedures. We describe the pre-training data collection and filtering pipeline, including the creation of EuroFilter, an AI-based multilingual filter, as well as the design of EuroBlocks-Synthetic, a novel synthetic dataset for post-training that enhances language coverage for European languages. Evaluation results demonstrate EuroLLM-9B's competitive performance on multilingual benchmarks and machine translation tasks, establishing it as the leading open European-made LLM of its size. To support open research and adoption, we release all major components of this work, including the base and instruction-tuned models, the EuroFilter classifier, and the synthetic post-training dataset.

[219] TextAtari: 100K Frames Game Playing with Language Agents

Wenhao Li,Wenwu Li,Chuyun Shen,Junjie Sheng,Zixiao Huang,Di Wu,Yun Hua,Wei Yin,Xiangfeng Wang,Hongyuan Zha,Bo Jin

Main category: cs.CL

TL;DR: TextAtari是一个评估语言代理在长时程决策任务（长达100,000步）上的基准，通过将Atari游戏的视觉状态转化为文本描述，创建了挑战性的测试环境。

Details

Motivation: 研究语言代理在长时程决策任务中的表现，探索其语义理解、指令理解和专家示范对决策的影响。 Method: 使用AtariARI框架将100个Atari任务转化为文本，评估三种语言模型（Qwen2.5-7B、Gemma-7B、Llama3.1-8B）在三种代理框架（零样本、少样本思维链、反思推理）下的表现。 Result: 语言代理与人类玩家在长时程规划任务中存在显著性能差距，尤其在顺序推理、状态跟踪和战略规划方面。 Conclusion: TextAtari为语言模型与规划研究提供了标准化评估工具和基准，揭示了语言代理在长时程任务中的挑战。 Abstract: We present TextAtari, a benchmark for evaluating language agents on very long-horizon decision-making tasks spanning up to 100,000 steps. By translating the visual state representations of classic Atari games into rich textual descriptions, TextAtari creates a challenging test bed that bridges sequential decision-making with natural language processing. The benchmark includes nearly 100 distinct tasks with varying complexity, action spaces, and planning horizons, all rendered as text through an unsupervised representation learning framework (AtariARI). We evaluate three open-source large language models (Qwen2.5-7B, Gemma-7B, and Llama3.1-8B) across three agent frameworks (zero-shot, few-shot chain-of-thought, and reflection reasoning) to assess how different forms of prior knowledge affect performance on these long-horizon challenges. Four scenarios-Basic, Obscured, Manual Augmentation, and Reference-based-investigate the impact of semantic understanding, instruction comprehension, and expert demonstrations on agent decision-making. Our results reveal significant performance gaps between language agents and human players in extensive planning tasks, highlighting challenges in sequential reasoning, state tracking, and strategic planning across tens of thousands of steps. TextAtari provides standardized evaluation protocols, baseline implementations, and a framework for advancing research at the intersection of language models and planning.

[220] Rectified Sparse Attention

Yutao Sun,Tianzhu Ye,Li Dong,Yuqing Xia,Jian Chen,Yizhao Gao,Shijie Cao,Jianyong Wang,Furu Wei

Main category: cs.CL

TL;DR: ReSA提出了一种结合块稀疏注意力和周期性密集校正的方法，解决了KV缓存对齐问题，显著提升了长序列生成的效率和质量。

Details

Motivation: 解决稀疏解码方法中KV缓存对齐问题导致的误差累积和生成质量下降。 Method: 结合块稀疏注意力和周期性密集校正，定期刷新KV缓存以限制误差累积。 Result: 在数学推理、语言建模和检索任务中，ReSA实现了接近无损的生成质量和显著的效率提升，解码速度最高提升2.42倍。 Conclusion: ReSA是一种实用的长上下文推理解决方案，平衡了效率和生成质量。 Abstract: Efficient long-sequence generation is a critical challenge for Large Language Models. While recent sparse decoding methods improve efficiency, they suffer from KV cache misalignment, where approximation errors accumulate and degrade generation quality. In this work, we propose Rectified Sparse Attention (ReSA), a simple yet effective method that combines block-sparse attention with periodic dense rectification. By refreshing the KV cache at fixed intervals using a dense forward pass, ReSA bounds error accumulation and preserves alignment with the pretraining distribution. Experiments across math reasoning, language modeling, and retrieval tasks demonstrate that ReSA achieves near-lossless generation quality with significantly improved efficiency. Notably, ReSA delivers up to 2.42$\times$ end-to-end speedup under decoding at 256K sequence length, making it a practical solution for scalable long-context inference. Code is available at https://aka.ms/ReSA-LM.

[221] CLAIM: An Intent-Driven Multi-Agent Framework for Analyzing Manipulation in Courtroom Dialogues

Disha Sheshanarayana,Tanishka Magar,Ayushi Mittal,Neelam Chaplot

Main category: cs.CL

TL;DR: 论文提出了LegalCon数据集和CLAIM框架，用于检测和分析法庭对话中的操纵行为，旨在提升司法过程的公平性和透明度。

Details

Motivation: 法庭对话中的操纵行为可能影响法官决策，但现有NLP技术在此领域的应用不足。 Method: 构建LegalCon数据集（1,063条标注对话），并提出CLAIM框架（两阶段、意图驱动的多智能体框架）。 Result: CLAIM框架能有效分析操纵行为，提升司法决策的公平性和透明度。 Conclusion: 该研究为NLP在司法领域的应用提供了新工具，支持公平决策。 Abstract: Courtrooms are places where lives are determined and fates are sealed, yet they are not impervious to manipulation. Strategic use of manipulation in legal jargon can sway the opinions of judges and affect the decisions. Despite the growing advancements in NLP, its application in detecting and analyzing manipulation within the legal domain remains largely unexplored. Our work addresses this gap by introducing LegalCon, a dataset of 1,063 annotated courtroom conversations labeled for manipulation detection, identification of primary manipulators, and classification of manipulative techniques, with a focus on long conversations. Furthermore, we propose CLAIM, a two-stage, Intent-driven Multi-agent framework designed to enhance manipulation analysis by enabling context-aware and informed decision-making. Our results highlight the potential of incorporating agentic frameworks to improve fairness and transparency in judicial processes. We hope that this contributes to the broader application of NLP in legal discourse analysis and the development of robust tools to support fairness in legal decision-making. Our code and data are available at https://github.com/Disha1001/CLAIM.

[222] Are Lexicon-Based Tools Still the Gold Standard for Valence Analysis in Low-Resource Flemish?

Ratna Kandala,Katie Hoemann

Main category: cs.CL

TL;DR: 研究探讨了LLMs在捕捉弗拉芒语日常叙述情感效价方面的表现，发现尽管LLMs有进步，但仍不及传统工具LIWC和Pattern。

Details

Motivation: 理解日常语言的细微差别对计算语言学和情感研究至关重要，传统工具如LIWC和Pattern虽有效，但未能完全捕捉自发性和语境依赖性。 Method: 研究收集了102名荷兰语参与者的约25,000条文本回答，评估了三种荷兰语LLMs预测情感效价的能力，并与LIWC和Pattern对比。 Result: 荷兰语LLMs在捕捉真实世界叙述的情感效价上表现不佳，突显了开发文化和语言定制模型的必要性。 Conclusion: 需加强低资源语言（如弗拉芒语）的数据集建设和LLMs微调，以弥合计算语言学与情感研究的差距。 Abstract: Understanding the nuances in everyday language is pivotal for advancements in computational linguistics & emotions research. Traditional lexicon-based tools such as LIWC and Pattern have long served as foundational instruments in this domain. LIWC is the most extensively validated word count based text analysis tool in the social sciences and Pattern is an open source Python library offering functionalities for NLP. However, everyday language is inherently spontaneous, richly expressive, & deeply context dependent. To explore the capabilities of LLMs in capturing the valences of daily narratives in Flemish, we first conducted a study involving approximately 25,000 textual responses from 102 Dutch-speaking participants. Each participant provided narratives prompted by the question, "What is happening right now and how do you feel about it?", accompanied by self-assessed valence ratings on a continuous scale from -50 to +50. We then assessed the performance of three Dutch-specific LLMs in predicting these valence scores, and compared their outputs to those generated by LIWC and Pattern. Our findings indicate that, despite advancements in LLM architectures, these Dutch tuned models currently fall short in accurately capturing the emotional valence present in spontaneous, real-world narratives. This study underscores the imperative for developing culturally and linguistically tailored models/tools that can adeptly handle the complexities of natural language use. Enhancing automated valence analysis is not only pivotal for advancing computational methodologies but also holds significant promise for psychological research with ecologically valid insights into human daily experiences. We advocate for increased efforts in creating comprehensive datasets & finetuning LLMs for low-resource languages like Flemish, aiming to bridge the gap between computational linguistics & emotion research.

[223] Establishing Trustworthy LLM Evaluation via Shortcut Neuron Analysis

Kejian Zhu,Shangqing Tu,Zhuoran Jin,Lei Hou,Juanzi Li,Jun Zhao

Main category: cs.CL

TL;DR: 论文提出了一种通过分析污染模型内部机制的新方法，识别捷径神经元并通过修补抑制其影响，从而解决大语言模型评估中的数据污染问题。

Details

Motivation: 当前大语言模型评估依赖公开基准，但数据污染问题严重影响了公平性。动态基准构建成本高且循环，因此需要从模型内部机制入手解决污染问题。 Method: 通过比较和因果分析识别捷径神经元，并提出捷径神经元修补方法抑制其影响。 Result: 实验验证了方法的有效性，与可信基准MixEval的Spearman系数超过0.95，表明方法能准确揭示模型真实能力。 Conclusion: 该方法具有广泛适用性，能有效解决数据污染问题，提升评估的可信度。 Abstract: The development of large language models (LLMs) depends on trustworthy evaluation. However, most current evaluations rely on public benchmarks, which are prone to data contamination issues that significantly compromise fairness. Previous researches have focused on constructing dynamic benchmarks to address contamination. However, continuously building new benchmarks is costly and cyclical. In this work, we aim to tackle contamination by analyzing the mechanisms of contaminated models themselves. Through our experiments, we discover that the overestimation of contaminated models is likely due to parameters acquiring shortcut solutions in training. We further propose a novel method for identifying shortcut neurons through comparative and causal analysis. Building on this, we introduce an evaluation method called shortcut neuron patching to suppress shortcut neurons. Experiments validate the effectiveness of our approach in mitigating contamination. Additionally, our evaluation results exhibit a strong linear correlation with MixEval, a recently released trustworthy benchmark, achieving a Spearman coefficient ($\rho$) exceeding 0.95. This high correlation indicates that our method closely reveals true capabilities of the models and is trustworthy. We conduct further experiments to demonstrate the generalizability of our method across various benchmarks and hyperparameter settings. Code: https://github.com/GaryStack/Trustworthy-Evaluation

Sarvesh Soni,Dina Demner-Fushman

Main category: cs.CL

TL;DR: ArchEHR-QA是一个专家标注的数据集，用于评估AI在电子健康记录（EHR）中回答患者问题的准确性和相关性，基于真实ICU和急诊病例。

Details

Motivation: 患者对住院信息有独特需求，但缺乏相关数据集评估AI生成回答的准确性和相关性。 Method: 基于真实病例构建ArchEHR-QA数据集，评估三种LLM模型（Llama 4、Llama 3、Mixtral）在不同提示策略下的表现。 Result: Llama 4在“答案优先”提示策略中表现最佳，但存在遗漏关键证据或生成矛盾内容的问题。 Conclusion: ArchEHR-QA为开发患者为中心的EHR问答系统提供了基准，需进一步改进AI生成回答的准确性和相关性。 Abstract: Patients have distinct information needs about their hospitalization that can be addressed using clinical evidence from electronic health records (EHRs). While artificial intelligence (AI) systems show promise in meeting these needs, robust datasets are needed to evaluate the factual accuracy and relevance of AI-generated responses. To our knowledge, no existing dataset captures patient information needs in the context of their EHRs. We introduce ArchEHR-QA, an expert-annotated dataset based on real-world patient cases from intensive care unit and emergency department settings. The cases comprise questions posed by patients to public health forums, clinician-interpreted counterparts, relevant clinical note excerpts with sentence-level relevance annotations, and clinician-authored answers. To establish benchmarks for grounded EHR question answering (QA), we evaluated three open-weight large language models (LLMs)--Llama 4, Llama 3, and Mixtral--across three prompting strategies: generating (1) answers with citations to clinical note sentences, (2) answers before citations, and (3) answers from filtered citations. We assessed performance on two dimensions: Factuality (overlap between cited note sentences and ground truth) and Relevance (textual and semantic similarity between system and reference answers). The final dataset contains 134 patient cases. The answer-first prompting approach consistently performed best, with Llama 4 achieving the highest scores. Manual error analysis supported these findings and revealed common issues such as omitted key clinical evidence and contradictory or hallucinated content. Overall, ArchEHR-QA provides a strong benchmark for developing and evaluating patient-centered EHR QA systems, underscoring the need for further progress toward generating factual and relevant responses in clinical contexts.

[225] SkipGPT: Dynamic Layer Pruning Reinvented with Token Awareness and Module Decoupling

Anhao Zhao,Fanghua Ye,Yingqi Fan,Junlong Tong,Zhiwei Fei,Hui Su,Xiaoyu Shen

Main category: cs.CL

TL;DR: SkipGPT是一种动态层剪枝框架，通过全局令牌感知路由和解耦的剪枝策略优化LLM计算资源分配，减少40%参数同时保持性能。

Details

Motivation: 大型语言模型（LLM）计算成本高，静态剪枝方法忽略了令牌和层功能的动态特性。 Method: SkipGPT采用全局令牌感知路由和MLP/自注意力层解耦剪枝策略，结合两阶段优化（软参数化训练和LoRA微调）。 Result: 实验显示SkipGPT减少40%参数，性能与原始密集模型相当或更优。 Conclusion: SkipGPT通过动态效率与表达性平衡，推动了资源感知LLM的实际部署。 Abstract: Large language models (LLMs) achieve remarkable performance across tasks but incur substantial computational costs due to their deep, multi-layered architectures. Layer pruning has emerged as a strategy to alleviate these inefficiencies, but conventional static pruning methods overlook two critical dynamics inherent to LLM inference: (1) horizontal dynamics, where token-level heterogeneity demands context-aware pruning decisions, and (2) vertical dynamics, where the distinct functional roles of MLP and self-attention layers necessitate component-specific pruning policies. We introduce SkipGPT, a dynamic layer pruning framework designed to optimize computational resource allocation through two core innovations: (1) global token-aware routing to prioritize critical tokens, and (2) decoupled pruning policies for MLP and self-attention components. To mitigate training instability, we propose a two-stage optimization paradigm: first, a disentangled training phase that learns routing strategies via soft parameterization to avoid premature pruning decisions, followed by parameter-efficient LoRA fine-tuning to restore performance impacted by layer removal. Extensive experiments demonstrate that SkipGPT reduces over 40% of model parameters while matching or exceeding the performance of the original dense model across benchmarks. By harmonizing dynamic efficiency with preserved expressivity, SkipGPT advances the practical deployment of scalable, resource-aware LLMs. Our code is publicly available at: https://github.com/EIT-NLP/SkipGPT.

[226] SuperWriter: Reflection-Driven Long-Form Generation with Large Language Models

Yuhao Wu,Yushi Bai,Zhiqiang Hu,Juanzi Li,Roy Ka-Wei Lee

Main category: cs.CL

TL;DR: SuperWriter-Agent框架通过结构化思维和分层优化提升长文本生成质量，SuperWriter-LM在多个基准测试中表现优异。

Details

Motivation: 解决大语言模型在生成长文本时的连贯性、逻辑一致性和质量保持问题。 Method: 提出SuperWriter-Agent框架，结合结构化思维和分层DPO优化，训练7B参数的SuperWriter-LM。 Result: SuperWriter-LM在自动和人工评估中超越基线模型，验证了分层DPO和结构化思维的有效性。 Conclusion: 结构化思维和分层优化显著提升长文本生成质量，SuperWriter-LM表现优异。 Abstract: Long-form text generation remains a significant challenge for large language models (LLMs), particularly in maintaining coherence, ensuring logical consistency, and preserving text quality as sequence length increases. To address these limitations, we propose SuperWriter-Agent, an agent-based framework designed to enhance the quality and consistency of long-form text generation. SuperWriter-Agent introduces explicit structured thinking-through planning and refinement stages into the generation pipeline, guiding the model to follow a more deliberate and cognitively grounded process akin to that of a professional writer. Based on this framework, we construct a supervised fine-tuning dataset to train a 7B SuperWriter-LM. We further develop a hierarchical Direct Preference Optimization (DPO) procedure that uses Monte Carlo Tree Search (MCTS) to propagate final quality assessments and optimize each generation step accordingly. Empirical results across diverse benchmarks demonstrate that SuperWriter-LM achieves state-of-the-art performance, surpassing even larger-scale baseline models in both automatic evaluation and human evaluation. Furthermore, comprehensive ablation studies demonstrate the effectiveness of hierarchical DPO and underscore the value of incorporating structured thinking steps to improve the quality of long-form text generation.

[227] Long or short CoT? Investigating Instance-level Switch of Large Reasoning Models

Ruiqi Zhang,Changyi Xiao,Yixin Cao

Main category: cs.CL

TL;DR: 论文分析了长链思维（CoT）提示与短链思维提示的性能和资源消耗，提出了一种动态选择策略的框架SwitchCoT，以在推理准确性和计算效率之间取得平衡。

Details

Motivation: 随着大型推理模型的快速发展，长链思维提示在复杂任务中表现出色，但伴随显著的令牌消耗增加。研究旨在比较长短CoT策略的优劣，并探索更高效的动态选择方法。 Method: 通过全面的实证分析比较长短CoT策略，提出SwitchCoT框架，动态选择策略以适应任务上下文和资源限制。 Result: 实验表明，SwitchCoT可减少高达50%的推理成本，同时保持高准确性，在有限令牌预算下性能优于单独使用长或短CoT。 Conclusion: SwitchCoT为动态选择CoT策略提供了有效解决方案，显著提升了资源利用效率，适用于不同资源约束的场景。 Abstract: With the rapid advancement of large reasoning models, long Chain-of-Thought (CoT) prompting has demonstrated strong performance on complex tasks. However, this often comes with a significant increase in token usage. In this paper, we conduct a comprehensive empirical analysis comparing long and short CoT strategies. Our findings reveal that while long CoT can lead to performance improvements, its benefits are often marginal relative to its significantly higher token consumption. Specifically, long CoT tends to outperform when ample generation budgets are available, whereas short CoT is more effective under tighter budget constraints. These insights underscore the need for a dynamic approach that selects the proper CoT strategy based on task context and resource availability. To address this, we propose SwitchCoT, an automatic framework that adaptively chooses between long and short CoT strategies to balance reasoning accuracy and computational efficiency. Moreover, SwitchCoT is designed to be budget-aware, making it broadly applicable across scenarios with varying resource constraints. Experimental results demonstrate that SwitchCoT can reduce inference costs by up to 50% while maintaining high accuracy. Notably, under limited token budgets, it achieves performance comparable to, or even exceeding, that of using either long or short CoT alone.

[228] R-Search: Empowering LLM Reasoning with Search via Multi-Reward Reinforcement Learning

Qingfei Zhao,Ruobing Wang,Dingling Xu,Daren Zha,Limin Liu

Main category: cs.CL

TL;DR: R-Search是一个强化学习框架，用于优化大型语言模型（LLM）的推理与搜索交互，提升复杂任务的响应质量。

Details

Motivation: 当前LLM在多步推理和搜索交互中表现不佳，难以找到最优的推理-搜索交互轨迹。 Method: 提出R-Search框架，通过多奖励信号和动态决策机制，优化推理与搜索的交互。 Result: 在七个数据集上，R-Search比先进RAG基线性能提升最高达32.2%（域内）和25.1%（域外）。 Conclusion: R-Search显著提升了LLM在复杂任务中的推理与搜索交互能力。 Abstract: Large language models (LLMs) have notably progressed in multi-step and long-chain reasoning. However, extending their reasoning capabilities to encompass deep interactions with search remains a non-trivial challenge, as models often fail to identify optimal reasoning-search interaction trajectories, resulting in suboptimal responses. We propose R-Search, a novel reinforcement learning framework for Reasoning-Search integration, designed to enable LLMs to autonomously execute multi-step reasoning with deep search interaction, and learn optimal reasoning search interaction trajectories via multi-reward signals, improving response quality in complex logic- and knowledge-intensive tasks. R-Search guides the LLM to dynamically decide when to retrieve or reason, while globally integrating key evidence to enhance deep knowledge interaction between reasoning and search. During RL training, R-Search provides multi-stage, multi-type rewards to jointly optimize the reasoning-search trajectory. Experiments on seven datasets show that R-Search outperforms advanced RAG baselines by up to 32.2% (in-domain) and 25.1% (out-of-domain). The code and data are available at https://github.com/QingFei1/R-Search.

[229] Efficient Knowledge Editing via Minimal Precomputation

Akshat Gupta,Maochuan Lu,Thomas Hartvigsen,Gopala Anumanchipalli

Main category: cs.CL

TL;DR: 论文提出知识编辑方法（如MEMIT）的预计算步骤可以大幅减少，仅需原计算量的0.3%即可完成，显著节省时间和资源。

Details

Motivation: 现有知识编辑方法（如MEMIT）的预计算步骤需要大量计算资源（如44百万隐藏向量），耗时且随模型规模增长。作者旨在证明这种高成本是不必要的。 Method: 通过理论分析确定知识编辑方法所需的最小隐藏向量预计算量，并实证验证仅需极少部分（<0.3%）即可完成编辑。 Result: 实验表明，预计算步骤可减少至原计算量的0.3%，显著节省时间（从几十小时缩短到几分钟）。 Conclusion: 知识编辑方法的预计算步骤可以大幅优化，为快速编辑新模型提供了高效解决方案。 Abstract: Knowledge editing methods like MEMIT are able to make data and compute efficient updates of factual knowledge by using a single sentence to update facts and their consequences. However, what is often overlooked is a "precomputation step", which requires a one-time but significant computational cost. The authors of MEMIT originally precompute approximately 44 million hidden vectors per edited layer, which requires a forward pass over 44 million tokens. For GPT-J (6B), this precomputation step takes 36 hours on a single GPU, while it takes approximately 40 hours for Llama2-7B. Additionally, this precomputation time grows with model size. In this paper, we show that this excessive computational cost is unnecessary. Knowledge editing using MEMIT and related methods, such as ROME and EMMET, can be performed by pre-computing a very small portion of the 44 million hidden vectors. We first present the theoretical minimum number of hidden vector precomputation required for solutions of these editing methods to exist. We then empirically show that knowledge editing using these methods can be done by pre-computing significantly fewer hidden vectors. Specifically, we show that the precomputation step can be done with less than 0.3% of the originally stipulated number of hidden vectors. This saves a significant amount of precomputation time and allows users to begin editing new models within a few minutes.

cond-mat.stat-mech [Back]

[230] Dreaming up scale invariance via inverse renormalization group

Adam Rançon,Ulysse Rançon,Tomislav Ivek,Ivan Balog

Main category: cond-mat.stat-mech

TL;DR: 论文探讨了如何用极简神经网络逆向实现二维Ising模型中的重整化群（RG）粗粒化过程，从粗粒化状态生成微观构型。

Details

Motivation: 研究动机是探索机器学习模型能否在不依赖微观输入的情况下，通过概率方法重建尺度不变的分布。 Method: 方法是通过训练仅含三个可调参数的神经网络，生成临界构型，并验证其是否能重现磁化率、热容等观测量的标度行为。 Result: 结果表明，简单神经网络不仅能捕捉尺度不变性，还能重现RG变换的非平凡特征值。增加网络复杂度并未带来显著改进。 Conclusion: 结论是，类似于生成分形结构的简单局部规则足以编码临界现象的普适性，为物理统计系综的高效生成模型提供了可能。 Abstract: We explore how minimal neural networks can invert the renormalization group (RG) coarse-graining procedure in the two-dimensional Ising model, effectively "dreaming up" microscopic configurations from coarse-grained states. This task-formally impossible at the level of configurations-can be approached probabilistically, allowing machine learning models to reconstruct scale-invariant distributions without relying on microscopic input. We demonstrate that even neural networks with as few as three trainable parameters can learn to generate critical configurations, reproducing the scaling behavior of observables such as magnetic susceptibility, heat capacity, and Binder ratios. A real-space renormalization group analysis of the generated configurations confirms that the models capture not only scale invariance but also reproduce nontrivial eigenvalues of the RG transformation. Surprisingly, we find that increasing network complexity by introducing multiple layers offers no significant benefit. These findings suggest that simple local rules, akin to those generating fractal structures, are sufficient to encode the universality of critical phenomena, opening the door to efficient generative models of statistical ensembles in physics.

cs.RO [Back]

[231] Splatting Physical Scenes: End-to-End Real-to-Sim from Imperfect Robot Data

Ben Moran,Mauro Comi,Steven Bohez,Tom Erez,Zhibin Li,Leonard Hasenclever

Main category: cs.RO

TL;DR: 提出了一种结合3D高斯点云和显式物体网格的混合场景表示方法，通过端到端优化实现高保真物体重建、逼真视图生成和无标注机器人位姿校准。

Details

Motivation: 解决真实机器人数据中的遮挡、噪声相机位姿和动态场景元素等问题，以创建几何精确且逼真的数字孪生体。 Method: 采用混合场景表示，结合3D高斯点云和显式物体网格，利用可微分渲染和MuJoCo物理引擎进行端到端优化。 Result: 在仿真和真实机器人序列中验证了方法的有效性，实现了高保真物体重建和逼真视图生成。 Conclusion: 该方法为真实到仿真的转换提供了更实用和鲁棒的解决方案。 Abstract: Creating accurate, physical simulations directly from real-world robot motion holds great value for safe, scalable, and affordable robot learning, yet remains exceptionally challenging. Real robot data suffers from occlusions, noisy camera poses, dynamic scene elements, which hinder the creation of geometrically accurate and photorealistic digital twins of unseen objects. We introduce a novel real-to-sim framework tackling all these challenges at once. Our key insight is a hybrid scene representation merging the photorealistic rendering of 3D Gaussian Splatting with explicit object meshes suitable for physics simulation within a single representation. We propose an end-to-end optimization pipeline that leverages differentiable rendering and differentiable physics within MuJoCo to jointly refine all scene components - from object geometry and appearance to robot poses and physical parameters - directly from raw and imprecise robot trajectories. This unified optimization allows us to simultaneously achieve high-fidelity object mesh reconstruction, generate photorealistic novel views, and perform annotation-free robot pose calibration. We demonstrate the effectiveness of our approach both in simulation and on challenging real-world sequences using an ALOHA 2 bi-manual manipulator, enabling more practical and robust real-to-simulation pipelines.

[232] Pseudo-Simulation for Autonomous Driving

Wei Cao,Marcel Hallgarten,Tianyu Li,Daniel Dauner,Xunjiang Gu,Caojun Wang,Yakov Miron,Marco Aiello,Hongyang Li,Igor Gilitschenski,Boris Ivanovic,Marco Pavone,Andreas Geiger,Kashyap Chitta

Main category: cs.RO

TL;DR: 提出了一种名为伪仿真的新范式，结合了真实数据和合成观测，解决了现有自动驾驶车辆评估方法的局限性。

Details

Motivation: 现有自动驾驶车辆评估方法存在安全性、可重复性、真实性和计算成本等问题，需要一种更高效的解决方案。 Method: 使用3D高斯散射生成合成观测，并通过基于接近度的加权方案评估潜在未来状态，避免顺序交互仿真的需求。 Result: 伪仿真与闭环仿真的相关性（R^2=0.8）优于现有开环方法（R^2=0.7）。 Conclusion: 伪仿真提供了一种高效且准确的评估方法，适用于自动驾驶车辆的错误恢复和因果混淆缓解。 Abstract: Existing evaluation paradigms for Autonomous Vehicles (AVs) face critical limitations. Real-world evaluation is often challenging due to safety concerns and a lack of reproducibility, whereas closed-loop simulation can face insufficient realism or high computational costs. Open-loop evaluation, while being efficient and data-driven, relies on metrics that generally overlook compounding errors. In this paper, we propose pseudo-simulation, a novel paradigm that addresses these limitations. Pseudo-simulation operates on real datasets, similar to open-loop evaluation, but augments them with synthetic observations generated prior to evaluation using 3D Gaussian Splatting. Our key idea is to approximate potential future states the AV might encounter by generating a diverse set of observations that vary in position, heading, and speed. Our method then assigns a higher importance to synthetic observations that best match the AV's likely behavior using a novel proximity-based weighting scheme. This enables evaluating error recovery and the mitigation of causal confusion, as in closed-loop benchmarks, without requiring sequential interactive simulation. We show that pseudo-simulation is better correlated with closed-loop simulations (R^2=0.8) than the best existing open-loop approach (R^2=0.7). We also establish a public leaderboard for the community to benchmark new methodologies with pseudo-simulation. Our code is available at https://github.com/autonomousvision/navsim.

[233] Object-centric 3D Motion Field for Robot Learning from Human Videos

Zhao-Heng Yin,Sherry Yang,Pieter Abbeel

Main category: cs.RO

TL;DR: 提出了一种基于对象中心3D运动场的动作表示方法，用于从人类视频中学习机器人控制策略，显著提升了3D运动估计精度和任务成功率。

Details

Motivation: 现有动作表示方法（如视频帧、像素流等）存在建模复杂或信息丢失的问题，需要一种更高效的动作表示方法。 Method: 提出了一种对象中心3D运动场表示方法，包括去噪3D运动场估计器和密集对象中心3D运动场预测架构。 Result: 实验表明，该方法将3D运动估计误差降低50%以上，任务平均成功率提升至55%，优于现有方法（<10%）。 Conclusion: 对象中心3D运动场是一种高效的动作表示方法，适用于机器人从人类视频中学习控制策略。 Abstract: Learning robot control policies from human videos is a promising direction for scaling up robot learning. However, how to extract action knowledge (or action representations) from videos for policy learning remains a key challenge. Existing action representations such as video frames, pixelflow, and pointcloud flow have inherent limitations such as modeling complexity or loss of information. In this paper, we propose to use object-centric 3D motion field to represent actions for robot learning from human videos, and present a novel framework for extracting this representation from videos for zero-shot control. We introduce two novel components in its implementation. First, a novel training pipeline for training a ''denoising'' 3D motion field estimator to extract fine object 3D motions from human videos with noisy depth robustly. Second, a dense object-centric 3D motion field prediction architecture that favors both cross-embodiment transfer and policy generalization to background. We evaluate the system in real world setups. Experiments show that our method reduces 3D motion estimation error by over 50% compared to the latest method, achieve 55% average success rate in diverse tasks where prior approaches fail~($\lesssim 10$\%), and can even acquire fine-grained manipulation skills like insertion.

physics.med-ph [Back]

[234] Analytical Reconstruction of Periodically Deformed Objects in Time-resolved CT

Qianwei Qu,Christian M. Schlepütz,Marco Stampanoni

Main category: physics.med-ph

TL;DR: 论文提出两种新的时间分辨CT重建方法，解决了传统门控方法辐射剂量利用效率低的问题，并通过实验验证了其有效性。

Details

Motivation: 传统门控方法在时间周期性重建中仅使用部分投影数据，忽略了不同集合间的相关性，导致辐射剂量利用效率低。 Method: 提出两种分析性重建流程，利用所有投影数据并考虑不同集合间的相关性。 Result: 新方法显著降低了重建图像的随机噪声，同时保留了对象的锐利特征，且能以更低辐射剂量达到与传统方法相同的重建质量。 Conclusion: 新方法在时间分辨CT重建中具有更高的效率和实用性，代码已开源。 Abstract: Time-resolved CT is an advanced measurement technique that has been widely used to observe dynamic objects, including periodically varying structures such as hearts, lungs, or hearing structures. To reconstruct these objects from CT projections, a common approach is to divide the projections into several collections based on their motion phases and perform reconstruction within each collection, assuming they originate from a static object. This describes the gating-based method, which is the standard approach for time-periodic reconstruction. However, the gating-based reconstruction algorithm only utilizes a limited subset of projections within each collection and ignores the correlation between different collections, leading to inefficient use of the radiation dose. To address this issue, we propose two analytical reconstruction pipelines in this paper, and validate them with experimental data captured using tomographic synchrotron microscopy. We demonstrate that our approaches significantly reduce random noise in the reconstructed images without blurring the sharp features of the observed objects. Equivalently, our methods can achieve the same reconstruction quality as gating-based methods but with a lower radiation dose. Our code is available at github.com/PeriodRecon.

[235] Personalized MR-Informed Diffusion Models for 3D PET Image Reconstruction

George Webber,Alexander Hammers,Andrew P. King,Andrew J. Reader

Main category: physics.med-ph

TL;DR: 提出一种基于多受试者PET-MR扫描数据生成受试者特异性PET图像的方法，通过图像配准合成“伪PET”图像，提高重建精度。

Details

Motivation: 现有方法利用预训练扩散模型重建PET图像，但缺乏受试者特异性数据。本研究旨在通过合成受试者特异性PET图像，提升低计数数据下的重建效果。 Method: 利用图像配准技术从多受试者PET-MR数据中合成受试者特异性“伪PET”图像，并用于预训练个性化扩散模型。 Result: 在模拟和真实[$^{18}$F]FDG数据集中，该方法显著提高了低计数数据下的重建精度，同时保留了PET独特的图像特征。 Conclusion: 该方法为医学成像任务提供了一种无需依赖生成式深度学习或大规模数据集的受试者特异性PET图像生成方案。 Abstract: Recent work has shown improved lesion detectability and flexibility to reconstruction hyperparameters (e.g. scanner geometry or dose level) when PET images are reconstructed by leveraging pre-trained diffusion models. Such methods train a diffusion model (without sinogram data) on high-quality, but still noisy, PET images. In this work, we propose a simple method for generating subject-specific PET images from a dataset of multi-subject PET-MR scans, synthesizing "pseudo-PET" images by transforming between different patients' anatomy using image registration. The images we synthesize retain information from the subject's MR scan, leading to higher resolution and the retention of anatomical features compared to the original set of PET images. With simulated and real [$^{18}$F]FDG datasets, we show that pre-training a personalized diffusion model with subject-specific "pseudo-PET" images improves reconstruction accuracy with low-count data. In particular, the method shows promise in combining information from a guidance MR scan without overly imposing anatomical features, demonstrating an improved trade-off between reconstructing PET-unique image features versus features present in both PET and MR. We believe this approach for generating and utilizing synthetic data has further applications to medical imaging tasks, particularly because patient-specific PET images can be generated without resorting to generative deep learning or large training datasets.

eess.SY [Back]

[236] Urban Visibility Hotspots: Quantifying Building Vertex Visibility from Connected Vehicle Trajectories using Spatial Indexing

Artur Grigorev,Adriana-Simona Mihaita

Main category: eess.SY

TL;DR: 论文提出了一种数据驱动的方法，通过分析大规模车辆轨迹数据来量化户外广告和街道家具的最佳位置，识别视觉曝光高的热点区域。

Details

Motivation: 传统选址方法依赖静态交通数据或主观评估，缺乏客观性。研究旨在通过动态车辆轨迹数据更准确地量化位置的视觉曝光。 Method: 利用车辆轨迹数据建模动态驾驶员视野，结合OpenStreetMap的建筑顶点数据，构建BallTree空间索引高效计算视觉曝光。 Result: 发现视觉曝光高度集中，存在明显热点区域；视觉曝光计数符合对数正态分布。 Conclusion: 数据驱动方法能更客观地识别高曝光位置，为户外广告选址提供科学依据。 Abstract: Effective placement of Out-of-Home advertising and street furniture requires accurate identification of locations offering maximum visual exposure to target audiences, particularly vehicular traffic. Traditional site selection methods often rely on static traffic counts or subjective assessments. This research introduces a data-driven methodology to objectively quantify location visibility by analyzing large-scale connected vehicle trajectory data (sourced from Compass IoT) within urban environments. We model the dynamic driver field-of-view using a forward-projected visibility area for each vehicle position derived from interpolated trajectories. By integrating this with building vertex locations extracted from OpenStreetMap, we quantify the cumulative visual exposure, or ``visibility count'', for thousands of potential points of interest near roadways. The analysis reveals that visibility is highly concentrated, identifying specific ``visual hotspots'' that receive disproportionately high exposure compared to average locations. The core technical contribution involves the construction of a BallTree spatial index over building vertices. This enables highly efficient (O(logN) complexity) radius queries to determine which vertices fall within the viewing circles of millions of trajectory points across numerous trips, significantly outperforming brute-force geometric checks. Analysis reveals two key findings: 1) Visibility is highly concentrated, identifying distinct 'visual hotspots' receiving disproportionately high exposure compared to average locations. 2) The aggregated visibility counts across vertices conform to a Log-Normal distribution.

eess.IV [Back]

[237] Adaptive and Robust Image Processing on CubeSats

Robert Bayer,Julian Priest,Daniel Kjellberg,Jeppe Lindhard,Nikolaj Sørenesen,Nicolaj Valsted,Ívar Óli,Pınar Tözün

Main category: eess.IV

TL;DR: 论文提出了DIPP和DISH两个系统，分别用于解决CubeSat上图像处理管道的灵活性和复杂性问题。DIPP是模块化框架，支持部署后调整；DISH是领域专用语言，优化资源调度。实验表明两者均高效且低开销。

Details

Motivation: CubeSat资源受限且部署后难以调整，限制了图像处理管道的灵活性和复杂性。 Method: DIPP提供模块化框架支持动态调整；DISH是专用语言和运行时系统，优化资源调度。 Result: DIPP减少网络需求且鲁棒；DISH内存需求低于Lua，表达能力相当。 Conclusion: DIPP和DISH为CubeSat提供了高效、灵活的解决方案。 Abstract: CubeSats offer a low-cost platform for space research, particularly for Earth observation. However, their resource-constrained nature and being in space, challenge the flexibility and complexity of the deployed image processing pipelines and their orchestration. This paper introduces two novel systems, DIPP and DISH, to address these challenges. DIPP is a modular and configurable image processing pipeline framework that allows for adaptability to changing mission goals even after deployment, while preserving robustness. DISH is a domain-specific language (DSL) and runtime system designed to schedule complex imaging workloads on low-power and memory-constrained processors. Our experiments demonstrate that DIPP's decomposition of the processing pipelines adds negligible overhead, while significantly reducing the network requirements of updating pipelines and being robust against erroneous module uploads. Furthermore, we compare DISH to Lua, a general purpose scripting language, and demonstrate its comparable expressiveness and lower memory requirement.

[238] Super-temporal-resolution Photoacoustic Imaging with Dynamic Reconstruction through Implicit Neural Representation in Sparse-view

Youshen Xiao,Yiling Shi,Ruixi Sun,Hongjiang Wei,Fei Gao,Yuyao Zhang

Main category: eess.IV

TL;DR: 提出了一种基于隐式神经表示（INR）的动态光声图像重建方法，用于解决稀疏传感器数据下的图像重建问题，并提升时间分辨率。

Details

Motivation: 传统光声图像重建方法在稀疏数据下会产生严重伪影，且未考虑动态成像中的帧间关系。高功率激光技术的低重复率和高成本限制了时间分辨率。 Method: 利用INR将动态光声图像表示为隐式函数，并通过神经网络编码。网络权重仅从稀疏传感器数据中学习，无需外部训练数据或先验图像。结合低秩和稀疏性正则化。 Result: 在两种稀疏条件下，该方法优于传统重建方法，有效抑制伪影并保证图像质量。 Conclusion: INR方法为稀疏数据下的动态光声成像提供了高质量重建和时间分辨率提升的解决方案。 Abstract: Dynamic Photoacoustic Computed Tomography (PACT) is an important imaging technique for monitoring physiological processes, capable of providing high-contrast images of optical absorption at much greater depths than traditional optical imaging methods. However, practical instrumentation and geometric constraints limit the number of acoustic sensors available around the imaging target, leading to sparsity in sensor data. Traditional photoacoustic (PA) image reconstruction methods, when directly applied to sparse PA data, produce severe artifacts. Additionally, these traditional methods do not consider the inter-frame relationships in dynamic imaging. Temporal resolution is crucial for dynamic photoacoustic imaging, which is fundamentally limited by the low repetition rate (e.g., 20 Hz) and high cost of high-power laser technology. Recently, Implicit Neural Representation (INR) has emerged as a powerful deep learning tool for solving inverse problems with sparse data, by characterizing signal properties as continuous functions of their coordinates in an unsupervised manner. In this work, we propose an INR-based method to improve dynamic photoacoustic image reconstruction from sparse-views and enhance temporal resolution, using only spatiotemporal coordinates as input. Specifically, the proposed INR represents dynamic photoacoustic images as implicit functions and encodes them into a neural network. The weights of the network are learned solely from the acquired sparse sensor data, without the need for external training datasets or prior images. Benefiting from the strong implicit continuity regularization provided by INR, as well as explicit regularization for low-rank and sparsity, our proposed method outperforms traditional reconstruction methods under two different sparsity conditions, effectively suppressing artifacts and ensuring image quality.

[239] Deep Learning-Based Breast Cancer Detection in Mammography: A Multi-Center Validation Study in Thai Population

Isarun Chamveha,Supphanut Chaiyungyuen,Sasinun Worakriangkrai,Nattawadee Prasawang,Warasinee Chaisangmongkon,Pornpim Korpraphong,Voraparee Suvannarerg,Shanigarn Thiravit,Chalermdej Kannawat,Kewalin Rungsinaporn,Suwara Issaragrisil,Payia Chadbunchachai,Pattiya Gatechumpol,Chawiporn Muktabhant,Patarachai Sereerat

Main category: eess.IV

TL;DR: 该研究提出了一种基于改进EfficientNetV2架构的深度学习系统，用于乳腺X光检查中的乳腺癌检测，并在多个数据集上验证了其性能。

Details

Motivation: 提高乳腺癌筛查的准确性和效率，减轻临床医生的工作负担。 Method: 使用改进的EfficientNetV2架构和增强的注意力机制，在泰国医疗中心的乳腺X光数据上进行训练，并在三个不同数据集上验证。 Result: 模型在癌症检测上的AUROC分别为0.89、0.96和0.94，病灶定位性能稳健，临床验证显示与放射科医生有高度一致性。 Conclusion: 该系统在辅助乳腺X光检查方面表现出色，有望提升临床乳腺癌筛查的工作流程。 Abstract: This study presents a deep learning system for breast cancer detection in mammography, developed using a modified EfficientNetV2 architecture with enhanced attention mechanisms. The model was trained on mammograms from a major Thai medical center and validated on three distinct datasets: an in-domain test set (9,421 cases), a biopsy-confirmed set (883 cases), and an out-of-domain generalizability set (761 cases) collected from two different hospitals. For cancer detection, the model achieved AUROCs of 0.89, 0.96, and 0.94 on the respective datasets. The system's lesion localization capability, evaluated using metrics including Lesion Localization Fraction (LLF) and Non-Lesion Localization Fraction (NLF), demonstrated robust performance in identifying suspicious regions. Clinical validation through concordance tests showed strong agreement with radiologists: 83.5% classification and 84.0% localization concordance for biopsy-confirmed cases, and 78.1% classification and 79.6% localization concordance for out-of-domain cases. Expert radiologists' acceptance rate also averaged 96.7% for biopsy-confirmed cases, and 89.3% for out-of-domain cases. The system achieved a System Usability Scale score of 74.17 for source hospital, and 69.20 for validation hospitals, indicating good clinical acceptance. These results demonstrate the model's effectiveness in assisting mammogram interpretation, with the potential to enhance breast cancer screening workflows in clinical practice.

[240] LLaMA-XR: A Novel Framework for Radiology Report Generation using LLaMA and QLoRA Fine Tuning

Md. Zihad Bin Jahangir,Muhammad Ashad Kabir,Sumaiya Akter,Israt Jahan,Minh Chau

Main category: eess.IV

TL;DR: LLaMA-XR是一种结合LLaMA 3.1和DenseNet-121图像嵌入的新型框架，通过QLoRA微调提升放射学报告生成的准确性和效率。

Details

Motivation: 自动化放射学报告生成可减轻放射科医生负担并提高诊断准确性，但现有模型在准确性和上下文相关性方面存在不足。 Method: LLaMA-XR整合LLaMA 3.1与DenseNet-121图像嵌入，采用QLoRA微调优化参数利用和内存开销。 Result: 在IU X-ray数据集上，LLaMA-XR的ROUGE-L和METEOR分数分别为0.433和0.336，优于现有方法。 Conclusion: LLaMA-XR展示了高效且可靠的自动化放射学报告生成潜力。 Abstract: Automated radiology report generation holds significant potential to reduce radiologists' workload and enhance diagnostic accuracy. However, generating precise and clinically meaningful reports from chest radiographs remains challenging due to the complexity of medical language and the need for contextual understanding. Existing models often struggle with maintaining both accuracy and contextual relevance. In this paper, we present LLaMA-XR, a novel framework that integrates LLaMA 3.1 with DenseNet-121-based image embeddings and Quantized Low-Rank Adaptation (QLoRA) fine-tuning. LLaMA-XR achieves improved coherence and clinical accuracy while maintaining computational efficiency. This efficiency is driven by an optimization strategy that enhances parameter utilization and reduces memory overhead, enabling faster report generation with lower computational resource demands. Extensive experiments conducted on the IU X-ray benchmark dataset demonstrate that LLaMA-XR outperforms a range of state-of-the-art methods. Our model achieves a ROUGE-L score of 0.433 and a METEOR score of 0.336, establishing new performance benchmarks in the domain. These results underscore LLaMA-XR's potential as an effective and efficient AI system for automated radiology reporting, offering enhanced clinical utility and reliability.

[241] Dc-EEMF: Pushing depth-of-field limit of photoacoustic microscopy via decision-level constrained learning

Wangting Zhou,Jiangshan He,Tong Cai,Lin Wang,Zhen Yuan,Xunbin Wei,Xueli Chen

Main category: eess.IV

TL;DR: 提出了一种基于决策级约束的端到端多焦点图像融合方法（Dc-EEMF），用于突破光学分辨率光声显微镜（OR-PAM）的景深限制，提升图像融合效果。

Details

Motivation: 传统OR-PAM因高斯光束的窄景深限制，无法在深度方向解析足够细节，影响了生物医学研究中无标记生物标志物的测量。 Method: 采用轻量级Siamese网络，结合抗伪影的通道空间频率特征融合规则，并设计基于U-Net的感知损失函数，实现端到端训练。 Result: 实验和数值分析表明，该方法在保持横向分辨率的同时，显著提升了PAM图像的融合效果。 Conclusion: Dc-EEMF有望成为需要扩展景深的临床前和临床研究的实用工具。 Abstract: Photoacoustic microscopy holds the potential to measure biomarkers' structural and functional status without labels, which significantly aids in comprehending pathophysiological conditions in biomedical research. However, conventional optical-resolution photoacoustic microscopy (OR-PAM) is hindered by a limited depth-of-field (DoF) due to the narrow depth range focused on a Gaussian beam. Consequently, it fails to resolve sufficient details in the depth direction. Herein, we propose a decision-level constrained end-to-end multi-focus image fusion (Dc-EEMF) to push DoF limit of PAM. The DC-EEMF method is a lightweight siamese network that incorporates an artifact-resistant channel-wise spatial frequency as its feature fusion rule. The meticulously crafted U-Net-based perceptual loss function for decision-level focus properties in end-to-end fusion seamlessly integrates the complementary advantages of spatial domain and transform domain methods within Dc-EEMF. This approach can be trained end-to-end without necessitating post-processing procedures. Experimental results and numerical analyses collectively demonstrate our method's robust performance, achieving an impressive fusion result for PAM images without a substantial sacrifice in lateral resolution. The utilization of Dc-EEMF-powered PAM has the potential to serve as a practical tool in preclinical and clinical studies requiring extended DoF for various applications.

[242] Edge Computing for Physics-Driven AI in Computational MRI: A Feasibility Study

Yaşar Utku Alçalar,Yu Cao,Mehmet Akçakaya

Main category: eess.IV

TL;DR: 论文提出了一种针对FPGA边缘计算优化的PD-AI MRI重建方法，通过8位复数数据量化和减少FFT/IFFT操作，提高了计算效率并保持了重建质量。

Details

Motivation: 高分辨率MRI扫描产生大量数据，导致传输、存储和实时处理挑战，尤其是在功能MRI中。边缘计算和FPGA为解决这些问题提供了可能。 Method: 提出了一种优化的PD-AI计算MRI方法，采用8位复数数据量化并消除冗余的FFT/IFFT操作，以适应FPGA硬件。 Result: 该方法在计算效率上优于传统PD-AI方法，同时保持重建质量，且优于标准临床方法。 Conclusion: 该方法为资源受限设备上的高分辨率MRI重建提供了可行方案，具有实际部署潜力。 Abstract: Physics-driven artificial intelligence (PD-AI) reconstruction methods have emerged as the state-of-the-art for accelerating MRI scans, enabling higher spatial and temporal resolutions. However, the high resolution of these scans generates massive data volumes, leading to challenges in transmission, storage, and real-time processing. This is particularly pronounced in functional MRI, where hundreds of volumetric acquisitions further exacerbate these demands. Edge computing with FPGAs presents a promising solution for enabling PD-AI reconstruction near the MRI sensors, reducing data transfer and storage bottlenecks. However, this requires optimization of PD-AI models for hardware efficiency through quantization and bypassing traditional FFT-based approaches, which can be a limitation due to their computational demands. In this work, we propose a novel PD-AI computational MRI approach optimized for FPGA-based edge computing devices, leveraging 8-bit complex data quantization and eliminating redundant FFT/IFFT operations. Our results show that this strategy improves computational efficiency while maintaining reconstruction quality comparable to conventional PD-AI methods, and outperforms standard clinical methods. Our approach presents an opportunity for high-resolution MRI reconstruction on resource-constrained devices, highlighting its potential for real-world deployment.

[243] DLiPath: A Benchmark for the Comprehensive Assessment of Donor Liver Based on Histopathological Image Dataset

Liangrui Pan,Xingchen Li,Zhongyi Chen,Ling Chu,Shaoliang Peng

Main category: eess.IV

TL;DR: DLiPath是一个基于组织病理学图像数据集的供体肝脏评估基准，旨在解决供体肝脏活检评估中的变异性问题。

Details

Motivation: 供体肝脏活检评估的快速和准确性对移植结果至关重要，但现有方法存在观察者间和观察者内变异性大的问题。 Method: 收集并公开了304例供体肝脏患者的636张全切片图像，标注了关键病理特征，并基于此数据集评估了九种多实例学习模型。 Result: 实验表明，多种多实例学习模型在DLiPath数据集上实现了高准确率。 Conclusion: DLiPath为未来自动化智能供体肝脏评估研究提供了明确方向，数据和代码已公开。 Abstract: Pathologists comprehensive evaluation of donor liver biopsies provides crucial information for accepting or discarding potential grafts. However, rapidly and accurately obtaining these assessments intraoperatively poses a significant challenge for pathologists. Features in donor liver biopsies, such as portal tract fibrosis, total steatosis, macrovesicular steatosis, and hepatocellular ballooning are correlated with transplant outcomes, yet quantifying these indicators suffers from substantial inter- and intra-observer variability. To address this, we introduce DLiPath, the first benchmark for comprehensive donor liver assessment based on a histopathology image dataset. We collected and publicly released 636 whole slide images from 304 donor liver patients at the Department of Pathology, the Third Xiangya Hospital, with expert annotations for key pathological features (including cholestasis, portal tract fibrosis, portal inflammation, total steatosis, macrovesicular steatosis, and hepatocellular ballooning). We selected nine state-of-the-art multiple-instance learning (MIL) models based on the DLiPath dataset as baselines for extensive comparative analysis. The experimental results demonstrate that several MIL models achieve high accuracy across donor liver assessment indicators on DLiPath, charting a clear course for future automated and intelligent donor liver assessment research. Data and code are available at https://github.com/panliangrui/ACM_MM_2025.

[244] Lightweight Convolutional Neural Networks for Retinal Disease Classification

Duaa Kareem Qasim,Sabah Abdulazeez Jebur,Lafta Raheem Ali,Abdul Jalil M. Khalaf,Abir Jaafar Hussain

Main category: eess.IV

TL;DR: 论文提出使用轻量级CNN架构MobileNet和NASNetMobile对糖尿病视网膜病变（DR）和黄斑裂孔（MH）进行分类，MobileNetV2表现最佳，准确率达90.8%。

Details

Motivation: DR和MH严重影响视力，早期检测至关重要。AI辅助诊断可提供高效解决方案。 Method: 采用MobileNet和NASNetMobile模型，基于RFMiD数据集（3,200张眼底图像），通过预处理、迁移学习和数据增强提升性能。 Result: MobileNetV2准确率90.8%，优于NASNetMobile的89.5%。 Conclusion: 轻量级CNN在视网膜疾病分类中表现优异，为AI辅助眼科诊断奠定基础。 Abstract: Retinal diseases such as Diabetic Retinopathy (DR) and Macular Hole (MH) significantly impact vision and affect millions worldwide. Early detection is crucial, as DR, a complication of diabetes, damages retinal blood vessels, potentially leading to blindness, while MH disrupts central vision, affecting tasks like reading and facial recognition. This paper employed two lightweight and efficient Convolution Neural Network architectures, MobileNet and NASNetMobile, for the classification of Normal, DR, and MH retinal images. The models were trained on the RFMiD dataset, consisting of 3,200 fundus images, after undergoing preprocessing steps such as resizing, normalization, and augmentation. To address data scarcity, this study leveraged transfer learning and data augmentation techniques, enhancing model generalization and performance. The experimental results demonstrate that MobileNetV2 achieved the highest accuracy of 90.8%, outperforming NASNetMobile, which achieved 89.5% accuracy. These findings highlight the effectiveness of CNNs in retinal disease classification, providing a foundation for AI-assisted ophthalmic diagnosis and early intervention.

[245] Multi-Analyte, Swab-based Automated Wound Monitor with AI

Madhu Babu Sikha,Lalith Appari,Gurudatt Nanjanagudu Ganesh,Amay Bandodkar,Imon Banerjee

Main category: eess.IV

TL;DR: 开发了一种低成本、多分析物的3D打印检测试纸和iOS应用，用于早期识别糖尿病足溃疡（DFUs）并实时监测伤口状况。

Details

Motivation: 糖尿病足溃疡（DFUs）每年影响大量患者，早期识别非愈合性溃疡可降低治疗成本和截肢风险。 Method: 通过3D打印试纸和iOS应用，结合计算机视觉技术，自动分析伤口严重程度。 Result: 实现了对伤口状况的实时监测和自动化分析，克服了相机配置和环境变化的挑战。 Conclusion: 该集成传感器和iOS应用为医疗专业人员提供了实时监测伤口和评估关键参数的工具。 Abstract: Diabetic foot ulcers (DFUs), a class of chronic wounds, affect ~750,000 individuals every year in the US alone and identifying non-healing DFUs that develop to chronic wounds early can drastically reduce treatment costs and minimize risks of amputation. There is therefore a pressing need for diagnostic tools that can detect non-healing DFUs early. We develop a low cost, multi-analyte 3D printed assays seamlessly integrated on swabs that can identify non-healing DFUs and a Wound Sensor iOS App - an innovative mobile application developed for the controlled acquisition and automated analysis of wound sensor data. By comparing both the original base image (before exposure to the wound) and the wound-exposed image, we developed automated computer vision techniques to compare density changes between the two assay images, which allow us to automatically determine the severity of the wound. The iOS app ensures accurate data collection and presents actionable insights, despite challenges such as variations in camera configurations and ambient conditions. The proposed integrated sensor and iOS app will allow healthcare professionals to monitor wound conditions real-time, track healing progress, and assess critical parameters related to wound care.

[246] Encoding of Demographic and Anatomical Information in Chest X-Ray-based Severe Left Ventricular Hypertrophy Classifiers

Basudha Pal,Rama Chellappa,Muhammad Umair

Main category: eess.IV

TL;DR: 提出一种基于胸部X光直接分类严重左心室肥厚的框架，无需解剖测量或人口统计输入，效果显著。

Details

Motivation: 超声心动图和MRI成本高且不易获取，限制了其在心脏结构评估中的应用。 Method: 采用互信息神经估计量化特征表达性，构建直接分类框架。 Result: 模型在AUROC和AUPRC上表现优异，揭示了临床有意义的属性编码。 Conclusion: 该方法支持透明模型解释，为心脏疾病筛查提供了高效替代方案。 Abstract: While echocardiography and MRI are clinical standards for evaluating cardiac structure, their use is limited by cost and accessibility.We introduce a direct classification framework that predicts severe left ventricular hypertrophy from chest X-rays, without relying on anatomical measurements or demographic inputs. Our approach achieves high AUROC and AUPRC, and employs Mutual Information Neural Estimation to quantify feature expressivity. This reveals clinically meaningful attribute encoding and supports transparent model interpretation.

[247] A combined Machine Learning and Finite Element Modelling tool for the surgical planning of craniosynostosis correction

Itxasne Antúnez Sáenz,Ane Alberdi Aramendi,David Dunaway,Juling Ong,Lara Deliège,Amparo Sáenz,Anita Ahmadi Birjandi,Noor UI Owase Jeelani,Silvia Schievano,Alessandro Borghi

Main category: eess.IV

TL;DR: 该研究旨在开发一种实时预测工具，用于颅缝早闭手术结果，避免CT扫描以减少辐射，采用3D照片和机器学习模型。

Details

Motivation: 当前颅缝早闭手术结果难以预测，依赖医生经验和婴儿年龄，传统有限元建模方法复杂且耗时。 Method: 基于3D照片生成个性化合成头骨，结合人群平均数据，使用机器学习替代模型预测手术结果。 Result: 多输出支持向量回归模型R2为0.95，MSE和MAE低于0.13，预测效果良好。 Conclusion: 该工具可模拟手术场景并提供最佳参数，未来有望优化颅骨指数。 Abstract: Craniosynostosis is a medical condition that affects the growth of babies' heads, caused by an early fusion of cranial sutures. In recent decades, surgical treatments for craniosynostosis have significantly improved, leading to reduced invasiveness, faster recovery, and less blood loss. At Great Ormond Street Hospital (GOSH), the main surgical treatment for patients diagnosed with sagittal craniosynostosis (SC) is spring assisted cranioplasty (SAC). This procedure involves a 15x15 mm2 osteotomy, where two springs are inserted to induce distraction. Despite the numerous advantages of this surgical technique for patients, the outcome remains unpredictable due to the lack of efficient preoperative planning tools. The surgeon's experience and the baby's age are currently relied upon to determine the osteotomy location and spring selection. Previous tools for predicting the surgical outcome of SC relied on finite element modeling (FEM), which involved computed tomography (CT) imaging and required engineering expertise and lengthy calculations. The main goal of this research is to develop a real-time prediction tool for the surgical outcome of patients, eliminating the need for CT scans to minimise radiation exposure during preoperative planning. The proposed methodology involves creating personalised synthetic skulls based on three-dimensional (3D) photographs, incorporating population average values of suture location, skull thickness, and soft tissue properties. A machine learning (ML) surrogate model is employed to achieve the desired surgical outcome. The resulting multi-output support vector regressor model achieves a R2 metric of 0.95 and MSE and MAE below 0.13. Furthermore, in the future, this model could not only simulate various surgical scenarios but also provide optimal parameters for achieving a maximum cranial index (CI).

[248] A Survey of Deep Learning Video Super-Resolution

Arbind Agrahari Baniya,Tsz-Kwan Lee,Peter Eklund,Sunil Aryal

Main category: eess.IV

TL;DR: 本文综述了基于深度学习的视频超分辨率（VSR）模型，分析了其组件和方法，并提出了多级分类法以指导未来研究。

Details

Motivation: VSR在多个领域具有潜在影响，但现有方法的使用和决策缺乏充分解释，需对其组件和方法进行全面分析。 Method: 通过系统调查和分类现有VSR模型，分析其组件和技术，并提出多级分类法。 Result: 总结了VSR领域的趋势、需求和挑战，并建立了分类法以指导研究。 Conclusion: 本文为VSR研究提供了系统化的综述和分类法，有助于推动其实际应用的发展。 Abstract: Video super-resolution (VSR) is a prominent research topic in low-level computer vision, where deep learning technologies have played a significant role. The rapid progress in deep learning and its applications in VSR has led to a proliferation of tools and techniques in the literature. However, the usage of these methods is often not adequately explained, and decisions are primarily driven by quantitative improvements. Given the significance of VSR's potential influence across multiple domains, it is imperative to conduct a comprehensive analysis of the elements and deep learning methodologies employed in VSR research. This methodical analysis will facilitate the informed development of models tailored to specific application needs. In this paper, we present an overarching overview of deep learning-based video super-resolution models, investigating each component and discussing its implications. Furthermore, we provide a synopsis of key components and technologies employed by state-of-the-art and earlier VSR models. By elucidating the underlying methodologies and categorising them systematically, we identified trends, requirements, and challenges in the domain. As a first-of-its-kind survey of deep learning-based VSR models, this work also establishes a multi-level taxonomy to guide current and future VSR research, enhancing the maturation and interpretation of VSR practices for various practical applications.

[249] petBrain: A New Pipeline for Amyloid, Tau Tangles and Neurodegeneration Quantification Using PET and MRI

Pierrick Coupé,Boris Mansencal,Floréal Morandat,Sergio Morell-Ortega,Nicolas Villain,Jose V. Manjón,Vincent Planche

Main category: eess.IV

TL;DR: petBrain是一个用于阿尔茨海默病（AD）生物标志物分析的新型端到端处理平台，结合深度学习分割和多模态成像，提供快速、标准化的A/T2/N生物标志物量化。

Details

Motivation: 现有流程在处理时间、示踪剂类型多样性和多模态整合方面存在局限性，需要一种更高效、标准化的解决方案。 Method: 开发了petBrain平台，利用深度学习分割和标准化生物标志物量化（如Centiloid、CenTauR、HAVAs），实现淀粉样蛋白-PET、tau-PET和结构MRI的同时分析。 Result: petBrain提供可靠且快速的生物标志物量化，结果与现有流程相当，并与ADNI数据库数据及CSF/血浆生物标志物、临床状态和认知表现一致。 Conclusion: petBrain是一个强大且开放的平台，有助于标准化AD生物标志物分析，推动临床研究应用。 Abstract: INTRODUCTION: Quantification of amyloid plaques (A), neurofibrillary tangles (T2), and neurodegeneration (N) using PET and MRI is critical for Alzheimer's disease (AD) diagnosis and prognosis. Existing pipelines face limitations regarding processing time, variability in tracer types, and challenges in multimodal integration. METHODS: We developed petBrain, a novel end-to-end processing pipeline for amyloid-PET, tau-PET, and structural MRI. It leverages deep learning-based segmentation, standardized biomarker quantification (Centiloid, CenTauR, HAVAs), and simultaneous estimation of A, T2, and N biomarkers. The pipeline is implemented as a web-based platform, requiring no local computational infrastructure or specialized software knowledge. RESULTS: petBrain provides reliable and rapid biomarker quantification, with results comparable to existing pipelines for A and T2. It shows strong concordance with data processed in ADNI databases. The staging and quantification of A/T2/N by petBrain demonstrated good agreement with CSF/plasma biomarkers, clinical status, and cognitive performance. DISCUSSION: petBrain represents a powerful and openly accessible platform for standardized AD biomarker analysis, facilitating applications in clinical research.

[250] Rethinking Whole-Body CT Image Interpretation: An Abnormality-Centric Approach

Ziheng Zhao,Lisong Dai,Ya Zhang,Yanfeng Wang,Weidi Xie

Main category: eess.IV

TL;DR: 论文提出了一种自动化CT图像解释方法，包括分类系统、数据集、模型开发和基准测试，显著优于现有方法。

Details

Motivation: 解决临床放射学中多平面和全身CT图像异常定位和描述的自动化挑战。 Method: 提出分类系统、贡献数据集、开发OminiAbnorm-CT模型，并建立三个临床评估任务。 Result: OminiAbnorm-CT在所有任务和指标上显著优于现有方法。 Conclusion: 该方法为CT图像自动化解释提供了有效解决方案，具有临床应用潜力。 Abstract: Automated interpretation of CT images-particularly localizing and describing abnormal findings across multi-plane and whole-body scans-remains a significant challenge in clinical radiology. This work aims to address this challenge through four key contributions: (i) On taxonomy, we collaborate with senior radiologists to propose a comprehensive hierarchical classification system, with 404 representative abnormal findings across all body regions; (ii) On data, we contribute a dataset containing over 14.5K CT images from multiple planes and all human body regions, and meticulously provide grounding annotations for over 19K abnormalities, each linked to the detailed description and cast into the taxonomy; (iii) On model development, we propose OminiAbnorm-CT, which can automatically ground and describe abnormal findings on multi-plane and whole-body CT images based on text queries, while also allowing flexible interaction through visual prompts; (iv) On benchmarks, we establish three representative evaluation tasks based on real clinical scenarios. Through extensive experiments, we show that OminiAbnorm-CT can significantly outperform existing methods on all the tasks and metrics.

[251] Hybrid Ensemble of Segmentation-Assisted Classification and GBDT for Skin Cancer Detection with Engineered Metadata and Synthetic Lesions from ISIC 2024 Non-Dermoscopic 3D-TBP Images

Muhammad Zubair Hasan,Fahmida Yasmin Rifat

Main category: eess.IV

TL;DR: 提出了一种结合机器学习和深度学习的混合方法，用于分类皮肤病变的良恶性，使用SLICE-3D数据集，通过特征融合和合成数据增强，取得了最佳性能。

Details

Motivation: 皮肤癌是全球高发且致命的疾病，早期检测对患者预后至关重要。 Method: 结合视觉变换器（EVA02）和卷积ViT混合模型（EdgeNeXtSAC），采用分割辅助分类管道，并通过梯度提升决策树（GBDT）融合预测。使用合成数据增强和诊断信息重标记策略。 Result: 在部分AUC（pAUC）高于80%真阳性率（TPR）的评估指标下，达到0.1755的pAUC，为所有配置中的最高值。 Conclusion: 混合可解释AI系统在远程医疗和资源有限环境中具有皮肤癌分诊的潜力。 Abstract: Skin cancer is among the most prevalent and life-threatening diseases worldwide, with early detection being critical to patient outcomes. This work presents a hybrid machine and deep learning-based approach for classifying malignant and benign skin lesions using the SLICE-3D dataset from ISIC 2024, which comprises 401,059 cropped lesion images extracted from 3D Total Body Photography (TBP), emulating non-dermoscopic, smartphone-like conditions. Our method combines vision transformers (EVA02) and our designed convolutional ViT hybrid (EdgeNeXtSAC) to extract robust features, employing a segmentation-assisted classification pipeline to enhance lesion localization. Predictions from these models are fused with a gradient-boosted decision tree (GBDT) ensemble enriched by engineered features and patient-specific relational metrics. To address class imbalance and improve generalization, we augment malignant cases with Stable Diffusion-generated synthetic lesions and apply a diagnosis-informed relabeling strategy to harmonize external datasets into a 3-class format. Using partial AUC (pAUC) above 80 percent true positive rate (TPR) as the evaluation metric, our approach achieves a pAUC of 0.1755 -- the highest among all configurations. These results underscore the potential of hybrid, interpretable AI systems for skin cancer triage in telemedicine and resource-constrained settings.

[252] Identifying Alzheimer's Disease Prediction Strategies of Convolutional Neural Network Classifiers using R2* Maps and Spectral Clustering

Christian Tinauer,Maximilian Sackl,Stefan Ropele,Christian Langkammer

Main category: eess.IV

TL;DR: 该研究使用LRP和谱聚类分析深度学习模型在阿尔茨海默病分类中的决策策略，发现预处理和训练选择对模型影响显著，谱聚类可有效识别分类策略差异。

Details

Motivation: 深度学习模型在阿尔茨海默病分类中表现优异，但决策过程不透明，存在潜在偏见，需进一步分析以提高可解释性。 Method: 使用3D卷积神经网络训练R2*图像，通过LRP生成热图，应用谱聚类和t-SNE可视化分析决策模式。 Result: 谱聚类揭示了明显的决策模式，基于相关性的模型在AD和NC病例间分离最清晰，t-SNE验证了热图分组与病例组的对应关系。 Conclusion: 预处理和训练选择对模型影响显著，谱聚类为识别分类策略差异提供了结构化方法，强调了医学AI中可解释性的重要性。 Abstract: Deep learning models have shown strong performance in classifying Alzheimer's disease (AD) from R2* maps, but their decision-making remains opaque, raising concerns about interpretability. Previous studies suggest biases in model decisions, necessitating further analysis. This study uses Layer-wise Relevance Propagation (LRP) and spectral clustering to explore classifier decision strategies across preprocessing and training configurations using R2* maps. We trained a 3D convolutional neural network on R2* maps, generating relevance heatmaps via LRP and applied spectral clustering to identify dominant patterns. t-Stochastic Neighbor Embedding (t-SNE) visualization was used to assess clustering structure. Spectral clustering revealed distinct decision patterns, with the relevance-guided model showing the clearest separation between AD and normal control (NC) cases. The t-SNE visualization confirmed that this model aligned heatmap groupings with the underlying subject groups. Our findings highlight the significant impact of preprocessing and training choices on deep learning models trained on R2* maps, even with similar performance metrics. Spectral clustering offers a structured method to identify classification strategy differences, emphasizing the importance of explainability in medical AI.

[253] Conformal coronary calcification volume estimation with conditional coverage via histogram clustering

Olivier Jaubert,Salman Mohammadi,Keith A. Goatman,Shadia S. Mikhael,Conor Bradley,Rebecca Hughes,Richard Good,John H. Hipwell,Sonia Dahdouh

Main category: eess.IV

TL;DR: 通过聚类条件共形预测框架校准冠状动脉钙化评分区间，提高预测准确性并优化患者分类。

Details

Motivation: CT扫描中偶然检测到的冠状动脉钙化可能引发早期干预，但过度报告可能对患者和医疗系统造成负面影响，因此需要谨慎自动报告钙化评分。 Method: 提出一种基于聚类的条件共形预测框架，无需重新训练即可从训练好的分割网络中生成校准的评分区间。该方法用于校准3D UNet模型的预测区间（确定性、MCDropout和深度集成）。 Result: 与传统共形预测相比，该方法在覆盖率相似的情况下实现了更好的分类指标。 Conclusion: 校准后的钙化评分区间有助于根据预测风险类别的置信度对患者进行分类，优化临床决策。 Abstract: Incidental detection and quantification of coronary calcium in CT scans could lead to the early introduction of lifesaving clinical interventions. However, over-reporting could negatively affect patient wellbeing and unnecessarily burden the medical system. Therefore, careful considerations should be taken when automatically reporting coronary calcium scores. A cluster-based conditional conformal prediction framework is proposed to provide score intervals with calibrated coverage from trained segmentation networks without retraining. The proposed method was tuned and used to calibrate predictive intervals for 3D UNet models (deterministic, MCDropout and deep ensemble) reaching similar coverage with better triage metrics compared to conventional conformal prediction. Meaningful predictive intervals of calcium scores could help triage patients according to the confidence of their risk category prediction.

[254] Towards generating more interpretable counterfactuals via concept vectors: a preliminary study on chest X-rays

Bulat Maksudov,Kathleen Curran,Alessandra Mileo

Main category: eess.IV

TL;DR: 该论文提出了一种通过生成模型的潜在空间映射临床概念的方法，利用CAVs实现无需显式标签训练的临床特征解释。

Details

Motivation: 确保医学影像模型与临床知识对齐并具有可解释性是部署的关键步骤。 Method: 使用简单的重建自编码器，将用户定义的概念映射到图像级特征，提取稳定的临床概念。 Result: 在胸部X光片中，该方法对大型病理（如心脏肥大）表现良好，但对小型病理因重建限制仍有挑战。 Conclusion: 尽管未超越基线方法，但提供了一种基于临床知识的可解释性路径。 Abstract: An essential step in deploying medical imaging models is ensuring alignment with clinical knowledge and interpretability. We focus on mapping clinical concepts into the latent space of generative models to identify Concept Activation Vectors (CAVs). Using a simple reconstruction autoencoder, we link user-defined concepts to image-level features without explicit label training. The extracted concepts are stable across datasets, enabling visual explanations that highlight clinically relevant features. By traversing latent space along concept directions, we produce counterfactuals that exaggerate or reduce specific clinical features. Preliminary results on chest X-rays show promise for large pathologies like cardiomegaly, while smaller pathologies remain challenging due to reconstruction limits. Although not outperforming baselines, this approach offers a path toward interpretable, concept-based explanations aligned with clinical knowledge.

[255] A Diffusion-Driven Temporal Super-Resolution and Spatial Consistency Enhancement Framework for 4D MRI imaging

Xuanru Zhou,Jiarun Liu,Shoujun Yu,Hao Yang,Cheng Li,Tao Tan,Shanshan Wang

Main category: eess.IV

TL;DR: TSSC-Net是一种新型框架，通过扩散模型和Mamba模块解决4D MRI中快速运动导致的时空分辨率问题，生成高质量中间帧。

Details

Motivation: 传统方法在快速大范围运动中难以处理大变形，导致配准错误和空间不一致性。 Method: 提出TSSC-Net，结合扩散模型生成中间帧，并引入三向Mamba模块解决跨片错位问题。 Result: 在ACDC心脏MRI和动态4D膝关节数据集上验证，TSSC-Net实现了6倍时间超分辨率，并保持结构保真度。 Conclusion: TSSC-Net有效提升动态MRI的时空分辨率，适用于快速运动场景。 Abstract: In medical imaging, 4D MRI enables dynamic 3D visualization, yet the trade-off between spatial and temporal resolution requires prolonged scan time that can compromise temporal fidelity--especially during rapid, large-amplitude motion. Traditional approaches typically rely on registration-based interpolation to generate intermediate frames. However, these methods struggle with large deformations, resulting in misregistration, artifacts, and diminished spatial consistency. To address these challenges, we propose TSSC-Net, a novel framework that generates intermediate frames while preserving spatial consistency. To improve temporal fidelity under fast motion, our diffusion-based temporal super-resolution network generates intermediate frames using the start and end frames as key references, achieving 6x temporal super-resolution in a single inference step. Additionally, we introduce a novel tri-directional Mamba-based module that leverages long-range contextual information to effectively resolve spatial inconsistencies arising from cross-slice misalignment, thereby enhancing volumetric coherence and correcting cross-slice errors. Extensive experiments were performed on the public ACDC cardiac MRI dataset and a real-world dynamic 4D knee joint dataset. The results demonstrate that TSSC-Net can generate high-resolution dynamic MRI from fast-motion data while preserving structural fidelity and spatial consistency.

[256] A Comprehensive Study on Medical Image Segmentation using Deep Neural Networks

Loan Dao,Ngoc Quoc Ly

Main category: eess.IV

TL;DR: 论文综述了基于深度神经网络的医学图像分割（MIS）的研究进展，重点探讨了DIKIW框架下的技术现状，并强调了可解释人工智能（XAI）和早期预测的重要性。

Details

Motivation: 研究旨在提升MIS在疾病诊断和早期检测中的应用，尤其是通过及时诊断提高癌症患者的生存率。 Method: 采用DIKIW框架评估智能视觉系统，并研究XAI以增强DNN的透明度和伦理合规性。 Result: 总结了MIS在DIKIW各层次的最新技术，并提出了提升DNN效率的潜在解决方案。 Conclusion: XAI和早期预测是从“智能”到“智慧”的关键步骤，未来需进一步解决MIS的挑战。 Abstract: Over the past decade, Medical Image Segmentation (MIS) using Deep Neural Networks (DNNs) has achieved significant performance improvements and holds great promise for future developments. This paper presents a comprehensive study on MIS based on DNNs. Intelligent Vision Systems are often evaluated based on their output levels, such as Data, Information, Knowledge, Intelligence, and Wisdom (DIKIW),and the state-of-the-art solutions in MIS at these levels are the focus of research. Additionally, Explainable Artificial Intelligence (XAI) has become an important research direction, as it aims to uncover the "black box" nature of previous DNN architectures to meet the requirements of transparency and ethics. The study emphasizes the importance of MIS in disease diagnosis and early detection, particularly for increasing the survival rate of cancer patients through timely diagnosis. XAI and early prediction are considered two important steps in the journey from "intelligence" to "wisdom." Additionally, the paper addresses existing challenges and proposes potential solutions to enhance the efficiency of implementing DNN-based MIS.

[257] Recent Advances in Medical Image Classification

Loan Dao,Ngoc Quoc Ly

Main category: eess.IV

TL;DR: 论文综述了医学图像分类领域的最新进展，重点关注基础、特定和应用三个层面的解决方案，包括深度学习和视觉语言模型的应用，以及可解释人工智能的作用。

Details

Motivation: 医学图像分类对诊断和治疗至关重要，人工智能的进步为其提供了显著支持。 Method: 论文回顾了传统方法（如卷积神经网络和视觉变换器）以及前沿方法（如视觉语言模型）的应用，并探讨了如何解决标记数据有限的问题。 Result: 这些方法通过可解释人工智能增强了预测结果的可解释性。 Conclusion: 论文总结了医学图像分类领域的进展，并强调了深度学习和可解释人工智能的重要性。 Abstract: Medical image classification is crucial for diagnosis and treatment, benefiting significantly from advancements in artificial intelligence. The paper reviews recent progress in the field, focusing on three levels of solutions: basic, specific, and applied. It highlights advances in traditional methods using deep learning models like Convolutional Neural Networks and Vision Transformers, as well as state-of-the-art approaches with Vision Language Models. These models tackle the issue of limited labeled data, and enhance and explain predictive results through Explainable Artificial Intelligence.

physics.optics [Back]

[258] Structural Vibration Monitoring with Diffractive Optical Processors

Yuntian Wang,Zafer Yilmaz,Yuhang Li,Edward Liu,Eric Ahlberg,Farid Ghahari,Ertugrul Taciroglu,Aydogan Ozcan

Main category: physics.optics

TL;DR: 提出了一种基于衍射振动监测的低功耗、低成本、可扩展的结构健康监测系统，通过联合优化的衍射层和浅层神经网络实现远程3D振动谱提取。

Details

Motivation: 当前结构健康监测方案受限于成本、功耗、可扩展性和数据处理复杂性，亟需一种更高效的解决方案。 Method: 结合空间优化的被动衍射层和浅层低功耗神经网络，将3D结构位移编码为调制光信号，并通过少量探测器实时解码。 Result: 在毫米波照明下对实验室建筑模型进行数值和实验验证，精度比传统光学或单独训练模块提高一个数量级。 Conclusion: 该系统为结构的高通量3D监测奠定了基础，并在灾难恢复、航空航天诊断和自主导航等领域具有潜在应用价值。 Abstract: Structural Health Monitoring (SHM) is vital for maintaining the safety and longevity of civil infrastructure, yet current solutions remain constrained by cost, power consumption, scalability, and the complexity of data processing. Here, we present a diffractive vibration monitoring system, integrating a jointly optimized diffractive layer with a shallow neural network-based backend to remotely extract 3D structural vibration spectra, offering a low-power, cost-effective and scalable solution. This architecture eliminates the need for dense sensor arrays or extensive data acquisition; instead, it uses a spatially-optimized passive diffractive layer that encodes 3D structural displacements into modulated light, captured by a minimal number of detectors and decoded in real-time by shallow and low-power neural networks to reconstruct the 3D displacement spectra of structures. The diffractive system's efficacy was demonstrated both numerically and experimentally using millimeter-wave illumination on a laboratory-scale building model with a programmable shake table. Our system achieves more than an order-of-magnitude improvement in accuracy over conventional optics or separately trained modules, establishing a foundation for high-throughput 3D monitoring of structures. Beyond SHM, the 3D vibration monitoring capabilities of this cost-effective and data-efficient framework establish a new computational sensing modality with potential applications in disaster resilience, aerospace diagnostics, and autonomous navigation, where energy efficiency, low latency, and high-throughput are critical.

cs.DL [Back]

[259] Preface to the Special Issue of the TAL Journal on Scholarly Document Processing

Florian Boudin,Akiko Aizawa

Main category: cs.DL

TL;DR: 论文探讨了如何利用大语言模型（LLMs）解决学术文献快速增长带来的挑战，包括文献综述、写作辅助和研究探索。

Details

Motivation: 学术文献的快速增长和复杂性使得研究者难以跟踪新知识，需要自动化工具来帮助导航和解读。 Method: 利用大语言模型（LLMs）进行任务如文献综述、写作辅助和交互式研究探索。 Result: 展示了LLMs在学术文献处理中的潜力，并提出了相关挑战和解决方案。 Conclusion: LLMs为学术文献处理提供了新机遇，但仍需进一步研究以应对复杂性和多样性。 Abstract: The rapid growth of scholarly literature makes it increasingly difficult for researchers to keep up with new knowledge. Automated tools are now more essential than ever to help navigate and interpret this vast body of information. Scientific papers pose unique difficulties, with their complex language, specialized terminology, and diverse formats, requiring advanced methods to extract reliable and actionable insights. Large language models (LLMs) offer new opportunities, enabling tasks such as literature reviews, writing assistance, and interactive exploration of research. This special issue of the TAL journal highlights research addressing these challenges and, more broadly, research on natural language processing and information retrieval for scholarly and scientific documents.

[260] Knowledge Graphs for Digitized Manuscripts in Jagiellonian Digital Library Application

Jan Ignatowicz,Krzysztof Kutt,Grzegorz J. Nalepa

Main category: cs.DL

TL;DR: 论文探讨了结合计算机视觉、人工智能和语义网技术的方法，以丰富数字化文化遗产的元数据并构建知识图谱。

Details

Motivation: 数字化文化遗产对保护和提升公众访问至关重要，但元数据不完整和标准化问题限制了其搜索性和连接性。 Method: 采用计算机视觉、人工智能和语义网技术的综合方法，用于丰富元数据和构建知识图谱。 Result: 该方法有望提升数字化手稿和古版书的元数据质量和连接性。 Conclusion: 通过技术整合，可以解决元数据问题，增强文化遗产数字化的潜力。 Abstract: Digitizing cultural heritage collections has become crucial for preservation of historical artifacts and enhancing their availability to the wider public. Galleries, libraries, archives and museums (GLAM institutions) are actively digitizing their holdings and creates extensive digital collections. Those collections are often enriched with metadata describing items but not exactly their contents. The Jagiellonian Digital Library, standing as a good example of such an effort, offers datasets accessible through protocols like OAI-PMH. Despite these improvements, metadata completeness and standardization continue to pose substantial obstacles, limiting the searchability and potential connections between collections. To deal with these challenges, we explore an integrated methodology of computer vision (CV), artificial intelligence (AI), and semantic web technologies to enrich metadata and construct knowledge graphs for digitized manuscripts and incunabula.

eess.AS [Back]

[261] Tone recognition in low-resource languages of North-East India: peeling the layers of SSL-based speech models

Parismita Gogoi,Sishir Kalita,Wendy Lalhminghlui,Viyazonuo Terhiija,Moakala Tzudir,Priyankoo Sarmah,S. R. M. Prasanna

Main category: eess.AS

TL;DR: 研究探讨了自监督学习（SSL）模型在印度东北部三种低资源语言（Angami、Ao、Mizo）中的声调识别效果，发现Mizo表现最佳，Angami最差，且SSL模型的中间层对声调识别最为关键。

Details

Motivation: 探索自监督学习模型在低资源语言声调识别中的应用，以解决资源匮乏语言的声调识别问题。 Method: 评估了四种基于Wav2vec2.0的预训练模型，分析了不同语言和模型层的声调识别表现。 Result: Mizo的声调识别效果最好，Angami最差；SSL模型的中间层对声调识别最为重要，且声调库和方言变异影响识别效果。 Conclusion: 研究为SSL模型在低资源语言声调识别中的优缺点提供了见解，并展示了改进潜力。 Abstract: This study explores the use of self-supervised learning (SSL) models for tone recognition in three low-resource languages from North Eastern India: Angami, Ao, and Mizo. We evaluate four Wav2vec2.0 base models that were pre-trained on both tonal and non-tonal languages. We analyze tone-wise performance across the layers for all three languages and compare the different models. Our results show that tone recognition works best for Mizo and worst for Angami. The middle layers of the SSL models are the most important for tone recognition, regardless of the pre-training language, i.e. tonal or non-tonal. We have also found that the tone inventory, tone types, and dialectal variations affect tone recognition. These findings provide useful insights into the strengths and weaknesses of SSL-based embeddings for tonal languages and highlight the potential for improving tone recognition in low-resource settings. The source code is available at GitHub 1 .

[262] SNIFR : Boosting Fine-Grained Child Harmful Content Detection Through Audio-Visual Alignment with Cascaded Cross-Transformer

Orchid Chetia Phukan,Mohd Mujtaba Akhtar,Girish,Swarup Ranjan Behera,Abu Osama Siddiqui,Sarthak Jain,Priyabrata Mallick,Jaya Sai Kiran Patibandla,Pailla Balakrishna Reddy,Arun Balaji Buduru,Rajesh Sharma

Main category: eess.AS

TL;DR: 论文提出了一种结合音频和视觉特征的多模态框架SNIFR，用于检测儿童有害内容，性能优于单模态和基线融合方法。

Details

Motivation: 随着视频平台的普及，儿童观众增多，但现有内容审核系统容易被恶意用户绕过，而音频特征在此领域的研究不足。 Method: SNIFR框架通过Transformer编码器实现模态内交互，再通过级联跨模态Transformer实现模态间对齐。 Result: 该方法在性能上超越了单模态和基线融合方法，达到了新的最优水平。 Conclusion: 结合音频和视觉特征的多模态方法能更有效地检测儿童有害内容。 Abstract: As video-sharing platforms have grown over the past decade, child viewership has surged, increasing the need for precise detection of harmful content like violence or explicit scenes. Malicious users exploit moderation systems by embedding unsafe content in minimal frames to evade detection. While prior research has focused on visual cues and advanced such fine-grained detection, audio features remain underexplored. In this study, we embed audio cues with visual for fine-grained child harmful content detection and introduce SNIFR, a novel framework for effective alignment. SNIFR employs a transformer encoder for intra-modality interaction, followed by a cascaded cross-transformer for inter-modality alignment. Our approach achieves superior performance over unimodal and baseline fusion methods, setting a new state-of-the-art.

cs.LG [Back]

[263] DiaBlo: Diagonal Blocks Are Sufficient For Finetuning

Selcuk Gurses,Aozhong Zhang,Yanxia Deng,Xun Dong,Xin Li,Naigang Wang,Penghang Yin,Zi Yang

Main category: cs.LG

TL;DR: DiaBlo是一种参数高效微调（PEFT）方法，通过仅更新模型权重矩阵的对角块，避免了低秩矩阵乘积，实现了稳定收敛和高效率。

Details

Motivation: 解决PEFT方法与全模型微调之间的性能差距，同时减少计算和内存成本。 Method: 仅更新模型权重矩阵的对角块，无需低秩矩阵乘积或特殊初始化策略。 Result: 在多项任务中表现稳定且高效，内存占用低且训练速度快。 Conclusion: DiaBlo是一种简单有效的PEFT方法，性能接近全模型微调，同时保持高效性。 Abstract: Finetuning is a critical step for adapting large language models (LLMs) to domain-specific downstream tasks. To mitigate the substantial computational and memory costs of full-model fine-tuning, Parameter-Efficient Finetuning (PEFT) methods have been proposed to update only a small subset of model parameters. However, performance gaps between PEFT approaches and full-model fine-tuning still exist. In this work, we present DiaBlo, a simple yet effective PEFT approach that updates only the diagonal blocks of selected model weight matrices. Unlike Low Rank Adaptation (LoRA) and its variants, DiaBlo eliminates the need for low rank matrix products, thereby avoiding the reliance on auxiliary initialization schemes or customized optimization strategies to improve convergence. This design leads to stable and robust convergence while maintaining comparable memory efficiency and training speed to LoRA. We conduct extensive experiments across a range of tasks, including commonsense reasoning, arithmetic reasoning, code generation, and safety alignment, to evaluate the effectiveness and efficiency of DiaBlo. Across these benchmarks, DiaBlo demonstrates strong and consistent performance while maintaining high memory efficiency and fast finetuning speed. Codes are available at https://github.com/ziyangjoy/DiaBlo.

[264] Comparison of different Unique hard attention transformer models by the formal languages they can recognize

Leonid Ryvkin

Main category: cs.LG

TL;DR: 本文综述了独特硬注意力变换器编码器（UHATs）在识别形式语言方面的能力，区分了掩码与非掩码、有限与无限图像以及一般与双线性注意力评分函数。

Details

Motivation: 探讨UHATs在形式语言识别中的能力，并分析不同变体的表现。 Method: 通过区分掩码与非掩码、有限与无限图像以及一般与双线性注意力评分函数，回顾相关模型的关系。 Result: 提供了基于一阶逻辑的下界和基于电路复杂度的上界。 Conclusion: 总结了UHATs在形式语言识别中的能力及其理论界限。 Abstract: This note is a survey of various results on the capabilities of unique hard attention transformers encoders (UHATs) to recognize formal languages. We distinguish between masked vs. non-masked, finite vs. infinite image and general vs. bilinear attention score functions. We recall some relations between these models, as well as a lower bound in terms of first-order logic and an upper bound in terms of circuit complexity.

[265] Adaptive Task Vectors for Large Language Models

Joonseong Kang,Soojeong Lee,Subeen Park,Sumin Park,Taero Kim,Jihee Kim,Ryunyi Lee,Kyungwoo Song

Main category: cs.LG

TL;DR: 论文提出了一种名为自适应任务向量（ATV）的新框架，通过动态生成与输入查询相关的任务向量，解决了传统上下文学习（ICL）和固定任务向量方法的局限性，提升了模型的适应性和泛化能力。

Details

Motivation: 传统ICL和固定任务向量方法存在对演示顺序敏感、上下文长度限制和计算效率低等问题，且无法根据具体输入动态调整任务向量，导致泛化性能下降。 Method: ATV框架利用小型语言模型动态生成与输入查询相关的任务向量，并将其适配到目标大语言模型中，以指导输出生成。 Result: ATV在未见任务上表现出强大的性能和泛化能力，理论分析表明其在表达力上优于Prefix-Tuning，与LoRA相当。 Conclusion: ATV通过动态生成任务向量，显著提升了模型的适应性和泛化能力，为任务向量方法提供了新的研究方向。 Abstract: In-Context Learning (ICL) enables Large Language Models (LLMs) to perform tasks without parameter updates by conditioning on a few demonstrations provided in the prompt. Despite its success, ICL suffers from several limitations, including sensitivity to demonstration order, context length constraints, and computational inefficiency. To address these challenges, task vector-based approaches compress task information into a single vector. However, these methods typically construct task vectors from fixed sets of demonstrations and reuse them across input queries, without conditioning on the specific input. This limitation can lead models to struggle with effective adaptation when the input query is not well aligned with the underlying demonstrations, consequently degrading their generalization performance on unseen tasks. To overcome this limitation, we propose Adaptive Task Vectors (ATV), a simple and effective framework that dynamically generates task vectors conditioned on each input query. ATV employs a small language model to generate task vectors, which are then transformed to match the target LLM's architecture and applied to guide its output generation. In contrast to ICL and previous vector-based approaches, which rely on fixed demonstration sets and their corresponding vectors, ATV dynamically generates task vectors tailored to each specific input query and task. Consequently, ATV demonstrates strong performance and generalization capabilities, even for unseen tasks. Furthermore, we provide a theoretical analysis indicating that ATV is expressively equivalent to LoRA under equal rank budgets and more expressive than Prefix-Tuning, thereby offering formal support for its representational advantage.

[266] DUAL: Dynamic Uncertainty-Aware Learning

Jiahao Qin,Bei Peng,Feng Liu,Guangliang Cheng,Lu Zong

Main category: cs.LG

TL;DR: 论文提出了动态不确定性感知学习（DUAL）框架，通过动态特征不确定性建模、自适应分布感知调制和不确定性感知跨模态关系学习，有效处理单模态和多模态场景中的特征不确定性，并在多个任务中显著提升性能。

Details

Motivation: 深度学习模型在多样学习场景中常面临特征不确定性问题，影响性能和可靠性，尤其是在多模态场景中更为复杂。 Method: DUAL框架包含动态特征不确定性建模、自适应分布感知调制和不确定性感知跨模态关系学习三大创新。 Result: 实验表明，DUAL在计算机视觉任务（如CIFAR-10、CIFAR-100、Tiny-ImageNet）和多模态学习任务（如CMU-MOSEI、CMU-MOSI、MISR）中均取得显著性能提升。 Conclusion: DUAL为处理特征不确定性提供了一种统一且有效的解决方案，适用于单模态和多模态场景。 Abstract: Deep learning models frequently encounter feature uncertainty in diverse learning scenarios, significantly impacting their performance and reliability. This challenge is particularly complex in multi-modal scenarios, where models must integrate information from different sources with inherent uncertainties. We propose Dynamic Uncertainty-Aware Learning (DUAL), a unified framework that effectively handles feature uncertainty in both single-modal and multi-modal scenarios. DUAL introduces three key innovations: Dynamic Feature Uncertainty Modeling, which continuously refines uncertainty estimates through joint consideration of feature characteristics and learning dynamics; Adaptive Distribution-Aware Modulation, which maintains balanced feature distributions through dynamic sample influence adjustment; and Uncertainty-aware Cross-Modal Relationship Learning, which explicitly models uncertainties in cross-modal interactions. Through extensive experiments, we demonstrate DUAL's effectiveness across multiple domains: in computer vision tasks, it achieves substantial improvements of 7.1% accuracy on CIFAR-10, 6.5% accuracy on CIFAR-100, and 2.3% accuracy on Tiny-ImageNet; in multi-modal learning, it demonstrates consistent gains of 4.1% accuracy on CMU-MOSEI and 2.8% accuracy on CMU-MOSI for sentiment analysis, while achieving 1.4% accuracy improvements on MISR. The code will be available on GitHub soon.

[267] Exploiting LLMs for Automatic Hypothesis Assessment via a Logit-Based Calibrated Prior

Yue Gong,Raul Castro Fernandez

Main category: cs.LG

TL;DR: 论文提出了一种基于LLM的自动假设评估方法，通过预测变量对的先验相关性分布来评估其新颖性和重要性。

Details

Motivation: 随着假设生成的自动化程度提高，假设评估成为新的瓶颈。现有系统能生成大量统计关系，但缺乏对哪些关系值得关注的指导。 Method: 提出Logit-based Calibrated Prior，利用LLM的权重知识生成变量对的先验相关性分布，并通过校准将其转化为连续预测分布。 Result: 在2096个真实变量对上的评估显示，该方法在预测Pearson相关系数时表现优异（符号准确率78.8%，MAE 0.26，95%可信区间覆盖率89.2%），且优于微调的RoBERTa分类器。 Conclusion: 该方法能有效评估相关性假设的新颖性，并展示了LLM在上下文敏感推理而非记忆方面的潜力。 Abstract: As hypothesis generation becomes increasingly automated, a new bottleneck has emerged: hypothesis assessment. Modern systems can surface thousands of statistical relationships-correlations, trends, causal links-but offer little guidance on which ones are novel, non-trivial, or worthy of expert attention. In this work, we study the complementary problem to hypothesis generation: automatic hypothesis assessment. Specifically, we ask: given a large set of statistical relationships, can we automatically assess which ones are novel and worth further exploration? We focus on correlations as they are a common entry point in exploratory data analysis that often serve as the basis for forming deeper scientific or causal hypotheses. To support automatic assessment, we propose to leverage the vast knowledge encoded in LLMs' weights to derive a prior distribution over the correlation value of a variable pair. If an LLM's prior expects the correlation value observed, then such correlation is not surprising, and vice versa. We propose the Logit-based Calibrated Prior, an LLM-elicited correlation prior that transforms the model's raw output logits into a calibrated, continuous predictive distribution over correlation values. We evaluate the prior on a benchmark of 2,096 real-world variable pairs and it achieves a sign accuracy of 78.8%, a mean absolute error of 0.26, and 95% credible interval coverage of 89.2% in predicting Pearson correlation coefficient. It also outperforms a fine-tuned RoBERTa classifier in binary correlation prediction and achieves higher precision@K in hypothesis ranking. We further show that the prior generalizes to correlations not seen during LLM pretraining, reflecting context-sensitive reasoning rather than memorization.

[268] Prompt Candidates, then Distill: A Teacher-Student Framework for LLM-driven Data Annotation

Mingxuan Xia,Haobo Wang,Yixuan Li,Zewei Yu,Jindong Wang,Junbo Zhao,Runze Wu

Main category: cs.LG

TL;DR: 论文提出了一种新的候选标注范式，通过鼓励大型语言模型（LLM）在不确定时输出所有可能的标签，并结合师生框架CanDist蒸馏候选标注，以提高数据标注质量。

Details

Motivation: 现有方法通常要求LLM为每个未标注样本确定单一标签，但由于LLM的不确定性，容易产生错误标注，影响下游任务数据质量。受人类行为中模糊厌恶的启发，作者提出改进方法。 Method: 提出候选标注范式，鼓励LLM输出所有可能的标签；开发师生框架CanDist，用小语言模型（SLM）蒸馏候选标注，确保下游任务标签唯一性。 Result: 在六个文本分类任务上的实验验证了方法的有效性，理论分析表明蒸馏候选标注优于直接使用单一标注。 Conclusion: 候选标注范式结合CanDist框架显著提升了数据标注质量，为下游任务提供了更可靠的标注数据。 Abstract: Recently, Large Language Models (LLMs) have demonstrated significant potential for data annotation, markedly reducing the labor costs associated with downstream applications. However, existing methods mostly adopt an aggressive strategy by prompting LLM to determine a single gold label for each unlabeled sample. Due to the inherent uncertainty within LLMs, they often produce incorrect labels for difficult samples, severely compromising the data quality for downstream applications. Motivated by ambiguity aversion in human behaviors, we propose a novel candidate annotation paradigm wherein large language models are encouraged to output all possible labels when incurring uncertainty. To ensure unique labels are provided for downstream tasks, we develop a teacher-student framework CanDist that distills candidate annotations with a Small Language Model (SLM). We further provide a rigorous justification demonstrating that distilling candidate annotations from the teacher LLM offers superior theoretical guarantees compared to directly using single annotations. Extensive experiments across six text classification tasks validate the effectiveness of our proposed method. The source code is available at https://github.com/MingxuanXia/CanDist.

[269] Multimodal Tabular Reasoning with Privileged Structured Information

Jun-Peng Jiang,Yu Xia,Hai-Long Sun,Shiyin Lu,Qing-Guo Chen,Weihua Luo,Kaifu Zhang,De-Chuan Zhan,Han-Jia Ye

Main category: cs.LG

TL;DR: 论文提出了一种名为Turbo的多模态表格推理框架，利用训练时的结构化信息增强模型性能，解决了表格图像与结构化信息对齐的挑战。

Details

Motivation: 现实中的表格通常以图像形式存在，缺乏高质量文本表示，因此需要开发能够从图像中推理表格内容的方法。 Method: Turbo框架通过结构感知的推理轨迹生成器（基于DeepSeek-R1）生成高质量跨模态数据，并通过反复生成和选择推理路径提升模型能力。 Result: 实验表明，Turbo在有限数据（9k）下，性能超越之前最佳方法7.2%。 Conclusion: Turbo通过跨模态信息桥接和结构化推理路径优化，显著提升了多模态表格推理能力。 Abstract: Tabular reasoning involves multi-step information extraction and logical inference over tabular data. While recent advances have leveraged large language models (LLMs) for reasoning over structured tables, such high-quality textual representations are often unavailable in real-world settings, where tables typically appear as images. In this paper, we tackle the task of tabular reasoning from table images, leveraging privileged structured information available during training to enhance multimodal large language models (MLLMs). The key challenges lie in the complexity of accurately aligning structured information with visual representations, and in effectively transferring structured reasoning skills to MLLMs despite the input modality gap. To address these, we introduce TabUlar Reasoning with Bridged infOrmation ({\sc Turbo}), a new framework for multimodal tabular reasoning with privileged structured tables. {\sc Turbo} benefits from a structure-aware reasoning trace generator based on DeepSeek-R1, contributing to high-quality modality-bridged data. On this basis, {\sc Turbo} repeatedly generates and selects the advantageous reasoning paths, further enhancing the model's tabular reasoning ability. Experimental results demonstrate that, with limited ($9$k) data, {\sc Turbo} achieves state-of-the-art performance ($+7.2\%$ vs. previous SOTA) across multiple datasets.

[270] AmbiK: Dataset of Ambiguous Tasks in Kitchen Environment

Anastasiia Ivanova,Eva Bakaeva,Zoya Volovikova,Alexey K. Kovalev,Aleksandr I. Panov

Main category: cs.LG

TL;DR: 论文提出了AmbiK数据集，用于解决LLMs在厨房环境中处理模糊指令的挑战，并提供一个统一的基准。

Details

Motivation: 现有方法因测试数据集不同而难以比较，缺乏通用基准。 Method: 通过LLMs辅助收集并人工验证，构建包含1000对模糊和明确指令的数据集AmbiK。 Result: AmbiK包含2000个任务，分类清晰，提供环境描述、澄清问题等。 Conclusion: AmbiK有望为模糊检测方法提供统一比较标准。 Abstract: As a part of an embodied agent, Large Language Models (LLMs) are typically used for behavior planning given natural language instructions from the user. However, dealing with ambiguous instructions in real-world environments remains a challenge for LLMs. Various methods for task ambiguity detection have been proposed. However, it is difficult to compare them because they are tested on different datasets and there is no universal benchmark. For this reason, we propose AmbiK (Ambiguous Tasks in Kitchen Environment), the fully textual dataset of ambiguous instructions addressed to a robot in a kitchen environment. AmbiK was collected with the assistance of LLMs and is human-validated. It comprises 1000 pairs of ambiguous tasks and their unambiguous counterparts, categorized by ambiguity type (Human Preferences, Common Sense Knowledge, Safety), with environment descriptions, clarifying questions and answers, user intents, and task plans, for a total of 2000 tasks. We hope that AmbiK will enable researchers to perform a unified comparison of ambiguity detection methods. AmbiK is available at https://github.com/cog-model/AmbiK-dataset.

[271] Robustness in Both Domains: CLIP Needs a Robust Text Encoder

Elias Abad Rocamora,Christian Schlarmann,Naman Deep Singh,Yongtao Wu,Matthias Hein,Volkan Cevher

Main category: cs.LG

TL;DR: LEAF是一种高效的对抗性微调方法，专注于提升CLIP文本编码器的鲁棒性，同时保持视觉性能，并在对抗性噪声下提升生成质量和检索任务表现。

Details

Motivation: 对抗性输入攻击可能导致CLIP嵌入的显著偏移，影响下游模型（如文本到图像生成模型）的鲁棒性。目前CLIP图像编码器的鲁棒性已有研究，但文本编码器的鲁棒性尚未探索。 Method: 提出LEAF方法，一种高效的对抗性微调方法，适用于文本领域，并能扩展到大型CLIP模型。 Result: LEAF显著提高了文本领域的零样本对抗性准确性，同时保持了视觉性能。在对抗性噪声下，提升了生成质量和多模态检索任务的召回率。 Conclusion: 鲁棒的文本编码器通过直接优化可以更好地重建输入文本的嵌入，LEAF填补了文本编码器鲁棒性研究的空白。 Abstract: Adversarial input attacks can cause a significant shift of CLIP embeddings. This can affect the downstream robustness of models incorporating CLIP in the pipeline, such as text-to-image generative models or large vision language models. While some efforts have been done towards making the CLIP image encoders robust, the robustness of text encoders remains unexplored. In this work, we cover this gap in the literature. We propose LEAF: an efficient adversarial finetuning method for the text domain, with the ability to scale to large CLIP models. Our models significantly improve the zero-shot adversarial accuracy in the text domain, while maintaining the vision performance provided by robust image encoders. When combined with text-to-image diffusion models, we can improve the generation quality under adversarial noise. When employing our robust CLIP encoders in multimodal retrieval tasks, we improve the recall under adversarial noise over standard CLIP models. Finally, we show that robust text encoders facilitate better reconstruction of input text from its embedding via direct optimization.

[272] Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning

Shuang Chen,Yue Guo,Zhaochen Su,Yafu Li,Yulun Wu,Jiacheng Chen,Jiayu Chen,Weijie Wang,Xiaoye Qu,Yu Cheng

Main category: cs.LG

TL;DR: 论文提出ReVisual-R1模型，通过分阶段训练（文本初始化、多模态RL和文本RL）解决多模态大语言模型（MLLMs）在复杂推理任务中的问题，并在多个基准测试中达到最优性能。

Details

Motivation: 现有方法直接应用强化学习（RL）于多模态大语言模型（MLLMs）时，难以激活复杂推理能力。论文旨在通过分析训练流程中的关键现象，改进多模态RL的效果。 Method: 1) 使用精选文本数据初始化模型；2) 解决多模态RL中的梯度停滞问题；3) 在多模态RL后引入文本RL训练。 Result: ReVisual-R1在多个挑战性基准测试（如MathVerse、AIME2024等）中表现最优，超越现有开源7B MLLMs。 Conclusion: 分阶段训练方法有效平衡感知与推理能力，ReVisual-R1成为当前最优的7B MLLM。 Abstract: Inspired by the remarkable reasoning capabilities of Deepseek-R1 in complex textual tasks, many works attempt to incentivize similar capabilities in Multimodal Large Language Models (MLLMs) by directly applying reinforcement learning (RL). However, they still struggle to activate complex reasoning. In this paper, rather than examining multimodal RL in isolation, we delve into current training pipelines and identify three crucial phenomena: 1) Effective cold start initialization is critical for enhancing MLLM reasoning. Intriguingly, we find that initializing with carefully selected text data alone can lead to performance surpassing many recent multimodal reasoning models, even before multimodal RL. 2) Standard GRPO applied to multimodal RL suffers from gradient stagnation, which degrades training stability and performance. 3) Subsequent text-only RL training, following the multimodal RL phase, further enhances multimodal reasoning. This staged training approach effectively balances perceptual grounding and cognitive reasoning development. By incorporating the above insights and addressing multimodal RL issues, we introduce ReVisual-R1, achieving a new state-of-the-art among open-source 7B MLLMs on challenging benchmarks including MathVerse, MathVision, WeMath, LogicVista, DynaMath, and challenging AIME2024 and AIME2025.

[273] Rethinking the Stability-Plasticity Trade-off in Continual Learning from an Architectural Perspective

Aojun Lu,Hangjie Yuan,Tao Feng,Yanan Sun

Main category: cs.LG

TL;DR: 论文提出了一种名为Dual-Arch的新框架，通过结合深度和宽度网络的互补优势，解决了持续学习中的稳定性-可塑性权衡问题，显著提升了性能并减少了参数量。

Details

Motivation: 持续学习需要平衡稳定性（保留旧知识）和可塑性（学习新知识），现有方法多关注参数层面，忽视了网络架构的影响。本文旨在从架构层面解决这一冲突。 Method: 提出Dual-Arch框架，包含两个独立网络：一个专注于可塑性（深度网络），另一个专注于稳定性（宽度网络）。两者轻量化设计，互补结合。 Result: 实验表明，Dual-Arch在提升现有持续学习方法性能的同时，参数量减少了87%。 Conclusion: Dual-Arch通过架构层面的创新，有效解决了稳定性-可塑性权衡问题，为持续学习提供了高效且紧凑的解决方案。 Abstract: The quest for Continual Learning (CL) seeks to empower neural networks with the ability to learn and adapt incrementally. Central to this pursuit is addressing the stability-plasticity dilemma, which involves striking a balance between two conflicting objectives: preserving previously learned knowledge and acquiring new knowledge. While numerous CL methods aim to achieve this trade-off, they often overlook the impact of network architecture on stability and plasticity, restricting the trade-off to the parameter level. In this paper, we delve into the conflict between stability and plasticity at the architectural level. We reveal that under an equal parameter constraint, deeper networks exhibit better plasticity, while wider networks are characterized by superior stability. To address this architectural-level dilemma, we introduce a novel framework denoted Dual-Arch, which serves as a plug-in component for CL. This framework leverages the complementary strengths of two distinct and independent networks: one dedicated to plasticity and the other to stability. Each network is designed with a specialized and lightweight architecture, tailored to its respective objective. Extensive experiments demonstrate that Dual-Arch enhances the performance of existing CL methods while being up to 87% more compact in terms of parameters.

[274] Adapt before Continual Learning

Aojun Lu,Tao Feng,Hangjie Yuan,Chunhui Ding,Yanan Sun

Main category: cs.LG

TL;DR: 提出了一种名为ACL的新框架，通过在核心持续学习过程之前对预训练模型进行适应性调整，以平衡稳定性和可塑性。

Details

Motivation: 预训练模型在持续学习中存在稳定性和可塑性之间的权衡问题，冻结模型限制了可塑性，而全模型微调又容易导致灾难性遗忘。 Method: ACL框架在每学习新任务前，通过一个即插即用的适应性阶段调整预训练模型，使其嵌入与原始类原型对齐，同时远离其他类。 Result: 实验表明，ACL显著提升了持续学习的性能，适用于多种基准和方法。 Conclusion: ACL为基于预训练模型的持续学习提供了一种灵活且有效的解决方案。 Abstract: Continual Learning (CL) seeks to enable neural networks to incrementally acquire new knowledge (plasticity) while retaining existing knowledge (stability). While pre-trained models (PTMs) have become pivotal in CL, prevailing approaches freeze the PTM backbone to preserve stability, limiting their plasticity, particularly when encountering significant domain gaps in incremental tasks. Conversely, sequentially finetuning the entire PTM risks catastrophic forgetting of generalizable knowledge, exposing a critical stability-plasticity trade-off. To address this challenge, we propose Adapting PTMs before the core CL process (ACL), a novel framework that refines the PTM backbone through a plug-and-play adaptation phase before learning each new task with existing CL approaches (e.g., prompt tuning). ACL enhances plasticity by aligning embeddings with their original class prototypes while distancing them from others, theoretically and empirically shown to balance stability and plasticity. Extensive experiments demonstrate that ACL significantly improves CL performance across benchmarks and integrated methods, offering a versatile solution for PTM-based CL.

[275] Solving Inverse Problems via Diffusion-Based Priors: An Approximation-Free Ensemble Sampling Approach

Haoxuan Chen,Yinuo Ren,Martin Renqiang Min,Lexing Ying,Zachary Izzo

Main category: cs.LG

TL;DR: 提出了一种基于扩散模型（DMs）的后验采样算法，避免了启发式近似，通过结合序列蒙特卡洛（SMC）方法，改进了逆问题的求解精度。

Details

Motivation: 当前基于DMs的后验采样方法依赖启发式近似，限制了其生成能力。本文旨在避免这种近似，更准确地求解贝叶斯逆问题（BIPs）。 Method: 提出了一种基于SMC的集成算法，通过分析扩散过程中先验的演化，推导出修正的PDE，并利用加权粒子方法模拟。 Result: 理论证明后验分布误差可被预训练得分函数的训练误差和粒子数限制；实验显示在成像逆问题中重建更准确。 Conclusion: 该方法有效利用了DMs的生成能力，无需启发式近似，显著提升了后验采样的准确性。 Abstract: Diffusion models (DMs) have proven to be effective in modeling high-dimensional distributions, leading to their widespread adoption for representing complex priors in Bayesian inverse problems (BIPs). However, current DM-based posterior sampling methods proposed for solving common BIPs rely on heuristic approximations to the generative process. To exploit the generative capability of DMs and avoid the usage of such approximations, we propose an ensemble-based algorithm that performs posterior sampling without the use of heuristic approximations. Our algorithm is motivated by existing works that combine DM-based methods with the sequential Monte Carlo (SMC) method. By examining how the prior evolves through the diffusion process encoded by the pre-trained score function, we derive a modified partial differential equation (PDE) governing the evolution of the corresponding posterior distribution. This PDE includes a modified diffusion term and a reweighting term, which can be simulated via stochastic weighted particle methods. Theoretically, we prove that the error between the true posterior distribution can be bounded in terms of the training error of the pre-trained score function and the number of particles in the ensemble. Empirically, we validate our algorithm on several inverse problems in imaging to show that our method gives more accurate reconstructions compared to existing DM-based methods.

[276] Optimal Transport-based Domain Alignment as a Preprocessing Step for Federated Learning

Luiz Manella Pereira,M. Hadi Amini

Main category: cs.LG

TL;DR: 提出了一种基于最优传输的预处理算法，用于解决联邦学习中的数据集不平衡问题，通过最小化边缘设备间的数据分布差异，提升全局模型性能。

Details

Motivation: 联邦学习中数据集不平衡会导致全局模型性能下降，需要一种方法减少分布差异以优化学习效果。 Method: 利用Wasserstein重心计算通道平均值，生成目标RGB空间，通过投影数据集最小化全局分布差异。 Result: 在CIFAR-10数据集上验证了方法的有效性，减少了通信轮次并提高了泛化能力。 Conclusion: 提出的算法通过分布对齐显著提升了联邦学习的效率和模型性能。 Abstract: Federated learning (FL) is a subfield of machine learning that avoids sharing local data with a central server, which can enhance privacy and scalability. The inability to consolidate data leads to a unique problem called dataset imbalance, where agents in a network do not have equal representation of the labels one is trying to learn to predict. In FL, fusing locally-trained models with unbalanced datasets may deteriorate the performance of global model aggregation, and reduce the quality of updated local models and the accuracy of the distributed agents' decisions. In this work, we introduce an Optimal Transport-based preprocessing algorithm that aligns the datasets by minimizing the distributional discrepancy of data along the edge devices. We accomplish this by leveraging Wasserstein barycenters when computing channel-wise averages. These barycenters are collected in a trusted central server where they collectively generate a target RGB space. By projecting our dataset towards this target space, we minimize the distributional discrepancy on a global level, which facilitates the learning process due to a minimization of variance across the samples. We demonstrate the capabilities of the proposed approach over the CIFAR-10 dataset, where we show its capability of reaching higher degrees of generalization in fewer communication rounds.

cs.MM [Back]

[277] How Far Are We from Predicting Missing Modalities with Foundation Models?

Guanzhou Ke,Yi Xie,Xiaoli Wang,Guoqing Chao,Bo Wang,Shengfeng He

Main category: cs.MM

TL;DR: 多模态基础模型在缺失模态预测中表现不足，作者提出了一种动态框架和自我优化机制，显著提升了预测效果。

Details

Motivation: 探索多模态基础模型在缺失模态预测中的潜力，发现现有方法在语义提取和模态验证方面存在不足。 Method: 提出了一种动态框架和自我优化机制，动态生成模态感知策略并迭代优化生成结果。 Result: 实验显示，该方法在缺失图像预测中FID降低至少14%，缺失文本预测中MER降低至少10%。 Conclusion: 提出的框架有效解决了现有模型的局限性，显著提升了缺失模态预测的准确性和鲁棒性。 Abstract: Multimodal foundation models have demonstrated impressive capabilities across diverse tasks. However, their potential as plug-and-play solutions for missing modality prediction remains underexplored. To investigate this, we categorize existing approaches into three representative paradigms, encompassing a total of 42 model variants, and conduct a comprehensive evaluation in terms of prediction accuracy and adaptability to downstream tasks. Our analysis reveals that current foundation models often fall short in two critical aspects: (i) fine-grained semantic extraction from the available modalities, and (ii) robust validation of generated modalities. These limitations lead to suboptimal and, at times, misaligned predictions. To address these challenges, we propose an agentic framework tailored for missing modality prediction. This framework dynamically formulates modality-aware mining strategies based on the input context, facilitating the extraction of richer and more discriminative semantic features. In addition, we introduce a \textit{self-refinement mechanism}, which iteratively verifies and enhances the quality of generated modalities through internal feedback. Experimental results show that our method reduces FID for missing image prediction by at least 14% and MER for missing text prediction by at least 10% compared to baselines.

cs.IR [Back]

[278] ProRank: Prompt Warmup via Reinforcement Learning for Small Language Models Reranking

Xianming Li,Aamir Shakir,Rui Huang,Julius Lipp,Jing Li

Main category: cs.IR

TL;DR: ProRank是一种针对小型语言模型（SLMs）的两阶段训练方法，通过强化学习和细粒度评分学习提升文档重排性能，计算高效且优于大型模型。

Details

Motivation: 当前基于大型语言模型（LLMs）的文档重排方法计算成本高，而小型语言模型（SLMs）在未微调时难以理解任务提示，限制了其效果。 Method: 提出ProRank两阶段训练：1）通过强化学习（GRPO）进行提示预热，帮助SLMs理解任务并生成粗粒度评分；2）细粒度评分学习阶段进一步优化重排质量。 Result: ProRank在BEIR基准测试中表现优异，轻量级ProRank-0.5B模型甚至超越32B LLM模型。 Conclusion: 适当训练的SLMs可实现高效且高性能的文档重排，ProRank为SLMs在重排任务中的应用提供了有效解决方案。 Abstract: Reranking is fundamental to information retrieval and retrieval-augmented generation, with recent Large Language Models (LLMs) significantly advancing reranking quality. While recent advances with LLMs have significantly improved document reranking quality, current approaches primarily rely on large-scale LLMs (>7B parameters) through zero-shot prompting, presenting high computational costs. Small Language Models (SLMs) offer a promising alternative because of their efficiency, but our preliminary quantitative analysis reveals they struggle with understanding task prompts without fine-tuning. This limits their effectiveness for document reranking tasks. To address this issue, we introduce a novel two-stage training approach, ProRank, for SLM-based document reranking. First, we propose a prompt warmup stage using reinforcement learning GRPO to steer SLMs to understand task prompts and generate more accurate coarse-grained binary relevance scores for document reranking. Then, we continuously fine-tune the SLMs with a fine-grained score learning stage without introducing additional layers to further improve the reranking quality. Comprehensive experimental results demonstrate that the proposed ProRank consistently outperforms both the most advanced open-source and proprietary reranking models. Notably, our lightweight ProRank-0.5B model even surpasses the powerful 32B LLM reranking model on the BEIR benchmark, establishing that properly trained SLMs can achieve superior document reranking performance while maintaining computational efficiency.

cs.SE [Back]

[279] VisCoder: Fine-Tuning LLMs for Executable Python Visualization Code Generation

Yuansheng Ni,Ping Nie,Kai Zou,Xiang Yue,Wenhu Chen

Main category: cs.SE

TL;DR: VisCode-200K是一个大规模指令调优数据集，用于Python可视化和自我修正，显著提升了模型在可视化任务中的表现。

Details

Motivation: 现有指令调优数据集缺乏执行监督和迭代代码修正支持，导致可视化任务中的代码生成脆弱且不可靠。 Method: VisCode-200K包含20万+示例，包括已验证的绘图代码和4.5万+多轮修正对话，用于训练模型Qwen2.5-Coder-Instruct（VisCoder）。 Result: VisCoder在PandasPlotBench上显著优于开源基线，接近GPT-4o-mini等专有模型性能。 Conclusion: 反馈驱动的学习方法可有效提升可执行且视觉准确的代码生成能力。 Abstract: Large language models (LLMs) often struggle with visualization tasks like plotting diagrams, charts, where success depends on both code correctness and visual semantics. Existing instruction-tuning datasets lack execution-grounded supervision and offer limited support for iterative code correction, resulting in fragile and unreliable plot generation. We present VisCode-200K, a large-scale instruction tuning dataset for Python-based visualization and self-correction. It contains over 200K examples from two sources: (1) validated plotting code from open-source repositories, paired with natural language instructions and rendered plots; and (2) 45K multi-turn correction dialogues from Code-Feedback, enabling models to revise faulty code using runtime feedback. We fine-tune Qwen2.5-Coder-Instruct on VisCode-200K to create VisCoder, and evaluate it on PandasPlotBench. VisCoder significantly outperforms strong open-source baselines and approaches the performance of proprietary models like GPT-4o-mini. We further adopt a self-debug evaluation protocol to assess iterative repair, demonstrating the benefits of feedback-driven learning for executable, visually accurate code generation.

[280] CETBench: A Novel Dataset constructed via Transformations over Programs for Benchmarking LLMs for Code-Equivalence Checking

Neeva Oza,Ishaan Govil,Parul Gupta,Dinesh Khandelwal,Dinesh Garg,Parag Singla

Main category: cs.SE

TL;DR: 本文探讨了LLMs在代码等价性检查任务中的应用，提出了CETBench数据集，并发现简单的代码转换会显著降低LLMs的性能。通过微调方法提升了性能，并分析了LLMs在代码语义理解上的局限性。

Details

Motivation: 代码等价性检查是评估LLMs在代码重写和翻译等任务中能力的重要问题，但目前研究较少。 Method: 构建CETBench数据集，通过随机代码转换生成（非）等价程序对，并采用微调方法提升LLMs性能。 Result: 简单代码转换会导致LLMs性能显著下降，微调方法能有效提升性能。 Conclusion: LLMs在代码语义理解上仍有局限，需进一步研究。 Abstract: LLMs have been extensively used for the task of automated code generation. In this work, we examine the applicability of LLMs for the related but relatively unexplored task of code-equivalence checking, i.e., given two programs, whether they are functionally equivalent or not. This is an important problem since benchmarking code equivalence can play a critical role in evaluating LLM capabilities for tasks such as code re-writing and code translation. Towards this end, we present CETBench - Code Equivalence with Transformations Benchmark, constructed via a repository of programs, where two programs in the repository may be solving the same or different tasks. Each instance in our dataset is obtained by taking a pair of programs in the repository and applying a random series of pre-defined code transformations, resulting in (non-)equivalent pairs. Our analysis on this dataset reveals a surprising finding that very simple code transformations in the underlying pair of programs can result in a significant drop in performance of SOTA LLMs for the task of code-equivalence checking. To remedy this, we present a simple fine-tuning-based approach to boost LLM performance on the transformed pairs of programs. Our approach for dataset generation is generic, and can be used with repositories with varying program difficulty levels and allows for applying varying numbers as well as kinds of transformations. In our experiments, we perform ablations over the difficulty level of original programs, as well as the kind of transformations used in generating pairs for equivalence checking. Our analysis presents deep insights into the working of LLMs for the task of code-equivalence, and points to the fact that they may still be far from what could be termed as a semantic understanding of the underlying code.

cs.HC [Back]

[281] PromptCanvas: Composable Prompting Workspaces Using Dynamic Widgets for Exploration and Iteration in Creative Writing

Rifat Mehreen Amin,Oliver Hans Kühle,Daniel Buschek,Andreas Butz

Main category: cs.HC

TL;DR: PromptCanvas是一个将提示转化为可组合、基于小部件的无限画布体验的概念，提升了用户对AI生成内容的控制力。

Details

Motivation: 传统对话式UI在创意支持方面存在局限，PromptCanvas旨在通过视觉化和交互式设计降低认知负荷，提升创意过程。 Method: 用户可以通过系统建议、用户提示或手动输入创建和定制小部件，并在无限画布上自由排列。 Result: 实验室研究（18人）显示PromptCanvas在创造力支持指数上优于传统UI，降低了认知负荷和挫败感。实地研究（10人）进一步验证了其效果。 Conclusion: PromptCanvas展示了动态、可定制界面在提升AI协作写作中的潜力，鼓励新视角和创意生成。 Abstract: We introduce PromptCanvas, a concept that transforms prompting into a composable, widget-based experience on an infinite canvas. Users can generate, customize, and arrange interactive widgets representing various facets of their text, offering greater control over AI-generated content. PromptCanvas allows widget creation through system suggestions, user prompts, or manual input, providing a flexible environment tailored to individual needs. This enables deeper engagement with the creative process. In a lab study with 18 participants, PromptCanvas outperformed a traditional conversational UI on the Creativity Support Index. Participants found that it reduced cognitive load, with lower mental demand and frustration. Qualitative feedback revealed that the visual organization of thoughts and easy iteration encouraged new perspectives and ideas. A follow-up field study (N=10) confirmed these results, showcasing the potential of dynamic, customizable interfaces in improving collaborative writing with AI.

cs.AI [Back]

Chan-Wei Hu,Yueqi Wang,Shuo Xing,Chia-Ju Chen,Zhengzhong Tu

Main category: cs.AI

TL;DR: 论文探讨了如何通过检索增强生成（RAG）解决大型视觉语言模型（LVLMs）在动态现实应用中的局限性，提出了多模态RAG管道的系统性分析，并展示了性能提升。

Details

Motivation: LVLMs在动态现实应用中受限于静态训练数据、幻觉问题和无法验证最新外部证据，RAG提供了一种解决方案。 Method: 系统分析了多模态RAG管道的三个阶段：检索（模态配置和策略）、重排序（减少位置偏差和提升相关性）和生成（整合检索结果）。 Result: 通过RAG框架，LVLMs在无需微调的情况下平均性能提升了5%。 Conclusion: RAG为LVLMs提供了动态整合外部知识的能力，显著提升了其在现实应用中的表现。 Abstract: Large Vision-Language Models (LVLMs) have made remarkable strides in multimodal tasks such as visual question answering, visual grounding, and complex reasoning. However, they remain limited by static training data, susceptibility to hallucinations, and inability to verify claims against up-to-date, external evidence, compromising their performance in dynamic real-world applications. Retrieval-Augmented Generation (RAG) offers a practical solution to mitigate these challenges by allowing the LVLMs to access large-scale knowledge databases via retrieval mechanisms, thereby grounding model outputs in factual, contextually relevant information. Here in this paper, we conduct the first systematic dissection of the multimodal RAG pipeline for LVLMs, explicitly investigating (1) the retrieval phase: on the modality configurations and retrieval strategies, (2) the re-ranking stage: on strategies to mitigate positional biases and improve the relevance of retrieved evidence, and (3) the generation phase: we further investigate how to best integrate retrieved candidates into the final generation process. Finally, we extend to explore a unified agentic framework that integrates re-ranking and generation through self-reflection, enabling LVLMs to select relevant evidence and suppress irrelevant context dynamically. Our full-stack exploration of RAG for LVLMs yields substantial insights, resulting in an average performance boost of 5% without any fine-tuning.

[283] Graph Counselor: Adaptive Graph Exploration via Multi-Agent Synergy to Enhance LLM Reasoning

Junqi Gao,Xiang Zou,YIng Ai,Dong Li,Yichen Niu,Biqing Qi,Jianxing Liu

Main category: cs.AI

TL;DR: Graph Counselor提出了一种基于多智能体协作的GraphRAG方法，通过自适应图信息提取模块和多视角自反思机制，解决了现有方法在信息聚合和推理机制上的局限性。

Details

Motivation: 现有GraphRAG方法存在信息聚合效率低和推理机制僵化的问题，无法适应多级图数据建模和动态调整推理深度。 Method: 提出Graph Counselor方法，结合自适应图信息提取模块（AGIEM）和多视角自反思（SR）模块，实现多智能体协作的动态信息提取和推理优化。 Result: 实验表明，Graph Counselor在多项图推理任务中表现优于现有方法，具有更高的推理准确性和泛化能力。 Conclusion: Graph Counselor通过多智能体协作和自反思机制，显著提升了GraphRAG的性能，为复杂图数据的知识增强提供了有效解决方案。 Abstract: Graph Retrieval Augmented Generation (GraphRAG) effectively enhances external knowledge integration capabilities by explicitly modeling knowledge relationships, thereby improving the factual accuracy and generation quality of Large Language Models (LLMs) in specialized domains. However, existing methods suffer from two inherent limitations: 1) Inefficient Information Aggregation: They rely on a single agent and fixed iterative patterns, making it difficult to adaptively capture multi-level textual, structural, and degree information within graph data. 2) Rigid Reasoning Mechanism: They employ preset reasoning schemes, which cannot dynamically adjust reasoning depth nor achieve precise semantic correction. To overcome these limitations, we propose Graph Counselor, an GraphRAG method based on multi-agent collaboration. This method uses the Adaptive Graph Information Extraction Module (AGIEM), where Planning, Thought, and Execution Agents work together to precisely model complex graph structures and dynamically adjust information extraction strategies, addressing the challenges of multi-level dependency modeling and adaptive reasoning depth. Additionally, the Self-Reflection with Multiple Perspectives (SR) module improves the accuracy and semantic consistency of reasoning results through self-reflection and backward reasoning mechanisms. Experiments demonstrate that Graph Counselor outperforms existing methods in multiple graph reasoning tasks, exhibiting higher reasoning accuracy and generalization ability. Our code is available at https://github.com/gjq100/Graph-Counselor.git.

[284] AgentMisalignment: Measuring the Propensity for Misaligned Behaviour in LLM-Based Agents

Akshat Naik,Patrick Quinn,Guillermo Bosch,Emma Gouné,Francisco Javier Campos Zabala,Jason Ross Brown,Edward James Young

Main category: cs.AI

TL;DR: 论文提出了一个名为AgentMisalignment的基准测试，用于评估LLM代理在现实场景中的错位倾向，发现模型能力和系统提示对错位行为有显著影响。

Details

Motivation: 随着LLM代理的广泛应用，其错位风险增加，但现有研究对代理在现实环境中尝试错位行为的倾向（错位倾向）了解不足。 Method: 引入AgentMisalignment基准测试，包含多种现实场景，评估代理在目标保护、抵抗关闭、消极怠工和权力追求等子类别中的行为。 Result: 前沿模型在基准测试中表现出更高的平均错位倾向，且系统提示对错位行为的影响有时甚至超过模型选择。 Conclusion: 当前对齐方法未能泛化到LLM代理，需进一步评估错位倾向，尤其是在自主系统日益普及的背景下。 Abstract: As Large Language Model (LLM) agents become more widespread, associated misalignment risks increase. Prior work has examined agents' ability to enact misaligned behaviour (misalignment capability) and their compliance with harmful instructions (misuse propensity). However, the likelihood of agents attempting misaligned behaviours in real-world settings (misalignment propensity) remains poorly understood. We introduce a misalignment propensity benchmark, AgentMisalignment, consisting of a suite of realistic scenarios in which LLM agents have the opportunity to display misaligned behaviour. We organise our evaluations into subcategories of misaligned behaviours, including goal-guarding, resisting shutdown, sandbagging, and power-seeking. We report the performance of frontier models on our benchmark, observing higher misalignment on average when evaluating more capable models. Finally, we systematically vary agent personalities through different system prompts. We find that persona characteristics can dramatically and unpredictably influence misalignment tendencies -- occasionally far more than the choice of model itself -- highlighting the importance of careful system prompt engineering for deployed AI agents. Our work highlights the failure of current alignment methods to generalise to LLM agents, and underscores the need for further propensity evaluations as autonomous systems become more prevalent.

[285] Does Thinking More always Help? Understanding Test-Time Scaling in Reasoning Models

Soumya Suvra Ghosal,Souradip Chakraborty,Avinash Reddy,Yifu Lu,Mengdi Wang,Dinesh Manocha,Furong Huang,Mohammad Ghavamzadeh,Amrit Singh Bedi

Main category: cs.AI

TL;DR: 研究发现，测试时延长思考时间（如使用“Wait”或“Let me rethink”提示）会先提升后降低模型性能（“过度思考”）。通过概率模型分析，发现额外思考会增加输出方差，导致性能提升的假象。作者提出并行思考方法，通过多数投票选择最一致的回答，性能提升20%。

Details

Motivation: 探讨测试时延长思考时间是否真的能提升推理模型的性能，并揭示其潜在问题。 Method: 通过实证研究分析模型和基准测试，提出概率模型解释现象，并引入并行思考方法（类似Best-of-N采样）。 Result: 延长思考时间会导致性能先升后降（过度思考），并行思考方法比传统方法性能提升20%。 Conclusion: 测试时延长思考时间并非有效利用推理预算的方法，并行思考是更优选择。 Abstract: Recent trends in test-time scaling for reasoning models (e.g., OpenAI o1, DeepSeek R1) have led to a popular belief that extending thinking traces using prompts like "Wait" or "Let me rethink" can improve performance. This raises a natural question: Does thinking more at test-time truly lead to better reasoning? To answer this question, we perform a detailed empirical study across models and benchmarks, which reveals a consistent pattern of initial performance improvements from additional thinking followed by a decline, due to "overthinking". To understand this non-monotonic trend, we consider a simple probabilistic model, which reveals that additional thinking increases output variance-creating an illusion of improved reasoning while ultimately undermining precision. Thus, observed gains from "more thinking" are not true indicators of improved reasoning, but artifacts stemming from the connection between model uncertainty and evaluation metric. This suggests that test-time scaling through extended thinking is not an effective way to utilize the inference thinking budget. Recognizing these limitations, we introduce an alternative test-time scaling approach, parallel thinking, inspired by Best-of-N sampling. Our method generates multiple independent reasoning paths within the same inference budget and selects the most consistent response via majority vote, achieving up to 20% higher accuracy compared to extended thinking. This provides a simple yet effective mechanism for test-time scaling of reasoning models.

Table of Contents

cs.CV [Back]

[1] Dual Branch VideoMamba with Gated Class Token Fusion for Violence Detection

[2] Farm-LightSeek: An Edge-centric Multimodal Agricultural IoT Data Analytics Framework with Lightweight LLMs

[3] Improvement of human health lifespan with hybrid group pose estimation methods

[4] PALADIN : Robust Neural Fingerprinting for Text-to-Image Diffusion Models

[5] EdgeVidSum: Real-Time Personalized Video Summarization at the Edge

[6] FOLIAGE: Towards Physical Intelligence World Models Via Unbounded Surface Evolution

[7] Multimodal Foundation Model for Cross-Modal Retrieval and Activity Recognition Tasks

[8] Vid-SME: Membership Inference Attacks against Large Video Understanding Models

[9] TerraIncognita: A Dynamic Benchmark for Species Discovery Using Frontier Models

[10] Impact of Tuning Parameters in Deep Convolutional Neural Network Using a Crack Image Dataset

[11] Continual Learning in Vision-Language Models via Aligned Model Merging

[12] MINT: Memory-Infused Prompt Tuning at Test-time for CLIP

[13] Multimodal Generative AI with Autoregressive LLMs for Human Motion Understanding and Generation: A Way Forward

[14] Human Fall Detection using Transfer Learning-based 3D CNN

[15] HueManity: Probing Fine-Grained Visual Perception in MLLMs

[16] Unlabeled Data Improves Fine-Grained Image Zero-shot Classification with Multimodal LLMs

[17] Infinity Parser: Layout Aware Reinforcement Learning for Scanned Document Parsing

[18] FLEX: A Large-Scale Multi-Modal Multi-Action Dataset for Fitness Action Quality Assessment

[19] Channel-adaptive Cross-modal Generative Semantic Communication for Point Cloud Transmission

[20] ConMamba: Contrastive Vision Mamba for Plant Disease Detection

[21] OpenCarbon: A Contrastive Learning-based Cross-Modality Neural Approach for High-Resolution Carbon Emission Prediction Using Open Data

[22] Pre-trained Vision-Language Models Assisted Noisy Partial Label Learning

[23] Chipmunk: Training-Free Acceleration of Diffusion Transformers with Dynamic Column-Sparse Deltas

[24] Learning Optical Flow Field via Neural Ordinary Differential Equation

[25] SportMamba: Adaptive Non-Linear Multi-Object Tracking with State Space Models for Team Sports

[26] Seeing the Arrow of Time in Large Multimodal Models

[27] Semiconductor SEM Image Defect Classification Using Supervised and Semi-Supervised Learning with Vision Transformers

[28] Toward Reliable VLM: A Fine-Grained Benchmark and Framework for Exposure, Bias, and Inference in Korean Street Views

[29] A Foundation Model for Spatial Proteomics

[30] Cross-Modal Urban Sensing: Evaluating Sound-Vision Alignment Across Street-Level and Aerial Imagery

[31] Temporal Vegetation Index-Based Unsupervised Crop Stress Detection via Eigenvector-Guided Contrastive Learning

[32] ViT-Split: Unleashing the Power of Vision Foundation Models via Efficient Splitting Heads

[33] Geometric Visual Fusion Graph Neural Networks for Multi-Person Human-Object Interaction Recognition in Videos

[34] RefEdit: A Benchmark and Method for Improving Instruction-based Image Editing Model on Referring Expressions

[35] The effects of using created synthetic images in computer vision training

[36] RoNFA: Robust Neural Field-based Approach for Few-Shot Image Classification with Noisy Labels

[37] MamFusion: Multi-Mamba with Temporal Fusion for Partially Relevant Video Retrieval

[38] Heterogeneous Skeleton-Based Action Representation Learning

[39] CHIME: Conditional Hallucination and Integrated Multi-scale Enhancement for Time Series Diffusion Model

[40] EDCFlow: Exploring Temporally Dense Difference Maps for Event-based Optical Flow Estimation

[41] DenseDPO: Fine-Grained Temporal Preference Optimization for Video Diffusion Models

[42] Target Semantics Clustering via Text Representations for Robust Universal Domain Adaptation

[43] Video-Skill-CoT: Skill-based Chain-of-Thoughts for Domain-Adaptive Video Reasoning

[44] Robust Neural Rendering in the Wild with Asymmetric Dual 3D Gaussian Splatting

[45] WIFE-Fusion:Wavelet-aware Intra-inter Frequency Enhancement for Multi-model Image Fusion

[46] DiagNet: Detecting Objects using Diagonal Constraints on Adjacency Matrix of Graph Neural Network

[47] ViTSGMM: A Robust Semi-Supervised Image Recognition Network Using Sparse Labels

[48] A Large-Scale Referring Remote Sensing Image Segmentation Dataset and Benchmark

[49] BiMa: Towards Biases Mitigation for Text-Video Retrieval via Scene Element Guidance

[50] Resolving Task Objective Conflicts in Unified Multimodal Understanding and Generation via Task-Aware Mixture-of-Experts

[51] ControlThinker: Unveiling Latent Semantics for Controllable Image Generation through Visual Reasoning

[52] Generating 6DoF Object Manipulation Trajectories from Action Description in Egocentric Vision

[53] Analyzing Transformer Models and Knowledge Distillation Approaches for Image Captioning on Edge AI

[54] PDSE: A Multiple Lesion Detector for CT Images using PANet and Deformable Squeeze-and-Excitation Block

[55] VLMs Can Aggregate Scattered Training Patches

[56] Isharah: A Large-Scale Multi-Scene Dataset for Continuous Sign Language Recognition

[57] Negative-Guided Subject Fidelity Optimization for Zero-Shot Subject-Driven Generation

[58] FingerVeinSyn-5M: A Million-Scale Dataset and Benchmark for Finger Vein Recognition

[59] Spatial Understanding from Videos: Structured Prompts Meet Simulation Data

[60] Images are Worth Variable Length of Representations

[61] YOND: Practical Blind Raw Image Denoising Free from Camera-Specific Data Dependency

[62] EmoArt: A Multidimensional Dataset for Emotion-Aware Artistic Generation

[63] MambaNeXt-YOLO: A Hybrid State Space Model for Real-time Object Detection

[64] INP-Former++: Advancing Universal Anomaly Detection via Intrinsic Normal Prototypes and Residual Learning

[65] Zero-Shot Temporal Interaction Localization for Egocentric Videos

[66] Intersectional Bias in Pre-Trained Image Recognition Models

[67] Accelerating SfM-based Pose Estimation with Dominating Set

[68] BiXFormer: A Robust Framework for Maximizing Modality Effectiveness in Multi-Modal Semantic Segmentation

[69] How PARTs assemble into wholes: Learning the relative composition of images

[70] PRJ: Perception-Retrieval-Judgement for Generated Images

[71] DSSAU-Net:U-Shaped Hybrid Network for Pubic Symphysis and Fetal Head Segmentation

[72] Advancements in Artificial Intelligence Applications for Cardiovascular Disease Research

[73] OV-COAST: Cost Aggregation with Optimal Transport for Open-Vocabulary Semantic Segmentation

[74] AetherVision-Bench: An Open-Vocabulary RGB-Infrared Benchmark for Multi-Angle Segmentation across Aerial and Ground Perspectives

[75] OSGNet @ Ego4D Episodic Memory Challenge 2025

[76] PlückeRF: A Line-based 3D Representation for Few-view Reconstruction

[77] FSHNet: Fully Sparse Hybrid Network for 3D Object Detection

[78] ComRoPE: Scalable and Robust Rotary Position Embedding Parameterized by Trainable Commuting Angle Matrices