cs.CV [Back]

[1] Can Geometry Save Central Views for Sports Field Registration?

Floriane Magera,Thomas Hoyoux,Martin Castin,Olivier Barnich,Anthony Cioppa,Marc Van Droogenbroeck

Main category: cs.CV

TL;DR: 提出了一种新方法，通过从圆对应关系中提取点和线，解决了体育场注册中圆标记利用的难题。

Details

Motivation: 体育场注册通常依赖稀疏且分布不均的线标记，而中央区域的近景视图仅显示线和圆标记，现有方法难以利用圆标记。 Method: 从圆对应关系中推导出一组点和线，将圆标记纳入线性方程组，用于体育场注册和图像标注。 Result: 实验表明，该方法能有效补充高性能检测器，在困难场景中成功实现体育场注册。 Conclusion: 该方法解决了圆标记利用的挑战，扩展了体育场注册的应用范围。 Abstract: Single-frame sports field registration often serves as the foundation for extracting 3D information from broadcast videos, enabling applications related to sports analytics, refereeing, or fan engagement. As sports fields have rigorous specifications in terms of shape and dimensions of their line, circle and point components, sports field markings are commonly used as calibration targets for this task. However, because of the sparse and uneven distribution of field markings, close-up camera views around central areas of the field often depict only line and circle markings. On these views, sports field registration is challenging for the vast majority of existing methods, as they focus on leveraging line field markings and their intersections. It is indeed a challenge to include circle correspondences in a set of linear equations. In this work, we propose a novel method to derive a set of points and lines from circle correspondences, enabling the exploitation of circle correspondences for both sports field registration and image annotation. In our experiments, we illustrate the benefits of our bottom-up geometric method against top-performing detectors and show that our method successfully complements them, enabling sports field registration in difficult scenarios.

[2] Marmot: Multi-Agent Reasoning for Multi-Object Self-Correcting in Improving Image-Text Alignment

Jiayang Sun,Hongbo Wang,Jie Cao,Huaibo Huang,Ran He

Main category: cs.CV

TL;DR: Marmot框架通过多智能体推理提升多对象场景的图像生成准确性，解决计数、属性和空间关系问题。

Details

Motivation: 扩散模型在复杂多对象场景中难以准确处理计数、属性和空间关系，需改进图像-文本对齐和编辑连贯性。 Method: 采用分治策略，将自校正任务分解为三个维度（计数、属性和空间关系），并构建多智能体编辑系统，结合决策-执行-验证机制和像素域拼接平滑器。 Result: 实验表明，Marmot显著提升了对象计数、属性分配和空间关系的准确性。 Conclusion: Marmot通过多智能体推理和优化方法，有效解决了多对象场景中的图像生成问题。 Abstract: While diffusion models excel at generating high-quality images, they often struggle with accurate counting, attributes, and spatial relationships in complex multi-object scenes. To address these challenges, we propose Marmot, a novel and generalizable framework that employs Multi-Agent Reasoning for Multi-Object Self-Correcting, enhancing image-text alignment and facilitating more coherent multi-object image editing. Our framework adopts a divide-and-conquer strategy that decomposes the self-correction task into three critical dimensions (counting, attributes, and spatial relationships), and further divided into object-level subtasks. We construct a multi-agent editing system featuring a decision-execution-verification mechanism, effectively mitigating inter-object interference and enhancing editing reliability. To resolve the problem of subtask integration, we propose a Pixel-Domain Stitching Smoother that employs mask-guided two-stage latent space optimization. This innovation enables parallel processing of subtask results, thereby enhancing runtime efficiency while eliminating multi-stage distortion accumulation. Extensive experiments demonstrate that Marmot significantly improves accuracy in object counting, attribute assignment, and spatial relationships for image generation tasks.

[3] Edge-Based Learning for Improved Classification Under Adversarial Noise

Manish Kansana,Keyan Alexander Rahimi,Elias Hossain,Iman Dehzangi,Noorbakhsh Amiri Golilarz

Main category: cs.CV

TL;DR: 研究探讨了对抗性噪声对图像分类的影响，发现基于边缘特征的训练能提升模型对抗扰动的鲁棒性。

Details

Motivation: 对抗性噪声会误导深度学习模型，降低识别准确性，研究旨在探索如何通过特定特征（如边缘）提升模型鲁棒性。 Method: 使用FGSM对抗噪声，分别在原始图像和边缘图像上训练模型，对比其对抗扰动的表现。 Result: 边缘特征模型对对抗性攻击更具抵抗力，但原始数据在重新训练后准确性提升略高。 Conclusion: 基于边缘的学习能增强模型对抗扰动的鲁棒性，但原始数据仍有一定优势。 Abstract: Adversarial noise introduces small perturbations in images, misleading deep learning models into misclassification and significantly impacting recognition accuracy. In this study, we analyzed the effects of Fast Gradient Sign Method (FGSM) adversarial noise on image classification and investigated whether training on specific image features can improve robustness. We hypothesize that while adversarial noise perturbs various regions of an image, edges may remain relatively stable and provide essential structural information for classification. To test this, we conducted a series of experiments using brain tumor and COVID datasets. Initially, we trained the models on clean images and then introduced subtle adversarial perturbations, which caused deep learning models to significantly misclassify the images. Retraining on a combination of clean and noisy images led to improved performance. To evaluate the robustness of the edge features, we extracted edges from the original/clean images and trained the models exclusively on edge-based representations. When noise was introduced to the images, the edge-based models demonstrated greater resilience to adversarial attacks compared to those trained on the original or clean images. These results suggest that while adversarial noise is able to exploit complex non-edge regions significantly more than edges, the improvement in the accuracy after retraining is marginally more in the original data as compared to the edges. Thus, leveraging edge-based learning can improve the resilience of deep learning models against adversarial perturbations.

[4] VideoMultiAgents: A Multi-Agent Framework for Video Question Answering

Noriyuki Kugo,Xiang Li,Zixin Li,Ashish Gupta,Arpandeep Khatua,Nidhish Jain,Chaitanya Patel,Yuta Kyuragi,Masamoto Tanabiki,Kazuki Kozuka,Ehsan Adeli

Main category: cs.CV

TL;DR: VideoMultiAgents框架通过多模态推理提升视频问答性能，结合视觉、场景图和文本处理代理，显著优于现有方法。

Details

Motivation: 现有视频问答方法依赖单模型处理帧级描述，难以捕捉时间和交互上下文。 Method: 引入VideoMultiAgents框架，整合视觉、场景图和文本代理，辅以问题引导的标题生成。 Result: 在Intent-QA（79.0%）、EgoSchema子集（75.4%）和NExT-QA（79.6%）上达到SOTA。 Conclusion: 多代理协同和问题引导的标题生成显著提升了视频问答性能。 Abstract: Video Question Answering (VQA) inherently relies on multimodal reasoning, integrating visual, temporal, and linguistic cues to achieve a deeper understanding of video content. However, many existing methods rely on feeding frame-level captions into a single model, making it difficult to adequately capture temporal and interactive contexts. To address this limitation, we introduce VideoMultiAgents, a framework that integrates specialized agents for vision, scene graph analysis, and text processing. It enhances video understanding leveraging complementary multimodal reasoning from independently operating agents. Our approach is also supplemented with a question-guided caption generation, which produces captions that highlight objects, actions, and temporal transitions directly relevant to a given query, thus improving the answer accuracy. Experimental results demonstrate that our method achieves state-of-the-art performance on Intent-QA (79.0%, +6.2% over previous SOTA), EgoSchema subset (75.4%, +3.4%), and NExT-QA (79.6%, +0.4%).

[5] Long-Distance Field Demonstration of Imaging-Free Drone Identification in Intracity Environments

Junran Guo,Tonglin Mu,Keyuan Li,Jianing Li,Ziyang Luo,Ye Chen,Xiaodong Fan,Jinquan Huang,Minjie Liu,Jinbei Zhang,Ruoyang Qi,Naiting Gu,Shihai Sun

Main category: cs.CV

TL;DR: 论文提出了一种结合ResNet和D²SP²-LiDAR的新方法，显著提升了小型目标（如无人机）的长距离检测能力，检测范围扩展到5公里，并实现了高精度的姿态和类型识别。

Details

Motivation: 传统成像方法在长距离、高分辨率和成本方面存在局限性，而D²SP²-LiDAR虽然简化了系统，但检测范围有限。因此，需要一种更高效的方法来提升长距离检测能力。 Method: 通过将残差神经网络（ResNet）与D²SP²-LiDAR结合，并优化观测模型，实现了5公里范围内的无人机检测和高精度识别。 Result: 实验表明，该方法在5公里范围内实现了94.93%的姿态识别准确率和97.99%的类型分类准确率，优于传统成像方法。 Conclusion: 该方法展示了成像自由技术在长距离小型目标检测中的潜力，适用于实际场景中的弱信号和低信噪比条件。 Abstract: Detecting small objects, such as drones, over long distances presents a significant challenge with broad implications for security, surveillance, environmental monitoring, and autonomous systems. Traditional imaging-based methods rely on high-resolution image acquisition, but are often constrained by range, power consumption, and cost. In contrast, data-driven single-photon-single-pixel light detection and ranging (\text{D\textsuperscript{2}SP\textsuperscript{2}-LiDAR}) provides an imaging-free alternative, directly enabling target identification while reducing system complexity and cost. However, its detection range has been limited to a few hundred meters. Here, we introduce a novel integration of residual neural networks (ResNet) with \text{D\textsuperscript{2}SP\textsuperscript{2}-LiDAR}, incorporating a refined observation model to extend the detection range to 5~\si{\kilo\meter} in an intracity environment while enabling high-accuracy identification of drone poses and types. Experimental results demonstrate that our approach not only outperforms conventional imaging-based recognition systems, but also achieves 94.93\% pose identification accuracy and 97.99\% type classification accuracy, even under weak signal conditions with long distances and low signal-to-noise ratios (SNRs). These findings highlight the potential of imaging-free methods for robust long-range detection of small targets in real-world scenarios.

[6] An on-production high-resolution longitudinal neonatal fingerprint database in Brazil

Luiz F. P. Southier,Marcelo Filipak,Luiz A. Zanlorensi,Ildefonso Wasilevski,Fabio Favarim,Jefferson T. Oliva,Marcelo Teixeira,Dalcimar Casanova

Main category: cs.CV

TL;DR: 研究旨在开发新生儿指纹的高质量生物特征数据库，以支持机器学习模型的训练和评估，从而改进新生儿生物识别系统的准确性。

Details

Motivation: 新生儿时期的生物识别对生存至关重要，但现有方法因生理变化（如指纹生长）而受限，缺乏全面数据集。 Method: 设计并开发多阶段采集的新生儿指纹数据库，用于训练和评估深度学习模型。 Result: 预期该数据集将支持开发更准确的深度学习模型，优于传统的缩放方法。 Conclusion: 研究为针对新生儿独特发育轨迹的生物识别系统奠定了基础。 Abstract: The neonatal period is critical for survival, requiring accurate and early identification to enable timely interventions such as vaccinations, HIV treatment, and nutrition programs. Biometric solutions offer potential for child protection by helping to prevent baby swaps, locate missing children, and support national identity systems. However, developing effective biometric identification systems for newborns remains a major challenge due to the physiological variability caused by finger growth, weight changes, and skin texture alterations during early development. Current literature has attempted to address these issues by applying scaling factors to emulate growth-induced distortions in minutiae maps, but such approaches fail to capture the complex and non-linear growth patterns of infants. A key barrier to progress in this domain is the lack of comprehensive, longitudinal biometric datasets capturing the evolution of neonatal fingerprints over time. This study addresses this gap by focusing on designing and developing a high-quality biometric database of neonatal fingerprints, acquired at multiple early life stages. The dataset is intended to support the training and evaluation of machine learning models aimed at emulating the effects of growth on biometric features. We hypothesize that such a dataset will enable the development of more robust and accurate Deep Learning-based models, capable of predicting changes in the minutiae map with higher fidelity than conventional scaling-based methods. Ultimately, this effort lays the groundwork for more reliable biometric identification systems tailored to the unique developmental trajectory of newborns.

[7] Forging and Removing Latent-Noise Diffusion Watermarks Using a Single Image

Anubhav Jain,Yuya Kobayashi,Naoki Murata,Yuhta Takida,Takashi Shibuya,Yuki Mitsufuji,Niv Cohen,Nasir Memon,Julian Togelius

Main category: cs.CV

TL;DR: 本文提出了一种针对扩散模型水印的黑盒对抗攻击方法，仅需一个水印样本即可伪造或移除水印，揭示了现有水印技术的脆弱性。

Details

Motivation: 现有水印技术通常将密钥嵌入初始噪声中，被认为难以移除或伪造。本文旨在揭示其潜在漏洞，推动改进研究。 Method: 基于图像与初始噪声的多对一映射关系，通过扰动图像进入或退出水印区域，实现伪造或移除水印。 Result: 在多种水印方案（Tree-Ring、RingID等）和扩散模型（SDv1.4、SDv2.0）上验证了攻击的有效性。 Conclusion: 现有水印技术存在漏洞，需进一步研究改进。 Abstract: Watermarking techniques are vital for protecting intellectual property and preventing fraudulent use of media. Most previous watermarking schemes designed for diffusion models embed a secret key in the initial noise. The resulting pattern is often considered hard to remove and forge into unrelated images. In this paper, we propose a black-box adversarial attack without presuming access to the diffusion model weights. Our attack uses only a single watermarked example and is based on a simple observation: there is a many-to-one mapping between images and initial noises. There are regions in the clean image latent space pertaining to each watermark that get mapped to the same initial noise when inverted. Based on this intuition, we propose an adversarial attack to forge the watermark by introducing perturbations to the images such that we can enter the region of watermarked images. We show that we can also apply a similar approach for watermark removal by learning perturbations to exit this region. We report results on multiple watermarking schemes (Tree-Ring, RingID, WIND, and Gaussian Shading) across two diffusion models (SDv1.4 and SDv2.0). Our results demonstrate the effectiveness of the attack and expose vulnerabilities in the watermarking methods, motivating future research on improving them.

[8] A Transformer-based Multimodal Fusion Model for Efficient Crowd Counting Using Visual and Wireless Signals

Zhe Cui,Yuli Li,Le-Nam Tran

Main category: cs.CV

TL;DR: TransFusion是一种基于多模态融合的人群计数模型，结合CSI与图像数据，利用Transformer和CNN提升全局与局部特征提取能力，实现高精度计数。

Details

Motivation: 现有单模态人群计数模型存在信息丢失和性能不足的问题，需通过多模态融合提升准确性。 Method: 提出TransFusion模型，结合Transformer（全局特征）和CNN（局部特征），融合CSI与图像数据。 Result: 实验表明TransFusion在计数精度和效率上表现优异。 Conclusion: 多模态融合结合全局与局部特征提取是提升人群计数性能的有效方法。 Abstract: Current crowd-counting models often rely on single-modal inputs, such as visual images or wireless signal data, which can result in significant information loss and suboptimal recognition performance. To address these shortcomings, we propose TransFusion, a novel multimodal fusion-based crowd-counting model that integrates Channel State Information (CSI) with image data. By leveraging the powerful capabilities of Transformer networks, TransFusion effectively combines these two distinct data modalities, enabling the capture of comprehensive global contextual information that is critical for accurate crowd estimation. However, while transformers are well capable of capturing global features, they potentially fail to identify finer-grained, local details essential for precise crowd counting. To mitigate this, we incorporate Convolutional Neural Networks (CNNs) into the model architecture, enhancing its ability to extract detailed local features that complement the global context provided by the Transformer. Extensive experimental evaluations demonstrate that TransFusion achieves high accuracy with minimal counting errors while maintaining superior efficiency.

[9] Integration Flow Models

Jingjing Wang,Dan Zhang,Joshua Luo,Yin Yang,Feng Luo

Main category: cs.CV

TL;DR: 本文提出了一种名为Integration Flow的新方法，直接学习ODE轨迹路径的积分，避免了传统ODE方法的离散化误差和不稳定性，显著提升了生成模型的性能。

Details

Motivation: 传统ODE生成模型存在离散化误差和训练不稳定性问题，限制了样本质量和效率。本文旨在通过直接学习ODE轨迹积分来解决这些问题。 Method: 提出Integration Flow，直接学习ODE轨迹的积分，并引入目标状态作为反向动力学的锚点，提升稳定性和准确性。 Result: 在CIFAR10和ImageNet数据集上，Integration Flow显著提升了现有ODE模型的性能，如扩散模型、Rectified Flows和PFGM++。 Conclusion: Integration Flow通过统一结构和直接学习积分，解决了ODE生成模型的关键问题，实现了高效且高质量的样本生成。 Abstract: Ordinary differential equation (ODE) based generative models have emerged as a powerful approach for producing high-quality samples in many applications. However, the ODE-based methods either suffer the discretization error of numerical solvers of ODE, which restricts the quality of samples when only a few NFEs are used, or struggle with training instability. In this paper, we proposed Integration Flow, which directly learns the integral of ODE-based trajectory paths without solving the ODE functions. Moreover, Integration Flow explicitly incorporates the target state $\mathbf{x}_0$ as the anchor state in guiding the reverse-time dynamics. We have theoretically proven this can contribute to both stability and accuracy. To the best of our knowledge, Integration Flow is the first model with a unified structure to estimate ODE-based generative models and the first to show the exact straightness of 1-Rectified Flow without reflow. Through theoretical analysis and empirical evaluations, we show that Integration Flows achieve improved performance when it is applied to existing ODE-based models, such as diffusion models, Rectified Flows, and PFGM++. Specifically, Integration Flow achieves one-step generation on CIFAR10 with FIDs of 2.86 for the Variance Exploding (VE) diffusion model, 3.36 for rectified flow without reflow, and 2.91 for PFGM++; and on ImageNet with FIDs of 4.09 for VE diffusion model, 4.35 for rectified flow without reflow and 4.15 for PFGM++.

[10] Weaving Context Across Images: Improving Vision-Language Models through Focus-Centric Visual Chains

Juntian Zhang,Chuanqi cheng,Yuhan Liu,Wei Liu,Jian Luan,Rui Yan

Main category: cs.CV

TL;DR: 论文提出Focus-Centric Visual Chain范式，通过Focus-Centric Data Synthesis方法构建VISC-150K数据集，显著提升多图像任务中视觉语言模型的性能。

Details

Motivation: 现实场景中多图像输入复杂，现有视觉语言模型在性能上显著下降，需要新方法提升其感知和理解能力。 Method: 提出Focus-Centric Visual Chain范式，采用Focus-Centric Data Synthesis方法合成高质量数据，构建VISC-150K数据集。 Result: 在七个多图像基准测试中，平均性能提升3.16%和2.24%，且不影响通用视觉语言能力。 Conclusion: 该研究为处理复杂视觉场景的视觉语言系统提供了重要进展。 Abstract: Vision-language models (VLMs) achieve remarkable success in single-image tasks. However, real-world scenarios often involve intricate multi-image inputs, leading to a notable performance decline as models struggle to disentangle critical information scattered across complex visual features. In this work, we propose Focus-Centric Visual Chain, a novel paradigm that enhances VLMs'perception, comprehension, and reasoning abilities in multi-image scenarios. To facilitate this paradigm, we propose Focus-Centric Data Synthesis, a scalable bottom-up approach for synthesizing high-quality data with elaborate reasoning paths. Through this approach, We construct VISC-150K, a large-scale dataset with reasoning data in the form of Focus-Centric Visual Chain, specifically designed for multi-image tasks. Experimental results on seven multi-image benchmarks demonstrate that our method achieves average performance gains of 3.16% and 2.24% across two distinct model architectures, without compromising the general vision-language capabilities. our study represents a significant step toward more robust and capable vision-language systems that can handle complex visual scenarios.

[11] Remote Sensing Imagery for Flood Detection: Exploration of Augmentation Strategies

Vladyslav Polushko,Damjan Hatic,Ronald Rösch,Thomas März,Markus Rauhut,Andreas Weinmann

Main category: cs.CV

TL;DR: 论文探讨了利用不同数据增强策略优化深度学习网络在RGB图像中检测河流洪水的效果。

Details

Motivation: 洪水是全球性问题，快速有效响应需要准确及时的受灾区域信息。遥感图像结合深度学习可提升洪水检测精度。 Method: 使用BlessemFlood21数据集，测试从基础到复杂（如光学畸变）的数据增强策略，优化深度学习分割网络的训练。 Result: 通过实验识别出有效的数据增强策略，提升洪水检测模型的性能。 Conclusion: 优化数据增强策略可显著改进深度学习网络在洪水检测中的表现。 Abstract: Floods cause serious problems around the world. Responding quickly and effectively requires accurate and timely information about the affected areas. The effective use of Remote Sensing images for accurate flood detection requires specific detection methods. Typically, Deep Neural Networks are employed, which are trained on specific datasets. For the purpose of river flood detection in RGB imagery, we use the BlessemFlood21 dataset. We here explore the use of different augmentation strategies, ranging from basic approaches to more complex techniques, including optical distortion. By identifying effective strategies, we aim to refine the training process of state-of-the-art Deep Learning segmentation networks.

[12] FreBIS: Frequency-Based Stratification for Neural Implicit Surface Representations

Naoko Sawada,Pedro Miraldo,Suhas Lohit,Tim K. Marks,Moitreya Chatterjee

Main category: cs.CV

TL;DR: 论文提出了一种名为FreBIS的新型神经隐式表面表示方法，通过分层编码不同频率的表面信息，显著提升了复杂场景的3D重建质量。

Details

Motivation: 现有神经隐式表面表示方法在处理复杂场景时表现不佳，因为它们使用单一编码器同时捕捉所有频率的表面信息。 Method: FreBIS将场景按表面频率分层，每个频率层由专用编码器处理，并通过冗余感知权重模块促进编码特征的互补性。 Result: 在BlendedMVS数据集上的实验表明，FreBIS显著提升了3D表面重建质量和渲染保真度。 Conclusion: FreBIS通过分层编码和互补特征提取，有效解决了复杂场景的神经隐式表面表示问题。 Abstract: Neural implicit surface representation techniques are in high demand for advancing technologies in augmented reality/virtual reality, digital twins, autonomous navigation, and many other fields. With their ability to model object surfaces in a scene as a continuous function, such techniques have made remarkable strides recently, especially over classical 3D surface reconstruction methods, such as those that use voxels or point clouds. However, these methods struggle with scenes that have varied and complex surfaces principally because they model any given scene with a single encoder network that is tasked to capture all of low through high-surface frequency information in the scene simultaneously. In this work, we propose a novel, neural implicit surface representation approach called FreBIS to overcome this challenge. FreBIS works by stratifying the scene based on the frequency of surfaces into multiple frequency levels, with each level (or a group of levels) encoded by a dedicated encoder. Moreover, FreBIS encourages these encoders to capture complementary information by promoting mutual dissimilarity of the encoded features via a novel, redundancy-aware weighting module. Empirical evaluations on the challenging BlendedMVS dataset indicate that replacing the standard encoder in an off-the-shelf neural surface reconstruction method with our frequency-stratified encoders yields significant improvements. These enhancements are evident both in the quality of the reconstructed 3D surfaces and in the fidelity of their renderings from any viewpoint.

[13] Improving trajectory continuity in drone-based crowd monitoring using a set of minimal-cost techniques and deep discriminative correlation filters

Bartosz Ptak,Marek Kraft

Main category: cs.CV

TL;DR: 提出了一种基于点的在线跟踪算法，改进无人机人群监控中的轨迹连续性和计数可靠性，显著减少了计数错误和身份切换。

Details

Motivation: 无人机人群监控在公共安全和事件管理中至关重要，但传统方法存在误检、漏检和身份切换问题，影响计数准确性和分析深度。 Method: 基于SORT框架，用点距离度量替换边界框分配，结合相机运动补偿、高度感知分配和分类轨迹验证，并集成DDCF以提高计算效率和跟踪精度。 Result: 在DroneCrowd和UP-COUNT-TRACK数据集上，计数错误分别降至23%和15%，身份切换显著减少，优于基线在线跟踪器和离线贪婪优化方法。 Conclusion: 该方法显著提升了无人机人群监控的跟踪连续性和计数可靠性，为实际应用提供了高效解决方案。 Abstract: Drone-based crowd monitoring is the key technology for applications in surveillance, public safety, and event management. However, maintaining tracking continuity and consistency remains a significant challenge. Traditional detection-assignment tracking methods struggle with false positives, false negatives, and frequent identity switches, leading to degraded counting accuracy and making in-depth analysis impossible. This paper introduces a point-oriented online tracking algorithm that improves trajectory continuity and counting reliability in drone-based crowd monitoring. Our method builds on the Simple Online and Real-time Tracking (SORT) framework, replacing the original bounding-box assignment with a point-distance metric. The algorithm is enhanced with three cost-effective techniques: camera motion compensation, altitude-aware assignment, and classification-based trajectory validation. Further, Deep Discriminative Correlation Filters (DDCF) that re-use spatial feature maps from localisation algorithms for increased computational efficiency through neural network resource sharing are integrated to refine object tracking by reducing noise and handling missed detections. The proposed method is evaluated on the DroneCrowd and newly shared UP-COUNT-TRACK datasets, demonstrating substantial improvements in tracking metrics, reducing counting errors to 23% and 15%, respectively. The results also indicate a significant reduction of identity switches while maintaining high tracking accuracy, outperforming baseline online trackers and even an offline greedy optimisation method.

[14] Physics-Informed Diffusion Models for SAR Ship Wake Generation from Text Prompts

Kamirul Kamirul,Odysseas Pappas,Alin Achim

Main category: cs.CV

TL;DR: 论文提出了一种基于扩散模型的方法，用于高效生成SAR图像中的船舶尾迹，解决了传统物理模拟速度慢的问题。

Details

Motivation: 由于标注数据稀缺，监督学习在SAR图像中检测船舶尾迹面临挑战，而传统物理模拟方法速度慢且限制了端到端学习。 Method: 使用扩散模型，通过物理模拟器生成的数据进行训练，训练数据集由模拟器生成的图像与模拟参数生成的文本提示配对组成。 Result: 模型能生成逼真的Kelvin尾迹模式，且推理速度显著快于物理模拟器。 Conclusion: 扩散模型在快速可控的尾迹图像生成方面具有潜力，为海事SAR分析的端到端下游任务提供了新可能。 Abstract: Detecting ship presence via wake signatures in SAR imagery is attracting considerable research interest, but limited annotated data availability poses significant challenges for supervised learning. Physics-based simulations are commonly used to address this data scarcity, although they are slow and constrain end-to-end learning. In this work, we explore a new direction for more efficient and end-to-end SAR ship wake simulation using a diffusion model trained on data generated by a physics-based simulator. The training dataset is built by pairing images produced by the simulator with text prompts derived from simulation parameters. Experimental result show that the model generates realistic Kelvin wake patterns and achieves significantly faster inference than the physics-based simulator. These results highlight the potential of diffusion models for fast and controllable wake image generation, opening new possibilities for end-to-end downstream tasks in maritime SAR analysis.

[15] Image Interpolation with Score-based Riemannian Metrics of Diffusion Models

Shinnosuke Saito,Takashi Matsubara

Main category: cs.CV

TL;DR: 论文提出了一种新框架，将预训练扩散模型的数据空间视为黎曼流形，利用得分函数导出的度量，改善了图像插值的真实性和提示忠实度。

Details

Motivation: 扩散模型虽擅长内容生成，但缺乏利用数据流形的实用方法，而其他深度生成模型通常具备潜在空间。 Method: 将扩散模型的数据空间建模为黎曼流形，基于得分函数定义度量，用于图像插值。 Result: 在MNIST和Stable Diffusion上的实验表明，该方法生成的插值图像更真实、噪声更少且更忠实于提示。 Conclusion: 该几何感知方法在内容生成和编辑方面具有潜力。 Abstract: Diffusion models excel in content generation by implicitly learning the data manifold, yet they lack a practical method to leverage this manifold - unlike other deep generative models equipped with latent spaces. This paper introduces a novel framework that treats the data space of pre-trained diffusion models as a Riemannian manifold, with a metric derived from the score function. Experiments with MNIST and Stable Diffusion show that this geometry-aware approach yields image interpolations that are more realistic, less noisy, and more faithful to prompts than existing methods, demonstrating its potential for improved content generation and editing.

[16] DeepAndes: A Self-Supervised Vision Foundation Model for Multi-Spectral Remote Sensing Imagery of the Andes

Junlin Guo,James R. Zimmer-Dauphinee,Jordan M. Nieusma,Siqi Lu,Quan Liu,Ruining Deng,Can Cui,Jialin Yue,Yizhe Lin,Tianyuan Yao,Juming Xiong,Junchao Zhu,Chongyu Qu,Yuechen Yang,Mitchell Wilkes,Xiao Wang,Parker VanValkenburgh,Steven A. Wernke,Yuankai Huo

Main category: cs.CV

TL;DR: DeepAndes是一个基于Transformer的视觉基础模型，专为安第斯考古学设计，通过自监督学习优化多光谱卫星图像分析，显著提升了考古遥感任务的性能。

Details

Motivation: 传统深度学习方法在标注细粒度考古特征时面临挑战，而现有的视觉基础模型多针对RGB图像，缺乏对多光谱卫星图像的支持。 Method: 开发了DeepAndes模型，采用定制的DINOv2自监督学习算法，训练于三百万张多光谱卫星图像，适用于8波段数据。 Result: 在少样本学习场景下，DeepAndes在分类、检索和分割任务中表现优异，显著优于从头训练或小数据集预训练的模型。 Conclusion: 大规模自监督预训练在考古遥感中具有显著效果，DeepAndes为安第斯考古提供了高效工具。 Abstract: By mapping sites at large scales using remotely sensed data, archaeologists can generate unique insights into long-term demographic trends, inter-regional social networks, and past adaptations to climate change. Remote sensing surveys complement field-based approaches, and their reach can be especially great when combined with deep learning and computer vision techniques. However, conventional supervised deep learning methods face challenges in annotating fine-grained archaeological features at scale. While recent vision foundation models have shown remarkable success in learning large-scale remote sensing data with minimal annotations, most off-the-shelf solutions are designed for RGB images rather than multi-spectral satellite imagery, such as the 8-band data used in our study. In this paper, we introduce DeepAndes, a transformer-based vision foundation model trained on three million multi-spectral satellite images, specifically tailored for Andean archaeology. DeepAndes incorporates a customized DINOv2 self-supervised learning algorithm optimized for 8-band multi-spectral imagery, marking the first foundation model designed explicitly for the Andes region. We evaluate its image understanding performance through imbalanced image classification, image instance retrieval, and pixel-level semantic segmentation tasks. Our experiments show that DeepAndes achieves superior F1 scores, mean average precision, and Dice scores in few-shot learning scenarios, significantly outperforming models trained from scratch or pre-trained on smaller datasets. This underscores the effectiveness of large-scale self-supervised pre-training in archaeological remote sensing. Codes will be available on https://github.com/geopacha/DeepAndes.

[17] Dynamic Contextual Attention Network: Transforming Spatial Representations into Adaptive Insights for Endoscopic Polyp Diagnosis

Teja Krishna Cherukuri,Nagur Shareef Shaik,Sribhuvan Reddy Yellu,Jun-Won Chung,Dong Hye Ye

Main category: cs.CV

TL;DR: 提出了一种动态上下文注意力网络（DCAN），用于提高结肠息肉检测的准确性和可解释性。

Details

Motivation: 传统内窥镜成像在息肉定位和上下文感知方面存在不足，限制了诊断的可解释性。 Method: 通过动态上下文注意力机制，将空间表征转化为自适应上下文信息，无需显式定位模块。 Result: DCAN提高了分类过程的决策可解释性和整体诊断性能。 Conclusion: 该方法有望提升结肠癌检测的可靠性，改善患者预后。 Abstract: Colorectal polyps are key indicators for early detection of colorectal cancer. However, traditional endoscopic imaging often struggles with accurate polyp localization and lacks comprehensive contextual awareness, which can limit the explainability of diagnoses. To address these issues, we propose the Dynamic Contextual Attention Network (DCAN). This novel approach transforms spatial representations into adaptive contextual insights, using an attention mechanism that enhances focus on critical polyp regions without explicit localization modules. By integrating contextual awareness into the classification process, DCAN improves decision interpretability and overall diagnostic performance. This advancement in imaging could lead to more reliable colorectal cancer detection, enabling better patient outcomes.

[18] Fine Grain Classification: Connecting Meta using Cross-Contrastive pre-training

Sumit Mamtani,Yash Thesia

Main category: cs.CV

TL;DR: 论文提出了一种利用元信息辅助细粒度视觉分类的统一框架，通过跨对比预训练联合学习视觉和元信息，显著提升了分类性能。

Details

Motivation: 细粒度视觉分类仅依赖外观信息难以区分相似类别，因此需要引入元信息以提高准确性。 Method: 采用三个编码器分别处理图像、文本和元信息，通过跨对比预训练对齐嵌入表示，随后微调图像和元信息编码器进行分类任务。 Result: 在NABirds数据集上，框架利用元信息将性能提升7.83%，最终准确率达到84.44%，优于现有方法。 Conclusion: 通过联合学习视觉和元信息，该框架有效提升了细粒度分类性能，证明了元信息的重要性。 Abstract: Fine-grained visual classification aims to recognize objects belonging to multiple subordinate categories within a super-category. However, this remains a challenging problem, as appearance information alone is often insufficient to accurately differentiate between fine-grained visual categories. To address this, we propose a novel and unified framework that leverages meta-information to assist fine-grained identification. We tackle the joint learning of visual and meta-information through cross-contrastive pre-training. In the first stage, we employ three encoders for images, text, and meta-information, aligning their projected embeddings to achieve better representations. We then fine-tune the image and meta-information encoders for the classification task. Experiments on the NABirds dataset demonstrate that our framework effectively utilizes meta-information to enhance fine-grained recognition performance. With the addition of meta-information, our framework surpasses the current baseline on NABirds by 7.83%. Furthermore, it achieves an accuracy of 84.44% on the NABirds dataset, outperforming many existing state-of-the-art approaches that utilize meta-information.

[19] MicarVLMoE: A Modern Gated Cross-Aligned Vision-Language Mixture of Experts Model for Medical Image Captioning and Report Generation

Amaan Izhar,Nurul Japar,Norisma Idris,Ting Dang

Main category: cs.CV

TL;DR: MicarVLMoE模型通过多尺度视觉编码器、多头双分支潜在注意力模块和调制专家混合解码器，解决了医学图像报告中细粒度特征提取、多模态对齐和泛化问题，并在多种影像类型上取得先进结果。

Details

Motivation: 现有方法在细粒度特征提取、多模态对齐和跨影像类型泛化方面表现不佳，且主要关注胸部X光片。 Method: 提出MicarVLMoE模型，包括多尺度视觉编码器（MSVE）、多头双分支潜在注意力模块（MDLA）和调制专家混合解码器（MoE）。 Result: 在COVCTR、MMR、PGROSS和ROCO数据集上取得先进结果，实验验证了临床准确性、跨模态对齐和模型可解释性的提升。 Conclusion: MicarVLMoE模型在医学图像报告中表现出色，解决了现有方法的局限性，并扩展到多种影像类型。 Abstract: Medical image reporting (MIR) aims to generate structured clinical descriptions from radiological images. Existing methods struggle with fine-grained feature extraction, multimodal alignment, and generalization across diverse imaging types, often relying on vanilla transformers and focusing primarily on chest X-rays. We propose MicarVLMoE, a vision-language mixture-of-experts model with gated cross-aligned fusion, designed to address these limitations. Our architecture includes: (i) a multiscale vision encoder (MSVE) for capturing anatomical details at varying resolutions, (ii) a multihead dual-branch latent attention (MDLA) module for vision-language alignment through latent bottleneck representations, and (iii) a modulated mixture-of-experts (MoE) decoder for adaptive expert specialization. We extend MIR to CT scans, retinal imaging, MRI scans, and gross pathology images, reporting state-of-the-art results on COVCTR, MMR, PGROSS, and ROCO datasets. Extensive experiments and ablations confirm improved clinical accuracy, cross-modal alignment, and model interpretability. Code is available at https://github.com/AI-14/micar-vl-moe.

[20] TTTFusion: A Test-Time Training-Based Strategy for Multimodal Medical Image Fusion in Surgical Robots

Qinhua Xie,Hao Tang

Main category: cs.CV

TL;DR: TTTFusion是一种基于测试时训练（TTT）的图像融合策略，通过动态调整模型参数提升多模态医学图像的融合质量。

Details

Motivation: 提升手术机器人处理多模态医学图像的能力，解决传统方法在实时性、细粒度特征提取和边缘保留方面的挑战。 Method: 采用测试时训练（TTT）策略，在推理阶段动态调整模型参数，优化输入图像数据的融合效果。 Result: 实验表明，TTTFusion在多模态图像融合质量上显著优于传统方法，尤其在细粒度特征提取和边缘保留方面表现突出。 Conclusion: TTTFusion不仅提高了图像融合精度，还为手术机器人实时图像处理提供了新的技术方案。 Abstract: With the increasing use of surgical robots in clinical practice, enhancing their ability to process multimodal medical images has become a key research challenge. Although traditional medical image fusion methods have made progress in improving fusion accuracy, they still face significant challenges in real-time performance, fine-grained feature extraction, and edge preservation.In this paper, we introduce TTTFusion, a Test-Time Training (TTT)-based image fusion strategy that dynamically adjusts model parameters during inference to efficiently fuse multimodal medical images. By adapting the model during the test phase, our method optimizes the parameters based on the input image data, leading to improved accuracy and better detail preservation in the fusion results.Experimental results demonstrate that TTTFusion significantly enhances the fusion quality of multimodal images compared to traditional fusion methods, particularly in fine-grained feature extraction and edge preservation. This approach not only improves image fusion accuracy but also offers a novel technical solution for real-time image processing in surgical robots.

[21] Inception: Jailbreak the Memory Mechanism of Text-to-Image Generation Systems

Shiqian Zhao,Jiayang Liu,Yiming Li,Runyi Hu,Xiaojun Jia,Wenshu Fan,Xinfeng Li,Jie Zhang,Wei Dong,Tianwei Zhang,Luu Anh Tuan

Main category: cs.CV

TL;DR: 论文揭示了在线文本生成图像（T2I）系统中的记忆机制加剧了越狱攻击的风险，并提出了一种名为Inception的多轮越狱攻击方法，通过分块和递归策略实现高成功率。

Details

Motivation: 当前T2I系统的记忆机制虽实用，但安全性分析不足，存在被滥用于越狱攻击的风险。 Method: 提出Inception攻击方法，将恶意提示分块输入系统，并采用递归策略处理不可分割的最小恶意词汇。 Result: 实验表明，Inception在攻击成功率上比现有方法高出14%。 Conclusion: 记忆机制的安全性需进一步研究，以防止类似Inception的攻击。 Abstract: Currently, the memory mechanism has been widely and successfully exploited in online text-to-image (T2I) generation systems ($e.g.$, DALL$\cdot$E 3) for alleviating the growing tokenization burden and capturing key information in multi-turn interactions. Despite its practicality, its security analyses have fallen far behind. In this paper, we reveal that this mechanism exacerbates the risk of jailbreak attacks. Different from previous attacks that fuse the unsafe target prompt into one ultimate adversarial prompt, which can be easily detected or may generate non-unsafe images due to under- or over-optimization, we propose Inception, the first multi-turn jailbreak attack against the memory mechanism in real-world text-to-image generation systems. Inception embeds the malice at the inception of the chat session turn by turn, leveraging the mechanism that T2I generation systems retrieve key information in their memory. Specifically, Inception mainly consists of two modules. It first segments the unsafe prompt into chunks, which are subsequently fed to the system in multiple turns, serving as pseudo-gradients for directive optimization. Specifically, we develop a series of segmentation policies that ensure the images generated are semantically consistent with the target prompt. Secondly, after segmentation, to overcome the challenge of the inseparability of minimum unsafe words, we propose recursion, a strategy that makes minimum unsafe words subdivisible. Collectively, segmentation and recursion ensure that all the request prompts are benign but can lead to malicious outcomes. We conduct experiments on the real-world text-to-image generation system ($i.e.$, DALL$\cdot$E 3) to validate the effectiveness of Inception. The results indicate that Inception surpasses the state-of-the-art by a 14\% margin in attack success rate.

[22] Sparse2DGS: Geometry-Prioritized Gaussian Splatting for Surface Reconstruction from Sparse Views

Jiang Wu,Rui Li,Yu Zhu,Rong Guo,Jinqiu Sun,Yanning Zhang

Main category: cs.CV

TL;DR: 提出了一种基于稀疏输入视图的高斯泼溅表面重建方法Sparse2DGS，解决了传统方法在稀疏视图下的几何优化问题，显著优于现有方法且速度更快。

Details

Motivation: 传统方法依赖密集视图，难以处理稀疏Structure-from-Motion点的初始化问题，而基于学习的MVS直接结合高斯泼溅效果不佳。 Method: 提出Sparse2DGS，结合MVS初始化的高斯泼溅流程，引入几何优先增强机制，实现稀疏视图下的鲁棒几何学习。 Result: Sparse2DGS在性能上显著优于现有方法，且速度比NeRF微调方法快2倍。 Conclusion: Sparse2DGS为稀疏视图下的表面重建提供了高效且准确的解决方案。 Abstract: We present a Gaussian Splatting method for surface reconstruction using sparse input views. Previous methods relying on dense views struggle with extremely sparse Structure-from-Motion points for initialization. While learning-based Multi-view Stereo (MVS) provides dense 3D points, directly combining it with Gaussian Splatting leads to suboptimal results due to the ill-posed nature of sparse-view geometric optimization. We propose Sparse2DGS, an MVS-initialized Gaussian Splatting pipeline for complete and accurate reconstruction. Our key insight is to incorporate the geometric-prioritized enhancement schemes, allowing for direct and robust geometric learning under ill-posed conditions. Sparse2DGS outperforms existing methods by notable margins while being ${2}\times$ faster than the NeRF-based fine-tuning approach.

[23] GSFeatLoc: Visual Localization Using Feature Correspondence on 3D Gaussian Splatting

Jongwon Lee,Timothy Bretl

Main category: cs.CV

TL;DR: 提出了一种基于3D高斯散射（3DGS）场景表示的查询图像定位方法，显著降低了推理时间和估计误差。

Details

Motivation: 解决现有方法在推理时间和姿态估计误差上的不足，尤其是对初始姿态估计误差的容忍度。 Method: 1. 使用3DGS渲染合成RGBD图像；2. 建立查询图像与合成图像的2D-2D对应关系；3. 通过深度图将2D-2D对应提升为2D-3D对应，并求解PnP问题得到最终姿态估计。 Result: 在三个数据集上测试，推理时间从10秒降至0.1秒，旋转误差小于5度，平移误差小于0.05单位。 Conclusion: 该方法高效且鲁棒，适用于大初始误差的场景定位。 Abstract: In this paper, we present a method for localizing a query image with respect to a precomputed 3D Gaussian Splatting (3DGS) scene representation. First, the method uses 3DGS to render a synthetic RGBD image at some initial pose estimate. Second, it establishes 2D-2D correspondences between the query image and this synthetic image. Third, it uses the depth map to lift the 2D-2D correspondences to 2D-3D correspondences and solves a perspective-n-point (PnP) problem to produce a final pose estimate. Results from evaluation across three existing datasets with 38 scenes and over 2,700 test images show that our method significantly reduces both inference time (by over two orders of magnitude, from more than 10 seconds to as fast as 0.1 seconds) and estimation error compared to baseline methods that use photometric loss minimization. Results also show that our method tolerates large errors in the initial pose estimate of up to 55{\deg} in rotation and 1.1 units in translation (normalized by scene scale), achieving final pose errors of less than 5{\deg} in rotation and 0.05 units in translation on 90% of images from the Synthetic NeRF and Mip-NeRF360 datasets and on 42% of images from the more challenging Tanks and Temples dataset.

[24] Neural Stereo Video Compression with Hybrid Disparity Compensation

Shiyin Jiang,Zhenghao Chen,Minghao Han,Xingyu Zhou,Leheng Zhang,Shuhang Gu

Main category: cs.CV

TL;DR: 提出了一种混合视差补偿（HDC）策略，结合显式和隐式方法优化立体视频压缩性能，并在多个基准测试中表现优异。

Details

Motivation: 立体视频压缩中，视差补偿是减少跨视图冗余的主要策略，但现有方法（显式水平位移和隐式跨注意力机制）各有局限，需结合两者优势。 Method: HDC策略融合显式像素位移和隐式跨注意力机制，通过相似性图和归一化注意力分数实现特征对齐，并集成到端到端优化的神经框架中。 Result: 在KITTI 2012、KITTI 2015和Nagoya等基准测试中，HDC框架优于传统和神经SVC方法。 Conclusion: HDC策略有效结合显式和隐式方法，显著提升立体视频压缩性能，适用于多种场景。 Abstract: Disparity compensation represents the primary strategy in stereo video compression (SVC) for exploiting cross-view redundancy. These mechanisms can be broadly categorized into two types: one that employs explicit horizontal shifting, and another that utilizes an implicit cross-attention mechanism to reduce cross-view disparity redundancy. In this work, we propose a hybrid disparity compensation (HDC) strategy that leverages explicit pixel displacement as a robust prior feature to simplify optimization and perform implicit cross-attention mechanisms for subsequent warping operations, thereby capturing a broader range of disparity information. Specifically, HDC first computes a similarity map by fusing the horizontally shifted cross-view features to capture pixel displacement information. This similarity map is then normalized into an "explicit pixel-wise attention score" to perform the cross-attention mechanism, implicitly aligning features from one view to another. Building upon HDC, we introduce a novel end-to-end optimized neural stereo video compression framework, which integrates HDC-based modules into key coding operations, including cross-view feature extraction and reconstruction (HDC-FER) and cross-view entropy modeling (HDC-EM). Extensive experiments on SVC benchmarks, including KITTI 2012, KITTI 2015, and Nagoya, which cover both autonomous driving and general scenes, demonstrate that our framework outperforms both neural and traditional SVC methodologies.

[25] FiLA-Video: Spatio-Temporal Compression for Fine-Grained Long Video Understanding

Yanan Guo,Wenhui Dong,Jun Song,Shiding Zhu,Xuan Zhang,Hanqing Yang,Yingbo Wang,Yang Du,Xianing Chen,Bo Zheng

Main category: cs.CV

TL;DR: FiLA-Video提出了一种轻量级动态权重多帧融合策略，解决长视频理解中的冗余信息和计算成本问题。

Details

Motivation: 现有视频特征压缩方法未能有效保留关键信息或计算成本高，限制了长视频理解。 Method: 采用动态权重多帧融合策略和关键帧选择策略，结合低成本训练数据生成方法。 Result: FiLA-Video在长视频理解中表现出更高的效率和准确性。 Conclusion: FiLA-Video为长视频理解提供了一种高效且准确的解决方案。 Abstract: Recent advancements in video understanding within visual large language models (VLLMs) have led to notable progress. However, the complexity of video data and contextual processing limitations still hinder long-video comprehension. A common approach is video feature compression to reduce token input to large language models, yet many methods either fail to prioritize essential features, leading to redundant inter-frame information, or introduce computationally expensive modules.To address these issues, we propose FiLA(Fine-grained Vision Language Model)-Video, a novel framework that leverages a lightweight dynamic-weight multi-frame fusion strategy, which adaptively integrates multiple frames into a single representation while preserving key video information and reducing computational costs. To enhance frame selection for fusion, we introduce a keyframe selection strategy, effectively identifying informative frames from a larger pool for improved summarization. Additionally, we present a simple yet effective long-video training data generation strategy, boosting model performance without extensive manual annotation. Experimental results demonstrate that FiLA-Video achieves superior efficiency and accuracy in long-video comprehension compared to existing methods.

[26] GarmentX: Autoregressive Parametric Representations for High-Fidelity 3D Garment Generation

Jingfeng Guo,Jinnan Chen,Weikai Chen,Zhenyu Sun,Lanjiong Li,Baozhu Zhao,Lingting Zhu,Xin Wang,Qi Liu

Main category: cs.CV

TL;DR: GarmentX是一个从单张输入图像生成多样、高保真且可穿戴3D服装的新框架，通过结构化参数表示和自回归模型，解决了传统方法中自交和物理不合理的问题。

Details

Motivation: 传统服装重建方法直接预测2D图案边缘及其连接性，导致自交和物理不合理结构，GarmentX旨在解决这一问题。 Method: 引入结构化可编辑参数表示，结合自回归模型预测服装参数，并构建大规模数据集GarmentX。 Result: 在几何保真度和输入图像对齐方面达到最先进性能，显著优于先前方法。 Conclusion: GarmentX框架通过参数化表示和数据集支持，实现了高质量3D服装生成，并将公开数据集。 Abstract: This work presents GarmentX, a novel framework for generating diverse, high-fidelity, and wearable 3D garments from a single input image. Traditional garment reconstruction methods directly predict 2D pattern edges and their connectivity, an overly unconstrained approach that often leads to severe self-intersections and physically implausible garment structures. In contrast, GarmentX introduces a structured and editable parametric representation compatible with GarmentCode, ensuring that the decoded sewing patterns always form valid, simulation-ready 3D garments while allowing for intuitive modifications of garment shape and style. To achieve this, we employ a masked autoregressive model that sequentially predicts garment parameters, leveraging autoregressive modeling for structured generation while mitigating inconsistencies in direct pattern prediction. Additionally, we introduce GarmentX dataset, a large-scale dataset of 378,682 garment parameter-image pairs, constructed through an automatic data generation pipeline that synthesizes diverse and high-quality garment images conditioned on parametric garment representations. Through integrating our method with GarmentX dataset, we achieve state-of-the-art performance in geometric fidelity and input image alignment, significantly outperforming prior approaches. We will release GarmentX dataset upon publication.

[27] Plant Disease Detection through Multimodal Large Language Models and Convolutional Neural Networks

Konstantinos I. Roumeliotis,Ranjan Sapkota,Manoj Karkee,Nikolaos D. Tselikas,Dimitrios K. Nasiopoulos

Main category: cs.CV

TL;DR: 该研究探讨了结合多模态大语言模型（GPT-4o）与卷积神经网络（CNN）在植物叶片图像疾病分类中的效果，发现微调后的GPT-4o性能略优于ResNet-50，但零样本表现较差。

Details

Motivation: 农业自动化在作物监测和疾病管理中至关重要，本研究旨在通过结合多模态大语言模型和CNN，提升植物疾病分类的准确性和泛化能力。 Method: 使用PlantVillage数据集，系统评估了GPT-4o和ResNet-50在零样本、少样本和渐进微调场景下的性能，并比较了不同分辨率和植物种类的表现。 Result: 微调后的GPT-4o在苹果叶片图像上达到98.12%的分类准确率，优于ResNet-50的96.88%，但零样本表现较差。模型在跨分辨率和跨植物泛化中表现出适应性和局限性。 Conclusion: 多模态大语言模型在自动化疾病检测中具有潜力，可提升精准农业系统的智能化和可扩展性，减少对大型标注数据集和高分辨率传感器的依赖。 Abstract: Automation in agriculture plays a vital role in addressing challenges related to crop monitoring and disease management, particularly through early detection systems. This study investigates the effectiveness of combining multimodal Large Language Models (LLMs), specifically GPT-4o, with Convolutional Neural Networks (CNNs) for automated plant disease classification using leaf imagery. Leveraging the PlantVillage dataset, we systematically evaluate model performance across zero-shot, few-shot, and progressive fine-tuning scenarios. A comparative analysis between GPT-4o and the widely used ResNet-50 model was conducted across three resolutions (100, 150, and 256 pixels) and two plant species (apple and corn). Results indicate that fine-tuned GPT-4o models achieved slightly better performance compared to the performance of ResNet-50, achieving up to 98.12% classification accuracy on apple leaf images, compared to 96.88% achieved by ResNet-50, with improved generalization and near-zero training loss. However, zero-shot performance of GPT-4o was significantly lower, underscoring the need for minimal training. Additional evaluations on cross-resolution and cross-plant generalization revealed the models' adaptability and limitations when applied to new domains. The findings highlight the promise of integrating multimodal LLMs into automated disease detection pipelines, enhancing the scalability and intelligence of precision agriculture systems while reducing the dependence on large, labeled datasets and high-resolution sensor infrastructure. Large Language Models, Vision Language Models, LLMs and CNNs, Disease Detection with Vision Language Models, VLMs

[28] AI Assisted Cervical Cancer Screening for Cytology Samples in Developing Countries

Love Panta,Suraj Prasai,Karishma Malla Vaidya,Shyam Shrestha,Suresh Manandhar

Main category: cs.CV

TL;DR: 提出了一种结合低成本显微镜和高效AI算法的自动化宫颈癌筛查方法，显著提高了准确性和效率。

Details

Motivation: 宫颈癌筛查的传统方法（如LBC）劳动密集且易出错，亟需更高效的解决方案。 Method: 使用电动显微镜采集图像，通过AI流程（图像拼接、细胞分割和分类）处理，采用轻量级UNet模型和CvT分类模型。 Result: 在SIPaKMeD数据集上，系统能准确分类五种细胞类型，性能优于现有方法。 Conclusion: 该方法为宫颈癌筛查提供了一种低成本、高效且准确的替代方案。 Abstract: Cervical cancer remains a significant health challenge, with high incidence and mortality rates, particularly in transitioning countries. Conventional Liquid-Based Cytology(LBC) is a labor-intensive process, requires expert pathologists and is highly prone to errors, highlighting the need for more efficient screening methods. This paper introduces an innovative approach that integrates low-cost biological microscopes with our simple and efficient AI algorithms for automated whole-slide analysis. Our system uses a motorized microscope to capture cytology images, which are then processed through an AI pipeline involving image stitching, cell segmentation, and classification. We utilize the lightweight UNet-based model involving human-in-the-loop approach to train our segmentation model with minimal ROIs. CvT-based classification model, trained on the SIPaKMeD dataset, accurately categorizes five cell types. Our framework offers enhanced accuracy and efficiency in cervical cancer screening compared to various state-of-art methods, as demonstrated by different evaluation metrics.

[29] PixelHacker: Image Inpainting with Structural and Semantic Consistency

Ziyang Xu,Kangsheng Duan,Xiaolei Shen,Zhifeng Ding,Wenyu Liu,Xiaohu Ruan,Xiaoxin Chen,Xinggang Wang

Main category: cs.CV

TL;DR: 论文提出了一种名为PixelHacker的扩散模型，通过潜在类别引导解决图像修复中结构和语义一致性问题，并在多个数据集上表现优于现有方法。

Details

Motivation: 现有图像修复方法在复杂结构和语义一致性上表现不佳，导致生成结果存在瑕疵。 Method: 设计了潜在类别引导范式，构建大规模图像-掩码数据集，并通过双嵌入编码和线性注意力注入特征。 Result: PixelHacker在Places2、CelebA-HQ和FFHQ数据集上全面超越现有方法，表现出色。 Conclusion: PixelHacker通过潜在类别引导和扩散模型，显著提升了图像修复的结构和语义一致性。 Abstract: Image inpainting is a fundamental research area between image editing and image generation. Recent state-of-the-art (SOTA) methods have explored novel attention mechanisms, lightweight architectures, and context-aware modeling, demonstrating impressive performance. However, they often struggle with complex structure (e.g., texture, shape, spatial relations) and semantics (e.g., color consistency, object restoration, and logical correctness), leading to artifacts and inappropriate generation. To address this challenge, we design a simple yet effective inpainting paradigm called latent categories guidance, and further propose a diffusion-based model named PixelHacker. Specifically, we first construct a large dataset containing 14 million image-mask pairs by annotating foreground and background (potential 116 and 21 categories, respectively). Then, we encode potential foreground and background representations separately through two fixed-size embeddings, and intermittently inject these features into the denoising process via linear attention. Finally, by pre-training on our dataset and fine-tuning on open-source benchmarks, we obtain PixelHacker. Extensive experiments show that PixelHacker comprehensively outperforms the SOTA on a wide range of datasets (Places2, CelebA-HQ, and FFHQ) and exhibits remarkable consistency in both structure and semantics. Project page at https://hustvl.github.io/projects/PixelHacker.

[30] LMM4Gen3DHF: Benchmarking and Evaluating Multimodal 3D Human Face Generation with LMMs

Woo Yi Yang,Jiarui Wang,Sijing Wu,Huiyu Duan,Yuxin Zhu,Liu Yang,Kang Fu,Guangtao Zhai,Xiongkuo Min

Main category: cs.CV

TL;DR: 论文提出了Gen3DHF基准和LMME3DHF模型，用于评估AI生成的3D人脸的质量和真实性，并在实验中表现优异。

Details

Motivation: 由于人类对AI生成的3D人脸质量和真实性的主观感知难以量化，需要一种客观的评估方法。 Method: 构建了包含2000个视频和4000个主观评分的Gen3DHF基准，并提出了基于多模态模型的LMME3DHF评估方法。 Result: LMME3DHF在质量评分预测和失真识别方面优于现有方法，且与人类感知一致。 Conclusion: Gen3DHF和LMME3DHF为AI生成的3D人脸评估提供了有效工具，未来将公开发布。 Abstract: The rapid advancement in generative artificial intelligence have enabled the creation of 3D human faces (HFs) for applications including media production, virtual reality, security, healthcare, and game development, etc. However, assessing the quality and realism of these AI-generated 3D human faces remains a significant challenge due to the subjective nature of human perception and innate perceptual sensitivity to facial features. To this end, we conduct a comprehensive study on the quality assessment of AI-generated 3D human faces. We first introduce Gen3DHF, a large-scale benchmark comprising 2,000 videos of AI-Generated 3D Human Faces along with 4,000 Mean Opinion Scores (MOS) collected across two dimensions, i.e., quality and authenticity, 2,000 distortion-aware saliency maps and distortion descriptions. Based on Gen3DHF, we propose LMME3DHF, a Large Multimodal Model (LMM)-based metric for Evaluating 3DHF capable of quality and authenticity score prediction, distortion-aware visual question answering, and distortion-aware saliency prediction. Experimental results show that LMME3DHF achieves state-of-the-art performance, surpassing existing methods in both accurately predicting quality scores for AI-generated 3D human faces and effectively identifying distortion-aware salient regions and distortion types, while maintaining strong alignment with human perceptual judgments. Both the Gen3DHF database and the LMME3DHF will be released upon the publication.

[31] Antidote: A Unified Framework for Mitigating LVLM Hallucinations in Counterfactual Presupposition and Object Perception

Yuanchen Wu,Lu Zhang,Hang Yao,Junlong Du,Ke Yan,Shouhong Ding,Yunsheng Wu,Xiaoqiang Li

Main category: cs.CV

TL;DR: 论文提出了一种名为'Antidote'的后训练框架，用于缓解大型视觉语言模型（LVLMs）在回答反事实预设问题（CPQs）时产生的幻觉问题，并通过合成数据和偏好优化实现自校正。

Details

Motivation: 尽管LVLMs在多模态任务中表现优异，但其在回答反事实预设问题时容易产生幻觉，现有研究多关注模型生成而忽略问题本身。 Method: 引入'Antidote'框架，利用合成数据将事实先验融入问题以实现自校正，并将缓解过程分解为偏好优化问题。构建'CP-Bench'基准评估模型表现。 Result: 在LLaVA系列模型中，'Antidote'显著提升了CP-Bench性能（50%以上）、POPE（1.8-3.3%）和CHAIR & SHR（30-50%），且无需外部监督。 Conclusion: 'Antidote'有效缓解了LVLMs的幻觉问题，同时避免了灾难性遗忘，为模型自校正提供了一种可行方案。 Abstract: Large Vision-Language Models (LVLMs) have achieved impressive results across various cross-modal tasks. However, hallucinations, i.e., the models generating counterfactual responses, remain a challenge. Though recent studies have attempted to alleviate object perception hallucinations, they focus on the models' response generation, and overlooking the task question itself. This paper discusses the vulnerability of LVLMs in solving counterfactual presupposition questions (CPQs), where the models are prone to accept the presuppositions of counterfactual objects and produce severe hallucinatory responses. To this end, we introduce "Antidote", a unified, synthetic data-driven post-training framework for mitigating both types of hallucination above. It leverages synthetic data to incorporate factual priors into questions to achieve self-correction, and decouple the mitigation process into a preference optimization problem. Furthermore, we construct "CP-Bench", a novel benchmark to evaluate LVLMs' ability to correctly handle CPQs and produce factual responses. Applied to the LLaVA series, Antidote can simultaneously enhance performance on CP-Bench by over 50%, POPE by 1.8-3.3%, and CHAIR & SHR by 30-50%, all without relying on external supervision from stronger LVLMs or human feedback and introducing noticeable catastrophic forgetting issues.

[32] Large-scale visual SLAM for in-the-wild videos

Shuo Sun,Torsten Sattler,Malcolm Mielle,Achim J. Lilienthal,Martin Magnusson

Main category: cs.CV

TL;DR: 提出了一种鲁棒的3D场景重建方法，针对非约束视频中的相机姿态估计和场景重建问题，通过结合深度视觉里程计、动态对象掩码、单目深度估计和全局优化，显著提升了重建效果。

Details

Motivation: 解决现有视觉SLAM方法在非约束视频（如快速旋转、无纹理区域和动态对象）中表现不佳的问题，以简化机器人部署到新环境的过程。 Method: 结合深度视觉里程计，自动恢复相机内参，使用预测模型掩码动态对象和无约束区域，利用单目深度估计优化捆绑调整，并集成位置识别和闭环检测以减少长期漂移。 Result: 在多种环境中从在线视频生成了大规模连续的3D模型，相比基线方法，重建结果更一致且视觉精度更高。 Conclusion: 该方法为非约束视频的视觉重建设立了新基准，显著提升了重建的连续性和一致性。 Abstract: Accurate and robust 3D scene reconstruction from casual, in-the-wild videos can significantly simplify robot deployment to new environments. However, reliable camera pose estimation and scene reconstruction from such unconstrained videos remains an open challenge. Existing visual-only SLAM methods perform well on benchmark datasets but struggle with real-world footage which often exhibits uncontrolled motion including rapid rotations and pure forward movements, textureless regions, and dynamic objects. We analyze the limitations of current methods and introduce a robust pipeline designed to improve 3D reconstruction from casual videos. We build upon recent deep visual odometry methods but increase robustness in several ways. Camera intrinsics are automatically recovered from the first few frames using structure-from-motion. Dynamic objects and less-constrained areas are masked with a predictive model. Additionally, we leverage monocular depth estimates to regularize bundle adjustment, mitigating errors in low-parallax situations. Finally, we integrate place recognition and loop closure to reduce long-term drift and refine both intrinsics and pose estimates through global bundle adjustment. We demonstrate large-scale contiguous 3D models from several online videos in various environments. In contrast, baseline methods typically produce locally inconsistent results at several points, producing separate segments or distorted maps. In lieu of ground-truth pose data, we evaluate map consistency, execution time and visual accuracy of re-rendered NeRF models. Our proposed system establishes a new baseline for visual reconstruction from casual uncontrolled videos found online, demonstrating more consistent reconstructions over longer sequences of in-the-wild videos than previously achieved.

[33] Style-Adaptive Detection Transformer for Single-Source Domain Generalized Object Detection

Jianhong Han,Yupei Wang,Liang Chen

Main category: cs.CV

TL;DR: 本文提出了一种基于DETR的检测器SA-DETR，用于单源域泛化（SDG）任务，通过动态风格适配和对象感知对比学习模块提升检测器在未见目标域中的泛化能力。

Details

Motivation: 现有基于CNN的检测器通过数据增强和特征对齐提升泛化能力，但其效果依赖于增强样本分布是否覆盖未见场景。DETR在域适应任务中表现优异，但其在SDG任务中的潜力尚未探索。 Method: 提出SA-DETR，包含域风格适配器（动态风格适配）和对象感知对比学习模块（通过对比学习提取域不变特征）。 Result: 在五种不同天气场景下的实验表明，SA-DETR具有优异的性能和泛化能力。 Conclusion: SA-DETR通过动态风格适配和对象感知对比学习，显著提升了单源域泛化任务中的检测性能。 Abstract: Single-source Domain Generalization (SDG) in object detection aims to develop a detector using only data from a source domain that can exhibit strong generalization capability when applied to unseen target domains. Existing methods are built upon CNN-based detectors and primarily improve robustness by employing carefully designed data augmentation strategies integrated with feature alignment techniques. However, data augmentation methods have inherent drawbacks; they are only effective when the augmented sample distribution approximates or covers the unseen scenarios, thus failing to enhance generalization across all unseen domains. Furthermore, while the recent Detection Transformer (DETR) has demonstrated superior generalization capability in domain adaptation tasks due to its efficient global information extraction, its potential in SDG tasks remains unexplored. To this end, we introduce a strong DETR-based detector named the Style-Adaptive Detection Transformer (SA-DETR) for SDG in object detection. Specifically, we present a domain style adapter that projects the style representation of the unseen target domain into the training domain, enabling dynamic style adaptation. Then, we propose an object-aware contrastive learning module to guide the detector in extracting domain-invariant features through contrastive learning. By using object-aware gating masks to constrain feature aggregation in both spatial and semantic dimensions, this module achieves cross-domain contrast of instance-level features, thereby enhancing generalization. Extensive experiments demonstrate the superior performance and generalization capability of SA-DETR across five different weather scenarios. Code is released at https://github.com/h751410234/SA-DETR.

[34] MambaMoE: Mixture-of-Spectral-Spatial-Experts State Space Model for Hyperspectral Image Classification

Yichu Xu,Di Wang,Hongzan Jiao,Lefei Zhang,Liangpei Zhang

Main category: cs.CV

TL;DR: MambaMoE提出了一种新颖的谱空混合专家框架，通过自适应谱空建模和不确定性引导的纠正学习策略，显著提升了高光谱图像分类性能。

Details

Motivation: 现有基于Mamba的方法忽视了高光谱场景中异质物体的谱空方向特性，导致分类性能受限。 Method: 设计了混合Mamba专家块（MoMEB）和不确定性引导的纠正学习策略（UGCL），实现自适应谱空建模和关注复杂区域。 Result: 在多个公开高光谱基准测试中，MambaMoE在精度和效率上均达到最先进水平。 Conclusion: MambaMoE为高光谱图像分类领域提供了一种高效且性能优越的新方法。 Abstract: The Mamba model has recently demonstrated strong potential in hyperspectral image (HSI) classification, owing to its ability to perform context modeling with linear computational complexity. However, existing Mamba-based methods usually neglect the spectral and spatial directional characteristics related to heterogeneous objects in hyperspectral scenes, leading to limited classification performance. To address these issues, we propose MambaMoE, a novel spectral-spatial mixture-of-experts framework, representing the first MoE-based approach in the HSI classification community. Specifically, we design a Mixture of Mamba Expert Block (MoMEB) that leverages sparse expert activation to enable adaptive spectral-spatial modeling. Furthermore, we introduce an uncertainty-guided corrective learning (UGCL) strategy to encourage the model's attention toward complex regions prone to prediction ambiguity. Extensive experiments on multiple public HSI benchmarks demonstrate that MambaMoE achieves state-of-the-art performance in both accuracy and efficiency compared to existing advanced approaches, especially for Mamba-based methods. Code will be released.

[35] SteelBlastQC: Shot-blasted Steel Surface Dataset with Interpretable Detection of Surface Defects

Irina Ruzavina,Lisa Sophie Theis,Jesse Lemeer,Rutger de Groen,Leo Ebeling,Andrej Hulak,Jouaria Ali,Guangzhi Tang,Rico Mockel

Main category: cs.CV

TL;DR: 该研究提出了一个用于钢表面质量控制的标注数据集，并评估了三种分类方法，其中监督学习方法（CCT和SVM）达到了95%的准确率。

Details

Motivation: 自动化钢表面喷砂处理的质量控制对提高制造效率和一致性至关重要。 Method: 研究使用了1654张标注的RGB图像数据集，评估了三种分类方法：CCT、SVM（基于ResNet-50特征提取）和CAE。 Result: 监督学习方法（CCT和SVM）在测试集上达到95%的分类准确率，CAE作为无监督方法的基线。 Conclusion: 通过公开数据集和代码，该研究旨在推动缺陷检测研究、开发可解释的计算机视觉模型，并促进工业自动化检测系统的应用。 Abstract: Automating the quality control of shot-blasted steel surfaces is crucial for improving manufacturing efficiency and consistency. This study presents a dataset of 1654 labeled RGB images (512x512) of steel surfaces, classified as either "ready for paint" or "needs shot-blasting." The dataset captures real-world surface defects, including discoloration, welding lines, scratches and corrosion, making it well-suited for training computer vision models. Additionally, three classification approaches were evaluated: Compact Convolutional Transformers (CCT), Support Vector Machines (SVM) with ResNet-50 feature extraction, and a Convolutional Autoencoder (CAE). The supervised methods (CCT and SVM) achieve 95% classification accuracy on the test set, with CCT leveraging transformer-based attention mechanisms and SVM offering a computationally efficient alternative. The CAE approach, while less effective, establishes a baseline for unsupervised quality control. We present interpretable decision-making by all three neural networks, allowing industry users to visually pinpoint problematic regions and understand the model's rationale. By releasing the dataset and baseline codes, this work aims to support further research in defect detection, advance the development of interpretable computer vision models for quality control, and encourage the adoption of automated inspection systems in industrial applications.

[36] Dynamic Attention Analysis for Backdoor Detection in Text-to-Image Diffusion Models

Zhongqi Wang,Jie Zhang,Shiguang Shan,Xilin Chen

Main category: cs.CV

TL;DR: 本文提出了一种名为动态注意力分析（DAA）的新方法，用于检测文本到图像扩散模型中的后门攻击，通过分析动态注意力图的变化来识别后门样本。

Details

Motivation: 现有后门检测方法主要关注静态特征，而扩散模型的动态特性未被充分利用。本文旨在利用动态特性作为更有效的后门检测指标。 Method: 提出两种方法：DAA-I（独立分析注意力图）和DAA-S（基于动态系统的图状态方程分析）。 Result: 在五种后门攻击场景中，DAA方法显著优于现有方法，平均F1得分为79.49%，AUC为87.67%。 Conclusion: 动态注意力分析是检测扩散模型后门攻击的有效方法，DAA-S进一步提升了检测性能。 Abstract: Recent studies have revealed that text-to-image diffusion models are vulnerable to backdoor attacks, where attackers implant stealthy textual triggers to manipulate model outputs. Previous backdoor detection methods primarily focus on the static features of backdoor samples. However, a vital property of diffusion models is their inherent dynamism. This study introduces a novel backdoor detection perspective named Dynamic Attention Analysis (DAA), showing that these dynamic characteristics serve as better indicators for backdoor detection. Specifically, by examining the dynamic evolution of cross-attention maps, we observe that backdoor samples exhibit distinct feature evolution patterns at the $<$EOS$>$ token compared to benign samples. To quantify these dynamic anomalies, we first introduce DAA-I, which treats the tokens' attention maps as spatially independent and measures dynamic feature using the Frobenius norm. Furthermore, to better capture the interactions between attention maps and refine the feature, we propose a dynamical system-based approach, referred to as DAA-S. This model formulates the spatial correlations among attention maps using a graph-based state equation and we theoretically analyze the global asymptotic stability of this method. Extensive experiments across five representative backdoor attack scenarios demonstrate that our approach significantly surpasses existing detection methods, achieving an average F1 Score of 79.49% and an AUC of 87.67%. The code is available at https://github.com/Robin-WZQ/DAA.

[37] Geometry-aware Temporal Aggregation Network for Monocular 3D Lane Detection

Huan Zheng,Wencheng Han,Tianyi Yan,Cheng-zhong Xu,Jianbing Shen

Main category: cs.CV

TL;DR: GTA-Net提出了一种基于多帧输入的单目3D车道检测方法，通过几何一致性增强和时序实例信息整合，解决了现有方法的几何不准确和车道完整性不足问题。

Details

Motivation: 现有单目3D车道检测方法存在几何信息不准确和车道完整性难以保持的问题，作者希望通过利用多帧输入提升性能。 Method: 提出GTA-Net，包含Temporal Geometry Enhancement Module（TGEM）和Temporal Instance-aware Query Generation（TIQG），分别用于增强几何感知和整合时序实例信息。 Result: 实验表明GTA-Net在单目3D车道检测任务中达到了最先进水平。 Conclusion: GTA-Net通过时序几何一致性和实例信息整合，显著提升了单目3D车道检测的性能。 Abstract: Monocular 3D lane detection aims to estimate 3D position of lanes from frontal-view (FV) images. However, current monocular 3D lane detection methods suffer from two limitations, including inaccurate geometric information of the predicted 3D lanes and difficulties in maintaining lane integrity. To address these issues, we seek to fully exploit the potential of multiple input frames. First, we aim at enhancing the ability to perceive the geometry of scenes by leveraging temporal geometric consistency. Second, we strive to improve the integrity of lanes by revealing more instance information from temporal sequences. Therefore, we propose a novel Geometry-aware Temporal Aggregation Network (GTA-Net) for monocular 3D lane detection. On one hand, we develop the Temporal Geometry Enhancement Module (TGEM), which exploits geometric consistency across successive frames, facilitating effective geometry perception. On the other hand, we present the Temporal Instance-aware Query Generation (TIQG), which strategically incorporates temporal cues into query generation, thereby enabling the exploration of comprehensive instance information. Experiments demonstrate that our GTA-Net achieves SoTA results, surpassing existing monocular 3D lane detection solutions.

[38] Beyond the Horizon: Decoupling UAVs Multi-View Action Recognition via Partial Order Transfer

Wenxuan Liu,Xian Zhong,Zhuo Zhou,Siyuan Yang,Chia-Wen Lin,Alex Chichung Kot

Main category: cs.CV

TL;DR: 论文提出了一种针对无人机（UAV）动作识别的多视角方法POG-MVNet，通过建模视角的层次结构，显著提升了识别性能。

Details

Motivation: 无人机动作识别因垂直空间轴上的视角变化而面临挑战，传统地面方法难以应对。 Method: 提出POG-MVNet框架，包含View Partition模块、Order-aware Feature Decoupling模块和Action Partial Order Guide模块。 Result: 在Drone-Action、MOD20和UAV数据集上，POG-MVNet表现优于现有方法，如Drone-Action上提升4.7%。 Conclusion: POG-MVNet通过建模视角层次结构，有效解决了无人机动作识别中的视角变化问题。 Abstract: Action recognition in unmanned aerial vehicles (UAVs) poses unique challenges due to significant view variations along the vertical spatial axis. Unlike traditional ground-based settings, UAVs capture actions from a wide range of altitudes, resulting in considerable appearance discrepancies. We introduce a multi-view formulation tailored to varying UAV altitudes and empirically observe a partial order among views, where recognition accuracy consistently decreases as the altitude increases. This motivates a novel approach that explicitly models the hierarchical structure of UAV views to improve recognition performance across altitudes. To this end, we propose the Partial Order Guided Multi-View Network (POG-MVNet), designed to address drastic view variations by effectively leveraging view-dependent information across different altitude levels. The framework comprises three key components: a View Partition (VP) module, which uses the head-to-body ratio to group views by altitude; an Order-aware Feature Decoupling (OFD) module, which disentangles action-relevant and view-specific features under partial order guidance; and an Action Partial Order Guide (APOG), which leverages the partial order to transfer informative knowledge from easier views to support learning in more challenging ones. We conduct experiments on Drone-Action, MOD20, and UAV datasets, demonstrating that POG-MVNet significantly outperforms competing methods. For example, POG-MVNet achieves a 4.7% improvement on Drone-Action dataset and a 3.5% improvement on UAV dataset compared to state-of-the-art methods ASAT and FAR. The code for POG-MVNet will be made available soon.

[39] Autoencoder Models for Point Cloud Environmental Synthesis from WiFi Channel State Information: A Preliminary Study

Daniele Pannone,Danilo Avola

Main category: cs.CV

TL;DR: 提出了一种基于WiFi信道状态信息（CSI）数据生成点云的深度学习框架，采用两阶段自编码器方法，验证了其在无线传感和环境映射中的潜力。

Details

Motivation: 利用WiFi CSI数据实现环境点云的精确重建，拓展无线传感的应用范围。 Method: 采用两阶段自编码器：PointNet自编码器用于点云生成，CNN自编码器将CSI数据映射到匹配的潜在空间，并通过潜在空间对齐实现重建。 Result: 实验结果表明该方法能够有效从WiFi数据中重建环境点云。 Conclusion: 该方法为无线传感和环境映射提供了新的解决方案，具有实际应用潜力。 Abstract: This paper introduces a deep learning framework for generating point clouds from WiFi Channel State Information data. We employ a two-stage autoencoder approach: a PointNet autoencoder with convolutional layers for point cloud generation, and a Convolutional Neural Network autoencoder to map CSI data to a matching latent space. By aligning these latent spaces, our method enables accurate environmental point cloud reconstruction from WiFi data. Experimental results validate the effectiveness of our approach, highlighting its potential for wireless sensing and environmental mapping applications.

[40] PartHOI: Part-based Hand-Object Interaction Transfer via Generalized Cylinders

Qiaochu Wang,Chufeng Xiao,Manfred Lau,Hongbo Fu

Main category: cs.CV

TL;DR: PartHOI提出了一种基于部分的HOI转移方法，通过参数化对象部分的几何形状，实现跨类别的高保真手-物体交互转移。

Details

Motivation: 现有方法依赖形状匹配，难以跨类别转移手姿势，而HOI通常涉及对象的语义部分，这些部分在不同类别间形状更一致。 Method: 使用广义圆柱表示参数化对象部分的几何形状，建立部分间的几何对应关系，并转移接触点，优化手姿势以适应目标对象。 Result: 定性和定量结果显示，PartHOI在跨类别对象上表现优异，生成结果优于现有方法。 Conclusion: PartHOI通过部分几何建模，显著提升了跨类别HOI转移的效果。 Abstract: Learning-based methods to understand and model hand-object interactions (HOI) require a large amount of high-quality HOI data. One way to create HOI data is to transfer hand poses from a source object to another based on the objects' geometry. However, current methods for transferring hand poses between objects rely on shape matching, limiting the ability to transfer poses across different categories due to differences in their shapes and sizes. We observe that HOI often involves specific semantic parts of objects, which often have more consistent shapes across categories. In addition, constructing size-invariant correspondences between these parts is important for cross-category transfer. Based on these insights, we introduce a novel method PartHOI for part-based HOI transfer. Using a generalized cylinder representation to parameterize an object parts' geometry, PartHOI establishes a robust geometric correspondence between object parts, and enables the transfer of contact points. Given the transferred points, we optimize a hand pose to fit the target object well. Qualitative and quantitative results demonstrate that our method can generalize HOI transfers well even for cross-category objects, and produce high-fidelity results that are superior to the existing methods.

[41] Purifying, Labeling, and Utilizing: A High-Quality Pipeline for Small Object Detection

Siwei Wang,Zhiwei Chen,Liujuan Cao,Rongrong Ji

Main category: cs.CV

TL;DR: PLUSNet是一个针对小目标检测的优化框架，通过改进上游特征净化、中游样本分配和下游信息利用，显著提升了检测性能。

Details

Motivation: 传统小目标检测方法仅优化管道的孤立阶段，忽略了整体性能提升，因此需要一种全面优化的方法。 Method: PLUSNet包含三个模块：HFP净化上游特征，MCLA优化中游样本分配，FDHead提升下游任务信息利用。 Result: 实验表明PLUSNet在多个数据集上显著提升了小目标检测性能。 Conclusion: PLUSNet通过全面优化检测管道，显著提升了小目标检测效果，且易于集成到现有检测器中。 Abstract: Small object detection is a broadly investigated research task and is commonly conceptualized as a "pipeline-style" engineering process. In the upstream, images serve as raw materials for processing in the detection pipeline, where pre-trained models are employed to generate initial feature maps. In the midstream, an assigner selects training positive and negative samples. Subsequently, these samples and features are fed into the downstream for classification and regression. Previous small object detection methods often focused on improving isolated stages of the pipeline, thereby neglecting holistic optimization and consequently constraining overall performance gains. To address this issue, we have optimized three key aspects, namely Purifying, Labeling, and Utilizing, in this pipeline, proposing a high-quality Small object detection framework termed PLUSNet. Specifically, PLUSNet comprises three sequential components: the Hierarchical Feature Purifier (HFP) for purifying upstream features, the Multiple Criteria Label Assignment (MCLA) for improving the quality of midstream training samples, and the Frequency Decoupled Head (FDHead) for more effectively exploiting information to accomplish downstream tasks. The proposed PLUS modules are readily integrable into various object detectors, thus enhancing their detection capabilities in multi-scale scenarios. Extensive experiments demonstrate the proposed PLUSNet consistently achieves significant and consistent improvements across multiple datasets for small object detection.

[42] EfficientHuman: Efficient Training and Reconstruction of Moving Human using Articulated 2D Gaussian

Hao Tian,Rui Liu,Wen Shen,Yilong Hu,Zhihao Zheng,Xiaolin Qin

Main category: cs.CV

TL;DR: EfficientHuman提出了一种基于Articulated 2D Gaussian surfels的动态人体重建方法，解决了3DGS在动态表面拟合和多视角不一致性上的问题，显著提升了训练速度和渲染质量。

Details

Motivation: 3DGS在动态人体重建中存在多视角不一致性和冗余高斯问题，导致训练速度慢且渲染质量不佳。 Method: 使用Articulated 2D Gaussian surfels在规范空间中编码高斯斑点，并通过LBS变换到姿态空间，同时引入姿态校准和LBS优化模块。 Result: 在ZJU-MoCap数据集上，EfficientHuman平均在1分钟内完成动态重建，比现有方法快20秒，且减少了冗余高斯。 Conclusion: EfficientHuman通过创新的2D高斯表示和优化模块，实现了快速且高质量的动态人体重建。 Abstract: 3D Gaussian Splatting (3DGS) has been recognized as a pioneering technique in scene reconstruction and novel view synthesis. Recent work on reconstructing the 3D human body using 3DGS attempts to leverage prior information on human pose to enhance rendering quality and improve training speed. However, it struggles to effectively fit dynamic surface planes due to multi-view inconsistency and redundant Gaussians. This inconsistency arises because Gaussian ellipsoids cannot accurately represent the surfaces of dynamic objects, which hinders the rapid reconstruction of the dynamic human body. Meanwhile, the prevalence of redundant Gaussians means that the training time of these works is still not ideal for quickly fitting a dynamic human body. To address these, we propose EfficientHuman, a model that quickly accomplishes the dynamic reconstruction of the human body using Articulated 2D Gaussian while ensuring high rendering quality. The key innovation involves encoding Gaussian splats as Articulated 2D Gaussian surfels in canonical space and then transforming them to pose space via Linear Blend Skinning (LBS) to achieve efficient pose transformations. Unlike 3D Gaussians, Articulated 2D Gaussian surfels can quickly conform to the dynamic human body while ensuring view-consistent geometries. Additionally, we introduce a pose calibration module and an LBS optimization module to achieve precise fitting of dynamic human poses, enhancing the model's performance. Extensive experiments on the ZJU-MoCap dataset demonstrate that EfficientHuman achieves rapid 3D dynamic human reconstruction in less than a minute on average, which is 20 seconds faster than the current state-of-the-art method, while also reducing the number of redundant Gaussians.

[43] AlignDiT: Multimodal Aligned Diffusion Transformer for Synchronized Speech Generation

Jeongsoo Choi,Ji-Hoon Kim,Kim Sung-Bin,Tae-Hyun Oh,Joon Son Chung

Main category: cs.CV

TL;DR: AlignDiT是一种多模态对齐扩散变换器，用于从对齐的多模态输入生成高质量语音，解决了现有方法在语音清晰度、音视频同步、自然度和声音相似性方面的不足。

Details

Motivation: 多模态语音生成任务在电影制作、配音和虚拟形象等领域有广泛应用，但现有方法在多个方面存在局限性。 Method: 提出AlignDiT，基于DiT架构，采用三种策略对齐多模态表示，并引入多模态无分类器引导机制。 Result: 实验表明AlignDiT在质量、同步性和说话人相似性上显著优于现有方法，并在多模态任务中表现出强大泛化能力。 Conclusion: AlignDiT在多模态语音生成任务中实现了最先进的性能，具有广泛的应用潜力。 Abstract: In this paper, we address the task of multimodal-to-speech generation, which aims to synthesize high-quality speech from multiple input modalities: text, video, and reference audio. This task has gained increasing attention due to its wide range of applications, such as film production, dubbing, and virtual avatars. Despite recent progress, existing methods still suffer from limitations in speech intelligibility, audio-video synchronization, speech naturalness, and voice similarity to the reference speaker. To address these challenges, we propose AlignDiT, a multimodal Aligned Diffusion Transformer that generates accurate, synchronized, and natural-sounding speech from aligned multimodal inputs. Built upon the in-context learning capability of the DiT architecture, AlignDiT explores three effective strategies to align multimodal representations. Furthermore, we introduce a novel multimodal classifier-free guidance mechanism that allows the model to adaptively balance information from each modality during speech synthesis. Extensive experiments demonstrate that AlignDiT significantly outperforms existing methods across multiple benchmarks in terms of quality, synchronization, and speaker similarity. Moreover, AlignDiT exhibits strong generalization capability across various multimodal tasks, such as video-to-speech synthesis and visual forced alignment, consistently achieving state-of-the-art performance. The demo page is available at https://mm.kaist.ac.kr/projects/AlignDiT .

[44] LDPoly: Latent Diffusion for Polygonal Road Outline Extraction in Large-Scale Topographic Mapping

Weiqin Jiao,Hao Cheng,George Vosselman,Claudio Persello

Main category: cs.CV

TL;DR: LDPoly是首个专门用于从高分辨率航拍图像中提取多边形道路轮廓的框架，采用双潜在扩散模型和通道嵌入融合模块，显著优于现有方法。

Details

Motivation: 现有方法未针对多边形道路轮廓提取任务设计，而道路的分支结构和拓扑连通性对现有建筑轮廓提取方法构成挑战。 Method: LDPoly采用双潜在扩散模型和通道嵌入融合模块，同时生成道路掩码和顶点热图，并通过定制多边形化方法获得精确的矢量道路多边形。 Result: 在Map2ImLas数据集上，LDPoly在像素覆盖率、顶点效率、多边形规则性和道路连通性等指标上优于现有方法，并设计了新的评估指标。 Conclusion: LDPoly首次将扩散模型应用于遥感图像中精确矢量对象轮廓提取，为该领域未来研究奠定了基础。 Abstract: Polygonal road outline extraction from high-resolution aerial images is an important task in large-scale topographic mapping, where roads are represented as vectorized polygons, capturing essential geometric features with minimal vertex redundancy. Despite its importance, no existing method has been explicitly designed for this task. While polygonal building outline extraction has been extensively studied, the unique characteristics of roads, such as branching structures and topological connectivity, pose challenges to these methods. To address this gap, we introduce LDPoly, the first dedicated framework for extracting polygonal road outlines from high-resolution aerial images. Our method leverages a novel Dual-Latent Diffusion Model with a Channel-Embedded Fusion Module, enabling the model to simultaneously generate road masks and vertex heatmaps. A tailored polygonization method is then applied to obtain accurate vectorized road polygons with minimal vertex redundancy. We evaluate LDPoly on a new benchmark dataset, Map2ImLas, which contains detailed polygonal annotations for various topographic objects in several Dutch regions. Our experiments include both in-region and cross-region evaluations, with the latter designed to assess the model's generalization performance on unseen regions. Quantitative and qualitative results demonstrate that LDPoly outperforms state-of-the-art polygon extraction methods across various metrics, including pixel-level coverage, vertex efficiency, polygon regularity, and road connectivity. We also design two new metrics to assess polygon simplicity and boundary smoothness. Moreover, this work represents the first application of diffusion models for extracting precise vectorized object outlines without redundant vertices from remote-sensing imagery, paving the way for future advancements in this field.

[45] SpaRE: Enhancing Spatial Reasoning in Vision-Language Models with Synthetic Data

Michael Ogezi,Freda Shi

Main category: cs.CV

TL;DR: 论文提出了一种增强视觉语言模型（VLM）空间推理能力的方法，通过构建合成VQA数据集SpaRE，显著提升了模型在空间推理任务中的表现。

Details

Motivation: 现有VLM在空间推理任务上表现不佳，主要因为广泛使用的VL数据集中空间关系样本稀少且分布不均。 Method: 利用Localized Narratives、DOCCI和PixMo-Cap中的超详细图像描述构建了包含455k样本和3.4百万QA对的合成VQA数据集，并训练了SpaRE VLM。 Result: SpaRE VLM在空间推理基准测试中性能提升显著，如在What's Up基准上提升49%，同时保持通用任务上的良好表现。 Conclusion: 该研究缩小了人类与VLM在空间推理能力上的差距，提升了VLM在机器人技术和导航等实际任务中的应用潜力。 Abstract: Vision-language models (VLMs) work well in tasks ranging from image captioning to visual question answering (VQA), yet they struggle with spatial reasoning, a key skill for understanding our physical world that humans excel at. We find that spatial relations are generally rare in widely used VL datasets, with only a few being well represented, while most form a long tail of underrepresented relations. This gap leaves VLMs ill-equipped to handle diverse spatial relationships. To bridge it, we construct a synthetic VQA dataset focused on spatial reasoning generated from hyper-detailed image descriptions in Localized Narratives, DOCCI, and PixMo-Cap. Our dataset consists of 455k samples containing 3.4 million QA pairs. Trained on this dataset, our Spatial-Reasoning Enhanced (SpaRE) VLMs show strong improvements on spatial reasoning benchmarks, achieving up to a 49% performance gain on the What's Up benchmark, while maintaining strong results on general tasks. Our work narrows the gap between human and VLM spatial reasoning and makes VLMs more capable in real-world tasks such as robotics and navigation.

[46] Image deidentification in the XNAT ecosystem: use cases and solutions

Alex Michie,Simon J Doran

Main category: cs.CV

TL;DR: XNAT平台用于DICOM数据去标识化，通过规则和机器学习方法实现高准确率，但地址识别仍有改进空间。

Details

Motivation: 解决DICOM数据去标识化需求，参与MIDI-B挑战以验证方法。 Method: 结合XNAT工具和独立工具，采用规则和机器学习方法去标识化。 Result: 初始得分97.91%，改进后达99.61%，地址识别仍有不足。 Conclusion: 未来需优化地址识别和图像像素数据去标识化，当前失败率为0.19%。 Abstract: XNAT is a server-based data management platform widely used in academia for curating large databases of DICOM images for research projects. We describe in detail a deidentification workflow for DICOM data using facilities in XNAT, together with independent tools in the XNAT "ecosystem". We list different contexts in which deidentification might be needed, based on our prior experience. The starting point for participation in the Medical Image De-Identification Benchmark (MIDI-B) challenge was a set of pre-existing local methodologies, which were adapted during the validation phase of the challenge. Our result in the test phase was 97.91\%, considerably lower than our peers, due largely to an arcane technical incompatibility of our methodology with the challenge's Synapse platform, which prevented us receiving feedback during the validation phase. Post-submission, additional discrepancy reports from the organisers and via the MIDI-B Continuous Benchmarking facility, enabled us to improve this score significantly to 99.61\%. An entirely rule-based approach was shown to be capable of removing all name-related information in the test corpus, but exhibited failures in dealing fully with address data. Initial experiments using published machine-learning models to remove addresses were partially successful but showed the models to be "over-aggressive" on other types of free-text data, leading to a slight overall degradation in performance to 99.54\%. Future development will therefore focus on improving address-recognition capabilities, but also on better removal of identifiable data burned into the image pixels. Several technical aspects relating to the "answer key" are still under discussion with the challenge organisers, but we estimate that our percentage of genuine deidentification failures on the MIDI-B test corpus currently stands at 0.19\%. (Abridged from original for arXiv submission)

[47] Advance Fake Video Detection via Vision Transformers

Joy Battocchio,Stefano Dell'Anna,Andrea Montibeller,Giulia Boato

Main category: cs.CV

TL;DR: 本文提出了一种基于Vision Transformer（ViT）的框架，用于检测AI生成的视频，解决了虚假多媒体传播的紧迫问题。

Details

Motivation: 随着AI生成多媒体技术的进步，虚假内容的传播风险增加，亟需高精度、泛化性强的检测方法。 Method: 扩展ViT框架至视频领域，通过时间整合ViT嵌入提升检测性能。 Result: 方法在新的大规模多样化数据集上表现出高准确性、泛化能力和少样本学习能力。 Conclusion: 该框架为AI生成视频检测提供了有效解决方案，符合当前法规需求。 Abstract: Recent advancements in AI-based multimedia generation have enabled the creation of hyper-realistic images and videos, raising concerns about their potential use in spreading misinformation. The widespread accessibility of generative techniques, which allow for the production of fake multimedia from prompts or existing media, along with their continuous refinement, underscores the urgent need for highly accurate and generalizable AI-generated media detection methods, underlined also by new regulations like the European Digital AI Act. In this paper, we draw inspiration from Vision Transformer (ViT)-based fake image detection and extend this idea to video. We propose an {original} %innovative framework that effectively integrates ViT embeddings over time to enhance detection performance. Our method shows promising accuracy, generalization, and few-shot learning capabilities across a new, large and diverse dataset of videos generated using five open source generative techniques from the state-of-the-art, as well as a separate dataset containing videos produced by proprietary generative methods.

[48] FBRT-YOLO: Faster and Better for Real-Time Aerial Image Detection

Yao Xiao,Tingfa Xu,Yu Xin,Jianan Li

Main category: cs.CV

TL;DR: FBRT-YOLO是一种新型实时检测器，通过轻量级模块FCM和MKP优化小目标检测，平衡精度与效率。

Details

Motivation: 现有方法在小目标检测中存在信息丢失和效率与精度不平衡的问题，阻碍了实时检测的发展。 Method: 提出FCM模块缓解小目标信息丢失，MKP模块利用多尺度卷积增强目标感知。 Result: 在Visdrone、UAVDT和AI-TOD数据集上表现优异，优于其他实时检测器。 Conclusion: FBRT-YOLO有效解决了小目标检测中的精度与效率平衡问题。 Abstract: Embedded flight devices with visual capabilities have become essential for a wide range of applications. In aerial image detection, while many existing methods have partially addressed the issue of small target detection, challenges remain in optimizing small target detection and balancing detection accuracy with efficiency. These issues are key obstacles to the advancement of real-time aerial image detection. In this paper, we propose a new family of real-time detectors for aerial image detection, named FBRT-YOLO, to address the imbalance between detection accuracy and efficiency. Our method comprises two lightweight modules: Feature Complementary Mapping Module (FCM) and Multi-Kernel Perception Unit(MKP), designed to enhance object perception for small targets in aerial images. FCM focuses on alleviating the problem of information imbalance caused by the loss of small target information in deep networks. It aims to integrate spatial positional information of targets more deeply into the network,better aligning with semantic information in the deeper layers to improve the localization of small targets. We introduce MKP, which leverages convolutions with kernels of different sizes to enhance the relationships between targets of various scales and improve the perception of targets at different scales. Extensive experimental results on three major aerial image datasets, including Visdrone, UAVDT, and AI-TOD,demonstrate that FBRT-YOLO outperforms various real-time detectors in terms of performance and speed.

[49] Occlusion-aware Driver Monitoring System using the Driver Monitoring Dataset

Paola Natalia Cañas,Alexander Diez,David Galvañ,Marcos Nieto,Igor Rodríguez

Main category: cs.CV

TL;DR: 本文提出了一种基于RGB和红外图像的鲁棒性驾驶员监控系统（DMS），具备驾驶员识别、区域注视估计和遮挡检测功能，适用于多种光照条件。

Details

Motivation: 开发一种符合EuroNCAP标准的DMS，提升系统在遮挡情况下的可靠性和信任度。 Method: 采用分别训练于RGB和红外图像的算法，整合为统一流程，解决多传感器和实际车辆部署的挑战。 Result: 在DMD数据集和实际场景中验证了系统的有效性，RGB模型表现更优，遮挡检测功能为DMS领域创新。 Conclusion: 该系统在复杂光照和遮挡条件下表现优异，为驾驶员监控提供了新的解决方案。 Abstract: This paper presents a robust, occlusion-aware driver monitoring system (DMS) utilizing the Driver Monitoring Dataset (DMD). The system performs driver identification, gaze estimation by regions, and face occlusion detection under varying lighting conditions, including challenging low-light scenarios. Aligned with EuroNCAP recommendations, the inclusion of occlusion detection enhances situational awareness and system trustworthiness by indicating when the system's performance may be degraded. The system employs separate algorithms trained on RGB and infrared (IR) images to ensure reliable functioning. We detail the development and integration of these algorithms into a cohesive pipeline, addressing the challenges of working with different sensors and real-car implementation. Evaluation on the DMD and in real-world scenarios demonstrates the effectiveness of the proposed system, highlighting the superior performance of RGB-based models and the pioneering contribution of robust occlusion detection in DMS.

[50] OG-HFYOLO :Orientation gradient guidance and heterogeneous feature fusion for deformation table cell instance segmentation

Long Liu,Cihui Yang

Main category: cs.CV

TL;DR: 论文提出OG-HFYOLO模型，通过梯度方向感知提取器和异构核交叉融合模块，结合尺度感知损失函数，改进变形表格的结构识别，并开源了数据集DWTAL。

Details

Motivation: 变形表格的几何变形导致内容信息与结构关联性弱，影响下游任务准确性。 Method: 提出OG-HFYOLO模型，包括梯度方向感知提取器、异构核交叉融合模块、尺度感知损失函数和掩码驱动的非极大值抑制。 Result: 模型在主流实例分割模型中表现出优异的分割精度。 Conclusion: OG-HFYOLO模型有效提升了变形表格的结构识别能力，并填补了数据集空白。 Abstract: Table structure recognition is a key task in document analysis. However, the geometric deformation in deformed tables causes a weak correlation between content information and structure, resulting in downstream tasks not being able to obtain accurate content information. To obtain fine-grained spatial coordinates of cells, we propose the OG-HFYOLO model, which enhances the edge response by Gradient Orientation-aware Extractor, combines a Heterogeneous Kernel Cross Fusion module and a scale-aware loss function to adapt to multi-scale objective features, and introduces mask-driven non-maximal suppression in the post-processing, which replaces the traditional bounding box suppression mechanism. Furthermore, we also propose a data generator, filling the gap in the dataset for fine-grained deformation table cell spatial coordinate localization, and derive a large-scale dataset named Deformation Wired Table (DWTAL). Experiments show that our proposed model demonstrates excellent segmentation accuracy on all mainstream instance segmentation models. The dataset and the source code are open source: https://github.com/justliulong/OGHFYOLO.

[51] Efficient Listener: Dyadic Facial Motion Synthesis via Action Diffusion

Zesheng Wang,Alexandre Bruckert,Patrick Le Callet,Guangtao Zhai

Main category: cs.CV

TL;DR: 提出Facial Action Diffusion (FAD)和Efficient Listener Network (ELNet)，通过扩散方法和视觉音频输入优化听众面部动作生成，显著减少计算时间。

Details

Motivation: 现有方法因3DMM计算速度限制难以实现实时交互，需高效生成听众面部动作。 Method: 结合FAD（扩散方法）和ELNet（视觉音频输入网络），学习面部动作表示。 Result: 性能优于现有方法，计算时间减少99%。 Conclusion: FAD和ELNet有效解决了实时生成听众面部动作的挑战。 Abstract: Generating realistic listener facial motions in dyadic conversations remains challenging due to the high-dimensional action space and temporal dependency requirements. Existing approaches usually consider extracting 3D Morphable Model (3DMM) coefficients and modeling in the 3DMM space. However, this makes the computational speed of the 3DMM a bottleneck, making it difficult to achieve real-time interactive responses. To tackle this problem, we propose Facial Action Diffusion (FAD), which introduces the diffusion methods from the field of image generation to achieve efficient facial action generation. We further build the Efficient Listener Network (ELNet) specially designed to accommodate both the visual and audio information of the speaker as input. Considering of FAD and ELNet, the proposed method learns effective listener facial motion representations and leads to improvements of performance over the state-of-the-art methods while reducing 99% computational time.

[52] In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer

Zechuan Zhang,Ji Xie,Yu Lu,Zongxin Yang,Yi Yang

Main category: cs.CV

TL;DR: 提出了一种基于指令的图像编辑方法，通过结合Diffusion Transformer和上下文感知，解决了现有方法在精度与效率上的权衡问题。

Details

Motivation: 当前基于指令的图像编辑方法存在精度与效率的权衡问题，微调方法计算资源需求高，而无训练方法在指令理解和编辑质量上表现不佳。 Method: 提出三种创新：1）基于上下文提示的零样本编辑框架；2）LoRA-MoE混合调优策略；3）基于视觉语言模型的早期噪声筛选方法。 Result: 实验表明，该方法在仅需0.5%训练数据和1%可训练参数的情况下，优于现有方法。 Conclusion: 该方法为高效且高精度的指令引导图像编辑提供了新范式。 Abstract: Instruction-based image editing enables robust image modification via natural language prompts, yet current methods face a precision-efficiency tradeoff. Fine-tuning methods demand significant computational resources and large datasets, while training-free techniques struggle with instruction comprehension and edit quality. We resolve this dilemma by leveraging large-scale Diffusion Transformer (DiT)' enhanced generation capacity and native contextual awareness. Our solution introduces three contributions: (1) an in-context editing framework for zero-shot instruction compliance using in-context prompting, avoiding structural changes; (2) a LoRA-MoE hybrid tuning strategy that enhances flexibility with efficient adaptation and dynamic expert routing, without extensive retraining; and (3) an early filter inference-time scaling method using vision-language models (VLMs) to select better initial noise early, improving edit quality. Extensive evaluations demonstrate our method's superiority: it outperforms state-of-the-art approaches while requiring only 0.5% training data and 1% trainable parameters compared to conventional baselines. This work establishes a new paradigm that enables high-precision yet efficient instruction-guided editing. Codes and demos can be found in https://river-zhang.github.io/ICEdit-gh-pages/.

[53] Adept: Annotation-Denoising Auxiliary Tasks with Discrete Cosine Transform Map and Keypoint for Human-Centric Pretraining

Weizhen He,Yunfeng Yan,Shixiang Tang,Yiheng Deng,Yangyang Zhong,Pengxin Luo,Donglian Qi

Main category: cs.CV

TL;DR: 本文提出了一种基于RGB图像频域信息（DCT）的人体中心预训练方法，通过舍弃深度信息并引入关键点和DCT图的辅助任务，提升了模型性能。

Details

Motivation: 现有方法依赖深度信息或任务特定数据集，限制了预训练模型的可扩展性。本文旨在通过RGB图像的频域信息解决这一问题。 Method: 利用离散余弦变换（DCT）提取RGB图像的频域语义信息，并设计关键点和DCT图的辅助任务以增强特征学习。 Result: 在多个任务（姿态估计、人体解析、人群计数等）上优于现有方法，性能提升显著。 Conclusion: 通过频域信息和辅助任务，无需深度信息即可实现高效的人体中心预训练。 Abstract: Human-centric perception is the core of diverse computer vision tasks and has been a long-standing research focus. However, previous research studied these human-centric tasks individually, whose performance is largely limited to the size of the public task-specific datasets. Recent human-centric methods leverage the additional modalities, e.g., depth, to learn fine-grained semantic information, which limits the benefit of pretraining models due to their sensitivity to camera views and the scarcity of RGB-D data on the Internet. This paper improves the data scalability of human-centric pretraining methods by discarding depth information and exploring semantic information of RGB images in the frequency space by Discrete Cosine Transform (DCT). We further propose new annotation denoising auxiliary tasks with keypoints and DCT maps to enforce the RGB image extractor to learn fine-grained semantic information of human bodies. Our extensive experiments show that when pretrained on large-scale datasets (COCO and AIC datasets) without depth annotation, our model achieves better performance than state-of-the-art methods by +0.5 mAP on COCO, +1.4 PCKh on MPII and -0.51 EPE on Human3.6M for pose estimation, by +4.50 mIoU on Human3.6M for human parsing, by -3.14 MAE on SHA and -0.07 MAE on SHB for crowd counting, by +1.1 F1 score on SHA and +0.8 F1 score on SHA for crowd localization, and by +0.1 mAP on Market1501 and +0.8 mAP on MSMT for person ReID. We also validate the effectiveness of our method on MPII+NTURGBD datasets

[54] GaussTrap: Stealthy Poisoning Attacks on 3D Gaussian Splatting for Targeted Scene Confusion

Jiaxin Hong,Sixu Chen,Shuoyang Sun,Hongyao Yu,Hao Fang,Yuqi Tan,Bin Chen,Shuhan Qi,Jiawei Li

Main category: cs.CV

TL;DR: 本文首次系统研究了3D高斯泼溅（3DGS）中的后门威胁，提出了一种名为GuassTrap的新型投毒攻击方法，能在特定视角植入恶意视图，同时保持非目标视图的高质量渲染。

Details

Motivation: 随着3DGS在安全关键领域的快速应用，亟需研究其潜在安全漏洞，尤其是后门威胁可能导致的场景混淆和环境误判。 Method: GuassTrap通过三阶段流程（攻击、稳定和正常训练）植入隐蔽且视角一致的毒化渲染，联合优化攻击效果和感知真实性。 Result: 实验表明，GuassTrap能有效嵌入不可察觉但有害的后门视图，同时在正常视图中保持高质量渲染。 Conclusion: 该研究揭示了3D渲染中的安全风险，强调了在安全关键应用中加强3DGS模型安全性的必要性。 Abstract: As 3D Gaussian Splatting (3DGS) emerges as a breakthrough in scene representation and novel view synthesis, its rapid adoption in safety-critical domains (e.g., autonomous systems, AR/VR) urgently demands scrutiny of potential security vulnerabilities. This paper presents the first systematic study of backdoor threats in 3DGS pipelines. We identify that adversaries may implant backdoor views to induce malicious scene confusion during inference, potentially leading to environmental misperception in autonomous navigation or spatial distortion in immersive environments. To uncover this risk, we propose GuassTrap, a novel poisoning attack method targeting 3DGS models. GuassTrap injects malicious views at specific attack viewpoints while preserving high-quality rendering in non-target views, ensuring minimal detectability and maximizing potential harm. Specifically, the proposed method consists of a three-stage pipeline (attack, stabilization, and normal training) to implant stealthy, viewpoint-consistent poisoned renderings in 3DGS, jointly optimizing attack efficacy and perceptual realism to expose security risks in 3D rendering. Extensive experiments on both synthetic and real-world datasets demonstrate that GuassTrap can effectively embed imperceptible yet harmful backdoor views while maintaining high-quality rendering in normal views, validating its robustness, adaptability, and practical applicability.

[55] CMT: A Cascade MAR with Topology Predictor for Multimodal Conditional CAD Generation

Jianyu Wu,Yizhou Wang,Xiangyu Yue,Xinzhu Ma,Jingyang Guo,Dongzhan Zhou,Wanli Ouyang,Shixiang Tang

Main category: cs.CV

TL;DR: 提出了一种基于B-Rep的多模态CAD生成框架CMT，并开发了大规模数据集mmABC，实验表明CMT在生成任务中优于现有方法。

Details

Motivation: 现有CAD方法因简化表示或架构不足而难以满足多模态设计需求，需从方法和数据集两方面解决。 Method: 提出CMT框架，结合级联MAR和拓扑预测器，利用B-Rep捕获几何先验；构建mmABC数据集支持训练。 Result: CMT在无条件生成任务中Coverage和Valid ratio提升10%以上，图像条件生成任务中Chamfer提升4.01。 Conclusion: CMT在多模态CAD生成中表现优异，数据集和代码将开源。 Abstract: While accurate and user-friendly Computer-Aided Design (CAD) is crucial for industrial design and manufacturing, existing methods still struggle to achieve this due to their over-simplified representations or architectures incapable of supporting multimodal design requirements. In this paper, we attempt to tackle this problem from both methods and datasets aspects. First, we propose a cascade MAR with topology predictor (CMT), the first multimodal framework for CAD generation based on Boundary Representation (B-Rep). Specifically, the cascade MAR can effectively capture the ``edge-counters-surface'' priors that are essential in B-Reps, while the topology predictor directly estimates topology in B-Reps from the compact tokens in MAR. Second, to facilitate large-scale training, we develop a large-scale multimodal CAD dataset, mmABC, which includes over 1.3 million B-Rep models with multimodal annotations, including point clouds, text descriptions, and multi-view images. Extensive experiments show the superior of CMT in both conditional and unconditional CAD generation tasks. For example, we improve Coverage and Valid ratio by +10.68% and +10.3%, respectively, compared to state-of-the-art methods on ABC in unconditional generation. CMT also improves +4.01 Chamfer on image conditioned CAD generation on mmABC. The dataset, code and pretrained network shall be released.

[56] RadSAM: Segmenting 3D radiological images with a 2D promptable model

Julien Khlaut,Elodie Ferreres,Daniel Tordjman,Hélène Philippe,Tom Boeken,Pierre Manceron,Corentin Dancette

Main category: cs.CV

TL;DR: RadSAM提出了一种基于2D模型的3D医学图像分割方法，通过单次提示实现3D对象分割，解决了现有SAM模型在医学图像处理中的不足。

Details

Motivation: 医学图像分割需要高精度，但现有SAM模型基于自然图像训练，无法有效处理3D医学数据（如CT和MRI），且缺乏编辑功能。 Method: RadSAM通过训练2D模型，使用噪声掩码、边界框和点作为初始提示，结合迭代推理管道逐片重建3D掩码。 Result: 在AMOS腹部器官分割数据集上，RadSAM表现优于现有先进模型，展示了其3D分割能力和跨域适应性。 Conclusion: RadSAM为3D医学图像分割提供了一种高效且灵活的方法，填补了SAM模型在医学领域的空白。 Abstract: Medical image segmentation is a crucial and time-consuming task in clinical care, where mask precision is extremely important. The Segment Anything Model (SAM) offers a promising approach, as it provides an interactive interface based on visual prompting and edition to refine an initial segmentation. This model has strong generalization capabilities, does not rely on predefined classes, and adapts to diverse objects; however, it is pre-trained on natural images and lacks the ability to process medical data effectively. In addition, this model is built for 2D images, whereas a whole medical domain is based on 3D images, such as CT and MRI. Recent adaptations of SAM for medical imaging are based on 2D models, thus requiring one prompt per slice to segment 3D objects, making the segmentation process tedious. They also lack important features such as editing. To bridge this gap, we propose RadSAM, a novel method for segmenting 3D objects with a 2D model from a single prompt. In practice, we train a 2D model using noisy masks as initial prompts, in addition to bounding boxes and points. We then use this novel prompt type with an iterative inference pipeline to reconstruct the 3D mask slice-by-slice. We introduce a benchmark to evaluate the model's ability to segment 3D objects in CT images from a single prompt and evaluate the models' out-of-domain transfer and edition capabilities. We demonstrate the effectiveness of our approach against state-of-the-art models on this benchmark using the AMOS abdominal organ segmentation dataset.

Mainak Singha,Subhankar Roy,Sarthak Mehrotra,Ankit Jha,Moloud Abdar,Biplab Banerjee,Elisa Ricci

Main category: cs.CV

TL;DR: FedMVP通过多模态视觉提示调优解决文本提示调优在联邦学习中的过拟合问题，提升对未见概念的适应性。

Details

Motivation: 文本提示调优在联邦学习中容易过拟合已知概念，依赖记忆的文本特征，限制了其对未见概念的适应性。 Method: 提出FedMVP，利用多模态上下文信息（图像条件和文本属性特征），通过PromptFormer模块对齐文本和视觉特征，生成动态多模态视觉提示，输入冻结的CLIP视觉编码器进行训练。 Result: 在20个数据集上的实验表明，FedMVP在保持已知类和领域性能的同时，对未见类和领域具有更高的泛化能力。 Conclusion: FedMVP通过多模态视觉提示调优显著提升了模型在联邦学习中的泛化能力。 Abstract: Textual prompt tuning adapts Vision-Language Models (e.g., CLIP) in federated learning by tuning lightweight input tokens (or prompts) on local client data, while keeping network weights frozen. Post training, only the prompts are shared by the clients with the central server for aggregation. However, textual prompt tuning often struggles with overfitting to known concepts and may be overly reliant on memorized text features, limiting its adaptability to unseen concepts. To address this limitation, we propose Federated Multimodal Visual Prompt Tuning (FedMVP) that conditions the prompts on comprehensive contextual information -- image-conditioned features and textual attribute features of a class -- that is multimodal in nature. At the core of FedMVP is a PromptFormer module that synergistically aligns textual and visual features through cross-attention, enabling richer contexual integration. The dynamically generated multimodal visual prompts are then input to the frozen vision encoder of CLIP, and trained with a combination of CLIP similarity loss and a consistency loss. Extensive evaluation on 20 datasets spanning three generalization settings demonstrates that FedMVP not only preserves performance on in-distribution classes and domains, but also displays higher generalizability to unseen classes and domains when compared to state-of-the-art methods. Codes will be released upon acceptance.

[58] AI-GenBench: A New Ongoing Benchmark for AI-Generated Image Detection

Lorenzo Pellegrini,Davide Cozzolino,Serafino Pandolfini,Davide Maltoni,Matteo Ferrara,Luisa Verdoliva,Marco Prati,Marco Ramilli

Main category: cs.CV

TL;DR: Ai-GenBench是一个新颖的基准测试，用于检测AI生成图像，解决现有方法的局限性，并提供标准化评估工具。

Details

Motivation: 生成式AI快速发展带来高质量图像合成，但也引发媒体真实性问题，亟需鲁棒的检测方法。 Method: 提出Ai-GenBench，采用时间评估框架，逐步训练检测模型以应对新型生成模型（如GAN到扩散模型）。 Result: Ai-GenBench提供多样化数据集、标准化协议和工具，支持研究者和非专家使用，确保可复现性。 Conclusion: Ai-GenBench为检测方法提供公平比较和可扩展解决方案，推动鲁棒检测器的发展。 Abstract: The rapid advancement of generative AI has revolutionized image creation, enabling high-quality synthesis from text prompts while raising critical challenges for media authenticity. We present Ai-GenBench, a novel benchmark designed to address the urgent need for robust detection of AI-generated images in real-world scenarios. Unlike existing solutions that evaluate models on static datasets, Ai-GenBench introduces a temporal evaluation framework where detection methods are incrementally trained on synthetic images, historically ordered by their generative models, to test their ability to generalize to new generative models, such as the transition from GANs to diffusion models. Our benchmark focuses on high-quality, diverse visual content and overcomes key limitations of current approaches, including arbitrary dataset splits, unfair comparisons, and excessive computational demands. Ai-GenBench provides a comprehensive dataset, a standardized evaluation protocol, and accessible tools for both researchers and non-experts (e.g., journalists, fact-checkers), ensuring reproducibility while maintaining practical training requirements. By establishing clear evaluation rules and controlled augmentation strategies, Ai-GenBench enables meaningful comparison of detection methods and scalable solutions. Code and data are publicly available to ensure reproducibility and to support the development of robust forensic detectors to keep pace with the rise of new synthetic generators.

[59] FLIM-based Salient Object Detection Networks with Adaptive Decoders

Gilson Junior Soares,Matheus Abrantes Cerqueira,Jancarlo F. Gomes,Laurent Najman,Silvio Jamil F. Guimarães,Alexandre Xavier Falcão

Main category: cs.CV

TL;DR: 本文提出了一种超轻量级的显著目标检测（SOD）网络FLIM，结合自适应解码器，仅需少量代表性图像训练且无需反向传播，适用于计算资源有限和标记数据受限的场景。

Details

Motivation: 针对计算资源有限的应用场景，研究轻量级模型在SOD任务中的潜力，减少对深度神经网络的依赖。 Method: 使用FLIM方法训练轻量级网络，结合自适应解码器，通过启发式函数为每个输入图像估计解码器权重，仅需3-4张代表性图像且无需反向传播。 Result: 实验表明，FLIM网络在两种挑战性SOD任务中优于现有轻量级网络和其他FLIM变体，验证了其有效性。 Conclusion: FLIM网络在资源受限场景下具有显著优势，值得进一步研究其在新应用中的潜力。 Abstract: Salient Object Detection (SOD) methods can locate objects that stand out in an image, assign higher values to their pixels in a saliency map, and binarize the map outputting a predicted segmentation mask. A recent tendency is to investigate pre-trained lightweight models rather than deep neural networks in SOD tasks, coping with applications under limited computational resources. In this context, we have investigated lightweight networks using a methodology named Feature Learning from Image Markers (FLIM), which assumes that the encoder's kernels can be estimated from marker pixels on discriminative regions of a few representative images. This work proposes flyweight networks, hundreds of times lighter than lightweight models, for SOD by combining a FLIM encoder with an adaptive decoder, whose weights are estimated for each input image by a given heuristic function. Such FLIM networks are trained from three to four representative images only and without backpropagation, making the models suitable for applications under labeled data constraints as well. We study five adaptive decoders; two of them are introduced here. Differently from the previous ones that rely on one neuron per pixel with shared weights, the heuristic functions of the new adaptive decoders estimate the weights of each neuron per pixel. We compare FLIM models with adaptive decoders for two challenging SOD tasks with three lightweight networks from the state-of-the-art, two FLIM networks with decoders trained by backpropagation, and one FLIM network whose labeled markers define the decoder's weights. The experiments demonstrate the advantages of the proposed networks over the baselines, revealing the importance of further investigating such methods in new applications.

[60] Classifier-to-Bias: Toward Unsupervised Automatic Bias Detection for Visual Classifiers

Quentin Guimard,Moreno D'Incà,Massimiliano Mancini,Elisa Ricci

Main category: cs.CV

TL;DR: C2B是一种无需标注数据的偏差发现框架，通过任务描述生成偏差建议并评估模型偏差，优于依赖标注的现有方法。

Details

Motivation: 现有偏差识别方法依赖标注数据，限制了应用范围，C2B旨在解决这一问题。 Method: 利用大语言模型生成偏差建议和描述，通过检索模型收集图像并评估模型偏差。 Result: C2B在公开数据集上发现更多偏差，优于依赖标注的基线方法。 Conclusion: C2B是任务无关、无监督偏差检测的有前景的第一步。 Abstract: A person downloading a pre-trained model from the web should be aware of its biases. Existing approaches for bias identification rely on datasets containing labels for the task of interest, something that a non-expert may not have access to, or may not have the necessary resources to collect: this greatly limits the number of tasks where model biases can be identified. In this work, we present Classifier-to-Bias (C2B), the first bias discovery framework that works without access to any labeled data: it only relies on a textual description of the classification task to identify biases in the target classification model. This description is fed to a large language model to generate bias proposals and corresponding captions depicting biases together with task-specific target labels. A retrieval model collects images for those captions, which are then used to assess the accuracy of the model w.r.t. the given biases. C2B is training-free, does not require any annotations, has no constraints on the list of biases, and can be applied to any pre-trained model on any classification task. Experiments on two publicly available datasets show that C2B discovers biases beyond those of the original datasets and outperforms a recent state-of-the-art bias detection baseline that relies on task-specific annotations, being a promising first step toward addressing task-agnostic unsupervised bias detection.

[61] DS_FusionNet: Dynamic Dual-Stream Fusion with Bidirectional Knowledge Distillation for Plant Disease Recognition

Yanghui Song,Chengfu Yang

Main category: cs.CV

TL;DR: 提出了一种动态双流融合网络（DS_FusionNet），用于解决植物病害识别中的技术挑战，如小样本学习和光照变化等，显著提高了识别精度。

Details

Motivation: 全球经济作物生长安全面临严峻挑战，精确识别和预防植物病害成为人工智能农业技术中的关键问题。 Method: 采用双主干架构、可变形动态融合模块和双向知识蒸馏策略，构建DS_FusionNet网络。 Result: 在PlantDisease和CIFAR-10数据集上仅用10%数据即实现超过90%的分类准确率，在复杂PlantWild数据集上保持85%准确率。 Conclusion: 研究为细粒度图像分类提供了新技术思路，并为农业病害的精确识别和管理奠定了坚实基础。 Abstract: Given the severe challenges confronting the global growth security of economic crops, precise identification and prevention of plant diseases has emerged as a critical issue in artificial intelligence-enabled agricultural technology. To address the technical challenges in plant disease recognition, including small-sample learning, leaf occlusion, illumination variations, and high inter-class similarity, this study innovatively proposes a Dynamic Dual-Stream Fusion Network (DS_FusionNet). The network integrates a dual-backbone architecture, deformable dynamic fusion modules, and bidirectional knowledge distillation strategy, significantly enhancing recognition accuracy. Experimental results demonstrate that DS_FusionNet achieves classification accuracies exceeding 90% using only 10% of the PlantDisease and CIFAR-10 datasets, while maintaining 85% accuracy on the complex PlantWild dataset, exhibiting exceptional generalization capabilities. This research not only provides novel technical insights for fine-grained image classification but also establishes a robust foundation for precise identification and management of agricultural diseases.

[62] SVD Based Least Squares for X-Ray Pneumonia Classification Using Deep Features

Mete Erdogan,Sebnem Demirtas

Main category: cs.CV

TL;DR: 提出了一种基于SVD-LS的肺炎多分类框架，结合自监督和迁移学习，实现高效准确的诊断。

Details

Motivation: 通过X光影像实现肺炎的早期准确诊断对治疗至关重要，机器学习工具可辅助放射科医生提高效率和可靠性。 Method: 采用SVD-LS框架，结合自监督和迁移学习的特征表示，使用闭式非迭代分类方法，避免高计算成本的梯度微调。 Result: 实验表明SVD-LS在保持竞争力的同时显著降低计算成本，适用于实时医疗影像应用。 Conclusion: SVD-LS是一种高效且准确的肺炎诊断替代方案，适合实际医疗场景。 Abstract: Accurate and early diagnosis of pneumonia through X-ray imaging is essential for effective treatment and improved patient outcomes. Recent advancements in machine learning have enabled automated diagnostic tools that assist radiologists in making more reliable and efficient decisions. In this work, we propose a Singular Value Decomposition-based Least Squares (SVD-LS) framework for multi-class pneumonia classification, leveraging powerful feature representations from state-of-the-art self-supervised and transfer learning models. Rather than relying on computationally expensive gradient based fine-tuning, we employ a closed-form, non-iterative classification approach that ensures efficiency without compromising accuracy. Experimental results demonstrate that SVD-LS achieves competitive performance while offering significantly reduced computational costs, making it a viable alternative for real-time medical imaging applications.

[63] TesserAct: Learning 4D Embodied World Models

Haoyu Zhen,Qiao Sun,Hongxin Zhang,Junyan Li,Siyuan Zhou,Yilun Du,Chuang Gan

Main category: cs.CV

TL;DR: 提出了一种学习4D世界模型的方法，通过RGB-DN视频训练，预测动态3D场景的时空一致性，优于传统2D模型，并支持高效的策略学习。

Details

Motivation: 传统2D模型无法捕捉动态3D场景的时空变化，限制了其在具身智能体中的应用。本文旨在通过4D世界模型解决这一问题。 Method: 扩展机器人操作视频数据集为RGB-DN格式，微调视频生成模型以预测RGB-DN帧，并设计算法将生成的视频转换为高质量4D场景。 Result: 方法在时空一致性、新视角合成和策略学习方面优于现有视频世界模型。 Conclusion: 提出的4D世界模型为具身智能体提供了更准确的动态场景预测，推动了相关领域的发展。 Abstract: This paper presents an effective approach for learning novel 4D embodied world models, which predict the dynamic evolution of 3D scenes over time in response to an embodied agent's actions, providing both spatial and temporal consistency. We propose to learn a 4D world model by training on RGB-DN (RGB, Depth, and Normal) videos. This not only surpasses traditional 2D models by incorporating detailed shape, configuration, and temporal changes into their predictions, but also allows us to effectively learn accurate inverse dynamic models for an embodied agent. Specifically, we first extend existing robotic manipulation video datasets with depth and normal information leveraging off-the-shelf models. Next, we fine-tune a video generation model on this annotated dataset, which jointly predicts RGB-DN (RGB, Depth, and Normal) for each frame. We then present an algorithm to directly convert generated RGB, Depth, and Normal videos into a high-quality 4D scene of the world. Our method ensures temporal and spatial coherence in 4D scene predictions from embodied scenarios, enables novel view synthesis for embodied environments, and facilitates policy learning that significantly outperforms those derived from prior video-based world models.

[64] X-Fusion: Introducing New Modality to Frozen Large Language Models

Sicheng Mo,Thao Nguyen,Xun Huang,Siddharth Srinivasan Iyer,Yijun Li,Yuchen Liu,Abhishek Tandon,Eli Shechtman,Krishna Kumar Singh,Yong Jae Lee,Bolei Zhou,Yuheng Li

Main category: cs.CV

TL;DR: X-Fusion是一个扩展预训练大语言模型（LLMs）以支持多模态任务的框架，保持其语言能力的同时整合视觉信息。

Details

Motivation: 解决如何在保留LLMs语言能力的同时扩展其多模态任务能力的问题。 Method: 采用双塔设计，冻结LLM参数，引入模态特定权重，整合视觉信息。 Result: 在图像到文本和文本到图像任务上表现优于其他架构，理解导向数据提升生成质量，减少图像数据噪声提升性能。 Conclusion: X-Fusion为构建高效统一多模态模型提供了有价值的见解。 Abstract: We propose X-Fusion, a framework that extends pretrained Large Language Models (LLMs) for multimodal tasks while preserving their language capabilities. X-Fusion employs a dual-tower design with modality-specific weights, keeping the LLM's parameters frozen while integrating vision-specific information for both understanding and generation. Our experiments demonstrate that X-Fusion consistently outperforms alternative architectures on both image-to-text and text-to-image tasks. We find that incorporating understanding-focused data improves generation quality, reducing image data noise enhances overall performance, and feature alignment accelerates convergence for smaller models but has minimal impact on larger ones. Our findings provide valuable insights into building efficient unified multimodal models.

[65] YoChameleon: Personalized Vision and Language Generation

Thao Nguyen,Krishna Kumar Singh,Jing Shi,Trung Bui,Yong Jae Lee,Yuheng Li

Main category: cs.CV

TL;DR: Yo'Chameleon是首个研究大型多模态模型个性化的方法，通过软提示调优实现特定概念的问答和图像生成。

Details

Motivation: 现有大型多模态模型缺乏对用户特定概念的个性化知识，且个性化方法在多模态（如图像生成）中的适应性尚不明确。 Method: 利用3-5张特定概念的图像，通过软提示调优嵌入主题信息，结合自提示优化机制和“软正”图像生成方法。 Result: 能够回答关于主题的问题，并在新上下文中生成具有像素级细节的图像。 Conclusion: Yo'Chameleon为多模态模型的个性化提供了有效解决方案，尤其在少样本设置下表现优异。 Abstract: Large Multimodal Models (e.g., GPT-4, Gemini, Chameleon) have evolved into powerful tools with millions of users. However, they remain generic models and lack personalized knowledge of specific user concepts. Previous work has explored personalization for text generation, yet it remains unclear how these methods can be adapted to new modalities, such as image generation. In this paper, we introduce Yo'Chameleon, the first attempt to study personalization for large multimodal models. Given 3-5 images of a particular concept, Yo'Chameleon leverages soft-prompt tuning to embed subject-specific information to (i) answer questions about the subject and (ii) recreate pixel-level details to produce images of the subject in new contexts. Yo'Chameleon is trained with (i) a self-prompting optimization mechanism to balance performance across multiple modalities, and (ii) a ``soft-positive" image generation approach to enhance image quality in a few-shot setting.

cs.GR [Back]

[66] Creating Your Editable 3D Photorealistic Avatar with Tetrahedron-constrained Gaussian Splatting

Hanxi Liu,Yifang Men,Zhouhui Lian

Main category: cs.GR

TL;DR: 提出了一种基于Tetrahedron-constrained Gaussian Splatting（TetGS）的框架，用于生成可编辑的3D虚拟化身，具有精确区域定位、几何适应性和逼真渲染能力。

Details

Motivation: 现有3D编辑方法在复杂重建场景中难以生成视觉上令人满意的结果，原因是几何和纹理的混合优化导致表示学习不稳定。本文旨在为普通用户提供一种易于使用的解决方案。 Method: 采用TetGS作为底层表示，将编辑过程分解为局部空间适应和逼真外观学习，并通过三个阶段优化：3D虚拟化身实例化、局部空间适应和基于几何的外观生成。 Result: 定性和定量实验表明，该方法在生成逼真可编辑3D虚拟化身方面具有显著效果和优越性。 Conclusion: 提出的TetGS框架能够有效解决3D虚拟化身编辑中的挑战，为用户提供高质量的编辑体验。 Abstract: Personalized 3D avatar editing holds significant promise due to its user-friendliness and availability to applications such as AR/VR and virtual try-ons. Previous studies have explored the feasibility of 3D editing, but often struggle to generate visually pleasing results, possibly due to the unstable representation learning under mixed optimization of geometry and texture in complicated reconstructed scenarios. In this paper, we aim to provide an accessible solution for ordinary users to create their editable 3D avatars with precise region localization, geometric adaptability, and photorealistic renderings. To tackle this challenge, we introduce a meticulously designed framework that decouples the editing process into local spatial adaptation and realistic appearance learning, utilizing a hybrid Tetrahedron-constrained Gaussian Splatting (TetGS) as the underlying representation. TetGS combines the controllable explicit structure of tetrahedral grids with the high-precision rendering capabilities of 3D Gaussian Splatting and is optimized in a progressive manner comprising three stages: 3D avatar instantiation from real-world monocular videos to provide accurate priors for TetGS initialization; localized spatial adaptation with explicitly partitioned tetrahedrons to guide the redistribution of Gaussian kernels; and geometry-based appearance generation with a coarse-to-fine activation strategy. Both qualitative and quantitative experiments demonstrate the effectiveness and superiority of our approach in generating photorealistic 3D editable avatars.

[67] Mìmir: A real-time interactive visualization library for CUDA programs

Francisco Carter,Nancy Hitschfeld,Cristóbal A. Navarro

Main category: cs.GR

TL;DR: Mímir是一个CUDA/Vulkan互操作的C++库，旨在简化GPU模拟代码的实时2D/3D可视化实现。

Details

Motivation: 现代科研中，实时可视化GPU模拟对模型质量评估至关重要，但传统方法因内存传输问题难以实现高效交互。 Method: Mímir通过抽象CUDA设备内存与Vulkan图形资源的映射，简化了低层技术细节，用户仅需少量代码修改即可实现可视化。 Result: Mímir成功实现了高效、低编程负担的实时可视化，无需用户深入掌握复杂技术细节。 Conclusion: Mímir为GPU模拟提供了便捷的实时可视化解决方案，显著提升了科研效率。 Abstract: Real-time visualization of computational simulations running over graphics processing units (GPU) is a valuable feature in modern science and technological research, as it allows researchers to visually assess the quality and correctness of their computational models during the simulation. Due to the high throughput involved in GPU-based simulations, classical visualization approaches such as ones based on copying to RAM or storage are not feasible anymore, as they imply large memory transfers between GPU and CPU at each moment, reducing both computational performance and interactivity. Implementing real-time visualizers for GPU simulation codes is a challenging task as it involves dealing with i) low-level integration of graphics APIs (e.g, OpenGL and Vulkan) into the general-purpose GPU code, ii) a careful and efficient handling of memory spaces and iii) finding a balance between rendering and computing as both need the GPU resources. In this work we present M\`imir, a CUDA/Vulkan interoperability C++ library that allows users to add real-time 2D/3D visualization to CUDA codes with low programming effort. With M\`imir, researchers can leverage state-of-the-art CUDA/Vulkan interoperability features without needing to invest time in learning the complex low-level technical aspects involved. Internally, M\`imir streamlines the interoperability mapping between CUDA device memory containing simulation data and Vulkan graphics resources, so that changes on the data are instantly reflected in the visualization. This abstraction scheme allows generating visualizations with minimal alteration over the original source code, needing only to replace the GPU memory allocation lines of the data to be visualized by the API calls provided by M\`imir among other optional changes.

cs.CL [Back]

[68] It's the same but not the same: Do LLMs distinguish Spanish varieties?

Marina Mayor-Rocher,Cristina Pozo,Nina Melero,Gonzalo Martínez,María Grandury,Pedro Reviriego

Main category: cs.CL

TL;DR: 研究评估了九种语言模型识别七种西班牙语变体的能力，发现GPT-4o是唯一能识别西班牙语多样性的模型。

Details

Motivation: 西班牙语存在丰富的地区变体，研究旨在评估语言模型对这些变体的识别能力。 Method: 通过多项选择测试评估模型对七种西班牙语变体的识别能力。 Result: 所有模型中，GPT-4o表现最佳，能识别西班牙语的多样性。 Conclusion: GPT-4o在识别西班牙语变体方面表现突出，其他模型需改进。 Abstract: In recent years, large language models (LLMs) have demonstrated a high capacity for understanding and generating text in Spanish. However, with five hundred million native speakers, Spanish is not a homogeneous language but rather one rich in diatopic variations spanning both sides of the Atlantic. For this reason, in this study, we evaluate the ability of nine language models to identify and distinguish the morphosyntactic and lexical peculiarities of seven varieties of Spanish (Andean, Antillean, Continental Caribbean, Chilean, Peninsular, Mexican and Central American and Rioplatense) through a multiple-choice test. The results indicate that the Peninsular Spanish variety is the best identified by all models and that, among them, GPT-4o is the only model capable of recognizing the variability of the Spanish language. -- En los \'ultimos a\~nos, los grandes modelos de lenguaje (LLMs, por sus siglas en ingl\'es) han demostrado una alta capacidad para comprender y generar texto en espa\~nol. Sin embargo, con quinientos millones de hablantes nativos, la espa\~nola no es una lengua homog\'enea, sino rica en variedades diat\'opicas que se extienden a ambos lados del Atl\'antico. Por todo ello, evaluamos en este trabajo la capacidad de nueve modelos de lenguaje de identificar y discernir las peculiaridades morfosint\'acticas y l\'exicas de siete variedades de espa\~nol (andino, antillano, caribe\~no continental, chileno, espa\~nol peninsular, mexicano y centroamericano y rioplatense) mediante un test de respuesta m\'ultiple. Los resultados obtenidos indican que la variedad de espa\~nol peninsular es la mejor identificada por todos los modelos y que, de entre todos, GPT-4o es el \'unico modelo capaz de identificar la variabilidad de la lengua espa\~nola.

[69] Evaluating Large Language Models on Multiword Expressions in Multilingual and Code-Switched Contexts

Frances Laureano De Leon,Harish Tayyar Madabushi,Mark G. Lee

Main category: cs.CL

TL;DR: 研究评估了大型语言模型处理多词表达歧义的能力，发现即使是GPT-4等最新模型在多词表达的检测和语义任务上也表现不佳。

Details

Motivation: 多词表达具有非组合意义和句法不规则性，模型处理此类语言微妙之处的能力尚不明确。 Method: 通过评估英语、葡萄牙语和加利西亚语中的模型表现，使用新颖的代码切换数据集和任务。 Result: 大型语言模型在多词表达处理上表现不佳，尤其是新任务，GPT-4未超越基线模型。 Conclusion: 多词表达，尤其是歧义性表达，仍是模型的挑战。 Abstract: Multiword expressions, characterised by non-compositional meanings and syntactic irregularities, are an example of nuanced language. These expressions can be used literally or idiomatically, leading to significant changes in meaning. While large language models have demonstrated strong performance across many tasks, their ability to handle such linguistic subtleties remains uncertain. Therefore, this study evaluates how state-of-the-art language models process the ambiguity of potentially idiomatic multiword expressions, particularly in contexts that are less frequent, where models are less likely to rely on memorisation. By evaluating models across in Portuguese and Galician, in addition to English, and using a novel code-switched dataset and a novel task, we find that large language models, despite their strengths, struggle with nuanced language. In particular, we find that the latest models, including GPT-4, fail to outperform the xlm-roBERTa-base baselines in both detection and semantic tasks, with especially poor performance on the novel tasks we introduce, despite its similarity to existing tasks. Overall, our results demonstrate that multiword expressions, especially those which are ambiguous, continue to be a challenge to models.

[70] Understanding and Mitigating Risks of Generative AI in Financial Services

Sebastian Gehrmann,Claire Huang,Xian Teng,Sergei Yurovski,Iyanuoluwa Shode,Chirag S. Patel,Arjun Bhorkar,Naveen Thomas,John Doucette,David Rosenberg,Mark Dredze,David Rabinowitz

Main category: cs.CL

TL;DR: 论文探讨了金融领域生成式AI（GenAI）产品的安全输入与输出范围，提出了AI内容风险分类法，并评估了现有开源技术护栏的覆盖情况。

Details

Motivation: 当前研究多关注通用AI模型的安全性（如毒性、偏见），但忽视了专业领域的法律与监管需求，尤其是在金融服务领域。 Method: 提出金融服务的AI内容风险分类法，通过红队活动收集数据，评估现有开源技术护栏的覆盖能力。 Result: 现有护栏未能检测到大部分讨论的内容风险。 Conclusion: 需针对专业领域（如金融）开发更全面的AI内容安全解决方案。 Abstract: To responsibly develop Generative AI (GenAI) products, it is critical to define the scope of acceptable inputs and outputs. What constitutes a "safe" response is an actively debated question. Academic work puts an outsized focus on evaluating models by themselves for general purpose aspects such as toxicity, bias, and fairness, especially in conversational applications being used by a broad audience. In contrast, less focus is put on considering sociotechnical systems in specialized domains. Yet, those specialized systems can be subject to extensive and well-understood legal and regulatory scrutiny. These product-specific considerations need to be set in industry-specific laws, regulations, and corporate governance requirements. In this paper, we aim to highlight AI content safety considerations specific to the financial services domain and outline an associated AI content risk taxonomy. We compare this taxonomy to existing work in this space and discuss implications of risk category violations on various stakeholders. We evaluate how existing open-source technical guardrail solutions cover this taxonomy by assessing them on data collected via red-teaming activities. Our results demonstrate that these guardrails fail to detect most of the content risks we discuss.

[71] Toward Evaluative Thinking: Meta Policy Optimization with Evolving Reward Models

Zae Myung Kim,Chanwoo Park,Vipul Raheja,Dongyeop Kang

Main category: cs.CL

TL;DR: Meta Policy Optimization (MPO) 是一种通过动态调整奖励模型提示来解决奖励黑客和减少人工提示工程需求的框架。

Details

Motivation: 现有奖励对齐方法存在奖励黑客问题和依赖人工提示工程的局限性。 Method: MPO 引入元奖励模型，动态优化奖励提示以提供自适应奖励信号。 Result: MPO 性能优于或等同于手工设计的奖励提示，且适用于多种任务。 Conclusion: MPO 为 LLMs 的奖励对齐提供了更稳健和自适应的解决方案。 Abstract: Reward-based alignment methods for large language models (LLMs) face two key limitations: vulnerability to reward hacking, where models exploit flaws in the reward signal; and reliance on brittle, labor-intensive prompt engineering when LLMs are used as reward models. We introduce Meta Policy Optimization (MPO), a framework that addresses these challenges by integrating a meta-reward model that dynamically refines the reward model's prompt throughout training. In MPO, the meta-reward model monitors the evolving training context and continuously adjusts the reward model's prompt to maintain high alignment, providing an adaptive reward signal that resists exploitation by the policy. This meta-learning approach promotes a more stable policy optimization, and greatly reduces the need for manual reward prompt design. It yields performance on par with or better than models guided by extensively hand-crafted reward prompts. Furthermore, we show that MPO maintains its effectiveness across diverse tasks, such as question answering and mathematical reasoning, without requiring specialized reward designs. Beyond standard RLAIF, MPO's meta-learning formulation is readily extensible to higher-level alignment frameworks. Overall, this method addresses theoretical and practical challenges in reward-based RL alignment for LLMs, paving the way for more robust and adaptable alignment strategies. The code and models will be publicly shared.

[72] MICE for CATs: Model-Internal Confidence Estimation for Calibrating Agents with Tools

Nishant Subramani,Jason Eisner,Justin Svegliato,Benjamin Van Durme,Yu Su,Sam Thomson

Main category: cs.CL

TL;DR: 提出了一种新型模型内部置信度估计器（MICE），通过解码语言模型中间层并计算相似度得分来评估工具调用的置信度，显著提升了校准误差和工具调用效用。

Details

Motivation: 工具调用代理需要兼具实用性和安全性，但现有模型的置信度校准较差，因此需要更准确的置信度评估方法。 Method: MICE通过解码语言模型各中间层（使用logitLens），计算各层生成与最终输出的相似度得分，再通过概率分类器评估置信度。 Result: 在模拟工具调用数据集上，MICE在平滑预期校准误差和工具调用效用上优于基线方法，且能零样本泛化到未见API。 Conclusion: MICE是一种高效、通用的置信度估计方法，显著提升了工具调用的安全性和实用性。 Abstract: Tool-using agents that act in the world need to be both useful and safe. Well-calibrated model confidences can be used to weigh the risk versus reward of potential actions, but prior work shows that many models are poorly calibrated. Inspired by interpretability literature exploring the internals of models, we propose a novel class of model-internal confidence estimators (MICE) to better assess confidence when calling tools. MICE first decodes from each intermediate layer of the language model using logitLens and then computes similarity scores between each layer's generation and the final output. These features are fed into a learned probabilistic classifier to assess confidence in the decoded output. On the simulated trial and error (STE) tool-calling dataset using Llama3 models, we find that MICE beats or matches the baselines on smoothed expected calibration error. Using MICE confidences to determine whether to call a tool significantly improves over strong baselines on a new metric, expected tool-calling utility. Further experiments show that MICE is sample-efficient, can generalize zero-shot to unseen APIs, and results in higher tool-calling utility in scenarios with varying risk levels. Our code is open source, available at https://github.com/microsoft/mice_for_cats.

[73] A Multimodal Pipeline for Clinical Data Extraction: Applying Vision-Language Models to Scans of Transfusion Reaction Reports

Henning Schäfer,Cynthia S. Schmidt,Johannes Wutzkowsky,Kamil Lorek,Lea Reinartz,Johannes Rückert,Christian Temme,Britta Böckmann,Peter A. Horn,Christoph M. Friedrich

Main category: cs.CL

TL;DR: 本文提出了一种开源流程，用于从扫描文档中提取和分类复选框数据，以减少人工转录错误并提高效率。

Details

Motivation: 尽管电子健康记录日益普及，但许多流程仍依赖纸质文档，导致数据转录耗时且易出错。 Method: 结合复选框检测、多语言OCR和多语言视觉语言模型（VLMs），设计了一个开源流程。 Result: 与2017至2024年的黄金标准相比，流程表现出高精度和召回率，减少了行政工作量并提高了报告准确性。 Conclusion: 该开源流程可推广至其他复选框丰富的文档类型，鼓励自托管解析。 Abstract: Despite the growing adoption of electronic health records, many processes still rely on paper documents, reflecting the heterogeneous real-world conditions in which healthcare is delivered. The manual transcription process is time-consuming and prone to errors when transferring paper-based data to digital formats. To streamline this workflow, this study presents an open-source pipeline that extracts and categorizes checkbox data from scanned documents. Demonstrated on transfusion reaction reports, the design supports adaptation to other checkbox-rich document types. The proposed method integrates checkbox detection, multilingual optical character recognition (OCR) and multilingual vision-language models (VLMs). The pipeline achieves high precision and recall compared against annually compiled gold-standards from 2017 to 2024. The result is a reduction in administrative workload and accurate regulatory reporting. The open-source availability of this pipeline encourages self-hosted parsing of checkbox forms.

[74] A Platform for Generating Educational Activities to Teach English as a Second Language

Aiala Rosá,Santiago Góngora,Juan Pablo Filevich,Ignacio Sastre,Laura Musto,Brian Carpenter,Luis Chiruzzo

Main category: cs.CL

TL;DR: 介绍了一个基于自然语言处理技术的英语教学平台，支持生成游戏和练习，并计划扩展功能。

Details

Motivation: 为英语作为外语教学提供多样化的教育活动，结合自然语言处理技术提升教学效果。 Method: 平台利用半自动生成和人工编辑的资源，支持教师输入文本生成复杂活动，并计划整合图像和文本生成技术。 Result: 平台已部署并支持多种活动生成，未来将迁移至更强大的服务器以提升性能。 Conclusion: 平台展示了结合自然语言处理技术的潜力，未来将进一步扩展功能和优化性能。 Abstract: We present a platform for the generation of educational activities oriented to teaching English as a foreign language. The different activities -- games and language practice exercises -- are strongly based on Natural Language Processing techniques. The platform offers the possibility of playing out-of-the-box games, generated from resources created semi-automatically and then manually curated. It can also generate games or exercises of greater complexity from texts entered by teachers, providing a stage of review and edition of the generated content before use. As a way of expanding the variety of activities in the platform, we are currently experimenting with image and text generation. In order to integrate them and improve the performance of other neural tools already integrated, we are working on migrating the platform to a more powerful server. In this paper we describe the development of our platform and its deployment for end users, discussing the challenges faced and how we overcame them, and also detail our future work plans.

[75] Enhancing Systematic Reviews with Large Language Models: Using GPT-4 and Kimi

Dandan Chen Kaptur,Yue Huang,Xuejun Ryan Ji,Yanhui Guo,Bradley Kaptur

Main category: cs.CL

TL;DR: 研究比较了GPT-4和Kimi在系统综述中的表现，发现其性能受数据量和问题复杂度影响。

Details

Motivation: 评估大型语言模型（LLMs）在系统综述中的表现，并与人工生成的代码进行对比。 Method: 通过比较LLM生成的代码与人工生成的代码，分析其在系统综述中的表现。 Result: LLMs的性能随数据量和问题复杂度的变化而波动。 Conclusion: LLMs在系统综述中的应用需考虑数据量和问题复杂度的影响。 Abstract: This research delved into GPT-4 and Kimi, two Large Language Models (LLMs), for systematic reviews. We evaluated their performance by comparing LLM-generated codes with human-generated codes from a peer-reviewed systematic review on assessment. Our findings suggested that the performance of LLMs fluctuates by data volume and question complexity for systematic reviews.

[76] UD-English-CHILDES: A Collected Resource of Gold and Silver Universal Dependencies Trees for Child Language Interactions

Xiulin Yang,Zhuoxuan Ju,Lanni Bu,Zoey Liu,Nathan Schneider

Main category: cs.CL

TL;DR: 本文介绍了UD-English-CHILDES，这是首个基于CHILDES数据的官方Universal Dependencies树库，统一了11名儿童及其照顾者的48k句子标注，并提供了1M银标准句子。

Details

Motivation: CHILDES是广泛使用的儿童语言资源，但缺乏一致的UD标注标准，因此需要统一和扩展标注以支持计算和语言学研究。 Method: 利用CHILDES已有的依赖标注数据，在UD v2框架下统一标注，并生成银标准句子。 Result: 构建了包含48k金标准句子和1M银标准句子的UD-English-CHILDES树库，标注一致且验证有效。 Conclusion: UD-English-CHILDES为儿童语言研究提供了高质量的标注资源，支持更广泛的计算和语言学应用。 Abstract: CHILDES is a widely used resource of transcribed child and child-directed speech. This paper introduces UD-English-CHILDES, the first officially released Universal Dependencies (UD) treebank derived from previously dependency-annotated CHILDES data with consistent and unified annotation guidelines. Our corpus harmonizes annotations from 11 children and their caregivers, totaling over 48k sentences. We validate existing gold-standard annotations under the UD v2 framework and provide an additional 1M silver-standard sentences, offering a consistent resource for computational and linguistic research.

[77] Labeling Case Similarity based on Co-Citation of Legal Articles in Judgment Documents with Empirical Dispute-Based Evaluation

Chao-Lin Liu,Po-Hsien Wu,Yi-Ting Yu

Main category: cs.CL

TL;DR: 提出一种基于法律条款共引用的方法，用于解决法律推荐系统中标注数据不足的问题，并在劳动纠纷领域验证了其有效性。

Details

Motivation: 解决专业领域（如劳动纠纷）中标注数据有限的问题，提升法律推荐系统的性能。 Method: 利用法律条款在案件中的共引用关系建立相似性，结合文本嵌入模型和BiLSTM模块进行案例推荐。 Result: 实验表明，该方法能有效推荐基于法律条款共引用的相似劳动纠纷案例。 Conclusion: 该方法为法律文档的自动标注提供了新思路，尤其适用于缺乏全面法律数据库的领域。 Abstract: This report addresses the challenge of limited labeled datasets for developing legal recommender systems, particularly in specialized domains like labor disputes. We propose a new approach leveraging the co-citation of legal articles within cases to establish similarity and enable algorithmic annotation. This method draws a parallel to the concept of case co-citation, utilizing cited precedents as indicators of shared legal issues. To evaluate the labeled results, we employ a system that recommends similar cases based on plaintiffs' accusations, defendants' rebuttals, and points of disputes. The evaluation demonstrates that the recommender, with finetuned text embedding models and a reasonable BiLSTM module can recommend labor cases whose similarity was measured by the co-citation of the legal articles. This research contributes to the development of automated annotation techniques for legal documents, particularly in areas with limited access to comprehensive legal databases.

[78] Local Prompt Optimization

Yash Jain,Vishal Chowdhary

Main category: cs.CL

TL;DR: 论文提出了一种局部提示优化（LPO）方法，通过专注于优化提示中的关键令牌，显著提升了自动提示工程的性能。

Details

Motivation: 现有提示优化方法全局优化所有令牌，导致优化空间过大且指导不足。 Method: LPO识别提示中的优化令牌，并引导LLM仅优化这些令牌。 Result: 在Math Reasoning和BIG-bench Hard基准测试中表现显著提升，且收敛速度更快。 Conclusion: LPO是一种高效的提示优化方法，适用于各种自动提示工程技术。 Abstract: In recent years, the use of prompts to guide the output of Large Language Models have increased dramatically. However, even the best of experts struggle to choose the correct words to stitch up a prompt for the desired task. To solve this, LLM driven prompt optimization emerged as an important problem. Existing prompt optimization methods optimize a prompt globally, where in all the prompt tokens have to be optimized over a large vocabulary while solving a complex task. The large optimization space (tokens) leads to insufficient guidance for a better prompt. In this work, we introduce Local Prompt Optimization (LPO) that integrates with any general automatic prompt engineering method. We identify the optimization tokens in a prompt and nudge the LLM to focus only on those tokens in its optimization step. We observe remarkable performance improvements on Math Reasoning (GSM8k and MultiArith) and BIG-bench Hard benchmarks across various automatic prompt engineering methods. Further, we show that LPO converges to the optimal prompt faster than global methods.

[79] What Causes Knowledge Loss in Multilingual Language Models?

Maria Khelli,Samuel Cahyawijaya,Ayu Purwarianti,Genta Indra Winata

Main category: cs.CL

TL;DR: 研究探讨了多语言NLP模型中的灾难性遗忘问题，通过LoRA适配器评估参数共享对知识保留的影响。

Details

Motivation: 传统方法在多语言场景中难以模拟现实情况，导致灾难性遗忘问题，研究旨在探索如何通过参数共享缓解这一问题。 Method: 使用52种语言和不同等级的LoRA适配器，评估非共享、部分共享和完全共享参数的效果。 Result: 非拉丁文字的语言更容易出现灾难性遗忘，而拉丁文字语言在多语言迁移中表现更好。 Conclusion: 参数共享（如LoRA适配器）可以缓解灾难性遗忘，但语言脚本类型对效果有显著影响。 Abstract: Cross-lingual transfer in natural language processing (NLP) models enhances multilingual performance by leveraging shared linguistic knowledge. However, traditional methods that process all data simultaneously often fail to mimic real-world scenarios, leading to challenges like catastrophic forgetting, where fine-tuning on new tasks degrades performance on previously learned ones. Our study explores this issue in multilingual contexts, focusing on linguistic differences affecting representational learning rather than just model parameters. We experiment with 52 languages using LoRA adapters of varying ranks to evaluate non-shared, partially shared, and fully shared parameters. Our aim is to see if parameter sharing through adapters can mitigate forgetting while preserving prior knowledge. We find that languages using non-Latin scripts are more susceptible to catastrophic forgetting, whereas those written in Latin script facilitate more effective cross-lingual transfer.

[80] DMDTEval: An Evaluation and Analysis of LLMs on Disambiguation in Multi-domain Translation

Zhibo Man,Yuanmeng Chen,Yujie Zhang,Yufeng Chen,Jinan Xu

Main category: cs.CL

TL;DR: 论文提出了一个评估框架DMDTEval，用于评估大语言模型（LLMs）在多领域翻译中的消歧能力，包括构建测试集、设计提示模板和精确度量标准。

Details

Motivation: 虽然LLMs在机器翻译中表现优异，但在多领域翻译（MDT）中因词汇歧义问题表现不佳，亟需评估其消歧能力。 Method: 构建多领域歧义词汇标注的翻译测试集，设计多样化的消歧提示模板，并制定精确的消歧度量标准。 Result: 实验揭示了多种关键发现，为提升LLMs的消歧能力提供了研究基础。 Conclusion: DMDTEval框架为LLMs在多领域翻译中的消歧能力评估和研究提供了系统支持。 Abstract: Currently, Large Language Models (LLMs) have achieved remarkable results in machine translation. However, their performance in multi-domain translation (MDT) is less satisfactory; the meanings of words can vary across different domains, highlighting the significant ambiguity inherent in MDT. Therefore, evaluating the disambiguation ability of LLMs in MDT remains an open problem. To this end, we present an evaluation and analysis of LLMs on disambiguation in multi-domain translation (DMDTEval), our systematic evaluation framework consisting of three critical aspects: (1) we construct a translation test set with multi-domain ambiguous word annotation, (2) we curate a diverse set of disambiguation prompting templates, and (3) we design precise disambiguation metrics, and study the efficacy of various prompting strategies on multiple state-of-the-art LLMs. Our extensive experiments reveal a number of crucial findings that we believe will pave the way and also facilitate further research in the critical area of improving the disambiguation of LLMs.

[81] On Psychology of AI -- Does Primacy Effect Affect ChatGPT and Other LLMs?

Mika Hämäläinen

Main category: cs.CL

TL;DR: 研究了三种商业LLM（ChatGPT、Gemini、Claude）中的首因效应，通过改造Asch（1946）实验发现，不同模型在不同实验条件下对候选人的偏好存在差异。

Details

Motivation: 探讨LLM是否像人类一样受首因效应影响，即在描述顺序不同时是否表现出偏好。 Method: 改造Asch实验，测试LLM对两种描述顺序（先正后负 vs. 先负后正）的候选人偏好，分为同时提示和分别提示两种实验条件。 Result: ChatGPT在同时提示时偏好先正后负的候选人，Gemini无偏好，Claude拒绝选择；在分别提示时，ChatGPT和Claude多评分相等，否则偏好先负后正，Gemini则更偏好先负后正。 Conclusion: LLM的首因效应表现因模型和实验条件而异，部分模型与人类行为相似，但Claude表现出独特的拒绝行为。 Abstract: We study the primacy effect in three commercial LLMs: ChatGPT, Gemini and Claude. We do this by repurposing the famous experiment Asch (1946) conducted using human subjects. The experiment is simple, given two candidates with equal descriptions which one is preferred if one description has positive adjectives first before negative ones and another description has negative adjectives followed by positive ones. We test this in two experiments. In one experiment, LLMs are given both candidates simultaneously in the same prompt, and in another experiment, LLMs are given both candidates separately. We test all the models with 200 candidate pairs. We found that, in the first experiment, ChatGPT preferred the candidate with positive adjectives listed first, while Gemini preferred both equally often. Claude refused to make a choice. In the second experiment, ChatGPT and Claude were most likely to rank both candidates equally. In the case where they did not give an equal rating, both showed a clear preference to a candidate that had negative adjectives listed first. Gemini was most likely to prefer a candidate with negative adjectives listed first.

[82] Team ACK at SemEval-2025 Task 2: Beyond Word-for-Word Machine Translation for English-Korean Pairs

Daniel Lee,Harsh Sharma,Jieun Han,Sunny Jeong,Alice Oh,Vered Shwartz

Main category: cs.CL

TL;DR: LLMs在英韩翻译中表现优于传统机器翻译系统，但在需要文化适应的实体翻译上仍有困难。

Details

Motivation: 研究知识密集和实体丰富的文本在英韩翻译中的挑战，尤其是语言和文化差异的处理。 Method: 评估13种模型（LLMs和MT模型），结合自动指标和双语标注者的人工评估，构建错误分类法。 Result: LLMs表现更好，但实体翻译和文化适应仍是问题，性能因实体类型和流行度而异。 Conclusion: 揭示了自动评估指标的不足，为未来文化敏感的机器翻译研究提供方向。 Abstract: Translating knowledge-intensive and entity-rich text between English and Korean requires transcreation to preserve language-specific and cultural nuances beyond literal, phonetic or word-for-word conversion. We evaluate 13 models (LLMs and MT models) using automatic metrics and human assessment by bilingual annotators. Our findings show LLMs outperform traditional MT systems but struggle with entity translation requiring cultural adaptation. By constructing an error taxonomy, we identify incorrect responses and entity name errors as key issues, with performance varying by entity type and popularity level. This work exposes gaps in automatic evaluation metrics and hope to enable future work in completing culturally-nuanced machine translation.

[83] Fane at SemEval-2025 Task 10: Zero-Shot Entity Framing with Large Language Models

Enfa Fane,Mihai Surdeanu,Eduardo Blanco,Steven R. Corman

Main category: cs.CL

TL;DR: 论文评估了大型语言模型（LLMs）在零样本分类新闻叙事中实体框架角色的能力，通过分层方法和优化提示设计，取得了显著效果。

Details

Motivation: 研究新闻叙事如何框架实体对媒体影响社会认知的重要性，探索LLMs在此任务中的潜力。 Method: 采用分层分类方法，先识别广泛角色再细化，并系统评估输入上下文、提示策略和任务分解的影响。 Result: 主角色准确率达89.4%，精确匹配率为34.5%，表明分层方法和优化提示设计的有效性。 Conclusion: 强调针对子任务优化提示设计和输入上下文对提升LLMs在实体框架任务中性能的重要性。 Abstract: Understanding how news narratives frame entities is crucial for studying media's impact on societal perceptions of events. In this paper, we evaluate the zero-shot capabilities of large language models (LLMs) in classifying framing roles. Through systematic experimentation, we assess the effects of input context, prompting strategies, and task decomposition. Our findings show that a hierarchical approach of first identifying broad roles and then fine-grained roles, outperforms single-step classification. We also demonstrate that optimal input contexts and prompts vary across task levels, highlighting the need for subtask-specific strategies. We achieve a Main Role Accuracy of 89.4% and an Exact Match Ratio of 34.5%, demonstrating the effectiveness of our approach. Our findings emphasize the importance of tailored prompt design and input context optimization for improving LLM performance in entity framing.

[84] Enhancing LLM Language Adaption through Cross-lingual In-Context Pre-training

Linjuan Wu,Haoran Wei,Huan Lin,Tianhao Li,Baosong Yang,Weiming Lu

Main category: cs.CL

TL;DR: CrossIC-PT通过语义相关的双语文本增强跨语言迁移，显著提升多语言性能。

Details

Motivation: 现有跨语言迁移方法受限于平行资源，语言和领域覆盖有限。 Method: 提出CrossIC-PT，利用语义相关的双语文本通过简单预测任务增强迁移，并优化上下文窗口分割策略。 Result: 在三种模型和六种目标语言上性能提升显著，最高达3.99%。 Conclusion: CrossIC-PT是一种简单可扩展的方法，有效提升跨语言迁移能力。 Abstract: Large language models (LLMs) exhibit remarkable multilingual capabilities despite English-dominated pre-training, attributed to cross-lingual mechanisms during pre-training. Existing methods for enhancing cross-lingual transfer remain constrained by parallel resources, suffering from limited linguistic and domain coverage. We propose Cross-lingual In-context Pre-training (CrossIC-PT), a simple and scalable approach that enhances cross-lingual transfer by leveraging semantically related bilingual texts via simple next-word prediction. We construct CrossIC-PT samples by interleaving semantic-related bilingual Wikipedia documents into a single context window. To access window size constraints, we implement a systematic segmentation policy to split long bilingual document pairs into chunks while adjusting the sliding window mechanism to preserve contextual coherence. We further extend data availability through a semantic retrieval framework to construct CrossIC-PT samples from web-crawled corpus. Experimental results demonstrate that CrossIC-PT improves multilingual performance on three models (Llama-3.1-8B, Qwen2.5-7B, and Qwen2.5-1.5B) across six target languages, yielding performance gains of 3.79%, 3.99%, and 1.95%, respectively, with additional improvements after data augmentation.

[85] UniDetox: Universal Detoxification of Large Language Models via Dataset Distillation

Huimin Lu,Masaru Isonuma,Junichiro Mori,Ichiro Sakata

Main category: cs.CL

TL;DR: UniDetox是一种通用的去毒方法，适用于多种大型语言模型（LLMs），无需针对不同模型单独调参。

Details

Motivation: 现有去毒方法通常针对特定模型或模型家族，且需在去毒效果和语言模型性能之间权衡调参。UniDetox旨在提供一种通用解决方案。 Method: 采用对比解码的数据蒸馏技术，生成合成文本数据作为去毒表示，通过微调实现通用去毒。 Result: 实验表明，从GPT-2蒸馏的去毒文本可有效应用于OPT、Falcon和LLaMA-2等更大模型，且无需单独调参。 Conclusion: UniDetox不仅高效去毒，还减少了政治偏见内容，为LLMs去毒提供了通用解决方案。 Abstract: We present UniDetox, a universally applicable method designed to mitigate toxicity across various large language models (LLMs). Previous detoxification methods are typically model-specific, addressing only individual models or model families, and require careful hyperparameter tuning due to the trade-off between detoxification efficacy and language modeling performance. In contrast, UniDetox provides a detoxification technique that can be universally applied to a wide range of LLMs without the need for separate model-specific tuning. Specifically, we propose a novel and efficient dataset distillation technique for detoxification using contrastive decoding. This approach distills detoxifying representations in the form of synthetic text data, enabling universal detoxification of any LLM through fine-tuning with the distilled text. Our experiments demonstrate that the detoxifying text distilled from GPT-2 can effectively detoxify larger models, including OPT, Falcon, and LLaMA-2. Furthermore, UniDetox eliminates the need for separate hyperparameter tuning for each model, as a single hyperparameter configuration can be seamlessly applied across different models. Additionally, analysis of the detoxifying text reveals a reduction in politically biased content, providing insights into the attributes necessary for effective detoxification of LLMs.

[86] Revisiting the MIMIC-IV Benchmark: Experiments Using Language Models for Electronic Health Records

Jesus Lovon,Thouria Ben-Haddi,Jules Di Scala,Jose G. Moreno,Lynda Tamine

Main category: cs.CL

TL;DR: 论文提出将MIMIC-IV EHR数据整合到Hugging Face库中，并探索将表格数据转为文本的方法。实验表明，微调的文本模型在患者死亡率任务上表现优于零样本LLM。

Details

Motivation: 解决医学领域缺乏标准化文本评估基准的问题，促进自然语言模型在健康相关任务中的应用。 Method: 整合MIMIC-IV数据到Hugging Face库，研究将EHR表格数据转为文本的模板方法，并进行微调和零样本LLM实验。 Result: 微调文本模型在患者死亡率任务中表现优于表格分类器，而零样本LLM难以利用EHR表示。 Conclusion: 文本方法在医学领域具有潜力，但需进一步改进零样本LLM的表现。 Abstract: The lack of standardized evaluation benchmarks in the medical domain for text inputs can be a barrier to widely adopting and leveraging the potential of natural language models for health-related downstream tasks. This paper revisited an openly available MIMIC-IV benchmark for electronic health records (EHRs) to address this issue. First, we integrate the MIMIC-IV data within the Hugging Face datasets library to allow an easy share and use of this collection. Second, we investigate the application of templates to convert EHR tabular data to text. Experiments using fine-tuned and zero-shot LLMs on the mortality of patients task show that fine-tuned text-based models are competitive against robust tabular classifiers. In contrast, zero-shot LLMs struggle to leverage EHR representations. This study underlines the potential of text-based approaches in the medical field and highlights areas for further improvement.

[87] BrAIcht, a theatrical agent that speaks like Bertolt Brecht's characters

Baz Roland,Kristina Malyseva,Anna Pappa,Tristan Cazenave

Main category: cs.CL

TL;DR: BrAIcht是一个基于德国语言模型LeoLM的AI对话代理，能够生成类似德国剧作家布莱希特风格的对话。

Details

Motivation: 为了创建一个能够模仿布莱希特独特风格的对话代理，填补AI在特定文学风格生成领域的空白。 Method: 使用7B参数的LeoLM模型，通过QLoRA参数高效微调技术，结合布莱希特的29部戏剧和907部风格相似的德国戏剧进行训练。 Result: 基于BLEU分数和困惑度评估，BrAIcht在生成布莱希特风格对话方面表现优异。 Conclusion: BrAIcht展示了在特定文学风格生成任务中的潜力，为AI在创意写作领域的应用提供了新思路。 Abstract: This project introduces BrAIcht, an AI conversational agent that creates dialogues in the distinctive style of the famous German playwright Bertolt Brecht. BrAIcht is fine-tuned using German LeoLM, a large language model with 7 billion parameters and a modified version of the base Llama2 suitable for German language tasks. For fine-tuning, 29 plays of Bertolt Brecht and 907 of other German plays that are stylistically similar to Bertolt Brecht are used to form a more di-erse dataset. Due to the limited memory capacity, a parameterefficient fine-tuning technique called QLoRA is implemented to train the large language model. The results, based on BLEU score and perplexity, show very promising performance of BrAIcht in generating dialogues in the style of Bertolt Brecht.

[88] ClonEval: An Open Voice Cloning Benchmark

Iwona Christop,Tomasz Kuczyński,Marek Kubis

Main category: cs.CL

TL;DR: 提出了一种新的语音克隆文本转语音模型基准，包括评估协议、开源库和排行榜。

Details

Motivation: 为语音克隆模型提供一个标准化的评估工具和平台。 Method: 设计了评估协议和开源库，并详细说明了评估过程和排行榜的组织方式。 Result: 开发了一个完整的基准工具，可用于评估和比较语音克隆模型的性能。 Conclusion: 该基准为语音克隆领域的研究提供了实用的评估框架和工具。 Abstract: We present a novel benchmark for voice cloning text-to-speech models. The benchmark consists of an evaluation protocol, an open-source library for assessing the performance of voice cloning models, and an accompanying leaderboard. The paper discusses design considerations and presents a detailed description of the evaluation procedure. The usage of the software library is explained, along with the organization of results on the leaderboard.

[89] TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models

Mihai Nadas,Laura Diosan,Andrei Piscoran,Andreea Tomescu

Main category: cs.CL

TL;DR: TF1-EN-3M是一个由8B参数模型生成的300万英语寓言数据集，采用六槽结构，并通过混合评估确保质量与多样性。

Details

Motivation: 填补现代NLP缺乏大规模结构化道德故事数据集的空白。 Method: 使用组合提示引擎生成六槽结构的寓言，并通过GPT评分和多样性指标评估质量。 Result: 8B参数的Llama-3变体在质量和速度上表现最佳，成本低至每千个故事13.5美分。 Conclusion: TF1-EN-3M为指令跟随、叙事智能等研究提供了开源资源，证明大规模道德故事生成无需专有大模型。 Abstract: Moral stories are a time-tested vehicle for transmitting values, yet modern NLP lacks a large, structured corpus that couples coherent narratives with explicit ethical lessons. We close this gap with TF1-EN-3M, the first open dataset of three million English-language fables generated exclusively by instruction-tuned models no larger than 8B parameters. Each story follows a six-slot scaffold (character -> trait -> setting -> conflict -> resolution -> moral), produced through a combinatorial prompt engine that guarantees genre fidelity while covering a broad thematic space. A hybrid evaluation pipeline blends (i) a GPT-based critic that scores grammar, creativity, moral clarity, and template adherence with (ii) reference-free diversity and readability metrics. Among ten open-weight candidates, an 8B-parameter Llama-3 variant delivers the best quality-speed trade-off, producing high-scoring fables on a single consumer GPU (<24 GB VRAM) at approximately 13.5 cents per 1,000 fables. We release the dataset, generation code, evaluation scripts, and full metadata under a permissive license, enabling exact reproducibility and cost benchmarking. TF1-EN-3M opens avenues for research in instruction following, narrative intelligence, value alignment, and child-friendly educational AI, demonstrating that large-scale moral storytelling no longer requires proprietary giant models.

[90] WenyanGPT: A Large Language Model for Classical Chinese Tasks

Xinyu Yao,Mengdi Wang,Bo Chen,Xiaobing Zhao

Main category: cs.CL

TL;DR: 本文提出了一种针对文言文的语言处理解决方案，通过预训练和指令微调构建了WenyanGPT模型，并开发了评估数据集WenyanBENCH。实验表明该模型在文言文任务上表现优异。

Details

Motivation: 现有自然语言处理模型主要针对现代汉语优化，对文言文处理性能不足，需要专门解决方案。 Method: 基于LLaMA3-8B-Chinese模型进行预训练和指令微调，构建WenyanGPT模型，并开发WenyanBENCH评估数据集。 Result: WenyanGPT在WenyanBENCH上的表现显著优于当前先进的大语言模型。 Conclusion: WenyanGPT为文言文处理提供了高效工具，相关数据和模型已公开以促进进一步研究。 Abstract: Classical Chinese, as the core carrier of Chinese culture, plays a crucial role in the inheritance and study of ancient literature. However, existing natural language processing models primarily optimize for Modern Chinese, resulting in inadequate performance on Classical Chinese. This paper presents a comprehensive solution for Classical Chinese language processing. By continuing pre-training and instruction fine-tuning on the LLaMA3-8B-Chinese model, we construct a large language model, WenyanGPT, which is specifically designed for Classical Chinese tasks. Additionally, we develop an evaluation benchmark dataset, WenyanBENCH. Experimental results on WenyanBENCH demonstrate that WenyanGPT significantly outperforms current advanced LLMs in various Classical Chinese tasks. We make the model's training data, instruction fine-tuning data\footnote, and evaluation benchmark dataset publicly available to promote further research and development in the field of Classical Chinese processing.

[91] Cooking Up Creativity: A Cognitively-Inspired Approach for Enhancing LLM Creativity through Structured Representations

Moran Mizrahi,Chen Shani,Gabriel Stanovsky,Dan Jurafsky,Dafna Shahaf

Main category: cs.CL

TL;DR: 论文提出了一种结合LLMs与结构化表示的方法，通过认知启发操作生成更具创造性和多样性的想法，尤其在烹饪领域表现优于GPT-4o。

Details

Motivation: 尽管LLMs在许多任务中表现出色，但在创造力方面仍有不足。本文旨在通过结构化表示和认知启发操作提升LLMs的创造力。 Method: 提出了一种新方法，将LLMs与结构化表示结合，通过重组现有想法的结构化表示来探索抽象创意空间。 Result: 在烹饪领域，模型DishCOVER生成的食谱比GPT-4o更具多样性，专家评估显示其新颖性显著优于GPT-4o。 Conclusion: 该方法在创造力生成方面表现优异，为AI结构化创造力研究提供了新方向。 Abstract: Large Language Models (LLMs) excel at countless tasks, yet struggle with creativity. In this paper, we introduce a novel approach that couples LLMs with structured representations and cognitively inspired manipulations to generate more creative and diverse ideas. Our notion of creativity goes beyond superficial token-level variations; rather, we explicitly recombine structured representations of existing ideas, allowing our algorithm to effectively explore the more abstract landscape of ideas. We demonstrate our approach in the culinary domain with DishCOVER, a model that generates creative recipes. Experiments comparing our model's results to those of GPT-4o show greater diversity. Domain expert evaluations reveal that our outputs, which are mostly coherent and feasible culinary creations, significantly surpass GPT-4o in terms of novelty, thus outperforming it in creative generation. We hope our work inspires further research into structured creativity in AI.

Ivan Vykopal,Martin Hyben,Robert Moro,Michal Gregor,Jakub Simko

Main category: cs.CL

TL;DR: 本文提出了一种利用大型语言模型（LLMs）检索和评估已核实信息的方法，以减少重复核查工作，提高事实核查效率。

Details

Motivation: 在线虚假信息泛滥，事实核查者面临重复核查已核实信息的负担，导致工作效率低下。 Method: 采用LLMs过滤无关信息，生成简洁摘要和解释，帮助核查者快速判断信息是否已被核实。 Result: 实验表明，LLMs能有效过滤无关信息，减少核查工作量，优化流程。 Conclusion: LLMs在事实核查中具有潜力，能显著提升效率并减轻核查者负担。 Abstract: Online disinformation poses a global challenge, placing significant demands on fact-checkers who must verify claims efficiently to prevent the spread of false information. A major issue in this process is the redundant verification of already fact-checked claims, which increases workload and delays responses to newly emerging claims. This research introduces an approach that retrieves previously fact-checked claims, evaluates their relevance to a given input, and provides supplementary information to support fact-checkers. Our method employs large language models (LLMs) to filter irrelevant fact-checks and generate concise summaries and explanations, enabling fact-checkers to faster assess whether a claim has been verified before. In addition, we evaluate our approach through both automatic and human assessments, where humans interact with the developed tool to review its effectiveness. Our results demonstrate that LLMs are able to filter out many irrelevant fact-checks and, therefore, reduce effort and streamline the fact-checking process.

[93] Non-native Children's Automatic Speech Assessment Challenge (NOCASA)

Yaroslav Getman,Tamás Grósz,Mikko Kurimo,Giampiero Salvi

Main category: cs.CL

TL;DR: NOCASA竞赛旨在开发评估非母语儿童发音的系统，提供伪匿名数据和基线模型，最佳模型UAR为36.37%。

Details

Motivation: 解决非母语儿童发音评估中的训练数据不足和类别不平衡问题。 Method: 提供TeflonNorL2数据集和两种基线模型：SVM和wav2vec 2.0。 Result: wav2vec 2.0模型在测试集上表现最佳，UAR为36.37%。 Conclusion: NOCASA竞赛为发音评估系统开发提供了数据和基线支持。 Abstract: This paper presents the "Non-native Children's Automatic Speech Assessment" (NOCASA) - a data competition part of the IEEE MLSP 2025 conference. NOCASA challenges participants to develop new systems that can assess single-word pronunciations of young second language (L2) learners as part of a gamified pronunciation training app. To achieve this, several issues must be addressed, most notably the limited nature of available training data and the highly unbalanced distribution among the pronunciation level categories. To expedite the development, we provide a pseudo-anonymized training data (TeflonNorL2), containing 10,334 recordings from 44 speakers attempting to pronounce 205 distinct Norwegian words, human-rated on a 1 to 5 scale (number of stars that should be given in the game). In addition to the data, two already trained systems are released as official baselines: an SVM classifier trained on the ComParE_16 acoustic feature set and a multi-task wav2vec 2.0 model. The latter achieves the best performance on the challenge test set, with an unweighted average recall (UAR) of 36.37%.

Wing Yan Li,Zeqiang Wang,Jon Johnson,Suparna De

Main category: cs.CL

TL;DR: 论文提出了一种新的信息检索任务，用于识别纵向社会科学调查中语义等效的问题，以解决概念和词汇不一致的挑战。

Details

Motivation: 自动化检测语义等效问题对长期社会科学研究至关重要，但面临概念表达不一致和词汇演变的双重挑战。 Method: 研究了多种无监督方法，包括概率模型、语言模型线性探测和专为信息检索设计的预训练神经网络。 Result: 专为信息检索设计的神经网络模型表现最佳，其他方法表现相当；重新排序仅带来微小的性能提升。 Conclusion: 研究为社会科学纵向研究的协调提供了进一步的研究方向。 Abstract: Automated detection of semantically equivalent questions in longitudinal social science surveys is crucial for long-term studies informing empirical research in the social, economic, and health sciences. Retrieving equivalent questions faces dual challenges: inconsistent representation of theoretical constructs (i.e. concept/sub-concept) across studies as well as between question and response options, and the evolution of vocabulary and structure in longitudinal text. To address these challenges, our multi-disciplinary collaboration of computer scientists and survey specialists presents a new information retrieval (IR) task of identifying concept (e.g. Housing, Job, etc.) equivalence across question and response options to harmonise longitudinal population studies. This paper investigates multiple unsupervised approaches on a survey dataset spanning 1946-2020, including probabilistic models, linear probing of language models, and pre-trained neural networks specialised for IR. We show that IR-specialised neural models achieve the highest overall performance with other approaches performing comparably. Additionally, the re-ranking of the probabilistic model's results with neural models only introduces modest improvements of 0.07 at most in F1-score. Qualitative post-hoc evaluation by survey specialists shows that models generally have a low sensitivity to questions with high lexical overlap, particularly in cases where sub-concepts are mismatched. Altogether, our analysis serves to further research on harmonising longitudinal studies in social science.

[95] Can LLMs Detect Intrinsic Hallucinations in Paraphrasing and Machine Translation?

Evangelia Gogoulou,Shorouq Zahra,Liane Guillou,Luise Dürlich,Joakim Nivre

Main category: cs.CL

TL;DR: 论文研究了LLMs在翻译和释义任务中检测幻觉的能力，发现性能因模型而异，但提示选择影响不大，且NLI模型表现同样出色。

Details

Motivation: 解决LLMs生成内容中常见的幻觉问题，评估其在特定任务中的检测能力。 Method: 使用HalluciGen任务评估开源LLMs在翻译和释义任务中的幻觉检测能力，分析模型大小、指令调整和提示选择的影响。 Result: 模型性能因任务和语言而异，但提示选择影响较小；NLI模型表现与LLM相当。 Conclusion: LLM并非唯一可行的幻觉检测方案，NLI模型同样有效。 Abstract: A frequently observed problem with LLMs is their tendency to generate output that is nonsensical, illogical, or factually incorrect, often referred to broadly as hallucination. Building on the recently proposed HalluciGen task for hallucination detection and generation, we evaluate a suite of open-access LLMs on their ability to detect intrinsic hallucinations in two conditional generation tasks: translation and paraphrasing. We study how model performance varies across tasks and language and we investigate the impact of model size, instruction tuning, and prompt choice. We find that performance varies across models but is consistent across prompts. Finally, we find that NLI models perform comparably well, suggesting that LLM-based detectors are not the only viable option for this specific task.

[96] BrightCookies at SemEval-2025 Task 9: Exploring Data Augmentation for Food Hazard Classification

Foteini Papadopoulou,Osman Mutlu,Neris Özen,Bas H. M. van der Velden,Iris Hendrickx,Ali Hürriyetoğlu

Main category: cs.CL

TL;DR: 本文介绍了为SemEval-2025任务9开发的系统，通过文本增强技术提升少数类别的分类性能，发现BERT模型在细粒度分类中表现最佳。

Details

Motivation: 解决食品召回事件报告中少数类别分类性能不佳的问题。 Method: 采用三种词级数据增强技术（同义词替换、随机词交换、上下文词插入），并比较其在多种模型上的效果。 Result: BERT模型在细粒度分类中表现显著提升，上下文词插入技术使少数危险类别的预测准确率提高6%。 Conclusion: 针对少数类别的文本增强技术可以提升Transformer模型的性能。 Abstract: This paper presents our system developed for the SemEval-2025 Task 9: The Food Hazard Detection Challenge. The shared task's objective is to evaluate explainable classification systems for classifying hazards and products in two levels of granularity from food recall incident reports. In this work, we propose text augmentation techniques as a way to improve poor performance on minority classes and compare their effect for each category on various transformer and machine learning models. We explore three word-level data augmentation techniques, namely synonym replacement, random word swapping, and contextual word insertion. The results show that transformer models tend to have a better overall performance. None of the three augmentation techniques consistently improved overall performance for classifying hazards and products. We observed a statistically significant improvement (P < 0.05) in the fine-grained categories when using the BERT model to compare the baseline with each augmented model. Compared to the baseline, the contextual words insertion augmentation improved the accuracy of predictions for the minority hazard classes by 6%. This suggests that targeted augmentation of minority classes can improve the performance of transformer models.

[97] Beyond the Last Answer: Your Reasoning Trace Uncovers More than You Think

Hasan Abed Al Kader Hammoud,Hani Itani,Bernard Ghanem

Main category: cs.CL

TL;DR: 论文提出了一种通过分析中间推理步骤（子思想）来评估大型语言模型（LLM）的方法，发现聚合不同子思想的答案能显著提高准确性。

Details

Motivation: 挑战传统仅依赖最终答案的评估方式，探讨最终答案是否能代表模型的最优结论，以及不同推理路径是否会产生不同结果。 Method: 将推理轨迹分段为子思想，从每个子思想的终点生成延续，提取潜在答案，并选择最频繁的答案（众数）作为最终结果。 Result: 在多个LLM和数学推理数据集上，该方法将准确性提升高达13%和10%。 Conclusion: 分析子思想的答案一致性可识别模型的置信度和正确性，提供了一种更可靠的评估方法。 Abstract: Large Language Models (LLMs) leverage step-by-step reasoning to solve complex problems. Standard evaluation practice involves generating a complete reasoning trace and assessing the correctness of the final answer presented at its conclusion. In this paper, we challenge the reliance on the final answer by posing the following two questions: Does the final answer reliably represent the model's optimal conclusion? Can alternative reasoning paths yield different results? To answer these questions, we analyze intermediate reasoning steps, termed subthoughts, and propose a method based on our findings. Our approach involves segmenting a reasoning trace into sequential subthoughts based on linguistic cues. We start by prompting the model to generate continuations from the end-point of each intermediate subthought. We extract a potential answer from every completed continuation originating from different subthoughts. We find that aggregating these answers by selecting the most frequent one (the mode) often yields significantly higher accuracy compared to relying solely on the answer derived from the original complete trace. Analyzing the consistency among the answers derived from different subthoughts reveals characteristics that correlate with the model's confidence and correctness, suggesting potential for identifying less reliable answers. Our experiments across various LLMs and challenging mathematical reasoning datasets (AIME2024 and AIME2025) show consistent accuracy improvements, with gains reaching up to 13\% and 10\% respectively. Implementation is available at: https://github.com/hammoudhasan/SubthoughtReasoner.

[98] UniversalRAG: Retrieval-Augmented Generation over Multiple Corpora with Diverse Modalities and Granularities

Woongyeong Yeo,Kangsan Kim,Soyeong Jeong,Jinheon Baek,Sung Ju Hwang

Main category: cs.CL

TL;DR: UniversalRAG是一个新颖的检索增强生成框架，通过多模态和多粒度检索解决现有RAG方法的局限性。

Details

Motivation: 现有RAG方法通常局限于单一模态的知识库，无法满足现实查询的多样性需求。 Method: 提出模态感知路由机制，动态选择最合适的模态特定知识库，并支持多粒度检索。 Result: 在8个多模态基准测试中表现优于模态特定和统一的基线方法。 Conclusion: UniversalRAG通过多模态和多粒度检索显著提升了RAG的适应性和准确性。 Abstract: Retrieval-Augmented Generation (RAG) has shown substantial promise in improving factual accuracy by grounding model responses with external knowledge relevant to queries. However, most existing RAG approaches are limited to a text-only corpus, and while recent efforts have extended RAG to other modalities such as images and videos, they typically operate over a single modality-specific corpus. In contrast, real-world queries vary widely in the type of knowledge they require, which a single type of knowledge source cannot address. To address this, we introduce UniversalRAG, a novel RAG framework designed to retrieve and integrate knowledge from heterogeneous sources with diverse modalities and granularities. Specifically, motivated by the observation that forcing all modalities into a unified representation space derived from a single combined corpus causes a modality gap, where the retrieval tends to favor items from the same modality as the query, we propose a modality-aware routing mechanism that dynamically identifies the most appropriate modality-specific corpus and performs targeted retrieval within it. Also, beyond modality, we organize each modality into multiple granularity levels, enabling fine-tuned retrieval tailored to the complexity and scope of the query. We validate UniversalRAG on 8 benchmarks spanning multiple modalities, showing its superiority over modality-specific and unified baselines.

[99] Grokking in the Wild: Data Augmentation for Real-World Multi-Hop Reasoning with Transformers

Roman Abramov,Felix Steinbauer,Gjergji Kasneci

Main category: cs.CL

TL;DR: 论文将grokking技术扩展到真实世界的事实数据，通过合成数据增强知识图谱，显著提升多跳推理性能。

Details

Motivation: 解决Transformer在稀疏知识下的多步事实推理不足问题，探索grokking在真实数据中的应用。 Method: 通过设计合成数据增强知识图谱，提高推断事实与原子事实的比例（φ_r），促使模型依赖关系结构而非记忆。 Result: 在2WikiMultiHopQA基准测试中达到95-100%准确率，超越基线并匹配或超过当前最优结果。 Conclusion: grokking数据增强能激发Transformer的隐式多跳推理能力，为大规模语言模型提供更鲁棒和可解释的事实推理。 Abstract: Transformers have achieved great success in numerous NLP tasks but continue to exhibit notable gaps in multi-step factual reasoning, especially when real-world knowledge is sparse. Recent advances in grokking have demonstrated that neural networks can transition from memorizing to perfectly generalizing once they detect underlying logical patterns - yet these studies have primarily used small, synthetic tasks. In this paper, for the first time, we extend grokking to real-world factual data and address the challenge of dataset sparsity by augmenting existing knowledge graphs with carefully designed synthetic data to raise the ratio $\phi_r$ of inferred facts to atomic facts above the threshold required for grokking. Surprisingly, we find that even factually incorrect synthetic data can strengthen emergent reasoning circuits rather than degrade accuracy, as it forces the model to rely on relational structure rather than memorization. When evaluated on multi-hop reasoning benchmarks, our approach achieves up to 95-100% accuracy on 2WikiMultiHopQA - substantially improving over strong baselines and matching or exceeding current state-of-the-art results. We further provide an in-depth analysis of how increasing $\phi_r$ drives the formation of generalizing circuits inside Transformers. Our findings suggest that grokking-based data augmentation can unlock implicit multi-hop reasoning capabilities, opening the door to more robust and interpretable factual reasoning in large-scale language models.

[100] Chain-of-Defensive-Thought: Structured Reasoning Elicits Robustness in Large Language Models against Reference Corruption

Wenxiao Wang,Parsa Hosseini,Soheil Feizi

Main category: cs.CL

TL;DR: 链式防御思维提示显著提升大语言模型在非推理任务中的鲁棒性，尤其在面对参考数据损坏时表现优异。

Details

Motivation: 探索如何利用链式思维提示增强大语言模型的推理能力，进而提升其在非推理任务中的鲁棒性。 Method: 提出链式防御思维提示方法，仅需提供少量结构化且防御性的示例作为演示。 Result: 在Natural Questions任务中，GPT-4o使用链式防御思维提示的准确率从标准提示的3%提升至50%。 Conclusion: 链式防御思维提示是一种简单且高效的方法，显著提升大语言模型在对抗性环境中的表现。 Abstract: Chain-of-thought prompting has demonstrated great success in facilitating the reasoning abilities of large language models. In this work, we explore how these enhanced reasoning abilities can be exploited to improve the robustness of large language models in tasks that are not necessarily reasoning-focused. In particular, we show how a wide range of large language models exhibit significantly improved robustness against reference corruption using a simple method called chain-of-defensive-thought, where only a few exemplars with structured and defensive reasoning are provided as demonstrations. Empirically, the improvements can be astounding, especially given the simplicity and applicability of the method. For example, in the Natural Questions task, the accuracy of GPT-4o degrades from 60% to as low as 3% with standard prompting when 1 out of 10 references provided is corrupted with prompt injection attacks. In contrast, GPT-4o using chain-of-defensive-thought prompting maintains an accuracy of 50%.

[101] Turing Machine Evaluation for Large Language Model

Haitao Wu,Zongbo Han,Huaxi Huang,Changqing Zhang

Main category: cs.CL

TL;DR: 该研究提出了一种基于通用图灵机（UTM）模拟的评估框架TMBench，用于系统评估大语言模型（LLMs）的计算推理能力，发现其性能与其他推理基准强相关。

Details

Motivation: 随着大语言模型的广泛应用，评估其核心计算推理能力（如准确理解规则和执行逻辑运算）变得尤为重要。 Method: 研究提出基于UTM模拟的评估框架TMBench，要求LLMs在多步计算中严格遵循指令并跟踪动态状态。 Result: TMBench具有知识无关性、难度可调等优势，且模型在TMBench上的表现与其他推理基准强相关（Pearson系数0.73）。 Conclusion: 计算推理能力是衡量LLMs深层能力的重要维度，TMBench为标准化评估提供了有效工具。 Abstract: With the rapid development and widespread application of Large Language Models (LLMs), rigorous evaluation has become particularly crucial. This research adopts a novel perspective, focusing on evaluating the core computational reasoning ability of LLMs, defined as the capacity of model to accurately understand rules, and execute logically computing operations. This capability assesses the reliability of LLMs as precise executors, and is critical to advanced tasks such as complex code generation and multi-step problem-solving. We propose an evaluation framework based on Universal Turing Machine (UTM) simulation. This framework requires LLMs to strictly follow instructions and track dynamic states, such as tape content and read/write head position, during multi-step computations. To enable standardized evaluation, we developed TMBench, a benchmark for systematically studying the computational reasoning capabilities of LLMs. TMBench provides several key advantages, including knowledge-agnostic evaluation, adjustable difficulty, foundational coverage through Turing machine encoding, and unlimited capacity for instance generation, ensuring scalability as models continue to evolve. We find that model performance on TMBench correlates strongly with performance on other recognized reasoning benchmarks (Pearson correlation coefficient is 0.73), clearly demonstrating that computational reasoning is a significant dimension for measuring the deep capabilities of LLMs. Code and data are available at https://github.com/HaitaoWuTJU/Turing-Machine-Bench.

[102] Universal language model with the intervention of quantum theory

D. -F. Qin

Main category: cs.CL

TL;DR: 该论文探讨了基于量子力学理论的语言建模，提出将量子力学引入语言符号-意义对以构建自然语言表示模型，并尝试用量子统计改进词嵌入技术。

Details

Motivation: 研究动机是将量子力学的数学框架应用于自然语言处理，以改进现有语言模型并探索其统计特性。 Method: 方法包括将量子力学理论引入语言建模，构建实验代码验证可行性，并探讨量子统计在语言表示中的应用。 Result: 结果表明量子理论可用于改进词嵌入技术，并为构建生成模型提供新思路。 Conclusion: 结论指出量子力学在自然语言建模中具有潜力，未来可进一步探索其在量子计算机上的应用。 Abstract: This paper examines language modeling based on the theory of quantum mechanics. It focuses on the introduction of quantum mechanics into the symbol-meaning pairs of language in order to build a representation model of natural language. At the same time, it is realized that word embedding, which is widely used as a basic technique for statistical language modeling, can be explained and improved by the mathematical framework of quantum mechanics. On this basis, this paper continues to try to use quantum statistics and other related theories to study the mathematical representation, natural evolution and statistical properties of natural language. It is also assumed that the source of such quantum properties is the physicality of information. The feasibility of using quantum theory to model natural language is pointed out through the construction of a experimental code. The paper discusses, in terms of applications, the possible help of the theory in constructing generative models that are popular nowadays. A preliminary discussion of future applications of the theory to quantum computers is also presented.

[103] JaccDiv: A Metric and Benchmark for Quantifying Diversity of Generated Marketing Text in the Music Industry

Anum Afzal,Alexandre Mercier,Florian Matthes

Main category: cs.CL

TL;DR: 研究探讨了基于LLM的数据到文本方法，用于生成高质量且多样化的营销文本，并提出了评估多样性的新指标JaccDiv。

Details

Motivation: 传统生成方法容易陷入重复模式，导致文本单调，限制了在线平台的内容生成能力。 Method: 利用T5、GPT-3.5、GPT-4和LLaMa2等语言模型，结合微调、少样本和零样本方法，生成多样化营销文本。 Result: 提出了JaccDiv指标评估文本多样性，方法适用于多个领域。 Conclusion: LLM-based方法能有效提升文本多样性和质量，具有广泛的应用潜力。 Abstract: Online platforms are increasingly interested in using Data-to-Text technologies to generate content and help their users. Unfortunately, traditional generative methods often fall into repetitive patterns, resulting in monotonous galleries of texts after only a few iterations. In this paper, we investigate LLM-based data-to-text approaches to automatically generate marketing texts that are of sufficient quality and diverse enough for broad adoption. We leverage Language Models such as T5, GPT-3.5, GPT-4, and LLaMa2 in conjunction with fine-tuning, few-shot, and zero-shot approaches to set a baseline for diverse marketing texts. We also introduce a metric JaccDiv to evaluate the diversity of a set of texts. This research extends its relevance beyond the music industry, proving beneficial in various fields where repetitive automated content generation is prevalent.

[104] DYNAMAX: Dynamic computing for Transformers and Mamba based architectures

Miguel Nogales,Matteo Gambella,Manuel Roveri

Main category: cs.CL

TL;DR: DYNAMAX框架首次将早期退出机制应用于Mamba架构，展示了其在平衡计算成本与性能方面的潜力。

Details

Motivation: 探索Mamba架构在早期退出机制中的应用，以降低计算成本和延迟。 Method: 将早期退出机制集成到Mamba架构中，并利用Mamba作为高效的早期退出分类器。 Result: 实验证明Mamba在计算节省、准确性和一致性方面表现优异。 Conclusion: Mamba架构为动态计算范式提供了新的可能性，适用于资源受限环境。 Abstract: Early exits (EEs) offer a promising approach to reducing computational costs and latency by dynamically terminating inference once a satisfactory prediction confidence on a data sample is achieved. Although many works integrate EEs into encoder-only Transformers, their application to decoder-only architectures and, more importantly, Mamba models, a novel family of state-space architectures in the LLM realm, remains insufficiently explored. This work introduces DYNAMAX, the first framework to exploit the unique properties of Mamba architectures for early exit mechanisms. We not only integrate EEs into Mamba but also repurpose Mamba as an efficient EE classifier for both Mamba-based and transformer-based LLMs, showcasing its versatility. Our experiments employ the Mistral 7B transformer compared to the Codestral 7B Mamba model, using data sets such as TruthfulQA, CoQA, and TriviaQA to evaluate computational savings, accuracy, and consistency. The results highlight the adaptability of Mamba as a powerful EE classifier and its efficiency in balancing computational cost and performance quality across NLP tasks. By leveraging Mamba's inherent design for dynamic processing, we open pathways for scalable and efficient inference in embedded applications and resource-constrained environments. This study underscores the transformative potential of Mamba in redefining dynamic computing paradigms for LLMs.

[105] Trace-of-Thought: Enhanced Arithmetic Problem Solving via Reasoning Distillation From Large to Small Language Models

Tyler McDonald,Ali Emami

Main category: cs.CL

TL;DR: 论文提出了一种名为Trace-of-Thought Prompting的零样本提示工程方法，旨在通过开源模型提升算术推理能力，性能提升高达125%。

Details

Motivation: 大型语言模型（LLMs）在专业领域（如算术推理）的应用存在计算成本高和封闭模型限制定制化的问题，开源模型可以优化资源使用并提升可访问性。 Method: 引入Trace-of-Thought Prompting方法，指导LLMs通过创建可观察的子问题来增强算术推理能力。 Result: 在7B参数以下的开源模型中，该方法性能提升高达125%，并与GPT-4结合展示了显著效果。 Conclusion: 开源模型和Trace-of-Thought Prompting方法在提升AI研究民主化和高质量计算语言学应用可访问性方面具有潜力。 Abstract: As Large Language Models (LLMs) continue to be leveraged for daily tasks, prompt engineering remains an active field of contribution within computational linguistics, particularly in domains requiring specialized knowledge such as arithmetic reasoning. While these LLMs are optimized for a variety of tasks, their exhaustive employment may become computationally or financially cumbersome for small teams. Additionally, complete reliance on proprietary, closed-source models often limits customization and adaptability, posing significant challenges in research and application scalability. Instead, by leveraging open-source models at or below 7 billion parameters, we can optimize our resource usage while still observing remarkable gains over standard prompting approaches. To cultivate this notion, we introduce Trace-of-Thought Prompting, a simple, zero-shot prompt engineering method that instructs LLMs to create observable subproblems using critical problem-solving, specifically designed to enhance arithmetic reasoning capabilities. When applied to open-source models in tandem with GPT-4, we observe that Trace-of-Thought not only allows novel insight into the problem-solving process but also introduces performance gains as large as 125% on language models at or below 7 billion parameters. This approach underscores the potential of open-source initiatives in democratizing AI research and improving the accessibility of high-quality computational linguistics applications.

[106] Information Gravity: A Field-Theoretic Model for Token Selection in Large Language Models

Maryna Vyshnyvetska

Main category: cs.CL

TL;DR: 提出了一种名为“信息引力”的理论模型，用场论和时空几何的物理工具描述大语言模型（LLM）的文本生成过程。

Details

Motivation: 解释LLM行为中的现象，如幻觉、查询敏感性及采样温度对多样性的影响。 Method: 将查询视为具有“信息质量”的对象，弯曲模型的语义空间，形成引力势阱吸引生成标记。 Result: 模型为LLM行为中的多种现象提供了机制解释。 Conclusion: “信息引力”模型为理解LLM的生成过程提供了新的理论框架。 Abstract: We propose a theoretical model called "information gravity" to describe the text generation process in large language models (LLMs). The model uses physical apparatus from field theory and spacetime geometry to formalize the interaction between user queries and the probability distribution of generated tokens. A query is viewed as an object with "information mass" that curves the semantic space of the model, creating gravitational potential wells that "attract" tokens during generation. This model offers a mechanism to explain several observed phenomena in LLM behavior, including hallucinations (emerging from low-density semantic voids), sensitivity to query formulation (due to semantic field curvature changes), and the influence of sampling temperature on output diversity.

[107] OSVBench: Benchmarking LLMs on Specification Generation Tasks for Operating System Verification

Shangyu Li,Juyong Jiang,Tiancheng Zhao,Jiasi Shen

Main category: cs.CL

TL;DR: OSVBench是一个新基准，用于评估大语言模型（LLMs）在操作系统内核验证任务中生成完整规范代码的能力。基于真实操作系统内核Hyperkernel，包含245个复杂任务，评估显示当前LLMs在此类长上下文代码生成任务中表现有限。

Details

Motivation: 操作系统内核验证需要生成复杂规范代码，现有LLMs在此任务上的能力尚不明确，因此需要建立基准进行评估。 Method: 将规范生成问题转化为程序合成问题，提供编程模型和验证假设，要求LLMs在限定语法和语义空间内生成完整规范。 Result: 评估12个LLMs显示其在长上下文代码生成任务中表现有限，性能差异显著。 Conclusion: 当前LLMs在操作系统验证规范生成任务中能力不足，需进一步优化。 Abstract: We introduce OSVBench, a new benchmark for evaluating Large Language Models (LLMs) in generating complete specification code pertaining to operating system kernel verification tasks. The benchmark first defines the specification generation problem into a program synthesis problem within a confined scope of syntax and semantics by providing LLMs with the programming model. The LLMs are required to understand the provided verification assumption and the potential syntax and semantics space to search for, then generate the complete specification for the potentially buggy operating system code implementation under the guidance of the high-level functional description of the operating system. This benchmark is built upon a real-world operating system kernel, Hyperkernel, and consists of 245 complex specification generation tasks in total, each is a long context task of about 20k-30k tokens. Our comprehensive evaluation of 12 LLMs exhibits the limited performance of the current LLMs on the specification generation tasks for operating system verification. Significant disparities in their performance on the benchmark highlight differences in their ability to handle long-context code generation tasks. The evaluation toolkit and benchmark are available at https://github.com/lishangyu-hkust/OSVBench.

[108] SetKE: Knowledge Editing for Knowledge Elements Overlap

Yifan Wei,Xiaoyan Yu,Ran Song,Hao Peng,Angsheng Li

Main category: cs.CL

TL;DR: 论文提出了一种新的知识编辑方法SetKE，解决了知识元素重叠（KEO）问题，并在主流大语言模型中表现优于现有方法。

Details

Motivation: 大型语言模型（LLMs）在知识更新时面临传统方法（如微调和增量学习）的挑战，而知识编辑（KE）忽略了知识元素重叠（KEO）现象，导致编辑冲突。 Method: 提出知识集编辑（KSE）框架和SetKE方法，同时编辑多组三元组，并引入包含KEO三元组的数据集EditSet作为基准。 Result: 实验表明，SetKE在KEO场景下优于现有方法。 Conclusion: SetKE为知识编辑提供了更有效的解决方案，EditSet数据集为未来研究提供了基准。 Abstract: Large Language Models (LLMs) excel in tasks such as retrieval and question answering but require updates to incorporate new knowledge and reduce inaccuracies and hallucinations. Traditional updating methods, like fine-tuning and incremental learning, face challenges such as overfitting and high computational costs. Knowledge Editing (KE) provides a promising alternative but often overlooks the Knowledge Element Overlap (KEO) phenomenon, where multiple triplets share common elements, leading to editing conflicts. We identify the prevalence of KEO in existing KE datasets and show its significant impact on current KE methods, causing performance degradation in handling such triplets. To address this, we propose a new formulation, Knowledge Set Editing (KSE), and introduce SetKE, a method that edits sets of triplets simultaneously. Experimental results demonstrate that SetKE outperforms existing methods in KEO scenarios on mainstream LLMs. Additionally, we introduce EditSet, a dataset containing KEO triplets, providing a comprehensive benchmark.

cs.SD [Back]

[109] End-to-end Audio Deepfake Detection from RAW Waveforms: a RawNet-Based Approach with Cross-Dataset Evaluation

Andrea Di Pierno,Luca Guarnera,Dario Allegra,Sebastiano Battiato

Main category: cs.SD

TL;DR: 论文提出了一种轻量级的端到端深度学习框架RawNetLite，用于检测音频深度伪造，通过结合多领域数据和Focal Loss提升鲁棒性，并在多种测试集上表现出色。

Details

Motivation: 音频深度伪造对数字安全和信任构成威胁，现有检测方法在开放世界条件下表现不佳，需开发更鲁棒的解决方案。 Method: 提出RawNetLite模型，直接处理原始波形，结合卷积-循环架构捕获频谱和时序特征，采用多领域数据训练和Focal Loss。 Result: 在FakeOrReal数据集上F1达99.7%，EER为0.25%；在AVSpoof2021 + CodecFake上F1达83.4%，EER为16.4%。 Conclusion: 多样化的训练数据、定制目标函数和音频增强对构建鲁棒且通用的音频伪造检测器至关重要。 Abstract: Audio deepfakes represent a growing threat to digital security and trust, leveraging advanced generative models to produce synthetic speech that closely mimics real human voices. Detecting such manipulations is especially challenging under open-world conditions, where spoofing methods encountered during testing may differ from those seen during training. In this work, we propose an end-to-end deep learning framework for audio deepfake detection that operates directly on raw waveforms. Our model, RawNetLite, is a lightweight convolutional-recurrent architecture designed to capture both spectral and temporal features without handcrafted preprocessing. To enhance robustness, we introduce a training strategy that combines data from multiple domains and adopts Focal Loss to emphasize difficult or ambiguous samples. We further demonstrate that incorporating codec-based manipulations and applying waveform-level audio augmentations (e.g., pitch shifting, noise, and time stretching) leads to significant generalization improvements under realistic acoustic conditions. The proposed model achieves over 99.7% F1 and 0.25% EER on in-domain data (FakeOrReal), and up to 83.4% F1 with 16.4% EER on a challenging out-of-distribution test set (AVSpoof2021 + CodecFake). These findings highlight the importance of diverse training data, tailored objective functions and audio augmentations in building resilient and generalizable audio forgery detectors. Code and pretrained models are available at https://iplab.dmi.unict.it/mfs/Deepfakes/PaperRawNet2025/.

eess.IV [Back]

[110] SCOPE-MRI: Bankart Lesion Detection as a Case Study in Data Curation and Deep Learning for Challenging Diagnoses

Sahil Sethi,Sai Reddy,Mansi Sakarvadia,Jordan Serotte,Darlington Nwaudo,Nicholas Maassen,Lewis Shi

Main category: eess.IV

TL;DR: 该研究提出了ScopeMRI数据集和深度学习框架，用于检测Bankart病变，模型在标准MRI和MRA上表现优异，性能接近或超过放射科医生。

Details

Motivation: 现有研究多关注易于诊断的病理，而Bankart病变等复杂问题研究不足，诊断依赖侵入性MRA。 Method: 使用CNN和Transformer结合的方法，训练了针对标准MRI和MRA的模型，并通过多视图集成优化性能。 Result: 模型在标准MRI和MRA上的AUC分别为0.91和0.93，敏感性为83%和94%，特异性为91%和86%。 Conclusion: 深度学习模型在标准MRI上达到放射科医生水平，减少了对MRA的需求，ScopeMRI的发布将推动肌肉骨骼影像研究。 Abstract: While deep learning has shown strong performance in musculoskeletal imaging, existing work has largely focused on pathologies where diagnosis is not a clinical challenge, leaving more difficult problems underexplored, such as detecting Bankart lesions (anterior-inferior glenoid labral tears) on standard MRIs. Diagnosing these lesions is challenging due to their subtle imaging features, often leading to reliance on invasive MRI arthrograms (MRAs). This study introduces ScopeMRI, the first publicly available, expert-annotated dataset for shoulder pathologies, and presents a deep learning (DL) framework for detecting Bankart lesions on both standard MRIs and MRAs. ScopeMRI includes 586 shoulder MRIs (335 standard, 251 MRAs) from 558 patients who underwent arthroscopy. Ground truth labels were derived from intraoperative findings, the gold standard for diagnosis. Separate DL models for MRAs and standard MRIs were trained using a combination of CNNs and transformers. Predictions from sagittal, axial, and coronal views were ensembled to optimize performance. The models were evaluated on a 20% hold-out test set (117 MRIs: 46 MRAs, 71 standard MRIs). The models achieved an AUC of 0.91 and 0.93, sensitivity of 83% and 94%, and specificity of 91% and 86% for standard MRIs and MRAs, respectively. Notably, model performance on non-invasive standard MRIs matched or surpassed radiologists interpreting MRAs. External validation demonstrated initial generalizability across imaging protocols. This study demonstrates that DL models can achieve radiologist-level diagnostic performance on standard MRIs, reducing the need for invasive MRAs. By releasing ScopeMRI and a modular codebase for training and evaluating deep learning models on 3D medical imaging data, we aim to accelerate research in musculoskeletal imaging and support the development of new datasets for clinically challenging diagnostic tasks.

[111] LymphAtlas- A Unified Multimodal Lymphoma Imaging Repository Delivering AI-Enhanced Diagnostic Insight

Jiajun Ding,Beiyao Zhu,Xiaosheng Liu,Lishen Zhang,Zhao Liu

Main category: eess.IV

TL;DR: 该研究整合PET代谢信息与CT解剖结构，构建了基于全身FDG PET/CT检查的淋巴瘤3D多模态分割数据集，填补了血液系统恶性肿瘤领域标准化多模态分割数据集的空白。

Details

Motivation: 解决血液系统恶性肿瘤领域缺乏标准化多模态分割数据集的问题，支持淋巴瘤的精确分割和定量分析。 Method: 回顾性收集483例检查数据，保留完整的3D结构信息，基于nnUNet格式构建高质量数据集，并通过技术验证和深度学习模型评估其性能。 Result: 深度学习模型在该数据集上实现了高精度、强鲁棒性和可重复性的淋巴瘤病灶分割，验证了数据集的适用性和稳定性。 Conclusion: 该数据集显著提升了肿瘤病灶形态、位置和代谢特征的精确描述，为早期诊断、临床分期和个性化治疗提供了数据支持，推动了基于深度学习的自动化图像分割和精准医学发展。 Abstract: This study integrates PET metabolic information with CT anatomical structures to establish a 3D multimodal segmentation dataset for lymphoma based on whole-body FDG PET/CT examinations, which bridges the gap of the lack of standardised multimodal segmentation datasets in the field of haematological malignancies. We retrospectively collected 483 examination datasets acquired between March 2011 and May 2024, involving 220 patients (106 non-Hodgkin lymphoma, 42 Hodgkin lymphoma); all data underwent ethical review and were rigorously de-identified. Complete 3D structural information was preserved during data acquisition, preprocessing and annotation, and a high-quality dataset was constructed based on the nnUNet format. By systematic technical validation and evaluation of the preprocessing process, annotation quality and automatic segmentation algorithm, the deep learning model trained based on this dataset is verified to achieve accurate segmentation of lymphoma lesions in PET/CT images with high accuracy, good robustness and reproducibility, which proves the applicability and stability of this dataset in accurate segmentation and quantitative analysis. The deep fusion of PET/CT images achieved with this dataset not only significantly improves the accurate portrayal of the morphology, location and metabolic features of tumour lesions, but also provides solid data support for early diagnosis, clinical staging and personalized treatment, and promotes the development of automated image segmentation and precision medicine based on deep learning. The dataset and related resources are available at https://github.com/SuperD0122/LymphAtlas-.

[112] SAM-Guided Robust Representation Learning for One-Shot 3D Medical Image Segmentation

Jia Wang,Yunan Mei,Jiarui Liu,Xin Fan

Main category: eess.IV

TL;DR: 提出了一种名为RRL-MedSAM的新框架，通过知识蒸馏和自动提示技术，将SAM模型适配到一次性3D医学图像分割任务中，显著提升了性能并减少了计算成本。

Details

Motivation: 医学图像分割需要大量标注，而SAM模型依赖人工交互且计算成本高，无法直接用于一次性分割任务。 Method: 采用双阶段知识蒸馏策略和互指数移动平均方法，结合自动提示分割解码器，从SAM模型中提取知识并优化轻量级编码器。 Result: 在OASIS和CT-lung数据集上表现优于现有方法，轻量级编码器参数仅为SAM-Base的3%。 Conclusion: RRL-MedSAM框架成功解决了SAM模型在一次性医学图像分割中的局限性，实现了高效且高性能的分割。 Abstract: One-shot medical image segmentation (MIS) is crucial for medical analysis due to the burden of medical experts on manual annotation. The recent emergence of the segment anything model (SAM) has demonstrated remarkable adaptation in MIS but cannot be directly applied to one-shot medical image segmentation (MIS) due to its reliance on labor-intensive user interactions and the high computational cost. To cope with these limitations, we propose a novel SAM-guided robust representation learning framework, named RRL-MedSAM, to adapt SAM to one-shot 3D MIS, which exploits the strong generalization capabilities of the SAM encoder to learn better feature representation. We devise a dual-stage knowledge distillation (DSKD) strategy to distill general knowledge between natural and medical images from the foundation model to train a lightweight encoder, and then adopt a mutual exponential moving average (mutual-EMA) to update the weights of the general lightweight encoder and medical-specific encoder. Specifically, pseudo labels from the registration network are used to perform mutual supervision for such two encoders. Moreover, we introduce an auto-prompting (AP) segmentation decoder which adopts the mask generated from the general lightweight model as a prompt to assist the medical-specific model in boosting the final segmentation performance. Extensive experiments conducted on three public datasets, i.e., OASIS, CT-lung demonstrate that the proposed RRL-MedSAM outperforms state-of-the-art one-shot MIS methods for both segmentation and registration tasks. Especially, our lightweight encoder uses only 3\% of the parameters compared to the encoder of SAM-Base.

cs.MM [Back]

Stefano Dell'Anna,Andrea Montibeller,Giulia Boato

Main category: cs.MM

TL;DR: TrueFake是一个包含60万张图像的大规模基准数据集，用于评估在社交媒体共享等真实条件下假图像检测器的性能。

Details

Motivation: AI生成的合成媒体被广泛用于传播虚假信息，而现有检测工具未能充分应对社交媒体压缩等现实挑战。 Method: 通过构建TrueFake数据集，包含多种生成技术和社交媒体共享的图像，并进行大量实验分析检测性能。 Result: 研究发现社交媒体共享显著影响检测性能，并确定了当前最有效的检测和训练策略。 Conclusion: 强调了在真实世界条件下评估法医模型的必要性。 Abstract: AI-generated synthetic media are increasingly used in real-world scenarios, often with the purpose of spreading misinformation and propaganda through social media platforms, where compression and other processing can degrade fake detection cues. Currently, many forensic tools fail to account for these in-the-wild challenges. In this work, we introduce TrueFake, a large-scale benchmarking dataset of 600,000 images including top notch generative techniques and sharing via three different social networks. This dataset allows for rigorous evaluation of state-of-the-art fake image detectors under very realistic and challenging conditions. Through extensive experimentation, we analyze how social media sharing impacts detection performance, and identify current most effective detection and training strategies. Our findings highlight the need for evaluating forensic models in conditions that mirror real-world use.

cs.SE [Back]

[114] ResearchCodeAgent: An LLM Multi-Agent System for Automated Codification of Research Methodologies

Shubham Gandhi,Dhruv Shah,Manasi Patwardhan,Lovekesh Vig,Gautam Shroff

Main category: cs.SE

TL;DR: ResearchCodeAgent是一个基于大型语言模型的多智能体系统，用于自动化机器学习文献中研究方法的代码生成，显著减少编码时间并提高代码质量。

Details

Motivation: 解决机器学习研究中高概念与实现之间的鸿沟，帮助研究人员快速生成现有论文的代码以进行基准测试或进一步开发。 Method: 采用灵活的智能体架构和动态规划机制，结合短期和长期记忆，支持上下文感知的研究环境交互。 Result: 在三个机器学习任务中，46.9%的生成代码高质量且无错误，25%优于基线实现，编码时间平均减少57.9%。 Conclusion: ResearchCodeAgent是自动化研究实现的重要进展，有望加速机器学习研究的进程。 Abstract: In this paper we introduce ResearchCodeAgent, a novel multi-agent system leveraging large language models (LLMs) agents to automate the codification of research methodologies described in machine learning literature. The system bridges the gap between high-level research concepts and their practical implementation, allowing researchers auto-generating code of existing research papers for benchmarking or building on top-of existing methods specified in the literature with availability of partial or complete starter code. ResearchCodeAgent employs a flexible agent architecture with a comprehensive action suite, enabling context-aware interactions with the research environment. The system incorporates a dynamic planning mechanism, utilizing both short and long-term memory to adapt its approach iteratively. We evaluate ResearchCodeAgent on three distinct machine learning tasks with distinct task complexity and representing different parts of the ML pipeline: data augmentation, optimization, and data batching. Our results demonstrate the system's effectiveness and generalizability, with 46.9% of generated code being high-quality and error-free, and 25% showing performance improvements over baseline implementations. Empirical analysis shows an average reduction of 57.9% in coding time compared to manual implementation. We observe higher gains for more complex tasks. ResearchCodeAgent represents a significant step towards automating the research implementation process, potentially accelerating the pace of machine learning research.

cs.IR [Back]

[115] Recommending Clinical Trials for Online Patient Cases using Artificial Intelligence

Joey Chan,Qiao Jin,Nicholas Wan,Charalampos S. Floudas,Elisabetta Xue,Zhiyong Lu

Main category: cs.IR

TL;DR: TrialGPT利用大型语言模型匹配患者病例与临床试验，比传统关键词搜索方法表现更优，识别合格试验的能力提升46%。

Details

Motivation: 临床试验招募面临挑战，如患者意识不足和复杂资格标准，而在线平台为扩大招募提供了新机会。 Method: 使用TrialGPT框架，基于LLM匹配50例在线患者病例与临床试验，并与传统关键词搜索方法对比。 Result: TrialGPT识别合格试验的能力比传统方法高46%，平均每位患者符合约7项试验。 Conclusion: TrialGPT在临床试验匹配中表现优异，获得患者和试验组织者的积极反馈。 Abstract: Clinical trials are crucial for assessing new treatments; however, recruitment challenges - such as limited awareness, complex eligibility criteria, and referral barriers - hinder their success. With the growth of online platforms, patients increasingly turn to social media and health communities for support, research, and advocacy, expanding recruitment pools and established enrollment pathways. Recognizing this potential, we utilized TrialGPT, a framework that leverages a large language model (LLM) as its backbone, to match 50 online patient cases (collected from published case reports and a social media website) to clinical trials and evaluate performance against traditional keyword-based searches. Our results show that TrialGPT outperforms traditional methods by 46% in identifying eligible trials, with each patient, on average, being eligible for around 7 trials. Additionally, our outreach efforts to case authors and trial organizers regarding these patient-trial matches yielded highly positive feedback, which we present from both perspectives.

[116] MATCHA: Can Multi-Agent Collaboration Build a Trustworthy Conversational Recommender?

Zheng Hui,Xiaokai Wei,Yexi Jiang,Kevin Gao,Chen Wang,Frank Ong,Se-eun Yoon,Rachit Pareek,Michelle Gong

Main category: cs.IR

TL;DR: MATCHA是一个基于多智能体协作的对话推荐系统框架，利用大语言模型提升个性化和用户参与度，通过多个专用智能体优化推荐准确性、多样性和安全性。

Details

Motivation: 解决对话推荐系统中的关键挑战，如处理复杂用户请求、增强个性化、实证评估及确保安全交互。 Method: 引入多个专用智能体（意图分析、候选生成、排序、重排序、解释性和安全保护）协作工作。 Result: 在八个指标上表现优于或与现有最佳模型相当，通过六个基线模型对比验证了其有效性。 Conclusion: MATCHA框架在游戏推荐等场景中有效解决了复杂用户需求，并通过多智能体协作提升了推荐系统的性能与安全性。 Abstract: In this paper, we propose a multi-agent collaboration framework called MATCHA for conversational recommendation system, leveraging large language models (LLMs) to enhance personalization and user engagement. Users can request recommendations via free-form text and receive curated lists aligned with their interests, preferences, and constraints. Our system introduces specialized agents for intent analysis, candidate generation, ranking, re-ranking, explainability, and safeguards. These agents collaboratively improve recommendations accuracy, diversity, and safety. On eight metrics, our model achieves superior or comparable performance to the current state-of-the-art. Through comparisons with six baseline models, our approach addresses key challenges in conversational recommendation systems for game recommendations, including: (1) handling complex, user-specific requests, (2) enhancing personalization through multi-agent collaboration, (3) empirical evaluation and deployment, and (4) ensuring safe and trustworthy interactions.

[117] Search-Based Interaction For Conversation Recommendation via Generative Reward Model Based Simulated User

Xiaolei Wang,Chunxuan Xia,Junyi Li,Fanzhe Meng,Lei Huang,Jinpeng Wang,Wayne Xin Zhao,Ji-Rong Wen

Main category: cs.IR

TL;DR: 论文提出了一种基于生成奖励模型的模拟用户（GRSU），用于自动与对话推荐系统（CRS）交互，以更好地捕捉用户偏好。

Details

Motivation: 对话推荐系统（CRS）在多轮交互中难以准确理解用户偏好，频繁的用户参与可能降低体验。 Method: 设计了生成式评分和基于属性的项目评价两种反馈动作，通过指令调优统一模拟用户，并采用波束搜索和候选排序优化交互过程。 Result: 在公开数据集上的实验验证了方法的有效性、高效性和可迁移性。 Conclusion: GRSU通过自动交互有效提升了CRS对用户偏好的理解能力。 Abstract: Conversational recommendation systems (CRSs) use multi-turn interaction to capture user preferences and provide personalized recommendations. A fundamental challenge in CRSs lies in effectively understanding user preferences from conversations. User preferences can be multifaceted and complex, posing significant challenges for accurate recommendations even with access to abundant external knowledge. While interaction with users can clarify their true preferences, frequent user involvement can lead to a degraded user experience. To address this problem, we propose a generative reward model based simulated user, named GRSU, for automatic interaction with CRSs. The simulated user provides feedback to the items recommended by CRSs, enabling them to better capture intricate user preferences through multi-turn interaction. Inspired by generative reward models, we design two types of feedback actions for the simulated user: i.e., generative item scoring, which offers coarse-grained feedback, and attribute-based item critique, which provides fine-grained feedback. To ensure seamless integration, these feedback actions are unified into an instruction-based format, allowing the development of a unified simulated user via instruction tuning on synthesized data. With this simulated user, automatic multi-turn interaction with CRSs can be effectively conducted. Furthermore, to strike a balance between effectiveness and efficiency, we draw inspiration from the paradigm of reward-guided search in complex reasoning tasks and employ beam search for the interaction process. On top of this, we propose an efficient candidate ranking method to improve the recommendation results derived from interaction. Extensive experiments on public datasets demonstrate the effectiveness, efficiency, and transferability of our approach.

[118] X-Cross: Dynamic Integration of Language Models for Cross-Domain Sequential Recommendation

Guy Hadad,Haggai Roitman,Yotam Eshel,Bracha Shapira,Lior Rokach

Main category: cs.IR

TL;DR: X-Cross是一种新型跨域顺序推荐模型，通过集成多个领域特定语言模型，减少对新领域数据的需求和计算开销。

Details

Motivation: 解决推荐系统在新领域快速适应的问题，避免大量重新训练。 Method: 使用低秩适配器（LoRA）微调多个领域特定语言模型，动态整合各模型知识，逐层优化表示。 Result: 在亚马逊数据集上，X-Cross性能接近LoRA微调模型，参数减少75%；跨域任务中数据需求减少50%-75%，准确率显著提升。 Conclusion: X-Cross提供了一种高效、可扩展的跨域推荐解决方案，适用于数据受限环境。 Abstract: As new products are emerging daily, recommendation systems are required to quickly adapt to possible new domains without needing extensive retraining. This work presents ``X-Cross'' -- a novel cross-domain sequential-recommendation model that recommends products in new domains by integrating several domain-specific language models; each model is fine-tuned with low-rank adapters (LoRA). Given a recommendation prompt, operating layer by layer, X-Cross dynamically refines the representation of each source language model by integrating knowledge from all other models. These refined representations are propagated from one layer to the next, leveraging the activations from each domain adapter to ensure domain-specific nuances are preserved while enabling adaptability across domains. Using Amazon datasets for sequential recommendation, X-Cross achieves performance comparable to a model that is fine-tuned with LoRA, while using only 25% of the additional parameters. In cross-domain tasks, such as adapting from Toys domain to Tools, Electronics or Sports, X-Cross demonstrates robust performance, while requiring about 50%-75% less fine-tuning data than LoRA to make fine-tuning effective. Furthermore, X-Cross achieves significant improvement in accuracy over alternative cross-domain baselines. Overall, X-Cross enables scalable and adaptive cross-domain recommendations, reducing computational overhead and providing an efficient solution for data-constrained environments.

cs.AI [Back]

[119] AI Awareness

Xiaojian Li,Haoyuan Shi,Rongwu Xu,Wei Xu

Main category: cs.AI

TL;DR: 论文探讨了AI意识的四个维度及其对AI能力的影响，同时分析了相关风险和未来研究方向。

Details

Motivation: 随着AI能力的提升，研究AI意识的功能性表现成为重要课题，而非仅关注哲学层面的意识问题。 Method: 结合认知科学、心理学和计算理论，分析AI意识的四种形式（元认知、自我意识、社会意识和情境意识），并评估现有方法和实证结果。 Result: 研究表明，更具意识的AI通常表现更高智能行为，但同时也带来安全和对齐风险。 Conclusion: AI意识是一把双刃剑，需在提升能力的同时谨慎应对风险，为未来研究提供方向。 Abstract: Recent breakthroughs in artificial intelligence (AI) have brought about increasingly capable systems that demonstrate remarkable abilities in reasoning, language understanding, and problem-solving. These advancements have prompted a renewed examination of AI awareness, not as a philosophical question of consciousness, but as a measurable, functional capacity. In this review, we explore the emerging landscape of AI awareness, which includes meta-cognition (the ability to represent and reason about its own state), self-awareness (recognizing its own identity, knowledge, limitations, inter alia), social awareness (modeling the knowledge, intentions, and behaviors of other agents), and situational awareness (assessing and responding to the context in which it operates). First, we draw on insights from cognitive science, psychology, and computational theory to trace the theoretical foundations of awareness and examine how the four distinct forms of AI awareness manifest in state-of-the-art AI. Next, we systematically analyze current evaluation methods and empirical findings to better understand these manifestations. Building on this, we explore how AI awareness is closely linked to AI capabilities, demonstrating that more aware AI agents tend to exhibit higher levels of intelligent behaviors. Finally, we discuss the risks associated with AI awareness, including key topics in AI safety, alignment, and broader ethical concerns. AI awareness is a double-edged sword: it improves general capabilities, i.e., reasoning, safety, while also raises concerns around misalignment and societal risks, demanding careful oversight as AI capabilities grow. On the whole, our interdisciplinary review provides a roadmap for future research and aims to clarify the role of AI awareness in the ongoing development of intelligent machines.

William P. McCarthy,Saujas Vaduguru,Karl D. D. Willis,Justin Matejka,Judith E. Fan,Daniel Fried,Yewen Pu

Main category: cs.AI

TL;DR: 论文介绍了mrCAD数据集，用于研究人类如何通过多模态指令（文本和绘图）迭代改进设计，并发现生成式AI在遵循生成指令方面优于改进指令。

Details

Motivation: 人类能够通过迭代改进沟通概念，而生成式AI在内容生成方面表现优异，但在语言指导的改进方面存在困难。研究旨在填补这一差距。 Method: 通过mrCAD数据集，记录了1,092对人类玩家在多轮通信游戏中使用文本、绘图或组合方式改进CAD设计的过程。 Result: 分析发现生成和改进指令在绘图和文本组成上存在差异，且当前最优视觉语言模型在生成指令上表现更好。 Conclusion: 研究为分析和建模多模态改进语言奠定了基础，填补了现有数据集的空白。 Abstract: A key feature of human collaboration is the ability to iteratively refine the concepts we have communicated. In contrast, while generative AI excels at the \textit{generation} of content, it often struggles to make specific language-guided \textit{modifications} of its prior outputs. To bridge the gap between how humans and machines perform edits, we present mrCAD, a dataset of multimodal instructions in a communication game. In each game, players created computer aided designs (CADs) and refined them over several rounds to match specific target designs. Only one player, the Designer, could see the target, and they must instruct the other player, the Maker, using text, drawing, or a combination of modalities. mrCAD consists of 6,082 communication games, 15,163 instruction-execution rounds, played between 1,092 pairs of human players. We analyze the dataset and find that generation and refinement instructions differ in their composition of drawing and text. Using the mrCAD task as a benchmark, we find that state-of-the-art VLMs are better at following generation instructions than refinement instructions. These results lay a foundation for analyzing and modeling a multimodal language of refinement that is not represented in previous datasets.

[121] ReasonIR: Training Retrievers for Reasoning Tasks

Rulin Shao,Rui Qiao,Varsha Kishore,Niklas Muennighoff,Xi Victoria Lin,Daniela Rus,Bryan Kian Hsiang Low,Sewon Min,Wen-tau Yih,Pang Wei Koh,Luke Zettlemoyer

Main category: cs.AI

TL;DR: ReasonIR-8B是首个专为通用推理任务训练的检索模型，通过合成数据和公开数据混合训练，在推理密集型IR任务中取得最佳性能，并显著提升RAG任务表现。

Details

Motivation: 现有检索模型在推理任务中表现有限，因训练数据多为简短事实性查询。 Method: 开发合成数据生成流程，为文档生成挑战性查询和硬负样本，结合公开数据训练。 Result: 在BRIGHT基准上达到29.9 nDCG@10（无重排）和36.9 nDCG@10（有重排），MMLU和GPQA任务分别提升6.4%和22.6%。 Conclusion: ReasonIR-8B在推理任务中表现优异，训练方法通用且开源，适用于未来LLMs。 Abstract: We present ReasonIR-8B, the first retriever specifically trained for general reasoning tasks. Existing retrievers have shown limited gains on reasoning tasks, in part because existing training datasets focus on short factual queries tied to documents that straightforwardly answer them. We develop a synthetic data generation pipeline that, for each document, our pipeline creates a challenging and relevant query, along with a plausibly related but ultimately unhelpful hard negative. By training on a mixture of our synthetic data and existing public data, ReasonIR-8B achieves a new state-of-the-art of 29.9 nDCG@10 without reranker and 36.9 nDCG@10 with reranker on BRIGHT, a widely-used reasoning-intensive information retrieval (IR) benchmark. When applied to RAG tasks, ReasonIR-8B improves MMLU and GPQA performance by 6.4% and 22.6% respectively, relative to the closed-book baseline, outperforming other retrievers and search engines. In addition, ReasonIR-8B uses test-time compute more effectively: on BRIGHT, its performance consistently increases with longer and more information-rich rewritten queries; it continues to outperform other retrievers when combined with an LLM reranker. Our training recipe is general and can be easily extended to future LLMs; to this end, we open-source our code, data, and model.

[122] The Leaderboard Illusion

Shivalika Singh,Yiyang Nan,Alex Wang,Daniel D'Souza,Sayash Kapoor,Ahmet Üstün,Sanmi Koyejo,Yuntian Deng,Shayne Longpre,Noah Smith,Beyza Ermis,Marzieh Fadaee,Sara Hooker

Main category: cs.AI

TL;DR: 论文指出Chatbot Arena排行榜存在系统性偏差，私有测试和选择性披露导致数据不对称，封闭模型受益更多，建议改革评估框架以实现更公平透明的基准测试。

Details

Motivation: 衡量进展对科学领域至关重要，但当前AI系统排行榜（如Chatbot Arena）存在扭曲现象，影响公平性。 Method: 通过分析私有测试实践、模型披露政策和数据分配，揭示Arena排行榜的偏差问题。 Result: 发现封闭模型在数据获取和测试中占据优势，Meta等公司通过私有测试选择性优化分数，开放模型数据获取较少。 Conclusion: 建议改革Arena评估框架，减少数据不对称，提升透明度和公平性。 Abstract: Measuring progress is fundamental to the advancement of any scientific field. As benchmarks play an increasingly central role, they also grow more susceptible to distortion. Chatbot Arena has emerged as the go-to leaderboard for ranking the most capable AI systems. Yet, in this work we identify systematic issues that have resulted in a distorted playing field. We find that undisclosed private testing practices benefit a handful of providers who are able to test multiple variants before public release and retract scores if desired. We establish that the ability of these providers to choose the best score leads to biased Arena scores due to selective disclosure of performance results. At an extreme, we identify 27 private LLM variants tested by Meta in the lead-up to the Llama-4 release. We also establish that proprietary closed models are sampled at higher rates (number of battles) and have fewer models removed from the arena than open-weight and open-source alternatives. Both these policies lead to large data access asymmetries over time. Providers like Google and OpenAI have received an estimated 19.2% and 20.4% of all data on the arena, respectively. In contrast, a combined 83 open-weight models have only received an estimated 29.7% of the total data. We show that access to Chatbot Arena data yields substantial benefits; even limited additional data can result in relative performance gains of up to 112% on the arena distribution, based on our conservative estimates. Together, these dynamics result in overfitting to Arena-specific dynamics rather than general model quality. The Arena builds on the substantial efforts of both the organizers and an open community that maintains this valuable evaluation platform. We offer actionable recommendations to reform the Chatbot Arena's evaluation framework and promote fairer, more transparent benchmarking for the field

[123] ChestX-Reasoner: Advancing Radiology Foundation Models with Reasoning through Step-by-Step Verification

Ziqing Fan,Cheng Liang,Chaoyi Wu,Ya Zhang,Yanfeng Wang,Weidi Xie

Main category: cs.AI

TL;DR: ChestX-Reasoner是一种放射学诊断多模态大语言模型（MLLM），通过临床报告中的结构化推理过程提升诊断性能，显著优于现有模型。

Details

Motivation: 医学AI模型常忽略临床实践中的结构化推理过程，ChestX-Reasoner旨在填补这一空白。 Method: 利用临床报告构建大型数据集，采用两阶段训练框架（监督微调和强化学习），并引入新基准和评估指标。 Result: 在诊断准确性和推理能力上显著优于现有模型，提升幅度达3.3%至27%。 Conclusion: ChestX-Reasoner为医学推理MLLM研究提供了开源资源，推动了该领域的进一步发展。 Abstract: Recent advances in reasoning-enhanced large language models (LLMs) and multimodal LLMs (MLLMs) have significantly improved performance in complex tasks, yet medical AI models often overlook the structured reasoning processes inherent in clinical practice. In this work, we present ChestX-Reasoner, a radiology diagnosis MLLM designed to leverage process supervision mined directly from clinical reports, reflecting the step-by-step reasoning followed by radiologists. We construct a large dataset by extracting and refining reasoning chains from routine radiology reports. Our two-stage training framework combines supervised fine-tuning and reinforcement learning guided by process rewards to better align model reasoning with clinical standards. We introduce RadRBench-CXR, a comprehensive benchmark featuring 59K visual question answering samples with 301K clinically validated reasoning steps, and propose RadRScore, a metric evaluating reasoning factuality, completeness, and effectiveness. ChestX-Reasoner outperforms existing medical and general-domain MLLMs in both diagnostic accuracy and reasoning ability, achieving 16%, 5.9%, and 18% improvements in reasoning ability compared to the best medical MLLM, the best general MLLM, and its base model, respectively, as well as 3.3%, 24%, and 27% improvements in outcome accuracy. All resources are open-sourced to facilitate further research in medical reasoning MLLMs.

Khoi Trinh,Scott Seidenberger,Raveen Wijewickrama,Murtuza Jadliwala,Anindya Maiti

Main category: cs.AI

TL;DR: 研究探讨了AI图像再生中通过迭代提示优化实现目标图像重现的效果，并验证了图像相似度指标与人类感知的一致性。

Details

Motivation: 随着AI生成内容普及，研究如何通过迭代提示优化实现特定图像再生，并验证现有图像相似度指标是否适用于此类迭代工作流。 Method: 通过结构化用户研究，评估迭代提示优化对再生图像与目标图像相似度的影响，并比较图像相似度指标与人类主观评价的一致性。 Result: 迭代提示调整显著提高了图像对齐效果，主观评价和定量测量均验证了这一点。 Conclusion: 迭代工作流在生成AI内容创作中具有广泛潜力，图像相似度指标可作为有效的反馈机制。 Abstract: With AI-generated content becoming ubiquitous across the web, social media, and other digital platforms, it is vital to examine how such content are inspired and generated. The creation of AI-generated images often involves refining the input prompt iteratively to achieve desired visual outcomes. This study focuses on the relatively underexplored concept of image regeneration using AI, in which a human operator attempts to closely recreate a specific target image by iteratively refining their prompt. Image regeneration is distinct from normal image generation, which lacks any predefined visual reference. A separate challenge lies in determining whether existing image similarity metrics (ISMs) can provide reliable, objective feedback in iterative workflows, given that we do not fully understand if subjective human judgments of similarity align with these metrics. Consequently, we must first validate their alignment with human perception before assessing their potential as a feedback mechanism in the iterative prompt refinement process. To address these research gaps, we present a structured user study evaluating how iterative prompt refinement affects the similarity of regenerated images relative to their targets, while also examining whether ISMs capture the same improvements perceived by human observers. Our findings suggest that incremental prompt adjustments substantially improve alignment, verified through both subjective evaluations and quantitative measures, underscoring the broader potential of iterative workflows to enhance generative AI content creation across various application domains.

[125] CBM-RAG: Demonstrating Enhanced Interpretability in Radiology Report Generation with Multi-Agent RAG and Concept Bottleneck Models

Hasan Md Tusfiqur Alam,Devansh Srivastav,Abdulrahman Mohamed Selim,Md Abdul Kadir,Md Moktadiurl Hoque Shuvo,Daniel Sonntag

Main category: cs.AI

TL;DR: 本文提出了一种结合概念瓶颈模型（CBMs）和多智能体检索增强生成（RAG）系统的自动化放射学报告生成框架，旨在提升AI的可解释性和可靠性。

Details

Motivation: 生成式AI在放射学工作流程自动化中具有潜力，但可解释性和可靠性问题阻碍了临床采用。 Method: 使用CBMs将胸部X射线特征映射到可理解的临床概念，结合多智能体RAG系统生成基于证据的报告。 Result: 系统能够提供可解释的预测、减少幻觉，并生成高质量、定制化的报告。 Conclusion: 该框架为提升诊断一致性和为放射科医生提供可操作的见解提供了途径。 Abstract: Advancements in generative Artificial Intelligence (AI) hold great promise for automating radiology workflows, yet challenges in interpretability and reliability hinder clinical adoption. This paper presents an automated radiology report generation framework that combines Concept Bottleneck Models (CBMs) with a Multi-Agent Retrieval-Augmented Generation (RAG) system to bridge AI performance with clinical explainability. CBMs map chest X-ray features to human-understandable clinical concepts, enabling transparent disease classification. Meanwhile, the RAG system integrates multi-agent collaboration and external knowledge to produce contextually rich, evidence-based reports. Our demonstration showcases the system's ability to deliver interpretable predictions, mitigate hallucinations, and generate high-quality, tailored reports with an interactive interface addressing accuracy, trust, and usability challenges. This framework provides a pathway to improving diagnostic consistency and empowering radiologists with actionable insights.

cs.LG [Back]

[126] RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning

Zihan Wang,Kangrui Wang,Qineng Wang,Pingyue Zhang,Linjie Li,Zhengyuan Yang,Kefan Yu,Minh Nhat Nguyen,Licheng Liu,Eli Gottlieb,Monica Lam,Yiping Lu,Kyunghyun Cho,Jiajun Wu,Li Fei-Fei,Lijuan Wang,Yejin Choi,Manling Li

Main category: cs.LG

TL;DR: 论文提出了StarPO框架和RAGEN系统，用于训练和评估大型语言模型（LLM）作为交互式代理，解决了多轮代理强化学习中的挑战，如奖励方差和梯度问题，并提出了优化方法。

Details

Motivation: 训练LLM作为交互式代理面临长时程决策和随机环境反馈的挑战，多轮代理强化学习尚未充分探索。 Method: 提出了StarPO框架和RAGEN系统，通过轨迹级代理强化学习优化训练，并引入StarPO-S解决奖励方差和梯度问题。 Result: 研究发现代理训练中存在Echo Trap现象，优化后的StarPO-S能有效解决；RL rollout的多样性、交互粒度和采样频率对训练有益；缺乏细粒度奖励信号会导致推理能力不足。 Conclusion: StarPO和RAGEN为LLM代理训练提供了有效框架，优化后的方法解决了关键问题，并揭示了训练中的关键因素。 Abstract: Training large language models (LLMs) as interactive agents presents unique challenges including long-horizon decision making and interacting with stochastic environment feedback. While reinforcement learning (RL) has enabled progress in static tasks, multi-turn agent RL training remains underexplored. We propose StarPO (State-Thinking-Actions-Reward Policy Optimization), a general framework for trajectory-level agent RL, and introduce RAGEN, a modular system for training and evaluating LLM agents. Our study on three stylized environments reveals three core findings. First, our agent RL training shows a recurring mode of Echo Trap where reward variance cliffs and gradient spikes; we address this with StarPO-S, a stabilized variant with trajectory filtering, critic incorporation, and decoupled clipping. Second, we find the shaping of RL rollouts would benefit from diverse initial states, medium interaction granularity and more frequent sampling. Third, we show that without fine-grained, reasoning-aware reward signals, agent reasoning hardly emerge through multi-turn RL and they may show shallow strategies or hallucinated thoughts. Code and environments are available at https://github.com/RAGEN-AI/RAGEN.

[127] Reviving Any-Subset Autoregressive Models with Principled Parallel Sampling and Speculative Decoding

Gabe Guo,Stefano Ermon

Main category: cs.LG

TL;DR: 论文提出了一种名为AS-ARMs的模型，通过Any-Subset Speculative Decoding (ASSD)算法，解决了并行生成语言模型中的联合分布问题，并在速度和性能上取得了显著提升。

Details

Motivation: 解决离散扩散模型在并行生成时无法保持原始数据分布的问题，探索更高效的并行化语言生成方法。 Method: 使用AS-ARMs模型和ASSD算法，支持并行化联合概率密度估计，并通过数学验证的训练方案优化模型。 Result: ASSD显著加速语言生成且不牺牲质量，AS-ARMs在子2亿参数模型中表现优异，接近50倍大模型的代码生成性能。 Conclusion: AS-ARMs是语言建模中极具潜力的方向，为并行生成提供了理论和实践支持。 Abstract: In arbitrary-order language models, it is an open question how to sample tokens in parallel from the correct joint distribution. With discrete diffusion models, the more tokens they generate in parallel, the less their predicted distributions adhere to the originally learned data distribution, as they rely on a conditional independence assumption that only works with infinitesimally small timesteps. We find that a different class of models, any-subset autoregressive models (AS-ARMs), holds the solution. As implied by the name, AS-ARMs can generate tokens in any order, and in parallel. Moreover, AS-ARMs support parallelized joint probability density estimation, allowing them to correct their own parallel-generated token distributions, via our Any-Subset Speculative Decoding (ASSD) algorithm. ASSD provably enables generation of tokens from the correct joint distribution, with the number of neural network calls upper bounded by the number of tokens predicted. We empirically verify that ASSD speeds up language generation, without sacrificing quality. Furthermore, we provide a mathematically justified scheme for training AS-ARMs for generation, and show that AS-ARMs achieve state-of-the-art performance among sub-200M parameter models on infilling benchmark tasks, and nearly match the performance of models 50X larger on code generation. Our theoretical and empirical results indicate that the once-forgotten AS-ARMs are a promising direction of language modeling.

[128] Reinforcement Learning for Reasoning in Large Language Models with One Training Example

Yiping Wang,Qing Yang,Zhiyuan Zeng,Liliang Ren,Lucas Liu,Baolin Peng,Hao Cheng,Xuehai He,Kuan Wang,Jianfeng Gao,Weizhu Chen,Shuohang Wang,Simon Shaolei Du,Yelong Shen

Main category: cs.LG

TL;DR: 1-shot RLVR显著提升LLMs的数学推理能力，单例训练使MATH500性能从36.0%提升至73.6%，并观察到跨域泛化等有趣现象。

Details

Motivation: 探索如何通过少量训练样本（单例）有效提升大语言模型的数学推理能力。 Method: 采用1-shot RLVR（强化学习与可验证奖励）方法，结合GRPO和PPO算法，并引入熵损失促进探索。 Result: 单例训练显著提升模型性能（MATH500从36.0%到73.6%），并发现跨域泛化和后饱和泛化现象。 Conclusion: 1-shot RLVR高效且具泛化性，未来可进一步研究其数据效率和机制。 Abstract: We show that reinforcement learning with verifiable reward using one training example (1-shot RLVR) is effective in incentivizing the math reasoning capabilities of large language models (LLMs). Applying RLVR to the base model Qwen2.5-Math-1.5B, we identify a single example that elevates model performance on MATH500 from 36.0% to 73.6%, and improves the average performance across six common mathematical reasoning benchmarks from 17.6% to 35.7%. This result matches the performance obtained using the 1.2k DeepScaleR subset (MATH500: 73.6%, average: 35.9%), which includes the aforementioned example. Similar substantial improvements are observed across various models (Qwen2.5-Math-7B, Llama3.2-3B-Instruct, DeepSeek-R1-Distill-Qwen-1.5B), RL algorithms (GRPO and PPO), and different math examples (many of which yield approximately 30% or greater improvement on MATH500 when employed as a single training example). In addition, we identify some interesting phenomena during 1-shot RLVR, including cross-domain generalization, increased frequency of self-reflection, and sustained test performance improvement even after the training accuracy has saturated, a phenomenon we term post-saturation generalization. Moreover, we verify that the effectiveness of 1-shot RLVR primarily arises from the policy gradient loss, distinguishing it from the "grokking" phenomenon. We also show the critical role of promoting exploration (e.g., by adding entropy loss with an appropriate coefficient) in 1-shot RLVR training. As a bonus, we observe that applying entropy loss alone, without any outcome reward, significantly enhances Qwen2.5-Math-1.5B's performance on MATH500 by 27.4%. These findings can inspire future work on RLVR data efficiency and encourage a re-examination of both recent progress and the underlying mechanisms in RLVR. Our code, model, and data are open source at https://github.com/ypwang61/One-Shot-RLVR

[129] Towards Understanding the Nature of Attention with Low-Rank Sparse Decomposition

Zhengfu He,Junxuan Wang,Rui Lin,Xuyang Ge,Wentao Shu,Qiong Tang,Junping Zhang,Xipeng Qiu

Main category: cs.LG

TL;DR: Lorsa是一种稀疏注意力模型，用于分解Transformer的多头自注意力机制，揭示更清晰的注意力行为。

Details

Motivation: 解决注意力叠加问题，理解不同token位置间特征的注意力交互。 Method: 将多头自注意力分解为独立可理解的组件，结合稀疏字典学习方法。 Result: 发现更精细的注意力行为（如归纳头、后继头等），并在算术任务中验证其有效性。 Conclusion: Lorsa在可解释性和电路发现方面优于SAE，适用于多注意力头共同计算的特征。 Abstract: We propose Low-Rank Sparse Attention (Lorsa), a sparse replacement model of Transformer attention layers to disentangle original Multi Head Self Attention (MHSA) into individually comprehensible components. Lorsa is designed to address the challenge of attention superposition to understand attention-mediated interaction between features in different token positions. We show that Lorsa heads find cleaner and finer-grained versions of previously discovered MHSA behaviors like induction heads, successor heads and attention sink behavior (i.e., heavily attending to the first token). Lorsa and Sparse Autoencoder (SAE) are both sparse dictionary learning methods applied to different Transformer components, and lead to consistent findings in many ways. For instance, we discover a comprehensive family of arithmetic-specific Lorsa heads, each corresponding to an atomic operation in Llama-3.1-8B. Automated interpretability analysis indicates that Lorsa achieves parity with SAE in interpretability while Lorsa exhibits superior circuit discovery properties, especially for features computed collectively by multiple MHSA heads. We also conduct extensive experiments on architectural design ablation, Lorsa scaling law and error analysis.

cs.RO [Back]

[130] DRO: Doppler-Aware Direct Radar Odometry

Cedric Le Gentil,Leonardo Brizi,Daniil Lisus,Xinyuan Qiao,Giorgio Grisetti,Timothy D. Barfoot

Main category: cs.RO

TL;DR: 本文提出了一种基于SE(2)的雷达里程计方法，直接利用雷达强度信息进行扫描到局部地图的配准，无需特征提取，适用于恶劣天气和特征缺失环境。

Details

Motivation: 毫米波雷达在恶劣天气和复杂环境中具有优势，但现有方法依赖特征提取，限制了其应用。本文旨在提出一种无需特征提取的直接方法。 Method: 通过扫描到局部地图的直接配准，结合运动和多普勒失真校正，利用多普勒约束提高速度估计。 Result: 在公开数据集和自采数据上验证，相对平移误差为0.26%（使用陀螺仪）和0.18%（启用多普勒模式）。 Conclusion: 该方法在恶劣天气和特征缺失环境中表现优越，实时实现已开源。 Abstract: A renaissance in radar-based sensing for mobile robotic applications is underway. Compared to cameras or lidars, millimetre-wave radars have the ability to `see' through thin walls, vegetation, and adversarial weather conditions such as heavy rain, fog, snow, and dust. In this paper, we propose a novel SE(2) odometry approach for spinning frequency-modulated continuous-wave radars. Our method performs scan-to-local-map registration of the incoming radar data in a direct manner using all the radar intensity information without the need for feature or point cloud extraction. The method performs locally continuous trajectory estimation and accounts for both motion and Doppler distortion of the radar scans. If the radar possesses a specific frequency modulation pattern that makes radial Doppler velocities observable, an additional Doppler-based constraint is formulated to improve the velocity estimate and enable odometry in geometrically feature-deprived scenarios (e.g., featureless tunnels). Our method has been validated on over 250km of on-road data sourced from public datasets (Boreas and MulRan) and collected using our automotive platform. With the aid of a gyroscope, it outperforms state-of-the-art methods and achieves an average relative translation error of 0.26% on the Boreas leaderboard. When using data with the appropriate Doppler-enabling frequency modulation pattern, the translation error is reduced to 0.18% in similar environments. We also benchmarked our algorithm using 1.5 hours of data collected with a mobile robot in off-road environments with various levels of structure to demonstrate its versatility. Our real-time implementation is publicly available: https://github.com/utiasASRL/dro.

[131] Hydra: Marker-Free RGB-D Hand-Eye Calibration

Martin Huber,Huanyu Tian,Christopher E. Mower,Lucas-Raphael Müller,Sébastien Ourselin,Christos Bergeles,Tom Vercauteren

Main category: cs.RO

TL;DR: 提出了一种基于RGB-D成像的无标记手眼标定方法，采用改进的ICP算法和鲁棒的点到平面目标函数，实验表明其收敛速度和精度显著优于现有方法。

Details

Motivation: 传统手眼标定方法依赖标记物，限制了应用场景；无标记方法则存在收敛速度和精度不足的问题。本文旨在提出一种高效且高精度的无标记标定方法。 Method: 基于RGB-D成像，采用改进的ICP算法和Lie代数上的鲁棒点到平面目标函数，仅需少量机器人配置即可完成标定。 Result: 实验表明，该方法仅需3个随机配置即可实现90%的成功率，收敛速度提高2-3倍，精度提升至5 mm（优于传统方法的7 mm）。 Conclusion: 该方法在无标记条件下显著提升了手眼标定的效率和精度，适用于实际部署，并开源了代码和数据集。 Abstract: This work presents an RGB-D imaging-based approach to marker-free hand-eye calibration using a novel implementation of the iterative closest point (ICP) algorithm with a robust point-to-plane (PTP) objective formulated on a Lie algebra. Its applicability is demonstrated through comprehensive experiments using three well known serial manipulators and two RGB-D cameras. With only three randomly chosen robot configurations, our approach achieves approximately 90% successful calibrations, demonstrating 2-3x higher convergence rates to the global optimum compared to both marker-based and marker-free baselines. We also report 2 orders of magnitude faster convergence time (0.8 +/- 0.4 s) for 9 robot configurations over other marker-free methods. Our method exhibits significantly improved accuracy (5 mm in task space) over classical approaches (7 mm in task space) whilst being marker-free. The benchmarking dataset and code are open sourced under Apache 2.0 License, and a ROS 2 integration with robot abstraction is provided to facilitate deployment.

[132] Learning a General Model: Folding Clothing with Topological Dynamics

Yiming Liu,Lijun Han,Enlin Gu,Hesheng Wang

Main category: cs.RO

TL;DR: 提出了一种基于拓扑动力学模型的衣物折叠方法，利用拓扑图表示衣物状态，并通过改进的图神经网络预测和控制衣物变形。

Details

Motivation: 衣物自由度多、结构复杂，传统方法难以高效处理其折叠问题。 Method: 通过语义分割和关键点检测生成拓扑图，利用改进的图神经网络学习动力学模型。 Result: 实验验证了算法在识别和折叠复杂衣物（如夹克）时的有效性。 Conclusion: 该方法为复杂衣物的自动化折叠提供了可行解决方案。 Abstract: The high degrees of freedom and complex structure of garments present significant challenges for clothing manipulation. In this paper, we propose a general topological dynamics model to fold complex clothing. By utilizing the visible folding structure as the topological skeleton, we design a novel topological graph to represent the clothing state. This topological graph is low-dimensional and applied for complex clothing in various folding states. It indicates the constraints of clothing and enables predictions regarding clothing movement. To extract graphs from self-occlusion, we apply semantic segmentation to analyze the occlusion relationships and decompose the clothing structure. The decomposed structure is then combined with keypoint detection to generate the topological graph. To analyze the behavior of the topological graph, we employ an improved Graph Neural Network (GNN) to learn the general dynamics. The GNN model can predict the deformation of clothing and is employed to calculate the deformation Jacobi matrix for control. Experiments using jackets validate the algorithm's effectiveness to recognize and fold complex clothing with self-occlusion.

[133] A Survey on Event-based Optical Marker Systems

Nafiseh Jabbari Tofighi,Maxime Robic,Fabio Morbidi,Pascal Vasseur

Main category: cs.RO

TL;DR: 本文综述了基于事件的视觉标记系统（EBOMS），探讨了其基本原理、技术特点、应用领域及未来研究方向。

Details

Motivation: 事件相机的低延迟、高动态范围和低功耗特性为机器感知带来了革新，结合光学标记（如AprilTags、LED阵列）为研究开辟了新方向。 Method: 分析了EBOMS的异步操作原理及其在恶劣光照条件下的鲁棒性，并总结了相关技术。 Result: EBOMS在目标检测与跟踪、姿态估计和光学通信等领域有广泛应用。 Conclusion: EBOMS是一个新兴的多学科领域，未来研究将聚焦于进一步优化技术和拓展应用场景。 Abstract: The advent of event-based cameras, with their low latency, high dynamic range, and reduced power consumption, marked a significant change in robotic vision and machine perception. In particular, the combination of these neuromorphic sensors with widely-available passive or active optical markers (e.g. AprilTags, arrays of blinking LEDs), has recently opened up a wide field of possibilities. This survey paper provides a comprehensive review on Event-Based Optical Marker Systems (EBOMS). We analyze the basic principles and technologies on which these systems are based, with a special focus on their asynchronous operation and robustness against adverse lighting conditions. We also describe the most relevant applications of EBOMS, including object detection and tracking, pose estimation, and optical communication. The article concludes with a discussion of possible future research directions in this rapidly-emerging and multidisciplinary field.

Table of Contents

cs.CV [Back]

[1] Can Geometry Save Central Views for Sports Field Registration?

[2] Marmot: Multi-Agent Reasoning for Multi-Object Self-Correcting in Improving Image-Text Alignment

[3] Edge-Based Learning for Improved Classification Under Adversarial Noise

[4] VideoMultiAgents: A Multi-Agent Framework for Video Question Answering

[5] Long-Distance Field Demonstration of Imaging-Free Drone Identification in Intracity Environments

[6] An on-production high-resolution longitudinal neonatal fingerprint database in Brazil

[7] Forging and Removing Latent-Noise Diffusion Watermarks Using a Single Image

[8] A Transformer-based Multimodal Fusion Model for Efficient Crowd Counting Using Visual and Wireless Signals

[9] Integration Flow Models

[10] Weaving Context Across Images: Improving Vision-Language Models through Focus-Centric Visual Chains

[11] Remote Sensing Imagery for Flood Detection: Exploration of Augmentation Strategies

[12] FreBIS: Frequency-Based Stratification for Neural Implicit Surface Representations

[13] Improving trajectory continuity in drone-based crowd monitoring using a set of minimal-cost techniques and deep discriminative correlation filters

[14] Physics-Informed Diffusion Models for SAR Ship Wake Generation from Text Prompts

[15] Image Interpolation with Score-based Riemannian Metrics of Diffusion Models

[16] DeepAndes: A Self-Supervised Vision Foundation Model for Multi-Spectral Remote Sensing Imagery of the Andes

[17] Dynamic Contextual Attention Network: Transforming Spatial Representations into Adaptive Insights for Endoscopic Polyp Diagnosis

[18] Fine Grain Classification: Connecting Meta using Cross-Contrastive pre-training

[19] MicarVLMoE: A Modern Gated Cross-Aligned Vision-Language Mixture of Experts Model for Medical Image Captioning and Report Generation

[20] TTTFusion: A Test-Time Training-Based Strategy for Multimodal Medical Image Fusion in Surgical Robots

[21] Inception: Jailbreak the Memory Mechanism of Text-to-Image Generation Systems

[22] Sparse2DGS: Geometry-Prioritized Gaussian Splatting for Surface Reconstruction from Sparse Views

[23] GSFeatLoc: Visual Localization Using Feature Correspondence on 3D Gaussian Splatting

[24] Neural Stereo Video Compression with Hybrid Disparity Compensation

[25] FiLA-Video: Spatio-Temporal Compression for Fine-Grained Long Video Understanding

[26] GarmentX: Autoregressive Parametric Representations for High-Fidelity 3D Garment Generation

[27] Plant Disease Detection through Multimodal Large Language Models and Convolutional Neural Networks

[28] AI Assisted Cervical Cancer Screening for Cytology Samples in Developing Countries

[29] PixelHacker: Image Inpainting with Structural and Semantic Consistency

[30] LMM4Gen3DHF: Benchmarking and Evaluating Multimodal 3D Human Face Generation with LMMs

[31] Antidote: A Unified Framework for Mitigating LVLM Hallucinations in Counterfactual Presupposition and Object Perception

[32] Large-scale visual SLAM for in-the-wild videos

[33] Style-Adaptive Detection Transformer for Single-Source Domain Generalized Object Detection

[34] MambaMoE: Mixture-of-Spectral-Spatial-Experts State Space Model for Hyperspectral Image Classification

[35] SteelBlastQC: Shot-blasted Steel Surface Dataset with Interpretable Detection of Surface Defects

[36] Dynamic Attention Analysis for Backdoor Detection in Text-to-Image Diffusion Models

[37] Geometry-aware Temporal Aggregation Network for Monocular 3D Lane Detection

[38] Beyond the Horizon: Decoupling UAVs Multi-View Action Recognition via Partial Order Transfer

[39] Autoencoder Models for Point Cloud Environmental Synthesis from WiFi Channel State Information: A Preliminary Study

[40] PartHOI: Part-based Hand-Object Interaction Transfer via Generalized Cylinders

[41] Purifying, Labeling, and Utilizing: A High-Quality Pipeline for Small Object Detection

[42] EfficientHuman: Efficient Training and Reconstruction of Moving Human using Articulated 2D Gaussian

[43] AlignDiT: Multimodal Aligned Diffusion Transformer for Synchronized Speech Generation

[44] LDPoly: Latent Diffusion for Polygonal Road Outline Extraction in Large-Scale Topographic Mapping

[45] SpaRE: Enhancing Spatial Reasoning in Vision-Language Models with Synthetic Data

[46] Image deidentification in the XNAT ecosystem: use cases and solutions

[47] Advance Fake Video Detection via Vision Transformers

[48] FBRT-YOLO: Faster and Better for Real-Time Aerial Image Detection

[49] Occlusion-aware Driver Monitoring System using the Driver Monitoring Dataset

[50] OG-HFYOLO :Orientation gradient guidance and heterogeneous feature fusion for deformation table cell instance segmentation

[51] Efficient Listener: Dyadic Facial Motion Synthesis via Action Diffusion

[52] In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer

[53] Adept: Annotation-Denoising Auxiliary Tasks with Discrete Cosine Transform Map and Keypoint for Human-Centric Pretraining

[54] GaussTrap: Stealthy Poisoning Attacks on 3D Gaussian Splatting for Targeted Scene Confusion

[55] CMT: A Cascade MAR with Topology Predictor for Multimodal Conditional CAD Generation

[56] RadSAM: Segmenting 3D radiological images with a 2D promptable model

[57] FedMVP: Federated Multi-modal Visual Prompt Tuning for Vision-Language Models

[58] AI-GenBench: A New Ongoing Benchmark for AI-Generated Image Detection

[59] FLIM-based Salient Object Detection Networks with Adaptive Decoders

[60] Classifier-to-Bias: Toward Unsupervised Automatic Bias Detection for Visual Classifiers

[61] DS_FusionNet: Dynamic Dual-Stream Fusion with Bidirectional Knowledge Distillation for Plant Disease Recognition

[62] SVD Based Least Squares for X-Ray Pneumonia Classification Using Deep Features

[63] TesserAct: Learning 4D Embodied World Models

[64] X-Fusion: Introducing New Modality to Frozen Large Language Models

[65] YoChameleon: Personalized Vision and Language Generation

cs.GR [Back]

[66] Creating Your Editable 3D Photorealistic Avatar with Tetrahedron-constrained Gaussian Splatting

[67] Mìmir: A real-time interactive visualization library for CUDA programs

cs.CL [Back]

[68] It's the same but not the same: Do LLMs distinguish Spanish varieties?

[69] Evaluating Large Language Models on Multiword Expressions in Multilingual and Code-Switched Contexts

[70] Understanding and Mitigating Risks of Generative AI in Financial Services

[71] Toward Evaluative Thinking: Meta Policy Optimization with Evolving Reward Models

[72] MICE for CATs: Model-Internal Confidence Estimation for Calibrating Agents with Tools

[73] A Multimodal Pipeline for Clinical Data Extraction: Applying Vision-Language Models to Scans of Transfusion Reaction Reports

[74] A Platform for Generating Educational Activities to Teach English as a Second Language

[75] Enhancing Systematic Reviews with Large Language Models: Using GPT-4 and Kimi

[76] UD-English-CHILDES: A Collected Resource of Gold and Silver Universal Dependencies Trees for Child Language Interactions