cs.CV [Back]

Xiwen Li,Ross Whitaker,Tolga Tasdizen

Main category: cs.CV

TL;DR: AVIVDNetv2是一种基于Transformer的端到端检测网络，通过跨模态Transformer和多尺度视觉特征融合模块，显著提升了怠速车辆检测的性能。

Details

Motivation: 怠速车辆检测（IVD）通过动态提醒驾驶员减少怠速行为，有助于降低污染和排放。现有方法在跨模态对齐方面表现不佳，导致检测效果不理想。 Method: 提出AVIVDNetv2，结合跨模态Transformer、全局块级学习、多尺度视觉特征融合模块和解耦检测头。 Result: 实验显示，AVIVDNetv2在mAP上比基线方法提升7.66和9.42，并在所有车辆类别中表现一致优异。 Conclusion: AVIVDNetv2在AVIVD数据集上实现了新的性能基准，优于现有方法。 Abstract: Idling vehicle detection (IVD) supports real-time systems that reduce pollution and emissions by dynamically messaging drivers to curb excess idling behavior. In computer vision, IVD has become an emerging task that leverages video from surveillance cameras and audio from remote microphones to localize and classify vehicles in each frame as moving, idling, or engine-off. As with other cross-modal tasks, the key challenge lies in modeling the correspondence between audio and visual modalities, which differ in representation but provide complementary cues -- video offers spatial and motion context, while audio conveys engine activity beyond the visual field. The previous end-to-end model, which uses a basic attention mechanism, struggles to align these modalities effectively, often missing vehicle detections. To address this issue, we propose AVIVDNetv2, a transformer-based end-to-end detection network. It incorporates a cross-modal transformer with global patch-level learning, a multiscale visual feature fusion module, and decoupled detection heads. Extensive experiments show that AVIVDNetv2 improves mAP by 7.66 over the disjoint baseline and 9.42 over the E2E baseline, with consistent AP gains across all vehicle categories. Furthermore, AVIVDNetv2 outperforms the state-of-the-art method for sounding object localization, establishing a new performance benchmark on the AVIVD dataset.

[2] Shape Your Ground: Refining Road Surfaces Beyond Planar Representations

Oussema Dhaouadi,Johannes Meier,Jacques Kaiser,Daniel Cremers

Main category: cs.CV

TL;DR: FlexRoad是一种基于NURBS的路面重建框架，通过ECSRC算法减少粗糙度和误差，显著优于现有方法。

Details

Motivation: 现有路面重建方法存在伪影和不一致性问题，且简化表示牺牲了精度。 Method: 使用NURBS曲面拟合3D路面点，结合ECSRC算法进行异常校正。 Result: 在GeRoD和DSC3D数据集上表现优异，对输入源和噪声不敏感。 Conclusion: FlexRoad是一种通用的高质量路面建模方法。 Abstract: Road surface reconstruction from aerial images is fundamental for autonomous driving, urban planning, and virtual simulation, where smoothness, compactness, and accuracy are critical quality factors. Existing reconstruction methods often produce artifacts and inconsistencies that limit usability, while downstream tasks have a tendency to represent roads as planes for simplicity but at the cost of accuracy. We introduce FlexRoad, the first framework to directly address road surface smoothing by fitting Non-Uniform Rational B-Splines (NURBS) surfaces to 3D road points obtained from photogrammetric reconstructions or geodata providers. Our method at its core utilizes the Elevation-Constrained Spatial Road Clustering (ECSRC) algorithm for robust anomaly correction, significantly reducing surface roughness and fitting errors. To facilitate quantitative comparison between road surface reconstruction methods, we present GeoRoad Dataset (GeRoD), a diverse collection of road surface and terrain profiles derived from openly accessible geodata. Experiments on GeRoD and the photogrammetry-based DeepScenario Open 3D Dataset (DSC3D) demonstrate that FlexRoad considerably surpasses commonly used road surface representations across various metrics while being insensitive to various input sources, terrains, and noise types. By performing ablation studies, we identify the key role of each component towards high-quality reconstruction performance, making FlexRoad a generic method for realistic road surface modeling.

[3] Persistence-based Hough Transform for Line Detection

Johannes Ferner,Stefan Huber,Saverio Messineo,Angel Pop,Martin Uray

Main category: cs.CV

TL;DR: 论文提出了一种基于持久同调的新投票技术，用于改进Hough变换中的峰值检测，显著优于传统阈值方法，并增强了鲁棒性。

Details

Motivation: 传统Hough变换中的阈值投票方法对噪声和伪影敏感，限制了其性能。 Method: 采用持久同调技术替代传统阈值方法，检测Hough空间中的峰值。 Result: 在合成数据上的实验表明，新方法显著优于原始方法，且鲁棒性更强。 Conclusion: 论文呼吁将拓扑数据分析技术更广泛地整合到现有方法中，并探讨Hough变换的数学稳定性以提升其鲁棒性。 Abstract: The Hough transform is a popular and classical technique in computer vision for the detection of lines (or more general objects). It maps a pixel into a dual space -- the Hough space: each pixel is mapped to the set of lines through this pixel, which forms a curve in Hough space. The detection of lines then becomes a voting process to find those lines that received many votes by pixels. However, this voting is done by thresholding, which is susceptible to noise and other artifacts. In this work, we present an alternative voting technique to detect peaks in the Hough space based on persistent homology, which very naturally addresses limitations of simple thresholding. Experiments on synthetic data show that our method significantly outperforms the original method, while also demonstrating enhanced robustness. This work seeks to inspire future research in two key directions. First, we highlight the untapped potential of Topological Data Analysis techniques and advocate for their broader integration into existing methods, including well-established ones. Secondly, we initiate a discussion on the mathematical stability of the Hough transform, encouraging exploration of mathematically grounded improvements to enhance its robustness.

[4] Context-Awareness and Interpretability of Rare Occurrences for Discovery and Formalization of Critical Failure Modes

Sridevi Polavaram,Xin Zhou,Meenu Ravi,Mohammad Zarei,Anmol Srivastava

Main category: cs.CV

TL;DR: CAIRO框架通过本体论方法检测和形式化AI模型中的罕见故障案例，结合人类参与，提升自动驾驶系统中对象检测模型的鲁棒性和可解释性。

Details

Motivation: 视觉系统在关键领域（如监控、执法和交通）的应用日益广泛，但其对罕见或未知场景的脆弱性带来了重大安全风险。 Method: 提出CAIRO框架，基于本体论，结合人类参与，检测和形式化AI模型中的关键现象（CP），如误检测、对抗攻击和幻觉。 Result: 在自动驾驶系统中，CAIRO成功形式化了对象检测模型的故障案例，生成可共享的知识图谱（OWL/XML格式），支持下游分析和逻辑推理。 Conclusion: CAIRO为AI模型的罕见故障提供了可扩展且可解释的形式化方法，增强了系统的安全性和问责性。 Abstract: Vision systems are increasingly deployed in critical domains such as surveillance, law enforcement, and transportation. However, their vulnerabilities to rare or unforeseen scenarios pose significant safety risks. To address these challenges, we introduce Context-Awareness and Interpretability of Rare Occurrences (CAIRO), an ontology-based human-assistive discovery framework for failure cases (or CP - Critical Phenomena) detection and formalization. CAIRO by design incentivizes human-in-the-loop for testing and evaluation of criticality that arises from misdetections, adversarial attacks, and hallucinations in AI black-box models. Our robust analysis of object detection model(s) failures in automated driving systems (ADS) showcases scalable and interpretable ways of formalizing the observed gaps between camera perception and real-world contexts, resulting in test cases stored as explicit knowledge graphs (in OWL/XML format) amenable for sharing, downstream analysis, logical reasoning, and accountability.

[5] MonoTher-Depth: Enhancing Thermal Depth Estimation via Confidence-Aware Distillation

Xingxing Zuo,Nikhil Ranganathan,Connor Lee,Georgia Gkioxari,Soon-Jo Chung

Main category: cs.CV

TL;DR: 论文提出了一种通过知识蒸馏从RGB深度估计模型增强热图像深度估计的新方法，显著提高了精度。

Details

Motivation: 热图像深度估计在恶劣条件下（如雾、烟、低光）对机器人系统至关重要，但标记数据有限限制了其泛化能力。 Method: 采用置信感知蒸馏方法，利用RGB模型的预测置信度选择性增强热图像模型。 Result: 在无标记深度的新场景中，该方法将绝对相对误差降低了22.88%。 Conclusion: 该方法显著提升了热图像深度估计的精度，扩展了其适用性。 Abstract: Monocular depth estimation (MDE) from thermal images is a crucial technology for robotic systems operating in challenging conditions such as fog, smoke, and low light. The limited availability of labeled thermal data constrains the generalization capabilities of thermal MDE models compared to foundational RGB MDE models, which benefit from datasets of millions of images across diverse scenarios. To address this challenge, we introduce a novel pipeline that enhances thermal MDE through knowledge distillation from a versatile RGB MDE model. Our approach features a confidence-aware distillation method that utilizes the predicted confidence of the RGB MDE to selectively strengthen the thermal MDE model, capitalizing on the strengths of the RGB model while mitigating its weaknesses. Our method significantly improves the accuracy of the thermal MDE, independent of the availability of labeled depth supervision, and greatly expands its applicability to new scenarios. In our experiments on new scenarios without labeled depth, the proposed confidence-aware distillation method reduces the absolute relative error of thermal MDE by 22.88\% compared to the baseline without distillation.

[6] Hybrid Knowledge Transfer through Attention and Logit Distillation for On-Device Vision Systems in Agricultural IoT

Stanley Mugisha,Rashid Kisitu,Florence Tushabe

Main category: cs.CV

TL;DR: 提出了一种混合知识蒸馏框架，将Swin Transformer教师模型的知识迁移到MobileNetV3学生模型，以在边缘设备上实现高精度植物病害分类。

Details

Motivation: 解决Vision Transformers高计算复杂度与边缘设备资源限制之间的矛盾，同时保持高精度。 Method: 采用自适应注意力对齐和双损失函数，优化类别概率和空间注意力。 Result: 蒸馏后的MobileNetV3在PlantVillage-Tomato数据集上达到92.4%准确率，计算量减少95%，推理延迟降低82%。 Conclusion: 该方法实现了边缘设备上的实时高效作物监测，展示了ViT级诊断精度的可行性。 Abstract: Integrating deep learning applications into agricultural IoT systems faces a serious challenge of balancing the high accuracy of Vision Transformers (ViTs) with the efficiency demands of resource-constrained edge devices. Large transformer models like the Swin Transformers excel in plant disease classification by capturing global-local dependencies. However, their computational complexity (34.1 GFLOPs) limits applications and renders them impractical for real-time on-device inference. Lightweight models such as MobileNetV3 and TinyML would be suitable for on-device inference but lack the required spatial reasoning for fine-grained disease detection. To bridge this gap, we propose a hybrid knowledge distillation framework that synergistically transfers logit and attention knowledge from a Swin Transformer teacher to a MobileNetV3 student model. Our method includes the introduction of adaptive attention alignment to resolve cross-architecture mismatch (resolution, channels) and a dual-loss function optimizing both class probabilities and spatial focus. On the lantVillage-Tomato dataset (18,160 images), the distilled MobileNetV3 attains 92.4% accuracy relative to 95.9% for Swin-L but at an 95% reduction on PC and < 82% in inference latency on IoT devices. (23ms on PC CPU and 86ms/image on smartphone CPUs). Key innovations include IoT-centric validation metrics (13 MB memory, 0.22 GFLOPs) and dynamic resolution-matching attention maps. Comparative experiments show significant improvements over standalone CNNs and prior distillation methods, with a 3.5% accuracy gain over MobileNetV3 baselines. Significantly, this work advances real-time, energy-efficient crop monitoring in precision agriculture and demonstrates how we can attain ViT-level diagnostic precision on edge devices. Code and models will be made available for replication after acceptance.

[7] SAIP-Net: Enhancing Remote Sensing Image Segmentation via Spectral Adaptive Information Propagation

Zhongtao Wang,Xizhe Cao,Yisong Chen,Guoping Wang

Main category: cs.CV

TL;DR: SAIP-Net通过频域自适应信息传播和增强多尺度感受野，显著提升了遥感图像语义分割的性能。

Details

Motivation: 传统分层模型在遥感图像语义分割中难以处理精确空间边界和类内一致性问题。 Method: 提出SAIP-Net，结合自适应频率滤波和多尺度感受野增强，抑制类内特征不一致并锐化边界。 Result: 实验表明，SAIP-Net在性能上显著优于现有方法。 Conclusion: 频域自适应策略与扩展感受野的结合对遥感图像分割非常有效。 Abstract: Semantic segmentation of remote sensing imagery demands precise spatial boundaries and robust intra-class consistency, challenging conventional hierarchical models. To address limitations arising from spatial domain feature fusion and insufficient receptive fields, this paper introduces SAIP-Net, a novel frequency-aware segmentation framework that leverages Spectral Adaptive Information Propagation. SAIP-Net employs adaptive frequency filtering and multi-scale receptive field enhancement to effectively suppress intra-class feature inconsistencies and sharpen boundary lines. Comprehensive experiments demonstrate significant performance improvements over state-of-the-art methods, highlighting the effectiveness of spectral-adaptive strategies combined with expanded receptive fields for remote sensing image segmentation.

[8] Multimodal Large Language Models for Enhanced Traffic Safety: A Comprehensive Review and Future Trends

Mohammad Abu Tami,Mohammed Elhenawy,Huthaifa I. Ashqar

Main category: cs.CV

TL;DR: 本文探讨了多模态大语言模型（MLLMs）在提升交通安全性中的潜力，通过整合多模态数据实现全面场景理解，弥补传统ADAS的不足。

Details

Motivation: 传统ADAS在动态现实场景中表现不佳，亟需更先进的解决方案。 Method: 通过分析MLLMs整合视觉、空间和环境数据的能力，研究其在感知、决策和对抗鲁棒性方面的表现。 Result: MLLMs展现出提升交通安全的潜力，尤其在场景理解和风险主动缓解方面。 Conclusion: MLLMs有望成为下一代交通安全系统的核心，提供可扩展且上下文感知的解决方案。 Abstract: Traffic safety remains a critical global challenge, with traditional Advanced Driver-Assistance Systems (ADAS) often struggling in dynamic real-world scenarios due to fragmented sensor processing and susceptibility to adversarial conditions. This paper reviews the transformative potential of Multimodal Large Language Models (MLLMs) in addressing these limitations by integrating cross-modal data such as visual, spatial, and environmental inputs to enable holistic scene understanding. Through a comprehensive analysis of MLLM-based approaches, we highlight their capabilities in enhancing perception, decision-making, and adversarial robustness, while also examining the role of key datasets (e.g., KITTI, DRAMA, ML4RoadSafety) in advancing research. Furthermore, we outline future directions, including real-time edge deployment, causality-driven reasoning, and human-AI collaboration. By positioning MLLMs as a cornerstone for next-generation traffic safety systems, this review underscores their potential to revolutionize the field, offering scalable, context-aware solutions that proactively mitigate risks and improve overall road safety.

[9] Progressive Language-guided Visual Learning for Multi-Task Visual Grounding

Jingchao Wang,Hong Wang,Wenlong Zhang,Kunhua Ji,Dingjiang Huang,Yefeng Zheng

Main category: cs.CV

TL;DR: PLVL框架通过渐进式语言引导视觉学习，解决了多任务视觉定位中语言信息未充分融入视觉特征提取及任务间协作不足的问题。

Details

Motivation: 现有方法在语言信息融入视觉特征提取和任务协作方面存在不足，PLVL旨在解决这些问题。 Method: 提出PLVL框架，渐进式注入语言信息，无需额外跨模态融合模块，并设计多任务头部实现任务协作。 Result: 在多个基准数据集上，PLVL显著优于现有方法。 Conclusion: PLVL通过语言引导和多任务协作，提升了多任务视觉定位的性能。 Abstract: Multi-task visual grounding (MTVG) includes two sub-tasks, i.e., Referring Expression Comprehension (REC) and Referring Expression Segmentation (RES). The existing representative approaches generally follow the research pipeline which mainly consists of three core procedures, including independent feature extraction for visual and linguistic modalities, respectively, cross-modal interaction module, and independent prediction heads for different sub-tasks. Albeit achieving remarkable performance, this research line has two limitations: 1) The linguistic content has not been fully injected into the entire visual backbone for boosting more effective visual feature extraction and it needs an extra cross-modal interaction module; 2) The relationship between REC and RES tasks is not effectively exploited to help the collaborative prediction for more accurate output. To deal with these problems, in this paper, we propose a Progressive Language-guided Visual Learning framework for multi-task visual grounding, called PLVL, which not only finely mine the inherent feature expression of the visual modality itself but also progressively inject the language information to help learn linguistic-related visual features. In this manner, our PLVL does not need additional cross-modal fusion module while fully introducing the language guidance. Furthermore, we analyze that the localization center for REC would help identify the to-be-segmented object region for RES to some extent. Inspired by this investigation, we design a multi-task head to accomplish collaborative predictions for these two sub-tasks. Extensive experiments conducted on several benchmark datasets comprehensively substantiate that our PLVL obviously outperforms the representative methods in both REC and RES tasks. https://github.com/jcwang0602/PLVL

[10] Classification of Firn Data via Topological Features

Sarah Day,Jesse Dimino,Matt Jester,Kaitlin Keegan,Thomas Weighill

Main category: cs.CV

TL;DR: 论文评估了拓扑特征在雪粒图像分类中的性能，探讨了拓扑特征化的优势、局限性和权衡。

Details

Motivation: 研究雪粒（冰川中未压缩成冰的颗粒雪层）的拓扑和几何结构随深度的变化，利用拓扑数据分析（TDA）揭示深度与结构的关系。 Method: 采用两类拓扑特征（子水平集特征和距离变换特征）及持久性曲线，通过微CT图像预测样本深度。 Result: 实验表明，不同方法在不同场景下表现各异，揭示了准确性、可解释性和泛化性之间的复杂权衡。 Conclusion: 拓扑特征在雪粒图像分类中具有潜力，但需根据具体需求权衡不同方法的优缺点。 Abstract: In this paper we evaluate the performance of topological features for generalizable and robust classification of firn image data, with the broader goal of understanding the advantages, pitfalls, and trade-offs in topological featurization. Firn refers to layers of granular snow within glaciers that haven't been compressed into ice. This compactification process imposes distinct topological and geometric structure on firn that varies with depth within the firn column, making topological data analysis (TDA) a natural choice for understanding the connection between depth and structure. We use two classes of topological features, sublevel set features and distance transform features, together with persistence curves, to predict sample depth from microCT images. A range of challenging training-test scenarios reveals that no one choice of method dominates in all categories, and uncoveres a web of trade-offs between accuracy, interpretability, and generalizability.

[11] A detection-task-specific deep-learning method to improve the quality of sparse-view myocardial perfusion SPECT images

Zezhang Yang,Zitong Yu,Nuri Choi,Abhinav K. Jha

Main category: cs.CV

TL;DR: 提出了一种基于深度学习的稀疏视角心肌灌注成像方法，旨在缩短扫描时间并提高图像质量。

Details

Motivation: 传统SPECT成像扫描时间长，可能导致患者不适和诊断不准确，减少投影角度会影响图像质量。 Method: 采用检测任务特定的深度学习方法，结合观察者损失项以优化灌注缺陷检测性能。 Result: 在检测心肌灌注缺陷任务中，AUC显著高于稀疏视角协议，并能恢复左心室壁结构。 Conclusion: 初步结果表明该方法有效，值得进一步评估。 Abstract: Myocardial perfusion imaging (MPI) with single-photon emission computed tomography (SPECT) is a widely used and cost-effective diagnostic tool for coronary artery disease. However, the lengthy scanning time in this imaging procedure can cause patient discomfort, motion artifacts, and potentially inaccurate diagnoses due to misalignment between the SPECT scans and the CT-scans which are acquired for attenuation compensation. Reducing projection angles is a potential way to shorten scanning time, but this can adversely impact the quality of the reconstructed images. To address this issue, we propose a detection-task-specific deep-learning method for sparse-view MPI SPECT images. This method integrates an observer loss term that penalizes the loss of anthropomorphic channel features with the goal of improving performance in perfusion defect-detection task. We observed that, on the task of detecting myocardial perfusion defects, the proposed method yielded an area under the receiver operating characteristic (ROC) curve (AUC) significantly larger than the sparse-view protocol. Further, the proposed method was observed to be able to restore the structure of the left ventricle wall, demonstrating ability to overcome sparse-sampling artifacts. Our preliminary results motivate further evaluations of the method.

[12] CLIP-IT: CLIP-based Pairing for Histology Images Classification

Banafsheh Karimian,Giulia Avanzato,Soufian Belharbi,Luke McCaffrey,Mohammadhadi Shateri,Eric Granger

Main category: cs.CV

TL;DR: CLIP-IT方法通过外部文本增强训练医学图像分类器，无需手动配对数据，提升性能且降低隐私与成本问题。

Details

Motivation: 解决多模态学习中需要大量配对数据的问题，降低隐私和成本负担。 Method: 利用CLIP模型匹配图像与外部文本，构建增强数据集；通过知识蒸馏将文本信息融入图像分类器；采用参数高效微调解决模态不对齐。 Result: 在PCAM、CRC和BACH数据集上表现优于单模态分类器。 Conclusion: CLIP-IT是一种低成本、高效的方法，可充分利用外部文本信息提升医学图像分类性能。 Abstract: Multimodal learning has shown significant promise for improving medical image analysis by integrating information from complementary data sources. This is widely employed for training vision-language models (VLMs) for cancer detection based on histology images and text reports. However, one of the main limitations in training these VLMs is the requirement for large paired datasets, raising concerns over privacy, and data collection, annotation, and maintenance costs. To address this challenge, we introduce CLIP-IT method to train a vision backbone model to classify histology images by pairing them with privileged textual information from an external source. At first, the modality pairing step relies on a CLIP-based model to match histology images with semantically relevant textual report data from external sources, creating an augmented multimodal dataset without the need for manually paired samples. Then, we propose a multimodal training procedure that distills the knowledge from the paired text modality to the unimodal image classifier for enhanced performance without the need for the textual data during inference. A parameter-efficient fine-tuning method is used to efficiently address the misalignment between the main (image) and paired (text) modalities. During inference, the improved unimodal histology classifier is used, with only minimal additional computational complexity. Our experiments on challenging PCAM, CRC, and BACH histology image datasets show that CLIP-IT can provide a cost-effective approach to leverage privileged textual information and outperform unimodal classifiers for histology.

[13] DeepCS-TRD, a Deep Learning-based Cross-Section Tree Ring Detector

Henry Marichal,Verónica Casaravilla,Candice Power,Karolain Mello,Joaquín Mazarino,Christine Lucas,Ludmila Profumo,Diego Passarella,Gregory Randall

Main category: cs.CV

TL;DR: Deep CS-TRD是一种基于深度学习的自动树轮检测算法，适用于多种图像领域和树种，性能优于现有方法，并提供了公开数据集和源代码。

Details

Motivation: 研究自动树轮检测在不同树种和图像获取条件下的应用，填补了现有方法的空白。 Method: 用U-Net替代CS-TRD的边缘检测步骤，适用于显微镜、扫描仪或智能手机获取的图像。 Result: 在宏观图像（Pinus taeda和Gleditsia triacanthos）中表现优于现有方法，在显微镜图像（Salix glauca）中性能略低。 Conclusion: Deep CS-TRD是首个针对多种树种和获取条件的自动树轮检测方法，数据集和代码已公开。 Abstract: Here, we propose Deep CS-TRD, a new automatic algorithm for detecting tree rings in whole cross-sections. It substitutes the edge detection step of CS-TRD by a deep-learning-based approach (U-Net), which allows the application of the method to different image domains: microscopy, scanner or smartphone acquired, and species (Pinus taeda, Gleditsia triachantos and Salix glauca). Additionally, we introduce two publicly available datasets of annotated images to the community. The proposed method outperforms state-of-the-art approaches in macro images (Pinus taeda and Gleditsia triacanthos) while showing slightly lower performance in microscopy images of Salix glauca. To our knowledge, this is the first paper that studies automatic tree ring detection for such different species and acquisition conditions. The dataset and source code are available in https://github.com/hmarichal93/deepcstrd

[14] Naturally Computed Scale Invariance in the Residual Stream of ResNet18

André Longon

Main category: cs.CV

TL;DR: 论文研究了ResNet18中残差流如何通过尺度等变表示的逐元素求和实现尺度不变性，并探讨其与尺度鲁棒物体识别行为的因果关系。

Details

Motivation: 探讨神经网络如何实现图像变换（如尺度变化）下的物体识别不变性，填补不同架构网络（如ResNet18）在此领域的研究空白。 Method: 分析ResNet18的残差流，观察中间块的卷积通道如何通过尺度等变表示的逐元素求和实现尺度不变性，并通过消融实验验证其与识别行为的因果关系。 Result: 发现残差流通过逐元素求和计算尺度不变性，初步表明其在尺度鲁棒物体识别中的作用。 Conclusion: 残差流可能是ResNet18实现尺度不变性的关键机制，代码已开源供进一步研究。 Abstract: An important capacity in visual object recognition is invariance to image-altering variables which leave the identity of objects unchanged, such as lighting, rotation, and scale. How do neural networks achieve this? Prior mechanistic interpretability research has illuminated some invariance-building circuitry in InceptionV1, but the results are limited and networks with different architectures have remained largely unexplored. This work investigates ResNet18 with a particular focus on its residual stream, an architectural component which InceptionV1 lacks. We observe that many convolutional channels in intermediate blocks exhibit scale invariant properties, computed by the element-wise residual summation of scale equivariant representations: the block input's smaller-scale copy with the block pre-sum output's larger-scale copy. Through subsequent ablation experiments, we attempt to causally link these neural properties with scale-robust object recognition behavior. Our tentative findings suggest how the residual stream computes scale invariance and its possible role in behavior. Code is available at: https://github.com/cest-andre/residual-stream-interp

[15] MetaHarm: Harmful YouTube Video Dataset Annotated by Domain Experts, GPT-4-Turbo, and Crowdworkers

Wonjeong Jo,Magdalena Wojcieszak

Main category: cs.CV

TL;DR: 论文提出了两个大规模数据集，用于测量和分析短视频平台上的有害内容，包括多模态和多类别标注。

Details

Motivation: 短视频平台（如YouTube、Instagram、TikTok）上的有害内容（如虚假信息、仇恨言论等）缺乏全面理解和测量。 Method: 构建了两个数据集：60,906个潜在有害YouTube视频和19,422个由专家、GPT-4-Turbo和众包工作者标注的视频，涵盖六类有害内容。 Result: 数据集提供了二进制分类和多标签分类，以及不同标注者的一致性数据，支持未来有害内容识别和缓解研究。 Conclusion: 这些数据集将推动短视频平台有害内容的研究，并帮助开发多模态分类工具。 Abstract: Short video platforms, such as YouTube, Instagram, or TikTok, are used by billions of users. These platforms expose users to harmful content, ranging from clickbait or physical harms to hate or misinformation. Yet, we lack a comprehensive understanding and measurement of online harm on short video platforms. Toward this end, we present two large-scale datasets of multi-modal and multi-categorical online harm: (1) 60,906 systematically selected potentially harmful YouTube videos and (2) 19,422 videos annotated by three labeling actors: trained domain experts, GPT-4-Turbo (using 14 image frames, 1 thumbnail, and text metadata), and crowdworkers (Amazon Mechanical Turk master workers). The annotated dataset includes both (a) binary classification (harmful vs. harmless) and (b) multi-label categorizations of six harm categories: Information, Hate and harassment, Addictive, Clickbait, Sexual, and Physical harms. Furthermore, the annotated dataset provides (1) ground truth data with videos annotated consistently across (a) all three actors and (b) the majority of the labeling actors, and (2) three data subsets labeled by individual actors. These datasets are expected to facilitate future work on online harm, aid in (multi-modal) classification efforts, and advance the identification and potential mitigation of harmful content on video platforms.

[16] SignX: The Foundation Model for Sign Recognition

Sen Fang,Chunyu Sui,Hongwei Yi,Carol Neidle,Dimitris N. Metaxas

Main category: cs.CV

TL;DR: SignX是一个用于手语识别的框架，通过两阶段训练（Pose2Gloss和Video2Pose）实现高精度识别。

Details

Motivation: 手语数据处理复杂，现有方法依赖RGB视频和姿势信息转换为英文ID符号，但缺乏统一的符号约定。 Method: SignX包含Pose2Gloss（基于逆扩散模型）和Video2Pose（基于ViT），整合多种姿势信息来源。 Result: 实验表明，SignX在手语视频识别中比现有方法更准确。 Conclusion: SignX为手语识别提供了通用姿势估计基础，兼容现有姿势格式。 Abstract: The complexity of sign language data processing brings many challenges. The current approach to recognition of ASL signs aims to translate RGB sign language videos through pose information into English-based ID glosses, which serve to uniquely identify ASL signs. Note that there is no shared convention for assigning such glosses to ASL signs, so it is essential that the same glossing conventions are used for all of the data in the datasets that are employed. This paper proposes SignX, a foundation model framework for sign recognition. It is a concise yet powerful framework applicable to multiple human activity recognition scenarios. First, we developed a Pose2Gloss component based on an inverse diffusion model, which contains a multi-track pose fusion layer that unifies five of the most powerful pose information sources--SMPLer-X, DWPose, Mediapipe, PrimeDepth, and Sapiens Segmentation--into a single latent pose representation. Second, we trained a Video2Pose module based on ViT that can directly convert raw video into signer pose representation. Through this 2-stage training framework, we enable sign language recognition models to be compatible with existing pose formats, laying the foundation for the common pose estimation necessary for sign recognition. Experimental results show that SignX can recognize signs from sign language video, producing predicted gloss representations with greater accuracy than has been reported in prior work.

[17] Almost Right: Making First-layer Kernels Nearly Orthogonal Improves Model Generalization

Colton R. Crum,Adam Czajka

Main category: cs.CV

TL;DR: 提出一种新的损失函数，通过正则化第一卷积层的滤波器核使其近似正交，提升模型泛化能力，无需修改架构。

Details

Motivation: 受人类感知智能启发，通过正交化滤波器核提升模型对未知样本的泛化能力。 Method: 设计损失函数，强制第一卷积层的滤波器核近似正交，同时允许网络自由选择正交对。 Result: 在三种架构（ResNet-50、DenseNet-121、ViT-b-16）和两项任务（虹膜攻击检测、胸部X光异常检测）中显著优于现有方法。 Conclusion: 提出的损失函数有效提升泛化性能，且无需修改网络架构。 Abstract: An ongoing research challenge within several domains in computer vision is how to increase model generalization capabilities. Several attempts to improve model generalization performance are heavily inspired by human perceptual intelligence, which is remarkable in both its performance and efficiency to generalize to unknown samples. Many of these methods attempt to force portions of the network to be orthogonal, following some observation within neuroscience related to early vision processes. In this paper, we propose a loss component that regularizes the filtering kernels in the first convolutional layer of a network to make them nearly orthogonal. Deviating from previous works, we give the network flexibility in which pairs of kernels it makes orthogonal, allowing the network to navigate to a better solution space, imposing harsh penalties. Without architectural modifications, we report substantial gains in generalization performance using the proposed loss against previous works (including orthogonalization- and saliency-based regularization methods) across three different architectures (ResNet-50, DenseNet-121, ViT-b-16) and two difficult open-set recognition tasks: presentation attack detection in iris biometrics, and anomaly detection in chest X-ray images.

[18] CLPSTNet: A Progressive Multi-Scale Convolutional Steganography Model Integrating Curriculum Learning

Fengchun Liu,Tong Zhang,Chunying Zhang

Main category: cs.CV

TL;DR: 论文提出了一种基于课程学习的渐进式多尺度卷积网络（CLPSTNet），用于解决CNN在图像隐写术中的不可见性和安全性问题。

Details

Motivation: 传统隐写术方法依赖手工特征和先验知识设计，而CNN的引入虽然实现了自主信息嵌入学习，但图像复杂性导致不可见性和安全性问题仍然存在。 Method: CLPSTNet由多个渐进式多尺度卷积模块组成，整合了Inception结构和空洞卷积，逐步从浅层到深层提取多尺度特征。 Result: 实验表明，CLPSTNet在多个数据集上表现出高PSNR、SSIM和解码准确率，且生成的隐写图像具有低隐写分析分数。 Conclusion: CLPSTNet通过渐进式多尺度特征提取，显著提升了隐写术的不可见性和安全性。 Abstract: In recent years, a large number of works have introduced Convolutional Neural Networks (CNNs) into image steganography, which transform traditional steganography methods such as hand-crafted features and prior knowledge design into steganography methods that neural networks autonomically learn information embedding. However, due to the inherent complexity of digital images, issues of invisibility and security persist when using CNN models for information embedding. In this paper, we propose Curriculum Learning Progressive Steganophy Network (CLPSTNet). The network consists of multiple progressive multi-scale convolutional modules that integrate Inception structures and dilated convolutions. The module contains multiple branching pathways, starting from a smaller convolutional kernel and dilatation rate, extracting the basic, local feature information from the feature map, and gradually expanding to the convolution with a larger convolutional kernel and dilatation rate for perceiving the feature information of a larger receptive field, so as to realize the multi-scale feature extraction from shallow to deep, and from fine to coarse, allowing the shallow secret information features to be refined in different fusion stages. The experimental results show that the proposed CLPSTNet not only has high PSNR , SSIM metrics and decoding accuracy on three large public datasets, ALASKA2, VOC2012 and ImageNet, but also the steganographic images generated by CLPSTNet have low steganalysis scores.You can find our code at \href{https://github.com/chaos-boops/CLPSTNet}{https://github.com/chaos-boops/CLPSTNet}.

[19] Revisiting Radar Camera Alignment by Contrastive Learning for 3D Object Detection

Linhua Kong,Dongxia Chang,Lian Liu,Zisen Kong,Pengyuan Li,Yao Zhao

Main category: cs.CV

TL;DR: 提出了一种名为RCAlign的新对齐模型，通过双路径对齐模块和雷达特征增强模块，解决了雷达与相机特征对齐问题，并在nuScenes基准测试中取得了最佳性能。

Details

Motivation: 现有方法在处理雷达与相机特征对齐时，忽视了模态间特征交互或未能有效对齐同一空间位置的特征。 Method: 设计了基于对比学习的双路径对齐模块（DRA）和雷达特征增强模块（RFE）。 Result: 在nuScenes基准测试中取得了最佳性能，实时3D检测性能显著提升（4.3% NDS和8.4% mAP）。 Conclusion: RCAlign通过改进特征对齐和增强雷达特征，显著提升了雷达与相机融合的3D目标检测性能。 Abstract: Recently, 3D object detection algorithms based on radar and camera fusion have shown excellent performance, setting the stage for their application in autonomous driving perception tasks. Existing methods have focused on dealing with feature misalignment caused by the domain gap between radar and camera. However, existing methods either neglect inter-modal features interaction during alignment or fail to effectively align features at the same spatial location across modalities. To alleviate the above problems, we propose a new alignment model called Radar Camera Alignment (RCAlign). Specifically, we design a Dual-Route Alignment (DRA) module based on contrastive learning to align and fuse the features between radar and camera. Moreover, considering the sparsity of radar BEV features, a Radar Feature Enhancement (RFE) module is proposed to improve the densification of radar BEV features with the knowledge distillation loss. Experiments show RCAlign achieves a new state-of-the-art on the public nuScenes benchmark in radar camera fusion for 3D Object Detection. Furthermore, the RCAlign achieves a significant performance gain (4.3\% NDS and 8.4\% mAP) in real-time 3D detection compared to the latest state-of-the-art method (RCBEVDet).

[20] SaENeRF: Suppressing Artifacts in Event-based Neural Radiance Fields

Yuanjian Wang,Yufei Deng,Rong Xiao,Jiahao Fan,Chenwei Tang,Deng Xiong,Jiancheng Lv

Main category: cs.CV

TL;DR: SaENeRF是一种自监督框架，通过归一化辐射变化和引入正则化损失，显著减少事件相机NeRF重建中的伪影，提升重建质量。

Details

Motivation: 事件相机在高速场景中具有优势，但现有事件NeRF方法存在伪影和噪声问题，需要改进。 Method: 提出SaENeRF框架，通过归一化辐射变化和设计正则化损失，抑制伪影并提升重建质量。 Result: 实验表明，SaENeRF显著减少伪影，重建质量优于现有方法。 Conclusion: SaENeRF为事件相机NeRF重建提供了一种高效、高质量的自监督解决方案。 Abstract: Event cameras are neuromorphic vision sensors that asynchronously capture changes in logarithmic brightness changes, offering significant advantages such as low latency, low power consumption, low bandwidth, and high dynamic range. While these characteristics make them ideal for high-speed scenarios, reconstructing geometrically consistent and photometrically accurate 3D representations from event data remains fundamentally challenging. Current event-based Neural Radiance Fields (NeRF) methods partially address these challenges but suffer from persistent artifacts caused by aggressive network learning in early stages and the inherent noise of event cameras. To overcome these limitations, we present SaENeRF, a novel self-supervised framework that effectively suppresses artifacts and enables 3D-consistent, dense, and photorealistic NeRF reconstruction of static scenes solely from event streams. Our approach normalizes predicted radiance variations based on accumulated event polarities, facilitating progressive and rapid learning for scene representation construction. Additionally, we introduce regularization losses specifically designed to suppress artifacts in regions where photometric changes fall below the event threshold and simultaneously enhance the light intensity difference of non-zero events, thereby improving the visual fidelity of the reconstructed scene. Extensive qualitative and quantitative experiments demonstrate that our method significantly reduces artifacts and achieves superior reconstruction quality compared to existing methods. The code is available at https://github.com/Mr-firework/SaENeRF.

[21] Assessing the Feasibility of Internet-Sourced Video for Automatic Cattle Lameness Detection

Md Fahimuzzman Sohan

Main category: cs.CV

TL;DR: 该研究提出了一种基于深度学习的模型，用于通过公开视频数据检测牛跛行、疾病或步态异常，3D CNN模型表现最佳，准确率达90%。

Details

Motivation: 牛跛行常由蹄部损伤或趾间皮炎引起，导致疼痛并影响其行走、进食等生理活动，亟需高效检测方法。 Method: 使用50个视频（40头牛）的数据集，通过数据增强提升模型鲁棒性，采用ConvLSTM2D和3D CNN两种深度学习模型进行分类。 Result: 3D CNN模型表现优异，视频级分类准确率为90%，精确率、召回率和F1分数均为90.9%；ConvLSTM2D模型准确率为85%。 Conclusion: 深度学习模型（尤其是3D CNN）能有效分类牛跛行，简化了传统多阶段处理流程，展示了直接从视频数据学习时空特征的潜力。 Abstract: Cattle lameness is often caused by hoof injuries or interdigital dermatitis, leads to pain and significantly impacts essential physiological activities such as walking, feeding, and drinking. This study presents a deep learning-based model for detecting cattle lameness, sickness, or gait abnormalities using publicly available video data. The dataset consists of 50 unique videos from 40 individual cattle, recorded from various angles in both indoor and outdoor environments. Half of the dataset represents naturally walking (normal/non-lame) cattle, while the other half consists of cattle exhibiting gait abnormalities (lame). To enhance model robustness and generalizability, data augmentation was applied to the training data. The pre-processed videos were then classified using two deep learning models: ConvLSTM2D and 3D CNN. A comparative analysis of the results demonstrates strong classification performance. Specifically, the 3D CNN model achieved a video-level classification accuracy of 90%, with precision, recall, and f1-score of 90.9%, 90.9%, and 90.91% respectively. The ConvLSTM2D model exhibited a slightly lower accuracy of 85%. This study highlights the effectiveness of directly applying classification models to learn spatiotemporal features from video data, offering an alternative to traditional multi-stage approaches that typically involve object detection, pose estimation, and feature extraction. Besides, the findings demonstrate that the proposed deep learning models, particularly the 3D CNN, effectively classify and detect lameness in cattle while simplifying the processing pipeline.

[22] PixelWeb: The First Web GUI Dataset with Pixel-Wise Labels

Qi Yang,Weichen Bi,Haiyang Shen,Yaoqi Guo,Yun Ma

Main category: cs.CV

TL;DR: PixelWeb是一个大规模GUI数据集，通过结合视觉特征提取和DOM结构分析，提供高质量的BBox注释，显著提升GUI元素检测性能。

Details

Motivation: 现有GUI数据集因自动标注导致BBox注释不准确，且仅提供视觉注释，限制了GUI下游任务的发展。 Method: 采用通道派生和层次分析两模块，结合BGRA四通道位图注释和DOM结构分析，确保注释准确性。 Result: PixelWeb在mAP95指标上比现有数据集性能提升3-7倍。 Conclusion: PixelWeb能显著提升GUI生成和自动化用户交互等下游任务的性能。 Abstract: Graphical User Interface (GUI) datasets are crucial for various downstream tasks. However, GUI datasets often generate annotation information through automatic labeling, which commonly results in inaccurate GUI element BBox annotations, including missing, duplicate, or meaningless BBoxes. These issues can degrade the performance of models trained on these datasets, limiting their effectiveness in real-world applications. Additionally, existing GUI datasets only provide BBox annotations visually, which restricts the development of visually related GUI downstream tasks. To address these issues, we introduce PixelWeb, a large-scale GUI dataset containing over 100,000 annotated web pages. PixelWeb is constructed using a novel automatic annotation approach that integrates visual feature extraction and Document Object Model (DOM) structure analysis through two core modules: channel derivation and layer analysis. Channel derivation ensures accurate localization of GUI elements in cases of occlusion and overlapping elements by extracting BGRA four-channel bitmap annotations. Layer analysis uses the DOM to determine the visibility and stacking order of elements, providing precise BBox annotations. Additionally, PixelWeb includes comprehensive metadata such as element images, contours, and mask annotations. Manual verification by three independent annotators confirms the high quality and accuracy of PixelWeb annotations. Experimental results on GUI element detection tasks show that PixelWeb achieves performance on the mAP95 metric that is 3-7 times better than existing datasets. We believe that PixelWeb has great potential for performance improvement in downstream tasks such as GUI generation and automated user interaction.

[23] FrogDogNet: Fourier frequency Retained visual prompt Output Guidance for Domain Generalization of CLIP in Remote Sensing

Hariseetharam Gunduboina,Muhammad Haris Khan,Biplab Banerjee

Main category: cs.CV

TL;DR: FrogDogNet是一种新型提示学习框架，结合傅里叶频率过滤和自注意力机制，提升遥感场景分类和领域泛化能力。

Details

Motivation: 大规模视觉语言模型（如CLIP）在通用计算机视觉中表现优异，但在遥感领域的领域泛化潜力尚未充分探索。现有方法依赖全图像特征，引入噪声和背景干扰，导致分类错误。 Method: FrogDogNet通过傅里叶频率过滤和自注意力机制，选择性保留低频分量并消除噪声，提取关键特征用于提示学习。 Result: 在四个遥感数据集和三个领域泛化任务中，FrogDogNet均优于现有提示学习方法，表现出更强的跨领域适应性。 Conclusion: 频率不变特征保留在领域泛化中具有显著效果，为更广泛应用铺平了道路。 Abstract: In recent years, large-scale vision-language models (VLMs) like CLIP have gained attention for their zero-shot inference using instructional text prompts. While these models excel in general computer vision, their potential for domain generalization in remote sensing (RS) remains underexplored. Existing approaches enhance prompt learning by generating visual prompt tokens but rely on full-image features, introducing noise and background artifacts that vary within a class, causing misclassification. To address this, we propose FrogDogNet, a novel prompt learning framework integrating Fourier frequency filtering and self-attention to improve RS scene classification and domain generalization. FrogDogNet selectively retains invariant low-frequency components while eliminating noise and irrelevant backgrounds, ensuring robust feature representation across domains. The model first extracts significant features via projection and self-attention, then applies frequency-based filtering to preserve essential structural information for prompt learning. Extensive experiments on four RS datasets and three domain generalization tasks show that FrogDogNet consistently outperforms state-of-the-art prompt learning methods, demonstrating superior adaptability across domain shifts. Our findings highlight the effectiveness of frequency-based invariant feature retention in generalization, paving the way for broader applications. Our code is available at https://github.com/HariseetharamG/FrogDogNet

[24] Marginalized Generalized IoU (MGIoU): A Unified Objective Function for Optimizing Any Convex Parametric Shapes

Duy-Tho Le,Trung Pham,Jianfei Cai,Hamid Rezatofighi

Main category: cs.CV

TL;DR: 论文提出了一种新的损失函数MGIoU和MGIoU+，用于统一参数化形状优化的目标函数，解决了现有方法的不足，并在实验中表现出色。

Details

Motivation: 现有参数化形状优化的目标函数存在不足，如回归损失与IoU相关性低、IoU损失不稳定且仅适用于简单形状、任务特定方法计算量大且不通用。 Method: 通过将结构化凸形状投影到其独特的形状法线上，计算一维归一化GIoU，提出MGIoU和MGIoU+，支持优化无结构凸形状。 Result: MGIoU和MGIoU+在标准基准测试中表现优于现有损失函数，计算延迟降低10-40倍，并满足度量性质和尺度不变性。 Conclusion: MGIoU系列损失函数统一了参数化形状优化，具有高效、通用和鲁棒的特点。 Abstract: Optimizing the similarity between parametric shapes is crucial for numerous computer vision tasks, where Intersection over Union (IoU) stands as the canonical measure. However, existing optimization methods exhibit significant shortcomings: regression-based losses like L1/L2 lack correlation with IoU, IoU-based losses are unstable and limited to simple shapes, and task-specific methods are computationally intensive and not generalizable accross domains. As a result, the current landscape of parametric shape objective functions has become scattered, with each domain proposing distinct IoU approximations. To address this, we unify the parametric shape optimization objective functions by introducing Marginalized Generalized IoU (MGIoU), a novel loss function that overcomes these challenges by projecting structured convex shapes onto their unique shape Normals to compute one-dimensional normalized GIoU. MGIoU offers a simple, efficient, fully differentiable approximation strongly correlated with IoU. We then extend MGIoU to MGIoU+ that supports optimizing unstructured convex shapes. Together, MGIoU and MGIoU+ unify parametric shape optimization across diverse applications. Experiments on standard benchmarks demonstrate that MGIoU and MGIoU+ consistently outperform existing losses while reducing loss computation latency by 10-40x. Additionally, MGIoU and MGIoU+ satisfy metric properties and scale-invariance, ensuring robustness as an objective function. We further propose MGIoU- for minimizing overlaps in tasks like collision-free trajectory prediction. Code is available at https://ldtho.github.io/MGIoU

[25] Cross Paradigm Representation and Alignment Transformer for Image Deraining

Shun Zou,Yi Zou,Juncheng Li,Guangwei Gao,Guojun Qi

Main category: cs.CV

TL;DR: 提出了一种名为CPRAformer的新型Transformer架构，通过整合全局-局部和空间-通道表示，解决了图像去雨任务中不规则雨模式和复杂几何重叠的挑战。

Details

Motivation: 现有单范式架构难以处理不规则雨模式和复杂几何重叠，需要一种统一框架整合互补的全局-局部和空间-通道表示。 Method: 提出CPRAformer，采用分层表示和对齐策略，结合稀疏提示通道自注意力（SPC-SA）和空间像素细化自注意力（SPR-SA），并引入自适应对齐频率模块（AAFM）进行特征对齐。 Result: 在八个基准数据集上实现了最先进的性能，并在其他图像修复任务和下游应用中验证了模型的鲁棒性。 Conclusion: CPRAformer通过跨范式动态交互框架，有效提取了两种范式中最有价值的融合信息，显著提升了图像去雨任务的性能。 Abstract: Transformer-based networks have achieved strong performance in low-level vision tasks like image deraining by utilizing spatial or channel-wise self-attention. However, irregular rain patterns and complex geometric overlaps challenge single-paradigm architectures, necessitating a unified framework to integrate complementary global-local and spatial-channel representations. To address this, we propose a novel Cross Paradigm Representation and Alignment Transformer (CPRAformer). Its core idea is the hierarchical representation and alignment, leveraging the strengths of both paradigms (spatial-channel and global-local) to aid image reconstruction. It bridges the gap within and between paradigms, aligning and coordinating them to enable deep interaction and fusion of features. Specifically, we use two types of self-attention in the Transformer blocks: sparse prompt channel self-attention (SPC-SA) and spatial pixel refinement self-attention (SPR-SA). SPC-SA enhances global channel dependencies through dynamic sparsity, while SPR-SA focuses on spatial rain distribution and fine-grained texture recovery. To address the feature misalignment and knowledge differences between them, we introduce the Adaptive Alignment Frequency Module (AAFM), which aligns and interacts with features in a two-stage progressive manner, enabling adaptive guidance and complementarity. This reduces the information gap within and between paradigms. Through this unified cross-paradigm dynamic interaction framework, we achieve the extraction of the most valuable interactive fusion information from the two paradigms. Extensive experiments demonstrate that our model achieves state-of-the-art performance on eight benchmark datasets and further validates CPRAformer's robustness in other image restoration tasks and downstream applications.

[26] MTSGL: Multi-Task Structure Guided Learning for Robust and Interpretable SAR Aircraft Recognition

Qishan He,Lingjun Zhao,Ru Luo,Siqian Zhang,Lin Lei,Kefeng Ji,Gangyao Kuang

Main category: cs.CV

TL;DR: 论文提出了一种基于结构的多任务学习网络（MTSGL），用于SAR图像中飞机的鲁棒和可解释识别，结合结构语义和几何一致性。

Details

Motivation: 当前SAR飞机识别算法缺乏对飞机结构知识的深入理解，需要更接近人类认知的方法。 Method: 引入结构注释方法，提出MTSGL网络，包含分类任务、结构语义感知（SSA）模块和结构一致性正则化（SCR）模块。 Result: 在自建数据集MT-SARD上验证，MTSGL在鲁棒性和可解释性上表现优越。 Conclusion: MTSGL通过结构引导学习，实现了类似人类认知的飞机识别，具有专家级先验知识。 Abstract: Aircraft recognition in synthetic aperture radar (SAR) imagery is a fundamental mission in both military and civilian applications. Recently deep learning (DL) has emerged a dominant paradigm for its explosive performance on extracting discriminative features. However, current classification algorithms focus primarily on learning decision hyperplane without enough comprehension on aircraft structural knowledge. Inspired by the fined aircraft annotation methods for optical remote sensing images (RSI), we first introduce a structure-based SAR aircraft annotations approach to provide structural and compositional supplement information. On this basis, we propose a multi-task structure guided learning (MTSGL) network for robust and interpretable SAR aircraft recognition. Besides the classification task, MTSGL includes a structural semantic awareness (SSA) module and a structural consistency regularization (SCR) module. The SSA is designed to capture structure semantic information, which is conducive to gain human-like comprehension of aircraft knowledge. The SCR helps maintain the geometric consistency between the aircraft structure in SAR imagery and the proposed annotation. In this process, the structural attribute can be disentangled in a geometrically meaningful manner. In conclusion, the MTSGL is presented with the expert-level aircraft prior knowledge and structure guided learning paradigm, aiming to comprehend the aircraft concept in a way analogous to the human cognitive process. Extensive experiments are conducted on a self-constructed multi-task SAR aircraft recognition dataset (MT-SARD) and the effective results illustrate the superiority of robustness and interpretation ability of the proposed MTSGL.

[27] RGB-D Video Object Segmentation via Enhanced Multi-store Feature Memory

Boyue Xu,Ruichao Hou,Tongwei Ren,Gangshan Wu

Main category: cs.CV

TL;DR: 提出了一种基于多存储特征记忆的RGB-D视频对象分割方法，通过层次化模态选择和融合以及SAM模型优化分割结果。

Details

Motivation: 现有RGB-D分割方法未能充分利用跨模态信息且存在长期预测中的对象漂移问题。 Method: 设计了层次化模态选择与融合模块，并利用SAM模型优化分割掩码。 Result: 在最新RGB-D VOS基准测试中取得了最先进的性能。 Conclusion: 该方法通过多模态特征融合和SAM模型的应用，显著提升了RGB-D视频对象分割的鲁棒性和准确性。 Abstract: The RGB-Depth (RGB-D) Video Object Segmentation (VOS) aims to integrate the fine-grained texture information of RGB with the spatial geometric clues of depth modality, boosting the performance of segmentation. However, off-the-shelf RGB-D segmentation methods fail to fully explore cross-modal information and suffer from object drift during long-term prediction. In this paper, we propose a novel RGB-D VOS method via multi-store feature memory for robust segmentation. Specifically, we design the hierarchical modality selection and fusion, which adaptively combines features from both modalities. Additionally, we develop a segmentation refinement module that effectively utilizes the Segmentation Anything Model (SAM) to refine the segmentation mask, ensuring more reliable results as memory to guide subsequent segmentation tasks. By leveraging spatio-temporal embedding and modality embedding, mixed prompts and fused images are fed into SAM to unleash its potential in RGB-D VOS. Experimental results show that the proposed method achieves state-of-the-art performance on the latest RGB-D VOS benchmark.

[28] Rethinking Generalizable Infrared Small Target Detection: A Real-scene Benchmark and Cross-view Representation Learning

Yahao Lu,Yuehui Li,Xingyuan Guo,Shuai Yuan,Yukai Shi,Liang Lin

Main category: cs.CV

TL;DR: 本文提出了一种基于域适应的红外小目标检测框架，通过跨视图通道对齐和噪声引导表示学习提升模型泛化能力，并在新数据集上验证了其优越性。

Details

Motivation: 红外小目标检测受传感器类型、观测条件和目标特性影响，导致数据分布差异（域偏移），限制了模型的跨场景泛化能力。 Method: 提出跨视图通道对齐（CCA）和跨视图Top-K融合策略，结合噪声引导表示学习，以缓解分布差异和噪声影响。 Result: 在检测概率（Pd）、误报率（Fa）和交并比（IoU）上优于现有方法。 Conclusion: 提出的框架有效提升了红外小目标检测的泛化能力，并在新数据集RealScene-ISTD上验证了其性能。 Abstract: Infrared small target detection (ISTD) is highly sensitive to sensor type, observation conditions, and the intrinsic properties of the target. These factors can introduce substantial variations in the distribution of acquired infrared image data, a phenomenon known as domain shift. Such distribution discrepancies significantly hinder the generalization capability of ISTD models across diverse scenarios. To tackle this challenge, this paper introduces an ISTD framework enhanced by domain adaptation. To alleviate distribution shift between datasets and achieve cross-sample alignment, we introduce Cross-view Channel Alignment (CCA). Additionally, we propose the Cross-view Top-K Fusion strategy, which integrates target information with diverse background features, enhancing the model' s ability to extract critical data characteristics. To further mitigate the impact of noise on ISTD, we develop a Noise-guided Representation learning strategy. This approach enables the model to learn more noise-resistant feature representations, to improve its generalization capability across diverse noisy domains. Finally, we develop a dedicated infrared small target dataset, RealScene-ISTD. Compared to state-of-the-art methods, our approach demonstrates superior performance in terms of detection probability (Pd), false alarm rate (Fa), and intersection over union (IoU). The code is available at: https://github.com/luy0222/RealScene-ISTD.

[29] PRaDA: Projective Radial Distortion Averaging

Daniil Sinitsyn,Linus Härenstam-Nielsen,Daniel Cremers

Main category: cs.CV

TL;DR: 提出一种无需3D重建的径向畸变相机自动标定方法，通过投影空间几何实现高精度标定。

Details

Motivation: 解决传统方法需要大量重叠图像或依赖学习方法的局限性，同时避免复杂的三维重建过程。 Method: 在投影空间中利用单应性几何，提出投影径向畸变平均法，无需生成3D点或进行完整捆绑调整。 Result: 方法在保持SfM精度的同时，简化了标定流程，支持任意特征匹配方法。 Conclusion: 投影径向畸变平均法是一种高效且准确的径向畸变标定方法，适用于复杂场景。 Abstract: We tackle the problem of automatic calibration of radially distorted cameras in challenging conditions. Accurately determining distortion parameters typically requires either 1) solving the full Structure from Motion (SfM) problem involving camera poses, 3D points, and the distortion parameters, which is only possible if many images with sufficient overlap are provided, or 2) relying heavily on learning-based methods that are comparatively less accurate. In this work, we demonstrate that distortion calibration can be decoupled from 3D reconstruction, maintaining the accuracy of SfM-based methods while avoiding many of the associated complexities. This is achieved by working in Projective Space, where the geometry is unique up to a homography, which encapsulates all camera parameters except for distortion. Our proposed method, Projective Radial Distortion Averaging, averages multiple distortion estimates in a fully projective framework without creating 3d points and full bundle adjustment. By relying on pairwise projective relations, our methods support any feature-matching approaches without constructing point tracks across multiple images.

Meng Chu,Yukang Chen,Haokun Gui,Shaozuo Yu,Yi Wang,Jiaya Jia

Main category: cs.CV

TL;DR: TraveLLaMA是一个专为城市场景理解和旅行辅助设计的多模态语言模型，通过大规模数据集和微调实验显著提升了旅行相关任务的性能。

Details

Motivation: 现有AI系统缺乏对城市环境的专业知识和上下文理解，无法满足旅游和旅行规划的需求。 Method: 利用220k问答对数据集（130k文本QA和90k视觉QA），对先进视觉语言模型进行微调。 Result: 性能提升6.5%-9.4%，在旅行推荐、地图理解和场景理解方面表现优异。 Conclusion: TraveLLaMA在旅行特定任务中优于通用模型，为多模态旅行辅助系统设定了新标准。 Abstract: Tourism and travel planning increasingly rely on digital assistance, yet existing multimodal AI systems often lack specialized knowledge and contextual understanding of urban environments. We present TraveLLaMA, a specialized multimodal language model designed for urban scene understanding and travel assistance. Our work addresses the fundamental challenge of developing practical AI travel assistants through a novel large-scale dataset of 220k question-answer pairs. This comprehensive dataset uniquely combines 130k text QA pairs meticulously curated from authentic travel forums with GPT-enhanced responses, alongside 90k vision-language QA pairs specifically focused on map understanding and scene comprehension. Through extensive fine-tuning experiments on state-of-the-art vision-language models (LLaVA, Qwen-VL, Shikra), we demonstrate significant performance improvements ranging from 6.5\%-9.4\% in both pure text travel understanding and visual question answering tasks. Our model exhibits exceptional capabilities in providing contextual travel recommendations, interpreting map locations, and understanding place-specific imagery while offering practical information such as operating hours and visitor reviews. Comparative evaluations show TraveLLaMA significantly outperforms general-purpose models in travel-specific tasks, establishing a new benchmark for multi-modal travel assistance systems.

[31] Federated Learning of Low-Rank One-Shot Image Detection Models in Edge Devices with Scalable Accuracy and Compute Complexity

Abdul Hannaan,Zubair Shah,Aiman Erbad,Amr Mohamed,Ali Safa

Main category: cs.CV

TL;DR: LoRa-FL是一种新颖的联邦学习框架，用于在边缘设备上训练低秩单次图像检测模型，显著降低计算和通信开销，同时保持可扩展的准确性。

Details

Motivation: 解决在资源受限的边缘设备上高效部署图像检测模型的问题，同时减少通信和计算开销。 Method: 结合低秩适应技术和单次检测架构，利用联邦学习协作训练轻量级图像识别模型。 Result: 在MNIST和CIFAR10数据集上（IID和非IID设置）表现出竞争性检测性能，显著降低通信带宽和计算复杂度。 Conclusion: LoRa-FL是一种有前景的解决方案，可在不牺牲模型准确性的情况下自适应减少通信和计算开销。 Abstract: This paper introduces a novel federated learning framework termed LoRa-FL designed for training low-rank one-shot image detection models deployed on edge devices. By incorporating low-rank adaptation techniques into one-shot detection architectures, our method significantly reduces both computational and communication overhead while maintaining scalable accuracy. The proposed framework leverages federated learning to collaboratively train lightweight image recognition models, enabling rapid adaptation and efficient deployment across heterogeneous, resource-constrained devices. Experimental evaluations on the MNIST and CIFAR10 benchmark datasets, both in an independent-and-identically-distributed (IID) and non-IID setting, demonstrate that our approach achieves competitive detection performance while significantly reducing communication bandwidth and compute complexity. This makes it a promising solution for adaptively reducing the communication and compute power overheads, while not sacrificing model accuracy.

Junrong Yue,Yifan Zhang,Chuan Qin,Bo Li,Xiaomin Lie,Xinlei Yu,Wenxin Zhang,Zhendong Zhao

Main category: cs.CV

TL;DR: 提出了一种多级融合与推理架构（MFRA），通过分层融合多模态特征和推理模块，提升了视觉-语言导航任务的性能。

Details

Motivation: 现有方法依赖全局场景或对象级特征，难以捕捉跨模态的复杂交互，限制了导航准确性。 Method: MFRA采用分层融合机制整合多级特征，并设计推理模块通过指令引导的注意力和动态上下文集成推断导航动作。 Result: 在REVERIE、R2R和SOON等基准数据集上，MFRA表现优于现有方法。 Conclusion: 多级模态融合能有效提升具身导航的决策准确性。 Abstract: Vision-and-Language Navigation (VLN) aims to enable embodied agents to follow natural language instructions and reach target locations in real-world environments. While prior methods often rely on either global scene representations or object-level features, these approaches are insufficient for capturing the complex interactions across modalities required for accurate navigation. In this paper, we propose a Multi-level Fusion and Reasoning Architecture (MFRA) to enhance the agent's ability to reason over visual observations, language instructions and navigation history. Specifically, MFRA introduces a hierarchical fusion mechanism that aggregates multi-level features-ranging from low-level visual cues to high-level semantic concepts-across multiple modalities. We further design a reasoning module that leverages fused representations to infer navigation actions through instruction-guided attention and dynamic context integration. By selectively capturing and combining relevant visual, linguistic, and temporal signals, MFRA improves decision-making accuracy in complex navigation scenarios. Extensive experiments on benchmark VLN datasets including REVERIE, R2R, and SOON demonstrate that MFRA achieves superior performance compared to state-of-the-art methods, validating the effectiveness of multi-level modal fusion for embodied navigation.

Wenwei Li,Liyi Cai,Wu Chen,Anan Li

Main category: cs.CV

TL;DR: 提出了一种基于双通道注意力机制和预训练视觉Transformer的少样本度量学习方法，用于跨模态神经元识别，实验证明其优于现有方法。

Details

Motivation: 解决神经科学研究中不同成像模态间单神经元匹配的挑战，如模态差异和标注不足。 Method: 采用双通道注意力机制（局部和全局通道）提取神经元形态和上下文信息，结合硬样本挖掘策略和Circle Loss函数。 Result: 在双光子和fMOST数据集上表现出更高的Top-K准确率和召回率，模块有效性通过消融实验和t-SNE可视化验证。 Conclusion: 该方法为单细胞水平匹配和多模态神经影像整合提供了有效的技术方案。 Abstract: In neuroscience research, achieving single-neuron matching across different imaging modalities is critical for understanding the relationship between neuronal structure and function. However, modality gaps and limited annotations present significant challenges. We propose a few-shot metric learning method with a dual-channel attention mechanism and a pretrained vision transformer to enable robust cross-modal neuron identification. The local and global channels extract soma morphology and fiber context, respectively, and a gating mechanism fuses their outputs. To enhance the model's fine-grained discrimination capability, we introduce a hard sample mining strategy based on the MultiSimilarityMiner algorithm, along with the Circle Loss function. Experiments on two-photon and fMOST datasets demonstrate superior Top-K accuracy and recall compared to existing methods. Ablation studies and t-SNE visualizations validate the effectiveness of each module. The method also achieves a favorable trade-off between accuracy and training efficiency under different fine-tuning strategies. These results suggest that the proposed approach offers a promising technical solution for accurate single-cell level matching and multimodal neuroimaging integration.

[34] Streetscape Analysis with Generative AI (SAGAI): Vision-Language Assessment and Mapping of Urban Scenes

Joan Perez,Giovanni Fusco

Main category: cs.CV

TL;DR: SAGAI是一个基于生成式人工智能的街道景观分析工具，利用开源数据和视觉语言模型评估城市街道场景，支持可扩展和可解释的分析。

Details

Motivation: 传统街道景观评估方法局限于形态特征或需要人工定性分析，SAGAI旨在提供一种自动化、可定制且无需专有软件的解决方案。 Method: SAGAI整合OpenStreetMap几何数据、Google街景图像和轻量级LLaVA模型，通过自然语言提示生成结构化空间指标，并支持点级和街道级评分聚合。 Result: 案例研究表明，SAGAI在城乡场景分类上表现优异，商业特征检测精度中等，人行道宽度估计虽较低但仍具参考价值。 Conclusion: SAGAI无需任务特定训练，仅通过提示修改即可适应多种城市研究主题，如步行友好性、安全性或城市设计。 Abstract: Streetscapes are an essential component of urban space. Their assessment is presently either limited to morphometric properties of their mass skeleton or requires labor-intensive qualitative evaluations of visually perceived qualities. This paper introduces SAGAI: Streetscape Analysis with Generative Artificial Intelligence, a modular workflow for scoring street-level urban scenes using open-access data and vision-language models. SAGAI integrates OpenStreetMap geometries, Google Street View imagery, and a lightweight version of the LLaVA model to generate structured spatial indicators from images via customizable natural language prompts. The pipeline includes an automated mapping module that aggregates visual scores at both the point and street levels, enabling direct cartographic interpretation. It operates without task-specific training or proprietary software dependencies, supporting scalable and interpretable analysis of urban environments. Two exploratory case studies in Nice and Vienna illustrate SAGAI's capacity to produce geospatial outputs from vision-language inference. The initial results show strong performance for binary urban-rural scene classification, moderate precision in commercial feature detection, and lower estimates, but still informative, of sidewalk width. Fully deployable by any user, SAGAI can be easily adapted to a wide range of urban research themes, such as walkability, safety, or urban design, through prompt modification alone.

[35] ToF-Splatting: Dense SLAM using Sparse Time-of-Flight Depth and Multi-Frame Integration

Andrea Conti,Matteo Poggi,Valerio Cambareri,Martin R. Oswald,Stefano Mattoccia

Main category: cs.CV

TL;DR: 论文提出ToF-Splatting，一种基于3D高斯泼溅的SLAM方法，用于处理极稀疏的ToF深度数据，通过多帧整合模块生成密集深度图。

Details

Motivation: 解决极稀疏ToF深度数据在SLAM中无缝使用的限制，满足移动和AR/VR设备的低功耗需求。 Method: 采用3D高斯泼溅框架，结合多帧整合模块，融合稀疏ToF深度、单目彩色和多视角几何信息。 Result: 在合成和真实稀疏ToF数据集上验证了方法的有效性，实现了领先的跟踪和建图性能。 Conclusion: ToF-Splatting为极稀疏ToF数据的SLAM应用提供了高效解决方案。 Abstract: Time-of-Flight (ToF) sensors provide efficient active depth sensing at relatively low power budgets; among such designs, only very sparse measurements from low-resolution sensors are considered to meet the increasingly limited power constraints of mobile and AR/VR devices. However, such extreme sparsity levels limit the seamless usage of ToF depth in SLAM. In this work, we propose ToF-Splatting, the first 3D Gaussian Splatting-based SLAM pipeline tailored for using effectively very sparse ToF input data. Our approach improves upon the state of the art by introducing a multi-frame integration module, which produces dense depth maps by merging cues from extremely sparse ToF depth, monocular color, and multi-view geometry. Extensive experiments on both synthetic and real sparse ToF datasets demonstrate the viability of our approach, as it achieves state-of-the-art tracking and mapping performances on reference datasets.

[36] Beyond Anonymization: Object Scrubbing for Privacy-Preserving 2D and 3D Vision Tasks

Murat Bilgehan Ertan,Ronak Sahu,Phuong Ha Nguyen,Kaleel Mahmood,Marten van Dijk

Main category: cs.CV

TL;DR: ROAR是一种隐私保护的数据混淆框架，通过移除而非修改敏感对象，结合实例分割和生成修复技术，在保持场景完整性的同时实现高效隐私保护。

Details

Motivation: 解决隐私保护数据集中敏感对象处理的问题，避免直接修改或删除图像导致的性能下降。 Method: 整合实例分割与生成修复技术，移除可识别实体。 Result: 在2D检测中达到基线AP的87.5%，3D重建中PSNR损失最多1.66 dB，同时保持SSIM和提升LPIPS。 Conclusion: ROAR展示了对象移除作为隐私保护框架的有效性，平衡隐私与性能，为未来隐私保护视觉系统奠定基础。 Abstract: We introduce ROAR (Robust Object Removal and Re-annotation), a scalable framework for privacy-preserving dataset obfuscation that eliminates sensitive objects instead of modifying them. Our method integrates instance segmentation with generative inpainting to remove identifiable entities while preserving scene integrity. Extensive evaluations on 2D COCO-based object detection show that ROAR achieves 87.5% of the baseline detection average precision (AP), whereas image dropping achieves only 74.2% of the baseline AP, highlighting the advantage of scrubbing in preserving dataset utility. The degradation is even more severe for small objects due to occlusion and loss of fine-grained details. Furthermore, in NeRF-based 3D reconstruction, our method incurs a PSNR loss of at most 1.66 dB while maintaining SSIM and improving LPIPS, demonstrating superior perceptual quality. Our findings establish object removal as an effective privacy framework, achieving strong privacy guarantees with minimal performance trade-offs. The results highlight key challenges in generative inpainting, occlusion-robust segmentation, and task-specific scrubbing, setting the foundation for future advancements in privacy-preserving vision systems.

[37] CountingDINO: A Training-free Pipeline for Class-Agnostic Counting using Unsupervised Backbones

Giacomo Pacini,Lorenzo Bianchi,Luca Ciampi,Nicola Messina,Giuseppe Amato,Fabrizio Falchi

Main category: cs.CV

TL;DR: CountingDINO是首个无需训练的基于示例的类别无关计数框架，利用无监督特征提取器，无需标注数据，性能优于基线方法，并与监督方法竞争。

Details

Motivation: 当前基于示例的类别无关计数方法依赖标注数据训练，限制了扩展性和泛化能力。 Method: 使用自监督视觉主干提取对象感知特征，通过ROI-Align提取潜在对象原型作为卷积核生成相似性图，再转化为密度图。 Result: 在FSC-147基准测试中优于无监督基线，与监督方法竞争甚至超越。 Conclusion: 无需训练的类别无关计数方法具有扩展性和竞争力。 Abstract: Class-agnostic counting (CAC) aims to estimate the number of objects in images without being restricted to predefined categories. However, while current exemplar-based CAC methods offer flexibility at inference time, they still rely heavily on labeled data for training, which limits scalability and generalization to many downstream use cases. In this paper, we introduce CountingDINO, the first training-free exemplar-based CAC framework that exploits a fully unsupervised feature extractor. Specifically, our approach employs self-supervised vision-only backbones to extract object-aware features, and it eliminates the need for annotated data throughout the entire proposed pipeline. At inference time, we extract latent object prototypes via ROI-Align from DINO features and use them as convolutional kernels to generate similarity maps. These are then transformed into density maps through a simple yet effective normalization scheme. We evaluate our approach on the FSC-147 benchmark, where we outperform a baseline under the same label-free setting. Our method also achieves competitive -- and in some cases superior -- results compared to training-free approaches relying on supervised backbones, as well as several fully supervised state-of-the-art methods. This demonstrates that training-free CAC can be both scalable and competitive. Website: https://lorebianchi98.github.io/CountingDINO/

[38] JEPA for RL: Investigating Joint-Embedding Predictive Architectures for Reinforcement Learning

Tristan Kenneweg,Philip Kenneweg,Barbara Hammer

Main category: cs.CV

TL;DR: JEPA架构在自监督学习中表现优异，本文将其应用于强化学习，解决了模型崩溃问题，并在Cart Pole任务中验证了效果。

Details

Motivation: 探索JEPA架构在强化学习中的适用性，尤其是从图像中学习。 Method: 将JEPA架构适配到强化学习，提出防止模型崩溃的方法。 Result: 在Cart Pole任务中展示了JEPA的有效性。 Conclusion: JEPA架构可成功应用于强化学习，并解决了模型崩溃问题。 Abstract: Joint-Embedding Predictive Architectures (JEPA) have recently become popular as promising architectures for self-supervised learning. Vision transformers have been trained using JEPA to produce embeddings from images and videos, which have been shown to be highly suitable for downstream tasks like classification and segmentation. In this paper, we show how to adapt the JEPA architecture to reinforcement learning from images. We discuss model collapse, show how to prevent it, and provide exemplary data on the classical Cart Pole task.

[39] Federated EndoViT: Pretraining Vision Transformers via Federated Learning on Endoscopic Image Collections

Max Kirchner,Alexander C. Jenke,Sebastian Bodenstedt,Fiona R. Kolbinger,Oliver Saldanha,Jakob N. Kather,Martin Wagner,Stefanie Speidel

Main category: cs.CV

TL;DR: 研究通过联邦学习训练基础模型，解决数据共享限制，实现微创手术的协作模型训练。

Details

Motivation: 解决数据共享限制，保护隐私，同时实现协作训练。 Method: 基于EndoViT研究，改进Masked Autoencoder，结合FedSAM和SWA，预训练于Endo700k数据集。 Result: FedSAM改进预训练效果，FL-EndoViT在手术任务中表现接近CEN-EndoViT，数据有限时表现更优。 Conclusion: 联邦学习为手术基础模型提供隐私保护训练方案，未来可探索视频模型以提升时空动态能力。 Abstract: Purpose: In this study, we investigate the training of foundation models using federated learning to address data-sharing limitations and enable collaborative model training without data transfer for minimally invasive surgery. Methods: Inspired by the EndoViT study, we adapt the Masked Autoencoder for federated learning, enhancing it with adaptive Sharpness-Aware Minimization (FedSAM) and Stochastic Weight Averaging (SWA). Our model is pretrained on the Endo700k dataset collection and later fine-tuned and evaluated for tasks such as Semantic Segmentation, Action Triplet Recognition, and Surgical Phase Recognition. Results: Our findings demonstrate that integrating adaptive FedSAM into the federated MAE approach improves pretraining, leading to a reduction in reconstruction loss per patch. The application of FL-EndoViT in surgical downstream tasks results in performance comparable to CEN-EndoViT. Furthermore, FL-EndoViT exhibits advantages over CEN-EndoViT in surgical scene segmentation when data is limited and in action triplet recognition when large datasets are used. Conclusion: These findings highlight the potential of federated learning for privacy-preserving training of surgical foundation models, offering a robust and generalizable solution for surgical data science. Effective collaboration requires adapting federated learning methods, such as the integration of FedSAM, which can accommodate the inherent data heterogeneity across institutions. In future, exploring FL in video-based models may enhance these capabilities by incorporating spatiotemporal dynamics crucial for real-world surgical environments.

[40] EHGCN: Hierarchical Euclidean-Hyperbolic Fusion via Motion-Aware GCN for Hybrid Event Stream Perception

Haosheng Chen,Lian Luo,Mengjingcheng Mo,Zhanjie Wu,Guobao Xiao,Ji Gan,Jiaxu Leng,Xinbo Gao

Main category: cs.CV

TL;DR: 提出了一种名为EHGCN的新方法，结合欧几里得和双曲空间感知事件流，通过自适应采样和运动感知超边生成提升事件视觉任务的性能。

Details

Motivation: 现有GNN方法在欧几里得空间中难以捕捉事件流的长程依赖和层次结构，限制了事件视觉任务的性能。 Method: EHGCN结合自适应采样策略、运动感知超边生成和欧几里得-双曲GCN，实现混合事件感知。 Result: 在物体检测和识别等任务上验证了方法的有效性。 Conclusion: EHGCN通过多空间融合和动态采样显著提升了事件视觉任务的性能。 Abstract: Event cameras, with microsecond temporal resolution and high dynamic range (HDR) characteristics, emit high-speed event stream for perception tasks. Despite the recent advancement in GNN-based perception methods, they are prone to use straightforward pairwise connectivity mechanisms in the pure Euclidean space where they struggle to capture long-range dependencies and fail to effectively characterize the inherent hierarchical structures of non-uniformly distributed event stream. To this end, in this paper we propose a novel approach named EHGCN, which is a pioneer to perceive event stream in both Euclidean and hyperbolic spaces for event vision. In EHGCN, we introduce an adaptive sampling strategy to dynamically regulate sampling rates, retaining discriminative events while attenuating chaotic noise. Then we present a Markov Vector Field (MVF)-driven motion-aware hyperedge generation method based on motion state transition probabilities, thereby eliminating cross-target spurious associations and providing critically topological priors while capturing long-range dependencies between events. Finally, we propose a Euclidean-Hyperbolic GCN to fuse the information locally aggregated and globally hierarchically modeled in Euclidean and hyperbolic spaces, respectively, to achieve hybrid event perception. Experimental results on event perception tasks such as object detection and recognition validate the effectiveness of our approach.

[41] Dual-Camera All-in-Focus Neural Radiance Fields

Xianrui Luo,Zijin Wu,Juewen Peng,Huiqiang Sun,Zhiguo Cao,Guosheng Lin

Main category: cs.CV

TL;DR: 提出首个无需手动对焦的全焦点神经辐射场（NeRF）合成框架DC-NeRF，利用智能手机双摄像头（主摄和超广角）实现高质量全焦点视图生成。

Details

Motivation: 现有NeRF方法因单摄像头固定对焦导致模糊和缺乏清晰参考，无法生成全焦点视图。 Method: 采用双摄像头（主摄高分辨率、超广角大景深），通过空间变形和颜色匹配对齐，结合可学习参数的去焦感知融合模块预测去焦图并融合图像。 Result: 在自建多视图数据集上验证，DC-NeRF能生成高质量全焦点新视图，定量和定性优于基线方法，并支持景深应用（如重新对焦、分光镜）。 Conclusion: DC-NeRF通过双摄像头解决了全焦点NeRF合成的挑战，为景深控制提供了新工具。 Abstract: We present the first framework capable of synthesizing the all-in-focus neural radiance field (NeRF) from inputs without manual refocusing. Without refocusing, the camera will automatically focus on the fixed object for all views, and current NeRF methods typically using one camera fail due to the consistent defocus blur and a lack of sharp reference. To restore the all-in-focus NeRF, we introduce the dual-camera from smartphones, where the ultra-wide camera has a wider depth-of-field (DoF) and the main camera possesses a higher resolution. The dual camera pair saves the high-fidelity details from the main camera and uses the ultra-wide camera's deep DoF as reference for all-in-focus restoration. To this end, we first implement spatial warping and color matching to align the dual camera, followed by a defocus-aware fusion module with learnable defocus parameters to predict a defocus map and fuse the aligned camera pair. We also build a multi-view dataset that includes image pairs of the main and ultra-wide cameras in a smartphone. Extensive experiments on this dataset verify that our solution, termed DC-NeRF, can produce high-quality all-in-focus novel views and compares favorably against strong baselines quantitatively and qualitatively. We further show DoF applications of DC-NeRF with adjustable blur intensity and focal plane, including refocusing and split diopter.

[42] RouteWinFormer: A Route-Window Transformer for Middle-range Attention in Image Restoration

Qifan Li,Tianyi Liang,Xingtao Wang,Xiaopeng Fan

Main category: cs.CV

TL;DR: RouteWinFormer是一种基于窗口的Transformer模型，通过动态选择附近窗口进行注意力聚合，高效扩展感受野至中范围，适用于图像恢复任务。

Details

Motivation: Transformer模型在图像恢复中因长距离像素依赖而受关注，但长距离注意力计算开销大且不必要，因为退化和上下文通常是局部的。 Method: 提出RouteWinFormer，包含Route-Windows Attention模块动态选择附近窗口，并引入多尺度结构正则化训练方法。 Result: 在9个数据集的图像恢复任务中优于现有方法。 Conclusion: 中范围注意力足够高效，RouteWinFormer在图像恢复中表现优异。 Abstract: Transformer models have recently garnered significant attention in image restoration due to their ability to capture long-range pixel dependencies. However, long-range attention often results in computational overhead without practical necessity, as degradation and context are typically localized. Normalized average attention distance across various degradation datasets shows that middle-range attention is enough for image restoration. Building on this insight, we propose RouteWinFormer, a novel window-based Transformer that models middle-range context for image restoration. RouteWinFormer incorporates Route-Windows Attnetion Module, which dynamically selects relevant nearby windows based on regional similarity for attention aggregation, extending the receptive field to a mid-range size efficiently. In addition, we introduce Multi-Scale Structure Regularization during training, enabling the sub-scale of the U-shaped network to focus on structural information, while the original-scale learns degradation patterns based on generalized image structure priors. Extensive experiments demonstrate that RouteWinFormer outperforms state-of-the-art methods across 9 datasets in various image restoration tasks.

[43] SSLR: A Semi-Supervised Learning Method for Isolated Sign Language Recognition

Hasan Algafri,Hamzah Luqman,Sarah Alyami,Issam Laradji

Main category: cs.CV

TL;DR: 提出了一种基于半监督学习的手语识别方法，通过伪标签标注未标记样本，利用姿态信息作为输入，性能优于全监督模型。

Details

Motivation: 解决手语识别中标注数据稀缺的问题。 Method: 采用半监督学习方法，利用伪标签标注未标记样本，以姿态信息为输入，使用Transformer模型。 Result: 在WLASL-100数据集上，半监督模型在少量标注数据下性能优于全监督模型。 Conclusion: 半监督学习是解决手语识别数据稀缺问题的有效方法。 Abstract: Sign language is the primary communication language for people with disabling hearing loss. Sign language recognition (SLR) systems aim to recognize sign gestures and translate them into spoken language. One of the main challenges in SLR is the scarcity of annotated datasets. To address this issue, we propose a semi-supervised learning (SSL) approach for SLR (SSLR), employing a pseudo-label method to annotate unlabeled samples. The sign gestures are represented using pose information that encodes the signer's skeletal joint points. This information is used as input for the Transformer backbone model utilized in the proposed approach. To demonstrate the learning capabilities of SSL across various labeled data sizes, several experiments were conducted using different percentages of labeled data with varying numbers of classes. The performance of the SSL approach was compared with a fully supervised learning-based model on the WLASL-100 dataset. The obtained results of the SSL model outperformed the supervised learning-based model with less labeled data in many cases.

[44] WiFi based Human Fall and Activity Recognition using Transformer based Encoder Decoder and Graph Neural Networks

Younggeol Cho,Elisa Motta,Olivia Nocentini,Marta Lagomarsino,Andrea Merello,Marco Crepaldi,Arash Ajoudani

Main category: cs.CV

TL;DR: 提出了一种基于Transformer的编码器-解码器网络（TED Net），用于从WiFi CSI信号估计人体骨骼姿态，并结合有向图神经网络（DGNN）进行动作识别，性能优于现有方法。

Details

Motivation: 人体姿态估计和动作识别在医疗监测、康复和辅助技术中具有重要作用，WiFi信号提供了一种隐私保护的替代方案。 Method: TED Net结合卷积编码器和Transformer注意力机制提取CSI信号的时空特征，DGNN用于动作分类。 Result: TED Net在姿态估计上优于现有方法，DGNN的动作分类性能与基于RGB的系统相当，且在跌倒和非跌倒场景中表现稳健。 Conclusion: WiFi CSI驱动的人体骨骼估计在动作识别中具有潜力，尤其适用于家庭环境中的隐私保护应用，如老年人跌倒检测。 Abstract: Human pose estimation and action recognition have received attention due to their critical roles in healthcare monitoring, rehabilitation, and assistive technologies. In this study, we proposed a novel architecture named Transformer based Encoder Decoder Network (TED Net) designed for estimating human skeleton poses from WiFi Channel State Information (CSI). TED Net integrates convolutional encoders with transformer based attention mechanisms to capture spatiotemporal features from CSI signals. The estimated skeleton poses were used as input to a customized Directed Graph Neural Network (DGNN) for action recognition. We validated our model on two datasets: a publicly available multi modal dataset for assessing general pose estimation, and a newly collected dataset focused on fall related scenarios involving 20 participants. Experimental results demonstrated that TED Net outperformed existing approaches in pose estimation, and that the DGNN achieves reliable action classification using CSI based skeletons, with performance comparable to RGB based systems. Notably, TED Net maintains robust performance across both fall and non fall cases. These findings highlight the potential of CSI driven human skeleton estimation for effective action recognition, particularly in home environments such as elderly fall detection. In such settings, WiFi signals are often readily available, offering a privacy preserving alternative to vision based methods, which may raise concerns about continuous camera monitoring.

[45] Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning

Chris,Yichen Wei,Yi Peng,Xiaokun Wang,Weijie Qiu,Wei Shen,Tianyidan Xie,Jiangbo Pei,Jianhao Zhang,Yunzhuo Hao,Xuchen Song,Yang Liu,Yahui Zhou

Main category: cs.CV

TL;DR: Skywork R1V2是一种新一代多模态推理模型，通过混合强化学习范式解决了推理能力与泛化的平衡问题，并引入SSB机制提升训练效率。

Details

Motivation: 解决传统模型在推理能力与泛化之间的平衡问题，并提升训练效率。 Method: 采用混合强化学习范式结合奖励模型和规则策略，并引入SSB机制优化训练过程。 Result: 在多个基准测试中表现优异，如OlympiadBench 62.6、AIME2024 79.0等，优于现有开源模型。 Conclusion: Skywork R1V2显著缩小了与顶级专有系统的性能差距，并公开模型权重以促进开放性和可复现性。 Abstract: We present Skywork R1V2, a next-generation multimodal reasoning model and a major leap forward from its predecessor, Skywork R1V. At its core, R1V2 introduces a hybrid reinforcement learning paradigm that harmonizes reward-model guidance with rule-based strategies, thereby addressing the long-standing challenge of balancing sophisticated reasoning capabilities with broad generalization. To further enhance training efficiency, we propose the Selective Sample Buffer (SSB) mechanism, which effectively counters the ``Vanishing Advantages'' dilemma inherent in Group Relative Policy Optimization (GRPO) by prioritizing high-value samples throughout the optimization process. Notably, we observe that excessive reinforcement signals can induce visual hallucinations--a phenomenon we systematically monitor and mitigate through calibrated reward thresholds throughout the training process. Empirical results affirm the exceptional capability of R1V2, with benchmark-leading performances such as 62.6 on OlympiadBench, 79.0 on AIME2024, 63.6 on LiveCodeBench, and 74.0 on MMMU. These results underscore R1V2's superiority over existing open-source models and demonstrate significant progress in closing the performance gap with premier proprietary systems, including Gemini 2.5 and OpenAI o4-mini. The Skywork R1V2 model weights have been publicly released to promote openness and reproducibility https://huggingface.co/Skywork/Skywork-R1V2-38B.

[46] A Time Series Dataset of NIR Spectra and RGB and NIR-HSI Images of the Barley Germination Process

Ole-Christian Galbo Engstrøm,Erik Schou Dreier,Birthe Møller Jespersen,Kim Steenstrup Pedersen

Main category: cs.CV

TL;DR: 开源数据集包含2242个大麦籽粒的RGB和NIR-HSI图像、分割掩码及NIR光谱，用于分析发芽时间。

Details

Motivation: 提供高质量数据集，支持基于RGB、NIR光谱或NIR-HSI的大麦籽粒发芽时间序列分析。 Method: 每天拍摄大麦籽粒的RGB和NIR-HSI图像，标记发芽状态，并使用黑色滤纸背景简化分割。 Result: 数据集支持多种分析方法，如Otsu阈值分割，可用于研究发芽动态。 Conclusion: 该数据集为研究大麦籽粒发芽提供了多模态分析工具。 Abstract: We provide an open-source dataset of RGB and NIR-HSI (near-infrared hyperspectral imaging) images with associated segmentation masks and NIR spectra of 2242 individual malting barley kernels. We imaged every kernel pre-exposure to moisture and every 24 hours after exposure to moisture for five consecutive days. Every barley kernel was labeled as germinated or not germinated during each image acquisition. The barley kernels were imaged with black filter paper as the background, facilitating straight-forward intensity threshold-based segmentation, e.g., by Otsu's method. This dataset facilitates time series analysis of germination time for barley kernels using either RGB image analysis, NIR spectral analysis, NIR-HSI analysis, or a combination hereof.

[47] A Diff-Attention Aware State Space Fusion Model for Remote Sensing Classification

Wenping Ma,Boyou Xue,Mengru Ma,Chuang Chen,Hekai Zhang,Hao Zhu

Main category: cs.CV

TL;DR: 本文提出了一种基于选择性状态空间模型的多模态遥感图像分类方法DAS2F-Model，通过CMDA-Module分离MS和PAN图像的共同特征和优势特征，并利用AALF-Module实现语义差异大的特征融合。

Details

Motivation: MS和PAN图像具有相似信息和各自优势，但融合阶段存在特征冗余问题，需分离相似信息并保留各自优势。 Method: 设计CMDA-Module提取和分离特征，SPVM保留空间特征，AALF-Module通过像素级线性融合处理语义差异大的特征。 Result: 实验表明该方法优于其他方法。 Conclusion: DAS2F-Model有效解决了多模态遥感图像分类中的特征冗余和融合问题。 Abstract: Multispectral (MS) and panchromatic (PAN) images describe the same land surface, so these images not only have their own advantages, but also have a lot of similar information. In order to separate these similar information and their respective advantages, reduce the feature redundancy in the fusion stage. This paper introduces a diff-attention aware state space fusion model (DAS2F-Model) for multimodal remote sensing image classification. Based on the selective state space model, a cross-modal diff-attention module (CMDA-Module) is designed to extract and separate the common features and their respective dominant features of MS and PAN images. Among this, space preserving visual mamba (SPVM) retains image spatial features and captures local features by optimizing visual mamba's input reasonably. Considering that features in the fusion stage will have large semantic differences after feature separation and simple fusion operations struggle to effectively integrate these significantly different features, an attention-aware linear fusion module (AALF-Module) is proposed. It performs pixel-wise linear fusion by calculating influence coefficients. This mechanism can fuse features with large semantic differences while keeping the feature size unchanged. Empirical evaluations indicate that the presented method achieves better results than alternative approaches. The relevant code can be found at:https://github.com/AVKSKVL/DAS-F-Model

[48] SemanticSugarBeets: A Multi-Task Framework and Dataset for Inspecting Harvest and Storage Characteristics of Sugar Beets

Gerardus Croonen,Andreas Trondl,Julia Simon,Daniel Steininger

Main category: cs.CV

TL;DR: 论文提出了一种用于糖甜菜检测、语义分割和质量估计的两阶段方法，并提供了一个高质量标注数据集。实验表明，该方法在检测和分割任务上表现优异。

Details

Motivation: 糖甜菜在储存过程中因微生物等因素导致糖分损失，自动化视觉检测可提高糖生产链的效率和质量保证。 Method: 采用两阶段方法，结合检测和语义分割技术，评估了不同图像尺寸、模型架构和环境条件的影响。 Result: 检测任务mAP50-95达98.8，分割任务mIoU达64.0，表现优异。 Conclusion: 该方法为糖甜菜质量检测提供了高效解决方案，实验验证了其有效性。 Abstract: While sugar beets are stored prior to processing, they lose sugar due to factors such as microorganisms present in adherent soil and excess vegetation. Their automated visual inspection promises to aide in quality assurance and thereby increase efficiency throughout the processing chain of sugar production. In this work, we present a novel high-quality annotated dataset and two-stage method for the detection, semantic segmentation and mass estimation of post-harvest and post-storage sugar beets in monocular RGB images. We conduct extensive ablation experiments for the detection of sugar beets and their fine-grained semantic segmentation regarding damages, rot, soil adhesion and excess vegetation. For these tasks, we evaluate multiple image sizes, model architectures and encoders, as well as the influence of environmental conditions. Our experiments show an mAP50-95 of 98.8 for sugar-beet detection and an mIoU of 64.0 for the best-performing segmentation model.

[49] Energy-Based Pseudo-Label Refining for Source-free Domain Adaptation

Xinru Meng,Han Sun,Jiamei Liu,Ningzhong Liu,Huiyu Zhou

Main category: cs.CV

TL;DR: 提出了一种基于能量的伪标签细化方法（EBPR），用于无源域自适应（SFDA），通过能量阈值和对比学习策略提升性能。

Details

Motivation: 现有SFDA方法依赖置信度生成的伪标签，噪声大导致负迁移，需改进。 Method: 利用能量分数生成伪标签，计算全局和类别能量阈值筛选，引入对比学习策略对齐困难样本。 Result: 在Office-31、Office-Home和VisDA-C数据集上表现优于现有方法。 Conclusion: EBPR通过能量阈值和对比学习有效提升SFDA性能。 Abstract: Source-free domain adaptation (SFDA), which involves adapting models without access to source data, is both demanding and challenging. Existing SFDA techniques typically rely on pseudo-labels generated from confidence levels, leading to negative transfer due to significant noise. To tackle this problem, Energy-Based Pseudo-Label Refining (EBPR) is proposed for SFDA. Pseudo-labels are created for all sample clusters according to their energy scores. Global and class energy thresholds are computed to selectively filter pseudo-labels. Furthermore, a contrastive learning strategy is introduced to filter difficult samples, aligning them with their augmented versions to learn more discriminative features. Our method is validated on the Office-31, Office-Home, and VisDA-C datasets, consistently finding that our model outperformed state-of-the-art methods.

[50] PMG: Progressive Motion Generation via Sparse Anchor Postures Curriculum Learning

Yingjie Xi,Jian Jun Zhang,Xiaosong Yang

Main category: cs.CV

TL;DR: ProMoGen提出了一种结合轨迹引导和稀疏锚点运动控制的新框架，通过解耦全局轨迹和精确动作指导，实现更可控、高保真的人体运动合成。

Details

Motivation: 现有方法在生成复杂或定制化人体运动时存在局限性，如文本方法难以准确描述复杂动作，轨迹方法无法生成精确运动，锚点方法仅支持简单模式。 Method: ProMoGen结合全局轨迹和稀疏锚点运动控制，支持双/单控制范式，并引入SAP-CL课程学习策略以稳定训练。 Result: 实验表明，ProMoGen能生成生动多样的运动，显著优于现有方法。 Conclusion: ProMoGen通过解耦和课程学习，实现了更精确、可控的运动合成，适用于多种控制场景。 Abstract: In computer animation, game design, and human-computer interaction, synthesizing human motion that aligns with user intent remains a significant challenge. Existing methods have notable limitations: textual approaches offer high-level semantic guidance but struggle to describe complex actions accurately; trajectory-based techniques provide intuitive global motion direction yet often fall short in generating precise or customized character movements; and anchor poses-guided methods are typically confined to synthesize only simple motion patterns. To generate more controllable and precise human motions, we propose \textbf{ProMoGen (Progressive Motion Generation)}, a novel framework that integrates trajectory guidance with sparse anchor motion control. Global trajectories ensure consistency in spatial direction and displacement, while sparse anchor motions only deliver precise action guidance without displacement. This decoupling enables independent refinement of both aspects, resulting in a more controllable, high-fidelity, and sophisticated motion synthesis. ProMoGen supports both dual and single control paradigms within a unified training process. Moreover, we recognize that direct learning from sparse motions is inherently unstable, we introduce \textbf{SAP-CL (Sparse Anchor Posture Curriculum Learning)}, a curriculum learning strategy that progressively adjusts the number of anchors used for guidance, thereby enabling more precise and stable convergence. Extensive experiments demonstrate that ProMoGen excels in synthesizing vivid and diverse motions guided by predefined trajectory and arbitrary anchor frames. Our approach seamlessly integrates personalized motion with structured guidance, significantly outperforming state-of-the-art methods across multiple control scenarios.

[51] Detecting and Understanding Hateful Contents in Memes Through Captioning and Visual Question-Answering

Ali Anaissi,Junaid Akram,Kunal Chaturvedi,Ali Braytee

Main category: cs.CV

TL;DR: 提出一种多模态仇恨内容检测框架，结合OCR、图像描述、子标签分类、RAG和VQA技术，显著提升检测性能。

Details

Motivation: 传统单模态或简单多模态方法难以检测隐含仇恨内容的模因，需更精细的多模态分析。 Method: 整合OCR提取文本、中性图像描述、子标签分类、RAG检索和VQA迭代分析，捕捉隐含信号。 Result: 在Facebook Hateful Memes数据集上，框架在准确率和AUC-ROC上优于传统方法。 Conclusion: 多模态框架能有效检测隐含仇恨内容，性能优于现有方法。 Abstract: Memes are widely used for humor and cultural commentary, but they are increasingly exploited to spread hateful content. Due to their multimodal nature, hateful memes often evade traditional text-only or image-only detection systems, particularly when they employ subtle or coded references. To address these challenges, we propose a multimodal hate detection framework that integrates key components: OCR to extract embedded text, captioning to describe visual content neutrally, sub-label classification for granular categorization of hateful content, RAG for contextually relevant retrieval, and VQA for iterative analysis of symbolic and contextual cues. This enables the framework to uncover latent signals that simpler pipelines fail to detect. Experimental results on the Facebook Hateful Memes dataset reveal that the proposed framework exceeds the performance of unimodal and conventional multimodal models in both accuracy and AUC-ROC.

[52] V$^2$R-Bench: Holistically Evaluating LVLM Robustness to Fundamental Visual Variations

Zhiyuan Fan,Yumeng Wang,Sandeep Polisetty,Yi R.,Fung

Main category: cs.CV

TL;DR: V$^2$R-Bench 是一个评估大型视觉语言模型（LVLMs）对视觉变化鲁棒性的基准框架，揭示了模型在视觉变化下的脆弱性及其原因。

Details

Motivation: 尽管 LVLMs 在多种视觉语言任务中表现优异，但其对自然场景中因视角和环境变化导致的视觉变化（如位置、尺度、方向等）的鲁棒性尚未充分研究。 Method: 提出 V$^^2$R-Bench 基准框架，包括自动评估数据集生成和系统性指标，对 21 个 LVLMs 进行广泛评估，并通过组件级分析和可视化方法探究脆弱性来源。 Result: 发现 LVLMs 对视觉变化表现出显著脆弱性，甚至高级模型在简单任务（如物体识别）中表现不佳，且存在视觉位置偏差和人类视觉敏锐度阈值。脆弱性源于架构中的错误累积和多模态对齐不足。 Conclusion: LVLMs 的脆弱性是架构缺陷所致，未来设计需创新架构以提升鲁棒性。 Abstract: Large Vision Language Models (LVLMs) excel in various vision-language tasks. Yet, their robustness to visual variations in position, scale, orientation, and context that objects in natural scenes inevitably exhibit due to changes in viewpoint and environment remains largely underexplored. To bridge this gap, we introduce V$^2$R-Bench, a comprehensive benchmark framework for evaluating Visual Variation Robustness of LVLMs, which encompasses automated evaluation dataset generation and principled metrics for thorough robustness assessment. Through extensive evaluation on 21 LVLMs, we reveal a surprising vulnerability to visual variations, in which even advanced models that excel at complex vision-language tasks significantly underperform on simple tasks such as object recognition. Interestingly, these models exhibit a distinct visual position bias that contradicts theories of effective receptive fields, and demonstrate a human-like visual acuity threshold. To identify the source of these vulnerabilities, we present a systematic framework for component-level analysis, featuring a novel visualization approach for aligned visual features. Results show that these vulnerabilities stem from error accumulation in the pipeline architecture and inadequate multimodal alignment. Complementary experiments with synthetic data further demonstrate that these limitations are fundamentally architectural deficiencies, scoring the need for architectural innovations in future LVLM designs.

[53] Prompt-Tuning SAM: From Generalist to Specialist with only 2048 Parameters and 16 Training Images

Tristan Piater,Björn Barz,Alexander Freytag

Main category: cs.CV

TL;DR: PTSAM（Prompt-Tuned SAM）通过仅微调少量参数（2048个），将SAM模型优化为特定任务的专家，显著提升了在非自然领域（如显微图像）的分割性能，且仅需少量标注数据。

Details

Motivation: SAM在自然图像中表现优异，但在非自然领域（如显微图像）性能下降，且其交互式设计不适用于自动化生物医学应用。 Method: 提出PTSAM方法，采用参数高效的提示调优技术，仅微调SAM的掩码解码器或图像编码器。 Result: PTSAM在多个显微和医学数据集上表现优异，仅需2048个参数即可达到与SOTA相当的性能，且训练数据需求极低（16张标注图像）。 Conclusion: PTSAM是一种高效、轻量级的解决方案，特别适用于数据有限和领域差异大的任务。 Abstract: The Segment Anything Model (SAM) is widely used for segmenting a diverse range of objects in natural images from simple user prompts like points or bounding boxes. However, SAM's performance decreases substantially when applied to non-natural domains like microscopic imaging. Furthermore, due to SAM's interactive design, it requires a precise prompt for each image and object, which is unfeasible in many automated biomedical applications. Previous solutions adapt SAM by training millions of parameters via fine-tuning large parts of the model or of adapter layers. In contrast, we show that as little as 2,048 additional parameters are sufficient for turning SAM into a use-case specialist for a certain downstream task. Our novel PTSAM (prompt-tuned SAM) method uses prompt-tuning, a parameter-efficient fine-tuning technique, to adapt SAM for a specific task. We validate the performance of our approach on multiple microscopic and one medical dataset. Our results show that prompt-tuning only SAM's mask decoder already leads to a performance on-par with state-of-the-art techniques while requiring roughly 2,000x less trainable parameters. For addressing domain gaps, we find that additionally prompt-tuning SAM's image encoder is beneficial, further improving segmentation accuracy by up to 18% over state-of-the-art results. Since PTSAM can be reliably trained with as little as 16 annotated images, we find it particularly helpful for applications with limited training data and domain shifts.

[54] Gaussian Splatting is an Effective Data Generator for 3D Object Detection

Farhad G. Zanjani,Davide Abati,Auke Wiggers,Dimitris Kalatzis,Jens Petersen,Hong Cai,Amirhossein Habibian

Main category: cs.CV

TL;DR: 该论文研究了基于高斯泼溅的3D重建技术用于自动驾驶中的3D物体检测数据增强，通过直接放置3D物体并施加几何变换，提升了检测性能。

Details

Motivation: 现有基于扩散的方法在BEV布局上合成图像，而本文旨在通过直接放置3D物体并施加几何变换，确保物体放置的物理合理性和标注的准确性。 Method: 利用高斯泼溅的3D重建技术，直接在3D空间中放置物体并施加几何变换，生成增强数据。 Result: 实验表明，该方法显著提升了3D物体检测性能，优于现有扩散方法，且几何多样性比外观多样性更重要。 Conclusion: 直接放置3D物体并施加几何变换是一种高效的3D数据增强方法，但生成困难样本对提升检测效率帮助有限。 Abstract: We investigate data augmentation for 3D object detection in autonomous driving. We utilize recent advancements in 3D reconstruction based on Gaussian Splatting for 3D object placement in driving scenes. Unlike existing diffusion-based methods that synthesize images conditioned on BEV layouts, our approach places 3D objects directly in the reconstructed 3D space with explicitly imposed geometric transformations. This ensures both the physical plausibility of object placement and highly accurate 3D pose and position annotations. Our experiments demonstrate that even by integrating a limited number of external 3D objects into real scenes, the augmented data significantly enhances 3D object detection performance and outperforms existing diffusion-based 3D augmentation for object detection. Extensive testing on the nuScenes dataset reveals that imposing high geometric diversity in object placement has a greater impact compared to the appearance diversity of objects. Additionally, we show that generating hard examples, either by maximizing detection loss or imposing high visual occlusion in camera images, does not lead to more efficient 3D data augmentation for camera-based 3D object detection in autonomous driving.

[55] Feature Mixing Approach for Detecting Intraoperative Adverse Events in Laparoscopic Roux-en-Y Gastric Bypass Surgery

Rupak Bose,Chinedu Innocent Nwoye,Jorge Lazo,Joël Lukas Lavanchy,Nicolas Padoy

Main category: cs.CV

TL;DR: BetaMixer是一种新型深度学习模型，通过基于Beta分布的混合方法解决术中不良事件（IAEs）检测中的数据集不平衡问题，实现了精确的严重程度回归和分类。

Details

Motivation: 术中不良事件（IAEs）如出血或热损伤若未被检测到可能导致严重的术后并发症，但其罕见性导致数据集高度不平衡，为AI检测带来挑战。 Method: BetaMixer采用基于Beta分布的采样增强少数类，并通过生成方法对齐特征空间与IAE严重程度，使用Transformer实现分类和回归。 Result: 在MultiBypass140数据集上，BetaMixer加权F1得分为0.76，召回率0.81，PPV 0.73，NPV 0.84，表现优异。 Conclusion: BetaMixer通过结合Beta分布采样、特征混合和生成建模，为临床环境中的IAE检测和量化提供了稳健解决方案。 Abstract: Intraoperative adverse events (IAEs), such as bleeding or thermal injury, can lead to severe postoperative complications if undetected. However, their rarity results in highly imbalanced datasets, posing challenges for AI-based detection and severity quantification. We propose BetaMixer, a novel deep learning model that addresses these challenges through a Beta distribution-based mixing approach, converting discrete IAE severity scores into continuous values for precise severity regression (0-5 scale). BetaMixer employs Beta distribution-based sampling to enhance underrepresented classes and regularizes intermediate embeddings to maintain a structured feature space. A generative approach aligns the feature space with sampled IAE severity, enabling robust classification and severity regression via a transformer. Evaluated on the MultiBypass140 dataset, which we extended with IAE labels, BetaMixer achieves a weighted F1 score of 0.76, recall of 0.81, PPV of 0.73, and NPV of 0.84, demonstrating strong performance on imbalanced data. By integrating Beta distribution-based sampling, feature mixing, and generative modeling, BetaMixer offers a robust solution for IAE detection and quantification in clinical settings.

[56] Tri-FusionNet: Enhancing Image Description Generation with Transformer-based Fusion Network and Dual Attention Mechanism

Lakshita Agarwal,Bindu Verma

Main category: cs.CV

TL;DR: Tri-FusionNet是一种新型图像描述生成模型，结合了ViT、RoBERTa和CLIP模块，通过双注意力机制提升性能，在多个数据集上表现优异。

Details

Motivation: 图像描述生成对无障碍访问和AI视觉内容理解至关重要，深度学习的最新进展为此提供了技术支持。 Method: 提出Tri-FusionNet模型，整合ViT编码器（带双注意力）、RoBERTa解码器和CLIP模块，通过对比学习对齐视觉与文本数据。 Result: 在Flickr30k、Flickr8k和MS-COCO数据集上取得高BLEU、CIDEr、METEOR和ROUGE-L分数，证明了模型的有效性。 Conclusion: Tri-FusionNet能生成更准确、上下文丰富的图像描述，展示了其高质量生成能力。 Abstract: Image description generation is essential for accessibility and AI understanding of visual content. Recent advancements in deep learning have significantly improved natural language processing and computer vision. In this work, we propose Tri-FusionNet, a novel image description generation model that integrates transformer modules: a Vision Transformer (ViT) encoder module with dual-attention mechanism, a Robustly Optimized BERT Approach (RoBERTa) decoder module, and a Contrastive Language-Image Pre-Training (CLIP) integrating module. The ViT encoder, enhanced with dual attention, focuses on relevant spatial regions and linguistic context, improving image feature extraction. The RoBERTa decoder is employed to generate precise textual descriptions. CLIP's integrating module aligns visual and textual data through contrastive learning, ensuring effective combination of both modalities. This fusion of ViT, RoBERTa, and CLIP, along with dual attention, enables the model to produce more accurate, contextually rich, and flexible descriptions. The proposed framework demonstrated competitive performance on the Flickr30k and Flickr8k datasets, with BLEU scores ranging from 0.767 to 0.456 and 0.784 to 0.479, CIDEr scores of 1.679 and 1.483, METEOR scores of 0.478 and 0.358, and ROUGE-L scores of 0.567 and 0.789, respectively. On MS-COCO, the framework obtained BLEU scores of 0.893 (B-1), 0.821 (B-2), 0.794 (B-3), and 0.725 (B-4). The results demonstrate the effectiveness of Tri-FusionNet in generating high-quality image descriptions.

Lakshita Agarwal,Bindu Verma

Main category: cs.CV

TL;DR: 论文提出了一种结合视觉和文本模态的新框架，用于从视频数据生成自然语言描述，性能优于传统方法。

Details

Motivation: 视频动作的理解与分析对智能监控和自主系统等应用至关重要，需要生成具有上下文相关性的描述。 Method: 使用ResNet50提取视频帧的视觉特征，通过GPT-2的编码器-解码器模型结合多头自注意力和交叉注意力技术，对齐文本与视觉表示。 Result: 在BDD-X和MSVD数据集上，BLEU-4、CIDEr、METEOR和ROUGE-L等指标表现优异，优于传统方法。 Conclusion: 该研究通过生成高质量描述，提升了可解释性AI，并增强了实际应用能力。 Abstract: Understanding and analyzing video actions are essential for producing insightful and contextualized descriptions, especially for video-based applications like intelligent monitoring and autonomous systems. The proposed work introduces a novel framework for generating natural language descriptions from video datasets by combining textual and visual modalities. The suggested architecture makes use of ResNet50 to extract visual features from video frames that are taken from the Microsoft Research Video Description Corpus (MSVD), and Berkeley DeepDrive eXplanation (BDD-X) datasets. The extracted visual characteristics are converted into patch embeddings and then run through an encoder-decoder model based on Generative Pre-trained Transformer-2 (GPT-2). In order to align textual and visual representations and guarantee high-quality description production, the system uses multi-head self-attention and cross-attention techniques. The model's efficacy is demonstrated by performance evaluation using BLEU (1-4), CIDEr, METEOR, and ROUGE-L. The suggested framework outperforms traditional methods with BLEU-4 scores of 0.755 (BDD-X) and 0.778 (MSVD), CIDEr scores of 1.235 (BDD-X) and 1.315 (MSVD), METEOR scores of 0.312 (BDD-X) and 0.329 (MSVD), and ROUGE-L scores of 0.782 (BDD-X) and 0.795 (MSVD). By producing human-like, contextually relevant descriptions, strengthening interpretability, and improving real-world applications, this research advances explainable AI.

[58] Decoupled Global-Local Alignment for Improving Compositional Understanding

Xiaoxing Hu,Kaicheng Yang,Jun Wang,Haoran Xu,Ziyong Feng,Yupei Wang

Main category: cs.CV

TL;DR: DeGLA框架通过解耦全局-局部对齐，提升CLIP的组合理解能力，同时减少通用能力的损失。

Details

Motivation: CLIP的全局对比学习限制了其对组合概念（如关系和属性）的理解能力。 Method: 结合自蒸馏机制和LLM生成的负样本，提出IGC和TGC损失函数。 Result: 在多个基准测试中平均提升3.5%，零样本分类任务提升13.0%。 Conclusion: DeGLA有效平衡了组合理解与通用能力，显著优于现有方法。 Abstract: Contrastive Language-Image Pre-training (CLIP) has achieved success on multiple downstream tasks by aligning image and text modalities. However, the nature of global contrastive learning limits CLIP's ability to comprehend compositional concepts, such as relations and attributes. Although recent studies employ global hard negative samples to improve compositional understanding, these methods significantly compromise the model's inherent general capabilities by forcibly distancing textual negative samples from images in the embedding space. To overcome this limitation, we introduce a Decoupled Global-Local Alignment (DeGLA) framework that improves compositional understanding while substantially mitigating losses in general capabilities. To optimize the retention of the model's inherent capabilities, we incorporate a self-distillation mechanism within the global alignment process, aligning the learnable image-text encoder with a frozen teacher model derived from an exponential moving average. Under the constraint of self-distillation, it effectively mitigates the catastrophic forgetting of pretrained knowledge during fine-tuning. To improve compositional understanding, we first leverage the in-context learning capability of Large Language Models (LLMs) to construct about 2M high-quality negative captions across five types. Subsequently, we propose the Image-Grounded Contrast (IGC) loss and Text-Grounded Contrast (TGC) loss to enhance vision-language compositionally. Extensive experimental results demonstrate the effectiveness of the DeGLA framework. Compared to previous state-of-the-art methods, DeGLA achieves an average enhancement of 3.5% across the VALSE, SugarCrepe, and ARO benchmarks. Concurrently, it obtains an average performance improvement of 13.0% on zero-shot classification tasks across eleven datasets. Our code will be released at https://github.com/xiaoxing2001/DeGLA

[59] A Low-Cost Photogrammetry System for 3D Plant Modeling and Phenotyping

Joe Hrzich,Michael A. Beck,Christopher P. Bidinosti,Christopher J. Henry,Kalhari Manawasinghe,Karen Tanino

Main category: cs.CV

TL;DR: 开源低成本的光度测量系统，用于3D植物建模和表型分析，通过点云重建植物3D模型，并以小麦为例展示表型特征的自动计算。

Details

Motivation: 开发一种低成本、开源的光度测量系统，以简化植物表型特征的测量和分析。 Method: 采用运动结构法（SfM）通过点云重建植物的3D模型，并从中计算表型特征。 Result: 系统能够自动测量植物高度、半径、叶片角度等特征，并用于小麦冠层结构的分类。 Conclusion: 该系统为植物表型研究提供了高效、低成本的解决方案，尤其适用于复杂特征的自动化测量。 Abstract: We present an open-source, low-cost photogrammetry system for 3D plant modeling and phenotyping. The system uses a structure-from-motion approach to reconstruct 3D representations of the plants via point clouds. Using wheat as an example, we demonstrate how various phenotypic traits can be computed easily from the point clouds. These include standard measurements such as plant height and radius, as well as features that would be more cumbersome to measure by hand, such as leaf angles and convex hull. We further demonstrate the utility of the system through the investigation of specific metrics that may yield objective classifications of erectophile versus planophile wheat canopy architectures.

[60] Hyperspectral Vision Transformers for Greenhouse Gas Estimations from Space

Ruben Gonzalez Avilés,Linus Scheibenreif,Nassim Ait Ali Braham,Benedikt Blumenstiel,Thomas Brunschwiler,Ranjini Guruprasad,Damian Borth,Conrad Albrecht,Paolo Fraccaro,Devyani Lambhate,Johannes Jakubik

Main category: cs.CV

TL;DR: 提出了一种基于光谱变换器的模型，通过多光谱数据合成高光谱数据，以解决高光谱成像空间覆盖有限的问题，同时提升温室气体监测精度。

Details

Motivation: 高光谱成像在温室气体监测中潜力巨大，但受限于空间覆盖和重访频率；多光谱成像覆盖广但光谱细节不足。本研究旨在结合两者优势。 Method: 采用光谱变换器模型，通过波段掩码自编码器预训练，并在时空对齐的多光谱-高光谱图像对上微调，合成高光谱数据。 Result: 合成的数据保留了多光谱的空间和时间优势，同时提高了温室气体预测的准确性。 Conclusion: 该方法有效平衡了光谱分辨率和覆盖范围，为结合高光谱和多光谱系统的优势提供了新思路。 Abstract: Hyperspectral imaging provides detailed spectral information and holds significant potential for monitoring of greenhouse gases (GHGs). However, its application is constrained by limited spatial coverage and infrequent revisit times. In contrast, multispectral imaging offers broader spatial and temporal coverage but often lacks the spectral detail that can enhance GHG detection. To address these challenges, this study proposes a spectral transformer model that synthesizes hyperspectral data from multispectral inputs. The model is pre-trained via a band-wise masked autoencoder and subsequently fine-tuned on spatio-temporally aligned multispectral-hyperspectral image pairs. The resulting synthetic hyperspectral data retain the spatial and temporal benefits of multispectral imagery and improve GHG prediction accuracy relative to using multispectral data alone. This approach effectively bridges the trade-off between spectral resolution and coverage, highlighting its potential to advance atmospheric monitoring by combining the strengths of hyperspectral and multispectral systems with self-supervised deep learning.

[61] High-Quality Cloud-Free Optical Image Synthesis Using Multi-Temporal SAR and Contaminated Optical Data

Chenxi Duan

Main category: cs.CV

TL;DR: 提出CRSynthNet网络，解决卫星图像中云覆盖导致的数据缺失问题，通过创新模块提升合成精度，并创建TCSEN12数据集。

Details

Motivation: 解决卫星图像因云覆盖和长重访周期导致的数据缺失问题，支持遥感应用。 Method: 提出CRSynthNet网络，包含DownUp Block和Fusion Attention等创新模块，提升合成精度。 Result: 实验显示CRSynthNet在PSNR、SSIM和RMSE等指标上显著优于对比方法，并创建了TCSEN12数据集。 Conclusion: CRSynthNet为光学卫星图像合成提供了实用方法和资源。 Abstract: Addressing gaps caused by cloud cover and the long revisit cycle of satellites is vital for providing essential data to support remote sensing applications. This paper tackles the challenges of missing optical data synthesis, particularly in complex scenarios with cloud cover. We propose CRSynthNet, a novel image synthesis network that incorporates innovative designed modules such as the DownUp Block and Fusion Attention to enhance accuracy. Experimental results validate the effectiveness of CRSynthNet, demonstrating substantial improvements in restoring structural details, preserving spectral consist, and achieving superior visual effects that far exceed those produced by comparison methods. It achieves quantitative improvements across multiple metrics: a peak signal-to-noise ratio (PSNR) of 26.978, a structural similarity index measure (SSIM) of 0.648, and a root mean square error (RMSE) of 0.050. Furthermore, this study creates the TCSEN12 dataset, a valuable resource specifically designed to address cloud cover challenges in missing optical data synthesis study. The dataset uniquely includes cloud-covered images and leverages earlier image to predict later image, offering a realistic representation of real-world scenarios. This study offer practical method and valuable resources for optical satellite image synthesis task.

[62] BadVideo: Stealthy Backdoor Attack against Text-to-Video Generation

Ruotong Wang,Mingli Zhu,Jiarong Ou,Rui Chen,Xin Tao,Pengfei Wan,Baoyuan Wu

Main category: cs.CV

TL;DR: 论文提出了BadVideo，首个针对文本到视频生成模型的后门攻击框架，利用视频中的冗余信息隐藏恶意内容。

Details

Motivation: 探索文本到视频生成模型的对抗性漏洞，揭示其潜在风险。 Method: 通过时空组合和动态元素变换策略设计目标对抗输出。 Result: BadVideo攻击成功率高，且能绕过传统内容审核系统。 Conclusion: 研究揭示了T2V模型的对抗性漏洞，呼吁关注潜在风险。 Abstract: Text-to-video (T2V) generative models have rapidly advanced and found widespread applications across fields like entertainment, education, and marketing. However, the adversarial vulnerabilities of these models remain rarely explored. We observe that in T2V generation tasks, the generated videos often contain substantial redundant information not explicitly specified in the text prompts, such as environmental elements, secondary objects, and additional details, providing opportunities for malicious attackers to embed hidden harmful content. Exploiting this inherent redundancy, we introduce BadVideo, the first backdoor attack framework tailored for T2V generation. Our attack focuses on designing target adversarial outputs through two key strategies: (1) Spatio-Temporal Composition, which combines different spatiotemporal features to encode malicious information; (2) Dynamic Element Transformation, which introduces transformations in redundant elements over time to convey malicious information. Based on these strategies, the attacker's malicious target seamlessly integrates with the user's textual instructions, providing high stealthiness. Moreover, by exploiting the temporal dimension of videos, our attack successfully evades traditional content moderation systems that primarily analyze spatial information within individual frames. Extensive experiments demonstrate that BadVideo achieves high attack success rates while preserving original semantics and maintaining excellent performance on clean inputs. Overall, our work reveals the adversarial vulnerability of T2V models, calling attention to potential risks and misuse. Our project page is at https://wrt2000.github.io/BadVideo2025/.

[63] DreamO: A Unified Framework for Image Customization

Chong Mou,Yanze Wu,Wenxu Wu,Zinan Guo,Pengze Zhang,Yufeng Cheng,Yiming Luo,Fei Ding,Shiwen Zhang,Xinghui Li,Mengtian Li,Songtao Zhao,Jian Zhang,Qian He,Xinglong Wu

Main category: cs.CV

TL;DR: DreamO是一个统一的图像定制框架，支持多种任务和条件，通过扩散变换器（DiT）处理输入，并采用渐进训练策略提升性能。

Details

Motivation: 现有图像定制方法多为特定任务设计，缺乏通用性，难以整合多种条件。DreamO旨在解决这一挑战。 Method: 使用DiT框架统一处理输入，构建大规模训练数据集，引入特征路由约束和占位符策略，采用三阶段渐进训练策略。 Result: 实验表明，DreamO能高质量完成多种图像定制任务，并灵活整合不同控制条件。 Conclusion: DreamO为图像定制提供了一个通用且高效的解决方案，具有广泛的应用潜力。 Abstract: Recently, extensive research on image customization (e.g., identity, subject, style, background, etc.) demonstrates strong customization capabilities in large-scale generative models. However, most approaches are designed for specific tasks, restricting their generalizability to combine different types of condition. Developing a unified framework for image customization remains an open challenge. In this paper, we present DreamO, an image customization framework designed to support a wide range of tasks while facilitating seamless integration of multiple conditions. Specifically, DreamO utilizes a diffusion transformer (DiT) framework to uniformly process input of different types. During training, we construct a large-scale training dataset that includes various customization tasks, and we introduce a feature routing constraint to facilitate the precise querying of relevant information from reference images. Additionally, we design a placeholder strategy that associates specific placeholders with conditions at particular positions, enabling control over the placement of conditions in the generated results. Moreover, we employ a progressive training strategy consisting of three stages: an initial stage focused on simple tasks with limited data to establish baseline consistency, a full-scale training stage to comprehensively enhance the customization capabilities, and a final quality alignment stage to correct quality biases introduced by low-quality data. Extensive experiments demonstrate that the proposed DreamO can effectively perform various image customization tasks with high quality and flexibly integrate different types of control conditions.

[64] Generalized Neighborhood Attention: Multi-dimensional Sparse Attention at the Speed of Light

Ali Hassani,Fengzhe Zhou,Aditya Kane,Jiannan Huang,Chieh-Yun Chen,Min Shi,Steven Walton,Markus Hoehnerbach,Vijay Thakkar,Michael Isaev,Qinsheng Zhang,Bing Xu,Haicheng Wu,Wen-mei Hwu,Ming-Yu Liu,Humphrey Shi

Main category: cs.CV

TL;DR: 论文提出了一种广义邻域注意力机制（GNA），通过模拟器和硬件优化实现高效稀疏注意力，显著提升了生成模型的速度。

Details

Motivation: 稀疏注意力机制（如邻域注意力）因复杂性和硬件架构快速变化，难以稳定超越自注意力基线的性能。计算机视觉中的前沿模型受限于注意力计算的高复杂度（O(n^2)），亟需可靠的稀疏性解决方案。 Method: 引入GNA描述滑动窗口、跨步滑动窗口和分块注意力；设计模拟器预测性能上限；在NVIDIA Blackwell架构上实现优化的FMHA内核。 Result: GNA在完美分块稀疏情况下实现理论最大加速，FP16下有效利用率达1.3 petaFLOPs/秒；在Cosmos-7B等模型中，端到端加速28%-46%。 Conclusion: GNA为稀疏注意力提供了高效实现方案，硬件优化和开源工具（NATTEN项目）将推动其广泛应用。 Abstract: Many sparse attention mechanisms such as Neighborhood Attention have typically failed to consistently deliver speedup over the self attention baseline. This is largely due to the level of complexity in attention infrastructure, and the rapid evolution of AI hardware architecture. At the same time, many state-of-the-art foundational models, particularly in computer vision, are heavily bound by attention, and need reliable sparsity to escape the O(n^2) complexity. In this paper, we study a class of promising sparse attention mechanisms that focus on locality, and aim to develop a better analytical model of their performance improvements. We first introduce Generalized Neighborhood Attention (GNA), which can describe sliding window, strided sliding window, and blocked attention. We then consider possible design choices in implementing these approaches, and create a simulator that can provide much more realistic speedup upper bounds for any given setting. Finally, we implement GNA on top of a state-of-the-art fused multi-headed attention (FMHA) kernel designed for the NVIDIA Blackwell architecture in CUTLASS. Our implementation can fully realize the maximum speedup theoretically possible in many perfectly block-sparse cases, and achieves an effective utilization of 1.3 petaFLOPs/second in FP16. In addition, we plug various GNA configurations into off-the-shelf generative models, such as Cosmos-7B, HunyuanVideo, and FLUX, and show that it can deliver 28% to 46% end-to-end speedup on B200 without any fine-tuning. We will open source our simulator and Blackwell kernels directly through the NATTEN project.

[65] Procedural Dataset Generation for Zero-Shot Stereo Matching

David Yan,Alexander Raistrick,Jia Deng

Main category: cs.CV

TL;DR: 论文研究了合成立体数据集的优化设计，提出了Infinigen-Stereo生成器，显著提升了零样本立体匹配性能。

Details

Motivation: 探索合成立体数据集的设计空间，以优化零样本立体匹配性能。 Method: 通过调整程序化数据集生成器的参数，生成不同数据集，并评估其对零样本立体匹配性能的影响。 Result: 提出的Infinigen-Stereo生成器在零样本立体匹配任务中表现优于现有合成数据集和公开检查点。 Conclusion: Infinigen-Stereo生成器为立体匹配研究提供了高效工具，开源以促进进一步研究。 Abstract: Synthetic datasets are a crucial ingredient for training stereo matching networks, but the question of what makes a stereo dataset effective remains largely unexplored. We investigate the design space of synthetic datasets by varying the parameters of a procedural dataset generator, and report the effects on zero-shot stereo matching performance using standard benchmarks. We collect the best settings to produce Infinigen-Stereo, a procedural generator specifically optimized for zero-shot stereo datasets. Models trained only on data from our system outperform robust baselines trained on a combination of existing synthetic datasets and have stronger zero-shot stereo matching performance than public checkpoints from prior works. We open source our system at https://github.com/princeton-vl/InfinigenStereo to enable further research on procedural stereo datasets.

cs.GR [Back]

[66] Digital Kitchen Remodeling: Editing and Relighting Intricate Indoor Scenes from a Single Panorama

Guanzhou Ji,Azadeh O. Sawyer,Srinivasa G. Narasimhan

Main category: cs.GR

TL;DR: 提出了一种基于单张全景图的厨房虚拟改造应用，通过HDR全景图捕捉和场景辐射恢复实现高质量场景重光照。

Details

Motivation: 解决厨房虚拟改造中场景真实感不足的问题，提供高质量的重光照效果。 Method: 应用流程包括HDR全景摄影、自动厨房布局生成和可编辑渲染管线，支持场景材质编辑和全局光照重光照。 Result: 贡献了一个包含141对室内外全景图的Pano-Pano HDR数据集，并提出了一种低成本的全景HDR摄影光度校准方法。 Conclusion: 该应用通过HDR技术和全局光照实现了高真实感的虚拟厨房改造。 Abstract: We present a novel virtual staging application for kitchen remodeling from a single panorama. To ensure the realism of the virtual rendered scene, we capture real-world High Dynamic Range (HDR) panoramas and recover the absolute scene radiance for high-quality scene relighting. Our application pipeline consists of three key components: (1) HDR photography for capturing paired indoor and outdoor panoramas, (2) automatic kitchen layout generation with new kitchen components, and (3) an editable rendering pipeline that flexibly edits scene materials and relights the new virtual scene with global illumination. Additionally, we contribute a novel Pano-Pano HDR dataset with 141 paired indoor and outdoor panoramas and present a low-cost photometric calibration method for panoramic HDR photography.

[67] HUG: Hierarchical Urban Gaussian Splatting with Block-Based Reconstruction

Zhongtao Wang,Mai Su,Huishan Au,Yilong Li,Xizhe Cao,Chengwei Pan,Yisong Chen,Guoping Wang

Main category: cs.GR

TL;DR: HUG提出了一种基于3D高斯泼溅的新方法，通过分层神经高斯表示优化数据分区和重建流程，实现高效的大规模城市场景渲染。

Details

Motivation: 随着城市场景3D复杂性和高质量渲染需求的增加，需要更高效的场景重建与渲染技术。 Method: 采用分层神经高斯表示和增强的块状重建流程，减少冗余训练区域并提升重建质量。 Result: 在公共基准测试中取得领先结果，证明了其在大规模城市场景表示中的高效性和优势。 Conclusion: HUG方法以低计算成本实现了高质量场景渲染，适用于复杂城市场景。 Abstract: As urban 3D scenes become increasingly complex and the demand for high-quality rendering grows, efficient scene reconstruction and rendering techniques become crucial. We present HUG, a novel approach to address inefficiencies in handling large-scale urban environments and intricate details based on 3D Gaussian splatting. Our method optimizes data partitioning and the reconstruction pipeline by incorporating a hierarchical neural Gaussian representation. We employ an enhanced block-based reconstruction pipeline focusing on improving reconstruction quality within each block and reducing the need for redundant training regions around block boundaries. By integrating neural Gaussian representation with a hierarchical architecture, we achieve high-quality scene rendering at a low computational cost. This is demonstrated by our state-of-the-art results on public benchmarks, which prove the effectiveness and advantages in large-scale urban scene representation.

cs.CL [Back]

[68] FinNLI: Novel Dataset for Multi-Genre Financial Natural Language Inference Benchmarking

Jabez Magomere,Elena Kochkina,Samuel Mensah,Simerjot Kaur,Charese H. Smiley

Main category: cs.CL

TL;DR: FinNLI是一个针对金融自然语言推理的基准数据集，包含21,304对前提-假设对，测试集由金融专家标注。评估显示领域迁移显著降低通用NLI性能，当前LLMs在金融推理上表现不佳。

Details

Motivation: 为金融文本（如SEC文件、年报、财报电话会议记录）提供多样化的自然语言推理基准，减少虚假相关性。 Method: 构建包含21,304对前提-假设对的FinNLI数据集，测试集由金融专家标注，评估PLMs和LLMs的性能。 Result: 领域迁移显著影响性能，PLMs和LLMs的Macro F1分别为74.57%和78.62%，金融LLMs表现不佳。 Conclusion: FinNLI揭示了当前LLMs在金融推理上的局限性，需进一步改进。 Abstract: We introduce FinNLI, a benchmark dataset for Financial Natural Language Inference (FinNLI) across diverse financial texts like SEC Filings, Annual Reports, and Earnings Call transcripts. Our dataset framework ensures diverse premise-hypothesis pairs while minimizing spurious correlations. FinNLI comprises 21,304 pairs, including a high-quality test set of 3,304 instances annotated by finance experts. Evaluations show that domain shift significantly degrades general-domain NLI performance. The highest Macro F1 scores for pre-trained (PLMs) and large language models (LLMs) baselines are 74.57% and 78.62%, respectively, highlighting the dataset's difficulty. Surprisingly, instruction-tuned financial LLMs perform poorly, suggesting limited generalizability. FinNLI exposes weaknesses in current LLMs for financial reasoning, indicating room for improvement.

[69] The Language of Attachment: Modeling Attachment Dynamics in Psychotherapy

Frederik Bredgaard,Martin Lund Trinhammer,Elisa Bassignana

Main category: cs.CL

TL;DR: 论文探讨了利用NLP技术自动识别患者依恋风格，以改进心理治疗的个性化服务。

Details

Motivation: 当前依恋风格评估依赖复杂且耗资源的人工方法（PACS），限制了广泛应用。 Method: 使用NLP分类模型从心理治疗转录本中自动评估患者依恋风格。 Result: 分析了自动化工具的潜在影响，如错误分类对治疗结果的负面影响。 Conclusion: 研究为个性化心理治疗和NLP在心理治疗机制研究中的应用开辟了新途径。 Abstract: The delivery of mental healthcare through psychotherapy stands to benefit immensely from developments within Natural Language Processing (NLP), in particular through the automatic identification of patient specific qualities, such as attachment style. Currently, the assessment of attachment style is performed manually using the Patient Attachment Coding System (PACS; Talia et al., 2017), which is complex, resource-consuming and requires extensive training. To enable wide and scalable adoption of attachment informed treatment and research, we propose the first exploratory analysis into automatically assessing patient attachment style from psychotherapy transcripts using NLP classification models. We further analyze the results and discuss the implications of using automated tools for this purpose -- e.g., confusing `preoccupied' patients with `avoidant' likely has a more negative impact on therapy outcomes with respect to other mislabeling. Our work opens an avenue of research enabling more personalized psychotherapy and more targeted research into the mechanisms of psychotherapy through advancements in NLP.

[70] The Paradox of Poetic Intent in Back-Translation: Evaluating the Quality of Large Language Models in Chinese Translation

Li Weigang,Pedro Carvalho Brom

Main category: cs.CL

TL;DR: 该研究通过构建多样化语料库和BT-Fried评估系统，比较了六种大型语言模型和三种传统翻译工具在中文-英文翻译中的表现，发现LLMs在文化保留和文学翻译上存在不足，并提出了一种改进的BLEU变体。

Details

Motivation: 解决大型语言模型在中文-英文翻译中保留诗意、文化传承和专业术语的挑战。 Method: 构建多样化语料库，使用基于回译和Friedman测试的BT-Fried系统评估BLEU、CHRF、TER和语义相似度指标。 Result: LLMs在科学摘要翻译中表现较好，但在文化和文学翻译上不如传统工具；部分模型出现逐字回译现象；提出了一种改进的BLEU变体。 Conclusion: 研究为中文NLP性能的实证评估提供了贡献，并深化了对AI翻译中文化保真度的理解。 Abstract: The rapid advancement of large language models (LLMs) has reshaped the landscape of machine translation, yet challenges persist in preserving poetic intent, cultural heritage, and handling specialized terminology in Chinese-English translation. This study constructs a diverse corpus encompassing Chinese scientific terminology, historical translation paradoxes, and literary metaphors. Utilizing a back-translation and Friedman test-based evaluation system (BT-Fried), we evaluate BLEU, CHRF, TER, and semantic similarity metrics across six major LLMs (e.g., GPT-4.5, DeepSeek V3) and three traditional translation tools. Key findings include: (1) Scientific abstracts often benefit from back-translation, while traditional tools outperform LLMs in linguistically distinct texts; (2) LLMs struggle with cultural and literary retention, exemplifying the "paradox of poetic intent"; (3) Some models exhibit "verbatim back-translation", reflecting emergent memory behavior; (4) A novel BLEU variant using Jieba segmentation and n-gram weighting is proposed. The study contributes to the empirical evaluation of Chinese NLP performance and advances understanding of cultural fidelity in AI-mediated translation.

[71] Capturing Symmetry and Antisymmetry in Language Models through Symmetry-Aware Training Objectives

Zhangdie Yuan,Andreas Vlachos

Main category: cs.CL

TL;DR: 论文提出了一种基于Wikidata的自然语言推理数据集，用于评估大语言模型（LLMs）对对称和反对称关系的理解能力。研究发现LLMs表现接近随机，随后通过对比学习和k近邻方法改进编码器，取得了与微调分类头相当的效果，同时提升了少样本学习和抗灾难性遗忘能力。

Details

Motivation: 评估和改进大语言模型对对称和反对称关系的理解能力，以填补其在关系理解上的不足。 Method: 引入Wikidata衍生的自然语言推理数据集，通过对比学习和k近邻方法重新训练编码器。 Result: LLMs在基准测试中表现接近随机，改进后的编码器性能与微调分类头相当，且在少样本学习和抗灾难性遗忘方面表现更优。 Conclusion: 通过对比学习和k近邻方法改进的编码器有效提升了LLMs的关系理解能力，同时具有更高的效率和鲁棒性。 Abstract: Capturing symmetric (e.g., country borders another country) and antisymmetric (e.g., parent_of) relations is crucial for a variety of applications. This paper tackles this challenge by introducing a novel Wikidata-derived natural language inference dataset designed to evaluate large language models (LLMs). Our findings reveal that LLMs perform comparably to random chance on this benchmark, highlighting a gap in relational understanding. To address this, we explore encoder retraining via contrastive learning with k-nearest neighbors. The retrained encoder matches the performance of fine-tuned classification heads while offering additional benefits, including greater efficiency in few-shot learning and improved mitigation of catastrophic forgetting.

[72] Transformer-Based Extraction of Statutory Definitions from the U.S. Code

Arpana Hosabettu,Harsh Shah

Main category: cs.CL

TL;DR: 论文提出了一种基于Transformer架构的NLP系统，用于从美国法典（U.S.C.）中自动提取法律定义、术语及其范围，显著提高了提取精度。

Details

Motivation: 提升对复杂法律文本（如美国法典）的理解和清晰度，解决自动识别法律定义、提取术语及其范围的挑战。 Method: 采用领域特定的Transformer模型（Legal-BERT），结合多阶段管道（文档结构分析和语言模型），通过注意力机制和基于规则的模式提取定义。 Result: 在多个美国法典标题上评估，最佳模型达到96.8%的精确率和98.9%的召回率（F1分数98.2%），显著优于传统机器学习方法。 Conclusion: 该系统提高了法律信息的可访问性和理解，并为下游法律推理任务奠定了基础。 Abstract: Automatic extraction of definitions from legal texts is critical for enhancing the comprehension and clarity of complex legal corpora such as the United States Code (U.S.C.). We present an advanced NLP system leveraging transformer-based architectures to automatically extract defined terms, their definitions, and their scope from the U.S.C. We address the challenges of automatically identifying legal definitions, extracting defined terms, and determining their scope within this complex corpus of over 200,000 pages of federal statutory law. Building upon previous feature-based machine learning methods, our updated model employs domain-specific transformers (Legal-BERT) fine-tuned specifically for statutory texts, significantly improving extraction accuracy. Our work implements a multi-stage pipeline that combines document structure analysis with state-of-the-art language models to process legal text from the XML version of the U.S. Code. Each paragraph is first classified using a fine-tuned legal domain BERT model to determine if it contains a definition. Our system then aggregates related paragraphs into coherent definitional units and applies a combination of attention mechanisms and rule-based patterns to extract defined terms and their jurisdictional scope. The definition extraction system is evaluated on multiple titles of the U.S. Code containing thousands of definitions, demonstrating significant improvements over previous approaches. Our best model achieves 96.8% precision and 98.9% recall (98.2% F1-score), substantially outperforming traditional machine learning classifiers. This work contributes to improving accessibility and understanding of legal information while establishing a foundation for downstream legal reasoning tasks.

[73] Text-to-TrajVis: Enabling Trajectory Data Visualizations from Natural Language Questions

Tian Bai,Huiyan Ying,Kailong Suo,Junqiu Wei,Tao Fan,Yuanfeng Song

Main category: cs.CL

TL;DR: 本文提出Text-to-TrajVis任务，将自然语言问题转化为轨迹数据可视化，并构建首个大规模数据集TrajVL。

Details

Motivation: 填补自然语言接口在轨迹可视化系统中的空白，解决相关数据集缺失问题。 Method: 设计轨迹可视化语言（TVL），结合LLMs与人工构建数据集，评估多种LLMs性能。 Result: 成功构建包含18,140对（问题，TVL）的TrajVL数据集，验证任务可行但具挑战性。 Conclusion: Text-to-TrajVis任务值得进一步研究，为自然语言与轨迹可视化结合提供新方向。 Abstract: This paper introduces the Text-to-TrajVis task, which aims to transform natural language questions into trajectory data visualizations, facilitating the development of natural language interfaces for trajectory visualization systems. As this is a novel task, there is currently no relevant dataset available in the community. To address this gap, we first devised a new visualization language called Trajectory Visualization Language (TVL) to facilitate querying trajectory data and generating visualizations. Building on this foundation, we further proposed a dataset construction method that integrates Large Language Models (LLMs) with human efforts to create high-quality data. Specifically, we first generate TVLs using a comprehensive and systematic process, and then label each TVL with corresponding natural language questions using LLMs. This process results in the creation of the first large-scale Text-to-TrajVis dataset, named TrajVL, which contains 18,140 (question, TVL) pairs. Based on this dataset, we systematically evaluated the performance of multiple LLMs (GPT, Qwen, Llama, etc.) on this task. The experimental results demonstrate that this task is both feasible and highly challenging and merits further exploration within the research community.

[74] SplitReason: Learning To Offload Reasoning

Yash Akhauri,Anthony Fei,Chi-Chih Chang,Ahmed F. AbouElhamayed,Yueying Li,Mohamed S. Abdelfattah

Main category: cs.CL

TL;DR: 论文提出了一种通过将推理过程中最困难的部分卸载到更大模型的方法，提高效率与准确性。

Details

Motivation: 大型语言模型（LLMs）在推理任务中生成较长的序列，导致效率低下。研究发现并非所有部分都同样难以生成，因此提出优化策略。 Method: 通过标注18k推理轨迹中的困难部分，训练一个1.5B参数的小模型识别并触发卸载，结合监督和强化学习微调。 Result: AIME24推理准确率分别提高24%和28.3%，同时仅卸载1.35%和5%的生成token。 Conclusion: SplitReason方法显著提升推理效率与准确性，并开源模型、数据与代码。 Abstract: Reasoning in large language models (LLMs) tends to produce substantially longer token generation sequences than simpler language modeling tasks. This extended generation length reflects the multi-step, compositional nature of reasoning and is often correlated with higher solution accuracy. From an efficiency perspective, longer token generation exacerbates the inherently sequential and memory-bound decoding phase of LLMs. However, not all parts of this expensive reasoning process are equally difficult to generate. We leverage this observation by offloading only the most challenging parts of the reasoning process to a larger, more capable model, while performing most of the generation with a smaller, more efficient model; furthermore, we teach the smaller model to identify these difficult segments and independently trigger offloading when needed. To enable this behavior, we annotate difficult segments across 18k reasoning traces from the OpenR1-Math-220k chain-of-thought (CoT) dataset. We then apply supervised fine-tuning (SFT) and reinforcement learning fine-tuning (RLFT) to a 1.5B-parameter reasoning model, training it to learn to offload the most challenging parts of its own reasoning process to a larger model. This approach improves AIME24 reasoning accuracy by 24% and 28.3% while offloading 1.35% and 5% of the generated tokens respectively. We open-source our SplitReason model, data, code and logs.

[75] ConTextual: Improving Clinical Text Summarization in LLMs with Context-preserving Token Filtering and Knowledge Graphs

Fahmida Liza Piya,Rahmatollah Beheshti

Main category: cs.CL

TL;DR: 论文提出了一种名为ConTextual的新框架，结合上下文保留的令牌过滤方法和领域特定知识图谱，用于临床文本摘要，显著提升语言连贯性和临床准确性。

Details

Motivation: 非结构化临床数据是丰富的信息源，但现有方法未能有效提取关键信息或忽略临床细微线索，影响决策质量。 Method: 提出ConTextual框架，整合上下文保留令牌过滤和领域知识图谱，增强上下文信息。 Result: 在两个公共基准数据集上，ConTextual表现优于其他基线方法。 Conclusion: ConTextual展示了令牌级过滤与结构化检索的互补作用，为临床文本生成提供了可扩展的高精度解决方案。 Abstract: Unstructured clinical data can serve as a unique and rich source of information that can meaningfully inform clinical practice. Extracting the most pertinent context from such data is critical for exploiting its true potential toward optimal and timely decision-making in patient care. While prior research has explored various methods for clinical text summarization, most prior studies either process all input tokens uniformly or rely on heuristic-based filters, which can overlook nuanced clinical cues and fail to prioritize information critical for decision-making. In this study, we propose Contextual, a novel framework that integrates a Context-Preserving Token Filtering method with a Domain-Specific Knowledge Graph (KG) for contextual augmentation. By preserving context-specific important tokens and enriching them with structured knowledge, ConTextual improves both linguistic coherence and clinical fidelity. Our extensive empirical evaluations on two public benchmark datasets demonstrate that ConTextual consistently outperforms other baselines. Our proposed approach highlights the complementary role of token-level filtering and structured retrieval in enhancing both linguistic and clinical integrity, as well as offering a scalable solution for improving precision in clinical text generation.

[76] Less is More: Enhancing Structured Multi-Agent Reasoning via Quality-Guided Distillation

Jiahao Yuan,Xingzhe Sun,Xing Yu,Jingwen Wang,Dehui Du,Zhiqing Cui,Zixiang Di

Main category: cs.CL

TL;DR: 论文介绍了在低资源条件下通过结构化推理任务提升LLM性能的方法Less is More，该方法在XLLM@ACL2025共享任务中获得第三名。

Details

Motivation: 挑战LLM在低资源条件下生成可解释的逐步推理过程，仅使用24个标注样本。 Method: 采用多智能体框架，结合反向提示诱导、检索增强推理合成和双阶段奖励引导过滤，对Meta-Llama-3-8B-Instruct进行微调。 Result: 通过结构化验证和奖励过滤，显著提升了结构化推理质量。 Conclusion: 可控数据蒸馏在低资源条件下对结构化推理具有重要价值。 Abstract: The XLLM@ACL2025 Shared Task-III formulates a low-resource structural reasoning task that challenges LLMs to generate interpretable, step-by-step rationales with minimal labeled data. We present Less is More, the third-place winning approach in the XLLM@ACL2025 Shared Task-III, which focuses on structured reasoning from only 24 labeled examples. Our approach leverages a multi-agent framework with reverse-prompt induction, retrieval-augmented reasoning synthesis via GPT-4o, and dual-stage reward-guided filtering to distill high-quality supervision across three subtasks: question parsing, CoT parsing, and step-level verification. All modules are fine-tuned from Meta-Llama-3-8B-Instruct under a unified LoRA+ setup. By combining structure validation with reward filtering across few-shot and zero-shot prompts, our pipeline consistently improves structure reasoning quality. These results underscore the value of controllable data distillation in enhancing structured inference under low-resource constraints. Our code is available at https://github.com/Jiahao-Yuan/Less-is-More.

[77] Out-of-the-Box Conditional Text Embeddings from Large Language Models

Kosuke Yamada,Peinan Zhang

Main category: cs.CL

TL;DR: PonTE是一种无监督条件文本嵌入方法，利用因果大语言模型和条件提示生成文本嵌入，性能接近监督方法且无需微调。

Details

Motivation: 解决传统方法依赖大量训练数据和微调的高成本问题。 Method: 使用因果大语言模型和条件提示生成条件文本嵌入。 Result: 在条件语义文本相似性和文本聚类任务中表现接近监督方法，且具有可解释性。 Conclusion: PonTE提供了一种高效且可解释的条件文本嵌入解决方案。 Abstract: Conditional text embedding is a proposed representation that captures the shift in perspective on texts when conditioned on a specific aspect. Previous methods have relied on extensive training data for fine-tuning models, leading to challenges in terms of labor and resource costs. We propose PonTE, a novel unsupervised conditional text embedding method that leverages a causal large language model and a conditional prompt. Through experiments on conditional semantic text similarity and text clustering, we demonstrate that PonTE can generate useful conditional text embeddings and achieve performance comparable to supervised methods without fine-tuning. We also show the interpretability of text embeddings with PonTE by analyzing word generation following prompts and embedding visualization.

[78] Evaluating Multi-Hop Reasoning in Large Language Models: A Chemistry-Centric Case Study

Mohammad Khodadad,Ali Shiraee Kasmaee,Mahdi Astaraki,Nicholas Sherck,Hamidreza Mahyar,Soheila Samiee

Main category: cs.CL

TL;DR: 研究提出了一种新的基准测试，用于评估大型语言模型在化学领域的组合推理能力，并开发了自动化流程生成知识图谱和多跳问题。实验表明，即使最先进的模型在多跳推理中仍面临挑战，文档检索对性能提升有显著影响。

Details

Motivation: 评估大型语言模型在化学领域的组合推理能力，并揭示其局限性。 Method: 结合OpenAI推理模型和命名实体识别系统，从文献中提取化学实体并构建知识图谱，生成多跳问题进行评估。 Result: 实验显示，即使最先进的模型在多跳推理中表现不佳，文档检索能显著提升性能，但无法完全消除推理错误。 Conclusion: 研究不仅揭示了当前模型的局限性，还提出了一种生成挑战性推理数据集的新方法，推动了计算语言学中对推理的理解。 Abstract: In this study, we introduced a new benchmark consisting of a curated dataset and a defined evaluation process to assess the compositional reasoning capabilities of large language models within the chemistry domain. We designed and validated a fully automated pipeline, verified by subject matter experts, to facilitate this task. Our approach integrates OpenAI reasoning models with named entity recognition (NER) systems to extract chemical entities from recent literature, which are then augmented with external knowledge bases to form a comprehensive knowledge graph. By generating multi-hop questions across these graphs, we assess LLM performance in both context-augmented and non-context augmented settings. Our experiments reveal that even state-of-the-art models face significant challenges in multi-hop compositional reasoning. The results reflect the importance of augmenting LLMs with document retrieval, which can have a substantial impact on improving their performance. However, even perfect retrieval accuracy with full context does not eliminate reasoning errors, underscoring the complexity of compositional reasoning. This work not only benchmarks and highlights the limitations of current LLMs but also presents a novel data generation pipeline capable of producing challenging reasoning datasets across various domains. Overall, this research advances our understanding of reasoning in computational linguistics.

[79] Can Large Language Models Help Multimodal Language Analysis? MMLA: A Comprehensive Benchmark

Hanlei Zhang,Zhuohang Li,Yeshuang Zhu,Hua Xu,Peiwu Wang,Jinchao Zhang,Jie Zhou,Haige Zhu

Main category: cs.CL

TL;DR: 论文提出了MMLA基准，用于评估多模态大语言模型（MLLMs）在理解认知级语义方面的能力，覆盖六种核心维度，实验显示当前模型准确率仅为60%~70%。

Details

Motivation: 多模态语言分析领域缺乏对MLLMs理解认知级语义能力的研究，MMLA旨在填补这一空白。 Method: 构建包含61K多模态话语的MMLA基准，评估八种主流LLMs和MLLMs，采用零样本推理、监督微调和指令调优三种方法。 Result: 实验表明，即使微调后的模型准确率也仅为60%~70%，揭示了当前MLLMs在理解复杂人类语言方面的局限性。 Conclusion: MMLA为探索大语言模型在多模态语言分析中的潜力提供了基础，并推动了该领域的发展。 Abstract: Multimodal language analysis is a rapidly evolving field that leverages multiple modalities to enhance the understanding of high-level semantics underlying human conversational utterances. Despite its significance, little research has investigated the capability of multimodal large language models (MLLMs) to comprehend cognitive-level semantics. In this paper, we introduce MMLA, a comprehensive benchmark specifically designed to address this gap. MMLA comprises over 61K multimodal utterances drawn from both staged and real-world scenarios, covering six core dimensions of multimodal semantics: intent, emotion, dialogue act, sentiment, speaking style, and communication behavior. We evaluate eight mainstream branches of LLMs and MLLMs using three methods: zero-shot inference, supervised fine-tuning, and instruction tuning. Extensive experiments reveal that even fine-tuned models achieve only about 60%~70% accuracy, underscoring the limitations of current MLLMs in understanding complex human language. We believe that MMLA will serve as a solid foundation for exploring the potential of large language models in multimodal language analysis and provide valuable resources to advance this field. The datasets and code are open-sourced at https://github.com/thuiar/MMLA.

[80] EMRModel: A Large Language Model for Extracting Medical Consultation Dialogues into Structured Medical Records

Shuguang Zhao,Qiangzhong Feng,Zhiyang He,Peipei Sun,Yingying Wang,Xiaodong Tao,Xiaoliang Lu,Mei Cheng,Xinyue Wu,Yanyan Wang,Wei Liang

Main category: cs.CL

TL;DR: EMRModel结合LoRA微调和代码风格提示设计，高效将医疗咨询对话转为结构化电子病历，性能显著优于传统方法。

Details

Motivation: 医疗咨询对话的非结构化特性限制了其在诊断和治疗中的有效利用，传统方法难以捕捉深层语义。 Method: 集成LoRA微调和代码风格提示设计，构建高质量标注数据集，提出细粒度评估基准。 Result: EMRModel的F1分数达88.1%，较标准预训练模型提升49.5%。 Conclusion: EMRModel在结构化病历提取任务中表现优异，推动了医疗NLP模型的优化。 Abstract: Medical consultation dialogues contain critical clinical information, yet their unstructured nature hinders effective utilization in diagnosis and treatment. Traditional methods, relying on rule-based or shallow machine learning techniques, struggle to capture deep and implicit semantics. Recently, large pre-trained language models and Low-Rank Adaptation (LoRA), a lightweight fine-tuning method, have shown promise for structured information extraction. We propose EMRModel, a novel approach that integrates LoRA-based fine-tuning with code-style prompt design, aiming to efficiently convert medical consultation dialogues into structured electronic medical records (EMRs). Additionally, we construct a high-quality, realistically grounded dataset of medical consultation dialogues with detailed annotations. Furthermore, we introduce a fine-grained evaluation benchmark for medical consultation information extraction and provide a systematic evaluation methodology, advancing the optimization of medical natural language processing (NLP) models. Experimental results show EMRModel achieves an F1 score of 88.1%, improving by49.5% over standard pre-trained models. Compared to traditional LoRA fine-tuning methods, our model shows superior performance, highlighting its effectiveness in structured medical record extraction tasks.

[81] T-VEC: A Telecom-Specific Vectorization Model with Enhanced Semantic Understanding via Deep Triplet Loss Fine-Tuning

Vignesh Ethiraj,Sidhanth Menon,Divya Vijay

Main category: cs.CL

TL;DR: T-VEC是一种专为电信领域设计的嵌入模型，通过深度微调优化了电信特定语义的捕捉，显著提升了任务性能。

Details

Motivation: 电信行业的专业词汇和复杂概念对通用NLP模型构成挑战，需要领域特定的嵌入模型。 Method: 基于gte-Qwen2-1.5B-instruct模型，通过三重损失目标在电信数据集上进行深度微调，并开发了专用分词器。 Result: T-VEC在MTEB评分（0.825）和电信特定评测（0.9380）中表现优异。 Conclusion: T-VEC为电信AI领域提供了强大的开源工具，推动了行业创新。 Abstract: The specialized vocabulary and complex concepts of the telecommunications industry present significant challenges for standard Natural Language Processing models. Generic text embeddings often fail to capture telecom-specific semantics, hindering downstream task performance. We introduce T-VEC (Telecom Vectorization Model), a novel embedding model tailored for the telecom domain through deep fine-tuning. Developed by NetoAI, T-VEC is created by adapting the state-of-the-art gte-Qwen2-1.5B-instruct model using a triplet loss objective on a meticulously curated, large-scale dataset of telecom-specific data. Crucially, this process involved substantial modification of weights across 338 layers of the base model, ensuring deep integration of domain knowledge, far exceeding superficial adaptation techniques. We quantify this deep change via weight difference analysis. A key contribution is the development and open-sourcing (MIT License) of the first dedicated telecom-specific tokenizer, enhancing the handling of industry jargon. T-VEC achieves a leading average MTEB score (0.825) compared to established models and demonstrates vastly superior performance (0.9380 vs. less than 0.07) on our internal telecom-specific triplet evaluation benchmark, indicating an exceptional grasp of domain-specific nuances, visually confirmed by improved embedding separation. This work positions NetoAI at the forefront of telecom AI innovation, providing the community with a powerful, deeply adapted, open-source tool.

[82] QuaDMix: Quality-Diversity Balanced Data Selection for Efficient LLM Pretraining

Fengze Liu,Weidong Zhou,Binbin Liu,Zhimiao Yu,Yifan Zhang,Haobin Lin,Yifeng Yu,Xiaohuan Zhou,Taifeng Wang,Yong Cao

Main category: cs.CL

TL;DR: QuaDMix是一个统一的数据选择框架，用于优化LLM预训练数据分布，平衡质量和多样性，平均性能提升7.2%。

Details

Motivation: 现有研究通常单独优化数据质量和多样性，忽略了二者之间的权衡，需要联合考虑。 Method: 提出多标准衡量数据质量，通过域分类衡量多样性，使用参数化采样函数优化数据分布，并通过小模型模拟实验加速参数搜索。 Result: 实验表明，QuaDMix在多个基准测试中平均性能提升7.2%，优于单独优化策略。 Conclusion: QuaDMix证明了平衡数据质量和多样性的必要性和能力。 Abstract: Quality and diversity are two critical metrics for the training data of large language models (LLMs), positively impacting performance. Existing studies often optimize these metrics separately, typically by first applying quality filtering and then adjusting data proportions. However, these approaches overlook the inherent trade-off between quality and diversity, necessitating their joint consideration. Given a fixed training quota, it is essential to evaluate both the quality of each data point and its complementary effect on the overall dataset. In this paper, we introduce a unified data selection framework called QuaDMix, which automatically optimizes the data distribution for LLM pretraining while balancing both quality and diversity. Specifically, we first propose multiple criteria to measure data quality and employ domain classification to distinguish data points, thereby measuring overall diversity. QuaDMix then employs a unified parameterized data sampling function that determines the sampling probability of each data point based on these quality and diversity related labels. To accelerate the search for the optimal parameters involved in the QuaDMix framework, we conduct simulated experiments on smaller models and use LightGBM for parameters searching, inspired by the RegMix method. Our experiments across diverse models and datasets demonstrate that QuaDMix achieves an average performance improvement of 7.2% across multiple benchmarks. These results outperform the independent strategies for quality and diversity, highlighting the necessity and ability to balance data quality and diversity.

[83] Transformers for Complex Query Answering over Knowledge Hypergraphs

Hong Ting Tsang,Zihao Wang,Yangqiu Song

Main category: cs.CL

TL;DR: 论文提出了一种基于知识超图（KHG）的复杂查询回答方法，通过两阶段Transformer模型（LKHGT）处理不同逻辑操作，并在新数据集上验证其性能。

Details

Motivation: 现实世界数据复杂，传统三元组知识图谱（KG）表达能力有限，而现有超关系图谱在表示不同元数关系时存在不足。 Method: 提出LKHGT模型，包含投影编码器和逻辑编码器，结合类型感知偏置（TAB）捕捉交互。 Result: 实验表明LKHGT在KHG上的复杂查询回答性能最优，并能泛化到分布外查询类型。 Conclusion: LKHGT填补了知识超图复杂查询回答的空白，展现了优越性能。 Abstract: Complex Query Answering (CQA) has been extensively studied in recent years. In order to model data that is closer to real-world distribution, knowledge graphs with different modalities have been introduced. Triple KGs, as the classic KGs composed of entities and relations of arity 2, have limited representation of real-world facts. Real-world data is more sophisticated. While hyper-relational graphs have been introduced, there are limitations in representing relationships of varying arity that contain entities with equal contributions. To address this gap, we sampled new CQA datasets: JF17k-HCQA and M-FB15k-HCQA. Each dataset contains various query types that include logical operations such as projection, negation, conjunction, and disjunction. In order to answer knowledge hypergraph (KHG) existential first-order queries, we propose a two-stage transformer model, the Logical Knowledge Hypergraph Transformer (LKHGT), which consists of a Projection Encoder for atomic projection and a Logical Encoder for complex logical operations. Both encoders are equipped with Type Aware Bias (TAB) for capturing token interactions. Experimental results on CQA datasets show that LKHGT is a state-of-the-art CQA method over KHG and is able to generalize to out-of-distribution query types.

[84] PIS: Linking Importance Sampling and Attention Mechanisms for Efficient Prompt Compression

Lizhe Chen,Binjia Zhou,Yuyao Ge,Jiayi Chen,Shiguang NI

Main category: cs.CL

TL;DR: 提出了一种名为Prompt Importance Sampling (PIS)的新框架，通过动态采样重要令牌来压缩提示，结合了令牌级和语义级的压缩机制，显著提升了压缩性能。

Details

Motivation: 大型语言模型（LLMs）的高成本限制了其广泛应用，现有提示压缩方法忽视了LLMs的内在机制，缺乏对令牌重要性的系统评估。 Method: PIS框架通过分析隐藏状态的注意力分数动态压缩提示，令牌级使用轻量级强化学习网络，语义级采用俄罗斯轮盘赌采样策略。 Result: 在多个领域基准测试中，PIS实现了最先进的压缩性能，并意外地通过优化上下文结构提升了推理效率。 Conclusion: PIS为LLMs的提示工程提供了理论基础和实践效率，推动了上下文管理的进步。 Abstract: Large language models (LLMs) have achieved remarkable progress, demonstrating unprecedented capabilities across various natural language processing tasks. However, the high costs associated with such exceptional performance limit the widespread adoption of LLMs, highlighting the need for prompt compression. Existing prompt compression methods primarily rely on heuristic truncation or abstractive summarization techniques, which fundamentally overlook the intrinsic mechanisms of LLMs and lack a systematic evaluation of token importance for generation. In this work, we introduce Prompt Importance Sampling (PIS), a novel compression framework that dynamically compresses prompts by sampling important tokens based on the analysis of attention scores of hidden states. PIS employs a dual-level compression mechanism: 1) at the token level, we quantify saliency using LLM-native attention scores and implement adaptive compression through a lightweight 9-layer reinforcement learning (RL) network; 2) at the semantic level, we propose a Russian roulette sampling strategy for sentence-level importance sampling. Comprehensive evaluations across multiple domain benchmarks demonstrate that our method achieves state-of-the-art compression performance. Notably, our framework serendipitously enhances reasoning efficiency through optimized context structuring. This work advances prompt engineering by offering both theoretical grounding and practical efficiency in context management for LLMs.

[85] Comparing Large Language Models and Traditional Machine Translation Tools for Translating Medical Consultation Summaries: A Pilot Study

Andy Li,Wei Zhou,Rashina Hoda,Chris Bain,Peter Poon

Main category: cs.CL

TL;DR: 研究比较了大型语言模型（LLMs）与传统机器翻译（MT）工具在医学咨询摘要翻译中的表现，发现传统MT工具表现更优，但LLMs在简单文本翻译中显示出潜力。

Details

Motivation: 评估LLMs和传统MT工具在医学领域的翻译能力，为改进医学翻译提供依据。 Method: 使用标准自动化指标评估LLMs和传统MT工具对英文医学摘要翻译成阿拉伯语、中文和越南语的效果。 Result: 传统MT工具表现更好，尤其在复杂文本中；LLMs在简单文本翻译中表现较好，尤其是越南语和中文。阿拉伯语翻译随文本复杂性提升而改善。 Conclusion: LLMs具有上下文灵活性但表现不一致，需领域特定训练、改进评估方法及人工监督。 Abstract: This study evaluates how well large language models (LLMs) and traditional machine translation (MT) tools translate medical consultation summaries from English into Arabic, Chinese, and Vietnamese. It assesses both patient, friendly and clinician, focused texts using standard automated metrics. Results showed that traditional MT tools generally performed better, especially for complex texts, while LLMs showed promise, particularly in Vietnamese and Chinese, when translating simpler summaries. Arabic translations improved with complexity due to the language's morphology. Overall, while LLMs offer contextual flexibility, they remain inconsistent, and current evaluation metrics fail to capture clinical relevance. The study highlights the need for domain-specific training, improved evaluation methods, and human oversight in medical translation.

[86] Debunking with Dialogue? Exploring AI-Generated Counterspeech to Challenge Conspiracy Theories

Mareike Lisker,Christina Gottschalk,Helena Mihaljević

Main category: cs.CL

TL;DR: 论文探讨了使用大型语言模型（如GPT-4o、Llama 3和Mistral）生成针对阴谋论的对抗性言论的可行性，发现模型生成的回复存在泛化、重复、浅显等问题，且容易虚构事实。

Details

Motivation: 专家驱动的对抗性言论难以规模化，而大型语言模型可能提供解决方案，但针对阴谋论的对抗性言论研究不足。 Method: 通过结构化提示，评估GPT-4o、Llama 3和Mistral在生成对抗性言论中的表现。 Result: 模型生成的回复多为泛泛之谈，重复且浅显，且常虚构事实或过度强调恐惧。 Conclusion: 当前基于提示的大型语言模型在实际应用中存在显著问题，需进一步改进。 Abstract: Counterspeech is a key strategy against harmful online content, but scaling expert-driven efforts is challenging. Large Language Models (LLMs) present a potential solution, though their use in countering conspiracy theories is under-researched. Unlike for hate speech, no datasets exist that pair conspiracy theory comments with expert-crafted counterspeech. We address this gap by evaluating the ability of GPT-4o, Llama 3, and Mistral to effectively apply counterspeech strategies derived from psychological research provided through structured prompts. Our results show that the models often generate generic, repetitive, or superficial results. Additionally, they over-acknowledge fear and frequently hallucinate facts, sources, or figures, making their prompt-based use in practical applications problematic.

[87] TIFIN India at SemEval-2025: Harnessing Translation to Overcome Multilingual IR Challenges in Fact-Checked Claim Retrieval

Prasanna Devadiga,Arya Suneesh,Pawan Kumar Rajpoot,Bharatdeep Hazarika,Aditya U Baliga

Main category: cs.CL

TL;DR: 论文提出了一种两阶段策略，用于单语和跨语言环境下的事实核查检索，结合了微调嵌入模型和LLM重排序器，并展示了LLM翻译在跨语言检索中的优势。

Details

Motivation: 解决全球范围内虚假信息传播的问题，特别是在单语和跨语言环境下检索已核查事实的挑战。 Method: 采用两阶段策略：1) 使用微调嵌入模型作为基线检索系统；2) 使用LLM重排序器优化结果，并利用LLM翻译解决跨语言检索问题。 Result: 集成系统在单语和跨语言测试集上分别取得了0.938和0.81025的success@10分数。 Conclusion: LLM翻译能有效克服跨语言信息检索的障碍，且系统设计可在消费级GPU上实现。 Abstract: We address the challenge of retrieving previously fact-checked claims in monolingual and crosslingual settings - a critical task given the global prevalence of disinformation. Our approach follows a two-stage strategy: a reliable baseline retrieval system using a fine-tuned embedding model and an LLM-based reranker. Our key contribution is demonstrating how LLM-based translation can overcome the hurdles of multilingual information retrieval. Additionally, we focus on ensuring that the bulk of the pipeline can be replicated on a consumer GPU. Our final integrated system achieved a success@10 score of 0.938 and 0.81025 on the monolingual and crosslingual test sets, respectively.

[88] A Post-trainer's Guide to Multilingual Training Data: Uncovering Cross-lingual Transfer Dynamics

Luisa Shimabucoro,Ahmet Ustun,Marzieh Fadaee,Sebastian Ruder

Main category: cs.CL

TL;DR: 研究探讨了多语言数据微调后大语言模型的跨语言迁移动态，发现其表现受多种因素影响，并提出了有效迁移的条件。

Details

Motivation: 理解跨语言迁移的动态机制，以优化多语言大语言模型的微调效果。 Method: 使用两种模型家族（最大35B参数）在控制的多语言数据混合下，研究三种生成任务（摘要、指令跟随、数学推理）的跨语言迁移。 Result: 跨语言迁移和性能表现受多种因素综合影响，不同微调设置下表现各异。 Conclusion: 研究确定了实践中实现有效跨语言迁移的条件。 Abstract: In order for large language models to be useful across the globe, they are fine-tuned to follow instructions on multilingual data. Despite the ubiquity of such post-training, a clear understanding of the dynamics that enable cross-lingual transfer remains elusive. This study examines cross-lingual transfer (CLT) dynamics in realistic post-training settings. We study two model families of up to 35B parameters in size trained on carefully controlled mixtures of multilingual data on three generative tasks with varying levels of complexity (summarization, instruction following, and mathematical reasoning) in both single-task and multi-task instruction tuning settings. Overall, we find that the dynamics of cross-lingual transfer and multilingual performance cannot be explained by isolated variables, varying depending on the combination of post-training settings. Finally, we identify the conditions that lead to effective cross-lingual transfer in practice.

[89] HEMA : A Hippocampus-Inspired Extended Memory Architecture for Long-Context AI Conversations

Kwangseob Ahn

Main category: cs.CL

TL;DR: HEMA是一种受人类认知启发的双记忆系统，通过结合紧凑记忆和向量记忆，显著提升了大型语言模型在长对话中的连贯性和事实回忆能力。

Details

Motivation: 大型语言模型在长对话中难以保持连贯性，HEMA旨在解决这一问题。 Method: HEMA结合紧凑记忆（持续更新的单句摘要）和向量记忆（基于余弦相似度的块嵌入存储），并与6B参数变压器集成。 Result: 实验显示，HEMA在300轮对话中保持连贯，事实回忆准确率从41%提升至87%，人类评分连贯性从2.7提升至4.3。 Conclusion: HEMA通过结合逐字回忆和语义连续性，为隐私敏感的对话AI提供了实用解决方案，支持长达数月的对话而无需重新训练。 Abstract: Large language models (LLMs) struggle with maintaining coherence in extended conversations spanning hundreds of turns, despite performing well within their context windows. This paper introduces HEMA (Hippocampus-Inspired Extended Memory Architecture), a dual-memory system inspired by human cognitive processes. HEMA combines Compact Memory - a continuously updated one-sentence summary preserving global narrative coherence, and Vector Memory - an episodic store of chunk embeddings queried via cosine similarity. When integrated with a 6B-parameter transformer, HEMA maintains coherent dialogues beyond 300 turns while keeping prompt length under 3,500 tokens. Experimental results show substantial improvements: factual recall accuracy increases from 41% to 87%, and human-rated coherence improves from 2.7 to 4.3 on a 5-point scale. With 10K indexed chunks, Vector Memory achieves P@5 >= 0.80 and R@50 >= 0.74, doubling the area under the precision-recall curve compared to summarization-only approaches. Ablation studies reveal two key insights: semantic forgetting through age-weighted pruning reduces retrieval latency by 34% with minimal recall loss, and a two-level summary hierarchy prevents cascade errors in ultra-long conversations exceeding 1,000 turns. HEMA demonstrates that combining verbatim recall with semantic continuity provides a practical solution for privacy-aware conversational AI capable of month-long dialogues without model retraining.

[90] How Effective are Generative Large Language Models in Performing Requirements Classification?

Waad Alhoshan,Alessio Ferrari,Liping Zhao

Main category: cs.CL

TL;DR: 本文探讨了生成式大型语言模型（如Bloom、Gemma和Llama）在需求分类任务中的表现，发现提示设计和模型架构是关键因素，而数据集的影响则因任务复杂度而异。

Details

Motivation: 研究生成式LLMs在需求分类任务中的表现，填补了现有研究的空白。 Method: 设计了400多个实验，使用三种生成式LLMs在三个数据集上进行二元和多类需求分类。 Result: 提示设计和模型架构对性能有普遍影响，而数据集的影响因任务复杂度而异。 Conclusion: 未来应优化提示结构和模型架构，以提升需求分类任务的性能。 Abstract: In recent years, transformer-based large language models (LLMs) have revolutionised natural language processing (NLP), with generative models opening new possibilities for tasks that require context-aware text generation. Requirements engineering (RE) has also seen a surge in the experimentation of LLMs for different tasks, including trace-link detection, regulatory compliance, and others. Requirements classification is a common task in RE. While non-generative LLMs like BERT have been successfully applied to this task, there has been limited exploration of generative LLMs. This gap raises an important question: how well can generative LLMs, which produce context-aware outputs, perform in requirements classification? In this study, we explore the effectiveness of three generative LLMs-Bloom, Gemma, and Llama-in performing both binary and multi-class requirements classification. We design an extensive experimental study involving over 400 experiments across three widely used datasets (PROMISE NFR, Functional-Quality, and SecReq). Our study concludes that while factors like prompt design and LLM architecture are universally important, others-such as dataset variations-have a more situational impact, depending on the complexity of the classification task. This insight can guide future model development and deployment strategies, focusing on optimising prompt structures and aligning model architectures with task-specific needs for improved performance.

[91] Evaluation Framework for AI Systems in "the Wild"

Sarah Jabbour,Trenton Chang,Anindya Das Antar,Joseph Peper,Insu Jang,Jiachen Liu,Jae-Won Chung,Shiqi He,Michael Wellman,Bryan Goodman,Elizabeth Bondi-Kelly,Kevin Samy,Rada Mihalcea,Mosharaf Chowhury,David Jurgens,Lu Wang

Main category: cs.CL

TL;DR: 本文提出了一种评估生成式AI（GenAI）系统的综合框架，强调动态、多样化的输入和持续评估方法，以弥补传统评估方法在现实应用中的不足。

Details

Motivation: 当前GenAI评估方法依赖固定数据集和基准测试，未能反映实际应用中的表现，导致实验室结果与现实应用之间存在差距。 Method: 提出一个动态、全面的评估框架，结合多样化的输入和持续评估方法，整合性能、公平性和伦理考量，并采用人机结合的透明评估方式。 Result: 框架为实践者提供了设计实时能力评估方法的指导，并为政策制定者提供了关注社会影响的GenAI政策建议。 Conclusion: 通过实施这一框架，GenAI模型不仅能提升技术能力，还能确保伦理责任和社会影响力。 Abstract: Generative AI (GenAI) models have become vital across industries, yet current evaluation methods have not adapted to their widespread use. Traditional evaluations often rely on benchmarks and fixed datasets, frequently failing to reflect real-world performance, which creates a gap between lab-tested outcomes and practical applications. This white paper proposes a comprehensive framework for how we should evaluate real-world GenAI systems, emphasizing diverse, evolving inputs and holistic, dynamic, and ongoing assessment approaches. The paper offers guidance for practitioners on how to design evaluation methods that accurately reflect real-time capabilities, and provides policymakers with recommendations for crafting GenAI policies focused on societal impacts, rather than fixed performance numbers or parameter sizes. We advocate for holistic frameworks that integrate performance, fairness, and ethics and the use of continuous, outcome-oriented methods that combine human and automated assessments while also being transparent to foster trust among stakeholders. Implementing these strategies ensures GenAI models are not only technically proficient but also ethically responsible and impactful.

[92] MOOSComp: Improving Lightweight Long-Context Compressor via Mitigating Over-Smoothing and Incorporating Outlier Scores

Fengwei Zhou,Jiafei Song,Wenjin Jason Li,Gengjian Xue,Zhikang Zhao,Yichao Lu,Bailin Na

Main category: cs.CL

TL;DR: MOOSComp是一种基于令牌分类的长上下文压缩方法，通过解决过平滑问题和引入异常值分数，提升了BERT压缩器的性能，并在资源受限环境中显著减少了推理时间和资源消耗。

Details

Motivation: 大型语言模型在处理长上下文输入时面临推理时间和资源消耗增加的挑战，尤其是在资源受限的环境中。 Method: 提出MOOSComp方法，通过添加类间余弦相似性损失项和引入异常值分数，优化令牌分类准确性和保留关键令牌。 Result: 在各种压缩比下，MOOSComp在长上下文理解和推理任务中表现优异，并在资源受限设备上实现了3.3倍的加速。 Conclusion: MOOSComp通过改进令牌分类和保留关键令牌，显著提升了长上下文压缩的性能和效率。 Abstract: Recent advances in large language models have significantly improved their ability to process long-context input, but practical applications are challenged by increased inference time and resource consumption, particularly in resource-constrained environments. To address these challenges, we propose MOOSComp, a token-classification-based long-context compression method that enhances the performance of a BERT-based compressor by mitigating the over-smoothing problem and incorporating outlier scores. In the training phase, we add an inter-class cosine similarity loss term to penalize excessively similar token representations, thereby improving the token classification accuracy. During the compression phase, we introduce outlier scores to preserve rare but critical tokens that are prone to be discarded in task-agnostic compression. These scores are integrated with the classifier's output, making the compressor more generalizable to various tasks. Superior performance is achieved at various compression ratios on long-context understanding and reasoning benchmarks. Moreover, our method obtains a speedup of 3.3x at a 4x compression ratio on a resource-constrained mobile device.

[93] Credible plan-driven RAG method for Multi-hop Question Answering

Ningning Zhang,Chi Zhang,Zhizhong Tan,Xingxing Yang,Weiping Deng,Wenyong Wang

Main category: cs.CL

TL;DR: PAR RAG框架通过规划、执行和审查三阶段，解决了多跳问答中推理路径偏差和中间结果错误的问题，显著提升了准确性和可靠性。

Details

Motivation: 当前RAG方法在多跳问答中存在推理路径偏差和中间结果错误的问题，导致答案准确性下降。 Method: 提出PAR RAG框架，包括规划（问题分解）、执行（多粒度验证）和审查（调整中间结果）三个阶段。 Result: 在多跳问答数据集上，PAR RAG在EM和F1分数上显著优于现有方法。 Conclusion: PAR RAG通过结构化推理和错误控制，为多跳问答提供了更准确和可靠的解决方案。 Abstract: Multi-hop question answering (QA) presents a considerable challenge for Retrieval-Augmented Generation (RAG), requiring the structured decomposition of complex queries into logical reasoning paths and the generation of dependable intermediate results. However, deviations in reasoning paths or errors in intermediate results, which are common in current RAG methods, may propagate and accumulate throughout the reasoning process, diminishing the accuracy of the answer to complex queries. To address this challenge, we propose the Plan-then-Act-and-Review (PAR RAG) framework, which is organized into three key stages: planning, act, and review, and aims to offer an interpretable and incremental reasoning paradigm for accurate and reliable multi-hop question answering by mitigating error propagation.PAR RAG initially applies a top-down problem decomposition strategy, formulating a comprehensive plan that integrates multiple executable steps from a holistic viewpoint. This approach avoids the pitfalls of local optima common in traditional RAG methods, ensuring the accuracy of the entire reasoning path. Subsequently, PAR RAG incorporates a plan execution mechanism based on multi-granularity verification. By utilizing both coarse-grained similarity information and fine-grained relevant data, the framework thoroughly checks and adjusts intermediate results, ensuring process accuracy while effectively managing error propagation and amplification. Experimental results on multi-hop QA datasets demonstrate that the PAR RAG framework substantially outperforms existing state-of-the-art methods in key metrics, including EM and F1 scores.

[94] Random Long-Context Access for Mamba via Hardware-aligned Hierarchical Sparse Attention

Xiang Hu,Jiaqi Leng,Jun Zhao,Kewei Tu,Wei Wu

Main category: cs.CL

TL;DR: 提出了一种名为HSA的新型注意力机制，结合RNN的效率和长序列建模能力，通过分层稀疏注意力实现长范围随机访问。

Details

Motivation: 解决RNN无法随机访问历史上下文的问题，同时保持其计算效率优势。 Method: HSA将输入分块，选择top-k块并分层聚合信息，学习块内细粒度token相关性。 Result: RAMba在64M上下文中实现完美准确率，并在下游任务中显著提升，内存占用几乎恒定。 Conclusion: HSA和RAMba展示了在长上下文建模中的巨大潜力。 Abstract: A key advantage of Recurrent Neural Networks (RNNs) over Transformers is their linear computational and space complexity enables faster training and inference for long sequences. However, RNNs are fundamentally unable to randomly access historical context, and simply integrating attention mechanisms may undermine their efficiency advantages. To overcome this limitation, we propose \textbf{H}ierarchical \textbf{S}parse \textbf{A}ttention (HSA), a novel attention mechanism that enhances RNNs with long-range random access flexibility while preserving their merits in efficiency and length generalization. HSA divides inputs into chunks, selecting the top-$k$ chunks and hierarchically aggregates information. The core innovation lies in learning token-to-chunk relevance based on fine-grained token-level information inside each chunk. This approach enhances the precision of chunk selection across both in-domain and out-of-domain context lengths. To make HSA efficient, we further introduce a hardware-aligned kernel design. By combining HSA with Mamba, we introduce RAMba, which achieves perfect accuracy in passkey retrieval across 64 million contexts despite pre-training on only 4K-length contexts, and significant improvements on various downstream tasks, with nearly constant memory footprint. These results show RAMba's huge potential in long-context modeling.

[95] LLM-assisted Graph-RAG Information Extraction from IFC Data

Sima Iranmanesh,Hadeel Saadany,Edlira Vakaj

Main category: cs.CL

TL;DR: 利用LLMs和Graph-RAG技术解析复杂的IFC数据，实现自然语言查询响应。

Details

Motivation: IFC数据复杂且多表示方式，需高效解析方法。 Method: 结合Graph-RAG技术，从IFC数据中检索对象属性及关系。 Result: Graph-RAG提升了LLMs的图知识能力，简化查询流程。 Conclusion: Graph-RAG有效支持IFC数据的自然语言交互解析。 Abstract: IFC data has become the general building information standard for collaborative work in the construction industry. However, IFC data can be very complicated because it allows for multiple ways to represent the same product information. In this research, we utilise the capabilities of LLMs to parse the IFC data with Graph Retrieval-Augmented Generation (Graph-RAG) technique to retrieve building object properties and their relations. We will show that, despite limitations due to the complex hierarchy of the IFC data, the Graph-RAG parsing enhances generative LLMs like GPT-4o with graph-based knowledge, enabling natural language query-response retrieval without the need for a complex pipeline.

[96] GreenMind: A Next-Generation Vietnamese Large Language Model for Structured and Logical Reasoning

Luu Quy Tung,Hoang Quoc Viet,Vo Trong Thu

Main category: cs.CL

TL;DR: 论文提出GreenMind-Medium-14B-R1，一种基于Group Relative Policy Optimization的越南语推理模型，通过高质量数据集和奖励函数解决语言混合和事实正确性问题。

Details

Motivation: 解决Chain-of-Thought（CoT）方法在越南语任务中的语言混合和事实正确性问题。 Method: 使用Group Relative Policy Optimization微调策略，设计两个奖励函数：检测语言混合和确保事实正确性。 Result: 在VLSP 2023越南语数据集和SeaExam多语言选择题数据集上表现优于现有方法。 Conclusion: 模型提升了越南语任务的推理能力和语言一致性，适用于多语言场景。 Abstract: Chain-of-Thought (CoT) is a robust approach for tackling LLM tasks that require intermediate reasoning steps prior to generating a final answer. In this paper, we present GreenMind-Medium-14B-R1, the Vietnamese reasoning model inspired by the finetuning strategy based on Group Relative Policy Optimization. We also leverage a high-quality Vietnamese synthesized reasoning dataset and design two reward functions to tackle the main limitations of this technique: (i) language mixing, where we explicitly detect the presence of biased language characters during the process of sampling tokens, and (ii) we leverage Sentence Transformer-based models to ensure that the generated reasoning content maintains factual correctness and does not distort the final output. Experimental results on the Vietnamese dataset from the VLSP 2023 Challenge demonstrate that our model outperforms prior works and enhances linguistic consistency in its responses. Furthermore, we extend our evaluation to SeaExam-a multilingual multiple-choice dataset, showing the effectiveness of our reasoning method compared to few-shot prompting techniques.

[97] Monte Carlo Planning with Large Language Model for Text-Based Game Agents

Zijing Shi,Meng Fang,Ling Chen

Main category: cs.CL

TL;DR: MC-DML算法结合大型语言模型（LLMs）和树搜索算法，通过动态记忆机制提升文本游戏中的语言理解和推理能力，显著提高初始规划阶段的性能。

Details

Motivation: 传统规划-学习范式（如MCTS与RL结合）在文本游戏中耗时且缺乏语言理解能力，需要更高效的解决方案。 Method: 提出MC-DML算法，利用LLMs的语言能力与树搜索的探索优势，并通过动态记忆机制（试验内和跨试验）优化动作评估。 Result: 在Jericho基准测试中，MC-DML在初始规划阶段显著优于现有方法，无需多次迭代即可取得优异表现。 Conclusion: MC-DML为复杂环境中的语言基础规划提供了高效解决方案，展示了LLMs与规划算法结合的潜力。 Abstract: Text-based games provide valuable environments for language-based autonomous agents. However, planning-then-learning paradigms, such as those combining Monte Carlo Tree Search (MCTS) and reinforcement learning (RL), are notably time-consuming due to extensive iterations. Additionally, these algorithms perform uncertainty-driven exploration but lack language understanding and reasoning abilities. In this paper, we introduce the Monte Carlo planning with Dynamic Memory-guided Large language model (MC-DML) algorithm. MC-DML leverages the language understanding and reasoning capabilities of Large Language Models (LLMs) alongside the exploratory advantages of tree search algorithms. Specifically, we enhance LLMs with in-trial and cross-trial memory mechanisms, enabling them to learn from past experiences and dynamically adjust action evaluations during planning. We conduct experiments on a series of text-based games from the Jericho benchmark. Our results demonstrate that the MC-DML algorithm significantly enhances performance across various games at the initial planning phase, outperforming strong contemporary methods that require multiple iterations. This demonstrates the effectiveness of our algorithm, paving the way for more efficient language-grounded planning in complex environments.

[98] Emo Pillars: Knowledge Distillation to Support Fine-Grained Context-Aware and Context-Less Emotion Classification

Alexander Shvets

Main category: cs.CL

TL;DR: 论文提出了一种基于LLM的数据合成方法，生成多样化的情感分析数据集，并训练轻量级BERT模型Emo Pillars，在多个任务上达到SOTA性能。

Details

Motivation: 现有情感分析数据集缺乏上下文且情感类别有限，大型语言模型（如GPT-4）资源消耗高且易过预测。 Method: 设计LLM数据合成流程，利用Mistral-7b生成多样化训练数据，训练BERT类模型Emo Pillars。 Result: 生成100K上下文和300K无上下文数据，Emo Pillars在GoEmotions等任务上达到SOTA。 Conclusion: 数据集验证成功，但需改进中性类别处理和标签分类。 Abstract: Most datasets for sentiment analysis lack context in which an opinion was expressed, often crucial for emotion understanding, and are mainly limited by a few emotion categories. Foundation large language models (LLMs) like GPT-4 suffer from over-predicting emotions and are too resource-intensive. We design an LLM-based data synthesis pipeline and leverage a large model, Mistral-7b, for the generation of training examples for more accessible, lightweight BERT-type encoder models. We focus on enlarging the semantic diversity of examples and propose grounding the generation into a corpus of narratives to produce non-repetitive story-character-centered utterances with unique contexts over 28 emotion classes. By running 700K inferences in 450 GPU hours, we contribute with the dataset of 100K contextual and also 300K context-less examples to cover both scenarios. We use it for fine-tuning pre-trained encoders, which results in several Emo Pillars models. We show that Emo Pillars models are highly adaptive to new domains when tuned to specific tasks such as GoEmotions, ISEAR, IEMOCAP, and EmoContext, reaching the SOTA performance on the first three. We also validate our dataset, conducting statistical analysis and human evaluation, and confirm the success of our measures in utterance diversification (although less for the neutral class) and context personalization, while pointing out the need for improved handling of out-of-taxonomy labels within the pipeline.

[99] Planning with Diffusion Models for Target-Oriented Dialogue Systems

Hanwen Du,Bo Peng,Xia Ning

Main category: cs.CL

TL;DR: DiffTOD提出了一种基于扩散模型的新型对话规划框架，解决了传统顺序规划中的错误累积和短视问题。

Details

Motivation: 现有对话规划方法存在顺序生成的错误累积和短视行为，限制了对话目标的实现。 Method: DiffTOD将对话规划建模为条件引导的轨迹生成问题，利用扩散语言模型估计轨迹可能性，并针对不同目标类型设计了三种引导机制。 Result: 实验表明，DiffTOD能有效进行非短视的前瞻探索，并在多样对话场景中表现出强灵活性。 Conclusion: DiffTOD通过非顺序对话规划优化长期动作策略，适用于复杂多样的对话场景。 Abstract: Target-Oriented Dialogue (TOD) remains a significant challenge in the LLM era, where strategic dialogue planning is crucial for directing conversations toward specific targets. However, existing dialogue planning methods generate dialogue plans in a step-by-step sequential manner, and may suffer from compounding errors and myopic actions. To address these limitations, we introduce a novel dialogue planning framework, DiffTOD, which leverages diffusion models to enable non-sequential dialogue planning. DiffTOD formulates dialogue planning as a trajectory generation problem with conditional guidance, and leverages a diffusion language model to estimate the likelihood of the dialogue trajectory. To optimize the dialogue action strategies, DiffTOD introduces three tailored guidance mechanisms for different target types, offering flexible guidance towards diverse TOD targets at test time. Extensive experiments across three diverse TOD settings show that DiffTOD can effectively perform non-myopic lookahead exploration and optimize action strategies over a long horizon through non-sequential dialogue planning, and demonstrates strong flexibility across complex and diverse dialogue scenarios. Our code and data are accessible through https://anonymous.4open.science/r/DiffTOD.

[100] Do Large Language Models know who did what to whom?

Joseph M. Denning,Xiaohan,Guo,Bryor Snefjella,Idan A. Blank

Main category: cs.CL

TL;DR: 论文研究了大型语言模型（LLMs）是否能通过词预测任务捕捉句子中的主题角色（如“谁对谁做了什么”），发现LLMs的表示更依赖句法而非主题角色，但部分注意力头能独立于句法捕捉主题角色。

Details

Motivation: 探讨LLMs是否能在语言处理中理解主题角色，以评估其语言理解能力。 Method: 通过两个实验，分析四种LLMs的句子表示，比较其与人类相似性判断的差异，并检查隐藏单元和注意力头中的主题角色信息。 Result: LLMs的表示更偏向句法相似性，而非主题角色；部分注意力头能独立捕捉主题角色，但整体影响较弱。 Conclusion: LLMs能提取主题角色，但其表示中该信息的影响较人类更弱。 Abstract: Large Language Models (LLMs) are commonly criticized for not understanding language. However, many critiques focus on cognitive abilities that, in humans, are distinct from language processing. Here, we instead study a kind of understanding tightly linked to language: inferring who did what to whom (thematic roles) in a sentence. Does the central training objective of LLMs-word prediction-result in sentence representations that capture thematic roles? In two experiments, we characterized sentence representations in four LLMs. In contrast to human similarity judgments, in LLMs the overall representational similarity of sentence pairs reflected syntactic similarity but not whether their agent and patient assignments were identical vs. reversed. Furthermore, we found little evidence that thematic role information was available in any subset of hidden units. However, some attention heads robustly captured thematic roles, independently of syntax. Therefore, LLMs can extract thematic roles but, relative to humans, this information influences their representations more weakly.

[101] Tracing Thought: Using Chain-of-Thought Reasoning to Identify the LLM Behind AI-Generated Text

Shifali Agrahari,Sanasam Ranbir Singh

Main category: cs.CL

TL;DR: 本文提出了一种名为COT Fine-tuned的新框架，用于检测AI生成文本并识别生成文本的具体语言模型。通过双任务设计和Chain-of-Thought推理，提高了模型的准确性和可解释性。

Details

Motivation: 近年来，AI生成文本的检测成为重要研究方向，涉及学术诚信、错误信息和AI伦理问题。 Method: 采用双任务设计：任务A区分AI生成与人类撰写文本，任务B识别具体语言模型。关键创新是使用Chain-of-Thought推理，增强透明度和可解释性。 Result: 实验表明，COT Fine-tuned在两项任务中均表现优异，尤其在语言模型识别和人类-AI分类方面。 Conclusion: Chain-of-Thought推理显著提升了模型的有效性和可解释性。 Abstract: In recent years, the detection of AI-generated text has become a critical area of research due to concerns about academic integrity, misinformation, and ethical AI deployment. This paper presents COT Fine-tuned, a novel framework for detecting AI-generated text and identifying the specific language model. responsible for generating the text. We propose a dual-task approach, where Task A involves classifying text as AI-generated or human-written, and Task B identifies the specific LLM behind the text. The key innovation of our method lies in the use of Chain-of-Thought reasoning, which enables the model to generate explanations for its predictions, enhancing transparency and interpretability. Our experiments demonstrate that COT Fine-tuned achieves high accuracy in both tasks, with strong performance in LLM identification and human-AI classification. We also show that the CoT reasoning process contributes significantly to the models effectiveness and interpretability.

[102] OptimAI: Optimization from Natural Language Using LLM-Powered AI Agents

Raghav Thind,Youran Sun,Ling Liang,Haizhao Yang

Main category: cs.CL

TL;DR: OptimAI是一个利用LLM驱动的AI代理解决自然语言描述的优化问题的框架，性能优于现有方法。

Details

Motivation: 优化问题在科研和实际应用中至关重要，但将自然语言描述的优化问题转化为数学形式并选择合适的求解器需要大量领域知识。 Method: 框架包含四个角色：formulator（将问题转化为数学形式）、planner（制定高层策略）、coder和code critic（与环境交互并优化行为）。采用UCB-based debug调度动态切换计划。 Result: 在NLP4LP和Optibench数据集上分别达到88.1%和71.2%的准确率，错误率降低58%和50%。 Conclusion: OptimAI通过多智能体协作显著提升性能，验证了各角色的必要性。 Abstract: Optimization plays a vital role in scientific research and practical applications, but formulating a concrete optimization problem described in natural language into a mathematical form and selecting a suitable solver to solve the problem requires substantial domain expertise. We introduce \textbf{OptimAI}, a framework for solving \underline{Optim}ization problems described in natural language by leveraging LLM-powered \underline{AI} agents, achieving superior performance over current state-of-the-art methods. Our framework is built upon four key roles: (1) a \emph{formulator} that translates natural language problem descriptions into precise mathematical formulations; (2) a \emph{planner} that constructs a high-level solution strategy prior to execution; and (3) a \emph{coder} and a \emph{code critic} capable of interacting with the environment and reflecting on outcomes to refine future actions. Ablation studies confirm that all roles are essential; removing the planner or code critic results in $5.8\times$ and $3.1\times$ drops in productivity, respectively. Furthermore, we introduce UCB-based debug scheduling to dynamically switch between alternative plans, yielding an additional $3.3\times$ productivity gain. Our design emphasizes multi-agent collaboration, allowing us to conveniently explore the synergistic effect of combining diverse models within a unified system. Our approach attains 88.1\% accuracy on the NLP4LP dataset and 71.2\% on the Optibench (non-linear w/o table) subset, reducing error rates by 58\% and 50\% respectively over prior best results.

[103] IberBench: LLM Evaluation on Iberian Languages

José Ángel González,Ian Borrego Obrador,Álvaro Romo Herrero,Areg Mikael Sarvazyan,Mara Chinea-Ríos,Angelo Basile,Marc Franco-Salvador

Main category: cs.CL

TL;DR: IberBench是一个针对伊比利亚半岛和拉丁美洲语言的全面可扩展基准测试，用于评估LLM在基础和工业相关NLP任务中的表现，解决了现有基准测试的局限性。

Details

Motivation: 现有基准测试主要针对英语，缺乏语言多样性，且忽视工业相关任务。IberBench旨在填补这一空白。 Method: 整合101个数据集，覆盖22个任务类别，支持持续更新和社区驱动的提交，并由专家委员会审核。评估了23个不同规模的LLM。 Result: LLM在工业相关任务中表现较差，对某些语言（如加利西亚语和巴斯克语）表现不佳，部分任务结果接近随机水平。 Conclusion: IberBench提供了开源实现和公开排行榜，为LLM评估提供了更全面的工具。 Abstract: Large Language Models (LLMs) remain difficult to evaluate comprehensively, particularly for languages other than English, where high-quality data is often limited. Existing benchmarks and leaderboards are predominantly English-centric, with only a few addressing other languages. These benchmarks fall short in several key areas: they overlook the diversity of language varieties, prioritize fundamental Natural Language Processing (NLP) capabilities over tasks of industrial relevance, and are static. With these aspects in mind, we present IberBench, a comprehensive and extensible benchmark designed to assess LLM performance on both fundamental and industry-relevant NLP tasks, in languages spoken across the Iberian Peninsula and Ibero-America. IberBench integrates 101 datasets from evaluation campaigns and recent benchmarks, covering 22 task categories such as sentiment and emotion analysis, toxicity detection, and summarization. The benchmark addresses key limitations in current evaluation practices, such as the lack of linguistic diversity and static evaluation setups by enabling continual updates and community-driven model and dataset submissions moderated by a committee of experts. We evaluate 23 LLMs ranging from 100 million to 14 billion parameters and provide empirical insights into their strengths and limitations. Our findings indicate that (i) LLMs perform worse on industry-relevant tasks than in fundamental ones, (ii) performance is on average lower for Galician and Basque, (iii) some tasks show results close to random, and (iv) in other tasks LLMs perform above random but below shared task systems. IberBench offers open-source implementations for the entire evaluation pipeline, including dataset normalization and hosting, incremental evaluation of LLMs, and a publicly accessible leaderboard.

cs.IR [Back]

[104] LegalRAG: A Hybrid RAG System for Multilingual Legal Information Retrieval

Muhammad Rafsan Kabir,Rafeed Mohammad Sultan,Fuad Rahman,Mohammad Ruhul Amin,Sifat Momen,Nabeel Mohammed,Shafin Rahman

Main category: cs.IR

TL;DR: 开发了一个高效的双语问答框架，用于处理孟加拉国警察公报中的法律文件，结合了现代RAG技术以提高检索和回答生成能力。

Details

Motivation: 法律和监管任务中NLP应用有限，需要更高效的双语问答系统。 Method: 采用改进的RAG管道，提升检索性能，生成更精确的答案。 Result: 在孟加拉国警察公报测试集上，提出的方法在所有评估指标上均优于现有方法。 Conclusion: 改进的RAG系统能更高效地检索法律信息，提升法律信息的可访问性。 Abstract: Natural Language Processing (NLP) and computational linguistic techniques are increasingly being applied across various domains, yet their use in legal and regulatory tasks remains limited. To address this gap, we develop an efficient bilingual question-answering framework for regulatory documents, specifically the Bangladesh Police Gazettes, which contain both English and Bangla text. Our approach employs modern Retrieval Augmented Generation (RAG) pipelines to enhance information retrieval and response generation. In addition to conventional RAG pipelines, we propose an advanced RAG-based approach that improves retrieval performance, leading to more precise answers. This system enables efficient searching for specific government legal notices, making legal information more accessible. We evaluate both our proposed and conventional RAG systems on a diverse test set on Bangladesh Police Gazettes, demonstrating that our approach consistently outperforms existing methods across all evaluation metrics.

[105] CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents

Francisco Valentini,Diego Kozlowski,Vincent Larivière

Main category: cs.IR

TL;DR: 论文介绍了CLIRudit数据集，用于评估跨语言学术搜索，重点测试英语查询和法语文档的检索效果。通过多种零样本检索方法的对比，发现大型密集检索器表现优异，无需机器翻译即可媲美人工翻译效果。

Details

Motivation: 解决跨语言学术信息检索的挑战，帮助研究人员获取非英语学术内容。 Method: 使用双语文章元数据构建CLIRudit数据集，并对比多种零样本检索方法，包括密集/稀疏检索器、机器翻译等。 Result: 大型密集检索器在零样本任务中表现优异，稀疏检索器结合文档翻译也具竞争力。 Conclusion: 研究推动了跨语言学术检索的理解，并提供了可扩展的数据集框架，促进科学知识的跨语言共享。 Abstract: Cross-lingual information retrieval (CLIR) consists in finding relevant documents in a language that differs from the language of the queries. This paper presents CLIRudit, a new dataset created to evaluate cross-lingual academic search, focusing on English queries and French documents. The dataset is built using bilingual article metadata from \'Erudit, a Canadian publishing platform, and is designed to represent scenarios in which researchers search for scholarly content in languages other than English. We perform a comprehensive benchmarking of different zero-shot first-stage retrieval methods on the dataset, including dense and sparse retrievers, query and document machine translation, and state-of-the-art multilingual retrievers. Our results show that large dense retrievers, not necessarily trained for the cross-lingual retrieval task, can achieve zero-shot performance comparable to using ground truth human translations, without the need for machine translation. Sparse retrievers, such as BM25 or SPLADE, combined with document translation, show competitive results, providing an efficient alternative to large dense models. This research advances the understanding of cross-lingual academic information retrieval and provides a framework that others can use to build comparable datasets across different languages and disciplines. By making the dataset and code publicly available, we aim to facilitate further research that will help make scientific knowledge more accessible across language barriers.

cs.LG [Back]

[106] Using Phonemes in cascaded S2S translation pipeline

Rene Pilz,Johannes Schneider

Main category: cs.LG

TL;DR: 论文探讨了在语音翻译中使用音素代替传统文本表示的方法，结果表明音素方法在质量相当的同时具有资源需求低等优势。

Details

Motivation: 探索音素作为文本表示在语音翻译中的潜力，以解决传统文本表示的局限性。 Method: 在WMT17数据集上训练序列到序列模型，比较文本表示和音素表示的性能。 Result: 音素方法在BLEU指标下表现相当，且资源需求更低，更适合低资源语言。 Conclusion: 音素表示在语音翻译中具有潜力，尤其适用于资源有限的语言。 Abstract: This paper explores the idea of using phonemes as a textual representation within a conventional multilingual simultaneous speech-to-speech translation pipeline, as opposed to the traditional reliance on text-based language representations. To investigate this, we trained an open-source sequence-to-sequence model on the WMT17 dataset in two formats: one using standard textual representation and the other employing phonemic representation. The performance of both approaches was assessed using the BLEU metric. Our findings shows that the phonemic approach provides comparable quality but offers several advantages, including lower resource requirements or better suitability for low-resource languages.

[107] MAGIC: Near-Optimal Data Attribution for Deep Learning

Andrew Ilyas,Logan Engstrom

Main category: cs.LG

TL;DR: 提出了一种新方法MAGIC，结合经典方法和元微分技术，用于估计训练数据增减对模型预测的影响。

Details

Motivation: 在非凸大规模场景下，现有方法对训练数据增减影响的估计效果不佳。 Method: 结合经典方法和元微分技术，开发了MAGIC方法。 Result: MAGIC方法能够（几乎）最优地估计训练数据增减对模型预测的影响。 Conclusion: MAGIC方法在非凸大规模场景下显著提升了数据归属估计的准确性。 Abstract: The goal of predictive data attribution is to estimate how adding or removing a given set of training datapoints will affect model predictions. In convex settings, this goal is straightforward (i.e., via the infinitesimal jackknife). In large-scale (non-convex) settings, however, existing methods are far less successful -- current methods' estimates often only weakly correlate with ground truth. In this work, we present a new data attribution method (MAGIC) that combines classical methods and recent advances in metadifferentiation to (nearly) optimally estimate the effect of adding or removing training data on model predictions.

[108] ParetoHqD: Fast Offline Multiobjective Alignment of Large Language Models using Pareto High-quality Data

Haoran Gu,Handing Wang,Yi Mei,Mengjie Zhang,Yaochu Jin

Main category: cs.LG

TL;DR: ParetoHqD通过将人类偏好表示为目标空间中的偏好方向，并利用帕累托前沿附近的高质量数据，改进了多目标对齐算法的性能。

Details

Motivation: 确保大型语言模型符合多样化的用户需求和价值观，当前的多目标对齐算法因偏好表示不当和奖励分数不平衡而受限。 Method: 提出ParetoHqD，分两阶段监督微调，每阶段使用与偏好方向匹配的高质量帕累托数据集。 Result: 实验显示ParetoHqD在两个多目标对齐任务中优于五种基线方法。 Conclusion: ParetoHqD有效解决了偏好表示和奖励不平衡问题，提升了多目标对齐性能。 Abstract: Aligning large language models with multiple human expectations and values is crucial for ensuring that they adequately serve a variety of user needs. To this end, offline multiobjective alignment algorithms such as the Rewards-in-Context algorithm have shown strong performance and efficiency. However, inappropriate preference representations and training with imbalanced reward scores limit the performance of such algorithms. In this work, we introduce ParetoHqD that addresses the above issues by representing human preferences as preference directions in the objective space and regarding data near the Pareto front as ''high-quality'' data. For each preference, ParetoHqD follows a two-stage supervised fine-tuning process, where each stage uses an individual Pareto high-quality training set that best matches its preference direction. The experimental results have demonstrated the superiority of ParetoHqD over five baselines on two multiobjective alignment tasks.

[109] Process Reward Models That Think

Muhammad Khalifa,Rishabh Agarwal,Lajanugen Logeswaran,Jaekyeom Kim,Hao Peng,Moontae Lee,Honglak Lee,Lu Wang

Main category: cs.LG

TL;DR: ThinkPRM是一种基于生成式长链思维（CoT）的步进验证模型，仅需1%的标注数据即可超越传统判别式验证模型，并在多个基准测试中表现优异。

Details

Motivation: 步进验证模型（PRMs）需要大量步级标注数据，训练成本高。本研究旨在开发数据高效的PRMs，通过生成验证链式思维（CoT）来验证每一步。 Method: 提出ThinkPRM，一种基于长链CoT的生成式验证模型，利用少量标注数据进行微调，充分发挥长链CoT模型的推理能力。 Result: ThinkPRM在ProcessBench、MATH-500和AIME '24等基准测试中表现优于基线模型，且在域外评估中超越全量标注的判别式验证模型。 Conclusion: 生成式长链CoT PRMs能够高效扩展验证计算，同时减少训练所需的监督数据，具有显著价值。 Abstract: Step-by-step verifiers -- also known as process reward models (PRMs) -- are a key ingredient for test-time scaling. PRMs require step-level supervision, making them expensive to train. This work aims to build data-efficient PRMs as verbalized step-wise reward models that verify every step in the solution by generating a verification chain-of-thought (CoT). We propose ThinkPRM, a long CoT verifier fine-tuned on orders of magnitude fewer process labels than those required by discriminative PRMs. Our approach capitalizes on the inherent reasoning abilities of long CoT models, and outperforms LLM-as-a-Judge and discriminative verifiers -- using only 1% of the process labels in PRM800K -- across several challenging benchmarks. Specifically, ThinkPRM beats the baselines on ProcessBench, MATH-500, and AIME '24 under best-of-N selection and reward-guided search. In an out-of-domain evaluation on a subset of GPQA-Diamond and LiveCodeBench, our PRM surpasses discriminative verifiers trained on the full PRM800K by 8% and 4.5%, respectively. Lastly, under the same token budget, ThinkPRM scales up verification compute more effectively compared to LLM-as-a-Judge, outperforming it by 7.2% on a subset of ProcessBench. Our work highlights the value of generative, long CoT PRMs that can scale test-time compute for verification while requiring minimal supervision for training. Our code, data, and models will be released at https://github.com/mukhal/thinkprm.

[110] Quantum Doubly Stochastic Transformers

Jannis Born,Filip Skogh,Kahn Rhrissorrakrai,Filippo Utro,Nico Wagner,Aleksandros Sobczyk

Main category: cs.LG

TL;DR: 论文提出了一种混合经典-量子双随机Transformer（QDSFormer），用变分量子电路替代Softmax，提升了性能与训练稳定性。

Details

Motivation: 传统Transformer中的Softmax可能导致训练不稳定，而现有的双随机矩阵方法（如Sinkhorn算法）存在局限性。量子电路为双随机矩阵提供了新的可能性。 Method: 用变分量子电路替代Softmax，生成双随机矩阵，并研究其表达能力。 Result: QDSFormer在多个小规模目标识别任务中表现优于标准Vision Transformer和其他双随机Transformer，训练更稳定。 Conclusion: QDSFormer展示了量子电路在Transformer中的潜力，可能缓解小规模数据上ViT训练不稳定的问题。 Abstract: At the core of the Transformer, the Softmax normalizes the attention matrix to be right stochastic. Previous research has shown that this often destabilizes training and that enforcing the attention matrix to be doubly stochastic (through Sinkhorn's algorithm) consistently improves performance across different tasks, domains and Transformer flavors. However, Sinkhorn's algorithm is iterative, approximative, non-parametric and thus inflexible w.r.t. the obtained doubly stochastic matrix (DSM). Recently, it has been proven that DSMs can be obtained with a parametric quantum circuit, yielding a novel quantum inductive bias for DSMs with no known classical analogue. Motivated by this, we demonstrate the feasibility of a hybrid classical-quantum doubly stochastic Transformer (QDSFormer) that replaces the Softmax in the self-attention layer with a variational quantum circuit. We study the expressive power of the circuit and find that it yields more diverse DSMs that better preserve information than classical operators. Across multiple small-scale object recognition tasks, we find that our QDSFormer consistently surpasses both a standard Vision Transformer and other doubly stochastic Transformers. Beyond the established Sinkformer, this comparison includes a novel quantum-inspired doubly stochastic Transformer (based on QR decomposition) that can be of independent interest. The QDSFormer also shows improved training stability and lower performance variation suggesting that it may mitigate the notoriously unstable training of ViTs on small-scale data.

[111] An Automated Pipeline for Few-Shot Bird Call Classification: A Case Study with the Tooth-Billed Pigeon

Abhishek Jana,Moeumu Uili,James Atherton,Mark O'Brien,Joe Wood,Leandra Brickson

Main category: cs.LG

TL;DR: 提出了一种针对稀有鸟类的自动化单次鸟鸣分类方法，解决了现有分类器对罕见物种检测不足的问题。

Details

Motivation: 现有分类器（如BirdNET和Perch）对常见鸟类表现优异，但对仅有1-3条录音的稀有物种无能为力，这对濒危物种的监测至关重要。 Method: 利用大型鸟类分类网络的嵌入空间，结合余弦相似度分类器及预处理技术（过滤和降噪），优化了在极少训练数据下的检测效果。 Result: 在模拟实验和真实测试（极度濒危的齿嘴鸽）中，模型召回率达1.0，准确率达0.95。 Conclusion: 该系统为濒危物种的监测提供了实用工具，且已开源。 Abstract: This paper presents an automated one-shot bird call classification pipeline designed for rare species absent from large publicly available classifiers like BirdNET and Perch. While these models excel at detecting common birds with abundant training data, they lack options for species with only 1-3 known recordings-a critical limitation for conservationists monitoring the last remaining individuals of endangered birds. To address this, we leverage the embedding space of large bird classification networks and develop a classifier using cosine similarity, combined with filtering and denoising preprocessing techniques, to optimize detection with minimal training data. We evaluate various embedding spaces using clustering metrics and validate our approach in both a simulated scenario with Xeno-Canto recordings and a real-world test on the critically endangered tooth-billed pigeon (Didunculus strigirostris), which has no existing classifiers and only three confirmed recordings. The final model achieved 1.0 recall and 0.95 accuracy in detecting tooth-billed pigeon calls, making it practical for use in the field. This open-source system provides a practical tool for conservationists seeking to detect and monitor rare species on the brink of extinction.

[112] Representation Learning via Non-Contrastive Mutual Information

Zhaohan Daniel Guo,Bernardo Avila Pires,Khimya Khetarpal,Dale Schuurmans,Bo Dai

Main category: cs.LG

TL;DR: 论文提出了一种结合对比和非对比自监督学习优势的新目标函数MINC，通过改进谱对比损失，降低了方差并防止崩溃。

Details

Motivation: 由于数据标注成本高，自监督学习成为重要方法，但现有对比和非对比方法各有不足，需要结合两者优势。 Method: 将谱对比损失转化为非对比形式，提出MINC损失，避免成对比较并保留互信息。 Result: 在ImageNet上测试，MINC表现优于谱对比损失基线。 Conclusion: MINC成功结合了对比和非对比方法的优点，为自监督学习提供了新方向。 Abstract: Labeling data is often very time consuming and expensive, leaving us with a majority of unlabeled data. Self-supervised representation learning methods such as SimCLR (Chen et al., 2020) or BYOL (Grill et al., 2020) have been very successful at learning meaningful latent representations from unlabeled image data, resulting in much more general and transferable representations for downstream tasks. Broadly, self-supervised methods fall into two types: 1) Contrastive methods, such as SimCLR; and 2) Non-Contrastive methods, such as BYOL. Contrastive methods are generally trying to maximize mutual information between related data points, so they need to compare every data point to every other data point, resulting in high variance, and thus requiring large batch sizes to work well. Non-contrastive methods like BYOL have much lower variance as they do not need to make pairwise comparisons, but are much trickier to implement as they have the possibility of collapsing to a constant vector. In this paper, we aim to develop a self-supervised objective that combines the strength of both types. We start with a particular contrastive method called the Spectral Contrastive Loss (HaoChen et al., 2021; Lu et al., 2024), and we convert it into a more general non-contrastive form; this removes the pairwise comparisons resulting in lower variance, but keeps the mutual information formulation of the contrastive method preventing collapse. We call our new objective the Mutual Information Non-Contrastive (MINC) loss. We test MINC by learning image representations on ImageNet (similar to SimCLR and BYOL) and show that it consistently improves upon the Spectral Contrastive loss baseline.

[113] Noise-Tolerant Coreset-Based Class Incremental Continual Learning

Edison Mucllari,Aswin Raghavan,Zachary Alan Daniels

Main category: cs.LG

TL;DR: 论文研究了在类增量学习（CIL）中，标签噪声和实例噪声对持续学习（CL）的影响，并提出两种新的噪声容忍重放缓冲区算法，显著提高了分类精度并减少了遗忘。

Details

Motivation: 持续学习需要适应新任务同时避免遗忘旧任务，但噪声（如标签噪声和实例噪声）会干扰学习过程。本研究旨在理解噪声对基于重放方法的持续学习的影响，并提出更鲁棒的解决方案。 Method: 通过理论分析推导出在一般加性噪声威胁模型下，基于Coresets的重放方法的鲁棒性界限，并设计两种新的噪声容忍持续学习算法。 Result: 实验表明，现有基于内存的持续学习方法对噪声不鲁棒，而提出的方法在五个数据集上显著提高了分类精度并减少了遗忘。 Conclusion: 提出的噪声容忍重放缓冲区算法在噪声环境下表现优越，为持续学习的实际应用提供了更可靠的解决方案。 Abstract: Many applications of computer vision require the ability to adapt to novel data distributions after deployment. Adaptation requires algorithms capable of continual learning (CL). Continual learners must be plastic to adapt to novel tasks while minimizing forgetting of previous tasks.However, CL opens up avenues for noise to enter the training pipeline and disrupt the CL. This work focuses on label noise and instance noise in the context of class-incremental learning (CIL), where new classes are added to a classifier over time, and there is no access to external data from past classes. We aim to understand the sensitivity of CL methods that work by replaying items from a memory constructed using the idea of Coresets. We derive a new bound for the robustness of such a method to uncorrelated instance noise under a general additive noise threat model, revealing several insights. Putting the theory into practice, we create two continual learning algorithms to construct noise-tolerant replay buffers. We empirically compare the effectiveness of prior memory-based continual learners and the proposed algorithms under label and uncorrelated instance noise on five diverse datasets. We show that existing memory-based CL are not robust whereas the proposed methods exhibit significant improvements in maximizing classification accuracy and minimizing forgetting in the noisy CIL setting.

[114] I-Con: A Unifying Framework for Representation Learning

Shaden Alshammari,John Hershey,Axel Feldmann,William T. Freeman,Mark Hamilton

Main category: cs.LG

TL;DR: 论文提出了一种信息论框架，统一了多种机器学习损失函数，揭示了其背后的信息几何结构，并展示了在新损失函数设计和无监督分类任务中的实际应用。

Details

Motivation: 随着表示学习领域的发展，出现了大量针对不同问题的损失函数。本文旨在提出一个统一框架，将这些损失函数归纳为一个信息论方程，揭示其共性。 Method: 引入一个信息论方程，将多种机器学习方法统一为最小化两个条件分布之间的KL散度。通过这一框架，连接了23种不同方法，并设计了新的损失函数。 Result: 在无监督图像分类任务中，新方法比现有最佳方法提高了8%的性能，并展示了在对比表示学习中去偏的应用。 Conclusion: 该框架不仅统一了多种机器学习方法，还为新损失函数的设计和性能提升提供了理论基础和实际应用。 Abstract: As the field of representation learning grows, there has been a proliferation of different loss functions to solve different classes of problems. We introduce a single information-theoretic equation that generalizes a large collection of modern loss functions in machine learning. In particular, we introduce a framework that shows that several broad classes of machine learning methods are precisely minimizing an integrated KL divergence between two conditional distributions: the supervisory and learned representations. This viewpoint exposes a hidden information geometry underlying clustering, spectral methods, dimensionality reduction, contrastive learning, and supervised learning. This framework enables the development of new loss functions by combining successful techniques from across the literature. We not only present a wide array of proofs, connecting over 23 different approaches, but we also leverage these theoretical results to create state-of-the-art unsupervised image classifiers that achieve a +8% improvement over the prior state-of-the-art on unsupervised classification on ImageNet-1K. We also demonstrate that I-Con can be used to derive principled debiasing methods which improve contrastive representation learners.

cs.AR [Back]

[115] HPU: High-Bandwidth Processing Unit for Scalable, Cost-effective LLM Inference via GPU Co-processing

Myunghyun Rhee,Joonseop Sim,Taeyoung Ahn,Seungyong Lee,Daegun Yoon,Euiseok Kim,Kyoung Park,Youngpyo Joo,Hosik Kim

Main category: cs.AR

TL;DR: 论文提出了一种高带宽处理单元（HPU），用于提升GPU在大型批处理LLM推理中的资源利用率，通过卸载内存密集型任务，显著提高性能和能效。

Details

Motivation: 当前GPU系统在处理Transformer LLM的注意力层时效率低下，主要由于低操作强度和KV缓存的高内存需求。 Method: 设计并实现了一种基于PCIe FPGA卡的HPU协处理器，用于卸载内存密集型任务，使GPU专注于计算密集型任务。 Result: 实验表明，GPU-HPU异构系统相比纯GPU系统性能提升4.1倍，能效提升4.6倍。 Conclusion: HPU作为一种扩展卡，能够在不增加GPU数量的情况下提升系统扩展性，显著优化LLM推理效率。 Abstract: The attention layer, a core component of Transformer-based LLMs, brings out inefficiencies in current GPU systems due to its low operational intensity and the substantial memory requirements of KV caches. We propose a High-bandwidth Processing Unit (HPU), a memoryintensive co-processor that enhances GPU resource utilization during large-batched LLM inference. By offloading memory-bound operations, the HPU allows the GPU to focus on compute-intensive tasks, increasing overall efficiency. Also, the HPU, as an add-on card, scales out to accommodate surging memory demands driven by large batch sizes and extended sequence lengths. In this paper, we show the HPU prototype implemented with PCIe-based FPGA cards mounted on a GPU system. Our novel GPU-HPU heterogeneous system demonstrates up to 4.1x performance gains and 4.6x energy efficiency improvements over a GPUonly system, providing scalability without increasing the number of GPUs.

cs.NE [Back]

[116] Regularizing Differentiable Architecture Search with Smooth Activation

Yanlin Zhou,Mostafa El-Khamy,Kee-Bong Song

Main category: cs.NE

TL;DR: SA-DARTS通过平滑激活函数解决DARTS中的跳跃操作主导问题，提升NAS性能。

Details

Motivation: DARTS存在鲁棒性、泛化性和差异性问题，尤其是跳跃操作主导导致性能崩溃。 Method: 提出SA-DARTS，利用平滑激活函数作为辅助损失，平衡权重自由操作的优势。 Result: 在NAS-Bench-201、分类和超分辨率任务中取得SOTA结果，并提升现有模型的性能。 Conclusion: SA-DARTS是一种简单有效的方法，解决了DARTS的核心问题，推动了NAS的发展。 Abstract: Differentiable Architecture Search (DARTS) is an efficient Neural Architecture Search (NAS) method but suffers from robustness, generalization, and discrepancy issues. Many efforts have been made towards the performance collapse issue caused by skip dominance with various regularization techniques towards operation weights, path weights, noise injection, and super-network redesign. It had become questionable at a certain point if there could exist a better and more elegant way to retract the search to its intended goal -- NAS is a selection problem. In this paper, we undertake a simple but effective approach, named Smooth Activation DARTS (SA-DARTS), to overcome skip dominance and discretization discrepancy challenges. By leveraging a smooth activation function on architecture weights as an auxiliary loss, our SA-DARTS mitigates the unfair advantage of weight-free operations, converging to fanned-out architecture weight values, and can recover the search process from skip-dominance initialization. Through theoretical and empirical analysis, we demonstrate that the SA-DARTS can yield new state-of-the-art (SOTA) results on NAS-Bench-201, classification, and super-resolution. Further, we show that SA-DARTS can help improve the performance of SOTA models with fewer parameters, such as Information Multi-distillation Network on the super-resolution task.

cs.MM [Back]

[117] 4D Multimodal Co-attention Fusion Network with Latent Contrastive Alignment for Alzheimer's Diagnosis

Yuxiang Wei,Yanteng Zhang,Xi Xiao,Tianyang Wang,Xiao Wang,Vince D. Calhoun

Main category: cs.MM

TL;DR: 提出了一种名为M2M-AlignNet的多模态神经影像融合方法，用于早期阿尔茨海默病诊断，通过几何感知的多模态共注意力网络和潜在对齐技术解决模态异质性挑战。

Details

Motivation: 多模态神经影像数据（如sMRI和fMRI）的融合能增强阿尔茨海默病的诊断敏感性，但模态间的异质性（如4D fMRI与3D sMRI）带来了特征融合的挑战。 Method: 提出M2M-AlignNet，采用几何感知的多模态共注意力网络和潜在对齐技术，通过M2M对比损失函数减少表示差异，并利用潜在查询共注意力模块自主发现融合模式。 Result: 实验验证了方法的有效性，并揭示了fMRI与sMRI作为阿尔茨海默病生物标志物的对应关系。 Conclusion: M2M-AlignNet通过几何感知和潜在对齐技术，有效解决了多模态神经影像融合的挑战，为早期阿尔茨海默病诊断提供了新方法。 Abstract: Multimodal neuroimaging provides complementary structural and functional insights into both human brain organization and disease-related dynamics. Recent studies demonstrate enhanced diagnostic sensitivity for Alzheimer's disease (AD) through synergistic integration of neuroimaging data (e.g., sMRI, fMRI) with behavioral cognitive scores tabular data biomarkers. However, the intrinsic heterogeneity across modalities (e.g., 4D spatiotemporal fMRI dynamics vs. 3D anatomical sMRI structure) presents critical challenges for discriminative feature fusion. To bridge this gap, we propose M2M-AlignNet: a geometry-aware multimodal co-attention network with latent alignment for early AD diagnosis using sMRI and fMRI. At the core of our approach is a multi-patch-to-multi-patch (M2M) contrastive loss function that quantifies and reduces representational discrepancies via geometry-weighted patch correspondence, explicitly aligning fMRI components across brain regions with their sMRI structural substrates without one-to-one constraints. Additionally, we propose a latent-as-query co-attention module to autonomously discover fusion patterns, circumventing modality prioritization biases while minimizing feature redundancy. We conduct extensive experiments to confirm the effectiveness of our method and highlight the correspondance between fMRI and sMRI as AD biomarkers.

q-bio.NC [Back]

[118] BrainPrompt: Multi-Level Brain Prompt Enhancement for Neurological Condition Identification

Jiaxing Xu,Kai He,Yue Tang,Wei Li,Mengcheng Lan,Xia Dong,Yiping Ke,Mengling Feng

Main category: q-bio.NC

TL;DR: BrainPrompt结合图神经网络和大型语言模型，通过多级知识驱动提示提升神经疾病诊断的预测能力和可解释性。

Details

Motivation: 早期神经疾病诊断困难，现有方法依赖单一成像数据，忽视了非成像因素，限制了模型的预测能力和可解释性。 Method: 提出BrainPrompt框架，整合ROI级、个体级和疾病级知识驱动提示，结合GNN和LLM捕捉多模态信息。 Result: 在fMRI数据集上优于现有方法，并能提取与神经科学领域知识一致的可解释信息。 Conclusion: BrainPrompt通过多级提示和知识增强，显著提升了神经疾病诊断的性能和可解释性。 Abstract: Neurological conditions, such as Alzheimer's Disease, are challenging to diagnose, particularly in the early stages where symptoms closely resemble healthy controls. Existing brain network analysis methods primarily focus on graph-based models that rely solely on imaging data, which may overlook important non-imaging factors and limit the model's predictive power and interpretability. In this paper, we present BrainPrompt, an innovative framework that enhances Graph Neural Networks (GNNs) by integrating Large Language Models (LLMs) with knowledge-driven prompts, enabling more effective capture of complex, non-imaging information and external knowledge for neurological disease identification. BrainPrompt integrates three types of knowledge-driven prompts: (1) ROI-level prompts to encode the identity and function of each brain region, (2) subject-level prompts that incorporate demographic information, and (3) disease-level prompts to capture the temporal progression of disease. By leveraging these multi-level prompts, BrainPrompt effectively harnesses knowledge-enhanced multi-modal information from LLMs, enhancing the model's capability to predict neurological disease stages and meanwhile offers more interpretable results. We evaluate BrainPrompt on two resting-state functional Magnetic Resonance Imaging (fMRI) datasets from neurological disorders, showing its superiority over state-of-the-art methods. Additionally, a biomarker study demonstrates the framework's ability to extract valuable and interpretable information aligned with domain knowledge in neuroscience.

cs.CY [Back]

[119] Cooperative Speech, Semantic Competence, and AI

Mahrad Almotahari

Main category: cs.CY

TL;DR: 论文探讨合作性言语的道德基础，认为大型语言模型（LLMs）缺乏作为合作对话者的道德地位，因此无法进行真正的断言，质疑其语义能力。

Details

Motivation: 研究合作性言语的道德维度，特别是尊重在对话中的作用，以及LLMs是否具备这种道德地位。 Method: 通过哲学分析，探讨合作性言语的道德要求及其对LLMs的适用性。 Result: LLMs因缺乏道德地位，无法成为真正的合作对话者，其断言能力受到质疑。 Conclusion: 语义能力不仅是认知心理学问题，还涉及道德心理学，LLMs的局限性凸显了这一点。 Abstract: Cooperative speech is purposive. From the speaker's perspective, one crucial purpose is the transmission of knowledge. Cooperative speakers care about getting things right for their conversational partners. This attitude is a kind of respect. Cooperative speech is an ideal form of communication because participants have respect for each other. And having respect within a cooperative enterprise is sufficient for a particular kind of moral standing: we ought to respect those who have respect for us. Respect demands reciprocity. I maintain that large language models aren't owed the kind of respect that partly constitutes a cooperative conversation. This implies that they aren't cooperative interlocutors, otherwise we would be obliged to reciprocate the attitude. Leveraging this conclusion, I argue that present-day LLMs are incapable of assertion and that this raises an overlooked doubt about their semantic competence. One upshot of this argument is that knowledge of meaning isn't just a subject for the cognitive psychologist. It's also a subject for the moral psychologist.

[120] Reflexive Prompt Engineering: A Framework for Responsible Prompt Engineering and Interaction Design

Christian Djeffal

Main category: cs.CY

TL;DR: 本文探讨了负责任提示工程的重要性，提出了一个包含五个组件的框架，以在生成式AI中嵌入伦理和法律考量。

Details

Motivation: 随着生成式AI的普及，提示工程对社会公平、问责和透明性有深远影响，需要结合伦理与技术优化。 Method: 提出了一个包含提示设计、系统选择、配置、性能评估和管理的框架，结合实证分析。 Result: 研究表明，有效的提示工程需平衡技术精确性和伦理意识，改善社会结果并降低风险。 Conclusion: 负责任提示工程是AI开发与部署的关键桥梁，未来需进一步研究和实践指导。 Abstract: Responsible prompt engineering has emerged as a critical framework for ensuring that generative artificial intelligence (AI) systems serve society's needs while minimizing potential harms. As generative AI applications become increasingly powerful and ubiquitous, the way we instruct and interact with them through prompts has profound implications for fairness, accountability, and transparency. This article examines how strategic prompt engineering can embed ethical and legal considerations and societal values directly into AI interactions, moving beyond mere technical optimization for functionality. This article proposes a comprehensive framework for responsible prompt engineering that encompasses five interconnected components: prompt design, system selection, system configuration, performance evaluation, and prompt management. Drawing from empirical evidence, the paper demonstrates how each component can be leveraged to promote improved societal outcomes while mitigating potential risks. The analysis reveals that effective prompt engineering requires a delicate balance between technical precision and ethical consciousness, combining the systematic rigor and focus on functionality with the nuanced understanding of social impact. Through examination of real-world and emerging practices, the article illustrates how responsible prompt engineering serves as a crucial bridge between AI development and deployment, enabling organizations to fine-tune AI outputs without modifying underlying model architectures. This approach aligns with broader "Responsibility by Design" principles, embedding ethical considerations directly into the implementation process rather than treating them as post-hoc additions. The article concludes by identifying key research directions and practical guidelines for advancing the field of responsible prompt engineering.

cs.AI [Back]

[121] IRIS: Interactive Research Ideation System for Accelerating Scientific Discovery

Aniketh Garikaparthi,Manasi Patwardhan,Lovekesh Vig,Arman Cohan

Main category: cs.AI

TL;DR: IRIS是一个开源平台，利用LLM辅助科学研究，通过透明和可操控的人机协同方法生成新颖假设。

Details

Motivation: 解决现有自动化假设生成方法缺乏透明性和人机协同的问题。 Method: 结合蒙特卡洛树搜索（MCTS）、细粒度反馈机制和基于查询的文献合成。 Result: 用户研究表明IRIS能有效提升研究人员的创意生成能力。 Conclusion: IRIS为科学研究提供了一种透明、可控的LLM辅助假设生成工具。 Abstract: The rapid advancement in capabilities of large language models (LLMs) raises a pivotal question: How can LLMs accelerate scientific discovery? This work tackles the crucial first stage of research, generating novel hypotheses. While recent work on automated hypothesis generation focuses on multi-agent frameworks and extending test-time compute, none of the approaches effectively incorporate transparency and steerability through a synergistic Human-in-the-loop (HITL) approach. To address this gap, we introduce IRIS: Interactive Research Ideation System, an open-source platform designed for researchers to leverage LLM-assisted scientific ideation. IRIS incorporates innovative features to enhance ideation, including adaptive test-time compute expansion via Monte Carlo Tree Search (MCTS), fine-grained feedback mechanism, and query-based literature synthesis. Designed to empower researchers with greater control and insight throughout the ideation process. We additionally conduct a user study with researchers across diverse disciplines, validating the effectiveness of our system in enhancing ideation. We open-source our code at https://github.com/Anikethh/IRIS-Interactive-Research-Ideation-System

[122] AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset

Ivan Moshkov,Darragh Hanley,Ivan Sorokin,Shubham Toshniwal,Christof Henkel,Benedikt Schifferer,Wei Du,Igor Gitman

Main category: cs.AI

TL;DR: 本文介绍了在AIMO-2竞赛中获胜的模型，其核心包括大规模数据集构建、代码执行与推理模型结合的方法，以及生成式解决方案选择技术。

Details

Motivation: 提升数学推理模型的性能，特别是在复杂数学问题上的表现。 Method: 1. 构建包含540K高质量数学问题的数据集；2. 开发结合代码执行与推理模型的迭代训练方法；3. 提出生成式解决方案选择技术（GenSelect）。 Result: 模型在数学推理基准测试中达到最先进水平。 Conclusion: 通过数据集、方法改进和解决方案选择技术的结合，实现了高性能的数学推理模型，并公开了相关资源。 Abstract: This paper presents our winning submission to the AI Mathematical Olympiad - Progress Prize 2 (AIMO-2) competition. Our recipe for building state-of-the-art mathematical reasoning models relies on three key pillars. First, we create a large-scale dataset comprising 540K unique high-quality math problems, including olympiad-level problems, and their 3.2M long-reasoning solutions. Second, we develop a novel method to integrate code execution with long reasoning models through iterative training, generation, and quality filtering, resulting in 1.7M high-quality Tool-Integrated Reasoning solutions. Third, we create a pipeline to train models to select the most promising solution from many candidates. We show that such generative solution selection (GenSelect) can significantly improve upon majority voting baseline. Combining these ideas, we train a series of models that achieve state-of-the-art results on mathematical reasoning benchmarks. To facilitate further research, we release our code, models, and the complete OpenMathReasoning dataset under a commercially permissive license.

eess.IV [Back]

[123] Comprehensive Evaluation of Quantitative Measurements from Automated Deep Segmentations of PSMA PET/CT Images

Obed Korshie Dzikunu,Amirhossein Toosi,Shadab Ahamed,Sara Harsini,Francois Benard,Xiaoxiao Li,Arman Rahmim

Main category: eess.IV

TL;DR: 该研究通过深度学习分割方法评估了六种定量指标，提出L1加权Dice Focal Loss（L1DFL）在Attention U-Net中表现最佳，与真实数据相关性最高。

Details

Motivation: 传统Dice相似系数评估有限，需更全面的定量指标评估方法。 Method: 使用U-Net、Attention U-Net和SegResNet，结合四种损失函数（包括提出的L1DFL），分析380例PSMA PET/CT扫描数据。 Result: Attention U-Net与L1DFL组合表现最优（相关性0.90-0.99），SUV指标和TLA效果显著，但肿瘤体积和病灶扩散变异性较大。 Conclusion: L1DFL能最小化临床测量变异性，代码已开源。 Abstract: This study performs a comprehensive evaluation of quantitative measurements as extracted from automated deep-learning-based segmentation methods, beyond traditional Dice Similarity Coefficient assessments, focusing on six quantitative metrics, namely SUVmax, SUVmean, total lesion activity (TLA), tumor volume (TMTV), lesion count, and lesion spread. We analyzed 380 prostate-specific membrane antigen (PSMA) targeted [18F]DCFPyL PET/CT scans of patients with biochemical recurrence of prostate cancer, training deep neural networks, U-Net, Attention U-Net and SegResNet with four loss functions: Dice Loss, Dice Cross Entropy, Dice Focal Loss, and our proposed L1 weighted Dice Focal Loss (L1DFL). Evaluations indicated that Attention U-Net paired with L1DFL achieved the strongest correlation with the ground truth (concordance correlation = 0.90-0.99 for SUVmax and TLA), whereas models employing the Dice Loss and the other two compound losses, particularly with SegResNet, underperformed. Equivalence testing (TOST, alpha = 0.05, Delta = 20%) confirmed high performance for SUV metrics, lesion count and TLA, with L1DFL yielding the best performance. By contrast, tumor volume and lesion spread exhibited greater variability. Bland-Altman, Coverage Probability, and Total Deviation Index analyses further highlighted that our proposed L1DFL minimizes variability in quantification of the ground truth clinical measures. The code is publicly available at: https://github.com/ObedDzik/pca\_segment.git.

[124] Frequency-Compensated Network for Daily Arctic Sea Ice Concentration Prediction

Jialiang Zhang,Feng Gao,Yanhai Gan,Junyu Dong,Qian Du

Main category: eess.IV

TL;DR: 提出了一种频率补偿网络（FCNet）用于北极海冰浓度（SIC）的每日预测，解决了现有方法在频域长期特征依赖和高频细节保留方面的不足。

Details

Motivation: 准确预测北极海冰浓度对全球生态系统健康和航行安全至关重要，但现有方法在频域特征依赖和高频细节保留方面存在不足。 Method: 设计了双分支网络，包括频率特征提取和卷积特征提取分支，分别通过自适应频率滤波块和高频增强块实现特征提取和增强。 Result: 在卫星数据上的实验验证了FCNet的有效性，能够更精确地预测海冰边缘和细节变化。 Conclusion: FCNet通过结合频域和卷积特征提取，显著提升了海冰浓度预测的精度，代码和数据将公开。 Abstract: Accurately forecasting sea ice concentration (SIC) in the Arctic is critical to global ecosystem health and navigation safety. However, current methods still is confronted with two challenges: 1) these methods rarely explore the long-term feature dependencies in the frequency domain. 2) they can hardly preserve the high-frequency details, and the changes in the marginal area of the sea ice cannot be accurately captured. To this end, we present a Frequency-Compensated Network (FCNet) for Arctic SIC prediction on a daily basis. In particular, we design a dual-branch network, including branches for frequency feature extraction and convolutional feature extraction. For frequency feature extraction, we design an adaptive frequency filter block, which integrates trainable layers with Fourier-based filters. By adding frequency features, the FCNet can achieve refined prediction of edges and details. For convolutional feature extraction, we propose a high-frequency enhancement block to separate high and low-frequency information. Moreover, high-frequency features are enhanced via channel-wise attention, and temporal attention unit is employed for low-frequency feature extraction to capture long-range sea ice changes. Extensive experiments are conducted on a satellite-derived daily SIC dataset, and the results verify the effectiveness of the proposed FCNet. Our codes and data will be made public available at: https://github.com/oucailab/FCNet .

[125] Advanced Chest X-Ray Analysis via Transformer-Based Image Descriptors and Cross-Model Attention Mechanism

Lakshita Agarwal,Bindu Verma

Main category: eess.IV

TL;DR: 提出了一种结合Vision Transformer和GPT-4的图像描述生成模型，用于提升胸部X光片的诊断效率和准确性。

Details

Motivation: 胸部X光片检查对诊断胸腔疾病至关重要，但传统方法在描述生成上存在局限。 Method: 使用ViT编码器提取视觉特征，结合跨模态注意力与GPT-4解码器生成描述。 Result: 在NIH和IU数据集上表现优异，多项指标领先。 Conclusion: 该模型有望辅助放射科医生进行更精准高效的诊断。 Abstract: The examination of chest X-ray images is a crucial component in detecting various thoracic illnesses. This study introduces a new image description generation model that integrates a Vision Transformer (ViT) encoder with cross-modal attention and a GPT-4-based transformer decoder. The ViT captures high-quality visual features from chest X-rays, which are fused with text data through cross-modal attention to improve the accuracy, context, and richness of image descriptions. The GPT-4 decoder transforms these fused features into accurate and relevant captions. The model was tested on the National Institutes of Health (NIH) and Indiana University (IU) Chest X-ray datasets. On the IU dataset, it achieved scores of 0.854 (B-1), 0.883 (CIDEr), 0.759 (METEOR), and 0.712 (ROUGE-L). On the NIH dataset, it achieved the best performance on all metrics: BLEU 1--4 (0.825, 0.788, 0.765, 0.752), CIDEr (0.857), METEOR (0.726), and ROUGE-L (0.705). This framework has the potential to enhance chest X-ray evaluation, assisting radiologists in more precise and efficient diagnosis.

cs.RO [Back]

[126] Physically Consistent Humanoid Loco-Manipulation using Latent Diffusion Models

Ilyass Taouil,Haizhou Zhao,Angela Dai,Majid Khadiv

Main category: cs.RO

TL;DR: 利用潜在扩散模型（LDMs）生成真实的人类-物体交互场景，指导人形机器人的运动规划，并通过轨迹优化生成物理一致的轨迹。

Details

Motivation: 解决人形机器人在长时程运动规划中缺乏真实场景引导的问题。 Method: 从LDM生成的图像中提取接触点和机器人配置，用于全身轨迹优化（TO）生成物理一致的轨迹。 Result: 在仿真中验证了不同长时程运动场景，并分析了接触点和配置提取流程的有效性。 Conclusion: LDM提取的信息可用于生成物理一致的长时程运动轨迹。 Abstract: This paper uses the capabilities of latent diffusion models (LDMs) to generate realistic RGB human-object interaction scenes to guide humanoid loco-manipulation planning. To do so, we extract from the generated images both the contact locations and robot configurations that are then used inside a whole-body trajectory optimization (TO) formulation to generate physically consistent trajectories for humanoids. We validate our full pipeline in simulation for different long-horizon loco-manipulation scenarios and perform an extensive analysis of the proposed contact and robot configuration extraction pipeline. Our results show that using the information extracted from LDMs, we can generate physically consistent trajectories that require long-horizon reasoning.

Table of Contents

cs.CV [Back]

[1] Audio and Multiscale Visual Cues Driven Cross-modal Transformer for Idling Vehicle Detection

[2] Shape Your Ground: Refining Road Surfaces Beyond Planar Representations

[3] Persistence-based Hough Transform for Line Detection

[4] Context-Awareness and Interpretability of Rare Occurrences for Discovery and Formalization of Critical Failure Modes

[5] MonoTher-Depth: Enhancing Thermal Depth Estimation via Confidence-Aware Distillation

[6] Hybrid Knowledge Transfer through Attention and Logit Distillation for On-Device Vision Systems in Agricultural IoT

[7] SAIP-Net: Enhancing Remote Sensing Image Segmentation via Spectral Adaptive Information Propagation

[8] Multimodal Large Language Models for Enhanced Traffic Safety: A Comprehensive Review and Future Trends

[9] Progressive Language-guided Visual Learning for Multi-Task Visual Grounding

[10] Classification of Firn Data via Topological Features

[11] A detection-task-specific deep-learning method to improve the quality of sparse-view myocardial perfusion SPECT images

[12] CLIP-IT: CLIP-based Pairing for Histology Images Classification

[13] DeepCS-TRD, a Deep Learning-based Cross-Section Tree Ring Detector

[14] Naturally Computed Scale Invariance in the Residual Stream of ResNet18

[15] MetaHarm: Harmful YouTube Video Dataset Annotated by Domain Experts, GPT-4-Turbo, and Crowdworkers

[16] SignX: The Foundation Model for Sign Recognition

[17] Almost Right: Making First-layer Kernels Nearly Orthogonal Improves Model Generalization

[18] CLPSTNet: A Progressive Multi-Scale Convolutional Steganography Model Integrating Curriculum Learning

[19] Revisiting Radar Camera Alignment by Contrastive Learning for 3D Object Detection

[20] SaENeRF: Suppressing Artifacts in Event-based Neural Radiance Fields

[21] Assessing the Feasibility of Internet-Sourced Video for Automatic Cattle Lameness Detection

[22] PixelWeb: The First Web GUI Dataset with Pixel-Wise Labels

[23] FrogDogNet: Fourier frequency Retained visual prompt Output Guidance for Domain Generalization of CLIP in Remote Sensing

[24] Marginalized Generalized IoU (MGIoU): A Unified Objective Function for Optimizing Any Convex Parametric Shapes

[25] Cross Paradigm Representation and Alignment Transformer for Image Deraining

[26] MTSGL: Multi-Task Structure Guided Learning for Robust and Interpretable SAR Aircraft Recognition

[27] RGB-D Video Object Segmentation via Enhanced Multi-store Feature Memory

[28] Rethinking Generalizable Infrared Small Target Detection: A Real-scene Benchmark and Cross-view Representation Learning

[29] PRaDA: Projective Radial Distortion Averaging

[30] TraveLLaMA: Facilitating Multi-modal Large Language Models to Understand Urban Scenes and Provide Travel Assistance

[31] Federated Learning of Low-Rank One-Shot Image Detection Models in Edge Devices with Scalable Accuracy and Compute Complexity

[32] Think Hierarchically, Act Dynamically: Hierarchical Multi-modal Fusion and Reasoning for Vision-and-Language Navigation

[33] A Few-Shot Metric Learning Method with Dual-Channel Attention for Cross-Modal Same-Neuron Identification

[34] Streetscape Analysis with Generative AI (SAGAI): Vision-Language Assessment and Mapping of Urban Scenes

[35] ToF-Splatting: Dense SLAM using Sparse Time-of-Flight Depth and Multi-Frame Integration

[36] Beyond Anonymization: Object Scrubbing for Privacy-Preserving 2D and 3D Vision Tasks

[37] CountingDINO: A Training-free Pipeline for Class-Agnostic Counting using Unsupervised Backbones

[38] JEPA for RL: Investigating Joint-Embedding Predictive Architectures for Reinforcement Learning

[39] Federated EndoViT: Pretraining Vision Transformers via Federated Learning on Endoscopic Image Collections

[40] EHGCN: Hierarchical Euclidean-Hyperbolic Fusion via Motion-Aware GCN for Hybrid Event Stream Perception

[41] Dual-Camera All-in-Focus Neural Radiance Fields

[42] RouteWinFormer: A Route-Window Transformer for Middle-range Attention in Image Restoration

[43] SSLR: A Semi-Supervised Learning Method for Isolated Sign Language Recognition

[44] WiFi based Human Fall and Activity Recognition using Transformer based Encoder Decoder and Graph Neural Networks

[45] Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning

[46] A Time Series Dataset of NIR Spectra and RGB and NIR-HSI Images of the Barley Germination Process

[47] A Diff-Attention Aware State Space Fusion Model for Remote Sensing Classification

[48] SemanticSugarBeets: A Multi-Task Framework and Dataset for Inspecting Harvest and Storage Characteristics of Sugar Beets

[49] Energy-Based Pseudo-Label Refining for Source-free Domain Adaptation

[50] PMG: Progressive Motion Generation via Sparse Anchor Postures Curriculum Learning

[51] Detecting and Understanding Hateful Contents in Memes Through Captioning and Visual Question-Answering

[52] V$^2$R-Bench: Holistically Evaluating LVLM Robustness to Fundamental Visual Variations

[53] Prompt-Tuning SAM: From Generalist to Specialist with only 2048 Parameters and 16 Training Images

[54] Gaussian Splatting is an Effective Data Generator for 3D Object Detection

[55] Feature Mixing Approach for Detecting Intraoperative Adverse Events in Laparoscopic Roux-en-Y Gastric Bypass Surgery

[56] Tri-FusionNet: Enhancing Image Description Generation with Transformer-based Fusion Network and Dual Attention Mechanism

[57] Towards Explainable AI: Multi-Modal Transformer for Video-based Image Description Generation

[58] Decoupled Global-Local Alignment for Improving Compositional Understanding

[59] A Low-Cost Photogrammetry System for 3D Plant Modeling and Phenotyping

[60] Hyperspectral Vision Transformers for Greenhouse Gas Estimations from Space

[61] High-Quality Cloud-Free Optical Image Synthesis Using Multi-Temporal SAR and Contaminated Optical Data

[62] BadVideo: Stealthy Backdoor Attack against Text-to-Video Generation

[63] DreamO: A Unified Framework for Image Customization

[64] Generalized Neighborhood Attention: Multi-dimensional Sparse Attention at the Speed of Light

[65] Procedural Dataset Generation for Zero-Shot Stereo Matching

cs.GR [Back]

[66] Digital Kitchen Remodeling: Editing and Relighting Intricate Indoor Scenes from a Single Panorama

[67] HUG: Hierarchical Urban Gaussian Splatting with Block-Based Reconstruction

cs.CL [Back]

[68] FinNLI: Novel Dataset for Multi-Genre Financial Natural Language Inference Benchmarking

[69] The Language of Attachment: Modeling Attachment Dynamics in Psychotherapy

[70] The Paradox of Poetic Intent in Back-Translation: Evaluating the Quality of Large Language Models in Chinese Translation

[71] Capturing Symmetry and Antisymmetry in Language Models through Symmetry-Aware Training Objectives

[72] Transformer-Based Extraction of Statutory Definitions from the U.S. Code

[73] Text-to-TrajVis: Enabling Trajectory Data Visualizations from Natural Language Questions

[74] SplitReason: Learning To Offload Reasoning

[75] ConTextual: Improving Clinical Text Summarization in LLMs with Context-preserving Token Filtering and Knowledge Graphs

[76] Less is More: Enhancing Structured Multi-Agent Reasoning via Quality-Guided Distillation