cs.CV [Back]

[1] Unconstrained Large-scale 3D Reconstruction and Rendering across Altitudes

Neil Joshi,Joshua Carney,Nathanael Kuo,Homer Li,Cheng Peng,Myron Brown

Main category: cs.CV

TL;DR: 论文提出了一个公开基准数据集，用于解决3D重建和新视角合成中的实际挑战，如图像数量有限、相机未标定、光照不一致和极端视角差异。

Details

Motivation: 为灾害救援或执法等场景提供逼真、可导航的3D场景模型，但现有图像数据往往不足且质量参差不齐。 Method: 开发了一个基于多标定相机的公共基准数据集，包括地面、安防和空中视角，并评估了未标定相机的校准和新视角渲染质量。 Result: 展示了当前先进方法的基线性能，并指出了进一步研究的挑战。 Conclusion: 该数据集为3D重建和新视角合成研究提供了实际挑战的测试平台，推动了相关领域的发展。 Abstract: Production of photorealistic, navigable 3D site models requires a large volume of carefully collected images that are often unavailable to first responders for disaster relief or law enforcement. Real-world challenges include limited numbers of images, heterogeneous unposed cameras, inconsistent lighting, and extreme viewpoint differences for images collected from varying altitudes. To promote research aimed at addressing these challenges, we have developed the first public benchmark dataset for 3D reconstruction and novel view synthesis based on multiple calibrated ground-level, security-level, and airborne cameras. We present datasets that pose real-world challenges, independently evaluate calibration of unposed cameras and quality of novel rendered views, demonstrate baseline performance using recent state-of-practice methods, and identify challenges for further research.

[2] MoSAM: Motion-Guided Segment Anything Model with Spatial-Temporal Memory Selection

Qiushi Yang,Yuan Yao,Miaomiao Cui,Liefeng Bo

Main category: cs.CV

TL;DR: MoSAM通过引入运动引导提示和动态时空记忆选择机制，解决了SAM2在视频分割中依赖固定帧记忆的问题，提升了对象跟踪和分割精度。

Details

Motivation: SAM2仅依赖过去六帧的固定记忆进行分割，导致对象消失或遮挡时性能下降，缺乏运动信息支持长程跟踪。 Method: 提出Motion-Guided Prompting (MGP)注入运动信息，并设计Spatial-Temporal Memory Selection (ST-MS)动态筛选可靠记忆帧。 Result: 在多个视频对象分割和实例分割基准测试中，MoSAM达到最先进性能。 Conclusion: MoSAM通过结合运动信息和动态记忆选择，显著提升了视频分割的鲁棒性和准确性。 Abstract: The recent Segment Anything Model 2 (SAM2) has demonstrated exceptional capabilities in interactive object segmentation for both images and videos. However, as a foundational model on interactive segmentation, SAM2 performs segmentation directly based on mask memory from the past six frames, leading to two significant challenges. Firstly, during inference in videos, objects may disappear since SAM2 relies solely on memory without accounting for object motion information, which limits its long-range object tracking capabilities. Secondly, its memory is constructed from fixed past frames, making it susceptible to challenges associated with object disappearance or occlusion, due to potentially inaccurate segmentation results in memory. To address these problems, we present MoSAM, incorporating two key strategies to integrate object motion cues into the model and establish more reliable feature memory. Firstly, we propose Motion-Guided Prompting (MGP), which represents the object motion in both sparse and dense manners, then injects them into SAM2 through a set of motion-guided prompts. MGP enables the model to adjust its focus towards the direction of motion, thereby enhancing the object tracking capabilities. Furthermore, acknowledging that past segmentation results may be inaccurate, we devise a Spatial-Temporal Memory Selection (ST-MS) mechanism that dynamically identifies frames likely to contain accurate segmentation in both pixel- and frame-level. By eliminating potentially inaccurate mask predictions from memory, we can leverage more reliable memory features to exploit similar regions for improving segmentation results. Extensive experiments on various benchmarks of video object segmentation and video instance segmentation demonstrate that our MoSAM achieves state-of-the-art results compared to other competitors.

[3] Fast2comm:Collaborative perception combined with prior knowledge

Zhengbin Zhang,Yan Wu,Hongkun Zhang

Main category: cs.CV

TL;DR: Fast2comm是一个基于先验知识的协作感知框架，通过生成高区分度的置信特征和优化带宽效率，解决了协作感知中的性能与带宽平衡问题。

Details

Motivation: 协作感知通过共享信息提升准确性，但面临性能与带宽平衡以及定位误差的挑战。 Method: 提出先验监督的置信特征生成方法、基于GT Bounding Box的空间先验特征选择策略，并解耦训练与测试阶段的特征融合策略。 Result: 在真实和模拟数据集上的实验表明，模型性能优越，验证了所提方法的必要性。 Conclusion: Fast2comm有效解决了协作感知中的关键问题，提升了感知准确性和带宽效率。 Abstract: Collaborative perception has the potential to significantly enhance perceptual accuracy through the sharing of complementary information among agents. However, real-world collaborative perception faces persistent challenges, particularly in balancing perception performance and bandwidth limitations, as well as coping with localization errors. To address these challenges, we propose Fast2comm, a prior knowledge-based collaborative perception framework. Specifically, (1)we propose a prior-supervised confidence feature generation method, that effectively distinguishes foreground from background by producing highly discriminative confidence features; (2)we propose GT Bounding Box-based spatial prior feature selection strategy to ensure that only the most informative prior-knowledge features are selected and shared, thereby minimizing background noise and optimizing bandwidth efficiency while enhancing adaptability to localization inaccuracies; (3)we decouple the feature fusion strategies between model training and testing phases, enabling dynamic bandwidth adaptation. To comprehensively validate our framework, we conduct extensive experiments on both real-world and simulated datasets. The results demonstrate the superior performance of our model and highlight the necessity of the proposed methods. Our code is available at https://github.com/Zhangzhengbin-TJ/Fast2comm.

[4] Detection and Classification of Diseases in Multi-Crop Leaves using LSTM and CNN Models

Srinivas Kanakala,Sneha Ningappa

Main category: cs.CV

TL;DR: 该研究利用CNN和LSTM模型对植物叶片疾病进行分类，CNN模型表现优于LSTM，验证准确率达96.4%，表明深度学习在农业监测中具有实用价值。

Details

Motivation: 植物病害严重影响农业产量和食品质量，早期检测和分类对减少损失和优化作物管理至关重要。 Method: 使用CNN和LSTM模型，基于包含70,295张训练图像和17,572张验证图像的数据集，训练CNN时采用Adam优化器和分类交叉熵损失函数。 Result: CNN模型训练准确率达99.1%，验证准确率96.4%；LSTM验证准确率为93.43%。性能评估指标证实了CNN的可靠性。 Conclusion: 深度学习模型（尤其是CNN）为植物病害分类提供了准确且可扩展的解决方案，适用于农业监测实践。 Abstract: Plant diseases pose a serious challenge to agriculture by reducing crop yield and affecting food quality. Early detection and classification of these diseases are essential for minimising losses and improving crop management practices. This study applies Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) models to classify plant leaf diseases using a dataset containing 70,295 training images and 17,572 validation images across 38 disease classes. The CNN model was trained using the Adam optimiser with a learning rate of 0.0001 and categorical cross-entropy as the loss function. After 10 training epochs, the model achieved a training accuracy of 99.1% and a validation accuracy of 96.4%. The LSTM model reached a validation accuracy of 93.43%. Performance was evaluated using precision, recall, F1-score, and confusion matrix, confirming the reliability of the CNN-based approach. The results suggest that deep learning models, particularly CNN, enable an effective solution for accurate and scalable plant disease classification, supporting practical applications in agricultural monitoring.

[5] Zoomer: Adaptive Image Focus Optimization for Black-box MLLM

Jiaxu Qian,Chendong Wang,Yifan Yang,Chaoyun Zhang,Huiqiang Jiang,Xufang Luo,Yu Kang,Qingwei Lin,Anlan Zhang,Shiqi Jiang,Ting Cao,Tianjun Mao,Suman Banerjee,Guyue Liu,Saravan Rajmohan,Dongmei Zhang,Yuqing Yang,Qi Zhang,Lili Qiu

Main category: cs.CV

TL;DR: 论文提出了一种名为\SysName的新型视觉提示机制，旨在提升多模态大语言模型（MLLM）的性能，同时保留关键视觉细节。

Details

Motivation: 现有的MLLM在精确处理视觉数据（如物体识别和细节捕捉）方面表现不佳，且严格的token限制常导致关键信息丢失。 Method: \SysName包含三项创新：动态突出相关图像区域的提示感知策略、保持物体完整性的空间保留编排模式，以及平衡全局上下文与关键视觉细节的预算感知提示方法。 Result: 在多个数据集上的评估显示，\SysName显著优于基线方法，准确率提升高达26.9%，同时大幅减少token消耗。 Conclusion: \SysName有效解决了MLLM在视觉任务中的性能瓶颈，为未来研究提供了新方向。 Abstract: Recent advancements in multimodal large language models (MLLMs) have broadened the scope of vision-language tasks, excelling in applications like image captioning and interactive question-answering. However, these models struggle with accurately processing visual data, particularly in tasks requiring precise object recognition and fine visual details. Stringent token limits often result in the omission of critical information, hampering performance. To address these limitations, we introduce \SysName, a novel visual prompting mechanism designed to enhance MLLM performance while preserving essential visual details within token limits. \SysName features three key innovations: a prompt-aware strategy that dynamically highlights relevant image regions, a spatial-preserving orchestration schema that maintains object integrity, and a budget-aware prompting method that balances global context with crucial visual details. Comprehensive evaluations across multiple datasets demonstrate that \SysName consistently outperforms baseline methods, achieving up to a $26.9\%$ improvement in accuracy while significantly reducing token consumption.

Yinfeng Yu,Dongsheng Yang

Main category: cs.CV

TL;DR: 本文提出了一种DOPE网络，通过增强文本和图像中的对象感知能力，解决了VLN任务中语言理解不足和跨模态对象关系建模缺失的问题。

Details

Motivation: 现有方法在VLN任务中未能充分利用语言指令的细节信息，且忽视了跨模态对象关系的建模，影响了导航的准确性和鲁棒性。 Method: 设计了TSE模块提取关键文本信息，并通过TOPA和IOPA模块分别增强文本和图像中的对象感知能力。 Result: 在R2R和REVERIE数据集上的实验验证了DOPE方法的有效性。 Conclusion: DOPE通过增强对象感知能力，显著提升了VLN任务的性能。 Abstract: Vision-and-Language Navigation (VLN) is a challenging task where an agent must understand language instructions and navigate unfamiliar environments using visual cues. The agent must accurately locate the target based on visual information from the environment and complete tasks through interaction with the surroundings. Despite significant advancements in this field, two major limitations persist: (1) Many existing methods input complete language instructions directly into multi-layer Transformer networks without fully exploiting the detailed information within the instructions, thereby limiting the agent's language understanding capabilities during task execution; (2) Current approaches often overlook the modeling of object relationships across different modalities, failing to effectively utilize latent clues between objects, which affects the accuracy and robustness of navigation decisions. We propose a Dual Object Perception-Enhancement Network (DOPE) to address these issues to improve navigation performance. First, we design a Text Semantic Extraction (TSE) to extract relatively essential phrases from the text and input them into the Text Object Perception-Augmentation (TOPA) to fully leverage details such as objects and actions within the instructions. Second, we introduce an Image Object Perception-Augmentation (IOPA), which performs additional modeling of object information across different modalities, enabling the model to more effectively utilize latent clues between objects in images and text, enhancing decision-making accuracy. Extensive experiments on the R2R and REVERIE datasets validate the efficacy of the proposed approach.

[7] Localizing Before Answering: A Benchmark for Grounded Medical Visual Question Answering

Dung Nguyen,Minh Khoi Ho,Huy Ta,Thanh Tam Nguyen,Qi Chen,Kumar Rav,Quy Duong Dang,Satwik Ramchandre,Son Lam Phung,Zhibin Liao,Minh-Son To,Johan Verjans,Phi Le Nguyen,Vu Minh Hieu Phan

Main category: cs.CV

TL;DR: 论文揭示了当前医学多模态大模型（LMMs）在定位病理区域时的局限性，提出了HEAL-MedVQA基准和Localize-before-Answer框架以提升模型性能。

Details

Motivation: 医学LMMs在解释医学数据时经常产生与源证据矛盾的幻觉，主要由于缺乏定位推理能力。 Method: 提出了HEAL-MedVQA基准，包含两种评估协议和67K医生标注的VQA对，并设计了Localize-before-Answer框架来训练模型定位病理区域。 Result: 实验表明，该方法在HEAL-MedVQA基准上显著优于现有生物医学LMMs，提升了医学VQA的鲁棒性。 Conclusion: 通过改进定位能力，该方法有效减少了医学LMMs的幻觉问题，为医学视觉问答提供了更可靠的解决方案。 Abstract: Medical Large Multi-modal Models (LMMs) have demonstrated remarkable capabilities in medical data interpretation. However, these models frequently generate hallucinations contradicting source evidence, particularly due to inadequate localization reasoning. This work reveals a critical limitation in current medical LMMs: instead of analyzing relevant pathological regions, they often rely on linguistic patterns or attend to irrelevant image areas when responding to disease-related queries. To address this, we introduce HEAL-MedVQA (Hallucination Evaluation via Localization MedVQA), a comprehensive benchmark designed to evaluate LMMs' localization abilities and hallucination robustness. HEAL-MedVQA features (i) two innovative evaluation protocols to assess visual and textual shortcut learning, and (ii) a dataset of 67K VQA pairs, with doctor-annotated anatomical segmentation masks for pathological regions. To improve visual reasoning, we propose the Localize-before-Answer (LobA) framework, which trains LMMs to localize target regions of interest and self-prompt to emphasize segmented pathological areas, generating grounded and reliable answers. Experimental results demonstrate that our approach significantly outperforms state-of-the-art biomedical LMMs on the challenging HEAL-MedVQA benchmark, advancing robustness in medical VQA.

[8] Responsive DNN Adaptation for Video Analytics against Environment Shift via Hierarchical Mobile-Cloud Collaborations

Maozhe Zhao,Shengzhong Liu,Fan Wu,Guihai Chen

Main category: cs.CV

TL;DR: MOCHA是一个优化移动和云资源协作的框架，用于提升模型适应环境变化的响应速度，通过设备端模型重用和快速微调减少延迟。

Details

Motivation: 现有云中心化模型适应框架在环境变化时性能下降且反应延迟，需要更高效的解决方案。 Method: MOCHA通过设备端模型重用、快速微调和结构化专家模型分类索引，优化响应速度和模型检索效率。 Result: 实验表明MOCHA在适应期间模型准确率提升6.8%，响应延迟和重训练时间分别减少35.5倍和3.0倍。 Conclusion: MOCHA通过分层协作显著提升了模型适应环境变化的效率和性能。 Abstract: Mobile video analysis systems often encounter various deploying environments, where environment shifts present greater demands for responsiveness in adaptations of deployed "expert DNN models". Existing model adaptation frameworks primarily operate in a cloud-centric way, exhibiting degraded performance during adaptation and delayed reactions to environment shifts. Instead, this paper proposes MOCHA, a novel framework optimizing the responsiveness of continuous model adaptation through hierarchical collaborations between mobile and cloud resources. Specifically, MOCHA (1) reduces adaptation response delays by performing on-device model reuse and fast fine-tuning before requesting cloud model retrieval and end-to-end retraining; (2) accelerates history expert model retrieval by organizing them into a structured taxonomy utilizing domain semantics analyzed by a cloud foundation model as indices; (3) enables efficient local model reuse by maintaining onboard expert model caches for frequent scenes, which proactively prefetch model weights from the cloud model database. Extensive evaluations with real-world videos on three DNN tasks show MOCHA improves the model accuracy during adaptation by up to 6.8% while saving the response delay and retraining time by up to 35.5x and 3.0x respectively.

[9] Entropy Heat-Mapping: Localizing GPT-Based OCR Errors with Sliding-Window Shannon Analysis

Alexei Kaltchenko

Main category: cs.CV

TL;DR: 论文提出了一种基于熵热图的方法，通过分析GPT-4o的token级置信度信号，定位OCR错误。实验表明，高熵区域与真实错误高度相关。

Details

Motivation: 现有视觉语言模型（如GPT-4o）在数学文档转录中很少利用token级置信度信号来定位识别错误。 Method: 使用滑动窗口扫描token的Shannon熵序列，生成视觉化的“不确定性景观”，并标记高熵区域作为潜在错误点。 Result: 实验证明，大多数真实错误集中在高熵区域。 Conclusion: 滑动窗口熵分析可作为轻量级工具，辅助GPT-based OCR的后编辑工作。 Abstract: Vision-language models such as OpenAI GPT-4o can transcribe mathematical documents directly from images, yet their token-level confidence signals are seldom used to pinpoint local recognition mistakes. We present an entropy-heat-mapping proof-of-concept that turns per-token Shannon entropy into a visual ''uncertainty landscape''. By scanning the entropy sequence with a fixed-length sliding window, we obtain hotspots that are likely to contain OCR errors such as missing symbols, mismatched braces, or garbled prose. Using a small, curated set of scanned research pages rendered at several resolutions, we compare the highlighted hotspots with the actual transcription errors produced by GPT-4o. Our analysis shows that the vast majority of true errors are indeed concentrated inside the high-entropy regions. This study demonstrates--in a minimally engineered setting--that sliding-window entropy can serve as a practical, lightweight aid for post-editing GPT-based OCR. All code, sample data, and annotation guidelines are released to encourage replication and further research.

[10] InstructAttribute: Fine-grained Object Attributes editing with Instruction

Xingxi Yin,Jingfeng Zhang,Zhi Li,Yicheng Li,Yin Zhang

Main category: cs.CV

TL;DR: SPAA方法通过编辑自注意力和交叉注意力图，实现了对物体颜色和材质的精确控制，无需训练。结合多模态大语言模型构建的属性数据集，进一步提升了细粒度编辑能力。

Details

Motivation: 现有的图像编辑技术在修改物体属性时难以保持结构一致性，且对细粒度属性的控制不足。 Method: 提出SPAA方法，通过编辑自注意力和交叉注意力图实现颜色和材质的精确控制；构建属性数据集，利用多模态大语言模型自动标注。 Result: 实验表明，SPAA在物体级颜色和材质编辑上优于现有基于指令的图像编辑方法。 Conclusion: SPAA方法在保持图像结构的同时，实现了对细粒度属性的高效编辑，为图像编辑提供了新思路。 Abstract: Text-to-image (T2I) diffusion models, renowned for their advanced generative abilities, are extensively utilized in image editing applications, demonstrating remarkable effectiveness. However, achieving precise control over fine-grained attributes still presents considerable challenges. Existing image editing techniques either fail to modify the attributes of an object or struggle to preserve its structure and maintain consistency in other areas of the image. To address these challenges, we propose the Structure-Preserving and Attribute Amplification (SPAA), a training-free method which enables precise control over the color and material transformations of objects by editing the self-attention maps and cross-attention values. Furthermore, we constructed the Attribute Dataset, which encompasses nearly all colors and materials associated with various objects, by integrating multimodal large language models (MLLM) to develop an automated pipeline for data filtering and instruction labeling. Training on this dataset, we present our InstructAttribute, an instruction-based model designed to facilitate fine-grained editing of color and material attributes. Extensive experiments demonstrate that our method achieves superior performance in object-level color and material editing, outperforming existing instruction-based image editing approaches.

[11] DARTer: Dynamic Adaptive Representation Tracker for Nighttime UAV Tracking

Xuzhao Li,Xuchen Li,Shiyu Hu

Main category: cs.CV

TL;DR: DARTer是一种用于夜间无人机跟踪的端到端框架，通过动态特征融合和自适应激活机制，显著提升了跟踪性能和效率。

Details

Motivation: 夜间无人机跟踪因光照变化和视角变化导致性能下降，现有方法计算成本高或未能充分利用动态特征。 Method: 提出DARTer框架，包含动态特征混合器（DFB）和动态特征激活器（DFA），优化特征融合和计算效率。 Result: 在多个夜间无人机跟踪基准测试中表现优异，平衡了准确性和效率。 Conclusion: DARTer是夜间无人机跟踪的有效解决方案，适用于实际应用。 Abstract: Nighttime UAV tracking presents significant challenges due to extreme illumination variations and viewpoint changes, which severely degrade tracking performance. Existing approaches either rely on light enhancers with high computational costs or introduce redundant domain adaptation mechanisms, failing to fully utilize the dynamic features in varying perspectives. To address these issues, we propose \textbf{DARTer} (\textbf{D}ynamic \textbf{A}daptive \textbf{R}epresentation \textbf{T}racker), an end-to-end tracking framework designed for nighttime UAV scenarios. DARTer leverages a Dynamic Feature Blender (DFB) to effectively fuse multi-perspective nighttime features from static and dynamic templates, enhancing representation robustness. Meanwhile, a Dynamic Feature Activator (DFA) adaptively activates Vision Transformer layers based on extracted features, significantly improving efficiency by reducing redundant computations. Our model eliminates the need for complex multi-task loss functions, enabling a streamlined training process. Extensive experiments on multiple nighttime UAV tracking benchmarks demonstrate the superiority of DARTer over state-of-the-art trackers. These results confirm that DARTer effectively balances tracking accuracy and efficiency, making it a promising solution for real-world nighttime UAV tracking applications.

[12] P2P-Insole: Human Pose Estimation Using Foot Pressure Distribution and Motion Sensors

Atsuya Watanabe,Ratna Aisuwarya,Lei Jing

Main category: cs.CV

TL;DR: P2P-Insole是一种低成本方法，通过集成IMU的鞋垫传感器估计和可视化3D人体骨骼数据，适用于大规模生产。

Details

Motivation: 解决现有商业方案成本高、侵入性强的问题，提供轻量、隐私友好的解决方案。 Method: 使用鞋垫压力分布、加速度和旋转数据，结合Transformer模型提取时间特征，并利用多模态信息提高识别精度。 Result: 实验证明该方法在复杂运动模式识别和姿态估计任务中具有鲁棒性。 Conclusion: P2P-Insole为康复、伤害预防和健康监测提供了低成本实用方案，并可通过传感器优化和数据集扩展进一步发展。 Abstract: This work presents P2P-Insole, a low-cost approach for estimating and visualizing 3D human skeletal data using insole-type sensors integrated with IMUs. Each insole, fabricated with e-textile garment techniques, costs under USD 1, making it significantly cheaper than commercial alternatives and ideal for large-scale production. Our approach uses foot pressure distribution, acceleration, and rotation data to overcome limitations, providing a lightweight, minimally intrusive, and privacy-aware solution. The system employs a Transformer model for efficient temporal feature extraction, enriched by first and second derivatives in the input stream. Including multimodal information, such as accelerometers and rotational measurements, improves the accuracy of complex motion pattern recognition. These facts are demonstrated experimentally, while error metrics show the robustness of the approach in various posture estimation tasks. This work could be the foundation for a low-cost, practical application in rehabilitation, injury prevention, and health monitoring while enabling further development through sensor optimization and expanded datasets.

[13] Efficient On-Chip Implementation of 4D Radar-Based 3D Object Detection on Hailo-8L

Woong-Chan Byun,Dong-Hee Paek,Seung-Hyun Song,Seung-Hyun Kong

Main category: cs.CV

TL;DR: 论文提出了一种在Hailo-8L AI加速器上实现4D雷达3D物体检测的芯片级方法，通过张量变换解决5D输入与4D支持的兼容性问题，实现了实时处理与高精度。

Details

Motivation: 4D雷达在自动驾驶中具有潜力，但需在低功耗嵌入式环境中实现实时处理。 Method: 引入张量变换方法，将5D输入重塑为4D格式，适配Hailo-8L加速器。 Result: 系统达到46.47% AP_3D和52.75% AP_BEV，推理速度13.76 Hz，与GPU模型精度相当。 Conclusion: 证明了4D雷达感知技术在自动驾驶系统中的实用性。 Abstract: 4D radar has attracted attention in autonomous driving due to its ability to enable robust 3D object detection even under adverse weather conditions. To practically deploy such technologies, it is essential to achieve real-time processing within low-power embedded environments. Addressing this, we present the first on-chip implementation of a 4D radar-based 3D object detection model on the Hailo-8L AI accelerator. Although conventional 3D convolutional neural network (CNN) architectures require 5D inputs, the Hailo-8L only supports 4D tensors, posing a significant challenge. To overcome this limitation, we introduce a tensor transformation method that reshapes 5D inputs into 4D formats during the compilation process, enabling direct deployment without altering the model structure. The proposed system achieves 46.47% AP_3D and 52.75% AP_BEV, maintaining comparable accuracy to GPU-based models while achieving an inference speed of 13.76 Hz. These results demonstrate the applicability of 4D radar-based perception technologies to autonomous driving systems.

Jiahui Chen,Candace Ross,Reyhane Askari-Hemmat,Koustuv Sinha,Melissa Hall,Michal Drozdzal,Adriana Romero-Soriano

Main category: cs.CV

TL;DR: MT2IE是一个基于多模态大语言模型的评估框架，用于评估文本到图像生成模型的性能，其生成的提示更高效且与人类判断相关性更高。

Details

Motivation: 随着文本到图像生成模型的进步，传统静态数据集的评估方法逐渐失效，需要新的评估方式。 Method: 利用多模态大语言模型（MLLMs）作为评估代理，通过迭代生成提示、评分图像，并与现有基准比较。 Result: MT2IE仅需1/80的提示数量即可达到与现有基准相同的模型排名，且其评分与人类判断相关性更高。 Conclusion: MT2IE提供了一种高效且可靠的文本到图像模型评估方法，优于传统静态基准。 Abstract: The steady improvements of text-to-image (T2I) generative models lead to slow deprecation of automatic evaluation benchmarks that rely on static datasets, motivating researchers to seek alternative ways to evaluate the T2I progress. In this paper, we explore the potential of multi-modal large language models (MLLMs) as evaluator agents that interact with a T2I model, with the objective of assessing prompt-generation consistency and image aesthetics. We present Multimodal Text-to-Image Eval (MT2IE), an evaluation framework that iteratively generates prompts for evaluation, scores generated images and matches T2I evaluation of existing benchmarks with a fraction of the prompts used in existing static benchmarks. Moreover, we show that MT2IE's prompt-generation consistency scores have higher correlation with human judgment than scores previously introduced in the literature. MT2IE generates prompts that are efficient at probing T2I model performance, producing the same relative T2I model rankings as existing benchmarks while using only 1/80th the number of prompts for evaluation.

[15] Person detection and re-identification in open-world settings of retail stores and public spaces

Branko Brkljač,Milan Brkljač

Main category: cs.CV

TL;DR: 论文探讨了智能城市中计算机视觉的实际应用，特别是在开放世界环境下的人员重识别任务，提出了系统设计挑战、解决方案及在零售和公共空间的应用。

Details

Motivation: 解决开放世界环境中人员重识别的复杂性和系统设计挑战，以提升视频监控和多摄像头环境下的性能。 Method: 讨论了现有计算机视觉技术，并测试了一种接近实时的人员重识别解决方案在不同视频和实时摄像头中的表现。 Result: 通过实验展示了该解决方案的性能，并分析了其在开放世界环境中的敏感性。 Conclusion: 提出了进一步研究方向和改进系统的可能性，以优化人员重识别任务的实际应用。 Abstract: Practical applications of computer vision in smart cities usually assume system integration and operation in challenging open-world environments. In the case of person re-identification task the main goal is to retrieve information whether the specific person has appeared in another place at a different time instance of the same video, or over multiple camera feeds. This typically assumes collecting raw data from video surveillance cameras in different places and under varying illumination conditions. In the considered open-world setting it also requires detection and localization of the person inside the analyzed video frame before the main re-identification step. With multi-person and multi-camera setups the system complexity becomes higher, requiring sophisticated tracking solutions and re-identification models. In this work we will discuss existing challenges in system design architectures, consider possible solutions based on different computer vision techniques, and describe applications of such systems in retail stores and public spaces for improved marketing analytics. In order to analyse sensitivity of person re-identification task under different open-world environments, a performance of one close to real-time solution will be demonstrated over several video captures and live camera feeds. Finally, based on conducted experiments we will indicate further research directions and possible system improvements.

[16] AI-ready Snow Radar Echogram Dataset (SRED) for climate change monitoring

Oluwanisola Ibikunle,Hara Talasila,Debvrat Varshney,Jilu Li,John Paden,Maryam Rahnemoonfar

Main category: cs.CV

TL;DR: 该研究提出了首个深度学习可用的雷达回波图数据集，用于改进冰层追踪技术，并评估了五种深度学习模型的表现。

Details

Motivation: 高精度追踪雷达回波图中的冰层对理解冰盖动态和全球气候变暖的影响至关重要，但缺乏标准化数据集限制了算法的发展。 Method: 研究基于NASA OIB任务的Snow Radar数据，构建了包含13,717标注和57,815弱标注回波图的数据集，并测试了五种深度学习模型。 Result: 当前计算机视觉分割算法能识别冰层像素，但需更先进的端到端模型以直接提取雪深和年积累量。 Conclusion: 该数据集和基准框架为冰层追踪和雪积累估算提供了重要资源，有助于理解极地冰盖对气候变暖的响应。 Abstract: Tracking internal layers in radar echograms with high accuracy is essential for understanding ice sheet dynamics and quantifying the impact of accelerated ice discharge in Greenland and other polar regions due to contemporary global climate warming. Deep learning algorithms have become the leading approach for automating this task, but the absence of a standardized and well-annotated echogram dataset has hindered the ability to test and compare algorithms reliably, limiting the advancement of state-of-the-art methods for the radar echogram layer tracking problem. This study introduces the first comprehensive ``deep learning ready'' radar echogram dataset derived from Snow Radar airborne data collected during the National Aeronautics and Space Administration Operation Ice Bridge (OIB) mission in 2012. The dataset contains 13,717 labeled and 57,815 weakly-labeled echograms covering diverse snow zones (dry, ablation, wet) with varying along-track resolutions. To demonstrate its utility, we evaluated the performance of five deep learning models on the dataset. Our results show that while current computer vision segmentation algorithms can identify and track snow layer pixels in echogram images, advanced end-to-end models are needed to directly extract snow depth and annual accumulation from echograms, reducing or eliminating post-processing. The dataset and accompanying benchmarking framework provide a valuable resource for advancing radar echogram layer tracking and snow accumulation estimation, advancing our understanding of polar ice sheets response to climate warming.

[17] SpatialLLM: A Compound 3D-Informed Design towards Spatially-Intelligent Large Multimodal Models

Wufei Ma,Luoxin Ye,Nessa McWeeney,Celso M de Melo,Alan Yuille,Jieneng Chen

Main category: cs.CV

TL;DR: 论文提出SpatialLLM，一种具有先进3D空间推理能力的多模态模型，通过3D数据增强和架构优化，性能超越GPT-4o 8.7%。

Details

Motivation: 当前多模态模型缺乏3D空间推理能力，主要由于3D训练数据稀缺和模型设计偏向2D数据。 Method: 开发两种3D训练数据集（3D探测数据和3D对话数据），并系统整合到模型架构与训练设计中。 Result: SpatialLLM在3D推理能力上显著提升，性能超越GPT-4o 8.7%。 Conclusion: 研究为未来3D推理模型设计提供了系统化方法和宝贵见解。 Abstract: Humans naturally understand 3D spatial relationships, enabling complex reasoning like predicting collisions of vehicles from different directions. Current large multimodal models (LMMs), however, lack of this capability of 3D spatial reasoning. This limitation stems from the scarcity of 3D training data and the bias in current model designs toward 2D data. In this paper, we systematically study the impact of 3D-informed data, architecture, and training setups, introducing SpatialLLM, a large multi-modal model with advanced 3D spatial reasoning abilities. To address data limitations, we develop two types of 3D-informed training datasets: (1) 3D-informed probing data focused on object's 3D location and orientation, and (2) 3D-informed conversation data for complex spatial relationships. Notably, we are the first to curate VQA data that incorporate 3D orientation relationships on real images. Furthermore, we systematically integrate these two types of training data with the architectural and training designs of LMMs, providing a roadmap for optimal design aimed at achieving superior 3D reasoning capabilities. Our SpatialLLM advances machines toward highly capable 3D-informed reasoning, surpassing GPT-4o performance by 8.7%. Our systematic empirical design and the resulting findings offer valuable insights for future research in this direction.

[18] Advancing Wheat Crop Analysis: A Survey of Deep Learning Approaches Using Hyperspectral Imaging

Fadi Abdeladhim Zidi,Abdelkrim Ouafi,Fares Bougourzi,Cosimo Distante,Abdelmalik Taleb-Ahmed

Main category: cs.CV

TL;DR: 综述探讨了深度学习在高光谱成像（HSI）小麦作物分析中的应用，总结了数据集、方法进展及关键应用，并指出了未来机会。

Details

Motivation: 小麦生产面临病虫害、气候变化等挑战，传统监测方法效率低，HSI结合深度学习有望解决这些问题。 Method: 通过综述现有文献，总结HSI数据集、深度学习方法的进展，分析其在品种分类、病害检测和产量预测中的应用。 Result: 综述填补了该领域空白，列出了最新研究，并提供了未来研究方向。 Conclusion: 深度学习在HSI小麦分析中潜力巨大，但仍需解决数据高维性和样本不足等问题。 Abstract: As one of the most widely cultivated and consumed crops, wheat is essential to global food security. However, wheat production is increasingly challenged by pests, diseases, climate change, and water scarcity, threatening yields. Traditional crop monitoring methods are labor-intensive and often ineffective for early issue detection. Hyperspectral imaging (HSI) has emerged as a non-destructive and efficient technology for remote crop health assessment. However, the high dimensionality of HSI data and limited availability of labeled samples present notable challenges. In recent years, deep learning has shown great promise in addressing these challenges due to its ability to extract and analysis complex structures. Despite advancements in applying deep learning methods to HSI data for wheat crop analysis, no comprehensive survey currently exists in this field. This review addresses this gap by summarizing benchmark datasets, tracking advancements in deep learning methods, and analyzing key applications such as variety classification, disease detection, and yield estimation. It also highlights the strengths, limitations, and future opportunities in leveraging deep learning methods for HSI-based wheat crop analysis. We have listed the current state-of-the-art papers and will continue tracking updating them in the following https://github.com/fadi-07/Awesome-Wheat-HSI-DeepLearning.

[19] The Comparability of Model Fusion to Measured Data in Confuser Rejection

Conor Flynn,Christopher Ebersole,Edmund Zelnio

Main category: cs.CV

TL;DR: 论文提出通过集成多个基于合成数据训练的模型来解决合成孔径雷达（SAR）数据不足的问题，并引入干扰物拒绝技术以应对未知目标。

Details

Motivation: 由于合成孔径雷达（SAR）数据采集成本高且合成数据与实测数据不完全一致，导致模型训练效果受限。 Method: 利用计算能力集成多个基于合成数据训练的模型，并采用干扰物拒绝技术处理未知目标。 Result: 通过集成模型和干扰物拒绝技术，提高了模型在合成数据训练下的泛化能力。 Conclusion: 该方法为SAR数据不足问题提供了可行的解决方案，同时增强了模型对未知目标的鲁棒性。 Abstract: Data collection has always been a major issue in the modeling and training of large deep learning networks, as no dataset can account for every slight deviation we might see in live usage. Collecting samples can be especially costly for Synthetic Aperture Radar (SAR), limiting the amount of unique targets and operating conditions we are able to observe from. To counter this lack of data, simulators have been developed utilizing the shooting and bouncing ray method to allow for the generation of synthetic SAR data on 3D models. While effective, the synthetically generated data does not perfectly correlate to the measured data leading to issues when training models solely on synthetic data. We aim to use computational power as a substitution for this lack of quality measured data, by ensembling many models trained on synthetic data. Synthetic data is also not complete, as we do not know what targets might be present in a live environment. Therefore we need to have our ensembling techniques account for these unknown targets by applying confuser rejection in which our models will reject unknown targets it is presented with, and only classify those it has been trained on.

[20] Are Minimal Radial Distortion Solvers Really Necessary for Relative Pose Estimation?

Viktor Kocur,Charalambos Tzamos,Yaqing Ding,Zuzana Berger Haladova,Torsten Sattler,Zuzana Kukelova

Main category: cs.CV

TL;DR: 论文比较了两种简单实现方法与传统径向畸变求解器的效果，发现复杂求解器在实际中并非必要。

Details

Motivation: 解决相机径向畸变对相对位姿估计的影响，避免复杂求解器的高成本和实现难度。 Method: 1. 结合高效针孔求解器与采样径向去畸变参数；2. 使用神经网络估计畸变参数而非采样。 Result: 实验表明，复杂径向畸变求解器在实际中不必要，简单方法效果更优。 Conclusion: 在特定条件下，采样径向去畸变参数优于学习型方法，复杂求解器可被替代。 Abstract: Estimating the relative pose between two cameras is a fundamental step in many applications such as Structure-from-Motion. The common approach to relative pose estimation is to apply a minimal solver inside a RANSAC loop. Highly efficient solvers exist for pinhole cameras. Yet, (nearly) all cameras exhibit radial distortion. Not modeling radial distortion leads to (significantly) worse results. However, minimal radial distortion solvers are significantly more complex than pinhole solvers, both in terms of run-time and implementation efforts. This paper compares radial distortion solvers with two simple-to-implement approaches that do not use minimal radial distortion solvers: The first approach combines an efficient pinhole solver with sampled radial undistortion parameters, where the sampled parameters are used for undistortion prior to applying the pinhole solver. The second approach uses a state-of-the-art neural network to estimate the distortion parameters rather than sampling them from a set of potential values. Extensive experiments on multiple datasets, and different camera setups, show that complex minimal radial distortion solvers are not necessary in practice. We discuss under which conditions a simple sampling of radial undistortion parameters is preferable over calibrating cameras using a learning-based prior approach. Code and newly created benchmark for relative pose estimation under radial distortion are available at https://github.com/kocurvik/rdnet.

[21] CDFormer: Cross-Domain Few-Shot Object Detection Transformer Against Feature Confusion

Boyuan Meng,Xiaohan Zhang,Peilin Li,Zhe Wu,Yiming Li,Wenkai Zhao,Beinan Yu,Hui-Liang Shen

Main category: cs.CV

TL;DR: CDFormer是一种针对跨域少样本目标检测（CD-FSOD）的Transformer方法，通过OBD和OOD模块解决特征混淆问题，显著提升了性能。

Details

Motivation: 跨域少样本目标检测中，特征混淆（如物体-背景混淆和物体-物体混淆）是主要挑战。 Method: 提出CDFormer，包含OBD模块（区分物体与背景）和OOD模块（区分不同类别物体）。 Result: 实验显示，CDFormer在1/5/10 shot设置下分别提升了12.9%、11.0%和10.4%的mAP。 Conclusion: CDFormer有效解决了特征混淆问题，显著提升了跨域少样本目标检测的性能。 Abstract: Cross-domain few-shot object detection (CD-FSOD) aims to detect novel objects across different domains with limited class instances. Feature confusion, including object-background confusion and object-object confusion, presents significant challenges in both cross-domain and few-shot settings. In this work, we introduce CDFormer, a cross-domain few-shot object detection transformer against feature confusion, to address these challenges. The method specifically tackles feature confusion through two key modules: object-background distinguishing (OBD) and object-object distinguishing (OOD). The OBD module leverages a learnable background token to differentiate between objects and background, while the OOD module enhances the distinction between objects of different classes. Experimental results demonstrate that CDFormer outperforms previous state-of-the-art approaches, achieving 12.9% mAP, 11.0% mAP, and 10.4% mAP improvements under the 1/5/10 shot settings, respectively, when fine-tuned.

[22] Generating Animated Layouts as Structured Text Representations

Yeonsang Shin,Jihwan Kim,Yumin Song,Kyungseung Lee,Hyunhee Chung,Taeyoung Na

Main category: cs.CV

TL;DR: 论文提出了一种名为Animated Layout Generation的新方法，通过结构化文本表示实现细粒度视频控制，并开发了VAKER工具，显著优于现有视频广告生成方法。

Details

Motivation: 现有文本到视频模型在控制文本元素和动态图形方面存在不足，尤其是在视频广告应用中。 Method: 提出Animated Layout Generation方法，结合结构化文本表示和三阶段生成流程，开发VAKER工具。 Result: VAKER在视频广告生成中表现显著优于现有方法。 Conclusion: 该方法为动态图形和文本控制提供了有效解决方案，并实现了视频广告生成的自动化。 Abstract: Despite the remarkable progress in text-to-video models, achieving precise control over text elements and animated graphics remains a significant challenge, especially in applications such as video advertisements. To address this limitation, we introduce Animated Layout Generation, a novel approach to extend static graphic layouts with temporal dynamics. We propose a Structured Text Representation for fine-grained video control through hierarchical visual elements. To demonstrate the effectiveness of our approach, we present VAKER (Video Ad maKER), a text-to-video advertisement generation pipeline that combines a three-stage generation process with Unstructured Text Reasoning for seamless integration with LLMs. VAKER fully automates video advertisement generation by incorporating dynamic layout trajectories for objects and graphics across specific video frames. Through extensive evaluations, we demonstrate that VAKER significantly outperforms existing methods in generating video advertisements. Project Page: https://yeonsangshin.github.io/projects/Vaker

[23] LMDepth: Lightweight Mamba-based Monocular Depth Estimation for Real-World Deployment

Jiahuan Long,Xin Zhou

Main category: cs.CV

TL;DR: LMDepth是一种基于Mamba的轻量级单目深度估计网络，通过改进的金字塔空间池化模块和多深度Mamba块，在保持低计算开销的同时实现高精度深度重建。

Details

Motivation: 现有深度估计算法在性能和计算效率之间难以平衡，限制了在资源受限设备上的部署。 Method: 提出改进的金字塔空间池化模块作为多尺度特征聚合器和上下文提取器，并集成多个深度Mamba块于解码器中，利用线性计算高效解码深度信息。 Result: 在NYUDv2和KITTI数据集上，LMDepth以更少的参数和更低的计算复杂度（GFLOPs）优于现有轻量级方法，并在嵌入式平台上验证了实用性。 Conclusion: LMDepth为资源受限设备提供了一种高效且实用的单目深度估计解决方案。 Abstract: Monocular depth estimation provides an additional depth dimension to RGB images, making it widely applicable in various fields such as virtual reality, autonomous driving and robotic navigation. However, existing depth estimation algorithms often struggle to effectively balance performance and computational efficiency, which poses challenges for deployment on resource-constrained devices. To address this, we propose LMDepth, a lightweight Mamba-based monocular depth estimation network, designed to reconstruct high-precision depth information while maintaining low computational overhead. Specifically, we propose a modified pyramid spatial pooling module that serves as a multi-scale feature aggregator and context extractor, ensuring global spatial information for accurate depth estimation. Moreover, we integrate multiple depth Mamba blocks into the decoder. Designed with linear computations, the Mamba Blocks enable LMDepth to efficiently decode depth information from global features, providing a lightweight alternative to Transformer-based architectures that depend on complex attention mechanisms. Extensive experiments on the NYUDv2 and KITTI datasets demonstrate the effectiveness of our proposed LMDepth. Compared to previous lightweight depth estimation methods, LMDepth achieves higher performance with fewer parameters and lower computational complexity (measured by GFLOPs). We further deploy LMDepth on an embedded platform with INT8 quantization, validating its practicality for real-world edge applications.

[24] Deterministic-to-Stochastic Diverse Latent Feature Mapping for Human Motion Synthesis

Yu Hua,Weiming Liu,Gui Xu,Yaqing Hou,Yew-Soon Ong,Qiang Zhang

Main category: cs.CV

TL;DR: 提出了一种确定性到随机性的多样化潜在特征映射（DSDFM）方法，用于人体运动合成，解决了基于分数生成模型（SGMs）训练不稳定的问题，并提升了生成运动的多样性和准确性。

Details

Motivation: 基于分数生成模型（SGMs）在人体运动合成任务中表现优异，但其训练过程复杂且不稳定，因此需要一种更稳定且能提升多样性的方法。 Method: DSDFM分为两阶段：1）人体运动重建阶段，学习运动潜在空间分布；2）多样化运动生成阶段，通过确定性特征映射（DerODE）和随机多样化输出生成（DivSDE）连接高斯分布与潜在空间分布。 Result: DSDFM在训练稳定性和生成多样性上优于现有方法，实验验证其达到了最先进的性能。 Conclusion: DSDFM是一种高效且稳定的方法，显著提升了人体运动合成的多样性和准确性。 Abstract: Human motion synthesis aims to generate plausible human motion sequences, which has raised widespread attention in computer animation. Recent score-based generative models (SGMs) have demonstrated impressive results on this task. However, their training process involves complex curvature trajectories, leading to unstable training process. In this paper, we propose a Deterministic-to-Stochastic Diverse Latent Feature Mapping (DSDFM) method for human motion synthesis. DSDFM consists of two stages. The first human motion reconstruction stage aims to learn the latent space distribution of human motions. The second diverse motion generation stage aims to build connections between the Gaussian distribution and the latent space distribution of human motions, thereby enhancing the diversity and accuracy of the generated human motions. This stage is achieved by the designed deterministic feature mapping procedure with DerODE and stochastic diverse output generation procedure with DivSDE.DSDFM is easy to train compared to previous SGMs-based methods and can enhance diversity without introducing additional training parameters.Through qualitative and quantitative experiments, DSDFM achieves state-of-the-art results surpassing the latest methods, validating its superiority in human motion synthesis.

[25] 3D Human Pose Estimation via Spatial Graph Order Attention and Temporal Body Aware Transformer

Kamel Aouaidjia,Aofan Li,Wenhao Zhang,Chongsheng Zhang

Main category: cs.CV

TL;DR: 提出了一种结合GCN和Transformer的新方法，通过多阶图表示骨架并引入图阶注意力模块，同时利用改进的时空Transformer建模全局和局部特征依赖。

Details

Motivation: 现有Transformer和GCN方法在3D人体姿态估计中分别忽略了空间邻域关系或局部时间模式，且GCN方法缺乏姿态特定表示。 Method: 使用GCN的多阶图表示骨架，结合图阶注意力模块动态选择代表性阶数，并设计时空Body Aware Transformer建模全局和局部特征依赖。 Result: 在Human3.6m、MPIINF-3DHP和HumanEva-I数据集上验证了方法的有效性。 Conclusion: 新方法通过结合GCN和Transformer的优势，显著提升了3D人体姿态估计的性能。 Abstract: Nowadays, Transformers and Graph Convolutional Networks (GCNs) are the prevailing techniques for 3D human pose estimation. However, Transformer-based methods either ignore the spatial neighborhood relationships between the joints when used for skeleton representations or disregard the local temporal patterns of the local joint movements in skeleton sequence modeling, while GCN-based methods often neglect the need for pose-specific representations. To address these problems, we propose a new method that exploits the graph modeling capability of GCN to represent each skeleton with multiple graphs of different orders, incorporated with a newly introduced Graph Order Attention module that dynamically emphasizes the most representative orders for each joint. The resulting spatial features of the sequence are further processed using a proposed temporal Body Aware Transformer that models the global body feature dependencies in the sequence with awareness of the local inter-skeleton feature dependencies of joints. Given that our 3D pose output aligns with the central 2D pose in the sequence, we improve the self-attention mechanism to be aware of the central pose while diminishing its focus gradually towards the first and the last poses. Extensive experiments on Human3.6m, MPIINF-3DHP, and HumanEva-I datasets demonstrate the effectiveness of the proposed method. Code and models are made available on Github.

[26] Fine-Tuning Without Forgetting: Adaptation of YOLOv8 Preserves COCO Performance

Vishal Gandhi,Sagar Gandhi

Main category: cs.CV

TL;DR: 研究表明，深度微调预训练目标检测模型（如YOLOv8n）的中后期骨干层可显著提升细粒度任务性能（如水果检测），且不会导致原有通用能力（如COCO基准）的显著退化。

Details

Motivation: 探讨预训练模型在细粒度任务中的适应性，特别是微调深度对性能的影响，以避免灾难性遗忘。 Method: 通过逐步解冻骨干层（22、15、10层）微调YOLOv8n模型，并在细粒度水果数据集和COCO验证集上评估性能。 Result: 深度微调（解冻至10层）在水果任务上性能提升显著（+10% mAP50），且对COCO基准影响极小（<0.1% mAP差异）。 Conclusion: 中后期骨干层微调对细粒度任务高效且安全，无需担心灾难性遗忘，适合复杂领域或性能优先场景。 Abstract: The success of large pre-trained object detectors hinges on their adaptability to diverse downstream tasks. While fine-tuning is the standard adaptation method, specializing these models for challenging fine-grained domains necessitates careful consideration of feature granularity. The critical question remains: how deeply should the pre-trained backbone be fine-tuned to optimize for the specialized task without incurring catastrophic forgetting of the original general capabilities? Addressing this, we present a systematic empirical study evaluating the impact of fine-tuning depth. We adapt a standard YOLOv8n model to a custom, fine-grained fruit detection dataset by progressively unfreezing backbone layers (freeze points at layers 22, 15, and 10) and training. Performance was rigorously evaluated on both the target fruit dataset and, using a dual-head evaluation architecture, on the original COCO validation set. Our results demonstrate unequivocally that deeper fine-tuning (unfreezing down to layer 10) yields substantial performance gains (e.g., +10\% absolute mAP50) on the fine-grained fruit task compared to only training the head. Strikingly, this significant adaptation and specialization resulted in negligible performance degradation (<0.1\% absolute mAP difference) on the COCO benchmark across all tested freeze levels. We conclude that adapting mid-to-late backbone features is highly effective for fine-grained specialization. Critically, our results demonstrate this adaptation can be achieved without the commonly expected penalty of catastrophic forgetting, presenting a compelling case for exploring deeper fine-tuning strategies, particularly when targeting complex domains or when maximizing specialized performance is paramount.

[27] Edge-preserving Image Denoising via Multi-scale Adaptive Statistical Independence Testing

Ruyu Yan,Da-Qing Zhang

Main category: cs.CV

TL;DR: 提出了一种基于多尺度自适应独立性测试的边缘检测与去噪方法（EDD-MAIT），通过动态调整窗口大小和结合通道注意力机制，显著提升了边缘检测的鲁棒性、准确性和效率。

Details

Motivation: 现有边缘检测方法生成的边缘图过于详细，影响清晰度，且固定窗口统计测试存在尺度不匹配和计算冗余问题。 Method: 结合通道注意力机制与独立性测试，采用梯度驱动的自适应窗口策略动态调整窗口大小。 Result: 在BSDS500和BIPED数据集上表现优于传统和基于学习的方法，F-score、MSE、PSNR等指标均有提升，且运行时间减少。对高斯噪声具有鲁棒性。 Conclusion: EDD-MAIT能够生成准确且干净的边缘图，适用于噪声环境，具有较高的实用价值。 Abstract: Edge detection is crucial in image processing, but existing methods often produce overly detailed edge maps, affecting clarity. Fixed-window statistical testing faces issues like scale mismatch and computational redundancy. To address these, we propose a novel Multi-scale Adaptive Independence Testing-based Edge Detection and Denoising (EDD-MAIT), a Multi-scale Adaptive Statistical Testing-based edge detection and denoising method that integrates a channel attention mechanism with independence testing. A gradient-driven adaptive window strategy adjusts window sizes dynamically, improving detail preservation and noise suppression. EDD-MAIT achieves better robustness, accuracy, and efficiency, outperforming traditional and learning-based methods on BSDS500 and BIPED datasets, with improvements in F-score, MSE, PSNR, and reduced runtime. It also shows robustness against Gaussian noise, generating accurate and clean edge maps in noisy environments.

[28] Edge Detection based on Channel Attention and Inter-region Independence Test

Ru-yu Yan,Da-Qing Zhang

Main category: cs.CV

TL;DR: CAM-EDIT结合通道注意力机制和独立性测试的边缘检测框架，显著提升了噪声鲁棒性和边缘检测精度。

Details

Motivation: 现有边缘检测方法存在噪声放大和非显著细节保留过多的问题，限制了在高精度工业场景中的应用。 Method: 提出CAM-EDIT框架，集成通道注意力机制（CAM）和基于独立性测试的边缘检测（EDIT），通过多通道融合和统计独立性分析抑制噪声。 Result: 在BSDS500和NYUDv2数据集上表现优异，F-measure分数分别为0.635和0.460，优于传统和最新学习方法，噪声鲁棒性提升2.2% PSNR。 Conclusion: CAM-EDIT在高精度工业应用中展现出潜力，生成更干净的边缘图并减少伪影。 Abstract: Existing edge detection methods often suffer from noise amplification and excessive retention of non-salient details, limiting their applicability in high-precision industrial scenarios. To address these challenges, we propose CAM-EDIT, a novel framework that integrates Channel Attention Mechanism (CAM) and Edge Detection via Independence Testing (EDIT). The CAM module adaptively enhances discriminative edge features through multi-channel fusion, while the EDIT module employs region-wise statistical independence analysis (using Fisher's exact test and chi-square test) to suppress uncorrelated noise.Extensive experiments on BSDS500 and NYUDv2 datasets demonstrate state-of-the-art performance. Among the nine comparison algorithms, the F-measure scores of CAM-EDIT are 0.635 and 0.460, representing improvements of 19.2\% to 26.5\% over traditional methods (Canny, CannySR), and better than the latest learning based methods (TIP2020, MSCNGP). Noise robustness evaluations further reveal a 2.2\% PSNR improvement under Gaussian noise compared to baseline methods. Qualitative results exhibit cleaner edge maps with reduced artifacts, demonstrating its potential for high-precision industrial applications.

[29] Transferable Adversarial Attacks on Black-Box Vision-Language Models

Kai Hu,Weichen Yu,Li Zhang,Alexander Robey,Andy Zou,Chengming Xu,Haoqi Hu,Matt Fredrikson

Main category: cs.CV

TL;DR: 研究发现，针对开源模型的对抗攻击可以转移到专有的视觉大语言模型（VLLMs）上，导致模型对视觉信息的错误解读。

Details

Motivation: 探索VLLMs在对抗攻击下的脆弱性，尤其是在多模态输入（文本和图像）场景中。 Method: 通过生成目标对抗样本和通用扰动，测试其在专有VLLMs（如GPT-4o、Claude和Gemini）上的可转移性和效果。 Result: 实验表明，对抗攻击能诱导模型错误解读视觉信息，且通用扰动在多个模型中均有效。 Conclusion: 当前VLLMs普遍存在对抗攻击漏洞，亟需鲁棒的防御措施以确保安全部署。 Abstract: Vision Large Language Models (VLLMs) are increasingly deployed to offer advanced capabilities on inputs comprising both text and images. While prior research has shown that adversarial attacks can transfer from open-source to proprietary black-box models in text-only and vision-only contexts, the extent and effectiveness of such vulnerabilities remain underexplored for VLLMs. We present a comprehensive analysis demonstrating that targeted adversarial examples are highly transferable to widely-used proprietary VLLMs such as GPT-4o, Claude, and Gemini. We show that attackers can craft perturbations to induce specific attacker-chosen interpretations of visual information, such as misinterpreting hazardous content as safe, overlooking sensitive or restricted material, or generating detailed incorrect responses aligned with the attacker's intent. Furthermore, we discover that universal perturbations -- modifications applicable to a wide set of images -- can consistently induce these misinterpretations across multiple proprietary VLLMs. Our experimental results on object recognition, visual question answering, and image captioning show that this vulnerability is common across current state-of-the-art models, and underscore an urgent need for robust mitigations to ensure the safe and secure deployment of VLLMs.

[30] GeloVec: Higher Dimensional Geometric Smoothing for Coherent Visual Feature Extraction in Image Segmentation

Boris Kriuk,Matey Yordanov

Main category: cs.CV

TL;DR: GeloVec是一种基于CNN的注意力平滑框架，通过高维几何平滑方法解决传统语义分割中的边界不稳定和上下文不连续问题，显著提升了分割性能。

Details

Motivation: 传统注意力分割方法存在边界不稳定和上下文不连续问题，GeloVec旨在通过几何平滑方法解决这些局限性。 Method: 结合改进的Chebyshev距离度量和多空间变换，通过自适应采样权重系统在n维特征空间中计算几何距离，同时利用张量投影和正交基向量增强特征表示。 Result: 在多个基准数据集上验证，mIoU分别提升2.1%、2.7%和2.4%，计算效率高且泛化能力强。 Conclusion: GeloVec通过几何平滑和高效计算，显著提升了语义分割的稳定性和性能，具有理论保证和广泛应用潜力。 Abstract: This paper introduces GeloVec, a new CNN-based attention smoothing framework for semantic segmentation that addresses critical limitations in conventional approaches. While existing attention-backed segmentation methods suffer from boundary instability and contextual discontinuities during feature mapping, our framework implements a higher-dimensional geometric smoothing method to establish a robust manifold relationships between visually coherent regions. GeloVec combines modified Chebyshev distance metrics with multispatial transformations to enhance segmentation accuracy through stabilized feature extraction. The core innovation lies in the adaptive sampling weights system that calculates geometric distances in n-dimensional feature space, achieving superior edge preservation while maintaining intra-class homogeneity. The multispatial transformation matrix incorporates tensorial projections with orthogonal basis vectors, creating more discriminative feature representations without sacrificing computational efficiency. Experimental validation across multiple benchmark datasets demonstrates significant improvements in segmentation performance, with mean Intersection over Union (mIoU) gains of 2.1%, 2.7%, and 2.4% on Caltech Birds-200, LSDSC, and FSSD datasets respectively compared to state-of-the-art methods. GeloVec's mathematical foundation in Riemannian geometry provides theoretical guarantees on segmentation stability. Importantly, our framework maintains computational efficiency through parallelized implementation of geodesic transformations and exhibits strong generalization capabilities across disciplines due to the absence of information loss during transformations.

[31] Efficient Vocabulary-Free Fine-Grained Visual Recognition in the Age of Multimodal LLMs

Hari Chandana Kuchibhotla,Sai Srinivas Kancheti,Abbavaram Gowtham Reddy,Vineeth N Balasubramanian

Main category: cs.CV

TL;DR: 论文提出了一种名为NeaR的新方法，用于解决无标签数据下的细粒度视觉识别（VF-FGVR）问题，通过多模态大语言模型（MLLM）生成标签并微调CLIP模型。

Details

Motivation: 在缺乏标注数据的领域（如医学影像），传统细粒度视觉识别方法无法应用，而直接使用MLLM成本高且效率低。 Method: NeaR方法利用MLLM为少量未标注训练数据生成标签，构建弱监督数据集，并微调下游CLIP模型以处理标签噪声和开放性。 Result: NeaR为高效VF-FGVR建立了新基准，解决了MLLM直接使用的高成本和推理时间问题。 Conclusion: NeaR是一种高效且实用的解决方案，适用于无标签数据的细粒度视觉识别任务。 Abstract: Fine-grained Visual Recognition (FGVR) involves distinguishing between visually similar categories, which is inherently challenging due to subtle inter-class differences and the need for large, expert-annotated datasets. In domains like medical imaging, such curated datasets are unavailable due to issues like privacy concerns and high annotation costs. In such scenarios lacking labeled data, an FGVR model cannot rely on a predefined set of training labels, and hence has an unconstrained output space for predictions. We refer to this task as Vocabulary-Free FGVR (VF-FGVR), where a model must predict labels from an unconstrained output space without prior label information. While recent Multimodal Large Language Models (MLLMs) show potential for VF-FGVR, querying these models for each test input is impractical because of high costs and prohibitive inference times. To address these limitations, we introduce \textbf{Nea}rest-Neighbor Label \textbf{R}efinement (NeaR), a novel approach that fine-tunes a downstream CLIP model using labels generated by an MLLM. Our approach constructs a weakly supervised dataset from a small, unlabeled training set, leveraging MLLMs for label generation. NeaR is designed to handle the noise, stochasticity, and open-endedness inherent in labels generated by MLLMs, and establishes a new benchmark for efficient VF-FGVR.

[32] Improving Editability in Image Generation with Layer-wise Memory

Daneul Kim,Jaeah Lee,Jaesik Park

Main category: cs.CV

TL;DR: 论文提出了一种支持多步图像编辑的框架，通过层记忆和一致性引导解决现有方法在连续编辑中的局限性。

Details

Motivation: 现有图像编辑方法主要针对单对象修改，难以处理多步编辑任务，尤其是保持先前编辑内容并自然融入新对象。 Method: 提出层记忆存储潜在表示和提示嵌入，结合背景一致性引导和多查询解耦的交叉注意力机制。 Result: 新方法在迭代编辑任务中表现优异，仅需粗略掩码即可保持高质量结果。 Conclusion: 该框架显著提升了多步图像编辑的效率和效果。 Abstract: Most real-world image editing tasks require multiple sequential edits to achieve desired results. Current editing approaches, primarily designed for single-object modifications, struggle with sequential editing: especially with maintaining previous edits along with adapting new objects naturally into the existing content. These limitations significantly hinder complex editing scenarios where multiple objects need to be modified while preserving their contextual relationships. We address this fundamental challenge through two key proposals: enabling rough mask inputs that preserve existing content while naturally integrating new elements and supporting consistent editing across multiple modifications. Our framework achieves this through layer-wise memory, which stores latent representations and prompt embeddings from previous edits. We propose Background Consistency Guidance that leverages memorized latents to maintain scene coherence and Multi-Query Disentanglement in cross-attention that ensures natural adaptation to existing content. To evaluate our method, we present a new benchmark dataset incorporating semantic alignment metrics and interactive editing scenarios. Through comprehensive experiments, we demonstrate superior performance in iterative image editing tasks with minimal user effort, requiring only rough masks while maintaining high-quality results throughout multiple editing steps.

[33] Any-to-Any Vision-Language Model for Multimodal X-ray Imaging and Radiological Report Generation

Daniele Molino,Francesco di Feola,Linlin Shen,Paolo Soda,Valerio Guarrasi

Main category: cs.CV

TL;DR: 提出了一种针对多模态医学数据生成的框架，生成高质量胸片和临床报告，性能优于通用模型，并在下游疾病分类任务中表现优异。

Details

Motivation: 医学数据复杂且需高临床准确性，通用生成模型难以满足需求，需专门设计框架。 Method: 基于MIMIC-CXR数据集，设计多模态生成框架，生成胸片和临床报告。 Result: 生成数据在FID和BLEU分数上表现优异，下游任务性能接近或优于真实数据。 Conclusion: 领域特定适配对提升生成模型在临床应用中的实用性和相关性至关重要。 Abstract: Generative models have revolutionized Artificial Intelligence (AI), particularly in multimodal applications. However, adapting these models to the medical domain poses unique challenges due to the complexity of medical data and the stringent need for clinical accuracy. In this work, we introduce a framework specifically designed for multimodal medical data generation. By enabling the generation of multi-view chest X-rays and their associated clinical report, it bridges the gap between general-purpose vision-language models and the specialized requirements of healthcare. Leveraging the MIMIC-CXR dataset, the proposed framework shows superior performance in generating high-fidelity images and semantically coherent reports. Our quantitative evaluation reveals significant results in terms of FID and BLEU scores, showcasing the quality of the generated data. Notably, our framework achieves comparable or even superior performance compared to real data on downstream disease classification tasks, underlining its potential as a tool for medical research and diagnostics. This study highlights the importance of domain-specific adaptations in enhancing the relevance and utility of generative models for clinical applications, paving the way for future advancements in synthetic multimodal medical data generation.

[34] Evaluating Vision Language Model Adaptations for Radiology Report Generation in Low-Resource Languages

Marco Salmè,Rosa Sicilia,Paolo Soda,Valerio Guarrasi

Main category: cs.CV

TL;DR: 研究评估了指令调优的视觉语言模型（VLMs）在低资源语言（意大利语、德语、西班牙语）中生成放射学报告的性能，发现语言和领域特定训练对提升报告质量至关重要。

Details

Motivation: 解决低资源语言中生成准确且上下文相关的放射学报告的挑战。 Method: 使用LLaVA架构，系统评估预训练模型在通用、领域特定和低资源语言特定数据集上的表现。 Result: 语言特定模型表现最佳，医学术语微调显著提升性能，温度参数影响报告连贯性。 Conclusion: 语言和领域特定训练对提升多语言放射学报告质量至关重要，为未来模型调优和语言适应研究指明方向。 Abstract: The integration of artificial intelligence in healthcare has opened new horizons for improving medical diagnostics and patient care. However, challenges persist in developing systems capable of generating accurate and contextually relevant radiology reports, particularly in low-resource languages. In this study, we present a comprehensive benchmark to evaluate the performance of instruction-tuned Vision-Language Models (VLMs) in the specialized task of radiology report generation across three low-resource languages: Italian, German, and Spanish. Employing the LLaVA architectural framework, we conducted a systematic evaluation of pre-trained models utilizing general datasets, domain-specific datasets, and low-resource language-specific datasets. In light of the unavailability of models that possess prior knowledge of both the medical domain and low-resource languages, we analyzed various adaptations to determine the most effective approach for these contexts. The results revealed that language-specific models substantially outperformed both general and domain-specific models in generating radiology reports, emphasizing the critical role of linguistic adaptation. Additionally, models fine-tuned with medical terminology exhibited enhanced performance across all languages compared to models with generic knowledge, highlighting the importance of domain-specific training. We also explored the influence of the temperature parameter on the coherence of report generation, providing insights for optimal model settings. Our findings highlight the importance of tailored language and domain-specific training for improving the quality and accuracy of radiological reports in multilingual settings. This research not only advances our understanding of VLMs adaptability in healthcare but also points to significant avenues for future investigations into model tuning and language-specific adaptations.

[35] VSC: Visual Search Compositional Text-to-Image Diffusion Model

Do Huu Dat,Nam Hyeonu,Po-Yuan Mao,Tae-Hyun Oh

Main category: cs.CV

TL;DR: 本文提出了一种新的组合生成方法，通过利用成对图像嵌入来改进属性-对象绑定，解决了文本到图像扩散模型中多属性-对象对提示的绑定问题。

Details

Motivation: 现有的文本到图像扩散模型在复杂提示中难以准确绑定属性与对象，主要由于文本编码器（如CLIP）的局限性。 Method: 方法包括分解复杂提示为子提示、生成对应图像、计算视觉原型并与文本嵌入融合，同时采用基于分割的定位训练解决交叉注意力错位。 Result: 在T2I CompBench基准测试中，该方法优于现有组合文本到图像扩散模型，提升了图像质量并在多绑定对情况下表现稳健。 Conclusion: 该方法通过视觉原型和分割训练显著改善了多属性-对象绑定的准确性，具有更好的扩展性和鲁棒性。 Abstract: Text-to-image diffusion models have shown impressive capabilities in generating realistic visuals from natural-language prompts, yet they often struggle with accurately binding attributes to corresponding objects, especially in prompts containing multiple attribute-object pairs. This challenge primarily arises from the limitations of commonly used text encoders, such as CLIP, which can fail to encode complex linguistic relationships and modifiers effectively. Existing approaches have attempted to mitigate these issues through attention map control during inference and the use of layout information or fine-tuning during training, yet they face performance drops with increased prompt complexity. In this work, we introduce a novel compositional generation method that leverages pairwise image embeddings to improve attribute-object binding. Our approach decomposes complex prompts into sub-prompts, generates corresponding images, and computes visual prototypes that fuse with text embeddings to enhance representation. By applying segmentation-based localization training, we address cross-attention misalignment, achieving improved accuracy in binding multiple attributes to objects. Our approaches outperform existing compositional text-to-image diffusion models on the benchmark T2I CompBench, achieving better image quality, evaluated by humans, and emerging robustness under scaling number of binding pairs in the prompt.

[36] Self-Supervision Enhances Instance-based Multiple Instance Learning Methods in Digital Pathology: A Benchmark Study

Ali Mammadov,Loic Le Folgoc,Julien Adam,Anne Buronfosse,Gilles Hayem,Guillaume Hocquet,Pietro Gori

Main category: cs.CV

TL;DR: 研究表明，在高质量自监督学习特征提取器的支持下，简单的基于实例的多实例学习方法（MIL）可以媲美或优于复杂的基于嵌入的MIL方法，且更具可解释性。

Details

Motivation: 探讨基于实例的MIL方法在高质量特征提取器下的表现，挑战基于嵌入的MIL方法的传统优势。 Method: 通过710个实验，比较10种MIL策略、6种自监督学习方法、4种基础模型及多种病理学适应技术，并引入4种新的基于实例的MIL方法。 Result: 在BRACS和Camelyon16数据集上，简单的基于实例的MIL方法取得了与复杂方法相当或更好的性能，并刷新了SOTA结果。 Conclusion: 建议更多关注自监督学习方法在WSI中的应用，而非复杂的基于嵌入的MIL方法，以提高可解释性和性能。 Abstract: Multiple Instance Learning (MIL) has emerged as the best solution for Whole Slide Image (WSI) classification. It consists of dividing each slide into patches, which are treated as a bag of instances labeled with a global label. MIL includes two main approaches: instance-based and embedding-based. In the former, each patch is classified independently, and then the patch scores are aggregated to predict the bag label. In the latter, bag classification is performed after aggregating patch embeddings. Even if instance-based methods are naturally more interpretable, embedding-based MILs have usually been preferred in the past due to their robustness to poor feature extractors. However, recently, the quality of feature embeddings has drastically increased using self-supervised learning (SSL). Nevertheless, many authors continue to endorse the superiority of embedding-based MIL. To investigate this further, we conduct 710 experiments across 4 datasets, comparing 10 MIL strategies, 6 self-supervised methods with 4 backbones, 4 foundation models, and various pathology-adapted techniques. Furthermore, we introduce 4 instance-based MIL methods never used before in the pathology domain. Through these extensive experiments, we show that with a good SSL feature extractor, simple instance-based MILs, with very few parameters, obtain similar or better performance than complex, state-of-the-art (SOTA) embedding-based MIL methods, setting new SOTA results on the BRACS and Camelyon16 datasets. Since simple instance-based MIL methods are naturally more interpretable and explainable to clinicians, our results suggest that more effort should be put into well-adapted SSL methods for WSI rather than into complex embedding-based MIL methods.

[37] FreePCA: Integrating Consistency Information across Long-short Frames in Training-free Long Video Generation via Principal Component Analysis

Jiangtong Tan,Hu Yu,Jie Huang,Jie Xiao,Feng Zhao

Main category: cs.CV

TL;DR: FreePCA是一种基于PCA的无训练长视频生成方法，通过解耦全局一致性和局部质量，显著提升视频生成效果。

Details

Motivation: 长视频生成因帧数变化导致分布偏移，现有方法难以同时兼顾全局一致性和局部质量。 Method: 利用PCA将全局和局部信息解耦为一致外观和运动强度特征，并通过余弦相似度测量和渐进式整合实现高质量生成。 Result: 实验表明，FreePCA无需训练即可应用于多种视频扩散模型，显著提升生成效果。 Conclusion: FreePCA通过PCA解耦和整合全局与局部信息，实现了高质量和一致性的长视频生成。 Abstract: Long video generation involves generating extended videos using models trained on short videos, suffering from distribution shifts due to varying frame counts. It necessitates the use of local information from the original short frames to enhance visual and motion quality, and global information from the entire long frames to ensure appearance consistency. Existing training-free methods struggle to effectively integrate the benefits of both, as appearance and motion in videos are closely coupled, leading to motion inconsistency and visual quality. In this paper, we reveal that global and local information can be precisely decoupled into consistent appearance and motion intensity information by applying Principal Component Analysis (PCA), allowing for refined complementary integration of global consistency and local quality. With this insight, we propose FreePCA, a training-free long video generation paradigm based on PCA that simultaneously achieves high consistency and quality. Concretely, we decouple consistent appearance and motion intensity features by measuring cosine similarity in the principal component space. Critically, we progressively integrate these features to preserve original quality and ensure smooth transitions, while further enhancing consistency by reusing the mean statistics of the initial noise. Experiments demonstrate that FreePCA can be applied to various video diffusion models without requiring training, leading to substantial improvements. Code is available at https://github.com/JosephTiTan/FreePCA.

[38] TSTMotion: Training-free Scene-awarenText-to-motion Generation

Ziyan Guo,Haoxuan Qu,Hossein Rahmani,Dewen Soh,Ping Hu,Qiuhong Ke,Jun Liu

Main category: cs.CV

TL;DR: 提出了一种无需训练的场景感知文本到运动生成框架TSTMotion，利用预训练模型和场景信息生成符合场景的运动序列。

Details

Motivation: 现有场景感知方法依赖大规模真实运动数据，成本高昂，因此提出一种无需训练的高效解决方案。 Method: 结合基础模型推理、预测和验证场景感知运动指导，并将其融入预训练的运动生成器中。 Result: 实验证明框架有效且具有通用性。 Conclusion: TSTMotion为场景感知文本到运动生成提供了一种高效且无需训练的方法。 Abstract: Text-to-motion generation has recently garnered significant research interest, primarily focusing on generating human motion sequences in blank backgrounds. However, human motions commonly occur within diverse 3D scenes, which has prompted exploration into scene-aware text-to-motion generation methods. Yet, existing scene-aware methods often rely on large-scale ground-truth motion sequences in diverse 3D scenes, which poses practical challenges due to the expensive cost. To mitigate this challenge, we are the first to propose a \textbf{T}raining-free \textbf{S}cene-aware \textbf{T}ext-to-\textbf{Motion} framework, dubbed as \textbf{TSTMotion}, that efficiently empowers pre-trained blank-background motion generators with the scene-aware capability. Specifically, conditioned on the given 3D scene and text description, we adopt foundation models together to reason, predict and validate a scene-aware motion guidance. Then, the motion guidance is incorporated into the blank-background motion generators with two modifications, resulting in scene-aware text-driven motion sequences. Extensive experiments demonstrate the efficacy and generalizability of our proposed framework. We release our code in \href{https://tstmotion.github.io/}{Project Page}.

[39] Efficient Vision-based Vehicle Speed Estimation

Andrej Macko,Lukáš Gajdošech,Viktor Kocur

Main category: cs.CV

TL;DR: 提出了一种高效计算的方法，通过交通摄像头视频估计车辆速度，改进了现有技术，显著提升了实时性能。

Details

Motivation: 现有方法在实时性和计算效率上存在不足，需要一种更高效且准确的车辆速度估计方法。 Method: 基于3D边界框和消失点几何的改进方法，结合后训练量化技术，优化模型性能。 Result: 在BrnoCompSpeed数据集上，速度估计误差（0.58 km/h）、检测精度（91.02%）和召回率（91.14%）均优于现有技术，且速度快5.5倍。 Conclusion: 该方法在保持高精度的同时显著提升了计算效率，适合实际部署。 Abstract: This paper presents a computationally efficient method for vehicle speed estimation from traffic camera footage. Building upon previous work that utilizes 3D bounding boxes derived from 2D detections and vanishing point geometry, we introduce several improvements to enhance real-time performance. We evaluate our method in several variants on the BrnoCompSpeed dataset in terms of vehicle detection and speed estimation accuracy. Our extensive evaluation across various hardware platforms, including edge devices, demonstrates significant gains in frames per second (FPS) compared to the prior state-of-the-art, while maintaining comparable or improved speed estimation accuracy. We analyze the trade-off between accuracy and computational cost, showing that smaller models utilizing post-training quantization offer the best balance for real-world deployment. Our best performing model beats previous state-of-the-art in terms of median vehicle speed estimation error (0.58 km/h vs. 0.60 km/h), detection precision (91.02% vs 87.08%) and recall (91.14% vs. 83.32%) while also being 5.5 times faster.

[40] T-Graph: Enhancing Sparse-view Camera Pose Estimation by Pairwise Translation Graph

Qingyu Xian,Weiqin Jiao,Hao Cheng,Berend Jan van der Zwaag,Yanqiu Huang

Main category: cs.CV

TL;DR: 论文提出T-Graph模块，通过构建全连接的平移图和多层感知机，提升稀疏视图下的相机位姿估计性能。

Details

Motivation: 稀疏视图下的相机位姿估计存在性能不足的问题，现有方法常忽略视角间的平移信息。 Method: T-Graph通过MLP处理成对图像特征，构建平移图，并引入两种平移表示（relative-t和pair-t）。 Result: 实验表明，T-Graph在RelPose++和Forge方法上显著提升性能，相机中心精度提高1%至6%。 Conclusion: T-Graph是一个轻量级、即插即用的模块，能有效提升稀疏视图下的相机位姿估计性能。 Abstract: Sparse-view camera pose estimation, which aims to estimate the 6-Degree-of-Freedom (6-DoF) poses from a limited number of images captured from different viewpoints, is a fundamental yet challenging problem in remote sensing applications. Existing methods often overlook the translation information between each pair of viewpoints, leading to suboptimal performance in sparse-view scenarios. To address this limitation, we introduce T-Graph, a lightweight, plug-and-play module to enhance camera pose estimation in sparse-view settings. T-graph takes paired image features as input and maps them through a Multilayer Perceptron (MLP). It then constructs a fully connected translation graph, where nodes represent cameras and edges encode their translation relationships. It can be seamlessly integrated into existing models as an additional branch in parallel with the original prediction, maintaining efficiency and ease of use. Furthermore, we introduce two pairwise translation representations, relative-t and pair-t, formulated under different local coordinate systems. While relative-t captures intuitive spatial relationships, pair-t offers a rotation-disentangled alternative. The two representations contribute to enhanced adaptability across diverse application scenarios, further improving our module's robustness. Extensive experiments on two state-of-the-art methods (RelPose++ and Forge) using public datasets (C03D and IMC PhotoTourism) validate both the effectiveness and generalizability of T-Graph. The results demonstrate consistent improvements across various metrics, notably camera center accuracy, which improves by 1% to 6% from 2 to 8 viewpoints.

[41] High Dynamic Range Novel View Synthesis with Single Exposure

Kaixuan Zhang,Hu Wang,Minxian Li,Mingwu Ren,Mao Ye,Xiatian Zhu

Main category: cs.CV

TL;DR: 论文提出了一种单曝光HDR-NVS方法Mono-HDR-3D，解决了多曝光HDR-NVS的局限性，如运动伪影和高成本。

Details

Motivation: 多曝光HDR-NVS存在运动伪影和高成本问题，需要一种仅依赖单曝光LDR图像的方法。 Method: 提出Mono-HDR-3D，包含两个模块：LDR转HDR和HDR转LDR，支持无监督闭环学习，并可集成到现有NVS模型中。 Result: 实验表明Mono-HDR-3D显著优于现有方法。 Conclusion: Mono-HDR-3D为单曝光HDR-NVS提供了高效解决方案，代码将开源。 Abstract: High Dynamic Range Novel View Synthesis (HDR-NVS) aims to establish a 3D scene HDR model from Low Dynamic Range (LDR) imagery. Typically, multiple-exposure LDR images are employed to capture a wider range of brightness levels in a scene, as a single LDR image cannot represent both the brightest and darkest regions simultaneously. While effective, this multiple-exposure HDR-NVS approach has significant limitations, including susceptibility to motion artifacts (e.g., ghosting and blurring), high capture and storage costs. To overcome these challenges, we introduce, for the first time, the single-exposure HDR-NVS problem, where only single exposure LDR images are available during training. We further introduce a novel approach, Mono-HDR-3D, featuring two dedicated modules formulated by the LDR image formation principles, one for converting LDR colors to HDR counterparts, and the other for transforming HDR images to LDR format so that unsupervised learning is enabled in a closed loop. Designed as a meta-algorithm, our approach can be seamlessly integrated with existing NVS models. Extensive experiments show that Mono-HDR-3D significantly outperforms previous methods. Source code will be released.

[42] RD-UIE: Relation-Driven State Space Modeling for Underwater Image Enhancement

Kui Jiang,Yan Luo,Junjun Jiang,Xin Xu,Fei Ma,Fei Yu

Main category: cs.CV

TL;DR: 论文提出了一种基于排序扫描机制的改进Mamba模型（RD-UIE），用于水下图像增强，通过动态调整扫描顺序和融合多尺度特征，显著提升了性能。

Details

Motivation: 水下图像因波长衰减导致内容退化和颜色失真，现有Mamba模型在复杂环境中因固定扫描路径和局部语义适应性不足而受限。 Method: 提出动态排序扫描机制和视觉自适应状态块（VSSB），结合跨特征桥（CFB）融合多尺度特征，构建RD-UIE框架。 Result: 在多个水下增强基准测试中，RD-UIE优于现有方法WMamba，平均性能提升0.55 dB。 Conclusion: RD-UIE通过动态扫描和多尺度特征融合，有效解决了水下图像增强中的全局和局部关系建模问题。 Abstract: Underwater image enhancement (UIE) is a critical preprocessing step for marine vision applications, where wavelength-dependent attenuation causes severe content degradation and color distortion. While recent state space models like Mamba show potential for long-range dependency modeling, their unfolding operations and fixed scan paths on 1D sequences fail to adapt to local object semantics and global relation modeling, limiting their efficacy in complex underwater environments. To address this, we enhance conventional Mamba with the sorting-based scanning mechanism that dynamically reorders scanning sequences based on statistical distribution of spatial correlation of all pixels. In this way, it encourages the network to prioritize the most informative components--structural and semantic features. Upon building this mechanism, we devise a Visually Self-adaptive State Block (VSSB) that harmonizes dynamic sorting of Mamba with input-dependent dynamic convolution, enabling coherent integration of global context and local relational cues. This exquisite design helps eliminate global focus bias, especially for widely distributed contents, which greatly weakens the statistical frequency. For robust feature extraction and refinement, we design a cross-feature bridge (CFB) to adaptively fuse multi-scale representations. These efforts compose the novel relation-driven Mamba framework for effective UIE (RD-UIE). Extensive experiments on underwater enhancement benchmarks demonstrate RD-UIE outperforms the state-of-the-art approach WMamba in both quantitative metrics and visual fidelity, averagely achieving 0.55 dB performance gain on the three benchmarks. Our code is available at https://github.com/kkoucy/RD-UIE/tree/main

[43] Core-Set Selection for Data-efficient Land Cover Segmentation

Keiller Nogueira,Akram Zaytar,Wanli Ma,Ribana Roscher,Ronny Hänsch,Caleb Robinson,Anthony Ortiz,Simone Nsutezo,Rahul Dodhia,Juan M. Lavista Ferres,Oktay Karakuş,Paul L. Rosin

Main category: cs.CV

TL;DR: 论文提出六种新颖的核心集选择方法，用于从遥感图像分割数据集中选择重要子集，实验表明这些方法优于随机选择基线，甚至在某些情况下优于使用全部数据。

Details

Motivation: 传统深度学习方法依赖大数据集训练，但大数据集可能引入偏差和噪声，且计算资源消耗大。因此，需关注数据质量和数量。 Method: 提出六种核心集选择方法，基于图像、标签或两者结合，并在三个常用土地分类数据集（DFC2022、Vaihingen、Potsdam）上进行基准测试。 Result: 实验显示，使用子集训练优于随机选择基线，部分方法甚至优于使用全部数据。 Conclusion: 数据为中心的学习在遥感领域具有重要潜力，代码已开源。 Abstract: The increasing accessibility of remotely sensed data and the potential of such data to inform large-scale decision-making has driven the development of deep learning models for many Earth Observation tasks. Traditionally, such models must be trained on large datasets. However, the common assumption that broadly larger datasets lead to better outcomes tends to overlook the complexities of the data distribution, the potential for introducing biases and noise, and the computational resources required for processing and storing vast datasets. Therefore, effective solutions should consider both the quantity and quality of data. In this paper, we propose six novel core-set selection methods for selecting important subsets of samples from remote sensing image segmentation datasets that rely on imagery only, labels only, and a combination of each. We benchmark these approaches against a random-selection baseline on three commonly used land cover classification datasets: DFC2022, Vaihingen, and Potsdam. In each of the datasets, we demonstrate that training on a subset of samples outperforms the random baseline, and some approaches outperform training on all available data. This result shows the importance and potential of data-centric learning for the remote sensing domain. The code is available at https://github.com/keillernogueira/data-centric-rs-classification/.

[44] Compensating Spatiotemporally Inconsistent Observations for Online Dynamic 3D Gaussian Splatting

Youngsik Yun,Jeongmin Bae,Hyunseung Son,Seoha Kim,Hahyun Lee,Gun Bang,Youngjung Uh

Main category: cs.CV

TL;DR: 论文提出了一种在线动态场景重建方法，通过消除学习到的误差来提升时间一致性，解决了现有方法在静态区域中的明显伪影问题。

Details

Motivation: 现有在线重建方法主要关注效率和渲染质量，忽略了时间一致性，导致静态区域出现伪影。本文旨在解决这一问题。 Method: 提出一种方法，通过学习并减去误差来增强时间一致性，适用于存在时间不一致性的相机观测。 Result: 实验表明，该方法显著提升了时间一致性和渲染质量，适用于多种基线。 Conclusion: 该方法有效解决了在线动态场景重建中的时间一致性问题，提升了整体质量。 Abstract: Online reconstruction of dynamic scenes is significant as it enables learning scenes from live-streaming video inputs, while existing offline dynamic reconstruction methods rely on recorded video inputs. However, previous online reconstruction approaches have primarily focused on efficiency and rendering quality, overlooking the temporal consistency of their results, which often contain noticeable artifacts in static regions. This paper identifies that errors such as noise in real-world recordings affect temporal inconsistency in online reconstruction. We propose a method that enhances temporal consistency in online reconstruction from observations with temporal inconsistency which is inevitable in cameras. We show that our method restores the ideal observation by subtracting the learned error. We demonstrate that applying our method to various baselines significantly enhances both temporal consistency and rendering quality across datasets. Code, video results, and checkpoints are available at https://bbangsik13.github.io/OR2.

[45] Fusing Foveal Fixations Using Linear Retinal Transformations and Bayesian Experimental Design

Christopher K. I. Williams

Main category: cs.CV

TL;DR: 论文提出了一种基于线性下采样的方法，模拟人眼注视场景时的视网膜变换，用于解决多注视点融合问题，并通过贝叶斯实验设计优化注视点选择。

Details

Motivation: 人类和许多脊椎动物需要通过多个注视点融合场景信息，每个注视点使用高分辨率中央凹和低分辨率周边视觉。本文旨在利用已知几何关系，将视网膜变换建模为线性下采样，以解决这一问题。 Method: 将视网膜变换表示为高分辨率潜在场景图像的线性下采样，利用因子分析（FA）和FA混合模型进行精确推断，并通过贝叶斯实验设计（基于预期信息增益）优化注视点选择。 Result: 在Frey人脸和MNIST数据集上的实验验证了模型的有效性。 Conclusion: 提出的方法能够有效模拟多注视点融合问题，并通过贝叶斯优化实现注视点选择的自动化。 Abstract: Humans (and many vertebrates) face the problem of fusing together multiple fixations of a scene in order to obtain a representation of the whole, where each fixation uses a high-resolution fovea and decreasing resolution in the periphery. In this paper we explicitly represent the retinal transformation of a fixation as a linear downsampling of a high-resolution latent image of the scene, exploiting the known geometry. This linear transformation allows us to carry out exact inference for the latent variables in factor analysis (FA) and mixtures of FA models of the scene. Further, this allows us to formulate and solve the choice of "where to look next" as a Bayesian experimental design problem using the Expected Information Gain criterion. Experiments on the Frey faces and MNIST datasets demonstrate the effectiveness of our models.

[46] CAMELTrack: Context-Aware Multi-cue ExpLoitation for Online Multi-Object Tracking

Vladimir Somers,Baptiste Standaert,Victor Joos,Alexandre Alahi,Christophe De Vleeschouwer

Main category: cs.CV

TL;DR: CAMEL是一种新型的关联模块，通过学习数据驱动的关联策略，摆脱了传统手工设计的启发式方法，同时保持了模块化设计的优势。

Details

Motivation: 现有的跟踪方法依赖手工设计的规则进行时间关联，限制了其捕捉复杂跟踪线索之间交互的能力。 Method: CAMEL采用两个基于Transformer的模块和一种新的以关联为中心的训练方案，有效建模目标与其关联线索之间的复杂交互。 Result: CAMELTrack在多个跟踪基准测试中达到了最先进的性能。 Conclusion: CAMEL提供了一种轻量级、快速训练且能利用外部现成模型的方法，显著提升了多目标跟踪的性能。 Abstract: Online multi-object tracking has been recently dominated by tracking-by-detection (TbD) methods, where recent advances rely on increasingly sophisticated heuristics for tracklet representation, feature fusion, and multi-stage matching. The key strength of TbD lies in its modular design, enabling the integration of specialized off-the-shelf models like motion predictors and re-identification. However, the extensive usage of human-crafted rules for temporal associations makes these methods inherently limited in their ability to capture the complex interplay between various tracking cues. In this work, we introduce CAMEL, a novel association module for Context-Aware Multi-Cue ExpLoitation, that learns resilient association strategies directly from data, breaking free from hand-crafted heuristics while maintaining TbD's valuable modularity. At its core, CAMEL employs two transformer-based modules and relies on a novel association-centric training scheme to effectively model the complex interactions between tracked targets and their various association cues. Unlike end-to-end detection-by-tracking approaches, our method remains lightweight and fast to train while being able to leverage external off-the-shelf models. Our proposed online tracking pipeline, CAMELTrack, achieves state-of-the-art performance on multiple tracking benchmarks. Our code is available at https://github.com/TrackingLaboratory/CAMELTrack.

[47] Diffusion-based Adversarial Purification from the Perspective of the Frequency Domain

Gaozheng Pei,Ke Ma,Yingfei Sun,Qianqian Xu,Qingming Huang

Main category: cs.CV

TL;DR: 该论文提出了一种基于频率域的对抗净化方法，通过分析振幅和相位谱，选择性地保留低频信息以消除对抗扰动，同时保护图像的原始内容。

Details

Motivation: 现有基于扩散的对抗净化方法因缺乏对抗扰动的分布信息，容易破坏图像的正常语义。作者从频率域视角出发，发现对抗扰动对高频部分的破坏更大，从而提出选择性保留低频信息的净化方法。 Method: 在反向过程中，对振幅谱的低频部分用对抗图像的对应部分替换，同时对相位谱的低频部分进行投影，以消除扰动并保留原始图像内容。 Result: 实验表明，该方法显著优于现有防御方法，能有效消除对抗扰动并保护图像内容。 Conclusion: 通过频率域分析选择性处理低频信息，该方法在对抗净化中实现了更好的性能和图像保护。 Abstract: The diffusion-based adversarial purification methods attempt to drown adversarial perturbations into a part of isotropic noise through the forward process, and then recover the clean images through the reverse process. Due to the lack of distribution information about adversarial perturbations in the pixel domain, it is often unavoidable to damage normal semantics. We turn to the frequency domain perspective, decomposing the image into amplitude spectrum and phase spectrum. We find that for both spectra, the damage caused by adversarial perturbations tends to increase monotonically with frequency. This means that we can extract the content and structural information of the original clean sample from the frequency components that are less damaged. Meanwhile, theoretical analysis indicates that existing purification methods indiscriminately damage all frequency components, leading to excessive damage to the image. Therefore, we propose a purification method that can eliminate adversarial perturbations while maximizing the preservation of the content and structure of the original image. Specifically, at each time step during the reverse process, for the amplitude spectrum, we replace the low-frequency components of the estimated image's amplitude spectrum with the corresponding parts of the adversarial image. For the phase spectrum, we project the phase of the estimated image into a designated range of the adversarial image's phase spectrum, focusing on the low frequencies. Empirical evidence from extensive experiments demonstrates that our method significantly outperforms most current defense methods.

[48] FreeInsert: Disentangled Text-Guided Object Insertion in 3D Gaussian Scene without Spatial Priors

Chenxi Li,Weijie Wang,Qiang Li,Bruno Lepri,Nicu Sebe,Weizhi Nie

Main category: cs.CV

TL;DR: FreeInsert是一种无需空间先验的文本驱动3D场景对象插入框架，利用基础模型实现灵活且一致的编辑。

Details

Motivation: 现有方法依赖空间先验（如2D掩码或3D边界框），限制了灵活性和可扩展性。 Method: 结合MLLM、LGM和扩散模型，通过语义解析、空间推理和分层细化实现对象插入。 Result: 实验表明，FreeInsert能实现语义一致、空间精确且视觉逼真的3D插入。 Conclusion: FreeInsert提供了一种无需空间先验的用户友好编辑方案。 Abstract: Text-driven object insertion in 3D scenes is an emerging task that enables intuitive scene editing through natural language. However, existing 2D editing-based methods often rely on spatial priors such as 2D masks or 3D bounding boxes, and they struggle to ensure consistency of the inserted object. These limitations hinder flexibility and scalability in real-world applications. In this paper, we propose FreeInsert, a novel framework that leverages foundation models including MLLMs, LGMs, and diffusion models to disentangle object generation from spatial placement. This enables unsupervised and flexible object insertion in 3D scenes without spatial priors. FreeInsert starts with an MLLM-based parser that extracts structured semantics, including object types, spatial relationships, and attachment regions, from user instructions. These semantics guide both the reconstruction of the inserted object for 3D consistency and the learning of its degrees of freedom. We leverage the spatial reasoning capabilities of MLLMs to initialize object pose and scale. A hierarchical, spatially aware refinement stage further integrates spatial semantics and MLLM-inferred priors to enhance placement. Finally, the appearance of the object is improved using the inserted-object image to enhance visual fidelity. Experimental results demonstrate that FreeInsert achieves semantically coherent, spatially precise, and visually realistic 3D insertions without relying on spatial priors, offering a user-friendly and flexible editing experience.

[49] Monitoring morphometric drift in lifelong learning segmentation of the spinal cord

Enamundram Naga Karthik,Sandrine Bédard,Jan Valošek,Christoph S. Aigner,Elise Bannier,Josef Bednařík,Virginie Callot,Anna Combes,Armin Curt,Gergely David,Falk Eippert,Lynn Farner,Michael G Fehlings,Patrick Freund,Tobias Granberg,Cristina Granziera,RHSCIR Network Imaging Group,Ulrike Horn,Tomáš Horák,Suzanne Humphreys,Markus Hupp,Anne Kerbrat,Nawal Kinany,Shannon Kolind,Petr Kudlička,Anna Lebret,Lisa Eunyoung Lee,Caterina Mainero,Allan R. Martin,Megan McGrath,Govind Nair,Kristin P. O'Grady,Jiwon Oh,Russell Ouellette,Nikolai Pfender,Dario Pfyffer,Pierre-François Pradat,Alexandre Prat,Emanuele Pravatà,Daniel S. Reich,Ilaria Ricchi,Naama Rotem-Kohavi,Simon Schading-Sassenhausen,Maryam Seif,Andrew Smith,Seth A Smith,Grace Sweeney,Roger Tam,Anthony Traboulsee,Constantina Andrada Treaba,Charidimos Tsagkas,Zachary Vavasour,Dimitri Van De Ville,Kenneth Arnold Weber II,Sarath Chandar,Julien Cohen-Adad

Main category: cs.CV

TL;DR: 研究提出了一种脊髓分割模型及终身学习框架，用于监测模型更新时的形态测量漂移，并应用于更新健康参与者的规范数据库。

Details

Motivation: 评估脊髓分割模型在更新时的稳定性，特别是用于健康参与者规范值的推导。 Method: 训练多站点数据集上的脊髓分割模型，引入终身学习框架监测形态测量漂移，并通过自动GitHub Actions工作流实现。 Result: 模型在腰椎脊髓病例上表现优异（Dice得分0.95±0.03），形态测量漂移监测提供了快速反馈，规范数据库更新所需的缩放因子稳定。 Conclusion: 模型和框架为脊髓形态测量提供了可靠工具，支持未来分割模型的开发与规范数据库的更新。 Abstract: Morphometric measures derived from spinal cord segmentations can serve as diagnostic and prognostic biomarkers in neurological diseases and injuries affecting the spinal cord. While robust, automatic segmentation methods to a wide variety of contrasts and pathologies have been developed over the past few years, whether their predictions are stable as the model is updated using new datasets has not been assessed. This is particularly important for deriving normative values from healthy participants. In this study, we present a spinal cord segmentation model trained on a multisite $(n=75)$ dataset, including 9 different MRI contrasts and several spinal cord pathologies. We also introduce a lifelong learning framework to automatically monitor the morphometric drift as the model is updated using additional datasets. The framework is triggered by an automatic GitHub Actions workflow every time a new model is created, recording the morphometric values derived from the model's predictions over time. As a real-world application of the proposed framework, we employed the spinal cord segmentation model to update a recently-introduced normative database of healthy participants containing commonly used measures of spinal cord morphometry. Results showed that: (i) our model outperforms previous versions and pathology-specific models on challenging lumbar spinal cord cases, achieving an average Dice score of $0.95 \pm 0.03$; (ii) the automatic workflow for monitoring morphometric drift provides a quick feedback loop for developing future segmentation models; and (iii) the scaling factor required to update the database of morphometric measures is nearly constant among slices across the given vertebral levels, showing minimum drift between the current and previous versions of the model monitored by the framework. The model is freely available in Spinal Cord Toolbox v7.0.

[50] Global Collinearity-aware Polygonizer for Polygonal Building Mapping in Remote Sensing

Fahong Zhang,Yilei Shi,Xiao Xiang Zhu

Main category: cs.CV

TL;DR: 论文提出了一种名为GCP的新算法，用于从遥感图像中映射多边形建筑，通过全局共线性感知的多边形化方法优化多边形生成。

Details

Motivation: 解决从遥感图像中准确映射多边形建筑的挑战，提高多边形生成的精度和效率。 Method: 基于实例分割框架，通过采样轮廓线、Transformer回归模块优化轮廓拟合，再通过共线性感知的多边形简化模块生成最终多边形。 Result: 在公共基准测试中验证了GCP的有效性，其简化模块优于传统方法如Douglas-Peucker算法。 Conclusion: GCP算法在建筑多边形映射中表现出色，具有广泛适用性，代码已开源。 Abstract: This paper addresses the challenge of mapping polygonal buildings from remote sensing images and introduces a novel algorithm, the Global Collinearity-aware Polygonizer (GCP). GCP, built upon an instance segmentation framework, processes binary masks produced by any instance segmentation model. The algorithm begins by collecting polylines sampled along the contours of the binary masks. These polylines undergo a refinement process using a transformer-based regression module to ensure they accurately fit the contours of the targeted building instances. Subsequently, a collinearity-aware polygon simplification module simplifies these refined polylines and generate the final polygon representation. This module employs dynamic programming technique to optimize an objective function that balances the simplicity and fidelity of the polygons, achieving globally optimal solutions. Furthermore, the optimized collinearity-aware objective is seamlessly integrated into network training, enhancing the cohesiveness of the entire pipeline. The effectiveness of GCP has been validated on two public benchmarks for polygonal building mapping. Further experiments reveal that applying the collinearity-aware polygon simplification module to arbitrary polylines, without prior knowledge, enhances accuracy over traditional methods such as the Douglas-Peucker algorithm. This finding underscores the broad applicability of GCP. The code for the proposed method will be made available at https://github.com/zhu-xlab.

[51] Multimodal Doctor-in-the-Loop: A Clinically-Guided Explainable Framework for Predicting Pathological Response in Non-Small Cell Lung Cancer

Alice Natalina Caragliano,Claudia Tacconi,Carlo Greco,Lorenzo Nibid,Edy Ippolito,Michele Fiore,Giuseppe Perrone,Sara Ramella,Paolo Soda,Valerio Guarrasi

Main category: cs.CV

TL;DR: 提出了一种结合多模态深度学习和可解释AI技术的新方法，用于预测非小细胞肺癌患者的病理反应，通过中间融合策略整合影像和临床数据，并引入医生参与的训练过程。

Details

Motivation: 现有放射组学和单模态深度学习方法存在局限性，无法有效整合多模态数据，且缺乏临床可解释性。 Method: 采用中间融合策略整合影像和临床数据，提出多模态医生参与循环方法，将临床知识嵌入训练过程。 Result: 提高了预测准确性和可解释性，为临床数据整合提供了优化策略。 Conclusion: 该方法在多模态数据整合和临床可解释性方面具有优势，适用于临床预测任务。 Abstract: This study proposes a novel approach combining Multimodal Deep Learning with intrinsic eXplainable Artificial Intelligence techniques to predict pathological response in non-small cell lung cancer patients undergoing neoadjuvant therapy. Due to the limitations of existing radiomics and unimodal deep learning approaches, we introduce an intermediate fusion strategy that integrates imaging and clinical data, enabling efficient interaction between data modalities. The proposed Multimodal Doctor-in-the-Loop method further enhances clinical relevance by embedding clinicians' domain knowledge directly into the training process, guiding the model's focus gradually from broader lung regions to specific lesions. Results demonstrate improved predictive accuracy and explainability, providing insights into optimal data integration strategies for clinical applications.

[52] VIDSTAMP: A Temporally-Aware Watermark for Ownership and Integrity in Video Diffusion Models

Mohammadreza Teymoorianfard,Shiqing Ma,Amir Houmansadr

Main category: cs.CV

TL;DR: VIDSTAMP是一种新型视频水印框架，通过潜在空间嵌入高容量水印，保持视觉质量并抵抗视频操作。

Details

Motivation: 现有水印方法难以应对视频特定操作（如帧插入、删除或重排序）且影响视觉质量，需改进。 Method: 采用两阶段微调解码器，先在静态图像数据集上训练以分离空间信息，再在合成视频序列上恢复时间一致性。 Result: 嵌入768比特/视频（48比特/帧），比特准确率95.0%，视频质量接近未加水印版本。 Conclusion: VIDSTAMP在容量、质量和鲁棒性上优于现有方法，且不增加推理成本。 Abstract: The rapid rise of video diffusion models has enabled the generation of highly realistic and temporally coherent videos, raising critical concerns about content authenticity, provenance, and misuse. Existing watermarking approaches, whether passive, post-hoc, or adapted from image-based techniques, often struggle to withstand video-specific manipulations such as frame insertion, dropping, or reordering, and typically degrade visual quality. In this work, we introduce VIDSTAMP, a watermarking framework that embeds per-frame or per-segment messages directly into the latent space of temporally-aware video diffusion models. By fine-tuning the model's decoder through a two-stage pipeline, first on static image datasets to promote spatial message separation, and then on synthesized video sequences to restore temporal consistency, VIDSTAMP learns to embed high-capacity, flexible watermarks with minimal perceptual impact. Leveraging architectural components such as 3D convolutions and temporal attention, our method imposes no additional inference cost and offers better perceptual quality than prior methods, while maintaining comparable robustness against common distortions and tampering. VIDSTAMP embeds 768 bits per video (48 bits per frame) with a bit accuracy of 95.0%, achieves a log P-value of -166.65 (lower is better), and maintains a video quality score of 0.836, comparable to unwatermarked outputs (0.838) and surpassing prior methods in capacity-quality tradeoffs. Code: Code: \url{https://github.com/SPIN-UMass/VidStamp}

cs.GR [Back]

[53] Model See Model Do: Speech-Driven Facial Animation with Style Control

Yifang Pan,Karan Singh,Luiz Gustavo Hafemann

Main category: cs.GR

TL;DR: 提出了一种基于示例的生成框架，利用潜在扩散模型和风格参考片段生成高表现力且时间连贯的3D面部动画，解决了现有方法难以捕捉和转移细微表演风格的问题。

Details

Motivation: 现有方法在实现准确的唇同步和基本情感表达方面取得了进展，但难以捕捉和转移细微的表演风格。 Method: 提出了一种基于示例的生成框架，利用潜在扩散模型，并通过引入风格基底的机制从参考片段中提取关键姿势，以指导生成过程。 Result: 该方法能够捕捉细微的风格线索，同时确保生成的动画与输入语音紧密对齐，实现了卓越的唇同步效果。 Conclusion: 通过广泛的评估，证明了该方法在忠实再现目标风格和实现高质量唇同步方面的有效性。 Abstract: Speech-driven 3D facial animation plays a key role in applications such as virtual avatars, gaming, and digital content creation. While existing methods have made significant progress in achieving accurate lip synchronization and generating basic emotional expressions, they often struggle to capture and effectively transfer nuanced performance styles. We propose a novel example-based generation framework that conditions a latent diffusion model on a reference style clip to produce highly expressive and temporally coherent facial animations. To address the challenge of accurately adhering to the style reference, we introduce a novel conditioning mechanism called style basis, which extracts key poses from the reference and additively guides the diffusion generation process to fit the style without compromising lip synchronization quality. This approach enables the model to capture subtle stylistic cues while ensuring that the generated animations align closely with the input speech. Extensive qualitative, quantitative, and perceptual evaluations demonstrate the effectiveness of our method in faithfully reproducing the desired style while achieving superior lip synchronization across various speech scenarios.

[54] GENMO: A GENeralist Model for Human MOtion

Jiefeng Li,Jinkun Cao,Haotian Zhang,Davis Rempe,Jan Kautz,Umar Iqbal,Ye Yuan

Main category: cs.GR

TL;DR: GENMO是一个统一的人类运动通用模型，将运动生成和估计结合在一个框架中，通过约束生成和扩散模型实现高精度和多样性。

Details

Motivation: 传统方法将运动生成和估计分开，限制了知识共享和模型效率，GENMO旨在解决这一问题。 Method: GENMO将运动估计重新定义为约束生成，结合回归和扩散模型，并利用视频和文本数据增强多样性。 Result: GENMO在多个任务中表现出色，能够处理复杂条件和多模态输入。 Conclusion: GENMO证明了统一框架在运动任务中的有效性，实现了生成和估计的协同提升。 Abstract: Human motion modeling traditionally separates motion generation and estimation into distinct tasks with specialized models. Motion generation models focus on creating diverse, realistic motions from inputs like text, audio, or keyframes, while motion estimation models aim to reconstruct accurate motion trajectories from observations like videos. Despite sharing underlying representations of temporal dynamics and kinematics, this separation limits knowledge transfer between tasks and requires maintaining separate models. We present GENMO, a unified Generalist Model for Human Motion that bridges motion estimation and generation in a single framework. Our key insight is to reformulate motion estimation as constrained motion generation, where the output motion must precisely satisfy observed conditioning signals. Leveraging the synergy between regression and diffusion, GENMO achieves accurate global motion estimation while enabling diverse motion generation. We also introduce an estimation-guided training objective that exploits in-the-wild videos with 2D annotations and text descriptions to enhance generative diversity. Furthermore, our novel architecture handles variable-length motions and mixed multimodal conditions (text, audio, video) at different time intervals, offering flexible control. This unified approach creates synergistic benefits: generative priors improve estimated motions under challenging conditions like occlusions, while diverse video data enhances generation capabilities. Extensive experiments demonstrate GENMO's effectiveness as a generalist framework that successfully handles multiple human motion tasks within a single model.

cs.CL [Back]

[55] FinBERT-QA: Financial Question Answering with pre-trained BERT Language Models

Bithiah Yuan

Main category: cs.CL

TL;DR: 论文提出了一种基于BERT的金融QA系统，用于非事实性答案选择，通过重排序方法提升效率，并在FiQA数据集上显著提升了性能。

Details

Motivation: 金融行业对大规模非结构化和结构化数据的自动分析需求日益增长，QA系统可为金融顾问的决策提供竞争优势。 Method: 系统结合BM25检索和BERT重排序，研究了多种BERT的预训练和微调方法，最终采用Transfer and Adapt微调和点学习策略。 Result: FinBERT-QA模型在FiQA数据集的任务2上，MRR提升16%，NDCG提升17%，Precision@1提升21%。 Conclusion: 提出的FinBERT-QA系统在金融QA任务中表现优异，显著优于现有方法。 Abstract: Motivated by the emerging demand in the financial industry for the automatic analysis of unstructured and structured data at scale, Question Answering (QA) systems can provide lucrative and competitive advantages to companies by facilitating the decision making of financial advisers. Consequently, we propose a novel financial QA system using the transformer-based pre-trained BERT language model to address the limitations of data scarcity and language specificity in the financial domain. Our system focuses on financial non-factoid answer selection, which retrieves a set of passage-level texts and selects the most relevant as the answer. To increase efficiency, we formulate the answer selection task as a re-ranking problem, in which our system consists of an Answer Retriever using BM25, a simple information retrieval approach, to first return a list of candidate answers, and an Answer Re-ranker built with variants of pre-trained BERT language models to re-rank and select the most relevant answers. We investigate various learning, further pre-training, and fine-tuning approaches for BERT. Our experiments suggest that FinBERT-QA, a model built from applying the Transfer and Adapt further fine-tuning and pointwise learning approach, is the most effective, improving the state-of-the-art results of task 2 of the FiQA dataset by 16% on MRR, 17% on NDCG, and 21% on Precision@1.

[56] A Survey on Large Language Model based Human-Agent Systems

Henry Peng Zou,Wei-Chieh Huang,Yaozu Wu,Yankai Chen,Chunyu Miao,Hoang Nguyen,Yue Zhou,Weizhi Zhang,Liancheng Fang,Langzhou He,Yangning Li,Yuwei Cao,Dongyuan Li,Renhe Jiang,Philip S. Yu

Main category: cs.CL

TL;DR: 本文综述了基于大型语言模型（LLM）的人机协作系统（LLM-HAS），探讨了其核心组件、应用及挑战，旨在推动这一跨学科领域的研究。

Details

Motivation: 完全自主的LLM代理存在可靠性、复杂任务处理及伦理风险等问题，LLM-HAS通过引入人类参与提升系统性能与安全性。 Method: 系统梳理了LLM-HAS的基本概念、核心组件（如环境与画像、人类反馈、交互类型等）及新兴应用。 Result: 提供了LLM-HAS的全面综述，明确了研究方向与挑战。 Conclusion: LLM-HAS通过人机协作克服了LLM的局限性，为未来研究提供了结构化框架。 Abstract: Recent advances in large language models (LLMs) have sparked growing interest in building fully autonomous agents. However, fully autonomous LLM-based agents still face significant challenges, including limited reliability due to hallucinations, difficulty in handling complex tasks, and substantial safety and ethical risks, all of which limit their feasibility and trustworthiness in real-world applications. To overcome these limitations, LLM-based human-agent systems (LLM-HAS) incorporate human-provided information, feedback, or control into the agent system to enhance system performance, reliability and safety. This paper provides the first comprehensive and structured survey of LLM-HAS. It clarifies fundamental concepts, systematically presents core components shaping these systems, including environment & profiling, human feedback, interaction types, orchestration and communication, explores emerging applications, and discusses unique challenges and opportunities. By consolidating current knowledge and offering a structured overview, we aim to foster further research and innovation in this rapidly evolving interdisciplinary field. Paper lists and resources are available at https://github.com/HenryPengZou/Awesome-LLM-Based-Human-Agent-System-Papers.

[57] Reasoning Capabilities and Invariability of Large Language Models

Alessandro Raganato,Rafael Peñaloza,Marco Viviani,Gabriella Pasi

Main category: cs.CL

TL;DR: 该论文分析了大型语言模型（LLMs）在简单推理任务中的表现，重点关注其对提示的依赖性，并引入了一个新的基准数据集进行测试。

Details

Motivation: 尽管LLMs在自然语言处理中表现出色，但其在简单推理任务中的能力仍受质疑，因此需要系统评估其推理能力。 Method: 研究引入了一个基于几何图形的基准数据集，测试了24种不同规模的LLMs在零样本和少样本提示下的表现，并进一步测试了链式思维提示的效果。 Result: 结果显示，参数量超过700亿的LLMs在零样本设置中表现更好，但仍有改进空间；链式思维提示的效果取决于提示的时机。 Conclusion: LLMs在简单推理任务中仍有不足，提示的设计对其表现有显著影响。 Abstract: Large Language Models (LLMs) have shown remarkable capabilities in manipulating natural language across multiple applications, but their ability to handle simple reasoning tasks is often questioned. In this work, we aim to provide a comprehensive analysis of LLMs' reasoning competence, specifically focusing on their prompt dependency. In particular, we introduce a new benchmark dataset with a series of simple reasoning questions demanding shallow logical reasoning. Aligned with cognitive psychology standards, the questions are confined to a basic domain revolving around geometric figures, ensuring that responses are independent of any pre-existing intuition about the world and rely solely on deduction. An empirical analysis involving zero-shot and few-shot prompting across 24 LLMs of different sizes reveals that, while LLMs with over 70 billion parameters perform better in the zero-shot setting, there is still a large room for improvement. An additional test with chain-of-thought prompting over 22 LLMs shows that this additional prompt can aid or damage the performance of models, depending on whether the rationale is required before or after the answer.

[58] Knowledge-augmented Pre-trained Language Models for Biomedical Relation Extraction

Mario Sänger,Ulf Leser

Main category: cs.CL

TL;DR: 论文研究了预训练语言模型（PLMs）在生物医学关系抽取（RE）中的应用，通过统一评估框架比较了不同上下文信息增强的PLMs性能，发现模型选择和超参数优化对性能影响显著。

Details

Motivation: 生物医学文献中的关系抽取对管理大量科学知识至关重要，但现有研究因模型、数据和评估方法差异难以直接比较，本研究旨在填补这一研究空白。 Method: 在五个数据集上评估了三种基线PLMs，并进行超参数优化，随后用实体描述、知识图谱和分子结构编码增强性能最佳模型。 Result: 模型选择和超参数优化对性能影响显著，上下文信息对小型PLMs有较大提升，但对整体性能改善有限。 Conclusion: 研究强调了模型选择和超参数优化的重要性，上下文信息对小型PLMs有益，但整体改进有限。 Abstract: Automatic relationship extraction (RE) from biomedical literature is critical for managing the vast amount of scientific knowledge produced each year. In recent years, utilizing pre-trained language models (PLMs) has become the prevalent approach in RE. Several studies report improved performance when incorporating additional context information while fine-tuning PLMs for RE. However, variations in the PLMs applied, the databases used for augmentation, hyper-parameter optimization, and evaluation methods complicate direct comparisons between studies and raise questions about the generalizability of these findings. Our study addresses this research gap by evaluating PLMs enhanced with contextual information on five datasets spanning four relation scenarios within a consistent evaluation framework. We evaluate three baseline PLMs and first conduct extensive hyperparameter optimization. After selecting the top-performing model, we enhance it with additional data, including textual entity descriptions, relational information from knowledge graphs, and molecular structure encodings. Our findings illustrate the importance of i) the choice of the underlying language model and ii) a comprehensive hyperparameter optimization for achieving strong extraction performance. Although inclusion of context information yield only minor overall improvements, an ablation study reveals substantial benefits for smaller PLMs when such external data was included during fine-tuning.

[59] Large Language Model-Driven Dynamic Assessment of Grammatical Accuracy in English Language Learner Writing

Timur Jaganov,John Blake,Julián Villegas,Nicholas Carr

Main category: cs.CL

TL;DR: 研究表明，大型语言模型（LLMs）可以扩展动态评估（DA），其中GPT-4o在生成清晰、一致的提示方面表现最佳。

Details

Motivation: 探索LLMs在语言学习课堂中扩展动态评估的潜力。 Method: 开发DynaWrite应用，测试21种LLMs，重点评估GPT-4o和Neural Chat的表现。 Result: GPT-4o在错误识别和提示质量上优于Neural Chat，且系统响应速度和稳定性良好。 Conclusion: LLMs能够有效扩展动态评估，适用于更大规模的语言学习场景。 Abstract: This study investigates the potential for Large Language Models (LLMs) to scale-up Dynamic Assessment (DA). To facilitate such an investigation, we first developed DynaWrite-a modular, microservices-based grammatical tutoring application which supports multiple LLMs to generate dynamic feedback to learners of English. Initial testing of 21 LLMs, revealed GPT-4o and neural chat to have the most potential to scale-up DA in the language learning classroom. Further testing of these two candidates found both models performed similarly in their ability to accurately identify grammatical errors in user sentences. However, GPT-4o consistently outperformed neural chat in the quality of its DA by generating clear, consistent, and progressively explicit hints. Real-time responsiveness and system stability were also confirmed through detailed performance testing, with GPT-4o exhibiting sufficient speed and stability. This study shows that LLMs can be used to scale-up dynamic assessment and thus enable dynamic assessment to be delivered to larger groups than possible in traditional teacher-learner settings.

[60] Llama-Nemotron: Efficient Reasoning Models

Akhiad Bercovich,Itay Levy,Izik Golan,Mohammad Dabbah,Ran El-Yaniv,Omri Puny,Ido Galil,Zach Moshe,Tomer Ronen,Najeeb Nabwani,Ido Shahaf,Oren Tropp,Ehud Karpas,Ran Zilberstein,Jiaqi Zeng,Soumye Singhal,Alexander Bukharin,Yian Zhang,Tugrul Konuk,Gerald Shen,Ameya Sunil Mahabaleshwarkar,Bilal Kartal,Yoshi Suhara,Olivier Delalleau,Zijia Chen,Zhilin Wang,David Mosallanezhad,Adi Renduchintala,Haifeng Qian,Dima Rekesh,Fei Jia,Somshubra Majumdar,Vahid Noroozi,Wasi Uddin Ahmad,Sean Narenthiran,Aleksander Ficek,Mehrzad Samadi,Jocelyn Huang,Siddhartha Jain,Igor Gitman,Ivan Moshkov,Wei Du,Shubham Toshniwal,George Armstrong,Branislav Kisacanin,Matvei Novikov,Daria Gitman,Evelina Bakhturina,Jane Polak Scowcroft,John Kamalu,Dan Su,Kezhi Kong,Markus Kliegl,Rabeeh Karimi,Ying Lin,Sanjeev Satheesh,Jupinder Parmar,Pritam Gundecha,Brandon Norick,Joseph Jennings,Shrimai Prabhumoye,Syeda Nahida Akter,Mostofa Patwary,Abhinav Khattar,Deepak Narayanan,Roger Waleffe,Jimmy Zhang,Bor-Yiing Su,Guyue Huang,Terry Kong,Parth Chadha,Sahil Jain,Christine Harvey,Elad Segal,Jining Huang,Sergey Kashirsky,Robert McQueen,Izzy Putterman,George Lam,Arun Venkatesan,Sherry Wu,Vinh Nguyen,Manoj Kilaru,Andrew Wang,Anna Warno,Abhilash Somasamudramath,Sandip Bhaskar,Maka Dong,Nave Assaf,Shahar Mor,Omer Ullman Argov,Scot Junkin,Oleksandr Romanenko,Pedro Larroy,Monika Katariya,Marco Rovinelli,Viji Balas,Nicholas Edelman,Anahita Bhiwandiwalla,Muthu Subramaniam,Smita Ithape,Karthik Ramamoorthy,Yuting Wu,Suguna Varshini Velury,Omri Almog,Joyjit Daw,Denys Fridman,Erick Galinkin,Michael Evans,Katherine Luna,Leon Derczynski,Nikki Pope,Eileen Long,Seth Schneider,Guillermo Siman,Tomasz Grzegorzek,Pablo Ribalta,Monika Katariya,Joey Conway,Trisha Saar,Ann Guan,Krzysztof Pawelec,Shyamala Prayaga,Oleksii Kuchaiev,Boris Ginsburg,Oluwatobi Olabiyi,Kari Briski,Jonathan Cohen,Bryan Catanzaro,Jonah Alben,Yonatan Geifman,Eric Chung

Main category: cs.CL

TL;DR: Llama-Nemotron系列模型是一个开源的异构推理模型家族，提供卓越的推理能力、高效的推理速度和商业友好的许可协议。

Details

Motivation: 旨在开发一个开源、高效的推理模型家族，与现有先进模型竞争，同时提供更高的推理吞吐量和内存效率。 Method: 通过神经架构搜索、知识蒸馏和持续预训练优化模型，随后进行监督微调和大规模强化学习的后训练阶段。 Result: 模型在推理效率和性能上表现优异，支持动态推理切换功能，并开源了模型、数据集和训练代码。 Conclusion: Llama-Nemotron系列模型为开源社区和企业提供了高效的推理工具，推动了开放研究和模型开发。 Abstract: We introduce the Llama-Nemotron series of models, an open family of heterogeneous reasoning models that deliver exceptional reasoning capabilities, inference efficiency, and an open license for enterprise use. The family comes in three sizes -- Nano (8B), Super (49B), and Ultra (253B) -- and performs competitively with state-of-the-art reasoning models such as DeepSeek-R1 while offering superior inference throughput and memory efficiency. In this report, we discuss the training procedure for these models, which entails using neural architecture search from Llama 3 models for accelerated inference, knowledge distillation, and continued pretraining, followed by a reasoning-focused post-training stage consisting of two main parts: supervised fine-tuning and large scale reinforcement learning. Llama-Nemotron models are the first open-source models to support a dynamic reasoning toggle, allowing users to switch between standard chat and reasoning modes during inference. To further support open research and facilitate model development, we provide the following resources: 1. We release the Llama-Nemotron reasoning models -- LN-Nano, LN-Super, and LN-Ultra -- under the commercially permissive NVIDIA Open Model License Agreement. 2. We release the complete post-training dataset: Llama-Nemotron-Post-Training-Dataset. 3. We also release our training codebases: NeMo, NeMo-Aligner, and Megatron-LM.

[61] A Character-based Diffusion Embedding Algorithm for Enhancing the Generation Quality of Generative Linguistic Steganographic Texts

Yingquan Chen,Qianmu Li,Xiaocong Wu,Huifeng Li,Qing Chang

Main category: cs.CL

TL;DR: 本文提出了一种基于字符的扩散嵌入算法（CDEA），结合XLNet模型，显著提升了隐写文本的生成质量，特别是在感知不可察觉性方面。

Details

Motivation: 现有模型在文本生成能力上的局限性，以及嵌入算法未能有效缓解敏感信息属性（如语义内容或随机性）的负面影响，导致隐写文本的语义连贯性和逻辑流畅性下降。 Method: 提出CDEA算法，利用敏感信息的属性，通过字符级统计特性和基于幂律分布的分组方法，增加高概率候选词的选择频率，减少低概率候选词的选择频率，并结合XLNet模型处理长序列。 Result: 实验结果表明，CDEA与XLNet的结合显著提升了隐写文本的生成质量，尤其是在感知不可察觉性方面。 Conclusion: CDEA算法通过利用敏感信息的属性，有效提升了隐写文本的质量，为生成高质量隐写文本提供了新思路。 Abstract: Generating high-quality steganographic text is a fundamental challenge in the field of generative linguistic steganography. This challenge arises primarily from two aspects: firstly, the capabilities of existing models in text generation are limited; secondly, embedding algorithms fail to effectively mitigate the negative impacts of sensitive information's properties, such as semantic content or randomness. Specifically, to ensure that the recipient can accurately extract hidden information, embedding algorithms often have to consider selecting candidate words with relatively low probabilities. This phenomenon leads to a decrease in the number of high-probability candidate words and an increase in low-probability candidate words, thereby compromising the semantic coherence and logical fluency of the steganographic text and diminishing the overall quality of the generated steganographic material. To address this issue, this paper proposes a novel embedding algorithm, character-based diffusion embedding algorithm (CDEA). Unlike existing embedding algorithms that strive to eliminate the impact of sensitive information's properties on the generation process, CDEA leverages sensitive information's properties. It enhances the selection frequency of high-probability candidate words in the candidate pool based on general statistical properties at the character level and grouping methods based on power-law distributions, while reducing the selection frequency of low-probability candidate words in the candidate pool. Furthermore, to ensure the effective transformation of sensitive information in long sequences, we also introduce the XLNet model. Experimental results demonstrate that the combination of CDEA and XLNet significantly improves the quality of generated steganographic text, particularly in terms of perceptual-imperceptibility.

[62] Synthesize-on-Graph: Knowledgeable Synthetic Data Generation for Continue Pre-training of Large Language Models

Xuhui Jiang,Shengjie Ma,Chengjin Xu,Cehao Yang,Liyu Zhang,Jian Guo

Main category: cs.CL

TL;DR: SoG框架通过构建上下文图整合跨文档知识关联，提升合成数据的多样性和连贯性，优于现有方法。

Details

Motivation: 解决大语言模型在小规模、专业化语料上数据效率低的问题，现有方法忽视跨文档知识关联。 Method: SoG框架提取实体和概念构建上下文图，结合图游走策略和CoT、CC方法生成高质量合成数据。 Result: SoG在多跳文档问答数据集上优于SOTA方法，在阅读理解任务上表现相当，泛化能力更强。 Conclusion: SoG为有限数据领域的知识获取提供了高效解决方案，推动了合成数据生成技术的发展。 Abstract: Large Language Models (LLMs) have achieved remarkable success but remain data-inefficient, especially when learning from small, specialized corpora with limited and proprietary data. Existing synthetic data generation methods for continue pre-training focus on intra-document content and overlook cross-document knowledge associations, limiting content diversity and depth. We propose Synthetic-on-Graph (SoG), a synthetic data generation framework that incorporates cross-document knowledge associations for efficient corpus expansion. SoG constructs a context graph by extracting entities and concepts from the original corpus, representing cross-document associations, and employing a graph walk strategy for knowledge-associated sampling. This enhances synthetic data diversity and coherence, enabling models to learn complex knowledge structures and handle rare knowledge. To further improve synthetic data quality, we integrate Chain-of-Thought (CoT) and Contrastive Clarifying (CC) synthetic, enhancing reasoning processes and discriminative power. Experiments show that SoG outperforms the state-of-the-art (SOTA) method in a multi-hop document Q&A dataset while performing comparably to the SOTA method on the reading comprehension task datasets, which also underscores the better generalization capability of SoG. Our work advances synthetic data generation and provides practical solutions for efficient knowledge acquisition in LLMs, especially in domains with limited data availability.

[63] Position: Enough of Scaling LLMs! Lets Focus on Downscaling

Ayan Sengupta,Yash Goel,Tanmoy Chakraborty

Main category: cs.CL

TL;DR: 本文主张从神经扩展定律转向缩小大型语言模型（LLM）的开发范式，强调传统扩展方法的局限性，并提出一种资源需求大幅降低的框架。

Details

Motivation: 传统扩展方法在计算效率、环境影响和部署限制方面存在显著问题，需要更可持续和高效的开发方式。 Method: 提出一个全面的缩小LLM框架，旨在保持性能的同时减少资源需求，并概述了从传统扩展范式过渡的实际策略。 Result: 通过缩小LLM，可以在减少资源消耗的同时维持模型性能。 Conclusion: 缩小LLM是一种更可持续、高效且可访问的开发方法，值得进一步研究和推广。 Abstract: We challenge the dominant focus on neural scaling laws and advocate for a paradigm shift toward downscaling in the development of large language models (LLMs). While scaling laws have provided critical insights into performance improvements through increasing model and dataset size, we emphasize the significant limitations of this approach, particularly in terms of computational inefficiency, environmental impact, and deployment constraints. To address these challenges, we propose a holistic framework for downscaling LLMs that seeks to maintain performance while drastically reducing resource demands. This paper outlines practical strategies for transitioning away from traditional scaling paradigms, advocating for a more sustainable, efficient, and accessible approach to LLM development.

[64] VTS-LLM: Domain-Adaptive LLM Agent for Enhancing Awareness in Vessel Traffic Services through Natural Language

Sijin Sun,Liangbin Zhao,Ming Deng,Xiuju Fu

Main category: cs.CL

TL;DR: 提出VTS-LLM Agent，首个针对VTS操作的交互式决策支持领域自适应大语言模型代理，通过知识增强的Text-to-SQL任务识别风险船舶，并在多语言风格查询中表现优异。

Details

Motivation: 现有VTS系统在时空推理和直观人机交互方面存在局限，需应对日益复杂的交通和多模态数据。 Method: 结合结构化船舶数据库与外部海事知识，构建定制数据集，采用NER关系推理、领域知识注入、语义代数中间表示和查询重思机制。 Result: VTS-LLM在命令式、操作式和正式自然语言查询中均优于通用和SQL专用基线模型。 Conclusion: 为VTS自然语言接口奠定基础，推动LLM驱动的主动海事实时交通管理。 Abstract: Vessel Traffic Services (VTS) are essential for maritime safety and regulatory compliance through real-time traffic management. However, with increasing traffic complexity and the prevalence of heterogeneous, multimodal data, existing VTS systems face limitations in spatiotemporal reasoning and intuitive human interaction. In this work, we propose VTS-LLM Agent, the first domain-adaptive large LLM agent tailored for interactive decision support in VTS operations. We formalize risk-prone vessel identification as a knowledge-augmented Text-to-SQL task, combining structured vessel databases with external maritime knowledge. To support this, we construct a curated benchmark dataset consisting of a custom schema, domain-specific corpus, and a query-SQL test set in multiple linguistic styles. Our framework incorporates NER-based relational reasoning, agent-based domain knowledge injection, semantic algebra intermediate representation, and query rethink mechanisms to enhance domain grounding and context-aware understanding. Experimental results show that VTS-LLM outperforms both general-purpose and SQL-focused baselines under command-style, operational-style, and formal natural language queries, respectively. Moreover, our analysis provides the first empirical evidence that linguistic style variation introduces systematic performance challenges in Text-to-SQL modeling. This work lays the foundation for natural language interfaces in vessel traffic services and opens new opportunities for proactive, LLM-driven maritime real-time traffic management.

[65] Token-free Models for Sarcasm Detection

Sumit Mamtani,Maitreya Sonawane,Kanika Agarwal,Nishanth Sanjeev

Main category: cs.CL

TL;DR: 论文评估了两种无标记模型（ByT5和CANINE）在社交媒体和非社交媒体领域的讽刺检测任务中的表现，发现它们优于基于标记的模型，并刷新了最佳性能。

Details

Motivation: 标记化在自然语言处理中引入词汇不匹配和词汇表外问题，无标记模型可能解决这些限制。 Method: 通过微调和基准测试，比较ByT5和CANINE与基于标记的模型在讽刺检测任务中的表现。 Result: ByT5-small和CANINE在新闻标题和Twitter讽刺数据集上分别提高了0.77%和0.49%的准确率。 Conclusion: 无标记模型在嘈杂和非正式领域（如社交媒体）中具有潜力。 Abstract: Tokenization is a foundational step in most natural language processing (NLP) pipelines, yet it introduces challenges such as vocabulary mismatch and out-of-vocabulary issues. Recent work has shown that models operating directly on raw text at the byte or character level can mitigate these limitations. In this paper, we evaluate two token-free models, ByT5 and CANINE, on the task of sarcasm detection in both social media (Twitter) and non-social media (news headlines) domains. We fine-tune and benchmark these models against token-based baselines and state-of-the-art approaches. Our results show that ByT5-small and CANINE outperform token-based counterparts and achieve new state-of-the-art performance, improving accuracy by 0.77% and 0.49% on the News Headlines and Twitter Sarcasm datasets, respectively. These findings underscore the potential of token-free models for robust NLP in noisy and informal domains such as social media.

[66] Value Portrait: Understanding Values of LLMs with Human-aligned Benchmark

Jongwook Han,Dongmin Choi,Woojung Song,Eun-Ju Lee,Yohan Jo

Main category: cs.CL

TL;DR: 提出了一个名为Value Portrait的新基准，用于评估语言模型的价值取向，解决了现有基准的偏差和生态效度问题。

Details

Motivation: 现有基准依赖人工或机器标注，易受价值偏差影响，且测试场景与现实使用场景脱节。 Method: 设计包含真实用户-LLM交互的评估项，并通过人类评分与心理测量学验证确保可靠性。 Result: 评估27个LLM发现其更重视Benevolence、Security和Self-Direction价值，而对Tradition、Power和Achievement重视不足，并揭示了模型对某些人口群体的偏见。 Conclusion: Value Portrait基准提供了更真实、可靠的价值评估方法，揭示了LLM的价值取向及其偏差。 Abstract: The importance of benchmarks for assessing the values of language models has been pronounced due to the growing need of more authentic, human-aligned responses. However, existing benchmarks rely on human or machine annotations that are vulnerable to value-related biases. Furthermore, the tested scenarios often diverge from real-world contexts in which models are commonly used to generate text and express values. To address these issues, we propose the Value Portrait benchmark, a reliable framework for evaluating LLMs' value orientations with two key characteristics. First, the benchmark consists of items that capture real-life user-LLM interactions, enhancing the relevance of assessment results to real-world LLM usage and thus ecological validity. Second, each item is rated by human subjects based on its similarity to their own thoughts, and correlations between these ratings and the subjects' actual value scores are derived. This psychometrically validated approach ensures that items strongly correlated with specific values serve as reliable items for assessing those values. Through evaluating 27 LLMs with our benchmark, we find that these models prioritize Benevolence, Security, and Self-Direction values while placing less emphasis on Tradition, Power, and Achievement values. Also, our analysis reveals biases in how LLMs perceive various demographic groups, deviating from real human data.

[67] Do We Need a Detailed Rubric for Automated Essay Scoring using Large Language Models?

Lui Yoshida

Main category: cs.CL

TL;DR: 研究探讨了在基于大型语言模型（LLM）的自动作文评分（AES）中详细评分标准的必要性和影响，发现简化标准在多数情况下能保持评分准确性并减少计算资源。

Details

Motivation: 详细评分标准在LLM-based AES中虽常用，但制作耗时且增加计算资源消耗，研究旨在验证简化标准是否可行。 Method: 使用TOEFL11数据集，比较了四种LLM（Claude 3.5 Haiku、Gemini 1.5 Flash、GPT-4o-mini和Llama 3 70B Instruct）在完整、简化及无评分标准三种条件下的评分准确性。 Result: 四分之三的模型在简化标准下保持与详细标准相近的准确性，且显著减少资源消耗；但Gemini 1.5 Flash在详细标准下表现下降。 Conclusion: 简化评分标准对多数LLM-based AES足够高效，但需根据模型特性评估，因性能表现因模型而异。 Abstract: This study investigates the necessity and impact of a detailed rubric in automated essay scoring (AES) using large language models (LLMs). While using rubrics are standard in LLM-based AES, creating detailed rubrics requires substantial ef-fort and increases token usage. We examined how different levels of rubric detail affect scoring accuracy across multiple LLMs using the TOEFL11 dataset. Our experiments compared three conditions: a full rubric, a simplified rubric, and no rubric, using four different LLMs (Claude 3.5 Haiku, Gemini 1.5 Flash, GPT-4o-mini, and Llama 3 70B Instruct). Results showed that three out of four models maintained similar scoring accuracy with the simplified rubric compared to the detailed one, while significantly reducing token usage. However, one model (Gemini 1.5 Flash) showed decreased performance with more detailed rubrics. The findings suggest that simplified rubrics may be sufficient for most LLM-based AES applications, offering a more efficient alternative without compromis-ing scoring accuracy. However, model-specific evaluation remains crucial as per-formance patterns vary across different LLMs.

Yijie Jin,Junjie Peng,Xuanchao Lin,Haochen Yuan,Lan Wang,Cangzhi Zheng

Main category: cs.CL

TL;DR: 论文提出了一种基于图结构的GsiT模型，通过Interlaced Mask机制优化多模态Transformer的效率，参数减少2/3且性能提升。

Details

Motivation: 多模态Transformer（MulTs）在多模态情感分析中效率低下，需要优化。 Method: 将MulTs建模为层次化模态异构图（HMHGs），提出GsiT模型，采用Interlaced Mask机制实现参数共享。 Result: GsiT参数仅为传统MulTs的1/3，性能显著提升，并在多个SOTA模型中验证了有效性。 Conclusion: GsiT和HMHG概念在多模态情感分析中具有高效性和优越性。 Abstract: Multimodal Sentiment Analysis (MSA) is a rapidly developing field that integrates multimodal information to recognize sentiments, and existing models have made significant progress in this area. The central challenge in MSA is multimodal fusion, which is predominantly addressed by Multimodal Transformers (MulTs). Although act as the paradigm, MulTs suffer from efficiency concerns. In this work, from the perspective of efficiency optimization, we propose and prove that MulTs are hierarchical modal-wise heterogeneous graphs (HMHGs), and we introduce the graph-structured representation pattern of MulTs. Based on this pattern, we propose an Interlaced Mask (IM) mechanism to design the Graph-Structured and Interlaced-Masked Multimodal Transformer (GsiT). It is formally equivalent to MulTs which achieves an efficient weight-sharing mechanism without information disorder through IM, enabling All-Modal-In-One fusion with only 1/3 of the parameters of pure MulTs. A Triton kernel called Decomposition is implemented to ensure avoiding additional computational overhead. Moreover, it achieves significantly higher performance than traditional MulTs. To further validate the effectiveness of GsiT itself and the HMHG concept, we integrate them into multiple state-of-the-art models and demonstrate notable performance improvements and parameter reduction on widely used MSA datasets.

[69] MateICL: Mitigating Attention Dispersion in Large-Scale In-Context Learning

Murtadha Ahmed,Wenbo,Liu yunfeng

Main category: cs.CL

TL;DR: MateICL通过分窗处理和注意力权重重新校准，解决了大语言模型在上下文学习中注意力分散的问题，提升了性能。

Details

Motivation: 固定位置长度限制和注意力分散问题限制了大规模上下文学习的有效性。 Method: 将上下文分窗处理，并引入额外层重新校准注意力权重。 Result: MateICL能有效利用更大上下文提升性能，优于检索基线且无需外部检索模型。 Conclusion: MateICL在资源受限环境中仍具优势，代码已开源。 Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in In-Context Learning (ICL). However, the fixed position length constraints in pre-trained models limit the number of demonstration examples. Recent efforts to extend context suffer from attention dispersion as the number of demonstrations increases. In this paper, we introduce Mitigating Attention Dispersion in large-scale ICL (MateICL) that enables LLMs to maintain effective self-attention as the context size grows. We first split the context into multiple windows, each filled to the model's context capacity, which are processed separately. Then, we introduce an additional layer to recalibrate the attention weights, prioritizing the query tokens as the number of demonstrations increases. Our empirical results show that MateICL can effectively leverage larger contexts to improve ICL performance. Compared to retrieval-based baselines, MateICL consistently achieves better performance without requiring an externally trained retrieval model. Despite recent advances in inference strategies (e.g., 32k token contexts), our results demonstrate that MateICL remains beneficial in computationally resource-constrained settings. The code is publicly available at https://github.com/amurtadha/MateICL.

[70] On the Limitations of Steering in Language Model Alignment

Chebrolu Niranjan,Kokil Jaidka,Gerard Christopher Yeo

Main category: cs.CL

TL;DR: 本文评估了引导向量在语言模型对齐中的局限性，发现其在特定任务（如价值观对齐）中有效，但在复杂场景下可能不足。

Details

Motivation: 研究引导向量作为对齐机制的局限性，为未来推理模型的引导能力研究奠定基础。 Method: 使用变压器钩干预和反义词功能向量框架，评估提示结构和上下文复杂性对引导效果的影响。 Result: 引导向量在特定对齐任务中表现良好，但在复杂场景下缺乏鲁棒性。 Conclusion: 引导向量适用于特定对齐任务，但需进一步研究以提升其在复杂场景中的适用性。 Abstract: Steering vectors are a promising approach to aligning language model behavior at inference time. In this paper, we propose a framework to assess the limitations of steering vectors as alignment mechanisms. Using a framework of transformer hook interventions and antonym-based function vectors, we evaluate the role of prompt structure and context complexity in steering effectiveness. Our findings indicate that steering vectors are promising for specific alignment tasks, such as value alignment, but may not provide a robust foundation for general-purpose alignment in LLMs, particularly in complex scenarios. We establish a methodological foundation for future investigations into steering capabilities of reasoning models.

[71] Gender Bias in Explainability: Investigating Performance Disparity in Post-hoc Methods

Mahdi Dhaini,Ege Erdogan,Nils Feldhus,Gjergji Kasneci

Main category: cs.CL

TL;DR: 研究发现，广泛使用的后验特征归因方法在忠实性、鲁棒性和复杂性方面存在显著的性别差异，即使模型在无偏数据集上训练。

Details

Motivation: 探讨解释方法在性能上的公平性问题，尤其是跨子群体间的差异。 Method: 分析三种任务和五种语言模型中后验特征归因方法的性别差异。 Result: 发现这些方法在无偏数据训练下仍存在性别差异，可能导致对某些子群体的偏见。 Conclusion: 强调在开发和应用解释方法时需关注公平性，并将其纳入监管框架。 Abstract: While research on applications and evaluations of explanation methods continues to expand, fairness of the explanation methods concerning disparities in their performance across subgroups remains an often overlooked aspect. In this paper, we address this gap by showing that, across three tasks and five language models, widely used post-hoc feature attribution methods exhibit significant gender disparity with respect to their faithfulness, robustness, and complexity. These disparities persist even when the models are pre-trained or fine-tuned on particularly unbiased datasets, indicating that the disparities we observe are not merely consequences of biased training data. Our results highlight the importance of addressing disparities in explanations when developing and applying explainability methods, as these can lead to biased outcomes against certain subgroups, with particularly critical implications in high-stakes contexts. Furthermore, our findings underscore the importance of incorporating the fairness of explanations, alongside overall model fairness and explainability, as a requirement in regulatory frameworks.

[72] EvalxNLP: A Framework for Benchmarking Post-Hoc Explainability Methods on NLP Models

Mahdi Dhaini,Kafaite Zahra Hussain,Efstratios Zaradoukas,Gjergji Kasneci

Main category: cs.CL

TL;DR: EvalxNLP是一个用于评估NLP模型解释方法的Python框架，整合了多种XAI技术，支持生成和评估解释，用户满意度高。

Details

Motivation: 随着NLP模型在高风险应用中的普及，解释性成为关键挑战，需要针对不同用例定制解释方法。 Method: EvalxNLP整合了八种XAI技术，支持基于忠实性、合理性和复杂性等属性的解释生成与评估，并提供交互式LLM文本解释。 Result: 用户评估显示EvalxNLP满意度高，表明其是跨用户群体解释方法评估的有前景框架。 Conclusion: EvalxNLP旨在普及解释工具，支持XAI技术的系统比较与进步。 Abstract: As Natural Language Processing (NLP) models continue to evolve and become integral to high-stakes applications, ensuring their interpretability remains a critical challenge. Given the growing variety of explainability methods and diverse stakeholder requirements, frameworks that help stakeholders select appropriate explanations tailored to their specific use cases are increasingly important. To address this need, we introduce EvalxNLP, a Python framework for benchmarking state-of-the-art feature attribution methods for transformer-based NLP models. EvalxNLP integrates eight widely recognized explainability techniques from the Explainable AI (XAI) literature, enabling users to generate and evaluate explanations based on key properties such as faithfulness, plausibility, and complexity. Our framework also provides interactive, LLM-based textual explanations, facilitating user understanding of the generated explanations and evaluation outcomes. Human evaluation results indicate high user satisfaction with EvalxNLP, suggesting it is a promising framework for benchmarking explanation methods across diverse user groups. By offering a user-friendly and extensible platform, EvalxNLP aims at democratizing explainability tools and supporting the systematic comparison and advancement of XAI techniques in NLP.

[73] PREMISE: Matching-based Prediction for Accurate Review Recommendation

Wei Han,Hui Chen,Soujanya Poria

Main category: cs.CL

TL;DR: PREMISE是一种基于匹配的多模态学习架构，通过多尺度和多领域表示及匹配分数提升任务性能。

Details

Motivation: 解决传统融合方法在多模态任务中因重复语义和低效表示导致性能不足的问题。 Method: 计算多尺度和多领域表示，过滤重复语义，生成匹配分数作为特征向量。 Result: 在两个公开数据集上表现优异，计算成本更低。 Conclusion: PREMISE在多模态任务中优于现有融合方法，尤其适用于上下文匹配与任务目标高度相关的场景。 Abstract: We present PREMISE (PREdict with Matching ScorEs), a new architecture for the matching-based learning in the multimodal fields for the multimodal review helpfulness (MRHP) task. Distinct to previous fusion-based methods which obtains multimodal representations via cross-modal attention for downstream tasks, PREMISE computes the multi-scale and multi-field representations, filters duplicated semantics, and then obtained a set of matching scores as feature vectors for the downstream recommendation task. This new architecture significantly boosts the performance for such multimodal tasks whose context matching content are highly correlated to the targets of that task, compared to the state-of-the-art fusion-based methods. Experimental results on two publicly available datasets show that PREMISE achieves promising performance with less computational cost.

[74] Anti-adversarial Learning: Desensitizing Prompts for Large Language Models

Xuan Li,Zhe Yin,Xiaodong Gu,Beijun Shen

Main category: cs.CL

TL;DR: PromptObfus是一种通过抗对抗学习扰动隐私词的方法，用于保护LLM提示中的隐私，同时保持模型预测的稳定性。

Details

Motivation: 随着LLM的广泛应用，用户提示中的隐私保护变得至关重要，传统方法因计算成本和用户参与需求高而受限。 Method: 将提示脱敏视为掩码语言建模任务，用[MASK]替换隐私词，训练脱敏模型生成候选替换，并通过梯度反馈选择最佳替换。 Result: 在三个NLP任务中，PromptObfus有效防止了远程LLM的隐私推断，同时保持了任务性能。 Conclusion: PromptObfus是一种高效且实用的LLM提示隐私保护方法。 Abstract: With the widespread use of LLMs, preserving privacy in user prompts has become crucial, as prompts risk exposing privacy and sensitive data to the cloud LLMs. Traditional techniques like homomorphic encryption, secure multi-party computation, and federated learning face challenges due to heavy computational costs and user participation requirements, limiting their applicability in LLM scenarios. In this paper, we propose PromptObfus, a novel method for desensitizing LLM prompts. The core idea of PromptObfus is "anti-adversarial" learning, which perturbs privacy words in the prompt to obscure sensitive information while retaining the stability of model predictions. Specifically, PromptObfus frames prompt desensitization as a masked language modeling task, replacing privacy-sensitive terms with a [MASK] token. A desensitization model is trained to generate candidate replacements for each masked position. These candidates are subsequently selected based on gradient feedback from a surrogate model, ensuring minimal disruption to the task output. We demonstrate the effectiveness of our approach on three NLP tasks. Results show that PromptObfus effectively prevents privacy inference from remote LLMs while preserving task performance.

[75] A Factorized Probabilistic Model of the Semantics of Vague Temporal Adverbials Relative to Different Event Types

Svenja Kenneweg,Jörg Deigmöller,Julian Eggert,Philipp Cimiano

Main category: cs.CL

TL;DR: 论文提出了一种因子化模型，用于捕捉模糊时间副词的语义，并将其与事件特定分布结合，生成上下文相关的含义。与非因子化模型相比，该模型在预测能力相似的情况下更简单且更具扩展性。

Details

Motivation: 模糊时间副词（如“最近”、“刚刚”和“很久以前”）描述了事件与说话时间之间的时间距离，但未明确具体时长。研究旨在通过概率分布模型更准确地捕捉这些副词的语义。 Method: 引入因子化模型，将模糊时间副词的语义建模为概率分布，并与事件特定分布结合。模型参数通过现有数据（母语者对副词适用性的判断）拟合。 Result: 与非因子化模型（基于高斯分布）相比，因子化模型预测能力相似，但更简单且更具扩展性。 Conclusion: 因子化模型在捕捉模糊时间副词语义方面更优，符合奥卡姆剃刀原则，适合进一步扩展。 Abstract: Vague temporal adverbials, such as recently, just, and a long time ago, describe the temporal distance between a past event and the utterance time but leave the exact duration underspecified. In this paper, we introduce a factorized model that captures the semantics of these adverbials as probabilistic distributions. These distributions are composed with event-specific distributions to yield a contextualized meaning for an adverbial applied to a specific event. We fit the model's parameters using existing data capturing judgments of native speakers regarding the applicability of these vague temporal adverbials to events that took place a given time ago. Comparing our approach to a non-factorized model based on a single Gaussian distribution for each pair of event and temporal adverbial, we find that while both models have similar predictive power, our model is preferable in terms of Occam's razor, as it is simpler and has better extendability.

[76] A Transformer-based Neural Architecture Search Method

Shang Wang,Huanrong Tang,Jianquan Ouyang

Main category: cs.CL

TL;DR: 提出了一种基于Transformer架构的神经架构搜索方法，通过多目标遗传算法优化翻译任务中的神经网络结构，结合BLEU分数和困惑度作为评估指标。

Details

Motivation: 为了在翻译任务中搜索出性能更好的神经网络结构，需要更全面的评估指标和高效的搜索方法。 Method: 采用基于Transformer架构的神经架构搜索，结合多目标遗传算法，同时优化BLEU分数和困惑度。 Result: 搜索出的神经网络结构优于所有基线模型，且引入困惑度作为辅助指标能发现比仅用BLEU分数更好的模型。 Conclusion: 多目标遗传算法结合Transformer架构搜索能有效提升翻译任务的性能，辅助评估指标的引入进一步优化了模型选择。 Abstract: This paper presents a neural architecture search method based on Transformer architecture, searching cross multihead attention computation ways for different number of encoder and decoder combinations. In order to search for neural network structures with better translation results, we considered perplexity as an auxiliary evaluation metric for the algorithm in addition to BLEU scores and iteratively improved each individual neural network within the population by a multi-objective genetic algorithm. Experimental results show that the neural network structures searched by the algorithm outperform all the baseline models, and that the introduction of the auxiliary evaluation metric can find better models than considering only the BLEU score as an evaluation metric.

[77] Helping Big Language Models Protect Themselves: An Enhanced Filtering and Summarization System

Sheikh Samit Muhaimin,Spyridon Mastorakis

Main category: cs.CL

TL;DR: 提出了一种无需重新训练LLM的防御框架，通过提示过滤和上下文总结模块，有效识别和抵御恶意输入。

Details

Motivation: 大型语言模型易受对抗性攻击和恶意输入影响，现有防御方法需重新训练模型，成本高且不实用。 Method: 框架包含两部分：1) 提示过滤模块，利用NLP技术检测有害输入；2) 总结模块，提供上下文防御知识。 Result: 实验显示，该方法识别有害模式的成功率达98.71%，并显著提升模型对恶意输入的抵抗能力。 Conclusion: 该框架是一种高效、无需重新训练的防御方案，显著增强LLM的安全性。 Abstract: The recent growth in the use of Large Language Models has made them vulnerable to sophisticated adversarial assaults, manipulative prompts, and encoded malicious inputs. Existing countermeasures frequently necessitate retraining models, which is computationally costly and impracticable for deployment. Without the need for retraining or fine-tuning, this study presents a unique defense paradigm that allows LLMs to recognize, filter, and defend against adversarial or malicious inputs on their own. There are two main parts to the suggested framework: (1) A prompt filtering module that uses sophisticated Natural Language Processing (NLP) techniques, including zero-shot classification, keyword analysis, and encoded content detection (e.g. base64, hexadecimal, URL encoding), to detect, decode, and classify harmful inputs; and (2) A summarization module that processes and summarizes adversarial research literature to give the LLM context-aware defense knowledge. This approach strengthens LLMs' resistance to adversarial exploitation by fusing text extraction, summarization, and harmful prompt analysis. According to experimental results, this integrated technique has a 98.71% success rate in identifying harmful patterns, manipulative language structures, and encoded prompts. By employing a modest amount of adversarial research literature as context, the methodology also allows the model to react correctly to harmful inputs with a larger percentage of jailbreak resistance and refusal rate. While maintaining the quality of LLM responses, the framework dramatically increases LLM's resistance to hostile misuse, demonstrating its efficacy as a quick and easy substitute for time-consuming, retraining-based defenses.

[78] TRAVELER: A Benchmark for Evaluating Temporal Reasoning across Vague, Implicit and Explicit References

Svenja Kenneweg,Jörg Deigmöller,Philipp Cimiano,Julian Eggert

Main category: cs.CL

TL;DR: TRAVELER是一个新的合成基准数据集，用于评估模型在解决时间引用问题上的能力，包括显式、隐式和模糊时间引用。

Details

Motivation: 现有基准对时间引用的系统评估有限，TRAVELER旨在填补这一空白。 Method: 通过问答范式构建数据集，评估模型对不同类型时间引用和事件集长度的处理能力。 Result: 测试的LLMs在少量事件和显式时间引用上表现良好，但在事件集较长或时间引用模糊时性能下降。 Conclusion: TRAVELER为时间引用问题提供了系统评估工具，揭示了LLMs在此任务上的局限性。 Abstract: Understanding and resolving temporal references is essential in Natural Language Understanding as we often refer to the past or future in daily communication. Although existing benchmarks address a system's ability to reason about and resolve temporal references, systematic evaluation of specific temporal references remains limited. Towards closing this gap, we introduce TRAVELER, a novel synthetic benchmark dataset that follows a Question Answering paradigm and consists of questions involving temporal references with the corresponding correct answers. TRAVELER assesses models' abilities to resolve explicit, implicit relative to speech time, and vague temporal references. Beyond investigating the performance of state-of-the-art LLMs depending on the type of temporal reference, our benchmark also allows evaluation of performance in relation to the length of the set of events. For the category of vague temporal references, ground-truth answers were established via human surveys on Prolific, following a procedure similar to the one from Kenneweg et al. To demonstrate the benchmark's applicability, we evaluate four state-of-the-art LLMs using a question-answering task encompassing 3,300 questions. Our findings show that while the benchmarked LLMs can answer questions over event sets with a handful of events and explicit temporal references successfully, performance clearly deteriorates with larger event set length and when temporal references get less explicit. Notably, the vague question category exhibits the lowest performance across all models. The benchmark is publicly available at: https://gitlab.ub.uni-bielefeld.de/s.kenneweg/TRAVELER

cs.RO [Back]

[79] SmallPlan: Leverage Small Language Models for Sequential Path Planning with Simulation-Powered, LLM-Guided Distillation

Quang P. M. Pham,Khoi T. N. Nguyen,Nhi H. Doan,Cuong A. Pham,Kentaro Inui,Dezhen Song

Main category: cs.RO

TL;DR: SmallPlan利用LLMs作为教师模型训练轻量级SLMs，用于高效路径规划，适合边缘设备部署。

Details

Motivation: 解决大规模动态环境中机器人路径规划的高计算成本和实时性问题。 Method: 通过LLM引导的监督微调和强化学习训练SLMs，生成最优动作序列。 Result: SLMs在路径规划任务中表现与GPT-4o相当，且避免了幻觉和过拟合。 Conclusion: SmallPlan资源高效，适合边缘设备，推动自主机器人发展。 Abstract: Efficient path planning in robotics, particularly within large-scale, dynamic environments, remains a significant hurdle. While Large Language Models (LLMs) offer strong reasoning capabilities, their high computational cost and limited adaptability in dynamic scenarios hinder real-time deployment on edge devices. We present SmallPlan -- a novel framework leveraging LLMs as teacher models to train lightweight Small Language Models (SLMs) for high-level path planning tasks. In SmallPlan, the SLMs provide optimal action sequences to navigate across scene graphs that compactly represent full-scaled 3D scenes. The SLMs are trained in a simulation-powered, interleaved manner with LLM-guided supervised fine-tuning (SFT) and reinforcement learning (RL). This strategy not only enables SLMs to successfully complete navigation tasks but also makes them aware of important factors like travel distance and number of trials. Through experiments, we demonstrate that the fine-tuned SLMs perform competitively with larger models like GPT-4o on sequential path planning, without suffering from hallucination and overfitting. SmallPlan is resource-efficient, making it well-suited for edge-device deployment and advancing practical autonomous robotics.

[80] Autonomous Embodied Agents: When Robotics Meets Deep Learning Reasoning

Roberto Bigazzi

Main category: cs.RO

TL;DR: 论文探讨了在计算能力提升和深度学习革命的背景下，具身人工智能（Embodied AI）的发展，重点研究了智能自主机器人在室内环境中的训练与部署。

Details

Motivation: 随着计算能力的增强和深度学习的进步，具身人工智能成为计算机视觉、机器人和决策领域的交叉研究方向，目标是推动智能自主机器人的发展及其在社会中的应用。 Method: 利用大规模3D模型进行逼真的机器人仿真，通过数百万帧的训练，使学习型智能体在模拟环境中学习与环境的连续交互，包括信息收集、任务相关线索提取和目标导向行动。 Result: 论文详细分析了智能具身代理的实现过程，包括文献综述、技术方法解释和相关机器人任务的实验研究。 Conclusion: 研究为具身人工智能和自主代理领域提供了贡献，旨在推动未来在这一领域的工作。 Abstract: The increase in available computing power and the Deep Learning revolution have allowed the exploration of new topics and frontiers in Artificial Intelligence research. A new field called Embodied Artificial Intelligence, which places at the intersection of Computer Vision, Robotics, and Decision Making, has been gaining importance during the last few years, as it aims to foster the development of smart autonomous robots and their deployment in society. The recent availability of large collections of 3D models for photorealistic robotic simulation has allowed faster and safe training of learning-based agents for millions of frames and a careful evaluation of their behavior before deploying the models on real robotic platforms. These intelligent agents are intended to perform a certain task in a possibly unknown environment. To this end, during the training in simulation, the agents learn to perform continuous interactions with the surroundings, such as gathering information from the environment, encoding and extracting useful cues for the task, and performing actions towards the final goal; where every action of the agent influences the interactions. This dissertation follows the complete creation process of embodied agents for indoor environments, from their concept to their implementation and deployment. We aim to contribute to research in Embodied AI and autonomous agents, in order to foster future work in this field. We present a detailed analysis of the procedure behind implementing an intelligent embodied agent, comprehending a thorough description of the current state-of-the-art in literature, technical explanations of the proposed methods, and accurate experimental studies on relevant robotic tasks.

[81] Optimizing Indoor Farm Monitoring Efficiency Using UAV: Yield Estimation in a GNSS-Denied Cherry Tomato Greenhouse

Taewook Park,Jinwoo Lee,Hyondong Oh,Won-Jae Yun,Kyu-Wha Lee

Main category: cs.RO

TL;DR: 论文提出了一种轻量级无人机系统，用于温室中的番茄产量估计，解决了地面机器人在温室中的局限性。

Details

Motivation: 随着农业劳动力减少和成本上升，机器人产量估计变得重要。地面机器人在温室中部署受限，因此需要替代方案。 Method: 开发了配备RGB-D相机、3D LiDAR和IMU传感器的无人机，采用LiDAR惯性里程计算法导航，并使用3D多目标跟踪算法估计番茄数量和重量。 Result: 在收获行数据集中，系统计数准确率94.4%，重量估计准确率87.5%；在生长行数据集中，定性分析了遮挡情况下的跟踪性能。 Conclusion: 无人机系统在温室产量估计中具有潜力，未来需进一步研究强遮挡环境下的感知改进。 Abstract: As the agricultural workforce declines and labor costs rise, robotic yield estimation has become increasingly important. While unmanned ground vehicles (UGVs) are commonly used for indoor farm monitoring, their deployment in greenhouses is often constrained by infrastructure limitations, sensor placement challenges, and operational inefficiencies. To address these issues, we develop a lightweight unmanned aerial vehicle (UAV) equipped with an RGB-D camera, a 3D LiDAR, and an IMU sensor. The UAV employs a LiDAR-inertial odometry algorithm for precise navigation in GNSS-denied environments and utilizes a 3D multi-object tracking algorithm to estimate the count and weight of cherry tomatoes. We evaluate the system using two dataset: one from a harvesting row and another from a growing row. In the harvesting-row dataset, the proposed system achieves 94.4\% counting accuracy and 87.5\% weight estimation accuracy within a 13.2-meter flight completed in 10.5 seconds. For the growing-row dataset, which consists of occluded unripened fruits, we qualitatively analyze tracking performance and highlight future research directions for improving perception in greenhouse with strong occlusions. Our findings demonstrate the potential of UAVs for efficient robotic yield estimation in commercial greenhouses.

Xun Li,Jian Yang,Fenli Jia,Muyu Wang,Qi Wu,Jun Wu,Jinpeng Mi,Jilin Hu,Peidong Liang,Xuan Tang,Ke Li,Xiong You,Xian Wei

Main category: cs.RO

TL;DR: NeuroLoc是一种受生物脑导航机制启发的相机定位方法，通过Hebbian学习模块、方向学习嵌入和3D网格中心预测，解决了场景模糊和方向恢复问题，提升了复杂环境中的定位性能。

Details

Motivation: 自主导航在未知环境中常因场景模糊、环境干扰和动态物体变换而面临挑战，需要一种更鲁棒的相机定位方法。 Method: 设计了基于位置细胞的Hebbian学习模块、方向细胞启发的多注意力嵌入，以及3D网格中心预测，结合单图像进行姿态回归。 Result: 在室内外基准数据集上验证，NeuroLoc提升了复杂环境中的鲁棒性，并改善了单图像的姿态回归性能。 Conclusion: NeuroLoc通过仿生机制有效解决了相机定位中的关键问题，为自主导航提供了更可靠的解决方案。 Abstract: Recently, camera localization has been widely adopted in autonomous robotic navigation due to its efficiency and convenience. However, autonomous navigation in unknown environments often suffers from scene ambiguity, environmental disturbances, and dynamic object transformation in camera localization. To address this problem, inspired by the biological brain navigation mechanism (such as grid cells, place cells, and head direction cells), we propose a novel neurobiological camera location method, namely NeuroLoc. Firstly, we designed a Hebbian learning module driven by place cells to save and replay historical information, aiming to restore the details of historical representations and solve the issue of scene fuzziness. Secondly, we utilized the head direction cell-inspired internal direction learning as multi-head attention embedding to help restore the true orientation in similar scenes. Finally, we added a 3D grid center prediction in the pose regression module to reduce the final wrong prediction. We evaluate the proposed NeuroLoc on commonly used benchmark indoor and outdoor datasets. The experimental results show that our NeuroLoc can enhance the robustness in complex environments and improve the performance of pose regression by using only a single image.

cs.NE [Back]

[83] A Neural Architecture Search Method using Auxiliary Evaluation Metric based on ResNet Architecture

Shang Wang,Huanrong Tang,Jianquan Ouyang

Main category: cs.NE

TL;DR: 提出了一种基于ResNet的神经架构搜索空间，优化目标包括卷积、池化和全连接层参数以及残差网络连接性，并使用验证集损失值作为次要优化目标。实验表明该方法在MNIST、Fashion-MNIST和CIFAR100数据集上能找到有竞争力的网络架构。

Details

Motivation: 通过结合ResNet框架和神经架构搜索，优化网络结构和参数，以提高识别准确性和泛化能力。 Method: 提出基于ResNet的搜索空间，优化卷积、池化、全连接层参数及残差网络连接性，并引入验证集损失值作为次要优化目标。 Result: 在MNIST、Fashion-MNIST和CIFAR100数据集上找到了具有竞争力的网络架构。 Conclusion: 该方法通过结合ResNet和神经架构搜索，有效提升了网络性能，验证了其在实际数据集上的有效性。 Abstract: This paper proposes a neural architecture search space using ResNet as a framework, with search objectives including parameters for convolution, pooling, fully connected layers, and connectivity of the residual network. In addition to recognition accuracy, this paper uses the loss value on the validation set as a secondary objective for optimization. The experimental results demonstrate that the search space of this paper together with the optimisation approach can find competitive network architectures on the MNIST, Fashion-MNIST and CIFAR100 datasets.

cs.CR [Back]

[84] Attack and defense techniques in large language models: A survey and new perspectives

Zhiyu Liao,Kang Chen,Yuanguo Lin,Kangkang Li,Yunxuan Liu,Hefeng Chen,Xingwang Huang,Yuanhui Yu

Main category: cs.CR

TL;DR: 本文系统调查了大语言模型（LLMs）的攻击与防御技术，分类了攻击类型并分析了防御策略，同时指出了当前挑战和未来研究方向。

Details

Motivation: 大语言模型在自然语言处理任务中广泛应用，但其安全漏洞和伦理问题带来了重大挑战，需要系统研究攻击与防御技术。 Method: 通过分类攻击类型（如对抗性提示攻击、优化攻击、模型窃取等）和防御策略（预防性和检测性方法），系统分析了LLMs的安全问题。 Result: 尽管已有进展，但动态威胁环境、防御实施中的资源限制等问题仍需解决。 Conclusion: 未来需开发自适应防御、可解释安全技术和标准化评估框架，强调跨学科合作和伦理考量以降低实际应用风险。 Abstract: Large Language Models (LLMs) have become central to numerous natural language processing tasks, but their vulnerabilities present significant security and ethical challenges. This systematic survey explores the evolving landscape of attack and defense techniques in LLMs. We classify attacks into adversarial prompt attack, optimized attacks, model theft, as well as attacks on application of LLMs, detailing their mechanisms and implications. Consequently, we analyze defense strategies, including prevention-based and detection-based defense methods. Although advances have been made, challenges remain to adapt to the dynamic threat landscape, balance usability with robustness, and address resource constraints in defense implementation. We highlight open problems, including the need for adaptive scalable defenses, explainable security techniques, and standardized evaluation frameworks. This survey provides actionable insights and directions for developing secure and resilient LLMs, emphasizing the importance of interdisciplinary collaboration and ethical considerations to mitigate risks in real-world applications.

cs.MM [Back]

[85] Photoshop Batch Rendering Using Actions for Stylistic Video Editing

Tessa De La Fuente

Main category: cs.MM

TL;DR: 论文提出了一种基于Adobe Photoshop Actions工具和批处理系统的高效创意图像/视频编辑工作流，通过自动化实现一致性和生产力提升。

Details

Motivation: 传统视频编辑工作流效率较低，缺乏一致性，因此需要一种创新方法来优化创意工作流管理。 Method: 利用Photoshop Actions工具和批处理系统，通过系统化自动化技术实现视觉编辑的一致应用。 Result: 该方法能够简化编辑流程，确保图像集合的统一结果，并提高生产力。 Conclusion: 该工作流为创意图像/视频编辑提供了一种高效且一致的替代方法。 Abstract: My project looks at an efficient workflow for creative image/video editing using Adobe Photoshop Actions tool and Batch Processing System. This innovative approach to video editing through Photoshop creates a fundamental shift to creative workflow management through the integration of industry-leading image manipulation with video editing techniques. Through systematic automation of Actions, users can achieve a simple and consistent application of visual edits across a string of images. This approach provides an alternative method to optimize productivity while ensuring uniform results across image collections through a post-processing pipeline.

[86] CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment

Edson Araujo,Andrew Rouditchenko,Yuan Gong,Saurabhchand Bhati,Samuel Thomas,Brian Kingsbury,Leonid Karlinsky,Rogerio Feris,James R. Glass

Main category: cs.MM

TL;DR: CAV-MAE Sync扩展了CAV-MAE框架，通过时序对齐音频与视频帧、分离对比和重建目标、引入可学习注册令牌，解决了音频-视觉学习中的粒度不匹配和优化冲突问题，并在多个任务上实现了最先进性能。

Details

Motivation: 现有方法在音频-视觉学习中难以捕捉细粒度时序对应关系，且优化目标冲突。 Method: 将音频作为时序序列与视频帧对齐，分离对比和重建目标，引入可学习注册令牌。 Result: 在AudioSet、VGG Sound和ADE20K Sound数据集上的零样本检索、分类和定位任务中表现优异。 Conclusion: CAV-MAE Sync通过简单有效的改进，解决了音频-视觉学习中的关键挑战，并取得了显著性能提升。 Abstract: Recent advances in audio-visual learning have shown promising results in learning representations across modalities. However, most approaches rely on global audio representations that fail to capture fine-grained temporal correspondences with visual frames. Additionally, existing methods often struggle with conflicting optimization objectives when trying to jointly learn reconstruction and cross-modal alignment. In this work, we propose CAV-MAE Sync as a simple yet effective extension of the original CAV-MAE framework for self-supervised audio-visual learning. We address three key challenges: First, we tackle the granularity mismatch between modalities by treating audio as a temporal sequence aligned with video frames, rather than using global representations. Second, we resolve conflicting optimization goals by separating contrastive and reconstruction objectives through dedicated global tokens. Third, we improve spatial localization by introducing learnable register tokens that reduce semantic load on patch tokens. We evaluate the proposed approach on AudioSet, VGG Sound, and the ADE20K Sound dataset on zero-shot retrieval, classification and localization tasks demonstrating state-of-the-art performance and outperforming more complex architectures.

[87] FlowDubber: Movie Dubbing with LLM-based Semantic-aware Learning and Flow Matching based Voice Enhancing

Gaoxiang Cong,Liang Li,Jiadong Pan,Zhedong Zhang,Amin Beheshti,Anton van den Hengel,Yuankai Qi,Qingming Huang

Main category: cs.MM

TL;DR: FlowDubber是一种基于大语言模型（LLM）的电影配音方法，通过结合语音语言模型和双重对比对齐技术，实现了高质量的音频-视觉同步和发音，同时通过声音增强流匹配提升了音质。

Details

Motivation: 现有方法主要关注降低单词错误率，忽视了唇同步和音质的重要性。FlowDubber旨在解决这些问题。 Method: 采用Qwen2.5作为LLM骨干，学习电影脚本和参考音频的上下文序列；提出语义感知学习、双重对比对齐（DCA）和基于流的语音增强（FVE）。 Result: 在多个基准测试中表现优于现有方法。 Conclusion: FlowDubber通过LLM和流匹配技术，显著提升了电影配音的同步性和音质。 Abstract: Movie Dubbing aims to convert scripts into speeches that align with the given movie clip in both temporal and emotional aspects while preserving the vocal timbre of a given brief reference audio. Existing methods focus primarily on reducing the word error rate while ignoring the importance of lip-sync and acoustic quality. To address these issues, we propose a large language model (LLM) based flow matching architecture for dubbing, named FlowDubber, which achieves high-quality audio-visual sync and pronunciation by incorporating a large speech language model and dual contrastive aligning while achieving better acoustic quality via the proposed voice-enhanced flow matching than previous works. First, we introduce Qwen2.5 as the backbone of LLM to learn the in-context sequence from movie scripts and reference audio. Then, the proposed semantic-aware learning focuses on capturing LLM semantic knowledge at the phoneme level. Next, dual contrastive aligning (DCA) boosts mutual alignment with lip movement, reducing ambiguities where similar phonemes might be confused. Finally, the proposed Flow-based Voice Enhancing (FVE) improves acoustic quality in two aspects, which introduces an LLM-based acoustics flow matching guidance to strengthen clarity and uses affine style prior to enhance identity when recovering noise into mel-spectrograms via gradient vector field prediction. Extensive experiments demonstrate that our method outperforms several state-of-the-art methods on two primary benchmarks. The demos are available at {\href{https://galaxycong.github.io/LLM-Flow-Dubber/}{\textcolor{red}{https://galaxycong.github.io/LLM-Flow-Dubber/}}}.

cs.LG [Back]

[88] A Mathematical Philosophy of Explanations in Mechanistic Interpretability -- The Strange Science Part I.i

Kola Ayonrinde,Louis Jaburi

Main category: cs.LG

TL;DR: 论文提出解释性视角假说，认为机制可解释性研究能提取神经网络中的隐含解释，并定义了解释忠实性和机制可解释性的框架。

Details

Motivation: 探讨机制可解释性研究的理论基础，明确其与其他可解释性范式的区别及其固有局限性。 Method: 提出解释忠实性的定义，并定义机制可解释性为模型级、本体性、因果机制性且可证伪的解释。 Result: 确立了机制可解释性的框架，并提出了解释乐观主义原则作为其成功的前提条件。 Conclusion: 机制可解释性研究是一种有原则的方法，能够提取和理解神经网络中的隐含解释，但其成功依赖于解释乐观主义原则。 Abstract: Mechanistic Interpretability aims to understand neural networks through causal explanations. We argue for the Explanatory View Hypothesis: that Mechanistic Interpretability research is a principled approach to understanding models because neural networks contain implicit explanations which can be extracted and understood. We hence show that Explanatory Faithfulness, an assessment of how well an explanation fits a model, is well-defined. We propose a definition of Mechanistic Interpretability (MI) as the practice of producing Model-level, Ontic, Causal-Mechanistic, and Falsifiable explanations of neural networks, allowing us to distinguish MI from other interpretability paradigms and detail MI's inherent limits. We formulate the Principle of Explanatory Optimism, a conjecture which we argue is a necessary precondition for the success of Mechanistic Interpretability.

[89] NeMo-Inspector: A Visualization Tool for LLM Generation Analysis

Daria Gitman,Igor Gitman,Evelina Bakhturina

Main category: cs.LG

TL;DR: NeMo-Inspector是一个开源工具，用于简化合成数据集的分析和清理，显著提升数据质量和模型性能。

Details

Motivation: 由于合成数据质量难以保证且手动检查耗时费力，需要专用工具来高效分析和改进数据集。 Method: 开发了NeMo-Inspector工具，具备集成推理能力，用于分析和清理合成数据集。 Result: 使用该工具后，GSM-Plus数据集的低质量样本从46.99%降至19.51%，OpenMath模型的准确性在MATH和GSM8K数据集上分别提高了1.92%和4.17%。 Conclusion: NeMo-Inspector能有效提升合成数据集质量，进而改善模型性能。 Abstract: Adapting Large Language Models (LLMs) to novel tasks and enhancing their overall capabilities often requires large, high-quality training datasets. Synthetic data, generated at scale, serves a valuable alternative when real-world data is scarce or difficult to obtain. However, ensuring the quality of synthetic datasets is challenging, as developers must manually inspect and refine numerous samples to identify errors and areas for improvement. This process is time-consuming and requires specialized tools. We introduce NeMo-Inspector, an open-source tool designed to simplify the analysis of synthetic datasets with integrated inference capabilities. We demonstrate its effectiveness through two real-world cases. Analysis and cleaning of the synthetically generated GSM-Plus dataset with NeMo-Inspector led to a significant decrease in low-quality samples from 46.99% to 19.51%. The tool also helped identify and correct generation errors in OpenMath models, improving accuracy by 1.92% on the MATH dataset and by 4.17% on the GSM8K dataset for a Meta-Llama-3-8B model fine-tuned on synthetic data generated from Nemotron-4-340B.

[90] How Transformers Learn Regular Language Recognition: A Theoretical Study on Training Dynamics and Implicit Bias

Ruiquan Huang,Yingbin Liang,Jing Yang

Main category: cs.LG

TL;DR: 论文研究了单层Transformer在解决正则语言识别任务（如‘even pairs’和‘parity check’）中的学习动态，发现其训练过程分为两个阶段：注意力层快速映射数据，线性层逐步分离样本。

Details

Motivation: 探索Transformer在正则语言任务中的学习机制，特别是单层模型如何通过梯度下降解决‘even pairs’和‘parity check’任务。 Method: 理论分析单层Transformer（注意力层+线性层）在梯度下降下的训练动态，并通过实验验证。 Result: 训练分为两阶段：注意力层快速映射数据，线性层逐步分离样本；损失以$O(1/t)$速率下降。 Conclusion: 单层Transformer能直接解决‘even pairs’，而‘parity check’需结合Chain-of-Thought；训练动态揭示了模型的分阶段学习行为。 Abstract: Language recognition tasks are fundamental in natural language processing (NLP) and have been widely used to benchmark the performance of large language models (LLMs). These tasks also play a crucial role in explaining the working mechanisms of transformers. In this work, we focus on two representative tasks in the category of regular language recognition, known as `even pairs' and `parity check', the aim of which is to determine whether the occurrences of certain subsequences in a given sequence are even. Our goal is to explore how a one-layer transformer, consisting of an attention layer followed by a linear layer, learns to solve these tasks by theoretically analyzing its training dynamics under gradient descent. While even pairs can be solved directly by a one-layer transformer, parity check need to be solved by integrating Chain-of-Thought (CoT), either into the inference stage of a transformer well-trained for the even pairs task, or into the training of a one-layer transformer. For both problems, our analysis shows that the joint training of attention and linear layers exhibits two distinct phases. In the first phase, the attention layer grows rapidly, mapping data sequences into separable vectors. In the second phase, the attention layer becomes stable, while the linear layer grows logarithmically and approaches in direction to a max-margin hyperplane that correctly separates the attention layer outputs into positive and negative samples, and the loss decreases at a rate of $O(1/t)$. Our experiments validate those theoretical results.

[91] Towards the Resistance of Neural Network Watermarking to Fine-tuning

Ling Tang,Yuefeng Chen,Hui Xue,Quanshi Zhang

Main category: cs.LG

TL;DR: 提出了一种新的水印方法，将所有权信息嵌入深度神经网络（DNN）中，并对微调具有鲁棒性。

Details

Motivation: 保护DNN模型的所有权，防止未经授权的微调篡改水印信息。 Method: 通过改进的傅里叶变换提取卷积滤波器的特定频率成分，设计水印模块将信息编码到这些成分中。 Result: 实验证明该方法在微调过程中能保持水印信息的稳定性。 Conclusion: 该方法为DNN模型的所有权保护提供了一种有效且鲁棒的解决方案。 Abstract: This paper proves a new watermarking method to embed the ownership information into a deep neural network (DNN), which is robust to fine-tuning. Specifically, we prove that when the input feature of a convolutional layer only contains low-frequency components, specific frequency components of the convolutional filter will not be changed by gradient descent during the fine-tuning process, where we propose a revised Fourier transform to extract frequency components from the convolutional filter. Additionally, we also prove that these frequency components are equivariant to weight scaling and weight permutations. In this way, we design a watermark module to encode the watermark information to specific frequency components in a convolutional filter. Preliminary experiments demonstrate the effectiveness of our method.

[92] Evaluating Explanations: An Explanatory Virtues Framework for Mechanistic Interpretability -- The Strange Science Part I.ii

Kola Ayonrinde,Louis Jaburi

Main category: cs.LG

TL;DR: 论文提出了一种基于哲学科学视角的多元解释框架，用于评估和改进神经网络的可解释性方法，并指出紧凑证明是一种有前景的方向。

Details

Motivation: 当前神经网络的可解释性方法缺乏统一的评估标准，限制了进展。 Method: 引入基于贝叶斯、库恩、德意志和规范四种哲学视角的多元解释框架，系统评估和改进解释。 Result: 紧凑证明方法因其综合考虑多种解释优点而被认为是有前景的方向。 Conclusion: 改进的可解释性方法有助于更好地监控、预测和引导AI系统。 Abstract: Mechanistic Interpretability (MI) aims to understand neural networks through causal explanations. Though MI has many explanation-generating methods, progress has been limited by the lack of a universal approach to evaluating explanations. Here we analyse the fundamental question "What makes a good explanation?" We introduce a pluralist Explanatory Virtues Framework drawing on four perspectives from the Philosophy of Science - the Bayesian, Kuhnian, Deutschian, and Nomological - to systematically evaluate and improve explanations in MI. We find that Compact Proofs consider many explanatory virtues and are hence a promising approach. Fruitful research directions implied by our framework include (1) clearly defining explanatory simplicity, (2) focusing on unifying explanations and (3) deriving universal principles for neural networks. Improved MI methods enhance our ability to monitor, predict, and steer AI systems.

[93] On-demand Test-time Adaptation for Edge Devices

Xiao Ma,Young D. Kwon,Dong Ma

Main category: cs.LG

TL;DR: 论文提出了一种名为OD-TTA的按需测试时适应框架，通过轻量级域偏移检测、源域选择模块和解耦BN更新方案，显著降低了计算和内存开销，同时保持了高性能。

Details

Motivation: 现有CTTA方法在资源受限的边缘设备上存在内存和能耗问题，限制了实际应用。 Method: 1) 轻量级域偏移检测机制；2) 源域选择模块；3) 解耦BN更新方案。 Result: OD-TTA在降低能耗和计算开销的同时，性能与现有方法相当甚至更好。 Conclusion: OD-TTA使TTA在边缘设备上更具实用性。 Abstract: Continual Test-time adaptation (CTTA) continuously adapts the deployed model on every incoming batch of data. While achieving optimal accuracy, existing CTTA approaches present poor real-world applicability on resource-constrained edge devices, due to the substantial memory overhead and energy consumption. In this work, we first introduce a novel paradigm -- on-demand TTA -- which triggers adaptation only when a significant domain shift is detected. Then, we present OD-TTA, an on-demand TTA framework for accurate and efficient adaptation on edge devices. OD-TTA comprises three innovative techniques: 1) a lightweight domain shift detection mechanism to activate TTA only when it is needed, drastically reducing the overall computation overhead, 2) a source domain selection module that chooses an appropriate source model for adaptation, ensuring high and robust accuracy, 3) a decoupled Batch Normalization (BN) update scheme to enable memory-efficient adaptation with small batch sizes. Extensive experiments show that OD-TTA achieves comparable and even better performance while reducing the energy and computation overhead remarkably, making TTA a practical reality.

cs.OH [Back]

[94] Wireless Communication as an Information Sensor for Multi-agent Cooperative Perception: A Survey

Zhiying Song,Tenghui Xie,Fuxi Wen,Jun Li

Main category: cs.OH

TL;DR: 本文综述了协作感知在自动驾驶中的最新进展，重点讨论了信息表示、信息融合和大规模部署三个维度，并提出了将V2X通信视为动态信息传感器的新视角。

Details

Motivation: 通过多智能体信息共享扩展自动驾驶车辆的感知能力，解决传统车载传感器的局限性。 Method: 从信息中心化的角度分类信息表示（数据级、特征级、对象级），探讨信息融合技术（理想与非理想条件），并总结支持大规模部署的系统级方法。 Result: 提出了减少数据量和压缩消息的新方法，解决了异构性、定位误差、延迟和数据包丢失等问题，并支持密集交通场景的可扩展性。 Conclusion: 本文通过将V2X通信视为信息传感器，为协作感知在现实智能交通系统中的部署提供了新的视角和挑战。 Abstract: Cooperative perception extends the perception capabilities of autonomous vehicles by enabling multi-agent information sharing via Vehicle-to-Everything (V2X) communication. Unlike traditional onboard sensors, V2X acts as a dynamic "information sensor" characterized by limited communication, heterogeneity, mobility, and scalability. This survey provides a comprehensive review of recent advancements from the perspective of information-centric cooperative perception, focusing on three key dimensions: information representation, information fusion, and large-scale deployment. We categorize information representation into data-level, feature-level, and object-level schemes, and highlight emerging methods for reducing data volume and compressing messages under communication constraints. In information fusion, we explore techniques under both ideal and non-ideal conditions, including those addressing heterogeneity, localization errors, latency, and packet loss. Finally, we summarize system-level approaches to support scalability in dense traffic scenarios. Compared with existing surveys, this paper introduces a new perspective by treating V2X communication as an information sensor and emphasizing the challenges of deploying cooperative perception in real-world intelligent transportation systems.

eess.IV [Back]

[95] Leveraging Depth and Attention Mechanisms for Improved RGB Image Inpainting

Jin Hyun Park,Harine Choi,Praewa Pitiphat

Main category: eess.IV

TL;DR: 提出了一种结合RGB和深度图像的双编码器架构，通过注意力机制融合特征，显著提升了图像修复质量。

Details

Motivation: 现有方法仅依赖RGB图像，忽略了深度信息对空间和结构理解的重要性。 Method: 采用双编码器分别处理RGB和深度图像，通过注意力机制融合特征，并使用不同掩码策略测试模型鲁棒性。 Result: 结合深度信息的模型在定性和定量评估中均优于基线，注意力机制进一步提升了性能。 Conclusion: 深度信息的引入和注意力机制的设计有效提升了图像修复的准确性和上下文感知能力。 Abstract: Existing deep learning-based image inpainting methods typically rely on convolutional networks with RGB images to reconstruct images. However, relying exclusively on RGB images may neglect important depth information, which plays a critical role in understanding the spatial and structural context of a scene. Just as human vision leverages stereo cues to perceive depth, incorporating depth maps into the inpainting process can enhance the model's ability to reconstruct images with greater accuracy and contextual awareness. In this paper, we propose a novel approach that incorporates both RGB and depth images for enhanced image inpainting. Our models employ a dual encoder architecture, where one encoder processes the RGB image and the other handles the depth image. The encoded features from both encoders are then fused in the decoder using an attention mechanism, effectively integrating the RGB and depth representations. We use two different masking strategies, line and square, to test the robustness of the model under different types of occlusions. To further analyze the effectiveness of our approach, we use Gradient-weighted Class Activation Mapping (Grad-CAM) visualizations to examine the regions of interest the model focuses on during inpainting. We show that incorporating depth information alongside the RGB image significantly improves the reconstruction quality. Through both qualitative and quantitative comparisons, we demonstrate that the depth-integrated model outperforms the baseline, with attention mechanisms further enhancing inpainting performance, as evidenced by multiple evaluation metrics and visualization.

[96] A Survey on 3D Reconstruction Techniques in Plant Phenotyping: From Classical Methods to Neural Radiance Fields (NeRF), 3D Gaussian Splatting (3DGS), and Beyond

Jiajia Li,Xinda Qi,Seyed Hamidreza Nabaei,Meiqi Liu,Dong Chen,Xin Zhang,Xunyuan Yin,Zhaojian Li

Main category: eess.IV

TL;DR: 综述了植物表型分析中的3D重建技术，包括经典方法、NeRF和3DGS，探讨了它们的优缺点及未来前景。

Details

Motivation: 植物表型分析对精准农业和作物改良至关重要，3D重建技术为自动化表型分析提供了新工具。 Method: 回顾了经典重建方法、NeRF和3DGS的技术原理、应用及性能。 Result: 经典方法简单灵活但面临数据密度和噪声问题；NeRF高质量但计算成本高；3DGS在效率和扩展性上具潜力。 Conclusion: 不同3D重建技术各有优劣，未来需结合应用场景优化，推动农业技术发展。 Abstract: Plant phenotyping plays a pivotal role in understanding plant traits and their interactions with the environment, making it crucial for advancing precision agriculture and crop improvement. 3D reconstruction technologies have emerged as powerful tools for capturing detailed plant morphology and structure, offering significant potential for accurate and automated phenotyping. This paper provides a comprehensive review of the 3D reconstruction techniques for plant phenotyping, covering classical reconstruction methods, emerging Neural Radiance Fields (NeRF), and the novel 3D Gaussian Splatting (3DGS) approach. Classical methods, which often rely on high-resolution sensors, are widely adopted due to their simplicity and flexibility in representing plant structures. However, they face challenges such as data density, noise, and scalability. NeRF, a recent advancement, enables high-quality, photorealistic 3D reconstructions from sparse viewpoints, but its computational cost and applicability in outdoor environments remain areas of active research. The emerging 3DGS technique introduces a new paradigm in reconstructing plant structures by representing geometry through Gaussian primitives, offering potential benefits in both efficiency and scalability. We review the methodologies, applications, and performance of these approaches in plant phenotyping and discuss their respective strengths, limitations, and future prospects (https://github.com/JiajiaLi04/3D-Reconstruction-Plants). Through this review, we aim to provide insights into how these diverse 3D reconstruction techniques can be effectively leveraged for automated and high-throughput plant phenotyping, contributing to the next generation of agricultural technology.

[97] Can Foundation Models Really Segment Tumors? A Benchmarking Odyssey in Lung CT Imaging

Elena Mulero Ayllón,Massimiliano Mantegna,Linlin Shen,Paolo Soda,Valerio Guarrasi,Matteo Tortora

Main category: eess.IV

TL;DR: 该研究对基于深度学习的肺肿瘤分割模型进行了全面基准测试，发现基础模型（如MedSAM~2）在准确性和计算效率上优于传统模型。

Details

Motivation: 肺肿瘤分割的准确性对诊断和治疗规划至关重要，但肿瘤形态、大小和位置的复杂性给自动化分割带来挑战。 Method: 研究比较了传统架构（如U-Net、DeepLabV3）、自配置模型（如nnUNet）和基础模型（如MedSAM、MedSAM~2），并在两个肺肿瘤分割数据集上评估了性能。 Result: 基础模型（尤其是MedSAM~2）在准确性和计算效率上均优于传统模型。 Conclusion: 基础模型在肺肿瘤分割中具有潜力，可改善临床工作流程和患者预后。 Abstract: Accurate lung tumor segmentation is crucial for improving diagnosis, treatment planning, and patient outcomes in oncology. However, the complexity of tumor morphology, size, and location poses significant challenges for automated segmentation. This study presents a comprehensive benchmarking analysis of deep learning-based segmentation models, comparing traditional architectures such as U-Net and DeepLabV3, self-configuring models like nnUNet, and foundation models like MedSAM, and MedSAM~2. Evaluating performance across two lung tumor segmentation datasets, we assess segmentation accuracy and computational efficiency under various learning paradigms, including few-shot learning and fine-tuning. The results reveal that while traditional models struggle with tumor delineation, foundation models, particularly MedSAM~2, outperform them in both accuracy and computational efficiency. These findings underscore the potential of foundation models for lung tumor segmentation, highlighting their applicability in improving clinical workflows and patient outcomes.

Table of Contents

cs.CV [Back]

[1] Unconstrained Large-scale 3D Reconstruction and Rendering across Altitudes

[2] MoSAM: Motion-Guided Segment Anything Model with Spatial-Temporal Memory Selection

[3] Fast2comm:Collaborative perception combined with prior knowledge

[4] Detection and Classification of Diseases in Multi-Crop Leaves using LSTM and CNN Models

[5] Zoomer: Adaptive Image Focus Optimization for Black-box MLLM

[6] DOPE: Dual Object Perception-Enhancement Network for Vision-and-Language Navigation

[7] Localizing Before Answering: A Benchmark for Grounded Medical Visual Question Answering

[8] Responsive DNN Adaptation for Video Analytics against Environment Shift via Hierarchical Mobile-Cloud Collaborations

[9] Entropy Heat-Mapping: Localizing GPT-Based OCR Errors with Sliding-Window Shannon Analysis

[10] InstructAttribute: Fine-grained Object Attributes editing with Instruction

[11] DARTer: Dynamic Adaptive Representation Tracker for Nighttime UAV Tracking

[12] P2P-Insole: Human Pose Estimation Using Foot Pressure Distribution and Motion Sensors

[13] Efficient On-Chip Implementation of 4D Radar-Based 3D Object Detection on Hailo-8L

[14] Multi-Modal Language Models as Text-to-Image Model Evaluators

[15] Person detection and re-identification in open-world settings of retail stores and public spaces

[16] AI-ready Snow Radar Echogram Dataset (SRED) for climate change monitoring

[17] SpatialLLM: A Compound 3D-Informed Design towards Spatially-Intelligent Large Multimodal Models

[18] Advancing Wheat Crop Analysis: A Survey of Deep Learning Approaches Using Hyperspectral Imaging

[19] The Comparability of Model Fusion to Measured Data in Confuser Rejection

[20] Are Minimal Radial Distortion Solvers Really Necessary for Relative Pose Estimation?

[21] CDFormer: Cross-Domain Few-Shot Object Detection Transformer Against Feature Confusion

[22] Generating Animated Layouts as Structured Text Representations

[23] LMDepth: Lightweight Mamba-based Monocular Depth Estimation for Real-World Deployment

[24] Deterministic-to-Stochastic Diverse Latent Feature Mapping for Human Motion Synthesis

[25] 3D Human Pose Estimation via Spatial Graph Order Attention and Temporal Body Aware Transformer

[26] Fine-Tuning Without Forgetting: Adaptation of YOLOv8 Preserves COCO Performance

[27] Edge-preserving Image Denoising via Multi-scale Adaptive Statistical Independence Testing

[28] Edge Detection based on Channel Attention and Inter-region Independence Test

[29] Transferable Adversarial Attacks on Black-Box Vision-Language Models

[30] GeloVec: Higher Dimensional Geometric Smoothing for Coherent Visual Feature Extraction in Image Segmentation

[31] Efficient Vocabulary-Free Fine-Grained Visual Recognition in the Age of Multimodal LLMs

[32] Improving Editability in Image Generation with Layer-wise Memory

[33] Any-to-Any Vision-Language Model for Multimodal X-ray Imaging and Radiological Report Generation

[34] Evaluating Vision Language Model Adaptations for Radiology Report Generation in Low-Resource Languages

[35] VSC: Visual Search Compositional Text-to-Image Diffusion Model

[36] Self-Supervision Enhances Instance-based Multiple Instance Learning Methods in Digital Pathology: A Benchmark Study

[37] FreePCA: Integrating Consistency Information across Long-short Frames in Training-free Long Video Generation via Principal Component Analysis

[38] TSTMotion: Training-free Scene-awarenText-to-motion Generation

[39] Efficient Vision-based Vehicle Speed Estimation

[40] T-Graph: Enhancing Sparse-view Camera Pose Estimation by Pairwise Translation Graph

[41] High Dynamic Range Novel View Synthesis with Single Exposure

[42] RD-UIE: Relation-Driven State Space Modeling for Underwater Image Enhancement

[43] Core-Set Selection for Data-efficient Land Cover Segmentation

[44] Compensating Spatiotemporally Inconsistent Observations for Online Dynamic 3D Gaussian Splatting

[45] Fusing Foveal Fixations Using Linear Retinal Transformations and Bayesian Experimental Design

[46] CAMELTrack: Context-Aware Multi-cue ExpLoitation for Online Multi-Object Tracking

[47] Diffusion-based Adversarial Purification from the Perspective of the Frequency Domain

[48] FreeInsert: Disentangled Text-Guided Object Insertion in 3D Gaussian Scene without Spatial Priors

[49] Monitoring morphometric drift in lifelong learning segmentation of the spinal cord

[50] Global Collinearity-aware Polygonizer for Polygonal Building Mapping in Remote Sensing

[51] Multimodal Doctor-in-the-Loop: A Clinically-Guided Explainable Framework for Predicting Pathological Response in Non-Small Cell Lung Cancer

[52] VIDSTAMP: A Temporally-Aware Watermark for Ownership and Integrity in Video Diffusion Models

cs.GR [Back]

[53] Model See Model Do: Speech-Driven Facial Animation with Style Control

[54] GENMO: A GENeralist Model for Human MOtion

cs.CL [Back]

[55] FinBERT-QA: Financial Question Answering with pre-trained BERT Language Models

[56] A Survey on Large Language Model based Human-Agent Systems

[57] Reasoning Capabilities and Invariability of Large Language Models

[58] Knowledge-augmented Pre-trained Language Models for Biomedical Relation Extraction

[59] Large Language Model-Driven Dynamic Assessment of Grammatical Accuracy in English Language Learner Writing

[60] Llama-Nemotron: Efficient Reasoning Models

[61] A Character-based Diffusion Embedding Algorithm for Enhancing the Generation Quality of Generative Linguistic Steganographic Texts

[62] Synthesize-on-Graph: Knowledgeable Synthetic Data Generation for Continue Pre-training of Large Language Models

[63] Position: Enough of Scaling LLMs! Lets Focus on Downscaling

[64] VTS-LLM: Domain-Adaptive LLM Agent for Enhancing Awareness in Vessel Traffic Services through Natural Language

[65] Token-free Models for Sarcasm Detection

[66] Value Portrait: Understanding Values of LLMs with Human-aligned Benchmark

[67] Do We Need a Detailed Rubric for Automated Essay Scoring using Large Language Models?

[68] Multimodal Transformers are Hierarchical Modal-wise Heterogeneous Graphs

[69] MateICL: Mitigating Attention Dispersion in Large-Scale In-Context Learning

[70] On the Limitations of Steering in Language Model Alignment

[71] Gender Bias in Explainability: Investigating Performance Disparity in Post-hoc Methods

[72] EvalxNLP: A Framework for Benchmarking Post-Hoc Explainability Methods on NLP Models

[73] PREMISE: Matching-based Prediction for Accurate Review Recommendation

[74] Anti-adversarial Learning: Desensitizing Prompts for Large Language Models

[75] A Factorized Probabilistic Model of the Semantics of Vague Temporal Adverbials Relative to Different Event Types

[76] A Transformer-based Neural Architecture Search Method