cs.CV [Back]

[1] Multi-party Collaborative Attention Control for Image Customization

Han Yang,Chuanguang Yang,Qiuli Wang,Zhulin An,Weilun Feng,Libo Huang,Yongjun Xu

Main category: cs.CV

TL;DR: 论文提出了一种无需调参的多方协作注意力控制方法（MCA-Ctrl），通过结合文本和复杂视觉条件实现高质量图像定制，解决了现有方法在复杂场景中的主题泄漏、背景不一致和高计算成本等问题。

Details

Motivation: 当前图像定制方法存在仅支持单一条件输入、复杂场景下主题泄漏或混淆、背景不一致以及高计算成本等局限性，亟需一种更高效的解决方案。 Method: MCA-Ctrl利用自注意力层的两个关键操作协调多个并行扩散过程，并结合主题定位模块提取精确的主题和可编辑图像层。 Result: 实验表明，MCA-Ctrl在零样本图像定制任务中优于现有方法，有效解决了主题泄漏、背景不一致等问题。 Conclusion: MCA-Ctrl为复杂视觉条件下的图像定制提供了一种高效且无需调参的解决方案，具有广泛的应用潜力。 Abstract: The rapid advancement of diffusion models has increased the need for customized image generation. However, current customization methods face several limitations: 1) typically accept either image or text conditions alone; 2) customization in complex visual scenarios often leads to subject leakage or confusion; 3) image-conditioned outputs tend to suffer from inconsistent backgrounds; and 4) high computational costs. To address these issues, this paper introduces Multi-party Collaborative Attention Control (MCA-Ctrl), a tuning-free method that enables high-quality image customization using both text and complex visual conditions. Specifically, MCA-Ctrl leverages two key operations within the self-attention layer to coordinate multiple parallel diffusion processes and guide the target image generation. This approach allows MCA-Ctrl to capture the content and appearance of specific subjects while maintaining semantic consistency with the conditional input. Additionally, to mitigate subject leakage and confusion issues common in complex visual scenarios, we introduce a Subject Localization Module that extracts precise subject and editable image layers based on user instructions. Extensive quantitative and human evaluation experiments show that MCA-Ctrl outperforms existing methods in zero-shot image customization, effectively resolving the mentioned issues.

[2] Explainable AI-Driven Detection of Human Monkeypox Using Deep Learning and Vision Transformers: A Comprehensive Analysis

Md. Zahid Hossain,Md. Rakibul Islam,Most. Sharmin Sultana Samu

Main category: cs.CV

TL;DR: 研究探讨了使用公开皮肤病变图像数据集从头训练深度学习和视觉变换器模型的可行性，发现数据集限制是主要问题，转而采用迁移学习提升分类器性能，其中MobileNet-v2表现最佳。

Details

Motivation: 由于mpox症状与麻疹和水痘相似，早期临床诊断困难，结合医学影像和深度学习技术有望提升疾病检测。 Method: 使用公开皮肤病变图像数据集训练深度学习和视觉变换器模型，发现数据集限制后采用迁移学习和预训练模型（如MobileNet-v2、ViT B16和ResNet-50）。 Result: MobileNet-v2表现最佳，准确率达93.15%，加权平均F1分数为93.09%；ViT B16和ResNet-50也表现良好。 Conclusion: 迁移学习和预训练模型能有效提升分类器性能，但数据集限制仍需解决。 Abstract: Since mpox can spread from person to person, it is a zoonotic viral illness that poses a significant public health concern. It is difficult to make an early clinical diagnosis because of how closely its symptoms match those of measles and chickenpox. Medical imaging combined with deep learning (DL) techniques has shown promise in improving disease detection by analyzing affected skin areas. Our study explore the feasibility to train deep learning and vision transformer-based models from scratch with publicly available skin lesion image dataset. Our experimental results show dataset limitation as a major drawback to build better classifier models trained from scratch. We used transfer learning with the help of pre-trained models to get a better classifier. The MobileNet-v2 outperformed other state of the art pre-trained models with 93.15% accuracy and 93.09% weighted average F1 score. ViT B16 and ResNet-50 also achieved satisfactory performance compared to already available studies with accuracy 92.12% and 86.21% respectively. To further validate the performance of the models, we applied explainable AI techniques.

[3] Deconstructing Bias: A Multifaceted Framework for Diagnosing Cultural and Compositional Inequities in Text-to-Image Generative Models

Muna Numan Said,Aarib Zaidi,Rabia Usman,Sonia Okon,Praneeth Medepalli,Kevin Zhu,Vasu Sharma,Sean O'Brien

Main category: cs.CV

TL;DR: 论文提出了一种名为CIS的指标，用于评估文本到图像模型在生成多元文化图像时的表现，揭示了模型在西方与非西方文化提示间的性能差距，并提出了改进方法。

Details

Motivation: 文本到图像模型在生成多元文化图像时存在文化偏见，导致系统性误表示，需要一种评估和改进方法。 Method: 通过分析2400张图像，量化了模型在文化背景下的偏差，包括组合脆弱性和上下文错位，并提出了CIS指标。 Result: 研究发现数据不平衡、注意力熵和嵌入叠加对模型公平性有显著影响，揭示了西方与非西方文化提示间的性能差距。 Conclusion: CIS为诊断和缓解文本到图像生成中的偏见提供了工具，推动了更公平的AI系统的发展。 Abstract: The transformative potential of text-to-image (T2I) models hinges on their ability to synthesize culturally diverse, photorealistic images from textual prompts. However, these models often perpetuate cultural biases embedded within their training data, leading to systemic misrepresentations. This paper benchmarks the Component Inclusion Score (CIS), a metric designed to evaluate the fidelity of image generation across cultural contexts. Through extensive analysis involving 2,400 images, we quantify biases in terms of compositional fragility and contextual misalignment, revealing significant performance gaps between Western and non-Western cultural prompts. Our findings underscore the impact of data imbalance, attention entropy, and embedding superposition on model fairness. By benchmarking models like Stable Diffusion with CIS, we provide insights into architectural and data-centric interventions for enhancing cultural inclusivity in AI-generated imagery. This work advances the field by offering a comprehensive tool for diagnosing and mitigating biases in T2I generation, advocating for more equitable AI systems.

[4] ZS-VCOS: Zero-Shot Outperforms Supervised Video Camouflaged Object Segmentation

Wenqi Guo,Shan Du

Main category: cs.CV

TL;DR: 该论文提出了一种结合光流、视觉语言模型和SAM 2的零样本方法，显著提升了伪装物体分割的性能，超越了现有零样本和监督方法。

Details

Motivation: 伪装物体分割因物体与背景高度相似而具有挑战性，现有零样本方法性能不足。光流在检测移动物体时表现优异，因此被引入以改进分割效果。 Method: 通过将光流、视觉语言模型和SAM 2整合为一个顺序流程，实现对伪装物体的零样本分割。 Result: 在MoCA-Mask数据集上，F-measure从0.296提升至0.628，超越了监督方法（0.476）。在MoCA-Filter数据集上，成功率从0.628提升至0.697。 Conclusion: 该方法显著提升了伪装物体分割的性能，证明了光流在零样本任务中的有效性，并为相关领域提供了新的解决方案。 Abstract: Camouflaged object segmentation presents unique challenges compared to traditional segmentation tasks, primarily due to the high similarity in patterns and colors between camouflaged objects and their backgrounds. Effective solutions to this problem have significant implications in critical areas such as pest control, defect detection, and lesion segmentation in medical imaging. Prior research has predominantly emphasized supervised or unsupervised pre-training methods, leaving zero-shot approaches significantly underdeveloped. Existing zero-shot techniques commonly utilize the Segment Anything Model (SAM) in automatic mode or rely on vision-language models to generate cues for segmentation; however, their performances remain unsatisfactory, likely due to the similarity of the camouflaged object and the background. Optical flow, commonly utilized for detecting moving objects, has demonstrated effectiveness even with camouflaged entities. Our method integrates optical flow, a vision-language model, and SAM 2 into a sequential pipeline. Evaluated on the MoCA-Mask dataset, our approach achieves outstanding performance improvements, significantly outperforming existing zero-shot methods by raising the F-measure ($F_\beta^w$) from 0.296 to 0.628. Remarkably, our approach also surpasses supervised methods, increasing the F-measure from 0.476 to 0.628. Additionally, evaluation on the MoCA-Filter dataset demonstrates an increase in the success rate from 0.628 to 0.697 when compared with FlowSAM, a supervised transfer method. A thorough ablation study further validates the individual contributions of each component. More details can be found on https://github.com/weathon/vcos.

Zongxia Li,Xiyang Wu,Yubin Qin,Guangyao Shi,Hongyang Du,Dinesh Manocha,Tianyi Zhou,Jordan Lee Boyd-Graber

Main category: cs.CV

TL;DR: 论文提出了VideoHallu基准，用于评估多模态大语言模型（MLLMs）在合成视频中检测常识和物理规律异常的能力，并通过GRPO优化提升了模型性能。

Details

Motivation: 现有合成视频生成模型虽能生成高质量帧，但常违反常识和物理规律，而现有评估指标（如VideoScore）缺乏对这些问题的关注和可解释性。 Method: 引入VideoHallu基准，包含专家设计的QA任务，评估多种MLLMs（如GPT-4o、Gemini-2.5-Pro等）在合成视频中的表现，并通过GRPO方法优化模型。 Result: 实验显示，MLLMs在合成视频中仍存在幻觉问题，但通过GRPO优化和反例集成，模型性能显著提升。 Conclusion: VideoHallu为评估和优化MLLMs在合成视频中的异常检测能力提供了有效工具，推动了模型推理能力的进步。 Abstract: Synthetic video generation with foundation models has gained attention for its realism and wide applications. While these models produce high-quality frames, they often fail to respect common sense and physical laws, resulting in abnormal content. Existing metrics like VideoScore emphasize general quality but ignore such violations and lack interpretability. A more insightful approach is using multi-modal large language models (MLLMs) as interpretable evaluators, as seen in FactScore. Yet, MLLMs' ability to detect abnormalities in synthetic videos remains underexplored. To address this, we introduce VideoHallu, a benchmark featuring synthetic videos from models like Veo2, Sora, and Kling, paired with expert-designed QA tasks solvable via human-level reasoning across various categories. We assess several SoTA MLLMs, including GPT-4o, Gemini-2.5-Pro, Qwen-2.5-VL, and newer models like Video-R1 and VideoChat-R1. Despite strong real-world performance on MVBench and MovieChat, these models still hallucinate on basic commonsense and physics tasks in synthetic settings, underscoring the challenge of hallucination. We further fine-tune SoTA MLLMs using Group Relative Policy Optimization (GRPO) on real and synthetic commonsense/physics data. Results show notable accuracy gains, especially with counterexample integration, advancing MLLMs' reasoning capabilities. Our data is available at https://github.com/zli12321/VideoHallu.

[6] WorldGenBench: A World-Knowledge-Integrated Benchmark for Reasoning-Driven Text-to-Image Generation

Daoan Zhang,Che Jiang,Ruoshi Xu,Biaoxiang Chen,Zijian Jin,Yutian Lu,Jianguo Zhang,Liang Yong,Jiebo Luo,Shengda Luo

Main category: cs.CV

TL;DR: WorldGenBench是一个评估文本到图像（T2I）模型世界知识和推理能力的基准，提出Knowledge Checklist Score作为衡量标准，实验显示扩散模型表现较好，但GPT-4o等专有模型推理能力更强。

Details

Motivation: 现有T2I模型在需要丰富世界知识和隐含推理的提示上表现不足，影响了生成图像的语义准确性和上下文连贯性。 Method: 引入WorldGenBench基准和Knowledge Checklist Score，对21种先进T2I模型进行评估。 Result: 扩散模型在开源方法中领先，但专有自回归模型（如GPT-4o）在推理和知识整合上表现更优。 Conclusion: 下一代T2I系统需提升理解和推理能力。 Abstract: Recent advances in text-to-image (T2I) generation have achieved impressive results, yet existing models still struggle with prompts that require rich world knowledge and implicit reasoning: both of which are critical for producing semantically accurate, coherent, and contextually appropriate images in real-world scenarios. To address this gap, we introduce \textbf{WorldGenBench}, a benchmark designed to systematically evaluate T2I models' world knowledge grounding and implicit inferential capabilities, covering both the humanities and nature domains. We propose the \textbf{Knowledge Checklist Score}, a structured metric that measures how well generated images satisfy key semantic expectations. Experiments across 21 state-of-the-art models reveal that while diffusion models lead among open-source methods, proprietary auto-regressive models like GPT-4o exhibit significantly stronger reasoning and knowledge integration. Our findings highlight the need for deeper understanding and inference capabilities in next-generation T2I systems. Project Page: \href{https://dwanzhang-ai.github.io/WorldGenBench/}{https://dwanzhang-ai.github.io/WorldGenBench/}

[7] Automated Parsing of Engineering Drawings for Structured Information Extraction Using a Fine-tuned Document Understanding Transformer

Muhammad Tayyab Khan,Zane Yong,Lequn Chen,Jun Ming Tan,Wenhe Feng,Seung Ki Moon

Main category: cs.CV

TL;DR: 提出了一种混合深度学习框架，结合OBB检测和Donut模型，用于从2D工程图中提取结构化信息，显著提高了精度和效率。

Details

Motivation: 手动提取2D工程图中的关键信息耗时且易错，传统OCR技术难以处理复杂布局和重叠符号。 Method: 使用YOLOv11检测关键类别，结合Donut模型生成结构化JSON输出，比较单模型与类别特定模型的性能。 Result: 单模型表现优于类别特定模型，精度达94.77%，召回率100%，F1分数97.3%，幻觉率降至5.23%。 Conclusion: 该框架提高了信息提取的准确性，减少了人工干预，适用于高精度制造行业。 Abstract: Accurate extraction of key information from 2D engineering drawings is crucial for high-precision manufacturing. Manual extraction is time-consuming and error-prone, while traditional Optical Character Recognition (OCR) techniques often struggle with complex layouts and overlapping symbols, resulting in unstructured outputs. To address these challenges, this paper proposes a novel hybrid deep learning framework for structured information extraction by integrating an oriented bounding box (OBB) detection model with a transformer-based document parsing model (Donut). An in-house annotated dataset is used to train YOLOv11 for detecting nine key categories: Geometric Dimensioning and Tolerancing (GD&T), General Tolerances, Measures, Materials, Notes, Radii, Surface Roughness, Threads, and Title Blocks. Detected OBBs are cropped into images and labeled to fine-tune Donut for structured JSON output. Fine-tuning strategies include a single model trained across all categories and category-specific models. Results show that the single model consistently outperforms category-specific ones across all evaluation metrics, achieving higher precision (94.77% for GD&T), recall (100% for most), and F1 score (97.3%), while reducing hallucination (5.23%). The proposed framework improves accuracy, reduces manual effort, and supports scalable deployment in precision-driven industries.

[8] Rethinking RGB-Event Semantic Segmentation with a Novel Bidirectional Motion-enhanced Event Representation

Zhen Yao,Xiaowen Ying,Mooi Choo Chuah

Main category: cs.CV

TL;DR: 论文提出了一种新颖的事件表示方法（MET）和两个模块（BFAM和TFM），解决了RGB-Event融合中的时空和模态不对齐问题，显著提升了语义分割性能。

Details

Motivation: RGB-Event融合中存在时空和模态不对齐问题，现有方法未能有效解决。 Method: 提出Motion-enhanced Event Tensor（MET）表示方法，结合密集光流和事件时间特征；引入BFAM和TFM模块，分别处理模态和时空不对齐。 Result: 在两个大规模数据集上，提出的框架显著优于现有RGB-Event语义分割方法。 Conclusion: MET和提出的模块有效解决了RGB-Event融合中的关键问题，提升了性能。 Abstract: Event cameras capture motion dynamics, offering a unique modality with great potential in various computer vision tasks. However, RGB-Event fusion faces three intrinsic misalignments: (i) temporal, (ii) spatial, and (iii) modal misalignment. Existing voxel grid representations neglect temporal correlations between consecutive event windows, and their formulation with simple accumulation of asynchronous and sparse events is incompatible with the synchronous and dense nature of RGB modality. To tackle these challenges, we propose a novel event representation, Motion-enhanced Event Tensor (MET), which transforms sparse event voxels into a dense and temporally coherent form by leveraging dense optical flows and event temporal features. In addition, we introduce a Frequency-aware Bidirectional Flow Aggregation Module (BFAM) and a Temporal Fusion Module (TFM). BFAM leverages the frequency domain and MET to mitigate modal misalignment, while bidirectional flow aggregation and temporal fusion mechanisms resolve spatiotemporal misalignment. Experimental results on two large-scale datasets demonstrate that our framework significantly outperforms state-of-the-art RGB-Event semantic segmentation approaches. Our code is available at: https://github.com/zyaocoder/BRENet.

[9] A Sensor Agnostic Domain Generalization Framework for Leveraging Geospatial Foundation Models: Enhancing Semantic Segmentation viaSynergistic Pseudo-Labeling and Generative Learning

Anan Yaghmour,Melba M. Crawford,Saurabh Prasad

Main category: cs.CV

TL;DR: 本文提出了一种结合软对齐伪标签和源到目标生成预训练的领域泛化方法，利用地理空间基础模型提升遥感分割模型的泛化能力。

Details

Motivation: 遥感数据标注稀缺且因传感器、光照和地理差异而多变，传统高性能分割模型依赖大量标注数据，领域适应成为提升模型泛化性的关键。 Method: 采用软对齐伪标签和源到目标生成预训练的方法，结合MAE生成学习实现领域不变特征学习。 Result: 在超光谱和多光谱遥感数据集上的实验验证了该方法在增强适应性和分割性能上的有效性。 Conclusion: 提出的领域泛化方法通过结合生成预训练和伪标签技术，显著提升了遥感分割模型的泛化能力和性能。 Abstract: Remote sensing enables a wide range of critical applications such as land cover and land use mapping, crop yield prediction, and environmental monitoring. Advances in satellite technology have expanded remote sensing datasets, yet high-performance segmentation models remain dependent on extensive labeled data, challenged by annotation scarcity and variability across sensors, illumination, and geography. Domain adaptation offers a promising solution to improve model generalization. This paper introduces a domain generalization approach to leveraging emerging geospatial foundation models by combining soft-alignment pseudo-labeling with source-to-target generative pre-training. We further provide new mathematical insights into MAE-based generative learning for domain-invariant feature learning. Experiments with hyperspectral and multispectral remote sensing datasets confirm our method's effectiveness in enhancing adaptability and segmentation.

[10] PainFormer: a Vision Foundation Model for Automatic Pain Assessment

Stefanos Gkikas,Raul Fernandez Rojas,Manolis Tsiknakis

Main category: cs.CV

TL;DR: PainFormer是一种基于多任务学习的视觉基础模型，用于自动疼痛评估，通过多模态输入提取高质量嵌入，并在实验中表现优异。

Details

Motivation: 疼痛评估对开发有效的疼痛管理方案至关重要，自动评估系统可提供持续监测和决策支持。 Method: PainFormer基于多任务学习，训练于14个任务/数据集，共1090万样本，结合Embedding-Mixer模块进行最终评估。 Result: 在RGB、热成像、深度视频及生理信号等多模态输入下表现优异，优于73种现有方法。 Conclusion: PainFormer为通用自动疼痛评估模型提供了新方向，性能领先。 Abstract: Pain is a manifold condition that impacts a significant percentage of the population. Accurate and reliable pain evaluation for the people suffering is crucial to developing effective and advanced pain management protocols. Automatic pain assessment systems provide continuous monitoring and support decision-making processes, ultimately aiming to alleviate distress and prevent functionality decline. This study introduces PainFormer, a vision foundation model based on multi-task learning principles trained simultaneously on 14 tasks/datasets with a total of 10.9 million samples. Functioning as an embedding extractor for various input modalities, the foundation model provides feature representations to the Embedding-Mixer, a transformer-based module that performs the final pain assessment. Extensive experiments employing behavioral modalities-including RGB, synthetic thermal, and estimated depth videos-and physiological modalities such as ECG, EMG, GSR, and fNIRS revealed that PainFormer effectively extracts high-quality embeddings from diverse input modalities. The proposed framework is evaluated on two pain datasets, BioVid and AI4Pain, and directly compared to 73 different methodologies documented in the literature. Experiments conducted in unimodal and multimodal settings demonstrate state-of-the-art performances across modalities and pave the way toward general-purpose models for automatic pain assessment.

[11] Grounding Task Assistance with Multimodal Cues from a Single Demonstration

Gabriel Sarch,Balasaravanan Thoravi Kumaravel,Sahithya Ravi,Vibhav Vineet,Andrew D. Wilson

Main category: cs.CV

TL;DR: MICA框架通过整合眼动和语音线索，提升任务辅助对话代理的性能，弥补RGB视频在捕捉细微意图和用户特定线索上的不足。

Details

Motivation: RGB视频无法充分捕捉人类行为中的细微意图、安全关键环境因素和用户偏好，限制了视觉语言模型的推理能力。 Method: MICA通过眼动和语音线索分割演示为子任务，提取关键帧和字幕，实现更丰富的上下文基础。 Result: 多模态线索显著提升回答质量，眼动线索单独达到语音性能的93%，两者结合效果最佳。 Conclusion: 多模态信号对现实AI任务辅助具有重要价值，需开发适应性多模态模型。 Abstract: A person's demonstration often serves as a key reference for others learning the same task. However, RGB video, the dominant medium for representing these demonstrations, often fails to capture fine-grained contextual cues such as intent, safety-critical environmental factors, and subtle preferences embedded in human behavior. This sensory gap fundamentally limits the ability of Vision Language Models (VLMs) to reason about why actions occur and how they should adapt to individual users. To address this, we introduce MICA (Multimodal Interactive Contextualized Assistance), a framework that improves conversational agents for task assistance by integrating eye gaze and speech cues. MICA segments demonstrations into meaningful sub-tasks and extracts keyframes and captions that capture fine-grained intent and user-specific cues, enabling richer contextual grounding for visual question answering. Evaluations on questions derived from real-time chat-assisted task replication show that multimodal cues significantly improve response quality over frame-based retrieval. Notably, gaze cues alone achieves 93% of speech performance, and their combination yields the highest accuracy. Task type determines the effectiveness of implicit (gaze) vs. explicit (speech) cues, underscoring the need for adaptable multimodal models. These results highlight the limitations of frame-based context and demonstrate the value of multimodal signals for real-world AI task assistance.

[12] TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action

Jen-Hao Cheng,Vivian Wang,Huayu Wang,Huapeng Zhou,Yi-Hao Peng,Hou-I Liu,Hsiang-Wei Huang,Kuang-Ming Chen,Cheng-Yen Yang,Wenhao Chai,Yi-Ling Chen,Vibhav Vineet,Qin Cai,Jenq-Neng Hwang

Main category: cs.CV

TL;DR: TEMPURA是一个两阶段训练框架，通过掩码事件预测和因果推理增强视频时间理解，结合细粒度分割和密集标注，显著提升视频理解能力。

Details

Motivation: 现有方法在视频时间分辨率和事件边界处理上存在不足，限制了因果依赖建模。 Method: TEMPURA采用掩码事件预测和因果推理，结合视频分割与密集标注，分两阶段训练。 Result: 在时间定位和高光检测任务上，TEMPURA表现优于基线模型。 Conclusion: 结合因果推理与细粒度时间分割可显著提升视频理解能力。 Abstract: Understanding causal event relationships and achieving fine-grained temporal grounding in videos remain challenging for vision-language models. Existing methods either compress video tokens to reduce temporal resolution, or treat videos as unsegmented streams, which obscures fine-grained event boundaries and limits the modeling of causal dependencies. We propose TEMPURA (Temporal Event Masked Prediction and Understanding for Reasoning in Action), a two-stage training framework that enhances video temporal understanding. TEMPURA first applies masked event prediction reasoning to reconstruct missing events and generate step-by-step causal explanations from dense event annotations, drawing inspiration from effective infilling techniques. TEMPURA then learns to perform video segmentation and dense captioning to decompose videos into non-overlapping events with detailed, timestamp-aligned descriptions. We train TEMPURA on VER, a large-scale dataset curated by us that comprises 1M training instances and 500K videos with temporally aligned event descriptions and structured reasoning steps. Experiments on temporal grounding and highlight detection benchmarks demonstrate that TEMPURA outperforms strong baseline models, confirming that integrating causal reasoning with fine-grained temporal segmentation leads to improved video understanding.

Dimitrios Dagdilelis,Panagiotis Grigoriadis,Roberto Galeazzi

Main category: cs.CV

TL;DR: 提出了一种基于交叉注意力变换器的多模态传感器融合方法，用于构建船舶周围环境的鸟瞰图，支持更安全的自主海洋导航。

Details

Motivation: 通过融合多视角RGB和长波红外图像以及稀疏LiDAR点云，结合X波段雷达和电子海图数据，提高导航的准确性和鲁棒性。 Method: 使用交叉注意力变换器深度融合多模态传感器数据，包括RGB、红外图像、LiDAR点云、雷达和电子海图。 Result: 生成的鸟瞰图提供了详细可靠的场景表示，实际海上试验证实了该方法在恶劣天气和复杂海况下的有效性。 Conclusion: 该方法显著提升了自主海洋导航的安全性和可靠性，适用于多种复杂环境。 Abstract: We propose a cross attention transformer based method for multimodal sensor fusion to build a birds eye view of a vessels surroundings supporting safer autonomous marine navigation. The model deeply fuses multiview RGB and long wave infrared images with sparse LiDAR point clouds. Training also integrates X band radar and electronic chart data to inform predictions. The resulting view provides a detailed reliable scene representation improving navigational accuracy and robustness. Real world sea trials confirm the methods effectiveness even in adverse weather and complex maritime settings.

[14] Toward Onboard AI-Enabled Solutions to Space Object Detection for Space Sustainability

Wenxuan Zhang,Peng Hu

Main category: cs.CV

TL;DR: 本文研究了基于深度学习的视觉传感器在低地球轨道卫星空间物体检测中的可行性，提出了结合SE层、ViT和GELAN的模型，显著提升了检测精度并降低了计算开销。

Details

Motivation: 随着低地球轨道卫星的快速扩张，空间物体检测对碰撞评估和避免至关重要，需要高精度和低延迟的解决方案。 Method: 提出了基于SE层、ViT和GELAN的深度学习模型，并评估其在空间物体检测任务中的性能。 Result: 实验结果显示，提出的GELAN-ViT-SE模型在mAP50和mAP50:95上分别达到0.751和0.280，同时降低了计算和功耗。 Conclusion: 结合SE层、ViT和GELAN的模型在空间物体检测中表现出色，为未来卫星任务提供了高效解决方案。 Abstract: The rapid expansion of advanced low-Earth orbit (LEO) satellites in large constellations is positioning space assets as key to the future, enabling global internet access and relay systems for deep space missions. A solution to the challenge is effective space object detection (SOD) for collision assessment and avoidance. In SOD, an LEO satellite must detect other satellites and objects with high precision and minimal delay. This paper investigates the feasibility and effectiveness of employing vision sensors for SOD tasks based on deep learning (DL) models. It introduces models based on the Squeeze-and-Excitation (SE) layer, Vision Transformer (ViT), and the Generalized Efficient Layer Aggregation Network (GELAN) and evaluates their performance under SOD scenarios. Experimental results show that the proposed models achieve mean average precision at intersection over union threshold 0.5 (mAP50) scores of up to 0.751 and mean average precision averaged over intersection over union thresholds from 0.5 to 0.95 (mAP50:95) scores of up to 0.280. Compared to the baseline GELAN-t model, the proposed GELAN-ViT-SE model increases the average mAP50 from 0.721 to 0.751, improves the mAP50:95 from 0.266 to 0.274, reduces giga floating point operations (GFLOPs) from 7.3 to 5.6, and lowers peak power consumption from 2080.7 mW to 2028.7 mW by 2.5\%.

[15] A Novel WaveInst-based Network for Tree Trunk Structure Extraction and Pattern Analysis in Forest Inventory

Chenyang Fan,Xujie Zhu,Taige Luo,Sheng Xu,Zhulin Chen,Hongxin Yang

Main category: cs.CV

TL;DR: 提出了一种基于WaveInst实例分割框架的新方法，用于从复杂背景中提取树木结构信息，并在多个数据集上表现优异。

Details

Motivation: 解决现有LiDAR和UAV技术在树木结构提取中的高成本或信息缺失问题。 Method: 采用离散小波变换增强多尺度边缘信息，结合实例分割框架WaveInst。 Result: 在多个数据集上表现优异，平均精度提升9.9，并能从2D图像中提取树木生长参数。 Conclusion: 为树木表型研究提供了科学数据，支持精准林业和生态监测应用。 Abstract: The pattern analysis of tree structure holds significant scientific value for genetic breeding and forestry management. The current trunk and branch extraction technologies are mainly LiDAR-based or UAV-based. The former approaches obtain high-precision 3D data, but its equipment cost is high and the three-dimensional (3D) data processing is complex. The latter approaches efficiently capture canopy information, but they miss the 3-D structure of trees. In order to deal with the branch information extraction from the complex background interference and occlusion, this work proposes a novel WaveInst instance segmentation framework, involving a discrete wavelet transform, to enhance multi-scale edge information for accurately improving tree structure extraction. Experimental results of the proposed model show superior performance on SynthTree43k, CaneTree100, Urban Street and our PoplarDataset. Moreover, we present a new Phenotypic dataset PoplarDataset, which is dedicated to extract tree structure and pattern analysis from artificial forest. The proposed method achieves a mean average precision of 49.6 and 24.3 for the structure extraction of mature and juvenile trees, respectively, surpassing the existing state-of-the-art method by 9.9. Furthermore, by in tegrating the segmentation model within the regression model, we accurately achieve significant tree grown parameters, such as the location of trees, the diameter-at-breast-height of individual trees, and the plant height, from 2D images directly. This study provides a scientific and plenty of data for tree structure analysis in related to the phenotype research, offering a platform for the significant applications in precision forestry, ecological monitoring, and intelligent breeding.

[16] Soft-Masked Semi-Dual Optimal Transport for Partial Domain Adaptation

Yi-Ming Zhai,Chuan-Xian Ren,Hong Yan

Main category: cs.CV

TL;DR: 论文提出了一种Soft-masked Semi-dual Optimal Transport (SSOT)方法，用于解决部分域适应（PDA）问题，通过估计类别权重和构建软掩码传输距离矩阵，实现了跨域的特征表示学习。

Details

Motivation: 部分域适应（PDA）中目标域标签空间是源域的子集，存在域偏移和标签空间不一致的挑战，需要一种有效的方法来学习域不变表示。 Method: 提出SSOT方法，通过估计类别权重构建重加权源域，利用软掩码传输距离矩阵增强类导向表示能力，并采用半对偶最优传输和神经网络优化。 Result: 在四个基准数据集上的实验验证了SSOT的有效性。 Conclusion: SSOT通过结合最优传输和神经网络，成功解决了PDA问题，提升了跨域学习的性能。 Abstract: Visual domain adaptation aims to learn discriminative and domain-invariant representation for an unlabeled target domain by leveraging knowledge from a labeled source domain. Partial domain adaptation (PDA) is a general and practical scenario in which the target label space is a subset of the source one. The challenges of PDA exist due to not only domain shift but also the non-identical label spaces of domains. In this paper, a Soft-masked Semi-dual Optimal Transport (SSOT) method is proposed to deal with the PDA problem. Specifically, the class weights of domains are estimated, and then a reweighed source domain is constructed, which is favorable in conducting class-conditional distribution matching with the target domain. A soft-masked transport distance matrix is constructed by category predictions, which will enhance the class-oriented representation ability of optimal transport in the shared feature space. To deal with large-scale optimal transport problems, the semi-dual formulation of the entropy-regularized Kantorovich problem is employed since it can be optimized by gradient-based algorithms. Further, a neural network is exploited to approximate the Kantorovich potential due to its strong fitting ability. This network parametrization also allows the generalization of the dual variable outside the supports of the input distribution. The SSOT model is built upon neural networks, which can be optimized alternately in an end-to-end manner. Extensive experiments are conducted on four benchmark datasets to demonstrate the effectiveness of SSOT.

[17] Automated ARAT Scoring Using Multimodal Video Analysis, Multi-View Fusion, and Hierarchical Bayesian Models: A Clinician Study

Tamim Ahmed,Thanassis Rikakis

Main category: cs.CV

TL;DR: 提出了一种基于多模态视频分析的自动化ARAT评分系统，结合SlowFast、I3D和Transformer模型，通过多视角数据和分层贝叶斯模型提升评分准确性和可解释性。

Details

Motivation: 手动评分ARAT耗时且存在变异性，需要一种自动化解决方案以提升效率和一致性。 Method: 整合OpenPose关键点和物体位置数据，采用多视角（同侧、对侧和顶部）和早期/晚期融合技术，结合分层贝叶斯模型推断运动质量。 Result: 在卒中康复数据集上验证，晚期融合准确率达89.0%，分层贝叶斯模型与人工评分高度一致。 Conclusion: 该框架为自动化康复提供了可扩展且临床验证的解决方案。 Abstract: Manual scoring of the Action Research Arm Test (ARAT) for upper extremity assessment in stroke rehabilitation is time-intensive and variable. We propose an automated ARAT scoring system integrating multimodal video analysis with SlowFast, I3D, and Transformer-based models using OpenPose keypoints and object locations. Our approach employs multi-view data (ipsilateral, contralateral, and top perspectives), applying early and late fusion to combine features across views and models. Hierarchical Bayesian Models (HBMs) infer movement quality components, enhancing interpretability. A clinician dashboard displays task scores, execution times, and quality assessments. We conducted a study with five clinicians who reviewed 500 video ratings generated by our system, providing feedback on its accuracy and usability. Evaluated on a stroke rehabilitation dataset, our framework achieves 89.0% validation accuracy with late fusion, with HBMs aligning closely with manual assessments. This work advances automated rehabilitation by offering a scalable, interpretable solution with clinical validation.

[18] Topology-Aware CLIP Few-Shot Learning

Dazhi Huang

Main category: cs.CV

TL;DR: 本文提出了一种基于拓扑感知的调优方法，通过结合RTD和TR框架，提升视觉语言模型在少样本学习中的性能。

Details

Motivation: 现有方法在适应视觉语言模型时忽略了潜在空间中的结构信息，导致预训练知识和任务特定适应之间的平衡不足。 Method: 采用拓扑感知调优方法，结合RTD和交叉熵损失对齐视觉和文本表示的拓扑结构，同时冻结基础模型编码器，仅优化轻量级任务残差参数。 Result: 在6个基准数据集上，平均准确率提升1-2%，优于基线方法。 Conclusion: 通过拓扑对齐，有效提升了视觉语言模型的少样本学习能力。 Abstract: Efficiently adapting large Vision-Language Models (VLMs) like CLIP for few-shot learning poses challenges in balancing pre-trained knowledge retention and task-specific adaptation. Existing methods often overlook valuable structural information within the VLM's latent space. We introduce a topology-aware tuning approach integrating Representation Topology Divergence (RTD) into the Task Residual (TR) framework. By explicitly aligning the topological structures of visual and text representations using a combined RTD and Cross-Entropy loss, while freezing base VLM encoders, our method enhances few-shot performance. We optimize only lightweight Task Residual parameters, effectively leveraging topological information. Across 6 diverse benchmark datasets, our approach demonstrates significant gains, achieving an average accuracy improvement of 1-2\% over relevant baseline methods in few-shot settings. This work presents an effective strategy to boost VLM few-shot capabilities by incorporating topological alignment.

[19] Component-Based Fairness in Face Attribute Classification with Bayesian Network-informed Meta Learning

Yifan Liu,Ruichen Yao,Yaokun Liu,Ruohan Zong,Zelin Li,Yang Zhang,Dong Wang

Main category: cs.CV

TL;DR: 本文提出了一种新方法BNMR，通过贝叶斯网络和元学习优化人脸组件公平性，解决了标签稀缺和属性依赖问题，并在实验中优于现有方法。

Details

Motivation: 人脸识别技术广泛应用，但现有研究主要关注人口统计学公平性，而忽略了生物人脸组件的公平性。本文首次探索了基于生物特征的公平性。 Method: 提出BNMR方法，结合贝叶斯网络校准器和元学习样本重加权，动态跟踪模型偏差并编码先验概率。 Result: 实验表明BNMR优于现有方法，且人脸组件公平性对人口统计学公平性有积极影响。 Conclusion: 人脸组件公平性可作为人口统计学公平性的替代目标，为相关研究开辟了新方向。 Abstract: The widespread integration of face recognition technologies into various applications (e.g., access control and personalized advertising) necessitates a critical emphasis on fairness. While previous efforts have focused on demographic fairness, the fairness of individual biological face components remains unexplored. In this paper, we focus on face component fairness, a fairness notion defined by biological face features. To our best knowledge, our work is the first work to mitigate bias of face attribute prediction at the biological feature level. In this work, we identify two key challenges in optimizing face component fairness: attribute label scarcity and attribute inter-dependencies, both of which limit the effectiveness of bias mitigation from previous approaches. To address these issues, we propose \textbf{B}ayesian \textbf{N}etwork-informed \textbf{M}eta \textbf{R}eweighting (BNMR), which incorporates a Bayesian Network calibrator to guide an adaptive meta-learning-based sample reweighting process. During the training process of our approach, the Bayesian Network calibrator dynamically tracks model bias and encodes prior probabilities for face component attributes to overcome the above challenges. To demonstrate the efficacy of our approach, we conduct extensive experiments on a large-scale real-world human face dataset. Our results show that BNMR is able to consistently outperform recent face bias mitigation baselines. Moreover, our results suggest a positive impact of face component fairness on the commonly considered demographic fairness (e.g., \textit{gender}). Our findings pave the way for new research avenues on face component fairness, suggesting that face component fairness could serve as a potential surrogate objective for demographic fairness. The code for our work is publicly available~\footnote{https://github.com/yliuaa/BNMR-FairCompFace.git}.

[20] Knowledge-Augmented Language Models Interpreting Structured Chest X-Ray Findings

Alexander Davis,Rafael Souza,Jia-Hao Lim

Main category: cs.CV

TL;DR: CXR-TextInter利用文本中心的大型语言模型（LLMs）进行胸部X光片（CXR）解读，通过结构化文本表示和医学知识模块提升性能，在多项任务中表现优异。

Details

Motivation: 当前多模态基础模型在CXR解读中潜力未充分挖掘，需探索如何有效利用LLMs处理视觉任务。 Method: 提出CXR-TextInter框架，将CXR内容转化为结构化文本表示，结合医学知识模块增强临床推理。 Result: 在CXR-ClinEval基准测试中，CXR-TextInter在病理检测、报告生成和视觉问答任务中表现最优。 Conclusion: CXR-TextInter展示了通过结构化视觉信息和整合领域知识，利用LLMs进行医学图像AI的潜力。 Abstract: Automated interpretation of chest X-rays (CXR) is a critical task with the potential to significantly improve clinical workflow and patient care. While recent advances in multimodal foundation models have shown promise, effectively leveraging the full power of large language models (LLMs) for this visual task remains an underexplored area. This paper introduces CXR-TextInter, a novel framework that repurposes powerful text-centric LLMs for CXR interpretation by operating solely on a rich, structured textual representation of the image content, generated by an upstream image analysis pipeline. We augment this LLM-centric approach with an integrated medical knowledge module to enhance clinical reasoning. To facilitate training and evaluation, we developed the MediInstruct-CXR dataset, containing structured image representations paired with diverse, clinically relevant instruction-response examples, and the CXR-ClinEval benchmark for comprehensive assessment across various interpretation tasks. Extensive experiments on CXR-ClinEval demonstrate that CXR-TextInter achieves state-of-the-art quantitative performance across pathology detection, report generation, and visual question answering, surpassing existing multimodal foundation models. Ablation studies confirm the critical contribution of the knowledge integration module. Furthermore, blinded human evaluation by board-certified radiologists shows a significant preference for the clinical quality of outputs generated by CXR-TextInter. Our work validates an alternative paradigm for medical image AI, showcasing the potential of harnessing advanced LLM capabilities when visual information is effectively structured and domain knowledge is integrated.

[21] Vision and Intention Boost Large Language Model in Long-Term Action Anticipation

Congqi Cao,Lanshu Hu,Yating Yu,Yanning Zhang

Main category: cs.CV

TL;DR: 提出了一种结合视觉和语言模型的多模态方法（ICVL），通过推断行为意图并融合视觉特征，利用LLM预测未来动作，显著优于现有方法。

Details

Motivation: 现有方法仅依赖视频数据或文本输入，前者缺乏先验知识，后者信息损失严重。ICVL旨在结合视觉语义和LLM推理能力，解决单模态方法的局限性。 Method: 1. 使用VLM从视频推断行为意图（文本特征）；2. 多模态融合意图与视觉特征；3. 结合文本提示输入LLM预测动作；4. 提出基于视觉和文本相似性的示例选择策略。 Result: 在Ego4D、EPIC-Kitchens-55和EGTEA GAZE+数据集上取得SOTA性能。 Conclusion: ICVL通过多模态融合和意图推断，显著提升了长期动作预测的准确性和鲁棒性。 Abstract: Long-term action anticipation (LTA) aims to predict future actions over an extended period. Previous approaches primarily focus on learning exclusively from video data but lack prior knowledge. Recent researches leverage large language models (LLMs) by utilizing text-based inputs which suffer severe information loss. To tackle these limitations single-modality methods face, we propose a novel Intention-Conditioned Vision-Language (ICVL) model in this study that fully leverages the rich semantic information of visual data and the powerful reasoning capabilities of LLMs. Considering intention as a high-level concept guiding the evolution of actions, we first propose to employ a vision-language model (VLM) to infer behavioral intentions as comprehensive textual features directly from video inputs. The inferred intentions are then fused with visual features through a multi-modality fusion strategy, resulting in intention-enhanced visual representations. These enhanced visual representations, along with textual prompts, are fed into LLM for future action anticipation. Furthermore, we propose an effective example selection strategy jointly considers visual and textual similarities, providing more relevant and informative examples for in-context learning. Extensive experiments with state-of-the-art performance on Ego4D, EPIC-Kitchens-55, and EGTEA GAZE+ datasets fully demonstrate the effectiveness and superiority of the proposed method.

[22] Probabilistic Interactive 3D Segmentation with Hierarchical Neural Processes

Jie Liu,Pan Zhou,Zehao Xiao,Jiayi Shen,Wenzhe Yin,Jan-Jakob Sonke,Efstratios Gavves

Main category: cs.CV

TL;DR: NPISeg3D是一个基于神经过程的概率框架，通过分层潜在变量结构和概率原型调制器，解决了3D交互式分割中稀疏点击泛化和预测不确定性量化的挑战。

Details

Motivation: 交互式3D分割需要从稀疏用户点击中生成准确的分割，并量化预测不确定性以识别不可靠区域。 Method: NPISeg3D采用分层潜在变量结构（场景和对象特定变量）和概率原型调制器，增强少样本泛化能力和不确定性量化。 Result: 在四个3D点云数据集上的实验表明，NPISeg3D能以更少的点击实现更优的分割性能，并提供可靠的不确定性估计。 Conclusion: NPISeg3D通过概率框架和分层结构，有效解决了3D交互式分割中的关键挑战。 Abstract: Interactive 3D segmentation has emerged as a promising solution for generating accurate object masks in complex 3D scenes by incorporating user-provided clicks. However, two critical challenges remain underexplored: (1) effectively generalizing from sparse user clicks to produce accurate segmentation, and (2) quantifying predictive uncertainty to help users identify unreliable regions. In this work, we propose NPISeg3D, a novel probabilistic framework that builds upon Neural Processes (NPs) to address these challenges. Specifically, NPISeg3D introduces a hierarchical latent variable structure with scene-specific and object-specific latent variables to enhance few-shot generalization by capturing both global context and object-specific characteristics. Additionally, we design a probabilistic prototype modulator that adaptively modulates click prototypes with object-specific latent variables, improving the model's ability to capture object-aware context and quantify predictive uncertainty. Experiments on four 3D point cloud datasets demonstrate that NPISeg3D achieves superior segmentation performance with fewer clicks while providing reliable uncertainty estimations.

[23] PosePilot: Steering Camera Pose for Generative World Models with Self-supervised Depth

Bu Jin,Weize Li,Baihan Yang,Zhenxin Zhu,Junpeng Jiang,Huan-ang Gao,Haiyang Sun,Kun Zhan,Hengtong Hu,Xueyang Zhang,Peng Jia,Hao Zhao

Main category: cs.CV

TL;DR: PosePilot是一个轻量级框架，通过自监督深度估计增强生成世界模型中的相机姿态可控性。

Details

Motivation: 解决自动驾驶系统中相机姿态控制的精确性和灵活性挑战，以提升场景动态模拟的准确性。 Method: 结合自监督深度和姿态读取，通过光度量扭曲损失和反向扭曲步骤优化相机姿态估计。 Result: 在自动驾驶和通用视频数据集上显著提升了结构理解和运动推理能力。 Conclusion: PosePilot为生成世界模型中的相机姿态控制设定了新标准，实现了物理一致的可靠视角合成。 Abstract: Recent advancements in autonomous driving (AD) systems have highlighted the potential of world models in achieving robust and generalizable performance across both ordinary and challenging driving conditions. However, a key challenge remains: precise and flexible camera pose control, which is crucial for accurate viewpoint transformation and realistic simulation of scene dynamics. In this paper, we introduce PosePilot, a lightweight yet powerful framework that significantly enhances camera pose controllability in generative world models. Drawing inspiration from self-supervised depth estimation, PosePilot leverages structure-from-motion principles to establish a tight coupling between camera pose and video generation. Specifically, we incorporate self-supervised depth and pose readouts, allowing the model to infer depth and relative camera motion directly from video sequences. These outputs drive pose-aware frame warping, guided by a photometric warping loss that enforces geometric consistency across synthesized frames. To further refine camera pose estimation, we introduce a reverse warping step and a pose regression loss, improving viewpoint precision and adaptability. Extensive experiments on autonomous driving and general-domain video datasets demonstrate that PosePilot significantly enhances structural understanding and motion reasoning in both diffusion-based and auto-regressive world models. By steering camera pose with self-supervised depth, PosePilot sets a new benchmark for pose controllability, enabling physically consistent, reliable viewpoint synthesis in generative world models.

[24] Learning Multi-frame and Monocular Prior for Estimating Geometry in Dynamic Scenes

Seong Hyeon Park,Jinwoo Shin

Main category: cs.CV

TL;DR: 论文提出了一种名为MMP的新模型，用于通过前馈方式估计动态场景的3D几何，显著提升了动态点图预测的质量。

Details

Motivation: 在单目视频中，动态场景的3D几何估计是一个基础性挑战，现有模型仅能预测部分属性（如深度或点图），且多帧下噪声显著，全局优化成本高且易失败。 Method: 基于Siamese架构，引入轨迹编码模块，将逐点动态投影到每帧的表示中，提升动态场景的表达能力。 Result: MMP在前馈点图预测中达到最优质量，回归误差降低了15.1%。 Conclusion: MMP通过改进的动态表示方法，有效解决了动态场景几何估计的挑战，显著提升了性能。 Abstract: In monocular videos that capture dynamic scenes, estimating the 3D geometry of video contents has been a fundamental challenge in computer vision. Specifically, the task is significantly challenged by the object motion, where existing models are limited to predict only partial attributes of the dynamic scenes, such as depth or pointmaps spanning only over a pair of frames. Since these attributes are inherently noisy under multiple frames, test-time global optimizations are often employed to fully recover the geometry, which is liable to failure and incurs heavy inference costs. To address the challenge, we present a new model, coined MMP, to estimate the geometry in a feed-forward manner, which produces a dynamic pointmap representation that evolves over multiple frames. Specifically, based on the recent Siamese architecture, we introduce a new trajectory encoding module to project point-wise dynamics on the representation for each frame, which can provide significantly improved expressiveness for dynamic scenes. In our experiments, we find MMP can achieve state-of-the-art quality in feed-forward pointmap prediction, e.g., 15.1% enhancement in the regression error.

[25] An LLM-Empowered Low-Resolution Vision System for On-Device Human Behavior Understanding

Siyang Jiang,Bufang Yang,Lilin Xu,Mu Yuan,Yeerzhati Abudunuer,Kaiwei Liu,Liekang Zeng,Hongkai Chen,Zhenyu Yan,Xiaofan Jiang,Guoliang Xing

Main category: cs.CV

TL;DR: 论文提出了一种名为Llambda的系统，旨在通过有限标注数据和大量未标注数据，结合对比学习和物理知识引导的标注方法，提升大视觉语言模型（LVLM）对低分辨率视频的理解能力。

Details

Motivation: 现有的大视觉语言模型主要针对高分辨率数据（如RGB图像），难以有效理解低分辨率数据（如深度、热成像和红外数据）。传统方法需要大量人工标注，成本高昂。 Method: 1. 提出对比导向的数据标注器，通过对比学习生成高质量的伪标签；2. 提出物理知识引导的标注器，利用时空一致性检查减少伪标签错误；3. 使用LoRA高效微调技术适配低分辨率数据。 Result: 在区域级真实测试平台和三种低分辨率数据集上，Llambda的平均Bert-Score比现有最优系统高出40.03%。 Conclusion: Llambda系统通过创新的标注和微调方法，显著提升了LVLM对低分辨率数据的理解能力，同时减少了人工标注需求。 Abstract: The rapid advancements in Large Vision Language Models (LVLMs) offer the potential to surpass conventional labeling by generating richer, more detailed descriptions of on-device human behavior understanding (HBU) in low-resolution vision systems, such as depth, thermal, and infrared. However, existing large vision language model (LVLM) approaches are unable to understand low-resolution data well as they are primarily designed for high-resolution data, such as RGB images. A quick fixing approach is to caption a large amount of low-resolution data, but it requires a significant amount of labor-intensive annotation efforts. In this paper, we propose a novel, labor-saving system, Llambda, designed to support low-resolution HBU. The core idea is to leverage limited labeled data and a large amount of unlabeled data to guide LLMs in generating informative captions, which can be combined with raw data to effectively fine-tune LVLM models for understanding low-resolution videos in HBU. First, we propose a Contrastive-Oriented Data Labeler, which can capture behavior-relevant information from long, low-resolution videos and generate high-quality pseudo labels for unlabeled data via contrastive learning. Second, we propose a Physical-Knowledge Guided Captioner, which utilizes spatial and temporal consistency checks to mitigate errors in pseudo labels. Therefore, it can improve LLMs' understanding of sequential data and then generate high-quality video captions. Finally, to ensure on-device deployability, we employ LoRA-based efficient fine-tuning to adapt LVLMs for low-resolution data. We evaluate Llambda using a region-scale real-world testbed and three distinct low-resolution datasets, and the experiments show that Llambda outperforms several state-of-the-art LVLM systems up to $40.03\%$ on average Bert-Score.

[26] Co$^{3}$Gesture: Towards Coherent Concurrent Co-speech 3D Gesture Generation with Interactive Diffusion

Xingqun Qi,Yatian Wang,Hengyuan Zhang,Jiahao Pan,Wei Xue,Shanghang Zhang,Wenhan Luo,Qifeng Liu,Yike Guo

Main category: cs.CV

TL;DR: 提出了一种新框架Co$^3$Gesture，用于生成两人交互对话中的同步手势，并构建了大规模数据集GES-Inter。

Details

Motivation: 现有方法仅支持单人自说自话的手势生成，忽视了两人交互对话中同步手势建模的实用性，且缺乏高质量数据集。 Method: 构建GES-Inter数据集，提出Co$^3$Gesture框架，包含两个生成分支和时序交互模块（TIM），通过互注意力机制增强交互手势的协调性。 Result: 实验表明，该方法在GES-Inter数据集上优于现有模型，能生成生动且连贯的交互手势。 Conclusion: Co$^3$Gesture和GES-Inter填补了两人交互手势生成的空白，为相关研究提供了数据和工具支持。 Abstract: Generating gestures from human speech has gained tremendous progress in animating virtual avatars. While the existing methods enable synthesizing gestures cooperated by individual self-talking, they overlook the practicality of concurrent gesture modeling with two-person interactive conversations. Moreover, the lack of high-quality datasets with concurrent co-speech gestures also limits handling this issue. To fulfill this goal, we first construct a large-scale concurrent co-speech gesture dataset that contains more than 7M frames for diverse two-person interactive posture sequences, dubbed GES-Inter. Additionally, we propose Co$^3$Gesture, a novel framework that enables coherent concurrent co-speech gesture synthesis including two-person interactive movements. Considering the asymmetric body dynamics of two speakers, our framework is built upon two cooperative generation branches conditioned on separated speaker audio. Specifically, to enhance the coordination of human postures with respect to corresponding speaker audios while interacting with the conversational partner, we present a Temporal Interaction Module (TIM). TIM can effectively model the temporal association representation between two speakers' gesture sequences as interaction guidance and fuse it into the concurrent gesture generation. Then, we devise a mutual attention mechanism to further holistically boost learning dependencies of interacted concurrent motions, thereby enabling us to generate vivid and coherent gestures. Extensive experiments demonstrate that our method outperforms the state-of-the-art models on our newly collected GES-Inter dataset. The dataset and source code are publicly available at \href{https://mattie-e.github.io/Co3/}{\textit{https://mattie-e.github.io/Co3/}}.

[27] Multimodal Graph Representation Learning for Robust Surgical Workflow Recognition with Adversarial Feature Disentanglement

Long Bai,Boyi Ma,Ruohan Wang,Guankun Wang,Beilei Cui,Zhongliang Jiang,Mobarakol Islam,Zhe Min,Jiewen Lai,Nassir Navab,Hongliang Ren

Main category: cs.CV

TL;DR: 论文提出了一种基于图的多模态方法（GRAD），通过结合视觉和运动学数据，提升手术工作流识别的准确性和鲁棒性，尤其在数据损坏或领域偏移的情况下。

Details

Motivation: 手术工作流识别对自动化任务、决策支持和培训新手外科医生至关重要，但数据损坏（如遮挡或存储问题）会导致性能下降。 Method: 提出了多模态解构图网络（GRAD），通过图建模视觉和运动学数据的关系，并利用对抗训练减少模态差异。还设计了上下文校准解码器以增强鲁棒性。 Result: 实验表明，该方法在数据损坏和领域偏移情况下表现优异，具有高稳定性和鲁棒性。 Conclusion: 该方法为复杂动态手术场景中的工作流识别提供了有效解决方案，推动了自动化手术技术的发展。 Abstract: Surgical workflow recognition is vital for automating tasks, supporting decision-making, and training novice surgeons, ultimately improving patient safety and standardizing procedures. However, data corruption can lead to performance degradation due to issues like occlusion from bleeding or smoke in surgical scenes and problems with data storage and transmission. In this case, we explore a robust graph-based multimodal approach to integrating vision and kinematic data to enhance accuracy and reliability. Vision data captures dynamic surgical scenes, while kinematic data provides precise movement information, overcoming limitations of visual recognition under adverse conditions. We propose a multimodal Graph Representation network with Adversarial feature Disentanglement (GRAD) for robust surgical workflow recognition in challenging scenarios with domain shifts or corrupted data. Specifically, we introduce a Multimodal Disentanglement Graph Network that captures fine-grained visual information while explicitly modeling the complex relationships between vision and kinematic embeddings through graph-based message modeling. To align feature spaces across modalities, we propose a Vision-Kinematic Adversarial framework that leverages adversarial training to reduce modality gaps and improve feature consistency. Furthermore, we design a Contextual Calibrated Decoder, incorporating temporal and contextual priors to enhance robustness against domain shifts and corrupted data. Extensive comparative and ablation experiments demonstrate the effectiveness of our model and proposed modules. Moreover, our robustness experiments show that our method effectively handles data corruption during storage and transmission, exhibiting excellent stability and robustness. Our approach aims to advance automated surgical workflow recognition, addressing the complexities and dynamism inherent in surgical procedures.

[28] Enhancing the Learning Experience: Using Vision-Language Models to Generate Questions for Educational Videos

Markos Stamatakis,Joshua Berger,Christian Wartena,Ralph Ewerth,Anett Hoppe

Main category: cs.CV

TL;DR: 研究探讨了视觉语言模型在教育视频中生成学习导向问题的能力，评估了现成模型的性能、微调效果、视频模态影响及问题质量，指出了未来研究方向。

Details

Motivation: 提升教育视频的用户参与度和知识保留率，自动生成问题以激活学习者和辅助评估理解。 Method: 评估视觉语言模型在生成教育视频问题上的能力，包括现成模型性能、微调效果、视频模态影响及问题质量定性研究。 Result: 发现当前模型需微调，问题多样性和相关性存在挑战，提出了未来多模态数据集的需求和研究方向。 Conclusion: 视觉语言模型在教育视频问题生成上有潜力，但需进一步优化和扩展数据集。 Abstract: Web-based educational videos offer flexible learning opportunities and are becoming increasingly popular. However, improving user engagement and knowledge retention remains a challenge. Automatically generated questions can activate learners and support their knowledge acquisition. Further, they can help teachers and learners assess their understanding. While large language and vision-language models have been employed in various tasks, their application to question generation for educational videos remains underexplored. In this paper, we investigate the capabilities of current vision-language models for generating learning-oriented questions for educational video content. We assess (1) out-of-the-box models' performance; (2) fine-tuning effects on content-specific question generation; (3) the impact of different video modalities on question quality; and (4) in a qualitative study, question relevance, answerability, and difficulty levels of generated questions. Our findings delineate the capabilities of current vision-language models, highlighting the need for fine-tuning and addressing challenges in question diversity and relevance. We identify requirements for future multimodal datasets and outline promising research directions.

[29] AquaGS: Fast Underwater Scene Reconstruction with SfM-Free Gaussian Splatting

Junhao Shi,Jisheng Xu,Jianping He,Zhiliang Lin

Main category: cs.CV

TL;DR: AquaGS是一种基于SeaThru算法的水下场景重建模型，通过结合多视角立体技术和3D高斯泼溅技术，快速分离场景细节和介质特征，实现高精度重建。

Details

Motivation: 水下图像质量因介质干扰而下降，传统SfM方法速度慢且效果受限，难以满足实时需求。 Method: 结合多视角立体技术初始化高斯，使用隐式NeRF渲染半透明介质，显式3DGS渲染物体表面。 Result: 仅需3张输入图像，30秒内完成高精度重建，显著提升算法在机器人平台的应用性。 Conclusion: AquaGS克服了传统方法的局限性，能准确模拟水下光学现象，适用于实时场景。 Abstract: Underwater scene reconstruction is a critical tech-nology for underwater operations, enabling the generation of 3D models from images captured by underwater platforms. However, the quality of underwater images is often degraded due to medium interference, which limits the effectiveness of Structure-from-Motion (SfM) pose estimation, leading to subsequent reconstruction failures. Additionally, SfM methods typically operate at slower speeds, further hindering their applicability in real-time scenarios. In this paper, we introduce AquaGS, an SfM-free underwater scene reconstruction model based on the SeaThru algorithm, which facilitates rapid and accurate separation of scene details and medium features. Our approach initializes Gaussians by integrating state-of-the-art multi-view stereo (MVS) technology, employs implicit Neural Radiance Fields (NeRF) for rendering translucent media and utilizes the latest explicit 3D Gaussian Splatting (3DGS) technique to render object surfaces, which effectively addresses the limitations of traditional methods and accurately simulates underwater optical phenomena. Experimental results on the data set and the robot platform show that our model can complete high-precision reconstruction in 30 seconds with only 3 image inputs, significantly enhancing the practical application of the algorithm in robotic platforms.

[30] Efficient 3D Full-Body Motion Generation from Sparse Tracking Inputs with Temporal Windows

Georgios Fotios Angelis,Savas Ozkan,Sinan Mutlu,Paul Wisbey,Anastasios Drosou,Mete Ozay

Main category: cs.CV

TL;DR: 提出了一种基于MLP的新方法，通过将长输入序列分割为小时间窗口，显著提高了3D全身生成的准确性，同时降低了计算成本和内存开销。

Details

Motivation: 现有神经网络模型计算成本高且依赖长序列输入，导致性能下降和噪声增加，需要一种更高效的方法。 Method: 采用多层感知机（MLP）机制，将长输入序列分割为小时间窗口，并通过潜在表示合并当前运动与历史上下文信息。 Result: 实验表明，该方法在生成准确性上显著优于现有技术，同时大幅降低计算成本和内存开销。 Conclusion: 该方法适用于资源受限设备，为AR/VR应用提供了高效的3D全身生成解决方案。 Abstract: To have a seamless user experience on immersive AR/VR applications, the importance of efficient and effective Neural Network (NN) models is undeniable, since missing body parts that cannot be captured by limited sensors should be generated using these models for a complete 3D full-body reconstruction in virtual environment. However, the state-of-the-art NN-models are typically computational expensive and they leverage longer sequences of sparse tracking inputs to generate full-body movements by capturing temporal context. Inevitably, longer sequences increase the computation overhead and introduce noise in longer temporal dependencies that adversely affect the generation performance. In this paper, we propose a novel Multi-Layer Perceptron (MLP)-based method that enhances the overall performance while balancing the computational cost and memory overhead for efficient 3D full-body generation. Precisely, we introduce a NN-mechanism that divides the longer sequence of inputs into smaller temporal windows. Later, the current motion is merged with the information from these windows through latent representations to utilize the past context for the generation. Our experiments demonstrate that generation accuracy of our method with this NN-mechanism is significantly improved compared to the state-of-the-art methods while greatly reducing computational costs and memory overhead, making our method suitable for resource-constrained devices.

[31] Not Every Tree Is a Forest: Benchmarking Forest Types from Satellite Remote Sensing

Yuchang Jiang,Maxim Neumann

Main category: cs.CV

TL;DR: ForTy是一个全球尺度的森林类型映射基准，利用多时相卫星数据区分自然林、人工林和树木作物，并通过新型transformer模型提升性能。

Details

Motivation: 开发精确可靠的森林类型模型对阻止森林砍伐和生物多样性保护至关重要，如欧盟森林砍伐法规（EUDR）的需求。 Method: 利用Sentinel-2、Sentinel-1、气候和海拔数据的20万时间序列图像块，结合像素级注释，提出一种新型transformer模型处理多模态多时相数据。 Result: 实验表明，提出的transformer模型在性能上优于基线模型（如卷积神经网络）。 Conclusion: ForTy基准为全球森林类型映射提供了新工具，新型transformer模型在多时相数据处理中表现优异。 Abstract: Developing accurate and reliable models for forest types mapping is critical to support efforts for halting deforestation and for biodiversity conservation (such as European Union Deforestation Regulation (EUDR)). This work introduces ForTy, a benchmark for global-scale FORest TYpes mapping using multi-temporal satellite data1. The benchmark comprises 200,000 time series of image patches, each consisting of Sentinel-2, Sentinel-1, climate, and elevation data. Each time series captures variations at monthly or seasonal cadence. Per-pixel annotations, including forest types and other land use classes, support image segmentation tasks. Unlike most existing land use products that often categorize all forest areas into a single class, our benchmark differentiates between three forest types classes: natural forest, planted forest, and tree crops. By leveraging multiple public data sources, we achieve global coverage with this benchmark. We evaluate the forest types dataset using several baseline models, including convolution neural networks and transformer-based models. Additionally, we propose a novel transformer-based model specifically designed to handle multi-modal, multi-temporal satellite data for forest types mapping. Our experimental results demonstrate that the proposed model surpasses the baseline models in performance.

[32] 3DWG: 3D Weakly Supervised Visual Grounding via Category and Instance-Level Alignment

Xiaoqi Li,Jiaming Liu,Nuowei Han,Liang Heng,Yandong Guo,Hao Dong,Yang Liu

Main category: cs.CV

TL;DR: 论文提出了一种弱监督的3D视觉定位方法，通过区分类别和实例来解决稀疏点云中的定位挑战，并在多个基准测试中表现优异。

Details

Motivation: 解决3D点云中基于自然语言描述的弱监督定位任务中的类别级模糊性和实例级复杂性。 Method: 采用双分支设计：类别级分支利用预训练检测器增强类别感知，实例级分支利用空间关系描述区分实例。 Result: 在Nr3D、Sr3D和ScanRef基准测试中达到最优性能。 Conclusion: 所提方法有效解决了弱监督3D视觉定位中的关键挑战，显著提升了性能。 Abstract: The 3D weakly-supervised visual grounding task aims to localize oriented 3D boxes in point clouds based on natural language descriptions without requiring annotations to guide model learning. This setting presents two primary challenges: category-level ambiguity and instance-level complexity. Category-level ambiguity arises from representing objects of fine-grained categories in a highly sparse point cloud format, making category distinction challenging. Instance-level complexity stems from multiple instances of the same category coexisting in a scene, leading to distractions during grounding. To address these challenges, we propose a novel weakly-supervised grounding approach that explicitly differentiates between categories and instances. In the category-level branch, we utilize extensive category knowledge from a pre-trained external detector to align object proposal features with sentence-level category features, thereby enhancing category awareness. In the instance-level branch, we utilize spatial relationship descriptions from language queries to refine object proposal features, ensuring clear differentiation among objects. These designs enable our model to accurately identify target-category objects while distinguishing instances within the same category. Compared to previous methods, our approach achieves state-of-the-art performance on three widely used benchmarks: Nr3D, Sr3D, and ScanRef.

Nitin Rai,Arnold W. Schumann,Nathan Boyd

Main category: cs.CV

TL;DR: 该研究探索了一种多模态文本到图像的方法，用于生成合成作物病害图像，并首次提供了计算性能的基准测试。SD3.5M在性能和效率上表现最佳。

Details

Motivation: 收集大规模田间作物病害图像耗时耗力，生成模型（GMs）提供了一种替代方案，但现有研究缺乏对农业领域计算需求的全面分析。 Method: 研究训练了三种Stable Diffusion（SD）变体（SDXL、SD3.5M和SD3.5L），并使用Dreambooth和LoRA微调技术提升泛化能力。 Result: SD3.5M表现最优，平均内存使用18 GB，功耗180 W，每500张图像总能耗1.02 kWh（每张图像0.002 kWh），能在1.5小时内从36个田间样本生成500张合成图像。 Conclusion: 推荐使用SD3.5M进行高效的作物病害数据生成。 Abstract: Collecting large-scale crop disease images in the field is labor-intensive and time-consuming. Generative models (GMs) offer an alternative by creating synthetic samples that resemble real-world images. However, existing research primarily relies on Generative Adversarial Networks (GANs)-based image-to-image translation and lack a comprehensive analysis of computational requirements in agriculture. Therefore, this research explores a multi-modal text-to-image approach for generating synthetic crop disease images and is the first to provide computational benchmarking in this context. We trained three Stable Diffusion (SD) variants-SDXL, SD3.5M (medium), and SD3.5L (large)-and fine-tuned them using Dreambooth and Low-Rank Adaptation (LoRA) fine-tuning techniques to enhance generalization. SD3.5M outperformed the others, with an average memory usage of 18 GB, power consumption of 180 W, and total energy use of 1.02 kWh/500 images (0.002 kWh per image) during inference task. Our results demonstrate SD3.5M's ability to generate 500 synthetic images from just 36 in-field samples in 1.5 hours. We recommend SD3.5M for efficient crop disease data generation.

[34] CVVNet: A Cross-Vertical-View Network for Gait Recognition

Xiangru Li,Wei Song,Yingda Huang,Wei Meng,Le Chang

Main category: cs.CV

TL;DR: CVVNet是一种针对跨垂直视角步态识别的频率聚合架构，通过多尺度特征提取和动态门控聚合机制显著提升了识别性能。

Details

Motivation: 现有方法在跨垂直视角场景中表现不佳，主要由于视角变化导致的关键解剖特征变形和自遮挡问题。 Method: 提出CVVNet，包含高-低频提取模块（HLFE）和动态门控聚合机制（DGA），用于多尺度特征提取和自适应融合。 Result: 在DroneGait和Gait3D数据集上分别实现了8.6%和2%的性能提升。 Conclusion: CVVNet通过多频率特征整合和动态融合机制，显著提升了跨垂直视角步态识别的鲁棒性。 Abstract: Gait recognition enables contact-free, long-range person identification that is robust to clothing variations and non-cooperative scenarios. While existing methods perform well in controlled indoor environments, they struggle with cross-vertical view scenarios, where surveillance angles vary significantly in elevation. Our experiments show up to 60\% accuracy degradation in low-to-high vertical view settings due to severe deformations and self-occlusions of key anatomical features. Current CNN and self-attention-based methods fail to effectively handle these challenges, due to their reliance on single-scale convolutions or simplistic attention mechanisms that lack effective multi-frequency feature integration. To tackle this challenge, we propose CVVNet (Cross-Vertical-View Network), a frequency aggregation architecture specifically designed for robust cross-vertical-view gait recognition. CVVNet employs a High-Low Frequency Extraction module (HLFE) that adopts parallel multi-scale convolution/max-pooling path and self-attention path as high- and low-frequency mixers for effective multi-frequency feature extraction from input silhouettes. We also introduce the Dynamic Gated Aggregation (DGA) mechanism to adaptively adjust the fusion ratio of high- and low-frequency features. The integration of our core Multi-Scale Attention Gated Aggregation (MSAGA) module, HLFE and DGA enables CVVNet to effectively handle distortions from view changes, significantly improving the recognition robustness across different vertical views. Experimental results show that our CVVNet achieves state-of-the-art performance, with $8.6\%$ improvement on DroneGait and $2\%$ on Gait3D compared with the best existing methods.

[35] MVHumanNet++: A Large-scale Dataset of Multi-view Daily Dressing Human Captures with Richer Annotations for 3D Human Digitization

Chenghong Li,Hongjie Liao,Yihao Zhi,Xihe Yang,Zhengwentai Sun,Jiahao Chang,Shuguang Cui,Xiaoguang Han

Main category: cs.CV

TL;DR: MVHumanNet++是一个大规模多视角人类动作数据集，填补了3D视觉中人本任务的数据空白，包含4500个身份、9000套日常服装和6.45亿帧数据，附带丰富注释。

Details

Motivation: 3D视觉中人本任务因缺乏大规模数据集进展有限，MVHumanNet++旨在填补这一空白。 Method: 通过多视角捕捉系统收集多样化人类数据，包含身份、服装、动作序列及多种注释（如2D/3D关键点、SMPL参数等）。 Result: 数据集规模庞大，包含4500个身份、9000套服装、6万动作序列和6.45亿帧数据，并扩展了法线图和深度图。 Conclusion: MVHumanNet++是目前最大规模的3D人类数据集，有望推动人本任务的研究创新，已公开提供。 Abstract: In this era, the success of large language models and text-to-image models can be attributed to the driving force of large-scale datasets. However, in the realm of 3D vision, while significant progress has been achieved in object-centric tasks through large-scale datasets like Objaverse and MVImgNet, human-centric tasks have seen limited advancement, largely due to the absence of a comparable large-scale human dataset. To bridge this gap, we present MVHumanNet++, a dataset that comprises multi-view human action sequences of 4,500 human identities. The primary focus of our work is on collecting human data that features a large number of diverse identities and everyday clothing using multi-view human capture systems, which facilitates easily scalable data collection. Our dataset contains 9,000 daily outfits, 60,000 motion sequences and 645 million frames with extensive annotations, including human masks, camera parameters, 2D and 3D keypoints, SMPL/SMPLX parameters, and corresponding textual descriptions. Additionally, the proposed MVHumanNet++ dataset is enhanced with newly processed normal maps and depth maps, significantly expanding its applicability and utility for advanced human-centric research. To explore the potential of our proposed MVHumanNet++ dataset in various 2D and 3D visual tasks, we conducted several pilot studies to demonstrate the performance improvements and effective applications enabled by the scale provided by MVHumanNet++. As the current largest-scale 3D human dataset, we hope that the release of MVHumanNet++ dataset with annotations will foster further innovations in the domain of 3D human-centric tasks at scale. MVHumanNet++ is publicly available at https://kevinlee09.github.io/research/MVHumanNet++/.

[36] Mitigating Group-Level Fairness Disparities in Federated Visual Language Models

Chaomeng Chen,Zitong Yu,Junhao Dong,Sen Su,Linlin Shen,Shutao Xia,Xiaochun Cao

Main category: cs.CV

TL;DR: 论文提出FVL-FP框架，通过结合联邦学习和公平提示调优技术，解决联邦视觉语言模型中的群体公平性问题，显著减少人口统计偏差。

Details

Motivation: 联邦视觉语言模型在联邦学习环境中存在人口统计群体公平性问题，需要一种既能减少偏差又能保持模型性能的解决方案。 Method: 提出三个创新组件：跨层人口统计公平提示（CDFP）、人口统计子空间正交投影（DSOP）和公平感知提示融合（FPF），分别调整嵌入、去除图像表示中的偏差并动态平衡客户端贡献。 Result: 在四个基准数据集上的评估显示，FVL-FP平均减少45%的人口统计差异，同时任务性能保持在最先进结果的6%以内。 Conclusion: FVL-FP是一种参数高效的解决方案，能够在隐私保护的多模态系统中确保跨人口统计群体的公平性能。 Abstract: Visual language models (VLMs) have shown remarkable capabilities in multimodal tasks but face challenges in maintaining fairness across demographic groups, particularly when deployed in federated learning (FL) environments. This paper addresses the critical issue of group fairness in federated VLMs by introducing FVL-FP, a novel framework that combines FL with fair prompt tuning techniques. We focus on mitigating demographic biases while preserving model performance through three innovative components: (1) Cross-Layer Demographic Fair Prompting (CDFP), which adjusts potentially biased embeddings through counterfactual regularization; (2) Demographic Subspace Orthogonal Projection (DSOP), which removes demographic bias in image representations by mapping fair prompt text to group subspaces; and (3) Fair-aware Prompt Fusion (FPF), which dynamically balances client contributions based on both performance and fairness metrics. Extensive evaluations across four benchmark datasets demonstrate that our approach reduces demographic disparity by an average of 45\% compared to standard FL approaches, while maintaining task performance within 6\% of state-of-the-art results. FVL-FP effectively addresses the challenges of non-IID data distributions in federated settings and introduces minimal computational overhead while providing significant fairness benefits. Our work presents a parameter-efficient solution to the critical challenge of ensuring equitable performance across demographic groups in privacy-preserving multimodal systems.

[37] DualDiff: Dual-branch Diffusion Model for Autonomous Driving with Semantic Fusion

Haoteng Li,Zhao Yang,Zezhong Qian,Gongpeng Zhao,Yuqi Huang,Jun Yu,Huazheng Zhou,Longjun Liu

Main category: cs.CV

TL;DR: DualDiff是一种双分支条件扩散模型，通过Occupancy Ray Sampling和Semantic Fusion Attention提升多视角驾驶场景生成的质量。

Details

Motivation: 现有方法仅使用3D边界框和二值图进行前景和背景控制，无法充分捕捉场景复杂性和多模态信息。 Method: 提出DualDiff模型，结合ORS语义丰富的3D表示和SFA机制，设计FGM损失优化小物体生成。 Result: 在FID分数上达到最优，并在BEV分割和3D物体检测任务中表现优异。 Conclusion: DualDiff通过多模态信息融合和精细化控制，显著提升了驾驶场景重建的准确性和保真度。 Abstract: Accurate and high-fidelity driving scene reconstruction relies on fully leveraging scene information as conditioning. However, existing approaches, which primarily use 3D bounding boxes and binary maps for foreground and background control, fall short in capturing the complexity of the scene and integrating multi-modal information. In this paper, we propose DualDiff, a dual-branch conditional diffusion model designed to enhance multi-view driving scene generation. We introduce Occupancy Ray Sampling (ORS), a semantic-rich 3D representation, alongside numerical driving scene representation, for comprehensive foreground and background control. To improve cross-modal information integration, we propose a Semantic Fusion Attention (SFA) mechanism that aligns and fuses features across modalities. Furthermore, we design a foreground-aware masked (FGM) loss to enhance the generation of tiny objects. DualDiff achieves state-of-the-art performance in FID score, as well as consistently better results in downstream BEV segmentation and 3D object detection tasks.

[38] Visual enhancement and 3D representation for underwater scenes: a review

Guoxi Huang,Haoran Wang,Brett Seymour,Evan Kovacs,John Ellerbrock,Dave Blackham,Nantheera Anantrasirichai

Main category: cs.CV

TL;DR: 本文综述了水下视觉增强（UVE）和3D重建的挑战、方法及未来方向，填补了该领域系统性研究的空白。

Details

Motivation: 水下环境的复杂成像条件对计算机视觉和AI任务提出了重大挑战，但目前缺乏对UVE和3D重建的系统性综述。 Method: 从物理模型入手，综述了从传统方法到数据驱动技术（如NeRF和3D高斯泼溅）的多种方法，并评估其效果。 Result: 通过定量和定性评估，比较了多种算法在不同数据集上的表现。 Conclusion: 总结了当前研究的局限性，并指出了未来水下视觉研究的关键方向。 Abstract: Underwater visual enhancement (UVE) and underwater 3D reconstruction pose significant challenges in computer vision and AI-based tasks due to complex imaging conditions in aquatic environments. Despite the development of numerous enhancement algorithms, a comprehensive and systematic review covering both UVE and underwater 3D reconstruction remains absent. To advance research in these areas, we present an in-depth review from multiple perspectives. First, we introduce the fundamental physical models, highlighting the peculiarities that challenge conventional techniques. We survey advanced methods for visual enhancement and 3D reconstruction specifically designed for underwater scenarios. The paper assesses various approaches from non-learning methods to advanced data-driven techniques, including Neural Radiance Fields and 3D Gaussian Splatting, discussing their effectiveness in handling underwater distortions. Finally, we conduct both quantitative and qualitative evaluations of state-of-the-art UVE and underwater 3D reconstruction algorithms across multiple benchmark datasets. Finally, we highlight key research directions for future advancements in underwater vision.

Trisanth Srinivasan,Santosh Patapati

Main category: cs.CV

TL;DR: PhysNav-DG是一个结合传感器融合和视觉语言模型的新框架，通过双分支架构实现导航动作预测和详细解释生成，显著提升了导航成功率和透明度。

Details

Motivation: 为了在多样化环境中实现准确的状态估计和透明的决策制定。 Method: 采用双分支架构，结合经典传感器融合和视觉语言模型（如LLaMA 3.2 11B和BLIP-2），使用改进的自适应卡尔曼滤波器动态调整噪声参数。 Result: 在MD-NEX Benchmark上，导航成功率提升超过20%，生成的解释既清晰又可靠。 Conclusion: 该工作通过连接高级语义推理和几何规划，为自主系统提供了更安全和可信赖的解决方案。 Abstract: Robust navigation in diverse environments and domains requires both accurate state estimation and transparent decision making. We present PhysNav-DG, a novel framework that integrates classical sensor fusion with the semantic power of vision-language models. Our dual-branch architecture predicts navigation actions from multi-sensor inputs while simultaneously generating detailed chain-of-thought explanations. A modified Adaptive Kalman Filter dynamically adjusts its noise parameters based on environmental context. It leverages several streams of raw sensor data along with semantic insights from models such as LLaMA 3.2 11B and BLIP-2. To evaluate our approach, we introduce the MD-NEX Benchmark, a novel multi-domain dataset that unifies indoor navigation, autonomous driving, and social navigation tasks with ground-truth actions and human-validated explanations. Extensive experiments and ablations show that PhysNav-DG improves navigation success rates by over 20% and achieves high efficiency, with explanations that are both highly grounded and clear. This work connects high-level semantic reasoning and geometric planning for safer and more trustworthy autonomous systems.

[40] CMAWRNet: Multiple Adverse Weather Removal via a Unified Quaternion Neural Architecture

Vladimir Frants,Sos Agaian,Karen Panetta,Peter Huang

Main category: cs.CV

TL;DR: 论文提出了一种名为CMAWRNet的统一四元数神经网络架构，用于高效去除多种恶劣天气条件对图像的影响，并通过新颖的纹理-结构分解块、轻量级编码器-解码器四元数变换器架构和低光校正的注意力融合块实现。

Details

Motivation: 现实应用中的图像常受恶劣天气条件（如雾、雨、雪）影响，现有方法难以处理多种天气条件叠加的情况，需要一种高效的多天气去除解决方案。 Method: 提出CMAWRNet，结合纹理-结构分解块、轻量级四元数变换器架构和注意力融合块，并引入四元数相似性损失函数以保留颜色信息。 Result: 在基准数据集和真实图像上的定量与定性评估表明，CMAWRNet优于其他现有方法，且能提升下游任务（如目标检测）的性能。 Conclusion: CMAWRNet首次将分解方法应用于通用天气去除任务，显著提升了多天气条件下的图像恢复效果。 Abstract: Images used in real-world applications such as image or video retrieval, outdoor surveillance, and autonomous driving suffer from poor weather conditions. When designing robust computer vision systems, removing adverse weather such as haze, rain, and snow is a significant problem. Recently, deep-learning methods offered a solution for a single type of degradation. Current state-of-the-art universal methods struggle with combinations of degradations, such as haze and rain-streak. Few algorithms have been developed that perform well when presented with images containing multiple adverse weather conditions. This work focuses on developing an efficient solution for multiple adverse weather removal using a unified quaternion neural architecture called CMAWRNet. It is based on a novel texture-structure decomposition block, a novel lightweight encoder-decoder quaternion transformer architecture, and an attentive fusion block with low-light correction. We also introduce a quaternion similarity loss function to preserve color information better. The quantitative and qualitative evaluation of the current state-of-the-art benchmarking datasets and real-world images shows the performance advantages of the proposed CMAWRNet compared to other state-of-the-art weather removal approaches dealing with multiple weather artifacts. Extensive computer simulations validate that CMAWRNet improves the performance of downstream applications such as object detection. This is the first time the decomposition approach has been applied to the universal weather removal task.

[41] Rethinking Score Distilling Sampling for 3D Editing and Generation

Xingyu Miao,Haoran Duan,Yang Long,Jungong Han

Main category: cs.CV

TL;DR: UDS统一了3D生成与编辑的梯度项，优于基线方法，支持更丰富的细节生成和高效编辑。

Details

Motivation: 解决SDS方法在3D生成与编辑任务中的局限性，提出统一框架。 Method: 通过优化梯度项，将生成与编辑任务统一为UDS方法。 Result: 实验表明UDS在生成和编辑任务中均优于基线方法。 Conclusion: UDS成功填补了3D生成与编辑之间的鸿沟，代码已开源。 Abstract: Score Distillation Sampling (SDS) has emerged as a prominent method for text-to-3D generation by leveraging the strengths of 2D diffusion models. However, SDS is limited to generation tasks and lacks the capability to edit existing 3D assets. Conversely, variants of SDS that introduce editing capabilities often can not generate new 3D assets effectively. In this work, we observe that the processes of generation and editing within SDS and its variants have unified underlying gradient terms. Building on this insight, we propose Unified Distillation Sampling (UDS), a method that seamlessly integrates both the generation and editing of 3D assets. Essentially, UDS refines the gradient terms used in vanilla SDS methods, unifying them to support both tasks. Extensive experiments demonstrate that UDS not only outperforms baseline methods in generating 3D assets with richer details but also excels in editing tasks, thereby bridging the gap between 3D generation and editing. The code is available on: https://github.com/xingy038/UDS.

[42] GenSync: A Generalized Talking Head Framework for Audio-driven Multi-Subject Lip-Sync using 3D Gaussian Splatting

Anushka Agarwal,Muhammad Yusuf Hassan,Talha Chafekar

Main category: cs.CV

TL;DR: GenSync是一个基于3D高斯溅射的多身份唇语同步视频合成框架，通过解耦模块实现高效多身份合成，训练速度提升6.8倍。

Details

Motivation: 现有3D方法通常需要为每个身份训练单独模型，效率低下。GenSync旨在解决这一问题，实现多身份统一合成。 Method: 采用3D高斯溅射技术，结合解耦模块分离身份特征与音频表示，实现高效多身份视频合成。 Result: 训练速度提升6.8倍，同时保持高唇语同步精度和视觉质量。 Conclusion: GenSync通过统一网络设计显著提升了多身份唇语同步视频合成的效率和性能。 Abstract: We introduce GenSync, a novel framework for multi-identity lip-synced video synthesis using 3D Gaussian Splatting. Unlike most existing 3D methods that require training a new model for each identity , GenSync learns a unified network that synthesizes lip-synced videos for multiple speakers. By incorporating a Disentanglement Module, our approach separates identity-specific features from audio representations, enabling efficient multi-identity video synthesis. This design reduces computational overhead and achieves 6.8x faster training compared to state-of-the-art models, while maintaining high lip-sync accuracy and visual quality.

[43] GauS-SLAM: Dense RGB-D SLAM with Gaussian Surfels

Yongxin Su,Lin Chen,Kaiting Zhang,Zhongliang Zhao,Chenfeng Hou,Ziping Yu

Main category: cs.CV

TL;DR: GauS-SLAM是一种基于2D高斯面元的密集RGB-D SLAM系统，通过改进几何精度和多视角一致性，显著提升了跟踪和映射性能。

Details

Motivation: 高斯基场景表示在新视角下会出现几何失真，影响跟踪精度，主要原因是高斯基元的深度建模和表面间的相互干扰。 Method: 提出2D高斯增量重建策略和表面感知深度渲染机制，动态隔离可见表面以减少遮挡区域的误对齐。 Result: 在多个数据集上的实验表明，GauS-SLAM在跟踪精度和渲染保真度上优于同类方法。 Conclusion: GauS-SLAM通过优化几何建模和表面处理，实现了高效的SLAM性能。 Abstract: We propose GauS-SLAM, a dense RGB-D SLAM system that leverages 2D Gaussian surfels to achieve robust tracking and high-fidelity mapping. Our investigations reveal that Gaussian-based scene representations exhibit geometry distortion under novel viewpoints, which significantly degrades the accuracy of Gaussian-based tracking methods. These geometry inconsistencies arise primarily from the depth modeling of Gaussian primitives and the mutual interference between surfaces during the depth blending. To address these, we propose a 2D Gaussian-based incremental reconstruction strategy coupled with a Surface-aware Depth Rendering mechanism, which significantly enhances geometry accuracy and multi-view consistency. Additionally, the proposed local map design dynamically isolates visible surfaces during tracking, mitigating misalignment caused by occluded regions in global maps while maintaining computational efficiency with increasing Gaussian density. Extensive experiments across multiple datasets demonstrate that GauS-SLAM outperforms comparable methods, delivering superior tracking precision and rendering fidelity. The project page will be made available at https://gaus-slam.github.io.

[44] HybridGS: High-Efficiency Gaussian Splatting Data Compression using Dual-Channel Sparse Representation and Point Cloud Encoder

Qi Yang,Le Yang,Geert Van Der Auwera,Zhu Li

Main category: cs.CV

TL;DR: HybridGS是一种新的3D高斯泼溅压缩框架，结合了紧凑生成和标准化点云编码，显著提高了编码和解码速度。

Details

Motivation: 现有3DGS压缩方案编码时间长且数据格式高度定制化，难以广泛部署。 Method: HybridGS首先生成紧凑且显式的3DGS数据，引入双通道稀疏表示监督基元位置和特征位深度，再利用标准点云编码器进一步压缩数据。 Result: 实验表明，HybridGS在重建性能上媲美先进方法，且编码和解码速度显著更快。 Conclusion: HybridGS提供了一种高效且标准化的3DGS压缩解决方案，适合广泛部署。 Abstract: Most existing 3D Gaussian Splatting (3DGS) compression schemes focus on producing compact 3DGS representation via implicit data embedding. They have long coding times and highly customized data format, making it difficult for widespread deployment. This paper presents a new 3DGS compression framework called HybridGS, which takes advantage of both compact generation and standardized point cloud data encoding. HybridGS first generates compact and explicit 3DGS data. A dual-channel sparse representation is introduced to supervise the primitive position and feature bit depth. It then utilizes a canonical point cloud encoder to perform further data compression and form standard output bitstreams. A simple and effective rate control scheme is proposed to pivot the interpretable data compression scheme. At the current stage, HybridGS does not include any modules aimed at improving 3DGS quality during generation. But experiment results show that it still provides comparable reconstruction performance against state-of-the-art methods, with evidently higher encoding and decoding speed. The code is publicly available at https://github.com/Qi-Yangsjtu/HybridGS.

[45] Segment Any RGB-Thermal Model with Language-aided Distillation

Dong Xing,Xianxun Zhu,Wei Zhou,Qika Lin,Hang Yang,Yuqing Wang

Main category: cs.CV

TL;DR: SARTM框架通过微调SAM模型并引入语义理解模块，成功将SAM应用于RGB-T语义分割任务，显著提升了性能。

Details

Motivation: SAM模型仅基于RGB数据训练，无法直接用于RGB-T语义分割，而RGB-T在恶劣天气和光照条件下具有优势，因此需要定制化解决方案。 Method: 通过添加LoRA层微调SAM，引入语言信息指导训练，并使用跨模态知识蒸馏（CMKD）模块解决模态不一致问题，同时优化分割头。 Result: 在多个RGB-T语义分割基准测试中，SARTM显著优于现有方法。 Conclusion: SARTM成功将SAM扩展至RGB-T领域，解决了模态差异问题，提升了语义分割性能。 Abstract: The recent Segment Anything Model (SAM) demonstrates strong instance segmentation performance across various downstream tasks. However, SAM is trained solely on RGB data, limiting its direct applicability to RGB-thermal (RGB-T) semantic segmentation. Given that RGB-T provides a robust solution for scene understanding in adverse weather and lighting conditions, such as low light and overexposure, we propose a novel framework, SARTM, which customizes the powerful SAM for RGB-T semantic segmentation. Our key idea is to unleash the potential of SAM while introduce semantic understanding modules for RGB-T data pairs. Specifically, our framework first involves fine tuning the original SAM by adding extra LoRA layers, aiming at preserving SAM's strong generalization and segmentation capabilities for downstream tasks. Secondly, we introduce language information as guidance for training our SARTM. To address cross-modal inconsistencies, we introduce a Cross-Modal Knowledge Distillation(CMKD) module that effectively achieves modality adaptation while maintaining its generalization capabilities. This semantic module enables the minimization of modality gaps and alleviates semantic ambiguity, facilitating the combination of any modality under any visual conditions. Furthermore, we enhance the segmentation performance by adjusting the segmentation head of SAM and incorporating an auxiliary semantic segmentation head, which integrates multi-scale features for effective fusion. Extensive experiments are conducted across three multi-modal RGBT semantic segmentation benchmarks: MFNET, PST900, and FMB. Both quantitative and qualitative results consistently demonstrate that the proposed SARTM significantly outperforms state-of-the-art approaches across a variety of conditions.

[46] A Comprehensive Analysis for Visual Object Hallucination in Large Vision-Language Models

Liqiang Jing,Guiming Hardy Chen,Ehsan Aghazadeh,Xin Eric Wang,Xinya Du

Main category: cs.CV

TL;DR: 论文分析了大型视觉语言模型（LVLMs）中的视觉对象幻觉问题，并提出了针对各组件（语言模型、视觉主干和投影器）的缓解方法，同时开发了两个幻觉基准测试。

Details

Motivation: 视觉对象幻觉是LVLMs中的主要问题，可能导致错误信息，但此前对其根本原因的研究不足。 Method: 通过分析LLaVA类LVLMs的各个组件，识别错误来源并提出缓解方法，同时设计了两个幻觉基准测试。 Result: 提出了针对各组件问题的缓解方法，并开发了QA-VisualGenome和QA-FB15k两个基准测试。 Conclusion: 研究为理解并缓解LVLMs中的视觉对象幻觉提供了系统方法，增强了模型的安全性和可靠性。 Abstract: Large Vision-Language Models (LVLMs) demonstrate remarkable capabilities in multimodal tasks, but visual object hallucination remains a persistent issue. It refers to scenarios where models generate inaccurate visual object-related information based on the query input, potentially leading to misinformation and concerns about safety and reliability. Previous works focus on the evaluation and mitigation of visual hallucinations, but the underlying causes have not been comprehensively investigated. In this paper, we analyze each component of LLaVA-like LVLMs -- the large language model, the vision backbone, and the projector -- to identify potential sources of error and their impact. Based on our observations, we propose methods to mitigate hallucination for each problematic component. Additionally, we developed two hallucination benchmarks: QA-VisualGenome, which emphasizes attribute and relation hallucinations, and QA-FB15k, which focuses on cognition-based hallucinations.

[47] MC3D-AD: A Unified Geometry-aware Reconstruction Model for Multi-category 3D Anomaly Detection

Jiayi Cheng,Can Gao,Jie Zhou,Jiajun Wen,Tao Dai,Jinbao Wang

Main category: cs.CV

TL;DR: 提出了一种多类别3D异常检测的统一模型MC3D-AD，结合局部和全局几何信息，显著优于现有单类别方法。

Details

Motivation: 现有3D异常检测方法需为每个类别单独训练模型，成本高、效率低且泛化能力弱。 Method: 提出自适应几何感知掩码注意力模块、局部几何感知编码器和全局查询解码器，结合点云位置嵌入提升重建能力。 Result: 在Real3D-AD和Anomaly-ShapeNet数据集上，分别实现了3.1%和9.3%的AUROC提升。 Conclusion: MC3D-AD在多类别3D异常检测中表现出色，具有高效性和强泛化能力。 Abstract: 3D Anomaly Detection (AD) is a promising means of controlling the quality of manufactured products. However, existing methods typically require carefully training a task-specific model for each category independently, leading to high cost, low efficiency, and weak generalization. Therefore, this paper presents a novel unified model for Multi-Category 3D Anomaly Detection (MC3D-AD) that aims to utilize both local and global geometry-aware information to reconstruct normal representations of all categories. First, to learn robust and generalized features of different categories, we propose an adaptive geometry-aware masked attention module that extracts geometry variation information to guide mask attention. Then, we introduce a local geometry-aware encoder reinforced by the improved mask attention to encode group-level feature tokens. Finally, we design a global query decoder that utilizes point cloud position embeddings to improve the decoding process and reconstruction ability. This leads to local and global geometry-aware reconstructed feature tokens for the AD task. MC3D-AD is evaluated on two publicly available Real3D-AD and Anomaly-ShapeNet datasets, and exhibits significant superiority over current state-of-the-art single-category methods, achieving 3.1\% and 9.3\% improvement in object-level AUROC over Real3D-AD and Anomaly-ShapeNet, respectively. The source code will be released upon acceptance.

[48] Visual Dominance and Emerging Multimodal Approaches in Distracted Driving Detection: A Review of Machine Learning Techniques

Anthony Dontoh,Stephanie Ivey,Logan Sirbaugh,Andrews Danyo,Armstrong Aboah

Main category: cs.CV

TL;DR: 本文综述了2019-2024年间74项关于使用ML/DL技术检测分心驾驶的研究，强调多模态方法的优势，并呼吁未来研究关注轻量化和可部署的多模态框架。

Details

Motivation: 分心驾驶是全球交通事故的重要原因，现有技术多依赖视觉数据，忽视了驾驶行为的复杂性。 Method: 系统综述了视觉、传感器、多模态和新兴技术的研究，分析了不同方法的优缺点。 Result: 多模态架构在鲁棒性和可扩展性上优于单模态方法，但需平衡计算需求。 Conclusion: 未来应发展轻量化多模态框架，结合个性化基准和跨模态标准，以提升ADAS和道路安全的可靠性。 Abstract: Distracted driving continues to be a significant cause of road traffic injuries and fatalities worldwide, even with advancements in driver monitoring technologies. Recent developments in machine learning (ML) and deep learning (DL) have primarily focused on visual data to detect distraction, often neglecting the complex, multimodal nature of driver behavior. This systematic review assesses 74 peer-reviewed studies from 2019 to 2024 that utilize ML/DL techniques for distracted driving detection across visual, sensor-based, multimodal, and emerging modalities. The review highlights a significant prevalence of visual-only models, particularly convolutional neural networks (CNNs) and temporal architectures, which achieve high accuracy but show limited generalizability in real-world scenarios. Sensor-based and physiological models provide complementary strengths by capturing internal states and vehicle dynamics, while emerging techniques, such as auditory sensing and radio frequency (RF) methods, offer privacy-aware alternatives. Multimodal architecture consistently surpasses unimodal baselines, demonstrating enhanced robustness, context awareness, and scalability by integrating diverse data streams. These findings emphasize the need to move beyond visual-only approaches and adopt multimodal systems that combine visual, physiological, and vehicular cues while keeping in checking the need to balance computational requirements. Future research should focus on developing lightweight, deployable multimodal frameworks, incorporating personalized baselines, and establishing cross-modality benchmarks to ensure real-world reliability in advanced driver assistance systems (ADAS) and road safety interventions.

[49] Lifelong Whole Slide Image Analysis: Online Vision-Language Adaptation and Past-to-Present Gradient Distillation

Doanh C. Bui,Hoai Luan Pham,Vu Trung Duong Le,Tuan Hai Vu,Van Duy Tran,Khang Nguyen,Yasuhiko Nakashima

Main category: cs.CV

TL;DR: ADaFGrad是一种用于全切片图像（WSI）分析的终身学习方法，通过结合病理视觉语言基础模型和梯度蒸馏机制，显著提升了分类性能并减少了遗忘。

Details

Motivation: 全切片图像（WSI）在癌症诊断中至关重要，但其巨大的数据规模带来了存储、处理和模型训练的挑战。需要开发终身学习方法以支持多机构协作的在线模型。 Method: 利用病理视觉语言基础模型构建框架，结合梯度蒸馏机制，模拟分类头参数的梯度变化，以持续学习方式优化模型。 Result: ADaFGrad在少量训练周期后优于现有方法，在类增量学习场景中性能提升高达5.068%，准确率比基线高40.084%。 Conclusion: ADaFGrad通过创新的模块设计，显著提升了WSI分析的终身学习能力，适用于临床诊断场景。 Abstract: Whole Slide Images (WSIs) play a crucial role in accurate cancer diagnosis and prognosis, as they provide tissue details at the cellular level. However, the rapid growth of computational tasks involving WSIs poses significant challenges. Given that WSIs are gigapixels in size, they present difficulties in terms of storage, processing, and model training. Therefore, it is essential to develop lifelong learning approaches for WSI analysis. In scenarios where slides are distributed across multiple institutes, we aim to leverage them to develop a unified online model as a computational tool for cancer diagnosis in clinical and hospital settings. In this study, we introduce ADaFGrad, a method designed to enhance lifelong learning for whole-slide image (WSI) analysis. First, we leverage pathology vision-language foundation models to develop a framework that enables interaction between a slide's regional tissue features and a predefined text-based prototype buffer. Additionally, we propose a gradient-distillation mechanism that mimics the gradient of a logit with respect to the classification-head parameters across past and current iterations in a continual-learning setting. We construct a sequence of six TCGA datasets for training and evaluation. Experimental results show that ADaFGrad outperforms both state-of-the-art WSI-specific and conventional continual-learning methods after only a few training epochs, exceeding them by up to +5.068% in the class-incremental learning scenario while exhibiting the least forgetting (i.e., retaining the most knowledge from previous tasks). Moreover, ADaFGrad surpasses its baseline by as much as +40.084% in accuracy, further demonstrating the effectiveness of the proposed modules.

[50] Drug classification based on X-ray spectroscopy combined with machine learning

Yongming Li,Peng Wang,Bangdong Han

Main category: cs.CV

TL;DR: 该论文提出了一种结合X射线吸收光谱、CNN、PSO和SVM的快速高精度药物检测方法，实验结果显示其分类准确率达99.14%。

Details

Motivation: 新型药物种类增多，传统检测方法复杂且对仪器和环境要求高，亟需开发更快速、准确的检测技术。 Method: 使用CNN提取X射线光谱特征，结合PSO优化SVM参数，构建分类模型对14种类似药物的化学试剂进行分类。 Result: 模型分类准确率达99.14%，且运行速度快，避免了PSO与SVM直接融合导致的效率下降问题。 Conclusion: 该方法为药物检测领域提供了一种快速、高精度且可靠的分类识别方案，具有广泛应用前景。 Abstract: The proliferation of new types of drugs necessitates the urgent development of faster and more accurate detection methods. Traditional detection methods have high requirements for instruments and environments, making the operation complex. X-ray absorption spectroscopy, a non-destructive detection technique, offers advantages such as ease of operation, penetrative observation, and strong substance differentiation capabilities, making it well-suited for application in the field of drug detection and identification. In this study, we constructed a classification model using Convolutional Neural Networks (CNN), Support Vector Machines (SVM), and Particle Swarm Optimization (PSO) to classify and identify drugs based on their X-ray spectral profiles. In the experiments, we selected 14 chemical reagents with chemical formulas similar to drugs as samples. We utilized CNN to extract features from the spectral data of these 14 chemical reagents and used the extracted features to train an SVM model. We also utilized PSO to optimize two critical initial parameters of the SVM. The experimental results demonstrate that this model achieved higher classification accuracy compared to two other common methods, with a prediction accuracy of 99.14%. Additionally, the model exhibited fast execution speed, mitigating the drawback of a drastic increase in running time and efficiency reduction that may result from the direct fusion of PSO and SVM. Therefore, the combined approach of X-ray absorption spectroscopy with CNN, PSO, and SVM provides a rapid, highly accurate, and reliable classification and identification method for the field of drug detection, holding promising prospects for widespread application.

[51] Learning Heterogeneous Mixture of Scene Experts for Large-scale Neural Radiance Fields

Zhenxing Mi,Ping Yin,Xue Xiao,Dan Xu

Main category: cs.CV

TL;DR: Switch-NeRF++提出了一种异构混合哈希专家网络（HMoHE），用于高效、可扩展的大规模场景NeRF建模，解决了场景分解、异构建模和效率问题。

Details

Motivation: 现有大规模场景NeRF方法在可学习分解、场景异构性和建模效率方面存在不足，需要一种统一的解决方案。 Method: 采用异构混合哈希专家网络（HMoHE），结合哈希门控网络和异构哈希专家，实现端到端的大规模场景建模。 Result: 在多个大规模场景数据集上实现了最先进的渲染精度，训练速度提升8倍，渲染速度提升16倍。 Conclusion: Switch-NeRF++是一种高效、可扩展的NeRF解决方案，适用于真实世界的大规模场景建模。 Abstract: Recent NeRF methods on large-scale scenes have underlined the importance of scene decomposition for scalable NeRFs. Although achieving reasonable scalability, there are several critical problems remaining unexplored, i.e., learnable decomposition, modeling scene heterogeneity, and modeling efficiency. In this paper, we introduce Switch-NeRF++, a Heterogeneous Mixture of Hash Experts (HMoHE) network that addresses these challenges within a unified framework. It is a highly scalable NeRF that learns heterogeneous decomposition and heterogeneous NeRFs efficiently for large-scale scenes in an end-to-end manner. In our framework, a gating network learns to decomposes scenes and allocates 3D points to specialized NeRF experts. This gating network is co-optimized with the experts, by our proposed Sparsely Gated Mixture of Experts (MoE) NeRF framework. We incorporate a hash-based gating network and distinct heterogeneous hash experts. The hash-based gating efficiently learns the decomposition of the large-scale scene. The distinct heterogeneous hash experts consist of hash grids of different resolution ranges, enabling effective learning of the heterogeneous representation of different scene parts. These design choices make our framework an end-to-end and highly scalable NeRF solution for real-world large-scale scene modeling to achieve both quality and efficiency. We evaluate our accuracy and scalability on existing large-scale NeRF datasets and a new dataset with very large-scale scenes ($>6.5km^2$) from UrbanBIS. Extensive experiments demonstrate that our approach can be easily scaled to various large-scale scenes and achieve state-of-the-art scene rendering accuracy. Furthermore, our method exhibits significant efficiency, with an 8x acceleration in training and a 16x acceleration in rendering compared to Switch-NeRF. Codes will be released in https://github.com/MiZhenxing/Switch-NeRF.

[52] Efficient Noise Calculation in Deep Learning-based MRI Reconstructions

Onat Dalmaz,Arjun D. Desai,Reinhard Heckel,Tolga Çukur,Akshay S. Chaudhari,Brian A. Hargreaves

Main category: cs.CV

TL;DR: 提出了一种计算加速MRI重建中体素级方差的方法，用于量化噪声传播的不确定性，通过DL网络的Jacobian近似噪声协方差，显著降低了计算和内存需求。

Details

Motivation: DL-based MRI重建方法常忽略噪声传播问题，但其对重建质量和算法设计至关重要。 Method: 利用DL网络的Jacobian近似噪声协方差，提出无偏估计器和Jacobian草图技术高效计算体素级方差。 Result: 在膝关节和脑部MRI数据集上验证，性能接近蒙特卡洛模拟，计算和内存需求降低一个数量级以上。 Conclusion: 该方法为DL-based MRI重建提供了高效准确的噪声分析工具，有望重塑评估和部署方式。 Abstract: Accelerated MRI reconstruction involves solving an ill-posed inverse problem where noise in acquired data propagates to the reconstructed images. Noise analyses are central to MRI reconstruction for providing an explicit measure of solution fidelity and for guiding the design and deployment of novel reconstruction methods. However, deep learning (DL)-based reconstruction methods have often overlooked noise propagation due to inherent analytical and computational challenges, despite its critical importance. This work proposes a theoretically grounded, memory-efficient technique to calculate voxel-wise variance for quantifying uncertainty due to acquisition noise in accelerated MRI reconstructions. Our approach approximates noise covariance using the DL network's Jacobian, which is intractable to calculate. To circumvent this, we derive an unbiased estimator for the diagonal of this covariance matrix (voxel-wise variance) and introduce a Jacobian sketching technique to efficiently implement it. We evaluate our method on knee and brain MRI datasets for both data- and physics-driven networks trained in supervised and unsupervised manners. Compared to empirical references obtained via Monte Carlo simulations, our technique achieves near-equivalent performance while reducing computational and memory demands by an order of magnitude or more. Furthermore, our method is robust across varying input noise levels, acceleration factors, and diverse undersampling schemes, highlighting its broad applicability. Our work reintroduces accurate and efficient noise analysis as a central tenet of reconstruction algorithms, holding promise to reshape how we evaluate and deploy DL-based MRI. Our code will be made publicly available upon acceptance.

[53] MLLM-Enhanced Face Forgery Detection: A Vision-Language Fusion Solution

Siran Peng,Zipei Wang,Li Gao,Xiangyu Zhu,Tianshuo Zhang,Ajian Liu,Haoyuan Zhang,Zhen Lei

Main category: cs.CV

TL;DR: 提出了一种名为VLF-FFD的新方法，通过视觉-语言融合提升MLLM在面部伪造检测中的性能，并扩展了数据集EFF++以支持更有效的训练。

Details

Motivation: 现有方法在视觉和文本模态的整合上表现不佳，需要更高效的解决方案来应对深度伪造的威胁。 Method: 提出了VLF-Net网络，通过三阶段训练实现视觉与文本特征的双向交互，并扩展了EFF++数据集。 Result: VLF-FFD在跨数据集和数据集内评估中均达到SOTA性能。 Conclusion: VLF-FFD在面部伪造检测中表现出色，为深度伪造检测提供了有效解决方案。 Abstract: Reliable face forgery detection algorithms are crucial for countering the growing threat of deepfake-driven disinformation. Previous research has demonstrated the potential of Multimodal Large Language Models (MLLMs) in identifying manipulated faces. However, existing methods typically depend on either the Large Language Model (LLM) alone or an external detector to generate classification results, which often leads to sub-optimal integration of visual and textual modalities. In this paper, we propose VLF-FFD, a novel Vision-Language Fusion solution for MLLM-enhanced Face Forgery Detection. Our key contributions are twofold. First, we present EFF++, a frame-level, explainability-driven extension of the widely used FaceForensics++ (FF++) dataset. In EFF++, each manipulated video frame is paired with a textual annotation that describes both the forgery artifacts and the specific manipulation technique applied, enabling more effective and informative MLLM training. Second, we design a Vision-Language Fusion Network (VLF-Net) that promotes bidirectional interaction between visual and textual features, supported by a three-stage training pipeline to fully leverage its potential. VLF-FFD achieves state-of-the-art (SOTA) performance in both cross-dataset and intra-dataset evaluations, underscoring its exceptional effectiveness in face forgery detection.

[54] R-Bench: Graduate-level Multi-disciplinary Benchmarks for LLM & MLLM Complex Reasoning Evaluation

Meng-Hao Guo,Jiajun Xu,Yi Zhang,Jiaxi Song,Haoyang Peng,Yi-Xuan Deng,Xinzhi Dong,Kiyohiro Nakayama,Zhengyang Geng,Chen Wang,Bolin Ni,Guo-Wei Yang,Yongming Rao,Houwen Peng,Han Hu,Gordon Wetzstein,Shi-min Hu

Main category: cs.CV

TL;DR: 论文提出了一个多学科、多模态的推理基准R-Bench，用于评估语言和多模态模型的推理能力，发现现有模型在复杂推理任务上表现不佳。

Details

Motivation: 现有推理基准未能充分评估复杂、多学科和多模态环境下的推理能力，因此需要更严格的评估工具。 Method: 构建了一个包含1094个语言模型问题和665个多模态模型问题的双语（英中）基准，涵盖多个学科，确保难度和平衡。 Result: 实验显示，即使是先进模型（如OpenAI o1）在多模态推理任务上的准确率仅为53.2%。 Conclusion: R-Bench是一个高难度的多学科基准，揭示了当前模型在复杂推理任务上的局限性。 Abstract: Reasoning stands as a cornerstone of intelligence, enabling the synthesis of existing knowledge to solve complex problems. Despite remarkable progress, existing reasoning benchmarks often fail to rigorously evaluate the nuanced reasoning capabilities required for complex, real-world problemsolving, particularly in multi-disciplinary and multimodal contexts. In this paper, we introduce a graduate-level, multi-disciplinary, EnglishChinese benchmark, dubbed as Reasoning Bench (R-Bench), for assessing the reasoning capability of both language and multimodal models. RBench spans 1,094 questions across 108 subjects for language model evaluation and 665 questions across 83 subjects for multimodal model testing in both English and Chinese. These questions are meticulously curated to ensure rigorous difficulty calibration, subject balance, and crosslinguistic alignment, enabling the assessment to be an Olympiad-level multi-disciplinary benchmark. We evaluate widely used models, including OpenAI o1, GPT-4o, DeepSeek-R1, etc. Experimental results indicate that advanced models perform poorly on complex reasoning, especially multimodal reasoning. Even the top-performing model OpenAI o1 achieves only 53.2% accuracy on our multimodal evaluation. Data and code are made publicly available at here.

[55] A Birotation Solution for Relative Pose Problems

Hongbo Zhao,Ziwei Long,Mengtan Zhang,Hanli Wang,Qijun Chen,Rui Fan

Main category: cs.CV

TL;DR: 提出了一种新颖的双旋转解决方案，通过引入三个基础变换和几何度量，在黎曼流形上最小化能量函数，实现了相对位姿估计的优越性能。

Details

Motivation: 相对位姿估计是计算机视觉中的基础问题，传统方法通过估计和分解本质矩阵或直接估计旋转和平移来求解。本文旨在打破传统，提出一种创新的双旋转解决方案。 Method: 引入三个基础变换，每个变换关联一个几何度量，量化待估计相对位姿与基础变换的距离。设计三个能量函数，在黎曼流形上通过迭代更新两个旋转矩阵来最小化这些函数，最终利用最小能量对应的旋转矩阵和基础变换恢复相对位姿。 Result: 在多种相对位姿估计任务中，定量和定性评估均表明所提出的双旋转解决方案具有优越性能。 Conclusion: 本文的双旋转解决方案为相对位姿估计问题提供了一种创新且高效的方法，实验验证了其性能优势。 Abstract: Relative pose estimation, a fundamental computer vision problem, has been extensively studied for decades. Existing methods either estimate and decompose the essential matrix or directly estimate the rotation and translation to obtain the solution. In this article, we break the mold by tackling this traditional problem with a novel birotation solution. We first introduce three basis transformations, each associated with a geometric metric to quantify the distance between the relative pose to be estimated and its corresponding basis transformation. Three energy functions, designed based on these metrics, are then minimized on the Riemannian manifold $\mathrm{SO(3)}$ by iteratively updating the two rotation matrices. The two rotation matrices and the basis transformation corresponding to the minimum energy are ultimately utilized to recover the relative pose. Extensive quantitative and qualitative evaluations across diverse relative pose estimation tasks demonstrate the superior performance of our proposed birotation solution. Source code, demo video, and datasets will be available at \href{https://mias.group/birotation-solution}{mias.group/birotation-solution} upon publication.

[56] Point2Primitive: CAD Reconstruction from Point Cloud by Direct Primitive Prediction

Cheng Wang,Xinzhu Ma,Bin Wang,Shixiang Tang,Yuan Meng,Ping Jiang

Main category: cs.CV

TL;DR: 本文提出了一种从点云直接生成可编辑CAD模型的方法（Point2Primitive），通过改进的Transformer直接预测拉伸基元的每个元素，实现了高精度的参数预测和拓扑重建。

Details

Motivation: 现有方法使用隐式场表示草图，导致曲线边缘的形状重建效果不佳。本文旨在通过直接预测拉伸基元的元素，提高CAD模型的重建精度和可编辑性。 Method: 提出Point2Primitive网络，基于改进的Transformer从点云直接检测和预测草图曲线（类型和参数），并通过自回归方式优化参数。拓扑通过拉伸分割重建，拉伸参数通过预测曲线和计算操作恢复。 Result: 实验表明，该方法在基元预测精度和CAD重建方面表现优越，重建形状具有高几何保真度。 Conclusion: Point2Primitive能够高效地从点云生成高精度的可编辑CAD模型，为CAD重建提供了新思路。 Abstract: Recovering CAD models from point clouds, especially the sketch-extrusion process, can be seen as the process of rebuilding the topology and extrusion primitives. Previous methods utilize implicit fields for sketch representation, leading to shape reconstruction of curved edges. In this paper, we proposed a CAD reconstruction network that produces editable CAD models from input point clouds (Point2Primitive) by directly predicting every element of the extrusion primitives. Point2Primitive can directly detect and predict sketch curves (type and parameter) from point clouds based on an improved transformer. The sketch curve parameters are formulated as position queries and optimized in an autoregressive way, leading to high parameter accuracy. The topology is rebuilt by extrusion segmentation, and each extrusion parameter (sketch and extrusion operation) is recovered by combining the predicted curves and the computed extrusion operation. Extensive experiments demonstrate that our method is superior in primitive prediction accuracy and CAD reconstruction. The reconstructed shapes are of high geometrical fidelity.

[57] A UNet Model for Accelerated Preprocessing of CRISM Hyperspectral Data for Mineral Identification on Mars

Priyanka Kumari,Sampriti Soor,Amba Shetty,Archana M. Nair

Main category: cs.CV

TL;DR: 本文提出了一种基于UNet的自动编码器模型，用于高效预处理CRISM MTRDR高光谱数据，显著减少了预处理时间，同时保持了矿物吸收特征。

Details

Motivation: 传统的高光谱数据预处理方法计算量大且耗时，限制了火星矿物识别的效率。 Method: 采用UNet架构的自动编码器模型，结合MICA光谱库的增强数据，模拟MTRDR数据条件，自动化完成平滑和连续去除等预处理步骤。 Result: 预处理时间从1.5小时缩短至5分钟，同时保持了分类准确性。 Conclusion: 该框架显著提升了火星矿物映射的速度和可靠性，具有广泛应用潜力。 Abstract: Accurate mineral identification on the Martian surface is critical for understanding the planet's geological history. This paper presents a UNet-based autoencoder model for efficient spectral preprocessing of CRISM MTRDR hyperspectral data, addressing the limitations of traditional methods that are computationally intensive and time-consuming. The proposed model automates key preprocessing steps, such as smoothing and continuum removal, while preserving essential mineral absorption features. Trained on augmented spectra from the MICA spectral library, the model introduces realistic variability to simulate MTRDR data conditions. By integrating this framework, preprocessing time for an 800x800 MTRDR scene is reduced from 1.5 hours to just 5 minutes on an NVIDIA T1600 GPU. The preprocessed spectra are subsequently classified using MICAnet, a deep learning model for Martian mineral identification. Evaluation on labeled CRISM TRDR data demonstrates that the proposed approach achieves competitive accuracy while significantly enhancing preprocessing efficiency. This work highlights the potential of the UNet-based preprocessing framework to improve the speed and reliability of mineral mapping on Mars.

[58] Handling Imbalanced Pseudolabels for Vision-Language Models with Concept Alignment and Confusion-Aware Calibrated Margin

Yuchen Wang,Xuefeng Bai,Xiucheng Li,Weili Guan,Liqiang Nie,Xinyang Chen

Main category: cs.CV

TL;DR: 论文提出了一种新框架，通过概念对齐和混淆感知校准机制解决视觉语言模型（VLM）伪标签不平衡问题，显著提升了伪标签的准确性和平衡性。

Details

Motivation: 伪标签生成的不平衡性导致性能下降，现有方法未深入探究其根本原因。 Method: 提出概念对齐和混淆感知校准机制，增强低效类别并促进平衡预测。 Result: 在六个基准数据集上实验表明，方法相对SoTA提升了6.29%。 Conclusion: 新框架有效解决了伪标签不平衡问题，提升了模型性能。 Abstract: Adapting vision-language models (VLMs) to downstream tasks with pseudolabels has gained increasing attention. A major obstacle is that the pseudolabels generated by VLMs tend to be imbalanced, leading to inferior performance. While existing methods have explored various strategies to address this, the underlying causes of imbalance remain insufficiently investigated. To fill this gap, we delve into imbalanced pseudolabels and identify two primary contributing factors: concept mismatch and concept confusion. To mitigate these two issues, we propose a novel framework incorporating concept alignment and confusion-aware calibrated margin mechanisms. The core of our approach lies in enhancing underperforming classes and promoting balanced predictions across categories, thus mitigating imbalance. Extensive experiments on six benchmark datasets with three learning paradigms demonstrate that the proposed method effectively enhances the accuracy and balance of pseudolabels, achieving a relative improvement of 6.29% over the SoTA method. Our code is avaliable at https://anonymous.4open.science/r/CAP-C642/

[59] Transforming faces into video stories -- VideoFace2.0

Branko Brkljač,Vladimir Kalušev,Branislav Popović,Milan Sečujski

Main category: cs.CV

TL;DR: 论文提出VideoFace2.0系统，用于视频中人脸的空间和时间定位（ReID），支持分类和结构化输出，适用于电视制作和机器学习数据集生成。

Details

Motivation: 受早期Videoface设备的启发，旨在开发高效视频分析工具，为下游任务（如唇读和多模态语音识别）提供结构化视频数据。 Method: 结合人脸检测、识别和被动检测跟踪技术，实现鲁棒且高效的人脸ReID算法。 Result: 实验验证了算法的有效性，系统支持实时处理，适用于多种应用场景。 Conclusion: VideoFace2.0为视频分析和高质量多模态数据集生成提供了模块化解决方案，有望推动相关工具的开发。 Abstract: Face detection and face recognition have been in the focus of vision community since the very beginnings. Inspired by the success of the original Videoface digitizer, a pioneering device that allowed users to capture video signals from any source, we have designed an advanced video analytics tool to efficiently create structured video stories, i.e. identity-based information catalogs. VideoFace2.0 is the name of the developed system for spatial and temporal localization of each unique face in the input video, i.e. face re-identification (ReID), which also allows their cataloging, characterization and creation of structured video outputs for later downstream tasks. Developed near real-time solution is primarily designed to be utilized in application scenarios involving TV production, media analysis, and as an efficient tool for creating large video datasets necessary for training machine learning (ML) models in challenging vision tasks such as lip reading and multimodal speech recognition. Conducted experiments confirm applicability of the proposed face ReID algorithm that is combining the concepts of face detection, face recognition and passive tracking-by-detection in order to achieve robust and efficient face ReID. The system is envisioned as a compact and modular extensions of the existing video production equipment. We hope that the presented work and shared code will stimulate further interest in development of similar, application specific video analysis tools, and lower the entry barrier for production of high-quality multi-modal ML datasets in the future.

[60] RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video

Shuhang Xun,Sicheng Tao,Jungang Li,Yibo Shi,Zhixin Lin,Zhanhui Zhu,Yibo Yan,Hanqian Li,Linghao Zhang,Shikang Wang,Yixin Liu,Hanbo Zhang,Xuming Hu,Ying Ma

Main category: cs.CV

TL;DR: RTV-Bench是一个用于评估多模态大语言模型（MLLMs）实时视频分析能力的细粒度基准，包含552个视频和4,631个问答对。实验表明开源实时模型优于离线模型，但仍落后于顶级专有模型。

Details

Motivation: 当前基准无法充分评估MLLMs在动态实时环境中的连续感知、理解和推理能力，因此需要更精细的评估工具。 Method: RTV-Bench采用多时间戳问答（MTQA）、分层问题结构和多维度评估三项原则，对MLLMs进行测试。 Result: 开源实时模型表现优于离线模型，但不及专有模型；模型规模或帧采样率对性能提升有限。 Conclusion: 需要优化模型架构以提升实时视频分析能力，RTV-Bench为相关研究提供了工具支持。 Abstract: Multimodal Large Language Models (MLLMs) increasingly excel at perception, understanding, and reasoning. However, current benchmarks inadequately evaluate their ability to perform these tasks continuously in dynamic, real-world environments. To bridge this gap, we introduce RTV-Bench, a fine-grained benchmark for MLLM real-time video analysis. RTV-Bench uses three key principles: (1) Multi-Timestamp Question Answering (MTQA), where answers evolve with scene changes; (2) Hierarchical Question Structure, combining basic and advanced queries; and (3) Multi-dimensional Evaluation, assessing the ability of continuous perception, understanding, and reasoning. RTV-Bench contains 552 diverse videos (167.2 hours) and 4,631 high-quality QA pairs. We evaluated leading MLLMs, including proprietary (GPT-4o, Gemini 2.0), open-source offline (Qwen2.5-VL, VideoLLaMA3), and open-source real-time (VITA-1.5, InternLM-XComposer2.5-OmniLive) models. Experiment results show open-source real-time models largely outperform offline ones but still trail top proprietary models. Our analysis also reveals that larger model size or higher frame sampling rates do not significantly boost RTV-Bench performance, sometimes causing slight decreases. This underscores the need for better model architectures optimized for video stream processing and long sequences to advance real-time video analysis with MLLMs. Our benchmark toolkit is available at: https://github.com/LJungang/RTV-Bench.

[61] Hierarchical Compact Clustering Attention (COCA) for Unsupervised Object-Centric Learning

Can Küçüksözen,Yücel Yemez

Main category: cs.CV

TL;DR: COCA-Net是一种基于注意力机制的层次化聚类模块，用于无监督对象发现任务，通过紧凑性概念提取对象中心，生成高质量分割掩码。

Details

Motivation: 解决单图像中无监督对象发现任务，提升对象中心表示学习的效果。 Method: 采用基于注意力的层次化聚类模块（COCA），结合紧凑性概念，构建COCA-Net网络架构。 Result: 在六个数据集上表现优异，优于或与现有最佳模型竞争。 Conclusion: COCA-Net在无监督对象发现任务中表现出色，具有灵活性和高质量分割能力。 Abstract: We propose the Compact Clustering Attention (COCA) layer, an effective building block that introduces a hierarchical strategy for object-centric representation learning, while solving the unsupervised object discovery task on single images. COCA is an attention-based clustering module capable of extracting object-centric representations from multi-object scenes, when cascaded into a bottom-up hierarchical network architecture, referred to as COCA-Net. At its core, COCA utilizes a novel clustering algorithm that leverages the physical concept of compactness, to highlight distinct object centroids in a scene, providing a spatial inductive bias. Thanks to this strategy, COCA-Net generates high-quality segmentation masks on both the decoder side and, notably, the encoder side of its pipeline. Additionally, COCA-Net is not bound by a predetermined number of object masks that it generates and handles the segmentation of background elements better than its competitors. We demonstrate COCA-Net's segmentation performance on six widely adopted datasets, achieving superior or competitive results against the state-of-the-art models across nine different evaluation metrics.

[62] Benchmarking Feature Upsampling Methods for Vision Foundation Models using Interactive Segmentation

Volodymyr Havrylov,Haiwen Huang,Dan Zhang,Andreas Geiger

Main category: cs.CV

TL;DR: 本文研究了如何通过任务无关的特征上采样模块提升视觉基础模型（VFMs）在密集预测任务中的表现，并以交互式分割（IS）为基准验证其有效性。

Details

Motivation: VFMs通常生成低分辨率特征，限制了其在密集预测任务中的直接应用，因此需要一种通用的特征上采样方法来解决这一问题。 Method: 采用任务无关的特征上采样模块，并在交互式分割（IS）任务中评估其效果。IS任务因其多模态输入（图像和用户点击）和密集掩码输出，为评估提供了挑战性环境。 Result: 实验表明，选择合适的特征上采样策略能显著提升VFMs生成的特征质量。 Conclusion: 通过任务无关的特征上采样模块可以有效改善VFMs在密集预测任务中的表现，交互式分割任务是一个有效的评估基准。 Abstract: Vision Foundation Models (VFMs) are large-scale, pre-trained models that serve as general-purpose backbones for various computer vision tasks. As VFMs' popularity grows, there is an increasing interest in understanding their effectiveness for dense prediction tasks. However, VFMs typically produce low-resolution features, limiting their direct applicability in this context. One way to tackle this limitation is by employing a task-agnostic feature upsampling module that refines VFM features resolution. To assess the effectiveness of this approach, we investigate Interactive Segmentation (IS) as a novel benchmark for evaluating feature upsampling methods on VFMs. Due to its inherent multimodal input, consisting of an image and a set of user-defined clicks, as well as its dense mask output, IS creates a challenging environment that demands comprehensive visual scene understanding. Our benchmarking experiments show that selecting appropriate upsampling strategies significantly improves VFM features quality. The code is released at https://github.com/havrylovv/iSegProbe

[63] HandOcc: NeRF-based Hand Rendering with Occupancy Networks

Maksym Ivashechkin,Oscar Mendez,Richard Bowden

Main category: cs.CV

TL;DR: HandOcc是一种基于占用率的新型手部渲染框架，通过结合NeRF和卷积模型，摆脱了对参数化网格的依赖，实现了高保真、快速渲染和优秀的手部外观迁移。

Details

Motivation: 传统方法依赖参数化网格，存在网格初始化和分辨率限制的问题，无法泛化到无参数模型的对象。HandOcc旨在解决这些问题，提供更灵活的渲染方案。 Method: 提出了一种无网格的3D渲染流程，仅需3D骨架输入，通过卷积模型提取外观，并结合基于占用率的NeRF渲染器。利用手部占用率解析手部交互，提升渲染效果。 Result: 在InterHand2.6M数据集上取得了最先进的性能，实现了快速渲染和高质量的手部外观迁移。 Conclusion: HandOcc通过占用率表示和NeRF渲染器的结合，提供了一种高效且高保真的手部渲染方法，解决了传统方法的局限性。 Abstract: We propose HandOcc, a novel framework for hand rendering based upon occupancy. Popular rendering methods such as NeRF are often combined with parametric meshes to provide deformable hand models. However, in doing so, such approaches present a trade-off between the fidelity of the mesh and the complexity and dimensionality of the parametric model. The simplicity of parametric mesh structures is appealing, but the underlying issue is that it binds methods to mesh initialization, making it unable to generalize to objects where a parametric model does not exist. It also means that estimation is tied to mesh resolution and the accuracy of mesh fitting. This paper presents a pipeline for meshless 3D rendering, which we apply to the hands. By providing only a 3D skeleton, the desired appearance is extracted via a convolutional model. We do this by exploiting a NeRF renderer conditioned upon an occupancy-based representation. The approach uses the hand occupancy to resolve hand-to-hand interactions further improving results, allowing fast rendering, and excellent hand appearance transfer. On the benchmark InterHand2.6M dataset, we achieved state-of-the-art results.

[64] SignSplat: Rendering Sign Language via Gaussian Splatting

Maksym Ivashechkin,Oscar Mendez,Richard Bowden

Main category: cs.CV

TL;DR: 提出了一种基于高斯泼溅的少视角人体渲染方法，专注于手语等复杂细微动作，通过序列数据优化和正则化技术提升渲染质量。

Details

Motivation: 现有方法多关注简单动作（如舞蹈或行走），而手语等复杂细微动作需要更高精度，且多视角数据获取困难。 Method: 利用序列数据优化模型拟合，约束网格参数，采用正则化技术减少过拟合，并提出自适应控制方法优化高斯分布。 Result: 在手语视频渲染任务中表现优异，显著优于现有方法。 Conclusion: 该方法在少视角条件下能有效建模复杂细微动作，为手语等应用提供了高质量渲染方案。 Abstract: State-of-the-art approaches for conditional human body rendering via Gaussian splatting typically focus on simple body motions captured from many views. This is often in the context of dancing or walking. However, for more complex use cases, such as sign language, we care less about large body motion and more about subtle and complex motions of the hands and face. The problems of building high fidelity models are compounded by the complexity of capturing multi-view data of sign. The solution is to make better use of sequence data, ensuring that we can overcome the limited information from only a few views by exploiting temporal variability. Nevertheless, learning from sequence-level data requires extremely accurate and consistent model fitting to ensure that appearance is consistent across complex motions. We focus on how to achieve this, constraining mesh parameters to build an accurate Gaussian splatting framework from few views capable of modelling subtle human motion. We leverage regularization techniques on the Gaussian parameters to mitigate overfitting and rendering artifacts. Additionally, we propose a new adaptive control method to densify Gaussians and prune splat points on the mesh surface. To demonstrate the accuracy of our approach, we render novel sequences of sign language video, building on neural machine translation approaches to sign stitching. On benchmark datasets, our approach achieves state-of-the-art performance; and on highly articulated and complex sign language motion, we significantly outperform competing approaches.

[65] Unaligned RGB Guided Hyperspectral Image Super-Resolution with Spatial-Spectral Concordance

Yingkai Zhang,Zeqiang Lai,Tao Zhang,Ying Fu,Chenghu Zhou

Main category: cs.CV

TL;DR: 本文提出了一种空间-光谱一致性高光谱超分辨率框架（SSC-HSR），通过两阶段图像对齐和特征聚合模块解决现有方法对齐不准确和模块交互不足的问题。

Details

Motivation: 现有高光谱图像超分辨率方法在高分辨率比下性能受限，且无法有效利用参考图像信息，主要由于对齐不准确和模块间交互不足。 Method: 提出两阶段图像对齐模块（精细光流模型和纹理修复模型）和特征聚合模块（迭代可变形特征聚合块）及注意力融合模块（光谱注意力块）。 Result: 在三个自然或遥感数据集上的实验表明，该方法在定量和定性评估上均优于现有技术。 Conclusion: SSC-HSR框架通过改进对齐和模块交互，显著提升了高光谱超分辨率的性能。 Abstract: Hyperspectral images super-resolution aims to improve the spatial resolution, yet its performance is often limited at high-resolution ratios. The recent adoption of high-resolution reference images for super-resolution is driven by the poor spatial detail found in low-resolution HSIs, presenting it as a favorable method. However, these approaches cannot effectively utilize information from the reference image, due to the inaccuracy of alignment and its inadequate interaction between alignment and fusion modules. In this paper, we introduce a Spatial-Spectral Concordance Hyperspectral Super-Resolution (SSC-HSR) framework for unaligned reference RGB guided HSI SR to address the issues of inaccurate alignment and poor interactivity of the previous approaches. Specifically, to ensure spatial concordance, i.e., align images more accurately across resolutions and refine textures, we construct a Two-Stage Image Alignment with a synthetic generation pipeline in the image alignment module, where the fine-tuned optical flow model can produce a more accurate optical flow in the first stage and warp model can refine damaged textures in the second stage. To enhance the interaction between alignment and fusion modules and ensure spectral concordance during reconstruction, we propose a Feature Aggregation module and an Attention Fusion module. In the feature aggregation module, we introduce an Iterative Deformable Feature Aggregation block to achieve significant feature matching and texture aggregation with the fusion multi-scale results guidance, iteratively generating learnable offset. Besides, we introduce two basic spectral-wise attention blocks in the attention fusion module to model the inter-spectra interactions. Extensive experiments on three natural or remote-sensing datasets show that our method outperforms state-of-the-art approaches on both quantitative and qualitative evaluations.

[66] GarmentGS: Point-Cloud Guided Gaussian Splatting for High-Fidelity Non-Watertight 3D Garment Reconstruction

Zhihao Tang,Shenghao Yang,Hongtao Zhang,Mingbo Zhao

Main category: cs.CV

TL;DR: GarmentGS利用密集点云引导高斯基元，实现高效高保真的3D服装重建，显著减少时间和成本。

Details

Motivation: 传统3D服装创建耗时耗力，高斯溅射技术虽在3D场景重建中取得突破，但因基元不规则性难以重建高保真非水密服装。 Method: 提出GarmentGS，通过密集点云引导高斯基元的移动、展平和旋转，快速重建服装点云并生成单层网格。 Result: 方法在10分钟内完成点云重建，优于传统数小时方法，同时保持高质量渲染和几何精度。 Conclusion: GarmentGS在快速训练和实时渲染方面表现优异，为3D服装重建提供了新途径。 Abstract: Traditional 3D garment creation requires extensive manual operations, resulting in time and labor costs. Recently, 3D Gaussian Splatting has achieved breakthrough progress in 3D scene reconstruction and rendering, attracting widespread attention and opening new pathways for 3D garment reconstruction. However, due to the unstructured and irregular nature of Gaussian primitives, it is difficult to reconstruct high-fidelity, non-watertight 3D garments. In this paper, we present GarmentGS, a dense point cloud-guided method that can reconstruct high-fidelity garment surfaces with high geometric accuracy and generate non-watertight, single-layer meshes. Our method introduces a fast dense point cloud reconstruction module that can complete garment point cloud reconstruction in 10 minutes, compared to traditional methods that require several hours. Furthermore, we use dense point clouds to guide the movement, flattening, and rotation of Gaussian primitives, enabling better distribution on the garment surface to achieve superior rendering effects and geometric accuracy. Through numerical and visual comparisons, our method achieves fast training and real-time rendering while maintaining competitive quality.

[67] HiLLIE: Human-in-the-Loop Training for Low-Light Image Enhancement

Xiaorui Zhao,Xinyue Zhou,Peibei Cao,Junyu Lou,Shuhang Gu

Main category: cs.CV

TL;DR: 提出了一种名为HiLLIE的人机交互低光图像增强框架，通过迭代训练和人类视觉偏好标注提升模型输出质量。

Details

Motivation: 低光图像增强（LLIE）中，如何生成符合人类视觉偏好的高质量图像仍具挑战性。 Method: 采用人机交互训练框架，通过视觉质量标注和定制化图像质量评估（IQA）模型学习人类偏好，指导模型训练。 Result: 实验表明，该方法在定量和定性上显著提升了无监督LLIE模型的性能。 Conclusion: HiLLIE框架通过少量标注和迭代训练，有效提升了低光图像增强的视觉质量。 Abstract: Developing effective approaches to generate enhanced results that align well with human visual preferences for high-quality well-lit images remains a challenge in low-light image enhancement (LLIE). In this paper, we propose a human-in-the-loop LLIE training framework that improves the visual quality of unsupervised LLIE model outputs through iterative training stages, named HiLLIE. At each stage, we introduce human guidance into the training process through efficient visual quality annotations of enhanced outputs. Subsequently, we employ a tailored image quality assessment (IQA) model to learn human visual preferences encoded in the acquired labels, which is then utilized to guide the training process of an enhancement model. With only a small amount of pairwise ranking annotations required at each stage, our approach continually improves the IQA model's capability to simulate human visual assessment of enhanced outputs, thus leading to visually appealing LLIE results. Extensive experiments demonstrate that our approach significantly improves unsupervised LLIE model performance in terms of both quantitative and qualitative performance. The code and collected ranking dataset will be available at https://github.com/LabShuHangGU/HiLLIE.

[68] Spotting the Unexpected (STU): A 3D LiDAR Dataset for Anomaly Segmentation in Autonomous Driving

Alexey Nekrasov,Malcolm Burdorf,Stewart Worrall,Bastian Leibe,Julie Stephany Berrio Perez

Main category: cs.CV

TL;DR: 本文提出了一种用于驾驶场景中异常分割的新型数据集，首次公开包含密集3D语义标注、LiDAR和相机数据以及序列信息，以支持不同范围内的异常检测。

Details

Motivation: 自动驾驶车辆需要检测和处理道路上的意外物体或异常，但目前3D异常检测的研究进展不足，且现有数据集缺乏高质量的多模态数据。 Method: 本文构建了一个包含密集3D语义标注、LiDAR和相机数据的公开数据集，并评估了几种3D分割基线模型。 Result: 数据集和评估代码将公开，以促进不同方法的测试和性能比较。 Conclusion: 该数据集填补了3D异常检测领域的空白，为自动驾驶的安全导航提供了重要支持。 Abstract: To operate safely, autonomous vehicles (AVs) need to detect and handle unexpected objects or anomalies on the road. While significant research exists for anomaly detection and segmentation in 2D, research progress in 3D is underexplored. Existing datasets lack high-quality multimodal data that are typically found in AVs. This paper presents a novel dataset for anomaly segmentation in driving scenarios. To the best of our knowledge, it is the first publicly available dataset focused on road anomaly segmentation with dense 3D semantic labeling, incorporating both LiDAR and camera data, as well as sequential information to enable anomaly detection across various ranges. This capability is critical for the safe navigation of autonomous vehicles. We adapted and evaluated several baseline models for 3D segmentation, highlighting the challenges of 3D anomaly detection in driving environments. Our dataset and evaluation code will be openly available, facilitating the testing and performance comparison of different approaches.

[69] Small Clips, Big Gains: Learning Long-Range Refocused Temporal Information for Video Super-Resolution

Xingyu Zhou,Wei Long,Jingbo Lu,Shiyin Jiang,Weiyi You,Haifeng Wu,Shuhang Gu

Main category: cs.CV

TL;DR: LRTI-VSR是一种新的视频超分辨率训练框架，通过长距离聚焦时间信息和改进的注意力机制提升性能。

Details

Motivation: 解决现有基于循环的视频超分辨率模型在长视频中难以有效学习长期依赖关系的问题。 Method: 提出LRTI-VSR框架，包括利用长视频片段特征的训练策略和聚焦时间信息的Transformer模块。 Result: 在长视频测试集上达到最先进性能，同时保持训练和计算效率。 Conclusion: LRTI-VSR通过高效利用长距离时间信息，显著提升了视频超分辨率的效果。 Abstract: Video super-resolution (VSR) can achieve better performance compared to single image super-resolution by additionally leveraging temporal information. In particular, the recurrent-based VSR model exploits long-range temporal information during inference and achieves superior detail restoration. However, effectively learning these long-term dependencies within long videos remains a key challenge. To address this, we propose LRTI-VSR, a novel training framework for recurrent VSR that efficiently leverages Long-Range Refocused Temporal Information. Our framework includes a generic training strategy that utilizes temporal propagation features from long video clips while training on shorter video clips. Additionally, we introduce a refocused intra&inter-frame transformer block which allows the VSR model to selectively prioritize useful temporal information through its attention module while further improving inter-frame information utilization in the FFN module. We evaluate LRTI-VSR on both CNN and transformer-based VSR architectures, conducting extensive ablation studies to validate the contribution of each component. Experiments on long-video test sets demonstrate that LRTI-VSR achieves state-of-the-art performance while maintaining training and computational efficiency.

[70] Focus What Matters: Matchability-Based Reweighting for Local Feature Matching

Dongyue Li

Main category: cs.CV

TL;DR: 提出了一种新颖的注意力重加权机制，通过分类像素为可匹配和不可匹配两类，动态调整注意力权重和输出表示，显著提升了半稠密匹配性能。

Details

Motivation: 传统注意力机制对所有像素或关键点平等处理，可能引入冗余和噪声。受关键点选择启发，提出分类像素以优化注意力权重。 Method: 结合可学习偏置项和匹配性信息重新缩放输入特征，动态调整注意力得分和输出表示。 Result: 在三个基准数据集上验证了方法的有效性，性能优于现有最先进方法。 Conclusion: 所提机制能有效减少冗余和噪声，提升半稠密匹配的准确性和鲁棒性。 Abstract: Since the rise of Transformers, many semi-dense matching methods have adopted attention mechanisms to extract feature descriptors. However, the attention weights, which capture dependencies between pixels or keypoints, are often learned from scratch. This approach can introduce redundancy and noisy interactions from irrelevant regions, as it treats all pixels or keypoints equally. Drawing inspiration from keypoint selection processes, we propose to first classify all pixels into two categories: matchable and non-matchable. Matchable pixels are expected to receive higher attention weights, while non-matchable ones are down-weighted. In this work, we propose a novel attention reweighting mechanism that simultaneously incorporates a learnable bias term into the attention logits and applies a matchability-informed rescaling to the input value features. The bias term, injected prior to the softmax operation, selectively adjusts attention scores based on the confidence of query-key interactions. Concurrently, the feature rescaling acts post-attention by modulating the influence of each value vector in the final output. This dual design allows the attention mechanism to dynamically adjust both its internal weighting scheme and the magnitude of its output representations. Extensive experiments conducted on three benchmark datasets validate the effectiveness of our method, consistently outperforming existing state-of-the-art approaches.

[71] SparSplat: Fast Multi-View Reconstruction with Generalizable 2D Gaussian Splatting

Shubhendu Jena,Shishir Reddy Vutukur,Adnane Boukhayma

Main category: cs.CV

TL;DR: 该论文提出了一种基于多视角立体（MVS）的学习框架，通过回归2D高斯表面元素参数，实现了稀疏视角下的3D重建和新视角合成（NVS），并在速度和精度上均达到领先水平。

Details

Motivation: 稀疏视角下的3D重建和NVS是挑战性问题，现有方法在实时性和准确性上仍有不足。本文旨在通过结合2D高斯光栅化和MVS框架，提升稀疏视角下的性能。 Method: 提出了一种MVS学习管道，以回归2D高斯表面元素参数的方式，实现稀疏视角下的3D重建和NVS，并利用预训练的多视角深度视觉特征提升性能。 Result: 在DTU稀疏3D重建基准测试中，模型在Chamfer距离和NVS任务上均达到最优，同时在BlendedMVS和Tanks and Temples数据集上表现出强泛化能力，推理速度提升近两个数量级。 Conclusion: 该方法在稀疏视角下实现了高效的3D重建和NVS，速度和精度均优于现有技术，为实时应用提供了可行方案。 Abstract: Recovering 3D information from scenes via multi-view stereo reconstruction (MVS) and novel view synthesis (NVS) is inherently challenging, particularly in scenarios involving sparse-view setups. The advent of 3D Gaussian Splatting (3DGS) enabled real-time, photorealistic NVS. Following this, 2D Gaussian Splatting (2DGS) leveraged perspective accurate 2D Gaussian primitive rasterization to achieve accurate geometry representation during rendering, improving 3D scene reconstruction while maintaining real-time performance. Recent approaches have tackled the problem of sparse real-time NVS using 3DGS within a generalizable, MVS-based learning framework to regress 3D Gaussian parameters. Our work extends this line of research by addressing the challenge of generalizable sparse 3D reconstruction and NVS jointly, and manages to perform successfully at both tasks. We propose an MVS-based learning pipeline that regresses 2DGS surface element parameters in a feed-forward fashion to perform 3D shape reconstruction and NVS from sparse-view images. We further show that our generalizable pipeline can benefit from preexisting foundational multi-view deep visual features. The resulting model attains the state-of-the-art results on the DTU sparse 3D reconstruction benchmark in terms of Chamfer distance to ground-truth, as-well as state-of-the-art NVS. It also demonstrates strong generalization on the BlendedMVS and Tanks and Temples datasets. We note that our model outperforms the prior state-of-the-art in feed-forward sparse view reconstruction based on volume rendering of implicit representations, while offering an almost 2 orders of magnitude higher inference speed.

[72] Saliency-Guided Training for Fingerprint Presentation Attack Detection

Samuel Webster,Adam Czajka

Main category: cs.CV

TL;DR: 该论文首次将显著性引导训练应用于指纹呈现攻击检测（PAD），通过实验验证其在有限和大规模数据下的有效性，并在LivDet-2021基准测试中取得最佳成绩。

Details

Motivation: 探索显著性引导训练在指纹PAD任务中的应用潜力，尤其是在数据有限的情况下提升模型泛化能力。 Method: 结合人工标注的指纹显著性图和算法生成的伪显著性图，设计了五种训练场景，评估显著性引导训练对指纹PAD任务的影响。 Result: 显著性引导训练显著提升了指纹PAD的准确性和泛化能力，并在LivDet-2021基准测试中排名第一。 Conclusion: 显著性引导训练在指纹PAD中表现出色，尤其在数据有限时效果显著，且具备扩展到更大数据集的潜力。 Abstract: Saliency-guided training, which directs model learning to important regions of images, has demonstrated generalization improvements across various biometric presentation attack detection (PAD) tasks. This paper presents its first application to fingerprint PAD. We conducted a 50-participant study to create a dataset of 800 human-annotated fingerprint perceptually-important maps, explored alongside algorithmically-generated "pseudosaliency," including minutiae-based, image quality-based, and autoencoder-based saliency maps. Evaluating on the 2021 Fingerprint Liveness Detection Competition testing set, we explore various configurations within five distinct training scenarios to assess the impact of saliency-guided training on accuracy and generalization. Our findings demonstrate the effectiveness of saliency-guided training for fingerprint PAD in both limited and large data contexts, and we present a configuration capable of earning the first place on the LivDet-2021 benchmark. Our results highlight saliency-guided training's promise for increased model generalization capabilities, its effectiveness when data is limited, and its potential to scale to larger datasets in fingerprint PAD. All collected saliency data and trained models are released with the paper to support reproducible research.

[73] Sparfels: Fast Reconstruction from Sparse Unposed Imagery

Shubhendu Jena,Amine Ouasfi,Mae Younes,Adnane Boukhayma

Main category: cs.CV

TL;DR: 提出了一种基于表面元素喷溅的稀疏视图重建方法，可在消费级GPU上3分钟内完成。通过利用3D基础模型的任务头（如点图和相机初始化），结合2D高斯喷溅模型和图像对应关系优化相机，实现了高效的稀疏未校准场景重建。

Details

Motivation: 稀疏视角下的辐射场学习和形状恢复在噪声或未标定稀疏相机条件下研究较少，现有方法多依赖数据先验或外部单目几何先验。本文旨在提出一种更简单高效的解决方案。 Method: 利用3D基础模型的任务头（点图、相机初始化）实例化2D高斯喷溅模型，并通过图像对应关系优化相机。提出了一种新颖的喷溅颜色方差沿射线计算方式，提升形状重建精度。 Result: 在稀疏未校准场景下，重建和新视角合成任务中达到了最先进的性能。 Conclusion: 该方法通过高效利用3D基础模型和优化策略，显著提升了稀疏视图重建的准确性和效率。 Abstract: We present a method for Sparse view reconstruction with surface element splatting that runs within 3 minutes on a consumer grade GPU. While few methods address sparse radiance field learning from noisy or unposed sparse cameras, shape recovery remains relatively underexplored in this setting. Several radiance and shape learning test-time optimization methods address the sparse posed setting by learning data priors or using combinations of external monocular geometry priors. Differently, we propose an efficient and simple pipeline harnessing a single recent 3D foundation model. We leverage its various task heads, notably point maps and camera initializations to instantiate a bundle adjusting 2D Gaussian Splatting (2DGS) model, and image correspondences to guide camera optimization midst 2DGS training. Key to our contribution is a novel formulation of splatted color variance along rays, which can be computed efficiently. Reducing this moment in training leads to more accurate shape reconstructions. We demonstrate state-of-the-art performances in the sparse uncalibrated setting in reconstruction and novel view benchmarks based on established multi-view datasets.

[74] ProDisc-VAD: An Efficient System for Weakly-Supervised Anomaly Detection in Video Surveillance Applications

Tao Zhu,Qi Yu,Xinru Dong,Shiyu Li,Yue Liu,Jinlong Jiang,Lei Shu

Main category: cs.CV

TL;DR: ProDisc-VAD通过原型交互层和伪实例判别增强损失解决弱监督视频异常检测中的标签模糊问题，实现了高效且高性能的检测。

Details

Motivation: 弱监督视频异常检测（WS-VAD）使用多实例学习（MIL）时存在标签模糊问题，阻碍了判别性特征的学习。 Method: 提出ProDisc-VAD框架，包含原型交互层（PIL）和伪实例判别增强（PIDE）损失，前者通过少量可学习原型建模正常性，后者通过对比学习增强分离性。 Result: 在ShanghaiTech和UCF-Crime数据集上分别达到97.98%和87.12%的AUC，仅使用0.4M参数，效率远超ViT-based方法。 Conclusion: ProDisc-VAD在高效性和性能上均达到先进水平，代码已开源。 Abstract: Weakly-supervised video anomaly detection (WS-VAD) using Multiple Instance Learning (MIL) suffers from label ambiguity, hindering discriminative feature learning. We propose ProDisc-VAD, an efficient framework tackling this via two synergistic components. The Prototype Interaction Layer (PIL) provides controlled normality modeling using a small set of learnable prototypes, establishing a robust baseline without being overwhelmed by dominant normal data. The Pseudo-Instance Discriminative Enhancement (PIDE) loss boosts separability by applying targeted contrastive learning exclusively to the most reliable extreme-scoring instances (highest/lowest scores). ProDisc-VAD achieves strong AUCs (97.98% ShanghaiTech, 87.12% UCF-Crime) using only 0.4M parameters, over 800x fewer than recent ViT-based methods like VadCLIP, demonstrating exceptional efficiency alongside state-of-the-art performance. Code is available at https://github.com/modadundun/ProDisc-VAD.

[75] Robust AI-Generated Face Detection with Imbalanced Data

Yamini Sri Krubha,Aryana Hou,Braden Vester,Web Walker,Xin Wang,Li Lin,Shu Hu

Main category: cs.CV

TL;DR: 论文提出了一种结合动态损失重加权和基于排名的优化的框架，用于解决深度伪造检测中的分布偏移和类别不平衡问题。

Details

Motivation: 深度伪造技术从研究和娱乐用途演变为恶意工具，威胁数字信任。现有检测方法在处理新兴生成模型的分布偏移和类别不平衡时表现不足。 Method: 提出动态损失重加权和基于排名的优化框架，提升检测器的泛化能力和性能。 Result: 该框架在类别不平衡的数据集条件下表现出优越的泛化能力和检测性能。 Conclusion: 该框架为深度伪造检测中的分布偏移和类别不平衡问题提供了有效解决方案。 Abstract: Deepfakes, created using advanced AI techniques such as Variational Autoencoder and Generative Adversarial Networks, have evolved from research and entertainment applications into tools for malicious activities, posing significant threats to digital trust. Current deepfake detection techniques have evolved from CNN-based methods focused on local artifacts to more advanced approaches using vision transformers and multimodal models like CLIP, which capture global anomalies and improve cross-domain generalization. Despite recent progress, state-of-the-art deepfake detectors still face major challenges in handling distribution shifts from emerging generative models and addressing severe class imbalance between authentic and fake samples in deepfake datasets, which limits their robustness and detection accuracy. To address these challenges, we propose a framework that combines dynamic loss reweighting and ranking-based optimization, which achieves superior generalization and performance under imbalanced dataset conditions. The code is available at https://github.com/Purdue-M2/SP_CUP.

[76] DualReal: Adaptive Joint Training for Lossless Identity-Motion Fusion in Video Customization

Wenchuan Wang,Mengqi Huang,Yijing Tu,Zhendong Mao

Main category: cs.CV

TL;DR: DualReal提出了一种联合训练框架，通过动态选择和正则化策略解决身份与运动之间的冲突，显著提升了生成视频的质量。

Details

Motivation: 现有方法将身份与运动定制孤立处理，忽略了二者之间的相互约束和协同依赖，导致生成过程中出现冲突。 Method: DualReal包含两个单元：Dual-aware Adaptation动态选择训练阶段并避免知识泄漏；StageBlender Controller通过去噪阶段和Transformer深度指导不同维度。 Result: 实验表明，DualReal在CLIP-I和DINO-I指标上分别提升了21.7%和31.8%，并在运动质量指标上表现优异。 Conclusion: DualReal通过联合训练和自适应粒度控制，实现了身份与运动模式的无损融合，显著优于现有方法。 Abstract: Customized text-to-video generation with pre-trained large-scale models has recently garnered significant attention through focusing on identity and motion consistency. Existing works typically follow the isolated customized paradigm, where the subject identity or motion dynamics are customized exclusively. However, this paradigm completely ignores the intrinsic mutual constraints and synergistic interdependencies between identity and motion, resulting in identity-motion conflicts throughout the generation process that systematically degrades. To address this, we introduce DualReal, a novel framework that, employs adaptive joint training to collaboratively construct interdependencies between dimensions. Specifically, DualReal is composed of two units: (1) Dual-aware Adaptation dynamically selects a training phase (i.e., identity or motion), learns the current information guided by the frozen dimension prior, and employs a regularization strategy to avoid knowledge leakage; (2) StageBlender Controller leverages the denoising stages and Diffusion Transformer depths to guide different dimensions with adaptive granularity, avoiding conflicts at various stages and ultimately achieving lossless fusion of identity and motion patterns. We constructed a more comprehensive benchmark than existing methods. The experimental results show that DualReal improves CLIP-I and DINO-I metrics by 21.7% and 31.8% on average, and achieves top performance on nearly all motion quality metrics.

[77] Improving Physical Object State Representation in Text-to-Image Generative Systems

Tianle Chen,Chaitanya Chakka,Deepti Ghadiyaram

Main category: cs.CV

TL;DR: 论文提出了一种全自动生成高质量合成数据的流程，用于改进文本到图像生成模型对物体状态的准确表示，并通过微调模型在公开数据集上实现了8%以上的平均提升。

Details

Motivation: 当前文本到图像生成模型难以准确表示物体状态（如“没有瓶子的桌子”、“空的杯子”），需要改进。 Method: 设计全自动流程生成合成数据，微调多个开源文本到图像模型，并使用GPT4o-mini评估生成图像与提示的对齐程度。 Result: 在公开数据集GenAI-Bench上平均提升8%以上，在自建200个提示的数据集上平均提升24%以上。 Conclusion: 通过合成数据和模型微调，显著提升了文本到图像模型对物体状态的表示能力，并公开了评估提示和代码。 Abstract: Current text-to-image generative models struggle to accurately represent object states (e.g., "a table without a bottle," "an empty tumbler"). In this work, we first design a fully-automatic pipeline to generate high-quality synthetic data that accurately captures objects in varied states. Next, we fine-tune several open-source text-to-image models on this synthetic data. We evaluate the performance of the fine-tuned models by quantifying the alignment of the generated images to their prompts using GPT4o-mini, and achieve an average absolute improvement of 8+% across four models on the public GenAI-Bench dataset. We also curate a collection of 200 prompts with a specific focus on common objects in various physical states. We demonstrate a significant improvement of an average of 24+% over the baseline on this dataset. We release all evaluation prompts and code.

[78] Quantizing Diffusion Models from a Sampling-Aware Perspective

Qian Zeng,Jie Song,Yuanyu Wan,Huiqiong Wang,Mingli Song

Main category: cs.CV

TL;DR: 论文提出了一种采样感知的量化策略，通过混合阶轨迹对齐技术，在保持生成质量的同时加速扩散模型。

Details

Motivation: 扩散模型在视觉生成任务中表现出色，但其冗长的去噪链和高计算成本限制了其在低延迟和资源受限环境中的应用。 Method: 提出采样感知量化策略，设计混合阶轨迹对齐技术，严格约束每一步采样的误差界限，促进更线性的概率流。 Result: 在多数据集上的稀疏步快速采样实验中，该方法保持了高速采样器的快速收敛特性，同时生成质量优越。 Conclusion: 该方法实现了高保真度的双重加速，为扩散模型在资源受限环境中的应用提供了有效解决方案。 Abstract: Diffusion models have recently emerged as the dominant approach in visual generation tasks. However, the lengthy denoising chains and the computationally intensive noise estimation networks hinder their applicability in low-latency and resource-limited environments. Previous research has endeavored to address these limitations in a decoupled manner, utilizing either advanced samplers or efficient model quantization techniques. In this study, we uncover that quantization-induced noise disrupts directional estimation at each sampling step, further distorting the precise directional estimations of higher-order samplers when solving the sampling equations through discretized numerical methods, thereby altering the optimal sampling trajectory. To attain dual acceleration with high fidelity, we propose a sampling-aware quantization strategy, wherein a Mixed-Order Trajectory Alignment technique is devised to impose a more stringent constraint on the error bounds at each sampling step, facilitating a more linear probability flow. Extensive experiments on sparse-step fast sampling across multiple datasets demonstrate that our approach preserves the rapid convergence characteristics of high-speed samplers while maintaining superior generation quality. Code will be made publicly available soon.

[79] Using Knowledge Graphs to harvest datasets for efficient CLIP model training

Simon Ging,Sebastian Walter,Jelena Bratulić,Johannes Dienert,Hannah Bast,Thomas Brox

Main category: cs.CV

TL;DR: 通过智能网络搜索和知识图谱增强策略，用较少数据训练高质量CLIP模型，并推出EntityNet数据集。

Details

Motivation: 解决大规模数据集对CLIP模型训练的限制，降低成本和提升领域特异性。 Method: 利用智能网络搜索和知识图谱优化数据收集，构建EntityNet数据集。 Result: 成功用10M图像训练专家模型，EntityNet数据集显著缩短通用模型训练时间。 Conclusion: 智能数据策略和EntityNet为高效训练CLIP模型提供了可行方案。 Abstract: Training high-quality CLIP models typically requires enormous datasets, which limits the development of domain-specific models -- especially in areas that even the largest CLIP models do not cover well -- and drives up training costs. This poses challenges for scientific research that needs fine-grained control over the training procedure of CLIP models. In this work, we show that by employing smart web search strategies enhanced with knowledge graphs, a robust CLIP model can be trained from scratch with considerably less data. Specifically, we demonstrate that an expert foundation model for living organisms can be built using just 10M images. Moreover, we introduce EntityNet, a dataset comprising 33M images paired with 46M text descriptions, which enables the training of a generic CLIP model in significantly reduced time.

[80] Cricket: A Self-Powered Chirping Pixel

Shree K. Nayar,Jeremy Klotz,Nikhil Nanda,Mikhail Fridberg

Main category: cs.CV

TL;DR: 介绍了一种名为'cricket'的无电池传感器，通过光能供电并无线传输测量数据，适用于太阳能跟踪和节能照明控制。

Details

Motivation: 解决传统传感器依赖外部电源或电池的问题，开发一种自供电、无线通信的传感器。 Method: 传感器利用光能供电，休眠后通过短而强的射频信号传输数据，频率固定以标识身份，信号间隔反映光照强度。 Result: 验证了传感器的辐射响应、信噪比和动态范围，展示了其在太阳能跟踪和节能照明中的应用。 Conclusion: 'cricket'传感器具有自供电、无线通信和多功能应用潜力，尤其在节能和太阳能领域。 Abstract: We present a sensor that can measure light and wirelessly communicate the measurement, without the need for an external power source or a battery. Our sensor, called cricket, harvests energy from incident light. It is asleep for most of the time and transmits a short and strong radio frequency chirp when its harvested energy reaches a specific level. The carrier frequency of each cricket is fixed and reveals its identity, and the duration between consecutive chirps is a measure of the incident light level. We have characterized the radiometric response function, signal-to-noise ratio and dynamic range of cricket. We have experimentally verified that cricket can be miniaturized at the expense of increasing the duration between chirps. We show that a cube with a cricket on each of its sides can be used to estimate the centroid of any complex illumination, which has value in applications such as solar tracking. We also demonstrate the use of crickets for creating untethered sensor arrays that can produce video and control lighting for energy conservation. Finally, we modified cricket's circuit to develop battery-free electronic sunglasses that can instantly adapt to environmental illumination.

[81] Enhancing AI Face Realism: Cost-Efficient Quality Improvement in Distilled Diffusion Models with a Fully Synthetic Dataset

Jakub Wąsala,Bartłomiej Wrzalski,Kornelia Noculak,Yuliia Tarasenko,Oliwer Krupa,Jan Kocoń,Grzegorz Chodak

Main category: cs.CV

TL;DR: 提出了一种新方法，通过训练一个图像到图像的转换模型，将低成本蒸馏模型的输出提升到与高成本基线模型相当的质量，从而显著降低计算成本。

Details

Motivation: 研究旨在提升扩散模型在图像生成中的成本与质量比，探索低成本模型输出与高质量模型输出之间的可学习差异。 Method: 生成合成配对数据集，训练一个快速的图像到图像转换模型，将低成本蒸馏模型（如FLUX.1-schnell）的输出优化到与高成本基线模型（如FLUX.1-dev）相当的水平。 Result: 实验表明，结合蒸馏模型和增强层的管道，能够以高达82%的计算成本降低，生成与基线模型相似的逼真肖像。 Conclusion: 该研究展示了在大型图像生成任务中提升AI解决方案效率的潜力。 Abstract: This study presents a novel approach to enhance the cost-to-quality ratio of image generation with diffusion models. We hypothesize that differences between distilled (e.g. FLUX.1-schnell) and baseline (e.g. FLUX.1-dev) models are consistent and, therefore, learnable within a specialized domain, like portrait generation. We generate a synthetic paired dataset and train a fast image-to-image translation head. Using two sets of low- and high-quality synthetic images, our model is trained to refine the output of a distilled generator (e.g., FLUX.1-schnell) to a level comparable to a baseline model like FLUX.1-dev, which is more computationally intensive. Our results show that the pipeline, which combines a distilled version of a large generative model with our enhancement layer, delivers similar photorealistic portraits to the baseline version with up to an 82% decrease in computational cost compared to FLUX.1-dev. This study demonstrates the potential for improving the efficiency of AI solutions involving large-scale image generation.

[82] Compositional Image-Text Matching and Retrieval by Grounding Entities

Madhukar Reddy Vongala,Saurabh Srivastava,Jana Košecká

Main category: cs.CV

TL;DR: 提出一种无需学习的零样本增强方法，改进CLIP嵌入的实体定位和组合匹配能力，显著提升图像-文本匹配和检索性能。

Details

Motivation: 现有视觉语言预训练模型（如CLIP）在实体定位和组合匹配方面表现不足，需改进其嵌入能力。 Method: 通过开放词汇检测器定位对象实体和关系的子图像嵌入，动态调整全局图像嵌入，加权组合后用于相似性计算。 Result: 在Visual Genome和SVO Probes数据集上图像-文本匹配准确率平均提升1.5%，在Flickr30K和MS-COCO检索任务中Recall@1分别提升12%和0.4%。 Conclusion: 提出的方法显著提升了CLIP在组合匹配和检索任务中的性能，代码已开源。 Abstract: Vision-language pretraining on large datasets of images-text pairs is one of the main building blocks of current Vision-Language Models. While with additional training, these models excel in various downstream tasks, including visual question answering, image captioning, and visual commonsense reasoning. However, a notable weakness of pretrained models like CLIP, is their inability to perform entity grounding and compositional image and text matching~\cite{Jiang2024ComCLIP, yang2023amc, Rajabi2023GroundedVSR, learninglocalizeCVPR24}. In this work we propose a novel learning-free zero-shot augmentation of CLIP embeddings that has favorable compositional properties. We compute separate embeddings of sub-images of object entities and relations that are localized by the state of the art open vocabulary detectors and dynamically adjust the baseline global image embedding. % The final embedding is obtained by computing a weighted combination of the sub-image embeddings. The resulting embedding is then utilized for similarity computation with text embedding, resulting in a average 1.5\% improvement in image-text matching accuracy on the Visual Genome and SVO Probes datasets~\cite{krishna2017visualgenome, svo}. Notably, the enhanced embeddings demonstrate superior retrieval performance, thus achieving significant gains on the Flickr30K and MS-COCO retrieval benchmarks~\cite{flickr30ke, mscoco}, improving the state-of-the-art Recall@1 by 12\% and 0.4\%, respectively. Our code is available at https://github.com/madhukarreddyvongala/GroundingCLIP.

[83] AOR: Anatomical Ontology-Guided Reasoning for Medical Large Multimodal Model in Chest X-Ray Interpretation

Qingqiu Li,Zihang Cui,Seongsu Bae,Jilan Xu,Runtian Yuan,Yuejie Zhang,Rui Feng,Quanli Shen,Xiaobo Zhang,Junjun He,Shujun Wang

Main category: cs.CV

TL;DR: 论文提出了一种基于解剖学本体引导推理（AOR）的框架，通过增强医学大型多模态模型（MLMMs）的区域级理解和多步推理能力，提高了其交互性和可解释性。

Details

Motivation: 当前医学大型多模态模型（MLMMs）在区域级理解和单步推理方面存在不足，影响了诊断准确性和可解释性。 Method: 提出解剖学本体引导推理（AOR）框架，结合专家指导开发AOR-Instruction数据集，用于MLMMs训练。 Result: 实验表明AOR在视觉问答（VQA）和报告生成任务中表现优异。 Conclusion: AOR框架显著提升了MLMMs的区域级理解和多步推理能力，增强了模型的实用性和可解释性。 Abstract: Chest X-rays (CXRs) are the most frequently performed imaging examinations in clinical settings. Recent advancements in Large Multimodal Models (LMMs) have enabled automated CXR interpretation, enhancing diagnostic accuracy and efficiency. However, despite their strong visual understanding, current Medical LMMs (MLMMs) still face two major challenges: (1) Insufficient region-level understanding and interaction, and (2) Limited accuracy and interpretability due to single-step reasoning. In this paper, we empower MLMMs with anatomy-centric reasoning capabilities to enhance their interactivity and explainability. Specifically, we first propose an Anatomical Ontology-Guided Reasoning (AOR) framework, which centers on cross-modal region-level information to facilitate multi-step reasoning. Next, under the guidance of expert physicians, we develop AOR-Instruction, a large instruction dataset for MLMMs training. Our experiments demonstrate AOR's superior performance in both VQA and report generation tasks.

[84] Continuous Normalizing Flows for Uncertainty-Aware Human Pose Estimation

Shipeng Liu,Ziliang Xiong,Bastian Wandt,Per-Erik Forssén

Main category: cs.CV

TL;DR: 提出了一种名为CFRE的新方法，结合连续归一化流（CNFs）与回归模型，以动态调整分布，提升人体姿态估计的准确性和不确定性量化，同时保持计算效率。

Details

Motivation: 当前人体姿态估计方法在准确性、计算效率和不确定性量化之间难以平衡，传统回归方法假设固定分布可能导致不确定性量化不佳，而基于热图的方法虽有效但资源消耗大。 Method: 提出Continuous Flow Residual Estimation (CFRE)，将Continuous Normalizing Flows (CNFs) 集成到回归模型中，实现动态分布适应。 Result: 实验表明，CFRE在2D和3D人体姿态估计任务中实现了更高的准确性和不确定性量化，同时保持了计算效率。 Conclusion: CFRE通过动态分布适应，有效解决了传统方法的局限性，为人体姿态估计提供了一种更优的解决方案。 Abstract: Human Pose Estimation (HPE) is increasingly important for applications like virtual reality and motion analysis, yet current methods struggle with balancing accuracy, computational efficiency, and reliable uncertainty quantification (UQ). Traditional regression-based methods assume fixed distributions, which might lead to poor UQ. Heatmap-based methods effectively model the output distribution using likelihood heatmaps, however, they demand significant resources. To address this, we propose Continuous Flow Residual Estimation (CFRE), an integration of Continuous Normalizing Flows (CNFs) into regression-based models, which allows for dynamic distribution adaptation. Through extensive experiments, we show that CFRE leads to better accuracy and uncertainty quantification with retained computational efficiency on both 2D and 3D human pose estimation tasks.

[85] R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning

Yi-Fan Zhang,Xingyu Lu,Xiao Hu,Chaoyou Fu,Bin Wen,Tianke Zhang,Changyi Liu,Kaiyu Jiang,Kaibing Chen,Kaiyu Tang,Haojie Ding,Jiankang Chen,Fan Yang,Zhang Zhang,Tingting Gao,Liang Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为StableReinforce的强化学习算法，用于改进多模态奖励模型（MRMs）的训练稳定性与性能，并在多模态奖励建模基准上取得了显著提升。

Details

Motivation: 尽管多模态奖励模型（MRMs）在多模态大语言模型（MLLMs）中至关重要，但现有研究对其长期推理能力的探索有限。本文旨在通过强化学习（RL）提升奖励建模的效果。 Method: 将奖励建模问题重新定义为基于规则的强化学习任务，并提出StableReinforce算法，优化训练损失、优势估计策略和奖励设计，以解决现有RL算法的不稳定性问题。 Result: 在收集的20万条偏好数据上训练的R1-Reward模型，在VL Reward-Bench和Multimodal Reward Bench上分别提升了8.4%和14.3%。 Conclusion: StableReinforce算法显著提升了MRMs的性能，展示了强化学习在优化多模态奖励模型中的潜力。 Abstract: Multimodal Reward Models (MRMs) play a crucial role in enhancing the performance of Multimodal Large Language Models (MLLMs). While recent advancements have primarily focused on improving the model structure and training data of MRMs, there has been limited exploration into the effectiveness of long-term reasoning capabilities for reward modeling and how to activate these capabilities in MRMs. In this paper, we explore how Reinforcement Learning (RL) can be used to improve reward modeling. Specifically, we reformulate the reward modeling problem as a rule-based RL task. However, we observe that directly applying existing RL algorithms, such as Reinforce++, to reward modeling often leads to training instability or even collapse due to the inherent limitations of these algorithms. To address this issue, we propose the StableReinforce algorithm, which refines the training loss, advantage estimation strategy, and reward design of existing RL methods. These refinements result in more stable training dynamics and superior performance. To facilitate MRM training, we collect 200K preference data from diverse datasets. Our reward model, R1-Reward, trained using the StableReinforce algorithm on this dataset, significantly improves performance on multimodal reward modeling benchmarks. Compared to previous SOTA models, R1-Reward achieves a $8.4\%$ improvement on the VL Reward-Bench and a $14.3\%$ improvement on the Multimodal Reward Bench. Moreover, with more inference compute, R1-Reward's performance is further enhanced, highlighting the potential of RL algorithms in optimizing MRMs.

[86] TeDA: Boosting Vision-Lanuage Models for Zero-Shot 3D Object Retrieval via Testing-time Distribution Alignment

Zhichuan Wang,Yang Zhou,Jinhai Xiang,Yulong Wang,Xinwei He

Main category: cs.CV

TL;DR: 论文提出了一种名为TeDA的新框架，通过测试时分布对齐技术，将预训练的2D视觉语言模型CLIP适配于未知3D对象检索任务，显著提升了性能。

Details

Motivation: 现有方法因3D训练数据不足而难以泛化到未知类别，而预训练的视觉语言模型（如CLIP）在2D与3D分布差异下表现受限。 Method: TeDA将3D对象投影为多视图图像，利用CLIP提取特征，并通过迭代优化策略和文本描述增强3D表示。 Result: 在四个开放集3D对象检索基准上，TeDA显著优于现有方法，包括需要大量训练的方法。 Conclusion: TeDA是首个研究视觉语言模型在3D特征学习中测试时适配的工作，展示了其高效性和泛化能力。 Abstract: Learning discriminative 3D representations that generalize well to unknown testing categories is an emerging requirement for many real-world 3D applications. Existing well-established methods often struggle to attain this goal due to insufficient 3D training data from broader concepts. Meanwhile, pre-trained large vision-language models (e.g., CLIP) have shown remarkable zero-shot generalization capabilities. Yet, they are limited in extracting suitable 3D representations due to substantial gaps between their 2D training and 3D testing distributions. To address these challenges, we propose Testing-time Distribution Alignment (TeDA), a novel framework that adapts a pretrained 2D vision-language model CLIP for unknown 3D object retrieval at test time. To our knowledge, it is the first work that studies the test-time adaptation of a vision-language model for 3D feature learning. TeDA projects 3D objects into multi-view images, extracts features using CLIP, and refines 3D query embeddings with an iterative optimization strategy by confident query-target sample pairs in a self-boosting manner. Additionally, TeDA integrates textual descriptions generated by a multimodal language model (InternVL) to enhance 3D object understanding, leveraging CLIP's aligned feature space to fuse visual and textual cues. Extensive experiments on four open-set 3D object retrieval benchmarks demonstrate that TeDA greatly outperforms state-of-the-art methods, even those requiring extensive training. We also experimented with depth maps on Objaverse-LVIS, further validating its effectiveness. Code is available at https://github.com/wangzhichuan123/TeDA.

[87] VAEmo: Efficient Representation Learning for Visual-Audio Emotion with Knowledge Injection

Hao Cheng,Zhiwei Zhao,Yichao He,Zhenzhen Hu,Jia Li,Meng Wang,Richang Hong

Main category: cs.CV

TL;DR: VAEmo是一个两阶段框架，通过外部知识注入和统一跨模态编码，解决了视听情感识别中的模态差异和细粒度情感语义建模问题。

Details

Motivation: 视听情感识别（AVER）面临情感表达模糊、模态差异和数据标注稀缺的挑战，现有方法依赖模态特定编码器和粗粒度对齐，限制了细粒度情感建模。 Method: VAEmo分两阶段：1）通过掩码重建和对比目标预训练统一轻量级表示网络；2）利用多模态大语言模型生成情感描述，并通过双路径对比学习对齐文本与视听表示。 Result: 在多个AVER基准测试中，VAEmo以紧凑设计实现最优性能，验证了统一跨模态编码和情感感知语义指导的有效性。 Conclusion: VAEmo通过统一编码和外部知识注入，显著提升了视听情感识别的效率和泛化能力。 Abstract: Audiovisual emotion recognition (AVER) aims to infer human emotions from nonverbal visual-audio (VA) cues, offering modality-complementary and language-agnostic advantages. However, AVER remains challenging due to the inherent ambiguity of emotional expressions, cross-modal expressive disparities, and the scarcity of reliably annotated data. Recent self-supervised AVER approaches have introduced strong multimodal representations, yet they predominantly rely on modality-specific encoders and coarse content-level alignment, limiting fine-grained emotional semantic modeling. To address these issues, we propose VAEmo, an efficient two-stage framework for emotion-centric joint VA representation learning with external knowledge injection. In Stage 1, a unified and lightweight representation network is pre-trained on large-scale speaker-centric VA corpora via masked reconstruction and contrastive objectives, mitigating the modality gap and learning expressive, complementary representations without emotion labels. In Stage 2, multimodal large language models automatically generate detailed affective descriptions according to our well-designed chain-of-thought prompting for only a small subset of VA samples; these rich textual semantics are then injected by aligning their corresponding embeddings with VA representations through dual-path contrastive learning, further bridging the emotion gap. Extensive experiments on multiple downstream AVER benchmarks show that VAEmo achieves state-of-the-art performance with a compact design, highlighting the benefit of unified cross-modal encoding and emotion-aware semantic guidance for efficient, generalizable VA emotion representations.

[88] 6D Pose Estimation on Spoons and Hands

Kevin Tan,Fan Yang,Yuhao Chen

Main category: cs.CV

TL;DR: 论文提出了一种通过6D姿态估计跟踪手和勺子的系统，用于分析饮食行为，并评估了两种视频对象分割模型的性能。

Details

Motivation: 准确的饮食监测对促进健康饮食习惯至关重要，而现有方法（如自我报告）可靠性不足。 Method: 通过分析静态视频，使用6D姿态估计跟踪手和勺子的位置与方向，评估两种SOTA视频对象分割模型。 Result: 系统能够捕捉饮食行为，并识别出主要误差来源。 Conclusion: 该方法为饮食监测提供了更可靠的解决方案，并指出了未来改进的方向。 Abstract: Accurate dietary monitoring is essential for promoting healthier eating habits. A key area of research is how people interact and consume food using utensils and hands. By tracking their position and orientation, it is possible to estimate the volume of food being consumed, or monitor eating behaviours, highly useful insights into nutritional intake that can be more reliable than popular methods such as self-reporting. Hence, this paper implements a system that analyzes stationary video feed of people eating, using 6D pose estimation to track hand and spoon movements to capture spatial position and orientation. In doing so, we examine the performance of two state-of-the-art (SOTA) video object segmentation (VOS) models, both quantitatively and qualitatively, and identify main sources of error within the system.

[89] Quaternion Infrared Visible Image Fusion

Weihua Yang,Yicong Zhou

Main category: cs.CV

TL;DR: 提出了一种基于四元数的红外-可见光图像融合框架（QIVIF），通过四元数域处理，解决了现有方法在低质量可见光输入下的性能下降问题。

Details

Motivation: 现有方法在融合红外和可见光图像时，常忽略可见光图像的颜色结构信息，且在低质量输入下表现不佳。 Method: QIVIF框架包括四元数低可见度特征学习模型、四元数自适应锐化掩蔽方法和四元数分层贝叶斯融合模型。 Result: 实验表明，QIVIF在低可见度条件下优于现有方法。 Conclusion: QIVIF通过四元数域处理，显著提升了融合图像的质量，尤其在低可见度条件下表现优异。 Abstract: Visible images provide rich details and color information only under well-lighted conditions while infrared images effectively highlight thermal targets under challenging conditions such as low visibility and adverse weather. Infrared-visible image fusion aims to integrate complementary information from infrared and visible images to generate a high-quality fused image. Existing methods exhibit critical limitations such as neglecting color structure information in visible images and performance degradation when processing low-quality color-visible inputs. To address these issues, we propose a quaternion infrared-visible image fusion (QIVIF) framework to generate high-quality fused images completely in the quaternion domain. QIVIF proposes a quaternion low-visibility feature learning model to adaptively extract salient thermal targets and fine-grained texture details from input infrared and visible images respectively under diverse degraded conditions. QIVIF then develops a quaternion adaptive unsharp masking method to adaptively improve high-frequency feature enhancement with balanced illumination. QIVIF further proposes a quaternion hierarchical Bayesian fusion model to integrate infrared saliency and enhanced visible details to obtain high-quality fused images. Extensive experiments across diverse datasets demonstrate that our QIVIF surpasses state-of-the-art methods under challenging low-visibility conditions.

[90] Quaternion Multi-focus Color Image Fusion

Weihua Yang,Yicong Zhou

Main category: cs.CV

TL;DR: 本文提出了一种四元数多焦点彩色图像融合框架，通过四元数稀疏分解模型、基-细节融合策略和结构相似性细化策略，显著提升了复杂场景下的图像融合质量。

Details

Motivation: 现有方法在处理复杂真实场景时，难以有效处理颜色信息和复杂纹理，导致融合效果不佳。 Method: 1) 四元数稀疏分解模型迭代学习图像细节和结构信息；2) 基-细节融合策略分别处理基尺度和细节尺度结果；3) 结构相似性细化策略自适应选择最优补丁。 Result: 实验表明，该框架在性能上优于现有最先进方法。 Conclusion: 该框架为复杂场景下的高质量彩色图像融合提供了有效解决方案。 Abstract: Multi-focus color image fusion refers to integrating multiple partially focused color images to create a single all-in-focus color image. However, existing methods struggle with complex real-world scenarios due to limitations in handling color information and intricate textures. To address these challenges, this paper proposes a quaternion multi-focus color image fusion framework to perform high-quality color image fusion completely in the quaternion domain. This framework introduces 1) a quaternion sparse decomposition model to jointly learn fine-scale image details and structure information of color images in an iterative fashion for high-precision focus detection, 2) a quaternion base-detail fusion strategy to individually fuse base-scale and detail-scale results across multiple color images for preserving structure and detail information, and 3) a quaternion structural similarity refinement strategy to adaptively select optimal patches from initial fusion results and obtain the final fused result for preserving fine details and ensuring spatially consistent outputs. Extensive experiments demonstrate that the proposed framework outperforms state-of-the-art methods.

[91] SuperEdit: Rectifying and Facilitating Supervision for Instruction-Based Image Editing

Ming Li,Xin Gu,Fan Chen,Xiaoying Xing,Longyin Wen,Chen Chen,Sijie Zhu

Main category: cs.CV

TL;DR: 本文提出了一种通过构建更有效的编辑指令来解决图像编辑中指令与图像对不匹配问题的新方法，包括修正指令和对比指令，显著提升了性能。

Details

Motivation: 现有数据集因自动化方法导致编辑指令与图像对不匹配，产生噪声监督信号，现有方法未能解决这一根本问题。 Method: 通过分析编辑模型在不同推理步骤的生成属性，定义统一指导修正指令，并引入对比监督信号（正负指令）通过三元组损失增强训练。 Result: 在多个基准测试中显著优于现有方法，例如在Real-Edit基准上比SOTA SmartEdit提升9.19%，且训练数据减少30倍，模型规模缩小13倍。 Conclusion: 该方法无需依赖VLM模块或预训练任务，提供了一种更直接高效的监督信号解决方案，为基于指令的图像编辑提供了新颖、简单且有效的方法。 Abstract: Due to the challenges of manually collecting accurate editing data, existing datasets are typically constructed using various automated methods, leading to noisy supervision signals caused by the mismatch between editing instructions and original-edited image pairs. Recent efforts attempt to improve editing models through generating higher-quality edited images, pre-training on recognition tasks, or introducing vision-language models (VLMs) but fail to resolve this fundamental issue. In this paper, we offer a novel solution by constructing more effective editing instructions for given image pairs. This includes rectifying the editing instructions to better align with the original-edited image pairs and using contrastive editing instructions to further enhance their effectiveness. Specifically, we find that editing models exhibit specific generation attributes at different inference steps, independent of the text. Based on these prior attributes, we define a unified guide for VLMs to rectify editing instructions. However, there are some challenging editing scenarios that cannot be resolved solely with rectified instructions. To this end, we further construct contrastive supervision signals with positive and negative instructions and introduce them into the model training using triplet loss, thereby further facilitating supervision effectiveness. Our method does not require the VLM modules or pre-training tasks used in previous work, offering a more direct and efficient way to provide better supervision signals, and providing a novel, simple, and effective solution for instruction-based image editing. Results on multiple benchmarks demonstrate that our method significantly outperforms existing approaches. Compared with previous SOTA SmartEdit, we achieve 9.19% improvements on the Real-Edit benchmark with 30x less training data and 13x smaller model size.

[92] MetaScenes: Towards Automated Replica Creation for Real-world 3D Scans

Huangyue Yu,Baoxiong Jia,Yixin Chen,Yandan Yang,Puhao Li,Rongpeng Su,Jiaxin Li,Qing Li,Wei Liang,Song-Chun Zhu,Tengyu Liu,Siyuan Huang

Main category: cs.CV

TL;DR: MetaScenes和Scan2Sim为EAI研究提供了高质量、可扩展的3D场景数据集和自动化资产替换方法，解决了现有数据集的依赖人工设计问题。

Details

Motivation: 现有3D场景数据集依赖艺术家设计，人力成本高且难以扩展，限制了EAI研究的进展。 Method: 提出MetaScenes数据集（基于真实扫描）和Scan2Sim模型（自动化资产替换），并设计两个基准任务验证其效果。 Result: MetaScenes支持更通用的代理学习和模拟到现实应用，Scan2Sim实现了高质量资产替换。 Conclusion: MetaScenes和Scan2Sim为EAI研究提供了新可能性，推动了领域发展。 Abstract: Embodied AI (EAI) research requires high-quality, diverse 3D scenes to effectively support skill acquisition, sim-to-real transfer, and generalization. Achieving these quality standards, however, necessitates the precise replication of real-world object diversity. Existing datasets demonstrate that this process heavily relies on artist-driven designs, which demand substantial human effort and present significant scalability challenges. To scalably produce realistic and interactive 3D scenes, we first present MetaScenes, a large-scale, simulatable 3D scene dataset constructed from real-world scans, which includes 15366 objects spanning 831 fine-grained categories. Then, we introduce Scan2Sim, a robust multi-modal alignment model, which enables the automated, high-quality replacement of assets, thereby eliminating the reliance on artist-driven designs for scaling 3D scenes. We further propose two benchmarks to evaluate MetaScenes: a detailed scene synthesis task focused on small item layouts for robotic manipulation and a domain transfer task in vision-and-language navigation (VLN) to validate cross-domain transfer. Results confirm MetaScene's potential to enhance EAI by supporting more generalizable agent learning and sim-to-real applications, introducing new possibilities for EAI research. Project website: https://meta-scenes.github.io/.

[93] Uncertainty-Weighted Image-Event Multimodal Fusion for Video Anomaly Detection

Sungheon Jeong,Jihong Park,Mohsen Imani

Main category: cs.CV

TL;DR: IEF-VAD提出了一种通过合成事件表征并与RGB特征融合的视频异常检测框架，无需专用事件传感器或帧级标签，实现了多基准测试的最新性能。

Details

Motivation: 现有视频异常检测器仅依赖RGB帧，缺乏捕捉瞬时运动线索的时间分辨率，而运动线索是异常事件的关键指标。 Method: IEF-VAD通过Student's-t似然建模传感器噪声，采用Laplace近似推导权重，应用Kalman式帧级更新平衡模态，并迭代优化融合潜在状态以消除跨模态噪声。 Result: IEF-VAD在多个真实世界异常检测基准测试中实现了最新性能。 Conclusion: 合成事件表征能有效突出RGB帧中常被忽视的运动线索，无需专用事件传感器即可实现准确鲁棒的视频理解。 Abstract: Most existing video anomaly detectors rely solely on RGB frames, which lack the temporal resolution needed to capture abrupt or transient motion cues, key indicators of anomalous events. To address this limitation, we propose Image-Event Fusion for Video Anomaly Detection (IEF-VAD), a framework that synthesizes event representations directly from RGB videos and fuses them with image features through a principled, uncertainty-aware process. The system (i) models heavy-tailed sensor noise with a Student`s-t likelihood, deriving value-level inverse-variance weights via a Laplace approximation; (ii) applies Kalman-style frame-wise updates to balance modalities over time; and (iii) iteratively refines the fused latent state to erase residual cross-modal noise. Without any dedicated event sensor or frame-level labels, IEF-VAD sets a new state of the art across multiple real-world anomaly detection benchmarks. These findings highlight the utility of synthetic event representations in emphasizing motion cues that are often underrepresented in RGB frames, enabling accurate and robust video understanding across diverse applications without requiring dedicated event sensors. Code and models are available at https://github.com/EavnJeong/IEF-VAD.

[94] Token Coordinated Prompt Attention is Needed for Visual Prompting

Zichen Liu,Xu Zou,Gang Hua,Jiahuan Zhou

Main category: cs.CV

TL;DR: 论文提出了一种名为TCPA的模块，通过为不同令牌分配特定提示，提升ViT的表示能力。

Details

Motivation: 现有视觉提示方法对所有令牌使用相同提示，忽视了不同令牌的独特作用，限制了ViT的表示能力。 Method: 提出TCPA模块，将提示分为CLS提示和图像提示，并通过匹配函数为不同令牌分配协调提示。 Result: 实验表明，TCPA显著提升了特征的多样性和判别能力。 Conclusion: TCPA是一种有效的插件模块，能增强ViT的特征提取能力。 Abstract: Visual prompting techniques are widely used to efficiently fine-tune pretrained Vision Transformers (ViT) by learning a small set of shared prompts for all tokens. However, existing methods overlook the unique roles of different tokens in conveying discriminative information and interact with all tokens using the same prompts, thereby limiting the representational capacity of ViT. This often leads to indistinguishable and biased prompt-extracted features, hindering performance. To address this issue, we propose a plug-and-play Token Coordinated Prompt Attention (TCPA) module, which assigns specific coordinated prompts to different tokens for attention-based interactions. Firstly, recognizing the distinct functions of CLS and image tokens-global information aggregation and local feature extraction, we disentangle the prompts into CLS Prompts and Image Prompts, which interact exclusively with CLS tokens and image tokens through attention mechanisms. This enhances their respective discriminative abilities. Furthermore, as different image tokens correspond to distinct image patches and contain diverse information, we employ a matching function to automatically assign coordinated prompts to individual tokens. This enables more precise attention interactions, improving the diversity and representational capacity of the extracted features. Extensive experiments across various benchmarks demonstrate that TCPA significantly enhances the diversity and discriminative power of the extracted features. The code is available at https://github.com/zhoujiahuan1991/ICML2025-TCPA.

[95] Recent Advances in Out-of-Distribution Detection with CLIP-Like Models: A Survey

Chaohua Li,Enhao Zhang,Chuanxing Geng,Songcan Chen

Main category: cs.CV

TL;DR: 论文提出了一种基于图像和文本模态的新分类框架，用于CLIP等视觉语言模型的OOD检测，并探讨了未来研究方向。

Details

Motivation: 现有OOD检测方法仍依赖单模态（图像）范式，未能充分利用CLIP等视觉语言模型的多模态特性。 Method: 提出基于图像和文本模态的分类框架，将方法分为四组（OOD图像是否可见、OOD文本是否已知），并讨论两种训练策略。 Result: 新框架更好地适应CLIP的多模态特性，为OOD检测提供更全面的分类视角。 Conclusion: 未来研究应关注跨域整合、实际应用和理论理解，以推动CLIP类OOD检测的发展。 Abstract: Out-of-distribution detection (OOD) is a pivotal task for real-world applications that trains models to identify samples that are distributionally different from the in-distribution (ID) data during testing. Recent advances in AI, particularly Vision-Language Models (VLMs) like CLIP, have revolutionized OOD detection by shifting from traditional unimodal image detectors to multimodal image-text detectors. This shift has inspired extensive research; however, existing categorization schemes (e.g., few- or zero-shot types) still rely solely on the availability of ID images, adhering to a unimodal paradigm. To better align with CLIP's cross-modal nature, we propose a new categorization framework rooted in both image and text modalities. Specifically, we categorize existing methods based on how visual and textual information of OOD data is utilized within image + text modalities, and further divide them into four groups: OOD Images (i.e., outliers) Seen or Unseen, and OOD Texts (i.e., learnable vectors or class names) Known or Unknown, across two training strategies (i.e., train-free or training-required). More importantly, we discuss open problems in CLIP-like OOD detection and highlight promising directions for future research, including cross-domain integration, practical applications, and theoretical understanding.

[96] Timing Is Everything: Finding the Optimal Fusion Points in Multimodal Medical Imaging

Valerio Guarrasi,Klara Mogensen,Sara Tassinari,Sara Qvarlander,Paolo Soda

Main category: cs.CV

TL;DR: 提出了一种顺序前向搜索算法，用于高效确定多模态网络中融合模块的最佳插入时机，显著提升诊断准确性并减少计算开销。

Details

Motivation: 当前多模态深度学习在医学影像中融合时机依赖手动调优或穷举搜索，计算成本高且结果不保证最优。 Method: 采用顺序前向搜索算法，逐步激活并评估不同层的融合模块，通过验证损失比较确定最佳配置。 Result: 在两种多模态MRI数据集上验证，算法性能优于单模态基线、晚期融合及穷举融合，且计算开销显著降低。 Conclusion: 该方法为医学影像多模态融合提供了高效优化框架，有望提升临床决策和AI架构的可扩展性。 Abstract: Multimodal deep learning harnesses diverse imaging modalities, such as MRI sequences, to enhance diagnostic accuracy in medical imaging. A key challenge is determining the optimal timing for integrating these modalities-specifically, identifying the network layers where fusion modules should be inserted. Current approaches often rely on manual tuning or exhaustive search, which are computationally expensive without any guarantee of converging to optimal results. We propose a sequential forward search algorithm that incrementally activates and evaluates candidate fusion modules at different layers of a multimodal network. At each step, the algorithm retrains from previously learned weights and compares validation loss to identify the best-performing configuration. This process systematically reduces the search space, enabling efficient identification of the optimal fusion timing without exhaustively testing all possible module placements. The approach is validated on two multimodal MRI datasets, each addressing different classification tasks. Our algorithm consistently identified configurations that outperformed unimodal baselines, late fusion, and a brute-force ensemble of all potential fusion placements. These architectures demonstrated superior accuracy, F-score, and specificity while maintaining competitive or improved AUC values. Furthermore, the sequential nature of the search significantly reduced computational overhead, making the optimization process more practical. By systematically determining the optimal timing to fuse imaging modalities, our method advances multimodal deep learning for medical imaging. It provides an efficient and robust framework for fusion optimization, paving the way for improved clinical decision-making and more adaptable, scalable architectures in medical AI applications.

[97] Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction

Biao Gong,Cheng Zou,Dandan Zheng,Hu Yu,Jingdong Chen,Jianxin Sun,Junbo Zhao,Jun Zhou,Kaixiang Ji,Lixiang Ru,Libin Wang,Qingpei Guo,Rui Liu,Weilong Chai,Xinyu Xiao,Ziyuan Huang

Main category: cs.CV

TL;DR: Ming-Lite-Uni是一个开源的多模态框架，结合了统一的视觉生成器和原生多模态自回归模型，支持文本到图像生成和指令驱动的图像编辑任务。

Details

Motivation: 旨在统一视觉与语言的多模态任务，扩展模型能力，超越纯视觉理解。 Method: 采用MetaQueries和M2-omni框架，引入多尺度可学习标记和多尺度表示对齐策略，结合固定MLLM和可学习扩散模型。 Result: 实验显示Ming-Lite-Uni性能强大，交互过程流畅。 Conclusion: 该框架为AGI路径上的统一模型提供了重要贡献，代码和模型权重已开源。 Abstract: We introduce Ming-Lite-Uni, an open-source multimodal framework featuring a newly designed unified visual generator and a native multimodal autoregressive model tailored for unifying vision and language. Specifically, this project provides an open-source implementation of the integrated MetaQueries and M2-omni framework, while introducing the novel multi-scale learnable tokens and multi-scale representation alignment strategy. By leveraging a fixed MLLM and a learnable diffusion model, Ming-Lite-Uni enables native multimodal AR models to perform both text-to-image generation and instruction based image editing tasks, expanding their capabilities beyond pure visual understanding. Our experimental results demonstrate the strong performance of Ming-Lite-Uni and illustrate the impressive fluid nature of its interactive process. All code and model weights are open-sourced to foster further exploration within the community. Notably, this work aligns with concurrent multimodal AI milestones - such as ChatGPT-4o with native image generation updated in March 25, 2025 - underscoring the broader significance of unified models like Ming-Lite-Uni on the path toward AGI. Ming-Lite-Uni is in alpha stage and will soon be further refined.

[98] Finger Pose Estimation for Under-screen Fingerprint Sensor

Xiongjun Guan,Zhiyu Pan,Jianjiang Feng,Jie Zhou

Main category: cs.CV

TL;DR: 提出了一种基于双模态输入的网络，用于解决指纹姿态估计中的大角度和小面积输入问题，显著提升了准确性和稳定性。

Details

Motivation: 现有方法在处理大角度或小面积输入时表现不佳，尤其在智能手机屏下指纹传感器捕获的指纹中更为明显。 Method: 结合纹理细节和粗略轮廓的双模态输入，设计了概率分布预测任务，并采用MoE特征融合机制和跨域知识转移策略。 Result: 在多个数据集上的实验表明，该方法显著优于现有SOTA方法，提升了指纹识别算法的性能。 Conclusion: 提出的方法有效解决了指纹姿态估计的挑战，为指纹识别提供了更优的解决方案。 Abstract: Two-dimensional pose estimation plays a crucial role in fingerprint recognition by facilitating global alignment and reduce pose-induced variations. However, existing methods are still unsatisfactory when handling with large angle or small area inputs. These limitations are particularly pronounced on fingerprints captured by under-screen fingerprint sensors in smartphones. In this paper, we present a novel dual-modal input based network for under-screen fingerprint pose estimation. Our approach effectively integrates two distinct yet complementary modalities: texture details extracted from ridge patches through the under-screen fingerprint sensor, and rough contours derived from capacitive images obtained via the touch screen. This collaborative integration endows our network with more comprehensive and discriminative information, substantially improving the accuracy and stability of pose estimation. A decoupled probability distribution prediction task is designed, instead of the traditional supervised forms of numerical regression or heatmap voting, to facilitate the training process. Additionally, we incorporate a Mixture of Experts (MoE) based feature fusion mechanism and a relationship driven cross-domain knowledge transfer strategy to further strengthen feature extraction and fusion capabilities. Extensive experiments are conducted on several public datasets and two private datasets. The results indicate that our method is significantly superior to previous state-of-the-art (SOTA) methods and remarkably boosts the recognition ability of fingerprint recognition algorithms. Our code is available at https://github.com/XiongjunGuan/DRACO.

[99] Corr2Distrib: Making Ambiguous Correspondences an Ally to Predict Reliable 6D Pose Distributions

Asma Brazi,Boris Meden,Fabrice Mayran de Chamisso,Steve Bourgeois,Vincent Lepetit

Main category: cs.CV

TL;DR: Corr2Distrib是一种基于对应关系的方法，首次通过RGB图像估计6D相机位姿分布，解决了视觉模糊性问题。

Details

Motivation: 对称性和遮挡导致视觉模糊性，产生多个有效位姿，现有方法未充分利用局部对应关系。 Method: 通过学习对称感知的3D点表示，生成旋转假设，并通过PnP和位姿评分优化为6DoF位姿分布。 Result: 在复杂非合成场景中，Corr2Distrib在位姿分布估计和单一位姿估计上均优于现有方法。 Conclusion: Corr2Distrib展示了基于对应关系方法的潜力，为视觉模糊性问题提供了新解决方案。 Abstract: We introduce Corr2Distrib, the first correspondence-based method which estimates a 6D camera pose distribution from an RGB image, explaining the observations. Indeed, symmetries and occlusions introduce visual ambiguities, leading to multiple valid poses. While a few recent methods tackle this problem, they do not rely on local correspondences which, according to the BOP Challenge, are currently the most effective way to estimate a single 6DoF pose solution. Using correspondences to estimate a pose distribution is not straightforward, since ambiguous correspondences induced by visual ambiguities drastically decrease the performance of PnP. With Corr2Distrib, we turn these ambiguities into an advantage to recover all valid poses. Corr2Distrib first learns a symmetry-aware representation for each 3D point on the object's surface, characterized by a descriptor and a local frame. This representation enables the generation of 3DoF rotation hypotheses from single 2D-3D correspondences. Next, we refine these hypotheses into a 6DoF pose distribution using PnP and pose scoring. Our experimental evaluations on complex non-synthetic scenes show that Corr2Distrib outperforms state-of-the-art solutions for both pose distribution estimation and single pose estimation from an RGB image, demonstrating the potential of correspondences-based approaches.

[100] Text to Image Generation and Editing: A Survey

Pengfei Yang,Ngai-Man Cheung,Xinda Ma

Main category: cs.CV

TL;DR: 本文综述了2021至2024年间141篇关于文本到图像生成（T2I）的研究，涵盖基础模型架构、关键技术、性能比较及社会影响，并提出未来发展方向。

Details

Motivation: T2I领域近年来发展迅速，但缺乏系统性综述，本文旨在填补这一空白，为未来研究提供指导。 Method: 介绍了四种基础模型架构（自回归、非自回归、GAN和扩散模型）及关键技术（如自编码器、注意力机制），并系统比较了T2I生成与编辑的方法、性能指标等。 Result: 综述了多种T2I模型的性能表现，并探讨了其社会影响及潜在解决方案。 Conclusion: 本文首次系统性综述T2I领域，提出了改进模型性能的见解和未来发展方向，为研究者提供参考。 Abstract: Text-to-image generation (T2I) refers to the text-guided generation of high-quality images. In the past few years, T2I has attracted widespread attention and numerous works have emerged. In this survey, we comprehensively review 141 works conducted from 2021 to 2024. First, we introduce four foundation model architectures of T2I (autoregression, non-autoregression, GAN and diffusion) and the commonly used key technologies (autoencoder, attention and classifier-free guidance). Secondly, we systematically compare the methods of these studies in two directions, T2I generation and T2I editing, including the encoders and the key technologies they use. In addition, we also compare the performance of these researches side by side in terms of datasets, evaluation metrics, training resources, and inference speed. In addition to the four foundation models, we survey other works on T2I, such as energy-based models and recent Mamba and multimodality. We also investigate the potential social impact of T2I and provide some solutions. Finally, we propose unique insights of improving the performance of T2I models and possible future development directions. In summary, this survey is the first systematic and comprehensive overview of T2I, aiming to provide a valuable guide for future researchers and stimulate continued progress in this field.

[101] Marker-Based Extrinsic Calibration Method for Accurate Multi-Camera 3D Reconstruction

Nahuel Garcia-D'Urso,Bernabe Sanchez-Sos,Jorge Azorin-Lopez,Andres Fuster-Guillo,Antonio Macia-Lillo,Higinio Mora-Mora

Main category: cs.CV

TL;DR: 提出了一种基于三维标记的迭代外参标定方法，显著提高了多相机RGB-D系统的校准精度。

Details

Motivation: 多相机RGB-D系统的精确外参标定对3D重建至关重要，但现有方法在复杂环境中可能表现不佳。 Method: 通过聚类、回归分析和迭代重分配技术，系统分割和优化标记平面，确保相机视图间的几何一致性。 Result: 在Tech4Diet项目中验证，实验结果显示对齐误差显著减少，实现了更准确可靠的3D重建。 Conclusion: 该方法在复杂环境中表现出色，为3D重建提供了高精度的外参标定解决方案。 Abstract: Accurate 3D reconstruction using multi-camera RGB-D systems critically depends on precise extrinsic calibration to achieve proper alignment between captured views. In this paper, we introduce an iterative extrinsic calibration method that leverages the geometric constraints provided by a three-dimensional marker to significantly improve calibration accuracy. Our proposed approach systematically segments and refines marker planes through clustering, regression analysis, and iterative reassignment techniques, ensuring robust geometric correspondence across camera views. We validate our method comprehensively in both controlled environments and practical real-world settings within the Tech4Diet project, aimed at modeling the physical progression of patients undergoing nutritional treatments. Experimental results demonstrate substantial reductions in alignment errors, facilitating accurate and reliable 3D reconstructions.

[102] Robust Duality Learning for Unsupervised Visible-Infrared Person Re-Identfication

Yongxiang Li,Yuan Sun,Yang Qin,Dezhong Peng,Xi Peng,Peng Hu

Main category: cs.CV

TL;DR: 论文提出了一种新的学习范式RoDE，通过动态强调干净样本、双模型交替训练和簇一致性匹配，解决了无监督可见光-红外行人重识别中的伪标签噪声问题。

Details

Motivation: 无监督可见光-红外行人重识别（UVI-ReID）面临模态差异和缺乏监督的挑战，现有方法假设伪标签总是正确，但实际存在噪声问题，影响模型性能。 Method: 提出RoDE框架，包括：1) Robust Adaptive Learning（RAL）动态调整样本权重；2) 双模型交替训练避免错误累积；3) Cluster Consistency Matching（CCM）对齐跨模型和模态的簇。 Result: 在三个基准测试上的实验验证了RoDE的有效性。 Conclusion: RoDE通过处理伪标签噪声的三个关键挑战，显著提升了无监督可见光-红外行人重识别的性能。 Abstract: Unsupervised visible-infrared person re-identification (UVI-ReID) aims to retrieve pedestrian images across different modalities without costly annotations, but faces challenges due to the modality gap and lack of supervision. Existing methods often adopt self-training with clustering-generated pseudo-labels but implicitly assume these labels are always correct. In practice, however, this assumption fails due to inevitable pseudo-label noise, which hinders model learning. To address this, we introduce a new learning paradigm that explicitly considers Pseudo-Label Noise (PLN), characterized by three key challenges: noise overfitting, error accumulation, and noisy cluster correspondence. To this end, we propose a novel Robust Duality Learning framework (RoDE) for UVI-ReID to mitigate the effects of noisy pseudo-labels. First, to combat noise overfitting, a Robust Adaptive Learning mechanism (RAL) is proposed to dynamically emphasize clean samples while down-weighting noisy ones. Second, to alleviate error accumulation-where the model reinforces its own mistakes-RoDE employs dual distinct models that are alternately trained using pseudo-labels from each other, encouraging diversity and preventing collapse. However, this dual-model strategy introduces misalignment between clusters across models and modalities, creating noisy cluster correspondence. To resolve this, we introduce Cluster Consistency Matching (CCM), which aligns clusters across models and modalities by measuring cross-cluster similarity. Extensive experiments on three benchmarks demonstrate the effectiveness of RoDE.

[103] Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities

Xinjie Zhang,Jintao Guo,Shanshan Zhao,Minghao Fu,Lunhao Duan,Guo-Hua Wang,Qing-Guo Chen,Zhao Xu,Weihua Luo,Kaifu Zhang

Main category: cs.CV

TL;DR: 本文综述了多模态理解与图像生成模型的统一框架研究，分析了现有方法、数据集及挑战，旨在指导未来研究。

Details

Motivation: 多模态理解和图像生成模型各自发展迅速，但架构差异显著，亟需统一框架以整合两者优势。 Method: 分类综述了扩散模型、自回归模型及混合方法，并分析了其结构设计与创新。 Result: 总结了现有统一模型的进展，并整理了相关数据集和基准测试。 Conclusion: 统一模型领域仍面临诸多挑战，但前景广阔，未来研究需关注关键问题如跨模态注意力等。 Abstract: Recent years have seen remarkable progress in both multimodal understanding models and image generation models. Despite their respective successes, these two domains have evolved independently, leading to distinct architectural paradigms: While autoregressive-based architectures have dominated multimodal understanding, diffusion-based models have become the cornerstone of image generation. Recently, there has been growing interest in developing unified frameworks that integrate these tasks. The emergence of GPT-4o's new capabilities exemplifies this trend, highlighting the potential for unification. However, the architectural differences between the two domains pose significant challenges. To provide a clear overview of current efforts toward unification, we present a comprehensive survey aimed at guiding future research. First, we introduce the foundational concepts and recent advancements in multimodal understanding and text-to-image generation models. Next, we review existing unified models, categorizing them into three main architectural paradigms: diffusion-based, autoregressive-based, and hybrid approaches that fuse autoregressive and diffusion mechanisms. For each category, we analyze the structural designs and innovations introduced by related works. Additionally, we compile datasets and benchmarks tailored for unified models, offering resources for future exploration. Finally, we discuss the key challenges facing this nascent field, including tokenization strategy, cross-modal attention, and data. As this area is still in its early stages, we anticipate rapid advancements and will regularly update this survey. Our goal is to inspire further research and provide a valuable reference for the community. The references associated with this survey will be available on GitHub soon.

Eliraz Orfaig,Inna Stainvas,Igal Bilik

Main category: cs.CV

TL;DR: RGBX-DiffusionDet扩展了DiffusionDet模型，通过自适应多模态编码器融合RGB和异构2D数据（X），采用DCR-CBAM和DMLAB模块增强跨模态交互和空间特征表示，并通过新颖的损失函数提升特征嵌入质量。实验证明其在多模态数据集上优于RGB-only基线模型。

Details

Motivation: 解决RGB-only模型在多模态数据（如RGB-Depth、RGB-Polarimetric、RGB-Infrared）中检测性能不足的问题，探索如何有效融合异构2D数据以提升检测效果。 Method: 提出自适应多模态编码器，设计DCR-CBAM模块动态优化通道特征，引入DMLAB模块进行多尺度空间特征融合，并采用正则化损失函数强化特征嵌入。 Result: 在KITTI、RGB-Polarimetric和M$^3$FD数据集上表现优于RGB-only模型，同时保持解码效率。 Conclusion: RGBX-DiffusionDet为多模态目标检测提供了灵活高效的解决方案，为基于扩散的检测框架整合多样化2D感知数据提供了新思路。 Abstract: This work introduces RGBX-DiffusionDet, an object detection framework extending the DiffusionDet model to fuse the heterogeneous 2D data (X) with RGB imagery via an adaptive multimodal encoder. To enable cross-modal interaction, we design the dynamic channel reduction within a convolutional block attention module (DCR-CBAM), which facilitates cross-talk between subnetworks by dynamically highlighting salient channel features. Furthermore, the dynamic multi-level aggregation block (DMLAB) is proposed to refine spatial feature representations through adaptive multiscale fusion. Finally, novel regularization losses that enforce channel saliency and spatial selectivity are introduced, leading to compact and discriminative feature embeddings. Extensive experiments using RGB-Depth (KITTI), a novel annotated RGB-Polarimetric dataset, and RGB-Infrared (M$^3$FD) benchmark dataset were conducted. We demonstrate consistent superiority of the proposed approach over the baseline RGB-only DiffusionDet. The modular architecture maintains the original decoding complexity, ensuring efficiency. These results establish the proposed RGBX-DiffusionDet as a flexible multimodal object detection approach, providing new insights into integrating diverse 2D sensing modalities into diffusion-based detection pipelines.

[105] DELTA: Dense Depth from Events and LiDAR using Transformer's Attention

Vincent Brebion,Julien Moreau,Franck Davoine

Main category: cs.CV

TL;DR: DELTA是一种基于神经网络的方法，融合事件相机和LiDAR数据以估计密集深度图，通过自注意力和交叉注意力建模时空关系，显著提升了事件深度估计的性能。

Details

Motivation: 事件相机和LiDAR提供互补但不同的数据，但现有研究很少探索两者的结合。 Method: 提出DELTA架构，利用自注意力和交叉注意力建模事件和LiDAR数据的时空关系。 Result: DELTA在事件深度估计中达到新SOTA，近距离误差降低至四分之一。 Conclusion: DELTA通过多模态融合显著提升了深度估计的精度，尤其在近距离场景中表现突出。 Abstract: Event cameras and LiDARs provide complementary yet distinct data: respectively, asynchronous detections of changes in lighting versus sparse but accurate depth information at a fixed rate. To this day, few works have explored the combination of these two modalities. In this article, we propose a novel neural-network-based method for fusing event and LiDAR data in order to estimate dense depth maps. Our architecture, DELTA, exploits the concepts of self- and cross-attention to model the spatial and temporal relations within and between the event and LiDAR data. Following a thorough evaluation, we demonstrate that DELTA sets a new state of the art in the event-based depth estimation problem, and that it is able to reduce the errors up to four times for close ranges compared to the previous SOTA.

Sassan Mokhtar,Arian Mousakhan,Silvio Galesso,Jawad Tayyub,Thomas Brox

Main category: cs.CV

TL;DR: VELM是一种基于LLM的异常分类新方法，结合无监督异常检测和LLM分类，在MVTec-AD和MVTec-AC上表现优异。

Details

Motivation: 现有工业异常检测方法在异常分类方面研究不足，而分类在实际检测任务中至关重要。 Method: VELM结合无监督异常检测和LLM分类，并引入带精确标签的数据集MVTec-AC和VisA-AC。 Result: VELM在MVTec-AD和MVTec-AC上分别达到80.4%和84%的分类准确率，超越基线5%。 Conclusion: VELM为异常分类提供了有效方法，并推动该领域研究。 Abstract: Recent advances in visual industrial anomaly detection have demonstrated exceptional performance in identifying and segmenting anomalous regions while maintaining fast inference speeds. However, anomaly classification-distinguishing different types of anomalies-remains largely unexplored despite its critical importance in real-world inspection tasks. To address this gap, we propose VELM, a novel LLM-based pipeline for anomaly classification. Given the critical importance of inference speed, we first apply an unsupervised anomaly detection method as a vision expert to assess the normality of an observation. If an anomaly is detected, the LLM then classifies its type. A key challenge in developing and evaluating anomaly classification models is the lack of precise annotations of anomaly classes in existing datasets. To address this limitation, we introduce MVTec-AC and VisA-AC, refined versions of the widely used MVTec-AD and VisA datasets, which include accurate anomaly class labels for rigorous evaluation. Our approach achieves a state-of-the-art anomaly classification accuracy of 80.4% on MVTec-AD, exceeding the prior baselines by 5%, and 84% on MVTec-AC, demonstrating the effectiveness of VELM in understanding and categorizing anomalies. We hope our methodology and benchmark inspire further research in anomaly classification, helping bridge the gap between detection and comprehensive anomaly characterization.

[107] MCCD: Multi-Agent Collaboration-based Compositional Diffusion for Complex Text-to-Image Generation

Mingcheng Li,Xiaolu Hou,Ziyang Liu,Dingkang Yang,Ziyun Qian,Jiawei Chen,Jinjie Wei,Yue Jiang,Qingyao Xu,Lihua Zhang

Main category: cs.CV

TL;DR: 提出了一种基于多智能体协作的扩散模型（MCCD），用于处理复杂场景的文本到图像生成，显著提升了基线模型的性能。

Details

Motivation: 现有方法在处理涉及多对象、特征和关系的复杂提示时存在性能瓶颈。 Method: 设计了多智能体协作的场景解析模块，利用MLLM提取场景元素，并结合分层扩散和高斯掩码优化区域。 Result: 实验表明，MCCD在无需训练的情况下显著提升了复杂场景生成的准确性和保真度。 Conclusion: MCCD为复杂场景生成提供了一种高效且训练自由的解决方案。 Abstract: Diffusion models have shown excellent performance in text-to-image generation. Nevertheless, existing methods often suffer from performance bottlenecks when handling complex prompts that involve multiple objects, characteristics, and relations. Therefore, we propose a Multi-agent Collaboration-based Compositional Diffusion (MCCD) for text-to-image generation for complex scenes. Specifically, we design a multi-agent collaboration-based scene parsing module that generates an agent system comprising multiple agents with distinct tasks, utilizing MLLMs to extract various scene elements effectively. In addition, Hierarchical Compositional diffusion utilizes a Gaussian mask and filtering to refine bounding box regions and enhance objects through region enhancement, resulting in the accurate and high-fidelity generation of complex scenes. Comprehensive experiments demonstrate that our MCCD significantly improves the performance of the baseline models in a training-free manner, providing a substantial advantage in complex scene generation.

[108] Sim2Real in endoscopy segmentation with a novel structure aware image translation

Clara Tomasini,Luis Riazuelo,Ana C. Murillo

Main category: cs.CV

TL;DR: 提出一种新型图像翻译模型，为模拟内窥镜图像添加真实纹理并保留关键场景布局，用于训练无真实标注数据的模型。

Details

Motivation: 真实内窥镜图像的标注困难，而合成数据训练的模型泛化能力差。 Method: 开发一种图像翻译模型，保持原始场景结构的同时添加真实纹理。 Result: 生成的图像在多种内窥镜场景中表现真实，成功用于训练折叠分割任务模型。 Conclusion: 该方法优于现有技术，生成的数据和元数据已公开以促进研究。 Abstract: Automatic segmentation of anatomical landmarks in endoscopic images can provide assistance to doctors and surgeons for diagnosis, treatments or medical training. However, obtaining the annotations required to train commonly used supervised learning methods is a tedious and difficult task, in particular for real images. While ground truth annotations are easier to obtain for synthetic data, models trained on such data often do not generalize well to real data. Generative approaches can add realistic texture to it, but face difficulties to maintain the structure of the original scene. The main contribution in this work is a novel image translation model that adds realistic texture to simulated endoscopic images while keeping the key scene layout information. Our approach produces realistic images in different endoscopy scenarios. We demonstrate these images can effectively be used to successfully train a model for a challenging end task without any real labeled data. In particular, we demonstrate our approach for the task of fold segmentation in colonoscopy images. Folds are key anatomical landmarks that can occlude parts of the colon mucosa and possible polyps. Our approach generates realistic images maintaining the shape and location of the original folds, after the image-style-translation, better than existing methods. We run experiments both on a novel simulated dataset for fold segmentation, and real data from the EndoMapper (EM) dataset. All our new generated data and new EM metadata is being released to facilitate further research, as no public benchmark is currently available for the task of fold segmentation.

[109] Dance of Fireworks: An Interactive Broadcast Gymnastics Training System Based on Pose Estimation

Haotian Chen,Ziyu Liu,Xi Cheng,Chuangqi Li

Main category: cs.CV

TL;DR: Dance of Fireworks 是一个交互系统，通过实时反馈和动态烟花动画激励用户参与广播体操，减少久坐健康风险。

Details

Motivation: 解决久坐生活方式带来的健康问题，通过技术手段提升广播体操的参与度和趣味性。 Method: 利用移动设备摄像头和轻量级姿态估计（PoseNet/TensorFlow Lite），提取身体关键点并计算关节角度，与标准动作对比提供实时反馈，同时将用户动作映射为烟花动画作为激励。 Result: 实验显示，136名参与者的平均关节角度误差从21.3度降至9.8度（p < 0.01），93.4%的用户认为其促进锻炼，85.4%赞赏其娱乐性。 Conclusion: 该系统无需预定义动作模板或专用硬件，为久坐人群提供了一种低成本、高趣味性的运动解决方案，未来将优化姿态识别和增加多人互动等功能。 Abstract: This study introduces Dance of Fireworks, an interactive system designed to combat sedentary health risks by enhancing engagement in radio calisthenics. Leveraging mobile device cameras and lightweight pose estimation (PoseNet/TensorFlow Lite), the system extracts body keypoints, computes joint angles, and compares them with standardized motions to deliver real-time corrective feedback. To incentivize participation, it dynamically maps users' movements (such as joint angles and velocity) to customizable fireworks animations, rewarding improved accuracy with richer visual effects. Experiments involving 136 participants demonstrated a significant reduction in average joint angle errors from 21.3 degrees to 9.8 degrees (p < 0.01) over four sessions, with 93.4 percent of users affirming its exercise-promoting efficacy and 85.4 percent praising its entertainment value. The system operates without predefined motion templates or specialised hardware, enabling seamless integration into office environments. Future enhancements will focus on improving pose recognition accuracy, reducing latency, and adding features such as multiplayer interaction and music synchronisation. This work presents a cost-effective, engaging solution to promote physical activity in sedentary populations.

[110] Structure Causal Models and LLMs Integration in Medical Visual Question Answering

Zibo Xu,Qiang Li,Weizhi Nie,Weijie Wang,Anan Liu

Main category: cs.CV

TL;DR: 提出了一种基于因果推理的MedVQA框架，通过消除图像和问题之间的混杂效应，提高问答准确性。

Details

Motivation: 医学数据复杂性导致图像与问题间存在难以观察的混杂因素，影响问答准确性。 Method: 引入因果图结构表示视觉与文本交互，利用互信息发现伪相关，采用多变量重采样前门调整方法消除混杂效应，并结合多提示策略提升模型理解能力。 Result: 在三个MedVQA数据集上显著提高了准确性，并实现了真实因果相关性。 Conclusion: 该方法有效解决了MedVQA中的混杂问题，提升了问答精度和因果推断能力。 Abstract: Medical Visual Question Answering (MedVQA) aims to answer medical questions according to medical images. However, the complexity of medical data leads to confounders that are difficult to observe, so bias between images and questions is inevitable. Such cross-modal bias makes it challenging to infer medically meaningful answers. In this work, we propose a causal inference framework for the MedVQA task, which effectively eliminates the relative confounding effect between the image and the question to ensure the precision of the question-answering (QA) session. We are the first to introduce a novel causal graph structure that represents the interaction between visual and textual elements, explicitly capturing how different questions influence visual features. During optimization, we apply the mutual information to discover spurious correlations and propose a multi-variable resampling front-door adjustment method to eliminate the relative confounding effect, which aims to align features based on their true causal relevance to the question-answering task. In addition, we also introduce a prompt strategy that combines multiple prompt forms to improve the model's ability to understand complex medical data and answer accurately. Extensive experiments on three MedVQA datasets demonstrate that 1) our method significantly improves the accuracy of MedVQA, and 2) our method achieves true causal correlations in the face of complex medical data.

[111] Visually-Guided Linguistic Disambiguation for Monocular Depth Scale Recovery

Bojin Wu,Jing Chen

Main category: cs.CV

TL;DR: 提出了一种稳健的单目深度尺度恢复方法VGLD，通过结合图像的高层语义信息解决文本描述的模糊性，生成具有度量尺度精度的深度预测。

Details

Motivation: 单目深度估计需要恢复绝对尺度信息以支持实际下游任务，但文本描述的多样性会影响尺度恢复过程。 Method: VGLD方法通过结合图像的高层语义信息与文本描述，稳定文本信息的影响，输出线性变换参数以全局调整相对深度图。 Result: 在多个数据集（NYUv2、KITTI）和模型（MiDas、DepthAnything）上验证，VGLD表现出色，可作为通用对齐模块。 Conclusion: VGLD能有效解决文本模糊性问题，生成高精度的度量尺度深度预测，适用于零样本场景。 Abstract: We propose a robust method for monocular depth scale recovery. Monocular depth estimation can be divided into two main directions: (1) relative depth estimation, which provides normalized or inverse depth without scale information, and (2) metric depth estimation, which involves recovering depth with absolute scale. To obtain absolute scale information for practical downstream tasks, utilizing textual information to recover the scale of a relative depth map is a highly promising approach. However, since a single image can have multiple descriptions from different perspectives or with varying styles, it has been shown that different textual descriptions can significantly affect the scale recovery process. To address this issue, our method, VGLD, stabilizes the influence of textual information by incorporating high-level semantic information from the corresponding image alongside the textual description. This approach resolves textual ambiguities and robustly outputs a set of linear transformation parameters (scalars) that can be globally applied to the relative depth map, ultimately generating depth predictions with metric-scale accuracy. We validate our method across several popular relative depth models(MiDas, DepthAnything), using both indoor scenes (NYUv2) and outdoor scenes (KITTI). Our results demonstrate that VGLD functions as a universal alignment module when trained on multiple datasets, achieving strong performance even in zero-shot scenarios. Code is available at: https://github.com/pakinwu/VGLD.

[112] A Rate-Quality Model for Learned Video Coding

Sang NguyenQuang,Cheng-Wei Chen,Xiem HoangVan,Wen-Hsiao Peng

Main category: cs.CV

TL;DR: 本文提出了一种基于神经网络的R-Q模型（RQNet），用于动态预测视频编码中的比特率与质量关系，并通过最小二乘法优化参数，显著提升了编码性能。

Details

Motivation: 传统的视频编码方法在动态调整比特率与质量关系时缺乏灵活性和精确性，因此需要一种能够根据视频内容和编码上下文动态适应的方法。 Method: 设计了一个神经网络（RQNet）来建模比特率与质量的关系，并结合最小二乘法动态优化模型参数。 Result: 实验结果表明，该方法在常用数据集上比特率偏差显著减小，且额外复杂度极低。 Conclusion: 提出的R-Q模型能够动态适应视频编码需求，显著提升了编码的灵活性和精确性。 Abstract: Learned video coding (LVC) has recently achieved superior coding performance. In this paper, we model the rate-quality (R-Q) relationship for learned video coding by a parametric function. We learn a neural network, termed RQNet, to characterize the relationship between the bitrate and quality level according to video content and coding context. The predicted (R,Q) results are further integrated with those from previously coded frames using the least-squares method to determine the parameters of our R-Q model on-the-fly. Compared to the conventional approaches, our method accurately estimates the R-Q relationship, enabling the online adaptation of model parameters to enhance both flexibility and precision. Experimental results show that our R-Q model achieves significantly smaller bitrate deviations than the baseline method on commonly used datasets with minimal additional complexity.

[113] Advancing Generalizable Tumor Segmentation with Anomaly-Aware Open-Vocabulary Attention Maps and Frozen Foundation Diffusion Models

Yankai Jiang,Peng Zhang,Donglin Yang,Yuan Tian,Hai Lin,Xiaosong Wang

Main category: cs.CV

TL;DR: DiffuGTS利用冻结的医学基础扩散模型内部表示，通过文本提示生成异常感知的开放词汇注意力图，实现零样本肿瘤分割，并通过潜在空间修复和残差学习提升分割质量。

Details

Motivation: 现有方法在分割质量、可扩展性和适用成像模态范围上存在局限，探索通用肿瘤分割模型的需求。 Method: 提出DiffuGTS框架，利用扩散模型生成异常感知注意力图，结合潜在空间修复和残差学习优化分割掩码。 Result: 在四个数据集和七种肿瘤类别上表现优异，超越现有零样本分割方法。 Conclusion: DiffuGTS展示了扩散模型在通用肿瘤分割中的潜力，为医学图像分析提供了新思路。 Abstract: We explore Generalizable Tumor Segmentation, aiming to train a single model for zero-shot tumor segmentation across diverse anatomical regions. Existing methods face limitations related to segmentation quality, scalability, and the range of applicable imaging modalities. In this paper, we uncover the potential of the internal representations within frozen medical foundation diffusion models as highly efficient zero-shot learners for tumor segmentation by introducing a novel framework named DiffuGTS. DiffuGTS creates anomaly-aware open-vocabulary attention maps based on text prompts to enable generalizable anomaly segmentation without being restricted by a predefined training category list. To further improve and refine anomaly segmentation masks, DiffuGTS leverages the diffusion model, transforming pathological regions into high-quality pseudo-healthy counterparts through latent space inpainting, and applies a novel pixel-level and feature-level residual learning approach, resulting in segmentation masks with significantly enhanced quality and generalization. Comprehensive experiments on four datasets and seven tumor categories demonstrate the superior performance of our method, surpassing current state-of-the-art models across multiple zero-shot settings. Codes are available at https://github.com/Yankai96/DiffuGTS.

[114] Unsupervised Deep Learning-based Keypoint Localization Estimating Descriptor Matching Performance

David Rivas-Villar,Álvaro S. Hervella,José Rouco,Jorge Novo

Main category: cs.CV

TL;DR: 提出了一种无需标注数据的无监督视网膜图像配准方法，通过创新的描述符学习和关键点检测网络，性能优于现有监督和无监督方法。

Details

Motivation: 视网膜图像配准在临床中至关重要，但现有方法依赖标注数据，而医学领域标注数据稀缺。 Method: 提出无监督描述符学习和关键点检测网络，基于描述符性能直接检测关键点，无需标注数据。 Result: 在四个数据集上验证，无监督描述符和检测器性能优于现有方法，配准效果接近监督方法。 Conclusion: 该方法无需标注数据，性能优越，且可推广至其他领域和模态。 Abstract: Retinal image registration, particularly for color fundus images, is a challenging yet essential task with diverse clinical applications. Existing registration methods for color fundus images typically rely on keypoints and descriptors for alignment; however, a significant limitation is their reliance on labeled data, which is particularly scarce in the medical domain. In this work, we present a novel unsupervised registration pipeline that entirely eliminates the need for labeled data. Our approach is based on the principle that locations with distinctive descriptors constitute reliable keypoints. This fully inverts the conventional state-of-the-art approach, conditioning the detector on the descriptor rather than the opposite. First, we propose an innovative descriptor learning method that operates without keypoint detection or any labels, generating descriptors for arbitrary locations in retinal images. Next, we introduce a novel, label-free keypoint detector network which works by estimating descriptor performance directly from the input image. We validate our method through a comprehensive evaluation on four hold-out datasets, demonstrating that our unsupervised descriptor outperforms state-of-the-art supervised descriptors and that our unsupervised detector significantly outperforms existing unsupervised detection methods. Finally, our full registration pipeline achieves performance comparable to the leading supervised methods, while not employing any labeled data. Additionally, the label-free nature and design of our method enable direct adaptation to other domains and modalities.

[115] Advances in Automated Fetal Brain MRI Segmentation and Biometry: Insights from the FeTA 2024 Challenge

Vladyslav Zalevskyi,Thomas Sanchez,Misha Kaandorp,Margaux Roulet,Diego Fajardo-Rojas,Liu Li,Jana Hutter,Hongwei Bran Li,Matthew Barkovich,Hui Ji,Luca Wilhelmi,Aline Dändliker,Céline Steger,Mériam Koob,Yvan Gomez,Anton Jakovčić,Melita Klaić,Ana Adžić,Pavel Marković,Gracia Grabarić,Milan Rados,Jordina Aviles Verdera,Gregor Kasprian,Gregor Dovjak,Raphael Gaubert-Rachmühl,Maurice Aschwanden,Qi Zeng,Davood Karimi,Denis Peruzzo,Tommaso Ciceri,Giorgio Longari,Rachika E. Hamadache,Amina Bouzid,Xavier Lladó,Simone Chiarella,Gerard Martí-Juan,Miguel Ángel González Ballester,Marco Castellaro,Marco Pinamonti,Valentina Visani,Robin Cremese,Keïn Sam,Fleur Gaudfernau,Param Ahir,Mehul Parikh,Maximilian Zenk,Michael Baumgartner,Klaus Maier-Hein,Li Tianhong,Yang Hong,Zhao Longfei,Domen Preloznik,Žiga Špiclin,Jae Won Choi,Muyang Li,Jia Fu,Guotai Wang,Jingwen Jiang,Lyuyang Tong,Bo Du,Andrea Gondova,Sungmin You,Kiho Im,Abdul Qayyum,Moona Mazher,Steven A Niederer,Maya Yanko,Bella Specktor-Fadida,Dafna Ben Bashat,Andras Jakab,Roxane Licandro,Kelly Payette,Meritxell Bach Cuadra

Main category: cs.CV

TL;DR: FeTA Challenge 2024聚焦胎儿脑MRI的自动分割和生物测量任务，引入低场MRI数据和拓扑评估指标，发现分割精度接近上限，生物测量任务表现不佳，强调数据质量和多样性对AI工具的重要性。

Details

Motivation: 研究胎儿脑发育需要精确的组织分割和生物测量分析，FeTA挑战赛旨在推动自动化MRI分析技术的发展。 Method: 挑战赛包括组织分割和生物测量任务，引入低场MRI数据和拓扑评估指标（ED），评估16个团队的分割方法和7个团队的生物测量方法。 Result: 分割方法在高场和低场MRI中表现一致，但精度接近上限；生物测量方法大多不如基于孕龄的简单基线。ED揭示了传统指标忽略的拓扑差异。 Conclusion: FeTA 2024为胎儿脑MRI分析提供了全面基准，强调数据质量、拓扑评估和数据集多样性对临床AI工具的重要性。 Abstract: Accurate fetal brain tissue segmentation and biometric analysis are essential for studying brain development in utero. The FeTA Challenge 2024 advanced automated fetal brain MRI analysis by introducing biometry prediction as a new task alongside tissue segmentation. For the first time, our diverse multi-centric test set included data from a new low-field (0.55T) MRI dataset. Evaluation metrics were also expanded to include the topology-specific Euler characteristic difference (ED). Sixteen teams submitted segmentation methods, most of which performed consistently across both high- and low-field scans. However, longitudinal trends indicate that segmentation accuracy may be reaching a plateau, with results now approaching inter-rater variability. The ED metric uncovered topological differences that were missed by conventional metrics, while the low-field dataset achieved the highest segmentation scores, highlighting the potential of affordable imaging systems when paired with high-quality reconstruction. Seven teams participated in the biometry task, but most methods failed to outperform a simple baseline that predicted measurements based solely on gestational age, underscoring the challenge of extracting reliable biometric estimates from image data alone. Domain shift analysis identified image quality as the most significant factor affecting model generalization, with super-resolution pipelines also playing a substantial role. Other factors, such as gestational age, pathology, and acquisition site, had smaller, though still measurable, effects. Overall, FeTA 2024 offers a comprehensive benchmark for multi-class segmentation and biometry estimation in fetal brain MRI, underscoring the need for data-centric approaches, improved topological evaluation, and greater dataset diversity to enable clinically robust and generalizable AI tools.

[116] Unsupervised training of keypoint-agnostic descriptors for flexible retinal image registration

David Rivas-Villar,Álvaro S. Hervella,José Rouco,Jorge Novo

Main category: cs.CV

TL;DR: 提出了一种无需关键点检测的无监督描述符学习方法，适用于医学图像配准，性能与监督方法相当且对关键点检测器无依赖。

Details

Motivation: 医学领域缺乏标记数据，限制了现有彩色眼底图像配准方法的发展，因此探索无监督学习。 Method: 开发了一种无监督描述符学习方法，不依赖关键点检测，使描述符网络对配准推理中的关键点检测器具有通用性。 Result: 在公共视网膜图像配准数据集上验证，结果表明该方法配准准确，性能不逊于监督方法，且对关键点检测器无依赖。 Conclusion: 该工作是无监督学习在医学领域应用的重要进展。 Abstract: Current color fundus image registration approaches are limited, among other things, by the lack of labeled data, which is even more significant in the medical domain, motivating the use of unsupervised learning. Therefore, in this work, we develop a novel unsupervised descriptor learning method that does not rely on keypoint detection. This enables the resulting descriptor network to be agnostic to the keypoint detector used during the registration inference. To validate this approach, we perform an extensive and comprehensive comparison on the reference public retinal image registration dataset. Additionally, we test our method with multiple keypoint detectors of varied nature, even proposing some novel ones. Our results demonstrate that the proposed approach offers accurate registration, not incurring in any performance loss versus supervised methods. Additionally, it demonstrates accurate performance regardless of the keypoint detector used. Thus, this work represents a notable step towards leveraging unsupervised learning in the medical domain.

[117] DPNet: Dynamic Pooling Network for Tiny Object Detection

Luqi Gong,Haotian Chen,Yikun Chen,Tianliang Yao,Chao Li,Shuai Zhao,Guangjie Han

Main category: cs.CV

TL;DR: 提出了一种动态池化网络（DPNet），用于解决无人机系统中微小物体检测的效率和准确性问题。通过动态调整特征图的分辨率和自适应归一化模块，实现了计算资源的高效分配。

Details

Motivation: 在复杂环境中，微小物体检测对无人机系统至关重要，但传统图像放大方法会增加计算成本和负样本数量，降低检测性能。 Method: DPNet引入动态下采样因子（df）和自适应归一化模块（ANM），通过轻量级预测器为每张图像预测df，实现输入感知的下采样。 Result: 在TinyCOCO和TinyPerson数据集上，DPNet分别节省了35%和25%的计算量（GFLOPs），同时保持检测性能。 Conclusion: DPNet通过动态资源分配，在检测效率和准确性之间取得了平衡，适用于无人机系统中的微小物体检测。 Abstract: In unmanned aerial systems, especially in complex environments, accurately detecting tiny objects is crucial. Resizing images is a common strategy to improve detection accuracy, particularly for small objects. However, simply enlarging images significantly increases computational costs and the number of negative samples, severely degrading detection performance and limiting its applicability. This paper proposes a Dynamic Pooling Network (DPNet) for tiny object detection to mitigate these issues. DPNet employs a flexible down-sampling strategy by introducing a factor (df) to relax the fixed downsampling process of the feature map to an adjustable one. Furthermore, we design a lightweight predictor to predict df for each input image, which is used to decrease the resolution of feature maps in the backbone. Thus, we achieve input-aware downsampling. We also design an Adaptive Normalization Module (ANM) to make a unified detector compatible with different dfs. A guidance loss supervises the predictor's training. DPNet dynamically allocates computing resources to trade off between detection accuracy and efficiency. Experiments on the TinyCOCO and TinyPerson datasets show that DPNet can save over 35% and 25% GFLOPs, respectively, while maintaining comparable detection performance. The code will be made publicly available.

[118] Database-Agnostic Gait Enrollment using SetTransformers

Nicoleta Basoc,Adrian Cosma,Andy Cǎtrunǎ,Emilian Rǎdoi

Main category: cs.CV

TL;DR: 本文提出了一种基于Transformer的开放集步态识别框架，能够在不依赖特定数据集或识别架构的情况下进行步态注册。

Details

Motivation: 现实世界中的步态识别需要处理开放集注册问题，即判断新样本是否属于已知身份或新个体。现有方法在封闭集条件下表现良好，但缺乏对开放集场景的适应性。 Method: 使用SetTransformer，基于探针样本和上下文集的嵌入进行注册决策，无需任务特定阈值或重新训练。方法解耦了注册与主识别流程，适用于不同数据集和身份分布。 Result: 在两个基准数据集（CASIA-B和PsyMo）上验证，使用三种先进识别模型的嵌入，方法表现灵活且准确，优于传统方法。 Conclusion: 提出的方法在开放集步态注册中具有通用性和可扩展性，代码和数据集场景将公开。 Abstract: Gait recognition has emerged as a powerful tool for unobtrusive and long-range identity analysis, with growing relevance in surveillance and monitoring applications. Although recent advances in deep learning and large-scale datasets have enabled highly accurate recognition under closed-set conditions, real-world deployment demands open-set gait enrollment, which means determining whether a new gait sample corresponds to a known identity or represents a previously unseen individual. In this work, we introduce a transformer-based framework for open-set gait enrollment that is both dataset-agnostic and recognition-architecture-agnostic. Our method leverages a SetTransformer to make enrollment decisions based on the embedding of a probe sample and a context set drawn from the gallery, without requiring task-specific thresholds or retraining for new environments. By decoupling enrollment from the main recognition pipeline, our model is generalized across different datasets, gallery sizes, and identity distributions. We propose an evaluation protocol that uses existing datasets in different ratios of identities and walks per identity. We instantiate our method using skeleton-based gait representations and evaluate it on two benchmark datasets (CASIA-B and PsyMo), using embeddings from three state-of-the-art recognition models (GaitGraph, GaitFormer, and GaitPT). We show that our method is flexible, is able to accurately perform enrollment in different scenarios, and scales better with data compared to traditional approaches. We will make the code and dataset scenarios publicly available.

[119] MUSAR: Exploring Multi-Subject Customization from Single-Subject Dataset via Attention Routing

Zinan Guo,Pengze Zhang,Yanze Wu,Chong Mou,Songtao Zhao,Qian He

Main category: cs.CV

TL;DR: MUSAR框架通过单主体训练数据实现多主体定制，解决了数据多样性和属性纠缠问题，性能优于现有方法。

Details

Motivation: 当前多主体定制方法面临数据获取困难和属性纠缠的挑战，需要一种仅依赖单主体数据的解决方案。 Method: 提出MUSAR框架，包括去偏双联学习和动态注意力路由机制，实现多主体学习和解耦。 Result: 实验表明MUSAR在图像质量、主体一致性和交互自然性上优于现有方法。 Conclusion: MUSAR通过创新设计解决了多主体定制中的核心问题，具有高效和可扩展性。 Abstract: Current multi-subject customization approaches encounter two critical challenges: the difficulty in acquiring diverse multi-subject training data, and attribute entanglement across different subjects. To bridge these gaps, we propose MUSAR - a simple yet effective framework to achieve robust multi-subject customization while requiring only single-subject training data. Firstly, to break the data limitation, we introduce debiased diptych learning. It constructs diptych training pairs from single-subject images to facilitate multi-subject learning, while actively correcting the distribution bias introduced by diptych construction via static attention routing and dual-branch LoRA. Secondly, to eliminate cross-subject entanglement, we introduce dynamic attention routing mechanism, which adaptively establishes bijective mappings between generated images and conditional subjects. This design not only achieves decoupling of multi-subject representations but also maintains scalable generalization performance with increasing reference subjects. Comprehensive experiments demonstrate that our MUSAR outperforms existing methods - even those trained on multi-subject dataset - in image quality, subject consistency, and interaction naturalness, despite requiring only single-subject dataset.

[120] Towards Dataset Copyright Evasion Attack against Personalized Text-to-Image Diffusion Models

Kuofeng Gao,Yufei Zhu,Yiming Li,Jiawang Bai,Yong Yang,Zhifeng Li,Shu-Tao Xia

Main category: cs.CV

TL;DR: 本文提出了一种针对文本到图像（T2I）扩散模型的版权规避攻击（CEAT2I），通过检测水印样本、识别触发标记和高效去除水印，成功绕过数据集所有权验证（DOV）机制。

Details

Motivation: 随着T2I扩散模型的个性化微调增多，未经授权的数据集使用问题日益严重。DOV通过水印技术保护数据集所有权，但其对版权规避攻击的鲁棒性尚未被研究。 Method: CEAT2I分为三个阶段：水印样本检测、触发标记识别和水印去除。利用水印样本在微调中收敛更快的特点，通过特征偏差检测水印，并通过迭代提示标记消融识别触发标记，最后采用闭式概念擦除方法去除水印。 Result: 实验表明，CEAT2I能有效规避DOV机制，同时保持模型性能。 Conclusion: 本文揭示了DOV在T2I扩散模型中的潜在漏洞，并提出了首个针对性的版权规避攻击方法，为未来更鲁棒的水印技术提供了研究基础。 Abstract: Text-to-image (T2I) diffusion models have rapidly advanced, enabling high-quality image generation conditioned on textual prompts. However, the growing trend of fine-tuning pre-trained models for personalization raises serious concerns about unauthorized dataset usage. To combat this, dataset ownership verification (DOV) has emerged as a solution, embedding watermarks into the fine-tuning datasets using backdoor techniques. These watermarks remain inactive under benign samples but produce owner-specified outputs when triggered. Despite the promise of DOV for T2I diffusion models, its robustness against copyright evasion attacks (CEA) remains unexplored. In this paper, we explore how attackers can bypass these mechanisms through CEA, allowing models to circumvent watermarks even when trained on watermarked datasets. We propose the first copyright evasion attack (i.e., CEAT2I) specifically designed to undermine DOV in T2I diffusion models. Concretely, our CEAT2I comprises three stages: watermarked sample detection, trigger identification, and efficient watermark mitigation. A key insight driving our approach is that T2I models exhibit faster convergence on watermarked samples during the fine-tuning, evident through intermediate feature deviation. Leveraging this, CEAT2I can reliably detect the watermarked samples. Then, we iteratively ablate tokens from the prompts of detected watermarked samples and monitor shifts in intermediate features to pinpoint the exact trigger tokens. Finally, we adopt a closed-form concept erasure method to remove the injected watermark. Extensive experiments show that our CEAT2I effectively evades DOV mechanisms while preserving model performance.

[121] Towards Application-Specific Evaluation of Vision Models: Case Studies in Ecology and Biology

Alex Hoi Hang Chan,Otto Brookes,Urs Waldmann,Hemal Naik,Iain D. Couzin,Majid Mirmehdi,Noël Adiko Houa,Emmanuelle Normand,Christophe Boesch,Lukas Boesch,Mimi Arandjelovic,Hjalmar Kühl,Tilo Burghardt,Fumihiro Kano

Main category: cs.CV

TL;DR: 论文主张在生态和生物研究中，计算机视觉模型应使用与应用场景直接相关的指标进行评估，而非仅依赖机器学习指标。通过两个案例研究，作者展示了高机器学习性能的模型在实际应用中可能导致数据偏差。

Details

Motivation: 当前计算机视觉模型在生态和生物研究中的评估主要依赖机器学习指标，忽视了其对下游分析的实际影响。作者希望通过应用场景指标更准确地评估模型性能。 Method: 通过两个案例研究：(1) 使用视频行为分类器估计黑猩猩数量和密度，(2) 使用3D姿态估计器估计鸽子头部旋转，验证模型在实际应用中的表现。 Result: 研究发现，即使机器学习性能高的模型（如87% mAP）也可能导致与专家数据相比的估计偏差。姿态估计性能最高的模型在鸽子视线方向推断中并非最准确。 Conclusion: 作者呼吁在生态/生物数据集中整合应用场景指标，以便在模型的下游应用中进行基准测试，并促进模型更好地融入实际工作流程。 Abstract: Computer vision methods have demonstrated considerable potential to streamline ecological and biological workflows, with a growing number of datasets and models becoming available to the research community. However, these resources focus predominantly on evaluation using machine learning metrics, with relatively little emphasis on how their application impacts downstream analysis. We argue that models should be evaluated using application-specific metrics that directly represent model performance in the context of its final use case. To support this argument, we present two disparate case studies: (1) estimating chimpanzee abundance and density with camera trap distance sampling when using a video-based behaviour classifier and (2) estimating head rotation in pigeons using a 3D posture estimator. We show that even models with strong machine learning performance (e.g., 87% mAP) can yield data that leads to discrepancies in abundance estimates compared to expert-derived data. Similarly, the highest-performing models for posture estimation do not produce the most accurate inferences of gaze direction in pigeons. Motivated by these findings, we call for researchers to integrate application-specific metrics in ecological/biological datasets, allowing for models to be benchmarked in the context of their downstream application and to facilitate better integration of models into application workflows.

[122] No Other Representation Component Is Needed: Diffusion Transformers Can Provide Representation Guidance by Themselves

Dengyang Jiang,Mengmeng Wang,Liuzhuozheng Li,Lei Zhang,Haoyu Wang,Wei Wei,Guang Dai,Yanning Zhang,Jingdong Wang

Main category: cs.CV

TL;DR: 论文提出了一种名为SRA的自我表示对齐方法，通过自蒸馏方式在扩散变换器的生成训练过程中提升表示学习，无需依赖外部表示组件。

Details

Motivation: 现有方法需要引入复杂的外部表示训练框架或依赖大规模预训练模型，而扩散变换器本身的判别过程可能提供表示指导。 Method: SRA通过自蒸馏方式对齐扩散变换器在不同噪声层级的潜在表示，逐步增强表示学习。 Result: 实验表明，SRA在DiTs和SiTs上表现一致提升，优于复杂外部框架，且性能接近依赖强大外部先验的方法。 Conclusion: SRA是一种简单有效的方法，可在生成训练中提升表示学习，无需外部组件。 Abstract: Recent studies have demonstrated that learning a meaningful internal representation can both accelerate generative training and enhance generation quality of the diffusion transformers. However, existing approaches necessitate to either introduce an additional and complex representation training framework or rely on a large-scale, pre-trained representation foundation model to provide representation guidance during the original generative training process. In this study, we posit that the unique discriminative process inherent to diffusion transformers enables them to offer such guidance without requiring external representation components. We therefore propose Self-Representation A}lignment (SRA), a simple yet straightforward method that obtain representation guidance through a self-distillation manner. Specifically, SRA aligns the output latent representation of the diffusion transformer in earlier layer with higher noise to that in later layer with lower noise to progressively enhance the overall representation learning during only generative training process. Experimental results indicate that applying SRA to DiTs and SiTs yields consistent performance improvements. Moreover, SRA not only significantly outperforms approaches relying on auxiliary, complex representation training frameworks but also achieves performance comparable to methods that heavily dependent on powerful external representation priors.

[123] Scenethesis: A Language and Vision Agentic Framework for 3D Scene Generation

Lu Ling,Chen-Hsuan Lin,Tsung-Yi Lin,Yifan Ding,Yu Zeng,Yichen Sheng,Yunhao Ge,Ming-Yu Liu,Aniket Bera,Zhaoshuo Li

Main category: cs.CV

TL;DR: Scenethesis是一个无需训练的框架，结合LLM和视觉模块，生成多样、真实且物理合理的3D交互场景。

Details

Motivation: 现有方法依赖小规模室内数据集或缺乏空间真实性，难以生成多样且符合常识的场景。 Method: Scenethesis通过LLM生成粗布局，视觉模块优化布局，优化模块确保物理合理性，最后由判断模块验证空间一致性。 Result: 实验表明，Scenethesis能生成多样、真实且物理合理的3D场景。 Conclusion: 该框架在虚拟内容创作、仿真环境和具身AI研究中具有重要价值。 Abstract: Synthesizing interactive 3D scenes from text is essential for gaming, virtual reality, and embodied AI. However, existing methods face several challenges. Learning-based approaches depend on small-scale indoor datasets, limiting the scene diversity and layout complexity. While large language models (LLMs) can leverage diverse text-domain knowledge, they struggle with spatial realism, often producing unnatural object placements that fail to respect common sense. Our key insight is that vision perception can bridge this gap by providing realistic spatial guidance that LLMs lack. To this end, we introduce Scenethesis, a training-free agentic framework that integrates LLM-based scene planning with vision-guided layout refinement. Given a text prompt, Scenethesis first employs an LLM to draft a coarse layout. A vision module then refines it by generating an image guidance and extracting scene structure to capture inter-object relations. Next, an optimization module iteratively enforces accurate pose alignment and physical plausibility, preventing artifacts like object penetration and instability. Finally, a judge module verifies spatial coherence. Comprehensive experiments show that Scenethesis generates diverse, realistic, and physically plausible 3D interactive scenes, making it valuable for virtual content creation, simulation environments, and embodied AI research.

cs.GR [Back]

[124] Discrete Spatial Diffusion: Intensity-Preserving Diffusion Modeling

Javier E. Santos,Agnese Marcato,Roman Colman,Nicholas Lubbers,Yen Ting Lin

Main category: cs.GR

TL;DR: 提出了一种离散空间扩散（DSD）框架，解决了生成扩散模型在离散空间和质量守恒问题上的局限性。

Details

Motivation: 生成扩散模型在连续强度空间中表现优异，但不适用于离散量（如粒子计数或材料单位）和质量守恒的科学应用。 Method: 基于连续时间离散状态的跳跃随机过程，直接在离散空间域中操作，严格保持质量守恒。 Result: 在图像合成、类条件化和修复任务中展示了DSD的表达灵活性，并验证了其在材料微结构等科学数据中的适用性。 Conclusion: DSD填补了扩散模型与质量守恒科学应用之间的鸿沟，为离散空间问题提供了有效解决方案。 Abstract: Generative diffusion models have achieved remarkable success in producing high-quality images. However, because these models typically operate in continuous intensity spaces - diffusing independently per pixel and color channel - they are fundamentally ill-suited for applications where quantities such as particle counts or material units are inherently discrete and governed by strict conservation laws such as mass preservation, limiting their applicability in scientific workflows. To address this limitation, we propose Discrete Spatial Diffusion (DSD), a framework based on a continuous-time, discrete-state jump stochastic process that operates directly in discrete spatial domains while strictly preserving mass in both forward and reverse diffusion processes. By using spatial diffusion to achieve mass preservation, we introduce stochasticity naturally through a discrete formulation. We demonstrate the expressive flexibility of DSD by performing image synthesis, class conditioning, and image inpainting across widely-used image benchmarks, with the ability to condition on image intensity. Additionally, we highlight its applicability to domain-specific scientific data for materials microstructure, bridging the gap between diffusion models and mass-conditioned scientific applications.

[125] OT-Talk: Animating 3D Talking Head with Optimal Transportation

Xinmu Wang,Xiang Gao,Xiyun Song,Heather Yu,Zongfang Lin,Liang Peng,Xianfeng Gu

Main category: cs.GR

TL;DR: OT-Talk利用最优传输优化学习模型，通过Chebyshev图卷积提取几何特征，结合Wasserstein距离建模网格变化，实现更自然的面部动画。

Details

Motivation: 解决语音信号与面部动态之间的模态差距问题，改善唇同步和面部动作的自然性。 Method: 使用预训练的Hubert模型提取音频特征，结合Transformer处理时序，引入Chebyshev图卷积提取网格几何特征，利用Wasserstein距离建模网格变化。 Result: 在两个公开数据集上，OT-Talk在网格重建精度和时间对齐上优于现有技术，用户感知研究也验证了其有效性。 Conclusion: OT-Talk通过新颖的几何特征提取和距离度量方法，显著提升了面部动画的自然性和准确性。 Abstract: Animating 3D head meshes using audio inputs has significant applications in AR/VR, gaming, and entertainment through 3D avatars. However, bridging the modality gap between speech signals and facial dynamics remains a challenge, often resulting in incorrect lip syncing and unnatural facial movements. To address this, we propose OT-Talk, the first approach to leverage optimal transportation to optimize the learning model in talking head animation. Building on existing learning frameworks, we utilize a pre-trained Hubert model to extract audio features and a transformer model to process temporal sequences. Unlike previous methods that focus solely on vertex coordinates or displacements, we introduce Chebyshev Graph Convolution to extract geometric features from triangulated meshes. To measure mesh dissimilarities, we go beyond traditional mesh reconstruction errors and velocity differences between adjacent frames. Instead, we represent meshes as probability measures and approximate their surfaces. This allows us to leverage the sliced Wasserstein distance for modeling mesh variations. This approach facilitates the learning of smooth and accurate facial motions, resulting in coherent and natural facial animations. Our experiments on two public audio-mesh datasets demonstrate that our method outperforms state-of-the-art techniques both quantitatively and qualitatively in terms of mesh reconstruction accuracy and temporal alignment. In addition, we conducted a user perception study with 20 volunteers to further assess the effectiveness of our approach.

[126] Aokana: A GPU-Driven Voxel Rendering Framework for Open World Games

Yingrong Fang,Qitong Wang,Wei Wang

Main category: cs.GR

TL;DR: Aokana是一个基于稀疏体素有向无环图（SVDAG）的GPU驱动体素渲染框架，用于开放世界游戏，显著降低内存使用并提升渲染速度。

Details

Motivation: 体素的高存储成本和渲染时间限制了开放世界体素游戏的发展，需要一种高效的渲染解决方案。 Method: Aokana结合了LOD机制和流式系统，设计了高性能GPU驱动渲染管线，支持实时渲染数十亿体素场景。 Result: 实验表明，Aokana可减少内存使用高达9倍，渲染速度提升4.8倍。 Conclusion: Aokana具有实际应用价值，可直接集成到现有游戏引擎中。 Abstract: Voxels are among the most popular 3D geometric representations today. Due to their intuitiveness and ease-of-editing, voxels have been widely adopted in stylized games and low-cost independent games. However, the high storage cost of voxels, along with the significant time overhead associated with large-scale voxel rendering, limits the further development of open-world voxel games. In this paper, we introduce Aokana, a GPU-Driven Voxel Rendering Framework for Open World Games. Aokana is based on a Sparse Voxel Directed Acyclic Graph (SVDAG). It incorporates a Level-of-Details (LOD) mechanism and a streaming system, enabling seamless map loading as players traverse the open-world game environment. We also designed a corresponding high-performance GPU-driven voxel rendering pipeline to support real-time rendering of the voxel scenes that contain tens of billions of voxels. Aokana can be directly applied to existing game engines and easily integrated with mesh-based rendering methods, demonstrating its practical applicability in game development. Experimental evaluations show that, with increasing voxel scene resolution, Aokana can reduce memory usage by up to ninefold and achieves rendering speeds up to 4.8 times faster than those of previous state-of-the-art approaches.

[127] Holographic Radiance Cascades for 2D Global Illumination

Rouli Freeman,Alexander Sannikov,Adrian Margel

Main category: cs.GR

TL;DR: 提出了一种名为Holographic Radiance Cascades的新算法，用于实时计算2D全局光照，效果接近参考解，且运行速度快。

Details

Motivation: 当前全局光照算法在实时动态场景中难以高质量运行，即使是2D场景也存在类似问题。 Method: 采用多级辐射探针系统，通过组合短射线区间替代传统光线追踪。 Result: 在RTX 3080笔记本上，512x512图像耗时1.85ms，1024x1024图像耗时7.67ms。 Conclusion: 该算法在实时帧率下实现了与参考解视觉无差异的2D全局光照效果。 Abstract: Efficiently calculating global illumination has always been one of the greatest challenges in computer graphics. Algorithms for approximating global illumination have always struggled to run in realtime for fully dynamic scenes, and have had to rely heavily on stochastic raytracing, spatialtemporal denoising, or undersampled representations, resulting in much lower quality of lighting compared to reference solutions. Even though the problem of calculating global illumination in 2D is significantly simpler than that of 3D, most contemporary approaches still struggle to accurately approximate 2D global illumination under realtime constraints. We present Holographic Radiance Cascades: a new single-shot scene-agnostic radiance transfer algorithm for global illumination, which is capable of achieving results visually indistinguishable from the 2D reference solution at realtime framerates. Our method uses a multi-level radiance probe system, and computes rays via combining short ray intervals as a replacement for conventional raytracing. It runs at constant cost for a given scene size, taking 1.85ms for a 512x512 pixel image and 7.67ms for 1024x1024 on an RTX 3080 Laptop.

[128] Diffeomorphic Reconstruction Of A 2D Simple Non Parametric Manifold From Level Set Data Via Shape Gradients

Shafeequdheen P,Jyotiranjan Nayak,Vijayakrishna Rowthu

Main category: cs.GR

TL;DR: 提出了一种基于变分方法和形状梯度的二维简单流形三角化表面重建方法，通过能量泛函和梯度下降优化实现边界匹配和光滑性。

Details

Motivation: 从给定的水平集重建形状（二维简单流形）需要一种能够匹配目标边界并保持边界光滑性的方法。 Method: 使用依赖于局部形状特性的能量泛函，通过梯度下降法迭代最小化能量，生成三角化表面网格。 Result: 重建的三角化表面网格能够准确匹配目标边界，并确保边界的光滑性。 Conclusion: 该方法通过变分和形状梯度实现了高效且光滑的形状重建。 Abstract: A variational approach to the reconstruction of a shape (2D simple manifolds) as triangulated surface from given level set using shape gradients is presented. It involves an energy functional that depends on the local shape characteristics of the surface. Minimization of the energy through an iterative procedure using the gradient descent method yields a triangulated surface mesh which matches the boundary of the object of interest and this model ensures the smoothness of the boundary.

[129] Sparse Ellipsoidal Radial Basis Function Network for Point Cloud Surface Representation

Bobo Lian,Dandan Wang,Chenjian Wu,Minxin Chen

Main category: cs.GR

TL;DR: 本文提出了一种基于稀疏椭球径向基函数网络的机器学习方法，用于近似点云的符号距离函数（SDF），实现紧凑且准确的表面表示。通过动态多目标优化策略平衡稀疏性和精度，并利用CUDA并行计算提升效率。

Details

Motivation: 点云表面表示是计算机图形学和视觉中的基础问题，现有方法在精度和计算效率上存在不足。 Method: 使用稀疏椭球径向基函数（ERBFs）近似SDF，引入动态多目标优化策略，并行计算和分层八叉树细化策略。 Result: 在多个基准数据集上，该方法在精度、鲁棒性和计算效率上优于现有稀疏表示方法。 Conclusion: 该方法通过稀疏ERBFs和高效优化策略，实现了点云表面表示的高精度和高效计算。 Abstract: Point cloud surface representation is a fundamental problem in computer graphics and vision. This paper presents a machine learning approach for approximating the signed distance function (SDF) of a point cloud using sparse ellipsoidal radial basis function networks, enabling a compact and accurate surface representation. Given the SDF values defined on the grid points constructed from the point cloud, our method approximates the SDF accurately with as few ellipsoidal radial basis functions (ERBFs) as possible, i.e., represent the SDF of a point cloud by sparse ERBFs. To balance sparsity and approximation precision, a dynamic multi-objective optimization strategy is introduced, which adaptively adds the regularization terms and jointly optimizes the weights, centers, shapes, and orientations of ERBFs. To improve computational efficiency, a nearest-neighbor-based data structure is employed, restricting function calculations to points near each Gaussian kernel center. The computations for each kernel are further parallelized on CUDA, which significantly improves the optimization speed. Additionally, a hierarchical octree-based refinement strategy is designed for training. Specifically, the initialization and optimization of network parameters are conducted using coarse grid points in the octree lattice structure. Subsequently, fine lattice points are progressively incorporated to accelerate model convergence and enhance training efficiency. Extensive experiments on multiple benchmark datasets demonstrate that our method outperforms previous sparse representation approaches in terms of accuracy, robustness, and computational efficiency. The corresponding code is publicly available at https://github.com/lianbobo/SE-RBFNet.git.

[130] GarmentImage: Raster Encoding of Garment Sewing Patterns with Diverse Topologies

Yuki Tatsukawa,Anran Qi,I-Chao Shen,Takeo Igarashi

Main category: cs.GR

TL;DR: 论文提出了一种名为GarmentImage的基于栅格的统一缝纫图案表示方法，解决了传统向量表示在机器学习中的不连续性和泛化能力不足的问题。

Details

Motivation: 传统的向量表示方法在机器学习中存在潜在空间不连续性和对未见拓扑结构的泛化能力有限的问题。 Method: 引入GarmentImage，一种多通道规则网格的栅格表示方法，编码缝纫图案的几何、拓扑和布局信息。 Result: GarmentImage在潜在空间探索、文本编辑和图像到图案预测等应用中表现优于传统向量表示。 Conclusion: GarmentImage为缝纫图案的机器学习提供了更优的表示方法，显著提升了模型的性能。 Abstract: Garment sewing patterns are the design language behind clothing, yet their current vector-based digital representations weren't built with machine learning in mind. Vector-based representation encodes a sewing pattern as a discrete set of panels, each defined as a sequence of lines and curves, stitching information between panels and the placement of each panel around a body. However, this representation causes two major challenges for neural networks: discontinuity in latent space between patterns with different topologies and limited generalization to garments with unseen topologies in the training data. In this work, we introduce GarmentImage, a unified raster-based sewing pattern representation. GarmentImage encodes a garment sewing pattern's geometry, topology and placement into multi-channel regular grids. Machine learning models trained on GarmentImage achieve seamless transitions between patterns with different topologies and show better generalization capabilities compared to models trained on vector-based representation. We demonstrate the effectiveness of GarmentImage across three applications: pattern exploration in latent space, text-based pattern editing, and image-to-pattern prediction. The results show that GarmentImage achieves superior performance on these applications using only simple convolutional networks.

cs.CL [Back]

[131] Unlearning Sensitive Information in Multimodal LLMs: Benchmark and Attack-Defense Evaluation

Vaidehi Patil,Yi-Lin Sung,Peter Hase,Jie Peng,Tianlong Chen,Mohit Bansal

Main category: cs.CL

TL;DR: 论文提出了多模态大语言模型（MLLMs）中敏感信息遗忘的挑战，并引入了一个新的基准UnLOK-VQA和攻防框架来评估多模态遗忘方法。

Details

Motivation: 由于MLLMs可能从多模态数据中学习到敏感信息，研究如何有效遗忘这些信息成为重要课题。 Method: 通过自动化生成和人工筛选构建高质量图像-文本对数据集，并评估六种防御目标对抗七种攻击方法。 Result: 多模态攻击优于单模态攻击，最有效的防御方法是从模型内部状态中移除答案信息，且更大模型表现出更强的鲁棒性。 Conclusion: UnLOK-VQA为MLLMs的遗忘研究提供了严谨的基准，并表明模型规模有助于提升安全性。 Abstract: LLMs trained on massive datasets may inadvertently acquire sensitive information such as personal details and potentially harmful content. This risk is further heightened in multimodal LLMs as they integrate information from multiple modalities (image and text). Adversaries can exploit this knowledge through multimodal prompts to extract sensitive details. Evaluating how effectively MLLMs can forget such information (targeted unlearning) necessitates the creation of high-quality, well-annotated image-text pairs. While prior work on unlearning has focused on text, multimodal unlearning remains underexplored. To address this gap, we first introduce a multimodal unlearning benchmark, UnLOK-VQA (Unlearning Outside Knowledge VQA), as well as an attack-and-defense framework to evaluate methods for deleting specific multimodal knowledge from MLLMs. We extend a visual question-answering dataset using an automated pipeline that generates varying-proximity samples for testing generalization and specificity, followed by manual filtering for maintaining high quality. We then evaluate six defense objectives against seven attacks (four whitebox, three blackbox), including a novel whitebox method leveraging interpretability of hidden states. Our results show multimodal attacks outperform text- or image-only ones, and that the most effective defense removes answer information from internal model states. Additionally, larger models exhibit greater post-editing robustness, suggesting that scale enhances safety. UnLOK-VQA provides a rigorous benchmark for advancing unlearning in MLLMs.

[132] MoxE: Mixture of xLSTM Experts with Entropy-Aware Routing for Efficient Language Modeling

Abdoul Majid O. Thiombiano,Brahim Hnich,Ali Ben Mrad,Mohamed Wiem Mkaouer

Main category: cs.CL

TL;DR: MoxE结合xLSTM和MoE框架，通过熵感知路由和辅助损失提升LLM的效率和扩展性。

Details

Motivation: 解决大型语言模型的可扩展性和效率问题。 Method: 结合xLSTM和MoE，引入熵感知路由机制和辅助损失。 Result: 显著提升效率和性能，优于现有方法。 Conclusion: MoxE为可扩展LLM架构提供了重要进展。 Abstract: This paper introduces MoxE, a novel architecture that synergistically combines the Extended Long Short-Term Memory (xLSTM) with the Mixture of Experts (MoE) framework to address critical scalability and efficiency challenges in large language models (LLMs). The proposed method effectively leverages xLSTM's innovative memory structures while strategically introducing sparsity through MoE to substantially reduce computational overhead. At the heart of our approach is a novel entropy-based routing mechanism, designed to dynamically route tokens to specialized experts, thereby ensuring efficient and balanced resource utilization. This entropy awareness enables the architecture to effectively manage both rare and common tokens, with mLSTM blocks being favored to handle rare tokens. To further enhance generalization, we introduce a suite of auxiliary losses, including entropy-based and group-wise balancing losses, ensuring robust performance and efficient training. Theoretical analysis and empirical evaluations rigorously demonstrate that MoxE achieves significant efficiency gains and enhanced effectiveness compared to existing approaches, marking a notable advancement in scalable LLM architectures.

[133] SymPlanner: Deliberate Planning in Language Models with Symbolic Representation

Siheng Xiong,Jieyu Zhou,Zhangding Liu,Yusen Su

Main category: cs.CL

TL;DR: SymPlanner是一个新框架，通过将语言模型与符号环境结合，提升其在多步规划任务中的能力。

Details

Motivation: 解决语言模型在多步规划任务中缺乏连贯性和外部约束的问题。 Method: 结合符号环境作为显式世界模型，引入迭代校正（IC）和对比排名（CR）优化规划过程。 Result: 在PlanBench上验证，SymPlanner生成的计划比纯自然语言基线更连贯、多样且可验证。 Conclusion: SymPlanner通过符号环境增强了语言模型的规划能力，提供了更可靠的解决方案。 Abstract: Planning remains a core challenge for language models (LMs), particularly in domains that require coherent multi-step action sequences grounded in external constraints. We introduce SymPlanner, a novel framework that equips LMs with structured planning capabilities by interfacing them with a symbolic environment that serves as an explicit world model. Rather than relying purely on natural language reasoning, SymPlanner grounds the planning process in a symbolic state space, where a policy model proposes actions and a symbolic environment deterministically executes and verifies their effects. To enhance exploration and improve robustness, we introduce Iterative Correction (IC), which refines previously proposed actions by leveraging feedback from the symbolic environment to eliminate invalid decisions and guide the model toward valid alternatives. Additionally, Contrastive Ranking (CR) enables fine-grained comparison of candidate plans by evaluating them jointly. We evaluate SymPlanner on PlanBench, demonstrating that it produces more coherent, diverse, and verifiable plans than pure natural language baselines.

[134] On the effectiveness of Large Language Models in the mechanical design domain

Daniele Grandi,Fabian Riquelme

Main category: cs.CL

TL;DR: 研究探讨了大型语言模型在机械工程领域的性能，通过ABC数据集的无监督任务评估模型表现，发现模型在特定任务中表现优于基线。

Details

Motivation: 理解大型语言模型在机械工程领域的性能表现，探索其在特定领域数据上的适用性。 Method: 利用ABC数据集的语义数据，开发了两个无监督任务：二元句子对分类任务和零样本分类任务，并通过调整学习率、dropout值、序列长度和添加多头注意力层来优化模型。 Result: 二元句子对分类任务准确率为0.62，零样本分类任务准确率为0.386，显著优于基线。 Conclusion: 研究揭示了模型在该领域语言学习中的特定失败模式，为未来改进提供了方向。 Abstract: In this work, we seek to understand the performance of large language models in the mechanical engineering domain. We leverage the semantic data found in the ABC dataset, specifically the assembly names that designers assigned to the overall assemblies, and the individual semantic part names that were assigned to each part. After pre-processing the data we developed two unsupervised tasks to evaluate how different model architectures perform on domain-specific data: a binary sentence-pair classification task and a zero-shot classification task. We achieved a 0.62 accuracy for the binary sentence-pair classification task with a fine-tuned model that focuses on fighting over-fitting: 1) modifying learning rates, 2) dropout values, 3) Sequence Length, and 4) adding a multi-head attention layer. Our model on the zero-shot classification task outperforms the baselines by a wide margin, and achieves a top-1 classification accuracy of 0.386. The results shed some light on the specific failure modes that arise when learning from language in this domain.

[135] AI agents may be worth the hype but not the resources (yet): An initial exploration of machine translation quality and costs in three language pairs in the legal and news domains

Vicent Briva Iglesias,Gokhan Dogru

Main category: cs.CL

TL;DR: 论文比较了五种机器翻译范式，发现传统NMT在自动评估中表现最佳，而增强推理的LLM在人工评估中更优，但多智能体工作流成本高昂。

Details

Motivation: 探讨LLM和多智能体编排在机器翻译中的实际效果，与传统NMT进行对比。 Method: 对五种翻译范式（Google Translate、GPT-4o、o1-preview及两种多智能体工作流）进行自动和人工评估，测试数据来自法律合同和新闻文本。 Result: NMT在自动评估中领先，o1-preview在人工评估中表现最佳，多智能体工作流成本显著更高。 Conclusion: 建议多维、成本感知的评估方法，并探索更高效的协调策略和混合管道。 Abstract: Large language models (LLMs) and multi-agent orchestration are touted as the next leap in machine translation (MT), but their benefits relative to conventional neural MT (NMT) remain unclear. This paper offers an empirical reality check. We benchmark five paradigms, Google Translate (strong NMT baseline), GPT-4o (general-purpose LLM), o1-preview (reasoning-enhanced LLM), and two GPT-4o-powered agentic workflows (sequential three-stage and iterative refinement), on test data drawn from a legal contract and news prose in three English-source pairs: Spanish, Catalan and Turkish. Automatic evaluation is performed with COMET, BLEU, chrF2 and TER; human evaluation is conducted with expert ratings of adequacy and fluency; efficiency with total input-plus-output token counts mapped to April 2025 pricing. Automatic scores still favour the mature NMT system, which ranks first in seven of twelve metric-language combinations; o1-preview ties or places second in most remaining cases, while both multi-agent workflows trail. Human evaluation reverses part of this narrative: o1-preview produces the most adequate and fluent output in five of six comparisons, and the iterative agent edges ahead once, indicating that reasoning layers capture semantic nuance undervalued by surface metrics. Yet these qualitative gains carry steep costs. The sequential agent consumes roughly five times, and the iterative agent fifteen times, the tokens used by NMT or single-pass LLMs. We advocate multidimensional, cost-aware evaluation protocols and highlight research directions that could tip the balance: leaner coordination strategies, selective agent activation, and hybrid pipelines combining single-pass LLMs with targeted agent intervention.

[136] PIPA: A Unified Evaluation Protocol for Diagnosing Interactive Planning Agents

Takyoung Kim,Janvijay Singh,Shuhaib Mehri,Emre Can Acikgoz,Sagnik Mukherjee,Nimet Beyza Bozdag,Sumuk Shashidhar,Gokhan Tur,Dilek Hakkani-Tür

Main category: cs.CL

TL;DR: 论文提出PIPA评估协议，通过POMDP范式全面评估任务规划代理的行为过程，强调用户满意度不仅取决于任务完成，还涉及中间行为。

Details

Motivation: 现有基准主要基于任务完成评估代理性能，忽视了用户对整个交互过程的满意度。 Method: 提出PIPA协议，将交互任务规划代理的行为过程建模为POMDP，并提供原子评估标准。 Result: 代理在不同行为阶段表现各异，用户满意度受结果和中间行为共同影响。 Conclusion: PIPA为代理性能提供全面评估，未来可探索多代理系统和用户模拟器的局限性。 Abstract: The growing capabilities of large language models (LLMs) in instruction-following and context-understanding lead to the era of agents with numerous applications. Among these, task planning agents have become especially prominent in realistic scenarios involving complex internal pipelines, such as context understanding, tool management, and response generation. However, existing benchmarks predominantly evaluate agent performance based on task completion as a proxy for overall effectiveness. We hypothesize that merely improving task completion is misaligned with maximizing user satisfaction, as users interact with the entire agentic process and not only the end result. To address this gap, we propose PIPA, a unified evaluation protocol that conceptualizes the behavioral process of interactive task planning agents within a partially observable Markov Decision Process (POMDP) paradigm. The proposed protocol offers a comprehensive assessment of agent performance through a set of atomic evaluation criteria, allowing researchers and practitioners to diagnose specific strengths and weaknesses within the agent's decision-making pipeline. Our analyses show that agents excel in different behavioral stages, with user satisfaction shaped by both outcomes and intermediate behaviors. We also highlight future directions, including systems that leverage multiple agents and the limitations of user simulators in task planning.

[137] Always Tell Me The Odds: Fine-grained Conditional Probability Estimation

Liaoyaqi Wang,Zhengping Jiang,Anqi Liu,Benjamin Van Durme

Main category: cs.CL

TL;DR: 提出了一种用于细粒度概率估计的先进模型，通过结合人类和合成数据、扩展更大模型及改进监督，显著提升了LLM在不确定性条件下的概率预测能力。

Details

Motivation: LLM在不确定或部分信息条件下进行准确且校准良好的概率预测仍存在困难，现有方法对不确定性的可靠估计研究不足。 Method: 结合人类和合成数据创建与评估，扩展更大模型，改进监督，提出一系列强且精确的概率估计模型。 Result: 在依赖条件概率估计的任务中，该方法显著优于现有微调和基于提示的方法。 Conclusion: 通过系统性改进，模型在不确定性条件下的概率预测能力得到显著提升。 Abstract: We present a state-of-the-art model for fine-grained probability estimation of propositions conditioned on context. Recent advances in large language models (LLMs) have significantly enhanced their reasoning capabilities, particularly on well-defined tasks with complete information. However, LLMs continue to struggle with making accurate and well-calibrated probabilistic predictions under uncertainty or partial information. While incorporating uncertainty into model predictions often boosts performance, obtaining reliable estimates of that uncertainty remains understudied. In particular, LLM probability estimates tend to be coarse and biased towards more frequent numbers. Through a combination of human and synthetic data creation and assessment, scaling to larger models, and better supervision, we propose a set of strong and precise probability estimation models. We conduct systematic evaluations across tasks that rely on conditional probability estimation and show that our approach consistently outperforms existing fine-tuned and prompting-based methods by a large margin.

[138] A Survey on Inference Engines for Large Language Models: Perspectives on Optimization and Efficiency

Sihyeong Park,Sungryeol Jeon,Chaelyn Lee,Seokhun Jeon,Byung-Soo Kim,Jemin Lee

Main category: cs.CL

TL;DR: 本文对25种开源和商业LLM推理引擎进行了全面评估，分析了其易用性、部署性、通用性、扩展性及性能，并探讨了未来研究方向。

Details

Motivation: 随着LLM在多种应用中的广泛使用，推理成本显著增加，但缺乏对推理引擎的系统研究。 Method: 评估25种推理引擎，分析其设计目标、优化技术、生态系统成熟度及性能成本策略。 Result: 提供了对推理引擎的详细评估，并指出了未来研究方向。 Conclusion: 为研究者和开发者提供了选择和设计优化LLM推理引擎的实用指南，并建立了公共仓库跟踪发展。 Abstract: Large language models (LLMs) are widely applied in chatbots, code generators, and search engines. Workloads such as chain-of-thought, complex reasoning, and agent services significantly increase the inference cost by invoking the model repeatedly. Optimization methods such as parallelism, compression, and caching have been adopted to reduce costs, but the diverse service requirements make it hard to select the right method. Recently, specialized LLM inference engines have emerged as a key component for integrating the optimization methods into service-oriented infrastructures. However, a systematic study on inference engines is still lacking. This paper provides a comprehensive evaluation of 25 open-source and commercial inference engines. We examine each inference engine in terms of ease-of-use, ease-of-deployment, general-purpose support, scalability, and suitability for throughput- and latency-aware computation. Furthermore, we explore the design goals of each inference engine by investigating the optimization techniques it supports. In addition, we assess the ecosystem maturity of open source inference engines and handle the performance and cost policy of commercial solutions. We outline future research directions that include support for complex LLM-based services, support of various hardware, and enhanced security, offering practical guidance to researchers and developers in selecting and designing optimized LLM inference engines. We also provide a public repository to continually track developments in this fast-evolving field: https://github.com/sihyeong/Awesome-LLM-Inference-Engine

[139] High-Fidelity Pseudo-label Generation by Large Language Models for Training Robust Radiology Report Classifiers

Brian Wong,Kaito Tanaka

Main category: cs.CL

TL;DR: DeBERTa-RAD是一个两阶段框架，结合LLM伪标签和DeBERTa知识蒸馏，用于高效准确的胸部X光报告标注。

Details

Motivation: 胸部X光报告标注对下游任务至关重要，但传统NLP方法因报告的高变异性、复杂性和否定/不确定性表达而受限。LLM直接应用又受计算成本和速度限制。 Method: 提出DeBERTa-RAD框架：1) 用先进LLM生成高质量伪标签；2) 通过知识蒸馏训练DeBERTa-Base模型。 Result: 在MIMIC-500基准测试中，Macro F1达0.9120，显著优于规则系统、微调Transformer和直接LLM推理，且推理速度快。 Conclusion: 结合LLM能力和高效蒸馏模型，可突破数据标注瓶颈，实现高性能医学文本处理。 Abstract: Automated labeling of chest X-ray reports is essential for enabling downstream tasks such as training image-based diagnostic models, population health studies, and clinical decision support. However, the high variability, complexity, and prevalence of negation and uncertainty in these free-text reports pose significant challenges for traditional Natural Language Processing methods. While large language models (LLMs) demonstrate strong text understanding, their direct application for large-scale, efficient labeling is limited by computational cost and speed. This paper introduces DeBERTa-RAD, a novel two-stage framework that combines the power of state-of-the-art LLM pseudo-labeling with efficient DeBERTa-based knowledge distillation for accurate and fast chest X-ray report labeling. We leverage an advanced LLM to generate high-quality pseudo-labels, including certainty statuses, for a large corpus of reports. Subsequently, a DeBERTa-Base model is trained on this pseudo-labeled data using a tailored knowledge distillation strategy. Evaluated on the expert-annotated MIMIC-500 benchmark, DeBERTa-RAD achieves a state-of-the-art Macro F1 score of 0.9120, significantly outperforming established rule-based systems, fine-tuned transformer models, and direct LLM inference, while maintaining a practical inference speed suitable for high-throughput applications. Our analysis shows particular strength in handling uncertain findings. This work demonstrates a promising path to overcome data annotation bottlenecks and achieve high-performance medical text processing through the strategic combination of LLM capabilities and efficient student models trained via distillation.

[140] Efficient Shapley Value-based Non-Uniform Pruning of Large Language Models

Chuan Sun,Han Yu,Lizhen Cui

Main category: cs.CL

TL;DR: 论文提出了一种基于Shapley值的非均匀剪枝方法（SVNP），通过量化各Transformer层对模型性能的贡献，为不同层分配定制化的剪枝预算，显著提升了剪枝后模型的性能。

Details

Motivation: 传统层间均匀剪枝方法未考虑不同Transformer层的重要性差异，导致性能下降。SVNP旨在通过非均匀剪枝保留关键参数，优化剪枝效果。 Method: SVNP利用Shapley值量化各层贡献，并设计了滑动窗口近似方法降低计算开销。实验在LLaMA和OPT等模型上进行。 Result: 在70%稀疏度下，SVNP相比SparseGPT显著降低了困惑度（PPL），LLaMA-7B和LLaMA-13B分别降低18.01%和19.55%。 Conclusion: SVNP通过非均匀剪枝有效提升了剪枝后模型的性能，为LLM的高效压缩提供了新思路。 Abstract: Pruning large language models (LLMs) is a promising solution for reducing model sizes and computational complexity while preserving performance. Traditional layer-wise pruning methods often adopt a uniform sparsity approach across all layers, which leads to suboptimal performance due to the varying significance of individual transformer layers within the model not being accounted for. To this end, we propose the \underline{S}hapley \underline{V}alue-based \underline{N}on-\underline{U}niform \underline{P}runing (\methodname{}) method for LLMs. This approach quantifies the contribution of each transformer layer to the overall model performance, enabling the assignment of tailored pruning budgets to different layers to retain critical parameters. To further improve efficiency, we design the Sliding Window-based Shapley Value approximation method. It substantially reduces computational overhead compared to exact SV calculation methods. Extensive experiments on various LLMs including LLaMA-v1, LLaMA-v2 and OPT demonstrate the effectiveness of the proposed approach. The results reveal that non-uniform pruning significantly enhances the performance of pruned models. Notably, \methodname{} achieves a reduction in perplexity (PPL) of 18.01\% and 19.55\% on LLaMA-7B and LLaMA-13B, respectively, compared to SparseGPT at 70\% sparsity.

[141] Same evaluation, more tokens: On the effect of input length for machine translation evaluation using Large Language Models

Tobias Domhan,Dawei Zhu

Main category: cs.CL

TL;DR: 研究发现，文本长度显著影响基于LLM的翻译质量评估，导致长文本错误标注减少和系统排名准确性下降。通过FSP和微调方法，可以缓解这种偏差。

Details

Motivation: 长文档翻译质量评估的挑战，以及LLMs在长文本评估中的潜在偏差问题。 Method: 采用粒度对齐提示、FSP和微调方法，以解决LLMs在长文本评估中的偏差。 Result: FSP和微调方法有效缓解了文本长度对评估的影响，提升了LLMs在长文档评估中的可靠性。 Conclusion: 通过特定方法调整，LLMs可以更可靠地用于长文档翻译质量评估。 Abstract: Accurately evaluating machine-translated text remains a long-standing challenge, particularly for long documents. Recent work has shown that large language models (LLMs) can serve as reliable and interpretable sentence-level translation evaluators via MQM error span annotations. With modern LLMs supporting larger context windows, a natural question arises: can we feed entire document translations into an LLM for quality assessment? Ideally, evaluation should be invariant to text length, producing consistent error spans regardless of input granularity. However, our analysis shows that text length significantly impacts evaluation: longer texts lead to fewer error spans and reduced system ranking accuracy. To address this limitation, we evaluate several strategies, including granularity-aligned prompting, Focus Sentence Prompting (FSP), and a fine-tuning approach to better align LLMs with the evaluation task. The latter two methods largely mitigate this length bias, making LLMs more reliable for long-form translation evaluation.

[142] A Multimodal Framework for Explainable Evaluation of Soft Skills in Educational Environments

Jared D. T. Guerrero-Sosa,Francisco P. Romero,Víctor Hugo Menéndez-Domínguez,Jesus Serrano-Guerrero,Andres Montoro-Montarroso,Jose A. Olivas

Main category: cs.CL

TL;DR: 本文提出了一种基于模糊逻辑和语言现象模型的多模态分析方法，用于评估本科生的软技能，通过计算感知捕捉复杂行为的细微差别，提高了评估的可解释性和可靠性。

Details

Motivation: 在高等教育中，软技能的公正评估是一个重要挑战，传统方法难以捕捉其复杂性和不确定性。 Method: 采用模糊逻辑和语言现象模型，结合多模态分析（如面部表情和手势识别），开发了一个评估工具，用于量化软技能的细微表现。 Result: 实验表明，该方法能有效整合多模态数据，生成一致且有意义的软技能评估结果，显著提升了评分的质量。 Conclusion: 多模态整合显著改善了软技能评估的透明度和可理解性，为教育利益相关者提供了可靠的评估工具。 Abstract: In the rapidly evolving educational landscape, the unbiased assessment of soft skills is a significant challenge, particularly in higher education. This paper presents a fuzzy logic approach that employs a Granular Linguistic Model of Phenomena integrated with multimodal analysis to evaluate soft skills in undergraduate students. By leveraging computational perceptions, this approach enables a structured breakdown of complex soft skill expressions, capturing nuanced behaviours with high granularity and addressing their inherent uncertainties, thereby enhancing interpretability and reliability. Experiments were conducted with undergraduate students using a developed tool that assesses soft skills such as decision-making, communication, and creativity. This tool identifies and quantifies subtle aspects of human interaction, such as facial expressions and gesture recognition. The findings reveal that the framework effectively consolidates multiple data inputs to produce meaningful and consistent assessments of soft skills, showing that integrating multiple modalities into the evaluation process significantly improves the quality of soft skills scores, making the assessment work transparent and understandable to educational stakeholders.

[143] Distinguishing AI-Generated and Human-Written Text Through Psycholinguistic Analysis

Chidimma Opara

Main category: cs.CL

TL;DR: 本文提出了一种结合风格计量学和心理语言学理论的框架，用于区分AI生成文本与人类写作，旨在为教育场景提供可靠的作者验证工具。

Details

Motivation: 随着AI生成文本的复杂性增加，教育领域对准确透明的检测工具需求迫切，以验证作者身份。 Method: 研究整合了31种风格计量特征与心理语言学理论，映射到认知过程（如词汇检索、话语规划等），提出了一种可解释的检测框架。 Result: 该框架揭示了人类写作中独特的心理语言学模式，为区分AI与人类文本提供了可靠方法。 Conclusion: 通过计算语言学与认知科学的结合，该研究为维护生成AI时代的学术诚信贡献了工具开发的基础。 Abstract: The increasing sophistication of AI-generated texts highlights the urgent need for accurate and transparent detection tools, especially in educational settings, where verifying authorship is essential. Existing literature has demonstrated that the application of stylometric features with machine learning classifiers can yield excellent results. Building on this foundation, this study proposes a comprehensive framework that integrates stylometric analysis with psycholinguistic theories, offering a clear and interpretable approach to distinguishing between AI-generated and human-written texts. This research specifically maps 31 distinct stylometric features to cognitive processes such as lexical retrieval, discourse planning, cognitive load management, and metacognitive self-monitoring. In doing so, it highlights the unique psycholinguistic patterns found in human writing. Through the intersection of computational linguistics and cognitive science, this framework contributes to the development of reliable tools aimed at preserving academic integrity in the era of generative AI.

[144] $\textit{New News}$: System-2 Fine-tuning for Robust Integration of New Knowledge

Core Francisco Park,Zechen Zhang,Hidenori Tanaka

Main category: cs.CL

TL;DR: 论文提出了一种名为$ extit{New News}$的数据集，用于评估模型在理解新闻后执行下游任务的能力。研究发现微调与上下文学习之间存在显著差距（FT-ICL gap），并提出$ extit{System-2 Fine-tuning}$（Sys2-FT）方法，通过自生成数据（如自问自答）提升模型权重学习能力。

Details

Motivation: 研究动机是解决大语言模型（LLMs）在微调过程中难以将新闻信息有效内化到权重中的问题。 Method: 方法包括构建$ extit{New News}$数据集，提出Sys2-FT方法（如自问自答协议），并评估其在Qwen 2.5模型家族中的表现。 Result: 结果显示Sys2-FT显著提升了模型对新闻的权重学习能力，同时发现了$ extit{contextual shadowing effect}$现象。 Conclusion: 结论是Sys2-FT方法有效，且初步发现了其规模扩展规律。 Abstract: Humans and intelligent animals can effortlessly internalize new information ("news") and accurately extract the implications for performing downstream tasks. While large language models (LLMs) can achieve this through in-context learning (ICL) when the news is explicitly given as context, fine-tuning remains challenging for the models to consolidate learning in weights. In this paper, we introduce $\textit{New News}$, a dataset composed of hypothetical yet plausible news spanning multiple domains (mathematics, coding, discoveries, leaderboards, events), accompanied by downstream evaluation questions whose correct answers critically depend on understanding and internalizing the news. We first demonstrate a substantial gap between naive fine-tuning and in-context learning (FT-ICL gap) on our news dataset. To address this gap, we explore a suite of self-play data generation protocols -- paraphrases, implications and Self-QAs -- designed to distill the knowledge from the model with context into the weights of the model without the context, which we term $\textit{System-2 Fine-tuning}$ (Sys2-FT). We systematically evaluate ICL and Sys2-FT performance across data domains and model scales with the Qwen 2.5 family of models. Our results demonstrate that the self-QA protocol of Sys2-FT significantly improves models' in-weight learning of the news. Furthermore, we discover the $\textit{contexual shadowing effect}$, where training with the news $\textit{in context}$ followed by its rephrases or QAs degrade learning of the news. Finally, we show preliminary evidence of an emerging scaling law of Sys2-FT.

[145] Intra-Layer Recurrence in Transformers for Language Modeling

Anthony Nguyen,Wenjun Lin

Main category: cs.CL

TL;DR: 论文提出了一种名为Intra-Layer Recurrence (ILR)的方法，通过选择性在单个前向传递中对特定层应用循环，优化了Transformer模型的参数效率。实验表明，早期层分配更多迭代次数效果最佳。

Details

Motivation: Transformer模型在自然语言处理中表现优异，但深度增加导致参数数量大幅增长。现有方法通常对整个层块不加区分地应用循环，效率不高。 Method: 提出Intra-Layer Recurrence (ILR)，在单个前向传递中仅对特定层应用循环，而非整个层块。 Result: 实验表明，将更多迭代次数分配给早期层效果最佳。 Conclusion: ILR为优化Transformer架构中的循环结构提供了有前景的方向。 Abstract: Transformer models have established new benchmarks in natural language processing; however, their increasing depth results in substantial growth in parameter counts. While existing recurrent transformer methods address this issue by reprocessing layers multiple times, they often apply recurrence indiscriminately across entire blocks of layers. In this work, we investigate Intra-Layer Recurrence (ILR), a more targeted approach that applies recurrence selectively to individual layers within a single forward pass. Our experiments show that allocating more iterations to earlier layers yields optimal results. These findings suggest that ILR offers a promising direction for optimizing recurrent structures in transformer architectures.

[146] Positional Attention for Efficient BERT-Based Named Entity Recognition

Mo Sun,Siheng Xiong,Yuankai Cai,Bowen Zuo

Main category: cs.CL

TL;DR: 提出了一种基于BERT的NER框架，通过引入位置注意力机制和预训练参数，降低了训练成本并保持了高准确性。

Details

Motivation: BERT在NER任务中表现出色，但从头微调成本高且耗时，需要一种更高效的解决方案。 Method: 结合位置注意力机制和预训练参数，优化BERT在NER任务中的性能。 Result: 在Kaggle数据集上表现优异，训练周期更少。 Conclusion: 该框架为降低BERT-based NER系统的训练成本提供了实用方案，同时保持高精度。 Abstract: This paper presents a framework for Named Entity Recognition (NER) leveraging the Bidirectional Encoder Representations from Transformers (BERT) model in natural language processing (NLP). NER is a fundamental task in NLP with broad applicability across downstream applications. While BERT has established itself as a state-of-the-art model for entity recognition, fine-tuning it from scratch for each new application is computationally expensive and time-consuming. To address this, we propose a cost-efficient approach that integrates positional attention mechanisms into the entity recognition process and enables effective customization using pre-trained parameters. The framework is evaluated on a Kaggle dataset derived from the Groningen Meaning Bank corpus and achieves strong performance with fewer training epochs. This work contributes to the field by offering a practical solution for reducing the training cost of BERT-based NER systems while maintaining high accuracy.

[147] Humans can learn to detect AI-generated texts, or at least learn when they can't

Jiří Milička,Anna Marklová,Ondřej Drobil,Eva Pospíšilová

Main category: cs.CL

TL;DR: 研究发现，通过即时反馈训练，人们可以更准确地区分人类写作与AI生成文本，并校准自我感知能力。反馈组在准确性和信心校准上显著提升。

Details

Motivation: 探讨人们是否能通过反馈学习区分人类与AI文本，并研究其依赖的判别标准（如文本风格和可读性）。 Method: 使用GPT-4o生成文本，与人类文本对比，255名参与者随机分反馈组和无反馈组，记录准确性、信心等数据。 Result: 反馈组准确性和信心校准显著改善，无反馈组在信心高时错误最多。 Conclusion: 通过反馈训练可有效提升区分能力，纠正对AI文本的误解，对教育领域尤为重要。 Abstract: This study investigates whether individuals can learn to accurately discriminate between human-written and AI-produced texts when provided with immediate feedback, and if they can use this feedback to recalibrate their self-perceived competence. We also explore the specific criteria individuals rely upon when making these decisions, focusing on textual style and perceived readability. We used GPT-4o to generate several hundred texts across various genres and text types comparable to Koditex, a multi-register corpus of human-written texts. We then presented randomized text pairs to 255 Czech native speakers who identified which text was human-written and which was AI-generated. Participants were randomly assigned to two conditions: one receiving immediate feedback after each trial, the other receiving no feedback until experiment completion. We recorded accuracy in identification, confidence levels, response times, and judgments about text readability along with demographic data and participants' engagement with AI technologies prior to the experiment. Participants receiving immediate feedback showed significant improvement in accuracy and confidence calibration. Participants initially held incorrect assumptions about AI-generated text features, including expectations about stylistic rigidity and readability. Notably, without feedback, participants made the most errors precisely when feeling most confident -- an issue largely resolved among the feedback group. The ability to differentiate between human and AI-generated texts can be effectively learned through targeted training with explicit feedback, which helps correct misconceptions about AI stylistic features and readability, as well as potential other variables that were not explored, while facilitating more accurate self-assessment. This finding might be particularly important in educational contexts.

Yiwen Lu,Siheng Xiong,Zhaowei Li

Main category: cs.CL

TL;DR: 提出了一种用于Twitter大规模情感和主题分析的框架，包括数据收集、情感标注、主题建模和可视化。

Details

Motivation: 研究社交媒体在动态地缘政治背景下的情感和主题分布，提供可扩展的分析方法。 Method: 通过冲突关键词收集数据，使用预训练模型进行情感标注，结合上下文特征分析，应用LDA进行主题建模，并开发交互式可视化工具。 Result: 建立了情感与上下文特征的关系，识别了潜在主题，提供了可视化工具支持探索。 Conclusion: 该框架为动态地缘政治背景下的社交媒体分析提供了可扩展的方法论。 Abstract: We present a framework for large-scale sentiment and topic analysis of Twitter discourse. Our pipeline begins with targeted data collection using conflict-specific keywords, followed by automated sentiment labeling via multiple pre-trained models to improve annotation robustness. We examine the relationship between sentiment and contextual features such as timestamp, geolocation, and lexical content. To identify latent themes, we apply Latent Dirichlet Allocation (LDA) on partitioned subsets grouped by sentiment and metadata attributes. Finally, we develop an interactive visualization interface to support exploration of sentiment trends and topic distributions across time and regions. This work contributes a scalable methodology for social media analysis in dynamic geopolitical contexts.

[149] CAMOUFLAGE: Exploiting Misinformation Detection Systems Through LLM-driven Adversarial Claim Transformation

Mazal Bethany,Nishant Vishwamitra,Cho-Yu Jason Chiang,Peyman Najafirad

Main category: cs.CL

TL;DR: CAMOUFLAGE是一种基于LLM的对抗攻击方法，通过优化提示和攻击代理，绕过证据检索和比较模块，成功欺骗基于证据的虚假信息检测系统。

Details

Motivation: 现有对抗攻击方法无法有效针对多组件的虚假信息检测系统，需要一种新方法来破坏其证据检索和比较功能。 Method: 采用两代理系统（提示优化代理和攻击代理），通过迭代优化生成语义等效的对抗性重写，误导检测系统。 Result: 在四个系统上测试，平均攻击成功率为46.92%，同时保持文本连贯性和语义等效性。 Conclusion: CAMOUFLAGE展示了对抗攻击在多组件虚假信息检测系统中的有效性，无需依赖梯度或分类器输出。 Abstract: Automated evidence-based misinformation detection systems, which evaluate the veracity of short claims against evidence, lack comprehensive analysis of their adversarial vulnerabilities. Existing black-box text-based adversarial attacks are ill-suited for evidence-based misinformation detection systems, as these attacks primarily focus on token-level substitutions involving gradient or logit-based optimization strategies, which are incapable of fooling the multi-component nature of these detection systems. These systems incorporate both retrieval and claim-evidence comparison modules, which requires attacks to break the retrieval of evidence and/or the comparison module so that it draws incorrect inferences. We present CAMOUFLAGE, an iterative, LLM-driven approach that employs a two-agent system, a Prompt Optimization Agent and an Attacker Agent, to create adversarial claim rewritings that manipulate evidence retrieval and mislead claim-evidence comparison, effectively bypassing the system without altering the meaning of the claim. The Attacker Agent produces semantically equivalent rewrites that attempt to mislead detectors, while the Prompt Optimization Agent analyzes failed attack attempts and refines the prompt of the Attacker to guide subsequent rewrites. This enables larger structural and stylistic transformations of the text rather than token-level substitutions, adapting the magnitude of changes based on previous outcomes. Unlike existing approaches, CAMOUFLAGE optimizes its attack solely based on binary model decisions to guide its rewriting process, eliminating the need for classifier logits or extensive querying. We evaluate CAMOUFLAGE on four systems, including two recent academic systems and two real-world APIs, with an average attack success rate of 46.92\% while preserving textual coherence and semantic equivalence to the original claims.

Jiatao Li,Yanheng Li,Xiaojun Wan

Main category: cs.CL

TL;DR: 论文提出Social Worldview Taxonomy (SWT)框架，用于量化LLMs的社会认知态度，发现其隐含的世界观差异，并通过社会反馈实验揭示其可塑性。

Details

Motivation: 研究LLMs中隐含的社会认知态度（如权威、平等、自主等），填补现有研究对更广泛维度探索的不足。 Method: 基于文化理论构建SWT框架，量化四种典型世界观，并在28种LLMs中实证分析其认知特征；通过社会参照理论实验验证社会反馈对态度的影响。 Result: 发现LLMs存在显著且可解释的世界观差异，社会反馈能系统性塑造其态度，同时模型间存在细微差异。 Conclusion: 研究揭示了LLMs的隐含社会认知偏见及其对社会反馈的响应，为开发更透明、负责任的AI技术提供指导。 Abstract: Large Language Models (LLMs) have become integral to daily life, widely adopted in communication, decision-making, and information retrieval, raising critical questions about how these systems implicitly form and express socio-cognitive attitudes or "worldviews". While existing research extensively addresses demographic and ethical biases, broader dimensions-such as attitudes toward authority, equality, autonomy, and fate-remain under-explored. In this paper, we introduce the Social Worldview Taxonomy (SWT), a structured framework grounded in Cultural Theory, operationalizing four canonical worldviews (Hierarchy, Egalitarianism, Individualism, Fatalism) into measurable sub-dimensions. Using SWT, we empirically identify distinct and interpretable cognitive profiles across 28 diverse LLMs. Further, inspired by Social Referencing Theory, we experimentally demonstrate that explicit social cues systematically shape these cognitive attitudes, revealing both general response patterns and nuanced model-specific variations. Our findings enhance the interpretability of LLMs by revealing implicit socio-cognitive biases and their responsiveness to social feedback, thus guiding the development of more transparent and socially responsible language technologies.

[151] LLM-based Text Simplification and its Effect on User Comprehension and Cognitive Load

Theo Guidroz,Diego Ardila,Jimmy Li,Adam Mansour,Paul Jhun,Nina Gonzalez,Xiang Ji,Mike Sanchez,Sujay Kakarmath,Mathias MJ Bellaiche,Miguel Ángel Garrido,Faruk Ahmed,Divyansh Choudhary,Jay Hartford,Chenwei Xu,Henry Javier Serrano Echeverria,Yifan Wang,Jeff Shaffer,Eric,Cao,Yossi Matias,Avinatan Hassidim,Dale R Webster,Yun Liu,Sho Fujiwara,Peggy Bui,Quang Duong

Main category: cs.CL

TL;DR: 论文提出了一种基于自优化的LLM方法，用于最小化信息损失的文本简化，并通过大规模随机研究验证其效果。

Details

Motivation: 解决网络信息（如科学文献和维基百科）超出用户阅读水平的问题，提高信息可访问性。 Method: 采用自优化方法开发LLM的文本简化能力，并通过4563名参与者的随机研究验证，涵盖6个学科领域。 Result: 简化文本显著提高了参与者的理解能力（绝对提升3.9%），尤其在生物医学领域（14.6%）。 Conclusion: LLM在简化复杂信息方面具有潜力，有助于更广泛的受众理解和利用专业知识。 Abstract: Information on the web, such as scientific publications and Wikipedia, often surpasses users' reading level. To help address this, we used a self-refinement approach to develop a LLM capability for minimally lossy text simplification. To validate our approach, we conducted a randomized study involving 4563 participants and 31 texts spanning 6 broad subject areas: PubMed (biomedical scientific articles), biology, law, finance, literature/philosophy, and aerospace/computer science. Participants were randomized to viewing original or simplified texts in a subject area, and answered multiple-choice questions (MCQs) that tested their comprehension of the text. The participants were also asked to provide qualitative feedback such as task difficulty. Our results indicate that participants who read the simplified text answered more MCQs correctly than their counterparts who read the original text (3.9% absolute increase, p<0.05). This gain was most striking with PubMed (14.6%), while more moderate gains were observed for finance (5.5%), aerospace/computer science (3.8%) domains, and legal (3.5%). Notably, the results were robust to whether participants could refer back to the text while answering MCQs. The absolute accuracy decreased by up to ~9% for both original and simplified setups where participants could not refer back to the text, but the ~4% overall improvement persisted. Finally, participants' self-reported perceived ease based on a simplified NASA Task Load Index was greater for those who read the simplified text (absolute change on a 5-point scale 0.33, p<0.05). This randomized study, involving an order of magnitude more participants than prior works, demonstrates the potential of LLMs to make complex information easier to understand. Our work aims to enable a broader audience to better learn and make use of expert knowledge available on the web, improving information accessibility.

[152] Towards Safer Pretraining: Analyzing and Filtering Harmful Content in Webscale datasets for Responsible LLMs

Sai Krishna Mendu,Harish Yenala,Aditi Gulati,Shanu Kumar,Parag Agrawal

Main category: cs.CL

TL;DR: 该论文分析了大型语言模型（LLMs）预训练数据中的有害内容，提出了分类方法、评估数据集和过滤模型，旨在提升LLMs的安全性和合规性。

Details

Motivation: 预训练数据中的有害内容可能导致LLMs传播偏见和错误信息，引发伦理问题，因此需要系统分析和过滤。 Method: 论文提出了分类有害内容的分类法，开发了高精度评估数据集（TTP）和基于Transformer的过滤模型（HarmFormer），并创建了新的毒性基准（HAVOC）。 Result: 研究提供了对有害内容的全面分析，并开发了工具和数据集以支持更安全的LLM预训练。 Conclusion: 该工作为LLMs的安全预训练和负责任AI（RAI）合规提供了重要资源和解决方案。 Abstract: Large language models (LLMs) have become integral to various real-world applications, leveraging massive, web-sourced datasets like Common Crawl, C4, and FineWeb for pretraining. While these datasets provide linguistic data essential for high-quality natural language generation, they often contain harmful content, such as hate speech, misinformation, and biased narratives. Training LLMs on such unfiltered data risks perpetuating toxic behaviors, spreading misinformation, and amplifying societal biases which can undermine trust in LLM-driven applications and raise ethical concerns about their use. This paper presents a large-scale analysis of inappropriate content across these datasets, offering a comprehensive taxonomy that categorizes harmful webpages into Topical and Toxic based on their intent. We also introduce a prompt evaluation dataset, a high-accuracy Topical and Toxic Prompt (TTP), and a transformer-based model (HarmFormer) for content filtering. Additionally, we create a new multi-harm open-ended toxicity benchmark (HAVOC) and provide crucial insights into how models respond to adversarial toxic inputs. Upon publishing, we will also opensource our model signal on the entire C4 dataset. Our work offers insights into ensuring safer LLM pretraining and serves as a resource for Responsible AI (RAI) compliance.

[153] An overview of artificial intelligence in computer-assisted language learning

Anisia Katinskaia

Main category: cs.CL

TL;DR: 本文综述了人工智能在计算机辅助语言学习（CALL）中的应用，探讨了其必要性、现有系统的局限性以及未来发展方向。

Details

Motivation: 随着语言学习需求的增长，人类教师资源有限且成本高昂，加之疫情、移民等因素，亟需智能化的语言学习辅助工具。 Method: 通过回顾现有研究和实践，分析AI在CALL中的多组件应用，并提出跨学科合作的桥梁。 Result: 现有系统多为原型或部分实现，完整解决方案稀缺；AI的最新进展有望推动CALL的改进。 Conclusion: 本文为CALL开发者提供了AI方法的视角，并呼吁跨学科合作以推动领域发展。 Abstract: Computer-assisted language learning -- CALL -- is an established research field. We review how artificial intelligence can be applied to support language learning and teaching. The need for intelligent agents that assist language learners and teachers is increasing: the human teacher's time is a scarce and costly resource, which does not scale with growing demand. Further factors contribute to the need for CALL: pandemics and increasing demand for distance learning, migration of large populations, the need for sustainable and affordable support for learning, etc. CALL systems are made up of many components that perform various functions, and AI is applied to many different aspects in CALL, corresponding to their own expansive research areas. Most of what we find in the research literature and in practical use are prototypes or partial implementations -- systems that perform some aspects of the overall desired functionality. Complete solutions -- most of them commercial -- are few, because they require massive resources. Recent advances in AI should result in improvements in CALL, yet there is a lack of surveys that focus on AI in the context of this research field. This paper aims to present a perspective on the AI methods that can be employed for language learning from a position of a developer of a CALL system. We also aim to connect work from different disciplines, to build bridges for interdisciplinary work.

[154] What do Language Model Probabilities Represent? From Distribution Estimation to Response Prediction

Eitan Wagner,Omri Abend

Main category: cs.CL

TL;DR: 论文探讨了语言模型从有限长度字符串分布到通用预测模型的转变，分析了分布估计与响应预测的区别及其冲突目标，提出了三种不同的预期输出分布，并指出NLP研究中对其误解的问题。

Details

Motivation: 研究动机在于澄清语言模型在不同训练阶段和使用场景下的输出分布差异，避免对实验结果的误解。 Method: 通过分析LLMs的训练阶段（预训练、上下文学习、偏好调整）和输出概率的常见用例（补全概率、显式概率），提出三种不同的预期输出分布。 Result: 研究发现NLP工作常假设这些分布相似，导致实验结果被误解，论文为LLMs的解释提供了更坚实的理论基础。 Conclusion: 结论强调了为LLMs的解释和诱导分布的使用奠定更正式基础的重要性。 Abstract: The notion of language modeling has gradually shifted in recent years from a distribution over finite-length strings to general-purpose prediction models for textual inputs and outputs, following appropriate alignment phases. This paper analyzes the distinction between distribution estimation and response prediction in the context of LLMs, and their often conflicting goals. We examine the training phases of LLMs, which include pretraining, in-context learning, and preference tuning, and also the common use cases for their output probabilities, which include completion probabilities and explicit probabilities as output. We argue that the different settings lead to three distinct intended output distributions. We demonstrate that NLP works often assume that these distributions should be similar, which leads to misinterpretations of their experimental findings. Our work sets firmer formal foundations for the interpretation of LLMs, which will inform ongoing work on the interpretation and use of LLMs' induced distributions.

[155] LecEval: An Automated Metric for Multimodal Knowledge Acquisition in Multimedia Learning

Joy Lim Jia Yin,Daniel Zhang-Li,Jifan Yu,Haoxuan Li,Shangqing Tu,Yuanchun Wang,Zhiyuan Liu,Huiqin Liu,Lei Hou,Juanzi Li,Bin Xu

Main category: cs.CL

TL;DR: LecEval是一种基于Mayer认知理论的自动化评估工具，用于评估幻灯片教学的多模态知识获取效果，优于现有方法。

Details

Motivation: 现有评估方法在可扩展性、上下文捕捉或偏见方面存在局限，需要更有效的自动化评估工具。 Method: 提出LecEval，基于四个维度（内容相关性、表达清晰度、逻辑结构、受众参与度）评估幻灯片教学效果，并构建大规模数据集进行模型训练。 Result: LecEval在准确性和适应性上优于现有方法，接近人工评估水平。 Conclusion: LecEval为幻灯片教学评估提供了高效、自动化的解决方案，数据集和工具已开源。 Abstract: Evaluating the quality of slide-based multimedia instruction is challenging. Existing methods like manual assessment, reference-based metrics, and large language model evaluators face limitations in scalability, context capture, or bias. In this paper, we introduce LecEval, an automated metric grounded in Mayer's Cognitive Theory of Multimedia Learning, to evaluate multimodal knowledge acquisition in slide-based learning. LecEval assesses effectiveness using four rubrics: Content Relevance (CR), Expressive Clarity (EC), Logical Structure (LS), and Audience Engagement (AE). We curate a large-scale dataset of over 2,000 slides from more than 50 online course videos, annotated with fine-grained human ratings across these rubrics. A model trained on this dataset demonstrates superior accuracy and adaptability compared to existing metrics, bridging the gap between automated and human assessments. We release our dataset and toolkits at https://github.com/JoylimJY/LecEval.

[156] LLM-OptiRA: LLM-Driven Optimization of Resource Allocation for Non-Convex Problems in Wireless Communications

Xinyue Peng,Yanming Liu,Yihan Cang,Chaoqun Cao,Ming Chen

Main category: cs.CL

TL;DR: LLM-OptiRA利用大型语言模型自动解决无线通信系统中的非凸资源分配问题，显著优于传统方法。

Details

Motivation: 传统优化技术难以解决非凸资源分配问题，需要一种自动化且高效的方法。 Method: LLM-OptiRA通过LLMs自动检测并转换非凸问题为可解形式，集成纠错和可行性验证机制。 Result: 实验显示，LLM-OptiRA在GPT-4上执行率为96%，成功率为80%，优于基准方法。 Conclusion: LLM-OptiRA为复杂优化任务提供了一种高效、自动化的解决方案。 Abstract: Solving non-convex resource allocation problems poses significant challenges in wireless communication systems, often beyond the capability of traditional optimization techniques. To address this issue, we propose LLM-OptiRA, the first framework that leverages large language models (LLMs) to automatically detect and transform non-convex components into solvable forms, enabling fully automated resolution of non-convex resource allocation problems in wireless communication systems. LLM-OptiRA not only simplifies problem-solving by reducing reliance on expert knowledge, but also integrates error correction and feasibility validation mechanisms to ensure robustness. Experimental results show that LLM-OptiRA achieves an execution rate of 96% and a success rate of 80% on GPT-4, significantly outperforming baseline approaches in complex optimization tasks across diverse scenarios.

[157] Exploring the Potential of Offline RL for Reasoning in LLMs: A Preliminary Study

Xiaoyu Tian,Sitong Zhao,Haotian Wang,Shuaiting Chen,Yiping Peng,Yunjie Ji,Han Zhao,Xiangang Li

Main category: cs.CL

TL;DR: 论文研究了离线强化学习方法（如DPO和LD-DPO）在提升大语言模型推理能力上的效果，发现其性能平均提升3.3%，并在Arena-Hard基准上显著提升10.1%。

Details

Motivation: 尽管在线强化学习方法在长上下文推理上取得进展，但其计算成本高且复杂，离线强化学习方法则未被充分探索。 Method: 采用离线强化学习方法（DPO和LD-DPO），并通过多推理基准实验验证其效果。 Result: 离线强化学习方法显著提升模型性能，平均提升3.3%，Arena-Hard基准上提升10.1%。 Conclusion: 离线强化学习方法是提升大语言模型推理能力的有效且经济的选择，需注意输出长度与语义丰富度的匹配。 Abstract: Despite significant advances in long-context reasoning by large language models (LLMs), primarily through Online Reinforcement Learning (RL) methods, these approaches incur substantial computational costs and complexity. In contrast, simpler and more economical Offline RL methods remain underexplored. To address this gap, we investigate the effectiveness of Offline RL methods, specifically Direct Preference Optimization (DPO) and its length-desensitized variant LD-DPO, in enhancing the reasoning capabilities of LLMs. Extensive experiments across multiple reasoning benchmarks demonstrate that these simpler Offline RL methods substantially improve model performance, achieving an average enhancement of 3.3\%, with a particularly notable increase of 10.1\% on the challenging Arena-Hard benchmark. Furthermore, we analyze DPO's sensitivity to output length, emphasizing that increasing reasoning length should align with semantic richness, as indiscriminate lengthening may adversely affect model performance. We provide comprehensive descriptions of our data processing and training methodologies, offering empirical evidence and practical insights for developing more cost-effective Offline RL approaches.

[158] QiMeng-Xpiler: Transcompiling Tensor Programs for Deep Learning Systems with a Neural-Symbolic Approach

Shouyang Dong,Yuanbo Wen,Jun Bi,Di Huang,Jiaming Guo,Jianxing Xu,Ruibai Xu,Xinkai Song,Yifan Hao,Xuehai Zhou,Tianshi Chen,Qi Guo,Yunji Chen

Main category: cs.CL

TL;DR: QiMeng-Xpiler是一种新型的跨平台张量程序转编译器，结合大型语言模型（LLM）和符号程序合成技术，显著提高了编程效率与性能。

Details

Motivation: 解决异构深度学习系统中多平台张量程序开发的高成本问题，实现“一次编写，随处运行”。 Method: 结合LLM和符号程序合成，通过预定义的元提示进行程序转换，并采用分层自动调优方法优化性能。 Result: 在4种不同平台上，转译准确率达95%，性能提升最高2.0倍，编程效率提升96.0倍。 Conclusion: QiMeng-Xpiler有效解决了跨平台张量程序转译的难题，显著提升了生产力和性能。 Abstract: Heterogeneous deep learning systems (DLS) such as GPUs and ASICs have been widely deployed in industrial data centers, which requires to develop multiple low-level tensor programs for different platforms. An attractive solution to relieve the programming burden is to transcompile the legacy code of one platform to others. However, current transcompilation techniques struggle with either tremendous manual efforts or functional incorrectness, rendering "Write Once, Run Anywhere" of tensor programs an open question. We propose a novel transcompiler, i.e., QiMeng-Xpiler, for automatically translating tensor programs across DLS via both large language models (LLMs) and symbolic program synthesis, i.e., neural-symbolic synthesis. The key insight is leveraging the powerful code generation ability of LLM to make costly search-based symbolic synthesis computationally tractable. Concretely, we propose multiple LLM-assisted compilation passes via pre-defined meta-prompts for program transformation. During each program transformation, efficient symbolic program synthesis is employed to repair incorrect code snippets with a limited scale. To attain high performance, we propose a hierarchical auto-tuning approach to systematically explore both the parameters and sequences of transformation passes. Experiments on 4 DLS with distinct programming interfaces, i.e., Intel DL Boost with VNNI, NVIDIA GPU with CUDA, AMD MI with HIP, and Cambricon MLU with BANG, demonstrate that QiMeng-Xpiler correctly translates different tensor programs at the accuracy of 95% on average, and the performance of translated programs achieves up to 2.0x over vendor-provided manually-optimized libraries. As a result, the programming productivity of DLS is improved by up to 96.0x via transcompiling legacy tensor programs.

Minzheng Wang,Yongbin Li,Haobo Wang,Xinghua Zhang,Nan Xu,Bingli Wu,Fei Huang,Haiyang Yu,Wenji Mao

Main category: cs.CL

TL;DR: 本文提出了一种名为AML的自适应模式学习框架，通过AMPO算法动态选择四种思维模式，显著提升了社交智能任务的表现，同时减少了推理链长度。

Details

Motivation: 现有方法在社交智能模拟中缺乏动态调整推理深度的能力，导致资源浪费或表现不佳。 Method: 提出AML框架和AMPO算法，包括多粒度思维模式设计、上下文感知模式切换和深度自适应处理。 Result: 实验显示AML在任务表现上比现有方法高15.6%，推理链长度减少32.8%。 Conclusion: AMPO的动态思维模式选择比固定深度方法更接近人类推理，显著提升了社交智能模拟效果。 Abstract: Effective social intelligence simulation requires language agents to dynamically adjust reasoning depth, a capability notably absent in current approaches. While existing methods either lack this kind of reasoning capability or enforce uniform long chain-of-thought reasoning across all scenarios, resulting in excessive token usage and inappropriate social simulation. In this paper, we propose $\textbf{A}$daptive $\textbf{M}$ode $\textbf{L}$earning ($\textbf{AML}$) that strategically selects from four thinking modes (intuitive reaction $\rightarrow$ deep contemplation) based on real-time context. Our framework's core innovation, the $\textbf{A}$daptive $\textbf{M}$ode $\textbf{P}$olicy $\textbf{O}$ptimization ($\textbf{AMPO}$) algorithm, introduces three key advancements over existing methods: (1) Multi-granular thinking mode design, (2) Context-aware mode switching across social interaction, and (3) Token-efficient reasoning via depth-adaptive processing. Extensive experiments on social intelligence tasks confirm that AML achieves 15.6% higher task performance than state-of-the-art methods. Notably, our method outperforms GRPO by 7.0% with 32.8% shorter reasoning chains. These results demonstrate that context-sensitive thinking mode selection, as implemented in AMPO, enables more human-like adaptive reasoning than GRPO's fixed-depth approach

[160] Incorporating Legal Structure in Retrieval-Augmented Generation: A Case Study on Copyright Fair Use

Justin Ho,Alexandra Colby,William Fisher

Main category: cs.CL

TL;DR: 本文提出了一种针对美国版权法中合理使用原则的领域特定RAG实现，旨在帮助内容创作者应对DMCA下架问题，通过结合语义搜索、法律知识图谱和法院引用网络提升检索质量和推理可靠性。

Details

Motivation: 由于DMCA下架通知日益普遍且内容创作者缺乏可用的法律支持，本文旨在提供一种更有效的法律辅助工具。 Method: 采用结构化方法，结合语义搜索、法律知识图谱和法院引用网络，利用Chain-of-Thought推理和交错检索步骤模拟法律推理。 Result: 初步测试表明，该方法在检索过程中提高了法律原则的相关性。 Conclusion: 为未来基于LLM的法律辅助工具评估和部署奠定了基础。 Abstract: This paper presents a domain-specific implementation of Retrieval-Augmented Generation (RAG) tailored to the Fair Use Doctrine in U.S. copyright law. Motivated by the increasing prevalence of DMCA takedowns and the lack of accessible legal support for content creators, we propose a structured approach that combines semantic search with legal knowledge graphs and court citation networks to improve retrieval quality and reasoning reliability. Our prototype models legal precedents at the statutory factor level (e.g., purpose, nature, amount, market effect) and incorporates citation-weighted graph representations to prioritize doctrinally authoritative sources. We use Chain-of-Thought reasoning and interleaved retrieval steps to better emulate legal reasoning. Preliminary testing suggests this method improves doctrinal relevance in the retrieval process, laying groundwork for future evaluation and deployment of LLM-based legal assistance tools.

[161] A New HOPE: Domain-agnostic Automatic Evaluation of Text Chunking

Henrik Brådland,Morten Goodwin,Per-Arne Andersen,Alexander S. Nossum,Aditya Gupta

Main category: cs.CL

TL;DR: 论文提出了一种名为HOPE的新方法，用于评估文档分块对RAG系统性能的影响，并通过实验验证了其有效性。

Details

Motivation: 现有研究缺乏对文档分块方法影响的系统性分析框架，而LLMs对数据布局和结构敏感，因此需要一种量化评估方法。 Method: 提出HOPE评估指标，从三个层次（内在属性、外在属性、段落-文档一致性）量化分块特性，并进行跨领域实证评估。 Result: HOPE与RAG性能指标显著相关，语义独立性对性能提升显著（事实正确性提升56.2%，答案正确性提升21.1%），而传统概念一致性假设影响较小。 Conclusion: HOPE为优化分块策略提供了实用指导，有助于设计更准确的RAG系统。 Abstract: Document chunking fundamentally impacts Retrieval-Augmented Generation (RAG) by determining how source materials are segmented before indexing. Despite evidence that Large Language Models (LLMs) are sensitive to the layout and structure of retrieved data, there is currently no framework to analyze the impact of different chunking methods. In this paper, we introduce a novel methodology that defines essential characteristics of the chunking process at three levels: intrinsic passage properties, extrinsic passage properties, and passages-document coherence. We propose HOPE (Holistic Passage Evaluation), a domain-agnostic, automatic evaluation metric that quantifies and aggregates these characteristics. Our empirical evaluations across seven domains demonstrate that the HOPE metric correlates significantly (p > 0.13) with various RAG performance indicators, revealing contrasts between the importance of extrinsic and intrinsic properties of passages. Semantic independence between passages proves essential for system performance with a performance gain of up to 56.2% in factual correctness and 21.1% in answer correctness. On the contrary, traditional assumptions about maintaining concept unity within passages show minimal impact. These findings provide actionable insights for optimizing chunking strategies, thus improving RAG system design to produce more factually correct responses.

[162] Identifying Legal Holdings with LLMs: A Systematic Study of Performance, Scale, and Memorization

Chuck Arvin

Main category: cs.CL

TL;DR: 研究评估了不同规模的LLMs在CaseHOLD法律基准数据集上的表现，发现模型性能随规模提升，且无需复杂训练或微调。通过匿名化测试验证了性能并非依赖记忆。

Details

Motivation: 评估LLMs在法律任务中的表现，探索其潜力与局限性，为自动化法律分析和基准开发提供参考。 Method: 使用CaseHOLD数据集测试不同规模LLMs的性能，并设计匿名化测试以排除记忆效应。 Result: 模型性能随规模提升，GPT4o和AmazonNovaPro分别达到0.744和0.720的F1分数，匿名化测试中仍保持0.728。 Conclusion: LLMs在法律任务中表现优异且不依赖记忆，但仍存在局限性，对法律自动化发展具有重要意义。 Abstract: As large language models (LLMs) continue to advance in capabilities, it is essential to assess how they perform on established benchmarks. In this study, we present a suite of experiments to assess the performance of modern LLMs (ranging from 3B to 90B+ parameters) on CaseHOLD, a legal benchmark dataset for identifying case holdings. Our experiments demonstrate ``scaling effects'' - performance on this task improves with model size, with more capable models like GPT4o and AmazonNovaPro achieving macro F1 scores of 0.744 and 0.720 respectively. These scores are competitive with the best published results on this dataset, and do not require any technically sophisticated model training, fine-tuning or few-shot prompting. To ensure that these strong results are not due to memorization of judicial opinions contained in the training data, we develop and utilize a novel citation anonymization test that preserves semantic meaning while ensuring case names and citations are fictitious. Models maintain strong performance under these conditions (macro F1 of 0.728), suggesting the performance is not due to rote memorization. These findings demonstrate both the promise and current limitations of LLMs for legal tasks with important implications for the development and measurement of automated legal analytics and legal benchmarks.

[163] Measuring Hong Kong Massive Multi-Task Language Understanding

Chuxue Cao,Zhenghao Zhu,Junqi Zhu,Guoying Lu,Siyu Peng,Juntao Dai,Weijie Shi,Sirui Han,Yike Guo

Main category: cs.CL

TL;DR: HKMMLU是一个针对香港独特语言文化背景的多任务语言理解基准，包含26,698道多选题和90,550个翻译任务。实验显示，现有LLMs在香港特定领域的表现不佳，需进一步改进。

Details

Motivation: 香港的语言文化背景独特（繁体中文与粤语结合），但现有评估基准不足，因此开发HKMMLU以填补这一空白。 Method: 构建包含STEM、社会科学、人文等66个学科的26,698道多选题，并加入90,550个普通话-粤语翻译任务，评估多种LLMs的表现。 Result: 最佳模型DeepSeek-V3准确率仅75%，远低于MMLU和CMMLU，凸显LLMs在香港特定领域的不足。 Conclusion: HKMMLU将推动LLMs在多语言和跨文化背景下的发展，提升其应用广度与影响力。 Abstract: Multilingual understanding is crucial for the cross-cultural applicability of Large Language Models (LLMs). However, evaluation benchmarks designed for Hong Kong's unique linguistic landscape, which combines Traditional Chinese script with Cantonese as the spoken form and its cultural context, remain underdeveloped. To address this gap, we introduce HKMMLU, a multi-task language understanding benchmark that evaluates Hong Kong's linguistic competence and socio-cultural knowledge. The HKMMLU includes 26,698 multi-choice questions across 66 subjects, organized into four categories: Science, Technology, Engineering, and Mathematics (STEM), Social Sciences, Humanities, and Other. To evaluate the multilingual understanding ability of LLMs, 90,550 Mandarin-Cantonese translation tasks were additionally included. We conduct comprehensive experiments on GPT-4o, Claude 3.7 Sonnet, and 18 open-source LLMs of varying sizes on HKMMLU. The results show that the best-performing model, DeepSeek-V3, struggles to achieve an accuracy of 75\%, significantly lower than that of MMLU and CMMLU. This performance gap highlights the need to improve LLMs' capabilities in Hong Kong-specific language and knowledge domains. Furthermore, we investigate how question language, model size, prompting strategies, and question and reasoning token lengths affect model performance. We anticipate that HKMMLU will significantly advance the development of LLMs in multilingual and cross-cultural contexts, thereby enabling broader and more impactful applications.

[164] SEval-Ex: A Statement-Level Framework for Explainable Summarization Evaluation

Tanguy Herserant,Vincent Guigue

Main category: cs.CL

TL;DR: SEval-Ex是一个新的文本摘要评估框架，通过分解为原子语句实现高性能和可解释性，优于现有方法。

Details

Motivation: 解决现有摘要评估方法在性能和可解释性之间的权衡问题。 Method: 采用两阶段流程：1）用LLM从文本和摘要中提取原子语句；2）进行语句级匹配，生成详细证据。 Result: 在SummEval基准测试中，SEval-Ex以0.580的相关性优于GPT-4评估器（0.521），且抗幻觉能力强。 Conclusion: SEval-Ex在性能和可解释性上均表现出色，为摘要评估提供了新方向。 Abstract: Evaluating text summarization quality remains a critical challenge in Natural Language Processing. Current approaches face a trade-off between performance and interpretability. We present SEval-Ex, a framework that bridges this gap by decomposing summarization evaluation into atomic statements, enabling both high performance and explainability. SEval-Ex employs a two-stage pipeline: first extracting atomic statements from text source and summary using LLM, then a matching between generated statements. Unlike existing approaches that provide only summary-level scores, our method generates detailed evidence for its decisions through statement-level alignments. Experiments on the SummEval benchmark demonstrate that SEval-Ex achieves state-of-the-art performance with 0.580 correlation on consistency with human consistency judgments, surpassing GPT-4 based evaluators (0.521) while maintaining interpretability. Finally, our framework shows robustness against hallucination.

[165] Personalisation or Prejudice? Addressing Geographic Bias in Hate Speech Detection using Debias Tuning in Large Language Models

Paloma Piot,Patricia Martín-Rodilla,Javier Parapar

Main category: cs.CL

TL;DR: 研究了商业大语言模型（LLM）在个性化记忆功能中对仇恨言论检测的影响，发现个性化背景显著影响模型行为，并通过微调减少偏见。

Details

Motivation: 探讨个性化信息对LLM行为的影响，特别是在敏感话题（如仇恨言论）中的表现。 Method: 测试多种先进LLM在不同个性化场景下的行为，通过微调模型以减少因个性化背景导致的偏见。 Result: 个性化背景显著影响LLM的仇恨言论检测，微调后模型在有无个性化背景下的表现均有所提升。 Conclusion: 个性化背景对LLM行为有重要影响，需通过技术手段（如微调）减少偏见，提升模型鲁棒性。 Abstract: Commercial Large Language Models (LLMs) have recently incorporated memory features to deliver personalised responses. This memory retains details such as user demographics and individual characteristics, allowing LLMs to adjust their behaviour based on personal information. However, the impact of integrating personalised information into the context has not been thoroughly assessed, leading to questions about its influence on LLM behaviour. Personalisation can be challenging, particularly with sensitive topics. In this paper, we examine various state-of-the-art LLMs to understand their behaviour in different personalisation scenarios, specifically focusing on hate speech. We prompt the models to assume country-specific personas and use different languages for hate speech detection. Our findings reveal that context personalisation significantly influences LLMs' responses in this sensitive area. To mitigate these unwanted biases, we fine-tune the LLMs by penalising inconsistent hate speech classifications made with and without country or language-specific context. The refined models demonstrate improved performance in both personalised contexts and when no context is provided.

[166] Parameter-Efficient Transformer Embeddings

Henry Ndubuaku,Mouad Talhi

Main category: cs.CL

TL;DR: 提出了一种基于傅里叶展开和轻量级MLP的替代嵌入方法，显著减少参数数量并保持性能。

Details

Motivation: 传统嵌入层参数多但性能提升不明显，需更高效的方法。 Method: 通过傅里叶展开生成确定性嵌入向量，再用轻量级MLP捕获高阶交互。 Result: 在自然语言推理任务中表现竞争性，参数更少、训练更快且无需dropout。 Conclusion: 该方法展示了高效、可扩展语言模型的潜力，值得进一步研究。 Abstract: Embedding layers in transformer-based NLP models typically account for the largest share of model parameters, scaling with vocabulary size but not yielding performance gains proportional to scale. We propose an alternative approach in which token embedding vectors are first generated deterministically, directly from the token IDs using a Fourier expansion of their normalized values, followed by a lightweight multilayer perceptron (MLP) that captures higher-order interactions. We train standard transformers and our architecture on natural language inference tasks (SNLI and MNLI), and evaluate zero-shot performance on sentence textual similarity (STS-B). Our results demonstrate that the proposed method achieves competitive performance using significantly fewer parameters, trains faster, and operates effectively without the need for dropout. This proof-of-concept study highlights the potential for scalable, memory-efficient language models and motivates further large-scale experimentation based on our findings.

[167] Demystifying optimized prompts in language models

Rimon Melamed,Lucas H. McCabe,H. Howie Huang

Main category: cs.CL

TL;DR: 现代语言模型对分布外输入不鲁棒，优化提示可调制模型输出但难以解释。本文研究了优化提示的组成及其解析机制，发现其主要由罕见标点和名词构成，且激活模式与自然语言不同。

Details

Motivation: 探究语言模型对优化提示的解析机制及其内部表征，以理解其不鲁棒性。 Method: 分析优化提示的组成（如标点和罕见名词），并研究模型激活模式及表征路径。 Result: 优化提示主要由罕见标点和名词构成，其激活模式稀疏且与自然语言不同，在不同模型中表征路径相似。 Conclusion: 优化提示通过罕见词汇和标点影响模型输出，其内部表征路径一致，揭示了模型对非自然输入的敏感性。 Abstract: Modern language models (LMs) are not robust to out-of-distribution inputs. Machine generated (``optimized'') prompts can be used to modulate LM outputs and induce specific behaviors while appearing completely uninterpretable. In this work, we investigate the composition of optimized prompts, as well as the mechanisms by which LMs parse and build predictions from optimized prompts. We find that optimized prompts primarily consist of punctuation and noun tokens which are more rare in the training data. Internally, optimized prompts are clearly distinguishable from natural language counterparts based on sparse subsets of the model's activations. Across various families of instruction-tuned models, optimized prompts follow a similar path in how their representations form through the network.

[168] Generative Sign-description Prompts with Multi-positive Contrastive Learning for Sign Language Recognition

Siyu Liang,Yunan Li,Wentian Xin,Huizhou Chen,Xujie Liu,Kang Liu,Qiguang Miao

Main category: cs.CL

TL;DR: 提出了一种结合生成式大语言模型（LLM）的手语识别方法GSP-MC，通过检索增强生成和多步提示工程，实现对手语的多层次描述和特征对齐，在中文和土耳其手语数据集上达到SOTA性能。

Details

Motivation: 解决手语识别中因同时涉及手动和非手动信号的复杂性而导致的标注困难问题。 Method: 提出GSP-MC方法，结合检索增强生成（RAG）和领域特定LLM，通过多步提示工程生成精确描述，并采用双编码器架构实现文本与骨架特征的对齐。 Result: 在中文SLR500和土耳其AUTSL数据集上分别达到97.1%和97.07%的准确率。 Conclusion: GSP-MC方法展示了跨语言有效性，有望推动包容性通信技术的发展。 Abstract: Sign language recognition (SLR) faces fundamental challenges in creating accurate annotations due to the inherent complexity of simultaneous manual and non-manual signals. To the best of our knowledge, this is the first work to integrate generative large language models (LLMs) into SLR tasks. We propose a novel Generative Sign-description Prompts Multi-positive Contrastive learning (GSP-MC) method that leverages retrieval-augmented generation (RAG) with domain-specific LLMs, incorporating multi-step prompt engineering and expert-validated sign language corpora to produce precise multipart descriptions. The GSP-MC method also employs a dual-encoder architecture to bidirectionally align hierarchical skeleton features with multiple text descriptions (global, synonym, and part level) through probabilistic matching. Our approach combines global and part-level losses, optimizing KL divergence to ensure robust alignment across all relevant text-skeleton pairs while capturing both sign-level semantics and detailed part dynamics. Experiments demonstrate state-of-the-art performance against existing methods on the Chinese SLR500 (reaching 97.1%) and Turkish AUTSL datasets (97.07% accuracy). The method's cross-lingual effectiveness highlight its potential for developing inclusive communication technologies.

[169] Invoke Interfaces Only When Needed: Adaptive Invocation for Large Language Models in Question Answering

Jihao Zhao,Chunlai Zhou,Biao Qin

Main category: cs.CL

TL;DR: 提出AttenHScore指标，动态调整大模型调用时机，提升小模型幻觉检测能力。

Details

Motivation: 解决小模型生成过程中幻觉问题，优化调用时机以减少计算成本。 Method: 通过AttenHScore动态检测幻觉，结合不确定性知识重组辅助小模型。 Result: AttenHScore在多QA数据集上表现优异，尤其处理复杂查询时。 Conclusion: 无需额外训练，适用于多种Transformer模型，灵活高效。 Abstract: The collaborative paradigm of large and small language models (LMs) effectively balances performance and cost, yet its pivotal challenge lies in precisely pinpointing the moment of invocation when hallucinations arise in small LMs. Previous optimization efforts primarily focused on post-processing techniques, which were separate from the reasoning process of LMs, resulting in high computational costs and limited effectiveness. In this paper, we propose a practical invocation evaluation metric called AttenHScore, which calculates the accumulation and propagation of hallucinations during the generation process of small LMs, continuously amplifying potential reasoning errors. By dynamically adjusting the detection threshold, we achieve more accurate real-time invocation of large LMs. Additionally, considering the limited reasoning capacity of small LMs, we leverage uncertainty-aware knowledge reorganization to assist them better capture critical information from different text chunks. Extensive experiments reveal that our AttenHScore outperforms most baseline in enhancing real-time hallucination detection capabilities across multiple QA datasets, especially when addressing complex queries. Moreover, our strategies eliminate the need for additional model training and display flexibility in adapting to various transformer-based LMs.

[170] SIMPLEMIX: Frustratingly Simple Mixing of Off- and On-policy Data in Language Model Preference Learning

Tianjian Li,Daniel Khashabi

Main category: cs.CL

TL;DR: 论文探讨了在偏好学习中，策略内数据和策略外数据的互补性，并提出了一种简单混合方法SIMPLEMIX，显著提升了语言模型的对齐效果。

Details

Motivation: 现有研究对策略内和策略外数据在偏好学习中的优势存在分歧，需要系统探索其相互作用。 Method: 提出SIMPLEMIX方法，通过简单混合策略内和策略外数据来结合两者的优势。 Result: SIMPLEMIX在Alpaca Eval 2.0上平均比策略内DPO和策略外DPO提升6.03%，并优于更复杂的混合方法。 Conclusion: 策略内和策略外数据在偏好学习中具有互补性，SIMPLEMIX是一种高效且简单的对齐优化方法。 Abstract: Aligning language models with human preferences relies on pairwise preference datasets. While some studies suggest that on-policy data consistently outperforms off -policy data for preference learning, others indicate that the advantages of on-policy data may be task-dependent, highlighting the need for a systematic exploration of their interplay. In this work, we show that on-policy and off-policy data offer complementary strengths in preference optimization: on-policy data is particularly effective for reasoning tasks like math and coding, while off-policy data performs better on open-ended tasks such as creative writing and making personal recommendations. Guided by these findings, we introduce SIMPLEMIX, an approach to combine the complementary strengths of on-policy and off-policy preference learning by simply mixing these two data sources. Our empirical results across diverse tasks and benchmarks demonstrate that SIMPLEMIX substantially improves language model alignment. Specifically, SIMPLEMIX improves upon on-policy DPO and off-policy DPO by an average of 6.03% on Alpaca Eval 2.0. Moreover, it outperforms prior approaches that are much more complex in combining on- and off-policy data, such as HyPO and DPO-Mix-P, by an average of 3.05%.

[171] JTCSE: Joint Tensor-Modulus Constraints and Cross-Attention for Unsupervised Contrastive Learning of Sentence Embeddings

Tianyu Zong,Hongzhu Yi,Bingkang Shi,Yuanxiang Wang,Jungang Xu

Main category: cs.CL

TL;DR: 论文提出了一种名为JTCSE的无监督对比学习框架，通过约束语义表示张量的模量特征和引入跨注意力机制，提升了句子嵌入表示的质量，并在多项任务中达到SOTA性能。

Details

Motivation: 现有对比学习方法仅关注高维语义空间中表示的方向分布，而忽略了模量特征，导致对比学习不足；同时，BERT类模型存在注意力下沉现象，影响CLS令牌的语义聚合。 Method: 提出模量约束的训练目标以增强正样本对齐，并设计跨注意力结构优化CLS令牌的注意力分配。结合两者，构建JTCSE框架。 Result: 在七项语义文本相似性任务和130多项零样本下游任务中，JTCSE的双塔集成模型和单塔蒸馏模型均优于基线方法。 Conclusion: JTCSE通过模量约束和跨注意力机制显著提升了无监督对比学习的效果，成为当前最优方法。 Abstract: Unsupervised contrastive learning has become a hot research topic in natural language processing. Existing works usually aim at constraining the orientation distribution of the representations of positive and negative samples in the high-dimensional semantic space in contrastive learning, but the semantic representation tensor possesses both modulus and orientation features, and the existing works ignore the modulus feature of the representations and cause insufficient contrastive learning. % Therefore, we firstly propose a training objective that aims at modulus constraints on the semantic representation tensor, to strengthen the alignment between the positive samples in contrastive learning. Therefore, we first propose a training objective that is designed to impose modulus constraints on the semantic representation tensor, to strengthen the alignment between positive samples in contrastive learning. Then, the BERT-like model suffers from the phenomenon of sinking attention, leading to a lack of attention to CLS tokens that aggregate semantic information. In response, we propose a cross-attention structure among the twin-tower ensemble models to enhance the model's attention to CLS token and optimize the quality of CLS Pooling. Combining the above two motivations, we propose a new \textbf{J}oint \textbf{T}ensor representation modulus constraint and \textbf{C}ross-attention unsupervised contrastive learning \textbf{S}entence \textbf{E}mbedding representation framework JTCSE, which we evaluate in seven semantic text similarity computation tasks, and the experimental results show that JTCSE's twin-tower ensemble model and single-tower distillation model outperform the other baselines and become the current SOTA. In addition, we have conducted an extensive zero-shot downstream task evaluation, which shows that JTCSE outperforms other baselines overall on more than 130 tasks.

[172] RM-R1: Reward Modeling as Reasoning

Xiusi Chen,Gaotang Li,Ziqi Wang,Bowen Jin,Cheng Qian,Yu Wang,Hongru Wang,Yu Zhang,Denghui Zhang,Tong Zhang,Hanghang Tong,Heng Ji

Main category: cs.CL

TL;DR: 论文提出了一种新型的生成式奖励模型——推理奖励模型（ReasRMs），通过将奖励建模任务转化为推理任务，显著提升了模型的解释性和性能。

Details

Motivation: 现有奖励模型（RMs）通常生成不透明的标量分数或直接预测偏好答案，缺乏解释性且难以整合自然语言批评。 Method: 提出推理导向的训练流程，包括高质量推理链的蒸馏和可验证奖励的强化学习两个阶段，训练了ReasRMs家族中的RM-R1模型。 Result: 实验表明，ReasRMs在多个奖励模型基准测试中达到或接近最优性能，优于更大的开源模型（如Llama3.1-405B）和专有模型（如GPT-4o），最高提升13.8%。 Conclusion: 推理奖励模型显著提升了奖励模型的解释性和性能，为未来研究提供了新的方向，并公开了模型、代码和数据。 Abstract: Reward modeling is essential for aligning large language models (LLMs) with human preferences, especially through reinforcement learning from human feedback (RLHF). To provide accurate reward signals, a reward model (RM) should stimulate deep thinking and conduct interpretable reasoning before assigning a score or a judgment. However, existing RMs either produce opaque scalar scores or directly generate the prediction of a preferred answer, making them struggle to integrate natural language critiques, thus lacking interpretability. Inspired by recent advances of long chain-of-thought (CoT) on reasoning-intensive tasks, we hypothesize and validate that integrating reasoning capabilities into reward modeling significantly enhances RM's interpretability and performance. In this work, we introduce a new class of generative reward models -- Reasoning Reward Models (ReasRMs) -- which formulate reward modeling as a reasoning task. We propose a reasoning-oriented training pipeline and train a family of ReasRMs, RM-R1. The training consists of two key stages: (1) distillation of high-quality reasoning chains and (2) reinforcement learning with verifiable rewards. RM-R1 improves LLM rollouts by self-generating reasoning traces or chat-specific rubrics and evaluating candidate responses against them. Empirically, our models achieve state-of-the-art or near state-of-the-art performance of generative RMs across multiple comprehensive reward model benchmarks, outperforming much larger open-weight models (e.g., Llama3.1-405B) and proprietary ones (e.g., GPT-4o) by up to 13.8%. Beyond final performance, we perform thorough empirical analysis to understand the key ingredients of successful ReasRM training. To facilitate future research, we release six ReasRM models along with code and data at https://github.com/RM-R1-UIUC/RM-R1.

[173] Bielik 11B v2 Technical Report

Krzysztof Ociepa,Łukasz Flis,Krzysztof Wróbel,Adrian Gwoździej,Remigiusz Kinas

Main category: cs.CL

TL;DR: Bielik 11B v2是一个针对波兰语优化的先进语言模型，基于Mistral 7B v0.2架构，通过深度扩展达到11B参数，在波兰语任务中表现卓越，并具备跨语言能力。

Details

Motivation: 提升波兰语文本处理的性能，同时保持跨语言能力，为资源较少语言的高效建模设立新标准。 Method: 采用深度扩展技术，引入加权指令交叉熵损失和自适应学习率两项创新技术。 Result: 在多项基准测试中超越更大参数模型，显著优于其他波兰语专用模型，支持多种硬件部署。 Conclusion: Bielik 11B v2为波兰语AI能力提供了高效解决方案，并为资源效率建模树立了新标杆。 Abstract: We present Bielik 11B v2, a state-of-the-art language model optimized for Polish text processing. Built on the Mistral 7B v0.2 architecture and scaled to 11B parameters using depth up-scaling, this model demonstrates exceptional performance across Polish language benchmarks while maintaining strong cross-lingual capabilities. We introduce two key technical innovations: Weighted Instruction Cross-Entropy Loss, which optimizes learning across diverse instruction types by assigning quality-based weights to training examples, and Adaptive Learning Rate, which dynamically adjusts based on context length. Comprehensive evaluation across multiple benchmarks demonstrates that Bielik 11B v2 outperforms many larger models, including those with 2-6 times more parameters, and significantly surpasses other specialized Polish language models on tasks ranging from linguistic understanding to complex reasoning. The model's parameter efficiency and extensive quantization options enable deployment across various hardware configurations, advancing Polish language AI capabilities and establishing new benchmarks for resource-efficient language modeling in less-represented languages.

[174] Colombian Waitresses y Jueces canadienses: Gender and Country Biases in Occupation Recommendations from LLMs

Elisa Forcada Rodríguez,Olatz Perez-de-Viñaspre,Jon Ander Campos,Dietrich Klakow,Vagrant Gautam

Main category: cs.CL

TL;DR: 研究探讨了多语言模型中性别与国家交叉偏见，发现即使单独性别或国家偏见较低，交叉偏见仍显著存在。

Details

Motivation: 现有公平性研究多关注单一偏见（如性别）和英语，本研究填补了多语言交叉偏见的空白。 Method: 构建包含英语、西班牙语和德语的基准测试，评估5种Llama模型在25个国家及4种代词下的职业推荐偏见。 Result: 模型表现出显著的性别与国家交叉偏见，提示语言影响偏见程度，指令调优模型偏见最低且最稳定。 Conclusion: 公平性研究需采用交叉和多语言视角。 Abstract: One of the goals of fairness research in NLP is to measure and mitigate stereotypical biases that are propagated by NLP systems. However, such work tends to focus on single axes of bias (most often gender) and the English language. Addressing these limitations, we contribute the first study of multilingual intersecting country and gender biases, with a focus on occupation recommendations generated by large language models. We construct a benchmark of prompts in English, Spanish and German, where we systematically vary country and gender, using 25 countries and four pronoun sets. Then, we evaluate a suite of 5 Llama-based models on this benchmark, finding that LLMs encode significant gender and country biases. Notably, we find that even when models show parity for gender or country individually, intersectional occupational biases based on both country and gender persist. We also show that the prompting language significantly affects bias, and instruction-tuned models consistently demonstrate the lowest and most stable levels of bias. Our findings highlight the need for fairness researchers to use intersectional and multilingual lenses in their work.

[175] Data Augmentation With Back translation for Low Resource languages: A case of English and Luganda

Richard Kimera,Dongnyeong Heo,Daniela N. Rim,Heeyoul Choi

Main category: cs.CL

TL;DR: 论文探讨了回译（BT）作为半监督技术提升英语-卢干达语神经机器翻译（NMT）模型的效果，解决了低资源语言的挑战。

Details

Motivation: 研究目的是展示回译如何通过从单语语料库生成合成数据来缓解双语数据稀缺问题。 Method: 方法包括开发定制NMT模型，使用公开和网络爬取数据，并应用迭代和增量回译技术。 Result: 结果显示翻译性能显著提升，BLEU分数超过之前基准10分以上，并通过多种评估指标全面衡量质量。 Conclusion: 研究证实了回译在策略性数据集选择下的有效性，为低资源语言的NMT模型设立了新基准。 Abstract: In this paper,we explore the application of Back translation (BT) as a semi-supervised technique to enhance Neural Machine Translation(NMT) models for the English-Luganda language pair, specifically addressing the challenges faced by low-resource languages. The purpose of our study is to demonstrate how BT can mitigate the scarcity of bilingual data by generating synthetic data from monolingual corpora. Our methodology involves developing custom NMT models using both publicly available and web-crawled data, and applying Iterative and Incremental Back translation techniques. We strategically select datasets for incremental back translation across multiple small datasets, which is a novel element of our approach. The results of our study show significant improvements, with translation performance for the English-Luganda pair exceeding previous benchmarks by more than 10 BLEU score units across all translation directions. Additionally, our evaluation incorporates comprehensive assessment metrics such as SacreBLEU, ChrF2, and TER, providing a nuanced understanding of translation quality. The conclusion drawn from our research confirms the efficacy of BT when strategically curated datasets are utilized, establishing new performance benchmarks and demonstrating the potential of BT in enhancing NMT models for low-resource languages.

[176] Bemba Speech Translation: Exploring a Low-Resource African Language

Muhammad Hazim Al Farouq,Aman Kassahun Wassie,Yasmin Moslem

Main category: cs.CL

TL;DR: 本文介绍了针对Bemba-to-English低资源语言语音翻译的系统，基于Whisper和NLLB-200构建级联系统，并采用数据增强技术（如回译）。

Details

Motivation: 研究低资源语言（Bemba-to-English）的语音翻译问题，探索如何通过合成数据提升性能。 Method: 基于Whisper和NLLB-200构建级联语音翻译系统，并采用数据增强技术（如回译）生成合成数据。 Result: 实验研究了合成数据的使用效果，并详细讨论了实验设置。 Conclusion: 通过级联系统和数据增强技术，为低资源语言语音翻译提供了可行方案。 Abstract: This paper describes our system submission to the International Conference on Spoken Language Translation (IWSLT 2025), low-resource languages track, namely for Bemba-to-English speech translation. We built cascaded speech translation systems based on Whisper and NLLB-200, and employed data augmentation techniques, such as back-translation. We investigate the effect of using synthetic data and discuss our experimental setup.

[177] EMORL: Ensemble Multi-Objective Reinforcement Learning for Efficient and Flexible LLM Fine-Tuning

Lingxiao Kong,Cong Yang,Susanne Neufang,Oya Deniz Beyan,Zeyd Boukhers

Main category: cs.CL

TL;DR: 论文提出了一种基于集成学习的多目标强化学习框架（EMORL），用于优化大语言模型的微调，解决了目标平衡、训练效率和可解释性等问题。

Details

Motivation: 现有强化学习方法在多目标任务中面临目标平衡复杂、训练效率低、可扩展性差和可解释性有限等挑战。 Method: 引入EMORL框架，通过集成多个目标模型并优化其聚合，结合分层网格搜索算法确定最优权重组合。 Result: 在心理咨询反射生成任务中，EMORL表现出更低的训练消耗、更高的稳定性和可扩展性，且性能与基线相当。 Conclusion: EMORL为多目标强化学习提供了一种高效、灵活且可解释的解决方案。 Abstract: Recent advances in reinforcement learning (RL) for large language model (LLM) fine-tuning show promise in addressing multi-objective tasks but still face significant challenges, including complex objective balancing, low training efficiency, poor scalability, and limited explainability. Leveraging ensemble learning principles, we introduce an Ensemble Multi-Objective RL (EMORL) framework that fine-tunes multiple models with individual objectives while optimizing their aggregation after the training to improve efficiency and flexibility. Our method is the first to aggregate the last hidden states of individual models, incorporating contextual information from multiple objectives. This approach is supported by a hierarchical grid search algorithm that identifies optimal weighted combinations. We evaluate EMORL on counselor reflection generation tasks, using text-scoring LLMs to evaluate the generations and provide rewards during RL fine-tuning. Through comprehensive experiments on the PAIR and Psych8k datasets, we demonstrate the advantages of EMORL against existing baselines: significantly lower and more stable training consumption ($17,529\pm 1,650$ data points and $6,573\pm 147.43$ seconds), improved scalability and explainability, and comparable performance across multiple objectives.

[178] Ensemble Kalman filter for uncertainty in human language comprehension

Diksha Bhandari,Alessandro Lopopolo,Milena Rabovsky,Sebastian Reich

Main category: cs.CL

TL;DR: 论文提出了一种基于贝叶斯框架的句子理解方法，通过集成卡尔曼滤波量化不确定性，改进了传统人工神经网络在语言处理中的确定性局限。

Details

Motivation: 传统人工神经网络（如句子完形模型）在处理不确定性时表现不足，无法模拟人类在歧义或意外输入时的认知过程。 Method: 采用贝叶斯框架，扩展集成卡尔曼滤波进行贝叶斯推断，将语言理解建模为逆问题。 Result: 数值实验表明，贝叶斯方法优于最大似然估计，能更好地表示不确定性，更接近人类认知。 Conclusion: 贝叶斯方法提升了模型在语言歧义处理中的表现，更贴合人类句子理解的认知机制。 Abstract: Artificial neural networks (ANNs) are widely used in modeling sentence processing but often exhibit deterministic behavior, contrasting with human sentence comprehension, which manages uncertainty during ambiguous or unexpected inputs. This is exemplified by reversal anomalies-sentences with unexpected role reversals that challenge syntax and semantics-highlighting the limitations of traditional ANN models, such as the Sentence Gestalt (SG) Model. To address these limitations, we propose a Bayesian framework for sentence comprehension, applying an extension of the ensemble Kalman filter (EnKF) for Bayesian inference to quantify uncertainty. By framing language comprehension as a Bayesian inverse problem, this approach enhances the SG model's ability to reflect human sentence processing with respect to the representation of uncertainty. Numerical experiments and comparisons with maximum likelihood estimation (MLE) demonstrate that Bayesian methods improve uncertainty representation, enabling the model to better approximate human cognitive processing when dealing with linguistic ambiguities.

[179] Automatic Proficiency Assessment in L2 English Learners

Armita Mohammadi,Alessandro Lameiras Koerich,Laureano Moro-Velazquez,Patrick Cardinal

Main category: cs.CL

TL;DR: 本文探讨了利用深度学习技术（如CNN、ResNet、wav2vec 2.0和BERT）自动评估英语第二语言（L2）熟练度的方法，结合语音和文本数据，展示了预训练模型wav2vec 2.0的潜力。

Details

Motivation: 传统L2熟练度评估依赖人工评分，存在评分者间和评分者内差异，需要更客观、自动化的方法。 Method: 使用多种深度学习架构（2D CNN、频率CNN、ResNet、wav2vec 2.0）分析语音数据，并微调BERT模型处理文本数据，同时探索对话评估。 Result: 在EFCamDat、ANGLISH和私有数据集上的实验表明，预训练的wav2vec 2.0模型在自动化评估中表现优异。 Conclusion: 深度学习技术，尤其是wav2vec 2.0，为L2熟练度评估提供了高效、可靠的自动化解决方案。 Abstract: Second language proficiency (L2) in English is usually perceptually evaluated by English teachers or expert evaluators, with the inherent intra- and inter-rater variability. This paper explores deep learning techniques for comprehensive L2 proficiency assessment, addressing both the speech signal and its correspondent transcription. We analyze spoken proficiency classification prediction using diverse architectures, including 2D CNN, frequency-based CNN, ResNet, and a pretrained wav2vec 2.0 model. Additionally, we examine text-based proficiency assessment by fine-tuning a BERT language model within resource constraints. Finally, we tackle the complex task of spontaneous dialogue assessment, managing long-form audio and speaker interactions through separate applications of wav2vec 2.0 and BERT models. Results from experiments on EFCamDat and ANGLISH datasets and a private dataset highlight the potential of deep learning, especially the pretrained wav2vec 2.0 model, for robust automated L2 proficiency evaluation.

[180] LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis

Qingkai Fang,Yan Zhou,Shoutao Guo,Shaolei Zhang,Yang Feng

Main category: cs.CL

TL;DR: LLaMA-Omni 2是一个基于Qwen2.5系列模型的语音语言模型系列，参数规模从0.5B到14B，能够实现高质量的实时语音交互。

Details

Motivation: 下一代人机交互需要实时、智能和自然的语音交互，而基于大语言模型的智能语音聊天机器人展示了潜力。 Method: LLaMA-Omni 2集成了语音编码器和自回归流式语音解码器，仅用200K多轮语音对话样本训练。 Result: 在多个语音问答和语音指令跟随基准测试中表现优异，超越之前基于数百万小时语音数据训练的GLM-4-Voice。 Conclusion: LLaMA-Omni 2展示了在有限数据下实现高性能语音交互的潜力。 Abstract: Real-time, intelligent, and natural speech interaction is an essential part of the next-generation human-computer interaction. Recent advancements have showcased the potential of building intelligent spoken chatbots based on large language models (LLMs). In this paper, we introduce LLaMA-Omni 2, a series of speech language models (SpeechLMs) ranging from 0.5B to 14B parameters, capable of achieving high-quality real-time speech interaction. LLaMA-Omni 2 is built upon the Qwen2.5 series models, integrating a speech encoder and an autoregressive streaming speech decoder. Despite being trained on only 200K multi-turn speech dialogue samples, LLaMA-Omni 2 demonstrates strong performance on several spoken question answering and speech instruction following benchmarks, surpassing previous state-of-the-art SpeechLMs like GLM-4-Voice, which was trained on millions of hours of speech data.

[181] Proper Name Diacritization for Arabic Wikipedia: A Benchmark Dataset

Rawan Bondok,Mayar Nassar,Salam Khalifa,Kurt Micallaf,Nizar Habash

Main category: cs.CL

TL;DR: 论文研究了阿拉伯语维基百科中未标注音标的专有名词问题，提出了一个新的手动标注数据集，并测试了GPT-4o在恢复音标任务中的表现。

Details

Motivation: 阿拉伯语维基百科中的专有名词常未标注音标，导致发音和解释的歧义，尤其是对外来专有名词的音译问题。尽管音译和音标标注在阿拉伯语NLP中已有研究，但两者的结合仍未被充分探索。 Method: 作者创建了一个手动标注的阿拉伯语专有名词数据集，包含其英文维基百科对应词条，并制定了标注挑战和指南。随后，他们用GPT-4o测试了从未标注形式恢复完整音标的任务。 Result: GPT-4o在恢复音标任务中达到了73%的准确率，表明任务难度较高，且需要改进的模型和资源。 Conclusion: 论文强调了阿拉伯语专有名词音标标注任务的挑战性，并发布了数据集以促进进一步研究。 Abstract: Proper names in Arabic Wikipedia are frequently undiacritized, creating ambiguity in pronunciation and interpretation, especially for transliterated named entities of foreign origin. While transliteration and diacritization have been well-studied separately in Arabic NLP,their intersection remains underexplored. In this paper, we introduce a new manually diacritized dataset of Arabic proper names of various origins with their English Wikipedia equivalent glosses, and present the challenges and guidelines we followed to create it. We benchmark GPT-4o on the task of recovering full diacritization given the undiacritized Arabic and English forms, and analyze its performance. Achieving 73% accuracy, our results underscore both the difficulty of the task and the need for improved models and resources. We release our dataset to facilitate further research on Arabic Wikipedia proper name diacritization.

[182] A Survey on Progress in LLM Alignment from the Perspective of Reward Design

Miaomiao Ji,Yanqiu Wu,Zhibin Wu,Shoujin Wang,Jian Yang,Mark Dras,Usman Naseem

Main category: cs.CL

TL;DR: 本文系统研究了大型语言模型（LLM）对齐中奖励机制的设计，提出了一个三阶段理论框架，并分析了奖励建模的演变趋势。

Details

Motivation: 解决LLM与人类价值观和意图对齐的核心挑战，推动奖励机制设计的发展。 Method: 通过四维分析（构建基础、格式、表达和粒度）建立分类框架，研究奖励机制的三阶段发展。 Result: 揭示了奖励建模的演变趋势，并指出从强化学习到新型优化范式的转变。 Conclusion: 提出了未来LLM对齐的创新奖励设计策略研究方向。 Abstract: The alignment of large language models (LLMs) with human values and intentions represents a core challenge in current AI research, where reward mechanism design has become a critical factor in shaping model behavior. This study conducts a comprehensive investigation of reward mechanisms in LLM alignment through a systematic theoretical framework, categorizing their development into three key phases: (1) feedback (diagnosis), (2) reward design (prescription), and (3) optimization (treatment). Through a four-dimensional analysis encompassing construction basis, format, expression, and granularity, this research establishes a systematic classification framework that reveals evolutionary trends in reward modeling. The field of LLM alignment faces several persistent challenges, while recent advances in reward design are driving significant paradigm shifts. Notable developments include the transition from reinforcement learning-based frameworks to novel optimization paradigms, as well as enhanced capabilities to address complex alignment scenarios involving multimodal integration and concurrent task coordination. Finally, this survey outlines promising future research directions for LLM alignment through innovative reward design strategies.

[183] Sailing AI by the Stars: A Survey of Learning from Rewards in Post-Training and Test-Time Scaling of Large Language Models

Xiaobao Wu

Main category: cs.CL

TL;DR: 论文综述了基于奖励信号的大语言模型（LLMs）学习范式，涵盖训练、推理和后推理阶段的技术，并讨论了奖励模型的基准和应用，同时指出了挑战和未来方向。

Details

Motivation: 探索如何通过奖励信号引导LLMs行为，从静态数据学习转向动态反馈学习，以提升模型的对齐偏好和深度推理能力。 Method: 分类分析了基于奖励学习的策略，包括强化学习（如RLHF、DPO、GRPO）、奖励引导解码和后验校正等技术。 Result: 总结了奖励模型在不同阶段的应用和效果，并提供了相关论文资源。 Conclusion: 奖励学习是LLMs发展的关键范式，但仍需解决挑战并探索未来方向。 Abstract: Recent developments in Large Language Models (LLMs) have shifted from pre-training scaling to post-training and test-time scaling. Across these developments, a key unified paradigm has arisen: Learning from Rewards, where reward signals act as the guiding stars to steer LLM behavior. It has underpinned a wide range of prevalent techniques, such as reinforcement learning (in RLHF, DPO, and GRPO), reward-guided decoding, and post-hoc correction. Crucially, this paradigm enables the transition from passive learning from static data to active learning from dynamic feedback. This endows LLMs with aligned preferences and deep reasoning capabilities. In this survey, we present a comprehensive overview of the paradigm of learning from rewards. We categorize and analyze the strategies under this paradigm across training, inference, and post-inference stages. We further discuss the benchmarks for reward models and the primary applications. Finally we highlight the challenges and future directions. We maintain a paper collection at https://github.com/bobxwu/learning-from-rewards-llm-papers.

[184] fastabx: A library for efficient computation of ABX discriminability

Maxime Poli,Emmanuel Chemla,Emmanuel Dupoux

Main category: cs.CL

TL;DR: fastabx是一个高性能Python库，用于构建ABX判别任务，填补了工具缺失的空白，支持快速开发和高效计算。

Details

Motivation: ABX任务在评估语音表示的自监督学习中广泛应用，但缺乏合适的工具限制了其更广泛的采用。 Method: fastabx提供了一个框架，支持构建任何类型的ABX任务，并高效计算表示间的距离。 Result: fastabx为表示学习社区提供了有价值的资源，支持跨领域研究。 Conclusion: fastabx有望推动表示学习的研究，其源代码已开源。 Abstract: We introduce fastabx, a high-performance Python library for building ABX discrimination tasks. ABX is a measure of the separation between generic categories of interest. It has been used extensively to evaluate phonetic discriminability in self-supervised speech representations. However, its broader adoption has been limited by the absence of adequate tools. fastabx addresses this gap by providing a framework capable of constructing any type of ABX task while delivering the efficiency necessary for rapid development cycles, both in task creation and in calculating distances between representations. We believe that fastabx will serve as a valuable resource for the broader representation learning community, enabling researchers to systematically investigate what information can be directly extracted from learned representations across several domains beyond speech processing. The source code is available at https://github.com/bootphon/fastabx.

[185] Bye-bye, Bluebook? Automating Legal Procedure with Large Language Models

Matthew Dahl

Main category: cs.CL

TL;DR: 论文研究了大型语言模型（LLMs）在遵循复杂法律引用规范（如《蓝皮书》）方面的表现，发现其合规性有限，且上下文学习提升效果不明显。

Details

Motivation: 评估LLMs是否能严格遵守复杂的法律引用规范，以探索其在法律实践中的自动化潜力。 Method: 构建包含866个《蓝皮书》任务的原创数据集，测试多个主流LLMs的合规性表现。 Result: LLMs生成完全合规引用的准确率为69%-74%，上下文学习仅提升至77%。 Conclusion: 现成的LLMs在法律程序要求严格的领域自动化应用需谨慎。 Abstract: Legal practice requires careful adherence to procedural rules. In the United States, few are more complex than those found in The Bluebook: A Uniform System of Citation. Compliance with this system's 500+ pages of byzantine formatting instructions is the raison d'etre of thousands of student law review editors and the bete noire of lawyers everywhere. To evaluate whether large language models (LLMs) are able to adhere to the procedures of such a complicated system, we construct an original dataset of 866 Bluebook tasks and test flagship LLMs from OpenAI, Anthropic, Google, Meta, and DeepSeek. We show (1) that these models produce fully compliant Bluebook citations only 69%-74% of the time and (2) that in-context learning on the Bluebook's underlying system of rules raises accuracy only to 77%. These results caution against using off-the-shelf LLMs to automate aspects of the law where fidelity to procedure is paramount.

[186] ReplaceMe: Network Simplification via Layer Pruning and Linear Transformations

Dmitriy Shopkhoev,Ammar Ali,Magauiya Zhussip,Valentin Malykh,Stamatios Lefkimmiatis,Nikos Komodakis,Sergey Zagoruyko

Main category: cs.CL

TL;DR: ReplaceMe是一种无需训练的通用深度剪枝方法，通过线性操作替换Transformer块，在低压缩比下保持高性能。

Details

Motivation: 传统剪枝方法需要额外训练或微调，而ReplaceMe仅需少量校准数据估计线性变换，无需额外参数。 Method: 使用校准数据集估计线性映射，近似剪枝块，并与剩余Transformer块无缝合并。 Result: 在多个大型语言模型上，ReplaceMe实现25%剪枝，保留90%性能，计算开销极小。 Conclusion: ReplaceMe在无需训练的情况下，优于其他无训练剪枝方法，并与需训练的最先进方法竞争。 Abstract: We introduce ReplaceMe, a generalized training-free depth pruning method that effectively replaces transformer blocks with a linear operation, while maintaining high performance for low compression ratios. In contrast to conventional pruning approaches that require additional training or fine-tuning, our approach requires only a small calibration dataset that is used to estimate a linear transformation to approximate the pruned blocks. This estimated linear mapping can be seamlessly merged with the remaining transformer blocks, eliminating the need for any additional network parameters. Our experiments show that ReplaceMe consistently outperforms other training-free approaches and remains highly competitive with state-of-the-art pruning methods that involve extensive retraining/fine-tuning and architectural modifications. Applied to several large language models (LLMs), ReplaceMe achieves up to 25% pruning while retaining approximately 90% of the original model's performance on open benchmarks - without any training or healing steps, resulting in minimal computational overhead (see Fig.1). We provide an open-source library implementing ReplaceMe alongside several state-of-the-art depth pruning techniques, available at this repository.

cs.RO [Back]

[187] Aerial Path Online Planning for Urban Scene Updation

Mingfeng Tang,Ziyuan Xie,Ke Xie,Hui Huang,Jianwei Hu,Ningna Wang,Xiaohu Guo

Main category: cs.RO

TL;DR: 提出了一种针对城市环境中变化区域检测与更新的无人机路径规划算法，通过利用先验重建和变化概率统计，显著减少飞行时间和计算开销。

Details

Motivation: 现有大规模3D城市场景重建方法在周期性更新场景中效率低下，因为它们通常重新探索和重建整个场景，浪费了未变化区域的资源。 Method: 结合先验重建和变化概率统计，提出了一种新颖的变化性启发式方法，规划静态先验路径和动态实时路径，并集成表面采样和候选视图生成策略。 Result: 在真实城市数据集上的实验表明，该方法显著减少了飞行时间和计算开销，同时保持了与全场景重建相当的高质量更新。 Conclusion: 该方法为复杂城市环境中高效、可扩展和自适应的无人机场景更新提供了新思路。 Abstract: We present the first scene-update aerial path planning algorithm specifically designed for detecting and updating change areas in urban environments. While existing methods for large-scale 3D urban scene reconstruction focus on achieving high accuracy and completeness, they are inefficient for scenarios requiring periodic updates, as they often re-explore and reconstruct entire scenes, wasting significant time and resources on unchanged areas. To address this limitation, our method leverages prior reconstructions and change probability statistics to guide UAVs in detecting and focusing on areas likely to have changed. Our approach introduces a novel changeability heuristic to evaluate the likelihood of changes, driving the planning of two flight paths: a prior path informed by static priors and a dynamic real-time path that adapts to newly detected changes. The framework integrates surface sampling and candidate view generation strategies, ensuring efficient coverage of change areas with minimal redundancy. Extensive experiments on real-world urban datasets demonstrate that our method significantly reduces flight time and computational overhead, while maintaining high-quality updates comparable to full-scene re-exploration and reconstruction. These contributions pave the way for efficient, scalable, and adaptive UAV-based scene updates in complex urban environments.

[188] RoBridge: A Hierarchical Architecture Bridging Cognition and Execution for General Robotic Manipulation

Kaidong Zhang,Rongtao Xu,Pengzhen Ren,Junfan Lin,Hefeng Wu,Liang Lin,Xiaodan Liang

Main category: cs.RO

TL;DR: RoBridge是一种分层智能架构，结合认知规划和强化学习，显著提升机器人在开放环境中的操作能力。

Details

Motivation: 解决机器人在开放环境中面临的程序性技能和声明性技能困境，同时保持认知和执行能力。 Method: 提出RoBridge架构，包括基于大规模预训练视觉语言模型的高层认知规划器、不变可操作表示和通用体现代理。 Result: 在新任务中达到75%的成功率，仅用五个真实数据样本实现83%的模拟到现实的泛化成功率。 Conclusion: RoBridge为机器人系统整合认知推理与物理执行提供了新范式，推动了通用机器人操作的发展。 Abstract: Operating robots in open-ended scenarios with diverse tasks is a crucial research and application direction in robotics. While recent progress in natural language processing and large multimodal models has enhanced robots' ability to understand complex instructions, robot manipulation still faces the procedural skill dilemma and the declarative skill dilemma in open environments. Existing methods often compromise cognitive and executive capabilities. To address these challenges, in this paper, we propose RoBridge, a hierarchical intelligent architecture for general robotic manipulation. It consists of a high-level cognitive planner (HCP) based on a large-scale pre-trained vision-language model (VLM), an invariant operable representation (IOR) serving as a symbolic bridge, and a generalist embodied agent (GEA). RoBridge maintains the declarative skill of VLM and unleashes the procedural skill of reinforcement learning, effectively bridging the gap between cognition and execution. RoBridge demonstrates significant performance improvements over existing baselines, achieving a 75% success rate on new tasks and an 83% average success rate in sim-to-real generalization using only five real-world data samples per task. This work represents a significant step towards integrating cognitive reasoning with physical execution in robotic systems, offering a new paradigm for general robotic manipulation.

[189] Estimating Commonsense Scene Composition on Belief Scene Graphs

Mario A. V. Saucedo,Vignesh Kottayam Viswanathan,Christoforos Kanellakis,George Nikolakopoulos

Main category: cs.RO

TL;DR: 提出了一种基于常识的场景构图方法，通过估计未见物体的空间分布扩展信念场景图。

Details

Motivation: 研究如何理解场景中相关物体的空间关系，将其建模为语义对象类所有可能位置的联合概率分布。 Method: 提出了两种相关性信息（CECI）模型：基于图卷积网络的基线方法和结合LLM空间知识的神经符号扩展方法。 Result: 在模拟数据和真实室内环境中验证了框架的有效性，展示了其在不同房间类型中的场景空间解释能力。 Conclusion: 该框架成功实现了基于常识的场景构图，为场景理解提供了新方法。 Abstract: This work establishes the concept of commonsense scene composition, with a focus on extending Belief Scene Graphs by estimating the spatial distribution of unseen objects. Specifically, the commonsense scene composition capability refers to the understanding of the spatial relationships among related objects in the scene, which in this article is modeled as a joint probability distribution for all possible locations of the semantic object class. The proposed framework includes two variants of a Correlation Information (CECI) model for learning probability distributions: (i) a baseline approach based on a Graph Convolutional Network, and (ii) a neuro-symbolic extension that integrates a spatial ontology based on Large Language Models (LLMs). Furthermore, this article provides a detailed description of the dataset generation process for such tasks. Finally, the framework has been validated through multiple runs on simulated data, as well as in a real-world indoor environment, demonstrating its ability to spatially interpret scenes across different room types.

[190] Point Cloud Recombination: Systematic Real Data Augmentation Using Robotic Targets for LiDAR Perception Validation

Hubert Padusinski,Christian Steinhauser,Christian Scherl,Julian Gaal,Jacob Langner

Main category: cs.RO

TL;DR: 提出了一种点云重组方法，通过在受控实验室环境中测量物理目标对象，系统性地增强真实点云场景，以生成大量可重复、物理准确的测试场景。

Details

Motivation: 解决LiDAR感知验证中真实环境数据控制不足和虚拟仿真缺乏物理传感器特性的问题。 Method: 提出Point Cloud Recombination方法，将实验室测量的目标对象点云整合到真实场景中，生成可重复的测试场景。 Result: 重组场景与真实传感器输出高度匹配，支持针对性测试、可扩展的故障分析和系统安全性改进。 Conclusion: 该方法提供了受控且传感器真实的数据，有助于可靠评估特定传感器与算法的局限性。 Abstract: The validation of LiDAR-based perception of intelligent mobile systems operating in open-world applications remains a challenge due to the variability of real environmental conditions. Virtual simulations allow the generation of arbitrary scenes under controlled conditions but lack physical sensor characteristics, such as intensity responses or material-dependent effects. In contrast, real-world data offers true sensor realism but provides less control over influencing factors, hindering sufficient validation. Existing approaches address this problem with augmentation of real-world point cloud data by transferring objects between scenes. However, these methods do not consider validation and remain limited in controllability because they rely on empirical data. We solve these limitations by proposing Point Cloud Recombination, which systematically augments captured point cloud scenes by integrating point clouds acquired from physical target objects measured in controlled laboratory environments. Thus enabling the creation of vast amounts and varieties of repeatable, physically accurate test scenes with respect to phenomena-aware occlusions with registered 3D meshes. Using the Ouster OS1-128 Rev7 sensor, we demonstrate the augmentation of real-world urban and rural scenes with humanoid targets featuring varied clothing and poses, for repeatable positioning. We show that the recombined scenes closely match real sensor outputs, enabling targeted testing, scalable failure analysis, and improved system safety. By providing controlled yet sensor-realistic data, our method enables trustworthy conclusions about the limitations of specific sensors in compound with their algorithms, e.g., object detection.

[191] Grasp the Graph (GtG) 2.0: Ensemble of GNNs for High-Precision Grasp Pose Detection in Clutter

Ali Rashidi Moghadam,Sayedmohammadreza Rastegari,Mehdi Tale Masouleh,Ahmad Kalhor

Main category: cs.RO

TL;DR: Grasp the Graph 2.0 (GtG 2.0) 是一种高效的机器人抓取框架，通过图神经网络和7自由度抓取候选生成，显著提升了在复杂环境中的抓取性能。

Details

Motivation: 在杂乱、真实环境中，由于噪声和不完整的感知数据以及复杂的物体几何形状，抓取姿态检测仍具挑战性。 Method: GtG 2.0 结合了传统的抓取姿态生成器和图神经网络集成模型，利用内外点信息提升几何推理能力。 Result: 在GraspNet-1Billion基准测试中，平均精度提升35%，实验显示91%的成功率和100%的杂乱环境完成率。 Conclusion: GtG 2.0 在性能和可靠性上表现优异，成为当前最先进的抓取框架之一。 Abstract: Grasp pose detection in cluttered, real-world environments remains a significant challenge due to noisy and incomplete sensory data combined with complex object geometries. This paper introduces Grasp the Graph 2.0 (GtG 2.0) method, a lightweight yet highly effective hypothesis-and-test robotics grasping framework which leverages an ensemble of Graph Neural Networks for efficient geometric reasoning from point cloud data. Building on the success of GtG 1.0, which demonstrated the potential of Graph Neural Networks for grasp detection but was limited by assumptions of complete, noise-free point clouds and 4-Dof grasping, GtG 2.0 employs a conventional Grasp Pose Generator to efficiently produce 7-Dof grasp candidates. Candidates are assessed with an ensemble Graph Neural Network model which includes points within the gripper jaws (inside points) and surrounding contextual points (outside points). This improved representation boosts grasp detection performance over previous methods using the same generator. GtG 2.0 shows up to a 35% improvement in Average Precision on the GraspNet-1Billion benchmark compared to hypothesis-and-test and Graph Neural Network-based methods, ranking it among the top three frameworks. Experiments with a 3-Dof Delta Parallel robot and Kinect-v1 camera show a success rate of 91% and a clutter completion rate of 100%, demonstrating its flexibility and reliability.

[192] TWIST: Teleoperated Whole-Body Imitation System

Yanjie Ze,Zixuan Chen,João Pedro Araújo,Zi-ang Cao,Xue Bin Peng,Jiajun Wu,C. Karen Liu

Main category: cs.RO

TL;DR: TWIST系统通过全身运动模仿实现人形机器人的远程操作，结合强化学习和行为克隆，显著提升了机器人的协调性和多功能性。

Details

Motivation: 目前的人形机器人远程操作系统通常局限于局部任务，缺乏全身协调能力，TWIST旨在解决这一问题。 Method: 通过将人类动作捕捉数据重定向到机器人，并结合强化学习和行为克隆（RL+BC）开发自适应控制器。 Result: TWIST实现了前所未有的全身协调运动能力，包括全身操纵、腿部操纵、移动和表达性动作。 Conclusion: TWIST为通用机器人智能的发展提供了重要基础，展示了全身远程操作的潜力。 Abstract: Teleoperating humanoid robots in a whole-body manner marks a fundamental step toward developing general-purpose robotic intelligence, with human motion providing an ideal interface for controlling all degrees of freedom. Yet, most current humanoid teleoperation systems fall short of enabling coordinated whole-body behavior, typically limiting themselves to isolated locomotion or manipulation tasks. We present the Teleoperated Whole-Body Imitation System (TWIST), a system for humanoid teleoperation through whole-body motion imitation. We first generate reference motion clips by retargeting human motion capture data to the humanoid robot. We then develop a robust, adaptive, and responsive whole-body controller using a combination of reinforcement learning and behavior cloning (RL+BC). Through systematic analysis, we demonstrate how incorporating privileged future motion frames and real-world motion capture (MoCap) data improves tracking accuracy. TWIST enables real-world humanoid robots to achieve unprecedented, versatile, and coordinated whole-body motor skills--spanning whole-body manipulation, legged manipulation, locomotion, and expressive movement--using a single unified neural network controller. Our project website: https://humanoid-teleop.github.io

q-bio.QM [Back]

[193] Enhancing TCR-Peptide Interaction Prediction with Pretrained Language Models and Molecular Representations

Cong Qi,Hanzhang Fang,Siqi jiang,Tianxing Hu,Wei Zhi

Main category: q-bio.QM

TL;DR: LANTERN是一个结合蛋白质语言模型和肽化学表征的深度学习框架，用于预测TCR与pMHC的结合特异性，在零样本和少样本学习中表现优异。

Details

Motivation: 当前预测模型在数据稀缺和新表位情况下泛化能力不足，限制了免疫治疗和疫苗开发。 Method: 使用ESM-1b编码TCR β链序列，MolFormer处理肽的SMILES字符串，结合生物和化学特征。 Result: 在零样本和少样本学习中优于现有模型（如ChemBERTa、TITAN、NetTCR），并通过嵌入分析显示显著聚类改进。 Conclusion: LANTERN有望推动TCR-pMHC结合预测，支持个性化免疫治疗的发展。 Abstract: Understanding the binding specificity between T-cell receptors (TCRs) and peptide-major histocompatibility complexes (pMHCs) is central to immunotherapy and vaccine development. However, current predictive models struggle with generalization, especially in data-scarce settings and when faced with novel epitopes. We present LANTERN (Large lAnguage model-powered TCR-Enhanced Recognition Network), a deep learning framework that combines large-scale protein language models with chemical representations of peptides. By encoding TCR \b{eta}-chain sequences using ESM-1b and transforming peptide sequences into SMILES strings processed by MolFormer, LANTERN captures rich biological and chemical features critical for TCR-peptide recognition. Through extensive benchmarking against existing models such as ChemBERTa, TITAN, and NetTCR, LANTERN demonstrates superior performance, particularly in zero-shot and few-shot learning scenarios. Our model also benefits from a robust negative sampling strategy and shows significant clustering improvements via embedding analysis. These results highlight the potential of LANTERN to advance TCR-pMHC binding prediction and support the development of personalized immunotherapies.

cs.SD [Back]

[194] Weakly-supervised Audio Temporal Forgery Localization via Progressive Audio-language Co-learning Network

Junyan Wu,Wenbo Xu,Wei Lu,Xiangyang Luo,Rui Yang,Shize Guo

Main category: cs.SD

TL;DR: 论文提出了一种基于弱监督的音频时序伪造定位方法LOCO，通过音频-语言协同学习和自监督提升定位性能，无需精细标注。

Details

Motivation: 现有方法依赖高成本精细标注，难以适应实际场景，因此提出弱监督下的解决方案。 Method: 设计了音频-语言协同学习模块和伪造定位模块，结合渐进优化策略生成伪标签并优化特征。 Result: 在三个公开基准测试中达到SOTA性能。 Conclusion: LOCO在弱监督下有效提升了音频伪造定位性能，具有实际应用潜力。 Abstract: Audio temporal forgery localization (ATFL) aims to find the precise forgery regions of the partial spoof audio that is purposefully modified. Existing ATFL methods rely on training efficient networks using fine-grained annotations, which are obtained costly and challenging in real-world scenarios. To meet this challenge, in this paper, we propose a progressive audio-language co-learning network (LOCO) that adopts co-learning and self-supervision manners to prompt localization performance under weak supervision scenarios. Specifically, an audio-language co-learning module is first designed to capture forgery consensus features by aligning semantics from temporal and global perspectives. In this module, forgery-aware prompts are constructed by using utterance-level annotations together with learnable prompts, which can incorporate semantic priors into temporal content features dynamically. In addition, a forgery localization module is applied to produce forgery proposals based on fused forgery-class activation sequences. Finally, a progressive refinement strategy is introduced to generate pseudo frame-level labels and leverage supervised semantic contrastive learning to amplify the semantic distinction between real and fake content, thereby continuously optimizing forgery-aware features. Extensive experiments show that the proposed LOCO achieves SOTA performance on three public benchmarks.

cs.IR [Back]

[195] AdaParse: An Adaptive Parallel PDF Parsing and Resource Scaling Engine

Carlo Siebenschuh,Kyle Hippe,Ozan Gokdemir,Alexander Brace,Arham Khan,Khalid Hossain,Yadu Babuji,Nicholas Chia,Venkatram Vishwanath,Rick Stevens,Arvind Ramanathan,Ian Foster,Robert Underwood

Main category: cs.IR

TL;DR: AdaParse是一个自适应并行PDF解析引擎，通过数据驱动策略选择最佳解析器，结合人类偏好和硬件需求，显著提升解析效率和准确性。

Details

Motivation: 科学文档多为PDF格式，解析方法多样但选择困难，需平衡计算成本和准确性。 Method: 引入AdaParse，结合直接偏好优化（DPO）和资源调度，动态选择解析器。 Result: 在1000份科学文档上，AdaParse的吞吐量提升17倍，准确性略优（0.2%）。 Conclusion: AdaParse的高效与可扩展性使其适用于大规模科学文档解析，支持高质量文本数据集构建。 Abstract: Language models for scientific tasks are trained on text from scientific publications, most distributed as PDFs that require parsing. PDF parsing approaches range from inexpensive heuristics (for simple documents) to computationally intensive ML-driven systems (for complex or degraded ones). The choice of the "best" parser for a particular document depends on its computational cost and the accuracy of its output. To address these issues, we introduce an Adaptive Parallel PDF Parsing and Resource Scaling Engine (AdaParse), a data-driven strategy for assigning an appropriate parser to each document. We enlist scientists to select preferred parser outputs and incorporate this information through direct preference optimization (DPO) into AdaParse, thereby aligning its selection process with human judgment. AdaParse then incorporates hardware requirements and predicted accuracy of each parser to orchestrate computational resources efficiently for large-scale parsing campaigns. We demonstrate that AdaParse, when compared to state-of-the-art parsers, improves throughput by $17\times$ while still achieving comparable accuracy (0.2 percent better) on a benchmark set of 1000 scientific documents. AdaParse's combination of high accuracy and parallel scalability makes it feasible to parse large-scale scientific document corpora to support the development of high-quality, trillion-token-scale text datasets. The implementation is available at https://github.com/7shoe/AdaParse/

[196] Exploring new Approaches for Information Retrieval through Natural Language Processing

Manak Raj,Nidhi Mishra

Main category: cs.IR

TL;DR: 本文综述了信息检索（IR）在自然语言处理（NLP）中的最新进展和新兴方法，包括传统模型和现代技术，并探讨了相关工具、应用及未来挑战。

Details

Motivation: 探索信息检索在自然语言处理领域的最新发展，以提升检索效率和应用范围。 Method: 分析了传统IR模型（如布尔模型、向量空间模型等）和现代技术（如深度学习、强化学习、BERT等），并比较了稀疏、密集和混合检索方法。 Result: 总结了IR在多个应用领域的表现，并提出了现有工具（如Lucene、Anserini）的实用性。 Conclusion: 指出了IR在准确性、可扩展性和伦理问题上的未来研究方向。 Abstract: This review paper explores recent advancements and emerging approaches in Information Retrieval (IR) applied to Natural Language Processing (NLP). We examine traditional IR models such as Boolean, vector space, probabilistic, and inference network models, and highlight modern techniques including deep learning, reinforcement learning, and pretrained transformer models like BERT. We discuss key tools and libraries - Lucene, Anserini, and Pyserini - for efficient text indexing and search. A comparative analysis of sparse, dense, and hybrid retrieval methods is presented, along with applications in web search engines, cross-language IR, argument mining, private information retrieval, and hate speech detection. Finally, we identify open challenges and future research directions to enhance retrieval accuracy, scalability, and ethical considerations.

[197] Predicting Movie Hits Before They Happen with LLMs

Shaghayegh Agah,Yejin Kim,Neeraj Sharma,Mayur Nankani,Kevin Foley,H. Howie Huang,Sardar Hamidian

Main category: cs.IR

TL;DR: 论文提出了一种利用大型语言模型（LLMs）预测冷启动电影流行度的方法，解决了内容推荐中的冷启动问题。

Details

Motivation: 解决娱乐平台上电影推荐的冷启动问题，确保潜在被忽视的电影得到公平推广。 Method: 利用电影元数据和大型语言模型（LLMs）预测冷启动电影的流行度。 Result: 该方法在实验中表现优于现有基线方法和自开发方法。 Conclusion: 该方法有效解决了冷启动问题，可集成到推荐系统中或作为编辑团队的工具。 Abstract: Addressing the cold-start issue in content recommendation remains a critical ongoing challenge. In this work, we focus on tackling the cold-start problem for movies on a large entertainment platform. Our primary goal is to forecast the popularity of cold-start movies using Large Language Models (LLMs) leveraging movie metadata. This method could be integrated into retrieval systems within the personalization pipeline or could be adopted as a tool for editorial teams to ensure fair promotion of potentially overlooked movies that may be missed by traditional or algorithmic solutions. Our study validates the effectiveness of this approach compared to established baselines and those we developed.

[198] A Multi-Granularity Multimodal Retrieval Framework for Multimodal Document Tasks

Mingjun Xu,Zehui Wang,Hengxing Cai,Renxin Zhong

Main category: cs.IR

TL;DR: 提出了一种统一的多粒度多模态检索框架，用于处理视觉丰富的文档，结合分层编码和模态感知检索，无需任务特定微调即可实现高性能。

Details

Motivation: 解决现有检索增强生成系统在视觉丰富文档（如文本、图像、表格和图表）中的局限性。 Method: 采用分层编码策略、模态感知检索机制和重排序模块，结合现成的视觉语言模型和免训练混合检索策略。 Result: 实验表明，布局感知搜索和重排序模块显著提升了检索准确性，最高性能得分为65.56。 Conclusion: 该框架展示了可扩展和可复用的解决方案在多模态文档检索系统中的潜力。 Abstract: Retrieval-augmented generation (RAG) systems have predominantly focused on text-based retrieval, limiting their effectiveness in handling visually-rich documents that encompass text, images, tables, and charts. To bridge this gap, we propose a unified multi-granularity multimodal retrieval framework tailored for two benchmark tasks: MMDocIR and M2KR. Our approach integrates hierarchical encoding strategies, modality-aware retrieval mechanisms, and reranking modules to effectively capture and utilize the complex interdependencies between textual and visual modalities. By leveraging off-the-shelf vision-language models and implementing a training-free hybridretrieval strategy, our framework demonstrates robust performance without the need for task-specific fine-tuning. Experimental evaluations reveal that incorporating layout-aware search and reranking modules significantly enhances retrieval accuracy, achieving a top performance score of 65.56. This work underscores the potential of scalable and reproducible solutions in advancing multimodal document retrieval systems.

[199] RAGAR: Retrieval Augment Personalized Image Generation Guided by Recommendation

Run Ling,Wenji Wang,Yuting Liu,Guibing Guo,Linying Jiang,Xingwei Wang

Main category: cs.IR

TL;DR: 论文提出RAGAR方法，通过检索机制和排名任务优化个性化图像生成，解决了现有方法在用户偏好提取和生成一致性上的问题。

Details

Motivation: 现有方法在提取用户偏好时忽视历史项与参考项的语义相似性，且过度依赖生成一致性，导致个性化效果不佳。 Method: RAGAR利用检索机制为历史项分配权重，并引入多模态排名模型优化个性化生成。 Result: 在三个真实数据集上，RAGAR在个性化和语义指标上显著优于五种基线方法。 Conclusion: RAGAR通过改进偏好提取和生成优化，显著提升了图像生成的个性化效果。 Abstract: Personalized image generation is crucial for improving the user experience, as it renders reference images into preferred ones according to user visual preferences. Although effective, existing methods face two main issues. First, existing methods treat all items in the user historical sequence equally when extracting user preferences, overlooking the varying semantic similarities between historical items and the reference item. Disproportionately high weights for low-similarity items distort users' visual preferences for the reference item. Second, existing methods heavily rely on consistency between generated and reference images to optimize the generation, which leads to underfitting user preferences and hinders personalization. To address these issues, we propose Retrieval Augment Personalized Image GenerAtion guided by Recommendation (RAGAR). Our approach uses a retrieval mechanism to assign different weights to historical items according to their similarities to the reference item, thereby extracting more refined users' visual preferences for the reference item. Then we introduce a novel rank task based on the multi-modal ranking model to optimize the personalization of the generated images instead of forcing depend on consistency. Extensive experiments and human evaluations on three real-world datasets demonstrate that RAGAR achieves significant improvements in both personalization and semantic metrics compared to five baselines.

cs.LG [Back]

[200] DNAZEN: Enhanced Gene Sequence Representations via Mixed Granularities of Coding Units

Lei Mao,Yuanhe Tian,Yan Song

Main category: cs.LG

TL;DR: DNAZEN是一个增强的基因组表示框架，通过学习基因序列中的多粒度信息（如小聚合物和G-grams）来改进基因组建模。

Details

Motivation: 现有方法直接将语言建模技术应用于基因序列，忽略了序列中不同粒度单元对表示的影响。 Method: 提出DNAZEN框架，通过无监督方法构建G-gram词汇表，并设计基于Transformer的G-gram编码器。采用全G-gram掩码训练机制。 Result: 在基准数据集上的实验验证了DNAZEN在各种下游任务中的有效性。 Conclusion: DNAZEN通过多粒度学习显著提升了基因组表示的质量。 Abstract: Genome modeling conventionally treats gene sequence as a language, reflecting its structured motifs and long-range dependencies analogous to linguistic units and organization principles such as words and syntax. Recent studies utilize advanced neural networks, ranging from convolutional and recurrent models to Transformer-based models, to capture contextual information of gene sequence, with the primary goal of obtaining effective gene sequence representations and thus enhance the models' understanding of various running gene samples. However, these approaches often directly apply language modeling techniques to gene sequences and do not fully consider the intrinsic information organization in them, where they do not consider how units at different granularities contribute to representation. In this paper, we propose DNAZEN, an enhanced genomic representation framework designed to learn from various granularities in gene sequences, including small polymers and G-grams that are combinations of several contiguous polymers. Specifically, we extract the G-grams from large-scale genomic corpora through an unsupervised approach to construct the G-gram vocabulary, which is used to provide G-grams in the learning process of DNA sequences through dynamically matching from running gene samples. A Transformer-based G-gram encoder is also proposed and the matched G-grams are fed into it to compute their representations and integrated into the encoder for basic unit (E4BU), which is responsible for encoding small units and maintaining the learning and inference process. To further enhance the learning process, we propose whole G-gram masking to train DNAZEN, where the model largely favors the selection of each entire G-gram to mask rather than an ordinary masking mechanism performed on basic units. Experiments on benchmark datasets demonstrate the effectiveness of DNAZEN on various downstream tasks.

[201] Optimizing LLMs for Resource-Constrained Environments: A Survey of Model Compression Techniques

Sanjay Surendranath Girija,Shashank Kapoor,Lakshit Arora,Dipen Pradhan,Aman Raj,Ankit Shetgaonkar

Main category: cs.LG

TL;DR: 综述论文探讨了压缩大型语言模型（LLMs）的三种主要方法：知识蒸馏、模型量化和模型剪枝，以在资源受限环境中实现高效推理。

Details

Motivation: LLMs的资源需求限制了其在移动和边缘设备上的部署，因此需要压缩技术来优化模型。 Method: 分析了知识蒸馏、模型量化和模型剪枝三种方法，并讨论了其原理、变体和应用实例。 Result: 提供了每种技术的成功应用案例，并补充了混合专家和早期退出策略等其他技术。 Conclusion: 总结了未来研究方向，为优化LLMs在边缘部署提供了有价值的资源。 Abstract: Large Language Models (LLMs) have revolutionized many areas of artificial intelligence (AI), but their substantial resource requirements limit their deployment on mobile and edge devices. This survey paper provides a comprehensive overview of techniques for compressing LLMs to enable efficient inference in resource-constrained environments. We examine three primary approaches: Knowledge Distillation, Model Quantization, and Model Pruning. For each technique, we discuss the underlying principles, present different variants, and provide examples of successful applications. We also briefly discuss complementary techniques such as mixture-of-experts and early-exit strategies. Finally, we highlight promising future directions, aiming to provide a valuable resource for both researchers and practitioners seeking to optimize LLMs for edge deployment.

[202] Optimizing Chain-of-Thought Reasoners via Gradient Variance Minimization in Rejection Sampling and RL

Jiarui Yao,Yifan Hao,Hanning Zhang,Hanze Dong,Wei Xiong,Nan Jiang,Tong Zhang

Main category: cs.LG

TL;DR: GVM-RAFT提出了一种动态样本分配策略，通过优化计算资源分配，显著提升了CoT推理的效率和准确性。

Details

Motivation: 现有CoT推理方法在计算资源分配上缺乏动态调整，导致训练效率低下。 Method: 提出GVM-RAFT，动态监控提示接受率和梯度范数，最小化梯度方差。 Result: 实验显示GVM-RAFT比RAFT提速2-4倍，准确性显著提升。 Conclusion: 动态采样策略具有普适性，可应用于其他强化学习算法，提升收敛速度和准确性。 Abstract: Chain-of-thought (CoT) reasoning in large language models (LLMs) can be formalized as a latent variable problem, where the model needs to generate intermediate reasoning steps. While prior approaches such as iterative reward-ranked fine-tuning (RAFT) have relied on such formulations, they typically apply uniform inference budgets across prompts, which fails to account for variability in difficulty and convergence behavior. This work identifies the main bottleneck in CoT training as inefficient stochastic gradient estimation due to static sampling strategies. We propose GVM-RAFT, a prompt-specific Dynamic Sample Allocation Strategy designed to minimize stochastic gradient variance under a computational budget constraint. The method dynamically allocates computational resources by monitoring prompt acceptance rates and stochastic gradient norms, ensuring that the resulting gradient variance is minimized. Our theoretical analysis shows that the proposed dynamic sampling strategy leads to accelerated convergence guarantees under suitable conditions. Experiments on mathematical reasoning show that GVM-RAFT achieves a 2-4x speedup and considerable accuracy improvements over vanilla RAFT. The proposed dynamic sampling strategy is general and can be incorporated into other reinforcement learning algorithms, such as GRPO, leading to similar improvements in convergence and test accuracy. Our code is available at https://github.com/RLHFlow/GVM.

[203] Bielik v3 Small: Technical Report

Krzysztof Ociepa,Łukasz Flis,Remigiusz Kinas,Krzysztof Wróbel,Adrian Gwoździej

Main category: cs.LG

TL;DR: Bielik v3是一系列针对波兰语优化的高效参数生成文本模型（1.5B和4.5B），通过创新技术实现与更大模型相当的性能，同时显著减少计算资源需求。

Details

Motivation: 为资源受限的应用提供高质量的波兰语AI模型，填补较少代表语言的高效参数建模空白。 Method: 采用定制波兰语分词器（APT4）、加权指令交叉熵损失和自适应学习率等技术，训练于2920亿标记的精选语料库。 Result: 4.5B模型性能与更大模型相当，1.5B模型在紧凑体积下表现优异，在多个基准测试中领先。 Conclusion: Bielik v3为波兰语高效参数建模设定了新标准，推动了资源受限场景下的高质量AI应用。 Abstract: We introduce Bielik v3, a series of parameter-efficient generative text models (1.5B and 4.5B) optimized for Polish language processing. These models demonstrate that smaller, well-optimized architectures can achieve performance comparable to much larger counterparts while requiring substantially fewer computational resources. Our approach incorporates several key innovations: a custom Polish tokenizer (APT4) that significantly improves token efficiency, Weighted Instruction Cross-Entropy Loss to balance learning across instruction types, and Adaptive Learning Rate that dynamically adjusts based on training progress. Trained on a meticulously curated corpus of 292 billion tokens spanning 303 million documents, these models excel across multiple benchmarks, including the Open PL LLM Leaderboard, Complex Polish Text Understanding Benchmark, Polish EQ-Bench, and Polish Medical Leaderboard. The 4.5B parameter model achieves results competitive with models 2-3 times its size, while the 1.5B model delivers strong performance despite its extremely compact profile. These advances establish new benchmarks for parameter-efficient language modeling in less-represented languages, making high-quality Polish language AI more accessible for resource-constrained applications.

[204] Enhancing Chemical Reaction and Retrosynthesis Prediction with Large Language Model and Dual-task Learning

Xuan Lin,Qingrui Liu,Hongxin Xiang,Daojian Zeng,Xiangxiang Zeng

Main category: cs.LG

TL;DR: ChemDual是一种新型大型语言模型框架，通过构建大规模指令数据集和双任务学习策略，显著提升了化学反应和逆合成预测的准确性，并在药物设计中展现出潜力。

Details

Motivation: 化学反应和逆合成预测是药物发现中的关键任务，但现有方法面临数据集不足和任务相关性被忽视的挑战。 Method: ChemDual将反应和逆合成视为重组和碎片化过程，构建了440万条指令数据集，并采用多尺度分词器和双任务学习策略优化模型。 Result: 在Mol-Instruction和USPTO-50K数据集上，ChemDual在反应和逆合成预测中均达到最优性能，并生成具有多样性和强蛋白结合亲和力的化合物。 Conclusion: ChemDual通过创新方法解决了现有挑战，为药物设计提供了高效工具。 Abstract: Chemical reaction and retrosynthesis prediction are fundamental tasks in drug discovery. Recently, large language models (LLMs) have shown potential in many domains. However, directly applying LLMs to these tasks faces two major challenges: (i) lacking a large-scale chemical synthesis-related instruction dataset; (ii) ignoring the close correlation between reaction and retrosynthesis prediction for the existing fine-tuning strategies. To address these challenges, we propose ChemDual, a novel LLM framework for accurate chemical synthesis. Specifically, considering the high cost of data acquisition for reaction and retrosynthesis, ChemDual regards the reaction-and-retrosynthesis of molecules as a related recombination-and-fragmentation process and constructs a large-scale of 4.4 million instruction dataset. Furthermore, ChemDual introduces an enhanced LLaMA, equipped with a multi-scale tokenizer and dual-task learning strategy, to jointly optimize the process of recombination and fragmentation as well as the tasks between reaction and retrosynthesis prediction. Extensive experiments on Mol-Instruction and USPTO-50K datasets demonstrate that ChemDual achieves state-of-the-art performance in both predictions of reaction and retrosynthesis, outperforming the existing conventional single-task approaches and the general open-source LLMs. Through molecular docking analysis, ChemDual generates compounds with diverse and strong protein binding affinity, further highlighting its strong potential in drug design.

[205] Always Skip Attention

Yiping Ji,Hemanth Saratchandran,Peyman Moghaddam,Simon Lucey

Main category: cs.LG

TL;DR: 现代视觉Transformer（ViT）中自注意力机制在训练时若没有跳跃连接会完全失败，而其他组件仍能工作。本文理论分析了自注意力机制的病态性，并提出Token Graying作为补充方法。

Details

Motivation: 探索自注意力机制在ViT中对跳跃连接的独特依赖性，并研究其背后的理论原因。 Method: 理论分析自注意力机制的病态性，并提出Token Graying方法作为补充。 Result: 验证了Token Graying在监督和自监督训练中的有效性。 Conclusion: 自注意力机制的病态性使其依赖跳跃连接，Token Graying能进一步改善输入条件。 Abstract: We highlight a curious empirical result within modern Vision Transformers (ViTs). Specifically, self-attention catastrophically fails to train unless it is used in conjunction with a skip connection. This is in contrast to other elements of a ViT that continue to exhibit good performance (albeit suboptimal) when skip connections are removed. Further, we show that this critical dependence on skip connections is a relatively new phenomenon, with previous deep architectures (\eg, CNNs) exhibiting good performance in their absence. In this paper, we theoretically characterize that the self-attention mechanism is fundamentally ill-conditioned and is, therefore, uniquely dependent on skip connections for regularization. Additionally, we propose Token Graying -- a simple yet effective complement (to skip connections) that further improves the condition of input tokens. We validate our approach in both supervised and self-supervised training methods.

[206] SkillMimic-V2: Learning Robust and Generalizable Interaction Skills from Sparse and Noisy Demonstrations

Runyi Yu,Yinhuai Wang,Qihan Zhao,Hok Wai Tsui,Jingbo Wang,Ping Tan,Qifeng Chen

Main category: cs.LG

TL;DR: 论文提出两种数据增强技术（STG和STF）解决RLID中的演示噪声和覆盖限制问题，结合ATS策略和历史编码机制，显著提升了技能学习的泛化能力。

Details

Motivation: 现有交互演示数据存在稀疏、不连续和噪声问题，无法覆盖完整技能变化和过渡。论文旨在通过数据增强技术填补这些空白。 Method: 提出Stitched Trajectory Graph（STG）和State Transition Field（STF）两种数据增强技术，结合Adaptive Trajectory Sampling（ATS）策略和历史编码机制。 Result: 实验表明，该方法在收敛稳定性、泛化能力和恢复鲁棒性上显著优于现有技术。 Conclusion: 通过数据增强和动态课程生成，论文实现了超越参考演示的鲁棒技能学习。 Abstract: We address a fundamental challenge in Reinforcement Learning from Interaction Demonstration (RLID): demonstration noise and coverage limitations. While existing data collection approaches provide valuable interaction demonstrations, they often yield sparse, disconnected, and noisy trajectories that fail to capture the full spectrum of possible skill variations and transitions. Our key insight is that despite noisy and sparse demonstrations, there exist infinite physically feasible trajectories that naturally bridge between demonstrated skills or emerge from their neighboring states, forming a continuous space of possible skill variations and transitions. Building upon this insight, we present two data augmentation techniques: a Stitched Trajectory Graph (STG) that discovers potential transitions between demonstration skills, and a State Transition Field (STF) that establishes unique connections for arbitrary states within the demonstration neighborhood. To enable effective RLID with augmented data, we develop an Adaptive Trajectory Sampling (ATS) strategy for dynamic curriculum generation and a historical encoding mechanism for memory-dependent skill learning. Our approach enables robust skill acquisition that significantly generalizes beyond the reference demonstrations. Extensive experiments across diverse interaction tasks demonstrate substantial improvements over state-of-the-art methods in terms of convergence stability, generalization capability, and recovery robustness.

[207] Local Herb Identification Using Transfer Learning: A CNN-Powered Mobile Application for Nepalese Flora

Prajwal Thapa,Mridul Sharma,Jinu Nyachhyon,Yagya Raj Pandeya

Main category: cs.LG

TL;DR: 本研究提出了一种基于深度学习的草药分类方法，利用CNN和迁移学习技术，在尼泊尔丰富的生物多样性背景下，成功分类了60种草药。

Details

Motivation: 草药分类在植物学研究中具有重要挑战，尤其是在生物多样性丰富的地区如尼泊尔。现有方法存在局限性，需要更高效的解决方案。 Method: 研究采用了多种深度学习模型架构（如DenseNet121、ResNet50、VGG16等），结合数据增强和正则化技术，使用12,000张草药图像数据集进行训练。 Result: DenseNet121表现最佳，模型通过数据增强和正则化有效减少了过拟合，提升了泛化能力。 Conclusion: 该研究推动了草药分类技术的发展，有助于传统植物学知识的保护和可持续利用。 Abstract: Herb classification presents a critical challenge in botanical research, particularly in regions with rich biodiversity such as Nepal. This study introduces a novel deep learning approach for classifying 60 different herb species using Convolutional Neural Networks (CNNs) and transfer learning techniques. Using a manually curated dataset of 12,000 herb images, we developed a robust machine learning model that addresses existing limitations in herb recognition methodologies. Our research employed multiple model architectures, including DenseNet121, 50-layer Residual Network (ResNet50), 16-layer Visual Geometry Group Network (VGG16), InceptionV3, EfficientNetV2, and Vision Transformer (VIT), with DenseNet121 ultimately demonstrating superior performance. Data augmentation and regularization techniques were applied to mitigate overfitting and enhance the generalizability of the model. This work advances herb classification techniques, preserving traditional botanical knowledge and promoting sustainable herb utilization.

[208] Sharpness-Aware Minimization with Z-Score Gradient Filtering for Neural Networks

Juyoung Yun

Main category: cs.LG

TL;DR: ZSharp是一种改进的SAM方法，通过层间Z-score归一化和百分位过滤保留显著梯度分量，提升泛化能力。

Details

Motivation: 深度神经网络易收敛于尖锐最小值，导致泛化能力差。SAM虽能缓解但扰动参数时包含统计不显著方向。 Method: ZSharp在SAM基础上引入层间Z-score归一化和百分位过滤，仅保留显著梯度分量。 Result: 在CIFAR-10、CIFAR-100和Tiny-ImageNet上，ZSharp在测试准确率上优于SAM及其变体，尤其在深层和Transformer模型中表现更佳。 Conclusion: ZSharp是一种轻量且有效的改进方法，无需架构调整即可提升泛化性能。 Abstract: Generalizing well in deep neural networks remains a core challenge, particularly due to their tendency to converge to sharp minima that degrade robustness. Sharpness-Aware Minimization (SAM) mitigates this by seeking flatter minima but perturbs parameters using the full gradient, which can include statistically insignificant directions. We propose ZSharp, a simple yet effective extension to SAM that applies layer-wise Z-score normalization followed by percentile-based filtering to retain only statistically significant gradient components. This selective perturbation aligns updates with curvature-sensitive directions, enhancing generalization without requiring architectural changes. ZSharp introduces only one additional hyperparameter, the percentile threshold, and remains fully compatible with existing SAM variants. Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet using ResNet, VGG, and Vision Transformers show that ZSharp consistently outperforms SAM and its variants in test accuracy, particularly on deeper and transformer-based models. These results demonstrate that ZSharp is a principled and lightweight improvement for sharpness-aware optimization.

eess.IV [Back]

[209] Regression s all you need for medical image translation

Sebastian Rassmann,David Kügler,Christian Ewert,Martin Reuter

Main category: eess.IV

TL;DR: YODA是一种基于扩散模型的2.5D框架，用于医学图像翻译，结合扩散和回归方法生成高质量图像，挑战了扩散模型在医学领域的优势假设。

Details

Motivation: 医学图像翻译需要高度准确的解剖信息，但现有生成模型（如GAN和扩散模型）可能引入噪声或虚假内容，影响临床实用性。 Method: 提出YODA框架，结合扩散和回归范式，并引入ExpA采样方法抑制生成噪声。 Result: YODA在多种数据集上表现优于现有GAN和扩散模型方法，生成图像可替代或优于实际采集图像。 Conclusion: YODA挑战了扩散模型在医学图像翻译中的优势假设，为实际应用提供了新方向。 Abstract: The acquisition of information-rich images within a limited time budget is crucial in medical imaging. Medical image translation (MIT) can help enhance and supplement existing datasets by generating synthetic images from acquired data. While Generative Adversarial Nets (GANs) and Diffusion Models (DMs) have achieved remarkable success in natural image generation, their benefits - creativity and image realism - do not necessarily transfer to medical applications where highly accurate anatomical information is required. In fact, the imitation of acquisition noise or content hallucination hinder clinical utility. Here, we introduce YODA (You Only Denoise once - or Average), a novel 2.5D diffusion-based framework for volumetric MIT. YODA unites diffusion and regression paradigms to produce realistic or noise-free outputs. Furthermore, we propose Expectation-Approximation (ExpA) DM sampling, which draws inspiration from MRI signal averaging. ExpA-sampling suppresses generated noise and, thus, eliminates noise from biasing the evaluation of image quality. Through extensive experiments on four diverse multi-modal datasets - comprising multi-contrast brain MRI and pelvic MRI-CT - we show that diffusion and regression sampling yield similar results in practice. As such, the computational overhead of diffusion sampling does not provide systematic benefits in medical information translation. Building on these insights, we demonstrate that YODA outperforms several state-of-the-art GAN and DM methods. Notably, YODA-generated images are shown to be interchangeable with, or even superior to, physical acquisitions for several downstream tasks. Our findings challenge the presumed advantages of DMs in MIT and pave the way for the practical application of MIT in medical imaging.

Aiman Farooq,Azad Singh,Deepak Mishra,Santanu Chaudhury

Main category: eess.IV

TL;DR: RobSurv是一种基于双路径架构的深度学习框架，通过向量量化和Transformer处理，在多模态医学影像中实现鲁棒的癌症生存预测。

Details

Motivation: 解决深度学习模型在癌症生存预测中对噪声和影像协议差异的脆弱性问题。 Method: 采用双路径架构：一路通过向量量化学习离散代码本以抵抗噪声，另一路保留连续特征细节，并通过基于Transformer的融合机制整合。 Result: 在三个数据集上表现优异（C-index分别为0.771、0.742、0.734），噪声下性能下降仅3.8-4.5%，优于基线方法。 Conclusion: RobSurv在多模态影像和噪声条件下表现稳健，为临床预后提供了可靠工具。 Abstract: Cancer survival prediction using multi-modal medical imaging presents a critical challenge in oncology, mainly due to the vulnerability of deep learning models to noise and protocol variations across imaging centers. Current approaches struggle to extract consistent features from heterogeneous CT and PET images, limiting their clinical applicability. We address these challenges by introducing RobSurv, a robust deep-learning framework that leverages vector quantization for resilient multi-modal feature learning. The key innovation of our approach lies in its dual-path architecture: one path maps continuous imaging features to learned discrete codebooks for noise-resistant representation, while the parallel path preserves fine-grained details through continuous feature processing. This dual representation is integrated through a novel patch-wise fusion mechanism that maintains local spatial relationships while capturing global context via Transformer-based processing. In extensive evaluations across three diverse datasets (HECKTOR, H\&N1, and NSCLC Radiogenomics), RobSurv demonstrates superior performance, achieving concordance index of 0.771, 0.742, and 0.734 respectively - significantly outperforming existing methods. Most notably, our model maintains robust performance even under severe noise conditions, with performance degradation of only 3.8-4.5\% compared to 8-12\% in baseline methods. These results, combined with strong generalization across different cancer types and imaging protocols, establish RobSurv as a promising solution for reliable clinical prognosis that can enhance treatment planning and patient care.

[211] Multimodal Deep Learning for Stroke Prediction and Detection using Retinal Imaging and Clinical Data

Saeed Shurrab,Aadim Nepal,Terrence J. Lee-St. John,Nicola G. Ghazi,Bartlomiej Piechowski-Jozwiak,Farah E. Shamout

Main category: eess.IV

TL;DR: 该研究探讨了利用视网膜图像和临床数据结合多模态深度神经网络进行中风检测和风险预测的潜力，相比传统方法取得了显著性能提升。

Details

Motivation: 中风是全球重大公共卫生问题，现有诊断方法依赖昂贵医学影像。视网膜成像因其与脑血管健康的共享临床路径，可能成为经济有效的替代方案。 Method: 提出多模态深度神经网络，结合光学相干断层扫描（OCT）、红外反射视网膜扫描及临床数据，采用自监督学习预训练后微调评估。 Result: 实验显示，多模态方法比单模态基线AUROC提升5%，比现有先进基础模型提升8%，验证了视网膜成像的预测能力。 Conclusion: 研究证实视网膜成像在识别高风险患者和改善长期预后方面的潜力。 Abstract: Stroke is a major public health problem, affecting millions worldwide. Deep learning has recently demonstrated promise for enhancing the diagnosis and risk prediction of stroke. However, existing methods rely on costly medical imaging modalities, such as computed tomography. Recent studies suggest that retinal imaging could offer a cost-effective alternative for cerebrovascular health assessment due to the shared clinical pathways between the retina and the brain. Hence, this study explores the impact of leveraging retinal images and clinical data for stroke detection and risk prediction. We propose a multimodal deep neural network that processes Optical Coherence Tomography (OCT) and infrared reflectance retinal scans, combined with clinical data, such as demographics, vital signs, and diagnosis codes. We pretrained our model using a self-supervised learning framework using a real-world dataset consisting of $37$ k scans, and then fine-tuned and evaluated the model using a smaller labeled subset. Our empirical findings establish the predictive ability of the considered modalities in detecting lasting effects in the retina associated with acute stroke and forecasting future risk within a specific time horizon. The experimental results demonstrate the effectiveness of our proposed framework by achieving $5$\% AUROC improvement as compared to the unimodal image-only baseline, and $8$\% improvement compared to an existing state-of-the-art foundation model. In conclusion, our study highlights the potential of retinal imaging in identifying high-risk patients and improving long-term outcomes.

[212] Platelet enumeration in dense aggregates

H. Martin Gillis,Yogeshwar Shendye,Paul Hollensen,Alan Fine,Thomas Trappenberg

Main category: eess.IV

TL;DR: 本文提出了一种改进的深度学习方法，通过优化卷积核和分类设计，显著提高了血小板识别的准确性。

Details

Motivation: 血小板因尺寸和形态多变，传统CNN方法难以准确识别，需改进方法解决这一问题。 Method: 采用U-Net架构进行语义分割，为单血小板和血小板聚集体分配独立类别，并比较像素面积法和连通分量分析。 Result: 实验表明，优化卷积操作和分类设计显著提升了血小板识别效果，新方法优于传统像素面积法。 Conclusion: 研究强调了卷积核优化和分类设计的重要性，为血小板识别提供了更准确的解决方案。 Abstract: Identifying and counting blood components such as red blood cells, various types of white blood cells, and platelets is a critical task for healthcare practitioners. Deep learning approaches, particularly convolutional neural networks (CNNs) using supervised learning strategies, have shown considerable success for such tasks. However, CNN based architectures such as U-Net, often struggles to accurately identify platelets due to their sizes and high variability of features. To address these challenges, researchers have commonly employed strategies such as class weighted loss functions, which have demonstrated some success. However, this does not address the more significant challenge of platelet variability in size and tendency to form aggregates and associations with other blood components. In this study, we explored an alternative approach by investigating the role of convolutional kernels in mitigating these issues. We also assigned separate classes to singular platelets and platelet aggregates and performed semantic segmentation using various U-Net architectures for identifying platelets. We then evaluated and compared two common methods (pixel area method and connected component analysis) for counting platelets and proposed an alternative approach specialized for single platelets and platelet aggregates. Our experiments provided results that showed significant improvements in the identification of platelets, highlighting the importance of optimizing convolutional operations and class designations. We show that the common practice of pixel area-based counting often over estimate platelet counts, whereas the proposed method presented in this work offers significant improvements. We discuss in detail about these methods from segmentation masks.

[213] CostFilter-AD: Enhancing Anomaly Detection through Matching Cost Filtering

Zhe Zhang,Mingxiu Cai,Hanxiao Wang,Gaochang Wu,Tianyou Chai,Xiatian Zhu

Main category: eess.IV

TL;DR: 论文提出了一种名为CostFilter-AD的方法，通过引入成本过滤概念改进无监督异常检测的匹配过程。

Details

Motivation: 现有无监督异常检测方法依赖图像或特征级匹配，但匹配过程不准确且被忽视，导致检测效果不佳。 Method: 构建匹配成本体积，并通过成本体积过滤网络进行细化，结合输入观察作为注意力查询。 Result: 在MVTec-AD和VisA基准测试中验证了CostFilter-AD对单类和多类任务的通用性。 Conclusion: CostFilter-AD作为一种通用后处理插件，可显著提升无监督异常检测性能。 Abstract: Unsupervised anomaly detection (UAD) seeks to localize the anomaly mask of an input image with respect to normal samples. Either by reconstructing normal counterparts (reconstruction-based) or by learning an image feature embedding space (embedding-based), existing approaches fundamentally rely on image-level or feature-level matching to derive anomaly scores. Often, such a matching process is inaccurate yet overlooked, leading to sub-optimal detection. To address this issue, we introduce the concept of cost filtering, borrowed from classical matching tasks, such as depth and flow estimation, into the UAD problem. We call this approach {\em CostFilter-AD}. Specifically, we first construct a matching cost volume between the input and normal samples, comprising two spatial dimensions and one matching dimension that encodes potential matches. To refine this, we propose a cost volume filtering network, guided by the input observation as an attention query across multiple feature layers, which effectively suppresses matching noise while preserving edge structures and capturing subtle anomalies. Designed as a generic post-processing plug-in, CostFilter-AD can be integrated with either reconstruction-based or embedding-based methods. Extensive experiments on MVTec-AD and VisA benchmarks validate the generic benefits of CostFilter-AD for both single- and multi-class UAD tasks. Code and models will be released at https://github.com/ZHE-SAPI/CostFilter-AD.

[214] Seeing Heat with Color -- RGB-Only Wildfire Temperature Inference from SAM-Guided Multimodal Distillation using Radiometric Ground Truth

Michael Marinaccio,Fatemeh Afghah

Main category: eess.IV

TL;DR: SAM-TIFF是一种基于RGB输入的师生蒸馏框架，用于火灾温度预测和分割，无需热传感器。

Details

Motivation: 降低无人机火灾监测的硬件成本和功耗，仅使用RGB输入实现高保真监测。 Method: 通过多模态教师网络（RGB-热成像）向单模态学生网络（RGB）蒸馏知识，结合SAM、TOPSIS、Canny边缘检测和Otsu阈值实现自动分割。 Result: 在FLAME 3数据集上表现出强泛化能力，首次实现RGB数据的像素级温度回归。 Conclusion: 为轻量、低成本的无人机火灾监测系统奠定了基础，无需热传感器。 Abstract: High-fidelity wildfire monitoring using Unmanned Aerial Vehicles (UAVs) typically requires multimodal sensing - especially RGB and thermal imagery - which increases hardware cost and power consumption. This paper introduces SAM-TIFF, a novel teacher-student distillation framework for pixel-level wildfire temperature prediction and segmentation using RGB input only. A multimodal teacher network trained on paired RGB-Thermal imagery and radiometric TIFF ground truth distills knowledge to a unimodal RGB student network, enabling thermal-sensor-free inference. Segmentation supervision is generated using a hybrid approach of segment anything (SAM)-guided mask generation, and selection via TOPSIS, along with Canny edge detection and Otsu's thresholding pipeline for automatic point prompt selection. Our method is the first to perform per-pixel temperature regression from RGB UAV data, demonstrating strong generalization on the recent FLAME 3 dataset. This work lays the foundation for lightweight, cost-effective UAV-based wildfire monitoring systems without thermal sensors.

[215] A Dual-Task Synergy-Driven Generalization Framework for Pancreatic Cancer Segmentation in CT Scans

Jun Li,Yijue Zhang,Haibo Shi,Minhong Li,Qiwei Li,Xiaohua Qian

Main category: eess.IV

TL;DR: 提出了一种结合像素级分类和回归任务的双任务框架，用于胰腺癌病灶的精确分割，并通过双自监督学习提升模型的泛化能力和稳定性。

Details

Motivation: 胰腺癌病灶分割的现有方法因影像变异性和病灶异质性而泛化能力不足，需改进以支持精准诊断和治疗。 Method: 采用双任务框架，结合分割和回归任务，利用任务输出的相互转换增强泛化能力，并通过双自监督学习提升特征和输出空间的表现。 Result: 在三个数据集上的实验显示，模型在域内验证中达到Dice分数84.07%，跨病灶泛化任务中性能提升9.51%。 Conclusion: 该模型为胰腺疾病管理提供了高效且稳健的技术支持，代码已开源。 Abstract: Pancreatic cancer, characterized by its notable prevalence and mortality rates, demands accurate lesion delineation for effective diagnosis and therapeutic interventions. The generalizability of extant methods is frequently compromised due to the pronounced variability in imaging and the heterogeneous characteristics of pancreatic lesions, which may mimic normal tissues and exhibit significant inter-patient variability. Thus, we propose a generalization framework that synergizes pixel-level classification and regression tasks, to accurately delineate lesions and improve model stability. This framework not only seeks to align segmentation contours with actual lesions but also uses regression to elucidate spatial relationships between diseased and normal tissues, thereby improving tumor localization and morphological characterization. Enhanced by the reciprocal transformation of task outputs, our approach integrates additional regression supervision within the segmentation context, bolstering the model's generalization ability from a dual-task perspective. Besides, dual self-supervised learning in feature spaces and output spaces augments the model's representational capability and stability across different imaging views. Experiments on 594 samples composed of three datasets with significant imaging differences demonstrate that our generalized pancreas segmentation results comparable to mainstream in-domain validation performance (Dice: 84.07%). More importantly, it successfully improves the results of the highly challenging cross-lesion generalized pancreatic cancer segmentation task by 9.51%. Thus, our model constitutes a resilient and efficient foundational technological support for pancreatic disease management and wider medical applications. The codes will be released at https://github.com/SJTUBME-QianLab/Dual-Task-Seg.

[216] Efficient Multi Subject Visual Reconstruction from fMRI Using Aligned Representations

Christos Zangos,Danish Ebadulla,Thomas Christopher Sprague,Ambuj Singh

Main category: eess.IV

TL;DR: 提出一种基于fMRI的视觉图像重建新方法，利用主体无关的通用表示空间，显著提高了低数据场景下的效率。

Details

Motivation: 传统方法在主体间对齐脑信号时效率低下，尤其是在数据稀缺的情况下。 Method: 通过训练将主体脑信号对齐到通用空间，形成语义对齐的通用脑，并利用轻量级模块与参考主体对齐。 Result: 在多个数据集上验证了通用空间的主体和数据集无关性，且效率优于传统端到端训练方法。 Conclusion: 该方法在低数据场景下表现优异，为脑信号对齐提供了高效解决方案。 Abstract: This work introduces a novel approach to fMRI-based visual image reconstruction using a subject-agnostic common representation space. We show that the brain signals of the subjects can be aligned in this common space during training to form a semantically aligned common brain. This is leveraged to demonstrate that aligning subject-specific lightweight modules to a reference subject is significantly more efficient than traditional end-to-end training methods. Our approach excels in low-data scenarios. We evaluate our methods on different datasets, demonstrating that the common space is subject and dataset-agnostic.

[217] CLOG-CD: Curriculum Learning based on Oscillating Granularity of Class Decomposed Medical Image Classification

Asmaa Abbas,Mohamed Gaber,Mohammed M. Abdelsamea

Main category: eess.IV

TL;DR: 论文提出了一种结合课程学习策略和类分解方法的新型CNN训练方法CLOG-CD，用于提升医学图像分类性能，并在四个不平衡医学数据集上验证了其有效性。

Details

Motivation: 医学图像数据的不规则性导致分类任务更具挑战性，传统方法易出现误分类。课程学习和类分解方法在解决此类问题上展现出潜力。 Method: 提出CLOG-CD方法，结合课程学习策略和类分解技术，利用类分解的权重进行训练，采用反课程技术（从难到易）。 Result: 在四个数据集上，CLOG-CD显著提升了分类准确率，最高达99.45%（CRC数据集）。 Conclusion: CLOG-CD在医学图像分类任务中表现优异，为不平衡数据分类提供了有效解决方案。 Abstract: Curriculum learning strategies have been proven to be effective in various applications and have gained significant interest in the field of machine learning. It has the ability to improve the final model's performance and accelerate the training process. However, in the medical imaging domain, data irregularities can make the recognition task more challenging and usually result in misclassification between the different classes in the dataset. Class-decomposition approaches have shown promising results in solving such a problem by learning the boundaries within the classes of the data set. In this paper, we present a novel convolutional neural network (CNN) training method based on the curriculum learning strategy and the class decomposition approach, which we call CLOG-CD, to improve the performance of medical image classification. We evaluated our method on four different imbalanced medical image datasets, such as Chest X-ray (CXR), brain tumour, digital knee X-ray, and histopathology colorectal cancer (CRC). CLOG-CD utilises the learnt weights from the decomposition granularity of the classes, and the training is accomplished from descending to ascending order (i.e., anti-curriculum technique). We also investigated the classification performance of our proposed method based on different acceleration factors and pace function curricula. We used two pre-trained networks, ResNet-50 and DenseNet-121, as the backbone for CLOG-CD. The results with ResNet-50 show that CLOG-CD has the ability to improve classification performance with an accuracy of 96.08% for the CXR dataset, 96.91% for the brain tumour dataset, 79.76% for the digital knee X-ray, and 99.17% for the CRC dataset, compared to other training strategies. In addition, with DenseNet-121, CLOG-CD has achieved 94.86%, 94.63%, 76.19%, and 99.45% for CXR, brain tumour, digital knee X-ray, and CRC datasets, respectively

[218] LensNet: An End-to-End Learning Framework for Empirical Point Spread Function Modeling and Lensless Imaging Reconstruction

Jiesong Bai,Yuhao Yin,Yihang Dong,Xiaofeng Zhang,Chi-Man Pun,Xuhang Chen

Main category: eess.IV

TL;DR: LensNet是一种端到端深度学习框架，通过动态估计PSF和嵌入Wiener滤波，提升无透镜成像的质量和适应性。

Details

Motivation: 传统无透镜成像技术依赖静态PSF模型和繁琐预处理，难以应对噪声和动态场景变化。 Method: 提出LensNet，结合空间域和频域表示，使用可学习编码掩模模拟器（CMS）动态估计PSF，并嵌入Wiener滤波。 Result: 实验表明LensNet在重建质量和高频细节保留上优于现有方法。 Conclusion: LensNet为无透镜成像提供了更准确、灵活的解决方案，适用于微型传感器和医疗诊断等领域。 Abstract: Lensless imaging stands out as a promising alternative to conventional lens-based systems, particularly in scenarios demanding ultracompact form factors and cost-effective architectures. However, such systems are fundamentally governed by the Point Spread Function (PSF), which dictates how a point source contributes to the final captured signal. Traditional lensless techniques often require explicit calibrations and extensive pre-processing, relying on static or approximate PSF models. These rigid strategies can result in limited adaptability to real-world challenges, including noise, system imperfections, and dynamic scene variations, thus impeding high-fidelity reconstruction. In this paper, we propose LensNet, an end-to-end deep learning framework that integrates spatial-domain and frequency-domain representations in a unified pipeline. Central to our approach is a learnable Coded Mask Simulator (CMS) that enables dynamic, data-driven estimation of the PSF during training, effectively mitigating the shortcomings of fixed or sparsely calibrated kernels. By embedding a Wiener filtering component, LensNet refines global structure and restores fine-scale details, thus alleviating the dependency on multiple handcrafted pre-processing steps. Extensive experiments demonstrate LensNet's robust performance and superior reconstruction quality compared to state-of-the-art methods, particularly in preserving high-frequency details and attenuating noise. The proposed framework establishes a novel convergence between physics-based modeling and data-driven learning, paving the way for more accurate, flexible, and practical lensless imaging solutions for applications ranging from miniature sensors to medical diagnostics. The link of code is https://github.com/baijiesong/Lensnet.

[219] Continuous Filtered Backprojection by Learnable Interpolation Network

Hui Lin,Dong Zeng,Qi Xie,Zerui Mao,Jianhua Ma,Deyu Meng

Main category: eess.IV

TL;DR: 提出了一种名为LInFBP的深度学习模型，通过可学习插值改进CT图像重建质量。

Details

Motivation: 传统CT重建方法中的插值误差影响图像精度，需解决这一问题。 Method: 利用深度学习网络预测线性组合系数，构建连续函数用于插值。 Result: 实验证明LInFBP能显著提升图像质量，具有即插即用和泛化能力。 Conclusion: LInFBP首次将深度学习应用于FBP插值，有效减少误差。 Abstract: Accurate reconstruction of computed tomography (CT) images is crucial in medical imaging field. However, there are unavoidable interpolation errors in the backprojection step of the conventional reconstruction methods, i.e., filtered-back-projection based methods, which are detrimental to the accurate reconstruction. In this study, to address this issue, we propose a novel deep learning model, named Leanable-Interpolation-based FBP or LInFBP shortly, to enhance the reconstructed CT image quality, which achieves learnable interpolation in the backprojection step of filtered backprojection (FBP) and alleviates the interpolation errors. Specifically, in the proposed LInFBP, we formulate every local piece of the latent continuous function of discrete sinogram data as a linear combination of selected basis functions, and learn this continuous function by exploiting a deep network to predict the linear combination coefficients. Then, the learned latent continuous function is exploited for interpolation in backprojection step, which first time takes the advantage of deep learning for the interpolation in FBP. Extensive experiments, which encompass diverse CT scenarios, demonstrate the effectiveness of the proposed LInFBP in terms of enhanced reconstructed image quality, plug-and-play ability and generalization capability.

[220] Multi-Scale Target-Aware Representation Learning for Fundus Image Enhancement

Haofan Wu,Yin Huang,Yuqing Wu,Qiuyu Yang,Bingfang Wang,Li Zhang,Muhammad Fahadullah Khan,Ali Zia,M. Saleh Memon,Syed Sohail Bukhari,Abdul Fattah Memon,Daizong Ji,Ya Zhang,Ghulam Mustafa,Yin Fang

Main category: eess.IV

TL;DR: 提出了一种多尺度目标感知表示学习框架（MTRL-FIE），用于高效增强眼底图像，解决了现有方法缺乏统一框架和针对性增强的问题。

Details

Motivation: 眼底图像常因硬件限制和操作变异性导致低分辨率和信噪比，现有方法未能统一恢复多尺度信息或针对性增强病变区域。 Method: 采用多尺度特征编码器（MFE）嵌入低频结构和高频细节，设计结构保持分层解码器（SHD）融合多尺度特征，并引入目标感知特征聚合（TFA）模块增强病理区域。 Result: 在多个数据集上验证了MTRL-FIE的有效性和泛化性，性能优于现有方法且架构更轻量。 Conclusion: MTRL-FIE不仅提升了眼底图像增强效果，还能泛化至其他眼科图像处理任务，具有临床应用潜力。 Abstract: High-quality fundus images provide essential anatomical information for clinical screening and ophthalmic disease diagnosis. Yet, due to hardware limitations, operational variability, and patient compliance, fundus images often suffer from low resolution and signal-to-noise ratio. Recent years have witnessed promising progress in fundus image enhancement. However, existing works usually focus on restoring structural details or global characteristics of fundus images, lacking a unified image enhancement framework to recover comprehensive multi-scale information. Moreover, few methods pinpoint the target of image enhancement, e.g., lesions, which is crucial for medical image-based diagnosis. To address these challenges, we propose a multi-scale target-aware representation learning framework (MTRL-FIE) for efficient fundus image enhancement. Specifically, we propose a multi-scale feature encoder (MFE) that employs wavelet decomposition to embed both low-frequency structural information and high-frequency details. Next, we design a structure-preserving hierarchical decoder (SHD) to fuse multi-scale feature embeddings for real fundus image restoration. SHD integrates hierarchical fusion and group attention mechanisms to achieve adaptive feature fusion while retaining local structural smoothness. Meanwhile, a target-aware feature aggregation (TFA) module is used to enhance pathological regions and reduce artifacts. Experimental results on multiple fundus image datasets demonstrate the effectiveness and generalizability of MTRL-FIE for fundus image enhancement. Compared to state-of-the-art methods, MTRL-FIE achieves superior enhancement performance with a more lightweight architecture. Furthermore, our approach generalizes to other ophthalmic image processing tasks without supervised fine-tuning, highlighting its potential for clinical applications.

[221] Accelerating Volumetric Medical Image Annotation via Short-Long Memory SAM 2

Yuwen Chen,Zafer Yildiz,Qihang Li,Yaqian Chen,Haoyu Dong,Hanxue Gu,Nicholas Konz,Maciej A. Mazurowski

Main category: eess.IV

TL;DR: SLM-SAM 2改进SAM 2，通过短长期记忆模块提升医学图像分割精度，减少错误传播。

Details

Motivation: 医学图像手动标注耗时耗力，现有SAM 2在体积分割中表现不稳定，尤其在边界区域。 Method: 提出SLM-SAM 2，整合短长期记忆模块和独立注意力机制，优化分割准确性。 Result: 在三个公开数据集上，SLM-SAM 2显著优于SAM 2，Dice系数提升0.14和0.11。 Conclusion: SLM-SAM 2为医学图像自动标注提供了更准确的解决方案。 Abstract: Manual annotation of volumetric medical images, such as magnetic resonance imaging (MRI) and computed tomography (CT), is a labor-intensive and time-consuming process. Recent advancements in foundation models for video object segmentation, such as Segment Anything Model 2 (SAM 2), offer a potential opportunity to significantly speed up the annotation process by manually annotating one or a few slices and then propagating target masks across the entire volume. However, the performance of SAM 2 in this context varies. Our experiments show that relying on a single memory bank and attention module is prone to error propagation, particularly at boundary regions where the target is present in the previous slice but absent in the current one. To address this problem, we propose Short-Long Memory SAM 2 (SLM-SAM 2), a novel architecture that integrates distinct short-term and long-term memory banks with separate attention modules to improve segmentation accuracy. We evaluate SLM-SAM 2 on three public datasets covering organs, bones, and muscles across MRI and CT modalities. We show that the proposed method markedly outperforms the default SAM 2, achieving average Dice Similarity Coefficient improvement of 0.14 and 0.11 in the scenarios when 5 volumes and 1 volume are available for the initial adaptation, respectively. SLM-SAM 2 also exhibits stronger resistance to over-propagation, making a notable step toward more accurate automated annotation of medical images for segmentation model development.

[222] Adversarial Robustness of Deep Learning Models for Inland Water Body Segmentation from SAR Images

Siddharth Kothari,Srinivasan Murali,Sankalp Kothari,Ujjwal Verma,Jaya Sreevalsan-Nair

Main category: eess.IV

TL;DR: 论文研究了SAR图像中内陆水体分割任务，模拟了人工标注错误对U-Net模型的影响，发现模型对一定程度的标注错误具有鲁棒性。

Details

Motivation: SAR图像中内陆水体分割因复杂几何形状和人工标注噪声而具有挑战性，研究旨在评估U-Net模型对标注错误的鲁棒性。 Method: 通过模拟对抗攻击形式的人工标注错误，测试U-Net模型在噪声标注下的性能表现。 Result: U-Net模型对一定程度的标注错误具有容忍性，性能不会显著下降。 Conclusion: 人工标注质量对分割模型效果至关重要，研究提供了对抗样本和数据集以支持鲁棒训练。 Abstract: Inland water body segmentation from Synthetic Aperture Radar (SAR) images is an important task needed for several applications, such as flood mapping. While SAR sensors capture data in all-weather conditions as high-resolution images, differentiating water and water-like surfaces from SAR images is not straightforward. Inland water bodies, such as large river basins, have complex geometry, which adds to the challenge of segmentation. U-Net is a widely used deep learning model for land-water segmentation of SAR images. In practice, manual annotation is often used to generate the corresponding water masks as ground truth. Manual annotation of the images is prone to label noise owing to data poisoning attacks, especially due to complex geometry. In this work, we simulate manual errors in the form of adversarial attacks on the U-Net model and study the robustness of the model to human errors in annotation. Our results indicate that U-Net can tolerate a certain level of corruption before its performance drops significantly. This finding highlights the crucial role that the quality of manual annotations plays in determining the effectiveness of the segmentation model. The code and the new dataset, along with adversarial examples for robust training, are publicly available. (Github link - https://github.com/GVCL/IWSeg-SAR-Poison.git)

[223] Hybrid Image Resolution Quality Metric (HIRQM):A Comprehensive Perceptual Image Quality Assessment Framework

Vineesh Kumar Reddy Mondem

Main category: eess.IV

TL;DR: 提出了一种混合图像分辨率质量度量（HIRQM），结合统计、多尺度和深度学习方法，优于传统指标，在复杂失真下更符合人类感知。

Details

Motivation: 传统图像质量评估指标（如MSE和SSIM）在复杂失真下无法准确反映感知质量，需要更全面的评估方法。 Method: HIRQM整合了三种组件：局部像素分布分析（PDF）、多尺度特征相似性（结构完整性）和预训练VGG16网络的深度特征（语义对齐），并通过动态权重机制适应不同图像特性。 Result: 在TID2013和LIVE数据集上，HIRQM的Pearson和Spearman相关系数分别达到0.92和0.90，优于传统指标，尤其在噪声、模糊和压缩伪影处理上表现突出。 Conclusion: HIRQM为图像处理应用（如压缩和恢复）提供了一种更灵活、更符合人类感知的质量评估工具。 Abstract: Traditional image quality assessment metrics like Mean Squared Error and Structural Similarity Index often fail to reflect perceptual quality under complex distortions. We propose the Hybrid Image Resolution Quality Metric (HIRQM), integrating statistical, multi-scale, and deep learning-based methods for a comprehensive quality evaluation. HIRQM combines three components: Probability Density Function for local pixel distribution analysis, Multi-scale Feature Similarity for structural integrity across resolutions, and Hierarchical Deep Image Features using a pre-trained VGG16 network for semantic alignment with human perception. A dynamic weighting mechanism adapts component contributions based on image characteristics like brightness and variance, enhancing flexibility across distortion types. Our contributions include a unified metric and dynamic weighting for better perceptual alignment. Evaluated on TID2013 and LIVE datasets, HIRQM achieves Pearson and Spearman correlations of 0.92 and 0.90, outperforming traditional metrics. It excels in handling noise, blur, and compression artifacts, making it valuable for image processing applications like compression and restoration.

[224] CSASN: A Multitask Attention-Based Framework for Heterogeneous Thyroid Carcinoma Classification in Ultrasound Images

Peiqi Li,Yincheng Gao,Renxing Li,Haojie Yang,Yunyun Liu,Boji Liu,Jiahui Ni,Ying Zhang,Yulu Wu,Xiaowei Fang,Lehang Guo,Liping Sun,Jiangang Chen

Main category: eess.IV

TL;DR: 提出了一种名为CSASN的多任务学习框架，结合了EfficientNet和ViT的双分支特征提取器，通过通道-空间注意力模块和动态加权损失函数，提高了罕见甲状腺癌分类的准确性和稳定性。

Details

Motivation: 解决超声图像中罕见甲状腺癌分类的异质性形态特征和数据不平衡问题。 Method: 采用双分支特征提取器（EfficientNet和ViT）、通道-空间注意力模块、残差多尺度分类器和动态加权损失函数。 Result: 在2000多名患者的多中心数据集上验证，CSASN在罕见亚型（如FTC和MTC）分类中表现优于现有单流CNN或Transformer模型。 Conclusion: CSASN为AI辅助甲状腺癌诊断提供了一种有前景的策略。 Abstract: Heterogeneous morphological features and data imbalance pose significant challenges in rare thyroid carcinoma classification using ultrasound imaging. To address this issue, we propose a novel multitask learning framework, Channel-Spatial Attention Synergy Network (CSASN), which integrates a dual-branch feature extractor - combining EfficientNet for local spatial encoding and ViT for global semantic modeling, with a cascaded channel-spatial attention refinement module. A residual multiscale classifier and dynamically weighted loss function further enhance classification stability and accuracy. Trained on a multicenter dataset comprising more than 2000 patients from four clinical institutions, our framework leverages a residual multiscale classifier and dynamically weighted loss function to enhance classification stability and accuracy. Extensive ablation studies demonstrate that each module contributes significantly to model performance, particularly in recognizing rare subtypes such as FTC and MTC carcinomas. Experimental results show that CSASN outperforms existing single-stream CNN or Transformer-based models, achieving a superior balance between precision and recall under class-imbalanced conditions. This framework provides a promising strategy for AI-assisted thyroid cancer diagnosis.

Lei Xie,Huajun Zhou,Junxiong Huang,Jiahao Huang,Qingrun Zeng,Jianzhong He,Jiawei Zhang,Baohua Fan,Mingchu Li,Guoqiang Xie,Hao Chen,Yuanjing Feng

Main category: eess.IV

TL;DR: 提出了一种名为CNTSeg-v2的新型任意模态融合网络，用于颅神经束分割，通过T1加权图像监督辅助模态信息选择，显著提升了分割性能。

Details

Motivation: 临床实践中难以获取完整的多模态数据，限制了现有方法的应用。 Method: 采用T1加权图像作为主要模态，设计任意模态协作模块（ACM）和深度距离引导多阶段解码器（DDM），以优化特征提取和分割精度。 Result: 在HCP和MDM数据集上，CNTSeg-v2实现了最先进的分割性能。 Conclusion: CNTSeg-v2通过灵活处理不同模态组合，显著提升了颅神经束分割的准确性和实用性。 Abstract: The segmentation of cranial nerves (CNs) tract provides a valuable quantitative tool for the analysis of the morphology and trajectory of individual CNs. Multimodal CNs tract segmentation networks, e.g., CNTSeg, which combine structural Magnetic Resonance Imaging (MRI) and diffusion MRI, have achieved promising segmentation performance. However, it is laborious or even infeasible to collect complete multimodal data in clinical practice due to limitations in equipment, user privacy, and working conditions. In this work, we propose a novel arbitrary-modal fusion network for volumetric CNs tract segmentation, called CNTSeg-v2, which trains one model to handle different combinations of available modalities. Instead of directly combining all the modalities, we select T1-weighted (T1w) images as the primary modality due to its simplicity in data acquisition and contribution most to the results, which supervises the information selection of other auxiliary modalities. Our model encompasses an Arbitrary-Modal Collaboration Module (ACM) designed to effectively extract informative features from other auxiliary modalities, guided by the supervision of T1w images. Meanwhile, we construct a Deep Distance-guided Multi-stage (DDM) decoder to correct small errors and discontinuities through signed distance maps to improve segmentation accuracy. We evaluate our CNTSeg-v2 on the Human Connectome Project (HCP) dataset and the clinical Multi-shell Diffusion MRI (MDM) dataset. Extensive experimental results show that our CNTSeg-v2 achieves state-of-the-art segmentation performance, outperforming all competing methods.

[226] Diagnostic Uncertainty in Pneumonia Detection using CNN MobileNetV2 and CNN from Scratch

Kennard Norbert Sudiardjo,Islam Nur Alam,Wilson Wijaya,Lili Ayu Wulandhari

Main category: eess.IV

TL;DR: 该研究提出使用CNN（MobileNetV2和ResNet101V2架构）诊断肺炎，结果显示MobileNetV2稳定性高但训练时间长，而Scratch模型精度高但易过拟合。

Details

Motivation: 肺炎诊断常因非典型表现、胸片限制和共存呼吸道疾病等因素存在不确定性，需更可靠的方法。 Method: 采用MobileNetV2预训练模型和ResNet101V2架构，以及Keras API构建的Scratch模型，通过Kaggle数据集进行训练。 Result: MobileNetV2验证准确率稳定（84.87%降至78.95%），Scratch模型验证精度高但过拟合（78.12%）。 Conclusion: MobileNetV2适合稳定性需求，Scratch模型适合高精度场景，两者各有优劣。 Abstract: Pneumonia Diagnosis, though it is crucial for an effective treatment, it can be hampered by uncertainty. This uncertainty starts to arise due to some factors like atypical presentations, limitations of diagnostic tools such as chest X-rays, and the presence of co-existing respiratory conditions. This research proposes one of the supervised learning methods, CNN. Using MobileNetV2 as the pre-trained one with ResNet101V2 architecture and using Keras API as the built from scratch model, for identifying lung diseases especially pneumonia. The datasets used in this research were obtained from the website through Kaggle. The result shows that by implementing CNN MobileNetV2 and CNN from scratch the result is promising. While validating data, MobileNetV2 performs with stability and minimal overfitting, while the training accuracy increased to 84.87% later it slightly decreased to 78.95%, with increasing validation loss from 0.499 to 0.6345. Nonetheless, MobileNetV2 is more stable. Although it takes more time to train each epoch. Meanwhile, after the 10th epoch, the Scratch model displayed more instability and overfitting despite having higher validation accuracy, training accuracy decreased significantly to 78.12% and the validation loss increased from 0.5698 to 1.1809. With these results, ResNet101V2 offers stability, and the Scratch model offers high accuracy.

[227] DeepSparse: A Foundation Model for Sparse-View CBCT Reconstruction

Yiqun Lin,Hualiang Wang,Jixiang Chen,Jiewen Yang,Jiarong Guo,Xiaomeng Li

Main category: eess.IV

TL;DR: DeepSparse是一种用于稀疏视图CBCT重建的基础模型，通过DiCE网络和HyViP框架，显著提升了图像质量并降低了计算需求。

Details

Motivation: 高辐射暴露是CBCT成像的主要问题，现有稀疏视图重建方法存在计算量大和泛化性差的问题。 Method: 提出DeepSparse模型，结合DiCE网络（双维度跨尺度嵌入）和HyViP预训练框架，采用两步微调策略。 Result: 实验表明，DeepSparse在重建质量上优于现有方法。 Conclusion: DeepSparse为更安全高效的CBCT成像提供了新途径。 Abstract: Cone-beam computed tomography (CBCT) is a critical 3D imaging technology in the medical field, while the high radiation exposure required for high-quality imaging raises significant concerns, particularly for vulnerable populations. Sparse-view reconstruction reduces radiation by using fewer X-ray projections while maintaining image quality, yet existing methods face challenges such as high computational demands and poor generalizability to different datasets. To overcome these limitations, we propose DeepSparse, the first foundation model for sparse-view CBCT reconstruction, featuring DiCE (Dual-Dimensional Cross-Scale Embedding), a novel network that integrates multi-view 2D features and multi-scale 3D features. Additionally, we introduce the HyViP (Hybrid View Sampling Pretraining) framework, which pretrains the model on large datasets with both sparse-view and dense-view projections, and a two-step finetuning strategy to adapt and refine the model for new datasets. Extensive experiments and ablation studies demonstrate that our proposed DeepSparse achieves superior reconstruction quality compared to state-of-the-art methods, paving the way for safer and more efficient CBCT imaging.

[228] Multi-View Learning with Context-Guided Receptance for Image Denoising

Binghong Chen,Tingting Chai,Wei Jiang,Yuanrong Xu,Guanglu Zhou,Xiangqian Wu

Main category: eess.IV

TL;DR: 提出了一种结合多视图特征集成和高效序列建模的图像去噪模型，通过上下文引导的令牌移位和频率混合模块提升性能，同时采用双向WKV机制降低计算成本。

Details

Motivation: 现有方法难以区分真实场景中的复杂噪声模式，且基于Transformer的模型计算资源消耗大。 Method: 提出上下文引导的令牌移位（CTS）范式捕捉局部空间依赖，设计频率混合（FMix）模块提取频域特征，并采用双向WKV（BiWKV）机制实现高效序列建模。 Result: 在多个真实图像去噪数据集上表现优于现有方法，推理时间减少40%，并能恢复精细细节。 Conclusion: 该模型在性能和效率上均优于现有方法，适用于真实场景的图像去噪。 Abstract: Image denoising is essential in low-level vision applications such as photography and automated driving. Existing methods struggle with distinguishing complex noise patterns in real-world scenes and consume significant computational resources due to reliance on Transformer-based models. In this work, the Context-guided Receptance Weighted Key-Value (\M) model is proposed, combining enhanced multi-view feature integration with efficient sequence modeling. Our approach introduces the Context-guided Token Shift (CTS) paradigm, which effectively captures local spatial dependencies and enhance the model's ability to model real-world noise distributions. Additionally, the Frequency Mix (FMix) module extracting frequency-domain features is designed to isolate noise in high-frequency spectra, and is integrated with spatial representations through a multi-view learning process. To improve computational efficiency, the Bidirectional WKV (BiWKV) mechanism is adopted, enabling full pixel-sequence interaction with linear complexity while overcoming the causal selection constraints. The model is validated on multiple real-world image denoising datasets, outperforming the existing state-of-the-art methods quantitatively and reducing inference time up to 40\%. Qualitative results further demonstrate the ability of our model to restore fine details in various scenes.

cs.LO [Back]

[229] Explainability by design: an experimental analysis of the legal coding process

Matteo Cristani,Guido Governatori,Francesco Olivieri,Monica Palmirani,Gabriele Buriola

Main category: cs.LO

TL;DR: 本文提出了一种从文本片段到Deontic Defeasible Logic规则的编码方法，并通过实验验证了编码过程的效率和影响因素。

Details

Motivation: 研究如何将法律文本片段准确编码为Deontic Defeasible Logic规则，并提供测试场景验证其正确性。 Method: 采用Houdini技术处理编码，并通过实验测量编码过程中的努力与文本特征的关系。 Result: 实验结果表明编码效率受法律领域知识、编码过程熟悉度、文本长度和引用深度等因素影响。 Conclusion: 提出了一种预测编码时间的技术，为法律编码提供了实用工具。 Abstract: Behind a set of rules in Deontic Defeasible Logic, there is a mapping process of normative background fragments. This process goes from text to rules and implicitly encompasses an explanation of the coded fragments. In this paper we deliver a methodology for \textit{legal coding} that starts with a fragment and goes onto a set of Deontic Defeasible Logic rules, involving a set of \textit{scenarios} to test the correctness of the coded fragments. The methodology is illustrated by the coding process of an example text. We then show the results of a series of experiments conducted with humans encoding a variety of normative backgrounds and corresponding cases in which we have measured the efforts made in the coding process, as related to some measurable features. To process these examples, a recently developed technology, Houdini, that allows reasoning in Deontic Defeasible Logic, has been employed. Finally we provide a technique to forecast time required in coding, that depends on factors such as knowledge of the legal domain, knowledge of the coding processes, length of the text, and a measure of \textit{depth} that refers to the length of the paths of legal references.

cs.AI [Back]

[230] CHORUS: Zero-shot Hierarchical Retrieval and Orchestration for Generating Linear Programming Code

Tasnim Ahmed,Salimur Choudhury

Main category: cs.AI

TL;DR: CHORUS框架通过检索增强生成技术，显著提升了开源LLMs生成Gurobi线性规划代码的性能，甚至优于GPT3.5和GPT4。

Details

Motivation: 线性规划问题需要专业知识，非专家难以解决。研究探索LLMs在生成求解器特定代码中的效率。 Method: 提出CHORUS框架，结合分层分块策略、两阶段检索和专家提示，生成自包含且语义连贯的代码。 Result: 在NL4Opt-Code基准测试中，CHORUS显著提升开源LLMs性能，优于基线方法。 Conclusion: CHORUS框架在减少计算资源需求的同时，显著提升了代码生成性能，验证了专家提示和分层分块的重要性。 Abstract: Linear Programming (LP) problems aim to find the optimal solution to an objective under constraints. These problems typically require domain knowledge, mathematical skills, and programming ability, presenting significant challenges for non-experts. This study explores the efficiency of Large Language Models (LLMs) in generating solver-specific LP code. We propose CHORUS, a retrieval-augmented generation (RAG) framework for synthesizing Gurobi-based LP code from natural language problem statements. CHORUS incorporates a hierarchical tree-like chunking strategy for theoretical contents and generates additional metadata based on code examples from documentation to facilitate self-contained, semantically coherent retrieval. Two-stage retrieval approach of CHORUS followed by cross-encoder reranking further ensures contextual relevance. Finally, expertly crafted prompt and structured parser with reasoning steps improve code generation performance significantly. Experiments on the NL4Opt-Code benchmark show that CHORUS improves the performance of open-source LLMs such as Llama3.1 (8B), Llama3.3 (70B), Phi4 (14B), Deepseek-r1 (32B), and Qwen2.5-coder (32B) by a significant margin compared to baseline and conventional RAG. It also allows these open-source LLMs to outperform or match the performance of much stronger baselines-GPT3.5 and GPT4 while requiring far fewer computational resources. Ablation studies further demonstrate the importance of expert prompting, hierarchical chunking, and structured reasoning.

[231] Structured Prompting and Feedback-Guided Reasoning with LLMs for Data Interpretation

Amit Rath

Main category: cs.AI

TL;DR: STROT框架通过结构化提示和反馈驱动的转换逻辑，提升LLM在结构化数据分析中的可靠性和语义对齐。

Details

Motivation: LLM在结构化数据分析中存在模式解释不一致、用户意图与输出不匹配及缺乏自我修正机制的问题。 Method: STROT结合轻量级模式自省、样本字段分类和动态上下文构建，通过迭代修正机制优化输出。 Result: STROT显著提升了LLM在结构化数据任务中的稳定性、可解释性和正确性。 Conclusion: STROT为LLM在结构化数据分析中提供了一个可重复且鲁棒的框架。 Abstract: Large language models (LLMs) have demonstrated remarkable capabilities in natural language understanding and task generalization. However, their application to structured data analysis remains fragile due to inconsistencies in schema interpretation, misalignment between user intent and model output, and limited mechanisms for self-correction when failures occur. This paper introduces the STROT Framework (Structured Task Reasoning and Output Transformation), a method for structured prompting and feedback-driven transformation logic generation aimed at improving the reliability and semantic alignment of LLM-based analytical workflows. STROT begins with lightweight schema introspection and sample-based field classification, enabling dynamic context construction that captures both the structure and statistical profile of the input data. This contextual information is embedded in structured prompts that guide the model toward generating task-specific, interpretable outputs. To address common failure modes in complex queries, STROT incorporates a refinement mechanism in which the model iteratively revises its outputs based on execution feedback and validation signals. Unlike conventional approaches that rely on static prompts or single-shot inference, STROT treats the LLM as a reasoning agent embedded within a controlled analysis loop -- capable of adjusting its output trajectory through planning and correction. The result is a robust and reproducible framework for reasoning over structured data with LLMs, applicable to diverse data exploration and analysis tasks where interpretability, stability, and correctness are essential.

[232] Inducing Robustness in a 2 Dimensional Direct Preference Optimization Paradigm

Sarvesh Shashidhar,Ritik,Nachiketa Patil,Suraj Racha,Ganesh Ramakrishnan

Main category: cs.AI

TL;DR: 本文探讨了Direct Preference Optimisation (DPO)及其改进版2D-DPO在大型语言模型（LLM）对齐人类偏好中的应用，并提出了一种增强噪声鲁棒性的方法。

Details

Motivation: DPO虽高效稳定，但无法处理细粒度评分，而人类偏好通常对回答的不同部分有不同评价。2D-DPO虽改进此问题，但对噪声敏感。 Method: 提出2D-DPO方法，引入二维评分，并进一步加入分段级噪声鲁棒性机制。 Result: 2D-DPO在胜率上优于标准DPO，但对噪声不鲁棒。改进后的算法在理论和实验上均验证了其有效性。 Conclusion: 2D-DPO及其噪声鲁棒性改进为LLM对齐提供了更优方案，但需进一步探索其他噪声模型。 Abstract: Direct Preference Optimisation (DPO) has emerged as a powerful method for aligning Large Language Models (LLMs) with human preferences, offering a stable and efficient alternative to approaches that use Reinforcement learning via Human Feedback. In this work, we investigate the performance of DPO using open-source preference datasets. One of the major drawbacks of DPO is that it doesn't induce granular scoring and treats all the segments of the responses with equal propensity. However, this is not practically true for human preferences since even "good" responses have segments that may not be preferred by the annotator. To resolve this, a 2-dimensional scoring for DPO alignment called 2D-DPO was proposed. We explore the 2D-DPO alignment paradigm and the advantages it provides over the standard DPO by comparing their win rates. It is observed that these methods, even though effective, are not robust to label/score noise. To counter this, we propose an approach of incorporating segment-level score noise robustness to the 2D-DPO algorithm. Along with theoretical backing, we also provide empirical verification in favour of the algorithm and introduce other noise models that can be present.

[233] Unraveling Media Perspectives: A Comprehensive Methodology Combining Large Language Models, Topic Modeling, Sentiment Analysis, and Ontology Learning to Analyse Media Bias

Orlando Jähde,Thorsten Weber,Rüdiger Buchkremer

Main category: cs.AI

TL;DR: 提出了一种基于自然语言处理的新方法，用于分析政治新闻中的媒体偏见，并通过案例研究验证了其有效性。

Details

Motivation: 偏见新闻报道对民主决策构成威胁，需要一种可扩展且低偏见的方法来分析媒体偏见。 Method: 利用自然语言处理技术（如分层主题建模、情感分析和本体学习）分析事件选择、标签、用词及遗漏偏见。 Result: 通过三个政治事件案例研究，证明了该方法在不同粒度下识别新闻来源偏见的有效性。 Conclusion: 该研究为开发工具帮助新闻消费者应对复杂媒体环境奠定了基础。 Abstract: Biased news reporting poses a significant threat to informed decision-making and the functioning of democracies. This study introduces a novel methodology for scalable, minimally biased analysis of media bias in political news. The proposed approach examines event selection, labeling, word choice, and commission and omission biases across news sources by leveraging natural language processing techniques, including hierarchical topic modeling, sentiment analysis, and ontology learning with large language models. Through three case studies related to current political events, we demonstrate the methodology's effectiveness in identifying biases across news sources at various levels of granularity. This work represents a significant step towards scalable, minimally biased media bias analysis, laying the groundwork for tools to help news consumers navigate an increasingly complex media landscape.

[234] Attention Mechanisms Perspective: Exploring LLM Processing of Graph-Structured Data

Zhong Guan,Likang Wu,Hongke Zhao,Ming He,Jianpin Fan

Main category: cs.AI

TL;DR: 研究探讨了注意力机制在处理图结构数据时的局限性，发现LLMs在建模节点间关系和适应图拓扑方面存在不足，并提出了一种中间态注意力窗口的改进方法。

Details

Motivation: 现有注意力机制在图结构数据上表现不佳，尤其是在捕捉节点间关系和适应图拓扑方面，因此需要深入研究LLMs如何处理此类数据。 Method: 通过实证研究分析LLMs在图结构数据上的注意力行为，探索其局限性并提出改进方法。 Result: 发现LLMs难以建模节点间关系且注意力分布与理想图结构不符，提出中间态注意力窗口能提升性能。 Conclusion: 中间态注意力窗口在训练和推理中表现更优，为LLMs处理图结构数据提供了新思路。 Abstract: Attention mechanisms are critical to the success of large language models (LLMs), driving significant advancements in multiple fields. However, for graph-structured data, which requires emphasis on topological connections, they fall short compared to message-passing mechanisms on fixed links, such as those employed by Graph Neural Networks (GNNs). This raises a question: ``Does attention fail for graphs in natural language settings?'' Motivated by these observations, we embarked on an empirical study from the perspective of attention mechanisms to explore how LLMs process graph-structured data. The goal is to gain deeper insights into the attention behavior of LLMs over graph structures. We uncovered unique phenomena regarding how LLMs apply attention to graph-structured data and analyzed these findings to improve the modeling of such data by LLMs. The primary findings of our research are: 1) While LLMs can recognize graph data and capture text-node interactions, they struggle to model inter-node relationships within graph structures due to inherent architectural constraints. 2) The attention distribution of LLMs across graph nodes does not align with ideal structural patterns, indicating a failure to adapt to graph topology nuances. 3) Neither fully connected attention nor fixed connectivity is optimal; each has specific limitations in its application scenarios. Instead, intermediate-state attention windows improve LLM training performance and seamlessly transition to fully connected windows during inference. Source code: \href{https://github.com/millioniron/LLM_exploration}{LLM4Exploration}

[235] Interpretable Emergent Language Using Inter-Agent Transformers

Mannan Bhardwaj

Main category: cs.AI

TL;DR: 论文提出DIAT方法，利用自注意力机制实现可解释的多智能体通信协议。

Details

Motivation: 现有方法（如RIAL、DIAL、CommNet）缺乏可解释性，DIAT旨在解决这一问题。 Method: 采用自注意力机制的Differentiable Inter-Agent Transformers（DIAT）方法。 Result: DIAT能生成可解释的词汇和嵌入，有效解决协作任务。 Conclusion: DIAT在复杂多智能体环境中具有可解释通信的潜力。 Abstract: This paper explores the emergence of language in multi-agent reinforcement learning (MARL) using transformers. Existing methods such as RIAL, DIAL, and CommNet enable agent communication but lack interpretability. We propose Differentiable Inter-Agent Transformers (DIAT), which leverage self-attention to learn symbolic, human-understandable communication protocols. Through experiments, DIAT demonstrates the ability to encode observations into interpretable vocabularies and meaningful embeddings, effectively solving cooperative tasks. These results highlight the potential of DIAT for interpretable communication in complex multi-agent environments.

Enpei Zhang,Jingyi Chai,Rui Ye,Yanfeng Wang,Siheng Chen

Main category: cs.AI

TL;DR: 本文提出了一种激励性的个性化联邦学习框架（iPFL），通过解决基于图的训练优化问题并引入基于博弈论的激励机制，激励数据持有者在不泄露原始数据的情况下协作训练个性化模型。

Details

Motivation: 公共数据即将耗尽，而分散的私有数据因隐私敏感性和缺乏激励机制未被充分利用。 Method: iPFL通过构建模型共享市场，结合基于博弈论的激励机制，确保个体理性和真实性。 Result: 在11项AI任务中，iPFL表现出最高的经济效用和优于或可比拟的模型性能。 Conclusion: iPFL有望成为未来利用分散私有数据提升AI模型性能的关键技术，同时满足各方需求。 Abstract: While data plays a crucial role in training contemporary AI models, it is acknowledged that valuable public data will be exhausted in a few years, directing the world's attention towards the massive decentralized private data. However, the privacy-sensitive nature of raw data and lack of incentive mechanism prevent these valuable data from being fully exploited. Addressing these challenges, this paper proposes inclusive and incentivized personalized federated learning (iPFL), which incentivizes data holders with diverse purposes to collaboratively train personalized models without revealing raw data. iPFL constructs a model-sharing market by solving a graph-based training optimization and incorporates an incentive mechanism based on game theory principles. Theoretical analysis shows that iPFL adheres to two key incentive properties: individual rationality and truthfulness. Empirical studies on eleven AI tasks (e.g., large language models' instruction-following tasks) demonstrate that iPFL consistently achieves the highest economic utility, and better or comparable model performance compared to baseline methods. We anticipate that our iPFL can serve as a valuable technique for boosting future AI models on decentralized private data while making everyone satisfied.

[237] Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play

Yemin Shi,Yu Shu,Siwei Dong,Guangyi Liu,Jaward Sesay,Jingwen Li,Zhiting Hu

Main category: cs.AI

TL;DR: Voila是一种新型的端到端语音AI模型，支持低延迟、情感丰富的对话，并具备多任务处理能力。

Details

Motivation: 目标是开发一种能够无缝融入日常生活、支持情感表达和主动交互的语音AI。 Method: 采用分层多尺度Transformer架构，结合大型语言模型和声学建模，实现全双工、低延迟对话。 Result: 响应延迟仅195毫秒，支持百万预建声音和快速定制，适用于多种语音任务。 Conclusion: Voila为下一代人机交互提供了开放的研究基础。 Abstract: A voice AI agent that blends seamlessly into daily life would interact with humans in an autonomous, real-time, and emotionally expressive manner. Rather than merely reacting to commands, it would continuously listen, reason, and respond proactively, fostering fluid, dynamic, and emotionally resonant interactions. We introduce Voila, a family of large voice-language foundation models that make a step towards this vision. Voila moves beyond traditional pipeline systems by adopting a new end-to-end architecture that enables full-duplex, low-latency conversations while preserving rich vocal nuances such as tone, rhythm, and emotion. It achieves a response latency of just 195 milliseconds, surpassing the average human response time. Its hierarchical multi-scale Transformer integrates the reasoning capabilities of large language models (LLMs) with powerful acoustic modeling, enabling natural, persona-aware voice generation -- where users can simply write text instructions to define the speaker's identity, tone, and other characteristics. Moreover, Voila supports over one million pre-built voices and efficient customization of new ones from brief audio samples as short as 10 seconds. Beyond spoken dialogue, Voila is designed as a unified model for a wide range of voice-based applications, including automatic speech recognition (ASR), Text-to-Speech (TTS), and, with minimal adaptation, multilingual speech translation. Voila is fully open-sourced to support open research and accelerate progress toward next-generation human-machine interactions.

[238] Knowing You Don't Know: Learning When to Continue Search in Multi-round RAG through Self-Practicing

Diji Yang,Linda Zeng,Jinmeng Rao,Yi Zhang

Main category: cs.AI

TL;DR: SIM-RAG框架通过增强RAG系统的自我意识和多轮检索能力，解决了现有方法在复杂任务中的不足，无需昂贵的人工标注数据。

Details

Motivation: 现有多轮RAG系统存在检索过度或不足的问题，且依赖昂贵的人工标注数据或性能不佳。 Method: 提出SIM-RAG框架，通过自生成合成训练数据训练轻量级信息充分性评估器（Critic），指导检索决策。 Result: 实验表明SIM-RAG在多轮RAG任务中表现优异，且系统高效、数据高效。 Conclusion: SIM-RAG为多轮RAG任务提供了一种高效且无需人工标注的解决方案。 Abstract: Retrieval Augmented Generation (RAG) has shown strong capability in enhancing language models' knowledge and reducing AI generative hallucinations, driving its widespread use. However, complex tasks requiring multi-round retrieval remain challenging, and early attempts tend to be overly optimistic without a good sense of self-skepticism. Current multi-round RAG systems may continue searching even when enough information has already been retrieved, or they may provide incorrect answers without having sufficient information or knowledge. Existing solutions either require large amounts of expensive human-labeled process supervision data or lead to subpar performance. This paper aims to address these limitations by introducing a new framework, \textbf{SIM-RAG}, to explicitly enhance RAG systems' self-awareness and multi-round retrieval capabilities. To train SIM-RAG, we first let a RAG system self-practice multi-round retrieval, augmenting existing question-answer pairs with intermediate inner monologue reasoning steps to generate synthetic training data. For each pair, the system may explore multiple retrieval paths, which are labeled as successful if they reach the correct answer and unsuccessful otherwise. Using this data, we train a lightweight information sufficiency Critic. At inference time, the Critic evaluates whether the RAG system has retrieved sufficient information at each round, guiding retrieval decisions and improving system-level self-awareness through in-context reinforcement learning. Experiments across multiple prominent RAG benchmarks show that SIM-RAG is an effective multi-round RAG solution. Furthermore, this framework is system-efficient, adding a lightweight component to RAG without requiring modifications to existing LLMs or search engines, and data-efficient, eliminating the need for costly human-annotated mid-step retrieval process supervision data.

[239] AutoLibra: Agent Metric Induction from Open-Ended Feedback

Hao Zhu,Phil Cuvin,Xinkai Yu,Charlotte Ka Yee Yan,Jason Zhang,Diyi Yang

Main category: cs.AI

TL;DR: AutoLibra是一个框架，通过将开放式人类反馈转化为具体的行为评估指标，优化语言代理的评估和改进。

Details

Motivation: 传统代理评估方法依赖任务成功指标，过于粗粒度且依赖专家设计，无法捕捉中间行为。 Method: AutoLibra将反馈与代理行为关联，聚类正负行为，生成具体指标，并通过LLM评估。提出“覆盖率”和“冗余度”元指标优化。 Result: 实验显示AutoLibra生成的指标优于传统基准，并发现新指标。在文本游戏和网页导航任务中，代理性能提升20%。 Conclusion: AutoLibra是一种强大的任务无关工具，可用于语言代理的评估和改进。 Abstract: Agents are predominantly evaluated and optimized via task success metrics, which are coarse, rely on manual design from experts, and fail to reward intermediate emergent behaviors. We propose AutoLibra, a framework for agent evaluation, that transforms open-ended human feedback, e.g., "If you find that the button is disabled, don't click it again", or "This agent has too much autonomy to decide what to do on its own", into metrics for evaluating fine-grained behaviors in agent trajectories. AutoLibra accomplishes this by grounding feedback to an agent's behavior, clustering similar positive and negative behaviors, and creating concrete metrics with clear definitions and concrete examples, which can be used for prompting LLM-as-a-Judge as evaluators. We further propose two meta-metrics to evaluate the alignment of a set of (induced) metrics with open feedback: "coverage" and "redundancy". Through optimizing these meta-metrics, we experimentally demonstrate AutoLibra's ability to induce more concrete agent evaluation metrics than the ones proposed in previous agent evaluation benchmarks and discover new metrics to analyze agents. We also present two applications of AutoLibra in agent improvement: First, we show that AutoLibra-induced metrics serve as better prompt-engineering targets than the task success rate on a wide range of text game tasks, improving agent performance over baseline by a mean of 20%. Second, we show that AutoLibra can iteratively select high-quality fine-tuning data for web navigation agents. Our results suggest that AutoLibra is a powerful task-agnostic tool for evaluating and improving language agents.

[240] TxP: Reciprocal Generation of Ground Pressure Dynamics and Activity Descriptions for Improving Human Activity Recognition

Lala Shakti Swarup Ray,Lars Krupp,Vitor Fortes Rey,Bo Zhou,Sungho Suh,Paul Lukowicz

Main category: cs.AI

TL;DR: 论文提出了一种双向Text×Pressure模型（TxP），利用生成基础模型将压力数据与自然语言结合，用于人类活动识别（HAR），显著提升了性能。

Details

Motivation: 压力传感器在HAR领域潜力巨大但未被充分利用，主要因数据集有限。论文旨在填补这一空白。 Method: 提出TxP模型，结合CLIP和LLaMA 2 13B Chat等预训练模型，通过Text2Pressure和Pressure2Text任务处理压力数据与文本的转换。 Result: 在真实数据（如瑜伽和日常任务）上验证，TxP将HAR性能提升12.4%（宏F1分数）。 Conclusion: TxP为压力传感器在HAR中的应用提供了新方法，扩展了数据增强和分类的可能性。 Abstract: Sensor-based human activity recognition (HAR) has predominantly focused on Inertial Measurement Units and vision data, often overlooking the capabilities unique to pressure sensors, which capture subtle body dynamics and shifts in the center of mass. Despite their potential for postural and balance-based activities, pressure sensors remain underutilized in the HAR domain due to limited datasets. To bridge this gap, we propose to exploit generative foundation models with pressure-specific HAR techniques. Specifically, we present a bidirectional Text$\times$Pressure model that uses generative foundation models to interpret pressure data as natural language. TxP accomplishes two tasks: (1) Text2Pressure, converting activity text descriptions into pressure sequences, and (2) Pressure2Text, generating activity descriptions and classifications from dynamic pressure maps. Leveraging pre-trained models like CLIP and LLaMA 2 13B Chat, TxP is trained on our synthetic PressLang dataset, containing over 81,100 text-pressure pairs. Validated on real-world data for activities such as yoga and daily tasks, TxP provides novel approaches to data augmentation and classification grounded in atomic actions. This consequently improved HAR performance by up to 12.4\% in macro F1 score compared to the state-of-the-art, advancing pressure-based HAR with broader applications and deeper insights into human movement.

physics.comp-ph [Back]

[241] Polar Interpolants for Thin-Shell Microstructure Homogenization

Antoine Chan-Lock,Miguel Otaduy

Main category: physics.comp-ph

TL;DR: 本文提出了一种新的薄壳微结构材料均匀化方法，解决了以往方法在能量响应、应力响应和变形模式交互方面的局限性。

Details

Motivation: 解决现有方法在能量响应拟合中忽视视觉影响、应力响应拟合不保守以及变形模式交互维度低的问题。 Method: 基于高维膜和弯曲域设计保守的材料能量函数，采用新型高阶RBF插值方法，优化应力而非能量参数。 Result: 新方法在定量和定性上均优于以往工作，能精确拟合多种微结构行为。 Conclusion: 提出的方法在材料均匀化中实现了更高的准确性和视觉相关性，为薄壳微结构设计提供了新工具。 Abstract: This paper introduces a new formulation for material homogenization of thin-shell microstructures. It addresses important challenges that limit the quality of previous approaches: methods that fit the energy response neglect visual impact, methods that fit the stress response are not conservative, and all of them are limited to a low-dimensional interplay between deformation modes. The new formulation is rooted on the following design principles: the material energy functions are conservative by definition, they are formulated on the high-dimensional membrane and bending domain to capture the complex interplay of the different deformation modes, the material function domain is maximally aligned with the training data, and the material parameters and the optimization are formulated on stress instead of energy for better correlation with visual impact. The key novelty of our formulation is a new type of high-order RBF interpolant for polar coordinates, which allows us to fulfill all the design principles. We design a material function using this novel interpolant, as well as an overall homogenization workflow. Our results demonstrate very accurate fitting of diverse microstructure behaviors, both quantitatively and qualitatively superior to previous work.

Table of Contents

cs.CV [Back]

[1] Multi-party Collaborative Attention Control for Image Customization

[2] Explainable AI-Driven Detection of Human Monkeypox Using Deep Learning and Vision Transformers: A Comprehensive Analysis

[3] Deconstructing Bias: A Multifaceted Framework for Diagnosing Cultural and Compositional Inequities in Text-to-Image Generative Models

[4] ZS-VCOS: Zero-Shot Outperforms Supervised Video Camouflaged Object Segmentation

[5] VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations for Synthetic Videos

[6] WorldGenBench: A World-Knowledge-Integrated Benchmark for Reasoning-Driven Text-to-Image Generation

[7] Automated Parsing of Engineering Drawings for Structured Information Extraction Using a Fine-tuned Document Understanding Transformer

[8] Rethinking RGB-Event Semantic Segmentation with a Novel Bidirectional Motion-enhanced Event Representation

[9] A Sensor Agnostic Domain Generalization Framework for Leveraging Geospatial Foundation Models: Enhancing Semantic Segmentation viaSynergistic Pseudo-Labeling and Generative Learning

[10] PainFormer: a Vision Foundation Model for Automatic Pain Assessment

[11] Grounding Task Assistance with Multimodal Cues from a Single Demonstration

[12] TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action

[13] Multimodal and Multiview Deep Fusion for Autonomous Marine Navigation

[14] Toward Onboard AI-Enabled Solutions to Space Object Detection for Space Sustainability

[15] A Novel WaveInst-based Network for Tree Trunk Structure Extraction and Pattern Analysis in Forest Inventory

[16] Soft-Masked Semi-Dual Optimal Transport for Partial Domain Adaptation

[17] Automated ARAT Scoring Using Multimodal Video Analysis, Multi-View Fusion, and Hierarchical Bayesian Models: A Clinician Study

[18] Topology-Aware CLIP Few-Shot Learning

[19] Component-Based Fairness in Face Attribute Classification with Bayesian Network-informed Meta Learning

[20] Knowledge-Augmented Language Models Interpreting Structured Chest X-Ray Findings

[21] Vision and Intention Boost Large Language Model in Long-Term Action Anticipation

[22] Probabilistic Interactive 3D Segmentation with Hierarchical Neural Processes

[23] PosePilot: Steering Camera Pose for Generative World Models with Self-supervised Depth

[24] Learning Multi-frame and Monocular Prior for Estimating Geometry in Dynamic Scenes

[25] An LLM-Empowered Low-Resolution Vision System for On-Device Human Behavior Understanding

[26] Co$^{3}$Gesture: Towards Coherent Concurrent Co-speech 3D Gesture Generation with Interactive Diffusion

[27] Multimodal Graph Representation Learning for Robust Surgical Workflow Recognition with Adversarial Feature Disentanglement

[28] Enhancing the Learning Experience: Using Vision-Language Models to Generate Questions for Educational Videos

[29] AquaGS: Fast Underwater Scene Reconstruction with SfM-Free Gaussian Splatting

[30] Efficient 3D Full-Body Motion Generation from Sparse Tracking Inputs with Temporal Windows

[31] Not Every Tree Is a Forest: Benchmarking Forest Types from Satellite Remote Sensing

[32] 3DWG: 3D Weakly Supervised Visual Grounding via Category and Instance-Level Alignment

[33] PhytoSynth: Leveraging Multi-modal Generative Models for Crop Disease Data Generation with Novel Benchmarking and Prompt Engineering Approach

[34] CVVNet: A Cross-Vertical-View Network for Gait Recognition

[35] MVHumanNet++: A Large-scale Dataset of Multi-view Daily Dressing Human Captures with Richer Annotations for 3D Human Digitization

[36] Mitigating Group-Level Fairness Disparities in Federated Visual Language Models

[37] DualDiff: Dual-branch Diffusion Model for Autonomous Driving with Semantic Fusion

[38] Visual enhancement and 3D representation for underwater scenes: a review

[39] PhysNav-DG: A Novel Adaptive Framework for Robust VLM-Sensor Fusion in Navigation Applications

[40] CMAWRNet: Multiple Adverse Weather Removal via a Unified Quaternion Neural Architecture

[41] Rethinking Score Distilling Sampling for 3D Editing and Generation

[42] GenSync: A Generalized Talking Head Framework for Audio-driven Multi-Subject Lip-Sync using 3D Gaussian Splatting

[43] GauS-SLAM: Dense RGB-D SLAM with Gaussian Surfels

[44] HybridGS: High-Efficiency Gaussian Splatting Data Compression using Dual-Channel Sparse Representation and Point Cloud Encoder

[45] Segment Any RGB-Thermal Model with Language-aided Distillation

[46] A Comprehensive Analysis for Visual Object Hallucination in Large Vision-Language Models

[47] MC3D-AD: A Unified Geometry-aware Reconstruction Model for Multi-category 3D Anomaly Detection

[48] Visual Dominance and Emerging Multimodal Approaches in Distracted Driving Detection: A Review of Machine Learning Techniques

[49] Lifelong Whole Slide Image Analysis: Online Vision-Language Adaptation and Past-to-Present Gradient Distillation

[50] Drug classification based on X-ray spectroscopy combined with machine learning

[51] Learning Heterogeneous Mixture of Scene Experts for Large-scale Neural Radiance Fields

[52] Efficient Noise Calculation in Deep Learning-based MRI Reconstructions

[53] MLLM-Enhanced Face Forgery Detection: A Vision-Language Fusion Solution

[54] R-Bench: Graduate-level Multi-disciplinary Benchmarks for LLM & MLLM Complex Reasoning Evaluation

[55] A Birotation Solution for Relative Pose Problems

[56] Point2Primitive: CAD Reconstruction from Point Cloud by Direct Primitive Prediction

[57] A UNet Model for Accelerated Preprocessing of CRISM Hyperspectral Data for Mineral Identification on Mars

[58] Handling Imbalanced Pseudolabels for Vision-Language Models with Concept Alignment and Confusion-Aware Calibrated Margin

[59] Transforming faces into video stories -- VideoFace2.0

[60] RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video

[61] Hierarchical Compact Clustering Attention (COCA) for Unsupervised Object-Centric Learning

[62] Benchmarking Feature Upsampling Methods for Vision Foundation Models using Interactive Segmentation

[63] HandOcc: NeRF-based Hand Rendering with Occupancy Networks

[64] SignSplat: Rendering Sign Language via Gaussian Splatting

[65] Unaligned RGB Guided Hyperspectral Image Super-Resolution with Spatial-Spectral Concordance

[66] GarmentGS: Point-Cloud Guided Gaussian Splatting for High-Fidelity Non-Watertight 3D Garment Reconstruction

[67] HiLLIE: Human-in-the-Loop Training for Low-Light Image Enhancement

[68] Spotting the Unexpected (STU): A 3D LiDAR Dataset for Anomaly Segmentation in Autonomous Driving

[69] Small Clips, Big Gains: Learning Long-Range Refocused Temporal Information for Video Super-Resolution

[70] Focus What Matters: Matchability-Based Reweighting for Local Feature Matching

[71] SparSplat: Fast Multi-View Reconstruction with Generalizable 2D Gaussian Splatting

[72] Saliency-Guided Training for Fingerprint Presentation Attack Detection

[73] Sparfels: Fast Reconstruction from Sparse Unposed Imagery

[74] ProDisc-VAD: An Efficient System for Weakly-Supervised Anomaly Detection in Video Surveillance Applications

[75] Robust AI-Generated Face Detection with Imbalanced Data

[76] DualReal: Adaptive Joint Training for Lossless Identity-Motion Fusion in Video Customization

[77] Improving Physical Object State Representation in Text-to-Image Generative Systems

[78] Quantizing Diffusion Models from a Sampling-Aware Perspective