cs.CV [Back]

[1] Learning to Borrow Features for Improved Detection of Small Objects in Single-Shot Detectors

Richard Schmit

Main category: cs.CV

TL;DR: 提出了一种新框架，通过从同类大对象中“借用”特征来提升小物体检测性能，包含三个关键模块，显著提升了检测精度。

Details

Motivation: 解决单次检测器中因空间分辨率与语义丰富度之间的权衡导致的小物体检测难题。 Method: 引入特征匹配块（FMB）、特征表示块（FRB）和特征融合块（FFB），通过加权聚合和特征融合增强浅层特征。 Result: 实验表明，该方法显著提升了小物体检测精度，同时保持实时性能。 Conclusion: 为复杂视觉环境中的鲁棒物体检测提供了有前景的方向。 Abstract: Detecting small objects remains a significant challenge in single-shot object detectors due to the inherent trade-off between spatial resolution and semantic richness in convolutional feature maps. To address this issue, we propose a novel framework that enables small object representations to "borrow" discriminative features from larger, semantically richer instances within the same class. Our architecture introduces three key components: the Feature Matching Block (FMB) to identify semantically similar descriptors across layers, the Feature Representing Block (FRB) to generate enhanced shallow features through weighted aggregation, and the Feature Fusion Block (FFB) to refine feature maps by integrating original, borrowed, and context information. Built upon the SSD framework, our method improves the descriptive capacity of shallow layers while maintaining real-time detection performance. Experimental results demonstrate that our approach significantly boosts small object detection accuracy over baseline methods, offering a promising direction for robust object detection in complex visual environments.

[2] Investigating Zero-Shot Diagnostic Pathology in Vision-Language Models with Efficient Prompt Design

Vasudev Sharma,Ahmed Alagha,Abdelhakim Khellaf,Vincent Quoc-Huy Trinh,Mahdi S. Hosseini

Main category: cs.CV

TL;DR: 本文研究了三种视觉语言模型（Quilt-Net、Quilt-LLAVA和CONCH）在消化病理学数据集上的表现，发现提示工程对模型性能有显著影响，尤其是解剖学精确性。

Details

Motivation: 探索视觉语言模型在计算病理学中对大规模临床数据、任务设计和提示设计的敏感性，以提高诊断准确性。 Method: 通过结构化消融研究，开发了一个全面的提示工程框架，测试不同模型在癌症侵袭性和发育不良状态任务中的表现。 Result: CONCH模型在提供精确解剖学参考时表现最佳，解剖学背景对性能至关重要，模型复杂性并非性能的决定因素。 Conclusion: 提示工程在计算病理学中具有重要作用，适当的领域对齐和训练能显著提升视觉语言模型的诊断准确性。 Abstract: Vision-language models (VLMs) have gained significant attention in computational pathology due to their multimodal learning capabilities that enhance big-data analytics of giga-pixel whole slide image (WSI). However, their sensitivity to large-scale clinical data, task formulations, and prompt design remains an open question, particularly in terms of diagnostic accuracy. In this paper, we present a systematic investigation and analysis of three state of the art VLMs for histopathology, namely Quilt-Net, Quilt-LLAVA, and CONCH, on an in-house digestive pathology dataset comprising 3,507 WSIs, each in giga-pixel form, across distinct tissue types. Through a structured ablative study on cancer invasiveness and dysplasia status, we develop a comprehensive prompt engineering framework that systematically varies domain specificity, anatomical precision, instructional framing, and output constraints. Our findings demonstrate that prompt engineering significantly impacts model performance, with the CONCH model achieving the highest accuracy when provided with precise anatomical references. Additionally, we identify the critical importance of anatomical context in histopathological image analysis, as performance consistently degraded when reducing anatomical precision. We also show that model complexity alone does not guarantee superior performance, as effective domain alignment and domain-specific training are critical. These results establish foundational guidelines for prompt engineering in computational pathology and highlight the potential of VLMs to enhance diagnostic accuracy when properly instructed with domain-appropriate prompts.

[3] Eye2Eye: A Simple Approach for Monocular-to-Stereo Video Synthesis

Michal Geyer,Omer Tov,Linyi Jin,Richard Tucker,Inbar Mosseri,Tali Dekel,Noah Snavely

Main category: cs.CV

TL;DR: 提出了一种将文本到视频生成器转换为视频到立体视频生成器的简单方法，直接合成新视角，避免了传统多阶段方法的限制。

Details

Motivation: 立体3D视频生成因数据稀缺而具有挑战性，传统方法在复杂场景（如镜面或透明物体）中表现不佳。 Method: 利用预训练视频模型的先验知识，直接合成新视角，无需中间步骤（如深度估计或修复）。 Result: 在复杂真实场景中展示了方法的优势，支持多样物体材质和组合。 Conclusion: 该方法通过直接合成新视角，显著提升了立体视频生成的质量和适用性。 Abstract: The rising popularity of immersive visual experiences has increased interest in stereoscopic 3D video generation. Despite significant advances in video synthesis, creating 3D videos remains challenging due to the relative scarcity of 3D video data. We propose a simple approach for transforming a text-to-video generator into a video-to-stereo generator. Given an input video, our framework automatically produces the video frames from a shifted viewpoint, enabling a compelling 3D effect. Prior and concurrent approaches for this task typically operate in multiple phases, first estimating video disparity or depth, then warping the video accordingly to produce a second view, and finally inpainting the disoccluded regions. This approach inherently fails when the scene involves specular surfaces or transparent objects. In such cases, single-layer disparity estimation is insufficient, resulting in artifacts and incorrect pixel shifts during warping. Our work bypasses these restrictions by directly synthesizing the new viewpoint, avoiding any intermediate steps. This is achieved by leveraging a pre-trained video model's priors on geometry, object materials, optics, and semantics, without relying on external geometry models or manually disentangling geometry from the synthesis process. We demonstrate the advantages of our approach in complex, real-world scenarios featuring diverse object materials and compositions. See videos on https://video-eye2eye.github.io

[4] Detecting and Mitigating Hateful Content in Multimodal Memes with Vision-Language Models

Minh-Hao Van,Xintao Wu

Main category: cs.CV

TL;DR: 论文提出了一种基于视觉语言模型（VLM）的方法，用于检测和转化仇恨模因，包括定义引导的提示技术和UnHateMeme框架。

Details

Motivation: 社交媒体中仇恨模因的滥用问题日益严重，现有研究多集中于检测，但转化仇恨内容的方法仍不足。 Method: 采用定义引导的提示技术检测仇恨模因，并开发UnHameMeme框架，通过替换文本或视觉组件来转化仇恨内容。 Result: VLM在检测任务中表现优异，UnHateMeme框架能有效转化仇恨模因，保持多模态一致性。 Conclusion: VLM在仇恨内容处理中具有潜力，为构建安全的在线环境提供了新思路。 Abstract: The rapid evolution of social media has provided enhanced communication channels for individuals to create online content, enabling them to express their thoughts and opinions. Multimodal memes, often utilized for playful or humorous expressions with visual and textual elements, are sometimes misused to disseminate hate speech against individuals or groups. While the detection of hateful memes is well-researched, developing effective methods to transform hateful content in memes remains a significant challenge. Leveraging the powerful generation and reasoning capabilities of Vision-Language Models (VLMs), we address the tasks of detecting and mitigating hateful content. This paper presents two key contributions: first, a definition-guided prompting technique for detecting hateful memes, and second, a unified framework for mitigating hateful content in memes, named UnHateMeme, which works by replacing hateful textual and/or visual components. With our definition-guided prompts, VLMs achieve impressive performance on hateful memes detection task. Furthermore, our UnHateMeme framework, integrated with VLMs, demonstrates a strong capability to convert hateful memes into non-hateful forms that meet human-level criteria for hate speech and maintain multimodal coherence between image and text. Through empirical experiments, we show the effectiveness of state-of-the-art pretrained VLMs such as LLaVA, Gemini and GPT-4o on the proposed tasks, providing a comprehensive analysis of their respective strengths and limitations for these tasks. This paper aims to shed light on important applications of VLMs for ensuring safe and respectful online environments.

[5] V3LMA: Visual 3D-enhanced Language Model for Autonomous Driving

Jannik Lübberstedt,Esteban Rivera,Nico Uhlemann,Markus Lienkamp

Main category: cs.CV

TL;DR: V3LMA通过结合大型语言模型（LLMs）和视觉语言模型（LVLMs），提升自动驾驶中的3D场景理解能力，无需微调即可显著提高性能。

Details

Motivation: 现有的大型视觉语言模型（LVLMs）在自动驾驶中对3D环境的理解有限，影响其对动态环境的全面和安全理解。 Method: V3LMA通过预处理管道提取3D物体数据，结合文本描述和视频输入，利用LLMs和LVLMs的融合策略提升场景理解。 Result: 在LingoQA基准测试中达到0.56分，提升了复杂交通场景中的情境感知和决策能力。 Conclusion: V3LMA通过融合策略和3D数据提取，推动了交通场景理解的进步，为更安全的自动驾驶系统提供了可能。 Abstract: Large Vision Language Models (LVLMs) have shown strong capabilities in understanding and analyzing visual scenes across various domains. However, in the context of autonomous driving, their limited comprehension of 3D environments restricts their effectiveness in achieving a complete and safe understanding of dynamic surroundings. To address this, we introduce V3LMA, a novel approach that enhances 3D scene understanding by integrating Large Language Models (LLMs) with LVLMs. V3LMA leverages textual descriptions generated from object detections and video inputs, significantly boosting performance without requiring fine-tuning. Through a dedicated preprocessing pipeline that extracts 3D object data, our method improves situational awareness and decision-making in complex traffic scenarios, achieving a score of 0.56 on the LingoQA benchmark. We further explore different fusion strategies and token combinations with the goal of advancing the interpretation of traffic scenes, ultimately enabling safer autonomous driving systems.

[6] Direct Motion Models for Assessing Generated Videos

Kelsey Allen,Carl Doersch,Guangyao Zhou,Mohammed Suhail,Danny Driess,Ignacio Rocco,Yulia Rubanova,Thomas Kipf,Mehdi S. M. Sajjadi,Kevin Murphy,Joao Carreira,Sjoerd van Steenkiste

Main category: cs.CV

TL;DR: 论文提出了一种基于点轨迹自动编码的新指标，用于更准确地评估生成视频中的运动质量，优于现有方法如FVD。

Details

Motivation: 现有视频生成模型在帧质量上表现良好，但运动表现不佳，而现有评估方法（如FVD）未能有效捕捉这一问题。 Method: 通过自动编码点轨迹提取运动特征，用于比较视频分布或单视频评估，并对生成视频中的时空不一致性进行定位。 Result: 新指标对合成数据中的时间失真更敏感，能更好地预测人类对生成视频时间一致性和真实性的评价。 Conclusion: 点轨迹表示不仅提高了运动评估的准确性，还提供了生成视频错误的时空定位能力，增强了可解释性。 Abstract: A current limitation of video generative video models is that they generate plausible looking frames, but poor motion -- an issue that is not well captured by FVD and other popular methods for evaluating generated videos. Here we go beyond FVD by developing a metric which better measures plausible object interactions and motion. Our novel approach is based on auto-encoding point tracks and yields motion features that can be used to not only compare distributions of videos (as few as one generated and one ground truth, or as many as two datasets), but also for evaluating motion of single videos. We show that using point tracks instead of pixel reconstruction or action recognition features results in a metric which is markedly more sensitive to temporal distortions in synthetic data, and can predict human evaluations of temporal consistency and realism in generated videos obtained from open-source models better than a wide range of alternatives. We also show that by using a point track representation, we can spatiotemporally localize generative video inconsistencies, providing extra interpretability of generated video errors relative to prior work. An overview of the results and link to the code can be found on the project page: http://trajan-paper.github.io.

[7] Towards Robust and Generalizable Gerchberg Saxton based Physics Inspired Neural Networks for Computer Generated Holography: A Sensitivity Analysis Framework

Ankit Amrutkar,Björn Kampa,Volkmar Schulz,Johannes Stegmaier,Markus Rothermel,Dorit Merhof

Main category: cs.CV

TL;DR: 论文提出了一种基于Saltelli扩展的Sobol方法的敏感性分析框架，用于量化前向模型超参数对GS-PINN性能的影响，并确定了SLM像素分辨率是关键因素。

Details

Motivation: 解决计算机生成全息术（CGH）中相位恢复的逆问题，提升物理启发神经网络（PINNs）的泛化能力和性能。 Method: 采用Saltelli扩展的Sobol方法进行敏感性分析，评估前向模型超参数对GS-PINN性能的影响。 Result: SLM像素分辨率是影响神经网络敏感性的主要因素，自由空间传播前向模型表现优于傅里叶全息。 Conclusion: 研究为CGH提供了前向模型选择、神经网络架构和性能评估的具体指南，推动了稳健、可解释和可泛化的神经网络发展。 Abstract: Computer-generated holography (CGH) enables applications in holographic augmented reality (AR), 3D displays, systems neuroscience, and optical trapping. The fundamental challenge in CGH is solving the inverse problem of phase retrieval from intensity measurements. Physics-inspired neural networks (PINNs), especially Gerchberg-Saxton-based PINNs (GS-PINNs), have advanced phase retrieval capabilities. However, their performance strongly depends on forward models (FMs) and their hyperparameters (FMHs), limiting generalization, complicating benchmarking, and hindering hardware optimization. We present a systematic sensitivity analysis framework based on Saltelli's extension of Sobol's method to quantify FMH impacts on GS-PINN performance. Our analysis demonstrates that SLM pixel-resolution is the primary factor affecting neural network sensitivity, followed by pixel-pitch, propagation distance, and wavelength. Free space propagation forward models demonstrate superior neural network performance compared to Fourier holography, providing enhanced parameterization and generalization. We introduce a composite evaluation metric combining performance consistency, generalization capability, and hyperparameter perturbation resilience, establishing a unified benchmarking standard across CGH configurations. Our research connects physics-inspired deep learning theory with practical CGH implementations through concrete guidelines for forward model selection, neural network architecture, and performance evaluation. Our contributions advance the development of robust, interpretable, and generalizable neural networks for diverse holographic applications, supporting evidence-based decisions in CGH research and implementation.

[8] ReXGradient-160K: A Large-Scale Publicly Available Dataset of Chest Radiographs with Free-text Reports

Xiaoman Zhang,Julián N. Acosta,Josh Miller,Ouwen Huang,Pranav Rajpurkar

Main category: cs.CV

TL;DR: ReXGradient-160K是目前最大的公开胸部X光数据集，包含160,000项研究，涉及109,487名患者，旨在推动医学影像AI研究。

Details

Motivation: 提供大规模、多样化的胸部X光数据集，以加速医学影像AI和自动化报告生成模型的研究与发展。 Method: 数据集包含多张图像和详细放射报告，分为训练、验证和测试集，并预留私有测试集用于ReXrank基准评估。 Result: 数据集为医学影像AI和自动化报告生成模型提供了丰富资源。 Conclusion: ReXGradient-160K的发布将推动医学影像AI领域的进步，数据集已开源。 Abstract: We present ReXGradient-160K, representing the largest publicly available chest X-ray dataset to date in terms of the number of patients. This dataset contains 160,000 chest X-ray studies with paired radiological reports from 109,487 unique patients across 3 U.S. health systems (79 medical sites). This comprehensive dataset includes multiple images per study and detailed radiology reports, making it particularly valuable for the development and evaluation of AI systems for medical imaging and automated report generation models. The dataset is divided into training (140,000 studies), validation (10,000 studies), and public test (10,000 studies) sets, with an additional private test set (10,000 studies) reserved for model evaluation on the ReXrank benchmark. By providing this extensive dataset, we aim to accelerate research in medical imaging AI and advance the state-of-the-art in automated radiological analysis. Our dataset will be open-sourced at https://huggingface.co/datasets/rajpurkarlab/ReXGradient-160K.

[9] Empowering Agentic Video Analytics Systems with Video Language Models

Yuxuan Yan,Shiqi Jiang,Ting Cao,Yifan Yang,Qianqian Yang,Yuanchao Shu,Yuqing Yang,Lili Qiu

Main category: cs.CV

TL;DR: AVA是一个基于视频语言模型（VLM）的系统，通过事件知识图谱（EKG）和检索生成机制，解决了超长视频内容处理的挑战，并在多个基准测试中表现优异。

Details

Motivation: 现有视频分析系统局限于预定义任务，难以适应开放场景，而VLM的上下文窗口限制又阻碍了超长视频的处理。 Method: AVA采用事件知识图谱（EKG）实时索引视频流，并结合检索生成机制处理复杂查询。 Result: 在LVBench和VideoMME-Long上分别达到62.3%和64.1%的准确率，在AVA-100上达到75.8%。 Conclusion: AVA在开放场景和超长视频分析中表现出色，为视频分析提供了新方向。 Abstract: AI-driven video analytics has become increasingly pivotal across diverse domains. However, existing systems are often constrained to specific, predefined tasks, limiting their adaptability in open-ended analytical scenarios. The recent emergence of Video-Language Models (VLMs) as transformative technologies offers significant potential for enabling open-ended video understanding, reasoning, and analytics. Nevertheless, their limited context windows present challenges when processing ultra-long video content, which is prevalent in real-world applications. To address this, we introduce AVA, a VLM-powered system designed for open-ended, advanced video analytics. AVA incorporates two key innovations: (1) the near real-time construction of Event Knowledge Graphs (EKGs) for efficient indexing of long or continuous video streams, and (2) an agentic retrieval-generation mechanism that leverages EKGs to handle complex and diverse queries. Comprehensive evaluations on public benchmarks, LVBench and VideoMME-Long, demonstrate that AVA achieves state-of-the-art performance, attaining 62.3% and 64.1% accuracy, respectively, significantly surpassing existing VLM and video Retrieval-Augmented Generation (RAG) systems. Furthermore, to evaluate video analytics in ultra-long and open-world video scenarios, we introduce a new benchmark, AVA-100. This benchmark comprises 8 videos, each exceeding 10 hours in duration, along with 120 manually annotated, diverse, and complex question-answer pairs. On AVA-100, AVA achieves top-tier performance with an accuracy of 75.8%.

[10] Pack-PTQ: Advancing Post-training Quantization of Neural Networks by Pack-wise Reconstruction

Changjun Li,Runqing Jiang,Zhuo Song,Pengpeng Yu,Ye Zhang,Yulan Guo

Main category: cs.CV

TL;DR: 本文提出了一种名为Pack-PTQ的新型后训练量化方法，通过自适应分组和混合精度量化解决现有方法忽略跨块依赖性和低比特精度下降的问题。

Details

Motivation: 现有后训练量化方法采用块级重建，忽略了跨块依赖性，导致低比特情况下精度显著下降。 Method: 设计了Hessian引导的自适应分组机制，将块划分为非重叠组作为重建基础单元，并提出混合精度量化方法，根据组的敏感性分配不同比特宽度。 Result: 在2D图像和3D点云分类任务中，实验证明该方法优于现有后训练量化方法。 Conclusion: Pack-PTQ通过保留跨块依赖性和混合精度量化，显著提升了低比特量化的性能。 Abstract: Post-training quantization (PTQ) has evolved as a prominent solution for compressing complex models, which advocates a small calibration dataset and avoids end-to-end retraining. However, most existing PTQ methods employ block-wise reconstruction, which neglects cross-block dependency and exhibits a notable accuracy drop in low-bit cases. To address these limitations, this paper presents a novel PTQ method, dubbed Pack-PTQ. First, we design a Hessian-guided adaptive packing mechanism to partition blocks into non-overlapping packs, which serve as the base unit for reconstruction, thereby preserving the cross-block dependency and enabling accurate quantization parameters estimation. Second, based on the pack configuration, we propose a mixed-precision quantization approach to assign varied bit-widths to packs according to their distinct sensitivities, thereby further enhancing performance. Extensive experiments on 2D image and 3D point cloud classification tasks, using various network architectures, demonstrate the superiority of our method over the state-of-the-art PTQ methods.

[11] AdCare-VLM: Leveraging Large Vision Language Model (LVLM) to Monitor Long-Term Medication Adherence and Care

Md Asaduzzaman Jabin,Hanqi Jiang,Yiwei Li,Patrick Kaggwa,Eugene Douglass,Juliet N. Sekandi,Tianming Liu

Main category: cs.CV

TL;DR: AdCare-VLM是一种基于视频-语言的多模态模型，用于通过患者视频检测药物依从性，在结核病监测中表现优于现有方法。

Details

Motivation: 慢性疾病患者药物依从性差，导致疾病进展和死亡率上升，亟需技术手段改善监测。 Method: 利用806个结核病监测视频数据集，通过视觉-语言对齐和多模态交互，训练VLM模型进行依从性检测。 Result: 模型在多种配置下性能提升3.1%-3.54%，并通过消融实验验证有效性。 Conclusion: AdCare-VLM为药物依从性监测提供了高效、可解释的解决方案。 Abstract: Chronic diseases, including diabetes, hypertension, asthma, HIV-AIDS, epilepsy, and tuberculosis, necessitate rigorous adherence to medication to avert disease progression, manage symptoms, and decrease mortality rates. Adherence is frequently undermined by factors including patient behavior, caregiver support, elevated medical costs, and insufficient healthcare infrastructure. We propose AdCare-VLM, a specialized Video-LLaVA-based multimodal large vision language model (LVLM) aimed at visual question answering (VQA) concerning medication adherence through patient videos. We employ a private dataset comprising 806 custom-annotated tuberculosis (TB) medication monitoring videos, which have been labeled by clinical experts, to fine-tune the model for adherence pattern detection. We present LLM-TB-VQA, a detailed medical adherence VQA dataset that encompasses positive, negative, and ambiguous adherence cases. Our method identifies correlations between visual features, such as the clear visibility of the patient's face, medication, water intake, and the act of ingestion, and their associated medical concepts in captions. This facilitates the integration of aligned visual-linguistic representations and improves multimodal interactions. Experimental results indicate that our method surpasses parameter-efficient fine-tuning (PEFT) enabled VLM models, such as LLaVA-V1.5 and Chat-UniVi, with absolute improvements ranging from 3.1% to 3.54% across pre-trained, regular, and low-rank adaptation (LoRA) configurations. Comprehensive ablation studies and attention map visualizations substantiate our approach, enhancing interpretability.

[12] Fine-grained spatial-temporal perception for gas leak segmentation

Xinlong Zhao,Shan Du

Main category: cs.CV

TL;DR: 提出了一种名为FGSTP的算法，用于高效准确地检测和分割气体泄漏，通过捕捉帧间运动信息并结合精细化的对象特征，在GasVid数据集上表现优于现有方法。

Details

Motivation: 气体泄漏对人类健康和环境构成重大风险，但由于其隐蔽性和随机形状，现有方法难以高效准确地检测和分割。 Method: FGSTP算法通过构建相关体积捕捉帧间运动信息，逐步精细化对象特征，并使用解码器优化边界分割。 Result: 在手动标注的GasVid数据集上，FGSTP在分割非刚性物体（如气体泄漏）方面表现最佳，生成的掩码最准确。 Conclusion: FGSTP算法在气体泄漏分割任务中表现出色，为解决隐蔽和随机形状的泄漏检测问题提供了有效方案。 Abstract: Gas leaks pose significant risks to human health and the environment. Despite long-standing concerns, there are limited methods that can efficiently and accurately detect and segment leaks due to their concealed appearance and random shapes. In this paper, we propose a Fine-grained Spatial-Temporal Perception (FGSTP) algorithm for gas leak segmentation. FGSTP captures critical motion clues across frames and integrates them with refined object features in an end-to-end network. Specifically, we first construct a correlation volume to capture motion information between consecutive frames. Then, the fine-grained perception progressively refines the object-level features using previous outputs. Finally, a decoder is employed to optimize boundary segmentation. Because there is no highly precise labeled dataset for gas leak segmentation, we manually label a gas leak video dataset, GasVid. Experimental results on GasVid demonstrate that our model excels in segmenting non-rigid objects such as gas leaks, generating the most accurate mask compared to other state-of-the-art (SOTA) models.

[13] AI-Assisted Decision-Making for Clinical Assessment of Auto-Segmented Contour Quality

Biling Wang,Austen Maniscalco,Ti Bai,Siqiu Wang,Michael Dohopolski,Mu-Han Lin,Chenyang Shen,Dan Nguyen,Junzhou Huang,Steve Jiang,Xinlei Wang

Main category: cs.CV

TL;DR: 提出一种基于深度学习的质量评估方法，用于放疗中自动生成轮廓的质量评估，利用贝叶斯序数分类和校准不确定性阈值，无需依赖真实轮廓或大量手动标注。

Details

Motivation: 提升在线自适应放疗中轮廓生成的效率，减少手动工作量，确保放疗工作流程的安全性和可靠性。 Method: 开发贝叶斯序数分类模型，分类自动轮廓质量并量化预测不确定性，通过校准步骤优化不确定性阈值，适用于无标注、有限标注和丰富标注三种数据场景。 Result: 模型在所有场景下表现稳健，仅需30个手动标注和34个受试者的校准即可达到90%以上的测试准确率，93%的自动轮廓质量预测准确率超过98%。 Conclusion: 该质量评估模型显著提升了轮廓生成效率，减少了手动审查需求，支持快速临床决策，并通过不确定性量化确保放疗工作流程更安全可靠。 Abstract: Purpose: This study presents a Deep Learning (DL)-based quality assessment (QA) approach for evaluating auto-generated contours (auto-contours) in radiotherapy, with emphasis on Online Adaptive Radiotherapy (OART). Leveraging Bayesian Ordinal Classification (BOC) and calibrated uncertainty thresholds, the method enables confident QA predictions without relying on ground truth contours or extensive manual labeling. Methods: We developed a BOC model to classify auto-contour quality and quantify prediction uncertainty. A calibration step was used to optimize uncertainty thresholds that meet clinical accuracy needs. The method was validated under three data scenarios: no manual labels, limited labels, and extensive labels. For rectum contours in prostate cancer, we applied geometric surrogate labels when manual labels were absent, transfer learning when limited, and direct supervision when ample labels were available. Results: The BOC model delivered robust performance across all scenarios. Fine-tuning with just 30 manual labels and calibrating with 34 subjects yielded over 90% accuracy on test data. Using the calibrated threshold, over 93% of the auto-contours' qualities were accurately predicted in over 98% of cases, reducing unnecessary manual reviews and highlighting cases needing correction. Conclusion: The proposed QA model enhances contouring efficiency in OART by reducing manual workload and enabling fast, informed clinical decisions. Through uncertainty quantification, it ensures safer, more reliable radiotherapy workflows.

[14] AWARE-NET: Adaptive Weighted Averaging for Robust Ensemble Network in Deepfake Detection

Muhammad Salman,Iqra Tariq,Mishal Zulfiqar,Muqadas Jalal,Sami Aujla,Sumbal Fatima

Main category: cs.CV

TL;DR: 论文提出了一种新颖的两层集成框架，用于基于深度学习的深度伪造检测，通过动态权重机制结合多种先进架构，显著提升了检测性能。

Details

Motivation: 由于合成媒体的兴起对数字身份和网络安全构成威胁，深度伪造检测变得至关重要，但现有方法在多样数据集和操纵类型上的性能一致性仍有挑战。 Method: 采用两层集成框架，结合Xception、Res2Net101和EfficientNet-B7三种架构的多个实例，通过动态权重机制优化预测。 Result: 在FF++和CelebDF-v2数据集上实现了99.22%和100.00%的AUC分数，以及98.06%和99.94%的F1分数，展示了优异的跨数据集泛化能力。 Conclusion: 该框架显著提升了深度伪造检测的性能和泛化能力，为未来研究提供了新方向。 Abstract: Deepfake detection has become increasingly important due to the rise of synthetic media, which poses significant risks to digital identity and cyber presence for security and trust. While multiple approaches have improved detection accuracy, challenges remain in achieving consistent performance across diverse datasets and manipulation types. In response, we propose a novel two-tier ensemble framework for deepfake detection based on deep learning that hierarchically combines multiple instances of three state-of-the-art architectures: Xception, Res2Net101, and EfficientNet-B7. Our framework employs a unique approach where each architecture is instantiated three times with different initializations to enhance model diversity, followed by a learnable weighting mechanism that dynamically combines their predictions. Unlike traditional fixed-weight ensembles, our first-tier averages predictions within each architecture family to reduce model variance, while the second tier learns optimal contribution weights through backpropagation, automatically adjusting each architecture's influence based on their detection reliability. Our experiments achieved state-of-the-art intra-dataset performance with AUC scores of 99.22% (FF++) and 100.00% (CelebDF-v2), and F1 scores of 98.06% (FF++) and 99.94% (CelebDF-v2) without augmentation. With augmentation, we achieve AUC scores of 99.47% (FF++) and 100.00% (CelebDF-v2), and F1 scores of 98.43% (FF++) and 99.95% (CelebDF-v2). The framework demonstrates robust cross-dataset generalization, achieving AUC scores of 88.20% and 72.52%, and F1 scores of 93.16% and 80.62% in cross-dataset evaluations.

[15] Quaternion Wavelet-Conditioned Diffusion Models for Image Super-Resolution

Luigi Sigillo,Christian Bianchi,Danilo Comminiello

Main category: cs.CV

TL;DR: 提出了一种名为ResQu的新型超分辨率框架，结合四元数小波预处理与潜在扩散模型，通过动态整合四元数小波嵌入提升图像重建质量。

Details

Motivation: 超分辨率在计算机视觉中应用广泛，但现有方法在高倍放大时难以平衡感知质量与结构保真度。扩散模型虽有潜力，但仍需改进。 Method: 集成四元数小波预处理与潜在扩散模型，引入四元数小波和时间感知编码器，动态整合小波嵌入于去噪过程，并利用Stable Diffusion等基础模型的生成先验。 Result: 在领域特定数据集上表现出色，感知质量和标准评估指标优于现有方法。 Conclusion: ResQu框架通过四元数小波和扩散模型的结合，显著提升了超分辨率任务的效果。 Abstract: Image Super-Resolution is a fundamental problem in computer vision with broad applications spacing from medical imaging to satellite analysis. The ability to reconstruct high-resolution images from low-resolution inputs is crucial for enhancing downstream tasks such as object detection and segmentation. While deep learning has significantly advanced SR, achieving high-quality reconstructions with fine-grained details and realistic textures remains challenging, particularly at high upscaling factors. Recent approaches leveraging diffusion models have demonstrated promising results, yet they often struggle to balance perceptual quality with structural fidelity. In this work, we introduce ResQu a novel SR framework that integrates a quaternion wavelet preprocessing framework with latent diffusion models, incorporating a new quaternion wavelet- and time-aware encoder. Unlike prior methods that simply apply wavelet transforms within diffusion models, our approach enhances the conditioning process by exploiting quaternion wavelet embeddings, which are dynamically integrated at different stages of denoising. Furthermore, we also leverage the generative priors of foundation models such as Stable Diffusion. Extensive experiments on domain-specific datasets demonstrate that our method achieves outstanding SR results, outperforming in many cases existing approaches in perceptual quality and standard evaluation metrics. The code will be available after the revision process.

[16] Efficient Neural Video Representation with Temporally Coherent Modulation

Seungjun Shin,Suji Kim,Dokwan Oh

Main category: cs.CV

TL;DR: NVTM提出了一种新的视频表示框架，通过分解时空3D视频数据为带有流信息的2D网格，实现了快速学习和高效参数利用，显著提升了编码速度和视频质量。

Details

Motivation: 现有网格型参数编码方法在视频应用中存在参数冗余和低效问题，NVTM旨在解决这一问题，同时提升视频表示的动态特性和编码效率。 Method: NVTM将视频数据分解为带有流信息的2D网格，通过时间相干调制捕捉视频动态特性，实现高效参数利用和快速编码。 Result: NVTM在编码速度上比NeRV快3倍以上，PSNR/LPIPS指标显著提升，且在压缩任务中表现与H.264、HEVC相当。 Conclusion: NVTM通过创新设计在视频表示和压缩任务中表现出色，具有广泛的应用潜力。 Abstract: Implicit neural representations (INR) has found successful applications across diverse domains. To employ INR in real-life, it is important to speed up training. In the field of INR for video applications, the state-of-the-art approach employs grid-type parametric encoding and successfully achieves a faster encoding speed in comparison to its predecessors. However, the grid usage, which does not consider the video's dynamic nature, leads to redundant use of trainable parameters. As a result, it has significantly lower parameter efficiency and higher bitrate compared to NeRV-style methods that do not use a parametric encoding. To address the problem, we propose Neural Video representation with Temporally coherent Modulation (NVTM), a novel framework that can capture dynamic characteristics of video. By decomposing the spatio-temporal 3D video data into a set of 2D grids with flow information, NVTM enables learning video representation rapidly and uses parameter efficiently. Our framework enables to process temporally corresponding pixels at once, resulting in the fastest encoding speed for a reasonable video quality, especially when compared to the NeRV-style method, with a speed increase of over 3 times. Also, it remarks an average of 1.54dB/0.019 improvements in PSNR/LPIPS on UVG (Dynamic) (even with 10% fewer parameters) and an average of 1.84dB/0.013 improvements in PSNR/LPIPS on MCL-JCV (Dynamic), compared to previous grid-type works. By expanding this to compression tasks, we demonstrate comparable performance to video compression standards (H.264, HEVC) and recent INR approaches for video compression. Additionally, we perform extensive experiments demonstrating the superior performance of our algorithm across diverse tasks, encompassing super resolution, frame interpolation and video inpainting. Project page is https://sujiikim.github.io/NVTM/.

M. A. D. Buser,D. C. Simons,M. Fitski,M. H. W. A. Wijnen,A. S. Littooij,A. H. ter Brugge,I. N. Vos,M. H. A. Janse,M. de Boer,R. ter Maat,J. Sato,S. Kido,S. Kondo,S. Kasai,M. Wodzinski,H. Muller,J. Ye,J. He,Y. Kirchhoff,M. R. Rokkus,G. Haokai,S. Zitong,M. Fernández-Patón,D. Veiga-Canuto,D. G. Ellis,M. R. Aizenberg,B. H. M. van der Velden,H. Kuijf,A. De Luca,A. F. W. van der Steeg

Main category: cs.CV

TL;DR: SPPIN挑战赛旨在推动神经母细胞瘤MRI自动分割技术的发展，最高分团队使用预训练网络STU-Net，但小肿瘤分割仍需改进。

Details

Motivation: 神经母细胞瘤手术规划依赖耗时且依赖用户的MRI 3D模型，需开发自动分割方法。 Method: 组织SPPIN挑战赛，提供多模态MRI数据集，评估团队的分割性能（Dice、HD95、VS指标）。 Result: 最高分团队Dice得分0.82，HD95为7.69mm，VS为0.91，但小肿瘤分割效果较差。 Conclusion: 预训练网络在小数据集有效，但需更可靠方法支持临床手术规划。 Abstract: Surgery plays an important role within the treatment for neuroblastoma, a common pediatric cancer. This requires careful planning, often via magnetic resonance imaging (MRI)-based anatomical 3D models. However, creating these models is often time-consuming and user dependent. We organized the Surgical Planning in Pediatric Neuroblastoma (SPPIN) challenge, to stimulate developments on this topic, and set a benchmark for fully automatic segmentation of neuroblastoma on multi-model MRI. The challenge started with a training phase, where teams received 78 sets of MRI scans from 34 patients, consisting of both diagnostic and post-chemotherapy MRI scans. The final test phase, consisting of 18 MRI sets from 9 patients, determined the ranking of the teams. Ranking was based on the Dice similarity coefficient (Dice score), the 95th percentile of the Hausdorff distance (HD95) and the volumetric similarity (VS). The SPPIN challenge was hosted at MICCAI 2023. The final leaderboard consisted of 9 teams. The highest-ranking team achieved a median Dice score 0.82, a median HD95 of 7.69 mm and a VS of 0.91, utilizing a large, pretrained network called STU-Net. A significant difference for the segmentation results between diagnostic and post-chemotherapy MRI scans was observed (Dice = 0.89 vs Dice = 0.59, P = 0.01) for the highest-ranking team. SPPIN is the first medical segmentation challenge in extracranial pediatric oncology. The highest-ranking team used a large pre-trained network, suggesting that pretraining can be of use in small, heterogenous datasets. Although the results of the highest-ranking team were high for most patients, segmentation especially in small, pre-treated tumors were insufficient. Therefore, more reliable segmentation methods are needed to create clinically applicable models to aid surgical planning in pediatric neuroblastoma.

[18] Cues3D: Unleashing the Power of Sole NeRF for Consistent and Unique Instances in Open-Vocabulary 3D Panoptic Segmentation

Feng Xue,Wenzhuang Xu,Guofeng Zhong,Anlong Minga,Nicu Sebe

Main category: cs.CV

TL;DR: Cues3D提出了一种基于NeRF的紧凑方法，用于开放词汇3D全景分割，通过隐式3D场实现全局一致的几何结构，无需显式跨视图监督。

Details

Motivation: 现有方法依赖高保真3D点云或跨视图关联预处理，而Cues3D旨在通过NeRF的隐式3D场实现更高效的全局一致性。 Method: 采用三阶段训练框架（初始化-消歧-细化），结合实例消歧方法匹配NeRF渲染的3D掩码，确保全局唯一的3D实例标识。 Result: 在多个数据集上，Cues3D优于基于2D图像的方法，并与最新的2D-3D融合方法竞争，甚至在使用额外3D点云时超越它们。 Conclusion: Cues3D证明了NeRF隐式3D场在实现全局一致性和高效对象区分方面的潜力，为开放词汇3D分割提供了新思路。 Abstract: Open-vocabulary 3D panoptic segmentation has recently emerged as a significant trend. Top-performing methods currently integrate 2D segmentation with geometry-aware 3D primitives. However, the advantage would be lost without high-fidelity 3D point clouds, such as methods based on Neural Radiance Field (NeRF). These methods are limited by the insufficient capacity to maintain consistency across partial observations. To address this, recent works have utilized contrastive loss or cross-view association pre-processing for view consensus. In contrast to them, we present Cues3D, a compact approach that relies solely on NeRF instead of pre-associations. The core idea is that NeRF's implicit 3D field inherently establishes a globally consistent geometry, enabling effective object distinction without explicit cross-view supervision. We propose a three-phase training framework for NeRF, initialization-disambiguation-refinement, whereby the instance IDs are corrected using the initially-learned knowledge. Additionally, an instance disambiguation method is proposed to match NeRF-rendered 3D masks and ensure globally unique 3D instance identities. With the aid of Cues3D, we obtain highly consistent and unique 3D instance ID for each object across views with a balanced version of NeRF. Our experiments are conducted on ScanNet v2, ScanNet200, ScanNet++, and Replica datasets for 3D instance, panoptic, and semantic segmentation tasks. Cues3D outperforms other 2D image-based methods and competes with the latest 2D-3D merging based methods, while even surpassing them when using additional 3D point clouds. The code link could be found in the appendix and will be released on \href{https://github.com/mRobotit/Cues3D}{github}

[19] The Invisible Threat: Evaluating the Vulnerability of Cross-Spectral Face Recognition to Presentation Attacks

Anjith George,Sebastien Marcel

Main category: cs.CV

TL;DR: 本文研究了近红外（NIR）与可见光（VIS）跨光谱人脸识别系统在呈现攻击下的脆弱性，发现尽管系统具有一定可靠性，但仍存在漏洞。

Details

Motivation: 跨光谱人脸识别系统在复杂操作条件下具有优势，但其对呈现攻击的鲁棒性尚未系统研究。 Method: 通过综合评估NIR-VIS跨光谱人脸识别系统对呈现攻击的脆弱性。 Result: 研究发现系统虽有一定可靠性，但仍易受特定攻击。 Conclusion: 需进一步研究以提高跨光谱人脸识别系统的安全性。 Abstract: Cross-spectral face recognition systems are designed to enhance the performance of facial recognition systems by enabling cross-modal matching under challenging operational conditions. A particularly relevant application is the matching of near-infrared (NIR) images to visible-spectrum (VIS) images, enabling the verification of individuals by comparing NIR facial captures acquired with VIS reference images. The use of NIR imaging offers several advantages, including greater robustness to illumination variations, better visibility through glasses and glare, and greater resistance to presentation attacks. Despite these claimed benefits, the robustness of NIR-based systems against presentation attacks has not been systematically studied in the literature. In this work, we conduct a comprehensive evaluation into the vulnerability of NIR-VIS cross-spectral face recognition systems to presentation attacks. Our empirical findings indicate that, although these systems exhibit a certain degree of reliability, they remain vulnerable to specific attacks, emphasizing the need for further research in this area.

[20] SOTA: Spike-Navigated Optimal TrAnsport Saliency Region Detection in Composite-bias Videos

Wenxuan Liu,Yao Deng,Kang Chen,Xian Zhong,Zhaofei Yu,Tiejun Huang

Main category: cs.CV

TL;DR: 提出SOTA框架，利用脉冲相机优势解决运动模糊和遮挡问题，通过微偏置和全局偏置消除噪声偏差。

Details

Motivation: 现有显著性检测方法在运动模糊和遮挡场景中表现不佳，脉冲相机的高时间分辨率虽能提升检测效果，但其固有噪声会导致显著性偏差。 Method: 提出SOTA框架，结合Spike-based Micro-debias（SM）和Spike-based Global-debias（SG），分别处理帧间细微变化和全局不一致性。 Result: 在真实和合成数据集上的实验表明，SOTA优于现有方法，有效消除复合噪声偏差。 Conclusion: SOTA框架显著提升了显著性检测的准确性，尤其在复杂场景下表现优异。 Abstract: Existing saliency detection methods struggle in real-world scenarios due to motion blur and occlusions. In contrast, spike cameras, with their high temporal resolution, significantly enhance visual saliency maps. However, the composite noise inherent to spike camera imaging introduces discontinuities in saliency detection. Low-quality samples further distort model predictions, leading to saliency bias. To address these challenges, we propose Spike-navigated Optimal TrAnsport Saliency Region Detection (SOTA), a framework that leverages the strengths of spike cameras while mitigating biases in both spatial and temporal dimensions. Our method introduces Spike-based Micro-debias (SM) to capture subtle frame-to-frame variations and preserve critical details, even under minimal scene or lighting changes. Additionally, Spike-based Global-debias (SG) refines predictions by reducing inconsistencies across diverse conditions. Extensive experiments on real and synthetic datasets demonstrate that SOTA outperforms existing methods by eliminating composite noise bias. Our code and dataset will be released at https://github.com/lwxfight/sota.

[21] Real-Time Animatable 2DGS-Avatars with Detail Enhancement from Monocular Videos

Xia Yuan,Hai Yuan,Wenyi Ge,Ying Fu,Xi Wu,Guanyu Xing

Main category: cs.CV

TL;DR: 提出了一种基于2D高斯泼溅（2DGS）的实时可动画3D人体化身重建框架，解决了现有方法在细节捕捉和动画稳定性上的不足。

Details

Motivation: 减少对复杂硬件的依赖，提升在游戏开发、增强现实等领域的实用性。 Method: 结合2DGS和全局SMPL姿态参数，引入旋转补偿网络（RCN）处理非刚性变形。 Result: 成功从单目视频重建高质量可动画化身，细节保留和动画稳定性优于现有方法。 Conclusion: 该方法在重建质量和动画鲁棒性上超越当前最优技术。 Abstract: High-quality, animatable 3D human avatar reconstruction from monocular videos offers significant potential for reducing reliance on complex hardware, making it highly practical for applications in game development, augmented reality, and social media. However, existing methods still face substantial challenges in capturing fine geometric details and maintaining animation stability, particularly under dynamic or complex poses. To address these issues, we propose a novel real-time framework for animatable human avatar reconstruction based on 2D Gaussian Splatting (2DGS). By leveraging 2DGS and global SMPL pose parameters, our framework not only aligns positional and rotational discrepancies but also enables robust and natural pose-driven animation of the reconstructed avatars. Furthermore, we introduce a Rotation Compensation Network (RCN) that learns rotation residuals by integrating local geometric features with global pose parameters. This network significantly improves the handling of non-rigid deformations and ensures smooth, artifact-free pose transitions during animation. Experimental results demonstrate that our method successfully reconstructs realistic and highly animatable human avatars from monocular videos, effectively preserving fine-grained details while ensuring stable and natural pose variation. Our approach surpasses current state-of-the-art methods in both reconstruction quality and animation robustness on public benchmarks.

[22] Leveraging Pretrained Diffusion Models for Zero-Shot Part Assembly

Ruiyuan Zhang,Qi Wang,Jiaxiang Liu,Yu Zhang,Yuchi Huo,Chao Wu

Main category: cs.CV

TL;DR: 提出了一种零样本3D零件组装方法，利用预训练点云扩散模型作为判别器，通过迭代最近点（ICP）过程解决零件组装问题，并引入推开策略增强鲁棒性。

Details

Motivation: 传统方法需要大量人工标注数据，成本高且难以适应现实世界形状和零件的多样性，因此提出零样本方法以解决这些问题。 Method: 利用预训练点云扩散模型作为判别器，将零件组装问题转化为ICP过程，并提出推开策略处理重叠零件。 Result: 实验表明，该方法在零样本设置下优于监督学习方法。 Conclusion: 提出的零样本方法在3D零件组装中表现出高效性和鲁棒性，适用于大规模应用。 Abstract: 3D part assembly aims to understand part relationships and predict their 6-DoF poses to construct realistic 3D shapes, addressing the growing demand for autonomous assembly, which is crucial for robots. Existing methods mainly estimate the transformation of each part by training neural networks under supervision, which requires a substantial quantity of manually labeled data. However, the high cost of data collection and the immense variability of real-world shapes and parts make traditional methods impractical for large-scale applications. In this paper, we propose first a zero-shot part assembly method that utilizes pre-trained point cloud diffusion models as discriminators in the assembly process, guiding the manipulation of parts to form realistic shapes. Specifically, we theoretically demonstrate that utilizing a diffusion model for zero-shot part assembly can be transformed into an Iterative Closest Point (ICP) process. Then, we propose a novel pushing-away strategy to address the overlap parts, thereby further enhancing the robustness of the method. To verify our work, we conduct extensive experiments and quantitative comparisons to several strong baseline methods, demonstrating the effectiveness of the proposed approach, which even surpasses the supervised learning method. The code has been released on https://github.com/Ruiyuan-Zhang/Zero-Shot-Assembly.

[23] ClearLines - Camera Calibration from Straight Lines

Gregory Schroeder,Mohamed Sabry,Cristina Olaverri-Monreal

Main category: cs.CV

TL;DR: 论文提出了一个名为“ClearLines”的小型数据集，用于解决户外场景中基于直线的校准问题，并提供了数据集的创建过程以指导算法开发。

Details

Motivation: 户外场景中的直线校准问题由于环境复杂、光线变化等因素难以解决，且缺乏专门的数据集支持算法开发。 Method: 通过创建“ClearLines”数据集，并详细描述其构建过程，为直线检测算法提供实践指导。 Result: 提供了一个实用的数据集和创建方法，支持直线检测算法的开发和优化。 Conclusion: “ClearLines”数据集填补了户外直线校准领域的空白，为算法开发提供了重要资源。 Abstract: The problem of calibration from straight lines is fundamental in geometric computer vision, with well-established theoretical foundations. However, its practical applicability remains limited, particularly in real-world outdoor scenarios. These environments pose significant challenges due to diverse and cluttered scenes, interrupted reprojections of straight 3D lines, and varying lighting conditions, making the task notoriously difficult. Furthermore, the field lacks a dedicated dataset encouraging the development of respective detection algorithms. In this study, we present a small dataset named "ClearLines", and by detailing its creation process, provide practical insights that can serve as a guide for developing and refining straight 3D line detection algorithms.

[24] JointDiT: Enhancing RGB-Depth Joint Modeling with Diffusion Transformers

Kwon Byung-Ki,Qi Dai,Lee Hyoseok,Chong Luo,Tae-Hyun Oh

Main category: cs.CV

TL;DR: JointDiT是一个扩散变换器，用于建模RGB和深度的联合分布，通过自适应调度权重和不平衡时间步采样策略，实现高质量的图像和深度图生成。

Details

Motivation: 研究动机是探索如何利用扩散变换器的架构优势和图像先验，实现RGB和深度的联合分布建模，以支持多种生成任务。 Method: 方法包括提出自适应调度权重和不平衡时间步采样策略，通过训练模型处理所有噪声级别，实现联合生成、深度估计和深度条件图像生成。 Result: JointDiT在联合生成任务中表现出色，同时在深度估计和深度条件图像生成中取得可比结果。 Conclusion: 结论表明，联合分布建模可以替代条件生成，为多模态生成任务提供了一种新方法。 Abstract: We present JointDiT, a diffusion transformer that models the joint distribution of RGB and depth. By leveraging the architectural benefit and outstanding image prior of the state-of-the-art diffusion transformer, JointDiT not only generates high-fidelity images but also produces geometrically plausible and accurate depth maps. This solid joint distribution modeling is achieved through two simple yet effective techniques that we propose, i.e., adaptive scheduling weights, which depend on the noise levels of each modality, and the unbalanced timestep sampling strategy. With these techniques, we train our model across all noise levels for each modality, enabling JointDiT to naturally handle various combinatorial generation tasks, including joint generation, depth estimation, and depth-conditioned image generation by simply controlling the timestep of each branch. JointDiT demonstrates outstanding joint generation performance. Furthermore, it achieves comparable results in depth estimation and depth-conditioned image generation, suggesting that joint distribution modeling can serve as a replaceable alternative to conditional generation. The project page is available at https://byungki-k.github.io/JointDiT/.

[25] KeySync: A Robust Approach for Leakage-free Lip Synchronization in High Resolution

Antoni Bigata,Rodrigo Mira,Stella Bounareli,Michał Stypułkowski,Konstantinos Vougioukas,Stavros Petridis,Maja Pantic

Main category: cs.CV

TL;DR: KeySync是一个两阶段框架，解决了唇同步任务中的时间一致性、表情泄漏和面部遮挡问题，通过精心设计的掩码策略实现了最先进的唇重建和交叉同步效果。

Details

Motivation: 现有唇同步方法常忽视表情泄漏和面部遮挡问题，影响实际应用（如自动配音），因此需要一种更全面的解决方案。 Method: 提出KeySync框架，采用两阶段设计和掩码策略，解决时间一致性、表情泄漏和遮挡问题。 Result: KeySync在唇重建和交叉同步中表现优异，视觉质量提升，表情泄漏减少（通过新指标LipLeak衡量），并有效处理遮挡。 Conclusion: KeySync通过创新设计和掩码策略，显著提升了唇同步任务的性能，解决了现有方法的不足。 Abstract: Lip synchronization, known as the task of aligning lip movements in an existing video with new input audio, is typically framed as a simpler variant of audio-driven facial animation. However, as well as suffering from the usual issues in talking head generation (e.g., temporal consistency), lip synchronization presents significant new challenges such as expression leakage from the input video and facial occlusions, which can severely impact real-world applications like automated dubbing, but are often neglected in existing works. To address these shortcomings, we present KeySync, a two-stage framework that succeeds in solving the issue of temporal consistency, while also incorporating solutions for leakage and occlusions using a carefully designed masking strategy. We show that KeySync achieves state-of-the-art results in lip reconstruction and cross-synchronization, improving visual quality and reducing expression leakage according to LipLeak, our novel leakage metric. Furthermore, we demonstrate the effectiveness of our new masking approach in handling occlusions and validate our architectural choices through several ablation studies. Code and model weights can be found at https://antonibigata.github.io/KeySync.

[26] Towards Scalable Human-aligned Benchmark for Text-guided Image Editing

Suho Ryu,Kihyun Kim,Eugene Baek,Dongsoo Shin,Joonseok Lee

Main category: cs.CV

TL;DR: 该论文提出了HATIE，一种用于文本引导图像编辑任务的新基准，旨在解决主观评估问题，提供自动化、多方面的评估方法。

Details

Motivation: 由于文本引导图像编辑任务的主观性，缺乏广泛接受的评估标准，研究者依赖人工用户研究。 Method: 提出HATIE基准，包含大规模数据集和多维评分系统，结合多种指标以对齐人类感知。 Result: 实验验证HATIE的评估与人类感知一致，并提供了多个先进模型的基准结果。 Conclusion: HATIE为文本引导图像编辑任务提供了可靠且自动化的评估标准。 Abstract: A variety of text-guided image editing models have been proposed recently. However, there is no widely-accepted standard evaluation method mainly due to the subjective nature of the task, letting researchers rely on manual user study. To address this, we introduce a novel Human-Aligned benchmark for Text-guided Image Editing (HATIE). Providing a large-scale benchmark set covering a wide range of editing tasks, it allows reliable evaluation, not limited to specific easy-to-evaluate cases. Also, HATIE provides a fully-automated and omnidirectional evaluation pipeline. Particularly, we combine multiple scores measuring various aspects of editing so as to align with human perception. We empirically verify that the evaluation of HATIE is indeed human-aligned in various aspects, and provide benchmark results on several state-of-the-art models to provide deeper insights on their performance.

[27] HeAL3D: Heuristical-enhanced Active Learning for 3D Object Detection

Esteban Rivera,Surya Prabhakaran,Markus Lienkamp

Main category: cs.CV

TL;DR: HeAL（启发式增强的主动学习）通过结合启发式特征（如物体距离和点数量）与定位和分类，为3D目标检测模型选择最有贡献的训练样本，显著减少所需样本量。

Details

Motivation: 现有主动学习方法在3D目标检测中忽视实际启发式特征，导致样本选择在非受控场景中效果不佳。 Method: HeAL整合启发式特征（如物体距离和点数量）与定位和分类，以估计样本不确定性并选择最有价值的样本。 Result: 在KITTI数据集上，HeAL达到与全监督基线相同的mAP，仅需24%的样本。 Conclusion: HeAL通过结合启发式特征，显著提升了样本选择效率，为3D目标检测提供了实用的主动学习解决方案。 Abstract: Active Learning has proved to be a relevant approach to perform sample selection for training models for Autonomous Driving. Particularly, previous works on active learning for 3D object detection have shown that selection of samples in uncontrolled scenarios is challenging. Furthermore, current approaches focus exclusively on the theoretical aspects of the sample selection problem but neglect the practical insights that can be obtained from the extensive literature and application of 3D detection models. In this paper, we introduce HeAL (Heuristical-enhanced Active Learning for 3D Object Detection) which integrates those heuristical features together with Localization and Classification to deliver the most contributing samples to the model's training. In contrast to previous works, our approach integrates heuristical features such as object distance and point-quantity to estimate the uncertainty, which enhance the usefulness of selected samples to train detection models. Our quantitative evaluation on KITTI shows that HeAL presents competitive mAP with respect to the State-of-the-Art, and achieves the same mAP as the full-supervised baseline with only 24% of the samples.

[28] Inconsistency-based Active Learning for LiDAR Object Detection

Esteban Rivera,Loic Stratil,Markus Lienkamp

Main category: cs.CV

TL;DR: 论文探讨了在自动驾驶中利用主动学习优化LiDAR数据标注的策略，通过不一致性样本选择方法，仅需50%标注数据即可达到随机采样的性能。

Details

Motivation: 当前深度学习模型在自动驾驶目标检测中表现优异，但需要大量标注数据，成本高昂。主动学习是一种优化标注过程的潜在方法，但尚未在LiDAR领域广泛应用。 Method: 扩展主动学习至LiDAR领域，开发基于不一致性的样本选择策略，并评估其效果。 Result: 使用基于检测框数量的简单不一致性方法，仅需50%标注数据即可达到与随机采样相同的mAP。 Conclusion: 主动学习在LiDAR数据标注中具有潜力，可显著减少标注成本。 Abstract: Deep learning models for object detection in autonomous driving have recently achieved impressive performance gains and are already being deployed in vehicles worldwide. However, current models require increasingly large datasets for training. Acquiring and labeling such data is costly, necessitating the development of new strategies to optimize this process. Active learning is a promising approach that has been extensively researched in the image domain. In our work, we extend this concept to the LiDAR domain by developing several inconsistency-based sample selection strategies and evaluate their effectiveness in various settings. Our results show that using a naive inconsistency approach based on the number of detected boxes, we achieve the same mAP as the random sampling strategy with 50% of the labeled data.

[29] InterLoc: LiDAR-based Intersection Localization using Road Segmentation with Automated Evaluation Method

Nguyen Hoang Khoi Tran,Julie Stephany Berrio,Mao Shan,Zhenxing Ming,Stewart Worrall

Main category: cs.CV

TL;DR: 提出一种基于LiDAR的交叉口检测方法，融合语义道路分割与车辆定位，通过最小二乘法优化候选交叉口，性能优于现有学习基线。

Details

Motivation: 交叉口是道路网络的关键点，但现有检测方法忽视车载语义信息或依赖稀缺人工标注数据。 Method: 方法分为两步：(i) 在鸟瞰图中融合语义分割与定位检测候选交叉口，(ii) 用最小二乘法分析分支拓扑优化候选。 Result: 在SemanticKITTI数据集上，平均定位误差1.9米，精度89%，召回率77%（5米容忍度），优于基线。 Conclusion: 该方法对分割误差鲁棒，适用于实际场景。 Abstract: Intersections are geometric and functional key points in every road network. They offer strong landmarks to correct GNSS dropouts and anchor new sensor data in up-to-date maps. Despite that importance, intersection detectors either ignore the rich semantic information already computed onboard or depend on scarce, hand-labeled intersection datasets. To close that gap, this paper presents a LiDAR-based method for intersection detection that (i) fuses semantic road segmentation with vehicle localization to detect intersection candidates in a bird's eye view (BEV) representation and (ii) refines those candidates by analyzing branch topology with a least squares formulation. To evaluate our method, we introduce an automated benchmarking pipeline that pairs detections with OpenStreetMap (OSM) intersection nodes using precise GNSS/INS ground-truth poses. Tested on eight SemanticKITTI sequences, the approach achieves a mean localization error of 1.9 m, 89% precision, and 77% recall at a 5 m tolerance, outperforming the latest learning-based baseline. Moreover, the method is robust to segmentation errors higher than those of the benchmark model, demonstrating its applicability in the real world.

[30] A Robust Deep Networks based Multi-Object MultiCamera Tracking System for City Scale Traffic

Muhammad Imran Zaman,Usama Ijaz Bajwa,Gulshan Saleem,Rana Hammad Raza

Main category: cs.CV

TL;DR: 提出了一种基于深度学习的多目标多摄像头跟踪框架，解决城市交通场景中的车辆跟踪问题，性能表现优异。

Details

Motivation: 随着网络摄像头数量增加，手动跟踪和匹配多摄像头中的车辆面临多样性、遮挡、光照变化等挑战。 Method: 使用Mask R-CNN进行目标检测，结合NMS和迁移学习进行重识别，利用ResNet-152和Deep SORT进行特征提取与跟踪。 Result: 在AI City Challenge数据集上，IDF1得分为0.8289，精确率和召回率分别为0.9026和0.8527。 Conclusion: 该框架在复杂城市交通场景中表现出鲁棒性和准确性，适用于大规模车辆跟踪。 Abstract: Vision sensors are becoming more important in Intelligent Transportation Systems (ITS) for traffic monitoring, management, and optimization as the number of network cameras continues to rise. However, manual object tracking and matching across multiple non-overlapping cameras pose significant challenges in city-scale urban traffic scenarios. These challenges include handling diverse vehicle attributes, occlusions, illumination variations, shadows, and varying video resolutions. To address these issues, we propose an efficient and cost-effective deep learning-based framework for Multi-Object Multi-Camera Tracking (MO-MCT). The proposed framework utilizes Mask R-CNN for object detection and employs Non-Maximum Suppression (NMS) to select target objects from overlapping detections. Transfer learning is employed for re-identification, enabling the association and generation of vehicle tracklets across multiple cameras. Moreover, we leverage appropriate loss functions and distance measures to handle occlusion, illumination, and shadow challenges. The final solution identification module performs feature extraction using ResNet-152 coupled with Deep SORT based vehicle tracking. The proposed framework is evaluated on the 5th AI City Challenge dataset (Track 3), comprising 46 camera feeds. Among these 46 camera streams, 40 are used for model training and validation, while the remaining six are utilized for model testing. The proposed framework achieves competitive performance with an IDF1 score of 0.8289, and precision and recall scores of 0.9026 and 0.8527 respectively, demonstrating its effectiveness in robust and accurate vehicle tracking.

[31] X-ray illicit object detection using hybrid CNN-transformer neural network architectures

Jorgen Cani,Christos Diou,Spyridon Evangelatos,Panagiotis Radoglou-Grammatikis,Vasileios Argyriou,Panagiotis Sarigiannidis,Iraklis Varlamis,Georgios Th. Papadopoulos

Main category: cs.CV

TL;DR: 该论文探讨了在X射线安全应用中，结合CNN和Transformer的混合架构在检测遮挡或隐藏物体时的性能表现，并与传统CNN方法（如YOLOv8）进行了对比。

Details

Motivation: 在X射线安全成像领域，遮挡或隐藏物体的检测是一个重大挑战。尽管CNN和Transformer各有优势，但两者的结合尚未充分研究。 Method: 论文评估了多种混合CNN-Transformer架构（如HGNetV2和Next-ViT-S）与YOLOv8和RT-DETR检测头的组合，并在三个公开数据集（EDS、HiXray、PIDray）上进行了测试。 Result: 结果显示，YOLOv8在HiXray和PIDray数据集上表现优异，但在存在域分布偏移的EDS数据集上，混合架构表现出更强的鲁棒性。 Conclusion: 论文总结了不同架构的优缺点，并提出了未来研究的指导建议，相关代码和模型权重已开源。 Abstract: In the field of X-ray security applications, even the smallest details can significantly impact outcomes. Objects that are heavily occluded or intentionally concealed pose a great challenge for detection, whether by human observation or through advanced technological applications. While certain Deep Learning (DL) architectures demonstrate strong performance in processing local information, such as Convolutional Neural Networks (CNNs), others excel in handling distant information, e.g., transformers. In X-ray security imaging the literature has been dominated by the use of CNN-based methods, while the integration of the two aforementioned leading architectures has not been sufficiently explored. In this paper, various hybrid CNN-transformer architectures are evaluated against a common CNN object detection baseline, namely YOLOv8. In particular, a CNN (HGNetV2) and a hybrid CNN-transformer (Next-ViT-S) backbone are combined with different CNN/transformer detection heads (YOLOv8 and RT-DETR). The resulting architectures are comparatively evaluated on three challenging public X-ray inspection datasets, namely EDS, HiXray, and PIDray. Interestingly, while the YOLOv8 detector with its default backbone (CSP-DarkNet53) is generally shown to be advantageous on the HiXray and PIDray datasets, when a domain distribution shift is incorporated in the X-ray images (as happens in the EDS datasets), hybrid CNN-transformer architectures exhibit increased robustness. Detailed comparative evaluation results, including object-level detection performance and object-size error analysis, demonstrate the strengths and weaknesses of each architectural combination and suggest guidelines for future research. The source code and network weights of the models employed in this study are available at https://github.com/jgenc/xray-comparative-evaluation.

[32] Multimodal Masked Autoencoder Pre-training for 3D MRI-Based Brain Tumor Analysis with Missing Modalities

Lucas Robinet,Ahmad Berjaoui,Elizabeth Cohen-Jonathan Moyal

Main category: cs.CV

TL;DR: BM-MAE是一种针对多模态MRI数据的预训练策略，能够适应任何模态组合，无需为每种组合单独训练模型。

Details

Motivation: 解决多模态医学影像中模态缺失的问题，避免为每种模态组合单独训练模型的高成本和不实用性。 Method: 提出BM-MAE，一种基于掩码图像建模的预训练策略，能够提取模态内和模态间的信息，适应任何模态组合。 Result: BM-MAE在多个下游任务中表现优于或与基线方法相当，且能高效重建缺失模态。 Conclusion: BM-MAE为多模态MRI数据提供了一种高效且灵活的预训练解决方案，具有实际应用价值。 Abstract: Multimodal magnetic resonance imaging (MRI) constitutes the first line of investigation for clinicians in the care of brain tumors, providing crucial insights for surgery planning, treatment monitoring, and biomarker identification. Pre-training on large datasets have been shown to help models learn transferable representations and adapt with minimal labeled data. This behavior is especially valuable in medical imaging, where annotations are often scarce. However, applying this paradigm to multimodal medical data introduces a challenge: most existing approaches assume that all imaging modalities are available during both pre-training and fine-tuning. In practice, missing modalities often occur due to acquisition issues, specialist unavailability, or specific experimental designs on small in-house datasets. Consequently, a common approach involves training a separate model for each desired modality combination, making the process both resource-intensive and impractical for clinical use. Therefore, we introduce BM-MAE, a masked image modeling pre-training strategy tailored for multimodal MRI data. The same pre-trained model seamlessly adapts to any combination of available modalities, extracting rich representations that capture both intra- and inter-modal information. This allows fine-tuning on any subset of modalities without requiring architectural changes, while still benefiting from a model pre-trained on the full set of modalities. Extensive experiments show that the proposed pre-training strategy outperforms or remains competitive with baselines that require separate pre-training for each modality subset, while substantially surpassing training from scratch on several downstream tasks. Additionally, it can quickly and efficiently reconstruct missing modalities, highlighting its practical value. Code and trained models are available at: https://github.com/Lucas-rbnt/bmmae

[33] AnimalMotionCLIP: Embedding motion in CLIP for Animal Behavior Analysis

Enmin Zhong,Carlos R. del-Blanco,Daniel Berjón,Fernando Jaureguizar,Narciso García

Main category: cs.CV

TL;DR: AnimalMotionCLIP结合CLIP框架和光流信息，提出多种时序建模方案，显著提升动物行为识别性能。

Details

Motivation: 利用预训练视觉语言模型（如CLIP）在动物行为识别中的潜力，但需解决运动信息整合和时序建模的挑战。 Method: 通过交替视频帧和光流信息扩展CLIP框架，并比较密集、半密集和稀疏三种分类器聚合方案。 Result: 在Animal Kingdom数据集上表现优于现有方法，能准确识别细粒度时序动作。 Conclusion: AnimalMotionCLIP为动物行为分析提供了高效解决方案，尤其在时序动作识别上表现突出。 Abstract: Recently, there has been a surge of interest in applying deep learning techniques to animal behavior recognition, particularly leveraging pre-trained visual language models, such as CLIP, due to their remarkable generalization capacity across various downstream tasks. However, adapting these models to the specific domain of animal behavior recognition presents two significant challenges: integrating motion information and devising an effective temporal modeling scheme. In this paper, we propose AnimalMotionCLIP to address these challenges by interleaving video frames and optical flow information in the CLIP framework. Additionally, several temporal modeling schemes using an aggregation of classifiers are proposed and compared: dense, semi dense, and sparse. As a result, fine temporal actions can be correctly recognized, which is of vital importance in animal behavior analysis. Experiments on the Animal Kingdom dataset demonstrate that AnimalMotionCLIP achieves superior performance compared to state-of-the-art approaches.

[34] Synthesizing and Identifying Noise Levels in Autonomous Vehicle Camera Radar Datasets

Mathis Morales,Golnaz Habibi

Main category: cs.CV

TL;DR: 论文提出了一种用于自动驾驶车辆摄像头-雷达数据集的合成数据增强方法，旨在模拟传感器故障和数据退化，并测试了轻量级噪声识别神经网络的性能。

Details

Motivation: 当前目标检测方法多关注性能指标，而忽略了检测与跟踪系统的鲁棒性，尤其是对传感器故障的适应性。本文旨在通过合成数据增强解决这一问题。 Method: 创建了一个现实的合成数据增强管道，模拟传感器故障和数据退化，并训练了一个轻量级噪声识别神经网络。 Result: 在增强数据集上测试的噪声识别神经网络在11个类别上达到了54.4%的识别准确率，覆盖10086张图像和2145个雷达点云。 Conclusion: 通过合成数据增强和噪声识别网络，本文为提高自动驾驶检测与跟踪系统的鲁棒性提供了可行方案。 Abstract: Detecting and tracking objects is a crucial component of any autonomous navigation method. For the past decades, object detection has yielded promising results using neural networks on various datasets. While many methods focus on performance metrics, few projects focus on improving the robustness of these detection and tracking pipelines, notably to sensor failures. In this paper we attempt to address this issue by creating a realistic synthetic data augmentation pipeline for camera-radar Autonomous Vehicle (AV) datasets. Our goal is to accurately simulate sensor failures and data deterioration due to real-world interferences. We also present our results of a baseline lightweight Noise Recognition neural network trained and tested on our augmented dataset, reaching an overall recognition accuracy of 54.4\% on 11 categories across 10086 images and 2145 radar point-clouds.

[35] Uncertainty-Aware Multi-Expert Knowledge Distillation for Imbalanced Disease Grading

Shuo Tong,Shangde Gao,Ke Liu,Zihang Huang,Hongxia Xu,Haochao Ying,Jian Wu

Main category: cs.CV

TL;DR: 提出了一种不确定性感知的多专家知识蒸馏框架（UMKD），用于解决医疗图像分级中的领域偏移和数据不平衡问题，通过特征解耦和动态权重调整实现高效知识迁移。

Details

Motivation: 医疗图像自动分级中，领域偏移和数据不平衡导致模型偏差，影响临床部署效果。 Method: UMKD框架通过解耦任务无关和任务相关特征，结合不确定性感知的动态蒸馏机制，优化知识迁移。 Result: 在多个数据集上验证，UMKD在源域和目标域不平衡场景下均达到最优性能。 Conclusion: UMKD为实际医疗图像分级提供了鲁棒且高效的解决方案。 Abstract: Automatic disease image grading is a significant application of artificial intelligence for healthcare, enabling faster and more accurate patient assessments. However, domain shifts, which are exacerbated by data imbalance, introduce bias into the model, posing deployment difficulties in clinical applications. To address the problem, we propose a novel \textbf{U}ncertainty-aware \textbf{M}ulti-experts \textbf{K}nowledge \textbf{D}istillation (UMKD) framework to transfer knowledge from multiple expert models to a single student model. Specifically, to extract discriminative features, UMKD decouples task-agnostic and task-specific features with shallow and compact feature alignment in the feature space. At the output space, an uncertainty-aware decoupled distillation (UDD) mechanism dynamically adjusts knowledge transfer weights based on expert model uncertainties, ensuring robust and reliable distillation. Additionally, UMKD also tackles the problems of model architecture heterogeneity and distribution discrepancies between source and target domains, which are inadequately tackled by previous KD approaches. Extensive experiments on histology prostate grading (\textit{SICAPv2}) and fundus image grading (\textit{APTOS}) demonstrate that UMKD achieves a new state-of-the-art in both source-imbalanced and target-imbalanced scenarios, offering a robust and practical solution for real-world disease image grading.

Alexander Puzicha,Konstantin Wüstefeld,Kathrin Wilms,Frank Weichert

Main category: cs.CV

TL;DR: 论文探讨了内河导航中基于视频的船舶轨迹预测，结合目标检测、卡尔曼滤波和样条插值方法，提升了预测精度。

Details

Motivation: 内河导航的未来依赖自主系统和远程操作，需要准确的船舶轨迹预测，但现有系统在复杂环境中易误分类。 Method: 集成目标检测、卡尔曼滤波和样条插值，对比评估了BoT-SORT、Deep OC-SORT和ByeTrack等跟踪算法。 Result: 实验表明卡尔曼滤波能提供平滑轨迹，显著提升船舶运动预测精度，支持避碰和态势感知。 Conclusion: 需定制化数据集和模型，未来将扩展数据集并加入船舶分类以优化预测，支持复杂环境中的自主系统和人工操作。 Abstract: The future of inland navigation increasingly relies on autonomous systems and remote operations, emphasizing the need for accurate vessel trajectory prediction. This study addresses the challenges of video-based vessel tracking and prediction by integrating advanced object detection methods, Kalman filters, and spline-based interpolation. However, existing detection systems often misclassify objects in inland waterways due to complex surroundings. A comparative evaluation of tracking algorithms, including BoT-SORT, Deep OC-SORT, and ByeTrack, highlights the robustness of the Kalman filter in providing smoothed trajectories. Experimental results from diverse scenarios demonstrate improved accuracy in predicting vessel movements, which is essential for collision avoidance and situational awareness. The findings underline the necessity of customized datasets and models for inland navigation. Future work will expand the datasets and incorporate vessel classification to refine predictions, supporting both autonomous systems and human operators in complex environments.

[37] Dietary Intake Estimation via Continuous 3D Reconstruction of Food

Wallace Lee,YuHao Chen

Main category: cs.CV

TL;DR: 提出了一种基于单目2D视频构建3D食物模型的方法，用于精确监测饮食行为，解决传统自我报告数据的不准确性。

Details

Motivation: 传统饮食监测方法依赖自我报告数据，存在不准确性，无法实时监测食物摄入量。 Method: 利用COLMAP和姿态估计算法从单目2D视频生成3D食物模型，观察食物体积变化，并提出自动化状态识别方法。 Result: 实验表明，该方法在玩具模型和真实食物上具有潜力，能够捕捉饮食行为的全面信息。 Conclusion: 3D重建方法为开发自动化、精确的饮食监测工具提供了可能。 Abstract: Monitoring dietary habits is crucial for preventing health risks associated with overeating and undereating, including obesity, diabetes, and cardiovascular diseases. Traditional methods for tracking food intake rely on self-reported data before or after the eating, which are prone to inaccuracies. This study proposes an approach to accurately monitor ingest behaviours by leveraging 3D food models constructed from monocular 2D video. Using COLMAP and pose estimation algorithms, we generate detailed 3D representations of food, allowing us to observe changes in food volume as it is consumed. Experiments with toy models and real food items demonstrate the approach's potential. Meanwhile, we have proposed a new methodology for automated state recognition challenges to accurately detect state changes and maintain model fidelity. The 3D reconstruction approach shows promise in capturing comprehensive dietary behaviour insights, ultimately contributing to the development of automated and accurate dietary monitoring tools.

[38] Pixel3DMM: Versatile Screen-Space Priors for Single-Image 3D Face Reconstruction

Simon Giebenhain,Tobias Kirschstein,Martin Rünz,Lourdes Agapito,Matthias Nießner

Main category: cs.CV

TL;DR: 提出Pixel3DMM方法，通过单张RGB图像实现3D人脸重建，利用DINO模型特征和优化的3DMM参数，几何精度显著提升。

Details

Motivation: 解决单张RGB图像重建3D人脸的挑战，提升几何精度和多样性适应性。 Method: 基于DINO模型特征，设计表面法线和uv坐标预测头，结合FLAME网格拓扑优化3DMM参数。 Result: 在几何精度上超越基线方法15%，尤其在表情丰富的面部重建中表现突出。 Conclusion: Pixel3DMM在单图像3D人脸重建中表现出色，尤其在复杂表情和多样性场景下具有优势。 Abstract: We address the 3D reconstruction of human faces from a single RGB image. To this end, we propose Pixel3DMM, a set of highly-generalized vision transformers which predict per-pixel geometric cues in order to constrain the optimization of a 3D morphable face model (3DMM). We exploit the latent features of the DINO foundation model, and introduce a tailored surface normal and uv-coordinate prediction head. We train our model by registering three high-quality 3D face datasets against the FLAME mesh topology, which results in a total of over 1,000 identities and 976K images. For 3D face reconstruction, we propose a FLAME fitting opitmization that solves for the 3DMM parameters from the uv-coordinate and normal estimates. To evaluate our method, we introduce a new benchmark for single-image face reconstruction, which features high diversity facial expressions, viewing angles, and ethnicities. Crucially, our benchmark is the first to evaluate both posed and neutral facial geometry. Ultimately, our method outperforms the most competitive baselines by over 15% in terms of geometric accuracy for posed facial expressions.

[39] Diverse Semantics-Guided Feature Alignment and Decoupling for Visible-Infrared Person Re-Identification

Neng Dong,Shuanglin Yan,Liyan Zhang,Jinhui Tang

Main category: cs.CV

TL;DR: 论文提出了一种名为DSFAD的网络，通过文本嵌入空间对齐可见光和红外图像的特征，并解耦身份无关特征，解决了VI-ReID中的模态差异和风格噪声问题。

Details

Motivation: 由于可见光和红外图像之间的模态差异大，且风格噪声（如光照和颜色对比）降低了特征的判别性和模态不变性，VI-ReID任务具有挑战性。 Method: 提出DSFAD网络，包含DSFA模块（通过多样化句子结构对齐特征）、SMFD模块（解耦特征并约束相似性）和SCFR模块（恢复丢失的语义信息）。 Result: 在三个VI-ReID数据集上的实验证明了DSFAD的优越性。 Conclusion: DSFAD通过特征对齐和解耦，有效提升了VI-ReID任务的性能。 Abstract: Visible-Infrared Person Re-Identification (VI-ReID) is a challenging task due to the large modality discrepancy between visible and infrared images, which complicates the alignment of their features into a suitable common space. Moreover, style noise, such as illumination and color contrast, reduces the identity discriminability and modality invariance of features. To address these challenges, we propose a novel Diverse Semantics-guided Feature Alignment and Decoupling (DSFAD) network to align identity-relevant features from different modalities into a textual embedding space and disentangle identity-irrelevant features within each modality. Specifically, we develop a Diverse Semantics-guided Feature Alignment (DSFA) module, which generates pedestrian descriptions with diverse sentence structures to guide the cross-modality alignment of visual features. Furthermore, to filter out style information, we propose a Semantic Margin-guided Feature Decoupling (SMFD) module, which decomposes visual features into pedestrian-related and style-related components, and then constrains the similarity between the former and the textual embeddings to be at least a margin higher than that between the latter and the textual embeddings. Additionally, to prevent the loss of pedestrian semantics during feature decoupling, we design a Semantic Consistency-guided Feature Restitution (SCFR) module, which further excavates useful information for identification from the style-related features and restores it back into the pedestrian-related features, and then constrains the similarity between the features after restitution and the textual embeddings to be consistent with that between the features before decoupling and the textual embeddings. Extensive experiments on three VI-ReID datasets demonstrate the superiority of our DSFAD.

[40] Brain Foundation Models with Hypergraph Dynamic Adapter for Brain Disease Analysis

Zhongying Deng,Haoyu Wang,Ziyan Huang,Lipei Zhang,Angelica I. Aviles-Rivero,Chaoyu Liu,Junjun He,Zoe Kourtzi,Carola-Bibiane Schönlieb

Main category: cs.CV

TL;DR: SAM-Brain3D和HyDA提出了一种针对脑部疾病的新型基础模型和适配器，解决了现有模型在任务和数据同质性、泛化能力及临床任务适应性上的限制。

Details

Motivation: 脑部疾病的复杂性和社会影响需要更高效的模型，现有脑部基础模型在任务多样性、泛化能力和适应性上存在不足。 Method: 提出了SAM-Brain3D（基于14种MRI子模态的66,000对脑图像标签训练）和HyDA（超图动态适配器），用于多模态数据融合和个性化适应。 Result: 实验表明，该方法在多种脑部疾病分割和分类任务中优于现有技术。 Conclusion: 该框架为脑部疾病分析提供了多模态、多尺度和动态建模的新范式。 Abstract: Brain diseases, such as Alzheimer's disease and brain tumors, present profound challenges due to their complexity and societal impact. Recent advancements in brain foundation models have shown significant promise in addressing a range of brain-related tasks. However, current brain foundation models are limited by task and data homogeneity, restricted generalization beyond segmentation or classification, and inefficient adaptation to diverse clinical tasks. In this work, we propose SAM-Brain3D, a brain-specific foundation model trained on over 66,000 brain image-label pairs across 14 MRI sub-modalities, and Hypergraph Dynamic Adapter (HyDA), a lightweight adapter for efficient and effective downstream adaptation. SAM-Brain3D captures detailed brain-specific anatomical and modality priors for segmenting diverse brain targets and broader downstream tasks. HyDA leverages hypergraphs to fuse complementary multi-modal data and dynamically generate patient-specific convolutional kernels for multi-scale feature fusion and personalized patient-wise adaptation. Together, our framework excels across a broad spectrum of brain disease segmentation and classification tasks. Extensive experiments demonstrate that our method consistently outperforms existing state-of-the-art approaches, offering a new paradigm for brain disease analysis through multi-modal, multi-scale, and dynamic foundation modeling.

[41] Vision Mamba in Remote Sensing: A Comprehensive Survey of Techniques, Applications and Outlook

Muyi Bao,Shuchang Lyu,Zhaoyang Xu,Huiyu Zhou,Jinchang Ren,Shiming Xiang,Xiangtai Li,Guangliang Cheng

Main category: cs.CV

TL;DR: 该论文综述了基于Mamba架构的遥感深度学习方法，分析了120多项研究，提出了五个维度的贡献，并建立了开源资源库。

Details

Motivation: 解决CNN和ViT在遥感数据中的局限性，探索Mamba架构的潜力。 Method: 系统综述120多项研究，构建分类法，分析Mamba在遥感中的应用。 Result: Mamba在遥感任务中表现出色，提供了新的研究方向。 Conclusion: Mamba是遥感分析的变革性框架，为未来研究提供了基础。 Abstract: Deep learning has profoundly transformed remote sensing, yet prevailing architectures like Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) remain constrained by critical trade-offs: CNNs suffer from limited receptive fields, while ViTs grapple with quadratic computational complexity, hindering their scalability for high-resolution remote sensing data. State Space Models (SSMs), particularly the recently proposed Mamba architecture, have emerged as a paradigm-shifting solution, combining linear computational scaling with global context modeling. This survey presents a comprehensive review of Mamba-based methodologies in remote sensing, systematically analyzing about 120 studies to construct a holistic taxonomy of innovations and applications. Our contributions are structured across five dimensions: (i) foundational principles of vision Mamba architectures, (ii) micro-architectural advancements such as adaptive scan strategies and hybrid SSM formulations, (iii) macro-architectural integrations, including CNN-Transformer-Mamba hybrids and frequency-domain adaptations, (iv) rigorous benchmarking against state-of-the-art methods in multiple application tasks, such as object detection, semantic segmentation, change detection, etc. and (v) critical analysis of unresolved challenges with actionable future directions. By bridging the gap between SSM theory and remote sensing practice, this survey establishes Mamba as a transformative framework for remote sensing analysis. To our knowledge, this paper is the first systematic review of Mamba architectures in remote sensing. Our work provides a structured foundation for advancing research in remote sensing systems through SSM-based methods. We curate an open-source repository (https://github.com/BaoBao0926/Awesome-Mamba-in-Remote-Sensing) to foster community-driven advancements.

[42] Deep Reinforcement Learning for Urban Air Quality Management: Multi-Objective Optimization of Pollution Mitigation Booth Placement in Metropolitan Environments

Kirtan Rajesh,Suvidha Rupesh Kumar

Main category: cs.CV

TL;DR: 该研究提出了一种基于深度强化学习（DRL）的框架，用于优化德里市空气净化亭的放置，以提高空气质量指数（AQI）。

Details

Motivation: 德里是全球污染最严重的城市之一，传统静态空气净化策略效果有限，亟需动态优化方法。 Method: 采用近端策略优化（PPO）算法，结合空间和环境因素（如人口密度、交通模式等）优化净化亭位置。 Result: DRL框架在AQI改善、空间覆盖等方面优于传统方法，实现了均衡高效的净化设施分布。 Conclusion: AI驱动的空间优化在智能城市建设和空气质量管理中具有巨大潜力。 Abstract: Urban air pollution remains a pressing global concern, particularly in densely populated and traffic-intensive metropolitan areas like Delhi, where exposure to harmful pollutants severely impacts public health. Delhi, being one of the most polluted cities globally, experiences chronic air quality issues due to vehicular emissions, industrial activities, and construction dust, which exacerbate its already fragile atmospheric conditions. Traditional pollution mitigation strategies, such as static air purifying installations, often fail to maximize their impact due to suboptimal placement and limited adaptability to dynamic urban environments. This study presents a novel deep reinforcement learning (DRL) framework to optimize the placement of air purification booths to improve the air quality index (AQI) in the city of Delhi. We employ Proximal Policy Optimization (PPO), a state-of-the-art reinforcement learning algorithm, to iteratively learn and identify high-impact locations based on multiple spatial and environmental factors, including population density, traffic patterns, industrial influence, and green space constraints. Our approach is benchmarked against conventional placement strategies, including random and greedy AQI-based methods, using multi-dimensional performance evaluation metrics such as AQI improvement, spatial coverage, population and traffic impact, and spatial entropy. Experimental results demonstrate that the RL-based approach outperforms baseline methods by achieving a balanced and effective distribution of air purification infrastructure. Notably, the DRL framework achieves an optimal trade-off between AQI reduction and high-coverage deployment, ensuring equitable environmental benefits across urban regions. The findings underscore the potential of AI-driven spatial optimization in advancing smart city initiatives and data-driven urban air quality management.

[43] Visual Test-time Scaling for GUI Agent Grounding

Tiange Luo,Lajanugen Logeswaran,Justin Johnson,Honglak Lee

Main category: cs.CV

TL;DR: RegionFocus是一种视觉测试时缩放方法，通过动态放大相关区域减少背景干扰，提升视觉语言模型代理在网页理解中的准确性。

Details

Motivation: 网页的视觉复杂性和大量界面元素使得准确选择动作变得困难，需要一种方法来减少干扰并提高定位准确性。 Method: 提出动态缩放相关区域的视觉测试时缩放方法，并结合图像作为地图的机制，可视化关键地标以支持动作选择。 Result: 在Screenspot-pro和WebVoyager基准测试中，性能分别提升28%和24%，并在Qwen2.5-VL-72B模型上达到61.6%的最新定位性能。 Conclusion: RegionFocus通过视觉测试时缩放显著提升了视觉语言模型代理的性能，证明了其在交互式环境中的有效性。 Abstract: We introduce RegionFocus, a visual test-time scaling approach for Vision Language Model Agents. Understanding webpages is challenging due to the visual complexity of GUI images and the large number of interface elements, making accurate action selection difficult. Our approach dynamically zooms in on relevant regions, reducing background clutter and improving grounding accuracy. To support this process, we propose an image-as-map mechanism that visualizes key landmarks at each step, providing a transparent action record and enables the agent to effectively choose among action candidates. Even with a simple region selection strategy, we observe significant performance gains of 28+\% on Screenspot-pro and 24+\% on WebVoyager benchmarks on top of two state-of-the-art open vision language model agents, UI-TARS and Qwen2.5-VL, highlighting the effectiveness of visual test-time scaling in interactive settings. We achieve a new state-of-the-art grounding performance of 61.6\% on the ScreenSpot-Pro benchmark by applying RegionFocus to a Qwen2.5-VL-72B model. Our code will be released publicly at https://github.com/tiangeluo/RegionFocus.

[44] Towards Autonomous Micromobility through Scalable Urban Simulation

Wayne Wu,Honglin He,Chaoyuan Zhang,Jack He,Seth Z. Zhao,Ran Gong,Quanyi Li,Bolei Zhou

Main category: cs.CV

TL;DR: 论文提出了一种可扩展的城市模拟解决方案URBAN-SIM和任务套件URBAN-BENCH，用于提升自主微移动设备的安全性和效率。

Details

Motivation: 当前微移动设备依赖人工操作，存在安全和效率问题，AI辅助是潜在解决方案。 Method: 构建URBAN-SIM平台（包含三个关键模块）和URBAN-BENCH任务套件（八项任务），评估四种机器人。 Result: 实验揭示了不同机器人在多样化地形和城市结构中的优势和局限。 Conclusion: URBAN-SIM和URBAN-BENCH为自主微移动设备的AI代理训练和评估提供了有效工具。 Abstract: Micromobility, which utilizes lightweight mobile machines moving in urban public spaces, such as delivery robots and mobility scooters, emerges as a promising alternative to vehicular mobility. Current micromobility depends mostly on human manual operation (in-person or remote control), which raises safety and efficiency concerns when navigating busy urban environments full of unpredictable obstacles and pedestrians. Assisting humans with AI agents in maneuvering micromobility devices presents a viable solution for enhancing safety and efficiency. In this work, we present a scalable urban simulation solution to advance autonomous micromobility. First, we build URBAN-SIM - a high-performance robot learning platform for large-scale training of embodied agents in interactive urban scenes. URBAN-SIM contains three critical modules: Hierarchical Urban Generation pipeline, Interactive Dynamics Generation strategy, and Asynchronous Scene Sampling scheme, to improve the diversity, realism, and efficiency of robot learning in simulation. Then, we propose URBAN-BENCH - a suite of essential tasks and benchmarks to gauge various capabilities of the AI agents in achieving autonomous micromobility. URBAN-BENCH includes eight tasks based on three core skills of the agents: Urban Locomotion, Urban Navigation, and Urban Traverse. We evaluate four robots with heterogeneous embodiments, such as the wheeled and legged robots, across these tasks. Experiments on diverse terrains and urban structures reveal each robot's strengths and limitations.

[45] RayZer: A Self-supervised Large View Synthesis Model

Hanwen Jiang,Hao Tan,Peng Wang,Haian Jin,Yue Zhao,Sai Bi,Kai Zhang,Fujun Luan,Kalyan Sunkavalli,Qixing Huang,Georgios Pavlakos

Main category: cs.CV

TL;DR: RayZer是一种无需3D监督的自监督多视角3D视觉模型，能够从无标定图像中恢复相机参数并合成新视角。

Details

Motivation: 研究旨在开发一种无需3D标注（如相机位姿和场景几何）的模型，实现3D感知能力。 Method: 通过自监督框架和基于Transformer的模型，分离相机与场景表示，仅依赖射线结构作为3D先验。 Result: RayZer在合成新视角任务中表现优于依赖标注的方法。 Conclusion: RayZer展示了无需3D监督即可实现高效3D感知的潜力。 Abstract: We present RayZer, a self-supervised multi-view 3D Vision model trained without any 3D supervision, i.e., camera poses and scene geometry, while exhibiting emerging 3D awareness. Concretely, RayZer takes unposed and uncalibrated images as input, recovers camera parameters, reconstructs a scene representation, and synthesizes novel views. During training, RayZer relies solely on its self-predicted camera poses to render target views, eliminating the need for any ground-truth camera annotations and allowing RayZer to be trained with 2D image supervision. The emerging 3D awareness of RayZer is attributed to two key factors. First, we design a self-supervised framework, which achieves 3D-aware auto-encoding of input images by disentangling camera and scene representations. Second, we design a transformer-based model in which the only 3D prior is the ray structure, connecting camera, pixel, and scene simultaneously. RayZer demonstrates comparable or even superior novel view synthesis performance than ``oracle'' methods that rely on pose annotations in both training and testing. Project: https://hwjiang1510.github.io/RayZer/

[46] T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT

Dongzhi Jiang,Ziyu Guo,Renrui Zhang,Zhuofan Zong,Hao Li,Le Zhuo,Shilin Yan,Pheng-Ann Heng,Hongsheng Li

Main category: cs.CV

TL;DR: T2I-R1是一种结合链式思维（CoT）和强化学习（RL）的文本到图像生成模型，通过双层CoT推理优化生成过程，显著提升了性能。

Details

Motivation: 尽管链式思维和强化学习在语言模型中表现优异，但在视觉生成领域的应用尚未充分探索。 Method: 提出T2I-R1模型，利用语义级和令牌级CoT分别优化提示规划和像素处理，并通过BiCoT-GRPO协调两者。 Result: 在T2I-CompBench和WISE基准测试中分别提升13%和19%，超越现有最佳模型FLUX。 Conclusion: T2I-R1展示了CoT和RL在视觉生成中的潜力，为未来研究提供了新方向。 Abstract: Recent advancements in large language models have demonstrated how chain-of-thought (CoT) and reinforcement learning (RL) can improve performance. However, applying such reasoning strategies to the visual generation domain remains largely unexplored. In this paper, we present T2I-R1, a novel reasoning-enhanced text-to-image generation model, powered by RL with a bi-level CoT reasoning process. Specifically, we identify two levels of CoT that can be utilized to enhance different stages of generation: (1) the semantic-level CoT for high-level planning of the prompt and (2) the token-level CoT for low-level pixel processing during patch-by-patch generation. To better coordinate these two levels of CoT, we introduce BiCoT-GRPO with an ensemble of generation rewards, which seamlessly optimizes both generation CoTs within the same training step. By applying our reasoning strategies to the baseline model, Janus-Pro, we achieve superior performance with 13% improvement on T2I-CompBench and 19% improvement on the WISE benchmark, even surpassing the state-of-the-art model FLUX.1. Code is available at: https://github.com/CaraJ7/T2I-R1

cs.GR [Back]

[47] Controllable Weather Synthesis and Removal with Video Diffusion Models

Chih-Hao Lin,Zian Wang,Ruofan Liang,Yuxuan Zhang,Sanja Fidler,Shenlong Wang,Zan Gojcic

Main category: cs.GR

TL;DR: WeatherWeaver是一种视频扩散模型，能够在无需3D建模的情况下，直接在输入视频中合成多样化的天气效果（如雨、雪、雾、云），并提供对天气强度的精确控制。

Details

Motivation: 物理基础的天气模拟需要精确重建，难以扩展到野外视频，而现有视频编辑方法缺乏真实性和可控性。 Method: 提出WeatherWeaver模型，结合合成视频、生成图像编辑和自动标记的真实视频数据策略。 Result: 在天气模拟和去除任务中优于现有方法，提供高质量、物理合理且保留场景身份的结果。 Conclusion: WeatherWeaver在多样天气效果合成中表现出色，兼具真实性和可控性。 Abstract: Generating realistic and controllable weather effects in videos is valuable for many applications. Physics-based weather simulation requires precise reconstructions that are hard to scale to in-the-wild videos, while current video editing often lacks realism and control. In this work, we introduce WeatherWeaver, a video diffusion model that synthesizes diverse weather effects -- including rain, snow, fog, and clouds -- directly into any input video without the need for 3D modeling. Our model provides precise control over weather effect intensity and supports blending various weather types, ensuring both realism and adaptability. To overcome the scarcity of paired training data, we propose a novel data strategy combining synthetic videos, generative image editing, and auto-labeled real-world videos. Extensive evaluations show that our method outperforms state-of-the-art methods in weather simulation and removal, providing high-quality, physically plausible, and scene-identity-preserving results over various real-world videos.

cs.CL [Back]

[48] Rosetta-PL: Propositional Logic as a Benchmark for Large Language Model Reasoning

Shaun Baek,Shaun Esua-Mensah,Cyrus Tsui,Sejan Vigneswaralingam,Abdullah Alali,Michael Lu,Vasu Sharma,Kevin Zhu

Main category: cs.CL

TL;DR: Rosetta-PL是一个用于评估LLMs在逻辑推理和泛化能力的基准，通过翻译逻辑命题数据集并微调LLM，研究发现翻译方法对性能有显著影响。

Details

Motivation: LLMs在高资源语言中表现良好，但在低资源环境和逻辑推理任务中受限，Rosetta-PL旨在解决这一问题。 Method: 将Lean中的逻辑命题翻译为自定义逻辑语言，用于微调LLM（如GPT-4o），并分析数据集大小和翻译方法对性能的影响。 Result: 翻译过程中保留逻辑关系显著提升精度，训练样本超过20,000时准确率趋于稳定。 Conclusion: 研究为优化LLM在形式推理任务中的训练提供了指导，有助于提升低资源语言应用的性能。 Abstract: Large Language Models (LLMs) are primarily trained on high-resource natural languages, limiting their effectiveness in low-resource settings and in tasks requiring deep logical reasoning. This research introduces Rosetta-PL, a benchmark designed to evaluate LLMs' logical reasoning and generalization capabilities in a controlled environment. We construct Rosetta-PL by translating a dataset of logical propositions from Lean into a custom logical language, which is then used to fine-tune an LLM (e.g., GPT-4o). Our experiments analyze the impact of the size of the dataset and the translation methodology on the performance of the model. Our results indicate that preserving logical relationships in the translation process significantly boosts precision, with accuracy plateauing beyond roughly 20,000 training samples. These insights provide valuable guidelines for optimizing LLM training in formal reasoning tasks and improving performance in various low-resource language applications.

[49] Symbol grounding in computational systems: A paradox of intentions

Vincent C. Müller

Main category: cs.CL

TL;DR: 计算主义无法解释符号接地问题，无论计算是基于有意义还是无意义的符号，都会导致语义先天论。

Details

Motivation: 探讨计算主义是否能解释符号接地问题，揭示其潜在的语义先天论矛盾。 Method: 通过逻辑分析，提出计算主义的两种可能性（有意义或无意义符号计算）及其后果。 Result: 无论哪种情况，计算主义都隐含语义先天论，无法解决符号接地问题。 Conclusion: 计算主义在解释符号接地问题上存在根本性缺陷，需重新审视其理论基础。 Abstract: The paper presents a paradoxical feature of computational systems that suggests that computationalism cannot explain symbol grounding. If the mind is a digital computer, as computationalism claims, then it can be computing either over meaningful symbols or over meaningless symbols. If it is computing over meaningful symbols its functioning presupposes the existence of meaningful symbols in the system, i.e. it implies semantic nativism. If the mind is computing over meaningless symbols, no intentional cognitive processes are available prior to symbol grounding. In this case, no symbol grounding could take place since any grounding presupposes intentional cognitive processes. So, whether computing in the mind is over meaningless or over meaningful symbols, computationalism implies semantic nativism.

[50] The Mind in the Machine: A Survey of Incorporating Psychological Theories in LLMs

Zizhou Liu,Ziwei Gong,Lin Ai,Zheng Hui,Run Chen,Colin Wayne Leach,Michelle R. Greene,Julia Hirschberg

Main category: cs.CL

TL;DR: 心理学理论对大型语言模型（LLM）的发展至关重要，本文综述了心理学如何从数据、预训练、后训练到评估应用各阶段增强LLM，并分析了当前趋势与不足。

Details

Motivation: 随着LLM规模和复杂度的增加，心理学被认为是实现人类认知、行为和互动的关键。本文旨在探讨心理学如何为LLM开发提供理论支持。 Method: 通过整合认知、发展、行为、社会、人格心理学及心理语言学的理论，分析心理学在LLM开发各阶段的应用。 Result: 揭示了心理学理论在LLM研究中的当前趋势与不足，并提出了跨领域整合的建议。 Conclusion: 心理学与NLP的深度融合有助于未来研究的进步，需进一步弥合学科分歧。 Abstract: Psychological insights have long shaped pivotal NLP breakthroughs, including the cognitive underpinnings of attention mechanisms, formative reinforcement learning, and Theory of Mind-inspired social modeling. As Large Language Models (LLMs) continue to grow in scale and complexity, there is a rising consensus that psychology is essential for capturing human-like cognition, behavior, and interaction. This paper reviews how psychological theories can inform and enhance stages of LLM development, including data, pre-training, post-training, and evaluation\&application. Our survey integrates insights from cognitive, developmental, behavioral, social, personality psychology, and psycholinguistics. Our analysis highlights current trends and gaps in how psychological theories are applied. By examining both cross-domain connections and points of tension, we aim to bridge disciplinary divides and promote more thoughtful integration of psychology into future NLP research.

[51] LangVAE and LangSpace: Building and Probing for Language Model VAEs

Danilo S. Carvalho,Yingji Zhang,Harriet Unsworth,André Freitas

Main category: cs.CL

TL;DR: LangVAE是一个基于预训练大语言模型（LLMs）构建变分自编码器（VAEs）的新框架，旨在生成更紧凑且语义解耦的表示。配套工具LangSpace提供多种分析方法。实验表明其灵活性和潜力。

Details

Motivation: 利用预训练语言模型的知识构建更高效的文本表示，并系统化实验和分析方法。 Method: 通过LangVAE框架构建语言模型VAEs，结合LangSpace工具进行表示分析（如向量遍历、解耦度量等）。 Result: 实验展示了不同编码器-解码器组合和标注输入的广泛交互，验证了框架的灵活性和潜力。 Conclusion: LangVAE为文本表示的系统化实验和理解提供了有前景的框架。 Abstract: We present LangVAE, a novel framework for modular construction of variational autoencoders (VAEs) on top of pre-trained large language models (LLMs). Such language model VAEs can encode the knowledge of their pre-trained components into more compact and semantically disentangled representations. The representations obtained in this way can be analysed with the LangVAE companion framework: LangSpace, which implements a collection of probing methods, such as vector traversal and interpolation, disentanglement measures, and cluster visualisations. LangVAE and LangSpace offer a flexible, efficient and scalable way of building and analysing textual representations, with simple integration for models available on the HuggingFace Hub. Additionally, we conducted a set of experiments with different encoder and decoder combinations, as well as annotated inputs, revealing a wide range of interactions across architectural families and sizes w.r.t. generalisation and disentanglement. Our findings demonstrate a promising framework for systematising the experimentation and understanding of textual representations.

[52] Toward a digital twin of U.S. Congress

Hayden Helm,Tianyi Chen,Harvey McGuinness,Paige Lee,Brandon Duderstadt,Carey E. Priebe

Main category: cs.CL

TL;DR: 论文提出了一种基于语言模型的美国国会议员数字孪生模型，能够生成与真实推文难以区分的虚拟推文，并用于预测投票行为和党派倾向。

Details

Motivation: 研究旨在探索数字孪生技术在政治领域的应用，通过模拟国会议员的语言行为，为资源分配和立法动态提供辅助决策。 Method: 利用每日更新的国会议员推文数据集，训练特定于每位议员的语言模型，生成虚拟推文并分析其与真实行为的关联。 Result: 生成的推文与真实推文高度相似，可用于预测投票行为和党派倾向，为利益相关者提供决策支持。 Conclusion: 研究展示了数字孪生在政治分析中的潜力，但也指出了模型的局限性，并提出了未来扩展的方向。 Abstract: In this paper we provide evidence that a virtual model of U.S. congresspersons based on a collection of language models satisfies the definition of a digital twin. In particular, we introduce and provide high-level descriptions of a daily-updated dataset that contains every Tweet from every U.S. congressperson during their respective terms. We demonstrate that a modern language model equipped with congressperson-specific subsets of this data are capable of producing Tweets that are largely indistinguishable from actual Tweets posted by their physical counterparts. We illustrate how generated Tweets can be used to predict roll-call vote behaviors and to quantify the likelihood of congresspersons crossing party lines, thereby assisting stakeholders in allocating resources and potentially impacting real-world legislative dynamics. We conclude with a discussion of the limitations and important extensions of our analysis.

[53] A Scoping Review of Natural Language Processing in Addressing Medically Inaccurate Information: Errors, Misinformation, and Hallucination

Zhaoyi Sun,Wen-Wai Yim,Ozlem Uzuner,Fei Xia,Meliha Yetisgen

Main category: cs.CL

TL;DR: 综述探讨了NLP在医疗信息准确性检测、修正和缓解中的潜力与挑战，强调其对患者安全和公共卫生的重要性。

Details

Motivation: 提升患者安全、改善公共卫生沟通，并推动医疗领域NLP应用的可靠性和透明度。 Method: 采用PRISMA指南进行范围综述，分析2020至2024年间五个数据库的研究，按主题、任务、文档类型等分类。 Result: NLP在医疗信息错误检测、修正及幻觉管理等方面展现潜力，但仍面临数据隐私、上下文依赖等挑战。 Conclusion: 未来需聚焦真实数据集开发、上下文方法优化及幻觉管理，以确保NLP在医疗中的可靠应用。 Abstract: Objective: This review aims to explore the potential and challenges of using Natural Language Processing (NLP) to detect, correct, and mitigate medically inaccurate information, including errors, misinformation, and hallucination. By unifying these concepts, the review emphasizes their shared methodological foundations and their distinct implications for healthcare. Our goal is to advance patient safety, improve public health communication, and support the development of more reliable and transparent NLP applications in healthcare. Methods: A scoping review was conducted following PRISMA guidelines, analyzing studies from 2020 to 2024 across five databases. Studies were selected based on their use of NLP to address medically inaccurate information and were categorized by topic, tasks, document types, datasets, models, and evaluation metrics. Results: NLP has shown potential in addressing medically inaccurate information on the following tasks: (1) error detection (2) error correction (3) misinformation detection (4) misinformation correction (5) hallucination detection (6) hallucination mitigation. However, challenges remain with data privacy, context dependency, and evaluation standards. Conclusion: This review highlights the advancements in applying NLP to tackle medically inaccurate information while underscoring the need to address persistent challenges. Future efforts should focus on developing real-world datasets, refining contextual methods, and improving hallucination management to ensure reliable and transparent healthcare applications.

[54] Efficient Knowledge Transfer in Multi-Task Learning through Task-Adaptive Low-Rank Representation

Xiao Zhang,Kangsheng Wang,Tianyu Hu,Huimin Ma

Main category: cs.CL

TL;DR: TA-LoRA是一种基于提示调优的多任务学习方法，通过低秩表示和快慢权重机制解决任务异质性，提升性能。

Details

Motivation: 预训练语言模型在新任务上表现不佳，多任务学习虽能迁移知识，但提示调优难以捕捉任务异质性。 Method: 提出TA-LoRA，结合低秩表示和快慢权重机制，避免共享与任务特定知识混淆，并引入零初始化注意力机制。 Result: 在16个任务上，TA-LoRA在完整数据和少样本设置中均达到最优性能，且参数高效。 Conclusion: TA-LoRA有效解决了任务异质性，提升了多任务学习的性能与效率。 Abstract: Pre-trained language models (PLMs) demonstrate remarkable intelligence but struggle with emerging tasks unseen during training in real-world applications. Training separate models for each new task is usually impractical. Multi-task learning (MTL) addresses this challenge by transferring shared knowledge from source tasks to target tasks. As an dominant parameter-efficient fine-tuning method, prompt tuning (PT) enhances MTL by introducing an adaptable vector that captures task-specific knowledge, which acts as a prefix to the original prompt that preserves shared knowledge, while keeping PLM parameters frozen. However, PT struggles to effectively capture the heterogeneity of task-specific knowledge due to its limited representational capacity. To address this challenge, we propose Task-Adaptive Low-Rank Representation (TA-LoRA), an MTL method built on PT, employing the low-rank representation to model task heterogeneity and a fast-slow weights mechanism where the slow weight encodes shared knowledge, while the fast weight captures task-specific nuances, avoiding the mixing of shared and task-specific knowledge, caused by training low-rank representations from scratch. Moreover, a zero-initialized attention mechanism is introduced to minimize the disruption of immature low-rank components on original prompts during warm-up epochs. Experiments on 16 tasks demonstrate that TA-LoRA achieves state-of-the-art performance in full-data and few-shot settings while maintaining superior parameter efficiency.

[55] Jailbreak Detection in Clinical Training LLMs Using Feature-Based Predictive Models

Tri Nguyen,Lohith Srikanth Pentapalli,Magnus Sieverding,Laurah Turner,Seth Overla,Weibing Zheng,Chris Zhou,David Furniss,Danielle Weber,Michael Gharib,Matt Kelleher,Michael Shukis,Cameron Pawlik,Kelly Cohen

Main category: cs.CL

TL;DR: 研究通过语言学特征检测大型语言模型（LLM）中的越狱行为，在临床教育平台2-Sigma上验证了特征预测模型优于提示工程，模糊决策树表现最佳。

Details

Motivation: 大型语言模型（LLM）的越狱行为威胁其在敏感领域（如教育）的安全使用，需开发有效检测方法。 Method: 标注2,300个提示，提取四个语言学特征，训练多种预测模型（决策树、模糊逻辑、Boosting、逻辑回归）。 Result: 特征预测模型优于提示工程，模糊决策树表现最佳。 Conclusion: 语言学特征模型是有效的越狱检测方法，未来可探索混合框架结合提示灵活性和规则鲁棒性。 Abstract: Jailbreaking in Large Language Models (LLMs) threatens their safe use in sensitive domains like education by allowing users to bypass ethical safeguards. This study focuses on detecting jailbreaks in 2-Sigma, a clinical education platform that simulates patient interactions using LLMs. We annotated over 2,300 prompts across 158 conversations using four linguistic variables shown to correlate strongly with jailbreak behavior. The extracted features were used to train several predictive models, including Decision Trees, Fuzzy Logic-based classifiers, Boosting methods, and Logistic Regression. Results show that feature-based predictive models consistently outperformed Prompt Engineering, with the Fuzzy Decision Tree achieving the best overall performance. Our findings demonstrate that linguistic-feature-based models are effective and explainable alternatives for jailbreak detection. We suggest future work explore hybrid frameworks that integrate prompt-based flexibility with rule-based robustness for real-time, spectrum-based jailbreak monitoring in educational LLMs.

[56] The AI Co-Ethnographer: How Far Can Automation Take Qualitative Research?

Fabian Retkowski,Andreas Sudmann,Alexander Waibel

Main category: cs.CL

TL;DR: 论文提出了一种名为AICoE的新型端到端流程，旨在解决定性研究中难以规模化且保持分析深度的问题。

Details

Motivation: 定性研究通常涉及劳动密集型过程，难以规模化同时保持分析深度。 Method: AICoE是一个端到端流程，涵盖开放编码、代码整合、代码应用和模式发现，提供更全面的定性数据分析。 Result: AICoE能够实现定性数据的全面分析，超越简单的自动化代码分配。 Conclusion: AICoE为定性研究提供了一种更集成化的方法，解决了规模化与深度分析的矛盾。 Abstract: Qualitative research often involves labor-intensive processes that are difficult to scale while preserving analytical depth. This paper introduces The AI Co-Ethnographer (AICoE), a novel end-to-end pipeline developed for qualitative research and designed to move beyond the limitations of simply automating code assignments, offering a more integrated approach. AICoE organizes the entire process, encompassing open coding, code consolidation, code application, and even pattern discovery, leading to a comprehensive analysis of qualitative data.

[57] Performance Evaluation of Emotion Classification in Japanese Using RoBERTa and DeBERTa

Yoichi Takenaka

Main category: cs.CL

TL;DR: 研究旨在构建高精度模型预测日语文本中的八种Plutchik情绪，DeBERTa-v3-large表现最佳，未来需优化数据与模型。

Details

Motivation: 日语文本情绪检测在社交媒体监控等应用中需求高，但资源稀缺和类别不平衡限制了模型性能。 Method: 使用WRIME语料库，将读者平均强度分数转为二元标签，微调四种预训练模型（BERT、RoBERTa、DeBERTa-v3-base、DeBERTa-v3-large），并评估两种大语言模型。 Result: DeBERTa-v3-large在平均准确率（0.860）和F1分数（0.662）上表现最佳，优于其他模型，而大语言模型表现较差。 Conclusion: 微调后的DeBERTa-v3-large是目前日语二元情绪分类的最可靠解决方案，未来需优化数据、模型大小及提示工程。 Abstract: Background Practical applications such as social media monitoring and customer-feedback analysis require accurate emotion detection for Japanese text, yet resource scarcity and class imbalance hinder model performance. Objective This study aims to build a high-accuracy model for predicting the presence or absence of eight Plutchik emotions in Japanese sentences. Methods Using the WRIME corpus, we transform reader-averaged intensity scores into binary labels and fine-tune four pre-trained language models (BERT, RoBERTa, DeBERTa-v3-base, DeBERTa-v3-large). For context, we also assess two large language models (TinySwallow-1.5B-Instruct and ChatGPT-4o). Accuracy and F1-score serve as evaluation metrics. Results DeBERTa-v3-large attains the best mean accuracy (0.860) and F1-score (0.662), outperforming all other models. It maintains robust F1 across both high-frequency emotions (e.g., Joy, Anticipation) and low-frequency emotions (e.g., Anger, Trust). The LLMs lag, with ChatGPT-4o and TinySwallow-1.5B-Instruct scoring 0.527 and 0.292 in mean F1, respectively. Conclusion The fine-tuned DeBERTa-v3-large model currently offers the most reliable solution for binary emotion classification in Japanese. We release this model as a pip-installable package (pip install deberta-emotion-predictor). Future work should augment data for rare emotions, reduce model size, and explore prompt engineering to improve LLM performance. This manuscript is under review for possible publication in New Generation Computing.

[58] Manifold-Constrained Sentence Embeddings via Triplet Loss: Projecting Semantics onto Spheres, Tori, and Möbius Strips

Vinit K. Chavan

Main category: cs.CL

TL;DR: 论文提出了一种将句子嵌入限制在连续流形（如单位球面、环面和莫比乌斯带）的新框架，通过三元组损失训练，显著提升了嵌入的区分性和拓扑结构。

Details

Motivation: 传统句子嵌入通常位于无约束的欧几里得空间，可能无法充分反映语言的复杂关系。 Method: 使用三元组损失将句子嵌入约束在单位球面、环面和莫比乌斯带上，利用微分几何约束输出空间。 Result: 在AG News和MBTI数据集上，流形约束嵌入（尤其是球面和莫比乌斯带）在聚类和分类性能上显著优于传统方法。 Conclusion: 流形空间嵌入通过拓扑结构补充语义分离，为NLP中的几何表示学习提供了新的数学基础方向。 Abstract: Recent advances in representation learning have emphasized the role of embedding geometry in capturing semantic structure. Traditional sentence embeddings typically reside in unconstrained Euclidean spaces, which may limit their ability to reflect complex relationships in language. In this work, we introduce a novel framework that constrains sentence embeddings to lie on continuous manifolds -- specifically the unit sphere, torus, and M\"obius strip -- using triplet loss as the core training objective. By enforcing differential geometric constraints on the output space, our approach encourages the learning of embeddings that are both discriminative and topologically structured. We evaluate our method on benchmark datasets (AG News and MBTI) and compare it to classical baselines including TF-IDF, Word2Vec, and unconstrained Keras-derived embeddings. Our results demonstrate that manifold-constrained embeddings, particularly those projected onto spheres and M\"obius strips, significantly outperform traditional approaches in both clustering quality (Silhouette Score) and classification performance (Accuracy). These findings highlight the value of embedding in manifold space -- where topological structure complements semantic separation -- offering a new and mathematically grounded direction for geometric representation learning in NLP.

[59] Design and Application of Multimodal Large Language Model Based System for End to End Automation of Accident Dataset Generation

MD Thamed Bin Zaman Chowdhury,Moazzem Hossain

Main category: cs.CL

TL;DR: 研究提出了一种基于大型语言模型（LLM）和网络爬虫的全自动化系统，用于解决孟加拉国交通事故数据收集的不可靠性问题。系统通过自动化爬取新闻、分类和去重，成功收集了705起独特事故数据，为数据驱动的道路安全政策制定提供了基础。

Details

Motivation: 发展中国家如孟加拉国的交通事故数据收集存在手动、分散和不可靠的问题，导致数据缺失和不一致。研究旨在通过自动化系统解决这些问题。 Method: 系统包括四个模块：自动化网络爬虫代码生成、新闻收集、事故新闻分类与结构化数据提取、去重。使用多模态生成LLM Gemini-2.0-Flash实现自动化。 Result: 系统在111天内爬取了15,000多篇新闻，识别出705起独特事故。代码生成模块校准准确率为91.3%，验证准确率为80%。吉大港事故最多。 Conclusion: 研究表明，基于LLM的系统能够高效、准确地收集交通事故数据，为数据驱动的道路安全政策制定提供了可行方案。 Abstract: Road traffic accidents remain a major public safety and socio-economic issue in developing countries like Bangladesh. Existing accident data collection is largely manual, fragmented, and unreliable, resulting in underreporting and inconsistent records. This research proposes a fully automated system using Large Language Models (LLMs) and web scraping techniques to address these challenges. The pipeline consists of four components: automated web scraping code generation, news collection from online sources, accident news classification with structured data extraction, and duplicate removal. The system uses the multimodal generative LLM Gemini-2.0-Flash for seamless automation. The code generation module classifies webpages into pagination, dynamic, or infinite scrolling categories and generates suitable Python scripts for scraping. LLMs also classify and extract key accident information such as date, time, location, fatalities, injuries, road type, vehicle types, and pedestrian involvement. A deduplication algorithm ensures data integrity by removing duplicate reports. The system scraped 14 major Bangladeshi news sites over 111 days (Oct 1, 2024 - Jan 20, 2025), processing over 15,000 news articles and identifying 705 unique accidents. The code generation module achieved 91.3% calibration and 80% validation accuracy. Chittagong reported the highest number of accidents (80), fatalities (70), and injuries (115), followed by Dhaka, Faridpur, Gazipur, and Cox's Bazar. Peak accident times were morning (8-9 AM), noon (12-1 PM), and evening (6-7 PM). A public repository was also developed with usage instructions. This study demonstrates the viability of an LLM-powered, scalable system for accurate, low-effort accident data collection, providing a foundation for data-driven road safety policymaking in Bangladesh.

[60] Sparks of Tabular Reasoning via Text2SQL Reinforcement Learning

Josefa Lia Stoisser,Marc Boubnovski Martell,Julien Fauqueur

Main category: cs.CL

TL;DR: 本文提出了一种两阶段框架，通过SQL监督提升大语言模型（LLM）在表格数据上的推理能力，并引入GRPO强化学习目标以提高泛化能力。实验显示，该方法在Text-to-SQL任务中显著提升了模型性能。

Details

Motivation: 传统Text-to-SQL任务仅关注查询生成，本文旨在通过SQL监督提升LLM对表格数据的推理能力，使其具备更通用的表格处理能力。 Method: 1. 从真实SQL查询生成详细的链式思维（CoT）跟踪，提供逐步监督；2. 引入GRPO强化学习目标，将SQL执行准确性与泛化推理能力结合。 Result: 在标准Text-to-SQL基准测试中表现提升，尤其在BIRD和CRT-QA数据集上显著。LLaMA模型准确率提升20%，Qwen提升5%。 Conclusion: SQL不仅是目标形式，还能作为学习结构化数据推理的有效支架，提升模型的泛化能力和可解释性。 Abstract: This work reframes the Text-to-SQL task as a pathway for teaching large language models (LLMs) to reason over and manipulate tabular data--moving beyond the traditional focus on query generation. We propose a two-stage framework that leverages SQL supervision to develop transferable table reasoning capabilities. First, we synthesize detailed chain-of-thought (CoT) traces from real-world SQL queries, providing step-by-step, clause-level supervision that teaches the model how to traverse, filter, and aggregate table fields. Second, we introduce a Group Relative Policy Optimization (GRPO) reinforcement learning objective that connects SQL execution accuracy to generalizable reasoning by encouraging steps that extend beyond task-specific syntax and transfer across datasets. Empirically, our approach improves performance on standard Text-to-SQL benchmarks and achieves substantial gains on reasoning-intensive datasets such as BIRD and CRT-QA, demonstrating enhanced generalization and interpretability. Specifically, the distilled-quantized LLaMA model achieved a 20\% increase in accuracy when trained on Text-to-SQL tasks, while Qwen achieved a 5\% increase. These results suggest that SQL can serve not only as a target formalism but also as an effective scaffold for learning robust, transferable reasoning over structured data.

[61] ReCellTy: Domain-specific knowledge graph retrieval-augmented LLMs workflow for single-cell annotation

Dezheng Han,Yibin Jia,Ruxiao Chen,Wenjie Han,Shuaishuai Guo,Jianbo Wang

Main category: cs.CL

TL;DR: 提出了一种基于图结构特征标记数据库和多任务工作流程的方法，用于自动化细胞类型注释，显著提升了标注效果。

Details

Motivation: 解决现有大型语言模型在细胞类型注释中精度不足和自动化程度不高的问题。 Method: 构建图结构特征标记数据库，设计多任务工作流程优化注释过程。 Result: 在11种组织类型中，人类评估分数提升0.21，语义相似性提高6.1%，更接近人工标注逻辑。 Conclusion: 该方法显著提升了细胞类型注释的精度和自动化水平，更符合人工标注逻辑。 Abstract: To enable precise and fully automated cell type annotation with large language models (LLMs), we developed a graph structured feature marker database to retrieve entities linked to differential genes for cell reconstruction. We further designed a multi task workflow to optimize the annotation process. Compared to general purpose LLMs, our method improves human evaluation scores by up to 0.21 and semantic similarity by 6.1% across 11 tissue types, while more closely aligning with the cognitive logic of manual annotation.

[62] An Empirical Study on Prompt Compression for Large Language Models

Zheng Zhang,Jinyi Li,Yihuai Lan,Xiang Wang,Hao Wang

Main category: cs.CL

TL;DR: 论文研究了六种针对大型语言模型（LLM）的提示压缩方法，旨在减少提示长度同时保持响应质量。通过多维度评估，发现压缩对长上下文任务影响更大，适度压缩甚至能提升性能。

Details

Motivation: 长提示增加了计算复杂性和经济成本，因此需要研究如何压缩提示而不影响LLM的响应质量。 Method: 提出并评估了六种提示压缩方法，覆盖生成性能、模型幻觉、多模态任务等多个方面，并在13个数据集上进行了实验。 Result: 实验显示，提示压缩对长上下文任务的影响更大，适度压缩在Longbench评估中甚至提升了性能。 Conclusion: 提示压缩是可行的，尤其在长上下文任务中表现显著，为降低LLM使用成本提供了有效方案。 Abstract: Prompt engineering enables Large Language Models (LLMs) to perform a variety of tasks. However, lengthy prompts significantly increase computational complexity and economic costs. To address this issue, we study six prompt compression methods for LLMs, aiming to reduce prompt length while maintaining LLM response quality. In this paper, we present a comprehensive analysis covering aspects such as generation performance, model hallucinations, efficacy in multimodal tasks, word omission analysis, and more. We evaluate these methods across 13 datasets, including news, scientific articles, commonsense QA, math QA, long-context QA, and VQA datasets. Our experiments reveal that prompt compression has a greater impact on LLM performance in long contexts compared to short ones. In the Longbench evaluation, moderate compression even enhances LLM performance. Our code and data is available at https://github.com/3DAgentWorld/Toolkit-for-Prompt-Compression.

[63] Beyond Public Access in LLM Pre-Training Data

Sruly Rosenblat,Tim O'Reilly,Ilan Strauss

Main category: cs.CL

TL;DR: 论文通过DE-COP方法检测OpenAI模型是否未经许可使用了O'Reilly Media的版权内容，发现GPT-4o对付费内容识别率高，而GPT-3.5 Turbo对公开内容更敏感。

Details

Motivation: 研究OpenAI大型语言模型是否在未经授权的情况下使用了受版权保护的O'Reilly Media书籍内容。 Method: 使用DE-COP成员推理攻击方法，测试GPT-4o、GPT-3.5 Turbo和GPT-4o Mini对O'Reilly书籍内容的识别能力。 Result: GPT-4o对付费内容识别率较高（AUROC=82%），GPT-3.5 Turbo对公开内容更敏感，GPT-4o Mini无显著识别能力。 Conclusion: 研究呼吁企业提高预训练数据源的透明度，以建立正式的AI内容训练许可框架。 Abstract: Using a legally obtained dataset of 34 copyrighted O'Reilly Media books, we apply the DE-COP membership inference attack method to investigate whether OpenAI's large language models were trained on copyrighted content without consent. Our AUROC scores show that GPT-4o, OpenAI's more recent and capable model, demonstrates strong recognition of paywalled O'Reilly book content (AUROC = 82\%), compared to OpenAI's earlier model GPT-3.5 Turbo. In contrast, GPT-3.5 Turbo shows greater relative recognition of publicly accessible O'Reilly book samples. GPT-4o Mini, as a much smaller model, shows no knowledge of public or non-public O'Reilly Media content when tested (AUROC $\approx$ 50\%). Testing multiple models, with the same cutoff date, helps us account for potential language shifts over time that might bias our findings. These results highlight the urgent need for increased corporate transparency regarding pre-training data sources as a means to develop formal licensing frameworks for AI content training

[64] Ustnlp16 at SemEval-2025 Task 9: Improving Model Performance through Imbalance Handling and Focal Loss

Zhuoang Cai,Zhenghao Li,Yang Liu,Liyuan Guo,Yangqiu Song

Main category: cs.CL

TL;DR: 论文提出了一种针对食品危害检测中数据不平衡问题的解决方案，通过数据增强技术和多种平衡策略（如随机过采样、EDA和焦点损失）提升分类性能。

Details

Motivation: 食品危害检测任务面临数据分布不平衡、文本短且非结构化、语义类别重叠等挑战，需要有效的方法提升分类性能。 Method: 使用基于Transformer的模型（BERT和RoBERTa）作为分类器，结合随机过采样、EDA和焦点损失等数据平衡策略。 Result: 实验表明，EDA有效缓解了类别不平衡问题，显著提高了准确率和F1分数；结合焦点损失和过采样进一步增强了模型鲁棒性。 Conclusion: 研究为食品危害检测任务提供了更有效的NLP分类模型，尤其适用于难分类样本。 Abstract: Classification tasks often suffer from imbal- anced data distribution, which presents chal- lenges in food hazard detection due to severe class imbalances, short and unstructured text, and overlapping semantic categories. In this paper, we present our system for SemEval- 2025 Task 9: Food Hazard Detection, which ad- dresses these issues by applying data augmenta- tion techniques to improve classification perfor- mance. We utilize transformer-based models, BERT and RoBERTa, as backbone classifiers and explore various data balancing strategies, including random oversampling, Easy Data Augmentation (EDA), and focal loss. Our ex- periments show that EDA effectively mitigates class imbalance, leading to significant improve- ments in accuracy and F1 scores. Furthermore, combining focal loss with oversampling and EDA further enhances model robustness, par- ticularly for hard-to-classify examples. These findings contribute to the development of more effective NLP-based classification models for food hazard detection.

[65] Aleph-Alpha-GermanWeb: Improving German-language LLM pre-training with model-based data curation and synthetic data generation

Thomas F Burns,Letitia Parcalabescu,Stephan Wäldchen,Michael Barlow,Gregor Ziegltrum,Volker Stampa,Bastian Harren,Björn Deiseroth

Main category: cs.CL

TL;DR: 论文提出了一种结合启发式和模型过滤技术的德语数据集构建流程，并通过合成数据生成优化数据质量，显著提升了大型语言模型的性能。

Details

Motivation: 数据质量对大型语言模型的性能和训练效率至关重要，但现有德语数据集质量不足，需要改进。 Method: 结合启发式和模型过滤技术，从Common Crawl、FineWeb2和合成数据中构建德语数据集Aleph-Alpha-GermanWeb。 Result: 在德语基准测试中，Aleph-Alpha-GermanWeb显著优于FineWeb2，即使后者加入高质量人工数据（如维基百科）。 Conclusion: 模型驱动的数据筛选和合成数据生成能显著提升预训练数据集质量。 Abstract: Scaling data quantity is essential for large language models (LLMs), yet recent findings show that data quality can significantly boost performance and training efficiency. We introduce a German-language dataset curation pipeline that combines heuristic and model-based filtering techniques with synthetic data generation. We use our pipeline to create Aleph-Alpha-GermanWeb, a large-scale German pre-training dataset which draws from: (1) Common Crawl web data, (2) FineWeb2, and (3) synthetically-generated data conditioned on actual, organic web data. We evaluate our dataset by pre-training both a 1B Llama-style model and an 8B tokenizer-free hierarchical autoregressive transformer (HAT). A comparison on German-language benchmarks, including MMMLU, shows significant performance gains of Aleph-Alpha-GermanWeb over FineWeb2 alone. This advantage holds at the 8B scale even when FineWeb2 is enriched by human-curated high-quality data sources such as Wikipedia. Our findings support the growing body of evidence that model-based data curation and synthetic data generation can significantly enhance LLM pre-training datasets.

[66] CORG: Generating Answers from Complex, Interrelated Contexts

Hyunji Lee,Franck Dernoncourt,Trung Bui,Seunghyun Yoon

Main category: cs.CL

TL;DR: 论文提出Context Organizer (CORG)框架，用于高效处理多文档中复杂且不一致的知识关系，优于现有方法。

Details

Motivation: 现实语料库中知识重复但存在不一致性，现有语言模型难以处理复杂上下文关系。 Method: CORG框架包含图构造器、重排序器和聚合器，将上下文分组独立处理。 Result: CORG在性能和效率上优于现有分组方法，接近计算密集型单上下文方法的效果。 Conclusion: CORG能有效处理多文档中的复杂关系，平衡性能与效率。 Abstract: In a real-world corpus, knowledge frequently recurs across documents but often contains inconsistencies due to ambiguous naming, outdated information, or errors, leading to complex interrelationships between contexts. Previous research has shown that language models struggle with these complexities, typically focusing on single factors in isolation. We classify these relationships into four types: distracting, ambiguous, counterfactual, and duplicated. Our analysis reveals that no single approach effectively addresses all these interrelationships simultaneously. Therefore, we introduce Context Organizer (CORG), a framework that organizes multiple contexts into independently processed groups. This design allows the model to efficiently find all relevant answers while ensuring disambiguation. CORG consists of three key components: a graph constructor, a reranker, and an aggregator. Our results demonstrate that CORG balances performance and efficiency effectively, outperforming existing grouping methods and achieving comparable results to more computationally intensive, single-context approaches.

[67] Nemotron-Research-Tool-N1: Tool-Using Language Models with Reinforced Reasoning

Shaokun Zhang,Yi Dong,Jieyu Zhang,Jan Kautz,Bryan Catanzaro,Andrew Tao,Qingyun Wu,Zhiding Yu,Guilin Liu

Main category: cs.CL

TL;DR: 论文提出了一种通过规则强化学习优化工具调用能力的语言模型Nemotron-Research-Tool-N1系列，仅需轻量级监督即可实现推理策略自主内化，并在BFCL和API-Bank基准测试中超越GPT-4o。

Details

Motivation: 现有方法在增强语言模型工具使用能力时，要么忽略推理过程，要么依赖模仿推理，限制了泛化能力。受DeepSeek-R1启发，作者希望通过规则强化学习实现更高效的推理策略学习。 Method: 采用规则强化学习训练Nemotron-Research-Tool-N1系列模型，仅通过二元奖励评估工具调用的结构有效性和功能正确性，无需标注推理轨迹。 Result: Nemotron-Research-Tool-N1-7B和14B在BFCL和API-Bank基准测试中表现优异，超越GPT-4o。 Conclusion: 轻量级监督结合规则强化学习可有效提升语言模型的工具调用能力，且无需依赖标注推理轨迹。 Abstract: Enabling large language models with external tools has become a pivotal strategy for extending their functionality beyond text generation tasks. Prior work typically enhances tool-use abilities by either applying supervised fine-tuning (SFT) to enforce tool-call correctness or distilling reasoning traces from stronger models for SFT. However, both approaches fall short, either omitting reasoning entirely or producing imitative reasoning that limits generalization. Inspired by the success of DeepSeek-R1 in eliciting reasoning through rule-based reinforcement learning, we develop the Nemotron-Research-Tool-N1 series of tool-using language models using a similar training paradigm. Instead of restrictively supervising intermediate reasoning traces distilled from stronger models, Nemotron-Research-Tool-N1 is optimized with a binary reward that evaluates only the structural validity and functional correctness of tool invocations. This lightweight supervision allows the model to autonomously internalize reasoning strategies, without the need for annotated reasoning trajectories. Experiments on the BFCL and API-Bank benchmarks show that Nemotron-Research-Tool-N1-7B and Nemotron-Research-Tool-N1-14B, built on Qwen-2.5-7B/14B-Instruct, achieve state-of-the-art results, outperforming GPT-4o on both evaluations.

[68] A Method for the Architecture of a Medical Vertical Large Language Model Based on Deepseek R1

Mingda Zhang,Jianglong Qin

Main category: cs.CL

TL;DR: 提出一种轻量级医疗垂直大语言模型架构，通过知识获取、模型压缩和计算优化三个维度解决医疗大模型在资源受限环境中的应用问题。

Details

Motivation: 解决基础模型在医疗场景中因专业知识壁垒、计算资源需求和部署环境限制而难以应用的问题。 Method: 设计知识转移管道（从70B教师模型到7B学生模型）、采用LoRA技术、4位权重量化、Flash Attention加速和连续批处理等优化技术。 Result: 在医疗问答数据集上，内存消耗减少64.7%，推理延迟降低12.4%，同时保持专业准确性。 Conclusion: 为资源受限环境（如边缘计算设备）中的医疗大模型应用提供了有效解决方案。 Abstract: In recent years, despite foundation models like DeepSeek-R1 and ChatGPT demonstrating significant capabilities in general tasks, professional knowledge barriers, computational resource requirements, and deployment environment limitations have severely hindered their application in actual medical scenarios. Addressing these challenges, this paper proposes an efficient lightweight medical vertical large language model architecture method, systematically solving the lightweight problem of medical large models from three dimensions: knowledge acquisition, model compression, and computational optimization. At the knowledge acquisition level, a knowledge transfer pipeline is designed from the fine-tuned DeepSeek-R1-Distill-70B teacher model to the DeepSeek-R1-Distill-7B student model, and Low-Rank Adaptation (LoRA) technology is adopted to precisely adjust key attention layers. At the model compression level, compression techniques including 4-bit weight quantization are implemented while preserving the core representation ability for medical reasoning. At the computational optimization level, inference optimization techniques such as Flash Attention acceleration and continuous batching are integrated, and a professional prompt template system is constructed to adapt to different types of medical problems. Experimental results on medical question-answering datasets show that the method proposed in this paper maintains professional accuracy while reducing memory consumption by 64.7\% and inference latency by 12.4\%, providing an effective solution for the application of medical large models in resource-constrained environments such as edge computing devices.

[69] Theory of Mind in Large Language Models: Assessment and Enhancement

Ruirui Chen,Weifeng Jiang,Chengwei Qin,Cheston Tan

Main category: cs.CL

TL;DR: 综述论文探讨了大语言模型（LLMs）的心理理论（ToM）能力，包括评估基准和改进策略，并展望了未来研究方向。

Details

Motivation: 评估和提升LLMs对人类心理状态的理解能力，以增强其社会智能。 Method: 通过分析故事型基准测试和改进方法，深入研究LLMs的ToM能力。 Result: 总结了现有基准和方法，为提升LLMs的ToM能力提供了资源。 Conclusion: 论文为研究者提供了有价值的参考，并指出了未来研究的潜力方向。 Abstract: Theory of Mind (ToM)-the ability to infer and reason about others' mental states-is fundamental to human social intelligence. As Large Language Models (LLMs) become increasingly integrated into daily life, it is crucial to assess and enhance their capacity to interpret and respond to human mental states. In this paper, we review LLMs' ToM capabilities by examining both evaluation benchmarks and the strategies designed to improve them. We focus on widely adopted story-based benchmarks and provide an in-depth analysis of methods aimed at enhancing ToM in LLMs. Furthermore, we outline promising future research directions informed by recent benchmarks and state-of-the-art approaches. Our survey serves as a valuable resource for researchers interested in advancing LLMs' ToM capabilities.

[70] Extracting Abstraction Dimensions by Identifying Syntax Pattern from Texts

Jian Zhou,Jiazheng Li,Sirui Zhuge,Hai Zhuge

Main category: cs.CL

TL;DR: 提出了一种从文本中自动发现主语、动作、宾语和状语维度的方法，以高效操作文本并支持自然语言查询。

Details

Motivation: 为了支持自然语言查询并高效操作文本，需要从文本中提取结构化信息。 Method: 通过构建高质量的抽象树来表示主语、动作、宾语和状语及其子类关系，确保树的独立性和表达能力。 Result: 实验显示抽象树的平均精确率、召回率和F1分数均超过80%，支持高效的自然语言查询。 Conclusion: 该方法能有效支持自然语言查询，并通过多树搜索快速定位目标句子，实现精确文本操作。 Abstract: This paper proposed an approach to automatically discovering subject dimension, action dimension, object dimension and adverbial dimension from texts to efficiently operate texts and support query in natural language. The high quality of trees guarantees that all subjects, actions, objects and adverbials and their subclass relations within texts can be represented. The independency of trees ensures that there is no redundant representation between trees. The expressiveness of trees ensures that the majority of sentences can be accessed from each tree and the rest of sentences can be accessed from at least one tree so that the tree-based search mechanism can support querying in natural language. Experiments show that the average precision, recall and F1-score of the abstraction trees constructed by the subclass relations of subject, action, object and adverbial are all greater than 80%. The application of the proposed approach to supporting query in natural language demonstrates that different types of question patterns for querying subject or object have high coverage of texts, and searching multiple trees on subject, action, object and adverbial according to the question pattern can quickly reduce search space to locate target sentences, which can support precise operation on texts.

[71] Enhancing Speech-to-Speech Dialogue Modeling with End-to-End Retrieval-Augmented Generation

Pengchao Feng,Ziyang Ma,Wenxi Chen,Yao Li,Sheng Wang,Kai Yu,Xie Chen

Main category: cs.CL

TL;DR: 提出了一种新型端到端RAG框架，直接从语音查询中检索文本知识，无需ASR转换，显著提升了端到端S2S对话系统的性能。

Details

Motivation: 端到端S2S系统在整合外部知识方面存在挑战，尤其是语音与文本知识之间的模态差距。 Method: 提出直接检索语音查询相关文本知识的端到端RAG框架，避免中间ASR转换。 Result: 实验表明，该方法显著提升系统性能并提高检索效率，但仍落后于级联模型。 Conclusion: 该框架为端到端S2S系统的知识整合提供了有前景的方向，代码和数据集将开源。 Abstract: In recent years, end-to-end speech-to-speech (S2S) dialogue systems have garnered increasing research attention due to their advantages over traditional cascaded systems, including achieving lower latency and more natural integration of nonverbal cues such as emotion and speaker identity. However, these end-to-end systems face key challenges, particularly in incorporating external knowledge, a capability commonly addressed by Retrieval-Augmented Generation (RAG) in text-based large language models (LLMs). The core difficulty lies in the modality gap between input speech and retrieved textual knowledge, which hinders effective integration. To address this issue, we propose a novel end-to-end RAG framework that directly retrieves relevant textual knowledge from speech queries, eliminating the need for intermediate speech-to-text conversion via techniques like ASR. Experimental results demonstrate that our method significantly improves the performance of end-to-end S2S dialogue systems while achieving higher retrieval efficiency. Although the overall performance still lags behind cascaded models, our framework offers a promising direction for enhancing knowledge integration in end-to-end S2S systems. We will release the code and dataset to support reproducibility and promote further research in this area.

[72] Keep the General, Inject the Specific: Structured Dialogue Fine-Tuning for Knowledge Injection without Catastrophic Forgetting

Yijie Hong,Xiaofei Yin,Xinzhong Wang,Yi Tu,Ya Guo,Sufeng Duan,Weiqiang Wang,Lingyong Fang,Depeng Wang,Huijia Zhu

Main category: cs.CL

TL;DR: 论文提出了一种名为SDFT的方法，通过结构化对话微调解决大视觉语言模型在融入专业知识时的灾难性遗忘问题。

Details

Motivation: 大视觉语言模型在融入专业知识时容易遗忘基础能力，需要一种平衡的方法。 Method: SDFT采用三阶段对话结构：基础保留、对比消歧和知识专业化，结合多轮监督框架。 Result: 实验证明SDFT能有效平衡专业知识获取与基础能力保留。 Conclusion: SDFT为视觉语言模型在专业知识领域的应用提供了有效解决方案。 Abstract: Large Vision Language Models have demonstrated impressive versatile capabilities through extensive multimodal pre-training, but face significant limitations when incorporating specialized knowledge domains beyond their training distribution. These models struggle with a fundamental dilemma: direct adaptation approaches that inject domain-specific knowledge often trigger catastrophic forgetting of foundational visual-linguistic abilities. We introduce Structured Dialogue Fine-Tuning (SDFT), an effective approach that effectively injects domain-specific knowledge while minimizing catastrophic forgetting. Drawing inspiration from supervised fine-tuning in LLMs and subject-driven personalization in text-to-image diffusion models, our method employs a three-phase dialogue structure: Foundation Preservation reinforces pre-trained visual-linguistic alignment through caption tasks; Contrastive Disambiguation introduces carefully designed counterfactual examples to maintain semantic boundaries; and Knowledge Specialization embeds specialized information through chain-of-thought reasoning. Experimental results across multiple domains confirm SDFT's effectiveness in balancing specialized knowledge acquisition with general capability retention. Our key contributions include a data-centric dialogue template that balances foundational alignment with targeted knowledge integration, a weighted multi-turn supervision framework, and comprehensive evaluation across diverse knowledge types.

[73] Can Language Models Represent the Past without Anachronism?

Ted Underwood,Laura K. Nelson,Matthew Wilkens

Main category: cs.CL

TL;DR: 研究发现，当代语言模型难以准确模拟历史文本风格，微调后虽能欺骗自动化评估，但仍被人类识别。

Details

Motivation: 探讨语言模型在模拟历史文本风格时的局限性，避免时代错位风险。 Method: 通过提示和微调当代语言模型，比较其输出与真实历史文本的差异。 Result: 微调模型能欺骗自动化评估，但人类仍能区分其输出与真实历史文本。 Conclusion: 可能需要预训练历史文本来可靠模拟历史视角。 Abstract: Before researchers can use language models to simulate the past, they need to understand the risk of anachronism. We find that prompting a contemporary model with examples of period prose does not produce output consistent with period style. Fine-tuning produces results that are stylistically convincing enough to fool an automated judge, but human evaluators can still distinguish fine-tuned model outputs from authentic historical text. We tentatively conclude that pretraining on period prose may be required in order to reliably simulate historical perspectives for social research.

[74] Learning to Plan Before Answering: Self-Teaching LLMs to Learn Abstract Plans for Problem Solving

Jin Zhang,Flood Sung,Zhilin Yang,Yang Gao,Chongjie Zhang

Main category: cs.CL

TL;DR: 论文提出了一种新的自训练算法LEPA，通过让LLM在解决问题前生成抽象的计划（meta-knowledge），提高了模型的泛化能力和推理表现。

Details

Motivation: 现有方法仅生成逐步解决方案，缺乏抽象meta-knowledge，无法泛化到类似问题。受认知科学启发，提出在解决问题前生成抽象计划。 Method: LEPA算法分两步：1) 生成基于问题的抽象计划；2) 根据计划生成解决方案，并通过自我反思优化计划。 Result: LEPA在多个自然语言推理基准测试中显著优于传统方法。 Conclusion: LEPA通过提取和利用抽象计划，提升了LLM的推理能力和泛化性能。 Abstract: In the field of large language model (LLM) post-training, the effectiveness of utilizing synthetic data generated by the LLM itself has been well-presented. However, a key question remains unaddressed: what essential information should such self-generated data encapsulate? Existing approaches only produce step-by-step problem solutions, and fail to capture the abstract meta-knowledge necessary for generalization across similar problems. Drawing insights from cognitive science, where humans employ high-level abstraction to simplify complex problems before delving into specifics, we introduce a novel self-training algorithm: LEarning to Plan before Answering (LEPA). LEPA trains the LLM to formulate anticipatory plans, which serve as abstract meta-knowledge for problem-solving, before engaging with the intricacies of problems. This approach not only outlines the solution generation path but also shields the LLM from the distraction of irrelevant details. During data generation, LEPA first crafts an anticipatory plan based on the problem, and then generates a solution that aligns with both the plan and the problem. LEPA refines the plan through self-reflection, aiming to acquire plans that are instrumental in yielding correct solutions. During model optimization, the LLM is trained to predict both the refined plans and the corresponding solutions. By efficiently extracting and utilizing the anticipatory plans, LEPA demonstrates remarkable superiority over conventional algorithms on various challenging natural language reasoning benchmarks.

[75] MDD-LLM: Towards Accuracy Large Language Models for Major Depressive Disorder Diagnosis

Yuyang Sha,Hongxin Pan,Wei Xu,Weiyu Meng,Gang Luo,Xinyu Du,Xiaobing Zhai,Henry H. Y. Tong,Caijuan Shi,Kefeng Li

Main category: cs.CL

TL;DR: 本文提出了一种名为MDD-LLM的高性能抑郁症诊断工具，利用微调的大型语言模型（LLMs）和真实世界样本，显著提升了诊断准确性。

Details

Motivation: 全球有超过3亿人受抑郁症影响，但医疗资源分布不均和诊断方法复杂导致许多地区对此关注不足。 Method: 从UK Biobank队列中选择274,348条个体记录，设计表格数据转换方法生成训练语料，并利用微调的LLMs进行诊断。 Result: MDD-LLM（70B）在实验中达到0.8378的准确率和0.8919的AUC，显著优于现有方法。 Conclusion: MDD-LLM为抑郁症诊断提供了高效解决方案，并探讨了影响性能的关键因素。 Abstract: Major depressive disorder (MDD) impacts more than 300 million people worldwide, highlighting a significant public health issue. However, the uneven distribution of medical resources and the complexity of diagnostic methods have resulted in inadequate attention to this disorder in numerous countries and regions. This paper introduces a high-performance MDD diagnosis tool named MDD-LLM, an AI-driven framework that utilizes fine-tuned large language models (LLMs) and extensive real-world samples to tackle challenges in MDD diagnosis. Therefore, we select 274,348 individual information from the UK Biobank cohort to train and evaluate the proposed method. Specifically, we select 274,348 individual records from the UK Biobank cohort and design a tabular data transformation method to create a large corpus for training and evaluating the proposed approach. To illustrate the advantages of MDD-LLM, we perform comprehensive experiments and provide several comparative analyses against existing model-based solutions across multiple evaluation metrics. Experimental results show that MDD-LLM (70B) achieves an accuracy of 0.8378 and an AUC of 0.8919 (95% CI: 0.8799 - 0.9040), significantly outperforming existing machine learning and deep learning frameworks for MDD diagnosis. Given the limited exploration of LLMs in MDD diagnosis, we examine numerous factors that may influence the performance of our proposed method, such as tabular data transformation techniques and different fine-tuning strategies.

[76] From Attention to Atoms: Spectral Dictionary Learning for Fast, Interpretable Language Models

Andrew Kiruluta

Main category: cs.CL

TL;DR: 提出了一种基于谱生成建模的NLP框架，替代Transformer中的自注意力机制，通过联合学习全局时变傅里叶字典和每个token的混合系数，实现线性计算复杂度。

Details

Motivation: 解决Transformer中自注意力机制的二次计算复杂度问题，同时保持语言建模性能。 Method: 结合时域（嵌入重建）和频域（短时傅里叶变换幅度匹配）的重建损失，并采用高斯混合模型（GMM）先验学习混合向量。 Result: 在WikiText2和Penn Treebank等基准测试中达到与Transformer基线竞争的性能，同时显著降低推理延迟和内存占用。 Conclusion: 谱字典模型为可扩展语言建模提供了高效且性能优越的替代方案。 Abstract: We propose a novel spectral generative modeling framework for natural language processing that jointly learns a global time varying Fourier dictionary and per token mixing coefficients, replacing the ubiquitous self attention mechanism in transformer architectures. By enforcing reconstruction losses in both the time domain (embedding reconstruction) and the frequency domain (via Short Time Fourier Transform magnitude matching) alongside a standard language modeling objective, and fitting a Gaussian Mixture Model (GMM) prior over the learned mixing vectors, our approach achieves competitive perplexity and generation quality on standard benchmarks such as WikiText2 and Penn Treebank. In contrast to the quadratic computation complexity of self attention, our method operates with linear complexity, delivering substantial efficiency gains. We demonstrate that spectral dictionary models can achieve competitive performance compared to transformer baselines while significantly reducing inference latency and memory footprint, offering a compelling alternative for scalable language modeling.

[77] Improving Phishing Email Detection Performance of Small Large Language Models

Zijie Lin,Zikang Liu,Hanbo Fan

Main category: cs.CL

TL;DR: 研究探讨了小型参数大语言模型（LLMs）在钓鱼邮件检测中的有效性，通过Prompt Engineering、Explanation Augmented Fine-tuning和Model Ensemble等方法显著提升了性能。

Details

Motivation: 尽管大参数LLMs在钓鱼邮件检测中表现优异，但其计算资源需求过高，因此研究小型LLMs的可行性以降低成本。 Method: 采用Prompt Engineering、Explanation Augmented Fine-tuning和Model Ensemble三种方法优化小型LLMs。 Result: 实验表明，方法显著提升了性能，准确率从基线模型的0.5提升至0.976。 Conclusion: 小型LLMs通过优化方法可在钓鱼邮件检测中达到高性能，同时降低计算成本。 Abstract: Large language models(LLMs) have demonstrated remarkable performance on many natural language processing(NLP) tasks and have been employed in phishing email detection research. However, in current studies, well-performing LLMs typically contain billions or even tens of billions of parameters, requiring enormous computational resources. To reduce computational costs, we investigated the effectiveness of small-parameter LLMs for phishing email detection. These LLMs have around 3 billion parameters and can run on consumer-grade GPUs. However, small LLMs often perform poorly in phishing email detection task. To address these issues, we designed a set of methods including Prompt Engineering, Explanation Augmented Fine-tuning, and Model Ensemble to improve phishing email detection capabilities of small LLMs. We validated the effectiveness of our approach through experiments, significantly improving accuracy on the SpamAssassin dataset from around 0.5 for baseline models like Qwen2.5-1.5B-Instruct to 0.976.

[78] Linguistic Complexity and Socio-cultural Patterns in Hip-Hop Lyrics

Aayam Bansal,Raghav Agarwal,Kaashvi Jain

Main category: cs.CL

TL;DR: 本文提出了一种计算框架，用于分析嘻哈歌词的语言复杂性及社会文化趋势。研究发现，词汇多样性增加了23.7%，押韵密度上升了34.2%，主题内容从社会正义转向内省，歌词情感在政治危机期间更负面。

Details

Motivation: 研究嘻哈歌词的语言复杂性和社会文化趋势，以揭示其作为艺术形式和社会动态反映的演变。 Method: 使用自然语言处理技术分析3,814首歌曲，量化词汇多样性、押韵密度、主题内容和情感极性。 Result: 词汇多样性增加23.7%，押韵密度上升34.2%，主题内容从社会正义转向内省，情感在政治危机期间更负面。 Conclusion: 研究为嘻哈作为艺术形式和社会动态反映的演变提供了量化证据，揭示了语言创新与文化背景的相互作用。 Abstract: This paper presents a comprehensive computational framework for analyzing linguistic complexity and socio-cultural trends in hip-hop lyrics. Using a dataset of 3,814 songs from 146 influential artists spanning four decades (1980-2020), we employ natural language processing techniques to quantify multiple dimensions of lyrical complexity. Our analysis reveals a 23.7% increase in vocabulary diversity over the study period, with East Coast artists demonstrating 17.3% higher lexical variation than other regions. Rhyme density increased by 34.2% across all regions, with Midwest artists exhibiting the highest technical complexity (3.04 rhymes per line). Topic modeling identified significant shifts in thematic content, with social justice themes decreasing from 28.5% to 13.8% of content while introspective themes increased from 7.6% to 26.3%. Sentiment analysis demon- strated that lyrics became significantly more negative during sociopolitical crises, with polarity decreasing by 0.31 following major social unrest. Multi-dimensional analysis revealed four dis- tinct stylistic approaches that correlate strongly with geographic origin (r=0.68, p!0.001) and time period (r=0.59, p<0.001). These findings establish quantitative evidence for the evolution of hip- hop as both an art form and a reflection of societal dynamics, providing insights into the interplay between linguistic innovation and cultural context in popular music.

[79] A Framework to Assess the Persuasion Risks Large Language Model Chatbots Pose to Democratic Societies

Zhongren Chen,Joshua Kalla,Quan Le,Shinpei Nakamura-Sakai,Jasjeet Sekhon,Ruixiao Wang

Main category: cs.CL

TL;DR: 研究发现，尽管LLM在说服选民方面与传统竞选广告效果相当，但实际政治说服需考虑信息曝光和接受度。LLM说服成本较低（48-74美元/选民），但传统方法更易扩展。

Details

Motivation: 探讨LLM对民主社会的潜在威胁，特别是其说服能力是否比传统竞选方法更高效。 Method: 通过两项调查实验（N=10,417）和真实世界模拟，评估LLM与人类互动及说服效果。 Result: LLM说服成本低于传统方法（48-74美元 vs 100美元/选民），但传统方法更易扩展。 Conclusion: 目前LLM在大规模政治说服上未显著优于传统方法，但随着技术发展，其潜力可能提升。 Abstract: In recent years, significant concern has emerged regarding the potential threat that Large Language Models (LLMs) pose to democratic societies through their persuasive capabilities. We expand upon existing research by conducting two survey experiments and a real-world simulation exercise to determine whether it is more cost effective to persuade a large number of voters using LLM chatbots compared to standard political campaign practice, taking into account both the "receive" and "accept" steps in the persuasion process (Zaller 1992). These experiments improve upon previous work by assessing extended interactions between humans and LLMs (instead of using single-shot interactions) and by assessing both short- and long-run persuasive effects (rather than simply asking users to rate the persuasiveness of LLM-produced content). In two survey experiments (N = 10,417) across three distinct political domains, we find that while LLMs are about as persuasive as actual campaign ads once voters are exposed to them, political persuasion in the real-world depends on both exposure to a persuasive message and its impact conditional on exposure. Through simulations based on real-world parameters, we estimate that LLM-based persuasion costs between \$48-\$74 per persuaded voter compared to \$100 for traditional campaign methods, when accounting for the costs of exposure. However, it is currently much easier to scale traditional campaign persuasion methods than LLM-based persuasion. While LLMs do not currently appear to have substantially greater potential for large-scale political persuasion than existing non-LLM methods, this may change as LLM capabilities continue to improve and it becomes easier to scalably encourage exposure to persuasive LLMs.

[80] HyPerAlign: Hypotheses-driven Personalized Alignment

Cristina Garbacea,Chenhao Tan

Main category: cs.CL

TL;DR: 论文提出了一种基于假设驱动的个性化方法（HyPerAlign），通过少量用户示例推断其偏好和风格，从而生成定制化LLM输出，优于传统基于偏好的微调方法。

Details

Motivation: 当前LLM通常针对“平均用户”偏好进行对齐，但实际使用中用户需求多样且具体，需要个性化输出。 Method: 提出HyPerAlign方法，通过少量用户示例推断其沟通策略、个性和写作风格，并基于假设生成定制化输出。 Result: 在作者归属和审议对齐任务中表现优异，审议对齐的LLM帮助性提升70%，作者归属任务胜率普遍>90%。 Conclusion: HyPerAlign是一种高效且可解释的LLM个性化策略，适用于多样化用户需求。 Abstract: Alignment algorithms are widely used to align large language models (LLMs) to human users based on preference annotations that reflect their intended real-world use cases. Typically these (often divergent) preferences are aggregated over a diverse set of users, resulting in fine-tuned models that are aligned to the ``average-user'' preference. Nevertheless, current models are used by individual users in very specific contexts and situations, emphasizing the need for user-dependent preference control. In this work we address the problem of personalizing LLM outputs to their users, aiming to generate customized responses tailored to individual users, instead of generic outputs that emulate the collective voices of diverse populations. We propose a novel interpretable and sample-efficient hypotheses-driven personalization approach (HyPerAlign) where given few-shot examples written by a particular user, we first infer hypotheses about their communication strategies, personality and writing style, then prompt LLM models with these hypotheses and user specific attributes to generate customized outputs. We conduct experiments on two different personalization tasks, authorship attribution and deliberative alignment, with datasets from diverse domains (news articles, blog posts, emails, jailbreaking benchmarks), and demonstrate the superiority of hypotheses-driven personalization approach when compared to preference-based fine-tuning methods. For deliberative alignment, the helpfulness of LLM models is improved by up to $70\%$ on average. For authorship attribution, results indicate consistently high win-rates (commonly $>90\%$) against state-of-the-art preference fine-tuning approaches for LLM personalization across diverse user profiles and LLM models. Overall, our approach represents an interpretable and sample-efficient strategy for the personalization of LLM models to individual users.

[81] Graph RAG for Legal Norms: A Hierarchical and Temporal Approach

Hudson de Martim

Main category: cs.CL

TL;DR: 本文提出了一种专为法律规范分析设计的Graph RAG方法，结合知识图谱和文本片段，解决法律数据的复杂性和体量问题。

Details

Motivation: 法律规范具有层次结构、内外引用和多时间版本的特点，传统方法难以处理其复杂性和规模。 Method: 通过结合知识图谱和文本单元，构建更丰富的法律知识表示，并整合层次结构和时间演化。 Result: Graph RAG在法律规范数据集上的应用显著提升了法律人工智能领域的效率。 Conclusion: 该方法为法律研究、立法分析和决策支持提供了更有效的工具。 Abstract: This article proposes an adaptation of Graph Retrieval Augmented Generation (Graph RAG) specifically designed for the analysis and comprehension of legal norms, which are characterized by their predefined hierarchical structure, extensive network of internal and external references and multiple temporal versions. By combining structured knowledge graphs with contextually enriched text segments, Graph RAG offers a promising solution to address the inherent complexity and vast volume of legal data. The integration of hierarchical structure and temporal evolution into knowledge graphs - along with the concept of comprehensive Text Units - facilitates the construction of richer, interconnected representations of legal knowledge. Through a detailed analysis of Graph RAG and its application to legal norm datasets, this article aims to significantly advance the field of Artificial Intelligence applied to Law, creating opportunities for more effective systems in legal research, legislative analysis, and decision support.

[82] Base Models Beat Aligned Models at Randomness and Creativity

Peter West,Christopher Potts

Main category: cs.CL

TL;DR: 对齐技术（如RLHF）在LLM开发中广泛应用，但研究发现基础模型在某些任务（如随机数生成、混合策略游戏、创意写作）上表现优于对齐模型，揭示了能力权衡。

Details

Motivation: 探讨对齐技术是否应普遍适用，发现对齐模型在需要不可预测输出的任务中表现不佳。 Method: 研究随机数生成、混合策略游戏和创意写作任务，对比基础模型与对齐模型的表现。 Result: 对齐模型在需要不可预测输出的任务中表现较差，倾向于生成狭窄行为，且常见基准表现与任务表现呈负相关。 Conclusion: 对齐技术并非适用于所有任务，需根据具体需求权衡模型能力。 Abstract: Alignment has quickly become a default ingredient in LLM development, with techniques such as reinforcement learning from human feedback making models act safely, follow instructions, and perform ever-better on complex tasks. While these techniques are certainly useful, we propose that they should not be universally applied and demonstrate a range of tasks on which base language models consistently outperform their popular aligned forms. Particularly, we study tasks that require unpredictable outputs, such as random number generation, mixed strategy games (rock-paper-scissors and hide-and-seek), and creative writing. In each case, aligned models tend towards narrow behaviors that result in distinct disadvantages, for instance, preferring to generate "7" over other uniformly random numbers, becoming almost fully predictable in some game states, or prioritizing pleasant writing over creative originality. Across models tested, better performance on common benchmarks tends to correlate with worse performance on our tasks, suggesting an effective trade-off in the required capabilities.

Aayam Bansal,Agneya Tharun

Main category: cs.CL

TL;DR: 通过分析Twitter数据，研究社交媒体情绪与时尚趋势的关系，发现情绪模式可作为时尚趋势的预测指标。

Details

Motivation: 探索社交媒体情绪分析在预测时尚趋势中的潜力，填补现有研究的空白。 Method: 使用T4SA数据集，结合自然语言处理和机器学习技术，进行情绪分类、时间序列分解、因果关系建模等分析。 Result: 发现情绪模式与时尚主题流行度相关，配饰和街头服饰趋势显著；改进的预测模型情绪分类准确率达78.35%。 Conclusion: 社交媒体情绪分析可作为时尚趋势的早期有效指标，需结合统计验证。 Abstract: This study explores the intersection of fashion trends and social media sentiment through computational analysis of Twitter data using the T4SA (Twitter for Sentiment Analysis) dataset. By applying natural language processing and machine learning techniques, we examine how sentiment patterns in fashion-related social media conversations can serve as predictors for emerging fashion trends. Our analysis involves the identification and categorization of fashion-related content, sentiment classification with improved normalization techniques, time series decomposition, statistically validated causal relationship modeling, cross-platform sentiment comparison, and brand-specific sentiment analysis. Results indicate correlations between sentiment patterns and fashion theme popularity, with accessories and streetwear themes showing statistically significant rising trends. The Granger causality analysis establishes sustainability and streetwear as primary trend drivers, showing bidirectional relationships with several other themes. The findings demonstrate that social media sentiment analysis can serve as an effective early indicator of fashion trend trajectories when proper statistical validation is applied. Our improved predictive model achieved 78.35% balanced accuracy in sentiment classification, establishing a reliable foundation for trend prediction across positive, neutral, and negative sentiment categories.

[84] Clustering Internet Memes Through Template Matching and Multi-Dimensional Similarity

Tygo Bloem,Filip Ilievski

Main category: cs.CL

TL;DR: 本文提出了一种基于模板匹配和多维相似性特征的网络表情包聚类方法，解决了现有方法依赖数据库、忽略语义和难以处理多样性相似性的问题。

Details

Motivation: 网络表情包聚类对毒性检测、传播建模和分类至关重要，但现有研究较少，且面临多模态、文化背景和适应性等挑战。 Method: 采用模板匹配结合多维相似性特征（如形式、视觉内容、文本和身份），无需预定义数据库，支持自适应匹配。 Result: 该方法优于现有聚类方法，生成更一致和连贯的聚类，且相似性特征集支持适应性和人类直觉对齐。 Conclusion: 提出的方法有效解决了表情包聚类问题，代码已公开以支持后续研究。 Abstract: Meme clustering is critical for toxicity detection, virality modeling, and typing, but it has received little attention in previous research. Clustering similar Internet memes is challenging due to their multimodality, cultural context, and adaptability. Existing approaches rely on databases, overlook semantics, and struggle to handle diverse dimensions of similarity. This paper introduces a novel method that uses template-based matching with multi-dimensional similarity features, thus eliminating the need for predefined databases and supporting adaptive matching. Memes are clustered using local and global features across similarity categories such as form, visual content, text, and identity. Our combined approach outperforms existing clustering methods, producing more consistent and coherent clusters, while similarity-based feature sets enable adaptability and align with human intuition. We make all supporting code publicly available to support subsequent research. Code: https://github.com/tygobl/meme-clustering

[85] A Report on the llms evaluating the high school questions

Zhu Jiawei,Chen Wei

Main category: cs.CL

TL;DR: 评估大型语言模型（LLMs）在解答高中科学问题中的表现，并探讨其在教育领域的应用潜力。

Details

Motivation: 随着LLMs在自然语言处理领域的快速发展，其在教育中的应用受到广泛关注。 Method: 选取2019-2023年高考数学试题作为评估数据，使用至少8个LLM API提供答案，基于准确性、响应时间、逻辑推理和创造力等指标进行综合评估。 Result: LLMs在某些方面表现优异，但在逻辑推理和创造性问题解决方面仍有改进空间。 Conclusion: 为LLMs在教育领域的进一步研究和应用提供了实证基础，并提出了改进建议。 Abstract: This report aims to evaluate the performance of large language models (LLMs) in solving high school science questions and to explore their potential applications in the educational field. With the rapid development of LLMs in the field of natural language processing, their application in education has attracted widespread attention. This study selected mathematics exam questions from the college entrance examinations (2019-2023) as evaluation data and utilized at least eight LLM APIs to provide answers. A comprehensive assessment was conducted based on metrics such as accuracy, response time, logical reasoning, and creativity. Through an in-depth analysis of the evaluation results, this report reveals the strengths and weaknesses of LLMs in handling high school science questions and discusses their implications for educational practice. The findings indicate that although LLMs perform excellently in certain aspects, there is still room for improvement in logical reasoning and creative problem-solving. This report provides an empirical foundation for further research and application of LLMs in the educational field and offers suggestions for improvement.

[86] BERSting at the Screams: A Benchmark for Distanced, Emotional and Shouted Speech Recognition

Paige Tuttösí,Mantaj Dhillon,Luna Sang,Shane Eastwood,Poorvi Bhatia,Quang Minh Dinh,Avni Kapoor,Yewon Jin,Angelica Lim

Main category: cs.CL

TL;DR: 论文介绍了BERSt数据集，用于评估复杂环境下的语音识别任务，如ASR和SER，并展示了其在距离和情绪识别上的挑战。

Details

Motivation: 尽管语音识别任务在某些指标上接近人类水平，但在复杂现实环境（如远距离语音）中表现不佳。现有数据集多依赖多麦克风阵列，而BERSt数据集旨在填补这一空白。 Method: 收集了98位演员在不同家庭环境中的近4小时英语语音数据，涵盖多种口音、情绪和发声方式（喊叫/正常）。智能手机被放置于19种不同位置，模拟真实场景。 Result: ASR性能随距离和喊叫程度下降，情绪识别表现不一。数据集对ASR和SER任务均具挑战性。 Conclusion: BERSt数据集为复杂环境下的语音识别任务提供了新基准，未来需进一步提升系统鲁棒性以适应现实需求。 Abstract: Some speech recognition tasks, such as automatic speech recognition (ASR), are approaching or have reached human performance in many reported metrics. Yet, they continue to struggle in complex, real-world, situations, such as with distanced speech. Previous challenges have released datasets to address the issue of distanced ASR, however, the focus remains primarily on distance, specifically relying on multi-microphone array systems. Here we present the B(asic) E(motion) R(andom phrase) S(hou)t(s) (BERSt) dataset. The dataset contains almost 4 hours of English speech from 98 actors with varying regional and non-native accents. The data was collected on smartphones in the actors homes and therefore includes at least 98 different acoustic environments. The data also includes 7 different emotion prompts and both shouted and spoken utterances. The smartphones were places in 19 different positions, including obstructions and being in a different room than the actor. This data is publicly available for use and can be used to evaluate a variety of speech recognition tasks, including: ASR, shout detection, and speech emotion recognition (SER). We provide initial benchmarks for ASR and SER tasks, and find that ASR degrades both with an increase in distance and shout level and shows varied performance depending on the intended emotion. Our results show that the BERSt dataset is challenging for both ASR and SER tasks and continued work is needed to improve the robustness of such systems for more accurate real-world use.

[87] Fact-Consistency Evaluation of Text-to-SQL Generation for Business Intelligence Using Exaone 3.5

Jeho Choi

Main category: cs.CL

TL;DR: 论文提出了一种事实一致性评估框架，用于评估LLM生成的SQL输出的语义准确性，并构建了一个领域特定的基准测试。实验结果显示LLM在复杂任务中表现不佳，强调了事实一致性验证和混合推理方法的必要性。

Details

Motivation: LLM在商业智能（BI）环境中的应用受到语义幻觉、结构错误和缺乏领域特定评估框架的限制。 Method: 提出事实一致性评估框架，使用Exaone 3.5 LLM，构建包含219个自然语言商业问题的基准测试，评估指标包括答案准确性、执行成功率、语义错误率和无响应率。 Result: Exaone 3.5在简单任务中表现良好（L1准确率93%），但在算术推理（H1准确率4%）和分组排名任务（H4准确率31%）中表现显著下降。 Conclusion: LLM在商业关键环境中存在局限性，需要事实一致性验证层和混合推理方法。论文贡献了可复现的基准测试和评估方法。 Abstract: Large Language Models (LLMs) have shown promise in enabling natural language interfaces for structured data querying through text-to-SQL generation. However, their application in real-world Business Intelligence (BI) contexts remains limited due to semantic hallucinations, structural errors, and a lack of domain-specific evaluation frameworks. In this study, we propose a Fact-Consistency Evaluation Framework for assessing the semantic accuracy of LLM-generated SQL outputs using Exaone 3.5--an instruction-tuned, bilingual LLM optimized for enterprise tasks. We construct a domain-specific benchmark comprising 219 natural language business questions across five SQL complexity levels, derived from actual sales data in LG Electronics' internal BigQuery environment. Each question is paired with a gold-standard SQL query and a validated ground-truth answer. We evaluate model performance using answer accuracy, execution success rate, semantic error rate, and non-response rate. Experimental results show that while Exaone 3.5 performs well on simple aggregation tasks (93% accuracy in L1), it exhibits substantial degradation in arithmetic reasoning (4% accuracy in H1) and grouped ranking tasks (31% in H4), with semantic errors and non-responses concentrated in complex cases. Qualitative error analysis further identifies common failure types such as misapplied arithmetic logic, incomplete filtering, and incorrect grouping operations. Our findings highlight the current limitations of LLMs in business-critical environments and underscore the need for fact-consistency validation layers and hybrid reasoning approaches. This work contributes a reproducible benchmark and evaluation methodology for advancing reliable natural language interfaces to structured enterprise data systems.

[88] Enhancing Security and Strengthening Defenses in Automated Short-Answer Grading Systems

Sahar Yarmohammadtoosky,Yiyun Zhou,Victoria Yaneva,Peter Baldwin,Saed Rezayi,Brian Clauser,Polina Harikeo

Main category: cs.CL

TL;DR: 研究发现基于Transformer的自动评分系统在医学教育中存在漏洞，可通过对抗性策略操纵。通过对抗训练和集成技术，系统鲁棒性显著提升，GPT-4等大模型在识别策略上表现良好。

Details

Motivation: 揭示自动评分系统的漏洞，确保其在医学教育等高风险场景中的可靠性和公平性。 Method: 识别三种对抗策略，采用对抗训练和集成技术（如多数投票和岭回归）增强系统防御能力，并测试GPT-4的提示技术。 Result: 对抗训练和集成技术显著降低系统被操纵的风险，GPT-4在识别策略上表现优异。 Conclusion: 需持续改进AI教育工具，以提升其在高风险环境中的鲁棒性和公平性。 Abstract: This study examines vulnerabilities in transformer-based automated short-answer grading systems used in medical education, with a focus on how these systems can be manipulated through adversarial gaming strategies. Our research identifies three main types of gaming strategies that exploit the system's weaknesses, potentially leading to false positives. To counteract these vulnerabilities, we implement several adversarial training methods designed to enhance the systems' robustness. Our results indicate that these methods significantly reduce the susceptibility of grading systems to such manipulations, especially when combined with ensemble techniques like majority voting and ridge regression, which further improve the system's defense against sophisticated adversarial inputs. Additionally, employing large language models such as GPT-4 with varied prompting techniques has shown promise in recognizing and scoring gaming strategies effectively. The findings underscore the importance of continuous improvements in AI-driven educational tools to ensure their reliability and fairness in high-stakes settings.

[89] GDI-Bench: A Benchmark for General Document Intelligence with Vision and Reasoning Decoupling

Siqi Li,Yufan Shen,Xiangnan Chen,Jiayi Chen,Hengwei Ju,Haodong Duan,Song Mao,Hongbin Zhou,Bo Zhang,Pinlong Cai,Licheng Wen,Botian Shi,Yong Liu,Xinyu Cai,Yu Qiao

Main category: cs.CL

TL;DR: GDI-Bench是一个全面的文档智能基准测试，包含1.9k图像和19个任务，用于评估多模态大语言模型（MLLMs）的能力，并识别其弱点。

Details

Motivation: 现有基准测试无法定位模型弱点或指导系统改进，因此需要一种更全面的评估工具。 Method: 通过分离视觉复杂性和推理复杂性，GDI-Bench设计了分级任务，并在视觉和推理领域进行了分析。同时提出了GDI模型，采用智能保留训练策略解决灾难性遗忘问题。 Result: GDI-Bench评估显示，GPT-4o在推理任务中表现优异，但视觉能力有限。GDI模型在现有基准和GDI-Bench上均达到最优性能。 Conclusion: GDI-Bench和GDI模型为文档智能领域提供了有效的评估和改进工具，并将开源。 Abstract: The rapid advancement of multimodal large language models (MLLMs) has profoundly impacted the document domain, creating a wide array of application scenarios. This progress highlights the need for a comprehensive benchmark to evaluate these models' capabilities across various document-specific tasks. However, existing benchmarks often fail to locate specific model weaknesses or guide systematic improvements. To bridge this gap, we introduce a General Document Intelligence Benchmark (GDI-Bench), featuring 1.9k images across 9 key scenarios and 19 document-specific tasks. By decoupling visual complexity and reasoning complexity, the GDI-Bench structures graded tasks that allow performance assessment by difficulty, aiding in model weakness identification and optimization guidance. We evaluate the GDI-Bench on various open-source and closed-source models, conducting decoupled analyses in the visual and reasoning domains. For instance, the GPT-4o model excels in reasoning tasks but exhibits limitations in visual capabilities. To address the diverse tasks and domains in the GDI-Bench, we propose a GDI Model that mitigates the issue of catastrophic forgetting during the supervised fine-tuning (SFT) process through a intelligence-preserving training strategy. Our model achieves state-of-the-art performance on previous benchmarks and the GDI-Bench. Both our benchmark and model will be open source.

[90] ConSens: Assessing context grounding in open-book question answering

Ivan Vankov,Matyo Ivanov,Adriana Correia,Victor Botev

Main category: cs.CL

TL;DR: 提出了一种新指标，通过对比模型在有上下文和无上下文时的困惑度，量化答案对上下文的依赖程度，解决了现有评估方法的局限性。

Details

Motivation: 现有评估方法存在偏见、可扩展性差和依赖昂贵外部系统的问题，需要一种更高效、可解释的指标来评估开放书问答系统中上下文的利用。 Method: 提出一种新指标，对比模型在有上下文和无上下文时的困惑度，生成一个分数，量化答案对上下文的依赖程度。 Result: 实验证明该指标能有效识别答案是否基于上下文，且计算高效、可解释、适应性强。 Conclusion: 该指标为开放书问答系统提供了一种可扩展且实用的上下文利用评估方法。 Abstract: Large Language Models (LLMs) have demonstrated considerable success in open-book question answering (QA), where the task requires generating answers grounded in a provided external context. A critical challenge in open-book QA is to ensure that model responses are based on the provided context rather than its parametric knowledge, which can be outdated, incomplete, or incorrect. Existing evaluation methods, primarily based on the LLM-as-a-judge approach, face significant limitations, including biases, scalability issues, and dependence on costly external systems. To address these challenges, we propose a novel metric that contrasts the perplexity of the model response under two conditions: when the context is provided and when it is not. The resulting score quantifies the extent to which the model's answer relies on the provided context. The validity of this metric is demonstrated through a series of experiments that show its effectiveness in identifying whether a given answer is grounded in the provided context. Unlike existing approaches, this metric is computationally efficient, interpretable, and adaptable to various use cases, offering a scalable and practical solution to assess context utilization in open-book QA systems.

[91] Fine-Tuning LLMs for Low-Resource Dialect Translation: The Case of Lebanese

Silvana Yakhni,Ali Chehab

Main category: cs.CL

TL;DR: 论文研究了大型语言模型（LLMs）在翻译低资源黎巴嫩方言中的效果，发现文化真实数据比大规模翻译数据集更有效。对比三种微调方法后，基于文化感知的小数据集（LW）表现最佳，尤其是对比微调结合对比提示效果最好。

Details

Motivation: 探讨文化真实性对低资源方言翻译的影响，挑战‘数据越多越好’的范式。 Method: 使用开源Aya23模型，比较基本、对比和语法提示三种微调方法，并引入LebEval新基准。 Result: 文化感知的小数据集（LW）表现优于大规模非本地数据，对比微调结合对比提示效果最佳。 Conclusion: 文化真实性在方言翻译中至关重要，挑战了传统的数据规模优先观念。 Abstract: This paper examines the effectiveness of Large Language Models (LLMs) in translating the low-resource Lebanese dialect, focusing on the impact of culturally authentic data versus larger translated datasets. We compare three fine-tuning approaches: Basic, contrastive, and grammar-hint tuning, using open-source Aya23 models. Experiments reveal that models fine-tuned on a smaller but culturally aware Lebanese dataset (LW) consistently outperform those trained on larger, non-native data. The best results were achieved through contrastive fine-tuning paired with contrastive prompting, which indicates the benefits of exposing translation models to bad examples. In addition, to ensure authentic evaluation, we introduce LebEval, a new benchmark derived from native Lebanese content, and compare it to the existing FLoRes benchmark. Our findings challenge the "More Data is Better" paradigm and emphasize the crucial role of cultural authenticity in dialectal translation. We made our datasets and code available on Github.

[92] Between Underthinking and Overthinking: An Empirical Study of Reasoning Length and correctness in LLMs

Jinyan Su,Jennifer Healey,Preslav Nakov,Claire Cardie

Main category: cs.CL

TL;DR: 研究发现，大型语言模型（LLM）在推理长度与答案正确性之间存在矛盾：简单问题过度推理，复杂问题推理不足。通过偏好优化算法减少生成长度，可在保持准确性的同时显著缩短输出。

Details

Motivation: 探讨LLMs在推理长度与答案正确性之间的关系，揭示模型可能误判问题难度并无法校准响应长度的问题。 Method: 通过系统性实证研究分析推理长度与答案正确性的关系，并利用偏好优化算法测试减少生成长度的效果。 Result: LLMs在简单问题上过度推理，复杂问题上推理不足；通过偏好优化算法可显著减少生成长度且保持准确性。 Conclusion: 生成长度是推理行为的重要信号，未来需进一步探索LLMs在推理长度自适应中的自我意识。 Abstract: Large language models (LLMs) are increasingly optimized for long reasoning, under the assumption that more reasoning leads to better performance. However, emerging evidence suggests that longer responses can sometimes degrade accuracy rather than improve it. In this paper, we conduct a systematic empirical study of the relationship between reasoning length and answer correctness. We find that LLMs tend to overthink simple problems, generating unnecessarily long outputs, and underthink harder ones, failing to extend their reasoning when it is most needed. This indicates that models might misjudge problem difficulty and fail to calibrate their response length appropriately. Furthermore, we investigate the effects of length reduction with a preference optimization algorithm when simply preferring the shorter responses regardless of answer correctness. Experiments show that the generation length can be significantly reduced while maintaining acceptable accuracy. Our findings highlight generation length as a meaningful signal for reasoning behavior and motivate further exploration into LLMs' self-awareness in reasoning length adaptation.

[93] AdaptMI: Adaptive Skill-based In-context Math Instruction for Small Language Models

Yinghui He,Abhishek Panigrahi,Yong Lin,Sanjeev Arora

Main category: cs.CL

TL;DR: 论文研究了小语言模型（SLM）在上下文学习（ICL）中的性能差距，提出了一种自适应方法AdaptMI和其改进版AdaptMI+，通过选择性引入技能示例提升SLM在数学任务中的准确性。

Details

Motivation: 探索技能提示策略对小语言模型（SLM）性能的影响，尤其是其在简单问题上可能导致的认知过载问题。 Method: 提出AdaptMI和AdaptMI+方法，基于认知负荷理论，动态选择技能示例，仅在模型表现不佳时引入相关技能示例。 Result: 在5-shot评估中，AdaptMI+在多个数学基准测试和五种SLM上，比传统技能策略提升了高达6%的准确性。 Conclusion: AdaptMI+通过自适应技能示例选择，有效解决了SLM在ICL中的性能问题，为小模型的高效学习提供了新思路。 Abstract: In-context learning (ICL) allows a language model to improve its problem-solving capability when provided with suitable information in context. Since the choice of in-context information can be determined based on the problem itself, in-context learning is analogous to human learning from teachers in a classroom. Recent works (Didolkar et al., 2024a; 2024b) show that ICL performance can be improved by leveraging a frontier large language model's (LLM) ability to predict required skills to solve a problem, popularly referred to as an LLM's metacognition, and using the recommended skills to construct necessary in-context examples. While this skill-based strategy boosts ICL performance in larger models, its gains on small language models (SLMs) have been minimal, highlighting a performance gap in ICL capabilities. We investigate this gap and show that skill-based prompting can hurt SLM performance on easy questions by introducing unnecessary information, akin to cognitive overload. To address this, we introduce AdaptMI, an adaptive approach to selecting skill-based in-context Math Instructions for SLMs. Inspired by cognitive load theory from human pedagogy, our method only introduces skill-based examples when the model performs poorly. We further propose AdaptMI+, which adds examples targeted to the specific skills missing from the model's responses. On 5-shot evaluations across popular math benchmarks and five SLMs (1B--7B; Qwen, Llama), AdaptMI+ improves accuracy by up to 6% over naive skill-based strategies.

[94] IP-CRR: Information Pursuit for Interpretable Classification of Chest Radiology Reports

Yuyan Ge,Kwan Ho Ryan Chan,Pablo Messina,René Vidal

Main category: cs.CL

TL;DR: 提出了一种可解释的AI框架，通过提取关键查询及其答案来分类放射学报告，提升诊断的可信度和实用性。

Details

Motivation: 现有AI方法在放射学报告分析中缺乏可解释性，阻碍了临床应用的推广。 Method: 结合Information Pursuit框架提取关键查询，使用Flan-T5模型判断事实存在，最后通过分类器预测疾病。 Result: 在MIMIC-CXR数据集上的实验验证了该方法的有效性。 Conclusion: 该框架有望提升医疗AI的可信度和临床实用性。 Abstract: The development of AI-based methods for analyzing radiology reports could lead to significant advances in medical diagnosis--from improving diagnostic accuracy to enhancing efficiency and reducing workload. However, the lack of interpretability in these methods has hindered their adoption in clinical settings. In this paper, we propose an interpretable-by-design framework for classifying radiology reports. The key idea is to extract a set of most informative queries from a large set of reports and use these queries and their corresponding answers to predict a diagnosis. Thus, the explanation for a prediction is, by construction, the set of selected queries and answers. We use the Information Pursuit framework to select informative queries, the Flan-T5 model to determine if facts are present in the report, and a classifier to predict the disease. Experiments on the MIMIC-CXR dataset demonstrate the effectiveness of the proposed method, highlighting its potential to enhance trust and usability in medical AI.

[95] Enriching the Korean Learner Corpus with Multi-reference Annotations and Rubric-Based Scoring

Jayoung Song,KyungTae Lim,Jungyeul Park

Main category: cs.CL

TL;DR: 论文通过增强KoLLA韩语学习者语料库，添加多语法错误修正参考和评分标准，以支持更灵活的韩语L2写作研究。

Details

Motivation: 全球对韩语教育的兴趣增长，但缺乏针对韩语L2写作的学习者语料库。 Method: 增强KoLLA语料库，添加多语法错误修正参考和基于评分的标准。 Result: 语料库成为韩语L2教育的标准化资源，支持语言学习、评估和自动错误修正。 Conclusion: KoLLA的增强为韩语L2教育研究提供了更灵活和标准化的工具。 Abstract: Despite growing global interest in Korean language education, there remains a significant lack of learner corpora tailored to Korean L2 writing. To address this gap, we enhance the KoLLA Korean learner corpus by adding multiple grammatical error correction (GEC) references, thereby enabling more nuanced and flexible evaluation of GEC systems, and reflects the variability of human language. Additionally, we enrich the corpus with rubric-based scores aligned with guidelines from the Korean National Language Institute, capturing grammatical accuracy, coherence, and lexical diversity. These enhancements make KoLLA a robust and standardized resource for research in Korean L2 education, supporting advancements in language learning, assessment, and automated error correction.

[96] Consistency in Language Models: Current Landscape, Challenges, and Future Directions

Jekaterina Novikova,Carol Anderson,Borhane Blili-Hamelin,Subhabrata Majumdar

Main category: cs.CL

TL;DR: 论文探讨了AI语言系统的一致性研究现状，包括形式和非形式一致性，分析了现有评估方法的不足，并提出需要更强健的基准和跨学科方法。

Details

Motivation: 人类语言使用具有一致性，而当前语言模型在不同场景中难以保持一致性，亟需研究解决。 Method: 分析了形式一致性（如逻辑规则遵循）和非形式一致性（如道德和事实连贯性）的评估方法，并识别研究空白。 Result: 发现现有评估方法在定义标准化、多语言评估和提升一致性方法上存在不足。 Conclusion: 需要开发强健的基准和跨学科方法，以确保语言模型在领域任务中的一致性，同时保持其实用性和适应性。 Abstract: The hallmark of effective language use lies in consistency -- expressing similar meanings in similar contexts and avoiding contradictions. While human communication naturally demonstrates this principle, state-of-the-art language models struggle to maintain reliable consistency across different scenarios. This paper examines the landscape of consistency research in AI language systems, exploring both formal consistency (including logical rule adherence) and informal consistency (such as moral and factual coherence). We analyze current approaches to measure aspects of consistency, identify critical research gaps in standardization of definitions, multilingual assessment, and methods to improve consistency. Our findings point to an urgent need for robust benchmarks to measure and interdisciplinary approaches to ensure consistency in the application of language models on domain-specific tasks while preserving the utility and adaptability.

[97] Enhancing AI-Driven Education: Integrating Cognitive Frameworks, Linguistic Feedback Analysis, and Ethical Considerations for Improved Content Generation

Antoun Yaacoub,Sansiri Tarnpradab,Phattara Khumprom,Zainab Assaghir,Lionel Prevost,Jérôme Da-Rugna

Main category: cs.CL

TL;DR: 本文提出一个综合框架，通过整合认知评估、语言分析和伦理设计，提升AI教育工具的质量和伦理标准。

Details

Motivation: AI在教育中潜力巨大，但需关注生成内容的质量、认知深度和伦理问题。 Method: 整合Bloom's Taxonomy、SOLO Taxonomy、语言分析和伦理原则，提出三阶段方法（认知对齐、语言反馈集成、伦理保障）。 Result: 框架应用于OneClickQuiz插件，展示了其实际效果。 Conclusion: 为教育者、研究者和开发者提供了兼顾AI潜力和教育伦理的实用指南。 Abstract: Artificial intelligence (AI) is rapidly transforming education, presenting unprecedented opportunities for personalized learning and streamlined content creation. However, realizing the full potential of AI in educational settings necessitates careful consideration of the quality, cognitive depth, and ethical implications of AI-generated materials. This paper synthesizes insights from four related studies to propose a comprehensive framework for enhancing AI-driven educational tools. We integrate cognitive assessment frameworks (Bloom's Taxonomy and SOLO Taxonomy), linguistic analysis of AI-generated feedback, and ethical design principles to guide the development of effective and responsible AI tools. We outline a structured three-phase approach encompassing cognitive alignment, linguistic feedback integration, and ethical safeguards. The practical application of this framework is demonstrated through its integration into OneClickQuiz, an AI-powered Moodle plugin for quiz generation. This work contributes a comprehensive and actionable guide for educators, researchers, and developers aiming to harness AI's potential while upholding pedagogical and ethical standards in educational content generation.

[98] KoACD: The First Korean Adolescent Dataset for Cognitive Distortion Analysis

JunSeo Kim,HyeHyeon Kim

Main category: cs.CL

TL;DR: 该研究提出了首个针对韩国青少年认知扭曲的大规模数据集KoACD，并采用多LLM协商方法优化分类和生成合成数据。验证表明LLM在显性标记分类上表现良好，但在上下文推理上不如人类评估者。

Details

Motivation: 现有研究主要关注小规模成人数据集，缺乏针对青少年认知扭曲的大规模研究，尤其是韩国青少年群体。 Method: 使用多LLM协商方法，结合认知澄清和认知平衡两种策略，优化分类并生成合成数据。 Result: LLM在显性标记分类上表现良好，但在上下文推理任务中准确性低于人类评估者。 Conclusion: KoACD数据集为未来认知扭曲检测研究提供了重要资源，同时揭示了LLM在复杂推理任务中的局限性。 Abstract: Cognitive distortion refers to negative thinking patterns that can lead to mental health issues like depression and anxiety in adolescents. Previous studies using natural language processing (NLP) have focused mainly on small-scale adult datasets, with limited research on adolescents. This study introduces KoACD, the first large-scale dataset of cognitive distortions in Korean adolescents, containing 108,717 instances. We applied a multi-Large Language Model (LLM) negotiation method to refine distortion classification and generate synthetic data using two approaches: cognitive clarification for textual clarity and cognitive balancing for diverse distortion representation. Validation through LLMs and expert evaluations showed that while LLMs classified distortions with explicit markers, they struggled with context-dependent reasoning, where human evaluators demonstrated higher accuracy. KoACD aims to enhance future research on cognitive distortion detection.

[99] CSE-SFP: Enabling Unsupervised Sentence Representation Learning via a Single Forward Pass

Bowen Zhang,Zixin Song,Chunping Li

Main category: cs.CL

TL;DR: 提出了一种名为CSE-SFP的创新方法，利用生成模型的结构特性，通过单次前向传播实现高效的对比学习，显著提升了句子表示的质量并减少了训练时间和内存消耗。

Details

Motivation: 生成式预训练语言模型（PLMs）在学术界和工业界占据主导地位，但缺乏针对其的无监督文本表示框架，因此需要一种高效的方法来填补这一空白。 Method: CSE-SFP方法利用生成模型的结构特性，仅需单次前向传播即可完成无监督对比学习，同时引入了两种比率指标来评估语义空间特性。 Result: 实验表明，CSE-SFP不仅生成更高质量的嵌入，还显著减少了训练时间和内存消耗。 Conclusion: CSE-SFP为生成式PLMs提供了一种高效的无监督句子表示框架，并通过新指标改进了语义空间评估。 Abstract: As a fundamental task in Information Retrieval and Computational Linguistics, sentence representation has profound implications for a wide range of practical applications such as text clustering, content analysis, question-answering systems, and web search. Recent advances in pre-trained language models (PLMs) have driven remarkable progress in this field, particularly through unsupervised embedding derivation methods centered on discriminative PLMs like BERT. However, due to time and computational constraints, few efforts have attempted to integrate unsupervised sentence representation with generative PLMs, which typically possess much larger parameter sizes. Given that state-of-the-art models in both academia and industry are predominantly based on generative architectures, there is a pressing need for an efficient unsupervised text representation framework tailored to decoder-only PLMs. To address this concern, we propose CSE-SFP, an innovative method that exploits the structural characteristics of generative models. Compared to existing strategies, CSE-SFP requires only a single forward pass to perform effective unsupervised contrastive learning. Rigorous experimentation demonstrates that CSE-SFP not only produces higher-quality embeddings but also significantly reduces both training time and memory consumption. Furthermore, we introduce two ratio metrics that jointly assess alignment and uniformity, thereby providing a more robust means for evaluating the semantic spatial properties of encoding models.

[100] Red Teaming Large Language Models for Healthcare

Vahid Balazadeh,Michael Cooper,David Pellow,Atousa Assadi,Jennifer Bell,Jim Fackler,Gabriel Funingana,Spencer Gable-Cook,Anirudh Gangadhar,Abhishek Jaiswal,Sumanth Kaja,Christopher Khoury,Randy Lin,Kaden McKeen,Sara Naimimohasses,Khashayar Namdar,Aviraj Newatia,Allan Pang,Anshul Pattoo,Sameer Peesapati,Diana Prepelita,Bogdana Rakova,Saba Sadatamin,Rafael Schulman,Ajay Shah,Syed Azhar Shah,Syed Ahmar Shah,Babak Taati,Balagopal Unnikrishnan,Stephanie Williams,Rahul G Krishnan

Main category: cs.CL

TL;DR: 论文介绍了2024年机器学习与医疗健康会议上关于“医疗领域大型语言模型红队测试”的研讨会设计与发现，旨在通过临床专家参与识别模型漏洞。

Details

Motivation: 通过结合计算与临床专业知识，发现大型语言模型在医疗场景中可能引发临床危害的漏洞，弥补开发者缺乏临床经验的不足。 Method: 组织研讨会，邀请临床与计算专家共同参与红队测试，生成可能引发危害的临床提示，并对发现的漏洞进行分类和复现研究。 Result: 识别并分类了大型语言模型的漏洞，并通过复现研究验证了这些漏洞在不同模型中的普遍性。 Conclusion: 临床专家参与红队测试能有效识别大型语言模型在医疗领域的潜在风险，为模型改进提供重要依据。 Abstract: We present the design process and findings of the pre-conference workshop at the Machine Learning for Healthcare Conference (2024) entitled Red Teaming Large Language Models for Healthcare, which took place on August 15, 2024. Conference participants, comprising a mix of computational and clinical expertise, attempted to discover vulnerabilities -- realistic clinical prompts for which a large language model (LLM) outputs a response that could cause clinical harm. Red-teaming with clinicians enables the identification of LLM vulnerabilities that may not be recognised by LLM developers lacking clinical expertise. We report the vulnerabilities found, categorise them, and present the results of a replication study assessing the vulnerabilities across all LLMs provided.

[101] Computational Identification of Regulatory Statements in EU Legislation

Gijs Jan Brandsma,Jens Blom-Hansen,Christiaan Meijer,Kody Moodley

Main category: cs.CL

TL;DR: 论文提出两种方法（依赖解析和基于Transformer的机器学习模型）来自动识别欧盟法律中的监管语句，两种方法表现相似（准确率80%和84%），并探讨了结合两者的潜力。

Details

Motivation: 识别法律中的监管语句有助于衡量法规的密度和严格性，而计算方法是处理大量欧盟法律（1952-2023年约18万条）的关键。 Method: 基于机构语法工具定义监管语句，开发并比较了依赖解析和Transformer模型两种方法。 Result: 两种方法表现相似（准确率80%和84%，K alpha为0.58），显示出结合两者的潜力。 Conclusion: 高准确率和中等一致性表明可以结合两种方法的优势，进一步提升监管语句识别的效果。 Abstract: Identifying regulatory statements in legislation is useful for developing metrics to measure the regulatory density and strictness of legislation. A computational method is valuable for scaling the identification of such statements from a growing body of EU legislation, constituting approximately 180,000 published legal acts between 1952 and 2023. Past work on extraction of these statements varies in the permissiveness of their definitions for what constitutes a regulatory statement. In this work, we provide a specific definition for our purposes based on the institutional grammar tool. We develop and compare two contrasting approaches for automatically identifying such statements in EU legislation, one based on dependency parsing, and the other on a transformer-based machine learning model. We found both approaches performed similarly well with accuracies of 80% and 84% respectively and a K alpha of 0.58. The high accuracies and not exceedingly high agreement suggests potential for combining strengths of both approaches.

[102] HalluMix: A Task-Agnostic, Multi-Domain Benchmark for Real-World Hallucination Detection

Deanna Emery,Michael Goitia,Freddie Vargus,Iulia Neagu

Main category: cs.CL

TL;DR: 论文提出了HalluMix Benchmark，用于检测大语言模型中的幻觉内容，评估了七种检测系统，发现性能差异显著，尤其在长短上下文之间。

Details

Motivation: 随着大语言模型在高风险领域的广泛应用，检测幻觉内容（无证据支持的文本）成为关键挑战，现有基准存在局限性。 Method: 引入HalluMix Benchmark，一个多样化、任务无关的数据集，评估了七种幻觉检测系统。 Result: 发现性能差异显著，Quotient Detections表现最佳（准确率0.82，F1分数0.84）。 Conclusion: HalluMix Benchmark为幻觉检测提供了更全面的评估工具，揭示了系统性能的差异，对实际应用有重要启示。 Abstract: As large language models (LLMs) are increasingly deployed in high-stakes domains, detecting hallucinated content$\unicode{x2013}$text that is not grounded in supporting evidence$\unicode{x2013}$has become a critical challenge. Existing benchmarks for hallucination detection are often synthetically generated, narrowly focused on extractive question answering, and fail to capture the complexity of real-world scenarios involving multi-document contexts and full-sentence outputs. We introduce the HalluMix Benchmark, a diverse, task-agnostic dataset that includes examples from a range of domains and formats. Using this benchmark, we evaluate seven hallucination detection systems$\unicode{x2013}$both open and closed source$\unicode{x2013}$highlighting differences in performance across tasks, document lengths, and input representations. Our analysis highlights substantial performance disparities between short and long contexts, with critical implications for real-world Retrieval Augmented Generation (RAG) implementations. Quotient Detections achieves the best overall performance, with an accuracy of 0.82 and an F1 score of 0.84.

[103] 100 Days After DeepSeek-R1: A Survey on Replication Studies and More Directions for Reasoning Language Models

Chong Zhang,Yue Deng,Xiang Lin,Bin Wang,Dianwen Ng,Hai Ye,Xingxuan Li,Yao Xiao,Zhanfeng Mo,Qi Zhang,Lidong Bing

Main category: cs.CL

TL;DR: 本文总结了近期关于推理语言模型（RLMs）的复现研究，重点关注监督微调（SFT）和基于可验证奖励的强化学习（RLVR），并探讨了数据构建、方法设计和训练过程。

Details

Motivation: DeepSeek-R1等模型的成功引发了研究社区对显式推理范式的兴趣，但其实现细节未完全开源，促使复现研究涌现。本文旨在总结这些研究，为未来研究提供启发。 Method: 通过分析复现研究的数据构建、方法设计和训练过程，总结SFT和RLVR的关键实现细节和实验结果。 Result: 复现研究通过类似训练流程和开源数据资源，成功达到与DeepSeek-R1相当的性能，并提供了数据准备和方法设计的可行策略。 Conclusion: 本文为RLMs的研究者和开发者提供了最新进展的总结，并讨论了增强RLMs的潜在技术和挑战，以激发新思路。 Abstract: The recent development of reasoning language models (RLMs) represents a novel evolution in large language models. In particular, the recent release of DeepSeek-R1 has generated widespread social impact and sparked enthusiasm in the research community for exploring the explicit reasoning paradigm of language models. However, the implementation details of the released models have not been fully open-sourced by DeepSeek, including DeepSeek-R1-Zero, DeepSeek-R1, and the distilled small models. As a result, many replication studies have emerged aiming to reproduce the strong performance achieved by DeepSeek-R1, reaching comparable performance through similar training procedures and fully open-source data resources. These works have investigated feasible strategies for supervised fine-tuning (SFT) and reinforcement learning from verifiable rewards (RLVR), focusing on data preparation and method design, yielding various valuable insights. In this report, we provide a summary of recent replication studies to inspire future research. We primarily focus on SFT and RLVR as two main directions, introducing the details for data construction, method design and training procedure of current replication studies. Moreover, we conclude key findings from the implementation details and experimental results reported by these studies, anticipating to inspire future research. We also discuss additional techniques of enhancing RLMs, highlighting the potential of expanding the application scope of these models, and discussing the challenges in development. By this survey, we aim to help researchers and developers of RLMs stay updated with the latest advancements, and seek to inspire new ideas to further enhance RLMs.

[104] Triggering Hallucinations in LLMs: A Quantitative Study of Prompt-Induced Hallucination in Large Language Models

Makoto Sato

Main category: cs.CL

TL;DR: 论文提出了一种基于提示的框架（HIP和HQP）来系统触发和量化大语言模型（LLM）的幻觉现象，发现不同模型在幻觉表现上存在差异。

Details

Motivation: 解决LLM在真实应用中生成流畅但不真实输出的问题，理解幻觉的认知动态。 Method: 使用Hallucination-Inducing Prompt（HIP）误导性融合语义远距概念，以及Hallucination Quantifying Prompt（HQP）评估输出的合理性、置信度和连贯性。 Result: HIP在多个LLM中一致产生更不连贯和更多幻觉的响应，不同模型表现各异。 Conclusion: 该框架为研究幻觉脆弱性提供了可重复的测试平台，有助于开发更安全、自省的LLM。 Abstract: Hallucinations in large language models (LLMs) present a growing challenge across real-world applications, from healthcare to law, where factual reliability is essential. Despite advances in alignment and instruction tuning, LLMs can still generate outputs that are fluent yet fundamentally untrue. Understanding the cognitive dynamics that underlie these hallucinations remains an open problem. In this study, we propose a prompt-based framework to systematically trigger and quantify hallucination: a Hallucination-Inducing Prompt (HIP), which synthetically fuses semantically distant concepts (e.g., periodic table of elements and tarot divination) in a misleading way, and a Hallucination Quantifying Prompt (HQP), which scores the plausibility, confidence, and coherence of the output. Controlled experiments across multiple LLMs revealed that HIPs consistently produced less coherent and more hallucinated responses than their null-fusion controls. These effects varied across models, with reasoning-oriented LLMs showing distinct profiles from general-purpose ones. Our framework provides a reproducible testbed for studying hallucination vulnerability, and opens the door to developing safer, more introspective LLMs that can detect and self-regulate the onset of conceptual instability.

[105] FreqKV: Frequency Domain Key-Value Compression for Efficient Context Window Extension

Jushi Kai,Boyi Zeng,Yixuan Wang,Haoli Bai,Bo Jiang,Zhouhan Lin

Main category: cs.CL

TL;DR: 论文提出了一种名为FreqKV的新方法，通过频域压缩KV缓存来高效扩展LLMs的上下文窗口，解决了内存和计算复杂性问题。

Details

Motivation: 扩展LLMs的上下文窗口对长文本生成至关重要，但KV缓存的内存需求和自注意力的二次复杂度带来了挑战。现有方法在扩展上下文时性能下降。 Method: 利用KV缓存在频域中能量集中于低频的特性，过滤高频成分以压缩缓存。提出FreqKV技术，无需额外参数或架构修改。 Result: 实验表明，FreqKV在长上下文任务中高效且有效。 Conclusion: FreqKV通过频域压缩KV缓存，显著提升了LLMs的上下文窗口扩展效率。 Abstract: Extending the context window in large language models (LLMs) is essential for applications involving long-form content generation. However, the linear increase in key-value (KV) cache memory requirements and the quadratic complexity of self-attention with respect to sequence length present significant challenges during fine-tuning and inference. Existing methods suffer from performance degradation when extending to longer contexts. In this work, we introduce a novel context extension method that optimizes both fine-tuning and inference efficiency. Our method exploits a key observation: in the frequency domain, the energy distribution of the KV cache is primarily concentrated in low-frequency components. By filtering out the high-frequency components, the KV cache can be effectively compressed with minimal information loss. Building on this insight, we propose an efficient compression technique, FreqKV, that iteratively compresses the increasing KV cache to a fixed size in the frequency domain, applicable to both fine-tuning and inference. FreqKV introduces no additional parameters or architectural modifications. With minimal fine-tuning, LLMs can learn to leverage the limited cache that is compressed in the frequency domain and extend the context window efficiently. Experiments on various long context language modeling and understanding tasks demonstrate the efficiency and efficacy of the proposed method.

[106] Block Circulant Adapter for Large Language Models

Xinyu Ding,Meiqi Wang,Siyu Liao,Zhongfeng Wang

Main category: cs.CL

TL;DR: 提出了一种基于块循环矩阵的微调方法，利用循环矩阵和一维傅里叶变换的特性，显著降低了存储和计算成本，同时保持了任务性能。

Details

Motivation: 由于大型语言模型（LLMs）的庞大模型尺寸，微调成本高昂，需要一种更高效的方法。 Method: 采用块循环矩阵和稳定训练启发式方法，结合一维傅里叶变换，以减少存储和计算开销。 Result: 实验表明，该方法比VeRA少用14倍参数，比LoRA小16倍，比FourierFT少32倍FLOPs，且性能接近或更好。 Conclusion: 该方法在频域中为下游任务微调大型模型提供了一种有前景的途径。 Abstract: Fine-tuning large language models (LLMs) is difficult due to their huge model size. Recent Fourier domain-based methods show potential for reducing fine-tuning costs. We propose a block circulant matrix-based fine-tuning method with a stable training heuristic to leverage the properties of circulant matrices and one-dimensional Fourier transforms to reduce storage and computation costs. Experiments show that our method uses $14\times$ less number of parameters than VeRA, $16\times$ smaller than LoRA and $32\times$ less FLOPs than FourierFT, while maintaining close or better task performance. Our approach presents a promising way in frequency domain to fine-tune large models on downstream tasks.

[107] FineScope : Precision Pruning for Domain-Specialized Large Language Models Using SAE-Guided Self-Data Cultivation

Chaitali Bhattacharyya,Yeseong Kim

Main category: cs.CL

TL;DR: FineScope框架通过稀疏自编码器和结构化剪枝，从大型预训练模型中提取领域优化的小型LLM，并通过自数据蒸馏提升性能。

Details

Motivation: 训练大型语言模型需要大量计算资源，因此需要开发小型、领域特定的LLM以保持效率和性能。 Method: FineScope结合稀疏自编码器提取领域特定子集，结构化剪枝保留关键知识，并通过自数据蒸馏恢复性能。 Result: FineScope在领域特定任务中表现优异，超越多个大型LLM，且能通过SAE数据集恢复剪枝模型的性能。 Conclusion: FineScope提供了一种高效的方法，既能生成小型领域优化模型，又能提升预训练模型的领域准确性。 Abstract: Training large language models (LLMs) from scratch requires significant computational resources, driving interest in developing smaller, domain-specific LLMs that maintain both efficiency and strong task performance. Medium-sized models such as LLaMA, llama} have served as starting points for domain-specific adaptation, but they often suffer from accuracy degradation when tested on specialized datasets. We introduce FineScope, a framework for deriving compact, domain-optimized LLMs from larger pretrained models. FineScope leverages the Sparse Autoencoder (SAE) framework, inspired by its ability to produce interpretable feature representations, to extract domain-specific subsets from large datasets. We apply structured pruning with domain-specific constraints, ensuring that the resulting pruned models retain essential knowledge for the target domain. To further enhance performance, these pruned models undergo self-data distillation, leveraging SAE-curated datasets to restore key domain-specific information lost during pruning. Extensive experiments and ablation studies demonstrate that FineScope achieves highly competitive performance, outperforming several large-scale state-of-the-art LLMs in domain-specific tasks. Additionally, our results show that FineScope enables pruned models to regain a substantial portion of their original performance when fine-tuned with SAE-curated datasets. Furthermore, applying these datasets to fine-tune pretrained LLMs without pruning also improves their domain-specific accuracy, highlighting the robustness of our approach. The code will be released.

[108] The Illusion of Role Separation: Hidden Shortcuts in LLM Role Learning (and How to Fix Them)

Zihao Wang,Yibo Jiang,Jiahao Yu,Heqing Huang

Main category: cs.CL

TL;DR: 论文探讨了如何通过调整输入编码中的标记信号，帮助大语言模型（LLMs）更可靠地区分不同角色（如系统指令和用户查询），而非依赖浅层代理。

Details

Motivation: 确保LLMs能准确区分不同角色的输入（角色分离）对多角色行为的一致性至关重要，但现有方法可能只是记忆已知触发器而非真正区分角色。 Method: 提出一个简单、受控的实验框架，发现微调模型依赖任务类型和文本开头的接近性作为代理。通过调整输入编码中的标记信号（如位置ID）来强化角色边界。 Result: 调整位置ID等方法帮助模型更清晰地区分角色，减少对浅层代理的依赖。 Conclusion: 通过机制中心的视角，论文展示了如何让LLMs更可靠地维持多角色行为，而非仅记忆已知提示或触发器。 Abstract: Large language models (LLMs) that integrate multiple input roles (e.g., system instructions, user queries, external tool outputs) are increasingly prevalent in practice. Ensuring that the model accurately distinguishes messages from each role -- a concept we call \emph{role separation} -- is crucial for consistent multi-role behavior. Although recent work often targets state-of-the-art prompt injection defenses, it remains unclear whether such methods truly teach LLMs to differentiate roles or merely memorize known triggers. In this paper, we examine \emph{role-separation learning}: the process of teaching LLMs to robustly distinguish system and user tokens. Through a \emph{simple, controlled experimental framework}, we find that fine-tuned models often rely on two proxies for role identification: (1) task type exploitation, and (2) proximity to begin-of-text. Although data augmentation can partially mitigate these shortcuts, it generally leads to iterative patching rather than a deeper fix. To address this, we propose reinforcing \emph{invariant signals} that mark role boundaries by adjusting token-wise cues in the model's input encoding. In particular, manipulating position IDs helps the model learn clearer distinctions and reduces reliance on superficial proxies. By focusing on this mechanism-centered perspective, our work illuminates how LLMs can more reliably maintain consistent multi-role behavior without merely memorizing known prompts or triggers.

[109] Large Language Models Understanding: an Inherent Ambiguity Barrier

Daniel N. Nissani

Main category: cs.CL

TL;DR: 本文通过思想实验和半正式论证，提出LLMs存在固有歧义障碍，无法真正理解对话意义。

Details

Motivation: 围绕LLMs是否具备理解能力的争议，本文旨在通过理论分析反驳其理解能力。 Method: 采用思想实验和半正式论证，揭示LLMs的歧义障碍。 Result: 论证表明LLMs因歧义障碍无法真正理解对话。 Conclusion: LLMs的流畅对话不代表其具备理解能力，存在固有局限性。 Abstract: A lively ongoing debate is taking place, since the extraordinary emergence of Large Language Models (LLMs) with regards to their capability to understand the world and capture the meaning of the dialogues in which they are involved. Arguments and counter-arguments have been proposed based upon thought experiments, anecdotal conversations between LLMs and humans, statistical linguistic analysis, philosophical considerations, and more. In this brief paper we present a counter-argument based upon a thought experiment and semi-formal considerations leading to an inherent ambiguity barrier which prevents LLMs from having any understanding of what their amazingly fluent dialogues mean.

[110] On the generalization of language models from in-context learning and finetuning: a controlled study

Andrew K. Lampinen,Arslan Chaudhry,Stephanie C. Y. Chan,Cody Wild,Diane Wan,Alex Ku,Jörg Bornschein,Razvan Pascanu,Murray Shanahan,James L. McClelland

Main category: cs.CL

TL;DR: 论文探讨了大型语言模型在微调和上下文学习中的泛化能力差异，提出了一种结合上下文推理的微调方法以提高泛化性能。

Details

Motivation: 研究大型语言模型在微调和上下文学习中的泛化能力差异，以解决模型在实际应用中的泛化不足问题。 Method: 构建新数据集，对比微调和上下文学习的泛化表现，并提出结合上下文推理的微调方法。 Result: 上下文学习在某些情况下比微调更具灵活性，结合上下文推理的微调方法能显著提升泛化能力。 Conclusion: 研究揭示了不同学习模式的归纳偏置差异，并提出了一种实用的改进方法。 Abstract: Large language models exhibit exciting capabilities, yet can show surprisingly narrow generalization from finetuning -- from failing to generalize to simple reversals of relations they are trained on, to missing logical deductions that can be made from trained information. These failures to generalize from fine-tuning can hinder practical application of these models. However, language models' in-context learning shows different inductive biases, and can generalize better in some of these cases. Here, we explore these differences in generalization between in-context- and fine-tuning-based learning. To do so, we constructed several novel datasets to evaluate and improve models' ability to generalize from finetuning data. The datasets are constructed to isolate the knowledge in the dataset from that in pretraining, to create clean tests of generalization. We expose pretrained large models to controlled subsets of the information in these datasets -- either in context, or through fine-tuning -- and evaluate their performance on test sets that require various types of generalization. We find overall that in data-matched settings, in-context learning can generalize more flexibly than fine-tuning (though we also find some qualifications of prior findings, such as cases when fine-tuning can generalize to reversals embedded in a larger structure of knowledge). We build on these findings to propose a method to enable improved generalization from fine-tuning: adding in-context inferences to finetuning data. We show that this method improves generalization across various splits of our datasets and other benchmarks. Our results have implications for understanding the inductive biases of different modes of learning in language models, and practically improving their performance.

[111] DeepCritic: Deliberate Critique with Large Language Models

Wenkai Yang,Jingwen Chen,Yankai Lin,Ji-Rong Wen

Main category: cs.CL

TL;DR: 论文提出了一种两阶段框架，通过生成详细的分步数学批评和强化学习，提升LLM的批评能力，显著优于现有模型。

Details

Motivation: 随着LLM的快速发展，提供准确反馈和可扩展监督成为迫切需求，利用LLM作为批评模型实现自动监督是一种有前景的解决方案。 Method: 采用两阶段框架：1）利用Qwen2.5-72B-Instruct生成4.5K长形式批评数据用于监督微调；2）通过强化学习（使用PRM800K或自动标注数据）进一步提升批评能力。 Result: 基于Qwen2.5-7B-Instruct的批评模型在错误识别基准上显著优于现有LLM批评模型（包括DeepSeek-R1-distill和GPT-4o），并提供更详细的反馈帮助修正错误。 Conclusion: 该框架有效提升了LLM的数学批评能力，为自动监督提供了更可靠的解决方案。 Abstract: As Large Language Models (LLMs) are rapidly evolving, providing accurate feedback and scalable oversight on their outputs becomes an urgent and critical problem. Leveraging LLMs as critique models to achieve automated supervision is a promising solution. In this work, we focus on studying and enhancing the math critique ability of LLMs. Current LLM critics provide critiques that are too shallow and superficial on each step, leading to low judgment accuracy and struggling to offer sufficient feedback for the LLM generator to correct mistakes. To tackle this issue, we propose a novel and effective two-stage framework to develop LLM critics that are capable of deliberately critiquing on each reasoning step of math solutions. In the first stage, we utilize Qwen2.5-72B-Instruct to generate 4.5K long-form critiques as seed data for supervised fine-tuning. Each seed critique consists of deliberate step-wise critiques that includes multi-perspective verifications as well as in-depth critiques of initial critiques for each reasoning step. Then, we perform reinforcement learning on the fine-tuned model with either existing human-labeled data from PRM800K or our automatically annotated data obtained via Monte Carlo sampling-based correctness estimation, to further incentivize its critique ability. Our developed critique model built on Qwen2.5-7B-Instruct not only significantly outperforms existing LLM critics (including the same-sized DeepSeek-R1-distill models and GPT-4o) on various error identification benchmarks, but also more effectively helps the LLM generator refine erroneous steps through more detailed feedback.

[112] Rethinking Memory in AI: Taxonomy, Operations, Topics, and Future Directions

Yiming Du,Wenyu Huang,Danna Zheng,Zhaowei Wang,Sebastien Montella,Mirella Lapata,Kam-Fai Wong,Jeff Z. Pan

Main category: cs.CL

TL;DR: 该论文综述了AI系统中记忆的动态操作与表示类型，提出了六种基本记忆操作，并系统性地将其映射到相关研究领域，为LLM代理的记忆研究提供了结构化视角。

Details

Motivation: 现有研究多关注记忆在LLM中的应用，而忽略了其底层动态操作。本文旨在填补这一空白，通过原子操作和表示类型重构记忆系统。 Method: 将记忆表示分为参数化、上下文结构化和上下文非结构化三类，并引入六种基本操作（巩固、更新、索引、遗忘、检索、压缩），系统性地映射到相关研究主题。 Result: 提供了一个结构化、动态的记忆研究视角，明确了LLM代理中记忆的功能交互，并指出了未来研究方向。 Conclusion: 通过原子操作和表示类型的框架，本文为AI记忆研究提供了系统性工具和方向，推动了该领域的进一步发展。 Abstract: Memory is a fundamental component of AI systems, underpinning large language models (LLMs) based agents. While prior surveys have focused on memory applications with LLMs, they often overlook the atomic operations that underlie memory dynamics. In this survey, we first categorize memory representations into parametric, contextual structured, and contextual unstructured and then introduce six fundamental memory operations: Consolidation, Updating, Indexing, Forgetting, Retrieval, and Compression. We systematically map these operations to the most relevant research topics across long-term, long-context, parametric modification, and multi-source memory. By reframing memory systems through the lens of atomic operations and representation types, this survey provides a structured and dynamic perspective on research, benchmark datasets, and tools related to memory in AI, clarifying the functional interplay in LLMs based agents while outlining promising directions for future research\footnote{The paper list, datasets, methods and tools are available at \href{https://github.com/Elvin-Yiming-Du/Survey_Memory_in_AI}{https://github.com/Elvin-Yiming-Du/Survey\_Memory\_in\_AI}.}.

[113] Steering Large Language Models with Register Analysis for Arbitrary Style Transfer

Xinchen Yang,Marine Carpuat

Main category: cs.CL

TL;DR: 本文提出了一种基于语域分析的提示方法，用于指导大语言模型（LLMs）进行高质量的示例式任意风格转换。

Details

Motivation: 尽管LLMs在多种风格文本改写中表现出色，但如何有效利用其能力进行示例式风格转换仍是一个挑战。关键在于如何描述示例风格以指导LLMs生成高质量的改写文本。 Method: 提出了一种基于语域分析的提示方法，用于指导LLMs完成风格转换任务。 Result: 实验表明，该方法在多种风格转换任务中显著提升了风格转换强度，同时更好地保留了原文意义。 Conclusion: 基于语域分析的提示方法在风格转换任务中优于现有策略，为LLMs的应用提供了新思路。 Abstract: Large Language Models (LLMs) have demonstrated strong capabilities in rewriting text across various styles. However, effectively leveraging this ability for example-based arbitrary style transfer, where an input text is rewritten to match the style of a given exemplar, remains an open challenge. A key question is how to describe the style of the exemplar to guide LLMs toward high-quality rewrites. In this work, we propose a prompting method based on register analysis to guide LLMs to perform this task. Empirical evaluations across multiple style transfer tasks show that our prompting approach enhances style transfer strength while preserving meaning more effectively than existing prompting strategies.

eess.IV [Back]

[114] SR-NeRV: Improving Embedding Efficiency of Neural Video Representation via Super-Resolution

Taiga Hayami,Kakeru Koizumi,Hiroshi Watanabe

Main category: eess.IV

TL;DR: 提出了一种结合超分辨率网络的INR视频表示方法，显著提升了高频细节的重建质量。

Details

Motivation: 传统INR方法在严格模型大小限制下难以重建高频细节，而高频细节在视频压缩中至关重要。 Method: 通过集成通用超分辨率网络，将高频细节的重建任务交给该网络处理。 Result: 实验表明，该方法在重建质量上优于传统INR基线，同时保持相似的模型大小。 Conclusion: 结合超分辨率网络的INR方法能有效解决高频细节重建问题，适用于视频压缩场景。 Abstract: Implicit Neural Representations (INRs) have garnered significant attention for their ability to model complex signals across a variety of domains. Recently, INR-based approaches have emerged as promising frameworks for neural video compression. While conventional methods primarily focus on embedding video content into compact neural networks for efficient representation, they often struggle to reconstruct high-frequency details under stringent model size constraints, which are critical in practical compression scenarios. To address this limitation, we propose an INR-based video representation method that integrates a general-purpose super-resolution (SR) network. Motivated by the observation that high-frequency components exhibit low temporal redundancy across frames, our method entrusts the reconstruction of fine details to the SR network. Experimental results demonstrate that the proposed method outperforms conventional INR-based baselines in terms of reconstruction quality, while maintaining comparable model sizes.

[115] Rootlets-based registration to the spinal cord PAM50 template

Sandrine Bédard,Jan Valošek,Valeria Oliva,Kenneth A. Weber II,Julien Cohen-Adad

Main category: eess.IV

TL;DR: 提出了一种基于脊髓神经根的新型配准方法，显著提高了脊髓功能MRI研究的配准精度和可重复性。

Details

Motivation: 传统基于椎间盘的配准方法因个体间解剖变异大，导致配准不准确。本研究旨在通过利用脊髓神经根改善配准效果。 Method: 开发了一种基于颈背侧神经根的分割和非线性配准方法，使用PAM50脊髓模板进行验证。 Result: 在多中心和不同颈部位置的验证中，根基配准优于传统方法，显著提高了任务fMRI的激活区域和Z分数。 Conclusion: 根基配准提高了脊髓功能MRI的配准精度和可靠性，为群体分析提供了更优的空间标准化。 Abstract: Spinal cord functional MRI studies require precise localization of spinal levels for reliable voxelwise group analyses. Traditional template-based registration of the spinal cord uses intervertebral discs for alignment. However, substantial anatomical variability across individuals exists between vertebral and spinal levels. This study proposes a novel registration approach that leverages spinal nerve rootlets to improve alignment accuracy and reproducibility across individuals. We developed a registration method leveraging dorsal cervical rootlets segmentation and aligning them non-linearly with the PAM50 spinal cord template. Validation was performed on a multi-subject, multi-site dataset (n=267, 44 sites) and a multi-subject dataset with various neck positions (n=10, 3 sessions). We further validated the method on task-based functional MRI (n=23) to compare group-level activation maps using rootlet-based registration to traditional disc-based methods. Rootlet-based registration showed superior alignment across individuals compared to the traditional disc-based method. Notably, rootlet positions were more stable across neck positions. Group-level analysis of task-based functional MRI using rootlet-based increased Z scores and activation cluster size compared to disc-based registration (number of active voxels from 3292 to 7978). Rootlet-based registration enhances both inter- and intra-subject anatomical alignment and yields better spatial normalization for group-level fMRI analyses. Our findings highlight the potential of rootlet-based registration to improve the precision and reliability of spinal cord neuroimaging group analysis.

[116] Efficient and robust 3D blind harmonization for large domain gaps

Hwihun Jeong,Hayeon Lee,Se Young Chun,Jongho Lee

Main category: eess.IV

TL;DR: BlindHarmonyDiff是一种新型盲3D图像协调框架，通过边缘到图像模型和3D修正流技术，解决了现有方法在3D图像协调中的局限性，如切片间异质性和大域差距问题。

Details

Motivation: 现有盲协调方法在3D图像中存在切片间异质性、图像质量中等及大域差距性能有限的问题，需要改进。 Method: 提出BlindHarmonyDiff框架，利用边缘到图像模型和3D修正流技术，结合多跨度补丁训练和细化模块，实现高效3D训练和鲁棒推理。 Result: 实验表明，BlindHarmonyDiff优于现有方法，能更好地将不同源域图像协调到目标域，并在下游任务（如组织分割和年龄预测）中表现优异。 Conclusion: BlindHarmonyDiff是一种高效、鲁棒且通用的盲协调方法，适用于多样化的MR扫描仪。 Abstract: Blind harmonization has emerged as a promising technique for MR image harmonization to achieve scale-invariant representations, requiring only target domain data (i.e., no source domain data necessary). However, existing methods face limitations such as inter-slice heterogeneity in 3D, moderate image quality, and limited performance for a large domain gap. To address these challenges, we introduce BlindHarmonyDiff, a novel blind 3D harmonization framework that leverages an edge-to-image model tailored specifically to harmonization. Our framework employs a 3D rectified flow trained on target domain images to reconstruct the original image from an edge map, then yielding a harmonized image from the edge of a source domain image. We propose multi-stride patch training for efficient 3D training and a refinement module for robust inference by suppressing hallucination. Extensive experiments demonstrate that BlindHarmonyDiff outperforms prior arts by harmonizing diverse source domain images to the target domain, achieving higher correspondence to the target domain characteristics. Downstream task-based quality assessments such as tissue segmentation and age prediction on diverse MR scanners further confirm the effectiveness of our approach and demonstrate the capability of our robust and generalizable blind harmonization.

[117] Towards Lightweight Hyperspectral Image Super-Resolution with Depthwise Separable Dilated Convolutional Network

Usman Muhammad,Jorma Laaksonen,Lyudmila Mihaylova

Main category: eess.IV

TL;DR: 提出了一种轻量级的深度可分离扩张卷积网络（DSDCN），用于解决高光谱图像超分辨率问题，结合了多种损失函数以保留光谱和空间细节。

Details

Motivation: 高光谱超分辨率因数据的高维性和训练样本稀缺而成为病态问题，现有方法依赖大模型或额外图像融合，不实用。 Method: 采用类似MobileNet的深度可分离卷积，并引入扩张卷积融合块，结合MSE、L2正则化和光谱角度损失函数。 Result: 在两个公开高光谱数据集上表现优异。 Conclusion: DSDCN适合高光谱图像超分辨率任务，代码已开源。 Abstract: Deep neural networks have demonstrated highly competitive performance in super-resolution (SR) for natural images by learning mappings from low-resolution (LR) to high-resolution (HR) images. However, hyperspectral super-resolution remains an ill-posed problem due to the high spectral dimensionality of the data and the scarcity of available training samples. Moreover, existing methods often rely on large models with a high number of parameters or require the fusion with panchromatic or RGB images, both of which are often impractical in real-world scenarios. Inspired by the MobileNet architecture, we introduce a lightweight depthwise separable dilated convolutional network (DSDCN) to address the aforementioned challenges. Specifically, our model leverages multiple depthwise separable convolutions, similar to the MobileNet architecture, and further incorporates a dilated convolution fusion block to make the model more flexible for the extraction of both spatial and spectral features. In addition, we propose a custom loss function that combines mean squared error (MSE), an L2 norm regularization-based constraint, and a spectral angle-based loss, ensuring the preservation of both spectral and spatial details. The proposed model achieves very competitive performance on two publicly available hyperspectral datasets, making it well-suited for hyperspectral image super-resolution tasks. The source codes are publicly available at: \href{https://github.com/Usman1021/lightweight}{https://github.com/Usman1021/lightweight}.

[118] CORSTITCH - A free, open source software for stitching and georeferencing underwater coral reef videos

Julian Christopher L. Maya,Johnenn R. Manalang,Maricor N. Soriano

Main category: eess.IV

TL;DR: CorStitch是一款开源软件，用于从视频片段自动生成精确的地理参考珊瑚礁拼接图，适用于地理信息系统分析。

Details

Motivation: 开发CorStitch的目的是为了自动化处理珊瑚礁评估视频数据，生成可用于空间分析的精确拼接图。 Method: 采用基于傅里叶的图像相关算法拼接视频帧，并结合GNSS时间戳进行地理参考。 Result: 生成的Keyhole Markup Language文件兼容地理信息系统，验证显示软件性能稳定可靠。 Conclusion: CorStitch能够高效生成地理参考珊瑚礁拼接图，为珊瑚礁研究提供可靠工具。 Abstract: CorStitch is an open-source software developed to automate the creation of accurate georeferenced reef mosaics from video transects obtained through Automated Rapid Reef Assessment System surveys. We utilized a Fourier-based image correlation algorithm to stitch sequential video frames, aligning them with synchronized GNSS timestamps. The resulting compressed Keyhole Markup Language files, compatible with geographic information systems such as Google Earth, enable detailed spatial analysis. Validation through comparative analysis of mosaics from two temporally distinct surveys of the same reef demonstrated the software's consistent and reliable performance.

[119] A Methodological and Structural Review of Parkinsons Disease Detection Across Diverse Data Modalities

Abu Saleh Musa Miah,taro Suzuki,Jungpil Shin

Main category: eess.IV

TL;DR: 本文综述了帕金森病（PD）识别系统的多模态方法，填补了现有研究的局限性，旨在为研究者提供全面的资源。

Details

Motivation: 早期准确诊断PD对改善患者预后至关重要，现有研究多局限于单一数据模态，未能充分利用多模态方法的潜力。 Method: 基于347篇文献，分析了MRI、步态分析、手写分析、语音测试等多种数据模态及其融合技术。 Result: 综述了数据收集方法、特征表示和系统性能，重点关注识别准确性和鲁棒性。 Conclusion: 通过多模态方法和前沿机器学习技术，推动了PD诊断的进步，为患者护理提供了创新方案。 Abstract: Parkinsons Disease (PD) is a progressive neurological disorder that primarily affects motor functions and can lead to mild cognitive impairment (MCI) and dementia in its advanced stages. With approximately 10 million people diagnosed globally 1 to 1.8 per 1,000 individuals, according to reports by the Japan Times and the Parkinson Foundation early and accurate diagnosis of PD is crucial for improving patient outcomes. While numerous studies have utilized machine learning (ML) and deep learning (DL) techniques for PD recognition, existing surveys are limited in scope, often focusing on single data modalities and failing to capture the potential of multimodal approaches. To address these gaps, this study presents a comprehensive review of PD recognition systems across diverse data modalities, including Magnetic Resonance Imaging (MRI), gait-based pose analysis, gait sensory data, handwriting analysis, speech test data, Electroencephalography (EEG), and multimodal fusion techniques. Based on over 347 articles from leading scientific databases, this review examines key aspects such as data collection methods, settings, feature representations, and system performance, with a focus on recognition accuracy and robustness. This survey aims to serve as a comprehensive resource for researchers, providing actionable guidance for the development of next generation PD recognition systems. By leveraging diverse data modalities and cutting-edge machine learning paradigms, this work contributes to advancing the state of PD diagnostics and improving patient care through innovative, multimodal approaches.

[120] Deep Learning Assisted Outer Volume Removal for Highly-Accelerated Real-Time Dynamic MRI

Merve Gülle,Sebastian Weingärtner,Mehmet Akçakaya

Main category: eess.IV

TL;DR: 提出了一种新型外体积去除（OVR）方法，通过深度学习模型消除实时动态MRI中的伪影，提高图像质量。

Details

Motivation: 实时动态MRI在捕捉快速生理过程中至关重要，但高加速率下易受伪影干扰，影响心脏功能评估。 Method: 使用时间交错欠采样模式的复合时间图像估计外体积信号，训练深度学习模型去除伪影，并结合物理驱动的DL方法重建图像。 Result: 在高加速率下，图像质量与临床基线图像相当，优于传统重建技术。 Conclusion: 该方法无需修改采集过程即可有效减少伪影，为实时动态MRI提供了实用解决方案。 Abstract: Real-time (RT) dynamic MRI plays a vital role in capturing rapid physiological processes, offering unique insights into organ motion and function. Among these applications, RT cine MRI is particularly important for functional assessment of the heart with high temporal resolution. RT imaging enables free-breathing, ungated imaging of cardiac motion, making it a crucial alternative for patients who cannot tolerate conventional breath-hold, ECG-gated acquisitions. However, achieving high acceleration rates in RT cine MRI is challenging due to aliasing artifacts from extra-cardiac tissues, particularly at high undersampling factors. In this study, we propose a novel outer volume removal (OVR) method to address this challenge by eliminating aliasing contributions from non-cardiac regions in a post-processing framework. Our approach estimates the outer volume signal for each timeframe using composite temporal images from time-interleaved undersampling patterns, which inherently contain pseudo-periodic ghosting artifacts. A deep learning (DL) model is trained to identify and remove these artifacts, producing a clean outer volume estimate that is subsequently subtracted from the corresponding k-space data. The final reconstruction is performed with a physics-driven DL (PD-DL) method trained using an OVR-specific loss function to restore high spatio-temporal resolution images. Experimental results show that the proposed method at high accelerations achieves image quality that is visually comparable to clinical baseline images, while outperforming conventional reconstruction techniques, both qualitatively and quantitatively. The proposed approach provides a practical and effective solution for artifact reduction in RT cine MRI without requiring acquisition modifications, offering a pathway to higher acceleration rates while preserving diagnostic quality.

[121] GuideSR: Rethinking Guidance for One-Step High-Fidelity Diffusion-Based Super-Resolution

Aditya Arora,Zhengzhong Tu,Yufei Wang,Ruizheng Bai,Jian Wang,Sizhuo Ma

Main category: eess.IV

TL;DR: GuideSR是一种新型单步扩散图像超分辨率模型，通过双分支架构提升图像保真度。

Details

Motivation: 现有扩散超分辨率方法通过预训练生成模型进行图像修复，但常牺牲结构保真度。GuideSR旨在解决这一问题。 Method: 采用双分支架构：引导分支保留高保真结构，扩散分支利用预训练潜在扩散模型提升感知质量。结合全分辨率块和图像引导网络。 Result: 在基准数据集上表现优异，PSNR提升达1.39dB，计算成本低，优于现有方法。 Conclusion: GuideSR在图像修复任务中实现了高效且高质量的成果，具有实际应用价值。 Abstract: In this paper, we propose GuideSR, a novel single-step diffusion-based image super-resolution (SR) model specifically designed to enhance image fidelity. Existing diffusion-based SR approaches typically adapt pre-trained generative models to image restoration tasks by adding extra conditioning on a VAE-downsampled representation of the degraded input, which often compromises structural fidelity. GuideSR addresses this limitation by introducing a dual-branch architecture comprising: (1) a Guidance Branch that preserves high-fidelity structures from the original-resolution degraded input, and (2) a Diffusion Branch, which a pre-trained latent diffusion model to enhance perceptual quality. Unlike conventional conditioning mechanisms, our Guidance Branch features a tailored structure for image restoration tasks, combining Full Resolution Blocks (FRBs) with channel attention and an Image Guidance Network (IGN) with guided attention. By embedding detailed structural information directly into the restoration pipeline, GuideSR produces sharper and more visually consistent results. Extensive experiments on benchmark datasets demonstrate that GuideSR achieves state-of-the-art performance while maintaining the low computational cost of single-step approaches, with up to 1.39dB PSNR gain on challenging real-world datasets. Our approach consistently outperforms existing methods across various reference-based metrics including PSNR, SSIM, LPIPS, DISTS and FID, further representing a practical advancement for real-world image restoration.

cs.MA [Back]

[122] Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems

Shaokun Zhang,Ming Yin,Jieyu Zhang,Jiale Liu,Zhiguang Han,Jingyang Zhang,Beibin Li,Chi Wang,Huazheng Wang,Yiran Chen,Qingyun Wu

Main category: cs.MA

TL;DR: 本文提出并定义了一个新研究领域：LLM多智能体系统中的自动化故障归因，并引入Who&When数据集支持研究。通过评估三种方法，发现任务复杂且现有方法效果有限。

Details

Motivation: LLM多智能体系统中的故障归因对系统调试至关重要，但目前研究不足且人工成本高。 Method: 提出自动化故障归因任务，引入Who&When数据集，开发并评估三种自动化方法。 Result: 最佳方法在识别责任代理上准确率为53.5%，但在定位故障步骤上仅14.2%，部分方法表现低于随机。SOTA模型也未能达到实用水平。 Conclusion: 任务复杂，现有方法效果有限，需进一步研究。 Abstract: Failure attribution in LLM multi-agent systems-identifying the agent and step responsible for task failures-provides crucial clues for systems debugging but remains underexplored and labor-intensive. In this paper, we propose and formulate a new research area: automated failure attribution for LLM multi-agent systems. To support this initiative, we introduce the Who&When dataset, comprising extensive failure logs from 127 LLM multi-agent systems with fine-grained annotations linking failures to specific agents and decisive error steps. Using the Who&When, we develop and evaluate three automated failure attribution methods, summarizing their corresponding pros and cons. The best method achieves 53.5% accuracy in identifying failure-responsible agents but only 14.2% in pinpointing failure steps, with some methods performing below random. Even SOTA reasoning models, such as OpenAI o1 and DeepSeek R1, fail to achieve practical usability. These results highlight the task's complexity and the need for further research in this area. Code and dataset are available at https://github.com/mingyin1/Agents_Failure_Attribution

cs.LG [Back]

[123] Recursive KL Divergence Optimization: A Dynamic Framework for Representation Learning

Anthony D Martin

Main category: cs.LG

TL;DR: 论文提出了一种递归KL散度优化（RKDO）方法，通过动态调整局部条件分布的KL散度来改进表示学习效率，实验显示其损失值降低30%，计算资源减少60-80%。

Details

Motivation: 现有方法（如I-Con）通过固定邻域条件分布的KL散度统一学习范式，但忽视了学习过程中的递归结构。 Method: 提出RKDO框架，将表示学习建模为KL散度在数据邻域上的动态演化，涵盖对比学习、聚类和降维方法。 Result: 实验表明，RKDO在三个数据集上损失值降低约30%，计算资源减少60-80%。 Conclusion: RKDO的递归更新机制为表示学习提供了更高效的优化路径，尤其适用于资源受限场景。 Abstract: We propose a generalization of modern representation learning objectives by reframing them as recursive divergence alignment processes over localized conditional distributions While recent frameworks like Information Contrastive Learning I-Con unify multiple learning paradigms through KL divergence between fixed neighborhood conditionals we argue this view underplays a crucial recursive structure inherent in the learning process. We introduce Recursive KL Divergence Optimization RKDO a dynamic formalism where representation learning is framed as the evolution of KL divergences across data neighborhoods. This formulation captures contrastive clustering and dimensionality reduction methods as static slices while offering a new path to model stability and local adaptation. Our experiments demonstrate that RKDO offers dual efficiency advantages approximately 30 percent lower loss values compared to static approaches across three different datasets and 60 to 80 percent reduction in computational resources needed to achieve comparable results. This suggests that RKDOs recursive updating mechanism provides a fundamentally more efficient optimization landscape for representation learning with significant implications for resource constrained applications.

[124] T2VPhysBench: A First-Principles Benchmark for Physical Consistency in Text-to-Video Generation

Xuyang Guo,Jiayan Huo,Zhenmei Shi,Zhao Song,Jiahao Zhang,Jiale Zhao

Main category: cs.LG

TL;DR: T2VPhysBench是一个新的基准测试，用于评估文本到视频生成模型是否遵守核心物理定律，结果显示现有模型普遍表现不佳。

Details

Motivation: 尽管文本到视频生成模型在质量和美观上取得进展，但其是否遵守基本物理定律仍未得到充分验证，导致生成内容可能不真实或误导。 Method: 通过T2VPhysBench基准测试，结合人类评估和第一性原理物理，系统评估模型对12种核心物理定律的遵守情况。 Result: 所有模型在各项物理定律上的平均得分低于0.60，详细提示也无法改善物理违规，模型甚至会生成违反物理规则的视频。 Conclusion: 当前模型在物理感知能力上存在显著不足，未来研究需更注重物理定律的整合。 Abstract: Text-to-video generative models have made significant strides in recent years, producing high-quality videos that excel in both aesthetic appeal and accurate instruction following, and have become central to digital art creation and user engagement online. Yet, despite these advancements, their ability to respect fundamental physical laws remains largely untested: many outputs still violate basic constraints such as rigid-body collisions, energy conservation, and gravitational dynamics, resulting in unrealistic or even misleading content. Existing physical-evaluation benchmarks typically rely on automatic, pixel-level metrics applied to simplistic, life-scenario prompts, and thus overlook both human judgment and first-principles physics. To fill this gap, we introduce \textbf{T2VPhysBench}, a first-principled benchmark that systematically evaluates whether state-of-the-art text-to-video systems, both open-source and commercial, obey twelve core physical laws including Newtonian mechanics, conservation principles, and phenomenological effects. Our benchmark employs a rigorous human evaluation protocol and includes three targeted studies: (1) an overall compliance assessment showing that all models score below 0.60 on average in each law category; (2) a prompt-hint ablation revealing that even detailed, law-specific hints fail to remedy physics violations; and (3) a counterfactual robustness test demonstrating that models often generate videos that explicitly break physical rules when so instructed. The results expose persistent limitations in current architectures and offer concrete insights for guiding future research toward truly physics-aware video generation.

[125] MINERVA: Evaluating Complex Video Reasoning

Arsha Nagrani,Sachit Menon,Ahmet Iscen,Shyamal Buch,Ramin Mehran,Nilpa Jha,Anja Hauth,Yukun Zhu,Carl Vondrick,Mikhail Sirotenko,Cordelia Schmid,Tobias Weyand

Main category: cs.LG

TL;DR: MINERVA是一个新的视频推理数据集，旨在解决现有视频基准测试中缺乏中间推理步骤的问题，为多模态模型提供详细的手工推理痕迹。

Details

Motivation: 现有视频基准测试仅提供结果监督，无法评估模型是否真正结合感知和时间信息进行推理，还是仅靠偶然或语言偏见得到正确答案。 Method: 创建MINERVA数据集，包含多模态、多样化的视频问题和复杂多步推理任务，并提供5个答案选项和详细推理痕迹。 Result: 数据集对前沿开源和专有模型构成挑战，错误分析显示主要失败模式为时间定位和视觉感知错误。 Conclusion: MINERVA为视频推理提供了更全面的评估工具，揭示了模型在时间和视觉推理上的主要缺陷。 Abstract: Multimodal LLMs are turning their focus to video benchmarks, however most video benchmarks only provide outcome supervision, with no intermediate or interpretable reasoning steps. This makes it challenging to assess if models are truly able to combine perceptual and temporal information to reason about videos, or simply get the correct answer by chance or by exploiting linguistic biases. To remedy this, we provide a new video reasoning dataset called MINERVA for modern multimodal models. Each question in the dataset comes with 5 answer choices, as well as detailed, hand-crafted reasoning traces. Our dataset is multimodal, diverse in terms of video domain and length, and consists of complex multi-step questions. Extensive benchmarking shows that our dataset provides a challenge for frontier open-source and proprietary models. We perform fine-grained error analysis to identify common failure modes across various models, and create a taxonomy of reasoning errors. We use this to explore both human and LLM-as-a-judge methods for scoring video reasoning traces, and find that failure modes are primarily related to temporal localization, followed by visual perception errors, as opposed to logical or completeness errors. The dataset, along with questions, answer candidates and reasoning traces will be publicly available under https://github.com/google-deepmind/neptune?tab=readme-ov-file\#minerva.

[126] Self-Generated In-Context Examples Improve LLM Agents for Sequential Decision-Making Tasks

Vishnu Sarukkai,Zhiqiang Xie,Kayvon Fatahalian

Main category: cs.LG

TL;DR: LLM agents通过从自身成功经验中学习，自动提升性能，无需依赖任务特定的知识工程。通过构建和优化自生成示例数据库，性能显著提升，并进一步通过数据库和示例级选择增强效果。

Details

Motivation: 减少对任务特定知识工程的依赖，探索LLM代理如何通过自我学习提升性能。 Method: 构建和优化自生成示例数据库，引入数据库级和示例级选择机制。 Result: 在ALFWorld、Wordcraft和InterCode-SQL基准测试中性能显著提升，最高达91%。 Conclusion: 自动轨迹数据库构建是替代人工知识工程的有效方法。 Abstract: Many methods for improving Large Language Model (LLM) agents for sequential decision-making tasks depend on task-specific knowledge engineering--such as prompt tuning, curated in-context examples, or customized observation and action spaces. Using these approaches, agent performance improves with the quality or amount of knowledge engineering invested. Instead, we investigate how LLM agents can automatically improve their performance by learning in-context from their own successful experiences on similar tasks. Rather than relying on task-specific knowledge engineering, we focus on constructing and refining a database of self-generated examples. We demonstrate that even a naive accumulation of successful trajectories across training tasks boosts test performance on three benchmarks: ALFWorld (73% to 89%), Wordcraft (55% to 64%), and InterCode-SQL (75% to 79%)--matching the performance the initial agent achieves if allowed two to three attempts per task. We then introduce two extensions: (1) database-level selection through population-based training to identify high-performing example collections, and (2) exemplar-level selection that retains individual trajectories based on their empirical utility as in-context examples. These extensions further enhance performance, achieving 91% on ALFWorld--matching more complex approaches that employ task-specific components and prompts. Our results demonstrate that automatic trajectory database construction offers a compelling alternative to labor-intensive knowledge engineering.

[127] Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing

Piotr Piękos,Róbert Csordás,Jürgen Schmidhuber

Main category: cs.LG

TL;DR: MoSA是一种基于动态稀疏注意力的新方法，通过选择关键令牌降低计算复杂度，性能优于密集注意力。

Details

Motivation: 解决自注意力机制的高计算成本问题，同时保持性能。 Method: 提出Mixture of Sparse Attention (MoSA)，动态选择令牌，降低计算复杂度。 Result: MoSA在相同计算预算下性能优于密集基线，最高提升27%困惑度。 Conclusion: MoSA在计算效率、内存使用和KV缓存大小上均优于密集注意力。 Abstract: Recent advances in large language models highlighted the excessive quadratic cost of self-attention. Despite the significant research efforts, subquadratic attention methods still suffer from inferior performance in practice. We hypothesize that dynamic, learned content-based sparsity can lead to more efficient attention mechanisms. We present Mixture of Sparse Attention (MoSA), a novel approach inspired by Mixture of Experts (MoE) with expert choice routing. MoSA dynamically selects tokens for each attention head, allowing arbitrary sparse attention patterns. By selecting $k$ tokens from a sequence of length $T$, MoSA reduces the computational complexity of each attention head from $O(T^2)$ to $O(k^2 + T)$. This enables using more heads within the same computational budget, allowing higher specialization. We show that among the tested sparse attention variants, MoSA is the only one that can outperform the dense baseline, sometimes with up to 27% better perplexity for an identical compute budget. MoSA can also reduce the resource usage compared to dense self-attention. Despite using torch implementation without an optimized kernel, perplexity-matched MoSA models are simultaneously faster in wall-clock time, require less memory for training, and drastically reduce the size of the KV-cache compared to the dense transformer baselines.

[128] R&B: Domain Regrouping and Data Mixture Balancing for Efficient Foundation Model Training

Albert Ge,Tzu-Heng Huang,John Cooper,Avi Trost,Ziyi Chu,Satya Sai Srinath Namburi GNVV,Ziyang Cai,Kendall Park,Nicholas Roberts,Frederic Sala

Main category: cs.LG

TL;DR: R&B框架通过语义相似性重新分组数据并优化数据组合，解决了传统数据混合方法的不足，性能优于现有方法且计算开销极低。

Details

Motivation: 传统数据混合方法依赖预定义的数据域，可能忽略语义细节且计算成本高。 Method: R&B通过语义相似性重新分组数据（Regroup），并利用域梯度诱导的Gram矩阵高效优化数据组合（Balance）。 Result: 在五个多样化数据集上，R&B仅需0.01%额外计算开销即达到或超越现有方法的性能。 Conclusion: R&B是一种高效且性能优越的数据混合策略，适用于多种任务。 Abstract: Data mixing strategies have successfully reduced the costs involved in training language models. While promising, such methods suffer from two flaws. First, they rely on predetermined data domains (e.g., data sources, task types), which may fail to capture critical semantic nuances, leaving performance on the table. Second, these methods scale with the number of domains in a computationally prohibitive way. We address these challenges via R&B, a framework that re-partitions training data based on semantic similarity (Regroup) to create finer-grained domains, and efficiently optimizes the data composition (Balance) by leveraging a Gram matrix induced by domain gradients obtained throughout training. Unlike prior works, it removes the need for additional compute to obtain evaluation information such as losses or gradients. We analyze this technique under standard regularity conditions and provide theoretical insights that justify R&B's effectiveness compared to non-adaptive mixing approaches. Empirically, we demonstrate the effectiveness of R&B on five diverse datasets ranging from natural language to reasoning and multimodal tasks. With as little as 0.01% additional compute overhead, R&B matches or exceeds the performance of state-of-the-art data mixing strategies.

[129] Toward Automated Regulatory Decision-Making: Trustworthy Medical Device Risk Classification with Multimodal Transformers and Self-Training

Yu Han,Aaron Ceross,Jeroen H. M. Bergmann

Main category: cs.LG

TL;DR: 提出了一种基于Transformer的多模态框架，结合文本和视觉信息预测医疗设备风险分类，显著优于单模态基线。

Details

Motivation: 医疗设备风险分类对监管和临床安全至关重要，但现有方法在有限监督下泛化能力不足。 Method: 采用跨模态注意力机制捕捉模态间依赖关系，并结合自训练策略提升泛化能力。 Result: 在真实监管数据集上达到90.4%准确率和97.9% AUROC，优于文本（77.2%）和图像（54.8%）基线。 Conclusion: 跨模态注意力和自训练策略互补，有效提升有限监督下的分类性能。 Abstract: Accurate classification of medical device risk levels is essential for regulatory oversight and clinical safety. We present a Transformer-based multimodal framework that integrates textual descriptions and visual information to predict device regulatory classification. The model incorporates a cross-attention mechanism to capture intermodal dependencies and employs a self-training strategy for improved generalization under limited supervision. Experiments on a real-world regulatory dataset demonstrate that our approach achieves up to 90.4% accuracy and 97.9% AUROC, significantly outperforming text-only (77.2%) and image-only (54.8%) baselines. Compared to standard multimodal fusion, the self-training mechanism improved SVM performance by 3.3 percentage points in accuracy (from 87.1% to 90.4%) and 1.4 points in macro-F1, suggesting that pseudo-labeling can effectively enhance generalization under limited supervision. Ablation studies further confirm the complementary benefits of both cross-modal attention and self-training.

cs.CY [Back]

[130] Humanizing LLMs: A Survey of Psychological Measurements with Tools, Datasets, and Human-Agent Applications

Wenhan Dong,Yuemeng Zhao,Zhen Sun,Yule Liu,Zifan Peng,Jingyi Zheng,Zongmin Zhang,Ziyi Zhang,Jun Wu,Ruiming Wang,Shengmin Xu,Xinyi Huang,Xinlei He

Main category: cs.CY

TL;DR: 本文系统回顾了将心理学理论应用于大型语言模型（LLMs）的六个关键维度，包括评估工具、数据集、评价指标、实证发现、人格模拟方法和行为模拟，并指出了当前方法的优缺点。

Details

Motivation: 随着LLMs在人类中心任务中的广泛应用，评估其心理特质对理解其社会影响和确保可信AI对齐至关重要。现有研究未系统讨论多个重要领域，如多样化的心理测试、LLM专用心理数据集及其应用。 Method: 系统回顾了六个维度：评估工具、LLM专用数据集、评价指标（一致性和稳定性）、实证发现、人格模拟方法和行为模拟。 Result: 部分LLMs在特定提示方案下表现出可复现的人格模式，但在任务和设置中存在显著变异性。当前方法存在心理学工具与LLM能力不匹配等问题。 Conclusion: 研究提出了未来发展方向，旨在开发更具可解释性、稳健性和通用性的LLM心理评估框架。 Abstract: As large language models (LLMs) are increasingly used in human-centered tasks, assessing their psychological traits is crucial for understanding their social impact and ensuring trustworthy AI alignment. While existing reviews have covered some aspects of related research, several important areas have not been systematically discussed, including detailed discussions of diverse psychological tests, LLM-specific psychological datasets, and the applications of LLMs with psychological traits. To address this gap, we systematically review six key dimensions of applying psychological theories to LLMs: (1) assessment tools; (2) LLM-specific datasets; (3) evaluation metrics (consistency and stability); (4) empirical findings; (5) personality simulation methods; and (6) LLM-based behavior simulation. Our analysis highlights both the strengths and limitations of current methods. While some LLMs exhibit reproducible personality patterns under specific prompting schemes, significant variability remains across tasks and settings. Recognizing methodological challenges such as mismatches between psychological tools and LLMs' capabilities, as well as inconsistencies in evaluation practices, this study aims to propose future directions for developing more interpretable, robust, and generalizable psychological assessment frameworks for LLMs.

cs.RO [Back]

[131] AI-Enhanced Automatic Design of Efficient Underwater Gliders

Peter Yichen Chen,Pingchuan Ma,Niklas Hagemann,John Romanishin,Wei Wang,Daniela Rus,Wojciech Matusik

Main category: cs.RO

TL;DR: 论文提出了一种AI增强的自动化计算框架，用于设计具有复杂形状的自主水下滑翔机，通过优化形状和控制信号，提高了能源效率。

Details

Motivation: 传统水下滑翔机设计依赖手动试错，形状多样性有限，且建模复杂流体交互计算成本高。 Method: 采用降阶几何表示和基于神经网络的流体代理模型，实现形状与控制信号的协同优化。 Result: 通过风洞实验和泳池测试验证，计算设计的滑翔机在能源效率上优于手动设计。 Conclusion: 该框架为高效水下滑翔机的开发提供了新途径，对远洋探索和环境监测有重要意义。 Abstract: The development of novel autonomous underwater gliders has been hindered by limited shape diversity, primarily due to the reliance on traditional design tools that depend heavily on manual trial and error. Building an automated design framework is challenging due to the complexities of representing glider shapes and the high computational costs associated with modeling complex solid-fluid interactions. In this work, we introduce an AI-enhanced automated computational framework designed to overcome these limitations by enabling the creation of underwater robots with non-trivial hull shapes. Our approach involves an algorithm that co-optimizes both shape and control signals, utilizing a reduced-order geometry representation and a differentiable neural-network-based fluid surrogate model. This end-to-end design workflow facilitates rapid iteration and evaluation of hydrodynamic performance, leading to the discovery of optimal and complex hull shapes across various control settings. We validate our method through wind tunnel experiments and swimming pool gliding tests, demonstrating that our computationally designed gliders surpass manually designed counterparts in terms of energy efficiency. By addressing challenges in efficient shape representation and neural fluid surrogate models, our work paves the way for the development of highly efficient underwater gliders, with implications for long-range ocean exploration and environmental monitoring.

[132] Robotic Visual Instruction

Yanbang Li,Ziyang Gong,Haoyang Li,Haoyang Li,Xiaoqi Huang,Haolan Kang,Guangping Bai,Xianzheng Ma

Main category: cs.RO

TL;DR: RoVI是一种通过手绘符号表示指导机器人任务的新范式，结合VIEW流程，利用视觉语言模型解析指令并生成精确动作，实验验证了其泛化能力。

Details

Motivation: 自然语言在机器人控制中缺乏空间精度，存在模糊性和冗长问题，RoVI旨在通过视觉符号解决这些问题。 Method: 提出RoVI范式，通过2D草图编码时空信息；设计VIEW流程，利用视觉语言模型解析RoVI指令并生成3D动作序列；构建15K数据集微调模型。 Result: 在11项新任务中验证，VIEW在真实场景中达到87.5%的成功率，支持多步骤、干扰和轨迹跟踪任务。 Conclusion: RoVI和VIEW为机器人任务提供了一种高效、泛化性强的视觉指令解决方案，代码和数据集将公开。 Abstract: Recently, natural language has been the primary medium for human-robot interaction. However, its inherent lack of spatial precision for robotic control introduces challenges such as ambiguity and verbosity. To address these limitations, we introduce the Robotic Visual Instruction (RoVI), a novel paradigm to guide robotic tasks through an object-centric, hand-drawn symbolic representation. RoVI effectively encodes spatial-temporal information into human-interpretable visual instructions through 2D sketches, utilizing arrows, circles, colors, and numbers to direct 3D robotic manipulation. To enable robots to understand RoVI better and generate precise actions based on RoVI, we present Visual Instruction Embodied Workflow (VIEW), a pipeline formulated for RoVI-conditioned policies. This approach leverages Vision-Language Models (VLMs) to interpret RoVI inputs, decode spatial and temporal constraints from 2D pixel space via keypoint extraction, and then transform them into executable 3D action sequences. We additionally curate a specialized dataset of 15K instances to fine-tune small VLMs for edge deployment, enabling them to effectively learn RoVI capabilities. Our approach is rigorously validated across 11 novel tasks in both real and simulated environments, demonstrating significant generalization capability. Notably, VIEW achieves an 87.5% success rate in real-world scenarios involving unseen tasks that feature multi-step actions, with disturbances, and trajectory-following requirements. Code and Datasets in this paper will be released soon.

cs.IR [Back]

[133] Optimization of embeddings storage for RAG systems using quantization and dimensionality reduction techniques

Naamán Huerga-Pérez,Rubén Álvarez,Rubén Ferrero-Guillén,Alberto Martínez-Gutiérrez,Javier Díez-González

Main category: cs.IR

TL;DR: 论文研究了通过量化和降维优化检索增强生成模型的存储问题，发现float8量化和PCA结合能实现8倍压缩且性能损失最小。

Details

Motivation: 高维向量嵌入（float32精度）在大规模存储时面临内存挑战，需优化存储策略。 Method: 系统评估了量化（float16、int8、binary、float8）和降维（PCA、Kernel PCA、UMAP等）技术。 Result: float8量化实现4倍存储压缩且性能损失<0.3%，结合PCA可达到8倍压缩。 Conclusion: float8量化和PCA结合是最优方案，并提出可视化方法帮助选择最佳配置。 Abstract: Retrieval-Augmented Generation enhances language models by retrieving relevant information from external knowledge bases, relying on high-dimensional vector embeddings typically stored in float32 precision. However, storing these embeddings at scale presents significant memory challenges. To address this issue, we systematically investigate on MTEB benchmark two complementary optimization strategies: quantization, evaluating standard formats (float16, int8, binary) and low-bit floating-point types (float8), and dimensionality reduction, assessing methods like PCA, Kernel PCA, UMAP, Random Projections and Autoencoders. Our results show that float8 quantization achieves a 4x storage reduction with minimal performance degradation (<0.3%), significantly outperforming int8 quantization at the same compression level, being simpler to implement. PCA emerges as the most effective dimensionality reduction technique. Crucially, combining moderate PCA (e.g., retaining 50% dimensions) with float8 quantization offers an excellent trade-off, achieving 8x total compression with less performance impact than using int8 alone (which provides only 4x compression). To facilitate practical application, we propose a methodology based on visualizing the performance-storage trade-off space to identify the optimal configuration that maximizes performance within their specific memory constraints.

[134] EnronQA: Towards Personalized RAG over Private Documents

Michael J. Ryan,Danmei Xu,Chris Nivera,Daniel Campos

Main category: cs.IR

TL;DR: 论文介绍了EnronQA基准数据集，用于改进检索增强生成（RAG）在私有数据上的评估，并探讨了私有文档推理中的记忆与检索权衡。

Details

Motivation: 当前RAG基准主要基于公共数据，缺乏对私有和个人化上下文的支持，限制了RAG在私有数据上的应用。 Method: 发布EnronQA数据集，包含103,638封邮件和528,304个问答对，覆盖150个用户收件箱，用于评估和优化RAG管道。 Result: EnronQA为私有数据上的RAG提供了更真实的评估基准，并支持个性化检索设置的实验。 Conclusion: EnronQA填补了私有数据RAG评估的空白，并揭示了私有文档推理中记忆与检索的权衡关系。 Abstract: Retrieval Augmented Generation (RAG) has become one of the most popular methods for bringing knowledge-intensive context to large language models (LLM) because of its ability to bring local context at inference time without the cost or data leakage risks associated with fine-tuning. A clear separation of private information from the LLM training has made RAG the basis for many enterprise LLM workloads as it allows the company to augment LLM's understanding using customers' private documents. Despite its popularity for private documents in enterprise deployments, current RAG benchmarks for validating and optimizing RAG pipelines draw their corpora from public data such as Wikipedia or generic web pages and offer little to no personal context. Seeking to empower more personal and private RAG we release the EnronQA benchmark, a dataset of 103,638 emails with 528,304 question-answer pairs across 150 different user inboxes. EnronQA enables better benchmarking of RAG pipelines over private data and allows for experimentation on the introduction of personalized retrieval settings over realistic data. Finally, we use EnronQA to explore the tradeoff in memorization and retrieval when reasoning over private documents.

[135] Investigating Task Arithmetic for Zero-Shot Information Retrieval

Marco Braga,Pranav Kasela,Alessandro Raganato,Gabriella Pasi

Main category: cs.IR

TL;DR: 论文提出了一种名为“任务算术”的技术，通过简单的数学操作（如加减）结合预训练LLM的权重，无需额外微调即可适应不同任务和领域，显著提升了零样本检索性能。

Details

Motivation: 大型语言模型（LLMs）在零样本任务中表现优异，但在未见任务和领域上性能下降，主要由于词汇和词分布的变化。 Method: 采用任务算术技术，通过数学操作结合不同任务或领域预训练的LLM权重，合成多样任务和领域知识到一个模型中。 Result: 在公开的科学、生物医学和多语言数据集上，NDCG@10和P@10分别提升18%和15%。 Conclusion: 任务算术是一种有效的零样本学习和模型适应策略，同时揭示了其优势和局限性。 Abstract: Large Language Models (LLMs) have shown impressive zero-shot performance across a variety of Natural Language Processing tasks, including document re-ranking. However, their effectiveness degrades on unseen tasks and domains, largely due to shifts in vocabulary and word distributions. In this paper, we investigate Task Arithmetic, a technique that combines the weights of LLMs pre-trained on different tasks or domains via simple mathematical operations, such as addition or subtraction, to adapt retrieval models without requiring additional fine-tuning. Our method is able to synthesize diverse tasks and domain knowledge into a single model, enabling effective zero-shot adaptation in different retrieval contexts. Extensive experiments on publicly available scientific, biomedical, and multilingual datasets show that our method improves state-of-the-art re-ranking performance by up to 18% in NDCG@10 and 15% in P@10. In addition to these empirical gains, our analysis provides insights into the strengths and limitations of Task Arithmetic as a practical strategy for zero-shot learning and model adaptation. We make our code publicly available at https://github.com/DetectiveMB/Task-Arithmetic-for-ZS-IR.

cs.NE [Back]

[136] Neuroevolution of Self-Attention Over Proto-Objects

Rafael C. Pinto,Anderson R. Tavares

Main category: cs.NE

TL;DR: 论文提出了一种基于原型对象（proto-objects）的注意力机制，替代传统的基于矩形图像块的注意力方法，显著降低了表示复杂性和计算成本。

Details

Motivation: 传统基于矩形图像块的注意力机制在视觉强化学习中表现优异，但存在表示复杂性和计算成本高的问题。原型对象作为更高层次的图像特征，有望提供更高效的解决方案。 Method: 利用图像分割技术提取原型对象，将其编码为紧凑特征向量，构建更小的自注意力模块，处理更丰富的语义信息。 Result: 实验表明，基于原型对象的方法在性能上匹配或超越传统方法，同时减少了62%的参数和2.6倍的训练时间。 Conclusion: 原型对象是一种高效的注意力机制替代方案，显著提升了计算效率和性能。 Abstract: Proto-objects - image regions that share common visual properties - offer a promising alternative to traditional attention mechanisms based on rectangular-shaped image patches in neural networks. Although previous work demonstrated that evolving a patch-based hard-attention module alongside a controller network could achieve state-of-the-art performance in visual reinforcement learning tasks, our approach leverages image segmentation to work with higher-level features. By operating on proto-objects rather than fixed patches, we significantly reduce the representational complexity: each image decomposes into fewer proto-objects than regular patches, and each proto-object can be efficiently encoded as a compact feature vector. This enables a substantially smaller self-attention module that processes richer semantic information. Our experiments demonstrate that this proto-object-based approach matches or exceeds the state-of-the-art performance of patch-based implementations with 62% less parameters and 2.6 times less training time.

Table of Contents

cs.CV [Back]

[1] Learning to Borrow Features for Improved Detection of Small Objects in Single-Shot Detectors

[2] Investigating Zero-Shot Diagnostic Pathology in Vision-Language Models with Efficient Prompt Design

[3] Eye2Eye: A Simple Approach for Monocular-to-Stereo Video Synthesis

[4] Detecting and Mitigating Hateful Content in Multimodal Memes with Vision-Language Models

[5] V3LMA: Visual 3D-enhanced Language Model for Autonomous Driving

[6] Direct Motion Models for Assessing Generated Videos

[7] Towards Robust and Generalizable Gerchberg Saxton based Physics Inspired Neural Networks for Computer Generated Holography: A Sensitivity Analysis Framework

[8] ReXGradient-160K: A Large-Scale Publicly Available Dataset of Chest Radiographs with Free-text Reports

[9] Empowering Agentic Video Analytics Systems with Video Language Models

[10] Pack-PTQ: Advancing Post-training Quantization of Neural Networks by Pack-wise Reconstruction

[11] AdCare-VLM: Leveraging Large Vision Language Model (LVLM) to Monitor Long-Term Medication Adherence and Care

[12] Fine-grained spatial-temporal perception for gas leak segmentation

[13] AI-Assisted Decision-Making for Clinical Assessment of Auto-Segmented Contour Quality

[14] AWARE-NET: Adaptive Weighted Averaging for Robust Ensemble Network in Deepfake Detection

[15] Quaternion Wavelet-Conditioned Diffusion Models for Image Super-Resolution

[16] Efficient Neural Video Representation with Temporally Coherent Modulation

[17] Automated segmenta-on of pediatric neuroblastoma on multi-modal MRI: Results of the SPPIN challenge at MICCAI 2023

[18] Cues3D: Unleashing the Power of Sole NeRF for Consistent and Unique Instances in Open-Vocabulary 3D Panoptic Segmentation

[19] The Invisible Threat: Evaluating the Vulnerability of Cross-Spectral Face Recognition to Presentation Attacks

[20] SOTA: Spike-Navigated Optimal TrAnsport Saliency Region Detection in Composite-bias Videos

[21] Real-Time Animatable 2DGS-Avatars with Detail Enhancement from Monocular Videos

[22] Leveraging Pretrained Diffusion Models for Zero-Shot Part Assembly

[23] ClearLines - Camera Calibration from Straight Lines

[24] JointDiT: Enhancing RGB-Depth Joint Modeling with Diffusion Transformers

[25] KeySync: A Robust Approach for Leakage-free Lip Synchronization in High Resolution

[26] Towards Scalable Human-aligned Benchmark for Text-guided Image Editing

[27] HeAL3D: Heuristical-enhanced Active Learning for 3D Object Detection

[28] Inconsistency-based Active Learning for LiDAR Object Detection

[29] InterLoc: LiDAR-based Intersection Localization using Road Segmentation with Automated Evaluation Method

[30] A Robust Deep Networks based Multi-Object MultiCamera Tracking System for City Scale Traffic

[31] X-ray illicit object detection using hybrid CNN-transformer neural network architectures

[32] Multimodal Masked Autoencoder Pre-training for 3D MRI-Based Brain Tumor Analysis with Missing Modalities

[33] AnimalMotionCLIP: Embedding motion in CLIP for Animal Behavior Analysis

[34] Synthesizing and Identifying Noise Levels in Autonomous Vehicle Camera Radar Datasets

[35] Uncertainty-Aware Multi-Expert Knowledge Distillation for Imbalanced Disease Grading

[36] Visual Trajectory Prediction of Vessels for Inland Navigation

[37] Dietary Intake Estimation via Continuous 3D Reconstruction of Food

[38] Pixel3DMM: Versatile Screen-Space Priors for Single-Image 3D Face Reconstruction

[39] Diverse Semantics-Guided Feature Alignment and Decoupling for Visible-Infrared Person Re-Identification

[40] Brain Foundation Models with Hypergraph Dynamic Adapter for Brain Disease Analysis

[41] Vision Mamba in Remote Sensing: A Comprehensive Survey of Techniques, Applications and Outlook

[42] Deep Reinforcement Learning for Urban Air Quality Management: Multi-Objective Optimization of Pollution Mitigation Booth Placement in Metropolitan Environments

[43] Visual Test-time Scaling for GUI Agent Grounding

[44] Towards Autonomous Micromobility through Scalable Urban Simulation

[45] RayZer: A Self-supervised Large View Synthesis Model

[46] T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT

cs.GR [Back]

[47] Controllable Weather Synthesis and Removal with Video Diffusion Models

cs.CL [Back]

[48] Rosetta-PL: Propositional Logic as a Benchmark for Large Language Model Reasoning

[49] Symbol grounding in computational systems: A paradox of intentions

[50] The Mind in the Machine: A Survey of Incorporating Psychological Theories in LLMs

[51] LangVAE and LangSpace: Building and Probing for Language Model VAEs

[52] Toward a digital twin of U.S. Congress

[53] A Scoping Review of Natural Language Processing in Addressing Medically Inaccurate Information: Errors, Misinformation, and Hallucination

[54] Efficient Knowledge Transfer in Multi-Task Learning through Task-Adaptive Low-Rank Representation

[55] Jailbreak Detection in Clinical Training LLMs Using Feature-Based Predictive Models

[56] The AI Co-Ethnographer: How Far Can Automation Take Qualitative Research?

[57] Performance Evaluation of Emotion Classification in Japanese Using RoBERTa and DeBERTa

[58] Manifold-Constrained Sentence Embeddings via Triplet Loss: Projecting Semantics onto Spheres, Tori, and Möbius Strips

[59] Design and Application of Multimodal Large Language Model Based System for End to End Automation of Accident Dataset Generation

[60] Sparks of Tabular Reasoning via Text2SQL Reinforcement Learning

[61] ReCellTy: Domain-specific knowledge graph retrieval-augmented LLMs workflow for single-cell annotation

[62] An Empirical Study on Prompt Compression for Large Language Models

[63] Beyond Public Access in LLM Pre-Training Data

[64] Ustnlp16 at SemEval-2025 Task 9: Improving Model Performance through Imbalance Handling and Focal Loss

[65] Aleph-Alpha-GermanWeb: Improving German-language LLM pre-training with model-based data curation and synthetic data generation

[66] CORG: Generating Answers from Complex, Interrelated Contexts

[67] Nemotron-Research-Tool-N1: Tool-Using Language Models with Reinforced Reasoning

[68] A Method for the Architecture of a Medical Vertical Large Language Model Based on Deepseek R1

[69] Theory of Mind in Large Language Models: Assessment and Enhancement

[70] Extracting Abstraction Dimensions by Identifying Syntax Pattern from Texts

[71] Enhancing Speech-to-Speech Dialogue Modeling with End-to-End Retrieval-Augmented Generation

[72] Keep the General, Inject the Specific: Structured Dialogue Fine-Tuning for Knowledge Injection without Catastrophic Forgetting

[73] Can Language Models Represent the Past without Anachronism?

[74] Learning to Plan Before Answering: Self-Teaching LLMs to Learn Abstract Plans for Problem Solving

[75] MDD-LLM: Towards Accuracy Large Language Models for Major Depressive Disorder Diagnosis

[76] From Attention to Atoms: Spectral Dictionary Learning for Fast, Interpretable Language Models