Papers with Code - 2025-07-16

论文标题 LifelongPR: Lifelong knowledge fusion for point cloud place recognition based on replay and prompt learning

中文摘要： 点云位姿识别（PCPR）在摄影测量和机器人应用中发挥着关键作用，如自动驾驶、智能交通和增强现实。然而，在实际大规模部署定位系统时，现有的PCPR模型往往因灾难性遗忘而性能下降，导致模型扩展性差、维护成本增加和系统部署困难。为解决这些问题，提出了一种新的持续学习框架LifelongPR，能够从连续的点云数据中有效提取和融合知识。首先，通过动态分配样本大小并选择空间多样性的样本来缓解知识损失。其次，设计了一个基于提示学习的持续学习框架，包含轻量级提示模块和两阶段训练策略，以处理域转移问题，同时最小化遗忘。在大规模公共和自收集数据集上的综合实验表明，该方法相比现有最先进方法，在mIR@1上提高了6.50%，在mR@1上提高了7.96%，在F上降低了8.95%。代码和预训练模型已在GitHub上公开。
英文摘要： Point cloud place recognition (PCPR) plays a crucial role in photogrammetry and robotics applications such as autonomous driving, intelligent transportation, and augmented reality. In real-world large-scale deployments of a positioning system, PCPR models must continuously acquire, update, and accumulate knowledge to adapt to diverse and dynamic environments, i.e., the ability known as continual learning (CL). However, existing PCPR models often suffer from catastrophic forgetting, leading to significant performance degradation in previously learned scenes when adapting to new environments or sensor types. This results in poor model scalability, increased maintenance costs, and system deployment difficulties, undermining the practicality of PCPR. To address these issues, we propose LifelongPR, a novel continual learning framework for PCPR, which effectively extracts and fuses knowledge from sequential point cloud data. First, to alleviate the knowledge loss, we propose a replay sample selection method that dynamically allocates sample sizes according to each dataset's information quantity and selects spatially diverse samples for maximal representativeness. Second, to handle domain shifts, we design a prompt learning-based CL framework with a lightweight prompt module and a two-stage training strategy, enabling domain-specific feature adaptation while minimizing forgetting. Comprehensive experiments on large-scale public and self-collected datasets are conducted to validate the effectiveness of the proposed method. Compared with state-of-the-art (SOTA) methods, our method achieves 6.50% improvement in mIR@1, 7.96% improvement in mR@1, and an 8.95% reduction in F. The code and pre-trained models are publicly available at https://github.com/zouxianghong/LifelongPR.
论文链接 https://arxiv.org/pdf/2507.10034v1.pdf
代码链接 https://github.com/zouxianghong/LifelongPR

论文标题 REST: Stress Testing Large Reasoning Models by Asking Multiple Problems at Once

中文摘要： 近期的大型推理模型（LRMs）在特定任务基准测试上取得了显著进展，但其评估方法仍然受限于孤立的问题解决范式。现有基准主要通过顺序测试来评估单个问题的推理能力，导致一些关键局限：容易受到数据污染的影响且挑战性较低（例如，DeepSeek-R1在MATH500上的准确率达到97.0%），需要大量人力不断创建新问题；无法评估模型在多上下文压力下的表现，这是实际应用中的关键要求。

为了解决这些问题，研究者提出了REST（Reasoning Evaluation through Simultaneous Testing），这是一种同时对多个问题进行压力测试的框架。除了基本推理能力外，REST还特别评估了几个未充分测试的能力，包括上下文优先级分配、跨问题干扰抵抗以及动态认知负载管理。评估结果揭示了一些重要发现：即使是像DeepSeek-R1这样的最先进模型，在压力测试下也表现出显著的性能下降。REST比现有基准测试具有更强的区分力，能够揭示出在单个问题评估中表现相近的模型之间的显著性能差异。

分析显示，过度思考陷阱是导致性能下降的关键因素之一；而采用“长到短”技术训练的模型在REST下保留了更多的单问题性能准确性，优于标准训练的模型。这些结果表明，REST是一种成本效益高且面向未来的评估范式，更好地反映了现实世界的推理需求，并减少了对持续人工标注的依赖。

英文摘要： Recent Large Reasoning Models (LRMs) have achieved remarkable progress on task-specific benchmarks, yet their evaluation methods remain constrained by isolated problem-solving paradigms. Existing benchmarks predominantly assess single-question reasoning through sequential testing, resulting critical limitations: (1) vulnerability to data contamination and less challenging (e.g., DeepSeek-R1 achieves 97.0% on MATH500), forcing costly and perpetual creation of new questions with large human efforts, (2) failure to evaluate models under multi-context pressure, a key requirement for real-world deployment. To bridge this gap, we present REST (Reasoning Evaluation through Simultaneous Testing), a stress-testing framework that concurrently exposes LRMs to multiple problems simultaneously. Beyond basic reasoning, REST specifically evaluates several under-tested capabilities: contextual priority allocation, cross-problem interference resistance, and dynamic cognitive load management. Our evaluation reveals several striking findings: Even state-of-the-art (SOTA) models like DeepSeek-R1 exhibit substantial performance degradation under stress testing. Crucially, REST demonstrates stronger discriminative power than existing benchmarks, revealing pronounced performance differences among models that exhibit similar, near-ceiling performance under single-question evaluations. Some key mechanistic insights emerge from our analysis: (1) the "overthinking trap" is a critical factor contributing to the performance degradation; (2) the models trained with "long2short" technique preserve more accuracy of their single-problem performance under REST, outperforming standard-trained counterparts. These results establish REST as a cost-efficient, future-proof evaluation paradigm that better reflects real-world reasoning demands while reducing reliance on continuous human annotation.
论文链接 https://arxiv.org/pdf/2507.10541v1.pdf
代码链接 https://github.com/opendatalab/REST

论文标题 Bridging Robustness and Generalization Against Word Substitution Attacks in NLP via the Growth Bound Matrix Approach

中文摘要： 尽管自然语言处理（NLP）取得了显著进展，但模型仍然容易受到对抗性攻击，例如同义词替换。之前的工作主要集中在改进前馈和卷积架构的鲁棒性，而循环网络和现代状态空间模型（如S4）的鲁棒性研究相对较少。这些架构由于其顺序处理和复杂的参数动态特性，带来了独特的挑战。

本文提出了一种基于增长边界矩阵（GBM）的新正则化技术，旨在通过减少输入扰动对模型输出的影响来提高NLP模型的鲁棒性。我们重点计算了三种架构的GBM：长短期记忆网络（LSTM）、状态空间模型（S4）和卷积神经网络（CNN）。该方法的目标是（1）增强抵御词替换攻击的能力，（2）提高在干净文本上的泛化性能，（3）首次系统地分析S4模型的鲁棒性。

广泛的实验表明，我们的方法在多个架构和基准数据集上将对抗鲁棒性提高了高达8.8%，超过了现有的基线。这些结果突显了我们方法的有效性，在对抗防御方面超越了多种最先进的方法。代码可在https://github.com/BouriMohammed/GBM获取。

英文摘要： Despite advancements in Natural Language Processing (NLP), models remain vulnerable to adversarial attacks, such as synonym substitutions. While prior work has focused on improving robustness for feed-forward and convolutional architectures, the robustness of recurrent networks and modern state space models (SSMs), such as S4, remains understudied. These architectures pose unique challenges due to their sequential processing and complex parameter dynamics. In this paper, we introduce a novel regularization technique based on Growth Bound Matrices (GBM) to improve NLP model robustness by reducing the impact of input perturbations on model outputs. We focus on computing the GBM for three architectures: Long Short-Term Memory (LSTM), State Space models (S4), and Convolutional Neural Networks (CNN). Our method aims to (1) enhance resilience against word substitution attacks, (2) improve generalization on clean text, and (3) providing the first systematic analysis of SSM (S4) robustness. Extensive experiments across multiple architectures and benchmark datasets demonstrate that our method improves adversarial robustness by up to 8.8% over existing baselines. These results highlight the effectiveness of our approach, outperforming several state-of-the-art methods in adversarial defense. Codes are available at https://github.com/BouriMohammed/GBM
论文链接 https://arxiv.org/pdf/2507.10330v1.pdf
代码链接 https://github.com/BouriMohammed/GBM

论文标题 Text-Visual Semantic Constrained AI-Generated Image Quality Assessment

中文摘要： 随着人工智能生成图像（AGI）技术的迅速发展，准确评估其质量变得越来越重要。目前常用的方法通常依赖于跨模态模型如CLIP或BLIP来评估文本-图像对齐和视觉质量。然而，这些方法在应用于AGI时遇到了两个主要挑战：语义不一致和细节感知缺失。为了解决这些限制，我们提出了基于文本-视觉语义约束的人工智能生成图像质量评估框架（SC-AGIQA）。该框架通过利用文本-视觉语义约束，显著提高了对文本-图像一致性和感知失真的综合评估。我们的方法整合了多个模型的关键能力，并通过引入两个核心模块来应对上述挑战：文本辅助语义对齐模块（TSAM），利用多模态大语言模型（MLLMs）生成图像描述并与原始提示进行对比，以实现更精细的一致性检查；频域细粒度退化感知模块（FFDPM），借鉴人类视觉系统（HVS）的特性，结合频率域分析和感知敏感性加权，更好地量化细微的视觉失真，并增强对图像中细粒度视觉质量细节的捕捉。在多个基准数据集上的广泛实验表明，SC-AGIQA优于现有的最先进方法。代码公开在https://github.com/mozhu1/SC-AGIQA。
英文摘要： With the rapid advancements in Artificial Intelligence Generated Image (AGI) technology, the accurate assessment of their quality has become an increasingly vital requirement. Prevailing methods typically rely on cross-modal models like CLIP or BLIP to evaluate text-image alignment and visual quality. However, when applied to AGIs, these methods encounter two primary challenges: semantic misalignment and details perception missing. To address these limitations, we propose Text-Visual Semantic Constrained AI-Generated Image Quality Assessment (SC-AGIQA), a unified framework that leverages text-visual semantic constraints to significantly enhance the comprehensive evaluation of both text-image consistency and perceptual distortion in AI-generated images. Our approach integrates key capabilities from multiple models and tackles the aforementioned challenges by introducing two core modules: the Text-assisted Semantic Alignment Module (TSAM), which leverages Multimodal Large Language Models (MLLMs) to bridge the semantic gap by generating an image description and comparing it against the original prompt for a refined consistency check, and the Frequency-domain Fine-Grained Degradation Perception Module (FFDPM), which draws inspiration from Human Visual System (HVS) properties by employing frequency domain analysis combined with perceptual sensitivity weighting to better quantify subtle visual distortions and enhance the capture of fine-grained visual quality details in images. Extensive experiments conducted on multiple benchmark datasets demonstrate that SC-AGIQA outperforms existing state-of-the-art methods. The code is publicly available at https://github.com/mozhu1/SC-AGIQA.
论文链接 https://arxiv.org/pdf/2507.10432v1.pdf
代码链接 https://github.com/mozhu1/SC-AGIQA

论文标题 VoTranhAbyssCoreMicro and PoliticalCore: A Unified Framework for Simulating Complex Economic and Political Dynamics

中文摘要： VoTranhAbyssCoreMicro框架与PoliticalCore模块结合，提供了一个强大的平台，用于模拟复杂的经济和政治动态，并在多个测试经济体和政治场景中验证了90%的准确率。该框架通过结合基于代理的建模、先进的神经网络和统计方法，能够捕捉到诸如影子经济、文化惯性、宣传、系统崩溃和政治动荡等现象。白皮书详细介绍了系统的架构、关键组件、验证结果及其应用，并强调了专门计算环境的必要性以实现最佳性能。作者可以免费提供设置说明。该框架是研究人员、政策制定者和分析师预测由熵和系统交互驱动的宏观经济和政治事件的强大工具。
英文摘要： The VoTranhAbyssCoreMicro framework, integrated with the PoliticalCore mod- ule, provides a robust platform for simulating complex economic and political dy- namics with a validated accuracy of 90% across multiple tested economies and polit- ical scenarios. By combining agent-based modeling, advanced neural networks, and statistical methods, the framework captures phenomena such as shadow economies, cultural inertia, propaganda, systemic collapse, and political unrest. This whitepa- per details the architecture, key components, validation results, and applications of the unified system. It emphasizes the necessity of a specialized computational en- vironment to achieve optimal performance, with setup instructions available upon request from the author (provided free of charge). The framework is a powerful tool for researchers, policymakers, and analysts seeking to forecast macro-economic and political events driven by entropy and systemic interactions.
论文链接 https://paperswithcode.com/paper/votranhabysscoremicro-and-politicalcore-a
代码链接 https://github.com/vinhatson/votranhabysscoremicro

论文标题 AI-Enhanced Pediatric Pneumonia Detection: A CNN-Based Approach Using Data Augmentation and Generative Adversarial Networks (GANs)

中文摘要： 本研究提出了一种基于机器学习的儿童胸肺炎分类系统，旨在帮助医疗专业人员通过胸部X光图像诊断肺炎。该系统使用卷积神经网络（CNN）模型，并在来自广州妇女儿童医疗中心5,863张0-5岁儿童的标记胸部X光图像上进行训练。为了解决数据有限的问题，研究中应用了多种数据增强技术（如旋转、缩放、剪切和水平翻转），并利用生成对抗网络（GANs）生成合成图像，以解决类别不平衡问题。结合原始数据、增强数据和GAN生成的数据，系统在准确率和F1分数等指标上达到了最佳性能。最终模型通过Flask Web应用程序部署，实现了实时分类并提供概率估计。研究结果表明，深度学习和GANs在提高儿科肺炎分类的诊断准确性和效率方面具有巨大潜力，尤其在资源有限的临床环境中尤为有价值。相关的代码和模型已在GitHub上公开。
英文摘要： Pneumonia is a leading cause of mortality in children under five, requiring accurate chest X-ray diagnosis. This study presents a machine learning-based Pediatric Chest Pneumonia Classification System to assist healthcare professionals in diagnosing pneumonia from chest X-ray images. The CNN-based model was trained on 5,863 labeled chest X-ray images from children aged 0-5 years from the Guangzhou Women and Children's Medical Center. To address limited data, we applied augmentation techniques (rotation, zooming, shear, horizontal flipping) and employed GANs to generate synthetic images, addressing class imbalance. The system achieved optimal performance using combined original, augmented, and GAN-generated data, evaluated through accuracy and F1 score metrics. The final model was deployed via a Flask web application, enabling real-time classification with probability estimates. Results demonstrate the potential of deep learning and GANs in improving diagnostic accuracy and efficiency for pediatric pneumonia classification, particularly valuable in resource-limited clinical settings https://github.com/AbdulManaf12/Pediatric-Chest-Pneumonia-Classification
论文链接 https://arxiv.org/pdf/2507.09759v1.pdf
代码链接 https://github.com/AbdulManaf12/Pediatric-Chest-Pneumonia-Classification

论文标题 PanoDiff-SR: Synthesizing Dental Panoramic Radiographs using Diffusion and Super-resolution

中文摘要： 近年来，生成高质量、逼真的合成医学图像引起了越来越多的兴趣。这些合成数据集可以缓解公共数据集在人工智能研究中的稀缺性，并可用于教育目的。本文提出了一种结合基于扩散的生成（PanoDiff）和超分辨率（SR）的方法，用于生成合成牙科全景放射图像。首先，PanoDiff生成一个低分辨率（256 x 128）的全景放射图像种子，然后通过SR模型处理，生成高分辨率（1024 x 512）的全景放射图像。对于SR，我们提出了一种最先进的变压器，学习局部-全局关系，从而产生更清晰的边缘和纹理。实验结果表明，7243张真实和合成图像（高分辨率）之间的Frechet初始距离得分为40.69。真实高分辨率、合成高分辨率、真实低分辨率和合成低分辨率图像的初始得分分别为2.55、2.30、2.90和2.98。在一组六名临床专家中，所有专家在有限时间内评估了100张合成和100张真实全景放射图像的混合图像，区分真实与合成图像的平均准确率为68.5%（50%对应随机猜测）。
英文摘要： There has been increasing interest in the generation of high-quality, realistic synthetic medical images in recent years. Such synthetic datasets can mitigate the scarcity of public datasets for artificial intelligence research, and can also be used for educational purposes. In this paper, we propose a combination of diffusion-based generation (PanoDiff) and Super-Resolution (SR) for generating synthetic dental panoramic radiographs (PRs). The former generates a low-resolution (LR) seed of a PR (256 X 128) which is then processed by the SR model to yield a high-resolution (HR) PR of size 1024 X 512. For SR, we propose a state-of-the-art transformer that learns local-global relationships, resulting in sharper edges and textures. Experimental results demonstrate a Frechet inception distance score of 40.69 between 7243 real and synthetic images (in HR). Inception scores were 2.55, 2.30, 2.90 and 2.98 for real HR, synthetic HR, real LR and synthetic LR images, respectively. Among a diverse group of six clinical experts, all evaluating a mixture of 100 synthetic and 100 real PRs in a time-limited observation, the average accuracy in distinguishing real from synthetic images was 68.5% (with 50% corresponding to random guessing).
论文链接 https://arxiv.org/pdf/2507.09227v1.pdf
代码链接 https://github.com/s4nyam/panodiff

论文标题 Car Object Counting and Position Estimation via Extension of the CLIP-EBC Framework

中文摘要： 本文研究了原本用于人群计数的CLIP-EBC框架在汽车对象计数中的应用，使用了CARPK数据集。实验结果显示，该模型在现有方法中表现次优。此外，我们提出了一种基于预测密度图的K-means加权聚类方法来估计物体位置，表明该框架有潜力扩展到定位任务。
英文摘要： In this paper, we investigate the applicability of the CLIP-EBC framework, originally designed for crowd counting, to car object counting using the CARPK dataset. Experimental results show that our model achieves second-best performance compared to existing methods. In addition, we propose a K-means weighted clustering method to estimate object positions based on predicted density maps, indicating the framework's potential extension to localization tasks.
论文链接 https://arxiv.org/pdf/2507.08240v1.pdf
代码链接 https://github.com/jungseoik/CLIP-LOCAR

中文摘要： 针对自动驾驶等对安全性要求较高的应用，现有基于掩码的方法在处理分布外（OoD）分割时常常存在边界不精确、目标内部异常分数不一致以及背景噪声引起的误报等问题。为此，提出了一种名为Objectomaly的对象感知细化框架，该框架结合了对象级别的先验知识。Objectomaly由三个阶段组成：(1) 通过现有的OoD骨干网络进行粗略的异常评分(CAS)，(2) 利用SAM生成的实例掩码进行对象级别分数校准(OASC)，以及(3) 应用拉普拉斯滤波和高斯平滑以提高边界精度(MBP)。实验结果显示，Objectomaly在关键OoD分割基准测试中取得了最先进的性能，在SMIYC AnomalyTrack/ObstacleTrack和RoadAnomaly数据集上显著提升了像素级（AuPRC高达96.99，FPR$_{95}$降至0.07）和组件级（F1-score高达83.44）指标。此外，消融研究和实际驾驶视频中的定性结果进一步验证了方法的鲁棒性和泛化能力。代码将在论文发表后公开。
英文摘要： Out-of-Distribution (OoD) segmentation is critical for safety-sensitive applications like autonomous driving. However, existing mask-based methods often suffer from boundary imprecision, inconsistent anomaly scores within objects, and false positives from background noise. We propose \textbf{\textit{Objectomaly}}, an objectness-aware refinement framework that incorporates object-level priors. Objectomaly consists of three stages: (1) Coarse Anomaly Scoring (CAS) using an existing OoD backbone, (2) Objectness-Aware Score Calibration (OASC) leveraging SAM-generated instance masks for object-level score normalization, and (3) Meticulous Boundary Precision (MBP) applying Laplacian filtering and Gaussian smoothing for contour refinement. Objectomaly achieves state-of-the-art performance on key OoD segmentation benchmarks, including SMIYC AnomalyTrack/ObstacleTrack and RoadAnomaly, improving both pixel-level (AuPRC up to 96.99, FPR$_{95}$ down to 0.07) and component-level (F1$-$score up to 83.44) metrics. Ablation studies and qualitative results on real-world driving videos further validate the robustness and generalizability of our method. Code will be released upon publication.
论文链接 https://arxiv.org/pdf/2507.07460v1.pdf
代码链接 https://github.com/hon121215/Objectomaly

论文标题 SCOOTER: A Human Evaluation Framework for Unrestricted Adversarial Examples

中文摘要： SCOOTER是一个开源的、统计支持的框架，用于评估无限制对抗性示例。无限制对抗性攻击不受到$\ell_p$-范数约束，可以通过改变物体颜色等方式来欺骗计算机视觉模型，从而绕过传统的防御策略。然而，由于其不受限的性质，无法保证基于范数的不可感知性，因此需要人类评估来验证这些对抗性示例的真实感。SCOOTER提供了最佳实践指南，包括众包研究的功率、补偿和李克特等效界限，以衡量不可感知性。此外，通过对346名参与者的大型研究发现，三种颜色空间攻击和三种基于扩散的攻击未能产生不可感知的图像。研究还表明，GPT-4o可以作为初步测试工具，但仅对六种测试攻击中的四种能够一致地检测到对抗性示例。SCOOTER还提供了一系列开源软件工具，包括基于浏览器的任务模板以收集注释，以及Python和R的分析脚本。此外，SCOOTER还包含一个基准数据集，其中包含3000张真实图像、7000个对抗性示例和超过34000个人类评分。研究结果表明，自动化视觉系统与人类感知并不一致，进一步强调了需要一个基于SCOOTER的基准。
英文摘要： Unrestricted adversarial attacks aim to fool computer vision models without being constrained by $\ell_p$-norm bounds to remain imperceptible to humans, for example, by changing an object's color. This allows attackers to circumvent traditional, norm-bounded defense strategies such as adversarial training or certified defense strategies. However, due to their unrestricted nature, there are also no guarantees of norm-based imperceptibility, necessitating human evaluations to verify just how authentic these adversarial examples look. While some related work assesses this vital quality of adversarial attacks, none provide statistically significant insights. This issue necessitates a unified framework that supports and streamlines such an assessment for evaluating and comparing unrestricted attacks. To close this gap, we introduce SCOOTER - an open-source, statistically powered framework for evaluating unrestricted adversarial examples. Our contributions are: $(i)$ best-practice guidelines for crowd-study power, compensation, and Likert equivalence bounds to measure imperceptibility; $(ii)$ the first large-scale human vs. model comparison across 346 human participants showing that three color-space attacks and three diffusion-based attacks fail to produce imperceptible images. Furthermore, we found that GPT-4o can serve as a preliminary test for imperceptibility, but it only consistently detects adversarial examples for four out of six tested attacks; $(iii)$ open-source software tools, including a browser-based task template to collect annotations and analysis scripts in Python and R; $(iv)$ an ImageNet-derived benchmark dataset containing 3K real images, 7K adversarial examples, and over 34K human ratings. Our findings demonstrate that automated vision systems do not align with human perception, reinforcing the need for a ground-truth SCOOTER benchmark.
论文链接 https://arxiv.org/pdf/2507.07776v1.pdf
代码链接 https://github.com/DrenFazlija/Scooter

论文标题 GNN-CNN: An Efficient Hybrid Model of Convolutional and Graph Neural Networks for Text Representation

中文摘要： 该研究提出了一种结合图神经网络（GNN）和卷积神经网络（CNN）的新型模型架构，旨在提高处理长文本时的时间、成本和能源效率。当前最先进的Transformer模型在处理长文档时计算复杂度较高，因此效率较低。新模型通过实时端到端的图生成机制，处理字符级输入的小批量数据，无需填充或截断。为了进一步提升性能，同时保持高速和高效，模型通过高效的字典查找方式引入了大型语言模型（LLM）的信息，如词嵌入和情感极性。具体而言，模型利用CNN捕捉局部上下文模式，通过基于晶格的图结构扩展局部感受野，并使用小世界图来聚合文档级别的信息。生成的图具有有意义的语义组织结构，平均聚集系数约为0.45，平均最短路径长度在4到5之间。该模型在多个文本分类任务中进行了评估，包括情感分析和新闻分类，并与最先进的模型进行了比较。实验结果表明，该模型在效率和性能方面具有竞争力。
英文摘要： Time, cost, and energy efficiency are critical considerations in Deep-Learning (DL), particularly when processing long texts. Transformers, which represent the current state of the art, exhibit quadratic computational complexity relative to input length, making them inefficient for extended documents. This study introduces a novel model architecture that combines Graph Neural Networks (GNNs) and Convolutional Neural Networks (CNNs), integrated with a real-time, end-to-end graph generation mechanism. The model processes compact batches of character-level inputs without requiring padding or truncation. To enhance performance while maintaining high speed and efficiency, the model incorporates information from Large Language Models (LLMs), such as token embeddings and sentiment polarities, through efficient dictionary lookups. It captures local contextual patterns using CNNs, expands local receptive fields via lattice-based graph structures, and employs small-world graphs to aggregate document-level information. The generated graphs exhibit structural properties indicative of meaningful semantic organization, with an average clustering coefficient of approximately 0.45 and an average shortest path length ranging between 4 and 5. The model is evaluated across multiple text classification tasks, including sentiment analysis and news-categorization, and is compared against state-of-the-art models. Experimental results confirm the proposed model's efficiency and competitive performance.
论文链接 https://paperswithcode.com/paper/gnn-cnn-an-efficient-hybrid-model-of
代码链接 https://github.com/FardinRastakhiz/CNN-GNN-TEXT

论文标题 PacGDC: Label-Efficient Generalizable Depth Completion with Projection Ambiguity and Consistency

中文摘要： PacGDC是一种标签高效的通用深度补全技术，通过最小的标注努力来增强数据多样性。该技术基于对2D到3D投影过程中物体形状和位置内在模糊性和一致性的新见解，能够为同一视觉场景合成大量的伪几何形状。PacGDC通过调整对应深度图的场景尺度，大大扩展了可用的几何形状。为此，提出了一种新的数据合成管道，使用多个深度基础模型作为尺度调节器，这些模型能提供具有不同场景尺度的伪深度标签，同时确保投影一致性以支持泛化。此外，通过结合插值和重新定位策略以及未标记图像，进一步增加了几何多样性，扩展了数据覆盖范围。广泛的实验表明，PacGDC在多个基准测试中表现出色，在零样本和少量样本设置下，对于多样化的场景语义/尺度和深度稀疏/模式都具有显著的泛化能力。
英文摘要： Generalizable depth completion enables the acquisition of dense metric depth maps for unseen environments, offering robust perception capabilities for various downstream tasks. However, training such models typically requires large-scale datasets with metric depth labels, which are often labor-intensive to collect. This paper presents PacGDC, a label-efficient technique that enhances data diversity with minimal annotation effort for generalizable depth completion. PacGDC builds on novel insights into inherent ambiguities and consistencies in object shapes and positions during 2D-to-3D projection, allowing the synthesis of numerous pseudo geometries for the same visual scene. This process greatly broadens available geometries by manipulating scene scales of the corresponding depth maps. To leverage this property, we propose a new data synthesis pipeline that uses multiple depth foundation models as scale manipulators. These models robustly provide pseudo depth labels with varied scene scales, affecting both local objects and global layouts, while ensuring projection consistency that supports generalization. To further diversify geometries, we incorporate interpolation and relocation strategies, as well as unlabeled images, extending the data coverage beyond the individual use of foundation models. Extensive experiments show that PacGDC achieves remarkable generalizability across multiple benchmarks, excelling in diverse scene semantics/scales and depth sparsity/patterns under both zero-shot and few-shot settings. Code: https://github.com/Wang-xjtu/PacGDC.
论文链接 https://arxiv.org/pdf/2507.07374v1.pdf
代码链接 https://github.com/Wang-xjtu/PacGDC

论文标题 SAMO: A Lightweight Sharpness-Aware Approach for Multi-Task Optimization with Joint Global-Local Perturbation

中文摘要： 多任务学习（MTL）通过联合模型捕捉多个任务之间的共性，从而减少计算成本并提高数据效率。然而，MTL优化中的主要挑战是任务冲突，即任务梯度在方向或幅度上存在差异，导致模型性能不如单任务模型。锐度感知最小化（SAM）在最小化任务损失的同时减少了损失景观的锐度。实验观察表明，SAM能够有效缓解MTL中的任务冲突。受到这些发现的启发，我们将SAM集成到MTL中，但面临两个关键挑战：如何结合全局和局部信息，以及直接计算每个任务梯度会引入显著的计算和内存开销。为了解决这些问题，我们提出了SAMO，一种轻量级的锐度感知多任务优化方法，利用联合全局-局部扰动。局部扰动仅使用前向传递进行近似，并且逐层归一化以提高效率。在一系列多任务基准测试上的广泛实验表明，我们的方法既有效又高效。代码可在https://github.com/OptMN-Lab/SAMO获取。
英文摘要： Multi-task learning (MTL) enables a joint model to capture commonalities across multiple tasks, reducing computation costs and improving data efficiency. However, a major challenge in MTL optimization is task conflicts, where the task gradients differ in direction or magnitude, limiting model performance compared to single-task counterparts. Sharpness-aware minimization (SAM) minimizes task loss while simultaneously reducing the sharpness of the loss landscape. Our empirical observations show that SAM effectively mitigates task conflicts in MTL. Motivated by these findings, we explore integrating SAM into MTL but face two key challenges. While both the average loss gradient and individual task gradients-referred to as global and local information-contribute to SAM, how to combine them remains unclear. Moreover, directly computing each task gradient introduces significant computational and memory overheads. To address these challenges, we propose SAMO, a lightweight \textbf{S}harpness-\textbf{A}ware \textbf{M}ulti-task \textbf{O}ptimization approach, that leverages a joint global-local perturbation. The local perturbations are approximated using only forward passes and are layerwise normalized to improve efficiency. Extensive experiments on a suite of multi-task benchmarks demonstrate both the effectiveness and efficiency of our method. Code is available at https://github.com/OptMN-Lab/SAMO.
论文链接 https://arxiv.org/pdf/2507.07883v1.pdf
代码链接 https://github.com/OptMN-Lab/SAMO

论文标题 Rethinking Query-based Transformer for Continual Image Segmentation

中文摘要： 类增量/持续图像分割（CIS）旨在分阶段训练图像分割器，每个阶段可用的类别不同。现有的方法通常利用基于查询的Transformer内置的对象性来缓解掩码提案的灾难性遗忘，并将掩码生成与持续学习过程解耦。然而，这种解耦框架存在两个关键问题：失去可塑性和高度依赖输入数据顺序。为了解决这些问题，研究深入探讨了内置对象性，发现高度聚合的图像特征为查询生成掩码提供了捷径，通过简单的特征对齐即可实现。基于此，研究提出了一种简单但强大的基线方法SimCIS。其核心思想是直接选择图像特征进行查询分配，确保“完美对齐”以保留对象性，同时允许查询选择新类别以促进可塑性。为了进一步对抗类别灾难性遗忘，引入了跨阶段一致性选择和一种创新的“视觉查询”重放机制。实验表明，SimCIS在各种分割任务、设置、划分和输入数据顺序下均优于现有方法。所有模型和代码将在https://github.com/SooLab/SimCIS公开。
英文摘要： Class-incremental/Continual image segmentation (CIS) aims to train an image segmenter in stages, where the set of available categories differs at each stage. To leverage the built-in objectness of query-based transformers, which mitigates catastrophic forgetting of mask proposals, current methods often decouple mask generation from the continual learning process. This study, however, identifies two key issues with decoupled frameworks: loss of plasticity and heavy reliance on input data order. To address these, we conduct an in-depth investigation of the built-in objectness and find that highly aggregated image features provide a shortcut for queries to generate masks through simple feature alignment. Based on this, we propose SimCIS, a simple yet powerful baseline for CIS. Its core idea is to directly select image features for query assignment, ensuring "perfect alignment" to preserve objectness, while simultaneously allowing queries to select new classes to promote plasticity. To further combat catastrophic forgetting of categories, we introduce cross-stage consistency in selection and an innovative "visual query"-based replay mechanism. Experiments demonstrate that SimCIS consistently outperforms state-of-the-art methods across various segmentation tasks, settings, splits, and input data orders. All models and codes will be made publicly available at https://github.com/SooLab/SimCIS.
论文链接 https://arxiv.org/pdf/2507.07831v1.pdf
代码链接 https://github.com/soolab/simcis

论文标题 Towards Interpretable Time Series Foundation Models

中文摘要： 本文研究了将时间序列推理能力提炼到小型指令调优语言模型中的方法，旨在构建可解释的时间序列基础模型。通过使用具有系统变化趋势和噪声水平的均值回归时间序列合成数据集，我们利用大型多模态模型生成自然语言注释，并用这些注释来监督紧凑型Qwen模型的微调。我们引入了评估指标，以评估提炼推理的质量，重点关注趋势方向、噪声强度和极值定位。结果显示，经过训练后的模型获得了有意义的解释能力。我们的工作证明了将时间序列理解压缩到轻量级、具备语言能力的模型中是可行的，这些模型适用于设备端或对隐私敏感的部署场景。这项工作为开发小型、可解释的能够用自然语言解释时间模式的模型奠定了基础。
英文摘要： In this paper, we investigate the distillation of time series reasoning capabilities into small, instruction-tuned language models as a step toward building interpretable time series foundation models. Leveraging a synthetic dataset of mean-reverting time series with systematically varied trends and noise levels, we generate natural language annotations using a large multimodal model and use these to supervise the fine-tuning of compact Qwen models. We introduce evaluation metrics that assess the quality of the distilled reasoning - focusing on trend direction, noise intensity, and extremum localization - and show that the post-trained models acquire meaningful interpretive capabilities. Our results highlight the feasibility of compressing time series understanding into lightweight, language-capable models suitable for on-device or privacy-sensitive deployment. This work contributes a concrete foundation toward developing small, interpretable models that explain temporal patterns in natural language.
论文链接 https://arxiv.org/pdf/2507.07439v1.pdf
代码链接 https://github.com/svitlana-outsampler/ITS_ICML2025

论文标题 Open Source Planning & Control System with Language Agents for Autonomous Scientific Discovery

中文摘要： 我们介绍了一种名为cmbagent的多代理系统，用于自动化科学研究任务。该系统由约30个大型语言模型（LLM）代理组成，并采用规划与控制策略来协调代理工作流程，整个过程无需人工干预。每个代理专门负责不同的任务，如检索科学论文和代码库、编写代码、解释结果以及批评其他代理的输出。系统能够本地执行代码。我们成功地将cmbagent应用于一个博士级别的宇宙学任务，即使用超新星数据测量宇宙参数，并在两个基准测试集上评估其性能，发现其表现优于当前最先进的LLM。cmbagent的源代码已在GitHub上公开，演示视频也可供查看，系统已在HuggingFace上部署，并即将在云端提供。
英文摘要： We present a multi-agent system for automation of scientific research tasks, cmbagent. The system is formed by about 30 Large Language Model (LLM) agents and implements a Planning & Control strategy to orchestrate the agentic workflow, with no human-in-the-loop at any point. Each agent specializes in a different task (performing retrieval on scientific papers and codebases, writing code, interpreting results, critiquing the output of other agents) and the system is able to execute code locally. We successfully apply cmbagent to carry out a PhD level cosmology task (the measurement of cosmological parameters using supernova data) and evaluate its performance on two benchmark sets, finding superior performance over state-of-the-art LLMs. The source code is available on GitHub, demonstration videos are also available, and the system is deployed on HuggingFace and will be available on the cloud.
论文链接 https://arxiv.org/pdf/2507.07257v1.pdf
代码链接 https://github.com/cmbagents/cmbagent

论文标题 MS-DPPs: Multi-Source Determinantal Point Processes for Contextual Diversity Refinement of Composite Attributes in Text to Image Retrieval

中文摘要： 在文本到图像检索中，结果多样化（RD）是提高实际应用效率的关键技术。传统方法仅关注增加图像外观的多样性指标，但这种单一的方法限制了RD的应用范围，因为不同的应用场景对多样性指标的需求不同。本文提出了一种新的任务——复合属性的上下文多样性优化（CDR-CA），旨在根据应用场景细化多个属性的多样性。为此，研究者提出了多源确定性点过程（MS-DPPs），这是一种简单而强大的基线方法，将确定性点过程（DPP）扩展到了多源场景。通过基于流形表示的统一相似矩阵建模MS-DPP，并引入切线归一化以反映上下文信息。广泛的实验验证了所提方法的有效性，相关代码已公开发布。
英文摘要： Result diversification (RD) is a crucial technique in Text-to-Image Retrieval for enhancing the efficiency of a practical application. Conventional methods focus solely on increasing the diversity metric of image appearances. However, the diversity metric and its desired value vary depending on the application, which limits the applications of RD. This paper proposes a novel task called CDR-CA (Contextual Diversity Refinement of Composite Attributes). CDR-CA aims to refine the diversities of multiple attributes, according to the application's context. To address this task, we propose Multi-Source DPPs, a simple yet strong baseline that extends the Determinantal Point Process (DPP) to multi-sources. We model MS-DPP as a single DPP model with a unified similarity matrix based on a manifold representation. We also introduce Tangent Normalization to reflect contexts. Extensive experiments demonstrate the effectiveness of the proposed method. Our code is publicly available at https://github.com/NEC-N-SOGI/msdpp.
论文链接 https://arxiv.org/pdf/2507.06654v1.pdf
代码链接 https://github.com/nec-n-sogi/msdpp

论文标题 MoFE-Time: Mixture of Frequency Domain Experts for Time-Series Forecasting Models

中文摘要： 时间序列预测在多种应用中扮演着关键角色。随着大型语言模型（LLMs）的显著进步，将其作为时间序列建模的基础架构受到了广泛关注。尽管现有模型取得了一定成功，但它们很少同时在预训练-微调范式中同时建模时间和频率特征，导致在复杂时间序列预测中的表现不佳，而这些预测需要同时考虑周期性和信号的先验模式知识。

为此，提出了一种创新的时间序列预测模型MoFE-Time，该模型在一个混合专家（MoE）网络中集成了时间和频率域特征。此外，采用预训练-微调范式作为训练框架，有效传递了具有不同周期性分布的预训练和微调数据集之间的先验模式知识。方法中，在注意力模块之后引入了频率和时间单元作为专家，并利用MoE路由机制构建输入信号的多维稀疏表示。

实验结果表明，在六个公开基准上，MoFE-Time相比代表性的Time-MoE方法，MSE和MAE分别降低了6.95%和6.02%，达到了新的最先进性能。除了现有的评估基准外，还开发了一个源自真实商业场景的专有数据集NEV-sales。MoFE-Time在该数据集上取得了出色的结果，进一步证明了其在实际商业应用中的有效性。

英文摘要： As a prominent data modality task, time series forecasting plays a pivotal role in diverse applications. With the remarkable advancements in Large Language Models (LLMs), the adoption of LLMs as the foundational architecture for time series modeling has gained significant attention. Although existing models achieve some success, they rarely both model time and frequency characteristics in a pretraining-finetuning paradigm leading to suboptimal performance in predictions of complex time series, which requires both modeling periodicity and prior pattern knowledge of signals. We propose MoFE-Time, an innovative time series forecasting model that integrates time and frequency domain features within a Mixture of Experts (MoE) network. Moreover, we use the pretraining-finetuning paradigm as our training framework to effectively transfer prior pattern knowledge across pretraining and finetuning datasets with different periodicity distributions. Our method introduces both frequency and time cells as experts after attention modules and leverages the MoE routing mechanism to construct multidimensional sparse representations of input signals. In experiments on six public benchmarks, MoFE-Time has achieved new state-of-the-art performance, reducing MSE and MAE by 6.95% and 6.02% compared to the representative methods Time-MoE. Beyond the existing evaluation benchmarks, we have developed a proprietary dataset, NEV-sales, derived from real-world business scenarios. Our method achieves outstanding results on this dataset, underscoring the effectiveness of the MoFE-Time model in practical commercial applications.
论文链接 https://arxiv.org/pdf/2507.06502v1.pdf
代码链接 https://github.com/alg-znsy-li/mofe-time

论文标题 InvestAlign: Overcoming Data Scarcity in Aligning Large Language Models with Investor Decision-Making Processes under Herd Behavior

中文摘要： 在行为金融学中，将大型语言模型（LLMs）与投资者决策过程对齐，特别是在群体行为下，面临着一个关键挑战：监督微调（SFT）所需的大量真实用户数据稀缺。尽管SFT可以弥合LLM输出与人类行为模式之间的差距，但其依赖于大规模的真实数据，这带来了高昂的数据收集成本和隐私风险。为此，研究者提出了一种新的框架InvestAlign，该框架通过利用理论解决方案来构建高质量的SFT数据集，而不是处理复杂的实际投资场景。理论分析表明，使用InvestAlign生成的数据训练LLM比使用真实用户数据能够更快地实现参数收敛，显示出更高的学习效率。此外，研究者还开发了InvestAgent，这是一个通过InvestAlign微调的LLM代理，在简单和复杂的投资问题中，与真实用户数据的对齐程度显著优于未经过SFT的模型。这表明InvestAlign方法具有解决复杂最优投资问题的潜力，并能更好地将LLMs与投资者决策过程对齐。相关代码已在GitHub上公开发布。
英文摘要： Aligning Large Language Models (LLMs) with investor decision-making processes under herd behavior is a critical challenge in behavioral finance, which grapples with a fundamental limitation: the scarcity of real-user data needed for Supervised Fine-Tuning (SFT). While SFT can bridge the gap between LLM outputs and human behavioral patterns, its reliance on massive authentic data imposes substantial collection costs and privacy risks. We propose InvestAlign, a novel framework that constructs high-quality SFT datasets by leveraging theoretical solutions to similar and simple optimal investment problems rather than complex scenarios. Our theoretical analysis demonstrates that training LLMs with InvestAlign-generated data achieves faster parameter convergence than using real-user data, suggesting superior learning efficiency. Furthermore, we develop InvestAgent, an LLM agent fine-tuned with InvestAlign, which demonstrates significantly closer alignment to real-user data than pre-SFT models in both simple and complex investment problems. This highlights our proposed InvestAlign as a promising approach with the potential to address complex optimal investment problems and align LLMs with investor decision-making processes under herd behavior. Our code is publicly available at https://github.com/thu-social-network-research-group/InvestAlign.
论文链接 https://arxiv.org/pdf/2507.06528v1.pdf
代码链接 https://github.com/thu-social-network-research-group/InvestAlign

论文标题 UQLM: A Python Package for Uncertainty Quantification in Large Language Models

中文摘要： UQLM是一个Python包，用于通过最先进的不确定性量化（UQ）技术来检测大型语言模型（LLMs）产生的幻觉。这些幻觉指的是LLMs生成虚假或误导性内容的情况，对下游应用的安全性和可信度构成了重大挑战。该工具包提供了一系列基于UQ的评分器，可以计算出从0到1的响应级置信分数。UQLM作为一个现成的解决方案，能够轻松集成以提高LLMs输出的可靠性。
英文摘要： Hallucinations, defined as instances where Large Language Models (LLMs) generate false or misleading content, pose a significant challenge that impacts the safety and trust of downstream applications. We introduce UQLM, a Python package for LLM hallucination detection using state-of-the-art uncertainty quantification (UQ) techniques. This toolkit offers a suite of UQ-based scorers that compute response-level confidence scores ranging from 0 to 1. This library provides an off-the-shelf solution for UQ-based hallucination detection that can be easily integrated to enhance the reliability of LLM outputs.
论文链接 https://arxiv.org/pdf/2507.06196v1.pdf
代码链接 https://github.com/cvs-health/uqlm

论文标题 Predicting Graph Structure via Adapted Flux Balance Analysis

中文摘要： 许多动态过程如电信和交通网络可以通过离散时间序列图来描述。对这些时间序列的建模可以预测未来时间步的图结构，可用于异常检测等应用。现有方法在预测图结构时存在局限性，例如假设连续图之间的顶点不变。为解决这一问题，我们提出结合时间序列预测方法与改进的通量平衡分析（FBA）来进行预测。FBA是一种起源于生物化学的线性规划方法，通过适应性调整以包含适用于增长图的各种约束条件。在合成数据集（通过优先连接模型构建）和真实数据集（UCI消息、HePH、Facebook、比特币）上的实证评估表明，所提方法的有效性。
英文摘要： Many dynamic processes such as telecommunication and transport networks can be described through discrete time series of graphs. Modelling the dynamics of such time series enables prediction of graph structure at future time steps, which can be used in applications such as detection of anomalies. Existing approaches for graph prediction have limitations such as assuming that the vertices do not to change between consecutive graphs. To address this, we propose to exploit time series prediction methods in combination with an adapted form of flux balance analysis (FBA), a linear programming method originating from biochemistry. FBA is adapted to incorporate various constraints applicable to the scenario of growing graphs. Empirical evaluations on synthetic datasets (constructed via Preferential Attachment model) and real datasets (UCI Message, HePH, Facebook, Bitcoin) demonstrate the efficacy of the proposed approach.
论文链接 https://arxiv.org/pdf/2507.05806v1.pdf
代码链接 https://github.com/sevvandi/netseer

论文标题 Is Diversity All You Need for Scalable Robotic Manipulation?

中文摘要： 在自然语言处理和计算机视觉领域，数据扩展已经取得了显著的成功，但在机器人操作中的有效数据扩展原则仍不够明确。本研究探讨了数据多样性在机器人学习中的细微作用，考察了任务（做什么）、实体（使用哪个机器人）和专家（谁来演示）这三个关键维度，并挑战了“更多样性更好”的传统直觉。通过在多种机器人平台上进行广泛的实验，我们发现：（1）任务多样性比每项任务的演示数量更重要，有助于从多样化的预训练任务中转移到新的下游场景；（2）多实体预训练数据对于跨实体迁移是可选的——高质量单实体数据训练的模型可以高效地转移到不同平台，在微调过程中表现出更好的扩展性；（3）专家多样性由于个体操作偏好和人类演示中的随机变化可能会给策略学习带来混淆，速度多模态成为关键因素。基于这一洞察，我们提出了一种分布去偏方法以减少速度歧义，由此产生的GO-1-Pro性能提升了15%，相当于使用了2.5倍的预训练数据。这些发现为有效扩展机器人操作数据集提供了新的视角和实用指导。
英文摘要： Data scaling has driven remarkable success in foundation models for Natural Language Processing (NLP) and Computer Vision (CV), yet the principles of effective data scaling in robotic manipulation remain insufficiently understood. In this work, we investigate the nuanced role of data diversity in robot learning by examining three critical dimensions-task (what to do), embodiment (which robot to use), and expert (who demonstrates)-challenging the conventional intuition of "more diverse is better". Throughout extensive experiments on various robot platforms, we reveal that (1) task diversity proves more critical than per-task demonstration quantity, benefiting transfer from diverse pre-training tasks to novel downstream scenarios; (2) multi-embodiment pre-training data is optional for cross-embodiment transfer-models trained on high-quality single-embodiment data can efficiently transfer to different platforms, showing more desirable scaling property during fine-tuning than multi-embodiment pre-trained models; and (3) expert diversity, arising from individual operational preferences and stochastic variations in human demonstrations, can be confounding to policy learning, with velocity multimodality emerging as a key contributing factor. Based on this insight, we propose a distribution debiasing method to mitigate velocity ambiguity, the yielding GO-1-Pro achieves substantial performance gains of 15%, equivalent to using 2.5 times pre-training data. Collectively, these findings provide new perspectives and offer practical guidance on how to scale robotic manipulation datasets effectively.
论文链接 https://arxiv.org/pdf/2507.06219v1.pdf
代码链接 https://github.com/opendrivelab/agibot-world

论文标题 Automated Neuron Labelling Enables Generative Steering and Interpretability in Protein Language Models

中文摘要： 本文介绍了一种自动化框架，能够为蛋白质语言模型（PLM）中的每个神经元提供生物学基础的自然语言描述。与依赖稀疏自动编码器或手动注释的先前方法不同，该方法可扩展到数十万个神经元，并揭示了单个神经元对多种生物化学和结构特性具有选择性敏感性。研究人员还开发了一种基于神经元激活的引导方法，用于生成具有所需特性的蛋白质，从而实现目标生化属性（如分子量和不稳定性指数）以及二级和三级结构基序（包括α螺旋和典型的锌指结构）。最后，通过分析不同规模模型中的标记神经元，揭示了PLM的扩展规律和结构化的神经元空间分布。
英文摘要： Protein language models (PLMs) encode rich biological information, yet their internal neuron representations are poorly understood. We introduce the first automated framework for labeling every neuron in a PLM with biologically grounded natural language descriptions. Unlike prior approaches relying on sparse autoencoders or manual annotation, our method scales to hundreds of thousands of neurons, revealing individual neurons are selectively sensitive to diverse biochemical and structural properties. We then develop a novel neuron activation-guided steering method to generate proteins with desired traits, enabling convergence to target biochemical properties like molecular weight and instability index as well as secondary and tertiary structural motifs, including alpha helices and canonical Zinc Fingers. We finally show that analysis of labeled neurons in different model sizes reveals PLM scaling laws and a structured neuron space distribution.
论文链接 https://arxiv.org/pdf/2507.06458v1.pdf
代码链接 https://github.com/arjun-banerjee/plmneuron

论文标题 eegFloss: A Python package for refining sleep EEG recordings using machine learning models

中文摘要： 脑电图（EEG）是监测大脑活动的重要工具，尤其在睡眠研究中，它是多导睡眠图的主要模态。然而，EEG信号容易受到内部和外部干扰导致的伪影影响，这在自动睡眠分期过程中尤为突出，可能导致错误的睡眠评分。为解决这一问题，本文介绍了一个开源的Python包eegFloss，该包利用了一种新的机器学习模型eegUsability来检测睡眠EEG记录中的伪影段。eegUsability模型在15名参与者127晚的Zmax头带EEG数据上进行了训练和评估，显示出良好的整体分类性能（F1分数约为0.85，Cohen's kappa系数为0.78），在识别通道可用EEG数据方面具有约94%的高召回率，并且其应用范围不仅限于Zmax设备。此外，eegFloss还提供了其他功能，如使用另一个名为eegMobility的机器学习模型进行自动在床时间检测，过滤特定伪影，生成催眠图和睡眠统计信息。通过应对大多数睡眠研究所面临的基本挑战，eegFloss可以提高分析的精确度和严谨性，以及结果的准确性和可靠性。
英文摘要： Electroencephalography (EEG) allows monitoring of brain activity, providing insights into the functional dynamics of various brain regions and their roles in cognitive processes. EEG is a cornerstone in sleep research, serving as the primary modality of polysomnography, the gold standard in the field. However, EEG signals are prone to artifacts caused by both internal (device-specific) factors and external (environmental) interferences. As sleep studies are becoming larger, most rely on automatic sleep staging, a process highly susceptible to artifacts, leading to erroneous sleep scores. This paper addresses this challenge by introducing eegFloss, an open-source Python package to utilize eegUsability, a novel machine learning (ML) model designed to detect segments with artifacts in sleep EEG recordings. eegUsability has been trained and evaluated on manually artifact-labeled EEG data collected from 15 participants over 127 nights using the Zmax headband. It demonstrates solid overall classification performance (F1-score is approximately 0.85, Cohens kappa is 0.78), achieving a high recall rate of approximately 94% in identifying channel-wise usable EEG data, and extends beyond Zmax. Additionally, eegFloss offers features such as automatic time-in-bed detection using another ML model named eegMobility, filtering out certain artifacts, and generating hypnograms and sleep statistics. By addressing a fundamental challenge faced by most sleep studies, eegFloss can enhance the precision and rigor of their analysis as well as the accuracy and reliability of their outcomes.
论文链接 https://arxiv.org/pdf/2507.06433v1.pdf
代码链接 https://github.com/Niloy333/eegFloss

论文标题 PaddleOCR 3.0 Technical Report

中文摘要： PaddleOCR 3.0 是一个Apache许可的开源工具包，用于OCR和文档解析。为了应对大型语言模型时代对文档理解日益增长的需求，PaddleOCR 3.0 提出了三个主要解决方案：(1) PP-OCRv5 用于多语言文本识别，(2) PP-StructureV3 用于层次化文档解析，(3) PP-ChatOCRv4 用于关键信息提取。这些模型参数少于一亿，但在准确性和效率上与主流视觉-语言模型（VLMs）相当，甚至可以与数十亿参数的VLMs竞争。除了提供高质量的OCR模型库外，PaddleOCR 3.0还提供了高效的训练、推理和部署工具，支持异构硬件加速，并使开发者能够轻松构建智能文档应用。
英文摘要： This technical report introduces PaddleOCR 3.0, an Apache-licensed open-source toolkit for OCR and document parsing. To address the growing demand for document understanding in the era of large language models, PaddleOCR 3.0 presents three major solutions: (1) PP-OCRv5 for multilingual text recognition, (2) PP-StructureV3 for hierarchical document parsing, and (3) PP-ChatOCRv4 for key information extraction. Compared to mainstream vision-language models (VLMs), these models with fewer than 100 million parameters achieve competitive accuracy and efficiency, rivaling billion-parameter VLMs. In addition to offering a high-quality OCR model library, PaddleOCR 3.0 provides efficient tools for training, inference, and deployment, supports heterogeneous hardware acceleration, and enables developers to easily build intelligent document applications.
论文链接 https://arxiv.org/pdf/2507.05595v1.pdf
代码链接 https://github.com/PaddlePaddle/PaddleOCR

论文标题 SE(3)-Equivariant Diffusion Policy in Spherical Fourier Space

中文摘要： 提出了一种名为球面扩散策略（SDP）的方法，该方法通过在球面傅里叶空间中嵌入状态、动作和去噪过程来实现SE(3)等变性，从而适应三维场景中的变换。为了条件化动作去噪过程，引入了新颖的球面FiLM层，并设计了一个球面去噪时间U-Net，以高效地实现时空等变性。最终，SDP实现了端到端的SE(3)等变性，能够在变换后的三维场景中稳健地泛化。实验结果表明，SDP在20个模拟任务和5个物理机器人任务（包括单臂和双臂操作）中相比其他强基线方法表现出显著的性能提升。代码已在GitHub上公开。
英文摘要： Diffusion Policies are effective at learning closed-loop manipulation policies from human demonstrations but generalize poorly to novel arrangements of objects in 3D space, hurting real-world performance. To address this issue, we propose Spherical Diffusion Policy (SDP), an SE(3) equivariant diffusion policy that adapts trajectories according to 3D transformations of the scene. Such equivariance is achieved by embedding the states, actions, and the denoising process in spherical Fourier space. Additionally, we employ novel spherical FiLM layers to condition the action denoising process equivariantly on the scene embeddings. Lastly, we propose a spherical denoising temporal U-net that achieves spatiotemporal equivariance with computational efficiency. In the end, SDP is end-to-end SE(3) equivariant, allowing robust generalization across transformed 3D scenes. SDP demonstrates a large performance improvement over strong baselines in 20 simulation tasks and 5 physical robot tasks including single-arm and bi-manual embodiments. Code is available at https://github.com/amazon-science/Spherical_Diffusion_Policy.
论文链接 https://paperswithcode.com/paper/se-3-equivariant-diffusion-policy-in
代码链接 https://github.com/amazon-science/Spherical_Diffusion_Policy

论文标题 Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate

中文摘要： 本文探讨了一种不同于传统大型语言模型（LLM）扩展方法的新型构建方法。这种方法基于不可训练的、确定性的输入嵌入，通过在固定表示基底上实现无缝模块化组合和逐步层扩展来提高模型性能。首先，研究显示可以在训练后将不同数据集（如俄语文本和中文文本）上的专家模型合并为一个更强大的混合专家（MoE）模型，只需简单地平均它们的输出logits，新模型在推理基准测试中表现出色，超越了组成它的专家模型且没有灾难性遗忘。其次，提出了一种逐层构造训练方法，通过逐步堆叠和训练每一层来“生长”深度Transformer。该方法展示了稳定的收敛性和模型深度与复杂推理能力之间明确的相关性，例如SQuAD所需的推理能力。这些发现表明，从单体优化转向更生物或构造性的AI开发范式是可行的，这种范式下复杂性可以逐步构建，并且模块可以自由组合。这为资源高效扩展、持续学习和构建强大AI系统的民主化生态系统提供了新的途径。所有代码和模型都已公开发布，以促进进一步的研究。
英文摘要： The prevailing paradigm for scaling large language models (LLMs) involves monolithic, end-to-end training, a resource-intensive process that lacks flexibility. This paper explores an alternative, constructive approach to model development, built upon the foundation of non-trainable, deterministic input embeddings. In prior [1], we established that high-level semantic reasoning can emerge in Transformers using frozen embeddings derived from the visual structure of Unicode glyphs. Here, we demonstrate that this fixed representational substrate acts as a universal "docking port," enabling two powerful and efficient scaling paradigms: seamless modular composition and progressive layer-wise growth. First, we show that specialist models trained on disparate datasets (e.g., Russian and Chinese text) can be merged into a single, more capable Mixture-of-Experts (MoE) model, post-training, with zero architectural modification. This is achieved by simply averaging their output logits. The resulting MoE model exhibits immediate performance improvements on reasoning benchmarks like MMLU, surpassing its constituent experts without catastrophic forgetting. Second, we introduce a layer-wise constructive training methodology, where a deep Transformer is "grown" by progressively stacking and training one layer at a time. This method demonstrates stable convergence and a clear correlation between model depth and the emergence of complex reasoning abilities, such as those required for SQuAD. Our findings suggest a paradigm shift from monolithic optimization towards a more biological or constructive model of AI development, where complexity is built incrementally and modules can be composed freely. This opens new avenues for resource-efficient scaling, continual learning, and a more democratized ecosystem for building powerful AI systems. We release all code and models to facilitate further research.
论文链接 https://arxiv.org/pdf/2507.07129v1.pdf
代码链接 https://github.com/AVBochkov/Embeddings

论文标题 Skywork-R1V3 Technical Report

中文摘要： Skywork-R1V3是一款先进的开源视觉-语言模型，通过将文本大语言模型中的推理能力有效转移到视觉任务中，实现了在视觉推理方面的新突破。该模型的卓越表现主要归功于其精心设计的后训练强化学习框架，该框架无需额外的持续预训练就能激活和增强模型的推理能力。研究还揭示了连接器模块在实现多模态推理模型中稳健的跨模态对齐方面的重要作用。此外，引入了一种新的推理能力指标——关键推理词的熵，该指标在强化学习训练过程中选择检查点时表现出色。Skywork-R1V3在MMMU数据集上取得了从64.3%到76.0%的显著提升，达到了初级人类水平。即使在参数量为38B的情况下，该模型也能与顶级闭源视觉-语言模型相媲美。该方法还能将数学推理能力成功转移到其他学科相关的推理任务中。研究还包括了课程学习和强化微调策略的分析，并对多模态推理进行了更广泛的讨论。Skywork-R1V3展示了强化学习在推进开源视觉-语言模型能力方面的强大潜力。
英文摘要： We introduce Skywork-R1V3, an advanced, open-source vision-language model (VLM) that pioneers a new approach to visual reasoning. Its key innovation lies in effectively transferring reasoning skills from text-only Large Language Models (LLMs) to visual tasks. The strong performance of Skywork-R1V3 primarily stems from our elaborate post-training RL framework, which effectively activates and enhances the model's reasoning ability, without the need for additional continue pre-training. Through this framework, we further uncover the fundamental role of the connector module in achieving robust cross-modal alignment for multimodal reasoning models. In addition, we introduce a unique indicator of reasoning capability, the entropy of critical reasoning tokens, which has proven highly effective for checkpoint selection during RL training. Skywork-R1V3 achieves state-of-the-art results on MMMU, significantly improving from 64.3% to 76.0%. This performance matches entry-level human capabilities. Remarkably, our RL-powered post-training approach enables even the 38B parameter model to rival top closed-source VLMs. The implementation successfully transfers mathematical reasoning to other subject-related reasoning tasks. We also include an analysis of curriculum learning and reinforcement finetuning strategies, along with a broader discussion on multimodal reasoning. Skywork-R1V3 represents a significant leap in multimodal reasoning, showcasing RL as a powerful engine for advancing open-source VLM capabilities.
论文链接 https://arxiv.org/pdf/2507.06167v1.pdf
代码链接 https://github.com/SkyworkAI/Skywork-R1V

论文标题 CLIP-Guided Backdoor Defense through Entropy-Based Poisoned Dataset Separation

中文摘要： 深度神经网络（DNNs）容易受到后门攻击，攻击者通过污染训练数据将后门植入受害模型。现有的针对污染数据的后门防御方法通常存在高计算成本或对高级攻击（如干净标签和干净图像后门）效果不佳的问题。为了解决这些问题，研究者提出了一种高效且有效的后门防御方法——CLIP-Guided后门防御（CGD）。该方法利用公开可用的CLIP模型来识别可能干净或被污染的输入，并使用这些输入重新训练模型，借助CLIP的logits作为指导，有效中和后门。实验结果表明，在4个数据集和11种攻击类型上，CGD能够将攻击成功率降至1%以下，同时保持较高的清洁准确率，最大下降幅度仅为0.3%，优于现有防御方法。此外，研究表明基于干净数据的防御方法可以通过CGD适应于被污染的数据。即使使用较弱的CLIP模型或CLIP本身被后门攻击时，CGD仍能保持较低的攻击成功率，展示了其在实际后门防御场景中的卓越效率、有效性和适用性。代码已开源：https://github.com/binyxu/CGD。
英文摘要： Deep Neural Networks (DNNs) are susceptible to backdoor attacks, where adversaries poison training data to implant backdoor into the victim model. Current backdoor defenses on poisoned data often suffer from high computational costs or low effectiveness against advanced attacks like clean-label and clean-image backdoors. To address them, we introduce CLIP-Guided backdoor Defense (CGD), an efficient and effective method that mitigates various backdoor attacks. CGD utilizes a publicly accessible CLIP model to identify inputs that are likely to be clean or poisoned. It then retrains the model with these inputs, using CLIP's logits as a guidance to effectively neutralize the backdoor. Experiments on 4 datasets and 11 attack types demonstrate that CGD reduces attack success rates (ASRs) to below 1% while maintaining clean accuracy (CA) with a maximum drop of only 0.3%, outperforming existing defenses. Additionally, we show that clean-data-based defenses can be adapted to poisoned data using CGD. Also, CGD exhibits strong robustness, maintaining low ASRs even when employing a weaker CLIP model or when CLIP itself is compromised by a backdoor. These findings underscore CGD's exceptional efficiency, effectiveness, and applicability for real-world backdoor defense scenarios. Code: https://github.com/binyxu/CGD.
论文链接 https://arxiv.org/pdf/2507.05113v1.pdf
代码链接 https://github.com/binyxu/CGD

论文标题 AI-Driven Cytomorphology Image Synthesis for Medical Diagnostics

中文摘要： 在医学诊断中，生物医学数据集往往存在样本不平衡和严格的隐私限制，这阻碍了精确机器学习模型的开发。通过生成合成图像可以改善数据可用性同时保护患者隐私，但生成高质量合成图像用于训练稳健分类器仍然具有挑战。本研究集中于单个白血细胞的分类，这是急性髓系白血病（AML）等血液疾病诊断的关键组成部分。我们展示了使用基于LoRA权重的微调稳定扩散模型，并以真实少量样本为指导生成的合成图像如何提升有限数据下的分类器性能。实验结果显示，在小且高度不平衡的真实数据集中，每类添加5000张合成图像后，ResNet分类器的准确率从27.3%提高到78.4%（+51.1%），CLIP分类器的准确率从61.8%提高到76.8%（+15.0%）。这些合成图像与真实图像高度相似，有助于克服数据集限制，增强模型泛化能力。我们的结果表明，合成图像是生物医学研究中的有力工具，能够改进机器学习模型并促进医学诊断和研究。
英文摘要： Biomedical datasets often contain a large sample imbalance and are subject to strict privacy constraints, which together hinder the development of accurate machine learning models. One potential solution is to generate synthetic images, as this can improve data availability while preserving patient privacy. However, it remains difficult to generate synthetic images of sufficient quality for training robust classifiers. In this work, we focus on the classification of single white blood cells, a key component in the diagnosis of hematological diseases such as acute myeloid leukemia (AML), a severe blood cancer. We demonstrate how synthetic images generated with a fine-tuned stable diffusion model using LoRA weights when guided by real few-shot samples of the target white blood cell classes, can enhance classifier performance for limited data. When training a ResNet classifier, accuracy increased from 27.3\% to 78.4\% (+51.1\%) by adding 5000 synthetic images per class to a small and highly imbalanced real dataset. For a CLIP-based classifier, the accuracy improved from 61.8\% to 76.8\% (+15.0\%). The synthetic images are highly similar to real images, and they can help overcome dataset limitations, enhancing model generalization. Our results establish synthetic images as a tool in biomedical research, improving machine learning models, and facilitating medical diagnosis and research.
论文链接 https://arxiv.org/pdf/2507.05063v1.pdf
代码链接 https://github.com/jancarreras24/final-degree-project

论文标题 Differential Attention for Multimodal Crisis Event Analysis

中文摘要： 社交媒体在危机事件中可以成为宝贵的信息来源，用户发布的多模态数据流对于实时人道主义响应至关重要。然而，从大量且嘈杂的数据流中有效提取有意义的信息并整合异构数据仍然是一个巨大挑战。本研究通过探索视觉语言模型（VLMs）和先进的融合策略来增强危机数据的分类性能，涉及三个不同任务。我们利用LLaVA生成的文本来改进文本-图像对齐，并利用基于CLIP的视觉和文本嵌入，在没有特定任务微调的情况下，这些方法优于传统模型。为了进一步优化多模态融合，我们采用了引导交叉注意力（Guided CA）机制，并结合了差异注意力机制，以强调关键信息并过滤掉无关内容。实验结果表明，差异注意力机制提高了分类性能，而引导交叉注意力在对齐多模态特征方面也非常有效。在CrisisMMD基准数据集上的广泛实验显示，预训练的VLMs、丰富的文本描述和自适应融合策略的组合在分类准确性上始终优于现有最先进模型，为灾难响应中的三个关键任务提供了更可靠和可解释的模型。代码可在https://github.com/Munia03/Multimodal_Crisis_Event获取。
英文摘要： Social networks can be a valuable source of information during crisis events. In particular, users can post a stream of multimodal data that can be critical for real-time humanitarian response. However, effectively extracting meaningful information from this large and noisy data stream and effectively integrating heterogeneous data remains a formidable challenge. In this work, we explore vision language models (VLMs) and advanced fusion strategies to enhance the classification of crisis data in three different tasks. We incorporate LLaVA-generated text to improve text-image alignment. Additionally, we leverage Contrastive Language-Image Pretraining (CLIP)-based vision and text embeddings, which, without task-specific fine-tuning, outperform traditional models. To further refine multimodal fusion, we employ Guided Cross Attention (Guided CA) and combine it with the Differential Attention mechanism to enhance feature alignment by emphasizing critical information while filtering out irrelevant content. Our results show that while Differential Attention improves classification performance, Guided CA remains highly effective in aligning multimodal features. Extensive experiments on the CrisisMMD benchmark data set demonstrate that the combination of pretrained VLMs, enriched textual descriptions, and adaptive fusion strategies consistently outperforms state-of-the-art models in classification accuracy, contributing to more reliable and interpretable models for three different tasks that are crucial for disaster response. Our code is available at https://github.com/Munia03/Multimodal_Crisis_Event.
论文链接 https://arxiv.org/pdf/2507.05165v1.pdf
代码链接 https://github.com/munia03/multimodal_crisis_event

论文标题 Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations

中文摘要： 本文探讨了大型语言模型（LLMs）中语义表示的位置，挑战了传统的观点，即可训练的输入嵌入是基础的“意义向量”。研究者构建了嵌入层完全冻结的Transformer模型，其向量不是从数据中得出，而是从Unicode字符的视觉结构中得出。这些非语义性的、预先计算的视觉嵌入在训练过程中保持不变。该方法与任何分词器兼容，并引入了一种新的Unicode为中心的分词器以确保文本的全面覆盖。尽管没有可训练的、语义初始化的嵌入，这些模型仍然能够收敛、生成连贯的文本，并且在MMLU推理基准测试中优于具有可训练嵌入的相同架构模型。研究者将这种现象归因于传统模型中的“表示干扰”，其中嵌入层需要同时学习结构和语义特征。研究结果表明，高层次的语义并不是输入嵌入固有的，而是Transformer组合架构和大规模数据的涌现特性。这一发现重新定义了嵌入的作用，将其从意义容器转变为结构原语。所有代码和模型均已公开，以促进进一步的研究。
英文摘要： Understanding the locus of semantic representation in large language models (LLMs) is crucial for interpretability and architectural innovation. The dominant paradigm posits that trainable input embeddings serve as foundational "meaning vectors." This paper challenges that view. We construct Transformer models where the embedding layer is entirely frozen, with vectors derived not from data, but from the visual structure of Unicode glyphs. These non-semantic, precomputed visual embeddings are fixed throughout training. Our method is compatible with any tokenizer, including a novel Unicode-centric tokenizer we introduce to ensure universal text coverage. Despite the absence of trainable, semantically initialized embeddings, our models converge, generate coherent text, and, critically, outperform architecturally identical models with trainable embeddings on the MMLU reasoning benchmark. We attribute this to "representational interference" in conventional models, where the embedding layer is burdened with learning both structural and semantic features. Our results indicate that high-level semantics are not inherent to input embeddings but are an emergent property of the Transformer's compositional architecture and data scale. This reframes the role of embeddings from meaning containers to structural primitives. We release all code and models to foster further research.
论文链接 https://arxiv.org/pdf/2507.04886v1.pdf
代码链接 https://github.com/AVBochkov/Embeddings

论文标题 Estimating Object Physical Properties from RGB-D Vision and Depth Robot Sensors Using Deep Learning

中文摘要： 惯性质量在机器人应用中扮演着重要角色，如物体抓取、操作和仿真，为规划和控制提供了强有力的先验信息。准确估计物体的质量可以显著提升各种机器人任务的性能。然而，仅使用视觉传感器进行质量估计的研究相对较少。本文提出了一种新方法，结合深度图像中的稀疏点云数据和RGB图像来估计物体的质量。我们评估了多种点云处理架构以及仅基于RGB的方法。为了解决训练数据有限的问题，我们使用ShapeNetSem 3D模型创建了一个合成数据集，并通过Kinect相机模拟RGBD图像。这个合成数据用于训练一个生成密集深度图的图像生成模型，进而增强现有配对有质量值的图像数据集。我们的方法在所有评估指标上显著优于现有基准。相关数据生成、深度估计器训练和质量估计器训练的代码均已在线提供。
英文摘要： Inertial mass plays a crucial role in robotic applications such as object grasping, manipulation, and simulation, providing a strong prior for planning and control. Accurately estimating an object's mass before interaction can significantly enhance the performance of various robotic tasks. However, mass estimation using only vision sensors is a relatively underexplored area. This paper proposes a novel approach combining sparse point-cloud data from depth images with RGB images to estimate the mass of objects. We evaluate a range of point-cloud processing architectures, alongside RGB-only methods. To overcome the limited availability of training data, we create a synthetic dataset using ShapeNetSem 3D models, simulating RGBD images via a Kinect camera. This synthetic data is used to train an image generation model for estimating dense depth maps, which we then use to augment an existing dataset of images paired with mass values. Our approach significantly outperforms existing benchmarks across all evaluated metrics. The data generation (https://github.com/RavineWindteer/ShapenetSem-to-RGBD) as well as the training of the depth estimator (https://github.com/RavineWindteer/GLPDepth-Edited) and the mass estimator (https://github.com/RavineWindteer/Depth-mass-estimator) are available online.
论文链接 https://arxiv.org/pdf/2507.05029v1.pdf
代码链接 https://github.com/RavineWindteer/Depth-mass-estimator, https://github.com/ravinewindteer/glpdepth-edited, https://github.com/ravinewindteer/shapenetsem-to-rgbd

论文标题 any4: Learned 4-bit Numeric Representation for LLMs

中文摘要： 我们提出了一种名为any4的学习型4位权重量化解决方案，适用于大型语言模型（LLMs），提供任意数值表示，无需对权重或激活进行预处理。与int4、fp4和nf4等其他4位数值表示类型相比，any4在一系列模型大小、世代和家族（如Llama 2、Llama 3、Mistral和Mixtral）上表现出更高的准确性。尽管any4不需要权重或激活的预处理，但它也能与其他需要预处理的技术（如AWQ和GPTQ）相竞争。我们还实验了any3和any2，并展示了它们在更低比特数下的竞争力。此外，我们证明可以使用单一精选多样样本进行校准，而不需要像大多数量化方法那样使用数据集中的数百个样本。我们还开源了一个名为tinygemm的延迟优化GPU矩阵乘法库，该库通过GPU高效的查找表策略实现了any4以及其他常见的量化方法。我们的代码已开源，可以在https://github.com/facebookresearch/any4 找到。
英文摘要： We present any4, a learned 4-bit weight quantization solution for large language models (LLMs) providing arbitrary numeric representations without requiring pre-processing of weights or activations. any4 yields higher accuracy compared to other related 4-bit numeric representation types: int4, fp4 and nf4, as evaluated on a range of model sizes, generations and families (Llama 2, Llama 3, Mistral and Mixtral). While any4 does not require preprocessing of weights or activations, it is also competitive with orthogonal techniques that require such preprocessing (e.g., AWQ and GPTQ). We also experiment with any3 and any2 and show competitiveness at lower bits. Additionally, we show that we can calibrate using a single curated diverse sample rather than hundreds of samples from a dataset as done in most quantization approaches. We also open source tinygemm, a latency optimized GPU matrix multiplication library for LLMs, that implements any4 using a GPU-efficient lookup table strategy along with other common quantization methods. We open source our code at https://github.com/facebookresearch/any4 .
论文链接 https://arxiv.org/pdf/2507.04610v1.pdf
代码链接 https://github.com/facebookresearch/any4

论文标题 The Extended SONICOM HRTF Dataset and Spatial Audio Metrics Toolbox

中文摘要： 耳机式空间音频利用头相关传输函数（HRTFs）来模拟真实世界的声学环境。HRTFs因个人形态而异，影响声音波与身体互动并到达耳膜的方式。扩展的SONICOM HRTF数据集在2023年发布的版本基础上进行了扩展，现在测量对象总数增加到300人，并提供了部分参与者的人口统计信息，为数据集的人群和相关性提供背景。该数据集包含了使用Mesh2HRTF生成的200个合成HRTFs，以及预处理的头部和耳朵的3D扫描，这些扫描经过优化以用于HRTF合成。这一丰富的数据集支持HRTF合成算法的快速迭代优化，能够自动生成大量数据。优化后的扫描使形态修改变得无缝，帮助理解解剖结构变化如何影响HRTFs，更大的样本量也增强了机器学习方法的效果。为了支持分析，还引入了空间音频指标（SAM）工具箱，这是一个Python包，旨在高效地分析和可视化HRTF数据，提供可定制的工具以支持高级研究。扩展的数据集和工具箱共同为推进个性化空间音频的研究和发展提供了全面的资源。
英文摘要： Headphone-based spatial audio uses head-related transfer functions (HRTFs) to simulate real-world acoustic environments. HRTFs are unique to everyone, due to personal morphology, shaping how sound waves interact with the body before reaching the eardrums. Here we present the extended SONICOM HRTF dataset which expands on the previous version released in 2023. The total number of measured subjects has now been increased to 300, with demographic information for a subset of the participants, providing context for the dataset's population and relevance. The dataset incorporates synthesised HRTFs for 200 of the 300 subjects, generated using Mesh2HRTF, alongside pre-processed 3D scans of the head and ears, optimised for HRTF synthesis. This rich dataset facilitates rapid and iterative optimisation of HRTF synthesis algorithms, allowing the automatic generation of large data. The optimised scans enable seamless morphological modifications, providing insights into how anatomical changes impact HRTFs, and the larger sample size enhances the effectiveness of machine learning approaches. To support analysis, we also introduce the Spatial Audio Metrics (SAM) Toolbox, a Python package designed for efficient analysis and visualisation of HRTF data, offering customisable tools for advanced research. Together, the extended dataset and toolbox offer a comprehensive resource for advancing personalised spatial audio research and development.
论文链接 https://arxiv.org/pdf/2507.05053v1.pdf
代码链接 https://github.com/Katarina-Poole/Spatial-Audio-Metrics

论文标题 EduCoder: An Open-Source Annotation System for Education Transcript Data

中文摘要： EduCoder是一款专为教育对话文本注释设计的开源工具。它针对教育场景中教师与学生、学生间复杂互动的特点，提供了一个支持话语级注释的平台。该工具解决了定义复杂教学特征代码本、支持开放式和分类编码以及结合外部特征（如课程目标和教学价值）来上下文化话语等常见挑战。EduCoder允许研究人员和领域专家协作定义基于观察数据的复杂代码本，同时支持多种注释类型，并提供了多注释者响应的并排比较功能，以提高数据可靠性。此外，该系统是开源的，并配有演示视频。
英文摘要： We introduce EduCoder, a domain-specialized tool designed to support utterance-level annotation of educational dialogue. While general-purpose text annotation tools for NLP and qualitative research abound, few address the complexities of coding education dialogue transcripts -- with diverse teacher-student and peer interactions. Common challenges include defining codebooks for complex pedagogical features, supporting both open-ended and categorical coding, and contextualizing utterances with external features, such as the lesson's purpose and the pedagogical value of the instruction. EduCoder is designed to address these challenges by providing a platform for researchers and domain experts to collaboratively define complex codebooks based on observed data. It incorporates both categorical and open-ended annotation types along with contextual materials. Additionally, it offers a side-by-side comparison of multiple annotators' responses, allowing comparison and calibration of annotations with others to improve data reliability. The system is open-source, with a demo video available.
论文链接 https://arxiv.org/pdf/2507.05385v1.pdf
代码链接 https://github.com/ArthurP-351/EduCoder

论文标题 VERITAS: Verification and Explanation of Realness in Images for Transparency in AI Systems

中文摘要： 随着生成对抗网络（GANs）和扩散模型等AI生成内容的广泛应用，数字媒体领域经历了革命性的变化，这些模型能够高效且创造性地生成内容。然而，这些技术也模糊了真实图像与AI生成合成图像之间的界限，引发了关于内容真实性和完整性的担忧。现有的许多检测虚假图像的方法主要集中在分类和高分辨率图像上，但往往缺乏决策透明度，使得用户难以理解图像被判定为虚假的原因。为此，研究者提出了VERITAS框架，该系统不仅能够准确判断小尺寸（32x32像素）图像是否由AI生成，还能通过定位人工痕迹及语义分析来解释其分类依据。VERITAS提供了人类可读的解释，描述了合成图像中的关键特征，并展示了零样本合成图像检测任务基础的清晰说明。相关代码和提示可在https://github.com/V-i-g-n-e-s-h-N/VERITAS获取。
英文摘要： The widespread and rapid adoption of AI-generated content, created by models such as Generative Adversarial Networks (GANs) and Diffusion Models, has revolutionized the digital media landscape by allowing efficient and creative content generation. However, these models also blur the difference between real images and AI-generated synthetic images, raising concerns regarding content authenticity and integrity. While many existing solutions to detect fake images focus solely on classification and higher-resolution images, they often lack transparency in their decision-making, making it difficult for users to understand why an image is classified as fake. In this paper, we present VERITAS, a comprehensive framework that not only accurately detects whether a small (32x32) image is AI-generated but also explains why it was classified that way through artifact localization and semantic reasoning. VERITAS produces human-readable explanations that describe key artifacts in synthetic images. We show that this architecture offers clear explanations of the basis of zero-shot synthetic image detection tasks. Code and relevant prompts can be found at https://github.com/V-i-g-n-e-s-h-N/VERITAS .
论文链接 https://arxiv.org/pdf/2507.05146v1.pdf
代码链接 https://github.com/V-i-g-n-e-s-h-N/VERITAS

论文标题 Learning Robust Stereo Matching in the Wild with Selective Mixture-of-Experts

中文摘要： 最近，基于学习的立体匹配网络取得了显著进展，但在跨领域性能方面仍缺乏鲁棒性，主要原因是不同数据集之间的域偏移和不均衡的视差分布。为了解决这一问题，提出了一种名为SMoEStereo的新框架，该框架通过低秩适应（LoRA）和专家混合（MoE）模块的定制化、场景特定融合，将视觉基础模型（VFMs）应用于立体匹配中。SMoEStereo引入了具有自适应秩的MoE-LoRA和具有自适应内核大小的MoE-Adapter。前者能够动态选择最合适的专家以适应跨域的不同场景，后者则向冻结的VFMs中注入归纳偏差，以提高几何特征提取能力。为了减轻计算开销，还设计了一个轻量级决策网络，根据输入复杂度选择性激活MoE模块，在效率与准确性之间取得平衡。大量实验表明，该方法在多个基准测试上展示了最先进的跨域和联合泛化性能，无需针对特定数据集进行调整。代码可在相关链接获取。
英文摘要： Recently, learning-based stereo matching networks have advanced significantly. However, they often lack robustness and struggle to achieve impressive cross-domain performance due to domain shifts and imbalanced disparity distributions among diverse datasets. Leveraging Vision Foundation Models (VFMs) can intuitively enhance the model's robustness, but integrating such a model into stereo matching cost-effectively to fully realize their robustness remains a key challenge. To address this, we propose SMoEStereo, a novel framework that adapts VFMs for stereo matching through a tailored, scene-specific fusion of Low-Rank Adaptation (LoRA) and Mixture-of-Experts (MoE) modules. SMoEStereo introduces MoE-LoRA with adaptive ranks and MoE-Adapter with adaptive kernel sizes. The former dynamically selects optimal experts within MoE to adapt varying scenes across domains, while the latter injects inductive bias into frozen VFMs to improve geometric feature extraction. Importantly, to mitigate computational overhead, we further propose a lightweight decision network that selectively activates MoE modules based on input complexity, balancing efficiency with accuracy. Extensive experiments demonstrate that our method exhibits state-of-the-art cross-domain and joint generalization across multiple benchmarks without dataset-specific adaptation. The code is available at \textcolor{red}{https://github.com/cocowy1/SMoE-Stereo}.
论文链接 https://arxiv.org/pdf/2507.04631v1.pdf
代码链接 https://github.com/cocowy1/smoe-stereo

论文标题 SV-DRR: High-Fidelity Novel View X-Ray Synthesis Using Diffusion Model

中文摘要： X射线成像是快速且成本效益高的工具，用于可视化人体内部解剖结构。多视角X射线成像提供互补信息，增强诊断、干预和教育效果，但采集多个角度的图像会增加辐射暴露并复杂化临床工作流程。为解决这些问题，我们提出了一种新的视图条件扩散模型，可以从单个视角合成多视角X射线图像。与先前方法相比，我们的方法利用扩散变换器保留细节，并采用从弱到强的训练策略以实现稳定的高分辨率图像生成。实验结果表明，我们的方法可以生成更高分辨率的输出，并更好地控制视角。这不仅对临床应用有重要意义，还对医学教育和数据扩展有益，能够创建多样化的高质量数据集，用于训练和分析。代码已在GitHub上公开。
英文摘要： X-ray imaging is a rapid and cost-effective tool for visualizing internal human anatomy. While multi-view X-ray imaging provides complementary information that enhances diagnosis, intervention, and education, acquiring images from multiple angles increases radiation exposure and complicates clinical workflows. To address these challenges, we propose a novel view-conditioned diffusion model for synthesizing multi-view X-ray images from a single view. Unlike prior methods, which are limited in angular range, resolution, and image quality, our approach leverages the Diffusion Transformer to preserve fine details and employs a weak-to-strong training strategy for stable high-resolution image generation. Experimental results demonstrate that our method generates higher-resolution outputs with improved control over viewing angles. This capability has significant implications not only for clinical applications but also for medical education and data extension, enabling the creation of diverse, high-quality datasets for training and analysis. Our code is available at GitHub.
论文链接 https://arxiv.org/pdf/2507.05148v1.pdf
代码链接 https://github.com/xiechun298/sv-drr

论文标题 DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge

中文摘要： DreamVLA是一种新的视觉-语言-动作模型，旨在通过整合全面的世界知识来改进机器人操作中的图像生成和动作预测。该模型通过动态区域引导的世界知识预测，结合空间和语义线索，提供了紧凑且全面的表示，以支持动作规划。为了在训练过程中减少动态、空间和语义信息之间的干扰，DreamVLA采用了一种块结构化的注意力机制，这种机制可以屏蔽它们之间的相互注意力，防止信息泄露并保持每种表示的清晰和独立。此外，DreamVLA利用基于扩散的变压器来解耦动作表示与共享潜在特征，从而建模未来动作的条件分布。实验证明，DreamVLA在真实机器人任务中达到了76.7%的成功率，在CALVIN ABC-D基准测试中平均长度为4.44。
英文摘要： Recent advances in vision-language-action (VLA) models have shown promise in integrating image generation with action prediction to improve generalization and reasoning in robot manipulation. However, existing methods are limited to challenging image-based forecasting, which suffers from redundant information and lacks comprehensive and critical world knowledge, including dynamic, spatial and semantic information. To address these limitations, we propose DreamVLA, a novel VLA framework that integrates comprehensive world knowledge forecasting to enable inverse dynamics modeling, thereby establishing a perception-prediction-action loop for manipulation tasks. Specifically, DreamVLA introduces a dynamic-region-guided world knowledge prediction, integrated with the spatial and semantic cues, which provide compact yet comprehensive representations for action planning. This design aligns with how humans interact with the world by first forming abstract multimodal reasoning chains before acting. To mitigate interference among the dynamic, spatial and semantic information during training, we adopt a block-wise structured attention mechanism that masks their mutual attention, preventing information leakage and keeping each representation clean and disentangled. Moreover, to model the conditional distribution over future actions, we employ a diffusion-based transformer that disentangles action representations from shared latent features. Extensive experiments on both real-world and simulation environments demonstrate that DreamVLA achieves 76.7% success rate on real robot tasks and 4.44 average length on the CALVIN ABC-D benchmarks.
论文链接 https://paperswithcode.com/paper/dreamvla-a-vision-language-action-model
代码链接 https://github.com/Zhangwenyao1/DreamVLA

论文标题 VOTE: Vision-Language-Action Optimization with Trajectory Ensemble Voting

中文摘要： 最近的大规模视觉语言动作（VLA）模型在自然语言引导的机器人操作任务中表现出色，但在处理训练分布外的新对象或不熟悉环境时，其泛化能力仍然有限。现有方法通过增加深度估计、分割甚至扩散等组件来提高泛化能力，但增加了大量计算开销，导致效率低下。为解决这一问题，本文提出了一种高效且通用的框架VOTE，用于优化和加速VLA模型。具体来说，我们提出了一种无需分词器的微调方法，实现并行精确的动作预测，从而减少计算开销并加快推理速度。此外，我们采用了一种集合投票策略来进行动作采样，显著提高了模型性能和泛化能力。实验结果表明，我们的方法达到了最先进的性能，推理速度提高了35倍，吞吐量达到145 Hz。所有细节和代码将开源提供。
英文摘要： Recent large-scale Vision Language Action (VLA) models have shown superior performance in robotic manipulation tasks guided by natural language. However, their generalization remains limited when applied to novel objects or unfamiliar environments that lie outside the training distribution. To address this, many existing approaches integrate additional components such as depth estimation, segmentation, or even diffusion to improve generalization, at the cost of adding significant computation overhead, resulting in low efficiency. This motivates the exploration of efficient action prediction methods, which are independent of additional high-level visual representations or diffusion techniques. In this work, we propose VOTE, an efficient and general framework for the optimization and acceleration of VLA models. In details, we propose a novel tokenizer-free fine-tuning approach for parallel accurate action prediction, which reduces computational overhead and accelerates inference speed. Additionally, we adopt an ensemble voting strategy for the action sampling, which significantly improves model performance and enhances generalization. Experimental results show that our method achieves state-of-the-art performance with 35$\times$ faster inference and 145 Hz throughput. All the details and codes will be open-sourced.
论文链接 https://arxiv.org/pdf/2507.05116v1.pdf
代码链接 https://github.com/LukeLIN-web/VOTE

论文标题 Exploring Remote Physiological Signal Measurement under Dynamic Lighting Conditions at Night: Dataset, Experiment, and Analysis

中文摘要： 远程光电容积描记术（rPPG）是一种非接触式测量人体生理信号的技术，因其便利性和无创性，在健康监测和情绪识别等领域展现出广泛的应用潜力。近年来，随着大量公开数据集的发布，rPPG算法在理想光照条件下的性能得到了显著提升。然而，当前rPPG方法在夜间动态光照变化的真实场景中的有效性仍不明确，并且缺乏专门针对这种挑战环境的数据集，这严重阻碍了相关研究的进展。为解决这一问题，我们发布了名为DLCN的大规模rPPG数据集，该数据集包含来自98名参与者、约13小时的视频数据及相应的同步生理信号，覆盖了四种代表性的夜间照明场景。DLCN数据集具有高度多样性和真实性，是评估算法在复杂条件下鲁棒性的宝贵资源。基于提出的Happy-rPPG工具包，我们进行了广泛的实验，并全面分析了现有rPPG方法在应用于DLCN时面临的挑战。数据集和代码可在https://github.com/dalaoplan/Happp-rPPG-Toolkit获取。
英文摘要： Remote photoplethysmography (rPPG) is a non-contact technique for measuring human physiological signals. Due to its convenience and non-invasiveness, it has demonstrated broad application potential in areas such as health monitoring and emotion recognition. In recent years, the release of numerous public datasets has significantly advanced the performance of rPPG algorithms under ideal lighting conditions. However, the effectiveness of current rPPG methods in realistic nighttime scenarios with dynamic lighting variations remains largely unknown. Moreover, there is a severe lack of datasets specifically designed for such challenging environments, which has substantially hindered progress in this area of research. To address this gap, we present and release a large-scale rPPG dataset collected under dynamic lighting conditions at night, named DLCN. The dataset comprises approximately 13 hours of video data and corresponding synchronized physiological signals from 98 participants, covering four representative nighttime lighting scenarios. DLCN offers high diversity and realism, making it a valuable resource for evaluating algorithm robustness in complex conditions. Built upon the proposed Happy-rPPG Toolkit, we conduct extensive experiments and provide a comprehensive analysis of the challenges faced by state-of-the-art rPPG methods when applied to DLCN. The dataset and code are publicly available at https://github.com/dalaoplan/Happp-rPPG-Toolkit.
论文链接 https://arxiv.org/pdf/2507.04306v1.pdf
代码链接 https://github.com/dalaoplan/happp-rppg-toolkit, https://github.com/dalaoplan/Happy-rPPG-Toolkit

论文标题 FA: Forced Prompt Learning of Vision-Language Models for Out-of-Distribution Detection

中文摘要： 本文提出了一种基于CLIP的新型框架，称为强迫提示学习（FA），旨在充分利用分布内（ID）知识以提高分布外（OOD）检测的效果。不同于现有方法主要关注于学习OOD相关的知识，FA通过学习包含更丰富和多样化ID类描述的提示（即强迫提示），来增强对ID图像的识别能力。此外，引入了强迫系数，促使强迫提示能够学习到更加全面且细致的ID类描述。实验结果表明，即使在没有外部辅助数据集的情况下，FA也能显著提高OOD检测性能，并且与CoOp相比，保持了相同的可训练参数数量。大量实验证明，该方法在OOD检测方面优于当前最先进的方法。代码可在https://github.com/0xFAFA/FA获取。
英文摘要： Pre-trained vision-language models (VLMs) have advanced out-of-distribution (OOD) detection recently. However, existing CLIP-based methods often focus on learning OOD-related knowledge to improve OOD detection, showing limited generalization or reliance on external large-scale auxiliary datasets. In this study, instead of delving into the intricate OOD-related knowledge, we propose an innovative CLIP-based framework based on Forced prompt leArning (FA), designed to make full use of the In-Distribution (ID) knowledge and ultimately boost the effectiveness of OOD detection. Our key insight is to learn a prompt (i.e., forced prompt) that contains more diversified and richer descriptions of the ID classes beyond the textual semantics of class labels. Specifically, it promotes better discernment for ID images, by forcing more notable semantic similarity between ID images and the learnable forced prompt. Moreover, we introduce a forced coefficient, encouraging the forced prompt to learn more comprehensive and nuanced descriptions of the ID classes. In this way, FA is capable of achieving notable improvements in OOD detection, even when trained without any external auxiliary datasets, while maintaining an identical number of trainable parameters as CoOp. Extensive empirical evaluations confirm our method consistently outperforms current state-of-the-art methods. Code is available at https://github.com/0xFAFA/FA.
论文链接 https://arxiv.org/pdf/2507.04511v1.pdf
代码链接 https://github.com/0xfafa/fa

论文标题 LoSiA: Efficient High-Rank Fine-Tuning via Subnet Localization and Optimization

中文摘要： LoSiA是一种创新的参数高效微调方法，通过动态定位和优化关键参数来提高计算效率和微调性能。它利用梯度稀疏性分析识别子网络，并将其作为可训练目标进行优化，从而实现有效的高秩适应，同时减少额外的矩阵乘法。此外，还提出了一个更快的实现版本LoSiA-Pro，其训练延迟比LoRA减少了约27%。大量评估表明，LoSiA在领域专业化和常识推理任务中几乎不损失性能，且所需的训练时间最少。进一步分析显示，LoSiA还能减少持续训练中的遗忘现象。
英文摘要： Parameter-Efficient Fine-Tuning (PEFT) methods, such as LoRA, significantly reduce the number of trainable parameters by introducing low-rank decomposition matrices. However, existing methods perform extensive matrix multiplications in domain specialization tasks, resulting in computational inefficiency and sub-optimal fine-tuning performance. Hence, we propose LoSiA(Low-Resources Subnet Integration Adaptation), an innovative method that dynamically localizes and optimizes critical parameters during the training process. Specifically, it identifies a sub-network using gradient sparsity analysis and optimizes it as the trainable target. This design enables effective high-rank adaptation by updating only the sub-network parameters, reducing the additional matrix multiplication. We also present LoSiA-Pro, a faster implementation of LoSiA, which reduces the training latency by about $27\%$ compared to LoRA. Extensive evaluations show that our method achieves minimal performance drop compared to full fine-tuning, while requiring the least training time across domain specialization and common-sense reasoning tasks. Further analysis shows that LoSiA also reduces forgetting during continued training.
论文链接 https://arxiv.org/pdf/2507.04487v1.pdf
代码链接 https://github.com/KlozeWang/LoSiA

中文摘要： 本文介绍了MambaFusion，这是首个展示纯Mamba块能够实现高效密集全局融合的方法，同时在相机-激光雷达多模态3D目标检测中保持顶级性能的工作。当前的融合策略受限于无法同时实现高效、长距离建模和保留完整场景信息。受到状态空间模型（SSMs）和线性注意力机制最新进展的启发，研究者利用其线性复杂度和长距离建模能力来应对这些挑战。然而，实验表明简单地采用高效的线性复杂度方法并不一定带来改进，甚至可能降低性能，这主要是由于在多模态对齐过程中高度信息的丢失导致序列顺序偏差。为了解决这个问题，研究者提出了保真度高的激光雷达编码，通过连续空间中的体素压缩来保留精确的高度信息，从而增强相机-激光雷达的对齐。随后，引入了混合Mamba块，利用增强的高度信息特征进行局部和全局上下文学习。综合这些组件，该方法在nuScenes验证基准上达到了75.0的顶级NDS分数，甚至超过了使用高分辨率输入的方法，并且在保持效率的同时实现了比大多数最新方法更快的推理速度。
英文摘要： We present the first work demonstrating that a pure Mamba block can achieve efficient Dense Global Fusion, meanwhile guaranteeing top performance for camera-LiDAR multi-modal 3D object detection. Our motivation stems from the observation that existing fusion strategies are constrained by their inability to simultaneously achieve efficiency, long-range modeling, and retaining complete scene information. Inspired by recent advances in state-space models (SSMs) and linear attention, we leverage their linear complexity and long-range modeling capabilities to address these challenges. However, this is non-trivial since our experiments reveal that simply adopting efficient linear-complexity methods does not necessarily yield improvements and may even degrade performance. We attribute this degradation to the loss of height information during multi-modal alignment, leading to deviations in sequence order. To resolve this, we propose height-fidelity LiDAR encoding that preserves precise height information through voxel compression in continuous space, thereby enhancing camera-LiDAR alignment. Subsequently, we introduce the Hybrid Mamba Block, which leverages the enriched height-informed features to conduct local and global contextual learning. By integrating these components, our method achieves state-of-the-art performance with the top-tire NDS score of 75.0 on the nuScenes validation benchmark, even surpassing methods that utilize high-resolution inputs. Meanwhile, our method maintains efficiency, achieving faster inference speed than most recent state-of-the-art methods.
论文链接 https://arxiv.org/pdf/2507.04369v1.pdf
代码链接 https://github.com/AutoLab-SAI-SJTU/MambaFusion

论文标题 MVNet: Hyperspectral Remote Sensing Image Classification Based on Hybrid Mamba-Transformer Vision Backbone Architecture

中文摘要： 本文提出了一种新的MVNet网络架构，旨在解决高光谱图像分类中面临的高维数据、有限训练样本和光谱冗余等问题。MVNet结合了3D-CNN的局部特征提取、Transformer的全局建模以及Mamba的线性复杂度序列建模能力，实现了高效的空-谱特征提取与融合。MVNet设计了一个重新定义的双分支Mamba模块，包括一个状态空间模型（SSM）分支和一个采用1D卷积及SiLU激活函数的非SSM分支，增强了对短程和长程依赖关系的建模，同时减少了传统Mamba中的计算延迟。优化后的HSI-MambaVision Mixer模块克服了因果卷积的单向限制，在单次前向传递中通过解耦注意力机制捕捉双向空-谱依赖关系，从而减轻参数冗余和维度诅咒问题。实验结果表明，在IN、UP和KSC数据集上，MVNet在分类准确率和计算效率方面均优于主流的高光谱图像分类方法，展示了其在处理复杂高光谱图像数据方面的强大能力。
英文摘要： Hyperspectral image (HSI) classification faces challenges such as high-dimensional data, limited training samples, and spectral redundancy, which often lead to overfitting and insufficient generalization capability. This paper proposes a novel MVNet network architecture that integrates 3D-CNN's local feature extraction, Transformer's global modeling, and Mamba's linear complexity sequence modeling capabilities, achieving efficient spatial-spectral feature extraction and fusion. MVNet features a redesigned dual-branch Mamba module, including a State Space Model (SSM) branch and a non-SSM branch employing 1D convolution with SiLU activation, enhancing modeling of both short-range and long-range dependencies while reducing computational latency in traditional Mamba. The optimized HSI-MambaVision Mixer module overcomes the unidirectional limitation of causal convolution, capturing bidirectional spatial-spectral dependencies in a single forward pass through decoupled attention that focuses on high-value features, alleviating parameter redundancy and the curse of dimensionality. On IN, UP, and KSC datasets, MVNet outperforms mainstream hyperspectral image classification methods in both classification accuracy and computational efficiency, demonstrating robust capability in processing complex HSI data.
论文链接 https://arxiv.org/pdf/2507.04409v1.pdf
代码链接 https://github.com/leeguandong/MVNet-for-HSI

论文标题 PresentAgent: Multimodal Agent for Presentation Video Generation

中文摘要： PresentAgent是一种多模态代理，能够将长文档转换为带有旁白的演示视频。与现有方法仅生成静态幻灯片或文本摘要不同，该方法通过生成高度同步的视觉和口语内容来模仿人类风格的演示。为了实现这一集成，PresentAgent采用了一个模块化流程，系统地分割输入文档，规划并渲染幻灯片样式的视觉帧，使用大型语言模型和文本转语音模型生成上下文相关的旁白，并最终无缝合成具有精确音视频对齐的完整视频。鉴于评估此类多模态输出的复杂性，研究者引入了PresentEval，这是一个由视觉-语言模型支持的统一评估框架，可以从内容保真度、视觉清晰度和观众理解三个关键维度进行全面评分。在30个文档-演示配对的精选数据集上的实验验证表明，PresentAgent在所有评估指标上都接近人类水平的质量。这些结果突显了可控多模态代理在将静态文本材料转化为动态、有效且易于访问的演示格式方面的巨大潜力。代码将在https://github.com/AIGeeksGroup/PresentAgent提供。
英文摘要： We present PresentAgent, a multimodal agent that transforms long-form documents into narrated presentation videos. While existing approaches are limited to generating static slides or text summaries, our method advances beyond these limitations by producing fully synchronized visual and spoken content that closely mimics human-style presentations. To achieve this integration, PresentAgent employs a modular pipeline that systematically segments the input document, plans and renders slide-style visual frames, generates contextual spoken narration with large language models and Text-to-Speech models, and seamlessly composes the final video with precise audio-visual alignment. Given the complexity of evaluating such multimodal outputs, we introduce PresentEval, a unified assessment framework powered by Vision-Language Models that comprehensively scores videos across three critical dimensions: content fidelity, visual clarity, and audience comprehension through prompt-based evaluation. Our experimental validation on a curated dataset of 30 document-presentation pairs demonstrates that PresentAgent approaches human-level quality across all evaluation metrics. These results highlight the significant potential of controllable multimodal agents in transforming static textual materials into dynamic, effective, and accessible presentation formats. Code will be available at https://github.com/AIGeeksGroup/PresentAgent.
论文链接 https://arxiv.org/pdf/2507.04036v1.pdf
代码链接 https://github.com/AIGeeksGroup/PresentAgent

论文标题 Temporal Continual Learning with Prior Compensation for Human Motion Prediction

中文摘要： 本文提出了一种新的多阶段训练框架，称为时间持续学习（TCL），用于解决人类运动预测中的两个主要限制：短期预测的学习受到长期预测的干扰，以及过去预测的先验信息在后续预测中的利用不足。为了更好地保留先验信息，引入了先验补偿因子（PCF），并将其整合到模型训练中以补偿丢失的先验信息。此外，通过理论推导得出了一个更为合理的优化目标。值得注意的是，TCL框架可以轻松地与不同的人类运动预测主干模型结合，并适应各种数据集和应用。在四个基准数据集上的广泛实验表明了TCL的有效性和灵活性。代码可在https://github.com/hyqlat/TCL获取。
英文摘要： Human Motion Prediction (HMP) aims to predict future poses at different moments according to past motion sequences. Previous approaches have treated the prediction of various moments equally, resulting in two main limitations: the learning of short-term predictions is hindered by the focus on long-term predictions, and the incorporation of prior information from past predictions into subsequent predictions is limited. In this paper, we introduce a novel multi-stage training framework called Temporal Continual Learning (TCL) to address the above challenges. To better preserve prior information, we introduce the Prior Compensation Factor (PCF). We incorporate it into the model training to compensate for the lost prior information. Furthermore, we derive a more reasonable optimization objective through theoretical derivation. It is important to note that our TCL framework can be easily integrated with different HMP backbone models and adapted to various datasets and applications. Extensive experiments on four HMP benchmark datasets demonstrate the effectiveness and flexibility of TCL. The code is available at https://github.com/hyqlat/TCL.
论文链接 https://arxiv.org/pdf/2507.04060v1.pdf
代码链接 https://github.com/hyqlat/tcl

论文标题 Graph Collaborative Attention Network for Link Prediction in Knowledge Graphs

中文摘要： 本文对传统基于规则的方法和现代深度学习方法在知识图谱链接预测中的表现进行了系统性比较。特别关注了KBGAT模型，该模型利用多头注意力机制来联合编码实体和关系特征。为了进一步推进这一研究方向，提出了GCAT（Graph Collaborative Attention Network），这是一种改进的模型，能够增强异构节点之间的上下文聚合和交互。在四个广泛使用的基准数据集上的实验结果表明，GCAT不仅持续优于基于规则的方法，而且在与现有神经嵌入模型相比时，也表现出竞争性或更优的性能。研究结果强调了基于注意力机制的架构在捕获复杂关系模式以完成知识图谱补全任务方面的优势。
英文摘要： Knowledge graphs offer a structured representation of real-world entities and their relationships, enabling a wide range of applications from information retrieval to automated reasoning. In this paper, we conduct a systematic comparison between traditional rule-based approaches and modern deep learning methods for link prediction. We focus on KBGAT, a graph neural network model that leverages multi-head attention to jointly encode both entity and relation features within local neighborhood structures. To advance this line of research, we introduce \textbf{GCAT} (Graph Collaborative Attention Network), a refined model that enhances context aggregation and interaction between heterogeneous nodes. Experimental results on four widely-used benchmark datasets demonstrate that GCAT not only consistently outperforms rule-based methods but also achieves competitive or superior performance compared to existing neural embedding models. Our findings highlight the advantages of attention-based architectures in capturing complex relational patterns for knowledge graph completion tasks.
论文链接 https://arxiv.org/pdf/2507.03947v1.pdf
代码链接 https://github.com/hmthanh/GCAT

论文标题 Stochastic Human Motion Prediction with Memory of Action Transition and Action Characteristic

中文摘要： 针对基于动作驱动的随机人体运动预测任务，即根据给定的过去执行非目标动作的序列来生成预定义目标动作的未来运动序列，本文提出了一种新的方法。该任务主要面临两大挑战：一是由于不同动作之间的过渡速度差异较大，生成平滑过渡动作较为困难；二是由于某些动作的相似性，学习动作特征也颇具挑战。为了解决这些问题，研究者提出了两个记忆库：软过渡动作库（STAB）和动作特征库（ACB）。STAB存储了动作之间的转换信息，并采用了一种新颖的软搜索方法，使模型能够关注到观察到的动作中的多个可能类别。ACB则记录了动作特征，提供了更多的先验信息用于特定动作的预测。为了更好地融合从这两个库中检索到的特征，进一步提出了一种自适应注意力调整（AAA）策略。在四个运动预测数据集上的广泛实验表明，此方法在性能上优于先前最先进的技术。相关演示和代码已在https://hyqlat.github.io/STABACB.github.io/ 公开。
英文摘要： Action-driven stochastic human motion prediction aims to generate future motion sequences of a pre-defined target action based on given past observed sequences performing non-target actions. This task primarily presents two challenges. Firstly, generating smooth transition motions is hard due to the varying transition speeds of different actions. Secondly, the action characteristic is difficult to be learned because of the similarity of some actions. These issues cause the predicted results to be unreasonable and inconsistent. As a result, we propose two memory banks, the Soft-transition Action Bank (STAB) and Action Characteristic Bank (ACB), to tackle the problems above. The STAB stores the action transition information. It is equipped with the novel soft searching approach, which encourages the model to focus on multiple possible action categories of observed motions. The ACB records action characteristic, which produces more prior information for predicting certain actions. To fuse the features retrieved from the two banks better, we further propose the Adaptive Attention Adjustment (AAA) strategy. Extensive experiments on four motion prediction datasets demonstrate that our approach consistently outperforms the previous state-of-the-art. The demo and code are available at https://hyqlat.github.io/STABACB.github.io/.
论文链接 https://arxiv.org/pdf/2507.04062v1.pdf
代码链接 https://github.com/hyqlat/ATACB