Papers with Code - 2025-07-02

论文标题 JointRank: Rank Large Set with Single Pass

中文摘要： 本文提出了一种名为JointRank的方法，用于高效地从大量候选池中对相关项进行排序。该方法适用于现代信息检索系统，如网络搜索、推荐和增强生成。JointRank通过将候选项划分为重叠的块，每个块独立并行地进行排序，然后从这些局部排序结果中推导出隐式的成对比较。最后，使用Winrate或PageRank等算法将这些比较聚合以构建全局排序。实验表明，在TREC DL-2019数据集上，JointRank在nDCG@10指标上达到了70.88，优于使用gpt-4.1-mini作为长上下文模型的全上下文列表排序方法的57.68，并且将延迟从21秒减少到8秒。该方法的实现和实验代码可在GitHub仓库中获取。
英文摘要： Efficiently ranking relevant items from large candidate pools is a cornerstone of modern information retrieval systems -- such as web search, recommendation, and retrieval-augmented generation. Listwise rerankers, which improve relevance by jointly considering multiple candidates, are often limited in practice: either by model input size constraints, or by degraded quality when processing large sets. We propose a model-agnostic method for fast reranking large sets that exceed a model input limits. The method first partitions candidate items into overlapping blocks, each of which is ranked independently in parallel. Implicit pairwise comparisons are then derived from these local rankings. Finally, these comparisons are aggregated to construct a global ranking using algorithms such as Winrate or PageRank. Experiments on TREC DL-2019 show that our method achieves an nDCG@10 of 70.88 compared to the 57.68 for full-context listwise approach using gpt-4.1-mini as long-context model, while reducing latency from 21 to 8 seconds. The implementation of the algorithm and the experiments is available in the repository: https://github.com/V3RGANz/jointrank
论文链接 https://arxiv.org/pdf/2506.22262v1.pdf
代码链接 https://github.com/V3RGANz/jointrank

论文标题 Out-of-Distribution Semantic Occupancy Prediction

中文摘要： 3D语义占用预测对于自动驾驶至关重要，因为它能提供密集且富含语义的环境表示。然而，现有方法主要关注于分布内场景，容易受到分布外（OoD）物体和长尾分布的影响，增加了未检测到异常和误解的风险，从而带来安全隐患。为了解决这些问题，研究提出了分布外语义占用预测，旨在在3D体素空间中进行OoD检测。为了填补数据集中的空白，研究人员提出了一种合成异常集成管道，该管道在保持真实的空间和遮挡模式的同时注入合成异常，从而创建了两个数据集：VAA-KITTI和VAA-KITTI-360。此外，他们还引入了一个新的框架OccOoD，将OoD检测集成到3D语义占用预测中，并通过基于RWKV的分支利用几何-语义融合来增强OoD检测。实验结果表明，OccOoD在1.2米范围内的OoD检测达到了最先进的水平，AuROC为67.34%，AuPRCr为29.21%，同时保持了竞争性的占用预测性能。建立的数据集和源代码将在https://github.com/7uHeng/OccOoD上公开。
英文摘要： 3D Semantic Occupancy Prediction is crucial for autonomous driving, providing a dense, semantically rich environmental representation. However, existing methods focus on in-distribution scenes, making them susceptible to Out-of-Distribution (OoD) objects and long-tail distributions, which increases the risk of undetected anomalies and misinterpretations, posing safety hazards. To address these challenges, we introduce Out-of-Distribution Semantic Occupancy Prediction, targeting OoD detection in 3D voxel space. To fill the gaps in the dataset, we propose a Synthetic Anomaly Integration Pipeline that injects synthetic anomalies while preserving realistic spatial and occlusion patterns, enabling the creation of two datasets: VAA-KITTI and VAA-KITTI-360. We introduce OccOoD, a novel framework integrating OoD detection into 3D semantic occupancy prediction, with Voxel-BEV Progressive Fusion (VBPF) leveraging an RWKV-based branch to enhance OoD detection via geometry-semantic fusion. Experimental results demonstrate that OccOoD achieves state-of-the-art OoD detection with an AuROC of 67.34% and an AuPRCr of 29.21% within a 1.2m region, while maintaining competitive occupancy prediction performance. The established datasets and source code will be made publicly available at https://github.com/7uHeng/OccOoD.
论文链接 https://arxiv.org/pdf/2506.21185v1.pdf
代码链接 https://github.com/7uheng/occood

论文标题 Beyond Reactive Safety: Risk-Aware LLM Alignment via Long-Horizon Simulation

中文摘要： 随着基于语言模型的代理在高风险社会决策中的影响力日益增强，从公共政策到医疗保健等领域，确保其正面影响需要理解它们建议的深远含义。本文提出了一种概念验证框架，该框架能够预测模型生成的建议如何在长时间内通过社会系统传播，从而实现更稳健的一致性。为了评估语言模型的长期安全意识，我们还引入了一个包含100个间接危害场景的数据集，测试模型从看似无害的用户提示中预见不良、非明显结果的能力。我们的方法不仅在新数据集上实现了超过20%的改进，而且在现有安全基准（AdvBench, SafeRLHF, WildGuardMix）上平均胜率超过70%，表明这是一个有前景的方向，可以开发出更安全的代理。
英文摘要： Given the growing influence of language model-based agents on high-stakes societal decisions, from public policy to healthcare, ensuring their beneficial impact requires understanding the far-reaching implications of their suggestions. We propose a proof-of-concept framework that projects how model-generated advice could propagate through societal systems on a macroscopic scale over time, enabling more robust alignment. To assess the long-term safety awareness of language models, we also introduce a dataset of 100 indirect harm scenarios, testing models' ability to foresee adverse, non-obvious outcomes from seemingly harmless user prompts. Our approach achieves not only over 20% improvement on the new dataset but also an average win rate exceeding 70% against strong baselines on existing safety benchmarks (AdvBench, SafeRLHF, WildGuardMix), suggesting a promising direction for safer agents.
论文链接 https://arxiv.org/pdf/2506.20949v1.pdf
代码链接 https://github.com/chenkaisun/proactivealignment

论文标题 Unveiling Causal Reasoning in Large Language Models: Reality or Mirage?

中文摘要： 尽管大型语言模型（LLMs）在理解上下文因果关系和提供符合因果律的响应方面表现出了一定的能力，但它们是否真正具备类似人类的因果推理能力仍不清楚。目前的证据表明，LLMs仅能进行浅层（一级）因果推理，这主要归因于嵌入其参数中的因果知识，而缺乏真正的类人（二级）因果推理能力。为了验证这一假设，研究深入探讨了基于Transformer的LLMs的自回归机制，发现它本质上不具备因果性。通过引入新的因果问答基准CausalProbe-2024，研究人员发现LLMs在此基准上的表现显著下降，进一步证实了它们主要进行的是浅层因果推理。为了解决这一问题，研究人员提出了G^2-Reasoner方法，将通用知识和目标导向的提示融入LLMs的因果推理过程中。实验表明，G^2-Reasoner显著提高了LLMs在新鲜和反事实情境中的因果推理能力，为LLMs向更高级别的因果推理迈进开辟了新路径。
英文摘要： Causal reasoning capability is critical in advancing large language models (LLMs) toward strong artificial intelligence. While versatile LLMs appear to have demonstrated capabilities in understanding contextual causality and providing responses that obey the laws of causality, it remains unclear whether they perform genuine causal reasoning akin to humans. However, current evidence indicates the contrary. Specifically, LLMs are only capable of performing shallow (level-1) causal reasoning, primarily attributed to the causal knowledge embedded in their parameters, but they lack the capacity for genuine human-like (level-2) causal reasoning. To support this hypothesis, methodologically, we delve into the autoregression mechanism of transformer-based LLMs, revealing that it is not inherently causal. Empirically, we introduce a new causal Q&A benchmark called CausalProbe-2024, whose corpora are fresh and nearly unseen for the studied LLMs. The LLMs exhibit a significant performance drop on CausalProbe-2024 compared to earlier benchmarks, indicating the fact that they primarily engage in level-1 causal reasoning. To bridge the gap towards level-2 causal reasoning, we draw inspiration from the fact that human reasoning is usually facilitated by general knowledge and intended goals. We propose G^2-Reasoner, a method that incorporates general knowledge and goal-oriented prompts into LLMs' causal reasoning processes. Experiments demonstrate that G^2-Reasoner significantly enhances LLMs' causal reasoning capability, particularly in fresh and counterfactual contexts. This work sheds light on a new path for LLMs to advance towards genuine causal reasoning, going beyond level-1 and making strides towards level-2.
论文链接 https://arxiv.org/pdf/2506.21215v1.pdf
代码链接 https://github.com/haoang97/causalprobe-2024

论文标题 RecCoT: Enhancing Recommendation via Chain-of-Thought

中文摘要： 在实际应用中，用户通常会从多个方面与项目进行互动，例如通过隐式二进制反馈（如点击、不喜欢、长时间观看）和显式反馈（如评论、评价）。现代推荐系统（RecSys）从这些隐式反馈信号中学习用户-项目的协作信号，将其视为大规模的二进制数据流，然后基于用户的个性化历史交互推荐其他高度相似的项目。然而，从这种协作连接的角度来看，推荐系统并不关注项目本身的实际内容，而是优先考虑项目之间行为共现的高概率信号。因此，在这种二进制学习范式下，推荐系统难以理解用户为什么喜欢或不喜欢某些项目。为了解决这个问题，一些研究尝试利用基于内容的评论来捕捉语义知识以增强推荐模型。然而，大多数这些方法侧重于预测评论的评分，但未能提供人类可理解的解释。
英文摘要： In real-world applications, users always interact with items in multiple aspects, such as through implicit binary feedback (e.g., clicks, dislikes, long views) and explicit feedback (e.g., comments, reviews). Modern recommendation systems (RecSys) learn user-item collaborative signals from these implicit feedback signals as a large-scale binary data-streaming, subsequently recommending other highly similar items based on users' personalized historical interactions. However, from this collaborative-connection perspective, the RecSys does not focus on the actual content of the items themselves but instead prioritizes higher-probability signals of behavioral co-occurrence among items. Consequently, under this binary learning paradigm, the RecSys struggles to understand why a user likes or dislikes certain items. To alleviate it, some works attempt to utilize the content-based reviews to capture the semantic knowledge to enhance recommender models. However, most of these methods focus on predicting the ratings of reviews, but do not provide a human-understandable explanation.
论文链接 https://arxiv.org/pdf/2506.21032v1.pdf
代码链接 https://github.com/shuoyang2/reccot

论文标题 ReME: A Data-Centric Framework for Training-Free Open-Vocabulary Segmentation

中文摘要： 无训练开放词汇语义分割（OVS）旨在在给定任意文本类别的情况下对图像进行分割，而无需昂贵的模型微调。现有的解决方案通常探索预训练模型（如CLIP）的注意力机制，或生成合成数据并设计复杂的检索过程来进行OVS。然而，这些方法的性能受限于依赖模型的能力或参考集的质量。本研究关注这一密集场景理解任务中被忽视的数据质量问题，并发现高质量的参考集可以显著提升无训练OVS的效果。基于此观察，我们提出了一种以数据质量为导向的框架，包括一个构建具有良好配对的分割-文本嵌入参考集的数据管道和一个简单的基于相似性的检索方法，以揭示数据的关键作用。广泛的评估表明，我们的方法在十个基准数据集上优于所有现有的无训练OVS方法，强调了数据为中心的设计对于无训练OVS的重要性。代码可在https://github.com/xiweix/ReME 获取。
英文摘要： Training-free open-vocabulary semantic segmentation (OVS) aims to segment images given a set of arbitrary textual categories without costly model fine-tuning. Existing solutions often explore attention mechanisms of pre-trained models, such as CLIP, or generate synthetic data and design complex retrieval processes to perform OVS. However, their performance is limited by the capability of reliant models or the suboptimal quality of reference sets. In this work, we investigate the largely overlooked data quality problem for this challenging dense scene understanding task, and identify that a high-quality reference set can significantly benefit training-free OVS. With this observation, we introduce a data-quality-oriented framework, comprising a data pipeline to construct a reference set with well-paired segment-text embeddings and a simple similarity-based retrieval to unveil the essential effect of data. Remarkably, extensive evaluations on ten benchmark datasets demonstrate that our method outperforms all existing training-free OVS approaches, highlighting the importance of data-centric design for advancing OVS without training. Our code is available at https://github.com/xiweix/ReME .
论文链接 https://arxiv.org/pdf/2506.21233v1.pdf
代码链接 https://github.com/xiweix/reme

论文标题 Class-Agnostic Region-of-Interest Matching in Document Images

中文摘要： 本文定义了一个名为“类无关感兴趣区域匹配”（简称“RoI-Matching”）的新任务，旨在以灵活、高效、多粒度和开放集的方式匹配用户自定义的文档区域。现有文档分析解决方案通常只适用于固定的类别定义和粒度，无法实现用户定制的灵活应用。为了满足这些需求，本文构建了一个基准测试集RoI-Matching-Bench，设置了三个难度级别，并提出了宏观和微观指标进行评估。此外，还提出了一种新的框架RoI-Matcher，该框架使用孪生网络提取参考和目标域中的多层次特征，并通过交叉注意力层整合和对齐不同域中的相似语义。实验表明，该方法在RoI-Matching-Bench上效果显著，可作为进一步研究的基线。代码可在https://github.com/pd162/RoI-Matching获取。
英文摘要： Document understanding and analysis have received a lot of attention due to their widespread application. However, existing document analysis solutions, such as document layout analysis and key information extraction, are only suitable for fixed category definitions and granularities, and cannot achieve flexible applications customized by users. Therefore, this paper defines a new task named Class-Agnostic Region-of-Interest Matching'' (RoI-Matching'' for short), which aims to match the customized regions in a flexible, efficient, multi-granularity, and open-set manner. The visual prompt of the reference document and target document images are fed into our model, while the output is the corresponding bounding boxes in the target document images. To meet the above requirements, we construct a benchmark RoI-Matching-Bench, which sets three levels of difficulties following real-world conditions, and propose the macro and micro metrics to evaluate. Furthermore, we also propose a new framework RoI-Matcher, which employs a siamese network to extract multi-level features both in the reference and target domains, and cross-attention layers to integrate and align similar semantics in different domains. Experiments show that our method with a simple procedure is effective on RoI-Matching-Bench, and serves as the baseline for further research. The code is available at https://github.com/pd162/RoI-Matching.
论文链接 https://arxiv.org/pdf/2506.21055v1.pdf
代码链接 https://github.com/pd162/roi-matching

论文标题 Parallels Between VLA Model Post-Training and Human Motor Learning: Progress, Challenges, and Trends

中文摘要： 视觉-语言-动作（VLA）模型通过集成动作生成模块扩展了视觉-语言模型（VLM），在视觉感知和指令理解方面表现出色，能够处理多种操作任务。然而，在需要高精度和准确性的应用中，这些模型仍存在性能差距，需要进一步的适应性训练。多个领域的证据表明，后训练对于将基础模型与下游应用对齐至关重要，因此引发了对VLA模型后训练的广泛研究。

本文从人类运动学习的角度出发，回顾了VLA模型的后训练策略，重点关注三个维度：环境、实体和任务。引入了一个结构化的分类体系，该体系与人类学习机制相一致，包括： 1. 增强环境感知； 2. 提升实体意识； 3. 深化任务理解； 4. 多组件整合。

最后，本文识别了VLA模型后训练中的关键挑战和趋势，并提出了一个概念框架以指导未来的研究。这项工作不仅提供了从人类运动学习视角对当前VLA模型后训练方法的全面概述，还为VLA模型的开发提供了实用见解。

英文摘要： Vision-language-action (VLA) models extend vision-language models (VLM) by integrating action generation modules for robotic manipulation. Leveraging strengths of VLM in vision perception and instruction understanding, VLA models exhibit promising generalization across diverse manipulation tasks. However, applications demanding high precision and accuracy reveal performance gaps without further adaptation. Evidence from multiple domains highlights the critical role of post-training to align foundational models with downstream applications, spurring extensive research on post-training VLA models. VLA model post-training aims to address the challenge of improving an embodiment's ability to interact with the environment for the given tasks, analogous to the process of humans motor skills acquisition. Accordingly, this paper reviews post-training strategies for VLA models through the lens of human motor learning, focusing on three dimensions: environments, embodiments, and tasks. A structured taxonomy is introduced aligned with human learning mechanisms: (1) enhancing environmental perception, (2) improving embodiment awareness, (3) deepening task comprehension, and (4) multi-component integration. Finally, key challenges and trends in post-training VLA models are identified, establishing a conceptual framework to guide future research. This work delivers both a comprehensive overview of current VLA model post-training methods from a human motor learning perspective and practical insights for VLA model development. (Project website: https://github.com/AoqunJin/Awesome-VLA-Post-Training)
论文链接 https://arxiv.org/pdf/2506.20966v1.pdf
代码链接 https://github.com/aoqunjin/awesome-vla-post-training

论文标题 Transformer-Based Spatial-Temporal Counterfactual Outcomes Estimation

中文摘要： 本文提出了一种基于Transformer的新框架，用于估计具有时空属性的反事实结果。与传统的统计模型相比，该方法在性能和泛化能力上有所提升。在一定假设条件下，该框架中的估计器具有一致性和渐近正态性。通过模拟实验和实际数据实验验证了该方法的有效性。模拟实验表明，所提出的估计器比基线方法具有更强的估计能力。实际数据实验则对哥伦比亚冲突对森林损失的因果效应提供了有价值的结论。源代码可在https://github.com/lihe-maxsize/DeppSTCI_Release_Version-master获取。
英文摘要： The real world naturally has dimensions of time and space. Therefore, estimating the counterfactual outcomes with spatial-temporal attributes is a crucial problem. However, previous methods are based on classical statistical models, which still have limitations in performance and generalization. This paper proposes a novel framework for estimating counterfactual outcomes with spatial-temporal attributes using the Transformer, exhibiting stronger estimation ability. Under mild assumptions, the proposed estimator within this framework is consistent and asymptotically normal. To validate the effectiveness of our approach, we conduct simulation experiments and real data experiments. Simulation experiments show that our estimator has a stronger estimation capability than baseline methods. Real data experiments provide a valuable conclusion to the causal effect of conflicts on forest loss in Colombia. The source code is available at https://github.com/lihe-maxsize/DeppSTCI_Release_Version-master.
论文链接 https://arxiv.org/pdf/2506.21154v1.pdf
代码链接 https://github.com/lihe-maxsize/deppstci_release_version-master

论文标题 OracleFusion: Assisting the Decipherment of Oracle Bone Script with Structurally Constrained Semantic Typography

中文摘要： 作为一种古老的书写系统，甲骨文承载了古代文明的文化记录和思想表达。尽管已发现约4500个甲骨文字，但仅有约1600个被成功解读。未解的甲骨文字因其复杂的结构和抽象的图像而难以解释。为解决这一难题，本文提出了一种名为OracleFusion的新型两阶段语义排版框架。在第一阶段，该方法利用增强空间感知推理的多模态大语言模型（MLLM）分析甲骨文字形结构，并进行关键组件的视觉定位。在第二阶段，引入了甲骨文结构向量融合（OSVF），结合字形结构约束和字形保持约束，以确保生成语义丰富的矢量字体。这种方法保持了字形结构的客观完整性，提供了视觉增强的表示，帮助专家解读甲骨文。广泛的定性和定量实验表明，OracleFusion在语义、视觉吸引力和字形保持方面优于现有的基线模型，显著提高了可读性和美学质量。此外，OracleFusion还能够提供对未知甲骨文字的专家级见解，成为推动甲骨文解读的重要工具。
英文摘要： As one of the earliest ancient languages, Oracle Bone Script (OBS) encapsulates the cultural records and intellectual expressions of ancient civilizations. Despite the discovery of approximately 4,500 OBS characters, only about 1,600 have been deciphered. The remaining undeciphered ones, with their complex structure and abstract imagery, pose significant challenges for interpretation. To address these challenges, this paper proposes a novel two-stage semantic typography framework, named OracleFusion. In the first stage, this approach leverages the Multimodal Large Language Model (MLLM) with enhanced Spatial Awareness Reasoning (SAR) to analyze the glyph structure of the OBS character and perform visual localization of key components. In the second stage, we introduce Oracle Structural Vector Fusion (OSVF), incorporating glyph structure constraints and glyph maintenance constraints to ensure the accurate generation of semantically enriched vector fonts. This approach preserves the objective integrity of the glyph structure, offering visually enhanced representations that assist experts in deciphering OBS. Extensive qualitative and quantitative experiments demonstrate that OracleFusion outperforms state-of-the-art baseline models in terms of semantics, visual appeal, and glyph maintenance, significantly enhancing both readability and aesthetic quality. Furthermore, OracleFusion provides expert-like insights on unseen oracle characters, making it a valuable tool for advancing the decipherment of OBS.
论文链接 https://arxiv.org/pdf/2506.21101v1.pdf
代码链接 https://github.com/lcs0215/oraclefusion

论文标题 Discovering multiple antibiotic resistance phenotypes using diverse top-k subgroup list discovery

中文摘要： 抗生素耐药性是全球人类健康面临的主要威胁之一，当抗生素失去对抗细菌感染的能力时就会发生这种情况。为了帮助临床医生识别患者中出现的抗生素耐药模式，可以使用临床决策支持系统来利用表型进行预警。患者表型分析的任务是找到与特定医疗问题（如本文所述）相关的一组患者特征。然而，单一的医学现象解释可能在临床专家看来没有实际意义而被忽略。因此，发现同一医学现象的多种患者表型将是有用的。在这项工作中，我们定义了挖掘多样化前k个表型的问题，并提出了基于子群发现技术、子群列表模型和最小描述长度原则的EDSLM算法。我们的方法为临床医生提供了一种获得一组患者多个且多样化的表型的方法。我们使用著名的MIMIC-III数据集展示了抗菌素耐药性表型分析的实际应用案例。
英文摘要： Antibiotic resistance is one of the major global threats to human health and occurs when antibiotics lose their ability to combat bacterial infections. In this problem, a clinical decision support system could use phenotypes in order to alert clinicians of the emergence of patterns of antibiotic resistance in patients. Patient phenotyping is the task of finding a set of patient characteristics related to a specific medical problem such as the one described in this work. However, a single explanation of a medical phenomenon might be useless in the eyes of a clinical expert and be discarded. The discovery of multiple patient phenotypes for the same medical phenomenon would be useful in such cases. Therefore, in this work, we define the problem of mining diverse top-k phenotypes and propose the EDSLM algorithm, which is based on the Subgroup Discovery technique, the subgroup list model, and the Minimum Description Length principle. Our proposal provides clinicians with a method with which to obtain multiple and diverse phenotypes of a set of patients. We show a real use case of phenotyping in antimicrobial resistance using the well-known MIMIC-III dataset.
论文链接 https://paperswithcode.com/paper/discovering-multiple-antibiotic-resistance
代码链接 https://github.com/antoniolopezmc/Discovering-multiple-antibiotic-resistance-phenotypes-using-diverse-top-k-subgroup-list-discovery

论文标题 Robust Deep Learning for Myocardial Scar Segmentation in Cardiac MRI with Noisy Labels

中文摘要： 这项研究提出了一种鲁棒的深度学习方法，用于从心脏MRI中全自动检测和分割心肌瘢痕。该方法通过微调最先进的模型，并利用Kullback-Leibler损失和广泛的数据增强技术来应对半自动标注带来的标签噪声、数据异质性和类别不平衡等挑战。在急性和慢性病例上的评估表明，该方法能够生成准确且平滑的分割结果，即使在存在噪声标签的情况下也能表现出色。与现有的先进模型（如nnU-Net）相比，该方法不仅性能更优，还在不同成像条件和临床任务中展现出强大的泛化能力。这些结果为自动化心肌瘢痕量化提供了可靠的基础，并支持深度学习在心脏影像学中的广泛应用。
英文摘要： The accurate segmentation of myocardial scars from cardiac MRI is essential for clinical assessment and treatment planning. In this study, we propose a robust deep-learning pipeline for fully automated myocardial scar detection and segmentation by fine-tuning state-of-the-art models. The method explicitly addresses challenges of label noise from semi-automatic annotations, data heterogeneity, and class imbalance through the use of Kullback-Leibler loss and extensive data augmentation. We evaluate the model's performance on both acute and chronic cases and demonstrate its ability to produce accurate and smooth segmentations despite noisy labels. In particular, our approach outperforms state-of-the-art models like nnU-Net and shows strong generalizability in an out-of-distribution test set, highlighting its robustness across various imaging conditions and clinical tasks. These results establish a reliable foundation for automated myocardial scar quantification and support the broader clinical adoption of deep learning in cardiac imaging.
论文链接 https://arxiv.org/pdf/2506.21151v1.pdf
代码链接 https://github.com/danialmoa/yolosam

论文标题 EraRAG: Efficient and Incremental Retrieval Augmented Generation for Growing Corpora

中文摘要： EraRAG是一种新型的多层Graph-RAG框架，旨在提高大型语言模型（LLMs）在动态增长语料库中的检索增强生成能力。与现有方法通常假设静态语料库不同，EraRAG通过基于超平面的局部敏感哈希（LSH）技术将原始语料库分割并组织成层次化的图结构，从而实现高效且局部化的数据插入，无需重新训练或昂贵的重新计算，同时保持高检索准确性和低延迟。实验结果显示，EraRAG在更新时间和令牌消耗方面比现有Graph-RAG系统减少了一个数量级，并且提供了更好的准确性。这项工作为必须处理不断增长语料库的RAG系统提供了一条实用路径，平衡了检索效率和适应性。代码和数据可以在https://github.com/EverM0re/EraRAG-Official获取。
英文摘要： Graph-based Retrieval-Augmented Generation (Graph-RAG) enhances large language models (LLMs) by structuring retrieval over an external corpus. However, existing approaches typically assume a static corpus, requiring expensive full-graph reconstruction whenever new documents arrive, limiting their scalability in dynamic, evolving environments. To address these limitations, we introduce EraRAG, a novel multi-layered Graph-RAG framework that supports efficient and scalable dynamic updates. Our method leverages hyperplane-based Locality-Sensitive Hashing (LSH) to partition and organize the original corpus into hierarchical graph structures, enabling efficient and localized insertions of new data without disrupting the existing topology. The design eliminates the need for retraining or costly recomputation while preserving high retrieval accuracy and low latency. Experiments on large-scale benchmarks demonstrate that EraRag achieves up to an order of magnitude reduction in update time and token consumption compared to existing Graph-RAG systems, while providing superior accuracy performance. This work offers a practical path forward for RAG systems that must operate over continually growing corpora, bridging the gap between retrieval efficiency and adaptability. Our code and data are available at https://github.com/EverM0re/EraRAG-Official.
论文链接 https://arxiv.org/pdf/2506.20963v1.pdf
代码链接 https://github.com/everm0re/erarag-official

论文标题 Learning to Skip the Middle Layers of Transformers

中文摘要： 本文提出了一种新的架构，旨在通过动态跳过Transformer中间层来提高效率。基于可解释性研究的发现，即Transformer的中间层具有更高的冗余性，而早期层则将信息聚合到token位置，该架构设计了一个学习门控机制，根据输入决定是否绕过对称跨度的中心块，并使用门控注意力机制防止后续token关注被跳过的token位置。此外，还采用了“三明治”或“每层归一化”方案控制残差范数，并用自适应正则化损失控制门控稀疏性。尽管该方法旨在减少“简单”token的计算需求，并可能促进多级表示层次结构的出现，但在所研究的规模下，与具有较少层数的密集基线相比，该方法在验证交叉熵和估计FLOPs之间的权衡上并未取得改进。代码已发布在https://github.com/tim-lawson/skip-middle。
英文摘要： Conditional computation is a popular strategy to make Transformers more efficient. Existing methods often target individual modules (e.g., mixture-of-experts layers) or skip layers independently of one another. However, interpretability research has demonstrated that the middle layers of Transformers exhibit greater redundancy, and that early layers aggregate information into token positions. Guided by these insights, we propose a novel architecture that dynamically skips a variable number of layers from the middle outward. In particular, a learned gating mechanism determines whether to bypass a symmetric span of central blocks based on the input, and a gated attention mechanism prevents subsequent tokens from attending to skipped token positions. Residual norms are controlled with a 'sandwich' or 'perilayernorm' scheme and gate sparsity with an adaptive regularization loss. We had aimed to reduce compute requirements for 'simpler' tokens and potentially foster an emergent multi-level representational hierarchy but, at the scales investigated, our approach does not achieve improvements in the trade-off between validation cross-entropy and estimated FLOPs compared to dense baselines with fewer layers. We release our code at https://github.com/tim-lawson/skip-middle.
论文链接 https://arxiv.org/pdf/2506.21103v1.pdf
代码链接 https://github.com/tim-lawson/skip-middle

论文标题 Task-Aware KV Compression For Cost-Effective Long Video Understanding

中文摘要： 针对长视频理解（LVU）任务中多模态大型语言模型（MLLMs）面临的高计算成本问题，本文提出了Video-X²L方法。该方法通过两个关键操作来灵活保留每个LVU任务所需的关键视频信息。首先是双层KV压缩，在MLLM的预填充阶段生成两种类型的压缩KV：低压缩KV（L-KVs）用于捕捉精细的视频细节，高压缩KV（H-KVs）用于提供紧凑的视频表示。其次是在解码阶段选择性地重新加载KV，即对最关键的部分重新加载L-KVs，而对其他较不重要的部分使用H-KVs。这种方法使得MLLM能够充分利用任务特定信息，同时保持整体紧凑性。Video-X²L无需额外训练，并且可以直接与现有的可压缩KV的MLLM兼容。实验结果表明，Video-X²L在多个流行的LVU基准测试中表现出色，显著优于现有KV压缩方法，并大幅节省了计算成本。
英文摘要： Long-video understanding (LVU) remains a severe challenge for existing multimodal large language models (MLLMs), primarily due to the prohibitive computational cost. Recent approaches have explored KV compression to mitigate this issue, but they often suffer from significant information loss at high compression ratios. In this paper, we introduce Video-X^2L, which flexibly preserves critical video information for each LVU task. Video-X^2L involves two key operations. The first one is called bi-level KV compression. During the MLLM's pre-filling stage, Video-X^2L generates two types of compressed KVs: low-compression KVs (L-KVs) to capture fine-grained video details and high-compression KVs (H-KVs) to offer compact video representations. The second one is called selective KV re-loading. During the MLLM's decoding stage, Video-X^2L selectively re-loads L-KVs for the most critical video chunks while using H-KVs for other less important ones. This allows the MLLM to fully utilize task-specific information while maintaining the overall compactness. Video-X^2L is simple yet effective: it is free from additional training and directly compatible with existing KV-compressible MLLMs. We evaluate Video-X^2L with a variety of popular LVU benchmarks, including VideoMME, MLVU, LongVideoBench, and VNBench. Our experiment result shows that Video-X^2L outperforms existing KV-compression methods by a huge advantage while substantially saving the computation cost.
论文链接 https://arxiv.org/pdf/2506.21184v1.pdf
代码链接 https://github.com/unabletousegit/videox22l

论文标题 Learning to See in the Extremely Dark

中文摘要： 基于学习的方法在低光RAW图像增强方面取得了显著进展，但在环境照度降至0.0001勒克斯的极暗场景中的能力仍有待探索，主要原因是缺乏相应的数据集。为此，研究者提出了一种配对到配对的数据合成管道，能够生成校准良好的极低光照RAW图像，涵盖三个精确的照度范围：0.01-0.1勒克斯、0.001-0.01勒克斯和0.0001-0.001勒克斯，并配有高质量的sRGB参考图像，构成一个大规模的配对数据集，命名为See-in-the-Extremely-Dark (SIED)，用于基准测试低光RAW图像增强方法。此外，还提出了一种基于扩散模型的框架，利用其生成能力和内在去噪特性，从极低信噪比的RAW输入中恢复出视觉上令人满意的图像。该框架引入了自适应照明校正模块（AICM）和颜色一致性损失，以确保准确的曝光校正和颜色恢复。在提出的SIED数据集和其他公开基准上的广泛实验表明了该方法的有效性。代码和数据集可在https://github.com/JianghaiSCU/SIED获取。
英文摘要： Learning-based methods have made promising advances in low-light RAW image enhancement, while their capability to extremely dark scenes where the environmental illuminance drops as low as 0.0001 lux remains to be explored due to the lack of corresponding datasets. To this end, we propose a paired-to-paired data synthesis pipeline capable of generating well-calibrated extremely low-light RAW images at three precise illuminance ranges of 0.01-0.1 lux, 0.001-0.01 lux, and 0.0001-0.001 lux, together with high-quality sRGB references to comprise a large-scale paired dataset named See-in-the-Extremely-Dark (SIED) to benchmark low-light RAW image enhancement approaches. Furthermore, we propose a diffusion-based framework that leverages the generative ability and intrinsic denoising property of diffusion models to restore visually pleasing results from extremely low-SNR RAW inputs, in which an Adaptive Illumination Correction Module (AICM) and a color consistency loss are introduced to ensure accurate exposure correction and color restoration. Extensive experiments on the proposed SIED and publicly available benchmarks demonstrate the effectiveness of our method. The code and dataset are available at https://github.com/JianghaiSCU/SIED.
论文链接 https://arxiv.org/pdf/2506.21132v1.pdf
代码链接 https://github.com/jianghaiscu/sied

论文标题 TableMoE: Neuro-Symbolic Routing for Structured Expert Reasoning in Multimodal Table Understanding

中文摘要： 针对现实世界中表格的多模态理解面临的结构复杂性、符号密度高及视觉退化（如模糊、倾斜、水印、不完整结构或字体、多跨度或层次嵌套布局）等挑战，现有大型多模态语言模型表现不佳。为此，提出了一种名为TableMoE的神经-符号混合连接专家架构，专门用于对多模态表格数据进行稳健的结构化推理。TableMoE采用创新的神经-符号路由机制，预测潜在语义标记角色（例如标题、数据单元格、轴、公式），并通过由符号推理图支持的置信度感知门控策略将表格元素动态路由到专门的专家模块（如表格转HTML、表格转JSON、表格转代码）。为了促进有效的对齐驱动预训练，引入了包含120万条跨金融、科学、生物医学和工业领域的表格-HTML-JSON-代码四元组的TableMoE-Align大规模数据集。此外，还策划并发布了四个具有挑战性的WildStruct基准测试：WMMFinQA、WMMTatQA、WMMTabDialog和WMMFinanceMath，专门用于在真实世界的多模态退化和结构复杂性条件下对模型进行压力测试。实验结果表明，TableMoE显著超越了现有的最先进模型。广泛的消融研究表明，神经-符号路由和结构化专家对齐是其核心组件的关键作用。通过定性分析进一步展示了TableMoE的可解释性和增强的鲁棒性，强调了在多模态表格理解中整合神经-符号推理的有效性。
英文摘要： Multimodal understanding of tables in real-world contexts is challenging due to the complexity of structure, symbolic density, and visual degradation (blur, skew, watermarking, incomplete structures or fonts, multi-span or hierarchically nested layouts). Existing multimodal large language models (MLLMs) struggle with such WildStruct conditions, resulting in limited performance and poor generalization. To address these challenges, we propose TableMoE, a neuro-symbolic Mixture-of-Connector-Experts (MoCE) architecture specifically designed for robust, structured reasoning over multimodal table data. TableMoE features an innovative Neuro-Symbolic Routing mechanism, which predicts latent semantic token roles (e.g., header, data cell, axis, formula) and dynamically routes table elements to specialized experts (Table-to-HTML, Table-to-JSON, Table-to-Code) using a confidence-aware gating strategy informed by symbolic reasoning graphs. To facilitate effective alignment-driven pretraining, we introduce the large-scale TableMoE-Align dataset, consisting of 1.2M table-HTML-JSON-code quadruples across finance, science, biomedicine and industry, utilized exclusively for model pretraining. For evaluation, we curate and release four challenging WildStruct benchmarks: WMMFinQA, WMMTatQA, WMMTabDialog, and WMMFinanceMath, designed specifically to stress-test models under real-world multimodal degradation and structural complexity. Experimental results demonstrate that TableMoE significantly surpasses existing state-of-the-art models. Extensive ablation studies validate each core component, emphasizing the critical role of Neuro-Symbolic Routing and structured expert alignment. Through qualitative analyses, we further showcase TableMoE's interpretability and enhanced robustness, underscoring the effectiveness of integrating neuro-symbolic reasoning for multimodal table understanding.
论文链接 https://arxiv.org/pdf/2506.21393v1.pdf
代码链接 https://github.com/ai-agi/tablemoe

论文标题 FaSTA$^*$: Fast-Slow Toolpath Agent with Subroutine Mining for Efficient Multi-turn Image Editing

中文摘要： 我们开发了一种成本高效的神经符号代理，用于处理多轮图像编辑任务，例如“检测图像中的长凳并将其重新着色为粉红色。同时移除猫以获得更清晰的视图，并将墙壁重新着色为黄色”。该代理结合了大型语言模型（LLMs）的快速高级子任务规划与每个子任务的慢速精确工具使用和局部A搜索，以找到成本高效的工具路径——一系列对AI工具的调用。为了节省在相似子任务上进行A搜索的成本，我们通过LLMs对先前成功的工具路径进行归纳推理，持续提取和优化常用的子程序，并将其作为新工具在未来的任务中重复使用，实现自适应的快速-慢速规划。在这种规划中，首先探索更高层次的子程序，只有当它们失败时，才会激活低层次的A搜索。可重用的符号子程序大大减少了应用于相似图像的同类型子任务的探索成本，从而产生了一个类似人类的快速-慢速工具路径代理“FaSTA”：首先由LLMs尝试快速子任务规划和基于规则的子程序选择，这可以覆盖大多数任务，而仅在遇到新颖且具有挑战性的子任务时触发慢速A搜索。通过与最近的图像编辑方法进行比较，我们展示了FaSTA在计算效率方面显著提高，同时在成功率方面仍能与最先进的基线保持竞争力。
英文摘要： We develop a cost-efficient neurosymbolic agent to address challenging multi-turn image editing tasks such as "Detect the bench in the image while recoloring it to pink. Also, remove the cat for a clearer view and recolor the wall to yellow.'' It combines the fast, high-level subtask planning by large language models (LLMs) with the slow, accurate, tool-use, and local A$^$ search per subtask to find a cost-efficient toolpath -- a sequence of calls to AI tools. To save the cost of A$^$ on similar subtasks, we perform inductive reasoning on previously successful toolpaths via LLMs to continuously extract/refine frequently used subroutines and reuse them as new tools for future tasks in an adaptive fast-slow planning, where the higher-level subroutines are explored first, and only when they fail, the low-level A$^$ search is activated. The reusable symbolic subroutines considerably save exploration cost on the same types of subtasks applied to similar images, yielding a human-like fast-slow toolpath agent "FaSTA$^$'': fast subtask planning followed by rule-based subroutine selection per subtask is attempted by LLMs at first, which is expected to cover most tasks, while slow A$^$ search is only triggered for novel and challenging subtasks. By comparing with recent image editing approaches, we demonstrate FaSTA$^$ is significantly more computationally efficient while remaining competitive with the state-of-the-art baseline in terms of success rate.
论文链接 https://arxiv.org/pdf/2506.20911v1.pdf
代码链接 https://github.com/tianyi-lab/fastar

论文标题 Amortizing personalization in virtual brain twins

中文摘要： 虚拟脑孪生是针对个体人类或患者大脑的个性化数字模型，能够对神经影像数据特征进行机制性解读。然而，在使用这些模型进行训练和推断时面临两个挑战：大型共享基础设施不允许使用个人数据，且临床应用中的推断不应需要大量资源。为解决这两个问题，我们引入了“匿名个性化”方法，通过扩展模型先验来包含个性化信息，在摊销推断下允许匿名训练，同时推断过程既个性化又轻量化。我们展示了这一基本方法，并在一个示例中证明了其可靠性，还讨论了它对实验和计算神经科学的影响。相关代码可在 https://github.com/ins-amu/apvbt 获取。
英文摘要： Virtual brain twins are personalized digital models of individual human subject or patient's brains, allowing for mechanistic interpretation of neuroimaging data features. Training and inference with these models however presents a pair of challenges: large shared infrastructure do not allow for use of personal data and inference in clinical applications should not require significant resources. We introduce "anonymized personalization" to address both by expanding model priors to include personalization which under amortized inference allows training to be performed anonymously, while inference is both personalized and lightweight. We illustrate the basic approach, demonstrate reliability in an example, and discuss the impact on both experimental and computational neuroscience. Code is available at https://github.com/ins-amu/apvbt.
论文链接 https://arxiv.org/pdf/2506.21155v1.pdf
代码链接 https://github.com/ins-amu/apvbt

论文标题 FedSC: Federated Learning with Semantic-Aware Collaboration

中文摘要： 联邦学习（FL）旨在在不共享数据的情况下跨客户端协作训练模型，以保护隐私。然而，主要挑战之一是数据异质性问题，即多个客户端存在偏差的标签偏好。现有的一些FL方法试图通过局部（如正则化本地模型）或全局（如微调全局模型）方式解决数据异质性，但往往忽略了每个客户端中固有的语义信息。为探索利用客户端内具有语义意义的知识处理数据异质性的可能性，本文提出了基于语义感知协作的联邦学习（FedSC），以捕捉异构客户端中的客户端特定和类相关知识。FedSC的核心思想是在语义层面构建关系原型和一致原型，旨在以原型协作的方式提供丰富的类别底层知识和稳定的收敛信号。一方面，FedSC引入了一种跨对比学习策略，使实例级嵌入更接近具有相同语义的关系原型，并远离不同类别。另一方面，FedSC通过差异聚合方式设计了一致原型，作为正则化惩罚来约束本地模型的优化区域。此外，还提供了FedSC的理论分析，确保其收敛性。实验结果表明，在各种具有挑战性的场景下，FedSC的有效性和关键组件的效率得到了验证。
英文摘要： Federated learning (FL) aims to train models collaboratively across clients without sharing data for privacy-preserving. However, one major challenge is the data heterogeneity issue, which refers to the biased labeling preferences at multiple clients. A number of existing FL methods attempt to tackle data heterogeneity locally (e.g., regularizing local models) or globally (e.g., fine-tuning global model), often neglecting inherent semantic information contained in each client. To explore the possibility of using intra-client semantically meaningful knowledge in handling data heterogeneity, in this paper, we propose Federated Learning with Semantic-Aware Collaboration (FedSC) to capture client-specific and class-relevant knowledge across heterogeneous clients. The core idea of FedSC is to construct relational prototypes and consistent prototypes at semantic-level, aiming to provide fruitful class underlying knowledge and stable convergence signals in a prototype-wise collaborative way. On the one hand, FedSC introduces an inter-contrastive learning strategy to bring instance-level embeddings closer to relational prototypes with the same semantics and away from distinct classes. On the other hand, FedSC devises consistent prototypes via a discrepancy aggregation manner, as a regularization penalty to constrain the optimization region of the local model. Moreover, a theoretical analysis for FedSC is provided to ensure a convergence guarantee. Experimental results on various challenging scenarios demonstrate the effectiveness of FedSC and the efficiency of crucial components.
论文链接 https://arxiv.org/pdf/2506.21012v1.pdf
代码链接 https://github.com/hwang52/fedsc

论文标题 Homogenization of Multi-agent Learning Dynamics in Finite-state Markov Games

中文摘要： 本文提出了一种新的方法，用于近似多个强化学习（RL）代理在有限状态马尔可夫游戏中交互的学习动态。该方法通过同时降低学习率和增加更新频率来重新调整学习过程，从而将代理的参数视为受快速混合游戏状态影响的缓慢演变变量。在温和假设下（即状态过程的遍历性和更新的连续性），我们证明了这种重新调整的过程收敛于一个常微分方程（ODE）。这个ODE为代理的学习动态提供了一个易于处理的、确定性的近似。该框架的实现可在以下网址获取：https://github.com/yannKerzreho/MarkovGameApproximation
英文摘要： This paper introduces a new approach for approximating the learning dynamics of multiple reinforcement learning (RL) agents interacting in a finite-state Markov game. The idea is to rescale the learning process by simultaneously reducing the learning rate and increasing the update frequency, effectively treating the agent's parameters as a slow-evolving variable influenced by the fast-mixing game state. Under mild assumptions-ergodicity of the state process and continuity of the updates-we prove the convergence of this rescaled process to an ordinary differential equation (ODE). This ODE provides a tractable, deterministic approximation of the agent's learning dynamics. An implementation of the framework is available at\,: https://github.com/yannKerzreho/MarkovGameApproximation
论文链接 https://arxiv.org/pdf/2506.21079v1.pdf
代码链接 https://github.com/yannkerzreho/markovgameapproximation

论文标题 Boosting Domain Generalized and Adaptive Detection with Diffusion Models: Fitness, Generalization, and Transferability

中文摘要： 检测器在训练和测试数据之间存在领域差异时，性能往往会下降。最近的方法探索了将扩散模型应用于领域泛化（DG）和领域自适应（DA）任务，但仍面临较大的推理成本问题，并未充分利用扩散模型的能力。本文提出通过从单步扩散过程中提取中间特征来解决这些问题，改进特征收集与融合，减少了75%的推理时间，同时提高了源领域的性能。接着，通过应用带有类别提示的框掩码图像构建以对象为中心的辅助分支，提取鲁棒且领域不变的特征。此外，还应用了一致性损失来对齐辅助分支和普通分支，在防止过拟合的同时平衡了源领域和目标领域的性能。在一个统一框架下，标准检测器通过源领域（针对DG）和未标记的目标领域（针对DA）上的特征级和对象级对齐，受到扩散检测器的引导，从而提高了跨领域检测性能。该方法在3个DA基准和5个DG基准上取得了具有竞争力的结果。此外，在COCO泛化基准上的实验表明，该方法在大领域偏移和低数据场景中保持了显著优势并展示了出色的效率。本研究展示了将扩散模型应用于领域泛化和自适应检测任务的优势，并为跨领域的视觉感知任务提供了有价值的见解。代码可在GitHub上获取。
英文摘要： Detectors often suffer from performance drop due to domain gap between training and testing data. Recent methods explore diffusion models applied to domain generalization (DG) and adaptation (DA) tasks, but still struggle with large inference costs and have not yet fully leveraged the capabilities of diffusion models. We propose to tackle these problems by extracting intermediate features from a single-step diffusion process, improving feature collection and fusion to reduce inference time by 75% while enhancing performance on source domains (i.e., Fitness). Then, we construct an object-centered auxiliary branch by applying box-masked images with class prompts to extract robust and domain-invariant features that focus on object. We also apply consistency loss to align the auxiliary and ordinary branch, balancing fitness and generalization while preventing overfitting and improving performance on target domains (i.e., Generalization). Furthermore, within a unified framework, standard detectors are guided by diffusion detectors through feature-level and object-level alignment on source domains (for DG) and unlabeled target domains (for DA), thereby improving cross-domain detection performance (i.e., Transferability). Our method achieves competitive results on 3 DA benchmarks and 5 DG benchmarks. Additionally, experiments on COCO generalization benchmark demonstrate that our method maintains significant advantages and show remarkable efficiency in large domain shifts and low-data scenarios. Our work shows the superiority of applying diffusion models to domain generalized and adaptive detection tasks and offers valuable insights for visual perception tasks across diverse domains. The code is available at \href{https://github.com/heboyong/Fitness-Generalization-Transferability}{Fitness-Generalization-Transferability}.
论文链接 https://arxiv.org/pdf/2506.21042v1.pdf
代码链接 https://github.com/heboyong/fitness-generalization-transferability

论文标题 AGTCNet: A Graph-Temporal Approach for Principled Motor Imagery EEG Classification

中文摘要： AGTCNet是一种新的图-时序模型，用于运动想象脑电（MI-EEG）分类。该模型利用脑电电极的拓扑配置作为归纳偏置，并结合图卷积注意力网络（GCAT），共同学习表达性强的时空脑电信号表示。AGTCNet在BCI Competition IV Dataset 2a上实现了66.82%的移动平均准确率，在特定被试者微调后进一步提高到82.88%。在EEG Motor Movement/Imagery数据集上，AGTCNet在4类和2类非特定被试者分类中分别达到了64.14%和85.22%的移动平均准确率，在特定被试者分类中进一步提高到72.13%和90.54%。相比现有方法，AGTCNet不仅在性能上取得了显著提升，还具有更紧凑的架构、更快的推理速度和更短的输入信号长度，减少了49.87%的模型大小，提高了64.65%的推理速度。这些优势使得AGTCNet在实际BCI应用中更具有效性和实用性。
英文摘要： Brain-computer interface (BCI) technology utilizing electroencephalography (EEG) marks a transformative innovation, empowering motor-impaired individuals to engage with their environment on equal footing. Despite its promising potential, developing subject-invariant and session-invariant BCI systems remains a significant challenge due to the inherent complexity and variability of neural activity across individuals and over time, compounded by EEG hardware constraints. While prior studies have sought to develop robust BCI systems, existing approaches remain ineffective in capturing the intricate spatiotemporal dependencies within multichannel EEG signals. This study addresses this gap by introducing the attentive graph-temporal convolutional network (AGTCNet), a novel graph-temporal model for motor imagery EEG (MI-EEG) classification. Specifically, AGTCNet leverages the topographic configuration of EEG electrodes as an inductive bias and integrates graph convolutional attention network (GCAT) to jointly learn expressive spatiotemporal EEG representations. The proposed model significantly outperformed existing MI-EEG classifiers, achieving state-of-the-art performance while utilizing a compact architecture, underscoring its effectiveness and practicality for BCI deployment. With a 49.87% reduction in model size, 64.65% faster inference time, and shorter input EEG signal, AGTCNet achieved a moving average accuracy of 66.82% for subject-independent classification on the BCI Competition IV Dataset 2a, which further improved to 82.88% when fine-tuned for subject-specific classification. On the EEG Motor Movement/Imagery Dataset, AGTCNet achieved moving average accuracies of 64.14% and 85.22% for 4-class and 2-class subject-independent classifications, respectively, with further improvements to 72.13% and 90.54% for subject-specific classifications.
论文链接 https://arxiv.org/pdf/2506.21338v1.pdf
代码链接 https://github.com/galvinlim/agtcnet

论文标题 Unlocking Constraints: Source-Free Occlusion-Aware Seamless Segmentation

中文摘要： 全景图像处理对于全方位环境感知至关重要，但面临着诸如失真、视角遮挡和标注有限等挑战。以往的无监督领域自适应方法通常从带有标签的针孔数据转移到未标注的全景图像，但这需要访问源针孔数据。为了解决这些问题，研究引入了一个更实际的任务——无源遮挡感知无缝分割（SFOASS），并提出了首个解决方案UNconstrained Learning Omni-Context Knowledge（UNLOCK）。该框架包括两个关键模块：全向伪标签学习和非模态驱动上下文学习。在不依赖源数据或目标标签的情况下，该框架增强了模型以实现360度视点覆盖和遮挡感知推理。此外，通过真实到真实和合成到真实的适应设置对提出的SFOASS任务进行了基准测试。实验结果显示，该无源方法在mAAP和mAP指标上分别达到了10.9和11.6的最新成绩，并且在mAPQ指标上比仅使用源数据的方法绝对提高了4.3。所有数据和代码将在https://github.com/yihong-97/UNLOCK公开提供。
英文摘要： Panoramic image processing is essential for omni-context perception, yet faces constraints like distortions, perspective occlusions, and limited annotations. Previous unsupervised domain adaptation methods transfer knowledge from labeled pinhole data to unlabeled panoramic images, but they require access to source pinhole data. To address these, we introduce a more practical task, i.e., Source-Free Occlusion-Aware Seamless Segmentation (SFOASS), and propose its first solution, called UNconstrained Learning Omni-Context Knowledge (UNLOCK). Specifically, UNLOCK includes two key modules: Omni Pseudo-Labeling Learning and Amodal-Driven Context Learning. While adapting without relying on source data or target labels, this framework enhances models to achieve segmentation with 360{\deg} viewpoint coverage and occlusion-aware reasoning. Furthermore, we benchmark the proposed SFOASS task through both real-to-real and synthetic-to-real adaptation settings. Experimental results show that our source-free method achieves performance comparable to source-dependent methods, yielding state-of-the-art scores of 10.9 in mAAP and 11.6 in mAP, along with an absolute improvement of +4.3 in mAPQ over the source-only method. All data and code will be made publicly available at https://github.com/yihong-97/UNLOCK.
论文链接 https://arxiv.org/pdf/2506.21198v1.pdf
代码链接 https://github.com/yihong-97/unlock

论文标题 Scalable Bayesian Low-Rank Adaptation of Large Language Models via Stochastic Variational Subspace Inference

中文摘要： 尽管大型语言模型（LLMs）被广泛使用，但它们有时会产生错误信息，并且校准不佳。这使得这些模型的不确定性量化在高风险领域（如自主性和医疗保健）中变得尤为重要。先前的工作通过在微调模型的低秩适应（LoRA）参数上进行推理，使基于贝叶斯深度学习的方法更加可行。然而，这些方法在扩展到更大规模的语言模型时遇到了困难，因为它们需要比LoRA更多的额外参数。

本文提出了一种名为ScalaBL的方法，即通过随机变分子空间推断实现可扩展贝叶斯低秩适应。该方法在一个r维子空间内执行贝叶斯推理，其中r是LoRA的秩。通过将LoRA参数重新用作投影矩阵，可以将来自此子空间的样本映射到LLM的完整权重空间中。这样就可以使用随机变分推断来学习我们方法中的所有参数。尽管我们的子空间维度较低，但仍能与最先进的方法达到竞争性表现，同时仅需约1000个额外参数。此外，这种方法还允许我们将贝叶斯LLM扩展至迄今为止最大的规模，其基础参数数量是之前工作的四倍。

英文摘要： Despite their widespread use, large language models (LLMs) are known to hallucinate incorrect information and be poorly calibrated. This makes the uncertainty quantification of these models of critical importance, especially in high-stakes domains, such as autonomy and healthcare. Prior work has made Bayesian deep learning-based approaches to this problem more tractable by performing inference over the low-rank adaptation (LoRA) parameters of a fine-tuned model. While effective, these approaches struggle to scale to larger LLMs due to requiring further additional parameters compared to LoRA. In this work we present $\textbf{Scala}$ble $\textbf{B}$ayesian $\textbf{L}$ow-Rank Adaptation via Stochastic Variational Subspace Inference (ScalaBL). We perform Bayesian inference in an $r$-dimensional subspace, for LoRA rank $r$. By repurposing the LoRA parameters as projection matrices, we are able to map samples from this subspace into the full weight space of the LLM. This allows us to learn all the parameters of our approach using stochastic variational inference. Despite the low dimensionality of our subspace, we are able to achieve competitive performance with state-of-the-art approaches while only requiring ${\sim}1000$ additional parameters. Furthermore, it allows us to scale up to the largest Bayesian LLM to date, with four times as a many base parameters as prior work.
论文链接 https://arxiv.org/pdf/2506.21408v1.pdf
代码链接 https://github.com/sri-csl/bayesadapt

论文标题 A Hierarchical Deep Learning Approach for Minority Instrument Detection

中文摘要： 在音乐信息检索中，识别音频片段中的乐器活动至关重要，对音乐编目和发现具有重要意义。以往的深度学习研究主要集中在数据充足的主要乐器类别上。最近的研究表明，即使在乐器层面的细粒度标注有限的情况下，分层分类方法也能有效检测管弦乐中的乐器活动。基于霍恩博斯特尔-萨克斯分类法，该研究使用MedleyDB数据集（以其多样性和丰富性著称）评估了这种分层分类系统。本工作提出了多种将分层结构集成到模型中的策略，并测试了一类新的分层音乐预测模型。通过在详细乐器识别和组级别识别之间架起桥梁，这项研究展示了更可靠的粗略级别的乐器检测，为该领域进一步的发展铺平了道路。
英文摘要： Identifying instrument activities within audio excerpts is vital in music information retrieval, with significant implications for music cataloging and discovery. Prior deep learning endeavors in musical instrument recognition have predominantly emphasized instrument classes with ample data availability. Recent studies have demonstrated the applicability of hierarchical classification in detecting instrument activities in orchestral music, even with limited fine-grained annotations at the instrument level. Based on the Hornbostel-Sachs classification, such a hierarchical classification system is evaluated using the MedleyDB dataset, renowned for its diversity and richness concerning various instruments and music genres. This work presents various strategies to integrate hierarchical structures into models and tests a new class of models for hierarchical music prediction. This study showcases more reliable coarse-level instrument detection by bridging the gap between detailed instrument identification and group-level recognition, paving the way for further advancements in this domain.
论文链接 https://arxiv.org/pdf/2506.21167v1.pdf
代码链接 https://github.com/seon82/musedetect

论文标题 FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

中文摘要： 这项工作介绍了一种基于FineWeb的新预训练数据集整理管道，可以自动适应任何语言。通过在九种不同语言上广泛测试管道设计选择，并采用一套有意义且具有信息量的评估任务（这些任务是通过基于可测量标准的新选择过程确定的），研究者展示了该管道能够创建出比先前数据集更有效的非英语语料库。此外，研究者还提出了一种简单且有原则的方法来重新平衡数据集，同时考虑了重复计数和质量，进一步提升了性能。最终，他们将管道扩展到超过1000种语言，使用近100个Common Crawl快照生成了FineWeb2，这是一个新的20太字节（50亿文档）多语言数据集，并公开发布了他们的管道、训练和评估代码库。
英文摘要： Pre-training state-of-the-art large language models (LLMs) requires vast amounts of clean and diverse text data. While the open development of large high-quality English pre-training datasets has seen substantial recent progress, training performant multilingual LLMs remains a challenge, in large part due to the inherent difficulty of tailoring filtering and deduplication pipelines to a large number of languages. In this work, we introduce a new pre-training dataset curation pipeline based on FineWeb that can be automatically adapted to support any language. We extensively ablate our pipeline design choices on a set of nine diverse languages, guided by a set of meaningful and informative evaluation tasks that were chosen through a novel selection process based on measurable criteria. Ultimately, we show that our pipeline can be used to create non-English corpora that produce more performant models than prior datasets. We additionally introduce a straightforward and principled approach to rebalance datasets that takes into consideration both duplication count and quality, providing an additional performance uplift. Finally, we scale our pipeline to over 1000 languages using almost 100 Common Crawl snapshots to produce FineWeb2, a new 20 terabyte (5 billion document) multilingual dataset which we release along with our pipeline, training, and evaluation codebases.
论文链接 https://arxiv.org/pdf/2506.20920v1.pdf
代码链接 https://github.com/huggingface/fineweb-2

论文标题 Agent-RewardBench: Towards a Unified Benchmark for Reward Modeling across Perception, Planning, and Safety in Real-World Multimodal Agents

中文摘要： 随着多模态大型语言模型（MLLMs）的发展，多模态代理在现实任务中展现出潜力，如网页导航和具身智能。然而，由于缺乏外部反馈，这些代理在自我纠正和泛化方面存在困难。使用奖励模型作为外部反馈是一种有前景的方法，但目前尚不清楚如何为代理选择合适的奖励模型。因此，迫切需要建立一个针对代理的奖励基准。为应对这些挑战，研究者提出了Agent-RewardBench，这是一个旨在评估MLLMs奖励建模能力的基准。该基准具有三个关键特点：(1) 多维度和真实代理场景评估，涵盖感知、规划和安全7个场景；(2) 步骤级奖励评估，允许在任务的每个步骤中评估代理的能力，提供更细致的性能视图；(3) 适当难度和高质量，从10个不同的模型中精心采样，控制难度以保持任务挑战性，并通过人工验证确保数据完整性。实验表明，即使是最先进的多模态模型也表现出有限的性能，强调了在代理奖励建模方面进行专门训练的必要性。代码可在GitHub上获取。
英文摘要： As Multimodal Large Language Models (MLLMs) advance, multimodal agents show promise in real-world tasks like web navigation and embodied intelligence. However, due to limitations in a lack of external feedback, these agents struggle with self-correction and generalization. A promising approach is to use reward models as external feedback, but there is no clear on how to select reward models for agents. Thus, there is an urgent need to build a reward bench targeted at agents. To address these challenges, we propose Agent-RewardBench, a benchmark designed to evaluate reward modeling ability in MLLMs. The benchmark is characterized by three key features: (1) Multiple dimensions and real-world agent scenarios evaluation. It covers perception, planning, and safety with 7 scenarios; (2) Step-level reward evaluation. It allows for the assessment of agent capabilities at the individual steps of a task, providing a more granular view of performance during the planning process; and (3) Appropriately difficulty and high-quality. We carefully sample from 10 diverse models, difficulty control to maintain task challenges, and manual verification to ensure the integrity of the data. Experiments demonstrate that even state-of-the-art multimodal models show limited performance, highlighting the need for specialized training in agent reward modeling. Code is available at github.
论文链接 https://arxiv.org/pdf/2506.21252v1.pdf
代码链接 https://github.com/quester-one/agent-rewardbench

论文标题 PsyLite Technical Report

中文摘要： 随着数字技术的快速发展，基于人工智能的心理咨询服务已成为心理健康领域的一个重要研究方向。然而，现有的模型在对话安全性、详细场景处理和轻量化部署方面仍存在不足。为了解决这些问题，本研究提出了PsyLite，一个基于基础模型InternLM2.5-7B-chat开发的轻量级心理辅导大型语言模型代理。通过两阶段训练策略（混合蒸馏数据微调和ORPO偏好优化），PsyLite增强了模型的深度推理能力、心理辅导能力和安全对话能力。利用Ollama和Open WebUI进行部署，并通过Pipelines创建自定义工作流程。设计了一种创新的条件RAG，在适当的时候引入相声幽默元素，以增强用户体验并拒绝危险请求，从而加强对话的安全性。评估结果显示，PsyLite在中文通用评估（CEval）、心理辅导专业评估（CPsyCounE）和对话安全性评估（SafeDialBench）中均优于基线模型，特别是在心理辅导专业性（CPsyCounE评分提高47.6%）和对话安全性（\safe{}评分提高2.4%）方面表现突出。此外，该模型使用量化技术（GGUF q4_k_m）实现了低硬件部署（5GB内存即可运行），为资源受限环境中的心理辅导应用提供了一个可行的解决方案。
英文摘要： With the rapid development of digital technology, AI-driven psychological counseling has gradually become an important research direction in the field of mental health. However, existing models still have deficiencies in dialogue safety, detailed scenario handling, and lightweight deployment. To address these issues, this study proposes PsyLite, a lightweight psychological counseling large language model agent developed based on the base model InternLM2.5-7B-chat. Through a two-stage training strategy (hybrid distillation data fine-tuning and ORPO preference optimization), PsyLite enhances the model's deep-reasoning ability, psychological counseling ability, and safe dialogue ability. After deployment using Ollama and Open WebUI, a custom workflow is created with Pipelines. An innovative conditional RAG is designed to introduce crosstalk humor elements at appropriate times during psychological counseling to enhance user experience and decline dangerous requests to strengthen dialogue safety. Evaluations show that PsyLite outperforms the baseline models in the Chinese general evaluation (CEval), psychological counseling professional evaluation (CPsyCounE), and dialogue safety evaluation (SafeDialBench), particularly in psychological counseling professionalism (CPsyCounE score improvement of 47.6\%) and dialogue safety (\safe{} score improvement of 2.4\%). Additionally, the model uses quantization technology (GGUF q4_k_m) to achieve low hardware deployment (5GB memory is sufficient for operation), providing a feasible solution for psychological counseling applications in resource-constrained environments.
论文链接 https://arxiv.org/pdf/2506.21536v1.pdf
代码链接 https://github.com/Jundifang/PsyLite

论文标题 Model State Arithmetic for Machine Unlearning

中文摘要： 大型语言模型在庞大的网络数据集上进行训练，这些数据可能包含私人数据、受版权保护的材料、不准确的事实信息或降低模型性能的数据。通过完全重新训练来消除这些问题数据点的影响（即反复在排除特定实例的数据集上预训练模型）在计算上是不可行的。因此，出现了旨在以较低计算成本消除特定数据点影响的遗忘算法。然而，精确估计和撤销单个数据点的影响一直是一个挑战。在这项工作中，我们提出了一种新的算法MSA，通过利用模型检查点（即捕捉模型在预训练不同阶段状态的工件）来估计和撤销数据点的影响。实验结果表明，MSA在多个基准测试、模型和评估指标上始终优于现有的机器遗忘算法，这表明MSA可能是实现更灵活的大规模语言模型的有效方法，使其能够进行数据擦除。
英文摘要： Large language models are trained on massive corpora of web data, which may include private data, copyrighted material, factually inaccurate data, or data that degrades model performance. Eliminating the influence of such problematic datapoints through complete retraining -- by repeatedly pretraining the model on datasets that exclude these specific instances -- is computationally prohibitive. For this reason, unlearning algorithms have emerged that aim to eliminate the influence of particular datapoints, while otherwise preserving the model -- at a low computational cost. However, precisely estimating and undoing the influence of individual datapoints has proved to be challenging. In this work, we propose a new algorithm, MSA, for estimating and undoing the influence of datapoints -- by leveraging model checkpoints i.e. artifacts capturing model states at different stages of pretraining. Our experimental results demonstrate that MSA consistently outperforms existing machine unlearning algorithms across multiple benchmarks, models, and evaluation metrics, suggesting that MSA could be an effective approach towards more flexible large language models that are capable of data erasure.
论文链接 https://arxiv.org/pdf/2506.20941v1.pdf
代码链接 https://github.com/mehrdadsaberi/msa_unlearning

论文标题 Latent Prototype Routing: Achieving Near-Perfect Load Balancing in Mixture-of-Experts

中文摘要： 混合专家（MoE）架构已成为高效扩展大型语言模型的关键策略，但当前的MoE系统存在严重的负载不平衡问题，导致模型容量和计算资源的严重浪费。本文从聚类的角度重新审视了专家路由，并提出了一种新的路由框架——潜原型路由（LPR），该方法在不牺牲下游性能的前提下，促进了平衡的专家利用率。通过在多个开源MoE模型上的广泛实验，包括DeepSeek-V3、Qwen3-MoE和Mixtral，LPR将专家负载的基尼系数从0.70降低到0.035，最小-最大专家负载比率从1e-6提高到0.70，实现了近乎完美的负载平衡。
英文摘要： Mixture-of-Experts (MoE) architectures have emerged as a key strategy for scaling large language models (LLMs) efficiently. However, current MoE systems suffer from severe load imbalance, where only a small subset of experts is consistently activated during training and inference, leading to significant underutilization of model capacity and computational resources. In this work, we revisit expert routing through a clustering perspective and propose Latent Prototype Routing (LPR), a novel routing framework that generalizes existing approaches while promoting balanced expert utilization without compromising downstream performance. Extensive experiments across multiple open-source MoE models -- including DeepSeek-V3, Qwen3-MoE, and Mixtral -- demonstrate that LPR reduces the Gini coefficient of expert load from 0.70 to 0.035 on average, improves the min-max expert load ratio from 1e-6 to 0.70, achieving near-perfect load balancing.
论文链接 https://arxiv.org/pdf/2506.21328v1.pdf
代码链接 https://github.com/rando11199/latentprototyperouter

论文标题 Complexity-aware fine-tuning

中文摘要： 通用的大规模语言模型（LLMs）通常通过监督微调（SFT）来提升特定领域的性能。通过蒸馏大型模型的思维链可以取得更好的结果，但需要大量的昂贵调用和更多数据。本文提出了一种高效的微调方法，该方法仅对通过熵识别出的复杂数据进行推理。具体来说，我们使用单个标记答案熵（ROC AUC 0.73）将训练数据分为复杂度类别，然后通过SFT和蒸馏对两个小型开放模型（约3B参数）进行微调。实验表明，我们的方法显著优于标准SFT方法（平均准确率0.55 vs 0.43），并且在使用62%更少的数据的情况下，其性能与蒸馏方法相当（平均准确率均为0.55）。我们发布了代码和数据，以促进这一方向的进一步研究。
英文摘要： General-purpose Large Language Models (LLMs) are frequently fine-tuned through supervised fine-tuning (SFT) to enhance performance in specific domains. Better results can be achieved by distilling the chain-of-thought of a larger model at the cost of numerous expensive calls and a much greater amount of data. We propose a novel blueprint for efficient fine-tuning that uses reasoning only for complex data identified by entropy. Specifically, across two small open models ($\approx 3B$) we split the training data into complexity categories by a single token answer entropy (ROC AUC $0.73$), fine-tune large language models (LLMs) via SFT and distillation, and show that our pipeline significantly outperforms the standard SFT approach ($0.55$ vs $0.43$ average accuracy) and provides comparable with distillation performance while using $62\%$ less data ($0.55$ average accuracy for both). We publish our code and data to facilitate further research in this direction.
论文链接 https://arxiv.org/pdf/2506.21220v1.pdf
代码链接 https://github.com/labarss/complexity-aware-fine-tuning

论文标题 "What's Up, Doc?": Analyzing How Users Seek Health Information in Large-Scale Conversational AI Datasets

中文摘要： 随着人们越来越多地通过交互式聊天机器人向大型语言模型（LLMs）寻求医疗信息，这些对话的性质和潜在风险仍待深入研究。本文通过对大规模对话AI数据集进行筛选，构建了HealthChat-11K数据集，包含11,000个真实世界的对话，共计25,000条用户消息。利用这一数据集和由临床医生制定的分类法，系统地研究了用户在21个不同健康专业领域中与LLMs互动的方式。分析揭示了用户如何以及为何寻求健康信息的特征，包括常见的互动模式、不完整上下文的情况、情感行为，以及可能导致谄媚行为的互动（如引导性问题），强调了改进LLMs作为对话AI在医疗支持方面能力的必要性。相关代码和工具可在GitHub上获取。
英文摘要： People are increasingly seeking healthcare information from large language models (LLMs) via interactive chatbots, yet the nature and inherent risks of these conversations remain largely unexplored. In this paper, we filter large-scale conversational AI datasets to achieve HealthChat-11K, a curated dataset of 11K real-world conversations composed of 25K user messages. We use HealthChat-11K and a clinician-driven taxonomy for how users interact with LLMs when seeking healthcare information in order to systematically study user interactions across 21 distinct health specialties. Our analysis reveals insights into the nature of how and why users seek health information, such as common interactions, instances of incomplete context, affective behaviors, and interactions (e.g., leading questions) that can induce sycophancy, underscoring the need for improvements in the healthcare support capabilities of LLMs deployed as conversational AI. Code and artifacts to retrieve our analyses and combine them into a curated dataset can be found here: https://github.com/yahskapar/HealthChat
论文链接 https://arxiv.org/pdf/2506.21532v1.pdf
代码链接 https://github.com/yahskapar/healthchat

中文摘要： 随着多模态大型语言模型的快速发展，深入理解和解释人类意图已成为关键能力，这需要细致周到的推理。最近的研究表明，强化学习（RL）在增强大型语言模型（LLMs）的推理能力方面具有潜力。然而，将RL适应于多模态数据和格式的挑战尚未得到充分解决。本文识别了现有多模态推理模型中的两个问题：全局上下文理解不足和捷径问题。全局上下文理解不足会导致模型误解多模态上下文，从而产生错误答案；而捷径问题则发生在模型忽略多模态输入中的关键线索，直接回答查询而不考虑多模态信息时。为了解决这些问题，我们强调模型需要在清晰理解多模态输入的全局上下文的基础上进行推理，这可以有效防止模型忽视关键多模态线索，并确保全面的推理过程。为了确保准确解释多模态上下文信息，我们实施了由大型语言模型评判的上下文奖励，以及格式和准确性奖励。此外，为了提高复杂推理能力，我们使用LLM评估逻辑奖励，判断推理过程是否成功地将多模态信息与逻辑方法结合。我们还引入了一个全模态推理基准IntentBench，旨在评估模型对复杂人类意图和情感的理解能力。我们的方法在多个全模态基准测试中表现出优于其他开源全模态模型的性能。
英文摘要： With the rapid evolution of multimodal large language models, the capacity to deeply understand and interpret human intentions has emerged as a critical capability, which demands detailed and thoughtful reasoning. In recent studies, Reinforcement Learning (RL) has demonstrated potential in enhancing the reasoning capabilities of Large Language Models (LLMs). Nonetheless, the challenges associated with adapting RL to multimodal data and formats remain largely unaddressed. In this paper, we identify two issues in existing multimodal reasoning models: insufficient global context understanding and shortcut problems. Insufficient context understanding can happen when a model misinterprets multimodal context, resulting in incorrect answers. The shortcut problem occurs when the model overlooks crucial clues in multimodal inputs, directly addressing the query without considering the multimodal information. To tackle these issues, we emphasize the necessity for the model to reason with a clear understanding of the global context within multimodal inputs. This global context understanding can effectively prevent the model from overlooking key multimodal cues and ensure a thorough reasoning process. To ensure the accurate interpretation of multimodal context information, we implement a context reward judged by a large language model, alongside format and accuracy rewards. Additionally, to improve complex reasoning capability, we employ the LLM to assess the logical reward, determining whether the reasoning process successfully integrates multimodal information with logical methods. We also introduce a reasoning omni-modal benchmark, IntentBench, aimed at evaluating models in understanding complex human intentions and emotions. Our proposed method demonstrates advanced performance across multiple omni-modal benchmarks compared to other open-source omni-modal models.
论文链接 https://arxiv.org/pdf/2506.21277v1.pdf
代码链接 https://github.com/humanmllm/humanomniv2

论文标题 HyperSORT: Self-Organising Robust Training with hyper-networks

中文摘要： HyperSORT是一种利用超网络从代表图像和标注变异性的潜在向量中预测UNet参数的框架。该框架通过联合学习超网络参数和每个训练样本对应的潜在向量集合，从而不再优化单一神经网络来拟合整个数据集，而是学习一个复杂的UNet参数分布。在该分布中，低密度区域可以捕捉特定噪声模式，而较大的模式则能够以差异化但有意义的方式稳健地分割器官。研究者在两个3D腹部CT公共数据集上验证了这一方法：一个是合成扰动版本的AMOS数据集，另一个是包含真实未知偏差和错误的大规模TotalSegmentator数据集。实验表明，HyperSORT能够创建一个结构化的数据集映射，有助于识别相关系统性偏差和错误样本。潜在空间中的聚类生成了与所学系统性偏差一致的UNet参数，从而执行分割任务。代码和对TotalSegmentator数据集的分析已公开。
英文摘要： Medical imaging datasets often contain heterogeneous biases ranging from erroneous labels to inconsistent labeling styles. Such biases can negatively impact deep segmentation networks performance. Yet, the identification and characterization of such biases is a particularly tedious and challenging task. In this paper, we introduce HyperSORT, a framework using a hyper-network predicting UNets' parameters from latent vectors representing both the image and annotation variability. The hyper-network parameters and the latent vector collection corresponding to each data sample from the training set are jointly learned. Hence, instead of optimizing a single neural network to fit a dataset, HyperSORT learns a complex distribution of UNet parameters where low density areas can capture noise-specific patterns while larger modes robustly segment organs in differentiated but meaningful manners. We validate our method on two 3D abdominal CT public datasets: first a synthetically perturbed version of the AMOS dataset, and TotalSegmentator, a large scale dataset containing real unknown biases and errors. Our experiments show that HyperSORT creates a structured mapping of the dataset allowing the identification of relevant systematic biases and erroneous samples. Latent space clusters yield UNet parameters performing the segmentation task in accordance with the underlying learned systematic bias. The code and our analysis of the TotalSegmentator dataset are made available: https://github.com/ImFusionGmbH/HyperSORT
论文链接 https://arxiv.org/pdf/2506.21430v1.pdf
代码链接 https://github.com/imfusiongmbh/hypersort

论文标题 Style-Aligned Image Composition for Robust Detection of Abnormal Cells in Cytopathology

中文摘要： 在细胞病理学中，高质量标注的缺乏、长尾数据分布和染色风格不一致等问题给神经网络检测异常细胞带来了重大挑战。本文提出了一种风格对齐图像合成（SAIC）方法，通过合成高保真且风格保持的病理图像来增强检测模型的有效性和鲁棒性。首先，SAIC根据属性指导从异常细胞库中选择合适的候选对象。然后，利用高频特征重建技术实现异常细胞与病理背景的风格对齐和高保真合成。最后，引入大规模视觉-语言模型来筛选高质量的合成图像。实验结果表明，将SAIC合成的图像纳入训练能够有效提升对尾部类别和不同风格的异常细胞检测性能，从而提高整体检测效果。全面的质量评估进一步证实了SAIC在临床应用场景中的通用性和实用性。代码将在https://github.com/Joey-Qi/SAIC发布。
英文摘要： Challenges such as the lack of high-quality annotations, long-tailed data distributions, and inconsistent staining styles pose significant obstacles to training neural networks to detect abnormal cells in cytopathology robustly. This paper proposes a style-aligned image composition (SAIC) method that composes high-fidelity and style-preserved pathological images to enhance the effectiveness and robustness of detection models. Without additional training, SAIC first selects an appropriate candidate from the abnormal cell bank based on attribute guidance. Then, it employs a high-frequency feature reconstruction to achieve a style-aligned and high-fidelity composition of abnormal cells and pathological backgrounds. Finally, it introduces a large vision-language model to filter high-quality synthesis images. Experimental results demonstrate that incorporating SAIC-synthesized images effectively enhances the performance and robustness of abnormal cell detection for tail categories and styles, thereby improving overall detection performance. The comprehensive quality evaluation further confirms the generalizability and practicality of SAIC in clinical application scenarios. Our code will be released at https://github.com/Joey-Qi/SAIC.
论文链接 https://arxiv.org/pdf/2506.21001v1.pdf
代码链接 https://github.com/joey-qi/saic

论文标题 XVerse: Consistent Multi-Subject Control of Identity and Semantic Attributes via DiT Modulation

中文摘要： XVerse是一种新的多主体控制生成模型，通过将参考图像转换为特定令牌文本流调制的偏移量，实现了对特定主体身份和语义属性（如姿态、风格、光照）的精细且独立控制。这种方法避免了破坏图像潜在特征或引入属性纠缠的问题，从而在保持高保真度的同时，提供了可编辑的多主体图像合成能力，并能够稳健地控制每个主体的特征和语义属性。这一进步显著提升了个性化及复杂场景生成的能力。
英文摘要： Achieving fine-grained control over subject identity and semantic attributes (pose, style, lighting) in text-to-image generation, particularly for multiple subjects, often undermines the editability and coherence of Diffusion Transformers (DiTs). Many approaches introduce artifacts or suffer from attribute entanglement. To overcome these challenges, we propose a novel multi-subject controlled generation model XVerse. By transforming reference images into offsets for token-specific text-stream modulation, XVerse allows for precise and independent control for specific subject without disrupting image latents or features. Consequently, XVerse offers high-fidelity, editable multi-subject image synthesis with robust control over individual subject characteristics and semantic attributes. This advancement significantly improves personalized and complex scene generation capabilities.
论文链接 https://arxiv.org/pdf/2506.21416v1.pdf
代码链接 https://github.com/bytedance/xverse

论文标题 Maintaining MTEB: Towards Long Term Usability and Reproducibility of Embedding Benchmarks

中文摘要： MTEB（大规模文本嵌入基准）已成为文本嵌入模型的标准评估平台。本文重点讨论了确保MTEB长期可重复性和可扩展性的工程方面。我们介绍了维护稳健的持续集成管道的方法，这些管道验证数据集完整性、自动化测试执行，并评估基准结果的泛化能力。详细阐述了增强可重复性和可用性的设计选择。此外，还讨论了处理社区贡献和通过新任务和数据集扩展基准的策略。这些工程实践在扩大MTEB覆盖面的同时，保持了其质量和相关性。我们的经验为面临类似挑战的基准维护者提供了宝贵的见解，帮助他们在机器学习评估框架中确保可重复性和可用性。MTEB的代码库可在https://github.com/embeddings-benchmark/mteb 获取。
英文摘要： The Massive Text Embedding Benchmark (MTEB) has become a standard evaluation platform for text embedding models. While previous work has established the core benchmark methodology, this paper focuses on the engineering aspects that ensure MTEB's continued reproducibility and extensibility. We present our approach to maintaining robust continuous integration pipelines that validate dataset integrity, automate test execution, and assess benchmark results' generalizability. We detail the design choices that collectively enhance reproducibility and usability. Furthermore, we discuss our strategies for handling community contributions and extending the benchmark with new tasks and datasets. These engineering practices have been instrumental in scaling MTEB to become more comprehensive while maintaining quality and, ultimately, relevance to the field. Our experiences offer valuable insights for benchmark maintainers facing similar challenges in ensuring reproducibility and usability in machine learning evaluation frameworks. The MTEB repository is available at: https://github.com/embeddings-benchmark/mteb
论文链接 https://arxiv.org/pdf/2506.21182v1.pdf
代码链接 https://github.com/embeddings-benchmark/mteb

论文标题 DrishtiKon: Multi-Granular Visual Grounding for Text-Rich Document Images

中文摘要： DrishtiKon是一种多粒度视觉定位框架，旨在提高复杂多语言文档中视觉问答系统的可解释性和可信度。该方法结合了强大的多语言OCR、大型语言模型以及一种新颖的区域匹配算法，能够在块、行、词和点等多个层级上准确地定位答案范围。研究团队从CircularsVQA测试集中策划了一个新的基准数据集，提供了多粒度的人工验证注释。广泛的实验表明，该方法达到了最先进的定位精度，在行级别上的精度和召回率之间取得了最佳平衡。消融研究表明，多块和多行推理具有显著优势。与领先的视觉-语言模型相比，DrishtiKon在精确定位方面表现更优，突显了其结构化、基于对齐的方法的有效性。这些发现为实际文本密集型场景中的更稳健和可解释的文档理解系统铺平了道路。代码和数据集已在GitHub上公开。
英文摘要： Visual grounding in text-rich document images is a critical yet underexplored challenge for document intelligence and visual question answering (VQA) systems. We present \drishtikon, a multi-granular visual grounding framework designed to enhance interpretability and trust in VQA for complex, multilingual documents. Our approach integrates robust multi-lingual OCR, large language models, and a novel region matching algorithm to accurately localize answer spans at block, line, word, and point levels. We curate a new benchmark from the CircularsVQA test set, providing fine-grained, human-verified annotations across multiple granularities. Extensive experiments demonstrate that our method achieves state-of-the-art grounding accuracy, with line-level granularity offering the best trade-off between precision and recall. Ablation studies further highlight the benefits of multi-block and multi-line reasoning. Comparative evaluations with leading vision-language models reveal the limitations of current VLMs in precise localization, underscoring the effectiveness of our structured, alignment-based approach. Our findings pave the way for more robust and interpretable document understanding systems in real-world, text-centric scenarios. Code and dataset has been made available at https://github.com/kasuba-badri-vishal/DhrishtiKon.
论文链接 https://arxiv.org/pdf/2506.21316v1.pdf
代码链接 https://github.com/kasuba-badri-vishal/dhrishtikon

论文标题 LLaVA-Pose: Enhancing Human Pose and Action Understanding via Keypoint-Integrated Instruction Tuning

中文摘要： 当前的视觉-语言模型在处理一般视觉理解任务时表现出色，但在处理与人体姿态和动作相关的复杂视觉任务时表现不佳，主要是因为缺乏专门的视觉-语言指令跟随数据。为了解决这一问题，研究者提出了一种方法，通过将人体关键点与传统的视觉特征（如字幕和边界框）结合，生成更精确的人体中心场景理解数据。该方法构建了一个包含200,328个样本的数据集，专门用于微调模型以处理人体中心任务，重点关注对话、详细描述和复杂推理三个领域。此外，还建立了一个扩展的人体姿态和动作理解基准（E-HPAUB），用于评估模型在这方面的性能。通过对LLaVA-1.5-7B模型进行微调，得到的LLaVA-Pose模型在基准测试中取得了显著提升，整体性能提高了33.2%。这些结果表明，集成关键点的数据在增强多模态模型对人体中心视觉理解方面具有显著效果。代码可在https://github.com/Ody-trek/LLaVA-Pose获取。
英文摘要： Current vision-language models (VLMs) are well-adapted for general visual understanding tasks. However, they perform inadequately when handling complex visual tasks related to human poses and actions due to the lack of specialized vision-language instruction-following data. We introduce a method for generating such data by integrating human keypoints with traditional visual features such as captions and bounding boxes, enabling more precise understanding of human-centric scenes. Our approach constructs a dataset comprising 200,328 samples tailored to fine-tune models for human-centric tasks, focusing on three areas: conversation, detailed description, and complex reasoning. We establish an Extended Human Pose and Action Understanding Benchmark (E-HPAUB) to assess model performance on human pose and action understanding. We fine-tune the LLaVA-1.5-7B model using this dataset and evaluate our resulting LLaVA-Pose model on the benchmark, achieving significant improvements. Experimental results show an overall improvement of 33.2% compared to the original LLaVA-1.5-7B model. These findings highlight the effectiveness of keypoint-integrated data in enhancing multimodal models for human-centric visual understanding. Code is available at https://github.com/Ody-trek/LLaVA-Pose.
论文链接 https://arxiv.org/pdf/2506.21317v1.pdf
代码链接 https://github.com/ody-trek/llava-pose

论文标题 Antibody Design and Optimization with Multi-scale Equivariant Graph Diffusion Models for Accurate Complex Antigen Binding

中文摘要： 抗体设计在治疗和诊断开发中仍然是一个关键挑战，特别是对于具有多样化结合界面的复杂抗原。当前的计算方法面临两个主要限制：（1）在保持对称性的同时捕捉几何特征；（2）泛化新的抗原界面。尽管最近有所进展，但这些方法往往无法准确捕捉分子相互作用并保持结构完整性。为了解决这些问题，研究人员提出了AbMEGD，这是一个端到端框架，集成了多尺度等变图扩散模型，用于抗体序列和结构的共同设计。通过先进的几何深度学习，AbMEGD结合了原子级几何特征和残基级嵌入，捕捉局部原子细节和全局序列-结构相互作用。其E(3)等变扩散方法确保了几何精度、计算效率和对复杂抗原的强大泛化能力。实验使用SAbDab数据库表明，与领先的抗体设计模型DiffAb相比，AbMEGD在氨基酸恢复率上提高了10.13%，改进百分比增加了3.32%，并在关键CDR-H3区域的均方根偏差减少了0.062埃。这些结果突显了AbMEGD在平衡结构完整性和功能改进方面的能力，为序列-结构共同设计和亲和力优化树立了新标准。代码可在https://github.com/Patrick221215/AbMEGD获取。
英文摘要： Antibody design remains a critical challenge in therapeutic and diagnostic development, particularly for complex antigens with diverse binding interfaces. Current computational methods face two main limitations: (1) capturing geometric features while preserving symmetries, and (2) generalizing novel antigen interfaces. Despite recent advancements, these methods often fail to accurately capture molecular interactions and maintain structural integrity. To address these challenges, we propose \textbf{AbMEGD}, an end-to-end framework integrating \textbf{M}ulti-scale \textbf{E}quivariant \textbf{G}raph \textbf{D}iffusion for antibody sequence and structure co-design. Leveraging advanced geometric deep learning, AbMEGD combines atomic-level geometric features with residue-level embeddings, capturing local atomic details and global sequence-structure interactions. Its E(3)-equivariant diffusion method ensures geometric precision, computational efficiency, and robust generalizability for complex antigens. Furthermore, experiments using the SAbDab database demonstrate a 10.13\% increase in amino acid recovery, 3.32\% rise in improvement percentage, and a 0.062~\AA\ reduction in root mean square deviation within the critical CDR-H3 region compared to DiffAb, a leading antibody design model. These results highlight AbMEGD's ability to balance structural integrity with improved functionality, establishing a new benchmark for sequence-structure co-design and affinity optimization. The code is available at: https://github.com/Patrick221215/AbMEGD.
论文链接 https://arxiv.org/pdf/2506.20957v1.pdf
代码链接 https://github.com/patrick221215/abmegd

论文标题 Towards Reliable Detection of Empty Space: Conditional Marked Point Processes for Object Detection

中文摘要： 本文提出了一种基于空间统计的对象检测模型，旨在提高对空旷区域的可靠检测。传统的深度神经网络在目标检测和语义分割等计算机视觉任务中表现优异，但其预测的置信度往往校准不佳，且无法量化未检测到物体的区域是否真正无障碍。为解决这一问题，研究者将边界框数据视为标记点过程的实现，该方法常用于描述空间点事件（如边界框中心）的概率分布，其中标记用于描述边界框的空间扩展和类别。通过这种统计框架，模型能够进行基于似然性的训练，并提供明确的置信度估计，以判断某一区域是否可行驶。实验结果表明，该方法在置信度校准和性能评估方面均表现出色。
英文摘要： Deep neural networks have set the state-of-the-art in computer vision tasks such as bounding box detection and semantic segmentation. Object detectors and segmentation models assign confidence scores to predictions, reflecting the model's uncertainty in object detection or pixel-wise classification. However, these confidence estimates are often miscalibrated, as their architectures and loss functions are tailored to task performance rather than probabilistic foundation. Even with well calibrated predictions, object detectors fail to quantify uncertainty outside detected bounding boxes, i.e., the model does not make a probability assessment of whether an area without detected objects is truly free of obstacles. This poses a safety risk in applications such as automated driving, where uncertainty in empty areas remains unexplored. In this work, we propose an object detection model grounded in spatial statistics. Bounding box data matches realizations of a marked point process, commonly used to describe the probabilistic occurrence of spatial point events identified as bounding box centers, where marks are used to describe the spatial extension of bounding boxes and classes. Our statistical framework enables a likelihood-based training and provides well-defined confidence estimates for whether a region is drivable, i.e., free of objects. We demonstrate the effectiveness of our method through calibration assessments and evaluation of performance.
论文链接 https://arxiv.org/pdf/2506.21486v1.pdf
代码链接 https://github.com/tobiasriedlinger/cmppp-object-detection

论文标题 G$^{2}$D: Boosting Multimodal Learning with Gradient-Guided Distillation

中文摘要： 多模态学习旨在利用多种数据模态的信息以实现更全面的性能。然而，传统的多模态模型常常存在模态不平衡问题，即一个或几个模态在模型优化过程中占据主导地位，导致特征表示不佳和弱模态的利用率不足。为了解决这一挑战，引入了梯度引导蒸馏（G²D），这是一种知识蒸馏框架，通过自定义损失函数融合单模态和多模态目标来优化多模态模型。G²D还在学习过程中引入了动态顺序模态优先级（SMP）技术，确保每个模态都能引领学习过程，避免强模态掩盖弱模态的问题。在多个真实世界数据集上的验证表明，G²D在训练过程中放大了弱模态的重要性，并在分类和回归任务中超越了现有最先进的方法。代码可在https://github.com/rAIson-Lab/G2D获取。
英文摘要： Multimodal learning aims to leverage information from diverse data modalities to achieve more comprehensive performance. However, conventional multimodal models often suffer from modality imbalance, where one or a few modalities dominate model optimization, leading to suboptimal feature representation and underutilization of weak modalities. To address this challenge, we introduce Gradient-Guided Distillation (G$^{2}$D), a knowledge distillation framework that optimizes the multimodal model with a custom-built loss function that fuses both unimodal and multimodal objectives. G$^{2}$D further incorporates a dynamic sequential modality prioritization (SMP) technique in the learning process to ensure each modality leads the learning process, avoiding the pitfall of stronger modalities overshadowing weaker ones. We validate G$^{2}$D on multiple real-world datasets and show that G$^{2}$D amplifies the significance of weak modalities while training and outperforms state-of-the-art methods in classification and regression tasks. Our code is available at https://github.com/rAIson-Lab/G2D.
论文链接 https://arxiv.org/pdf/2506.21514v1.pdf
代码链接 https://github.com/raison-lab/g2d

论文标题 Mitigating Hallucination of Large Vision-Language Models via Dynamic Logits Calibration

中文摘要： 大型视觉-语言模型（LVLMs）在多模态理解方面取得了显著进展，但它们常常会生成与视觉输入相矛盾的文本，即所谓的幻觉问题。现有的无训练解码策略存在关键局限性，如使用静态约束无法适应生成过程中的语义漂移、需要多次前向传递导致效率低下以及过度严格的干预规则导致细节损失。为解决这些问题，本文提出了一种新的无训练解码框架——动态对数校准（DLC），旨在在推理时动态地使文本生成与视觉证据保持一致。在解码阶段，DLC逐步使用CLIP评估输入图像与生成文本序列之间的语义一致性。然后，通过动态更新的上下文基线评估候选词的相对视觉优势（RVA），自适应地调整输出对数以偏好有视觉依据的词。此外，一种基于实时上下文对齐分数的自适应加权机制，巧妙平衡了视觉指导与文本输出的整体质量。在多种基准测试和不同LVLM架构（如LLaVA、InstructBLIP和MiniGPT-4）上的广泛实验表明，DLC显著减少了幻觉现象，优于现有方法，同时通过避免多次前向传递保持了高推理效率。总体而言，我们提出了一种有效且高效的解码时间解决方案，以减轻幻觉问题，从而提高LVLMs在实际应用中的可靠性。代码将在GitHub上发布。
英文摘要： Large Vision-Language Models (LVLMs) have demonstrated significant advancements in multimodal understanding, yet they are frequently hampered by hallucination-the generation of text that contradicts visual input. Existing training-free decoding strategies exhibit critical limitations, including the use of static constraints that do not adapt to semantic drift during generation, inefficiency stemming from the need for multiple forward passes, and degradation of detail due to overly rigid intervention rules. To overcome these challenges, this paper introduces Dynamic Logits Calibration (DLC), a novel training-free decoding framework designed to dynamically align text generation with visual evidence at inference time. At the decoding phase, DLC step-wise employs CLIP to assess the semantic alignment between the input image and the generated text sequence. Then, the Relative Visual Advantage (RVA) of candidate tokens is evaluated against a dynamically updated contextual baseline, adaptively adjusting output logits to favor tokens that are visually grounded. Furthermore, an adaptive weighting mechanism, informed by a real-time context alignment score, carefully balances the visual guidance while ensuring the overall quality of the textual output. Extensive experiments conducted across diverse benchmarks and various LVLM architectures (such as LLaVA, InstructBLIP, and MiniGPT-4) demonstrate that DLC significantly reduces hallucinations, outperforming current methods while maintaining high inference efficiency by avoiding multiple forward passes. Overall, we present an effective and efficient decoding-time solution to mitigate hallucinations, thereby enhancing the reliability of LVLMs for more practices. Code will be released on Github.
论文链接 https://arxiv.org/pdf/2506.21509v1.pdf
代码链接 https://github.com/jiahechen2002/dlc

论文标题 Benchmarking Deep Learning and Vision Foundation Models for Atypical vs. Normal Mitosis Classification with Cross-Dataset Evaluation

中文摘要： 研究针对乳腺癌中的非典型有丝分裂（AMF）分类问题，提出了一种全面的基准测试，比较了多种深度学习方法，包括基线模型、线性探测的基础模型以及使用低秩适应（LoRA）微调的基础模型。为了进行严格的评估，研究引入了两个新的保留数据集：AtNorM-Br（来自TCGA乳腺癌队列的有丝分裂数据集）和AtNorM-MD（来自MIDOG++训练集的多领域有丝分裂数据集）。实验结果显示，在AMi-Br、AtNorm-Br和AtNorM-MD数据集上，平均平衡准确率分别达到0.8135、0.7696和0.7705，特别是基于LoRA的Virchow系列基础模型表现尤为出色。研究证明，尽管非典型有丝分裂分类是一个具有挑战性的问题，但通过利用最新的迁移学习和模型微调技术可以有效解决。所有代码和数据均可在GitHub仓库中获取。
英文摘要： Atypical mitoses mark a deviation in the cell division process that can be an independent prognostically relevant marker for tumor malignancy. However, their identification remains challenging due to low prevalence, at times subtle morphological differences from normal mitoses, low inter-rater agreement among pathologists, and class imbalance in datasets. Building on the Atypical Mitosis dataset for Breast Cancer (AMi-Br), this study presents a comprehensive benchmark comparing deep learning approaches for automated atypical mitotic figure (AMF) classification, including baseline models, foundation models with linear probing, and foundation models fine-tuned with low-rank adaptation (LoRA). For rigorous evaluation, we further introduce two new hold-out AMF datasets - AtNorM-Br, a dataset of mitoses from the The TCGA breast cancer cohort, and AtNorM-MD, a multi-domain dataset of mitoses from the MIDOG++ training set. We found average balanced accuracy values of up to 0.8135, 0.7696, and 0.7705 on the in-domain AMi-Br and the out-of-domain AtNorm-Br and AtNorM-MD datasets, respectively, with the results being particularly good for LoRA-based adaptation of the Virchow-line of foundation models. Our work shows that atypical mitosis classification, while being a challenging problem, can be effectively addressed through the use of recent advances in transfer learning and model fine-tuning techniques. We make available all code and data used in this paper in this github repository: https://github.com/DeepMicroscopy/AMi-Br_Benchmark.
论文链接 https://arxiv.org/pdf/2506.21444v1.pdf
代码链接 https://github.com/deepmicroscopy/ami-br_benchmark

论文标题 How Good Are Synthetic Requirements ? Evaluating LLM-Generated Datasets for AI4RE

中文摘要： 针对需求工程中人工智能（AI4RE）的发展，公开可用的标记需求数据集的短缺是一个主要障碍。虽然大型语言模型在合成数据生成方面展示了潜力，但控制和优化生成需求质量的系统方法仍待深入探索。本文介绍了Synthline v1，这是一种改进的产品线方法，用于生成合成需求数据，通过高级生成策略和管理技术扩展了之前的v0版本。研究探讨了四个问题：提示策略、自动提示优化以及后生成管理如何影响四个分类任务的数据质量，包括缺陷检测、功能与非功能性、质量与非质量、安全与非安全性。评估结果显示，多样本提示显著提高了数据的实用性和多样性，F1分数提高了6到44点。使用PACE（Prompt Actor-Critic Editing）进行自动提示优化的效果因任务而异，在功能分类上表现尤为突出（+32.5分），但在其他任务上则有所下降。基于相似性的管理虽然提高了多样性，却往往损害了分类性能，表明一定程度的冗余可能有助于机器学习模型。最重要的是，结果表明，对于特定任务，合成需求可以匹配甚至超越人工编写的需求，尤其是在安全（+7.8分）和缺陷分类（+15.4分）方面。这些发现为AI4RE提供了实用见解，并指出通过系统的合成数据生成来缓解数据集稀缺性是可行的路径。
英文摘要： The shortage of publicly available, labeled requirements datasets remains a major barrier to advancing Artificial Intelligence for Requirements Engineering (AI4RE). While Large Language Models offer promising capabilities for synthetic data generation, systematic approaches to control and optimize the quality of generated requirements remain underexplored. This paper presents Synthline v1, an enhanced Product Line approach for generating synthetic requirements data that extends our earlier v0 version with advanced generation strategies and curation techniques. We investigate four research questions assessing how prompting strategies, automated prompt optimization, and post-generation curation affect data quality across four classification tasks: defect detection, functional vs. non-functional, quality vs. non-quality, and security vs. non-security. Our evaluation shows that multi-sample prompting significantly boosts both utility and diversity over single-sample generation, with F1-score gains from 6 to 44 points. The use of PACE (Prompt Actor-Critic Editing) for automated prompt optimization yields task-dependent results, greatly improving functional classification (+32.5 points) but reducing performance on others. Interestingly, similarity-based curation improves diversity but often harms classification performance, indicating that some redundancy may help ML models. Most importantly, our results show that synthetic requirements can match or outperform human-authored ones for specific tasks, with synthetic data surpassing human data for security (+7.8 points) and defect classification (+15.4 points). These findings offer practical insights for AI4RE and chart a viable path to mitigating dataset scarcity through systematic synthetic generation.
论文链接 https://arxiv.org/pdf/2506.21138v1.pdf
代码链接 https://github.com/abdelkarim-elhajjami/synthline

论文标题 FairyGen: Storied Cartoon Video from a Single Child-Drawn Character

中文摘要： FairyGen是一种自动系统，能够从单个儿童绘制的角色生成故事驱动的卡通视频，并忠实保留其独特的艺术风格。与以往主要关注角色一致性和基础动作的故事讲述方法不同，FairyGen将角色建模与风格化背景生成分离，并结合电影镜头设计来支持富有表现力和连贯的故事叙述。给定一个单一的角色草图，系统首先使用多语言大规模模型（MLLM）生成结构化的分镜脚本，其中包含环境设定、角色动作和摄像机视角的详细描述。为了确保视觉一致性，引入了一种风格传播适配器，捕捉角色的视觉风格并将其应用于背景，从而在合成风格一致的场景时忠实保留角色的整体视觉特征。镜头设计模块通过基于分镜脚本的帧裁剪和多视角合成进一步增强视觉多样性和电影质感。为了使故事动起来，系统重建了角色的3D代理以生成物理上合理的运动序列，这些序列用于微调基于MMDiT的图像到视频扩散模型。此外，还提出了一种两阶段的运动定制适配器：第一阶段从时间无序的帧中学习外观特征，解耦身份与动作；第二阶段利用冻结身份权重的时间步移策略建模时间动态。训练完成后，FairyGen可以直接渲染与分镜脚本对齐的各种连贯视频场景。广泛的实验表明，该系统生成的动画不仅在风格上忠实于原作，而且具有自然流畅的动作，展示了其在个性化和引人入胜的故事动画方面的潜力。相关代码将在GitHub上公开。
英文摘要： We propose FairyGen, an automatic system for generating story-driven cartoon videos from a single child's drawing, while faithfully preserving its unique artistic style. Unlike previous storytelling methods that primarily focus on character consistency and basic motion, FairyGen explicitly disentangles character modeling from stylized background generation and incorporates cinematic shot design to support expressive and coherent storytelling. Given a single character sketch, we first employ an MLLM to generate a structured storyboard with shot-level descriptions that specify environment settings, character actions, and camera perspectives. To ensure visual consistency, we introduce a style propagation adapter that captures the character's visual style and applies it to the background, faithfully retaining the character's full visual identity while synthesizing style-consistent scenes. A shot design module further enhances visual diversity and cinematic quality through frame cropping and multi-view synthesis based on the storyboard. To animate the story, we reconstruct a 3D proxy of the character to derive physically plausible motion sequences, which are then used to fine-tune an MMDiT-based image-to-video diffusion model. We further propose a two-stage motion customization adapter: the first stage learns appearance features from temporally unordered frames, disentangling identity from motion; the second stage models temporal dynamics using a timestep-shift strategy with frozen identity weights. Once trained, FairyGen directly renders diverse and coherent video scenes aligned with the storyboard. Extensive experiments demonstrate that our system produces animations that are stylistically faithful, narratively structured natural motion, highlighting its potential for personalized and engaging story animation. The code will be available at https://github.com/GVCLab/FairyGen
论文链接 https://arxiv.org/pdf/2506.21272v1.pdf
代码链接 https://github.com/gvclab/fairygen

论文标题 LASFNet: A Lightweight Attention-Guided Self-Modulation Feature Fusion Network for Multimodal Object Detection

中文摘要： 针对多模态目标检测中深度特征提取的问题，提出了一种新的融合检测基线，通过单一的特征级融合单元实现高性能检测，简化了训练过程。基于此方法，设计了一种轻量级注意力引导自调制特征融合网络（LASFNet），该网络引入了一种新颖的注意力引导自调制特征融合模块（ASFF），能够根据不同模态的注意力信息自适应调整全局和局部层面的融合特征响应，从而促进全面且丰富的特征生成。此外，在LASFNet的颈部设计了一个轻量级特征注意力转换模块（FATM），以增强对融合特征的关注并减少信息损失。在三个代表性数据集上的广泛实验表明，与现有最先进方法相比，该方法在效率和准确性之间取得了良好的平衡，参数数量和计算成本分别减少了90%和85%，同时提高了1%-3%的检测准确率（mAP）。代码将在https://github.com/leileilei2000/LASFNet开源。
英文摘要： Effective deep feature extraction via feature-level fusion is crucial for multimodal object detection. However, previous studies often involve complex training processes that integrate modality-specific features by stacking multiple feature-level fusion units, leading to significant computational overhead. To address this issue, we propose a new fusion detection baseline that uses a single feature-level fusion unit to enable high-performance detection, thereby simplifying the training process. Based on this approach, we propose a lightweight attention-guided self-modulation feature fusion network (LASFNet), which introduces a novel attention-guided self-modulation feature fusion (ASFF) module that adaptively adjusts the responses of fusion features at both global and local levels based on attention information from different modalities, thereby promoting comprehensive and enriched feature generation. Additionally, a lightweight feature attention transformation module (FATM) is designed at the neck of LASFNet to enhance the focus on fused features and minimize information loss. Extensive experiments on three representative datasets demonstrate that, compared to state-of-the-art methods, our approach achieves a favorable efficiency-accuracy trade-off, reducing the number of parameters and computational cost by as much as 90% and 85%, respectively, while improving detection accuracy (mAP) by 1%-3%. The code will be open-sourced at https://github.com/leileilei2000/LASFNet.
论文链接 https://arxiv.org/pdf/2506.21018v1.pdf
代码链接 https://github.com/leileilei2000/lasfnet

论文标题 WorldVLA: Towards Autoregressive Action World Model

中文摘要： WorldVLA 是一种自回归动作世界模型，它将动作和图像的理解与生成统一在一个框架中。该模型结合了视觉-语言-动作（VLA）模型和世界模型，通过利用动作和图像理解来预测未来图像，从而学习环境的物理特性以改进动作生成。同时，动作模型基于图像观察生成后续动作，有助于视觉理解并反过来帮助世界模型的视觉生成。研究表明，WorldVLA 在性能上优于独立的动作模型和世界模型，突显了两者之间的相互增强作用。此外，研究发现当以自回归方式生成一系列动作时，动作模型的性能会下降。这归因于模型在动作预测方面的泛化能力有限，导致早期动作的误差传播到后续动作。为解决这一问题，提出了一种注意力掩码策略，在生成当前动作时选择性地掩盖先前的动作，显著提高了动作块生成任务的性能。
英文摘要： We present WorldVLA, an autoregressive action world model that unifies action and image understanding and generation. Our WorldVLA intergrates Vision-Language-Action (VLA) model and world model in one single framework. The world model predicts future images by leveraging both action and image understanding, with the purpose of learning the underlying physics of the environment to improve action generation. Meanwhile, the action model generates the subsequent actions based on image observations, aiding in visual understanding and in turn helps visual generation of the world model. We demonstrate that WorldVLA outperforms standalone action and world models, highlighting the mutual enhancement between the world model and the action model. In addition, we find that the performance of the action model deteriorates when generating sequences of actions in an autoregressive manner. This phenomenon can be attributed to the model's limited generalization capability for action prediction, leading to the propagation of errors from earlier actions to subsequent ones. To address this issue, we propose an attention mask strategy that selectively masks prior actions during the generation of the current action, which shows significant performance improvement in the action chunk generation task.
论文链接 https://arxiv.org/pdf/2506.21539v1.pdf
代码链接 https://github.com/alibaba-damo-academy/worldvla

论文标题 DBConformer: Dual-Branch Convolutional Transformer for EEG Decoding

中文摘要： DBConformer是一种专为脑电图（EEG）解码设计的双分支卷积Transformer网络。该模型结合了时间Conformer和空间Conformer，分别用于捕捉长范围的时间依赖性和通道间的交互作用，从而同时提取EEG信号中的时间动态和空间模式。此外，轻量级的通道注意力模块通过数据驱动的方式对EEG通道的重要性进行调整，进一步优化了空间表示。在五个运动想象（MI）数据集和两个癫痫检测数据集上的广泛实验表明，DBConformer在三种评估设置下均优于10种竞争基线模型，并且参数数量仅为高容量EEG Conformer基线模型的八分之一。可视化结果还证实，DBConformer提取的特征具有生理可解释性，并与MI中的感觉运动先验一致。这使得DBConformer在鲁棒性和可解释性方面表现出色，适用于可靠的EEG解码任务。代码已公开发布。
英文摘要： Electroencephalography (EEG)-based brain-computer interfaces (BCIs) transform spontaneous/evoked neural activity into control commands for external communication. While convolutional neural networks (CNNs) remain the mainstream backbone for EEG decoding, their inherently short receptive field makes it difficult to capture long-range temporal dependencies and global inter-channel relationships. Recent CNN-Transformer (Conformers) hybrids partially address this issue, but most adopt a serial design, resulting in suboptimal integration of local and global features, and often overlook explicit channel-wise modeling. To address these limitations, we propose DBConformer, a dual-branch convolutional Transformer network tailored for EEG decoding. It integrates a temporal Conformer to model long-range temporal dependencies and a spatial Conformer to extract inter-channel interactions, capturing both temporal dynamics and spatial patterns in EEG signals. A lightweight channel attention module further refines spatial representations by assigning data-driven importance to EEG channels. Extensive experiments on five motor imagery (MI) datasets and two seizure detection datasets under three evaluation settings demonstrate that DBConformer consistently outperforms 10 competitive baseline models, with over eight times fewer parameters than the high-capacity EEG Conformer baseline. Further, the visualization results confirm that the features extracted by DBConformer are physiologically interpretable and aligned with sensorimotor priors in MI. The superior performance and interpretability of DBConformer make it reliable for robust and explainable EEG decoding. Code is publicized at https://github.com/wzwvv/DBConformer.
论文链接 https://arxiv.org/pdf/2506.21140v1.pdf
代码链接 https://github.com/wzwvv/DBConformer