2025 03 20

Command R7B Arabic: A Small, Enterprise Focused, Multilingual, and Culturally Aware Arabic LLM

Yazeed Alnumay,Alexandre Barbet,Anna Bialas,William Darling,Shaan Desai,Joan Devassy,Kyle Duffy,Stephanie Howe,Olivia Lasche,Justin Lee,Anirudh Shrinivason,Jennifer Tracey

Task: 构建高质量的企业阿拉伯语应用大语言模型（LLMs）

Motivation: 由于阿拉伯语数字化数据的有限可用性，构建高质量的企业阿拉伯语应用大语言模型仍然具有挑战性。

Details

Method: 提出了一种数据合成和精炼策略，通过合成数据生成和人工参与注释来扩展阿拉伯语训练语料库，并提出了迭代的后训练方法以在模型与人类偏好对齐方面实现最先进的性能。 Result: 发布了一个7B的小型开放权重模型，该模型在头对头比较和阿拉伯语基准测试中表现优于同类模型，涵盖文化知识、指令遵循、RAG和上下文忠实度。 Conclusion: 通过数据合成和精炼策略以及迭代后训练方法，成功构建了一个高质量的企业阿拉伯语应用大语言模型，并在多个基准测试中表现出色。 Abstract: Building high-quality large language models (LLMs) for enterprise Arabic applications remains challenging due to the limited availability of digitized Arabic data. In this work, we present a data synthesis and refinement strategy to help address this problem, namely, by leveraging synthetic data generation and human-in-the-loop annotation to expand our Arabic training corpus. We further present our iterative post training recipe that is essential to achieving state-of-the-art performance in aligning the model with human preferences, a critical aspect to enterprise use cases. The culmination of this effort is the release of a small, 7B, open-weight model that outperforms similarly sized peers in head-to-head comparisons and on Arabic-focused benchmarks covering cultural knowledge, instruction following, RAG, and contextual faithfulness.

Retrieval-Augmented Simulacra: Generative Agents for Up-to-date and Knowledge-Adaptive Simulations

Hikaru Shimadzu,Takehito Utsuro,Daisuke Kitayama

Task: 评估在虚拟社交网络环境中使用的搜索扩展生成机制对生成帖子和回复能力的影响。

Motivation: 随着社交网络服务在日本的影响力显著增长，以及使用SNS进行营销和情感信息传播研究的活跃进行，需要一种预测SNS互动趋势的系统。

Details

Method: 通过在虚拟SNS环境中构建一个由代理使用LLMs创建的聊天社区，模拟各种社区在SNS上的行为，并评估搜索扩展生成机制对生成帖子和回复能力的影响。 Result: 确认了模仿人类搜索行为的搜索扩展生成机制能够生成最自然的交流。 Conclusion: 提出的搜索扩展生成机制在虚拟SNS环境中能够有效生成自然的帖子和回复。 Abstract: In the 2023 edition of the White Paper on Information and Communications, it is estimated that the population of social networking services in Japan will exceed 100 million by 2022, and the influence of social networking services in Japan is growing significantly. In addition, marketing using SNS and research on the propagation of emotions and information on SNS are being actively conducted, creating the need for a system for predicting trends in SNS interactions. We have already created a system that simulates the behavior of various communities on SNS by building a virtual SNS environment in which agents post and reply to each other in a chat community created by agents using a LLMs. In this paper, we evaluate the impact of the search extension generation mechanism used to create posts and replies in a virtual SNS environment using a simulation system on the ability to generate posts and replies. As a result of the evaluation, we confirmed that the proposed search extension generation mechanism, which mimics human search behavior, generates the most natural exchange.

An Explainable Framework for Misinformation Identification via Critical Question Answering

Ramon Ruiz-Dolz,John Lawrence

Task: 提出一种基于论证方案和关键问题的可解释框架，用于检测事实和理性错误信息。

Motivation: 现有的自然语言错误信息检测方法主要依赖于序列分类方法，导致系统不透明，分类原因不明确。

Details

Method: 创建并发布了NLAS-CQ语料库，结合了3,566个教科书式的自然语言论证方案实例和4,687个与这些论证相关的关键问题答案。基于该语料库，实现并验证了结合分类和问答的新框架。 Result: 新框架能够分析论证以检测错误信息，并以关键问题的形式向用户提供解释。 Conclusion: 提出的框架为事实和理性错误信息检测提供了一种可解释的方法，并通过NLAS-CQ语料库进行了验证。 Abstract: Natural language misinformation detection approaches have been, to date, largely dependent on sequence classification methods, producing opaque systems in which the reasons behind classification as misinformation are unclear. While an effort has been made in the area of automated fact-checking to propose explainable approaches to the problem, this is not the case for automated reason-checking systems. In this paper, we propose a new explainable framework for both factual and rational misinformation detection based on the theory of Argumentation Schemes and Critical Questions. For that purpose, we create and release NLAS-CQ, the first corpus combining 3,566 textbook-like natural language argumentation scheme instances and 4,687 corresponding answers to critical questions related to these arguments. On the basis of this corpus, we implement and validate our new framework which combines classification with question answering to analyse arguments in search of misinformation, and provides the explanations in form of critical questions to the human user.

ConQuer: A Framework for Concept-Based Quiz Generation

Yicheng Fu,Zikui Wang,Liuxin Yang,Meiqing Huo,Zhongdongming Dai

Task: 提出一个基于概念的测验生成框架ConQuer，利用外部知识源生成高质量的测验。

Motivation: 尽管LLMs提高了测验生成的效率，但AI生成的测验质量和教育影响仍存在担忧。

Details

Method: 引入ConQuer框架，利用外部知识源生成测验，并使用LLMs作为评判者进行多维度评估。 Result: 实验结果显示，评估分数提高了4.8%，在成对比较中胜率为77.52%。 Conclusion: ConQuer框架在生成高质量测验方面表现出色，各组件在框架中的有效性得到了验证。 Abstract: Quizzes play a crucial role in education by reinforcing students' understanding of key concepts and encouraging self-directed exploration. However, compiling high-quality quizzes can be challenging and require deep expertise and insight into specific subject matter. Although LLMs have greatly enhanced the efficiency of quiz generation, concerns remain regarding the quality of these AI-generated quizzes and their educational impact on students. To address these issues, we introduce ConQuer, a concept-based quiz generation framework that leverages external knowledge sources. We employ comprehensive evaluation dimensions to assess the quality of the generated quizzes, using LLMs as judges. Our experiment results demonstrate a 4.8% improvement in evaluation scores and a 77.52% win rate in pairwise comparisons against baseline quiz sets. Ablation studies further underscore the effectiveness of each component in our framework. Code available at https://github.com/sofyc/ConQuer.

Synthetic Data Generation of Body Motion Data by Neural Gas Network for Emotion Recognition

Seyed Muhammad Hossein Mousavi

Task: 利用神经气体网络（NGN）算法生成多样化的身体运动数据以进行情感识别。

Motivation: 解决情感识别领域中身体运动数据稀缺和多样性不足的问题。

Details

Method: 使用神经气体网络（NGN）算法生成身体运动数据，并通过学习骨骼结构拓扑来优化数据的多样性和生成速度。 Result: NGN算法生成的身体运动数据在真实性和情感区分度上优于现有方法，且生成速度更快。 Conclusion: NGN算法在生成多样化且情感区分度高的身体运动数据方面具有显著优势，能够有效提升情感识别的性能。 Abstract: In the domain of emotion recognition using body motion, the primary challenge lies in the scarcity of diverse and generalizable datasets. Automatic emotion recognition uses machine learning and artificial intelligence techniques to recognize a person's emotional state from various data types, such as text, images, sound, and body motion. Body motion poses unique challenges as many factors, such as age, gender, ethnicity, personality, and illness, affect its appearance, leading to a lack of diverse and robust datasets specifically for emotion recognition. To address this, employing Synthetic Data Generation (SDG) methods, such as Generative Adversarial Networks (GANs) and Variational Auto Encoders (VAEs), offers potential solutions, though these methods are often complex. This research introduces a novel application of the Neural Gas Network (NGN) algorithm for synthesizing body motion data and optimizing diversity and generation speed. By learning skeletal structure topology, the NGN fits the neurons or gas particles on body joints. Generated gas particles, which form the skeletal structure later on, will be used to synthesize the new body posture. By attaching body postures over frames, the final synthetic body motion appears. We compared our generated dataset against others generated by GANs, VAEs, and another benchmark algorithm, using benchmark metrics such as Fr\'echet Inception Distance (FID), Diversity, and a few more. Furthermore, we continued evaluation using classification metrics such as accuracy, precision, recall, and a few others. Joint-related features or kinematic parameters were extracted, and the system assessed model performance against unseen data. Our findings demonstrate that the NGN algorithm produces more realistic and emotionally distinct body motion data and does so with more synthesizing speed than existing methods.

Generating Medically-Informed Explanations for Depression Detection using LLMs

Xiangyong Chen,Xiaochuan Lin

Task: 利用预训练的大型语言模型进行多任务抑郁症检测，并生成基于医学诊断标准的文本解释。

Motivation: 早期从社交媒体数据中检测抑郁症为及时干预提供了宝贵机会，但该任务需要专业医学知识和开发准确且可解释的模型。

Details

Method: 提出LLM-MTD方法，利用多任务学习框架和组合损失函数，同时优化分类准确性和解释质量。 Result: 在Reddit自报抑郁症数据集（RSDD）上评估，LLM-MTD在抑郁症检测中达到了最先进的性能，AUPRC等关键指标显著提升。 Conclusion: 该工作提出了一种结合大型语言模型和可解释性的抑郁症检测新方法。 Abstract: Early detection of depression from social media data offers a valuable opportunity for timely intervention. However, this task poses significant challenges, requiring both professional medical knowledge and the development of accurate and explainable models. In this paper, we propose LLM-MTD (Large Language Model for Multi-Task Depression Detection), a novel approach that leverages a pre-trained large language model to simultaneously classify social media posts for depression and generate textual explanations grounded in medical diagnostic criteria. We train our model using a multi-task learning framework with a combined loss function that optimizes both classification accuracy and explanation quality. We evaluate LLM-MTD on the benchmark Reddit Self-Reported Depression Dataset (RSDD) and compare its performance against several competitive baseline methods, including traditional machine learning and fine-tuned BERT. Our experimental results demonstrate that LLM-MTD achieves state-of-the-art performance in depression detection, showing significant improvements in AUPRC and other key metrics. Furthermore, human evaluation of the generated explanations reveals their relevance, completeness, and medical accuracy, highlighting the enhanced interpretability of our approach. This work contributes a novel methodology for depression detection that combines the power of large language models with the crucial aspect of explainability.

Cafe-Talk: Generating 3D Talking Face Animation with Multimodal Coarse- and Fine-grained Control

Hejia Chen,Haoxian Zhang,Shoulong Zhang,Xiaoqiang Liu,Sisi Zhuang,Yuan Zhang,Pengfei Wan,Di Zhang,Shuai Li

Task: 提出一种基于扩散-变换器的3D说话脸生成模型Cafe-Talk，以实现精确的唇同步和可控的表情。

Motivation: 现有的方法仅采用离散的情感标签来全局控制表情，限制了时空域内的灵活细粒度面部控制。

Details

Method: 提出了一种两阶段训练管道，首先使用语音音频和粗粒度条件进行训练，然后逐步添加细粒度控制条件。设计了交换标签训练机制和基于掩码的CFG技术来解耦粗粒度和细粒度条件。 Result: Cafe-Talk在唇同步和表情表现方面达到了最先进的性能，并在用户研究中获得了广泛的接受。 Conclusion: Cafe-Talk通过多模态控制条件实现了精确的唇同步和灵活的表情控制，展示了其在细粒度控制方面的优势。 Abstract: Speech-driven 3D talking face method should offer both accurate lip synchronization and controllable expressions. Previous methods solely adopt discrete emotion labels to globally control expressions throughout sequences while limiting flexible fine-grained facial control within the spatiotemporal domain. We propose a diffusion-transformer-based 3D talking face generation model, Cafe-Talk, which simultaneously incorporates coarse- and fine-grained multimodal control conditions. Nevertheless, the entanglement of multiple conditions challenges achieving satisfying performance. To disentangle speech audio and fine-grained conditions, we employ a two-stage training pipeline. Specifically, Cafe-Talk is initially trained using only speech audio and coarse-grained conditions. Then, a proposed fine-grained control adapter gradually adds fine-grained instructions represented by action units (AUs), preventing unfavorable speech-lip synchronization. To disentangle coarse- and fine-grained conditions, we design a swap-label training mechanism, which enables the dominance of the fine-grained conditions. We also devise a mask-based CFG technique to regulate the occurrence and intensity of fine-grained control. In addition, a text-based detector is introduced with text-AU alignment to enable natural language user input and further support multimodal control. Extensive experimental results prove that Cafe-Talk achieves state-of-the-art lip synchronization and expressiveness performance and receives wide acceptance in fine-grained control in user studies. Project page: https://harryxd2018.github.io/cafe-talk/

Rui Yang,Lin Song,Yicheng Xiao,Runhui Huang,Yixiao Ge,Ying Shan,Hengshuang Zhao

Task: 提出一种简单而高效的方法来构建基于单一Transformer的端到端大型多模态模型的基线。

Motivation: 尽管原生大型多模态模型（LMMs）在资源消耗和性能上存在挑战，但其潜力巨大，因此需要一种更高效的方法来构建这些模型。

Details

Method: 提出了一种新的早期融合LMM，能够在早期阶段融合多模态输入，并以自回归方式响应视觉指令；同时设计了一种高效的训练方法，利用预训练模型的先验知识来解决性能限制和资源消耗的挑战。 Result: 所提出的模型在使用单一Transformer的LMMs中表现出优越的性能，并显著缩小了与组合式LMMs的性能差距。 Conclusion: 该方法为构建高效的原生端到端大型多模态模型提供了一种有效的解决方案，显著提升了性能并减少了资源消耗。 Abstract: Recent advancements in large language models (LLMs) have significantly propelled the development of large multi-modal models (LMMs), highlighting the potential for general and intelligent assistants. However, most LMMs model visual and textual modalities separately, leading to recent efforts to develop native LMMs using a single transformer. Despite the promise, these native models are resource-intensive and often exhibit performance gaps compared to their compositional counterparts. To alleviate this issue, we propose a simple yet efficient method to construct a baseline for the native and end-to-end large multi-modal model in a single transformer. First, we propose a new early-fusion LMM that can fuse multi-modal inputs in the early stage and respond to visual instructions in an auto-regressive manner. Second, we devise an efficient training recipe for the proposed model, which harnesses the prior knowledge of the pre-trained models, addressing both the performance limitations and the challenge of resource consumption. The proposed model demonstrates superior performance compared to other LMMs using one transformer and significantly narrows the performance gap with compositional LMMs.

Salient Temporal Encoding for Dynamic Scene Graph Generation

Zhihao Zhu

Task: 提出一种新的时空场景图生成方法，选择性地在时间相关的对象对之间建立时间连接，并将时间关系表示为场景图中的显式边。

Motivation: 现有的时空场景图生成方法在所有对象之间建立密集且抽象的时间连接，但并非所有时间连接都编码有意义的时间动态。

Details

Method: 提出一种新的时空场景图生成方法，选择性地在时间相关的对象对之间建立时间连接，并将时间关系表示为场景图中的显式边。 Result: 在场景图检测中，该方法比强基线提高了4.4%。在动作识别任务中，mAP提高了0.6%。 Conclusion: 该方法通过稀疏且显式的时间表示，提高了场景图生成和下游视觉任务的性能。 Abstract: Representing a dynamic scene using a structured spatial-temporal scene graph is a novel and particularly challenging task. To tackle this task, it is crucial to learn the temporal interactions between objects in addition to their spatial relations. Due to the lack of explicitly annotated temporal relations in current benchmark datasets, most of the existing spatial-temporal scene graph generation methods build dense and abstract temporal connections among all objects across frames. However, not all temporal connections are encoding meaningful temporal dynamics. We propose a novel spatial-temporal scene graph generation method that selectively builds temporal connections only between temporal-relevant objects pairs and represents the temporal relations as explicit edges in the scene graph. The resulting sparse and explicit temporal representation allows us to improve upon strong scene graph generation baselines by up to $4.4\%$ in Scene Graph Detection. In addition, we show that our approach can be leveraged to improve downstream vision tasks. Particularly, applying our approach to action recognition, shows 0.6\% gain in mAP in comparison to the state-of-the-art

Hakyung Sung,Gyu-Ho Shin

Task: 扩展第二语言（L2）韩语通用依存（UD）树库，并评估其在领域内和领域外数据集上的性能。

Motivation: 为了更好地与UD框架对齐，并强调使用定制的L2数据集对基于第一语言的通用语言模型进行微调的重要性。

Details

Method: 扩展L2韩语UD树库，修订注释指南，并微调三个韩语语言模型。 Result: 微调显著提高了模型在各种指标上的性能。 Conclusion: 使用定制的L2数据集对基于第一语言的通用语言模型进行微调对于L2数据的形态句法分析非常重要。 Abstract: We expand the second language (L2) Korean Universal Dependencies (UD) treebank with 5,454 manually annotated sentences. The annotation guidelines are also revised to better align with the UD framework. Using this enhanced treebank, we fine-tune three Korean language models and evaluate their performance on in-domain and out-of-domain L2-Korean datasets. The results show that fine-tuning significantly improves their performance across various metrics, thus highlighting the importance of using well-tailored L2 datasets for fine-tuning first-language-based, general-purpose language models for the morphosyntactic analysis of L2 data.

ReBot: Scaling Robot Learning with Real-to-Sim-to-Real Robotic Video Synthesis

Yu Fang,Yue Yang,Xinghao Zhu,Kaiyuan Zheng,Gedas Bertasius,Daniel Szafir,Mingyu Ding

Task: 提出了一种名为ReBot的实-仿-实方法，用于扩展真实机器人数据集并适应视觉-语言-动作（VLA）模型到目标领域。

Motivation: 真实世界数据收集的高成本限制了VLA模型的泛化能力，ReBot旨在通过仿真来扩展数据集并减少仿真与现实的差距。

Details

Method: ReBot通过在仿真中重放真实世界机器人轨迹来多样化操作对象（实到仿），并将仿真运动与修复的真实世界背景结合以合成物理真实且时间一致的机器人视频（仿到实）。 Result: 在仿真和真实环境中的广泛实验表明，ReBot显著提高了VLA模型的性能和鲁棒性。例如，在SimplerEnv中，ReBot将Octo和OpenVLA的域内性能分别提高了7.2%和21.8%，域外泛化能力分别提高了19.9%和9.4%。在真实世界的Franka机器人评估中，ReBot将Octo和OpenVLA的成功率分别提高了17%和20%。 Conclusion: ReBot通过实-仿-实方法有效扩展了真实机器人数据集，并显著提升了VLA模型的性能和泛化能力。 Abstract: Vision-language-action (VLA) models present a promising paradigm by training policies directly on real robot datasets like Open X-Embodiment. However, the high cost of real-world data collection hinders further data scaling, thereby restricting the generalizability of VLAs. In this paper, we introduce ReBot, a novel real-to-sim-to-real approach for scaling real robot datasets and adapting VLA models to target domains, which is the last-mile deployment challenge in robot manipulation. Specifically, ReBot replays real-world robot trajectories in simulation to diversify manipulated objects (real-to-sim), and integrates the simulated movements with inpainted real-world background to synthesize physically realistic and temporally consistent robot videos (sim-to-real). Our approach has several advantages: 1) it enjoys the benefit of real data to minimize the sim-to-real gap; 2) it leverages the scalability of simulation; and 3) it can generalize a pretrained VLA to a target domain with fully automated data pipelines. Extensive experiments in both simulation and real-world environments show that ReBot significantly enhances the performance and robustness of VLAs. For example, in SimplerEnv with the WidowX robot, ReBot improved the in-domain performance of Octo by 7.2% and OpenVLA by 21.8%, and out-of-domain generalization by 19.9% and 9.4%, respectively. For real-world evaluation with a Franka robot, ReBot increased the success rates of Octo by 17% and OpenVLA by 20%. More information can be found at: https://yuffish.github.io/rebot/

Strategic resource allocation in memory encoding: An efficiency principle shaping language processing

Weijie Xu,Richard Futrell

Task: 研究工作记忆在句子处理中的战略资源分配。

Motivation: 探讨工作记忆的有限容量如何有效支持人类语言行为。

Details

Method: 从资源理性的角度提出理论假设，并通过自然语料库数据进行实证研究。 Result: 发现战略资源分配在依赖局部性中的证据，并揭示跨语言变异性。 Conclusion: 战略资源分配作为一种普遍效率原则，需要进一步研究其与语言特定短语结构的相互作用。 Abstract: How is the limited capacity of working memory efficiently used to support human linguistic behaviors? In this paper, we investigate strategic resource allocation as an efficiency principle for memory encoding in sentence processing. The idea is that working memory resources are dynamically and strategically allocated to prioritize novel and unexpected information, enhancing their representations to make them less susceptible to memory decay and interference. Theoretically, from a resource-rational perspective, we argue that this efficiency principle naturally arises from two functional assumptions about working memory, namely, its limited capacity and its noisy representation. Empirically, through naturalistic corpus data, we find converging evidence for strategic resource allocation in the context of dependency locality from both the production and the comprehension side, where non-local dependencies with less predictable antecedents are associated with reduced locality effect. However, our results also reveal considerable cross-linguistic variability, highlighting the need for a closer examination of how strategic resource allocation, as a universal efficiency principle, interacts with language-specific phrase structures.

SAUCE: Selective Concept Unlearning in Vision-Language Models with Sparse Autoencoders

Qing Li,Jiahui Geng,Derui Zhu,Fengyu Cai,Chenyang Lyu,Fakhri Karray

Task: 提出了一种名为SAUCE的新方法，用于在视觉语言模型（VLMs）中进行细粒度和选择性的概念遗忘。

Motivation: 现有的视觉语言模型遗忘方法主要依赖于从大型语言模型（LLMs）中借鉴的技术，这些方法需要大量的注释遗忘集，并且在粗粒度上进行遗忘，导致过度遗忘和模型效用降低。

Details

Method: SAUCE利用稀疏自编码器（SAEs）捕获高维、语义丰富的稀疏特征，并识别与目标概念最相关的特征进行遗忘。在推理过程中，选择性地修改这些特征以抑制特定概念，同时保留无关信息。 Result: SAUCE在LLaVA-v1.5-7B和LLaMA-3.2-11B-Vision-Instruct两个VLMs上进行了评估，涵盖了60个概念。实验表明，SAUCE在遗忘质量上比现有方法提高了18.04%，同时保持了相当的模型效用。 Conclusion: SAUCE是一种有效且可扩展的解决方案，适用于在VLMs中进行选择性概念遗忘。 Abstract: Unlearning methods for vision-language models (VLMs) have primarily adapted techniques from large language models (LLMs), relying on weight updates that demand extensive annotated forget sets. Moreover, these methods perform unlearning at a coarse granularity, often leading to excessive forgetting and reduced model utility. To address this issue, we introduce SAUCE, a novel method that leverages sparse autoencoders (SAEs) for fine-grained and selective concept unlearning in VLMs. Briefly, SAUCE first trains SAEs to capture high-dimensional, semantically rich sparse features. It then identifies the features most relevant to the target concept for unlearning. During inference, it selectively modifies these features to suppress specific concepts while preserving unrelated information. We evaluate SAUCE on two distinct VLMs, LLaVA-v1.5-7B and LLaMA-3.2-11B-Vision-Instruct, across two types of tasks: concrete concept unlearning (objects and sports scenes) and abstract concept unlearning (emotions, colors, and materials), encompassing a total of 60 concepts. Extensive experiments demonstrate that SAUCE outperforms state-of-the-art methods by 18.04% in unlearning quality while maintaining comparable model utility. Furthermore, we investigate SAUCE's robustness against widely used adversarial attacks, its transferability across models, and its scalability in handling multiple simultaneous unlearning requests. Our findings establish SAUCE as an effective and scalable solution for selective concept unlearning in VLMs.

Uncertainty Distillation: Teaching Language Models to Express Semantic Confidence

Sophia Hager,David Mueller,Kevin Duh,Nicholas Andrews

Task: 开发一种方法，使大型语言模型（LLMs）能够表达其答案的正确性概率。

Motivation: 随着大型语言模型越来越多地用于事实问答，模型能够传达其答案正确性的概率变得尤为重要。

Details

Method: 提出了一种简单的方法，即不确定性蒸馏，通过使用保留数据将初始不确定性估计映射到有意义的概率，创建带有标注的示例进行监督微调。 Result: 该方法生成的表达置信度与观察到的错误率相关，并且在短答案上，语义不确定性与词汇不确定性相关性良好。 Conclusion: 不确定性蒸馏方法能够有效地教导LLMs表达校准的语义置信度，从而提高其在实际应用中的可靠性。 Abstract: As large language models (LLMs) are increasingly used for factual question-answering, it becomes more important for LLMs to have the capability to communicate the likelihood that their answer is correct. For these verbalized expressions of uncertainty to be meaningful, they should reflect the error rates at the expressed level of confidence. However, when prompted to express confidence, the error rates of current LLMs are inconsistent with their communicated confidences, highlighting the need for uncertainty quantification methods. Many prior methods calculate lexical uncertainty, estimating a model's confidence in the specific string it generated. In some cases, however, it may be more useful to estimate semantic uncertainty, or the model's confidence in the answer regardless of how it is verbalized. We propose a simple procedure, uncertainty distillation, to teach an LLM to verbalize calibrated semantic confidences. Using held-out data to map initial uncertainty estimates to meaningful probabilities, we create examples annotated with verbalized probabilities for supervised fine-tuning. We demonstrate our method yields verbalized confidences that correlate with observed error rates with a small fine-tuned language model as well as with larger instruction-tuned models, and find that our semantic uncertainty correlates well with lexical uncertainty on short answers.

Interpretable Unsupervised Joint Denoising and Enhancement for Real-World low-light Scenarios

Huaqiu Li,Xiaowan Hu,Haoqian Wang

Task: 提出一种针对真实场景的可解释、零参考的联合去噪和低光增强框架。

Motivation: 真实世界的低光图像通常存在局部过曝、低亮度、噪声和不均匀光照等复杂退化问题。有监督方法容易过拟合特定场景，而无监督方法由于缺乏参考图像，难以建模这些退化。

Details

Method: 基于物理成像原理和Retinex理论，提出一种基于成对子图像的训练策略，利用离散余弦变换（DCT）在sRGB空间进行频域分解，并引入隐式引导的混合表示策略，有效分离复杂的复合退化。在骨干网络设计中，开发了由隐式退化表示机制引导的视网膜分解网络。 Result: 大量实验证明了该方法的优越性。 Conclusion: 该方法在真实场景下的联合去噪和低光增强方面表现出色，代码将在https://github.com/huaqlili/unsupervised-light-enhance-ICLR2025上提供。 Abstract: Real-world low-light images often suffer from complex degradations such as local overexposure, low brightness, noise, and uneven illumination. Supervised methods tend to overfit to specific scenarios, while unsupervised methods, though better at generalization, struggle to model these degradations due to the lack of reference images. To address this issue, we propose an interpretable, zero-reference joint denoising and low-light enhancement framework tailored for real-world scenarios. Our method derives a training strategy based on paired sub-images with varying illumination and noise levels, grounded in physical imaging principles and retinex theory. Additionally, we leverage the Discrete Cosine Transform (DCT) to perform frequency domain decomposition in the sRGB space, and introduce an implicit-guided hybrid representation strategy that effectively separates intricate compounded degradations. In the backbone network design, we develop retinal decomposition network guided by implicit degradation representation mechanisms. Extensive experiments demonstrate the superiority of our method. Code will be available at https://github.com/huaqlili/unsupervised-light-enhance-ICLR2025.

Language Independent Named Entity Recognition via Orthogonal Transformation of Word Vectors

Omar E. Rakha,Hazem M. Abbas

Task: 使用双向LSTM/CRF模型和词嵌入进行跨语言的命名实体识别。

Motivation: 词嵌入是NLP中的关键构建块，本文旨在通过训练一个模型在源语言（英语）上，并通过正交线性变换矩阵将目标语言的词嵌入转换为源语言的词嵌入，从而实现跨语言的命名实体识别。

Details

Method: 提出了一种基于双向LSTM/CRF和词嵌入的模型，通过正交线性变换矩阵将目标语言的词嵌入转换为源语言的词嵌入。 Result: 通过在英语数据集上训练模型，该模型能够在阿拉伯语数据集上检测命名实体，而无需在阿拉伯语数据集上进行训练或微调。 Conclusion: 该模型展示了跨语言命名实体识别的潜力，无需在目标语言上进行额外的训练或微调。 Abstract: Word embeddings have been a key building block for NLP in which models relied heavily on word embeddings in many different tasks. In this paper, a model is proposed based on using Bidirectional LSTM/CRF with word embeddings to perform named entity recognition for any language. This is done by training a model on a source language (English) and transforming word embeddings from the target language into word embeddings of the source language by using an orthogonal linear transformation matrix. Evaluation of the model shows that by training a model on an English dataset the model was capable of detecting named entities in an Arabic dataset without neither training or fine tuning the model on an Arabic language dataset.

Learning-based 3D Reconstruction in Autonomous Driving: A Comprehensive Survey

Liewen Liao,Weihao Yan,Ming Yang,Songan Zhang

Task: 综述学习型3D重建在自动驾驶中的应用及其进展

Motivation: 3D重建在自动驾驶中具有重要作用，能够精确建模动态和静态环境，促进场景理解和闭环仿真等关键任务的发展。

Details

Method: 通过多视角深入分析，系统介绍了学习型3D重建的预备知识，包括数据格式、基准测试和技术基础，并对方法进行分类和多维度分析。 Result: 总结了学习型3D重建在自动驾驶中的发展趋势和现有挑战。 Conclusion: 本文综述旨在为未来研究提供技术参考和启发。 Abstract: Learning-based 3D reconstruction has emerged as a transformative technique in autonomous driving, enabling precise modeling of both dynamic and static environments through advanced neural representations. Despite augmenting perception, 3D reconstruction inspires pioneering solution for vital tasks in the field of autonomous driving, such as scene understanding and closed-loop simulation. Commencing with an examination of input modalities, we investigates the details of 3D reconstruction and conducts a multi-perspective, in-depth analysis of recent advancements. Specifically, we first provide a systematic introduction of preliminaries, including data formats, benchmarks and technical preliminaries of learning-based 3D reconstruction, facilitating instant identification of suitable methods based on hardware configurations and sensor suites. Then, we systematically review learning-based 3D reconstruction methods in autonomous driving, categorizing approaches by subtasks and conducting multi-dimensional analysis and summary to establish a comprehensive technical reference. The development trends and existing challenges is summarized in the context of learning-based 3D reconstruction in autonomous driving. We hope that our review will inspire future researches.

FACTS&EVIDENCE: An Interactive Tool for Transparent Fine-Grained Factual Verification of Machine-Generated Text

Varich Boonsanong,Vidhisha Balachandran,Xiaochuang Han,Shangbin Feng,Lucy Lu Wang,Yulia Tsvetkov

Task: 开发一个交互式且透明的工具，用于用户驱动的复杂文本验证。

Motivation: 现有的自动事实验证工具缺乏透明性和多样化的证据来源，无法提供可信赖的用户体验。

Details

Method: 开发Facts&Evidence工具，将复杂输入文本分解，可视化各个声明的可信度，并提供模型决策的解释和多种证据来源的归因。 Result: Facts&Evidence工具能够帮助用户理解、验证、选择性地信任和使用机器生成的文本。 Conclusion: Facts&Evidence工具旨在增强机器生成文本的消费者的能力，使他们能够理解、验证、选择性地信任和使用这些文本。 Abstract: With the widespread consumption of AI-generated content, there has been an increased focus on developing automated tools to verify the factual accuracy of such content. However, prior research and tools developed for fact verification treat it as a binary classification or a linear regression problem. Although this is a useful mechanism as part of automatic guardrails in systems, we argue that such tools lack transparency in the prediction reasoning and diversity in source evidence to provide a trustworthy user experience. We develop Facts&Evidence - an interactive and transparent tool for user-driven verification of complex text. The tool facilitates the intricate decision-making involved in fact-verification, presenting its users a breakdown of complex input texts to visualize the credibility of individual claims along with an explanation of model decisions and attribution to multiple, diverse evidence sources. Facts&Evidence aims to empower consumers of machine-generated text and give them agency to understand, verify, selectively trust and use such text.

Matching Skeleton-based Activity Representations with Heterogeneous Signals for HAR

Shuheng Li,Jiayun Zhang,Xiaohan Fu,Xiyuan Zhang,Jingbo Shang,Rajesh K. Gupta

Task: 提出了一种基于骨架数据预训练活动表示并匹配异构HAR信号的新框架SKELAR。

Motivation: 在人类活动识别（HAR）中，活动标签通常以one-hot格式编码，最近转向使用文本表示以提供上下文知识。然而，文本表示存在固有局限性，HAR应基于物理运动数据，因为运动是活动的基础，并且适用于各种传感系统。

Details

Method: 提出了SKELAR框架，通过自监督的粗角度重建任务捕捉核心运动知识，并通过自注意力匹配模块动态优先处理相关身体部位。 Result: SKELAR在full-shot和few-shot设置下均达到了最先进的性能，并且能够有效利用合成骨架数据扩展其应用场景。 Conclusion: SKELAR框架通过预训练骨架数据表示并匹配异构HAR信号，解决了HAR中的核心挑战，并在实验中展示了其优越性能。 Abstract: In human activity recognition (HAR), activity labels have typically been encoded in one-hot format, which has a recent shift towards using textual representations to provide contextual knowledge. Here, we argue that HAR should be anchored to physical motion data, as motion forms the basis of activity and applies effectively across sensing systems, whereas text is inherently limited. We propose SKELAR, a novel HAR framework that pretrains activity representations from skeleton data and matches them with heterogeneous HAR signals. Our method addresses two major challenges: (1) capturing core motion knowledge without context-specific details. We achieve this through a self-supervised coarse angle reconstruction task that recovers joint rotation angles, invariant to both users and deployments; (2) adapting the representations to downstream tasks with varying modalities and focuses. To address this, we introduce a self-attention matching module that dynamically prioritizes relevant body parts in a data-driven manner. Given the lack of corresponding labels in existing skeleton data, we establish MASD, a new HAR dataset with IMU, WiFi, and skeleton, collected from 20 subjects performing 27 activities. This is the first broadly applicable HAR dataset with time-synchronized data across three modalities. Experiments show that SKELAR achieves the state-of-the-art performance in both full-shot and few-shot settings. We also demonstrate that SKELAR can effectively leverage synthetic skeleton data to extend its use in scenarios without skeleton collections.

MMDT: Decoding the Trustworthiness and Safety of Multimodal Foundation Models

Chejian Xu,Jiawei Zhang,Zhaorun Chen,Chulin Xie,Mintong Kang,Yujin Potter,Zhun Wang,Zhuowen Yuan,Alexander Xiong,Zidi Xiong,Chenhui Zhang,Lingzhi Yuan,Yi Zeng,Peiyang Xu,Chengquan Guo,Andy Zhou,Jeffrey Ziwei Tan,Xuandong Zhao,Francesco Pinto,Zhen Xiang,Yu Gai,Zinan Lin,Dan Hendrycks,Bo Li,Dawn Song

Task: 提出一个统一的平台MMDT，用于全面评估多模态基础模型的安全性和可信度。

Motivation: 现有的多模态模型基准主要评估模型的有用性，或仅关注公平性和隐私等有限视角，缺乏全面的安全性和可信度评估。

Details

Method: 设计了多种评估场景和红队算法，生成具有挑战性的数据，形成一个高质量的基准。 Result: 评估了一系列多模态模型，揭示了这些模型在多个视角下的漏洞和改进空间。 Conclusion: MMDT是第一个全面且独特的多模态基础模型安全性和可信度评估平台，为开发更安全和可靠的多模态基础模型和系统铺平了道路。 Abstract: Multimodal foundation models (MMFMs) play a crucial role in various applications, including autonomous driving, healthcare, and virtual assistants. However, several studies have revealed vulnerabilities in these models, such as generating unsafe content by text-to-image models. Existing benchmarks on multimodal models either predominantly assess the helpfulness of these models, or only focus on limited perspectives such as fairness and privacy. In this paper, we present the first unified platform, MMDT (Multimodal DecodingTrust), designed to provide a comprehensive safety and trustworthiness evaluation for MMFMs. Our platform assesses models from multiple perspectives, including safety, hallucination, fairness/bias, privacy, adversarial robustness, and out-of-distribution (OOD) generalization. We have designed various evaluation scenarios and red teaming algorithms under different tasks for each perspective to generate challenging data, forming a high-quality benchmark. We evaluate a range of multimodal models using MMDT, and our findings reveal a series of vulnerabilities and areas for improvement across these perspectives. This work introduces the first comprehensive and unique safety and trustworthiness evaluation platform for MMFMs, paving the way for developing safer and more reliable MMFMs and systems. Our platform and benchmark are available at https://mmdecodingtrust.github.io/.

Fire and Smoke Datasets in 20 Years: An In-depth Review

Sayed Pedram Haeri Boroujeni,Niloufar Mehrabi,Fatemeh Afghah,Connor Peter McGrath,Danish Bhatkar,Mithilesh Anil Biradar,Abolfazl Razi

Task: 系统分析和评估过去20年收集的火灾和烟雾数据集

Motivation: 火灾和烟雾现象对自然环境、生态系统、全球经济和人类生命构成重大威胁，需要更先进的技术来实现早期检测、实时监测和最小化火灾对生态平衡和公共安全的影响。

Details

Method: 对过去20年收集的火灾和烟雾数据集进行深入审查，分析每个数据集的特征，包括类型、大小、格式、收集方法和地理多样性，并总结每个数据集的优缺点。 Result: 通过使用ResNet-50、DeepLab-V3和YoloV8等最先进的算法对不同数据集进行广泛的实验分析。 Conclusion: 该研究为火灾管理领域的研究和技术进步提供了有价值的见解和数据支持。 Abstract: Fire and smoke phenomena pose a significant threat to the natural environment, ecosystems, and global economy, as well as human lives and wildlife. In this particular circumstance, there is a demand for more sophisticated and advanced technologies to implement an effective strategy for early detection, real-time monitoring, and minimizing the overall impacts of fires on ecological balance and public safety. Recently, the rapid advancement of Artificial Intelligence (AI) and Computer Vision (CV) frameworks has substantially revolutionized the momentum for developing efficient fire management systems. However, these systems extensively rely on the availability of adequate and high-quality fire and smoke data to create proficient Machine Learning (ML) methods for various tasks, such as detection and monitoring. Although fire and smoke datasets play a critical role in training, evaluating, and testing advanced Deep Learning (DL) models, a comprehensive review of the existing datasets is still unexplored. For this purpose, we provide an in-depth review to systematically analyze and evaluate fire and smoke datasets collected over the past 20 years. We investigate the characteristics of each dataset, including type, size, format, collection methods, and geographical diversities. We also review and highlight the unique features of each dataset, such as imaging modalities (RGB, thermal, infrared) and their applicability for different fire management tasks (classification, segmentation, detection). Furthermore, we summarize the strengths and weaknesses of each dataset and discuss their potential for advancing research and technology in fire management. Ultimately, we conduct extensive experimental analyses across different datasets using several state-of-the-art algorithms, such as ResNet-50, DeepLab-V3, and YoloV8.

The CLEF-2025 CheckThat! Lab: Subjectivity, Fact-Checking, Claim Normalization, and Retrieval

Firoj Alam,Julia Maria Struß,Tanmoy Chakraborty,Stefan Dietze,Salim Hafid,Katerina Korre,Arianna Muti,Preslav Nakov,Federico Ruggeri,Sebastian Schellhammer,Vinay Setty,Megha Sundriyal,Konstantin Todorov,Venktesh V

Task: 识别和反制在线虚假信息和操纵行为，包括主观性识别、声明规范化、事实核查数值声明和科学网络话语处理。

Motivation: 推动创新技术的发展，以应对多语言和多平台上的在线虚假信息和操纵行为。

Details

Method: 通过CheckThat!实验室的多个任务，包括主观性识别、声明规范化、事实核查数值声明和科学网络话语处理，来解决信息验证管道中的关键和辅助任务。 Result: 提出了多个具有挑战性的分类和检索问题，包括文档和跨度的多语言设置。 Conclusion: CheckThat!实验室通过不断扩展和更新任务，有效应对了在线虚假信息和操纵行为的挑战。 Abstract: The CheckThat! lab aims to advance the development of innovative technologies designed to identify and counteract online disinformation and manipulation efforts across various languages and platforms. The first five editions focused on key tasks in the information verification pipeline, including check-worthiness, evidence retrieval and pairing, and verification. Since the 2023 edition, the lab has expanded its scope to address auxiliary tasks that support research and decision-making in verification. In the 2025 edition, the lab revisits core verification tasks while also considering auxiliary challenges. Task 1 focuses on the identification of subjectivity (a follow-up from CheckThat! 2024), Task 2 addresses claim normalization, Task 3 targets fact-checking numerical claims, and Task 4 explores scientific web discourse processing. These tasks present challenging classification and retrieval problems at both the document and span levels, including multilingual settings.

Redefining non-IID Data in Federated Learning for Computer Vision Tasks: Migrating from Labels to Embeddings for Task-Specific Data Distributions

Kasra Borazjani,Payam Abdisarabshali,Naji Khosravan,Seyyedali Hosseinalipour

Task: 研究联邦学习（FL）中数据异构性对性能的影响，并提出一种新的基于嵌入的数据异构性定义。

Motivation: 现有的联邦学习方法主要依赖于标签分布偏斜来模拟数据异构性，但这无法完全捕捉真实世界中的数据异构性，特别是在计算机视觉任务中。

Details

Method: 利用预训练的深度神经网络提取任务特定的数据嵌入，通过聚类数据点并使用狄利克雷分布将其分配给客户端，定义任务特定的数据异构性。 Result: 通过大量实验评估了不同联邦学习方法在新的数据异构性定义下的性能，并引入了新的基准性能指标。 Conclusion: 现有的联邦学习方法在数据异构性方面存在高估性能的问题，提出了基于嵌入的数据异构性定义，并揭示了一系列未来的研究方向。 Abstract: Federated Learning (FL) represents a paradigm shift in distributed machine learning (ML), enabling clients to train models collaboratively while keeping their raw data private. This paradigm shift from traditional centralized ML introduces challenges due to the non-iid (non-independent and identically distributed) nature of data across clients, significantly impacting FL's performance. Existing literature, predominantly model data heterogeneity by imposing label distribution skew across clients. In this paper, we show that label distribution skew fails to fully capture the real-world data heterogeneity among clients in computer vision tasks beyond classification. Subsequently, we demonstrate that current approaches overestimate FL's performance by relying on label/class distribution skew, exposing an overlooked gap in the literature. By utilizing pre-trained deep neural networks to extract task-specific data embeddings, we define task-specific data heterogeneity through the lens of each vision task and introduce a new level of data heterogeneity called embedding-based data heterogeneity. Our methodology involves clustering data points based on embeddings and distributing them among clients using the Dirichlet distribution. Through extensive experiments, we evaluate the performance of different FL methods under our revamped notion of data heterogeneity, introducing new benchmark performance measures to the literature. We further unveil a series of open research directions that can be pursued.

MetaLadder: Ascending Mathematical Solution Quality via Analogical-Problem Reasoning Transfer

Honglin Lin,Zhuoshi Pan,Yu Li,Qizhi Pei,Xin Gao,Mengzhang Cai,Conghui He,Lijun Wu

Task: 提出一种新的框架MetaLadder，通过提示大语言模型回忆和反思元问题及其解决方案，从而提高数学推理任务的准确性。

Motivation: 人类在解决问题时通常会回忆类似案例并利用其解决方案进行推理，因此希望模仿这一认知过程来提高大语言模型的推理能力。

Details

Method: 提出MetaLadder框架，通过提示模型回忆和反思元问题及其解决方案，并引入问题重述机制以增强模型对目标问题的理解。 Result: 在数学基准测试中，MetaLadder显著提高了大语言模型的问题解决准确性，比标准CoT方法提高了10.3%的准确率。 Conclusion: MetaLadder框架通过模仿人类的类比推理和学习能力，显著提升了大语言模型在数学推理任务中的表现。 Abstract: Large Language Models (LLMs) have demonstrated promising capabilities in solving mathematical reasoning tasks, leveraging Chain-of-Thought (CoT) data as a vital component in guiding answer generation. Current paradigms typically generate CoT and answers directly for a given problem, diverging from human problem-solving strategies to some extent. Humans often solve problems by recalling analogous cases and leveraging their solutions to reason about the current task. Inspired by this cognitive process, we propose \textbf{MetaLadder}, a novel framework that explicitly prompts LLMs to recall and reflect on meta-problems, those structurally or semantically analogous problems, alongside their CoT solutions before addressing the target problem. Additionally, we introduce a problem-restating mechanism to enhance the model's comprehension of the target problem by regenerating the original question, which further improves reasoning accuracy. Therefore, the model can achieve reasoning transfer from analogical problems, mimicking human-like "learning from examples" and generalization abilities. Extensive experiments on mathematical benchmarks demonstrate that our MetaLadder significantly boosts LLMs' problem-solving accuracy, largely outperforming standard CoT-based methods (\textbf{10.3\%} accuracy gain) and other methods. Our code and data has been released at https://github.com/LHL3341/MetaLadder.

SuperPC: A Single Diffusion Model for Point Cloud Completion, Upsampling, Denoising, and Colorization

Yi Du,Zhipeng Zhao,Shaoshu Su,Sharath Golluri,Haoze Zheng,Runmao Yao,Chen Wang

Task: 提出一种统一的扩散模型SuperPC，能够同时处理点云的补全、上采样、去噪和着色任务。

Motivation: 现有的方法通常独立处理每个任务，忽略了这些缺陷之间的相互影响和相关性，导致误差累积和计算成本增加。

Details

Method: 采用三级条件扩散框架，并结合新颖的空间混合融合策略，利用四种缺陷之间的相关性进行同时高效处理。 Result: SuperPC在所有四个任务上均优于现有的专用模型及其组合。 Conclusion: SuperPC通过统一模型有效解决了点云处理中的多个任务，展示了其在处理复杂缺陷时的优越性能。 Abstract: Point cloud (PC) processing tasks-such as completion, upsampling, denoising, and colorization-are crucial in applications like autonomous driving and 3D reconstruction. Despite substantial advancements, prior approaches often address each of these tasks independently, with separate models focused on individual issues. However, this isolated approach fails to account for the fact that defects like incompleteness, low resolution, noise, and lack of color frequently coexist, with each defect influencing and correlating with the others. Simply applying these models sequentially can lead to error accumulation from each model, along with increased computational costs. To address these challenges, we introduce SuperPC, the first unified diffusion model capable of concurrently handling all four tasks. Our approach employs a three-level-conditioned diffusion framework, enhanced by a novel spatial-mix-fusion strategy, to leverage the correlations among these four defects for simultaneous, efficient processing. We show that SuperPC outperforms the state-of-the-art specialized models as well as their combination on all four individual tasks.

Deep Contrastive Unlearning for Language Models

Estrid He,Tabinda Sarwar,Ibrahim Khalil,Xun Yi,Ke Wang

Task: 提出一种名为DeepCUT的机器遗忘框架，用于微调语言模型，以优化模型的潜在空间。

Motivation: 随着大语言模型的成功，其训练数据中可能包含受版权保护的内容和用户生成的知识，这带来了隐私泄露和版权侵犯的风险。因此，需要一种方法来保护用户的“被遗忘权”，即在不影响模型预测质量的情况下，移除特定训练样本的信息。

Details

Method: 提出了一种名为Deep Contrastive Unlearning for fine-Tuning (DeepCUT)的机器遗忘框架，通过直接优化模型的潜在空间来实现机器遗忘。 Result: 在真实世界数据集上的综合实验表明，DeepCUT在有效性和效率上均优于基线方法，具有一致且显著的改进。 Conclusion: DeepCUT框架通过优化模型的潜在空间，有效地实现了机器遗忘，解决了现有方法未考虑的潜在空间几何分布问题。 Abstract: The past a few years have witnessed the great success of large language models, demonstrating powerful capabilities in comprehending textual data and generating human-like languages. Large language models achieve success by being trained on vast amounts of textual data, including online sources with copyrighted content and user-generated knowledge. However, this comes at a cost: the potential risk of exposing users' privacy and violating copyright protections. Thus, to safeguard individuals' "right to be forgotten", there has been increasing interests in machine unlearning -- the process of removing information carried by particular training samples from a model while not deteriorating its predictive quality. This is a challenging task due to the black-box nature of language models. Most existing studies focus on mitigating the impact of those forgot samples upon a model's outputs, and do not explicitly consider the geometric distributions of samples in the latent space of a model. To address this issue, we propose a machine unlearning framework, named Deep Contrastive Unlearning for fine-Tuning (DeepCUT) language models. Our proposed model achieves machine unlearning by directly optimizing the latent space of a model. Comprehensive experiments on real-world datasets demonstrate the effectiveness and efficiency of DeepCUT with consistent and significant improvement over baseline methods.

Effortless Active Labeling for Long-Term Test-Time Adaptation

Guowei Wang,Changxing Ding

Task: 研究如何在长期测试时间适应（TTA）中实现无需大量标注的主动标注，每批次最多选择一个样本进行标注。

Motivation: 由于错误累积，长期测试时间适应是一个具有挑战性的任务。现有的方法通过主动标注每批次中的一小部分样本来解决这个问题，但随着批次数的增加，标注负担迅速增加。

Details

Method: 首先，基于TTA上下文中的单步优化视角，标注每批次中最有价值的样本。然后，引入一种有效的策略，通过特征扰动来识别这些样本。其次，发现标注和未标注样本产生的梯度幅度有显著差异，因此提出使用两个动态权重来平衡它们对模型优化的影响。 Result: 在流行的ImageNet-C、-R、-K、-A和PACS数据库上的大量实验表明，该方法在显著降低标注成本的情况下，始终优于最先进的方法。 Conclusion: 本文提出的方法在长期测试时间适应中实现了无需大量标注的主动标注，显著降低了标注成本，并在多个数据库上取得了优于现有方法的效果。 Abstract: Long-term test-time adaptation (TTA) is a challenging task due to error accumulation. Recent approaches tackle this issue by actively labeling a small proportion of samples in each batch, yet the annotation burden quickly grows as the batch number increases. In this paper, we investigate how to achieve effortless active labeling so that a maximum of one sample is selected for annotation in each batch. First, we annotate the most valuable sample in each batch based on the single-step optimization perspective in the TTA context. In this scenario, the samples that border between the source- and target-domain data distributions are considered the most feasible for the model to learn in one iteration. Then, we introduce an efficient strategy to identify these samples using feature perturbation. Second, we discover that the gradient magnitudes produced by the annotated and unannotated samples have significant variations. Therefore, we propose balancing their impact on model optimization using two dynamic weights. Extensive experiments on the popular ImageNet-C, -R, -K, -A and PACS databases demonstrate that our approach consistently outperforms state-of-the-art methods with significantly lower annotation costs.

MASS: Mathematical Data Selection via Skill Graphs for Pretraining Large Language Models

Jiazheng Li,Lu Yu,Qing Cui,Zhiqiang Zhang,Jun Zhou,Yanfang Ye,Chuxu Zhang

Task: 提出一种基于技能图的数学数据选择框架（MASS），用于在数学推理领域预训练大语言模型（LLMs）。

Motivation: 高质量数据在大语言模型的预训练和微调中起着关键作用，甚至在一定程度上决定了模型的性能上限。然而，大多数数据选择方法忽略了领域相关数据的特定细节。

Details

Method: 构建一个技能图，捕捉数学技能及其相互关系，并根据技能图为目标数据集分配质量分数，从而选择排名靠前的子集用于预训练LLMs。 Result: 实验结果表明，MASS在不同模型大小（1B和7B）和预训练数据集（网络数据和合成数据）上均表现出高效性和有效性。使用MASS选择的子集训练的模型可以在显著减少训练token数量的情况下（减少50%到70%），达到与原始数据集训练的模型相似的性能。在相同token数量的情况下，使用MASS选择的数据训练的模型性能优于原始数据集训练的模型3.3%到5.9%。 Conclusion: MASS在提高LLMs预训练的效率和有效性方面具有潜力。 Abstract: High-quality data plays a critical role in the pretraining and fine-tuning of large language models (LLMs), even determining their performance ceiling to some degree. Consequently, numerous data selection methods have been proposed to identify subsets of data that can effectively and efficiently enhance model performance. However, most of these methods focus on general data selection and tend to overlook the specific nuances of domain-related data. In this paper, we introduce MASS, a \textbf{MA}thematical data \textbf{S}election framework using the \textbf{S}kill graph for pretraining LLMs in the mathematical reasoning domain. By taking into account the unique characteristics of mathematics and reasoning, we construct a skill graph that captures the mathematical skills and their interrelations from a reference dataset. This skill graph guides us in assigning quality scores to the target dataset, enabling us to select the top-ranked subset which is further used to pretrain LLMs. Experimental results demonstrate the efficiency and effectiveness of MASS across different model sizes (1B and 7B) and pretraining datasets (web data and synthetic data). Specifically, in terms of efficiency, models trained on subsets selected by MASS can achieve similar performance to models trained on the original datasets, with a significant reduction in the number of trained tokens - ranging from 50\% to 70\% fewer tokens. In terms of effectiveness, when trained on the same amount of tokens, models trained on the data selected by MASS outperform those trained on the original datasets by 3.3\% to 5.9\%. These results underscore the potential of MASS to improve both the efficiency and effectiveness of pretraining LLMs.

Image Captioning Evaluation in the Age of Multimodal LLMs: Challenges and Future Perspectives

Sara Sarto,Marcella Cornia,Rita Cucchiara

Task: 评估机器生成的图像描述

Motivation: 随着多模态大语言模型（MLLMs）的出现，图像描述生成成为一个核心任务，增加了对稳健和可靠评估指标的需求。

Details

Method: 本文提供了一份全面的图像描述评估进展综述，分析了现有指标的演变、优势和局限性。 Result: 我们的分析揭示了标准评估方法的一些局限性，并提出了未来图像描述评估研究的有前景方向。 Conclusion: 本文强调了现有评估指标的局限性，并提出了未来研究的方向。 Abstract: The evaluation of machine-generated image captions is a complex and evolving challenge. With the advent of Multimodal Large Language Models (MLLMs), image captioning has become a core task, increasing the need for robust and reliable evaluation metrics. This survey provides a comprehensive overview of advancements in image captioning evaluation, analyzing the evolution, strengths, and limitations of existing metrics. We assess these metrics across multiple dimensions, including correlation with human judgment, ranking accuracy, and sensitivity to hallucinations. Additionally, we explore the challenges posed by the longer and more detailed captions generated by MLLMs and examine the adaptability of current metrics to these stylistic variations. Our analysis highlights some limitations of standard evaluation approaches and suggests promising directions for future research in image captioning assessment.

Covering Cracks in Content Moderation: Delexicalized Distant Supervision for Illicit Drug Jargon Detection

Minkyoo Song,Eugene Jang,Jaehan Kim,Seungwon Shin

Task: 检测社交媒体上的非法药物术语

Motivation: 由于药物相关问题的增加和社交媒体的普及，非法药物的销售和讨论在线上变得普遍。现有的药物术语检测方法存在易被规避和无法区分术语的良性使用的问题。

Details

Method: 提出了JEDIS框架，通过分析上下文来检测非法药物术语，结合了远程监督和去词汇化的方法，无需人工标注数据即可训练。 Result: 在两个手动标注的数据集上，JEDIS在F1分数和检测覆盖率方面显著优于现有的基于词汇的基线方法。 Conclusion: JEDIS框架在检测非法药物术语方面表现出色，能够有效应对现有方法的缺陷。 Abstract: In light of rising drug-related concerns and the increasing role of social media, sales and discussions of illicit drugs have become commonplace online. Social media platforms hosting user-generated content must therefore perform content moderation, which is a difficult task due to the vast amount of jargon used in drug discussions. Previous works on drug jargon detection were limited to extracting a list of terms, but these approaches have fundamental problems in practical application. First, they are trivially evaded using word substitutions. Second, they cannot distinguish whether euphemistic terms such as "pot" or "crack" are being used as drugs or in their benign meanings. We argue that drug content moderation should be done using contexts rather than relying on a banlist. However, manually annotated datasets for training such a task are not only expensive but also prone to becoming obsolete. We present JEDIS, a framework for detecting illicit drug jargon terms by analyzing their contexts. JEDIS utilizes a novel approach that combines distant supervision and delexicalization, which allows JEDIS to be trained without human-labeled data while being robust to new terms and euphemisms. Experiments on two manually annotated datasets show JEDIS significantly outperforms state-of-the-art word-based baselines in terms of F1-score and detection coverage in drug jargon detection. We also conduct qualitative analysis that demonstrates JEDIS is robust against pitfalls faced by existing approaches.

Can Large Vision Language Models Read Maps Like a Human?

Shuo Xing,Zezhou Sun,Shuangyu Xie,Kaiyuan Chen,Yanjia Huang,Yuping Wang,Jiachen Li,Dezhen Song,Zhengzhong Tu

Task: 介绍MapBench数据集，专门用于人类可读的、基于像素的户外导航。

Motivation: 为了解决复杂路径寻找场景中的导航问题，并评估LVLMs在空间推理和结构化决策能力上的表现。

Details

Method: 创建包含1600多个像素空间地图路径寻找问题的MapBench数据集，并提供Map Space Scene Graph (MSSG)作为索引数据结构。 Result: MapBench显著挑战了最先进的LVLMs，揭示了它们在空间推理和结构化决策能力上的关键限制。 Conclusion: MapBench是一个具有挑战性的数据集，能够有效评估LVLMs在复杂导航任务中的表现。 Abstract: In this paper, we introduce MapBench-the first dataset specifically designed for human-readable, pixel-based map-based outdoor navigation, curated from complex path finding scenarios. MapBench comprises over 1600 pixel space map path finding problems from 100 diverse maps. In MapBench, LVLMs generate language-based navigation instructions given a map image and a query with beginning and end landmarks. For each map, MapBench provides Map Space Scene Graph (MSSG) as an indexing data structure to convert between natural language and evaluate LVLM-generated results. We demonstrate that MapBench significantly challenges state-of-the-art LVLMs both zero-shot prompting and a Chain-of-Thought (CoT) augmented reasoning framework that decomposes map navigation into sequential cognitive processes. Our evaluation of both open-source and closed-source LVLMs underscores the substantial difficulty posed by MapBench, revealing critical limitations in their spatial reasoning and structured decision-making capabilities. We release all the code and dataset in https://github.com/taco-group/MapBench.

ML-Triton, A Multi-Level Compilation and Language Extension to Triton GPU Programming

Dewei Wang,Wei Zhu,Liyang Ling,Ettore Tiotto,Quintin Wang,Whitney Tsang,Julian Opperman,Jacky Deng

Task: 提出一种多级编译流程和编程接口的ML-Triton，以更好地利用GPU的层次结构。

Motivation: 传统的Triton编译器从工作组级别直接降低到每线程级别，这种过早的降低方式可能无法充分利用现代GPU的层次结构和SIMD单元。

Details

Method: 提出ML-Triton，从工作组级别逐步降低到warp和内部级别，实现与GPU层次结构一致的多级降低，并扩展Triton语言以支持用户设置的编译器提示和warp级别编程。 Result: 实验结果表明，ML-Triton在Intel GPU上的性能达到了专家编写内核的95%以上。 Conclusion: ML-Triton通过多级编译流程和编程接口，能够在不等待编译器更新的情况下，使研究人员获得良好的开箱即用性能。 Abstract: In the era of LLMs, dense operations such as GEMM and MHA are critical components. These operations are well-suited for parallel execution using a tilebased approach. While traditional GPU programming often relies on low level interfaces like CUDA or SYCL, Triton has emerged as a DSL that offers a more user-friendly and portable alternative by programming at a higher level. The current Triton starts at the workgroup (aka threadblock) level, and directly lowers to per-thread level. And then attempt to coalesce and amend through a series of passes, promoting information from low-level representation. We believe this is pre-mature lowering based on the below observations. 1. GPU has a hierarchical structure both physically and logically. Modern GPUs often feature SIMD units capable of directly operating on tiles on a warp or warpgroup basis, such as blocked load and blocked MMA. 2. Multi-level gradual lowering can make compiler decoupled and clean by separating considerations inter and intra a logical layer. 3. Kernel developers often need fine control to get good performance on the latest hardware. FlashAttention2 advocates explicit data partition between warps to make a performance boost. In this context, we propose ML-Triton which features multi-level compilation flow and programming interface. Our approach begins at the workgroup level and progressively lowers to the warp and intrinsic level, implementing a multilevel lowering align with the hierarchical nature of GPU. Additionally, we extend triton language to support user-set compiler hint and warp level programming, enabling researchers to get good out-of-the box performance without awaiting compiler updates. Experimental results demonstrate that our approach achieves performance above 95% of expert-written kernels on Intel GPU, as measured by the geometric mean.

Dynamic Accumulated Attention Map for Interpreting Evolution of Decision-Making in Vision Transformer

Yi Liao,Yongsheng Gao,Weichuan Zhang

Task: 提出一种新的视觉解释方法，动态累积注意力图（DAAM），以可视化Vision Transformer（ViT）模型内部的注意力流。

Motivation: 现有的视觉解释方法无法显示ViT模型内部结构中的注意力流，无法解释ViT在决策过程中最终注意力区域的形成。

Details

Method: 提出了一种新的分解模块，通过解锁每个ViT块的自注意力模块生成的[class]标记来构建和存储空间特征信息，并通过分解分类分数来获取通道重要性系数。对于自监督ViT模型，提出了维度重要性权重来计算通道重要性系数。 Result: 定量和定性分析一致验证了DAAM在解释ViT模型和自监督ViT模型方面的有效性和优越性。 Conclusion: DAAM能够可视化ViT模型内部任何中间块的决策注意力动态，提供了一种新的分解模块和维度重要性权重方法。 Abstract: Various Vision Transformer (ViT) models have been widely used for image recognition tasks. However, existing visual explanation methods can not display the attention flow hidden inside the inner structure of ViT models, which explains how the final attention regions are formed inside a ViT for its decision-making. In this paper, a novel visual explanation approach, Dynamic Accumulated Attention Map (DAAM), is proposed to provide a tool that can visualize, for the first time, the attention flow from the top to the bottom through ViT networks. To this end, a novel decomposition module is proposed to construct and store the spatial feature information by unlocking the [class] token generated by the self-attention module of each ViT block. The module can also obtain the channel importance coefficients by decomposing the classification score for supervised ViT models. Because of the lack of classification score in self-supervised ViT models, we propose dimension-wise importance weights to compute the channel importance coefficients. Such spatial features are linearly combined with the corresponding channel importance coefficients, forming the attention map for each block. The dynamic attention flow is revealed by block-wisely accumulating each attention map. The contribution of this work focuses on visualizing the evolution dynamic of the decision-making attention for any intermediate block inside a ViT model by proposing a novel decomposition module and dimension-wise importance weights. The quantitative and qualitative analysis consistently validate the effectiveness and superior capacity of the proposed DAAM for not only interpreting ViT models with the fully-connected layers as the classifier but also self-supervised ViT models. The code is available at https://github.com/ly9802/DynamicAccumulatedAttentionMap.

Inspecting the Representation Manifold of Differentially-Private Text

Stefan Arnold

Task: 研究差分隐私（DP）在文本中的应用，特别是通过语言模型和温度采样进行文本改写的效果。

Motivation: 探讨差分隐私在文本改写中对结构和复杂性在表示空间中的几何扭曲问题。

Details

Method: 通过估计不同隐私预算下改写文本的内在维度，比较词级和句级方法的效果。 Result: 发现词级方法严重提高了表示流形，而句级方法生成的改写文本在拓扑上更接近人类编写的改写文本。在句级方法中，掩码改写相比因果改写更能保持结构复杂性。 Conclusion: 自回归生成会从不自然的词选择中传播扭曲，导致表示空间膨胀，而掩码改写能更好地保持结构复杂性。 Abstract: Differential Privacy (DP) for text has recently taken the form of text paraphrasing using language models and temperature sampling to better balance privacy and utility. However, the geometric distortion of DP regarding the structure and complexity in the representation space remains unexplored. By estimating the intrinsic dimension of paraphrased text across varying privacy budgets, we find that word-level methods severely raise the representation manifold, while sentence-level methods produce paraphrases whose manifolds are topologically more consistent with human-written paraphrases. Among sentence-level methods, masked paraphrasing, compared to causal paraphrasing, demonstrates superior preservation of structural complexity, suggesting that autoregressive generation propagates distortions from unnatural word choices that cascade and inflate the representation space.

A Simple Combination of Diffusion Models for Better Quality Trade-Offs in Image Denoising

Jonas Dornbusch,Emanuel Pfarr,Florin-Alexandru Vasluianu,Frank Werner,Radu Timofte

Task: 提出一种新的线性组合扩散去噪器（LCDD），用于在图像去噪任务中平衡高视觉质量和低失真。

Motivation: 现有的扩散模型在图像重建任务中虽然能够生成高质量的图像，但在去噪任务中无法有效平衡视觉质量和失真。

Details

Method: 提出线性组合扩散去噪器（LCDD），结合两种互补的推理过程：一种利用模型的生成潜力，另一种确保信号恢复的准确性。 Result: LCDD在去噪任务中实现了最先进的性能，并通过简单的标量超参数调整提供了可控的权衡。 Conclusion: LCDD在图像去噪任务中表现出色，能够有效平衡视觉质量和失真，具有广泛的应用前景。 Abstract: Diffusion models have garnered considerable interest in computer vision, owing both to their capacity to synthesize photorealistic images and to their proven effectiveness in image reconstruction tasks. However, existing approaches fail to efficiently balance the high visual quality of diffusion models with the low distortion achieved by previous image reconstruction methods. Specifically, for the fundamental task of additive Gaussian noise removal, we first illustrate an intuitive method for leveraging pretrained diffusion models. Further, we introduce our proposed Linear Combination Diffusion Denoiser (LCDD), which unifies two complementary inference procedures - one that leverages the model's generative potential and another that ensures faithful signal recovery. By exploiting the inherent structure of the denoising samples, LCDD achieves state-of-the-art performance and offers controlled, well-behaved trade-offs through a simple scalar hyperparameter adjustment.

Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering

Francesco Maria Molfese,Luca Moroni,Luca Gioffrè,Alessandro Scirè,Simone Conia,Roberto Navigli

Task: 评估大型语言模型（LLMs）在多项选择题回答（MCQA）任务中的表现。

Motivation: 尽管开放式问题回答任务更具挑战性，但MCQA任务在理论上更容易评估，因为模型的答案被认为易于提取并直接与预定义的选择进行比较。然而，最近的研究开始质疑MCQA评估的可靠性，表明多个因素可能显著影响LLMs的报告性能，特别是当模型在生成自由文本后选择答案时。

Details

Method: 系统分析现有的答案提取方法是否与人类判断一致，以及它们如何受到提示中答案约束的影响。 Result: 传统评估策略往往低估了LLM的能力，而基于LLM的答案提取器容易出现系统性错误。此外，揭示了在提示中包含格式约束以简化答案提取与允许模型生成自由文本以提高推理能力之间的基本权衡。 Conclusion: 需要标准化的评估方法，并强调需要更可靠和一致的MCQA评估实践。 Abstract: One of the most widely used tasks to evaluate Large Language Models (LLMs) is Multiple-Choice Question Answering (MCQA). While open-ended question answering tasks are more challenging to evaluate, MCQA tasks are, in principle, easier to assess, as the model's answer is thought to be simple to extract and is directly compared to a set of predefined choices. However, recent studies have started to question the reliability of MCQA evaluation, showing that multiple factors can significantly impact the reported performance of LLMs, especially when the model generates free-form text before selecting one of the answer choices. In this work, we shed light on the inconsistencies of MCQA evaluation strategies, which can lead to inaccurate and misleading model comparisons. We systematically analyze whether existing answer extraction methods are aligned with human judgment, and how they are influenced by answer constraints in the prompt across different domains. Our experiments demonstrate that traditional evaluation strategies often underestimate LLM capabilities, while LLM-based answer extractors are prone to systematic errors. Moreover, we reveal a fundamental trade-off between including format constraints in the prompt to simplify answer extraction and allowing models to generate free-form text to improve reasoning. Our findings call for standardized evaluation methodologies and highlight the need for more reliable and consistent MCQA evaluation practices.

These Magic Moments: Differentiable Uncertainty Quantification of Radiance Field Models

Parker Ewen,Hao Chen,Seth Isaacson,Joey Wilson,Katherine A. Skinner,Ram Vasudevan

Task: 提出一种利用渲染方程的高阶矩进行辐射场不确定性量化的新方法。

Motivation: 不确定性量化对于视图规划和场景理解等下游任务至关重要，尤其是在安全和鲁棒性方面。然而，辐射场的高维性和复杂性给不确定性量化带来了重大挑战，限制了这些方法在高速决策中的应用。

Details

Method: 利用渲染过程的概率性质，高效且可微分地计算辐射场输出的高阶矩，包括颜色、深度和语义预测。 Result: 该方法在合成和真实场景的广泛实验中表现出色，达到了最先进的性能，同时保持了简单性。 Conclusion: 该方法不仅在不确定性量化方面优于现有技术，还在下游应用如最佳视角选择和神经辐射场训练的主动射线采样中展示了其效用。 Abstract: This paper introduces a novel approach to uncertainty quantification for radiance fields by leveraging higher-order moments of the rendering equation. Uncertainty quantification is crucial for downstream tasks including view planning and scene understanding, where safety and robustness are paramount. However, the high dimensionality and complexity of radiance fields pose significant challenges for uncertainty quantification, limiting the use of these uncertainty quantification methods in high-speed decision-making. We demonstrate that the probabilistic nature of the rendering process enables efficient and differentiable computation of higher-order moments for radiance field outputs, including color, depth, and semantic predictions. Our method outperforms existing radiance field uncertainty estimation techniques while offering a more direct, computationally efficient, and differentiable formulation without the need for post-processing.Beyond uncertainty quantification, we also illustrate the utility of our approach in downstream applications such as next-best-view (NBV) selection and active ray sampling for neural radiance field training. Extensive experiments on synthetic and real-world scenes confirm the efficacy of our approach, which achieves state-of-the-art performance while maintaining simplicity.

LLM Alignment for the Arabs: A Homogenous Culture or Diverse Ones?

Amr Keleg

Task: 讨论阿拉伯语大型语言模型（LLMs）在捕捉阿拉伯世界文化多样性方面的局限性。

Motivation: 现有的阿拉伯语LLMs假设阿拉伯世界具有文化同质性，忽视了其内部的文化多样性。

Details

Method: 通过立场论文的形式，讨论文化同质性假设的局限性，并提出初步的系统构建思路。 Result: 指出了文化同质性假设的无效性，并呼吁NLP社区在开发多语言和阿拉伯语特定LLMs时考虑文化多样性。 Conclusion: 希望本文能鼓励NLP社区在开发语言模型时，考虑同一语言社区内部的文化多样性。 Abstract: Large language models (LLMs) have the potential of being useful tools that can automate tasks and assist humans. However, these models are more fluent in English and more aligned with Western cultures, norms, and values. Arabic-specific LLMs are being developed to better capture the nuances of the Arabic language, as well as the views of the Arabs. Yet, Arabs are sometimes assumed to share the same culture. In this position paper, I discuss the limitations of this assumption and provide preliminary thoughts for how to build systems that can better represent the cultural diversity within the Arab world. The invalidity of the cultural homogeneity assumption might seem obvious, yet, it is widely adopted in developing multilingual and Arabic-specific LLMs. I hope that this paper will encourage the NLP community to be considerate of the cultural diversity within various communities speaking the same language.

Elevating Visual Question Answering through Implicitly Learned Reasoning Pathways in LVLMs

Liu Jing,Amirul Rahman

Task: 增强大型视觉语言模型（LVLMs）在复杂视觉推理任务中的表现。

Motivation: 现有的LVLMs在多模态任务中表现出色，但在需要多步推理的复杂视觉推理任务中表现不佳。

Details

Method: 提出MF-SQ-LLaVA方法，通过端到端训练实现隐式自问自答，增强视觉问答数据集中的推理链，并使用多任务损失训练模型。 Result: 在ScienceQA和VQAv2数据集上的实验表明，MF-SQ-LLaVA显著优于现有的最先进模型，包括基础LLaVA和原始SQ-LLaVA。 Conclusion: MF-SQ-LLaVA通过隐式自问自答和多任务训练，显著提高了LVLMs在复杂视觉推理任务中的表现。 Abstract: Large Vision-Language Models (LVLMs) have shown remarkable progress in various multimodal tasks, yet they often struggle with complex visual reasoning that requires multi-step inference. To address this limitation, we propose MF-SQ-LLaVA, a novel approach that enhances LVLMs by enabling implicit self-questioning through end-to-end training. Our method involves augmenting visual question answering datasets with reasoning chains consisting of sub-question and answer pairs, and training the LVLM with a multi-task loss that encourages the generation and answering of these intermediate steps, as well as the prediction of the final answer. We conduct extensive experiments on the ScienceQA and VQAv2 datasets, demonstrating that MF-SQ-LLaVA significantly outperforms existing state-of-the-art models, including the base LLaVA and the original SQ-LLaVA. Ablation studies further validate the contribution of each component of our approach, and human evaluation confirms the improved accuracy and coherence of the reasoning process enabled by our method.

SPADE: Systematic Prompt Framework for Automated Dialogue Expansion in Machine-Generated Text Detection

Haoyi Li,Angela Yifei Yuan,Soyeon Caren Han,Christopher Leckie

Task: 开发用于检测机器生成文本（MGT）的模型，并提出五种新的数据增强框架以生成合成用户对话。

Motivation: 由于缺乏系统生成的高质量数据集，现有的MGT检测模型面临挑战，需要降低成本并提高数据质量。

Details

Method: 通过结构化提示方法提出五种数据增强框架，生成14个新的对话数据集，并在七个MGT检测模型上进行基准测试。 Result: 使用提出的增强框架生成的混合数据集提高了泛化性能，并模拟了在线对话检测，研究了聊天历史长度与检测准确率的关系。 Conclusion: 提出的数据增强框架有效降低了数据收集成本，并提高了MGT检测模型的性能，开源数据集可供下载。 Abstract: The increasing capability of large language models (LLMs) to generate synthetic content has heightened concerns about their misuse, driving the development of Machine-Generated Text (MGT) detection models. However, these detectors face significant challenges due to the lack of systematically generated, high-quality datasets for training. To address this issue, we propose five novel data augmentation frameworks for synthetic user dialogue generation through a structured prompting approach, reducing the costs associated with traditional data collection methods. Our proposed method yields 14 new dialogue datasets, which we benchmark against seven MGT detection models. The results demonstrate improved generalization performance when utilizing a mixed dataset produced by our proposed augmentation framework. Furthermore, considering that real-world agents lack knowledge of future opponent utterances, we simulate online dialogue detection and examine the relationship between chat history length and detection accuracy. We also benchmark online detection performance with limited chat history on our frameworks. Our open-source datasets can be downloaded from https://github.com/AngieYYF/SPADE-customer-service-dialogue.

SplatVoxel: History-Aware Novel View Streaming without Temporal Training

Yiming Wang,Lucy Chai,Xuan Luo,Michael Niemeyer,Manuel Lagunas,Stephen Lombardi,Siyu Tang,Tiancheng Sun

Task: 从稀疏视图视频中生成高质量、时间一致的新视图序列。

Motivation: 现有的新视图合成方法在时间一致性和视觉保真度方面存在困难，导致闪烁和不一致。为了解决这些问题，引入了历史感知机制，利用之前的帧来重建场景并提高质量和稳定性。

Details

Method: 提出了一种混合的splat-voxel前馈场景重建方法，结合了高斯Splatting来传播时间信息，并使用分层体素网格进行时间融合。高斯基元通过运动图在时间上高效变形，将2D跟踪模型扩展到3D运动，而稀疏体素变换器以误差感知的方式集成新的时间观察。 Result: 在静态和流式场景重建中实现了最先进的性能，有效减少了时间伪影和视觉伪影，同时在单个H100 GPU上以交互速率（15 fps，350ms延迟）运行。 Conclusion: 该方法不需要在多视图视频数据集上进行训练，可以直接应用于稀疏视图视频流，并在推理时以历史感知的方式运行，显著提高了新视图合成的时间一致性和视觉质量。 Abstract: We study the problem of novel view streaming from sparse-view videos, which aims to generate a continuous sequence of high-quality, temporally consistent novel views as new input frames arrive. However, existing novel view synthesis methods struggle with temporal coherence and visual fidelity, leading to flickering and inconsistency. To address these challenges, we introduce history-awareness, leveraging previous frames to reconstruct the scene and improve quality and stability. We propose a hybrid splat-voxel feed-forward scene reconstruction approach that combines Gaussian Splatting to propagate information over time, with a hierarchical voxel grid for temporal fusion. Gaussian primitives are efficiently warped over time using a motion graph that extends 2D tracking models to 3D motion, while a sparse voxel transformer integrates new temporal observations in an error-aware manner. Crucially, our method does not require training on multi-view video datasets, which are currently limited in size and diversity, and can be directly applied to sparse-view video streams in a history-aware manner at inference time. Our approach achieves state-of-the-art performance in both static and streaming scene reconstruction, effectively reducing temporal artifacts and visual artifacts while running at interactive rates (15 fps with 350ms delay) on a single H100 GPU. Project Page: https://19reborn.github.io/SplatVoxel/

ELTEX: A Framework for Domain-Driven Synthetic Data Generation

Arina Razmyslovich,Kseniia Murasheva,Sofia Sedlova,Julien Capitaine,Eugene Dmitriev

Task: 提出ELTEX框架，用于在专业领域生成高质量的合成训练数据。

Motivation: 解决大型语言模型在专业领域（如网络安全）中由于缺乏领域特定训练数据而表现受限的问题。

Details

Method: 通过系统整合显式领域指示器提取和动态提示，以在生成过程中保留关键领域知识。 Result: 在区块链相关的网络攻击检测中，ELTEX增强的模型在标准分类指标和不确定性校准方面表现与GPT-4相当，且计算资源需求显著减少。 Conclusion: 领域驱动的合成数据生成可以有效弥合资源高效模型与大型架构在专业领域中的性能差距。 Abstract: We present ELTEX (Efficient LLM Token Extraction), a domain-driven framework for generating high-quality synthetic training data in specialized domains. While Large Language Models (LLMs) have shown impressive general capabilities, their performance in specialized domains like cybersecurity remains limited by the scarcity of domain-specific training data. ELTEX addresses this challenge by systematically integrating explicit domain indicator extraction with dynamic prompting to preserve critical domain knowledge throughout the generation process. We demonstrate ELTEX's effectiveness in the context of blockchain-related cyberattack detection, where we fine-tune Gemma-2B using various combinations of real and ELTEX-generated data. Our results show that the ELTEX-enhanced model achieves performance competitive with GPT-4 across both standard classification metrics and uncertainty calibration, while requiring significantly fewer computational resources. We release a curated synthetic dataset of social media texts for cyberattack detection in blockchain. Our work demonstrates that domain-driven synthetic data generation can effectively bridge the performance gap between resource-efficient models and larger architectures in specialized domains.

Construction Site Scaffolding Completeness Detection Based on Mask R-CNN and Hough Transform

Pei-Hsin Lin,Jacob J. Lin,Shang-Hsien Hsieh

Task: 提出一种基于深度学习的计算机视觉方法来检测脚手架及其交叉支撑。

Motivation: 确保脚手架的安全性和完整性，防止事故发生，减少人工检查的时间和成本。

Details

Method: 使用带有注释标签的脚手架图像数据集训练卷积神经网络（CNN）模型。 Result: 能够自动从施工现场拍摄的图像中检测交叉支撑的完整性，无需人工检查。 Conclusion: 这种非侵入且高效的脚手架完整性检测解决方案有助于提高施工现场的安全性。 Abstract: Construction site scaffolding is essential for many building projects, and ensuring its safety is crucial to prevent accidents. The safety inspector must check the scaffolding's completeness and integrity, where most violations occur. The inspection process includes ensuring all the components are in the right place since workers often compromise safety for convenience and disassemble parts such as cross braces. This paper proposes a deep learning-based approach to detect the scaffolding and its cross braces using computer vision. A scaffold image dataset with annotated labels is used to train a convolutional neural network (CNN) model. With the proposed approach, we can automatically detect the completeness of cross braces from images taken at construction sites, without the need for manual inspection, saving a significant amount of time and labor costs. This non-invasive and efficient solution for detecting scaffolding completeness can help improve safety in construction sites.

A Data-driven Investigation of Euphemistic Language: Comparing the usage of "slave" and "servant" in 19th century US newspapers

Jaihyun Park,Ryan Cordell

Task: 研究19世纪美国报纸中“slave”和“servant”的使用情况。

Motivation: 探讨这两个词在19世纪美国报纸中的不同使用方式及其背后的社会文化含义。

Details

Method: 使用FastText嵌入考虑OCR错误，排除重印文本，使用Word2vec嵌入找到与“slave”和“servant”语义相近的词，并计算对数几率比以识别南方和北方报纸中的过度代表词汇。 Result: 发现“slave”与社会经济、法律和行政词汇相关，而“servant”在北方报纸中与宗教词汇相关，在南方报纸中与家庭和家庭词汇相关。南方报纸中的奴隶话语词汇在北方报纸中更为普遍，而各自地区的仆人话语词汇在各自地区更为普遍。 Conclusion: 该研究有助于理解19世纪美国报纸如何围绕被奴役的非洲美国人创造不同的话语。 Abstract: This study investigates the usage of "slave" and "servant" in the 19th century US newspapers using computational methods. While both terms were used to refer to enslaved African Americans, they were used in distinct ways. In the Chronicling America corpus, we included possible OCR errors by using FastText embedding and excluded text reprints to consider text reprint culture in the 19th century. Word2vec embedding was used to find semantically close words to "slave" and "servant" and log-odds ratio was calculated to identify over-represented discourse words in the Southern and Northern newspapers. We found that "slave" is associated with socio-economic, legal, and administrative words, however, "servant" is linked to religious words in the Northern newspapers while Southern newspapers associated "servant" with domestic and familial words. We further found that slave discourse words in Southern newspapers are more prevalent in Northern newspapers while servant discourse words from each side are prevalent in their own region. This study contributes to the understanding of how newspapers created different discourses around enslaved African Americans in the 19th century US.

ShapeShift: Towards Text-to-Shape Arrangement Synthesis with Content-Aware Geometric Constraints

Vihaan Misra,Peter Schaldenbrand,Jean Oh

Task: 解决在固定刚性形状集合下进行文本引导的图像生成问题。

Motivation: 在仅使用固定刚性形状的情况下生成符合语义描述的图像，类似于解决七巧板拼图或排列现实世界物体。

Details

Method: 提出ShapeShift方法，通过可微分矢量图形管道显式参数化每个形状，并通过预训练扩散模型的分数蒸馏采样迭代优化位置和方向。引入内容感知碰撞解决机制，确保在重叠时进行最小的语义一致调整。 Result: 在多种场景下展示了令人信服的结果，定量和定性上优于其他技术。 Conclusion: 通过将基于扩散的语义引导与显式几何约束相结合，ShapeShift方法生成了可解释的构图，其中空间关系清晰地体现了文本提示。 Abstract: While diffusion-based models excel at generating photorealistic images from text, a more nuanced challenge emerges when constrained to using only a fixed set of rigid shapes, akin to solving tangram puzzles or arranging real-world objects to match semantic descriptions. We formalize this problem as shape-based image generation, a new text-guided image-to-image translation task that requires rearranging the input set of rigid shapes into non-overlapping configurations and visually communicating the target concept. Unlike pixel-manipulation approaches, our method, ShapeShift, explicitly parameterizes each shape within a differentiable vector graphics pipeline, iteratively optimizing placement and orientation through score distillation sampling from pretrained diffusion models. To preserve arrangement clarity, we introduce a content-aware collision resolution mechanism that applies minimal semantically coherent adjustments when overlaps occur, ensuring smooth convergence toward physically valid configurations. By bridging diffusion-based semantic guidance with explicit geometric constraints, our approach yields interpretable compositions where spatial relationships clearly embody the textual prompt. Extensive experiments demonstrate compelling results across diverse scenarios, with quantitative and qualitative advantages over alternative techniques.

Exploring Model Editing for LLM-based Aspect-Based Sentiment Classification

Shichen Li,Zhongqing Wang,Zheyu Zhao,Yue Zhang,Peifeng Li

Task: 研究模型编辑以提供一种高效的方法来适应大语言模型（LLMs）解决基于方面的情感分类问题。

Motivation: 模型编辑可以选择性地更新神经模型的一小部分参数，以减少计算成本并适应大语言模型。

Details

Method: 通过因果干预，追踪并确定哪些神经元隐藏状态对模型的预测至关重要，并在LLM的每个组件上进行干预和恢复。 Result: 发现一组特定的中层表示对于检测给定方面词的情感极性至关重要，并开发了一种专注于LLM关键部分的模型编辑方法。 Conclusion: 该方法在领域内和领域外实验中取得了与当前最强方法相竞争的结果，且显著减少了可训练参数，展示了一种更高效和可解释的微调策略。 Abstract: Model editing aims at selectively updating a small subset of a neural model's parameters with an interpretable strategy to achieve desired modifications. It can significantly reduce computational costs to adapt to large language models (LLMs). Given its ability to precisely target critical components within LLMs, model editing shows great potential for efficient fine-tuning applications. In this work, we investigate model editing to serve an efficient method for adapting LLMs to solve aspect-based sentiment classification. Through causal interventions, we trace and determine which neuron hidden states are essential for the prediction of the model. By performing interventions and restorations on each component of an LLM, we identify the importance of these components for aspect-based sentiment classification. Our findings reveal that a distinct set of mid-layer representations is essential for detecting the sentiment polarity of given aspect words. Leveraging these insights, we develop a model editing approach that focuses exclusively on these critical parts of the LLM, leading to a more efficient method for adapting LLMs. Our in-domain and out-of-domain experiments demonstrate that this approach achieves competitive results compared to the currently strongest methods with significantly fewer trainable parameters, highlighting a more efficient and interpretable fine-tuning strategy.

HandSplat: Embedding-Driven Gaussian Splatting for High-Fidelity Hand Rendering

Yilan Dong,Haohe Liu,Qing Wang,Jiahao Yang,Wenqing Wang,Gregory Slabaugh,Shanxin Yuan

Task: 提出了一种基于高斯泼溅的手部渲染框架HandSplat，以提高渲染的逼真度和稳定性。

Motivation: 现有的3D高斯泼溅方法在手部渲染中依赖于刚性骨骼运动，且非刚性运动模型过于简化，无法捕捉精细的几何和外观细节，导致几何细节丢失、时间不稳定性和点分布效率低下。

Details

Method: 扩展了标准的3D高斯泼溅属性，引入了隐式几何和外观嵌入，以改进非刚性运动建模，并提出了局部梯度感知的密度化策略和姿态条件属性正则化。 Result: 在InterHand2.6M数据集上的实验表明，HandSplat在逼真度和稳定性上优于现有方法，并实现了实时性能。 Conclusion: HandSplat框架通过改进非刚性运动建模和密度化策略，显著提高了手部渲染的逼真度和稳定性。 Abstract: Existing 3D Gaussian Splatting (3DGS) methods for hand rendering rely on rigid skeletal motion with an oversimplified non-rigid motion model, which fails to capture fine geometric and appearance details. Additionally, they perform densification based solely on per-point gradients and process poses independently, ignoring spatial and temporal correlations. These limitations lead to geometric detail loss, temporal instability, and inefficient point distribution. To address these issues, we propose HandSplat, a novel Gaussian Splatting-based framework that enhances both fidelity and stability for hand rendering. To improve fidelity, we extend standard 3DGS attributes with implicit geometry and appearance embeddings for finer non-rigid motion modeling while preserving the static hand characteristic modeled by original 3DGS attributes. Additionally, we introduce a local gradient-aware densification strategy that dynamically refines Gaussian density in high-variation regions. To improve stability, we incorporate pose-conditioned attribute regularization to encourage attribute consistency across similar poses, mitigating temporal artifacts. Extensive experiments on InterHand2.6M demonstrate that HandSplat surpasses existing methods in fidelity and stability while achieving real-time performance. We will release the code and pre-trained models upon acceptance.

Increasing the Robustness of the Fine-tuned Multilingual Machine-Generated Text Detectors

Dominik Macko,Robert Moro,Ivan Srba

Task: 开发一种自动化方法来准确检测机器生成的内容。

Motivation: 由于LLMs的普及，人们担心其被滥用于有害内容的创建和传播。人类无法区分高质量的机器生成文本和真实的人类写作文本，因此需要开发自动化手段来检测机器生成内容。

Details

Method: 提出了一种鲁棒的微调过程，用于LLMs的检测任务，使检测器对混淆更具鲁棒性，并能更好地泛化到分布外数据。 Result: 该方法使检测器在检测机器生成内容时更加鲁棒和通用。 Conclusion: 通过提出的鲁棒微调过程，可以更有效地检测机器生成内容，从而提供关于其可信度的额外信息。 Abstract: Since the proliferation of LLMs, there have been concerns about their misuse for harmful content creation and spreading. Recent studies justify such fears, providing evidence of LLM vulnerabilities and high potential of their misuse. Humans are no longer able to distinguish between high-quality machine-generated and authentic human-written texts. Therefore, it is crucial to develop automated means to accurately detect machine-generated content. It would enable to identify such content in online information space, thus providing an additional information about its credibility. This work addresses the problem by proposing a robust fine-tuning process of LLMs for the detection task, making the detectors more robust against obfuscation and more generalizable to out-of-distribution data.

RETHINED: A New Benchmark and Baseline for Real-Time High-Resolution Image Inpainting On Edge Devices

Marcelo Sanchez,Gil Triginer,Ignacio Sarasua,Lara Raad,Coloma Ballester

Task: 提出一种能够在边缘设备上实时进行高分辨率图像修复的基线方法。

Motivation: 现有的图像修复方法在低分辨率图像上表现出色，但在高分辨率图像上表现不佳且需要强大的硬件支持，限制了其在边缘设备上的部署。

Details

Method: 提出了一种由轻量级卷积神经网络（CNN）和分辨率无关的补丁替换机制组成的简单而有效的新方法。 Result: 在各种移动设备上进行了广泛的分析，展示了与现有最先进方法相似的修复性能，同时速度快了100倍。 Conclusion: 该方法能够在边缘设备上实时进行高分辨率图像修复，并发布了首个自由形式掩码的超高清图像修复数据集。 Abstract: Existing image inpainting methods have shown impressive completion results for low-resolution images. However, most of these algorithms fail at high resolutions and require powerful hardware, limiting their deployment on edge devices. Motivated by this, we propose the first baseline for REal-Time High-resolution image INpainting on Edge Devices (RETHINED) that is able to inpaint at ultra-high-resolution and can run in real-time ($\leq$ 30ms) in a wide variety of mobile devices. A simple, yet effective novel method formed by a lightweight Convolutional Neural Network (CNN) to recover structure, followed by a resolution-agnostic patch replacement mechanism to provide detailed texture. Specially our pipeline leverages the structural capacity of CNN and the high-level detail of patch-based methods, which is a key component for high-resolution image inpainting. To demonstrate the real application of our method, we conduct an extensive analysis on various mobile-friendly devices and demonstrate similar inpainting performance while being $\mathrm{100 \times faster}$ than existing state-of-the-art methods. Furthemore, we realease DF8K-Inpainting, the first free-form mask UHD inpainting dataset.

Christina Zorenböhmer,Sebastian Schmidt,Bernd Resch

Task: 生成第一个基于方面的情感分析（ABEA）训练数据集，并微调BERT模型用于ABEA的子任务：方面术语提取（ATE）和方面情感分类（AEC）。

Motivation: 解决ABEA领域面临的数据集瓶颈和情感类别复杂性增加的问题。

Details

Method: 生成包含2,621条英文推特的ABEA训练数据集，基于Shaver等人的分层情感理论进行标注，并使用GRACE模型进行微调。 Result: 在ATE任务上达到70.1%的F1分数，在联合ATE和AEC任务上达到46.9%的F1分数。 Conclusion: 模型性能的限制因素主要是训练数据集规模较小和任务复杂性增加，导致模型过拟合和泛化能力有限。 Abstract: While sentiment analysis has advanced from sentence to aspect-level, i.e., the identification of concrete terms related to a sentiment, the equivalent field of Aspect-based Emotion Analysis (ABEA) is faced with dataset bottlenecks and the increased complexity of emotion classes in contrast to binary sentiments. This paper addresses these gaps, by generating a first ABEA training dataset, consisting of 2,621 English Tweets, and fine-tuning a BERT-based model for the ABEA sub-tasks of Aspect Term Extraction (ATE) and Aspect Emotion Classification (AEC). The dataset annotation process was based on the hierarchical emotion theory by Shaver et al. [1] and made use of group annotation and majority voting strategies to facilitate label consistency. The resulting dataset contained aspect-level emotion labels for Anger, Sadness, Happiness, Fear, and a None class. Using the new ABEA training dataset, the state-of-the-art ABSA model GRACE by Luo et al. [2] was fine-tuned for ABEA. The results reflected a performance plateau at an F1-score of 70.1% for ATE and 46.9% for joint ATE and AEC extraction. The limiting factors for model performance were broadly identified as the small training dataset size coupled with the increased task complexity, causing model overfitting and limited abilities to generalize well on new data.

Validation of Human Pose Estimation and Human Mesh Recovery for Extracting Clinically Relevant Motion Data from Videos

Kai Armstrong,Alexander Rodrigues,Alexander P. Willmott,Lei Zhang,Xujiong Ye

Task: 比较无标记运动捕捉技术与传统运动捕捉技术在临床环境中的应用。

Motivation: 验证无标记运动捕捉技术在运动学分析中的可行性，并展示其相对于传统技术的优势。

Details

Method: 比较惯性测量单元（IMU）、反射标记光学运动捕捉（MoCap）与无标记运动捕捉技术（如人体姿态估计和人体网格恢复）的性能。 Result: 无标记运动捕捉技术的结果与IMU和MoCap技术的结果一致，且具有更短的设置时间和更低的专业知识要求。 Conclusion: 尽管无标记运动捕捉技术在数据质量上仍有改进空间，但其在临床测试中的误差范围内是可接受的。 Abstract: This work aims to discuss the current landscape of kinematic analysis tools, ranging from the state-of-the-art in sports biomechanics such as inertial measurement units (IMUs) and retroreflective marker-based optical motion capture (MoCap) to more novel approaches from the field of computing such as human pose estimation and human mesh recovery. Primarily, this comparative analysis aims to validate the use of marker-less MoCap techniques in a clinical setting by showing that these marker-less techniques are within a reasonable range for kinematics analysis compared to the more cumbersome and less portable state-of-the-art tools. Not only does marker-less motion capture using human pose estimation produce results in-line with the results of both the IMU and MoCap kinematics but also benefits from a reduced set-up time and reduced practical knowledge and expertise to set up. Overall, while there is still room for improvement when it comes to the quality of the data produced, we believe that this compromise is within the room of error that these low-speed actions that are used in small clinical tests.

Comparing Llama3 and DeepSeekR1 on Biomedical Text Classification Tasks

Yuting Guo,Abeed Sarker

Task: 比较两个开源大语言模型（Llama3-70B和DeepSeekR1-distill-Llama3-70B）在六个生物医学文本分类任务中的表现。

Motivation: 评估不同大语言模型在生物医学文本分类任务中的性能，特别是在零样本设置下的表现。

Details

Method: 在六个生物医学文本分类任务上进行了实验，其中四个任务涉及社交媒体数据，两个任务涉及电子健康记录中的临床笔记。所有实验均在零样本设置下进行，并测量了精度、召回率和F1分数及其95%置信区间。 Result: DeepSeekR1-distill-Llama3-70B在大多数任务上的精度表现更好，召回率结果则参差不齐。尽管在某些任务上零样本大语言模型表现出较高的F1分数，但在其他任务上表现不佳。 Conclusion: 模型选择应根据健康相关文本分类任务的具体需求进行，特别是在考虑精度-召回率权衡时。在存在标注数据的情况下，有监督分类方法可能比零样本大语言模型更可靠。 Abstract: This study compares the performance of two open-source large language models (LLMs)-Llama3-70B and DeepSeekR1-distill-Llama3-70B-on six biomedical text classification tasks. Four tasks involve data from social media, while two tasks focus on clinical notes from electronic health records, and all experiments were performed in zero-shot settings. Performance metrics, including precision, recall, and F1 scores, were measured for each task, along with their 95% confidence intervals. Results demonstrated that DeepSeekR1-distill-Llama3-70B generally performs better in terms of precision on most tasks, with mixed results on recall. While the zero-shot LLMs demonstrated high F1 scores for some tasks, they grossly underperformed on others, for data from both sources. The findings suggest that model selection should be guided by the specific requirements of the health-related text classification tasks, particularly when considering the precision-recall trade-offs, and that, in the presence of annotated data, supervised classification approaches may be more reliable than zero-shot LLMs.

Revisiting Image Fusion for Multi-Illuminant White-Balance Correction

David Serrano-Lozano,Aditya Arora,Luis Herranz,Konstantinos G. Derpanis,Michael S. Brown,Javier Vazquez-Corral

Task: 提出一种基于Transformer的高效模型，用于多光源场景下的白平衡校正。

Motivation: 现有的基于融合的方法在多光源场景下表现不佳，且缺乏专门的多光源图像数据集。

Details

Method: 提出了一种基于Transformer的模型，能够有效捕捉sRGB白平衡预设之间的空间依赖性，并引入了一个包含16,000多张sRGB图像的大规模多光源数据集。 Result: 新方法在提出的多光源图像融合数据集上比现有技术提高了100%。 Conclusion: 提出的基于Transformer的模型和新的多光源数据集显著提高了多光源场景下的白平衡校正效果。 Abstract: White balance (WB) correction in scenes with multiple illuminants remains a persistent challenge in computer vision. Recent methods explored fusion-based approaches, where a neural network linearly blends multiple sRGB versions of an input image, each processed with predefined WB presets. However, we demonstrate that these methods are suboptimal for common multi-illuminant scenarios. Additionally, existing fusion-based methods rely on sRGB WB datasets lacking dedicated multi-illuminant images, limiting both training and evaluation. To address these challenges, we introduce two key contributions. First, we propose an efficient transformer-based model that effectively captures spatial dependencies across sRGB WB presets, substantially improving upon linear fusion techniques. Second, we introduce a large-scale multi-illuminant dataset comprising over 16,000 sRGB images rendered with five different WB settings, along with WB-corrected images. Our method achieves up to 100\% improvement over existing techniques on our new multi-illuminant image fusion dataset.

Entity-aware Cross-lingual Claim Detection for Automated Fact-checking

Rrubaa Panchendrarajan,Arkaitz Zubiaga

Task: 识别需要验证的声明，特别是在社交媒体平台上错误信息泛滥的情况下。

Motivation: 尽管在识别需要验证的声明方面取得了显著进展，但仍存在一些开放挑战，如处理在线讨论中常见的多语言和多模态数据。

Details

Method: 引入了一个名为EX-Claim的实体感知跨语言声明检测模型，该模型利用从命名实体识别和实体链接技术中提取的实体信息来提高训练期间所见和未见语言的语言级别性能。 Result: 在来自不同社交媒体平台的三个数据集上进行的广泛实验表明，所提出的模型在27种语言中显著优于基线模型，并且在训练数据有限的情况下实现了最高的知识转移率。 Conclusion: EX-Claim模型能够有效地处理任何语言的声明，并在多语言环境中表现出色。 Abstract: Identifying claims requiring verification is a critical task in automated fact-checking, especially given the proliferation of misinformation on social media platforms. Despite significant progress in the task, there remain open challenges such as dealing with multilingual and multimodal data prevalent in online discourse. Addressing the multilingual challenge, recent efforts have focused on fine-tuning pre-trained multilingual language models. While these models can handle multiple languages, their ability to effectively transfer cross-lingual knowledge for detecting claims spreading on social media remains under-explored. In this paper, we introduce \textit{EX-Claim}, an entity-aware cross-lingual claim detection model that generalizes well to handle claims written in any language. The model leverages entity information derived from named entity recognition and entity linking techniques to improve the language-level performance of both seen and unseen languages during training. Extensive experiments conducted on three datasets from different social media platforms demonstrate that our proposed model significantly outperforms the baselines, across 27 languages, and achieves the highest rate of knowledge transfer, even with limited training data.

RAT: Boosting Misclassification Detection Ability without Extra Data

Ge Yan,Tsui-Wei Weng

Task: 检测图像分类模型的错误分类输入。

Motivation: 随着深度神经网络在高风险领域（如自动驾驶和医疗保健）中的广泛应用，检测模型的错误预测并进行干预变得至关重要。

Details

Method: 提出使用鲁棒半径（即输入空间边际）作为置信度度量，并设计了两种高效的估计算法（RR-BS和RR-Fast）用于错误分类检测。此外，设计了一种称为半径感知训练（RAT）的训练方法，以提高模型识别错误的能力。 Result: 实验表明，与之前的方法相比，该方法在AURC上减少了29.3%，在FPR@95TPR上减少了21.62%。 Conclusion: 所提出的方法在检测错误分类输入方面表现出色，显著提高了模型的可靠性。 Abstract: As deep neural networks(DNN) become increasingly prevalent, particularly in high-stakes areas such as autonomous driving and healthcare, the ability to detect incorrect predictions of models and intervene accordingly becomes crucial for safety. In this work, we investigate the detection of misclassified inputs for image classification models from the lens of adversarial perturbation: we propose to use robust radius (a.k.a. input-space margin) as a confidence metric and design two efficient estimation algorithms, RR-BS and RR-Fast, for misclassification detection. Furthermore, we design a training method called Radius Aware Training (RAT) to boost models' ability to identify mistakes. Extensive experiments show our method could achieve up to 29.3% reduction on AURC and 21.62% reduction in FPR@95TPR, compared with previous methods.

Model Hubs and Beyond: Analyzing Model Popularity, Performance, and Documentation

Pritam Kadasi,Sriman Reddy,Srivathsa Vamsi Chaturvedula,Rudranshu Sen,Agnish Saha,Soumavo Sikdar,Sayani Sarkar,Suhani Mittal,Rohit Jindal,Mayank Singh

Task: 研究Hugging Face平台上模型流行度与实际性能之间的关系，以及模型文档的全面性与流行度和性能的相关性。

Motivation: 随着Hugging Face平台上机器学习模型的激增，用户在选择最佳模型时往往依赖模型的流行度（如下载量、点赞数或最近更新），而忽略了实际性能。

Details

Method: 评估了Hugging Face平台上500个情感分析模型，进行了大规模的人工标注（近80,000个标注）以及广泛的模型训练和评估。 Result: 模型流行度与实际性能并不一定相关。约80%的模型缺乏详细的模型、训练和评估过程信息，约88%的模型作者在模型卡片中夸大了模型性能。 Conclusion: 提供了一份指南清单，帮助用户为下游任务选择好的模型。 Abstract: With the massive surge in ML models on platforms like Hugging Face, users often lose track and struggle to choose the best model for their downstream tasks, frequently relying on model popularity indicated by download counts, likes, or recency. We investigate whether this popularity aligns with actual model performance and how the comprehensiveness of model documentation correlates with both popularity and performance. In our study, we evaluated a comprehensive set of 500 Sentiment Analysis models on Hugging Face. This evaluation involved massive annotation efforts, with human annotators completing nearly 80,000 annotations, alongside extensive model training and evaluation. Our findings reveal that model popularity does not necessarily correlate with performance. Additionally, we identify critical inconsistencies in model card reporting: approximately 80\% of the models analyzed lack detailed information about the model, training, and evaluation processes. Furthermore, about 88\% of model authors overstate their models' performance in the model cards. Based on our findings, we provide a checklist of guidelines for users to choose good models for downstream tasks.

SketchSplat: 3D Edge Reconstruction via Differentiable Multi-view Sketch Splatting

Haiyang Ying,Matthias Zwicker

Task: 从校准的多视角图像中重建参数化的3D边缘。

Motivation: 现有的方法通常从多视角2D边缘图像重建3D边缘点集，然后拟合3D边缘到点集。然而，点集中的噪声可能导致拟合边缘之间的间隙，并且由于边缘拟合仅依赖于重建的3D点集，恢复的边缘可能与输入的多视角图像不对齐。

Details

Method: 提出了一种通过可微分多视角草图溅射重建准确、完整和紧凑的3D边缘的方法，称为SketchSplat。将3D边缘表示为草图，这些草图是由控制点、比例和不透明度等属性定义的参数化线条和曲线。在边缘重建过程中，从一组草图中迭代采样高斯点，并将高斯点栅格化到2D边缘图像上。然后，可以将图像误差相对于输入2D边缘图像的梯度反向传播以优化草图属性。 Result: 实验表明，该方法在基准CAD数据集上实现了最先进的准确性、完整性和紧凑性。 Conclusion: 该方法通过可微分的方式桥接了2D边缘图像和3D边缘，确保3D边缘与2D图像对齐，并实现了准确和完整的结果。同时，提出了一系列自适应拓扑操作，并在草图优化过程中应用这些操作，以减少所需的草图数量，同时确保高精度，从而产生更紧凑的重建。 Abstract: Edges are one of the most basic parametric primitives to describe structural information in 3D. In this paper, we study parametric 3D edge reconstruction from calibrated multi-view images. Previous methods usually reconstruct a 3D edge point set from multi-view 2D edge images, and then fit 3D edges to the point set. However, noise in the point set may cause gaps among fitted edges, and the recovered edges may not align with input multi-view images since the edge fitting depends only on the reconstructed 3D point set. To mitigate these problems, we propose SketchSplat, a method to reconstruct accurate, complete, and compact 3D edges via differentiable multi-view sketch splatting. We represent 3D edges as sketches, which are parametric lines and curves defined by attributes including control points, scales, and opacity. During edge reconstruction, we iteratively sample Gaussian points from a set of sketches and rasterize the Gaussians onto 2D edge images. Then the gradient of the image error with respect to the input 2D edge images can be back-propagated to optimize the sketch attributes. Our method bridges 2D edge images and 3D edges in a differentiable manner, which ensures that 3D edges align well with 2D images and leads to accurate and complete results. We also propose a series of adaptive topological operations and apply them along with the sketch optimization. The topological operations help reduce the number of sketches required while ensuring high accuracy, yielding a more compact reconstruction. Finally, we contribute an accurate 2D edge detector that improves the performance of both ours and existing methods. Experiments show that our method achieves state-of-the-art accuracy, completeness, and compactness on a benchmark CAD dataset.

Exploring Large Language Models for Word Games:Who is the Spy?

Chentian Wei,Jiewei Chen,Jinzhu Xu

Task: 探索大型语言模型（LLMs）如何有效地参与文字游戏，并提出一种无需训练的框架。

Motivation: 文字游戏因其基于规则和情境的特性，在自然语言处理（NLP）、博弈论及相关领域具有重要的研究价值。

Details

Method: 提出了一种基于Chain-of-Thought（CoT）的调度框架，以经典文字游戏'谁是卧底'为例，使LLMs在推断角色词和伪装身份等任务中表现出色。 Result: 实验结果表明该框架的有效性，LLMs在多个数据集上的表现显著提升。 Conclusion: 该工作展示了LLMs在结构化游戏环境中掌握情境推理和社交互动的潜力。 Abstract: Word games hold significant research value for natural language processing (NLP), game theory, and related fields due to their rule-based and situational nature. This study explores how large language models (LLMs) can be effectively involved in word games and proposes a training-free framework. "Shei Shi Wo Di" or "Who is the Spy" in English, is a classic word game. Using this game as an example, we introduce a Chain-of-Thought (CoT)-based scheduling framework to enable LLMs to achieve excellent performance in tasks such as inferring role words and disguising their identities. We evaluate the framework's performance based on game success rates and the accuracy of the LLM agents' analytical results. Experimental results affirm the framework's effectiveness, demonstrating notable improvements in LLM performance across multiple datasets. This work highlights the potential of LLMs in mastering situational reasoning and social interactions within structured game environments. Our code is publicly available at https://github.com/ct-wei/Who-is-The-Spy.

Prototype Perturbation for Relaxing Alignment Constraints in Backward-Compatible Learning

Zikun Zhou,Yushuai Sun,Wenjie Pei,Xin Li,Yaowei Wang

Task: 提出一种新的方法来解决在更新检索模型时避免重新计算图库数据嵌入的问题。

Motivation: 传统的更新检索模型的方法需要重新计算图库数据的嵌入，这是一个耗时且计算密集的过程。为了规避这一问题，提出了向后兼容学习（BCL）方法，但现有的方法在增强向后兼容性时会损害新模型的判别能力。

Details

Method: 通过引入对旧特征原型的扰动来放松约束，使新特征空间与由这些扰动原型定义的伪旧特征空间对齐，从而在向后兼容学习中保持新模型的判别能力。提出了两种计算扰动的方法：邻居驱动原型扰动（NDPP）和优化驱动原型扰动（ODPP）。 Result: 在多个数据集上的实验表明，所提出的方法在向后兼容学习算法中表现优于现有技术。 Conclusion: 通过引入扰动来放松约束，可以在保持新模型判别能力的同时实现向后兼容学习，所提出的方法在实验中表现优异。 Abstract: The traditional paradigm to update retrieval models requires re-computing the embeddings of the gallery data, a time-consuming and computationally intensive process known as backfilling. To circumvent backfilling, Backward-Compatible Learning (BCL) has been widely explored, which aims to train a new model compatible with the old one. Many previous works focus on effectively aligning the embeddings of the new model with those of the old one to enhance the backward-compatibility. Nevertheless, such strong alignment constraints would compromise the discriminative ability of the new model, particularly when different classes are closely clustered and hard to distinguish in the old feature space. To address this issue, we propose to relax the constraints by introducing perturbations to the old feature prototypes. This allows us to align the new feature space with a pseudo-old feature space defined by these perturbed prototypes, thereby preserving the discriminative ability of the new model in backward-compatible learning. We have developed two approaches for calculating the perturbations: Neighbor-Driven Prototype Perturbation (NDPP) and Optimization-Driven Prototype Perturbation (ODPP). Particularly, they take into account the feature distributions of not only the old but also the new models to obtain proper perturbations along with new model updating. Extensive experiments on the landmark and commodity datasets demonstrate that our approaches perform favorably against state-of-the-art BCL algorithms.

BigO(Bench) -- Can LLMs Generate Code with Controlled Time and Space Complexity?

Pierre Chambon,Baptiste Roziere,Benoit Sagot,Gabriel Synnaeve

Task: 评估生成语言模型在理解和生成具有指定时间和空间复杂度的代码方面的能力。

Motivation: 当前评估往往忽略了模型在理解和生成受计算复杂度约束的代码方面的能力，BigO(Bench)旨在填补这一空白。

Details

Method: BigO(Bench)包括从性能分析测量中推断任何Python函数的算法复杂度的工具，以及一组3,105个编码问题和1,190,250个来自Code Contests的解决方案，这些解决方案带有推断的（合成的）时间和空间复杂度的标签。 Result: 评估了多个最先进的语言模型，发现它们在处理复杂度要求方面的优势和劣势。特别是，基于token-space推理的模型在代码生成方面表现出色，但在复杂度理解方面表现不佳。 Conclusion: BigO(Bench)为评估语言模型在代码生成和理解复杂度方面的能力提供了一个新的基准，揭示了现有模型在复杂度理解方面的不足。 Abstract: We introduce BigO(Bench), a novel coding benchmark designed to evaluate the capabilities of generative language models in understanding and generating code with specified time and space complexities. This benchmark addresses the gap in current evaluations that often overlook the ability of models to comprehend and produce code constrained by computational complexity. BigO(Bench) includes tooling to infer the algorithmic complexity of any Python function from profiling measurements, including human- or LLM-generated solutions. BigO(Bench) also includes of set of 3,105 coding problems and 1,190,250 solutions from Code Contests annotated with inferred (synthetic) time and space complexity labels from the complexity framework, as well as corresponding runtime and memory footprint values for a large set of input sizes. We present results from evaluating multiple state-of-the-art language models on this benchmark, highlighting their strengths and weaknesses in handling complexity requirements. In particular, token-space reasoning models are unrivaled in code generation but not in complexity understanding, hinting that they may not generalize well to tasks for which no reward was given at training time.

Decompositional Neural Scene Reconstruction with Generative Diffusion Prior

Junfeng Ni,Yu Liu,Ruijie Lu,Zirui Zhou,Song-Chun Zhu,Yixin Chen,Siyuan Huang

Task: 通过稀疏视图输入实现3D场景的分解重建，包括完整形状和详细纹理。

Motivation: 解决在稀疏视图输入下，现有方法在欠约束区域和遮挡区域恢复效果不佳的问题。

Details

Method: 提出DP-Recon方法，利用扩散先验（Score Distillation Sampling, SDS）优化每个物体在新视角下的神经表示，并引入可见性引导方法动态调整每像素SDS损失权重。 Result: 在Replica和ScanNet++数据集上的实验表明，该方法显著优于现有方法，特别是在10视图下的物体重建效果优于基线方法在100视图下的效果。 Conclusion: DP-Recon方法通过SDS优化实现了几何和外观的无缝文本编辑，并生成了支持逼真视觉特效（VFX）编辑的分解物体网格和详细UV贴图。 Abstract: Decompositional reconstruction of 3D scenes, with complete shapes and detailed texture of all objects within, is intriguing for downstream applications but remains challenging, particularly with sparse views as input. Recent approaches incorporate semantic or geometric regularization to address this issue, but they suffer significant degradation in underconstrained areas and fail to recover occluded regions. We argue that the key to solving this problem lies in supplementing missing information for these areas. To this end, we propose DP-Recon, which employs diffusion priors in the form of Score Distillation Sampling (SDS) to optimize the neural representation of each individual object under novel views. This provides additional information for the underconstrained areas, but directly incorporating diffusion prior raises potential conflicts between the reconstruction and generative guidance. Therefore, we further introduce a visibility-guided approach to dynamically adjust the per-pixel SDS loss weights. Together these components enhance both geometry and appearance recovery while remaining faithful to input images. Extensive experiments across Replica and ScanNet++ demonstrate that our method significantly outperforms SOTA methods. Notably, it achieves better object reconstruction under 10 views than the baselines under 100 views. Our method enables seamless text-based editing for geometry and appearance through SDS optimization and produces decomposed object meshes with detailed UV maps that support photorealistic Visual effects (VFX) editing. The project page is available at https://dp-recon.github.io/.

MAMM-Refine: A Recipe for Improving Faithfulness in Generation with Multi-Agent Collaboration

David Wan,Justin Chih-Yao Chen,Elias Stengel-Eskin,Mohit Bansal

Task: 将多智能体多模型推理扩展到生成任务，特别是通过改进生成输出来提高忠实度。

Motivation: 多智能体协作在推理任务中显示出潜力，但在长文本生成任务（如摘要和问答）中尚未得到充分探索。

Details

Method: 研究多实例和多种类型的大型语言模型（LLMs）在改进过程中的子任务（如错误检测、批评不忠实的句子和基于批评进行修正）中的迭代协作。 Result: 多智能体（多实例）和多模型（多样化的LLM类型）方法在错误检测和批评方面都有益处。将批评和改进重新定义为重新排序任务而非生成任务，提高了多智能体的性能。 Conclusion: 提出了一个名为多智能体多模型改进（MAMM-Refine）的最终“配方”，在多智能体和多模型协作下，显著提高了三个摘要数据集以及长文本问答的性能，证明了该方法的有效性和通用性。 Abstract: Multi-agent collaboration among models has shown promise in reasoning tasks but is underexplored in long-form generation tasks like summarization and question-answering. We extend multi-agent multi-model reasoning to generation, specifically to improving faithfulness through refinement, i.e., revising model-generated outputs to remove factual inconsistencies. We investigate how iterative collaboration among multiple instances and types of large language models (LLMs) enhances subtasks in the refinement process, such as error detection, critiquing unfaithful sentences, and making corrections based on critiques. We design intrinsic evaluations for each subtask, with our findings indicating that both multi-agent (multiple instances) and multi-model (diverse LLM types) approaches benefit error detection and critiquing. Additionally, reframing critiquing and refinement as reranking rather than generation tasks improves multi-agent performance. We consolidate these insights into a final "recipe" called Multi-Agent Multi-Model Refinement (MAMM-Refine), where multi-agent and multi-model collaboration significantly boosts performance on three summarization datasets as well as on long-form question answering, demonstrating the effectiveness and generalizability of our recipe.

H2ST: Hierarchical Two-Sample Tests for Continual Out-of-Distribution Detection

Yuhang Liu,Wenjie Zhao,Yunhui Guo

Task: 提出一种新的持续OOD检测方法，称为分层双样本测试（H2ST），用于开放世界任务增量学习（TIL）场景。

Motivation: 现有的TIL方法在封闭世界假设下运行，假设输入数据始终在分布内（ID）。然而，在开放世界设置中，输入样本可能来自分布外（OOD）源，其任务身份未知。当前的OOD检测方法在持续检测OOD样本时面临多个挑战。

Details

Method: 提出了一种名为分层双样本测试（H2ST）的新方法，通过假设测试消除了阈值选择的需求，并利用特征图更好地利用模型能力，而不过度依赖模型性能。 Result: 广泛的实验和分析验证了H2ST在开放世界TIL场景中的有效性，并证明了其优于现有方法的性能。 Conclusion: H2ST在开放世界TIL场景中表现出色，具有较低的开销和优越的性能，能够进行任务级别的检测。 Abstract: Task Incremental Learning (TIL) is a specialized form of Continual Learning (CL) in which a model incrementally learns from non-stationary data streams. Existing TIL methodologies operate under the closed-world assumption, presuming that incoming data remains in-distribution (ID). However, in an open-world setting, incoming samples may originate from out-of-distribution (OOD) sources, with their task identities inherently unknown. Continually detecting OOD samples presents several challenges for current OOD detection methods: reliance on model outputs leads to excessive dependence on model performance, selecting suitable thresholds is difficult, hindering real-world deployment, and binary ID/OOD classification fails to provide task-level identification. To address these issues, we propose a novel continual OOD detection method called the Hierarchical Two-sample Tests (H2ST). H2ST eliminates the need for threshold selection through hypothesis testing and utilizes feature maps to better exploit model capabilities without excessive dependence on model performance. The proposed hierarchical architecture enables task-level detection with superior performance and lower overhead compared to non-hierarchical classifier two-sample tests. Extensive experiments and analysis validate the effectiveness of H2ST in open-world TIL scenarios and its superiority to the existing methods. Code is available at \href{https://github.com/YuhangLiuu/H2ST}{https://github.com/YuhangLiuu/H2ST}.

TROVE: A Challenge for Fine-Grained Text Provenance via Source Sentence Tracing and Relationship Classification

Junnan Zhu,Min Xiao,Yining Wang,Feifei Zhai,Yu Zhou,Chengqing Zong

Task: 设计并评估Text pROVEnance (TROVE)挑战，以追踪目标文本中每个句子的来源句子，并注释细粒度关系。

Motivation: 在大语言模型（LLMs）广泛应用的背景下，特别是在医疗、法律和新闻等高风险领域，理解内容的来源和生成方式至关重要。

Details

Method: 利用三个公共数据集构建TROVE数据集，涵盖11种不同场景，采用三阶段注释过程（句子检索、GPT来源注释和人类来源注释），并评估11种LLM在直接提示和检索增强范式下的表现。 Result: 检索对于稳健性能至关重要，较大模型在复杂关系分类中表现更好，闭源模型通常领先，但开源模型在检索增强方面显示出显著潜力。 Conclusion: TROVE挑战为追踪文本来源提供了有效方法，特别是在多文档和长文档设置中，检索增强显著提升了模型性能。 Abstract: LLMs have achieved remarkable fluency and coherence in text generation, yet their widespread adoption has raised concerns about content reliability and accountability. In high-stakes domains such as healthcare, law, and news, it is crucial to understand where and how the content is created. To address this, we introduce the Text pROVEnance (TROVE) challenge, designed to trace each sentence of a target text back to specific source sentences within potentially lengthy or multi-document inputs. Beyond identifying sources, TROVE annotates the fine-grained relationships (quotation, compression, inference, and others), providing a deep understanding of how each target sentence is formed. To benchmark TROVE, we construct our dataset by leveraging three public datasets covering 11 diverse scenarios (e.g., QA and summarization) in English and Chinese, spanning source texts of varying lengths (0-5k, 5-10k, 10k+), emphasizing the multi-document and long-document settings essential for provenance. To ensure high-quality data, we employ a three-stage annotation process: sentence retrieval, GPT provenance, and human provenance. We evaluate 11 LLMs under direct prompting and retrieval-augmented paradigms, revealing that retrieval is essential for robust performance, larger models perform better in complex relationship classification, and closed-source models often lead, yet open-source models show significant promise, particularly with retrieval augmentation.

SemanticFlow: A Self-Supervised Framework for Joint Scene Flow Prediction and Instance Segmentation in Dynamic Environments

Yinqi Chen,Meiying Zhang,Qi Hao,Guang Zhou

Task: 提出了一种多任务SemanticFlow框架，用于同时预测全分辨率点云的场景流和实例分割。

Motivation: 传统方法通常将动态交通场景中的对象运动估计和实例分割视为独立任务，导致性能不佳、时空不一致以及在复杂场景中的效率低下。

Details

Method: 开发了一种基于粗到细预测的多任务方案，通过共享特征处理模块提供上下文信息以细化运动和语义信息；开发了一组损失函数以增强场景流估计和实例分割的性能；开发了一种自监督学习方案，利用粗分割检测刚性对象并计算其变换矩阵以生成自监督标签。 Result: 在Argoverse和Waymo数据集上验证了所提出的框架，展示了在实例分割准确性、场景流估计和计算效率方面的优越性能。 Conclusion: 该框架为动态场景理解中的自监督方法建立了新的基准。 Abstract: Accurate perception of dynamic traffic scenes is crucial for high-level autonomous driving systems, requiring robust object motion estimation and instance segmentation. However, traditional methods often treat them as separate tasks, leading to suboptimal performance, spatio-temporal inconsistencies, and inefficiency in complex scenarios due to the absence of information sharing. This paper proposes a multi-task SemanticFlow framework to simultaneously predict scene flow and instance segmentation of full-resolution point clouds. The novelty of this work is threefold: 1) developing a coarse-to-fine prediction based multi-task scheme, where an initial coarse segmentation of static backgrounds and dynamic objects is used to provide contextual information for refining motion and semantic information through a shared feature processing module; 2) developing a set of loss functions to enhance the performance of scene flow estimation and instance segmentation, while can help ensure spatial and temporal consistency of both static and dynamic objects within traffic scenes; 3) developing a self-supervised learning scheme, which utilizes coarse segmentation to detect rigid objects and compute their transformation matrices between sequential frames, enabling the generation of self-supervised labels. The proposed framework is validated on the Argoverse and Waymo datasets, demonstrating superior performance in instance segmentation accuracy, scene flow estimation, and computational efficiency, establishing a new benchmark for self-supervised methods in dynamic scene understanding.

Inside-Out: Hidden Factual Knowledge in LLMs

Zorik Gekhman,Eyal Ben David,Hadas Orgad,Eran Ofek,Yonatan Belinkov,Idan Szpector,Jonathan Herzig,Roi Reichart

Task: 评估大型语言模型（LLMs）在其参数中编码的事实知识是否比其输出中表达的更多。

Motivation: 现有研究暗示了这种可能性，但尚未明确定义或证明这一现象。

Details

Method: 提出了一种知识的正式定义，并将其量化为给定问题中正确-错误答案对中正确答案排名更高的比例。根据用于评分的信息来源，分为外部知识和内部知识。隐藏知识是指内部知识超过外部知识的情况。 Result: LLMs在其内部编码的事实知识比其外部表达的知识多40%。有些知识隐藏得如此之深，以至于模型在内部完全知道答案，但在大规模重复采样1000次答案后仍无法生成。 Conclusion: LLMs的生成能力存在根本性限制，这限制了通过重复答案采样在闭卷QA中扩展测试时计算的实用性，因为某些答案几乎从未被采样，但如果被采样，它们将被保证排名第一。 Abstract: This work presents a framework for assessing whether large language models (LLMs) encode more factual knowledge in their parameters than what they express in their outputs. While a few studies hint at this possibility, none has clearly defined or demonstrated this phenomenon. We first propose a formal definition of knowledge, quantifying it for a given question as the fraction of correct-incorrect answer pairs where the correct one is ranked higher. This gives rise to external and internal knowledge, depending on the information used to score individual answer candidates: either the model's observable token-level probabilities or its intermediate computations. Hidden knowledge arises when internal knowledge exceeds external knowledge. We then present a case study, applying this framework to three popular open-weights LLMs in a closed-book QA setup. Our results indicate that: (1) LLMs consistently encode more factual knowledge internally than what they express externally, with an average gap of 40%. (2) Surprisingly, some knowledge is so deeply hidden that a model can internally know an answer perfectly, yet fail to generate it even once, despite large-scale repeated sampling of 1,000 answers. This reveals fundamental limitations in the generation capabilities of LLMs, which (3) puts a practical constraint on scaling test-time compute via repeated answer sampling in closed-book QA: significant performance improvements remain inaccessible because some answers are practically never sampled, yet if they were, we would be guaranteed to rank them first.

Unlocking the Capabilities of Vision-Language Models for Generalizable and Explainable Deepfake Detection

Peipeng Yu,Jianwei Fei,Hui Gao,Xuan Feng,Zhihua Xia,Chip Hong Chang

Task: 提出一种新的范式，利用视觉语言模型（VLMs）进行深度伪造检测。

Motivation: 现有的视觉语言模型在多模态数据理解方面表现出色，但其在深度伪造检测方面的潜力尚未充分挖掘，主要原因是其知识与取证模式的不对齐。

Details

Method: 通过三个组件解锁VLMs的潜力：(1) 知识引导的伪造适应模块，通过对比学习将VLM的语义空间与取证特征对齐；(2) 多模态提示调优框架，联合优化视觉-文本嵌入以实现定位和可解释性；(3) 迭代优化策略，支持多轮对话以进行基于证据的推理。 Result: 在多个基准测试（包括FF++、CDF2、DFD、DFDCP和DFDC）上的广泛实验表明，该方案在泛化性能上超越了现有方法，并支持多轮对话功能。 Conclusion: 该研究提出了一种新的范式，成功解锁了VLMs在深度伪造检测中的潜力，并在多个基准测试中取得了优异的性能。 Abstract: Current vision-language models (VLMs) have demonstrated remarkable capabilities in understanding multimodal data, but their potential remains underexplored for deepfake detection due to the misaligned of their knowledge and forensics patterns. To this end, we present a novel paradigm that unlocks VLMs' potential capabilities through three components: (1) A knowledge-guided forgery adaptation module that aligns VLM's semantic space with forensic features through contrastive learning with external manipulation knowledge; (2) A multi-modal prompt tuning framework that jointly optimizes visual-textual embeddings for both localization and explainability; (3) An iterative refinement strategy enabling multi-turn dialog for evidence-based reasoning. Our framework includes a VLM-based Knowledge-guided Forgery Detector (KFD), a VLM image encoder, and a Large Language Model (LLM). The VLM image encoder extracts visual prompt embeddings from images, while the LLM receives visual and question prompt embeddings for inference. The KFD is used to calculate correlations between image features and pristine/deepfake class embeddings, enabling forgery classification and localization. The outputs from these components are used to construct forgery prompt embeddings. Finally, we feed these prompt embeddings into the LLM to generate textual detection responses to assist judgment. Extensive experiments on multiple benchmarks, including FF++, CDF2, DFD, DFDCP, and DFDC, demonstrate that our scheme surpasses state-of-the-art methods in generalization performance, while also supporting multi-turn dialogue capabilities.

SPILL: Domain-Adaptive Intent Clustering based on Selection and Pooling with Large Language Models

I-Fan Lin,Faegheh Hasibi,Suzan Verberne

Task: 提出一种无需微调的领域自适应意图聚类方法SPILL。

Motivation: 现有的基于嵌入的聚类方法依赖于少量标记示例或无监督微调来优化每个新数据集的结果，这使得它们在多个数据集上的泛化能力较差。

Details

Method: 提出了一种两阶段方法：首先为每个话语（称为种子）生成嵌入，然后使用距离度量选择接近种子的候选池；在第二阶段，使用LLM从这些候选中选择与种子具有相同意图的话语，最后将这些选定的候选与种子池化以生成种子的精炼嵌入。 Result: 该方法通常优于直接使用嵌入器，并且与其他最先进的研究结果相当，即使这些研究使用了更大的模型并需要微调。 Conclusion: 该方法使现有嵌入器无需额外微调即可进一步改进，使其更适应新的领域数据集。将聚类任务视为小规模选择问题，可以利用LLM根据用户目标定制聚类任务。 Abstract: In this paper, we propose Selection and Pooling with Large Language Models (SPILL), an intuitive and domain-adaptive method for intent clustering without fine-tuning. Existing embeddings-based clustering methods rely on a few labeled examples or unsupervised fine-tuning to optimize results for each new dataset, which makes them less generalizable to multiple datasets. Our goal is to make these existing embedders more generalizable to new domain datasets without further fine-tuning. Inspired by our theoretical derivation and simulation results on the effectiveness of sampling and pooling techniques, we view the clustering task as a small-scale selection problem. A good solution to this problem is associated with better clustering performance. Accordingly, we propose a two-stage approach: First, for each utterance (referred to as the seed), we derive its embedding using an existing embedder. Then, we apply a distance metric to select a pool of candidates close to the seed. Because the embedder is not optimized for new datasets, in the second stage, we use an LLM to further select utterances from these candidates that share the same intent as the seed. Finally, we pool these selected candidates with the seed to derive a refined embedding for the seed. We found that our method generally outperforms directly using an embedder, and it achieves comparable results to other state-of-the-art studies, even those that use much larger models and require fine-tuning, showing its strength and efficiency. Our results indicate that our method enables existing embedders to be further improved without additional fine-tuning, making them more adaptable to new domain datasets. Additionally, viewing the clustering task as a small-scale selection problem gives the potential of using LLMs to customize clustering tasks according to the user's goals.

Fine-Grained Open-Vocabulary Object Detection with Fined-Grained Prompts: Task, Dataset and Benchmark

Ying Liu,Yijing Hua,Haojiang Chai,Yanbo Wang,TengQi Ye

Task: 提出3F-OVD任务，将监督细粒度目标检测扩展到开放词汇设置。

Motivation: 现有的开放词汇检测器在评估时存在不公平和不可靠的问题，主要原因是视觉感知语言词汇数据的变化。

Details

Method: 引入3F-OVD任务，创建新的数据集NEU-171K，并提出一种简单但有效的后处理技术。 Result: 在NEU-171K数据集上对最先进的目标检测器进行了基准测试。 Conclusion: 3F-OVD任务具有挑战性，需要深入理解细粒度描述和图像中的细节，以准确检测细粒度目标。 Abstract: Open-vocabulary detectors are proposed to locate and recognize objects in novel classes. However, variations in vision-aware language vocabulary data used for open-vocabulary learning can lead to unfair and unreliable evaluations. Recent evaluation methods have attempted to address this issue by incorporating object properties or adding locations and characteristics to the captions. Nevertheless, since these properties and locations depend on the specific details of the images instead of classes, detectors can not make accurate predictions without precise descriptions provided through human annotation. This paper introduces 3F-OVD, a novel task that extends supervised fine-grained object detection to the open-vocabulary setting. Our task is intuitive and challenging, requiring a deep understanding of Fine-grained captions and careful attention to Fine-grained details in images in order to accurately detect Fine-grained objects. Additionally, due to the scarcity of qualified fine-grained object detection datasets, we have created a new dataset, NEU-171K, tailored for both supervised and open-vocabulary settings. We benchmark state-of-the-art object detectors on our dataset for both settings. Furthermore, we propose a simple yet effective post-processing technique.

Optimizing Decomposition for Optimal Claim Verification

Yining Lu,Noah Ziems,Hy Dang,Meng Jiang

Task: 优化长文本事实性评估中的分解与验证策略。

Motivation: 现有研究通常将分解和验证视为独立过程，忽略了它们之间的相互作用和潜在的不一致性，导致验证结果不理想。

Details

Method: 提出了一种双层优化问题，并通过强化学习框架动态分解来实现最优分解策略。 Result: 动态分解方法在验证置信度和准确性上均优于现有分解策略，平均提高了0.07的置信度和0.12的准确性。 Conclusion: 动态分解方法能够有效提升长文本事实性评估的效果。 Abstract: Current research on the \textit{Decompose-Then-Verify} paradigm for evaluating the factuality of long-form text typically treats decomposition and verification in isolation, overlooking their interactions and potential misalignment. We find that existing decomposition policies, typically hand-crafted demonstrations, do not align well with downstream verifiers in terms of atomicity -- a novel metric quantifying information density -- leading to suboptimal verification results. We formulate finding the optimal decomposition policy for optimal verification as a bilevel optimization problem. To approximate a solution for this strongly NP-hard problem, we propose dynamic decomposition, a reinforcement learning framework that leverages verifier feedback to learn a policy for dynamically decomposing claims to verifier-preferred atomicity. Experimental results show that dynamic decomposition outperforms existing decomposition policies, improving verification confidence by 0.07 and accuracy by 0.12 (on a 0-1 scale) on average across varying verifiers, datasets, and atomcities of input claims.

Temporal-Consistent Video Restoration with Pre-trained Diffusion Models

Hengkang Wang,Yang Liu,Huidong Liu,Chien-Chih Wang,Yanhui Guo,Hongdong Li,Bryan Wang,Ju Sun

Task: 恢复高质量视频从退化的视频中。

Motivation: 现有的零样本视频恢复方法在使用预训练扩散模型时存在反向扩散过程中的近似误差和时间一致性不足的问题，且处理3D视频数据计算量大。

Details

Method: 提出了一种新的最大后验（MAP）框架，将视频帧直接参数化在扩散模型的种子空间中，消除近似误差，并通过种子空间中的聚类结构和光流细化的渐进变形策略来促进双层时间一致性。 Result: 在多个虚拟现实任务上的广泛实验表明，该方法在视觉质量和时间一致性方面优于现有技术。 Conclusion: 该方法通过直接参数化视频帧和促进时间一致性，显著提高了视频恢复的质量。 Abstract: Video restoration (VR) aims to recover high-quality videos from degraded ones. Although recent zero-shot VR methods using pre-trained diffusion models (DMs) show good promise, they suffer from approximation errors during reverse diffusion and insufficient temporal consistency. Moreover, dealing with 3D video data, VR is inherently computationally intensive. In this paper, we advocate viewing the reverse process in DMs as a function and present a novel Maximum a Posterior (MAP) framework that directly parameterizes video frames in the seed space of DMs, eliminating approximation errors. We also introduce strategies to promote bilevel temporal consistency: semantic consistency by leveraging clustering structures in the seed space, and pixel-level consistency by progressive warping with optical flow refinements. Extensive experiments on multiple virtual reality tasks demonstrate superior visual quality and temporal consistency achieved by our method compared to the state-of-the-art.

SemEval-2025 Task 1: AdMIRe -- Advancing Multimodal Idiomaticity Representation

Thomas Pickard,Aline Villavicencio,Maggie Mi,Wei He,Dylan Phelps,Carolina Scarton,Marco Idiart

Task: 评估和改进模型在多模态上下文和多种语言中解释惯用表达的能力。

Motivation: 惯用表达在自然语言处理中具有独特的挑战，因为它们的含义通常不能直接从其组成词汇中推断出来。尽管大型语言模型（LLMs）取得了进展，但惯用性仍然是语义表示的一个重大障碍。

Details

Method: 提出了SemEval-2025 Task 1: AdMiRe（推进多模态惯用性表示）的数据集和任务，包括两个子任务：根据图像与惯用或字面含义的对齐程度进行排序，以及预测序列中的下一张图像。 Result: 最有效的方法通过在多专家设置中利用预训练的LLMs和视觉语言模型，达到了人类水平的性能，并使用多个查询来平滑这些模型在惯用性表示中的弱点。 Conclusion: 通过多模态和多语言环境下的任务，可以有效评估和改进模型对惯用表达的理解能力。 Abstract: Idiomatic expressions present a unique challenge in NLP, as their meanings are often not directly inferable from their constituent words. Despite recent advancements in Large Language Models (LLMs), idiomaticity remains a significant obstacle to robust semantic representation. We present datasets and tasks for SemEval-2025 Task 1: AdMiRe (Advancing Multimodal Idiomaticity Representation), which challenges the community to assess and improve models' ability to interpret idiomatic expressions in multimodal contexts and in multiple languages. Participants competed in two subtasks: ranking images based on their alignment with idiomatic or literal meanings, and predicting the next image in a sequence. The most effective methods achieved human-level performance by leveraging pretrained LLMs and vision-language models in mixture-of-experts settings, with multiple queries used to smooth over the weaknesses in these models' representations of idiomaticity.

DVHGNN: Multi-Scale Dilated Vision HGNN for Efficient Vision Recognition

Caoshuo Li,Tanzhe Li,Xiaobin Hu,Donghao Luo,Taisong Jin

Task: 提出一种新的视觉架构Dilated Vision HyperGraph Neural Network (DVHGNN)，以解决Vision Graph Neural Network (ViG)中的计算复杂性和成对关系限制问题。

Motivation: Vision Graph Neural Network (ViG)在计算机视觉领域取得了显著进展，但其K-Nearest Neighbor (KNN)图构建导致的二次计算复杂性和普通图的成对关系限制是关键问题。

Details

Method: 提出了一种新的视觉架构DVHGNN，利用多尺度超图来高效捕捉对象之间的高阶相关性。具体方法包括Clustering和Dilated HyperGraph Construction (DHGC)来自适应捕捉数据样本之间的多尺度依赖关系，以及动态超图卷积机制来促进超图级别的自适应特征交换和融合。 Result: 在基准图像数据集上的广泛定性和定量评估表明，DVHGNN显著优于现有的视觉骨干网络。例如，DVHGNN-S在ImageNet-1K上达到了83.1%的top-1准确率，比ViG-S高出1.0%，比ViHGNN-S高出0.6%。 Conclusion: DVHGNN通过多尺度超图和动态超图卷积机制，有效解决了ViG中的计算复杂性和成对关系限制问题，显著提升了图像分类的性能。 Abstract: Recently, Vision Graph Neural Network (ViG) has gained considerable attention in computer vision. Despite its groundbreaking innovation, Vision Graph Neural Network encounters key issues including the quadratic computational complexity caused by its K-Nearest Neighbor (KNN) graph construction and the limitation of pairwise relations of normal graphs. To address the aforementioned challenges, we propose a novel vision architecture, termed Dilated Vision HyperGraph Neural Network (DVHGNN), which is designed to leverage multi-scale hypergraph to efficiently capture high-order correlations among objects. Specifically, the proposed method tailors Clustering and Dilated HyperGraph Construction (DHGC) to adaptively capture multi-scale dependencies among the data samples. Furthermore, a dynamic hypergraph convolution mechanism is proposed to facilitate adaptive feature exchange and fusion at the hypergraph level. Extensive qualitative and quantitative evaluations of the benchmark image datasets demonstrate that the proposed DVHGNN significantly outperforms the state-of-the-art vision backbones. For instance, our DVHGNN-S achieves an impressive top-1 accuracy of 83.1% on ImageNet-1K, surpassing ViG-S by +1.0% and ViHGNN-S by +0.6%.

Real-world validation of a multimodal LLM-powered pipeline for High-Accuracy Clinical Trial Patient Matching leveraging EHR data

Anatole Callies,Quentin Bodinier,Philippe Ravaud,Kourosh Davarpanah

Task: 自动化患者与临床试验的匹配

Motivation: 临床试验中的患者招募受到复杂的资格标准和劳动密集型的图表审查的阻碍，现有的文本模型难以可靠且可扩展地解决这一问题。

Details

Method: 引入了一种广泛适用、无需集成的LLM驱动的管道，利用未处理的EHR文档自动化患者与试验的匹配。该方法结合了新的推理-LLM范式、最新的LLM视觉能力和多模态嵌入。 Result: 在n2c2数据集上，该方法达到了93%的标准级准确率；在真实世界的试验中，准确率为87%，用户平均每名患者的审查时间不到9分钟，比传统手动图表审查提高了80%。 Conclusion: 该管道在临床试验患者匹配中表现出强大的性能，无需与站点系统进行定制集成或针对特定试验进行调整，从而实现了可扩展的部署。 Abstract: Background: Patient recruitment in clinical trials is hindered by complex eligibility criteria and labor-intensive chart reviews. Prior research using text-only models have struggled to address this problem in a reliable and scalable way due to (1) limited reasoning capabilities, (2) information loss from converting visual records to text, and (3) lack of a generic EHR integration to extract patient data. Methods: We introduce a broadly applicable, integration-free, LLM-powered pipeline that automates patient-trial matching using unprocessed documents extracted from EHRs. Our approach leverages (1) the new reasoning-LLM paradigm, enabling the assessment of even the most complex criteria, (2) visual capabilities of latest LLMs to interpret medical records without lossy image-to-text conversions, and (3) multimodal embeddings for efficient medical record search. The pipeline was validated on the n2c2 2018 cohort selection dataset (288 diabetic patients) and a real-world dataset composed of 485 patients from 30 different sites matched against 36 diverse trials. Results: On the n2c2 dataset, our method achieved a new state-of-the-art criterion-level accuracy of 93\%. In real-world trials, the pipeline yielded an accuracy of 87\%, undermined by the difficulty to replicate human decision-making when medical records lack sufficient information. Nevertheless, users were able to review overall eligibility in under 9 minutes per patient on average, representing an 80\% improvement over traditional manual chart reviews. Conclusion: This pipeline demonstrates robust performance in clinical trial patient matching without requiring custom integration with site systems or trial-specific tailoring, thereby enabling scalable deployment across sites seeking to leverage AI for patient matching.

Efficient Personalization of Quantized Diffusion Model without Backpropagation

Hoigi Seo,Wongi Jeong,Kyungryeol Lee,Se Young Chun

Task: 通过量化和零阶优化技术，实现扩散模型在个性化应用中的内存高效微调。

Motivation: 扩散模型在图像合成中表现出色，但训练、微调和推理需要大量计算和内存资源。内存高效的微调对于在边缘设备上运行的个性化应用尤为重要。

Details

Method: 通过Textual Inversion量化扩散模型，并利用零阶优化对个性化token进行优化，避免反传播中的梯度和激活存储。提出子空间梯度法来降噪估计梯度，并研究文本嵌入对图像生成的影响，提出部分均匀时间步采样。 Result: 在个性化Stable Diffusion中，仅通过前向传递即可实现与现有方法相当的图像和文本对齐分数，同时将训练内存需求减少至8.2倍。 Conclusion: 该方法在保持性能的同时显著减少了内存需求，适用于边缘设备上的个性化应用。 Abstract: Diffusion models have shown remarkable performance in image synthesis, but they demand extensive computational and memory resources for training, fine-tuning and inference. Although advanced quantization techniques have successfully minimized memory usage for inference, training and fine-tuning these quantized models still require large memory possibly due to dequantization for accurate computation of gradients and/or backpropagation for gradient-based algorithms. However, memory-efficient fine-tuning is particularly desirable for applications such as personalization that often must be run on edge devices like mobile phones with private data. In this work, we address this challenge by quantizing a diffusion model with personalization via Textual Inversion and by leveraging a zeroth-order optimization on personalization tokens without dequantization so that it does not require gradient and activation storage for backpropagation that consumes considerable memory. Since a gradient estimation using zeroth-order optimization is quite noisy for a single or a few images in personalization, we propose to denoise the estimated gradient by projecting it onto a subspace that is constructed with the past history of the tokens, dubbed Subspace Gradient. In addition, we investigated the influence of text embedding in image generation, leading to our proposed time steps sampling, dubbed Partial Uniform Timestep Sampling for sampling with effective diffusion timesteps. Our method achieves comparable performance to prior methods in image and text alignment scores for personalizing Stable Diffusion with only forward passes while reducing training memory demand up to $8.2\times$.

VenusFactory: A Unified Platform for Protein Engineering Data Retrieval and Language Model Fine-Tuning

Yang Tan,Chen Liu,Jingyuan Gao,Banghao Wu,Mingchen Li,Ruilin Wang,Lingrong Zhang,Huiqun Yu,Guisheng Fan,Liang Hong,Bingxin Zhou

Task: 开发一个名为VenusFactory的引擎，用于整合生物数据检索、标准化任务基准测试和模块化微调蛋白质语言模型（PLMs）。

Motivation: 由于数据收集、任务基准测试和应用方面的挑战，跨学科采用预训练的蛋白质语言模型（PLMs）仍然有限。

Details

Method: VenusFactory引擎整合了生物数据检索、标准化任务基准测试和模块化微调PLMs，支持命令行执行和基于Gradio的无代码界面。 Result: VenusFactory集成了40多个蛋白质相关数据集和40多个流行的PLMs，所有实现均已开源。 Conclusion: VenusFactory为计算机科学和生物学社区提供了一个多功能工具，促进了PLMs的跨学科应用。 Abstract: Natural language processing (NLP) has significantly influenced scientific domains beyond human language, including protein engineering, where pre-trained protein language models (PLMs) have demonstrated remarkable success. However, interdisciplinary adoption remains limited due to challenges in data collection, task benchmarking, and application. This work presents VenusFactory, a versatile engine that integrates biological data retrieval, standardized task benchmarking, and modular fine-tuning of PLMs. VenusFactory supports both computer science and biology communities with choices of both a command-line execution and a Gradio-based no-code interface, integrating $40+$ protein-related datasets and $40+$ popular PLMs. All implementations are open-sourced on https://github.com/tyang816/VenusFactory.

DPFlow: Adaptive Optical Flow Estimation with a Dual-Pyramid Framework

Henrique Morimitsu,Xiaobin Zhu,Roberto M. Cesar Jr.,Xiangyang Ji,Xu-Cheng Yin

Task: 提出一种能够适应8K分辨率输入的光流估计方法DPFlow，并引入新的高分辨率光流基准Kubric-NK。

Motivation: 现有的光流方法通常设计用于低分辨率，无法推广到大尺寸输入，且缺乏高分辨率样本的基准来评估现有方法的实际性能。

Details

Method: 提出DPFlow，一种自适应光流架构，能够在仅使用低分辨率样本训练的情况下推广到8K分辨率输入，并引入Kubric-NK基准。 Result: DPFlow在MPI-Sintel、KITTI 2015、Spring等高分辨率基准上取得了最先进的结果。 Conclusion: DPFlow能够有效处理高分辨率输入，Kubric-NK基准为高分辨率光流评估提供了新的标准。 Abstract: Optical flow estimation is essential for video processing tasks, such as restoration and action recognition. The quality of videos is constantly increasing, with current standards reaching 8K resolution. However, optical flow methods are usually designed for low resolution and do not generalize to large inputs due to their rigid architectures. They adopt downscaling or input tiling to reduce the input size, causing a loss of details and global information. There is also a lack of optical flow benchmarks to judge the actual performance of existing methods on high-resolution samples. Previous works only conducted qualitative high-resolution evaluations on hand-picked samples. This paper fills this gap in optical flow estimation in two ways. We propose DPFlow, an adaptive optical flow architecture capable of generalizing up to 8K resolution inputs while trained with only low-resolution samples. We also introduce Kubric-NK, a new benchmark for evaluating optical flow methods with input resolutions ranging from 1K to 8K. Our high-resolution evaluation pushes the boundaries of existing methods and reveals new insights about their generalization capabilities. Extensive experimental results show that DPFlow achieves state-of-the-art results on the MPI-Sintel, KITTI 2015, Spring, and other high-resolution benchmarks.

SkyLadder: Better and Faster Pretraining via Context Window Scheduling

Tongyao Zhu,Qian Liu,Haonan Wang,Shiqi Chen,Xiangming Gu,Tianyu Pang,Min-Yen Kan

Task: 探索一种最优的上下文窗口调度策略，以更好地平衡长上下文能力与预训练效率。

Motivation: 研究发现，在固定token预算下，使用较短上下文窗口预训练的模型始终优于长上下文窗口的模型。

Details

Method: 提出了SkyLadder方法，实现从短到长的上下文窗口过渡。 Result: 在100B tokens上预训练了1B参数模型（最多32K上下文）和3B参数模型（8K上下文），在常见基准测试中获得了高达3.7%的增益，并且训练速度比基线快22%。 Conclusion: SkyLadder方法在保持强大标准基准性能的同时，在长上下文任务上匹配或超过了基线结果。 Abstract: Recent advancements in LLM pretraining have featured ever-expanding context windows to process longer sequences. However, our pilot study reveals that models pretrained with shorter context windows consistently outperform their long-context counterparts under a fixed token budget. This finding motivates us to explore an optimal context window scheduling strategy to better balance long-context capability with pretraining efficiency. To this end, we propose SkyLadder, a simple yet effective approach that implements a short-to-long context window transition. SkyLadder preserves strong standard benchmark performance, while matching or exceeding baseline results on long context tasks. Through extensive experiments, we pre-train 1B-parameter models (up to 32K context) and 3B-parameter models (8K context) on 100B tokens, demonstrating that SkyLadder yields consistent gains of up to 3.7% on common benchmarks, while achieving up to 22% faster training speeds compared to baselines. The code is at https://github.com/sail-sg/SkyLadder.

Mitigating Object Hallucinations in MLLMs via Multi-Frequency Perturbations

Shuo Li,Jiajun Sun,Guodong Zheng,Xiaoran Fan,Yujiong Shen,Yi Lu,Zhiheng Xi,Yuming Yang,Wenming Tan,Tao Ji,Tao Gui,Qi Zhang,Xuanjing Huang

Task: 提出一种名为多频率扰动（MFP）的方法，以减少多模态大语言模型（MLLMs）在视觉-语言任务中的物体幻觉问题。

Motivation: MLLMs在视觉-语言任务中表现出色，但其生成的响应常常因物体幻觉而失真。研究发现，模型在检测物体时对特定图像频率特征的过度敏感性是导致这些幻觉的关键原因。

Details

Method: 引入多频率扰动（MFP），利用图像的低频和高频特征来扰动视觉特征表示，并在推理过程中显式抑制冗余的频率域特征，从而减少幻觉。 Result: 实验结果表明，该方法在各种模型架构中显著减少了物体幻觉。此外，作为一种训练时方法，MFP可以与推理时方法结合，在CHAIR基准上达到最先进的性能。 Conclusion: 多频率扰动（MFP）是一种简单、经济且可插拔的方法，能够有效减少MLLMs中的物体幻觉，并提升其在视觉-语言任务中的性能。 Abstract: Recently, multimodal large language models (MLLMs) have demonstrated remarkable performance in visual-language tasks. However, the authenticity of the responses generated by MLLMs is often compromised by object hallucinations. We identify that a key cause of these hallucinations is the model's over-susceptibility to specific image frequency features in detecting objects. In this paper, we introduce Multi-Frequency Perturbations (MFP), a simple, cost-effective, and pluggable method that leverages both low-frequency and high-frequency features of images to perturb visual feature representations and explicitly suppress redundant frequency-domain features during inference, thereby mitigating hallucinations. Experimental results demonstrate that our method significantly mitigates object hallucinations across various model architectures. Furthermore, as a training-time method, MFP can be combined with inference-time methods to achieve state-of-the-art performance on the CHAIR benchmark.

Evaluating Bias in Retrieval-Augmented Medical Question-Answering Systems

Yuelyu Ji,Hang Zhang,Yanshan Wang

Task: 系统评估基于检索增强生成（RAG）模型的医疗问答系统中的偏见。

Motivation: 支持临床决策的医疗问答系统可能会引入与种族、性别和社会健康决定因素相关的偏见。

Details

Method: 通过检查人口统计敏感查询和测量检索差异，使用MMLU和MedMCQA等数据集分析检索重叠和正确性差异。 Result: 研究结果揭示了RAG管道中存在显著的人口统计差异。 Conclusion: 强调了检索方法需要明确考虑公平性，以确保公平的临床决策。 Abstract: Medical QA systems powered by Retrieval-Augmented Generation (RAG) models support clinical decision-making but may introduce biases related to race, gender, and social determinants of health. We systematically evaluate biases in RAG-based LLM by examining demographic-sensitive queries and measuring retrieval discrepancies. Using datasets like MMLU and MedMCQA, we analyze retrieval overlap and correctness disparities. Our findings reveal substantial demographic disparities within RAG pipelines, emphasizing the critical need for retrieval methods that explicitly account for fairness to ensure equitable clinical decision-making.

When Domain Generalization meets Generalized Category Discovery: An Adaptive Task-Arithmetic Driven Approach

Vaibhav Rathore,Shubhranil B,Saikat Dutta,Sarthak Mehrotra,Zsolt Kira,Biplab Banerjee

Task: 在目标域中聚类基类和新类，利用源域中仅有的基类进行监督。

Motivation: 当前方法在分布偏移时表现不佳，并且通常需要在训练期间访问目标数据，这有时是不切实际的。为了解决这个问题，引入了域泛化的GCD（DG-GCD）新范式。

Details

Method: 提出了DG2CD-Net，通过一种情景训练策略增强跨域泛化能力，结合开放集域适应、新的边际损失和表示学习来逐步优化特征空间。 Result: 在三个数据集上的实验证实，DG2CD-Net优于现有的针对DG-GCD定制的GCD方法。 Conclusion: DG2CD-Net通过情景更新机制提高了基础模型对未见目标的适应性，证明了其在域泛化GCD任务中的有效性。 Abstract: Generalized Class Discovery (GCD) clusters base and novel classes in a target domain using supervision from a source domain with only base classes. Current methods often falter with distribution shifts and typically require access to target data during training, which can sometimes be impractical. To address this issue, we introduce the novel paradigm of Domain Generalization in GCD (DG-GCD), where only source data is available for training, while the target domain, with a distinct data distribution, remains unseen until inference. To this end, our solution, DG2CD-Net, aims to construct a domain-independent, discriminative embedding space for GCD. The core innovation is an episodic training strategy that enhances cross-domain generalization by adapting a base model on tasks derived from source and synthetic domains generated by a foundation model. Each episode focuses on a cross-domain GCD task, diversifying task setups over episodes and combining open-set domain adaptation with a novel margin loss and representation learning for optimizing the feature space progressively. To capture the effects of fine-tuning on the base model, we extend task arithmetic by adaptively weighting the local task vectors concerning the fine-tuned models based on their GCD performance on a validation distribution. This episodic update mechanism boosts the adaptability of the base model to unseen targets. Experiments across three datasets confirm that DG2CD-Net outperforms existing GCD methods customized for DG-GCD.

From 1,000,000 Users to Every User: Scaling Up Personalized Preference for User-level Alignment

Jia-Nan Li,Jian Guan,Songhao Wu,Wei Wu,Rui Yan

Task: 提出一个可扩展的个性化对齐大语言模型（LLM）的框架。

Motivation: 传统的一刀切对齐方法忽视了用户价值观和需求的多样性，无法满足个性化需求。

Details

Method: 建立一个系统的偏好空间，结合心理和行为维度，以及多样化的角色表示，开发了两种互补的对齐方法：上下文对齐和偏好桥接对齐。 Result: 实验结果表明，该方法在四个基准测试中平均提高了17.06%的准确率，并表现出对新偏好的强适应能力、对有限用户数据的鲁棒性以及精确的偏好控制能力。 Conclusion: 该框架有效推动了真正用户自适应的AI系统的发展。 Abstract: Large language models (LLMs) have traditionally been aligned through one-size-fits-all approaches that assume uniform human preferences, fundamentally overlooking the diversity in user values and needs. This paper introduces a comprehensive framework for scalable personalized alignment of LLMs. We establish a systematic preference space characterizing psychological and behavioral dimensions, alongside diverse persona representations for robust preference inference in real-world scenarios. Building upon this foundation, we introduce \textsc{AlignX}, a large-scale dataset of over 1.3 million personalized preference examples, and develop two complementary alignment approaches: \textit{in-context alignment} directly conditioning on persona representations and \textit{preference-bridged alignment} modeling intermediate preference distributions. Extensive experiments demonstrate substantial improvements over existing methods, with an average 17.06\% accuracy gain across four benchmarks while exhibiting a strong adaptation capability to novel preferences, robustness to limited user data, and precise preference controllability. These results validate our framework's effectiveness, advancing toward truly user-adaptive AI systems.

Spot the Fake: Large Multimodal Model-Based Synthetic Image Detection with Artifact Explanation

Siwei Wen,Junyan Ye,Peilin Feng,Hengrui Kang,Zichen Wen,Yize Chen,Jiang Wu,Wenjun Wu,Conghui He,Weijia Li

Task: 提出一种用于合成图像和DeepFake检测的专用大型多模态模型FakeVLM，并提供自然语言解释以增强可解释性。

Motivation: 随着AIGC技术的快速发展，合成图像在日常生活中越来越普遍，给真实性评估和检测带来了新的挑战。现有方法在评估图像真实性和定位伪造方面虽然有效，但往往缺乏人类可解释性，并且无法完全应对合成数据的日益复杂性。

Details

Method: 引入FakeVLM，一种专门用于合成图像和DeepFake检测的大型多模态模型，并提供自然语言解释。同时，提出了包含超过100,000张图像的综合数据集FakeClue，标注了细粒度的自然语言线索。 Result: FakeVLM在多个数据集上的广泛评估中表现出色，不仅在真实性分类任务中表现优异，还在伪影解释任务中设定了新的基准。 Conclusion: FakeVLM在合成图像检测任务中表现出色，提供了一种无需额外分类器的强大解决方案，并增强了可解释性。 Abstract: With the rapid advancement of Artificial Intelligence Generated Content (AIGC) technologies, synthetic images have become increasingly prevalent in everyday life, posing new challenges for authenticity assessment and detection. Despite the effectiveness of existing methods in evaluating image authenticity and locating forgeries, these approaches often lack human interpretability and do not fully address the growing complexity of synthetic data. To tackle these challenges, we introduce FakeVLM, a specialized large multimodal model designed for both general synthetic image and DeepFake detection tasks. FakeVLM not only excels in distinguishing real from fake images but also provides clear, natural language explanations for image artifacts, enhancing interpretability. Additionally, we present FakeClue, a comprehensive dataset containing over 100,000 images across seven categories, annotated with fine-grained artifact clues in natural language. FakeVLM demonstrates performance comparable to expert models while eliminating the need for additional classifiers, making it a robust solution for synthetic data detection. Extensive evaluations across multiple datasets confirm the superiority of FakeVLM in both authenticity classification and artifact explanation tasks, setting a new benchmark for synthetic image detection. The dataset and code will be released in: https://github.com/opendatalab/FakeVLM.

Dynamic Bi-Elman Attention Networks (DBEAN): Dual-Directional Context-Aware Representation Learning for Enhanced Text Classification

ZhengLin Lai,MengYao Liao,Dong Xu

Task: 提出一种新的文本分类模型——动态双向Elman注意力网络（DBEAN）。

Motivation: 传统方法在处理复杂语言结构和语义依赖时存在困难，现有模型在可解释性、计算效率和长距离上下文理解方面存在局限性。

Details

Method: DBEAN结合了双向时间建模和自注意力机制，动态分配输入关键部分的权重。 Result: DBEAN提高了上下文表示能力，同时保持了计算效率。 Conclusion: DBEAN在文本分类任务中表现出色，能够有效平衡可解释性、计算效率和长距离上下文理解。 Abstract: Text classification, a fundamental task in natural language processing (NLP), aims to categorize textual data into predefined labels. Traditional methods struggled with complex linguistic structures and semantic dependencies. The advent of deep learning, particularly recurrent neural networks (RNNs) and Transformer-based models, has significantly advanced the field by enabling nuanced feature extraction and context-aware predictions. Despite improvements, existing models exhibit limitations in balancing interpretability, computational efficiency, and long-range contextual understanding. This paper proposes the Dynamic Bidirectional Elman with Attention Network (DBEAN), which integrates bidirectional temporal modelling with self-attention mechanisms. DBEAN dynamically assigns weights to critical segments of input, improving contextual representation while maintaining computational efficiency.

Robust Distribution Alignment for Industrial Anomaly Detection under Distribution Shift

Jingyi Liao,Xun Xu,Yongyi Su,Rong-Cheng Tu,Yifan Liu,Dacheng Tao,Xulei Yang

Task: 通过优化Sinkhorn距离来增强异常检测方法在未见目标域上的泛化能力。

Motivation: 工业应用中的异常检测在质量控制中起着至关重要的作用，但在未见域转移（如光照变化或传感器漂移）下确保鲁棒性仍然是一个重大挑战。现有方法尝试通过训练可泛化模型来解决域转移问题，但通常依赖于目标分布的先验知识，并且难以泛化到为其他数据模态设计的骨干网络。

Details

Method: 基于记忆库的异常检测方法，优化有限目标训练数据上的鲁棒Sinkhorn距离。 Result: 在模拟分布转移的2D和3D异常检测基准上评估了方法的有效性，提出的方法在异常检测和域适应方法中表现出优越的结果。 Conclusion: 所提出的方法在未见目标域上的泛化能力优于现有的最先进方法。 Abstract: Anomaly detection plays a crucial role in quality control for industrial applications. However, ensuring robustness under unseen domain shifts such as lighting variations or sensor drift remains a significant challenge. Existing methods attempt to address domain shifts by training generalizable models but often rely on prior knowledge of target distributions and can hardly generalise to backbones designed for other data modalities. To overcome these limitations, we build upon memory-bank-based anomaly detection methods, optimizing a robust Sinkhorn distance on limited target training data to enhance generalization to unseen target domains. We evaluate the effectiveness on both 2D and 3D anomaly detection benchmarks with simulated distribution shifts. Our proposed method demonstrates superior results compared with state-of-the-art anomaly detection and domain adaptation methods.

Value Profiles for Encoding Human Variation

Taylor Sorensen,Pushkar Mishra,Roma Patel,Michael Henry Tessler,Michiel Bakker,Georgina Evans,Iason Gabriel,Noah Goodman,Verena Rieser

Task: 建模人类在评分任务中的变化，以实现个性化、多元模型对齐和计算社会科学。

Motivation: 为了使AI系统能够进行个性化、多元模型对齐和计算社会科学研究，需要建模人类在评分任务中的变化。

Details

Method: 提出使用价值档案（自然语言描述）来表示个体，并结合可控制的解码器模型来估计基于价值档案或其他评分者信息的评分。引入信息论方法来衡量评分者表示中的预测信息。 Result: 发现演示包含最多的信息，其次是价值档案和人口统计信息。价值档案在可审查性、可解释性和可控性方面具有优势。价值档案有效地压缩了演示中的有用信息（信息保留率超过70%）。聚类价值档案比最具预测性的人口统计分组更好地解释了评分者的变化。解码器模型根据语义档案差异可解释地改变评分，并且校准良好，可以通过模拟注释者群体来解释实例级分歧。 Conclusion: 价值档案提供了一种新颖的、预测性的方式来描述个体变化，超越了人口统计或群体信息。 Abstract: Modelling human variation in rating tasks is crucial for enabling AI systems for personalization, pluralistic model alignment, and computational social science. We propose representing individuals using value profiles -- natural language descriptions of underlying values compressed from in-context demonstrations -- along with a steerable decoder model to estimate ratings conditioned on a value profile or other rater information. To measure the predictive information in rater representations, we introduce an information-theoretic methodology. We find that demonstrations contain the most information, followed by value profiles and then demographics. However, value profiles offer advantages in terms of scrutability, interpretability, and steerability due to their compressed natural language format. Value profiles effectively compress the useful information from demonstrations (>70% information preservation). Furthermore, clustering value profiles to identify similarly behaving individuals better explains rater variation than the most predictive demographic groupings. Going beyond test set performance, we show that the decoder models interpretably change ratings according to semantic profile differences, are well-calibrated, and can help explain instance-level disagreement by simulating an annotator population. These results demonstrate that value profiles offer novel, predictive ways to describe individual variation beyond demographics or group information.

Derm1M: A Million-scale Vision-Language Dataset Aligned with Clinical Ontology Knowledge for Dermatology

Siyuan Yan,Ming Hu,Yiwen Jiang,Xieji Li,Hao Fei,Philipp Tschandl,Harald Kittler,Zongyuan Ge

Task: 构建一个大规模视觉语言数据集Derm1M，用于皮肤病学的AI研究和临床应用。

Motivation: 现有的皮肤病学数据集在规模和深度上有限，缺乏丰富的文本描述和临床背景，限制了皮肤病学AI的发展。

Details

Method: 从多样化的教育资源中构建Derm1M数据集，包含1,029,761个图像-文本对，围绕专家协作开发的标准本体进行结构化。 Result: Derm1M数据集覆盖了超过390种皮肤状况和130个临床概念，提供了丰富的上下文信息。基于此数据集预训练的DermLIP模型在多个任务上显著优于现有基础模型。 Conclusion: Derm1M数据集和DermLIP模型在皮肤病学的AI研究和临床应用中具有巨大潜力。 Abstract: The emergence of vision-language models has transformed medical AI, enabling unprecedented advances in diagnostic capability and clinical applications. However, progress in dermatology has lagged behind other medical domains due to the lack of standard image-text pairs. Existing dermatological datasets are limited in both scale and depth, offering only single-label annotations across a narrow range of diseases instead of rich textual descriptions, and lacking the crucial clinical context needed for real-world applications. To address these limitations, we present Derm1M, the first large-scale vision-language dataset for dermatology, comprising 1,029,761 image-text pairs. Built from diverse educational resources and structured around a standard ontology collaboratively developed by experts, Derm1M provides comprehensive coverage for over 390 skin conditions across four hierarchical levels and 130 clinical concepts with rich contextual information such as medical history, symptoms, and skin tone. To demonstrate Derm1M potential in advancing both AI research and clinical application, we pretrained a series of CLIP-like models, collectively called DermLIP, on this dataset. The DermLIP family significantly outperforms state-of-the-art foundation models on eight diverse datasets across multiple tasks, including zero-shot skin disease classification, clinical and artifacts concept identification, few-shot/full-shot learning, and cross-modal retrieval. Our dataset and code will be public.

Policy Frameworks for Transparent Chain-of-Thought Reasoning in Large Language Models

Yihang Chen,Haikang Deng,Kaiqiao Han,Qingyue Zhao

Task: 分析Chain-of-Thought (CoT)推理的全面披露的双刃影响，并提出一个分层次访问政策框架。

Motivation: 当前CoT披露政策在不同模型之间存在差异，缺乏统一的政策框架，这可能导致知识产权侵犯、滥用和运营成本增加等问题。

Details

Method: 提出一个分层次访问政策框架，通过伦理许可、结构化推理输出和跨层次保障措施来平衡透明度、责任和安全性。 Result: 该框架旨在通过协调可访问性与伦理和运营考虑，推进负责任的AI部署，同时减少滥用或误解的风险。 Conclusion: 分层次访问政策框架可以在透明度、责任和安全性之间取得平衡，从而促进负责任的AI部署。 Abstract: Chain-of-Thought (CoT) reasoning enhances large language models (LLMs) by decomposing complex problems into step-by-step solutions, improving performance on reasoning tasks. However, current CoT disclosure policies vary widely across different models in frontend visibility, API access, and pricing strategies, lacking a unified policy framework. This paper analyzes the dual-edged implications of full CoT disclosure: while it empowers small-model distillation, fosters trust, and enables error diagnosis, it also risks violating intellectual property, enabling misuse, and incurring operational costs. We propose a tiered-access policy framework that balances transparency, accountability, and security by tailoring CoT availability to academic, business, and general users through ethical licensing, structured reasoning outputs, and cross-tier safeguards. By harmonizing accessibility with ethical and operational considerations, this framework aims to advance responsible AI deployment while mitigating risks of misuse or misinterpretation.

Deep Polycuboid Fitting for Compact 3D Representation of Indoor Scenes

Gahye Lee,Hyejeong Yoon,Jungeon Kim,Seungyong Lee

Task: 提出了一种基于深度学习的紧凑表示3D室内场景的新框架。

Motivation: 室内场景主要由人造物体（如家具）组成，这些物体通常呈现直线几何形状，因此可以使用多立方体组合来表示，从而为下游应用（如家具重新排列）提供紧凑的表示。

Details

Method: 该框架首先使用变压器网络检测六种类型的立方体面，然后使用图神经网络验证检测到的面的空间关系以形成潜在的多立方体，最后通过聚合面标签重建每个多立方体实例。 Result: 该框架在包括Replica、ScanNet和iPhone捕获的场景在内的真实世界室内场景数据集中表现良好。 Conclusion: 该方法的通用性通过虚拟房间游览和场景编辑等实际应用得到了展示。 Abstract: This paper presents a novel framework for compactly representing a 3D indoor scene using a set of polycuboids through a deep learning-based fitting method. Indoor scenes mainly consist of man-made objects, such as furniture, which often exhibit rectilinear geometry. This property allows indoor scenes to be represented using combinations of polycuboids, providing a compact representation that benefits downstream applications like furniture rearrangement. Our framework takes a noisy point cloud as input and first detects six types of cuboid faces using a transformer network. Then, a graph neural network is used to validate the spatial relationships of the detected faces to form potential polycuboids. Finally, each polycuboid instance is reconstructed by forming a set of boxes based on the aggregated face labels. To train our networks, we introduce a synthetic dataset encompassing a diverse range of cuboid and polycuboid shapes that reflect the characteristics of indoor scenes. Our framework generalizes well to real-world indoor scene datasets, including Replica, ScanNet, and scenes captured with an iPhone. The versatility of our method is demonstrated through practical applications, such as virtual room tours and scene editing.

Threefold model for AI Readiness: A Case Study with Finnish Healthcare SMEs

Mohammed Alnajjar,Khalid Alnajjar,Mika Hämäläinen

Task: 研究芬兰医疗保健领域中小企业的AI采用情况。

Motivation: 了解中小企业在医疗保健领域采用AI的现状及其面临的挑战。

Details

Method: 通过对六家健康科技公司进行半结构化访谈，识别出三种AI参与类别。 Result: 提出了一个三阶段模型，突出了采用AI的关键障碍，包括监管复杂性、技术专家缺口和财务限制。 Conclusion: 提供了加速AI集成的可行建议，重点关注监管改革、人才发展和公司间合作，为医疗保健组织、政策制定者和研究人员提供了有价值的见解。 Abstract: This study examines AI adoption among Finnish healthcare SMEs through semi-structured interviews with six health-tech companies. We identify three AI engagement categories: AI-curious (exploring AI), AI-embracing (integrating AI), and AI-catering (providing AI solutions). Our proposed threefold model highlights key adoption barriers, including regulatory complexities, technical expertise gaps, and financial constraints. While SMEs recognize AI's potential, most remain in early adoption stages. We provide actionable recommendations to accelerate AI integration, focusing on regulatory reforms, talent development, and inter-company collaboration, offering valuable insights for healthcare organizations, policymakers, and researchers.

GenM$^3$: Generative Pretrained Multi-path Motion Model for Text Conditional Human Motion Generation

Junyu Shi,Lijiang Liu,Yong Sun,Zhiyuan Zhang,Jinni Zhou,Qiang Nie

Task: 提出Generative Pretrained Multi-path Motion Model (GenM$^3$)框架，以解决大规模多源数据集中的数据异质性挑战，并学习统一的运动表示。

Motivation: 为了增强运动生成能力，需要扩展运动数据集，但大规模多源数据集的训练引入了数据异质性挑战。

Details

Method: GenM$^3$框架包括两个组件：1) Multi-Expert VQ-VAE (MEVQ-VAE)，用于适应不同数据集分布以学习统一的离散运动表示；2) Multi-path Motion Transformer (MMT)，通过使用单独的模态特定路径来改进模态内表示，并通过文本-运动共享路径改进模态间对齐。 Result: 在HumanML3D基准测试中，GenM$^3$达到了0.035的FID，显著超越了现有方法。在IDEA400数据集上也展示了强大的零样本泛化能力。 Conclusion: GenM$^3$框架在多样化的运动场景中表现出色，具有高效性和适应性。 Abstract: Scaling up motion datasets is crucial to enhance motion generation capabilities. However, training on large-scale multi-source datasets introduces data heterogeneity challenges due to variations in motion content. To address this, we propose Generative Pretrained Multi-path Motion Model (GenM$^3$), a comprehensive framework designed to learn unified motion representations. GenM$^3$ comprises two components: 1) a Multi-Expert VQ-VAE (MEVQ-VAE) that adapts to different dataset distributions to learn a unified discrete motion representation, and 2) a Multi-path Motion Transformer (MMT) that improves intra-modal representations by using separate modality-specific pathways, each with densely activated experts to accommodate variations within that modality, and improves inter-modal alignment by the text-motion shared pathway. To enable large-scale training, we integrate and unify 11 high-quality motion datasets (approximately 220 hours of motion data) and augment it with textual annotations (nearly 10,000 motion sequences labeled by a large language model and 300+ by human experts). After training on our integrated dataset, GenM$^3$ achieves a state-of-the-art FID of 0.035 on the HumanML3D benchmark, surpassing state-of-the-art methods by a large margin. It also demonstrates strong zero-shot generalization on IDEA400 dataset, highlighting its effectiveness and adaptability across diverse motion scenarios.

Squeeze Out Tokens from Sample for Finer-Grained Data Governance

Weixiong Lin,Chen Ju,Haicheng Wang,Shengchao Hu,Shuai Xiao,Mengting Chen,Yuheng Jiao,Mingshuai Yao,Jinsong Lan,Qingwen Liu,Ying Chen

Task: 提出一种名为DataJuicer的双分支数据治理方法，通过更细粒度的样本内治理来提升数据集的质量。

Motivation: 现有的数据治理方法通过筛选低价值样本来缩减数据集，但保留的样本中仍包含大量不理想的标记，存在进一步压缩和净化的潜力。

Details

Method: DataJuicer采用双分支结构，视觉分支保留显著的图像块并提取相关对象类别，文本分支则利用这些类别来增强描述，从而实现更细粒度的样本内治理。 Result: 实验表明，DataJuicer在图像-文本检索、分类和密集视觉推理任务上显著优于现有的DataSieve方法。 Conclusion: DataJuicer通过更细粒度的数据治理方法，显著提升了数据集的质量和模型性能。 Abstract: Widely observed data scaling laws, in which error falls off as a power of the training size, demonstrate the diminishing returns of unselective data expansion. Hence, data governance is proposed to downsize datasets through pruning non-informative samples. Yet, isolating the impact of a specific sample on overall model performance is challenging, due to the vast computation required for tryout all sample combinations. Current data governors circumvent this complexity by estimating sample contributions through heuristic-derived scalar scores, thereby discarding low-value ones. Despite thorough sample sieving, retained samples contain substantial undesired tokens intrinsically, underscoring the potential for further compression and purification. In this work, we upgrade data governance from a 'sieving' approach to a 'juicing' one. Instead of scanning for least-flawed samples, our dual-branch DataJuicer applies finer-grained intra-sample governance. It squeezes out informative tokens and boosts image-text alignments. Specifically, the vision branch retains salient image patches and extracts relevant object classes, while the text branch incorporates these classes to enhance captions. Consequently, DataJuicer yields more refined datasets through finer-grained governance. Extensive experiments across datasets demonstrate that DataJuicer significantly outperforms existing DataSieve in image-text retrieval, classification, and dense visual reasoning.

Shushing! Let's Imagine an Authentic Speech from the Silent Video

Jiaxin Ye,Hongming Shan

Task: 通过视觉输入生成真实的语音，不依赖听觉信号。

Motivation: 现有的方法在跨模态对齐语义、音色和情感韵律方面存在困难，因此提出了CV2S任务以增强跨模态一致性。

Details

Method: 提出了ImaginTalk，一种新颖的跨模态扩散框架，通过离散唇对齐器预测离散语音标记，并使用BERT进行错误检测和修正，同时开发了风格扩散变换器以增强生成语音的表现力。 Result: 实验表明，ImaginTalk能够生成高保真语音，具有更准确的语义细节和更强的音色和情感表现力。 Conclusion: ImaginTalk在视觉引导的语音生成任务中表现出色，能够生成高质量的语音，具有广泛的应用潜力。 Abstract: Vision-guided speech generation aims to produce authentic speech from facial appearance or lip motions without relying on auditory signals, offering significant potential for applications such as dubbing in filmmaking and assisting individuals with aphonia. Despite recent progress, existing methods struggle to achieve unified cross-modal alignment across semantics, timbre, and emotional prosody from visual cues, prompting us to propose Consistent Video-to-Speech (CV2S) as an extended task to enhance cross-modal consistency. To tackle emerging challenges, we introduce ImaginTalk, a novel cross-modal diffusion framework that generates faithful speech using only visual input, operating within a discrete space. Specifically, we propose a discrete lip aligner that predicts discrete speech tokens from lip videos to capture semantic information, while an error detector identifies misaligned tokens, which are subsequently refined through masked language modeling with BERT. To further enhance the expressiveness of the generated speech, we develop a style diffusion transformer equipped with a face-style adapter that adaptively customizes identity and prosody dynamics across both the channel and temporal dimensions while ensuring synchronization with lip-aware semantic features. Extensive experiments demonstrate that ImaginTalk can generate high-fidelity speech with more accurate semantic details and greater expressiveness in timbre and emotion compared to state-of-the-art baselines. Demos are shown at our project page: https://imagintalk.github.io.

Image Captioning Evaluation in the Age of Multimodal LLMs: Challenges and Future Perspectives

Sara Sarto,Marcella Cornia,Rita Cucchiara

Task: 评估机器生成的图像描述。

Motivation: 随着多模态大语言模型（MLLMs）的出现，图像描述生成成为一个核心任务，增加了对稳健可靠评估指标的需求。

Details

Method: 本文提供了一份全面的图像描述评估进展综述，分析了现有指标的演变、优势和局限性。 Result: 评估了这些指标在多个维度上的表现，包括与人类判断的相关性、排名准确性和对幻觉的敏感性。 Conclusion: 分析揭示了标准评估方法的一些局限性，并提出了图像描述评估未来研究的有前景方向。 Abstract: The evaluation of machine-generated image captions is a complex and evolving challenge. With the advent of Multimodal Large Language Models (MLLMs), image captioning has become a core task, increasing the need for robust and reliable evaluation metrics. This survey provides a comprehensive overview of advancements in image captioning evaluation, analyzing the evolution, strengths, and limitations of existing metrics. We assess these metrics across multiple dimensions, including correlation with human judgment, ranking accuracy, and sensitivity to hallucinations. Additionally, we explore the challenges posed by the longer and more detailed captions generated by MLLMs and examine the adaptability of current metrics to these stylistic variations. Our analysis highlights some limitations of standard evaluation approaches and suggests promising directions for future research in image captioning assessment.

FAVOR-Bench: A Comprehensive Benchmark for Fine-Grained Video Motion Understanding

Chongjun Tu,Lin Zhang,Pengtao Chen,Peng Ye,Xianfang Zeng,Wei Cheng,Gang Yu,Tao Chen

Task: 评估多模态大语言模型（MLLMs）在视频内容理解中的细粒度运动理解能力。

Motivation: 现有的MLLMs在视频内容理解方面表现出色，但在细粒度运动理解方面仍存在困难。

Details

Method: 引入FAVOR-Bench基准，包含1,776个带有结构化手动注释的视频，设计了8,184个多项选择题对和开放任务评估方法。 Result: 21个最先进的MLLMs在理解和描述视频运动的详细时间动态方面存在显著局限性。通过FAVOR-Train数据集微调Qwen2.5-VL，在TVBench、MotionBench和FAVOR-Bench的运动相关任务上取得了持续改进。 Conclusion: FAVOR-Bench和FAVOR-Train为开发更强大的视频理解模型提供了有价值的工具。 Abstract: Multimodal Large Language Models (MLLMs) have shown remarkable capabilities in video content understanding but still struggle with fine-grained motion comprehension. To comprehensively assess the motion understanding ability of existing MLLMs, we introduce FAVOR-Bench, comprising 1,776 videos with structured manual annotations of various motions. Our benchmark includes both close-ended and open-ended tasks. For close-ended evaluation, we carefully design 8,184 multiple-choice question-answer pairs spanning six distinct sub-tasks. For open-ended evaluation, we develop both a novel cost-efficient LLM-free and a GPT-assisted caption assessment method, where the former can enhance benchmarking interpretability and reproducibility. Comprehensive experiments with 21 state-of-the-art MLLMs reveal significant limitations in their ability to comprehend and describe detailed temporal dynamics in video motions. To alleviate this limitation, we further build FAVOR-Train, a dataset consisting of 17,152 videos with fine-grained motion annotations. The results of finetuning Qwen2.5-VL on FAVOR-Train yield consistent improvements on motion-related tasks of TVBench, MotionBench and our FAVOR-Bench. Comprehensive assessment results demonstrate that the proposed FAVOR-Bench and FAVOR-Train provide valuable tools to the community for developing more powerful video understanding models. Project page: \href{https://favor-bench.github.io/}{https://favor-bench.github.io/}.

Unique Hard Attention: A Tale of Two Sides

Selim Jerad,Anej Svete,Jiaoda Li,Ryan Cotterell

Task: 分析左硬注意力变换器的表达能力及其与线性时序逻辑（LTL）的关系。

Motivation: 理解变换器的表达能力有助于揭示其能力和局限性，特别是关注方向性对表达能力的影响。

Details

Method: 通过比较左硬注意力变换器和右硬注意力变换器，分析它们与线性时序逻辑（LTL）的等价性。 Result: 左硬注意力变换器对应于LTL的一个严格较弱的片段，并且与软注意力变换器等价。 Conclusion: 左硬注意力变换器可能比右硬注意力变换器更好地近似现实世界中的变换器，强调了注意力方向性在表达能力中的重要性。 Abstract: Understanding the expressive power of transformers has recently attracted attention, as it offers insights into their abilities and limitations. Many studies analyze unique hard attention transformers, where attention selects a single position that maximizes the attention scores. When multiple positions achieve the maximum score, either the rightmost or the leftmost of those is chosen. In this paper, we highlight the importance of this seeming triviality. Recently, finite-precision transformers with both leftmost- and rightmost-hard attention were shown to be equivalent to Linear Temporal Logic (LTL). We show that this no longer holds with only leftmost-hard attention -- in that case, they correspond to a \emph{strictly weaker} fragment of LTL. Furthermore, we show that models with leftmost-hard attention are equivalent to \emph{soft} attention, suggesting they may better approximate real-world transformers than right-attention models. These findings refine the landscape of transformer expressivity and underscore the role of attention directionality.

Optimal Transport Adapter Tuning for Bridging Modality Gaps in Few-Shot Remote Sensing Scene Classification

Zhong Ji,Ci Liu,Jingren Liu,Chen Tang,Yanwei Pang,Xuelong Li

Task: Few-Shot Remote Sensing Scene Classification (FS-RSSC) with limited labeled samples.

Motivation: Existing methods typically emphasize single-modal feature learning, neglecting the potential benefits of optimizing multi-modal representations.

Details

Method: Propose a novel Optimal Transport Adapter Tuning (OTAT) framework aimed at constructing an ideal Platonic representational space through optimal transport (OT) theory. The framework includes an Optimal Transport Adapter (OTA) and a sample-level Entropy-Aware Weighted (EAW) loss. Result: OTAT achieves state-of-the-art performance in FS-RSSC, significantly improving the model performance and generalization. Conclusion: The OTAT framework offers a scalable and efficient solution for advancing multimodal learning in remote sensing applications. Abstract: Few-Shot Remote Sensing Scene Classification (FS-RSSC) presents the challenge of classifying remote sensing images with limited labeled samples. Existing methods typically emphasize single-modal feature learning, neglecting the potential benefits of optimizing multi-modal representations. To address this limitation, we propose a novel Optimal Transport Adapter Tuning (OTAT) framework aimed at constructing an ideal Platonic representational space through optimal transport (OT) theory. This framework seeks to harmonize rich visual information with less dense textual cues, enabling effective cross-modal information transfer and complementarity. Central to this approach is the Optimal Transport Adapter (OTA), which employs a cross-modal attention mechanism to enrich textual representations and facilitate subsequent better information interaction. By transforming the network optimization into an OT optimization problem, OTA establishes efficient pathways for balanced information exchange between modalities. Moreover, we introduce a sample-level Entropy-Aware Weighted (EAW) loss, which combines difficulty-weighted similarity scores with entropy-based regularization. This loss function provides finer control over the OT optimization process, enhancing its solvability and stability. Our framework offers a scalable and efficient solution for advancing multimodal learning in remote sensing applications. Extensive experiments on benchmark datasets demonstrate that OTAT achieves state-of-the-art performance in FS-RSSC, significantly improving the model performance and generalization.

RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation Serving

Wenqi Jiang,Suvinay Subramanian,Cat Graves,Gustavo Alonso,Amir Yazdanbakhsh,Vidushi Dadu

Task: 提出一种高效的服务检索增强生成（RAG）的方法。

Motivation: 由于RAG变体的快速涌现和工作负载特性的显著差异，高效RAG服务仍然是一个开放的挑战。

Details

Method: 引入RAGSchema作为RAG算法的结构化抽象，分析具有不同RAGSchema的代表性RAG工作负载，并提出RAGO（Retrieval-Augmented Generation Optimizer）系统优化框架。 Result: RAGO在每芯片QPS上实现了高达2倍的提升，并在首次令牌延迟上减少了55%。 Conclusion: RAGO框架能够有效提升RAG服务的性能，满足多样化的性能需求。 Abstract: Retrieval-augmented generation (RAG), which combines large language models (LLMs) with retrievals from external knowledge databases, is emerging as a popular approach for reliable LLM serving. However, efficient RAG serving remains an open challenge due to the rapid emergence of many RAG variants and the substantial differences in workload characteristics across them. In this paper, we make three fundamental contributions to advancing RAG serving. First, we introduce RAGSchema, a structured abstraction that captures the wide range of RAG algorithms, serving as a foundation for performance optimization. Second, we analyze several representative RAG workloads with distinct RAGSchema, revealing significant performance variability across these workloads. Third, to address this variability and meet diverse performance requirements, we propose RAGO (Retrieval-Augmented Generation Optimizer), a system optimization framework for efficient RAG serving. Our evaluation shows that RAGO achieves up to a 2x increase in QPS per chip and a 55% reduction in time-to-first-token latency compared to RAG systems built on LLM-system extensions.

VisNumBench: Evaluating Number Sense of Multimodal Large Language Models

Tengjin Weng,Jingyi Wang,Wenhao Jiang,Zhong Ming

Task: 评估多模态大语言模型（MLLMs）在视觉数字任务中的数字感知能力。

Motivation: 研究多模态大语言模型是否能够发展出类似人类的直观数字感知能力。

Details

Method: 引入视觉数字基准（VisNumBench），包含约1,900个多项选择题-答案对，涵盖七个视觉数字属性和四种视觉数字估计任务。 Result: 测试的17个MLLMs在数字感知相关任务中表现显著低于人类水平；多模态数学模型和多模态链式思维（CoT）模型在数字感知能力上没有显著提升；参数规模更大、通用能力更强的MLLMs在数字感知能力上有适度提升。 Conclusion: VisNumBench将为研究社区提供有价值的资源，鼓励进一步改进MLLMs的数字感知能力。 Abstract: Can Multimodal Large Language Models (MLLMs) develop an intuitive number sense similar to humans? Targeting this problem, we introduce Visual Number Benchmark (VisNumBench) to evaluate the number sense abilities of MLLMs across a wide range of visual numerical tasks. VisNumBench consists of about 1,900 multiple-choice question-answer pairs derived from both synthetic and real-world visual data, covering seven visual numerical attributes and four types of visual numerical estimation tasks. Our experiments on VisNumBench led to the following key findings: (i) The 17 MLLMs we tested, including open-source models such as Qwen2.5-VL and InternVL2.5, as well as proprietary models like GPT-4o and Gemini 2.0 Flash, perform significantly below human levels in number sense-related tasks. (ii) Multimodal mathematical models and multimodal chain-of-thought (CoT) models did not exhibit significant improvements in number sense abilities. (iii) Stronger MLLMs with larger parameter sizes and broader general abilities demonstrate modest gains in number sense abilities. We believe VisNumBench will serve as a valuable resource for the research community, encouraging further advancements in enhancing MLLMs' number sense abilities. All benchmark resources, including code and datasets, will be publicly available at https://wwwtttjjj.github.io/VisNumBench/.

Mitigating Object Hallucinations in MLLMs via Multi-Frequency Perturbations

Shuo Li,Jiajun Sun,Guodong Zheng,Xiaoran Fan,Yujiong Shen,Yi Lu,Zhiheng Xi,Yuming Yang,Wenming Tan,Tao Ji,Tao Gui,Qi Zhang,Xuanjing Huang

Task: 提出一种名为多频率扰动（MFP）的方法，以减少多模态大语言模型（MLLMs）在视觉语言任务中的对象幻觉。

Motivation: 多模态大语言模型在视觉语言任务中表现出色，但其生成响应的真实性常因对象幻觉而受到影响。研究发现，模型对特定图像频率特征的过度敏感性是导致这些幻觉的关键原因。

Details

Method: 引入多频率扰动（MFP），利用图像的低频和高频特征来扰动视觉特征表示，并在推理过程中显式抑制冗余的频率域特征，从而减少幻觉。 Result: 实验结果表明，该方法显著减少了各种模型架构中的对象幻觉。此外，作为一种训练时方法，MFP可以与推理时方法结合，在CHAIR基准上实现最先进的性能。 Conclusion: 多频率扰动（MFP）是一种简单、经济且可插拔的方法，能有效减少多模态大语言模型中的对象幻觉，并提升模型性能。 Abstract: Recently, multimodal large language models (MLLMs) have demonstrated remarkable performance in visual-language tasks. However, the authenticity of the responses generated by MLLMs is often compromised by object hallucinations. We identify that a key cause of these hallucinations is the model's over-susceptibility to specific image frequency features in detecting objects. In this paper, we introduce Multi-Frequency Perturbations (MFP), a simple, cost-effective, and pluggable method that leverages both low-frequency and high-frequency features of images to perturb visual feature representations and explicitly suppress redundant frequency-domain features during inference, thereby mitigating hallucinations. Experimental results demonstrate that our method significantly mitigates object hallucinations across various model architectures. Furthermore, as a training-time method, MFP can be combined with inference-time methods to achieve state-of-the-art performance on the CHAIR benchmark.

UPME: An Unsupervised Peer Review Framework for Multimodal Large Language Model Evaluation

Qihui Zhang,Munan Ning,Zheyuan Liu,Yanbo Wang,Jiayi Ye,Yue Huang,Shuo Yang,Xiao Chen,Yibing Song,Li Yuan

Task: 提出一种无监督的同行评审多模态大语言模型评估框架，以解决现有评估方法中的人力工作量大和偏差问题。

Motivation: 现有的评估方法由于需要大量人力设计视觉图像的问答对，限制了评估的规模和范围，而自动化的MLLM-as-judge方法虽然减少了人力工作量，但引入了偏差。

Details

Method: 提出了一种无监督的同行评审MLLM评估框架，仅使用图像数据，让模型自动生成问题并对其他模型的答案进行同行评审，同时引入了视觉语言评分系统以减少偏差。 Result: 实验结果表明，UPME在MMstar数据集上与人评估的Pearson相关性为0.944，在ScienceQA数据集上为0.814，表明该框架与人工设计的基准和人类偏好高度一致。 Conclusion: UPME框架有效减少了人力工作量并减少了偏差，与人类评估高度一致，为多模态大语言模型的评估提供了一种新的方法。 Abstract: Multimodal Large Language Models (MLLMs) have emerged to tackle the challenges of Visual Question Answering (VQA), sparking a new research focus on conducting objective evaluations of these models. Existing evaluation methods face limitations due to the significant human workload required to design Q&A pairs for visual images, which inherently restricts the scale and scope of evaluations. Although automated MLLM-as-judge approaches attempt to reduce the human workload through automatic evaluations, they often introduce biases. To address these problems, we propose an Unsupervised Peer review MLLM Evaluation framework. It utilizes only image data, allowing models to automatically generate questions and conduct peer review assessments of answers from other models, effectively alleviating the reliance on human workload. Additionally, we introduce the vision-language scoring system to mitigate the bias issues, which focuses on three aspects: (i) response correctness; (ii) visual understanding and reasoning; and (iii) image-text correlation. Experimental results demonstrate that UPME achieves a Pearson correlation of 0.944 with human evaluations on the MMstar dataset and 0.814 on the ScienceQA dataset, indicating that our framework closely aligns with human-designed benchmarks and inherent human preferences.

Machine Unlearning in Hyperbolic vs. Euclidean Multimodal Contrastive Learning: Adapting Alignment Calibration to MERU

Àlex Pujol Vidal,Sergio Escalera,Kamal Nasrollahi,Thomas B. Moeslund

Task: 研究在双曲对比学习中的机器遗忘方法，以选择性移除预训练模型中的概念。

Motivation: 探索在双曲空间中进行概念移除的有效性，因为目前的研究主要集中在欧几里得对比视觉语言模型中的遗忘。

Details

Method: 通过将Alignment Calibration应用于MERU模型，该模型将图像和文本嵌入双曲空间以更好地捕捉语义层次结构。 Result: 实验表明，双曲几何在概念移除方面具有独特优势，特别是在扩展到多个概念移除时，能够实现近乎完美的遗忘，同时保持合理性能。 Conclusion: 双曲遗忘在重组语义层次结构方面与欧几里得方法有根本不同，这些发现不仅推进了机器遗忘技术，还提供了关于几何特性如何影响多模态模型中概念表示和移除的见解。 Abstract: Machine unlearning methods have become increasingly important for selective concept removal in large pre-trained models. While recent work has explored unlearning in Euclidean contrastive vision-language models, the effectiveness of concept removal in hyperbolic spaces remains unexplored. This paper investigates machine unlearning in hyperbolic contrastive learning by adapting Alignment Calibration to MERU, a model that embeds images and text in hyperbolic space to better capture semantic hierarchies. Through systematic experiments and ablation studies, we demonstrate that hyperbolic geometry offers distinct advantages for concept removal, achieving near perfect forgetting with reasonable performance on retained concepts, particularly when scaling to multiple concept removal. Our approach introduces hyperbolic-specific components including entailment calibration and norm regularization that leverage the unique properties of hyperbolic space. Comparative analysis with Euclidean models reveals fundamental differences in unlearning dynamics, with hyperbolic unlearning reorganizing the semantic hierarchy while Euclidean approaches merely disconnect cross-modal associations. These findings not only advance machine unlearning techniques but also provide insights into the geometric properties that influence concept representation and removal in multimodal models. Source code available at https://github.com/alex-pv01/HAC

3D Engine-ready Photorealistic Avatars via Dynamic Textures

Yifan Wang,Ivan Molodetskikh,Ondrej Texler,Dimitar Dinev

Task: 提出一种端到端的管道，使用标准3D资产构建显式表示的逼真3D虚拟形象。

Motivation: 随着数字世界和物理世界的交织，人们对与现实世界对应的数字虚拟形象产生了浓厚兴趣。然而，当前的数字化方法需要昂贵的捕捉设备，不适合普通消费者大规模使用。

Details

Method: 使用动态生成的纹理来增强真实感，并视觉上掩盖底层网格几何的缺陷。 Result: 该方法能够无缝集成到当前的图形管道中，同时实现与最先进的3D虚拟形象生成方法相当的视觉质量。 Conclusion: 提出的方法在保持与传统渲染管道兼容性的同时，能够生成逼真的3D虚拟形象，具有广泛的应用前景。 Abstract: As the digital and physical worlds become more intertwined, there has been a lot of interest in digital avatars that closely resemble their real-world counterparts. Current digitization methods used in 3D production pipelines require costly capture setups, making them impractical for mass usage among common consumers. Recent academic literature has found success in reconstructing humans from limited data using implicit representations (e.g., voxels used in NeRFs), which are able to produce impressive videos. However, these methods are incompatible with traditional rendering pipelines, making it difficult to use them in applications such as games. In this work, we propose an end-to-end pipeline that builds explicitly-represented photorealistic 3D avatars using standard 3D assets. Our key idea is the use of dynamically-generated textures to enhance the realism and visually mask deficiencies in the underlying mesh geometry. This allows for seamless integration with current graphics pipelines while achieving comparable visual quality to state-of-the-art 3D avatar generation methods.

A Review on Large Language Models for Visual Analytics

Navya Sonal Agarwal,Sanjay Kumar Sonbhadra

Task: 全面回顾大型语言模型（LLMs）与视觉分析的整合，探讨其基础概念、能力和广泛应用。

Motivation: 探讨LLMs在自然语言理解、自然语言生成、对话系统和文本到媒体转换中的变革潜力，以及它们如何增强数据解释、可视化技术和交互探索能力。

Details

Method: 通过评估关键工具和平台（如LIDA、Chat2VIS、Julius AI和Zoho Analytics）以及专门的多模态模型（如ChartLlama和CharXIV），系统地探索LLM任务分类。 Result: 提供了LLMs与视觉分析整合的SWOT分析，强调了其优势（如可访问性和灵活性）、劣势（如计算需求和偏见）、机会（如多模态整合和用户协作）和威胁（如隐私问题和技能退化）。 Conclusion: 强调解决伦理考虑和方法改进以实现有效整合的重要性。 Abstract: This paper provides a comprehensive review of the integration of Large Language Models (LLMs) with visual analytics, addressing their foundational concepts, capabilities, and wide-ranging applications. It begins by outlining the theoretical underpinnings of visual analytics and the transformative potential of LLMs, specifically focusing on their roles in natural language understanding, natural language generation, dialogue systems, and text-to-media transformations. The review further investigates how the synergy between LLMs and visual analytics enhances data interpretation, visualization techniques, and interactive exploration capabilities. Key tools and platforms including LIDA, Chat2VIS, Julius AI, and Zoho Analytics, along with specialized multimodal models such as ChartLlama and CharXIV, are critically evaluated. The paper discusses their functionalities, strengths, and limitations in supporting data exploration, visualization enhancement, automated reporting, and insight extraction. The taxonomy of LLM tasks, ranging from natural language understanding (NLU), natural language generation (NLG), to dialogue systems and text-to-media transformations, is systematically explored. This review provides a SWOT analysis of integrating Large Language Models (LLMs) with visual analytics, highlighting strengths like accessibility and flexibility, weaknesses such as computational demands and biases, opportunities in multimodal integration and user collaboration, and threats including privacy concerns and skill degradation. It emphasizes addressing ethical considerations and methodological improvements for effective integration.

MMAIF: Multi-task and Multi-degradation All-in-One for Image Fusion with Language Guidance

Zihan Cao,Yu Zhong,Ziqi Wang,Liang-Jian Deng

Task: 提出一个统一的框架，用于多任务、多退化和语言引导的图像融合。

Motivation: 现有方法存在多个显著限制，如需要任务或数据集特定的模型、忽略现实世界的图像退化、在像素空间中操作计算成本高、缺乏用户交互能力。

Details

Method: 提出一个统一的框架，包括一个实用的退化管道和一个在潜在空间中操作的Diffusion Transformer (DiT)。 Result: 实验表明，该方法有效解决了上述限制，并优于以前的恢复+融合和一体化管道。 Conclusion: 该框架在图像融合任务中表现出色，解决了现有方法的多个限制。 Abstract: Image fusion, a fundamental low-level vision task, aims to integrate multiple image sequences into a single output while preserving as much information as possible from the input. However, existing methods face several significant limitations: 1) requiring task- or dataset-specific models; 2) neglecting real-world image degradations (\textit{e.g.}, noise), which causes failure when processing degraded inputs; 3) operating in pixel space, where attention mechanisms are computationally expensive; and 4) lacking user interaction capabilities. To address these challenges, we propose a unified framework for multi-task, multi-degradation, and language-guided image fusion. Our framework includes two key components: 1) a practical degradation pipeline that simulates real-world image degradations and generates interactive prompts to guide the model; 2) an all-in-one Diffusion Transformer (DiT) operating in latent space, which fuses a clean image conditioned on both the degraded inputs and the generated prompts. Furthermore, we introduce principled modifications to the original DiT architecture to better suit the fusion task. Based on this framework, we develop two versions of the model: Regression-based and Flow Matching-based variants. Extensive qualitative and quantitative experiments demonstrate that our approach effectively addresses the aforementioned limitations and outperforms previous restoration+fusion and all-in-one pipelines. Codes are available at https://github.com/294coder/MMAIF.

When Pigs Get Sick: Multi-Agent AI for Swine Disease Detection

Tittaya Mairittha,Tanakon Sawanglok,Panuwit Raden,Sorrawit Treesuk

Task: 开发一种基于AI的多代理诊断系统，用于猪病监测。

Motivation: 解决全球农业可持续性中猪病监测的挑战，包括有限的兽医资源、病例识别的延迟和诊断准确性的不一致。

Details

Method: 利用检索增强生成（RAG）技术，自动将用户输入分类为知识检索查询或基于症状的诊断查询，采用自适应提问协议收集相关临床体征，并通过置信度加权决策融合机制整合多个诊断假设。 Result: 系统在查询分类、疾病诊断和知识检索方面表现出高准确性、快速响应时间和一致的可靠性。 Conclusion: 该AI驱动的诊断框架提高了兽医决策能力，推动了可持续的畜牧管理实践，并为全球粮食安全的实现做出了实质性贡献。 Abstract: Swine disease surveillance is critical to the sustainability of global agriculture, yet its effectiveness is frequently undermined by limited veterinary resources, delayed identification of cases, and variability in diagnostic accuracy. To overcome these barriers, we introduce a novel AI-powered, multi-agent diagnostic system that leverages Retrieval-Augmented Generation (RAG) to deliver timely, evidence-based disease detection and clinical guidance. By automatically classifying user inputs into either Knowledge Retrieval Queries or Symptom-Based Diagnostic Queries, the system ensures targeted information retrieval and facilitates precise diagnostic reasoning. An adaptive questioning protocol systematically collects relevant clinical signs, while a confidence-weighted decision fusion mechanism integrates multiple diagnostic hypotheses to generate robust disease predictions and treatment recommendations. Comprehensive evaluations encompassing query classification, disease diagnosis, and knowledge retrieval demonstrate that the system achieves high accuracy, rapid response times, and consistent reliability. By providing a scalable, AI-driven diagnostic framework, this approach enhances veterinary decision-making, advances sustainable livestock management practices, and contributes substantively to the realization of global food security.

Generating Multimodal Driving Scenes via Next-Scene Prediction

Yanhao Wu,Haoyang Zhang,Tianwei Lin,Lichao Huang,Shujie Luo,Rui Wu,Congpei Qiu,Wei Ke,Tong Zhang

Task: 提出一种多模态生成框架，用于生成可控的自动驾驶场景。

Motivation: 现有的生成方法只能捕捉有限的模态，限制了生成可控场景的能力，无法全面评估自动驾驶系统。

Details

Method: 引入包含四种主要数据模态的多模态生成框架，采用两阶段方法进行场景序列生成，包括时间自回归（TAR）组件和有序自回归（OAR）组件，并引入动作感知地图对齐（AMA）模块。 Result: 该框架能够有效生成复杂、真实的驾驶场景，确保多模态一致性，并提供对场景元素的细粒度控制。 Conclusion: 该多模态生成框架能够生成复杂且可控的驾驶场景，适用于自动驾驶系统的全面评估。 Abstract: Generative models in Autonomous Driving (AD) enable diverse scene creation, yet existing methods fall short by only capturing a limited range of modalities, restricting the capability of generating controllable scenes for comprehensive evaluation of AD systems. In this paper, we introduce a multimodal generation framework that incorporates four major data modalities, including a novel addition of map modality. With tokenized modalities, our scene sequence generation framework autoregressively predicts each scene while managing computational demands through a two-stage approach. The Temporal AutoRegressive (TAR) component captures inter-frame dynamics for each modality while the Ordered AutoRegressive (OAR) component aligns modalities within each scene by sequentially predicting tokens in a fixed order. To maintain coherence between map and ego-action modalities, we introduce the Action-aware Map Alignment (AMA) module, which applies a transformation based on the ego-action to maintain coherence between these modalities. Our framework effectively generates complex, realistic driving scenes over extended sequences, ensuring multimodal consistency and offering fine-grained control over scene elements.

Solla: Towards a Speech-Oriented LLM That Hears Acoustic Context

Junyi Ao,Dekun Chen,Xiaohai Tian,Wenjie Feng,Jun Zhang,Lu Lu,Yuxuan Wang,Haizhou Li,Zhizheng Wu

Task: 提出了一种名为Solla的新框架，旨在同时理解语音指令和音频内容。

Motivation: 现有的模型主要关注使用文本指令分析输入信号，忽略了语音指令和音频混合作为输入的场景。

Details

Method: Solla框架结合了音频标记模块和ASR辅助预测方法，以有效识别和表示音频事件，并提高对语音内容的理解。 Result: 实验结果表明，Solla在简单和困难测试集上的表现与基线模型相当或更好。 Conclusion: Solla框架在联合理解语音和音频方面表现出色。 Abstract: Large Language Models (LLMs) have recently shown remarkable ability to process not only text but also multimodal inputs such as speech and audio. However, most existing models primarily focus on analyzing input signals using text instructions, overlooking scenarios in which speech instructions and audio are mixed and serve as inputs to the model. To address these challenges, we introduce Solla, a novel framework designed to understand speech-based questions and hear the acoustic context concurrently. Solla incorporates an audio tagging module to effectively identify and represent audio events, as well as an ASR-assisted prediction method to improve comprehension of spoken content. To rigorously evaluate Solla and other publicly available models, we propose a new benchmark dataset called SA-Eval, which includes three tasks: audio event classification, audio captioning, and audio question answering. SA-Eval has diverse speech instruction with various speaking styles, encompassing two difficulty levels, easy and hard, to capture the range of real-world acoustic conditions. Experimental results show that Solla performs on par with or outperforms baseline models on both the easy and hard test sets, underscoring its effectiveness in jointly understanding speech and audio.

ChatStitch: Visualizing Through Structures via Surround-View Unsupervised Deep Image Stitching with Collaborative LLM-Agents

Hao Liang,Zhipeng Dong,Yi Yang,Mengyin Fu

Task: 提出一种能够通过自然语言命令与外部数字资产集成来揭示遮挡盲点信息的协作感知系统。

Motivation: 现有的协作感知系统在用户交互效率和多摄像头逼真可视化方面存在局限性。

Details

Method: 提出了ChatStitch系统，采用基于大语言模型的多代理协作框架，并提出了SV-UDIS方法，一种在非全局重叠条件下的环视无监督深度图像拼接方法。 Result: 在UDIS-D数据集上，SV-UDIS方法在3、4和5图像拼接任务中达到了最先进的性能，PSNR分别提高了9%、17%和21%，SSIM分别提高了8%、18%和26%。 Conclusion: ChatStitch系统通过自然语言命令和多代理协作框架有效解决了现有协作感知系统的局限性，SV-UDIS方法在图像拼接任务中表现出色。 Abstract: Collaborative perception has garnered significant attention for its ability to enhance the perception capabilities of individual vehicles through the exchange of information with surrounding vehicle-agents. However, existing collaborative perception systems are limited by inefficiencies in user interaction and the challenge of multi-camera photorealistic visualization. To address these challenges, this paper introduces ChatStitch, the first collaborative perception system capable of unveiling obscured blind spot information through natural language commands integrated with external digital assets. To adeptly handle complex or abstract commands, ChatStitch employs a multi-agent collaborative framework based on Large Language Models. For achieving the most intuitive perception for humans, ChatStitch proposes SV-UDIS, the first surround-view unsupervised deep image stitching method under the non-global-overlapping condition. We conducted extensive experiments on the UDIS-D, MCOV-SLAM open datasets, and our real-world dataset. Specifically, our SV-UDIS method achieves state-of-the-art performance on the UDIS-D dataset for 3, 4, and 5 image stitching tasks, with PSNR improvements of 9%, 17%, and 21%, and SSIM improvements of 8%, 18%, and 26%, respectively.

What Makes a Reward Model a Good Teacher? An Optimization Perspective

Noam Razin,Zixuan Wang,Hubert Strauss,Stanley Wei,Jason D. Lee,Sanjeev Arora

Task: 探讨奖励模型在人类反馈强化学习（RLHF）中的优化效果。

Motivation: 尽管奖励模型的质量主要通过准确性来评估，但尚不清楚准确性是否完全捕捉了奖励模型作为有效教师的特性。

Details

Method: 从优化角度出发，证明了无论奖励模型多么准确，如果其诱导的奖励方差较低，RLHF目标将面临平坦的优化景观。 Result: 实验表明，即使奖励模型非常准确，也可能导致优化速度极慢，而准确性较低的模型如果诱导较高的奖励方差，反而可能表现更好。 Conclusion: 除了准确性，奖励模型还需要诱导足够的方差以实现高效优化。 Abstract: The success of Reinforcement Learning from Human Feedback (RLHF) critically depends on the quality of the reward model. While this quality is primarily evaluated through accuracy, it remains unclear whether accuracy fully captures what makes a reward model an effective teacher. We address this question from an optimization perspective. First, we prove that regardless of how accurate a reward model is, if it induces low reward variance, then the RLHF objective suffers from a flat landscape. Consequently, even a perfectly accurate reward model can lead to extremely slow optimization, underperforming less accurate models that induce higher reward variance. We additionally show that a reward model that works well for one language model can induce low reward variance, and thus a flat objective landscape, for another. These results establish a fundamental limitation of evaluating reward models solely based on accuracy or independently of the language model they guide. Experiments using models of up to 8B parameters corroborate our theory, demonstrating the interplay between reward variance, accuracy, and reward maximization rate. Overall, our findings highlight that beyond accuracy, a reward model needs to induce sufficient variance for efficient optimization.

USAM-Net: A U-Net-based Network for Improved Stereo Correspondence and Scene Depth Estimation using Features from a Pre-trained Image Segmentation network

Joseph Emmanuel DL Dayo,Prospero C. Naval Jr

Task: 提出了一种新的卷积神经网络USAM-Net，用于增强深度估计性能。

Motivation: 自动驾驶和增强现实应用中对高精度深度估计的需求不断增加，需要能够有效利用多种数据模态的先进神经架构。

Details

Method: USAM-Net采用双路径架构，结合预训练的分割模型（SAM）和深度估计模型，通过将分割路径生成的语义掩码与立体图像拼接作为深度估计路径的输入。 Result: 在DrivingStereo数据集上的实验表明，USAM-Net在全局差异（GD）和端点误差（EPE）方面优于传统模型，如CFNet、SegStereo和iResNet。 Conclusion: USAM-Net在需要高精度深度数据的应用中具有潜力，证明了将分割信息整合到立体深度估计任务中的有效性。 Abstract: The increasing demand for high-accuracy depth estimation in autonomous driving and augmented reality applications necessitates advanced neural architectures capable of effectively leveraging multiple data modalities. In this context, we introduce the Unified Segmentation Attention Mechanism Network (USAM-Net), a novel convolutional neural network that integrates stereo image inputs with semantic segmentation maps and attention to enhance depth estimation performance. USAM-Net employs a dual-pathway architecture, which combines a pre-trained segmentation model (SAM) and a depth estimation model. The segmentation pathway preprocesses the stereo images to generate semantic masks, which are then concatenated with the stereo images as inputs to the depth estimation pathway. This integration allows the model to focus on important features such as object boundaries and surface textures which are crucial for accurate depth perception. Empirical evaluation on the DrivingStereo dataset demonstrates that USAM-Net achieves superior performance metrics, including a Global Difference (GD) of 3.61\% and an End-Point Error (EPE) of 0.88, outperforming traditional models such as CFNet, SegStereo, and iResNet. These results underscore the effectiveness of integrating segmentation information into stereo depth estimation tasks, highlighting the potential of USAM-Net in applications demanding high-precision depth data.

TULIP: Towards Unified Language-Image Pretraining

Zineng Tang,Long Lian,Seun Eisape,XuDong Wang,Roei Herzig,Adam Yala,Alane Suhr,Trevor Darrell,David M. Chan

Task: 提出TULIP模型，以解决现有图像-文本对比模型在视觉中心任务中的不足。

Motivation: 现有的图像-文本对比模型（如CLIP和SigLIP）在需要高保真图像理解的任务（如计数、深度估计和细粒度对象识别）中表现不佳，而视觉聚焦模型在处理语言理解方面存在局限性。

Details

Method: 利用生成数据增强、增强的图像-图像和文本-文本对比学习以及图像/文本重建正则化来学习细粒度视觉特征，同时保持全局语义对齐。 Result: TULIP模型在多个基准测试中优于现有的最先进模型，在ImageNet-1K上建立了新的零样本性能最先进水平，在RxRx1上的少样本分类线性探测中比SigLIP提高了2倍，在MMVP上比SigLIP提高了3倍以上。 Conclusion: TULIP模型在视觉和语言理解任务中表现出色，为现有的CLIP类模型提供了一个有效的替代方案。 Abstract: Despite the recent success of image-text contrastive models like CLIP and SigLIP, these models often struggle with vision-centric tasks that demand high-fidelity image understanding, such as counting, depth estimation, and fine-grained object recognition. These models, by performing language alignment, tend to prioritize high-level semantics over visual understanding, weakening their image understanding. On the other hand, vision-focused models are great at processing visual information but struggle to understand language, limiting their flexibility for language-driven tasks. In this work, we introduce TULIP, an open-source, drop-in replacement for existing CLIP-like models. Our method leverages generative data augmentation, enhanced image-image and text-text contrastive learning, and image/text reconstruction regularization to learn fine-grained visual features while preserving global semantic alignment. Our approach, scaling to over 1B parameters, outperforms existing state-of-the-art (SOTA) models across multiple benchmarks, establishing a new SOTA zero-shot performance on ImageNet-1K, delivering up to a $2\times$ enhancement over SigLIP on RxRx1 in linear probing for few-shot classification, and improving vision-language models, achieving over $3\times$ higher scores than SigLIP on MMVP. Our code/checkpoints are available at https://tulip-berkeley.github.io

Aligning Information Capacity Between Vision and Language via Dense-to-Sparse Feature Distillation for Image-Text Matching

Yang Liu,Wentao Feng,Zhuoyao Liu,Shudong Huang,Jiancheng Lv

Task: 提出一种新的视觉语义嵌入模型（D2S-VSE）来处理多视图描述匹配问题。

Motivation: 现有的方法通过学习一组嵌入来找到每个视图文本的最佳匹配并计算相似性，但这些方法学习的视觉和文本嵌入信息容量有限，容易受到局部相似负样本的干扰。

Details

Method: 提出Dense-to-Sparse Feature Distilled Visual Semantic Embedding (D2S-VSE)，通过密集文本蒸馏增强稀疏文本的信息容量。具体来说，D2S-VSE是一个两阶段框架：在预训练阶段，通过将图像与密集文本对齐来增强视觉语义嵌入的信息容量；在微调阶段，同时优化两个任务，将密集文本嵌入蒸馏到稀疏文本嵌入中，同时对齐图像和稀疏文本，增强稀疏文本嵌入的信息容量。 Result: 在MS-COCO和Flickr30K数据集上进行了广泛评估，证明了D2S-VSE模型优于最近的最先进方法。 Conclusion: D2S-VSE模型通过增强稀疏文本嵌入的信息容量，有效解决了多视图描述匹配问题，并在大规模数据集上表现出色。 Abstract: Enabling Visual Semantic Models to effectively handle multi-view description matching has been a longstanding challenge. Existing methods typically learn a set of embeddings to find the optimal match for each view's text and compute similarity. However, the visual and text embeddings learned through these approaches have limited information capacity and are prone to interference from locally similar negative samples. To address this issue, we argue that the information capacity of embeddings is crucial and propose Dense-to-Sparse Feature Distilled Visual Semantic Embedding (D2S-VSE), which enhances the information capacity of sparse text by leveraging dense text distillation. Specifically, D2S-VSE is a two-stage framework. In the pre-training stage, we align images with dense text to enhance the information capacity of visual semantic embeddings. In the fine-tuning stage, we optimize two tasks simultaneously, distilling dense text embeddings to sparse text embeddings while aligning images and sparse texts, enhancing the information capacity of sparse text embeddings. Our proposed D2S-VSE model is extensively evaluated on the large-scale MS-COCO and Flickr30K datasets, demonstrating its superiority over recent state-of-the-art methods.

Depth-Aware Range Image-Based Model for Point Cloud Segmentation

Bike Chen,Antti Tikanmäki,Juha Röning

Task: 点云分割（PCS）旨在将点云分成不同且有意义的组。

Motivation: 点云分割在机器人技术中起着重要作用，因为它使机器人能够直接理解其物理环境。然而，处理稀疏和大规模的室外点云时，基于距离图像的模型通常被采用，但这些模型缺乏显式的深度信息，导致在3D空间中分离的物体在图像中接触，增加了分割的难度。此外，现有的PCS模型通常是从现有的彩色图像模型衍生而来，无法充分利用距离图像中隐含但有序的深度信息，导致性能较差。

Details

Method: 本文提出了深度感知模块（DAM）和Fast FMVNet V3。DAM通过显式建模通道间的相互依赖来感知距离图像中的有序深度信息。Fast FMVNet V3通过将DAM集成到每个架构阶段的最后一个块中来结合DAM。 Result: 在SemanticKITTI、nuScenes和SemanticPOSS上进行的大量实验表明，DAM为Fast FMVNet V3带来了显著的改进，且计算成本可以忽略不计。 Conclusion: 本文提出的深度感知模块（DAM）和Fast FMVNet V3有效解决了基于距离图像的点云分割中的深度信息利用问题，显著提升了分割性能。 Abstract: Point cloud segmentation (PCS) aims to separate points into different and meaningful groups. The task plays an important role in robotics because PCS enables robots to understand their physical environments directly. To process sparse and large-scale outdoor point clouds in real time, range image-based models are commonly adopted. However, in a range image, the lack of explicit depth information inevitably causes some separate objects in 3D space to touch each other, bringing difficulty for the range image-based models in correctly segmenting the objects. Moreover, previous PCS models are usually derived from the existing color image-based models and unable to make full use of the implicit but ordered depth information inherent in the range image, thereby achieving inferior performance. In this paper, we propose Depth-Aware Module (DAM) and Fast FMVNet V3. DAM perceives the ordered depth information in the range image by explicitly modelling the interdependence among channels. Fast FMVNet V3 incorporates DAM by integrating it into the last block in each architecture stage. Extensive experiments conducted on SemanticKITTI, nuScenes, and SemanticPOSS demonstrate that DAM brings a significant improvement for Fast FMVNet V3 with negligible computational cost.

Neuro Symbolic Knowledge Reasoning for Procedural Video Question Answering

Thanh-Son Nguyen,Hong Yang,Tzeh Yuan Neoh,Hao Zhang,Ee Yeo Keat,Basura Fernando

Task: 介绍一个新的视频问答（VQA）数据集，要求模型利用程序性知识进行复杂推理。

Motivation: 挑战模型在识别视觉实体、生成假设以及进行上下文、因果和反事实推理方面的能力。

Details

Method: 提出了一种神经符号推理模块，该模块集成了神经网络和基于LLM的约束推理，以生成可解释的答案。 Result: 结果表明，将LLM与结构化知识推理结合使用逻辑推理可以增强在STAR基准和我们的数据集上的程序性推理能力。 Conclusion: 结合LLM和结构化知识推理的方法在程序性推理任务上表现出色，代码和数据集将在https://github.com/LUNAProject22/KML上发布。 Abstract: This paper introduces a new video question-answering (VQA) dataset that challenges models to leverage procedural knowledge for complex reasoning. It requires recognizing visual entities, generating hypotheses, and performing contextual, causal, and counterfactual reasoning. To address this, we propose neuro symbolic reasoning module that integrates neural networks and LLM-driven constrained reasoning over variables for interpretable answer generation. Results show that combining LLMs with structured knowledge reasoning with logic enhances procedural reasoning on the STAR benchmark and our dataset. Code and dataset at https://github.com/LUNAProject22/KML soon.

Reducing Annotation Burden: Exploiting Image Knowledge for Few-Shot Medical Video Object Segmentation via Spatiotemporal Consistency Relearning

Zixuan Zheng,Yilei Shi,Chunlei Li,Jingliang Hu,Xiao Xiang Zhu,Lichao Mou

Task: 研究在极低数据量情况下，利用少量视频帧注释和现有标注图像来减少视频注释成本的少样本视频对象分割方法。

Motivation: 减少医学领域中密集帧注释的高成本，利用现有标注图像来最小化视频注释的需求。

Details

Method: 提出一个两阶段框架：首先使用标注图像学习少样本分割模型，然后通过时空一致性重新学习方法在医学视频上提高性能，同时在特征和预测层面强制图像模型和重新学习模型之间的一致性。 Result: 实验表明，该方法在少样本分割任务上优于现有的最先进方法，能够在低数据量情况下实现强大的视频分割性能。 Conclusion: 该方法在医学图像和稀疏标注医学视频之间架起桥梁，在低数据量情况下实现了强大的视频分割性能。 Abstract: Few-shot video object segmentation aims to reduce annotation costs; however, existing methods still require abundant dense frame annotations for training, which are scarce in the medical domain. We investigate an extremely low-data regime that utilizes annotations from only a few video frames and leverages existing labeled images to minimize costly video annotations. Specifically, we propose a two-phase framework. First, we learn a few-shot segmentation model using labeled images. Subsequently, to improve performance without full supervision, we introduce a spatiotemporal consistency relearning approach on medical videos that enforces consistency between consecutive frames. Constraints are also enforced between the image model and relearning model at both feature and prediction levels. Experiments demonstrate the superiority of our approach over state-of-the-art few-shot segmentation methods. Our model bridges the gap between abundant annotated medical images and scarce, sparsely labeled medical videos to achieve strong video segmentation performance in this low data regime. Code is available at https://github.com/MedAITech/RAB.

Body-Hand Modality Expertized Networks with Cross-attention for Fine-grained Skeleton Action Recognition

Seungyeon Cho,Tae-Kyun Kim

Task: 提出了一种新的框架BHaRNet，用于基于骨架的人体动作识别，特别是关注细微的手部动作。

Motivation: 现有的方法主要关注全身动作，往往忽略了细微的手部动作，而这些动作对于区分细粒度动作至关重要。

Details

Method: BHaRNet框架通过增强典型的身体专家模型与手部专家模型，采用联合训练和交叉注意力机制，实现特征级交互和选择性融合互补信息。 Result: 在大规模基准测试（NTU RGB+D 60、NTU RGB+D 120、PKU-MMD和Northwestern-UCLA）上，BHaRNet在手部密集型动作中的准确率从86.4%提高到93.0%，同时保持了较少的GFLOPs和参数。 Conclusion: BHaRNet在基于骨架的人体动作识别中表现出色，特别是在手部动作识别方面，具有较高的准确率和较低的计算复杂度。 Abstract: Skeleton-based Human Action Recognition (HAR) is a vital technology in robotics and human-robot interaction. However, most existing methods concentrate primarily on full-body movements and often overlook subtle hand motions that are critical for distinguishing fine-grained actions. Recent work leverages a unified graph representation that combines body, hand, and foot keypoints to capture detailed body dynamics. Yet, these models often blur fine hand details due to the disparity between body and hand action characteristics and the loss of subtle features during the spatial-pooling. In this paper, we propose BHaRNet (Body-Hand action Recognition Network), a novel framework that augments a typical body-expert model with a hand-expert model. Our model jointly trains both streams with an ensemble loss that fosters cooperative specialization, functioning in a manner reminiscent of a Mixture-of-Experts (MoE). Moreover, cross-attention is employed via an expertized branch method and a pooling-attention module to enable feature-level interactions and selectively fuse complementary information. Inspired by MMNet, we also demonstrate the applicability of our approach to multi-modal tasks by leveraging RGB information, where body features guide RGB learning to capture richer contextual cues. Experiments on large-scale benchmarks (NTU RGB+D 60, NTU RGB+D 120, PKU-MMD, and Northwestern-UCLA) demonstrate that BHaRNet achieves SOTA accuracies -- improving from 86.4\% to 93.0\% in hand-intensive actions -- while maintaining fewer GFLOPs and parameters than the relevant unified methods.

Ultrasound Image-to-Video Synthesis via Latent Dynamic Diffusion Models

Tingxiu Chen,Yilei Shi,Zixuan Zheng,Bingcong Yan,Jingliang Hu,Xiao Xiang Zhu,Lichao Mou

Task: 通过合成超声视频来解决超声视频数据集稀缺的问题。

Motivation: 公开可用的超声视频数据集稀缺，阻碍了有效视频分类模型的开发。

Details

Method: 提出了一种潜在动态扩散模型（LDDM），将静态图像高效地转换为具有真实视频特征的动态序列。 Result: 在BUSV基准测试中展示了强大的定量结果和视觉上吸引人的合成视频。使用真实数据和LDDM合成视频的组合训练视频分类模型，性能显著优于仅使用真实数据。 Conclusion: 图像到视频的方法为推进超声视频分析提供了有效的数据增强解决方案。 Abstract: Ultrasound video classification enables automated diagnosis and has emerged as an important research area. However, publicly available ultrasound video datasets remain scarce, hindering progress in developing effective video classification models. We propose addressing this shortage by synthesizing plausible ultrasound videos from readily available, abundant ultrasound images. To this end, we introduce a latent dynamic diffusion model (LDDM) to efficiently translate static images to dynamic sequences with realistic video characteristics. We demonstrate strong quantitative results and visually appealing synthesized videos on the BUSV benchmark. Notably, training video classification models on combinations of real and LDDM-synthesized videos substantially improves performance over using real data alone, indicating our method successfully emulates dynamics critical for discrimination. Our image-to-video approach provides an effective data augmentation solution to advance ultrasound video analysis. Code is available at https://github.com/MedAITech/U_I2V.

Language-based Image Colorization: A Benchmark and Beyond

Yifan Li,Shuai Yang,Jiaying Liu

Task: 对基于语言的图像着色方法进行全面回顾和基准测试。

Motivation: 由于缺乏对基于语言的着色文献的全面回顾，本文旨在填补这一空白，并提供有意义的见解。

Details

Method: 首先总结了现有的自动着色方法，然后重点分析了基于语言的方法，并将其分为两类：一类是从头开始训练跨模态网络，另一类是利用预训练的跨模态模型建立文本-视觉对应关系。基于现有方法的局限性，提出了一种基于蒸馏扩散模型的简单有效方法。 Result: 实验表明，所提出的简单基线方法比之前的复杂方法效果更好，且速度提高了14倍。 Conclusion: 本文首次对基于语言的图像着色领域进行了全面回顾和基准测试，为社区提供了有意义的见解。 Abstract: Image colorization aims to bring colors back to grayscale images. Automatic image colorization methods, which requires no additional guidance, struggle to generate high-quality images due to color ambiguity, and provides limited user controllability. Thanks to the emergency of cross-modality datasets and models, language-based colorization methods are proposed to fully utilize the efficiency and flexibly of text descriptions to guide colorization. In view of the lack of a comprehensive review of language-based colorization literature, we conduct a thorough analysis and benchmarking. We first briefly summarize existing automatic colorization methods. Then, we focus on language-based methods and point out their core challenge on cross-modal alignment. We further divide these methods into two categories: one attempts to train a cross-modality network from scratch, while the other utilizes the pre-trained cross-modality model to establish the textual-visual correspondence. Based on the analyzed limitations of existing language-based methods, we propose a simple yet effective method based on distilled diffusion model. Extensive experiments demonstrate that our simple baseline can produces better results than previous complex methods with 14 times speed up. To the best of our knowledge, this is the first comprehensive review and benchmark on language-based image colorization field, providing meaningful insights for the community. The code is available at https://github.com/lyf1212/Color-Turbo.

Taming Flow Matching with Unbalanced Optimal Transport into Fast Pansharpening

Zihan Cao,Yu Zhong,Liang-Jian Deng

Task: 提出了一种基于最优传输流匹配（OTFM）框架的单步高质量全色锐化方法。

Motivation: 现有的基于随机微分方程（SDEs）的扩散模型在全色锐化任务中表现出色，但其多步采样过程带来了巨大的计算开销，限制了实际应用。

Details

Method: 提出了最优传输流匹配（OTFM）框架，结合不平衡最优传输（UOT）的双重公式，实现单步高质量全色锐化。 Result: 实验结果表明，OTFM在多个数据集上的表现与之前的回归模型和领先的基于扩散的方法相当或更好，且仅需一步采样。 Conclusion: OTFM框架在保持全色锐化约束的同时，实现了无模拟训练和单步推理，显著提高了计算效率。 Abstract: Pansharpening, a pivotal task in remote sensing for fusing high-resolution panchromatic and multispectral imagery, has garnered significant research interest. Recent advancements employing diffusion models based on stochastic differential equations (SDEs) have demonstrated state-of-the-art performance. However, the inherent multi-step sampling process of SDEs imposes substantial computational overhead, hindering practical deployment. While existing methods adopt efficient samplers, knowledge distillation, or retraining to reduce sampling steps (e.g., from 1,000 to fewer steps), such approaches often compromise fusion quality. In this work, we propose the Optimal Transport Flow Matching (OTFM) framework, which integrates the dual formulation of unbalanced optimal transport (UOT) to achieve one-step, high-quality pansharpening. Unlike conventional OT formulations that enforce rigid distribution alignment, UOT relaxes marginal constraints to enhance modeling flexibility, accommodating the intrinsic spectral and spatial disparities in remote sensing data. Furthermore, we incorporate task-specific regularization into the UOT objective, enhancing the robustness of the flow model. The OTFM framework enables simulation-free training and single-step inference while maintaining strict adherence to pansharpening constraints. Experimental evaluations across multiple datasets demonstrate that OTFM matches or exceeds the performance of previous regression-based models and leading diffusion-based methods while only needing one sampling step. Codes are available at https://github.com/294coder/PAN-OTFM.

One-Shot Medical Video Object Segmentation via Temporal Contrastive Memory Networks

Yaxiong Chen,Junjian Hu,Chunlei Li,Zixuan Zheng,Jingliang Hu,Yilei Shi,Shengwu Xiong,Xiao Xiang Zhu,Lichao Mou

Task: 提出了一种一次性医学视频对象分割任务，要求仅通过第一帧的掩码注释来分离视频中的前景和背景像素。

Motivation: 医学视频数据的复杂分析需要高效的视频对象分割，但面临数据可用性和注释的挑战。

Details

Method: 提出了一种时间对比记忆网络，包括图像和掩码编码器、时间对比记忆库和解码器，用于学习特征表示、对齐相邻帧的嵌入并存储这些特征。 Result: 在分割可见和不可见结构方面展示了最先进的性能，显示了从稀缺标签中泛化的能力。 Conclusion: 该方法有潜力减轻医学视频分析的注释负担。 Abstract: Video object segmentation is crucial for the efficient analysis of complex medical video data, yet it faces significant challenges in data availability and annotation. We introduce the task of one-shot medical video object segmentation, which requires separating foreground and background pixels throughout a video given only the mask annotation of the first frame. To address this problem, we propose a temporal contrastive memory network comprising image and mask encoders to learn feature representations, a temporal contrastive memory bank that aligns embeddings from adjacent frames while pushing apart distant ones to explicitly model inter-frame relationships and stores these features, and a decoder that fuses encoded image features and memory readouts for segmentation. We also collect a diverse, multi-source medical video dataset spanning various modalities and anatomies to benchmark this task. Extensive experiments demonstrate state-of-the-art performance in segmenting both seen and unseen structures from a single exemplar, showing ability to generalize from scarce labels. This highlights the potential to alleviate annotation burdens for medical video analysis. Code is available at https://github.com/MedAITech/TCMN.

Semi-KAN: KAN Provides an Effective Representation for Semi-Supervised Learning in Medical Image Segmentation

Zanting Ye,Xiaolong Niu,Xuanbin Wu,Wenxiang Yi,Yuan Chang,Lijun Lu

Task: 提出一种基于Kolmogorov-Arnold Networks (KANs)的半监督医学图像分割方法Semi-KAN。

Motivation: 现有的半监督医学图像分割方法通常依赖于单一固定的激活函数和线性建模模式，限制了其学习鲁棒表示的能力。

Details

Method: 提出Semi-KAN，将KANs应用于U-Net管道的编码器瓶颈和解码器顶层，以提取高层语义特征，并通过减少特征维度和水平扩展来降低计算开销。 Result: 在四个公共数据集上的实验表明，Semi-KAN在较少的KAN层和较低的计算成本下优于基线网络。 Conclusion: KANs在半监督医学图像分割中具有潜力，能够有效提升表示学习能力。 Abstract: Deep learning-based medical image segmentation has shown remarkable success; however, it typically requires extensive pixel-level annotations, which are both expensive and time-intensive. Semi-supervised medical image segmentation (SSMIS) offers a viable alternative, driven by advancements in CNNs and ViTs. However, these networks often rely on single fixed activation functions and linear modeling patterns, limiting their ability to effectively learn robust representations. Given the limited availability of labeled date, achieving robust representation learning becomes crucial. Inspired by Kolmogorov-Arnold Networks (KANs), we propose Semi-KAN, which leverages the untapped potential of KANs to enhance backbone architectures for representation learning in SSMIS. Our findings indicate that: (1) compared to networks with fixed activation functions, KANs exhibit superior representation learning capabilities with fewer parameters, and (2) KANs excel in high-semantic feature spaces. Building on these insights, we integrate KANs into tokenized intermediate representations, applying them selectively at the encoder's bottleneck and the decoder's top layers within a U-Net pipeline to extract high-level semantic features. Although learnable activation functions improve feature expansion, they introduce significant computational overhead with only marginal performance gains. To mitigate this, we reduce the feature dimensions and employ horizontal scaling to capture multiple pattern representations. Furthermore, we design a multi-branch U-Net architecture with uncertainty estimation to effectively learn diverse pattern representations. Extensive experiments on four public datasets demonstrate that Semi-KAN surpasses baseline networks, utilizing fewer KAN layers and lower computational cost, thereby underscoring the potential of KANs as a promising approach for SSMIS.

Disentangling Modes and Interference in the Spectrogram of Multicomponent Signals

Kévin Polisano,Sylvain Meignen,Nils Laurent,Hubert Leterme

Task: 研究如何将多分量信号的频谱图分解为模式部分和干扰部分。

Motivation: 提高在强干扰情况下时频分析的准确性。

Details

Method: 探索了两种方法：(i) 受图像处理中纹理-几何分解启发的变分方法，(ii) 使用U-Net架构的监督学习方法，训练数据集包含多种干扰模式和噪声条件。 Result: 数值实验展示了两种方法在频谱图分解中的优势和局限性。 Conclusion: 这两种方法在强干扰情况下具有增强时频分析的潜力。 Abstract: In this paper, we investigate how the spectrogram of multicomponent signals can be decomposed into a mode part and an interference part. We explore two approaches: (i) a variational method inspired by texture-geometry decomposition in image processing, and (ii) a supervised learning approach using a U-Net architecture, trained on a dataset encompassing diverse interference patterns and noise conditions. Once the interference component is identified, we explain how it enables us to define a criterion to locally adapt the window length used in the definition of the spectrogram, for the sake of improving ridge detection in the presence of close modes. Numerical experiments illustrate the advantages and limitations of both approaches for spectrogram decomposition, highlighting their potential for enhancing time-frequency analysis in the presence of strong interference.

TGV: Tabular Data-Guided Learning of Visual Cardiac Representations

Marta Hasny,Maxime Di Folco,Keno Bressem,Julia Schnabel

Task: 利用临床相关的表格数据来识别不同的患者表型，并在对比学习框架中形成更有意义的对。

Motivation: 在医学影像中，通常需要比较具有不同表型的整个患者，而不仅仅是单一扫描的多个增强版本。

Details

Method: 使用表格属性来指导视觉表示的训练，而不需要联合嵌入空间。 Result: 在UK Biobank的短轴心脏MR图像和临床属性上，表格数据有助于更有效地区分患者亚组。在下游任务（包括心血管疾病和心脏表型的微调和零样本预测）中，结合表格数据的视觉表示比仅依赖图像增强或联合图像-表格嵌入的传统方法更强。 Conclusion: 通过表格指导训练的图像编码器能够在表示中嵌入人口统计信息，使其能够利用表格数据的见解进行单模态预测，适用于现实世界的医学场景，其中广泛的临床注释在推理时可能不可用。 Abstract: Contrastive learning methods in computer vision typically rely on different views of the same image to form pairs. However, in medical imaging, we often seek to compare entire patients with different phenotypes rather than just multiple augmentations of one scan. We propose harnessing clinically relevant tabular data to identify distinct patient phenotypes and form more meaningful pairs in a contrastive learning framework. Our method uses tabular attributes to guide the training of visual representations, without requiring a joint embedding space. We demonstrate its strength using short-axis cardiac MR images and clinical attributes from the UK Biobank, where tabular data helps to more effectively distinguish between patient subgroups. Evaluation on downstream tasks, including fine-tuning and zero-shot prediction of cardiovascular artery diseases and cardiac phenotypes, shows that incorporating tabular data yields stronger visual representations than conventional methods that rely solely on image augmentations or combined image-tabular embeddings. Furthermore, we demonstrate that image encoders trained with tabular guidance are capable of embedding demographic information in their representations, allowing them to use insights from tabular data for unimodal predictions, making them well-suited to real-world medical settings where extensive clinical annotations may not be routinely available at inference time. The code will be available on GitHub.

Low-Complexity Patch-based No-Reference Point Cloud Quality Metric exploiting Weighted Structure and Texture Features

Michael Neri,Federica Battisti

Task: 提出一种无参考点云质量评估方法PST-PCQA，用于评估点云在压缩、传输和渲染过程中引入的失真对整体质量的影响。

Motivation: 在点云的压缩、传输和渲染过程中，会引入各种伪影，影响最终用户的感知质量。然而，评估这些失真对整体质量的影响是一个具有挑战性的任务。

Details

Method: PST-PCQA是一种基于低复杂度学习框架的无参考点云质量评估方法，通过分析单个补丁，整合局部和全局特征来预测平均意见分数。 Result: 在三个最先进的数据集上的实验测试表明，PST-PCQA具有良好的预测能力，能够通过分析不同的特征池化策略并在不同数据集上泛化。 Conclusion: PST-PCQA的轻量级结构使其适用于实时应用和计算能力有限的设备，并且通过逐块评估质量的方法具有显著优势。 Abstract: During the compression, transmission, and rendering of point clouds, various artifacts are introduced, affecting the quality perceived by the end user. However, evaluating the impact of these distortions on the overall quality is a challenging task. This study introduces PST-PCQA, a no-reference point cloud quality metric based on a low-complexity, learning-based framework. It evaluates point cloud quality by analyzing individual patches, integrating local and global features to predict the Mean Opinion Score. In summary, the process involves extracting features from patches, combining them, and using correlation weights to predict the overall quality. This approach allows us to assess point cloud quality without relying on a reference point cloud, making it particularly useful in scenarios where reference data is unavailable. Experimental tests on three state-of-the-art datasets show good prediction capabilities of PST-PCQA, through the analysis of different feature pooling strategies and its ability to generalize across different datasets. The ablation study confirms the benefits of evaluating quality on a patch-by-patch basis. Additionally, PST-PCQA's light-weight structure, with a small number of parameters to learn, makes it well-suited for real-time applications and devices with limited computational capacity. For reproducibility purposes, we made code, model, and pretrained weights available at https://github.com/michaelneri/PST-PCQA.

Semantic Segmentation of Transparent and Opaque Drinking Glasses with the Help of Zero-shot Learning

Annalena Blänsdorf,Tristan Wirth,Arne Rak,Thomas Pöllabauer,Volker Knauthe,Arjan Kuijper

Task: 提出TransCaGNet模型，用于分割图像中的透明结构。

Motivation: 透明结构在图像中难以与背景区分，如常见的玻璃杯。

Details

Method: 将CaGNet的分割骨干替换为Trans4Trans架构，并使用零样本学习来分割未在训练中提供的玻璃类别。提出一个新的合成数据集，并捕获一个真实世界评估数据集。 Result: TransCaGNet在合成数据集上的平均IoU和准确率分别提高了13.68%和17.88%，在真实世界数据集上的平均IoU和准确率分别提高了5.55%和5.72%。 Conclusion: TransCaGNet在分割透明结构方面表现优异，尤其是在合成数据集上训练后，在真实世界数据集上的表现也有所提升。 Abstract: Segmenting transparent structures in images is challenging since they are difficult to distinguish from the background. Common examples are drinking glasses, which are a ubiquitous part of our lives and appear in many different shapes and sizes. In this work we propose TransCaGNet, a modified version of the zero-shot model CaGNet. We exchange the segmentation backbone with the architecture of Trans4Trans to be capable of segmenting transparent objects. Since some glasses are rarely captured, we use zeroshot learning to be able to create semantic segmentations of glass categories not given during training. We propose a novel synthetic dataset covering a diverse set of different environmental conditions. Additionally we capture a real-world evaluation dataset since most applications take place in the real world. Comparing our model with Zeg-Clip we are able to show that TransCaGNet produces better mean IoU and accuracy values while ZegClip outperforms it mostly for unseen classes. To improve the segmentation results, we combine the semantic segmentation of the models with the segmentation results of SAM 2. Our evaluation emphasizes that distinguishing between different classes is challenging for the models due to similarity, points of view, or coverings. Taking this behavior into account, we assign glasses multiple possible categories. The modification leads to an improvement up to 13.68% for the mean IoU and up to 17.88% for the mean accuracy values on the synthetic dataset. Using our difficult synthetic dataset for training, the models produce even better results on the real-world dataset. The mean IoU is improved up to 5.55% and the mean accuracy up to 5.72% on the real-world dataset.

Universal Scene Graph Generation

Shengqiong Wu,Hao Fei,Tat-Seng Chua

Task: 提出一种能够从任何给定的多模态输入组合中全面表征语义场景的通用场景图（USG）表示方法。

Motivation: 当前场景图研究主要局限于单模态场景建模，无法充分利用不同模态场景图表示在描述整体场景语义中的互补优势。

Details

Method: 设计了一种针对特定目标的USG解析器（USG-Par），采用模块化架构进行端到端的USG生成，包括用于缓解模态差距的对象关联器和以文本为中心的对比学习机制。 Result: 实验表明，USG在表达场景语义方面比独立场景图更强，且USG-Par具有更高的效率和性能。 Conclusion: USG和USG-Par能够有效解决跨模态对象对齐和域外挑战，提供了一种更全面的场景语义表示方法。 Abstract: Scene graph (SG) representations can neatly and efficiently describe scene semantics, which has driven sustained intensive research in SG generation. In the real world, multiple modalities often coexist, with different types, such as images, text, video, and 3D data, expressing distinct characteristics. Unfortunately, current SG research is largely confined to single-modality scene modeling, preventing the full utilization of the complementary strengths of different modality SG representations in depicting holistic scene semantics. To this end, we introduce Universal SG (USG), a novel representation capable of fully characterizing comprehensive semantic scenes from any given combination of modality inputs, encompassing modality-invariant and modality-specific scenes. Further, we tailor a niche-targeting USG parser, USG-Par, which effectively addresses two key bottlenecks of cross-modal object alignment and out-of-domain challenges. We design the USG-Par with modular architecture for end-to-end USG generation, in which we devise an object associator to relieve the modality gap for cross-modal object alignment. Further, we propose a text-centric scene contrasting learning mechanism to mitigate domain imbalances by aligning multimodal objects and relations with textual SGs. Through extensive experiments, we demonstrate that USG offers a stronger capability for expressing scene semantics than standalone SGs, and also that our USG-Par achieves higher efficacy and performance.

Manifold Learning for Hyperspectral Images

Fethi Harkat,Tiphaine Deuberet,Guillaume Gey,Valérie Perrier,Kévin Polisano

Task: 提出一种通过构建邻接图来近似数据集拓扑的方法，以改进X射线透射多能量图像的特征表示。

Motivation: 传统的特征提取和投影技术（如主成分分析）在表示X射线透射多能量图像时表现不佳，限制了神经网络在决策过程中的性能。

Details

Method: 使用均匀流形逼近和投影（UMAP）构建邻接图，捕捉数据中的非线性相关性。 Result: 该方法显著提高了机器学习算法的性能，特别是在处理来自X射线透射光谱的高光谱图像时，不仅保留了数据的全局结构，还增强了特征的可分离性。 Conclusion: 所提出的方法能够更准确和鲁棒地进行分类，显著提升了X射线透射多能量图像的处理效果。 Abstract: Traditional feature extraction and projection techniques, such as Principal Component Analysis, struggle to adequately represent X-Ray Transmission (XRT) Multi-Energy (ME) images, limiting the performance of neural networks in decision-making processes. To address this issue, we propose a method that approximates the dataset topology by constructing adjacency graphs using the Uniform Manifold Approximation and Projection. This approach captures nonlinear correlations within the data, significantly improving the performance of machine learning algorithms, particularly in processing Hyperspectral Images (HSI) from X-ray transmission spectroscopy. This technique not only preserves the global structure of the data but also enhances feature separability, leading to more accurate and robust classification results.

Exploiting Diffusion Prior for Real-World Image Dehazing with Unpaired Training

Yunwei Lan,Zhigao Cui,Chang Liu,Jialun Peng,Nian Wang,Xin Luo,Dong Liu

Task: 利用扩散先验和物理先验进行无配对真实场景去雾。

Motivation: 当前方法由于特征表示有限和真实世界先验利用不足，在各种真实场景中的泛化能力有限。

Details

Method: 提出了一种名为Diff-Dehazer的无配对框架，利用扩散先验作为CycleGAN中的双射映射学习器，并集成物理先验以挖掘真实世界知识。 Result: 在多个真实世界数据集上的广泛实验证明了该方法的优越性能。 Conclusion: Diff-Dehazer通过利用扩散先验和物理先验，显著提高了真实场景去雾的效果。 Abstract: Unpaired training has been verified as one of the most effective paradigms for real scene dehazing by learning from unpaired real-world hazy and clear images. Although numerous studies have been proposed, current methods demonstrate limited generalization for various real scenes due to limited feature representation and insufficient use of real-world prior. Inspired by the strong generative capabilities of diffusion models in producing both hazy and clear images, we exploit diffusion prior for real-world image dehazing, and propose an unpaired framework named Diff-Dehazer. Specifically, we leverage diffusion prior as bijective mapping learners within the CycleGAN, a classic unpaired learning framework. Considering that physical priors contain pivotal statistics information of real-world data, we further excavate real-world knowledge by integrating physical priors into our framework. Furthermore, we introduce a new perspective for adequately leveraging the representation ability of diffusion models by removing degradation in image and text modalities, so as to improve the dehazing effect. Extensive experiments on multiple real-world datasets demonstrate the superior performance of our method. Our code https://github.com/ywxjm/Diff-Dehazer.

Learning 4D Panoptic Scene Graph Generation from Rich 2D Visual Scene

Shengqiong Wu,Hao Fei,Jingkang Yang,Xiangtai Li,Juncheng Li,Hanwang Zhang,Tat-seng Chua

Task: 提出一种新的框架，利用丰富的2D视觉场景注释来增强4D场景学习，以解决4D全景场景图（4D-PSG）生成中的数据稀缺问题。

Motivation: 当前4D-PSG研究面临数据稀缺和词汇外问题，且基准生成方法的流水线性质导致性能不佳。

Details

Method: 引入4D大语言模型（4D-LLM）与3D掩码解码器集成，设计链式场景图推理机制，并提出2D到4D视觉场景迁移学习框架。 Result: 在基准数据上的大量实验表明，该方法显著优于基线模型。 Conclusion: 所提出的方法有效解决了4D-PSG生成中的数据稀缺问题，并显著提升了性能。 Abstract: The latest emerged 4D Panoptic Scene Graph (4D-PSG) provides an advanced-ever representation for comprehensively modeling the dynamic 4D visual real world. Unfortunately, current pioneering 4D-PSG research can primarily suffer from data scarcity issues severely, as well as the resulting out-of-vocabulary problems; also, the pipeline nature of the benchmark generation method can lead to suboptimal performance. To address these challenges, this paper investigates a novel framework for 4D-PSG generation that leverages rich 2D visual scene annotations to enhance 4D scene learning. First, we introduce a 4D Large Language Model (4D-LLM) integrated with a 3D mask decoder for end-to-end generation of 4D-PSG. A chained SG inference mechanism is further designed to exploit LLMs' open-vocabulary capabilities to infer accurate and comprehensive object and relation labels iteratively. Most importantly, we propose a 2D-to-4D visual scene transfer learning framework, where a spatial-temporal scene transcending strategy effectively transfers dimension-invariant features from abundant 2D SG annotations to 4D scenes, effectively compensating for data scarcity in 4D-PSG. Extensive experiments on the benchmark data demonstrate that we strikingly outperform baseline models by a large margin, highlighting the effectiveness of our method.

Saad Lahlali,Sandra Kara,Hejer Ammar,Florian Chabot,Nicolas Granger,Hervé Le Borgne,Quoc-Cuong Pham

Task: 提出了一种新的框架，利用2D运动线索进行3D数据中的多目标发现。

Motivation: 尽管2D图像分析中的目标发现任务受到了广泛关注，但在3D数据中仍然缺乏探索，现有的方法依赖于3D运动，存在诸多挑战。

Details

Method: 提出了DIOD-3D和xMOD两个框架，DIOD-3D利用2D运动进行3D数据中的多目标发现，xMOD则通过跨模态训练框架整合2D和3D数据。 Result: 在合成数据集（TRIP-PD）和真实数据集（KITTI和Waymo）上进行了广泛评估，与2D目标发现的最新技术相比，F1@50得分提高了+8.7到+15.1。 Conclusion: 提出的方法在3D目标发现任务中取得了显著的性能提升，尤其是在利用2D运动线索方面。 Abstract: Object discovery, which refers to the task of localizing objects without human annotations, has gained significant attention in 2D image analysis. However, despite this growing interest, it remains under-explored in 3D data, where approaches rely exclusively on 3D motion, despite its several challenges. In this paper, we present a novel framework that leverages advances in 2D object discovery which are based on 2D motion to exploit the advantages of such motion cues being more flexible and generalizable and to bridge the gap between 2D and 3D modalities. Our primary contributions are twofold: (i) we introduce DIOD-3D, the first baseline for multi-object discovery in 3D data using 2D motion, incorporating scene completion as an auxiliary task to enable dense object localization from sparse input data; (ii) we develop xMOD, a cross-modal training framework that integrates 2D and 3D data while always using 2D motion cues. xMOD employs a teacher-student training paradigm across the two modalities to mitigate confirmation bias by leveraging the domain gap. During inference, the model supports both RGB-only and point cloud-only inputs. Additionally, we propose a late-fusion technique tailored to our pipeline that further enhances performance when both modalities are available at inference. We evaluate our approach extensively on synthetic (TRIP-PD) and challenging real-world datasets (KITTI and Waymo). Notably, our approach yields a substantial performance improvement compared with the 2D object discovery state-of-the-art on all datasets with gains ranging from +8.7 to +15.1 in F1@50 score. The code is available at https://github.com/CEA-LIST/xMOD

Bridging the Gap: Fusing CNNs and Transformers to Decode the Elegance of Handwritten Arabic Script

Chaouki Boufenar,Mehdi Ayoub Rabiai,Boualem Nadjib Zahaf,Khelil Rafik Ouaras

Task: 提出一种结合卷积神经网络（CNN）和Transformer架构的混合方法，用于手写阿拉伯文字识别。

Motivation: 由于阿拉伯文字的动态字母形式和上下文变化，手写阿拉伯文字识别具有挑战性。

Details

Method: 结合卷积神经网络（CNN）和Transformer架构，评估了自定义和微调模型，包括EfficientNet-B7和Vision Transformer（ViT-B16），并引入了一种基于置信度融合的集成模型。 Result: 在IFN/ENIT数据集上，集成模型在字母分类和位置分类上分别达到了96.38%和97.22%的准确率。 Conclusion: CNN和Transformer的互补性展示了它们在阿拉伯手写文字识别中的潜力，为OCR系统提供了可扩展的解决方案。 Abstract: Handwritten Arabic script recognition is a challenging task due to the script's dynamic letter forms and contextual variations. This paper proposes a hybrid approach combining convolutional neural networks (CNNs) and Transformer-based architectures to address these complexities. We evaluated custom and fine-tuned models, including EfficientNet-B7 and Vision Transformer (ViT-B16), and introduced an ensemble model that leverages confidence-based fusion to integrate their strengths. Our ensemble achieves remarkable performance on the IFN/ENIT dataset, with 96.38% accuracy for letter classification and 97.22% for positional classification. The results highlight the complementary nature of CNNs and Transformers, demonstrating their combined potential for robust Arabic handwriting recognition. This work advances OCR systems, offering a scalable solution for real-world applications.

Forensics-Bench: A Comprehensive Forgery Detection Benchmark Suite for Large Vision Language Models

Jin Wang,Chenghui Lv,Xian Li,Shichao Dong,Huadong Li,kelu Yao,Chao Li,Wenqi Shao,Ping Luo

Task: 设计一个全面的基准测试套件来评估大型视觉语言模型（LVLMs）在伪造媒体检测中的能力。

Motivation: 随着AIGC的快速发展，伪造媒体的多样性显著增加，对社会安全、政治、法律等领域构成了前所未有的威胁。为了检测这些多样化的恶意伪造媒体，需要设计一个全面的基准测试套件来评估LVLMs的能力。

Details

Method: 提出了Forensics-Bench，一个包含63,292个精心策划的多选视觉问题的伪造检测评估基准测试套件，涵盖112种独特的伪造检测类型，从5个角度进行评估：伪造语义、伪造模态、伪造任务、伪造类型和伪造模型。 Result: 对22个开源LVLMs和3个专有模型（GPT-4o、Gemini 1.5 Pro和Claude 3.5 Sonnet）进行了全面评估，突出了Forensics-Bench在全面伪造检测方面提出的重大挑战。 Conclusion: Forensics-Bench将激励社区推进LVLMs的前沿，努力在AIGC时代实现全方位的伪造检测器。 Abstract: Recently, the rapid development of AIGC has significantly boosted the diversities of fake media spread in the Internet, posing unprecedented threats to social security, politics, law, and etc. To detect the ever-increasingly diverse malicious fake media in the new era of AIGC, recent studies have proposed to exploit Large Vision Language Models (LVLMs) to design robust forgery detectors due to their impressive performance on a wide range of multimodal tasks. However, it still lacks a comprehensive benchmark designed to comprehensively assess LVLMs' discerning capabilities on forgery media. To fill this gap, we present Forensics-Bench, a new forgery detection evaluation benchmark suite to assess LVLMs across massive forgery detection tasks, requiring comprehensive recognition, location and reasoning capabilities on diverse forgeries. Forensics-Bench comprises 63,292 meticulously curated multi-choice visual questions, covering 112 unique forgery detection types from 5 perspectives: forgery semantics, forgery modalities, forgery tasks, forgery types and forgery models. We conduct thorough evaluations on 22 open-sourced LVLMs and 3 proprietary models GPT-4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet, highlighting the significant challenges of comprehensive forgery detection posed by Forensics-Bench. We anticipate that Forensics-Bench will motivate the community to advance the frontier of LVLMs, striving for all-around forgery detectors in the era of AIGC. The deliverables will be updated at https://Forensics-Bench.github.io/.

Single-Step Bidirectional Unpaired Image Translation Using Implicit Bridge Consistency Distillation

Suhyeon Lee,Kwanyoung Kim,Jong Chul Ye

Task: 提出一种新的框架Implicit Bridge Consistency Distillation (IBCD)，用于实现单步双向无配对图像翻译。

Motivation: 解决基于扩散模型或Schrödinger桥的方法在实际应用中由于迭代采样特性而未被广泛采用的问题。

Details

Method: IBCD通过使用扩散隐式桥模型连接PF-ODE轨迹，并引入分布匹配和基于蒸馏难度的自适应加权方法。 Result: 实验结果表明，IBCD在单步生成中在基准数据集上达到了最先进的性能。 Conclusion: IBCD框架在单步双向无配对图像翻译中表现出色，具有广泛的应用潜力。 Abstract: Unpaired image-to-image translation has seen significant progress since the introduction of CycleGAN. However, methods based on diffusion models or Schr\"odinger bridges have yet to be widely adopted in real-world applications due to their iterative sampling nature. To address this challenge, we propose a novel framework, Implicit Bridge Consistency Distillation (IBCD), which enables single-step bidirectional unpaired translation without using adversarial loss. IBCD extends consistency distillation by using a diffusion implicit bridge model that connects PF-ODE trajectories between distributions. Additionally, we introduce two key improvements: 1) distribution matching for consistency distillation and 2) adaptive weighting method based on distillation difficulty. Experimental results demonstrate that IBCD achieves state-of-the-art performance on benchmark datasets in a single generation step. Project page available at https://hyn2028.github.io/project_page/IBCD/index.html

Conjuring Positive Pairs for Efficient Unification of Representation Learning and Image Synthesis

Imanol G. Estepa,Jesús M. Rodríguez-de-Vera,Ignacio Sarasúa,Bhalaji Nagarajan,Petia Radeva

Task: 提出一种新的统一自监督学习框架Sorcen，结合对比-重建目标，以解决视觉数据表示学习和生成建模的统一问题。

Motivation: 现有的统一自监督学习方法依赖于语义标记重建，需要外部标记器，增加了训练开销。Sorcen旨在通过引入协同对比-重建目标，减少计算开销并提高性能。

Details

Method: Sorcen框架结合了对比目标（Echo Contrast）和重建目标，利用生成能力在语义标记空间中生成回声样本，形成对比正对。Sorcen仅使用预计算的标记，无需在线标记转换，显著减少计算开销。 Result: 在ImageNet-1k上的实验表明，Sorcen在线性探测、无条件图像生成、少样本学习和迁移学习上分别优于之前的统一自监督学习SoTA，同时效率提高了60.8%。 Conclusion: Sorcen在统一自监督学习模型中取得了显著改进和突破，特别是在无条件图像生成和线性探测方面表现优异。 Abstract: While representation learning and generative modeling seek to understand visual data, unifying both domains remains unexplored. Recent Unified Self-Supervised Learning (SSL) methods have started to bridge the gap between both paradigms. However, they rely solely on semantic token reconstruction, which requires an external tokenizer during training -- introducing a significant overhead. In this work, we introduce Sorcen, a novel unified SSL framework, incorporating a synergic Contrastive-Reconstruction objective. Our Contrastive objective, "Echo Contrast", leverages the generative capabilities of Sorcen, eliminating the need for additional image crops or augmentations during training. Sorcen "generates" an echo sample in the semantic token space, forming the contrastive positive pair. Sorcen operates exclusively on precomputed tokens, eliminating the need for an online token transformation during training, thereby significantly reducing computational overhead. Extensive experiments on ImageNet-1k demonstrate that Sorcen outperforms the previous Unified SSL SoTA by 0.4%, 1.48 FID, 1.76%, and 1.53% on linear probing, unconditional image generation, few-shot learning, and transfer learning, respectively, while being 60.8% more efficient. Additionally, Sorcen surpasses previous single-crop MIM SoTA in linear probing and achieves SoTA performance in unconditional image generation, highlighting significant improvements and breakthroughs in Unified SSL models.

MultiBARF: Integrating Imagery of Different Wavelength Regions by Using Neural Radiance Fields

Kana Kurata,Hitoshi Niigaki,Xiaojun Wu,Ryuichi Tanida

Task: 开发一种名为MultiBARF的方法，用于简化不同传感器图像的数据准备过程。

Motivation: 为了使不熟悉传感和图像处理的用户更容易进行数据准备，减少数据准备过程中对高传感和图像处理专业知识的需求。

Details

Method: 通过合成两种不同传感器图像和深度图像对，替代共配准和几何校准，扩展了基于深度神经网络的Bundle Adjusting Neural Radiance Fields (BARF)方法。 Result: 在可见光和热成像图像上的实验表明，该方法能够在NeRF上叠加两种传感器图像的颜色通道。 Conclusion: MultiBARF方法有效地简化了不同传感器图像的数据准备过程，并展示了其在可见光和热成像图像上的应用潜力。 Abstract: Optical sensor applications have become popular through digital transformation. Linking observed data to real-world locations and combining different image sensors is essential to make the applications practical and efficient. However, data preparation to try different sensor combinations requires high sensing and image processing expertise. To make data preparation easier for users unfamiliar with sensing and image processing, we have developed MultiBARF. This method replaces the co-registration and geometric calibration by synthesizing pairs of two different sensor images and depth images at assigned viewpoints. Our method extends Bundle Adjusting Neural Radiance Fields(BARF), a deep neural network-based novel view synthesis method, for the two imagers. Through experiments on visible light and thermographic images, we demonstrate that our method superimposes two color channels of those sensor images on NeRF.

An Investigation of Beam Density on LiDAR Object Detection Performance

Christoph Griesbacher,Christian Fruhwirth-Reisinger

Task: 研究3D物体检测在不同LiDAR传感器设置下的性能差异，特别是光束密度对跨域场景的影响。

Motivation: 自动驾驶车辆需要精确感知周围环境以做出明智决策，而LiDAR传感器是实现这一能力的关键。然而，训练和推理数据之间的差异会导致在不同传感器设置下性能显著下降，特别是光束密度的影响。

Details

Method: 通过评估不同的物体检测架构，结合体素和点云方法，研究光束密度引起的域差距，并分析其与其他域变化的关系。 Result: 实验表明，结合体素和点云方法在跨域性能上表现优越，且训练在更密集数据上的检测器对光束密度变化具有鲁棒性。 Conclusion: 光束密度引起的域差距需要与其他域变化一起评估，训练在更密集数据上的检测器对光束密度变化具有鲁棒性。 Abstract: Accurate 3D object detection is a critical component of autonomous driving, enabling vehicles to perceive their surroundings with precision and make informed decisions. LiDAR sensors, widely used for their ability to provide detailed 3D measurements, are key to achieving this capability. However, variations between training and inference data can cause significant performance drops when object detection models are employed in different sensor settings. One critical factor is beam density, as inference on sparse, cost-effective LiDAR sensors is often preferred in real-world applications. Despite previous work addressing the beam-density-induced domain gap, substantial knowledge gaps remain, particularly concerning dense 128-beam sensors in cross-domain scenarios. To gain better understanding of the impact of beam density on domain gaps, we conduct a comprehensive investigation that includes an evaluation of different object detection architectures. Our architecture evaluation reveals that combining voxel- and point-based approaches yields superior cross-domain performance by leveraging the strengths of both representations. Building on these findings, we analyze beam-density-induced domain gaps and argue that these domain gaps must be evaluated in conjunction with other domain shifts. Contrary to conventional beliefs, our experiments reveal that detectors benefit from training on denser data and exhibit robustness to beam density variations during inference.

When the Future Becomes the Past: Taming Temporal Correspondence for Self-supervised Video Representation Learning

Yang Liu,Qianqian Xu,Peisong Wen,Siran Dai,Qingming Huang

Task: 提出了一种自监督框架T-CoRe，用于视频表示学习。

Motivation: 解决自监督视频学习中的两个关键挑战：1）随机时间采样引入的不确定性；2）先前MVM方法在像素空间中恢复掩码补丁导致的信息压缩不足。

Details

Method: 提出了三明治采样策略以减少重建不确定性，并在自蒸馏架构中引入辅助分支以在潜在空间中恢复表示。 Result: T-CoRe在多个下游任务中表现出优越的性能。 Conclusion: T-CoRe框架在视频表示学习中表现出色，代码已开源。 Abstract: The past decade has witnessed notable achievements in self-supervised learning for video tasks. Recent efforts typically adopt the Masked Video Modeling (MVM) paradigm, leading to significant progress on multiple video tasks. However, two critical challenges remain: 1) Without human annotations, the random temporal sampling introduces uncertainty, increasing the difficulty of model training. 2) Previous MVM methods primarily recover the masked patches in the pixel space, leading to insufficient information compression for downstream tasks. To address these challenges jointly, we propose a self-supervised framework that leverages Temporal Correspondence for video Representation learning (T-CoRe). For challenge 1), we propose a sandwich sampling strategy that selects two auxiliary frames to reduce reconstruction uncertainty in a two-side-squeezing manner. Addressing challenge 2), we introduce an auxiliary branch into a self-distillation architecture to restore representations in the latent space, generating high-level semantic representations enriched with temporal information. Experiments of T-CoRe consistently present superior performance across several downstream tasks, demonstrating its effectiveness for video representation learning. The code is available at https://github.com/yafeng19/T-CORE.

Distilling 3D distinctive local descriptors for 6D pose estimation

Amir Hamza,Andrea Caraffa,Davide Boscaini,Fabio Poiesi

Task: 通过知识蒸馏框架训练一个高效的学生模型，以从GeDi教师模型中回归局部描述符。

Motivation: GeDi在零样本6D姿态估计中表现出色，但由于其推理过程昂贵，无法在实际应用中实用。

Details

Method: 引入一个知识蒸馏框架，包括一个高效的大规模训练程序和一个新的损失公式，以处理来自非显著教师描述符的弱监督。 Result: 在五个BOP基准数据集上验证了该方法，显著减少了推理时间，同时保持了与现有方法竞争的性能。 Conclusion: 该方法使零样本6D姿态估计更接近实时可行性。 Abstract: Three-dimensional local descriptors are crucial for encoding geometric surface properties, making them essential for various point cloud understanding tasks. Among these descriptors, GeDi has demonstrated strong zero-shot 6D pose estimation capabilities but remains computationally impractical for real-world applications due to its expensive inference process. \textit{Can we retain GeDi's effectiveness while significantly improving its efficiency?} In this paper, we explore this question by introducing a knowledge distillation framework that trains an efficient student model to regress local descriptors from a GeDi teacher. Our key contributions include: an efficient large-scale training procedure that ensures robustness to occlusions and partial observations while operating under compute and storage constraints, and a novel loss formulation that handles weak supervision from non-distinctive teacher descriptors. We validate our approach on five BOP Benchmark datasets and demonstrate a significant reduction in inference time while maintaining competitive performance with existing methods, bringing zero-shot 6D pose estimation closer to real-time feasibility. Project Website: https://tev-fbk.github.io/dGeDi/

GIVEPose: Gradual Intra-class Variation Elimination for RGB-based Category-Level Object Pose Estimation

Zinqin Huang,Gu Wang,Chenyangguang Zhang,Ruida Zhang,Xiu Li,Xiangyang Ji

Task: 提出了一种新的类别级物体姿态估计方法，通过消除类内变化来提高估计精度。

Motivation: 现有的基于RGBD的类别级物体姿态估计方法依赖于精确的深度信息，限制了其广泛应用。基于RGB的方法虽然有所发展，但现有的几何引导姿态回归方法在类别级任务中存在类内变化问题，导致结果不理想。

Details

Method: 提出了Intra-class Variation-Free Consensus (IVFC) map，一种从类别级共识模型生成的新坐标表示，并结合NOCS map的优势，提出了GIVEPose框架，逐步消除类内变化。 Result: 在合成和真实数据集上的广泛评估表明，GIVEPose显著优于现有的基于RGB的最先进方法，在类别级物体姿态估计中取得了显著改进。 Conclusion: GIVEPose通过消除类内变化，显著提高了类别级物体姿态估计的精度，为基于RGB的方法提供了新的思路。 Abstract: Recent advances in RGBD-based category-level object pose estimation have been limited by their reliance on precise depth information, restricting their broader applicability. In response, RGB-based methods have been developed. Among these methods, geometry-guided pose regression that originated from instance-level tasks has demonstrated strong performance. However, we argue that the NOCS map is an inadequate intermediate representation for geometry-guided pose regression method, as its many-to-one correspondence with category-level pose introduces redundant instance-specific information, resulting in suboptimal results. This paper identifies the intra-class variation problem inherent in pose regression based solely on the NOCS map and proposes the Intra-class Variation-Free Consensus (IVFC) map, a novel coordinate representation generated from the category-level consensus model. By leveraging the complementary strengths of the NOCS map and the IVFC map, we introduce GIVEPose, a framework that implements Gradual Intra-class Variation Elimination for category-level object pose estimation. Extensive evaluations on both synthetic and real-world datasets demonstrate that GIVEPose significantly outperforms existing state-of-the-art RGB-based approaches, achieving substantial improvements in category-level object pose estimation. Our code is available at https://github.com/ziqin-h/GIVEPose.

Text-Derived Relational Graph-Enhanced Network for Skeleton-Based Action Segmentation

Haoyu Ji,Bowen Chen,Weihong Ren,Wenze Huang,Zhihao Yang,Zhiyong Wang,Honghai Liu

Task: 基于骨架的时间动作分割（STAS）旨在从长时间未修剪的人类骨骼运动序列中分割和识别各种动作。

Motivation: 当前的STAS方法通常采用时空建模来建立关节和帧之间的依赖关系，并使用独热编码和交叉熵损失进行帧级分类监督。然而，这些方法忽略了骨骼特征中关节和动作之间的内在关联，导致对人类运动的理解有限。

Details

Method: 提出了一个文本衍生的关系图增强网络（TRG-Net），利用大型语言模型（LLM）生成的先验图来增强建模和监督。在建模方面，动态时空融合建模（DSFM）方法结合了文本衍生的关节图（TJG）和通道及帧级动态适应，以有效建模空间关系，同时在时间建模中整合时空核心特征。在监督方面，绝对-相对类间监督（ARIS）方法采用对比学习来正则化绝对类分布，并利用文本衍生的动作图（TAG）捕捉动作特征之间的相对类间关系。此外，提出了空间感知增强处理（SAEP）方法，通过随机关节遮挡和轴向旋转来增强空间泛化能力。 Result: 在四个公共数据集上的性能评估表明，TRG-Net达到了最先进的结果。 Conclusion: TRG-Net通过引入文本衍生的关系图和增强的空间感知处理，显著提升了基于骨架的时间动作分割的性能。 Abstract: Skeleton-based Temporal Action Segmentation (STAS) aims to segment and recognize various actions from long, untrimmed sequences of human skeletal movements. Current STAS methods typically employ spatio-temporal modeling to establish dependencies among joints as well as frames, and utilize one-hot encoding with cross-entropy loss for frame-wise classification supervision. However, these methods overlook the intrinsic correlations among joints and actions within skeletal features, leading to a limited understanding of human movements. To address this, we propose a Text-Derived Relational Graph-Enhanced Network (TRG-Net) that leverages prior graphs generated by Large Language Models (LLM) to enhance both modeling and supervision. For modeling, the Dynamic Spatio-Temporal Fusion Modeling (DSFM) method incorporates Text-Derived Joint Graphs (TJG) with channel- and frame-level dynamic adaptation to effectively model spatial relations, while integrating spatio-temporal core features during temporal modeling. For supervision, the Absolute-Relative Inter-Class Supervision (ARIS) method employs contrastive learning between action features and text embeddings to regularize the absolute class distributions, and utilizes Text-Derived Action Graphs (TAG) to capture the relative inter-class relationships among action features. Additionally, we propose a Spatial-Aware Enhancement Processing (SAEP) method, which incorporates random joint occlusion and axial rotation to enhance spatial generalization. Performance evaluations on four public datasets demonstrate that TRG-Net achieves state-of-the-art results.

VideoGen-of-Thought: Step-by-step generating multi-shot video with minimal manual intervention

Mingzhe Zheng,Yongqi Xu,Haojian Huang,Xuran Ma,Yexin Liu,Wenjie Shu,Yatian Pang,Feilong Tang,Qifeng Chen,Harry Yang,Ser-Nam Lim

Task: 自动化从单一句子生成多镜头视频。

Motivation: 现有视频生成模型在短片段上表现出色，但在生成连贯的多镜头叙事时存在视觉动态不连贯和故事情节断裂的问题。现有解决方案要么依赖大量手动脚本/编辑，要么优先考虑单镜头保真度而非跨场景连续性，限制了其在电影类内容中的实用性。

Details

Method: 提出了VideoGen-of-Thought (VGoT)框架，通过动态故事情节建模、身份感知跨镜头传播和相邻潜在过渡机制，系统解决叙事碎片化、视觉不一致性和过渡伪影三个核心挑战。 Result: VGoT生成的多镜头视频在镜头内面部一致性和风格一致性上分别比现有最佳基线高出20.4%和17.4%，同时在跨镜头一致性上提高了100%以上，且手动调整次数减少了10倍。 Conclusion: VGoT框架在生成连贯的多镜头视频方面表现出色，显著优于现有方法，具有较高的实用性和潜力。 Abstract: Current video generation models excel at short clips but fail to produce cohesive multi-shot narratives due to disjointed visual dynamics and fractured storylines. Existing solutions either rely on extensive manual scripting/editing or prioritize single-shot fidelity over cross-scene continuity, limiting their practicality for movie-like content. We introduce VideoGen-of-Thought (VGoT), a step-by-step framework that automates multi-shot video synthesis from a single sentence by systematically addressing three core challenges: (1) Narrative Fragmentation: Existing methods lack structured storytelling. We propose dynamic storyline modeling, which first converts the user prompt into concise shot descriptions, then elaborates them into detailed, cinematic specifications across five domains (character dynamics, background continuity, relationship evolution, camera movements, HDR lighting), ensuring logical narrative progression with self-validation. (2) Visual Inconsistency: Existing approaches struggle with maintaining visual consistency across shots. Our identity-aware cross-shot propagation generates identity-preserving portrait (IPP) tokens that maintain character fidelity while allowing trait variations (expressions, aging) dictated by the storyline. (3) Transition Artifacts: Abrupt shot changes disrupt immersion. Our adjacent latent transition mechanisms implement boundary-aware reset strategies that process adjacent shots' features at transition points, enabling seamless visual flow while preserving narrative continuity. VGoT generates multi-shot videos that outperform state-of-the-art baselines by 20.4% in within-shot face consistency and 17.4% in style consistency, while achieving over 100% better cross-shot consistency and 10x fewer manual adjustments than alternatives.

Object-Centric Pretraining via Target Encoder Bootstrapping

Nikola Đukić,Tim Lebailly,Tinne Tuytelaars

Task: 提出了一种新的自蒸馏设置，用于从头开始训练对象中心模型。

Motivation: 现有的对象中心模型依赖于预训练的非对象中心基础模型，这些模型的特征作为重建目标，但目标在整个训练过程中必须保持冻结，限制了模型性能的上限。

Details

Method: 提出了Object-CEntric Pretraining by Target Encoder BOotstrapping (OCEBO)，通过目标编码器的指数移动平均更新来丰富对象中心归纳偏差，并引入跨视图补丁过滤方法来缓解槽崩溃问题。 Result: 在COCO数据集的241k图像上进行预训练后，OCEBO在无监督对象发现性能上达到了与使用冻结非对象中心目标编码器的对象中心模型相当的水平。 Conclusion: OCEBO通过自蒸馏设置和跨视图补丁过滤方法，成功地在真实世界数据上从头训练对象中心模型，并取得了显著的性能提升。 Abstract: Object-centric representation learning has recently been successfully applied to real-world datasets. This success can be attributed to pretrained non-object-centric foundation models, whose features serve as reconstruction targets for slot attention. However, targets must remain frozen throughout the training, which sets an upper bound on the performance object-centric models can attain. Attempts to update the target encoder by bootstrapping result in large performance drops, which can be attributed to its lack of object-centric inductive biases, causing the object-centric model's encoder to drift away from representations useful as reconstruction targets. To address these limitations, we propose Object-CEntric Pretraining by Target Encoder BOotstrapping, a self-distillation setup for training object-centric models from scratch, on real-world data, for the first time ever. In OCEBO, the target encoder is updated as an exponential moving average of the object-centric model, thus explicitly being enriched with object-centric inductive biases introduced by slot attention while removing the upper bound on performance present in other models. We mitigate the slot collapse caused by random initialization of the target encoder by introducing a novel cross-view patch filtering approach that limits the supervision to sufficiently informative patches. When pretrained on 241k images from COCO, OCEBO achieves unsupervised object discovery performance comparable to that of object-centric models with frozen non-object-centric target encoders pretrained on hundreds of millions of images. The code and pretrained models are publicly available at https://github.com/djukicn/ocebo.

PointSFDA: Source-free Domain Adaptation for Point Cloud Completion

Xing He,Zhe Zhu,Liangliang Nan,Honghua Chen,Jing Qin,Mingqiang Wei

Task: 提出一种无需源数据的点云补全领域自适应框架PointSFDA。

Motivation: 传统点云补全方法在应用于真实世界扫描时面临显著挑战，特别是在源数据不可访问的情况下。

Details

Method: 提出了一种从粗到细的蒸馏解决方案，并引入了自监督部分掩码一致性训练策略。 Result: 实验验证了该方法显著提高了跨域形状补全的性能。 Conclusion: PointSFDA框架在无需源数据的情况下，有效地提高了点云补全的性能。 Abstract: Conventional methods for point cloud completion, typically trained on synthetic datasets, face significant challenges when applied to out-of-distribution real-world scans. In this paper, we propose an effective yet simple source-free domain adaptation framework for point cloud completion, termed \textbf{PointSFDA}. Unlike unsupervised domain adaptation that reduces the domain gap by directly leveraging labeled source data, PointSFDA uses only a pretrained source model and unlabeled target data for adaptation, avoiding the need for inaccessible source data in practical scenarios. Being the first source-free domain adaptation architecture for point cloud completion, our method offers two core contributions. First, we introduce a coarse-to-fine distillation solution to explicitly transfer the global geometry knowledge learned from the source dataset. Second, as noise may be introduced due to domain gaps, we propose a self-supervised partial-mask consistency training strategy to learn local geometry information in the target domain. Extensive experiments have validated that our method significantly improves the performance of state-of-the-art networks in cross-domain shape completion. Our code is available at \emph{\textcolor{magenta}{https://github.com/Starak-x/PointSFDA}}.

ARC: Anchored Representation Clouds for High-Resolution INR Classification

Joost Luijmes,Alexander Gielisse,Roman Knyazhitskiy,Jan van Gemert

Task: 提出一种新的隐式神经表示（INR）架构ARC，用于图像分类。

Motivation: 当前INR图像分类方法在低分辨率数据上表现良好，但对图像空间变换敏感，且缺乏局部表示机制。

Details

Method: 提出ARC（Anchored Representation Clouds），通过在图像空间中显式锚定局部潜在向量，引入空间结构。 Result: ARC在低分辨率和高分辨率图像的隐式图像分类中达到了最先进的性能，并提高了对图像空间平移的鲁棒性。 Conclusion: ARC通过引入局部表示机制，显著提升了INR在图像分类中的性能。 Abstract: Implicit neural representations (INRs) encode signals in neural network weights as a memory-efficient representation, decoupling sampling resolution from the associated resource costs. Current INR image classification methods are demonstrated on low-resolution data and are sensitive to image-space transformations. We attribute these issues to the global, fully-connected MLP neural network architecture encoding of current INRs, which lack mechanisms for local representation: MLPs are sensitive to absolute image location and struggle with high-frequency details. We propose ARC: Anchored Representation Clouds, a novel INR architecture that explicitly anchors latent vectors locally in image-space. By introducing spatial structure to the latent vectors, ARC captures local image data which in our testing leads to state-of-the-art implicit image classification of both low- and high-resolution images and increased robustness against image-space translation. Code can be found at https://github.com/JLuij/anchored_representation_clouds.

UltraFlwr -- An Efficient Federated Medical and Surgical Object Detection Framework

Yang Li,Soumya Snigdha Kundu,Maxence Boels,Toktam Mahmoodi,Sebastien Ourselin,Tom Vercauteren,Prokar Dasgupta,Jonathan Shapey,Alejandro Granados

Task: 提出了一种用于医疗和手术对象检测的联邦学习框架UltraFlwr，并设计了YOLO-PA策略以减少通信开销。

Motivation: 解决医疗和手术对象检测在边缘部署中面临的高质量标注数据有限、数据共享限制和计算资源受限等挑战。

Details

Method: 利用联邦学习（FL）实现跨多个站点的去中心化模型训练，并提出YOLO-PA策略以减少YOLO模型在FL中的通信开销。 Result: YOLO-PA策略每轮通信开销减少高达83%，同时在BCCD和m2cai16-tool-locations数据集上表现优于客户端集中训练和全聚合策略。 Conclusion: UltraFlwr框架提高了在资源受限的边缘设备上训练和部署检测模型的可行性，使联邦对象检测在时间和资源受限的医疗和手术应用中更加实用。 Abstract: Object detection shows promise for medical and surgical applications such as cell counting and tool tracking. However, its faces multiple real-world edge deployment challenges including limited high-quality annotated data, data sharing restrictions, and computational constraints. In this work, we introduce UltraFlwr, a framework for federated medical and surgical object detection. By leveraging Federated Learning (FL), UltraFlwr enables decentralized model training across multiple sites without sharing raw data. To further enhance UltraFlwr's efficiency, we propose YOLO-PA, a set of novel Partial Aggregation (PA) strategies specifically designed for YOLO models in FL. YOLO-PA significantly reduces communication overhead by up to 83% per round while maintaining performance comparable to Full Aggregation (FA) strategies. Our extensive experiments on BCCD and m2cai16-tool-locations datasets demonstrate that YOLO-PA not only provides better client models compared to client-wise centralized training and FA strategies, but also facilitates efficient training and deployment across resource-constrained edge devices. Further, we also establish one of the first benchmarks in federated medical and surgical object detection. This paper advances the feasibility of training and deploying detection models on the edge, making federated object detection more practical for time-critical and resource-constrained medical and surgical applications. UltraFlwr is publicly available at https://github.com/KCL-BMEIS/UltraFlwr.

Machine Unlearning in Hyperbolic vs. Euclidean Multimodal Contrastive Learning: Adapting Alignment Calibration to MERU

Àlex Pujol Vidal,Sergio Escalera,Kamal Nasrollahi,Thomas B. Moeslund

Task: 研究在双曲对比学习中的机器遗忘方法，以更好地捕捉语义层次结构。

Motivation: 探索在双曲空间中进行概念移除的有效性，填补现有研究在欧几里得对比视觉语言模型中的空白。

Details

Method: 通过将Alignment Calibration应用于MERU模型，引入双曲特定的组件，包括蕴涵校准和范数正则化。 Result: 实验表明，双曲几何在概念移除方面具有独特优势，能够实现近乎完美的遗忘，并在保留概念上保持合理性能。 Conclusion: 双曲遗忘在重组语义层次结构方面与欧几里得方法有根本性差异，为多模态模型中的概念表示和移除提供了新的见解。 Abstract: Machine unlearning methods have become increasingly important for selective concept removal in large pre-trained models. While recent work has explored unlearning in Euclidean contrastive vision-language models, the effectiveness of concept removal in hyperbolic spaces remains unexplored. This paper investigates machine unlearning in hyperbolic contrastive learning by adapting Alignment Calibration to MERU, a model that embeds images and text in hyperbolic space to better capture semantic hierarchies. Through systematic experiments and ablation studies, we demonstrate that hyperbolic geometry offers distinct advantages for concept removal, achieving near perfect forgetting with reasonable performance on retained concepts, particularly when scaling to multiple concept removal. Our approach introduces hyperbolic-specific components including entailment calibration and norm regularization that leverage the unique properties of hyperbolic space. Comparative analysis with Euclidean models reveals fundamental differences in unlearning dynamics, with hyperbolic unlearning reorganizing the semantic hierarchy while Euclidean approaches merely disconnect cross-modal associations. These findings not only advance machine unlearning techniques but also provide insights into the geometric properties that influence concept representation and removal in multimodal models. Source code available at https://github.com/alex-pv01/HAC

3D Occupancy Prediction with Low-Resolution Queries via Prototype-aware View Transformation

Gyeongrok Oh,Sungjune Kim,Heeju Ko,Hyung-gun Chi,Jinkyu Kim,Dongwook Lee,Daehyun Ji,Sungjoon Choi,Sujin Jang,Sangpil Kim

Task: 提高基于相机的3D占用预测中体素查询的分辨率以增强视图转换质量。

Motivation: 由于计算限制和实时部署的实际需求，较小的查询分辨率会导致信息丢失，因此需要在有限的查询大小内编码和保留丰富的视觉细节。

Details

Method: 提出了ProtoOcc，一种利用聚类图像片段的原型在视图转换中增强低分辨率上下文的新型占用网络。 Result: 在Occ3D和SemanticKITTI基准测试上的实验结果表明，该方法有效，并显示出相对于基线的明显改进。 Conclusion: ProtoOcc在减少75%体素分辨率的情况下仍能实现与基线竞争的性能。 Abstract: The resolution of voxel queries significantly influences the quality of view transformation in camera-based 3D occupancy prediction. However, computational constraints and the practical necessity for real-time deployment require smaller query resolutions, which inevitably leads to an information loss. Therefore, it is essential to encode and preserve rich visual details within limited query sizes while ensuring a comprehensive representation of 3D occupancy. To this end, we introduce ProtoOcc, a novel occupancy network that leverages prototypes of clustered image segments in view transformation to enhance low-resolution context. In particular, the mapping of 2D prototypes onto 3D voxel queries encodes high-level visual geometries and complements the loss of spatial information from reduced query resolutions. Additionally, we design a multi-perspective decoding strategy to efficiently disentangle the densely compressed visual cues into a high-dimensional 3D occupancy scene. Experimental results on both Occ3D and SemanticKITTI benchmarks demonstrate the effectiveness of the proposed method, showing clear improvements over the baselines. More importantly, ProtoOcc achieves competitive performance against the baselines even with 75\% reduced voxel resolution.

Benchmarking Large Language Models for Handwritten Text Recognition

Giorgia Crosilla,Lukas Klic,Giovanni Colavizza

Task: 评估多模态大语言模型（MLLMs）在手写文本识别（HTR）中的表现，并与传统模型进行比较。

Motivation: 传统的手写文本识别模型需要大量的手动标注，并且在布局和文本处理之间存在分离，容易产生错误。多模态大语言模型提供了一种无需特定模型训练的通用方法，能够识别多样化的手写风格。

Details

Method: 研究对多种专有和开源的大语言模型进行了基准测试，评估了它们在现代和历史数据集上的表现，并测试了模型自主纠正先前生成输出的能力。 Result: 专有模型，特别是Claude 3.5 Sonnet，在零样本设置中表现优于开源替代品。MLLMs在现代手写识别中表现出色，并且由于预训练数据集的组成，对英语表现出偏好。与Transkribus的比较显示，两种方法没有一致的优势。此外，大语言模型在零样本转录中自主纠正错误的能力有限。 Conclusion: 多模态大语言模型在手写文本识别中表现出色，尤其是在现代手写识别中，但对自主纠正错误的能力有限。专有模型在零样本设置中表现优于开源模型。 Abstract: Traditional machine learning models for Handwritten Text Recognition (HTR) rely on supervised training, requiring extensive manual annotations, and often produce errors due to the separation between layout and text processing. In contrast, Multimodal Large Language Models (MLLMs) offer a general approach to recognizing diverse handwriting styles without the need for model-specific training. The study benchmarks various proprietary and open-source LLMs against Transkribus models, evaluating their performance on both modern and historical datasets written in English, French, German, and Italian. In addition, emphasis is placed on testing the models' ability to autonomously correct previously generated outputs. Findings indicate that proprietary models, especially Claude 3.5 Sonnet, outperform open-source alternatives in zero-shot settings. MLLMs achieve excellent results in recognizing modern handwriting and exhibit a preference for the English language due to their pre-training dataset composition. Comparisons with Transkribus show no consistent advantage for either approach. Moreover, LLMs demonstrate limited ability to autonomously correct errors in zero-shot transcriptions.

Detect-and-Guide: Self-regulation of Diffusion Models for Safe Text-to-Image Generation via Guideline Token Optimization

Feifei Li,Mi Zhang,Yiming Sun,Min Yang

Task: 提出一种名为Detect-and-Guide (DAG)的安全生成框架，用于在文本到图像扩散模型中检测和消除有害内容。

Motivation: 现有的后处理模型干预技术（如概念遗忘和安全引导）在消除有害概念时会影响采样轨迹，且操作方式不透明，难以确定中间变量的哪一部分导致了不安全生成。

Details

Method: DAG框架利用扩散模型的内部知识，在采样过程中进行自我诊断和细粒度的自我调节。首先通过优化的token的精细交叉注意力图从噪声潜在空间中检测有害概念，然后应用具有自适应强度和编辑区域的安全引导来消除不安全生成。 Result: 实验表明，DAG在消除色情内容方面达到了最先进的安全生成性能，平衡了有害性缓解和文本跟随性能。 Conclusion: DAG框架不需要对扩散模型进行微调，因此不会影响其生成多样性，并且只需要少量标注数据集即可提供精确的检测图，具有通用性和概念特异性。 Abstract: Text-to-image diffusion models have achieved state-of-the-art results in synthesis tasks; however, there is a growing concern about their potential misuse in creating harmful content. To mitigate these risks, post-hoc model intervention techniques, such as concept unlearning and safety guidance, have been developed. However, fine-tuning model weights or adapting the hidden states of the diffusion model operates in an uninterpretable way, making it unclear which part of the intermediate variables is responsible for unsafe generation. These interventions severely affect the sampling trajectory when erasing harmful concepts from complex, multi-concept prompts, thus hindering their practical use in real-world settings. In this work, we propose the safe generation framework Detect-and-Guide (DAG), leveraging the internal knowledge of diffusion models to perform self-diagnosis and fine-grained self-regulation during the sampling process. DAG first detects harmful concepts from noisy latents using refined cross-attention maps of optimized tokens, then applies safety guidance with adaptive strength and editing regions to negate unsafe generation. The optimization only requires a small annotated dataset and can provide precise detection maps with generalizability and concept specificity. Moreover, DAG does not require fine-tuning of diffusion models, and therefore introduces no loss to their generation diversity. Experiments on erasing sexual content show that DAG achieves state-of-the-art safe generation performance, balancing harmfulness mitigation and text-following performance on multi-concept real-world prompts.

DiST-4D: Disentangled Spatiotemporal Diffusion with Metric Depth for 4D Driving Scene Generation

Jiazhe Guo,Yikang Ding,Xiwu Chen,Shuo Chen,Bohan Li,Yingshuang Zou,Xiaoyang Lyu,Feiyang Tan,Xiaojuan Qi,Zhiheng Li,Hao Zhao

Task: 提出DiST-4D，一个用于4D驾驶场景生成的解耦时空扩散框架。

Motivation: 当前生成模型在合成动态4D驾驶场景时，难以同时支持时间外推和空间新视角合成（NVS），且无需逐场景优化。

Details

Method: DiST-4D利用度量深度作为核心几何表示，将问题分解为两个扩散过程：DiST-T（从过去的观测中直接预测未来的度量深度和多视角RGB序列）和DiST-S（通过仅在现有视角上训练并强制循环一致性来实现空间NVS）。 Result: DiST-4D在时间预测和NVS任务中实现了最先进的性能，并在规划相关评估中表现出色。 Conclusion: 度量深度对于准确可靠的时间预测和空间NVS至关重要，因为它提供了一个视角一致的几何表示，能够很好地推广到未见过的视角。 Abstract: Current generative models struggle to synthesize dynamic 4D driving scenes that simultaneously support temporal extrapolation and spatial novel view synthesis (NVS) without per-scene optimization. A key challenge lies in finding an efficient and generalizable geometric representation that seamlessly connects temporal and spatial synthesis. To address this, we propose DiST-4D, the first disentangled spatiotemporal diffusion framework for 4D driving scene generation, which leverages metric depth as the core geometric representation. DiST-4D decomposes the problem into two diffusion processes: DiST-T, which predicts future metric depth and multi-view RGB sequences directly from past observations, and DiST-S, which enables spatial NVS by training only on existing viewpoints while enforcing cycle consistency. This cycle consistency mechanism introduces a forward-backward rendering constraint, reducing the generalization gap between observed and unseen viewpoints. Metric depth is essential for both accurate reliable forecasting and accurate spatial NVS, as it provides a view-consistent geometric representation that generalizes well to unseen perspectives. Experiments demonstrate that DiST-4D achieves state-of-the-art performance in both temporal prediction and NVS tasks, while also delivering competitive performance in planning-related evaluations.

GO-N3RDet: Geometry Optimized NeRF-enhanced 3D Object Detector

Zechuan Li,Hongshan Yu,Yihao Ding,Jinhao Qiao,Basim Azam,Naveed Akhtar

Task: 提出了一种基于神经辐射场的场景几何优化的多视角3D目标检测器GO-N3RDet。

Motivation: 由于遮挡和缺乏3D信息，从多视角2D图像构建3D特征具有挑战性。

Details

Method: 引入了一种独特的3D位置信息嵌入体素优化机制来融合多视角特征，设计了双重要性采样方案用于NeRF分支，提出了不透明度优化模块以通过多视角一致性约束来精确预测体素不透明度，并引入射线距离作为权重因子以最小化累积射线误差。 Result: 在ScanNet和ARKITScenes数据集上进行了广泛的实验验证，建立了基于NeRF的多视角3D检测的新状态。 Conclusion: GO-N3RDet通过独特的模块协同形成了一个端到端的神经模型，在基于NeRF的多视角3D检测中达到了新的最先进水平。 Abstract: We propose GO-N3RDet, a scene-geometry optimized multi-view 3D object detector enhanced by neural radiance fields. The key to accurate 3D object detection is in effective voxel representation. However, due to occlusion and lack of 3D information, constructing 3D features from multi-view 2D images is challenging. Addressing that, we introduce a unique 3D positional information embedded voxel optimization mechanism to fuse multi-view features. To prioritize neural field reconstruction in object regions, we also devise a double importance sampling scheme for the NeRF branch of our detector. We additionally propose an opacity optimization module for precise voxel opacity prediction by enforcing multi-view consistency constraints. Moreover, to further improve voxel density consistency across multiple perspectives, we incorporate ray distance as a weighting factor to minimize cumulative ray errors. Our unique modules synergetically form an end-to-end neural model that establishes new state-of-the-art in NeRF-based multi-view 3D detection, verified with extensive experiments on ScanNet and ARKITScenes. Code will be available at https://github.com/ZechuanLi/GO-N3RDet.

CoE: Chain-of-Explanation via Automatic Visual Concept Circuit Description and Polysemanticity Quantification

Wenlong Yu,Qilong Wang,Chuang Liu,Dong Li,Qinghua Hu

Task: 提出一种Chain-of-Explanation (CoE)方法，用于自动构建全局概念解释数据集并提供局部决策过程的语言解释。

Motivation: 当前的后解释方法在自动构建准确且充分的全局概念和局部电路的语言解释方面存在不足，特别是语义视觉概念（VCs）中的多义性严重影响了概念和深度视觉模型（DVMs）的可解释性。

Details

Method: CoE方法通过自动解码和描述VCs来构建全局概念解释数据集，并设计了一种概念多义性解缠和过滤机制来区分最相关的概念原子。此外，提出了概念多义性熵（CPE）作为模型可解释性的度量。 Result: GPT-4o和人类实验证明了CPE的有效性和CoE的优越性，在可解释性评分上平均绝对提高了36%。 Conclusion: CoE方法通过自动构建全局概念解释数据集和提供局部决策过程的语言解释，显著提高了深度视觉模型的可解释性。 Abstract: Explainability is a critical factor influencing the wide deployment of deep vision models (DVMs). Concept-based post-hoc explanation methods can provide both global and local insights into model decisions. However, current methods in this field face challenges in that they are inflexible to automatically construct accurate and sufficient linguistic explanations for global concepts and local circuits. Particularly, the intrinsic polysemanticity in semantic Visual Concepts (VCs) impedes the interpretability of concepts and DVMs, which is underestimated severely. In this paper, we propose a Chain-of-Explanation (CoE) approach to address these issues. Specifically, CoE automates the decoding and description of VCs to construct global concept explanation datasets. Further, to alleviate the effect of polysemanticity on model explainability, we design a concept polysemanticity disentanglement and filtering mechanism to distinguish the most contextually relevant concept atoms. Besides, a Concept Polysemanticity Entropy (CPE), as a measure of model interpretability, is formulated to quantify the degree of concept uncertainty. The modeling of deterministic concepts is upgraded to uncertain concept atom distributions. Finally, CoE automatically enables linguistic local explanations of the decision-making process of DVMs by tracing the concept circuit. GPT-4o and human-based experiments demonstrate the effectiveness of CPE and the superiority of CoE, achieving an average absolute improvement of 36% in terms of explainability scores.

DEPT: Deep Extreme Point Tracing for Ultrasound Image Segmentation

Lei Shi,Xi Fang,Naiyu Wang,Junxing Zhang

Task: 提出一种用于超声图像分割的深度极端点追踪（DEPT）与特征引导极端点掩码（FGEPM）算法。

Motivation: 全监督学习方法需要大量且劳动密集型的标注工作，为了解决这一问题，弱监督学习方法，特别是使用极端点作为监督信号的方法，提供了有效的解决方案。

Details

Method: 通过识别基于特征图的成本矩阵上连接所有极端点的最低成本路径生成伪标签，并提出迭代训练策略逐步优化伪标签，从而实现网络的持续改进。 Result: 在两个公开数据集上的实验结果表明，所提出的方法接近全监督方法的性能，并优于几种现有的弱监督方法。 Conclusion: 所提出的DEPT与FGEPM算法在超声图像分割中表现出色，接近全监督方法的性能，并优于现有的弱监督方法。 Abstract: Automatic medical image segmentation plays a crucial role in computer aided diagnosis. However, fully supervised learning approaches often require extensive and labor-intensive annotation efforts. To address this challenge, weakly supervised learning methods, particularly those using extreme points as supervisory signals, have the potential to offer an effective solution. In this paper, we introduce Deep Extreme Point Tracing (DEPT) integrated with Feature-Guided Extreme Point Masking (FGEPM) algorithm for ultrasound image segmentation. Notably, our method generates pseudo labels by identifying the lowest-cost path that connects all extreme points on the feature map-based cost matrix. Additionally, an iterative training strategy is proposed to refine pseudo labels progressively, enabling continuous network improvement. Experimental results on two public datasets demonstrate the effectiveness of our proposed method. The performance of our method approaches that of the fully supervised method and outperforms several existing weakly supervised methods.

LEGION: Learning to Ground and Explain for Synthetic Image Detection

Hengrui Kang,Siwei Wen,Zichen Wen,Junyan Ye,Weijia Li,Peilin Feng,Baichuan Zhou,Bin Wang,Dahua Lin,Linfeng Zhang,Conghui He

Task: 提出一个高质量且多样化的合成图像数据集SynthScars，并开发一个基于多模态大语言模型的图像伪造分析框架LEGION。

Motivation: 当前的合成图像检测方法缺乏文本可解释性，且数据集通常过时且缺乏细粒度注释。

Details

Method: 引入SynthScars数据集，并提出LEGION框架，该框架集成了伪影检测、分割和解释功能。 Result: LEGION在多个基准测试中优于现有方法，特别是在SynthScars数据集上，mIoU和F1得分分别提高了3.31%和7.75%。 Conclusion: LEGION不仅提高了合成图像检测的准确性，还能指导生成更高质量和更逼真的图像。 Abstract: The rapid advancements in generative technology have emerged as a double-edged sword. While offering powerful tools that enhance convenience, they also pose significant social concerns. As defenders, current synthetic image detection methods often lack artifact-level textual interpretability and are overly focused on image manipulation detection, and current datasets usually suffer from outdated generators and a lack of fine-grained annotations. In this paper, we introduce SynthScars, a high-quality and diverse dataset consisting of 12,236 fully synthetic images with human-expert annotations. It features 4 distinct image content types, 3 categories of artifacts, and fine-grained annotations covering pixel-level segmentation, detailed textual explanations, and artifact category labels. Furthermore, we propose LEGION (LEarning to Ground and explain for Synthetic Image detectiON), a multimodal large language model (MLLM)-based image forgery analysis framework that integrates artifact detection, segmentation, and explanation. Building upon this capability, we further explore LEGION as a controller, integrating it into image refinement pipelines to guide the generation of higher-quality and more realistic images. Extensive experiments show that LEGION outperforms existing methods across multiple benchmarks, particularly surpassing the second-best traditional expert on SynthScars by 3.31% in mIoU and 7.75% in F1 score. Moreover, the refined images generated under its guidance exhibit stronger alignment with human preferences. The code, model, and dataset will be released.

DeepMesh: Auto-Regressive Artist-mesh Creation with Reinforcement Learning

Ruowen Zhao,Junliang Ye,Zhengyi Wang,Guangce Liu,Yiwen Chen,Yikai Wang,Jun Zhu

Task: 提出了一种名为DeepMesh的框架，用于优化3D网格生成。

Motivation: 现有的自回归方法在生成结构化网格时受到面数限制和网格不完整性的约束。

Details

Method: DeepMesh通过两种关键创新来优化网格生成：(1) 一种高效的预训练策略，结合了新颖的标记化算法以及数据整理和处理的改进；(2) 将强化学习（RL）引入3D网格生成，通过直接偏好优化（DPO）实现人类偏好对齐。 Result: DeepMesh在点云和图像的条件下生成具有复杂细节和精确拓扑的网格，在精度和质量上均优于现有方法。 Conclusion: DeepMesh通过创新的预训练策略和强化学习方法，显著提升了3D网格生成的精度和质量。 Abstract: Triangle meshes play a crucial role in 3D applications for efficient manipulation and rendering. While auto-regressive methods generate structured meshes by predicting discrete vertex tokens, they are often constrained by limited face counts and mesh incompleteness. To address these challenges, we propose DeepMesh, a framework that optimizes mesh generation through two key innovations: (1) an efficient pre-training strategy incorporating a novel tokenization algorithm, along with improvements in data curation and processing, and (2) the introduction of Reinforcement Learning (RL) into 3D mesh generation to achieve human preference alignment via Direct Preference Optimization (DPO). We design a scoring standard that combines human evaluation with 3D metrics to collect preference pairs for DPO, ensuring both visual appeal and geometric accuracy. Conditioned on point clouds and images, DeepMesh generates meshes with intricate details and precise topology, outperforming state-of-the-art methods in both precision and quality. Project page: https://zhaorw02.github.io/DeepMesh/

Challenges and Trends in Egocentric Vision: A Survey

Xiang Li,Heqian Qiu,Lanxiao Wang,Hanwen Zhang,Chenghao Qi,Linfeng Han,Huiyu Xiong,Hongliang Li

Task: 对自我中心视觉理解的研究进行全面的综述，系统分析自我中心场景的组成部分，并将任务分为四个主要领域：主体理解、对象理解、环境理解和混合理解。

Motivation: 随着人工智能技术和可穿戴设备的快速发展，自我中心视觉理解作为一个新的研究方向逐渐受到学术界和工业界的广泛关注。

Details

Method: 系统分析自我中心场景的组成部分，并将任务分为四个主要领域，详细探讨每个类别中的子任务。 Result: 总结了该领域当前存在的主要挑战和趋势，并提供了高质量的自我中心视觉数据集资源。 Conclusion: 展望了自我中心视觉技术在增强现实、虚拟现实和具身智能等领域的广泛应用，并基于该领域的最新发展提出了未来的研究方向。 Abstract: With the rapid development of artificial intelligence technologies and wearable devices, egocentric vision understanding has emerged as a new and challenging research direction, gradually attracting widespread attention from both academia and industry. Egocentric vision captures visual and multimodal data through cameras or sensors worn on the human body, offering a unique perspective that simulates human visual experiences. This paper provides a comprehensive survey of the research on egocentric vision understanding, systematically analyzing the components of egocentric scenes and categorizing the tasks into four main areas: subject understanding, object understanding, environment understanding, and hybrid understanding. We explore in detail the sub-tasks within each category. We also summarize the main challenges and trends currently existing in the field. Furthermore, this paper presents an overview of high-quality egocentric vision datasets, offering valuable resources for future research. By summarizing the latest advancements, we anticipate the broad applications of egocentric vision technologies in fields such as augmented reality, virtual reality, and embodied intelligence, and propose future research directions based on the latest developments in the field.

Teng-Fang Hsiao,Bo-Kai Ruan,Yi-Lun Wu,Tzu-Ling Lin,Hong-Han Shuai

Task: 提出一种无需训练的文本和图像到图像生成方法（TF-TI2I），以增强复杂多图像指令下的图像生成质量。

Motivation: 现有方法在利用图像输入时往往只关注特定元素，或在处理复杂多图像指令时生成质量下降。

Details

Method: 利用MM-DiT架构，通过提取参考图像的浓缩视觉表示，并通过参考上下文掩码技术选择性地共享信息，同时使用Winner-Takes-All模块优先处理最相关的参考。 Result: 提出的方法在各种基准测试中表现出色，证实了其在处理复杂图像生成任务中的有效性。 Conclusion: TF-TI2I方法在不需额外训练的情况下，显著提升了复杂图像生成任务的质量和效果。 Abstract: Text-and-Image-To-Image (TI2I), an extension of Text-To-Image (T2I), integrates image inputs with textual instructions to enhance image generation. Existing methods often partially utilize image inputs, focusing on specific elements like objects or styles, or they experience a decline in generation quality with complex, multi-image instructions. To overcome these challenges, we introduce Training-Free Text-and-Image-to-Image (TF-TI2I), which adapts cutting-edge T2I models such as SD3 without the need for additional training. Our method capitalizes on the MM-DiT architecture, in which we point out that textual tokens can implicitly learn visual information from vision tokens. We enhance this interaction by extracting a condensed visual representation from reference images, facilitating selective information sharing through Reference Contextual Masking -- this technique confines the usage of contextual tokens to instruction-relevant visual information. Additionally, our Winner-Takes-All module mitigates distribution shifts by prioritizing the most pertinent references for each vision token. Addressing the gap in TI2I evaluation, we also introduce the FG-TI2I Bench, a comprehensive benchmark tailored for TI2I and compatible with existing T2I methods. Our approach shows robust performance across various benchmarks, confirming its effectiveness in handling complex image-generation tasks.

EdgeRegNet: Edge Feature-based Multimodal Registration Network between Images and LiDAR Point Clouds

Yuanchao Yue,Hui Yuan,Qinglong Miao,Xiaolong Mao,Raouf Hamzaoui,Peter Eisert

Task: 提出一种利用边缘信息进行2D图像和3D点云跨模态配准的方法。

Motivation: 跨模态数据配准在自动驾驶和机器人技术中有广泛应用，现有方法由于计算限制通常需要对原始点云和图像数据进行下采样，导致精度损失。此外，不同模态的高维特征提取需要特定技术来减少跨模态差异以实现有效匹配。

Details

Method: 提出一种方法，利用原始点云和图像的边缘信息进行跨模态配准。通过提取边缘点和边缘像素保留原始数据的关键信息，同时引入基于注意力的特征交换块来消除跨模态差异，并加入最优匹配层以提高对应关系识别。 Result: 在KITTI和nuScenes数据集上验证了方法的准确性，展示了其最先进的性能。 Conclusion: 所提出的方法在保持计算效率的同时提高了配准精度，有效解决了现有方法中的精度损失和跨模态差异问题。 Abstract: Cross-modal data registration has long been a critical task in computer vision, with extensive applications in autonomous driving and robotics. Accurate and robust registration methods are essential for aligning data from different modalities, forming the foundation for multimodal sensor data fusion and enhancing perception systems' accuracy and reliability. The registration task between 2D images captured by cameras and 3D point clouds captured by Light Detection and Ranging (LiDAR) sensors is usually treated as a visual pose estimation problem. High-dimensional feature similarities from different modalities are leveraged to identify pixel-point correspondences, followed by pose estimation techniques using least squares methods. However, existing approaches often resort to downsampling the original point cloud and image data due to computational constraints, inevitably leading to a loss in precision. Additionally, high-dimensional features extracted using different feature extractors from various modalities require specific techniques to mitigate cross-modal differences for effective matching. To address these challenges, we propose a method that uses edge information from the original point clouds and images for cross-modal registration. We retain crucial information from the original data by extracting edge points and pixels, enhancing registration accuracy while maintaining computational efficiency. The use of edge points and edge pixels allows us to introduce an attention-based feature exchange block to eliminate cross-modal disparities. Furthermore, we incorporate an optimal matching layer to improve correspondence identification. We validate the accuracy of our method on the KITTI and nuScenes datasets, demonstrating its state-of-the-art performance.

Yuanchao Yue,Zhengxin Li,Wei Zhang,Hui Yuan

Task: 提出一种框架，将点云投影为多个2D表示以与相机图像匹配，解决LiDAR点云与相机图像之间的跨模态配准问题。

Motivation: 现有的LiDAR点云与相机图像的校准方法通常耗时且需要外部校准板或特定环境特征，跨模态配准直接对齐数据，无需外部校准，但由于点云与图像之间的领域差距，现有方法难以在保持实时性能的同时达到满意的配准精度。

Details

Method: 提出一种框架，将点云投影为多个2D表示以与相机图像匹配，并引入多尺度特征提取网络和patch-to-pixel匹配网络，以更有效地提取特征并提供更有效的监督。 Result: 在KITTI和nuScenes数据集上的实验验证了模型的有效性，模型实现了实时性能和高配准精度，在KITTI数据集上的配准准确率超过99%。 Conclusion: 所提出的框架有效地解决了LiDAR点云与相机图像之间的跨模态配准问题，实现了高精度和实时性能。 Abstract: The primary requirement for cross-modal data fusion is the precise alignment of data from different sensors. However, the calibration between LiDAR point clouds and camera images is typically time-consuming and needs external calibration board or specific environmental features. Cross-modal registration effectively solves this problem by aligning the data directly without requiring external calibration. However, due to the domain gap between the point cloud and the image, existing methods rarely achieve satisfactory registration accuracy while maintaining real-time performance. To address this issue, we propose a framework that projects point clouds into several 2D representations for matching with camera images, which not only leverages the geometric characteristic of LiDAR point clouds more effectively but also bridge the domain gap between the point cloud and image. Moreover, to tackle the challenges of cross modal differences and the limited overlap between LiDAR point clouds and images in the image matching task, we introduce a multi-scale feature extraction network to effectively extract features from both camera images and the projection maps of LiDAR point cloud. Additionally, we propose a patch-to-pixel matching network to provide more effective supervision and achieve higher accuracy. We validate the performance of our model through experiments on the KITTI and nuScenes datasets. Our network achieves real-time performance and extremely high registration accuracy. On the KITTI dataset, our model achieves a registration accuracy rate of over 99\%.

Test-Time Backdoor Detection for Object Detection Models

Hangtao Zhang,Yichen Wang,Shihui Yan,Chenyu Zhu,Ziqi Zhou,Linshan Hou,Shengshan Hu,Minghui Li,Yanjun Zhang,Leo Yu Zhang

Task: 设计一种在测试时检测目标检测模型中中毒样本的新方法。

Motivation: 目标检测模型容易受到后门攻击，攻击者通过在训练样本中嵌入预定义的触发器来操纵预测。检测中毒样本可以防止后门激活，但由于目标检测的独特特性，现有的防御方法不足以应对这些挑战。

Details

Method: 设计了TRAnsformation Consistency Evaluation (TRACE)方法，通过应用前景和背景变换来评估变换一致性，计算对象置信度的方差。 Result: TRACE方法在AUROC上比现有防御方法提高了30%，并且能够抵抗自适应攻击。 Conclusion: TRACE方法在检测目标检测模型中的中毒样本方面表现出色，具有广泛的应用前景。 Abstract: Object detection models are vulnerable to backdoor attacks, where attackers poison a small subset of training samples by embedding a predefined trigger to manipulate prediction. Detecting poisoned samples (i.e., those containing triggers) at test time can prevent backdoor activation. However, unlike image classification tasks, the unique characteristics of object detection -- particularly its output of numerous objects -- pose fresh challenges for backdoor detection. The complex attack effects (e.g., "ghost" object emergence or "vanishing" object) further render current defenses fundamentally inadequate. To this end, we design TRAnsformation Consistency Evaluation (TRACE), a brand-new method for detecting poisoned samples at test time in object detection. Our journey begins with two intriguing observations: (1) poisoned samples exhibit significantly more consistent detection results than clean ones across varied backgrounds. (2) clean samples show higher detection consistency when introduced to different focal information. Based on these phenomena, TRACE applies foreground and background transformations to each test sample, then assesses transformation consistency by calculating the variance in objects confidences. TRACE achieves black-box, universal backdoor detection, with extensive experiments showing a 30% improvement in AUROC over state-of-the-art defenses and resistance to adaptive attacks.

DCA: Dividing and Conquering Amnesia in Incremental Object Detection

Aoting Zhang,Dongbao Yang,Chang Liu,Xiaopeng Hong,Miao Shang,Yu Zhou

Task: 研究增量目标检测（IOD）中如何持续定位和识别新类别，同时保持对先前类别的性能。

Motivation: 现有方法通过改进知识蒸馏和样本重放取得了一定成功，但对遗忘机制的内在原因仍缺乏深入探讨。

Details

Method: 提出了一种分而治之的遗忘策略（DCA），将基于transformer的IOD重新设计为定位-识别过程，并利用预训练语言模型中的语义知识来减少识别中的特征漂移。 Result: 在MS-COCO数据集上的四步设置中，DCA策略显著提高了最终AP 6.9%。 Conclusion: DCA策略在长期增量场景中表现出色，能够有效维持和传递定位能力，同时专门解决识别中的遗忘问题。 Abstract: Incremental object detection (IOD) aims to cultivate an object detector that can continuously localize and recognize novel classes while preserving its performance on previous classes. Existing methods achieve certain success by improving knowledge distillation and exemplar replay for transformer-based detection frameworks, but the intrinsic forgetting mechanisms remain underexplored. In this paper, we dive into the cause of forgetting and discover forgetting imbalance between localization and recognition in transformer-based IOD, which means that localization is less-forgetting and can generalize to future classes, whereas catastrophic forgetting occurs primarily on recognition. Based on these insights, we propose a Divide-and-Conquer Amnesia (DCA) strategy, which redesigns the transformer-based IOD into a localization-then-recognition process. DCA can well maintain and transfer the localization ability, leaving decoupled fragile recognition to be specially conquered. To reduce feature drift in recognition, we leverage semantic knowledge encoded in pre-trained language models to anchor class representations within a unified feature space across incremental tasks. This involves designing a duplex classifier fusion and embedding class semantic features into the recognition decoding process in the form of queries. Extensive experiments validate that our approach achieves state-of-the-art performance, especially for long-term incremental scenarios. For example, under the four-step setting on MS-COCO, our DCA strategy significantly improves the final AP by 6.9%.

SUM Parts: Benchmarking Part-Level Semantic Segmentation of Urban Meshes

Weixiao Gao,Liangliang Nan,Hugo Ledoux

Task: 介绍并评估一个用于城市纹理网格的大规模数据集SUM Parts，并提供3D语义分割和交互式标注方法的综合评估。

Motivation: 城市场景分析中的语义分割主要集中在图像或点云上，而提供更丰富空间表示的纹理网格仍未得到充分探索。

Details

Method: 创建了一个包含21个类别的约2.5平方公里的城市纹理网格数据集SUM Parts，并开发了一个支持基于面和纹理的高效交互式选择的标注工具。 Result: 提供了对3D语义分割和交互式标注方法的综合评估。 Conclusion: SUM Parts数据集填补了城市纹理网格语义分割的空白，并为相关研究提供了有价值的资源。 Abstract: Semantic segmentation in urban scene analysis has mainly focused on images or point clouds, while textured meshes - offering richer spatial representation - remain underexplored. This paper introduces SUM Parts, the first large-scale dataset for urban textured meshes with part-level semantic labels, covering about 2.5 km2 with 21 classes. The dataset was created using our own annotation tool, which supports both face- and texture-based annotations with efficient interactive selection. We also provide a comprehensive evaluation of 3D semantic segmentation and interactive annotation methods on this dataset. Our project page is available at https://tudelft3d.github.io/SUMParts/.

Recover and Match: Open-Vocabulary Multi-Label Recognition through Knowledge-Constrained Optimal Transport

Hao Tan,Zichang Tan,Jun Li,Ajian Liu,Jun Wan,Zhen Lei

Task: 解决开放词汇多标签识别中的局部语义丢失和无关区域匹配问题。

Motivation: 现有的视觉-语言模型（如CLIP）在局部语义和区域匹配方面存在不足，导致不可靠的预测。

Details

Method: 提出了RAM框架，包括Ladder Local Adapter (LLA)来恢复局部语义，以及Knowledge-Constrained Optimal Transport (KCOT)来抑制无关匹配。 Result: RAM在多个数据集上取得了最先进的性能，并展示了提升现有方法的潜力。 Conclusion: RAM通过恢复局部语义和优化区域匹配，有效解决了开放词汇多标签识别中的关键问题。 Abstract: Identifying multiple novel classes in an image, known as open-vocabulary multi-label recognition, is a challenging task in computer vision. Recent studies explore the transfer of powerful vision-language models such as CLIP. However, these approaches face two critical challenges: (1) The local semantics of CLIP are disrupted due to its global pre-training objectives, resulting in unreliable regional predictions. (2) The matching property between image regions and candidate labels has been neglected, relying instead on naive feature aggregation such as average pooling, which leads to spurious predictions from irrelevant regions. In this paper, we present RAM (Recover And Match), a novel framework that effectively addresses the above issues. To tackle the first problem, we propose Ladder Local Adapter (LLA) to enforce refocusing on local regions, recovering local semantics in a memory-friendly way. For the second issue, we propose Knowledge-Constrained Optimal Transport (KCOT) to suppress meaningless matching to non-GT labels by formulating the task as an optimal transport problem. As a result, RAM achieves state-of-the-art performance on various datasets from three distinct domains, and shows great potential to boost the existing methods. Code: https://github.com/EricTan7/RAM.

TruthLens:A Training-Free Paradigm for DeepFake Detection

Ritabrata Chakraborty,Rajatsubhra Chakraborty,Ali Khaleghi Rahimian,Thomas MacDougall

Task: 提出一种新的训练免费框架TruthLens，将深度伪造检测重新构想为视觉问答任务。

Motivation: 当前假图像检测方法主要依赖二分类模型，注重准确性但往往忽视可解释性，用户无法清楚了解图像被判定为真实或伪造的原因。

Details

Method: TruthLens利用最先进的大型视觉语言模型（LVLMs）观察和描述视觉伪影，并结合大型语言模型（LLMs）如GPT-4的推理能力，分析和聚合证据以做出明智决策。 Result: TruthLens在具有挑战性的数据集上表现出色，实现了高准确性，同时保持了强大的可解释性。 Conclusion: 通过将深度伪造检测重新定义为推理驱动过程，TruthLens在打击合成媒体方面建立了新范式，结合了尖端性能和可解释性，以应对视觉虚假信息的日益增长的威胁。 Abstract: The proliferation of synthetic images generated by advanced AI models poses significant challenges in identifying and understanding manipulated visual content. Current fake image detection methods predominantly rely on binary classification models that focus on accuracy while often neglecting interpretability, leaving users without clear insights into why an image is deemed real or fake. To bridge this gap, we introduce TruthLens, a novel training-free framework that reimagines deepfake detection as a visual question-answering (VQA) task. TruthLens utilizes state-of-the-art large vision-language models (LVLMs) to observe and describe visual artifacts and combines this with the reasoning capabilities of large language models (LLMs) like GPT-4 to analyze and aggregate evidence into informed decisions. By adopting a multimodal approach, TruthLens seamlessly integrates visual and semantic reasoning to not only classify images as real or fake but also provide interpretable explanations for its decisions. This transparency enhances trust and provides valuable insights into the artifacts that signal synthetic content. Extensive evaluations demonstrate that TruthLens outperforms conventional methods, achieving high accuracy on challenging datasets while maintaining a strong emphasis on explainability. By reframing deepfake detection as a reasoning-driven process, TruthLens establishes a new paradigm in combating synthetic media, combining cutting-edge performance with interpretability to address the growing threats of visual disinformation.

Boosting HDR Image Reconstruction via Semantic Knowledge Transfer

Qingsen Yan,Tao Hu,Genggeng Chen,Wei Dong,Yanning Zhang

Task: 从多个低动态范围（LDR）图像中恢复高动态范围（HDR）图像，特别是在LDR图像存在明显退化和缺失内容的情况下。

Motivation: 利用场景特定的语义先验为恢复严重退化区域提供了有希望的解决方案，但由于这些先验通常从sRGB标准动态范围（SDR）图像中提取，域/格式差距在应用于HDR成像时构成了重大挑战。

Details

Method: 提出了一个通用框架，通过自蒸馏将SDR域中的语义知识转移到现有的HDR重建中。具体来说，该框架首先引入了语义先验引导重建模型（SPGRM），利用SDR图像的语义知识解决初始HDR重建结果中的不适定问题。随后，利用自蒸馏机制约束颜色和内容信息与语义知识，对齐基线和SPGRM之间的外部输出。此外，为了转移内部特征的语义知识，使用语义知识对齐模块（SKAM）通过互补掩码填充缺失的语义内容。 Result: 大量实验表明，该方法可以显著提高现有方法的HDR成像质量。 Conclusion: 所提出的框架通过自蒸馏和语义知识对齐模块，有效地将SDR域中的语义知识转移到HDR重建中，显著提高了HDR成像质量。 Abstract: Recovering High Dynamic Range (HDR) images from multiple Low Dynamic Range (LDR) images becomes challenging when the LDR images exhibit noticeable degradation and missing content. Leveraging scene-specific semantic priors offers a promising solution for restoring heavily degraded regions. However, these priors are typically extracted from sRGB Standard Dynamic Range (SDR) images, the domain/format gap poses a significant challenge when applying it to HDR imaging. To address this issue, we propose a general framework that transfers semantic knowledge derived from SDR domain via self-distillation to boost existing HDR reconstruction. Specifically, the proposed framework first introduces the Semantic Priors Guided Reconstruction Model (SPGRM), which leverages SDR image semantic knowledge to address ill-posed problems in the initial HDR reconstruction results. Subsequently, we leverage a self-distillation mechanism that constrains the color and content information with semantic knowledge, aligning the external outputs between the baseline and SPGRM. Furthermore, to transfer the semantic knowledge of the internal features, we utilize a semantic knowledge alignment module (SKAM) to fill the missing semantic contents with the complementary masks. Extensive experiments demonstrate that our method can significantly improve the HDR imaging quality of existing methods.

EfficientLLaVA:Generalizable Auto-Pruning for Large Vision-language Models

Yinan Liang,Ziwei Wang,Xiuwei Xu,Jie Zhou,Jiwen Lu

Task: 提出一种自动剪枝方法，以提高多模态推理的效率。

Motivation: 多模态大语言模型在复杂推理任务中表现出色，但在资源有限的设备上部署时存在模型复杂性的挑战。

Details

Method: 利用少量样本搜索剪枝策略，通过最大化其在未知训练数据上的泛化能力来保持模型准确性，从而实现准确性和效率之间的最佳权衡。 Result: 在ScienceQA数据集上，仅使用64个样本进行剪枝策略搜索，EfficientLLaVA达到了83.05%的准确率，并且比密集的LLaVA-v1.5-7B模型快了1.8倍。 Conclusion: 提出的方法能够在保持模型准确性的同时，显著提高多模态推理的效率。 Abstract: While multimodal large language models demonstrate strong performance in complex reasoning tasks, they pose significant challenges related to model complexity during deployment, especially for resource-limited devices. In this paper, we propose an automatic pruning method for large vision-language models to enhance the efficiency of multimodal reasoning. Conventional methods rely on the training data of the original model to select the proper pruning ratio for different network components. However, these methods are impractical for large vision-language models due to the unaffordable search costs caused by web-scale training corpus. In contrast, our approach only leverages a small number of samples to search for the desired pruning policy by maximizing its generalization ability on unknown training data while maintaining the model accuracy, which enables the achievement of an optimal trade-off between accuracy and efficiency for large visual language models. Specifically, we formulate the generalization gap of the pruning strategy using the structural risk minimization principle. Based on both task performance and generalization capability, we iteratively search for the optimal pruning policy within a given search space and optimize the vision projector to evolve the search space with higher upper bound of performance. We conduct extensive experiments on the ScienceQA, Vizwiz, MM-vet, and LLaVA-Bench datasets for the task of visual question answering. Using only 64 samples for pruning policy search, EfficientLLaVA achieves an accuracy of 83.05% on ScienceQA, along with a $\times$ 1.8 speedup compared to the dense LLaVA-v1.5-7B model.

Yuchen Ren,Zhengyu Zhao,Chenhao Lin,Bo Yang,Lu Zhou,Zhe Liu,Chao Shen

Task: 研究Vision Transformers（ViTs）在对抗样本转移中的前向传播优化方法。

Motivation: 为了深入了解ViTs在实际场景中的鲁棒性，研究其在对抗样本转移中的表现，并探索通过前向传播优化来提升对抗样本的转移能力。

Details

Method: 提出了前向传播优化（FPR）方法，具体包括注意力图多样化（AMD）和动量词嵌入（MTE）。AMD通过多样化注意力图并在反向传播中隐式引入有益的梯度消失来优化注意力图；MTE通过累积历史词嵌入来稳定前向更新。 Result: 实验表明，FPR方法在对抗样本转移任务中优于当前最佳的反向传播优化方法，平均提升7.0%。 Conclusion: FPR方法在提升对抗样本转移能力方面表现出色，并且与现有的防御方法和其他转移方法兼容。 Abstract: Vision Transformers (ViTs) have been widely applied in various computer vision and vision-language tasks. To gain insights into their robustness in practical scenarios, transferable adversarial examples on ViTs have been extensively studied. A typical approach to improving adversarial transferability is by refining the surrogate model. However, existing work on ViTs has restricted their surrogate refinement to backward propagation. In this work, we instead focus on Forward Propagation Refinement (FPR) and specifically refine two key modules of ViTs: attention maps and token embeddings. For attention maps, we propose Attention Map Diversification (AMD), which diversifies certain attention maps and also implicitly imposes beneficial gradient vanishing during backward propagation. For token embeddings, we propose Momentum Token Embedding (MTE), which accumulates historical token embeddings to stabilize the forward updates in both the Attention and MLP blocks. We conduct extensive experiments with adversarial examples transferred from ViTs to various CNNs and ViTs, demonstrating that our FPR outperforms the current best (backward) surrogate refinement by up to 7.0\% on average. We also validate its superiority against popular defenses and its compatibility with other transfer methods. Codes and appendix are available at https://github.com/RYC-98/FPR.

Visual Persona: Foundation Model for Full-Body Human Customization

Jisu Nam,Soowon Son,Zhan Xu,Jing Shi,Difan Liu,Feng Liu,Aashish Misraa,Seungryong Kim,Yang Zhou

Task: 开发一个基于文本描述生成多样化全身人像图像的基础模型。

Motivation: 现有的方法主要关注面部身份的保留，而忽略了全身外观的细节。本文旨在捕捉详细的全身外观，并与文本描述中的身体结构和场景变化对齐。

Details

Method: 提出了一种数据整理流程，利用视觉语言模型评估全身外观一致性，构建了一个包含58万张配对人类图像的数据集。采用基于预训练文本到图像扩散模型的变压器编码器-解码器架构，将输入图像分割为不同的身体区域，编码为局部外观特征，并投影为密集身份嵌入，以生成定制图像。 Result: Visual Persona模型在生成高质量、定制化的图像方面优于现有方法。 Conclusion: Visual Persona模型在多种下游任务中展示了其多功能性，并通过广泛的消融研究验证了设计选择。 Abstract: We introduce Visual Persona, a foundation model for text-to-image full-body human customization that, given a single in-the-wild human image, generates diverse images of the individual guided by text descriptions. Unlike prior methods that focus solely on preserving facial identity, our approach captures detailed full-body appearance, aligning with text descriptions for body structure and scene variations. Training this model requires large-scale paired human data, consisting of multiple images per individual with consistent full-body identities, which is notoriously difficult to obtain. To address this, we propose a data curation pipeline leveraging vision-language models to evaluate full-body appearance consistency, resulting in Visual Persona-500K, a dataset of 580k paired human images across 100k unique identities. For precise appearance transfer, we introduce a transformer encoder-decoder architecture adapted to a pre-trained text-to-image diffusion model, which augments the input image into distinct body regions, encodes these regions as local appearance features, and projects them into dense identity embeddings independently to condition the diffusion model for synthesizing customized images. Visual Persona consistently surpasses existing approaches, generating high-quality, customized images from in-the-wild inputs. Extensive ablation studies validate design choices, and we demonstrate the versatility of Visual Persona across various downstream tasks.

Learn Your Scales: Towards Scale-Consistent Generative Novel View Synthesis

Fereshteh Forghani,Jason J. Yu,Tristan Aumentado-Armstrong,Konstantinos G. Derpanis,Marcus A. Brubaker

Task: 研究并解决在生成新视角合成方法（GNVS）中场景尺度模糊性的影响。

Motivation: 传统的多视图数据集在单目相机移动拍摄时存在尺度模糊性，之前的方法通过各种临时归一化预处理步骤承认了这一问题，但未直接分析错误场景尺度对其应用的影响。

Details

Method: 通过从单张图像中采样，研究场景尺度模糊性对GNVS模型的影响，并基于这些直觉定义新的度量标准来衡量生成视图的尺度不一致性。提出了一种端到端的框架，联合估计场景尺度和GNVS模型。 Result: 实验表明，该方法减少了生成视图的尺度不一致性，且无需复杂或具有副作用的尺度归一化方法。 Conclusion: 消除尺度模糊性可以提高GNVS模型生成的图像质量。 Abstract: Conventional depth-free multi-view datasets are captured using a moving monocular camera without metric calibration. The scales of camera positions in this monocular setting are ambiguous. Previous methods have acknowledged scale ambiguity in multi-view data via various ad-hoc normalization pre-processing steps, but have not directly analyzed the effect of incorrect scene scales on their application. In this paper, we seek to understand and address the effect of scale ambiguity when used to train generative novel view synthesis methods (GNVS). In GNVS, new views of a scene or object can be minimally synthesized given a single image and are, thus, unconstrained, necessitating the use of generative methods. The generative nature of these models captures all aspects of uncertainty, including any uncertainty of scene scales, which act as nuisance variables for the task. We study the effect of scene scale ambiguity in GNVS when sampled from a single image by isolating its effect on the resulting models and, based on these intuitions, define new metrics that measure the scale inconsistency of generated views. We then propose a framework to estimate scene scales jointly with the GNVS model in an end-to-end fashion. Empirically, we show that our method reduces the scale inconsistency of generated views without the complexity or downsides of previous scale normalization methods. Further, we show that removing this ambiguity improves generated image quality of the resulting GNVS model.

Automated Processing of eXplainable Artificial Intelligence Outputs in Deep Learning Models for Fault Diagnostics of Large Infrastructures

Giovanni Floreale,Piero Baraldi,Enrico Zio,Olga Fink

Task: 提出一种结合事后解释与半监督学习的新框架，用于自动识别异常解释，从而减少维护决策者的工作量。

Motivation: 深度学习模型在处理图像以识别大型基础设施组件的健康状态时可能表现出偏见并依赖非因果捷径，而可解释人工智能（XAI）可以解决这些问题，但手动分析XAI技术生成的解释既耗时又容易出错。

Details

Method: 提出了一种结合事后解释与半监督学习的框架，自动识别与正确分类图像解释不同的异常解释，从而指示模型异常行为。 Result: 在两个故障类别上的平均分类准确率提高了8%，维护操作员只需手动重新分类15%的图像。与基于忠实度度量的最先进方法相比，该框架始终获得更高的F1分数。 Conclusion: 该框架显著减少了维护决策者的工作量，并成功识别了由非因果捷径导致的正确分类。 Abstract: Deep Learning (DL) models processing images to recognize the health state of large infrastructure components can exhibit biases and rely on non-causal shortcuts. eXplainable Artificial Intelligence (XAI) can address these issues but manually analyzing explanations generated by XAI techniques is time-consuming and prone to errors. This work proposes a novel framework that combines post-hoc explanations with semi-supervised learning to automatically identify anomalous explanations that deviate from those of correctly classified images and may therefore indicate model abnormal behaviors. This significantly reduces the workload for maintenance decision-makers, who only need to manually reclassify images flagged as having anomalous explanations. The proposed framework is applied to drone-collected images of insulator shells for power grid infrastructure monitoring, considering two different Convolutional Neural Networks (CNNs), GradCAM explanations and Deep Semi-Supervised Anomaly Detection. The average classification accuracy on two faulty classes is improved by 8% and maintenance operators are required to manually reclassify only 15% of the images. We compare the proposed framework with a state-of-the-art approach based on the faithfulness metric: the experimental results obtained demonstrate that the proposed framework consistently achieves F_1 scores larger than those of the faithfulness-based approach. Additionally, the proposed framework successfully identifies correct classifications that result from non-causal shortcuts, such as the presence of ID tags printed on insulator shells.

Temporal Regularization Makes Your Video Generator Stronger

Harold Haodong Chen,Haojian Huang,Xianfeng Wu,Yexin Liu,Yajing Bai,Wen-Jie Shu,Harry Yang,Ser-Nam Lim

Task: 探索视频生成中的时间增强方法，并引入FluxFlow策略以提高时间质量。

Motivation: 时间质量是视频生成的关键方面，确保帧间一致的运动和真实的动态，但实现高时间一致性和多样性仍然具有挑战性。

Details

Method: 在数据层面应用FluxFlow策略，通过受控的时间扰动来增强时间质量，无需修改架构。 Result: 在UCF-101和VBench基准测试中，FluxFlow显著提高了各种视频生成模型的时间一致性和多样性，同时保持了空间保真度。 Conclusion: 时间增强作为一种简单而有效的方法，具有提高视频生成质量的潜力。 Abstract: Temporal quality is a critical aspect of video generation, as it ensures consistent motion and realistic dynamics across frames. However, achieving high temporal coherence and diversity remains challenging. In this work, we explore temporal augmentation in video generation for the first time, and introduce FluxFlow for initial investigation, a strategy designed to enhance temporal quality. Operating at the data level, FluxFlow applies controlled temporal perturbations without requiring architectural modifications. Extensive experiments on UCF-101 and VBench benchmarks demonstrate that FluxFlow significantly improves temporal coherence and diversity across various video generation models, including U-Net, DiT, and AR-based architectures, while preserving spatial fidelity. These findings highlight the potential of temporal augmentation as a simple yet effective approach to advancing video generation quality.

Visual Position Prompt for MLLM based Visual Grounding

Wei Tang,Yanpeng Sun,Qinying Gu,Zechao Li

Task: 改进多模态大语言模型（MLLMs）在图像中的坐标与空间信息对齐能力，特别是在视觉定位任务中。

Motivation: 现有的MLLMs在图像相关任务中表现出色，但在位置感知任务中难以精确对齐坐标与空间信息，主要因为缺乏明确的空间参考和对细粒度空间细节的关注不足。

Details

Method: 提出了VPP-LLaVA模型，该模型配备了视觉位置提示（VPP），通过全局VPP和局部VPP两种互补机制来增强其定位能力。全局VPP在输入图像上叠加可学习的轴状嵌入以提供结构化的空间线索，局部VPP则通过引入位置感知查询来关注细粒度的定位。 Result: 在标准定位基准测试中，VPP-LLaVA模型使用较少的训练样本（0.6M）实现了最先进的性能，优于依赖更大数据集（约21M样本）的其他MLLMs如MiniGPT-v2。 Conclusion: VPP-LLaVA通过引入视觉位置提示和高效的数据集训练，显著提升了MLLMs在视觉定位任务中的性能，展示了其在处理位置感知任务中的潜力。 Abstract: Although Multimodal Large Language Models (MLLMs) excel at various image-related tasks, they encounter challenges in precisely aligning coordinates with spatial information within images, particularly in position-aware tasks such as visual grounding. This limitation arises from two key factors. First, MLLMs lack explicit spatial references, making it difficult to associate textual descriptions with precise image locations. Second, their feature extraction processes prioritize global context over fine-grained spatial details, leading to weak localization capability. To address this issue, we introduce VPP-LLaVA, an MLLM equipped with Visual Position Prompt (VPP) to improve its grounding capability. VPP-LLaVA integrates two complementary mechanisms. The global VPP overlays learnable, axis-like embeddings onto the input image to provide structured spatial cues. The local VPP focuses on fine-grained localization by incorporating position-aware queries, which suggests probable object locations. We also introduce a VPP-SFT dataset with 0.6M samples, consolidating high-quality visual grounding data into a compact format for efficient model training. Training on this dataset with VPP enhances the model's performance, achieving state-of-the-art results on standard grounding benchmarks despite using fewer training samples compared to other MLLMs like MiniGPT-v2, which rely on much larger datasets ($\sim$21M samples). The code and VPP-SFT dataset will be available at https://github.com/WayneTomas/VPP-LLaVA upon acceptance.

V2X-DG: Domain Generalization for Vehicle-to-Everything Cooperative Perception

Baolu Li,Zongzhe Xu,Jinlong Li,Xinyu Liu,Jianwu Fang,Xiaopeng Li,Hongkai Yu

Task: 研究基于LiDAR的V2X协同感知的领域泛化问题，以提高3D检测的泛化能力。

Motivation: 当前协同感知算法在相同数据集上训练和测试，其泛化能力尚未得到充分探索。

Details

Method: 提出了基于协同混合增强的泛化方法（CMAG）和合作特征一致性（CFC）约束。 Result: 实验表明，该方法在泛化到其他未见数据集时显著提升了性能，同时在源数据集上保持了强性能。 Conclusion: 提出的方法有效提高了LiDAR-based V2X协同感知系统的泛化能力。 Abstract: LiDAR-based Vehicle-to-Everything (V2X) cooperative perception has demonstrated its impact on the safety and effectiveness of autonomous driving. Since current cooperative perception algorithms are trained and tested on the same dataset, the generalization ability of cooperative perception systems remains underexplored. This paper is the first work to study the Domain Generalization problem of LiDAR-based V2X cooperative perception (V2X-DG) for 3D detection based on four widely-used open source datasets: OPV2V, V2XSet, V2V4Real and DAIR-V2X. Our research seeks to sustain high performance not only within the source domain but also across other unseen domains, achieved solely through training on source domain. To this end, we propose Cooperative Mixup Augmentation based Generalization (CMAG) to improve the model generalization capability by simulating the unseen cooperation, which is designed compactly for the domain gaps in cooperative perception. Furthermore, we propose a constraint for the regularization of the robust generalized feature representation learning: Cooperation Feature Consistency (CFC), which aligns the intermediately fused features of the generalized cooperation by CMAG and the early fused features of the original cooperation in source domain. Extensive experiments demonstrate that our approach achieves significant performance gains when generalizing to other unseen datasets while it also maintains strong performance on the source dataset.

MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space

Lixing Xiao,Shunlin Lu,Huaijin Pi,Ke Fan,Liang Pan,Yueer Zhou,Ziyong Feng,Xiaowei Zhou,Sida Peng,Jingbo Wang

Task: 解决基于文本条件的流式运动生成问题，预测基于可变长度历史运动和输入文本的下一步人体姿态。

Motivation: 现有方法在流式运动生成方面存在困难，如扩散模型受限于预定义的运动长度，而基于GPT的方法由于离散化的非因果标记化导致响应延迟和错误累积问题。

Details

Method: 提出了MotionStreamer框架，将连续因果潜在空间引入概率自回归模型，减少离散化造成的信息损失，有效减少长期自回归生成中的错误累积。 Result: 实验表明，该方法优于现有方法，并提供了更多应用，包括多轮生成、长期生成和动态运动组合。 Conclusion: MotionStreamer通过连续因果潜在空间和概率自回归模型，有效解决了流式运动生成中的问题，具有广泛的应用前景。 Abstract: This paper addresses the challenge of text-conditioned streaming motion generation, which requires us to predict the next-step human pose based on variable-length historical motions and incoming texts. Existing methods struggle to achieve streaming motion generation, e.g., diffusion models are constrained by pre-defined motion lengths, while GPT-based methods suffer from delayed response and error accumulation problem due to discretized non-causal tokenization. To solve these problems, we propose MotionStreamer, a novel framework that incorporates a continuous causal latent space into a probabilistic autoregressive model. The continuous latents mitigate information loss caused by discretization and effectively reduce error accumulation during long-term autoregressive generation. In addition, by establishing temporal causal dependencies between current and historical motion latents, our model fully utilizes the available information to achieve accurate online motion decoding. Experiments show that our method outperforms existing approaches while offering more applications, including multi-round generation, long-term generation, and dynamic motion composition. Project Page: https://zju3dv.github.io/MotionStreamer/

Di$\mathtt{[M]}$O: Distilling Masked Diffusion Models into One-step Generator

Yuanzhi Zhu,Xi Wang,Stéphane Lathuilière,Vicky Kalogeiton

Task: Error

Motivation: Error

Details

Method: Error Result: Error Conclusion: Error Abstract: Masked Diffusion Models (MDMs) have emerged as a powerful generative modeling technique. Despite their remarkable results, they typically suffer from slow inference with several steps. In this paper, we propose Di$\mathtt{[M]}$O, a novel approach that distills masked diffusion models into a one-step generator. Di$\mathtt{[M]}$O addresses two key challenges: (1) the intractability of using intermediate-step information for one-step generation, which we solve through token-level distribution matching that optimizes model output logits by an 'on-policy framework' with the help of an auxiliary model; and (2) the lack of entropy in the initial distribution, which we address through a token initialization strategy that injects randomness while maintaining similarity to teacher training distribution. We show Di$\mathtt{[M]}$O's effectiveness on both class-conditional and text-conditional image generation, impressively achieving performance competitive to multi-step teacher outputs while drastically reducing inference time. To our knowledge, we are the first to successfully achieve one-step distillation of masked diffusion models and the first to apply discrete distillation to text-to-image generation, opening new paths for efficient generative modeling.

FP4DiT: Towards Effective Floating Point Quantization for Diffusion Transformers

Ruichen Chen,Keith G. Mills,Di Niu

Task: 提出一种基于浮点量化的后训练量化方法FP4DiT，用于Diffusion Transformer模型。

Motivation: 现有的后训练量化方法主要针对卷积U-Net结构的经典扩散模型，而新型的Diffusion Transformer模型（如PixArt系列、Hunyuan等）采用了不同的Transformer架构，且整数量化方法在低比特设置下不能很好地适应网络权重和激活分布。

Details

Method: 提出FP4DiT方法，扩展并泛化了自适应舍入后训练量化技术，以校准浮点量化的权重量化，并开发了鲁棒的在线激活量化技术。 Result: 实验结果表明，FP4DiT在W4A6和W4A8精度下优于基于整数的后训练量化方法，并在PixArt-α、PixArt-Σ和Hunyuan模型上生成了具有说服力的视觉内容。 Conclusion: FP4DiT方法在低比特设置下能够更好地适应Diffusion Transformer模型的权重和激活分布，生成高质量的图像。 Abstract: Diffusion Models (DM) have revolutionized the text-to-image visual generation process. However, the large computational cost and model footprint of DMs hinders practical deployment, especially on edge devices. Post-training quantization (PTQ) is a lightweight method to alleviate these burdens without the need for training or fine-tuning. While recent DM PTQ methods achieve W4A8 on integer-based PTQ, two key limitations remain: First, while most existing DM PTQ methods evaluate on classical DMs like Stable Diffusion XL, 1.5 or earlier, which use convolutional U-Nets, newer Diffusion Transformer (DiT) models like the PixArt series, Hunyuan and others adopt fundamentally different transformer backbones to achieve superior image synthesis. Second, integer (INT) quantization is prevailing in DM PTQ but doesn't align well with the network weight and activation distribution, while Floating-Point Quantization (FPQ) is still under-investigated, yet it holds the potential to better align the weight and activation distributions in low-bit settings for DiT. In response, we introduce FP4DiT, a PTQ method that leverages FPQ to achieve W4A6 quantization. Specifically, we extend and generalize the Adaptive Rounding PTQ technique to adequately calibrate weight quantization for FPQ and demonstrate that DiT activations depend on input patch data, necessitating robust online activation quantization techniques. Experimental results demonstrate that FP4DiT outperforms integer-based PTQ at W4A6 and W4A8 precision and generates convincing visual content on PixArt-$\alpha$, PixArt-$\Sigma$ and Hunyuan in terms of several T2I metrics such as HPSv2 and CLIP.

EgoDTM: Towards 3D-Aware Egocentric Video-Language Pretraining

Boshen Xu,Yuting Mei,Xinbi Liu,Sipeng Zheng,Qin Jin

Task: 通过大规模3D感知视频预训练和视频-文本对比学习，联合训练Egocentric Depth- and Text-aware Model (EgoDTM)。

Motivation: 人类感知和互动的是一个完全3D的世界，发展出超越文本理解的空间意识。然而，大多数先前的工作从1D文本或2D视觉线索（如边界框）中学习，这些方法本质上缺乏3D理解。

Details

Method: 引入EgoDTM，结合轻量级3D感知解码器，从深度估计模型生成的伪深度图中高效学习3D感知。通过有机结合多个基础模型，丰富原始简短描述的手-对象视觉线索。 Result: 广泛的实验表明，EgoDTM在多种下游任务中表现出色，展示了其卓越的3D感知视觉理解能力。 Conclusion: EgoDTM通过3D感知视频预训练和视频-文本对比学习，显著提升了视频表示学习，特别是在3D感知视觉理解方面。 Abstract: Egocentric video-language pretraining has significantly advanced video representation learning. Humans perceive and interact with a fully 3D world, developing spatial awareness that extends beyond text-based understanding. However, most previous works learn from 1D text or 2D visual cues, such as bounding boxes, which inherently lack 3D understanding. To bridge this gap, we introduce EgoDTM, an Egocentric Depth- and Text-aware Model, jointly trained through large-scale 3D-aware video pretraining and video-text contrastive learning. EgoDTM incorporates a lightweight 3D-aware decoder to efficiently learn 3D-awareness from pseudo depth maps generated by depth estimation models. To further facilitate 3D-aware video pretraining, we enrich the original brief captions with hand-object visual cues by organically combining several foundation models. Extensive experiments demonstrate EgoDTM's superior performance across diverse downstream tasks, highlighting its superior 3D-aware visual understanding. Our code will be released at https://github.com/xuboshen/EgoDTM.

Toward task-driven satellite image super-resolution

Maciej Ziaja,Pawel Kowaleczko,Daniel Kostrzewa,Nicolas Longépé,Michal Kawulok

Task: 学习超分辨率算法以生成适合自动化图像分析的高分辨率图像。

Motivation: 现有的超分辨率方法虽然能生成高质量图像，但不确定重建的细节是否接近真实信息，以及是否对图像分析算法更有价值。

Details

Method: 提出一种方法学方法，评估现有计算机视觉任务模型是否可用于评估超分辨率重建算法，并以任务驱动的方式训练它们。 Result: 通过实验研究支持分析，为选择适当的计算机视觉任务以提升现实世界超分辨率能力奠定基础。 Conclusion: 该研究为任务驱动的超分辨率算法提供了方法论基础，有助于生成更适合自动化图像分析的高分辨率图像。 Abstract: Super-resolution is aimed at reconstructing high-resolution images from low-resolution observations. State-of-the-art approaches underpinned with deep learning allow for obtaining outstanding results, generating images of high perceptual quality. However, it often remains unclear whether the reconstructed details are close to the actual ground-truth information and whether they constitute a more valuable source for image analysis algorithms. In the reported work, we address the latter problem, and we present our efforts toward learning super-resolution algorithms in a task-driven way to make them suitable for generating high-resolution images that can be exploited for automated image analysis. In the reported initial research, we propose a methodological approach for assessing the existing models that perform computer vision tasks in terms of whether they can be used for evaluating super-resolution reconstruction algorithms, as well as training them in a task-driven way. We support our analysis with experimental study and we expect it to establish a solid foundation for selecting appropriate computer vision tasks that will advance the capabilities of real-world super-resolution.

Cube: A Roblox View of 3D Intelligence

Foundation AI Team,Kiran Bhat,Nishchaie Khanna,Karun Channa,Tinghui Zhou,Yiheng Zhu,Xiaoxia Sun,Charles Shang,Anirudh Sudarshan,Maurice Chu,Daiqing Li,Kangle Deng,Jean-Philippe Fauconnier,Tijmen Verhulsdonck,Maneesh Agrawala,Kayvon Fatahalian,Alexander Weiss,Christian Reiser,Ravi Kiran Chirravuri,Ravali Kandur,Alejandro Pelaez,Akash Garg,Michael Palleschi,Jessica Wang,Skylar Litz,Leon Liu,Anying Li,David Harmon,Derek Liu,Liangjun Feng,Denis Goupil,Lukas Kuczynski,Jihyun Yoon,Naveen Marri,Peiye Zhuang,Yinan Zhang,Brian Yin,Haomiao Jiang,Marcel van Workum,Thomas Lane,Bryce Erickson,Salil Pathare,Kyle Price,Anupam Singh,David Baszucki

Task: 构建一个用于3D智能的基础模型，支持开发者生成3D对象、场景、角色动画和对象行为的程序脚本。

Motivation: 基础模型在文本、图像、音频和视频领域展示了卓越的推理和生成能力，Roblox希望构建一个类似的3D智能基础模型。

Details

Method: 讨论了构建3D基础模型的三个关键设计需求，并提出了3D形状分词器的解决方案。 Result: 展示了分词方案在文本到形状生成、形状到文本生成和文本到场景生成中的应用，并展示了这些应用如何与现有的大型语言模型（LLMs）协作进行场景分析和推理。 Conclusion: 讨论了构建完全统一的3D智能基础模型的路径。 Abstract: Foundation models trained on vast amounts of data have demonstrated remarkable reasoning and generation capabilities in the domains of text, images, audio and video. Our goal at Roblox is to build such a foundation model for 3D intelligence, a model that can support developers in producing all aspects of a Roblox experience, from generating 3D objects and scenes to rigging characters for animation to producing programmatic scripts describing object behaviors. We discuss three key design requirements for such a 3D foundation model and then present our first step towards building such a model. We expect that 3D geometric shapes will be a core data type and describe our solution for 3D shape tokenizer. We show how our tokenization scheme can be used in applications for text-to-shape generation, shape-to-text generation and text-to-scene generation. We demonstrate how these applications can collaborate with existing large language models (LLMs) to perform scene analysis and reasoning. We conclude with a discussion outlining our path to building a fully unified foundation model for 3D intelligence.

TULIP: Towards Unified Language-Image Pretraining

Zineng Tang,Long Lian,Seun Eisape,XuDong Wang,Roei Herzig,Adam Yala,Alane Suhr,Trevor Darrell,David M. Chan

Task: 提出TULIP模型，以解决现有图像-文本对比模型在视觉中心任务中的不足。

Motivation: 现有的图像-文本对比模型（如CLIP和SigLIP）在需要高保真图像理解的任务（如计数、深度估计和细粒度对象识别）中表现不佳，而专注于视觉的模型在处理语言方面存在局限性。

Details

Method: TULIP模型通过生成数据增强、增强的图像-图像和文本-文本对比学习以及图像/文本重建正则化来学习细粒度视觉特征，同时保持全局语义对齐。 Result: TULIP模型在多个基准测试中优于现有的最先进模型，在ImageNet-1K上实现了新的零样本性能最先进水平，在RxRx1上的少样本分类线性探测中比SigLIP提高了2倍，在MMVP上比SigLIP提高了3倍以上。 Conclusion: TULIP模型在保持全局语义对齐的同时，能够学习细粒度视觉特征，显著提升了视觉中心任务的性能。 Abstract: Despite the recent success of image-text contrastive models like CLIP and SigLIP, these models often struggle with vision-centric tasks that demand high-fidelity image understanding, such as counting, depth estimation, and fine-grained object recognition. These models, by performing language alignment, tend to prioritize high-level semantics over visual understanding, weakening their image understanding. On the other hand, vision-focused models are great at processing visual information but struggle to understand language, limiting their flexibility for language-driven tasks. In this work, we introduce TULIP, an open-source, drop-in replacement for existing CLIP-like models. Our method leverages generative data augmentation, enhanced image-image and text-text contrastive learning, and image/text reconstruction regularization to learn fine-grained visual features while preserving global semantic alignment. Our approach, scaling to over 1B parameters, outperforms existing state-of-the-art (SOTA) models across multiple benchmarks, establishing a new SOTA zero-shot performance on ImageNet-1K, delivering up to a $2\times$ enhancement over SigLIP on RxRx1 in linear probing for few-shot classification, and improving vision-language models, achieving over $3\times$ higher scores than SigLIP on MMVP. Our code/checkpoints are available at https://tulip-berkeley.github.io

SDF-TopoNet: A Two-Stage Framework for Tubular Structure Segmentation via SDF Pre-training and Topology-Aware Fine-Tuning

Siyi Wu,Leyi Zhao,Haitian Ma,Xinyuan Song

Task: 提出一种改进的拓扑感知分割框架SDF-TopoNet，用于准确分割管状和曲线结构。

Motivation: 现有方法在确保拓扑正确性的同时，计算成本高且对像素级精度不敏感，需要额外的损失项来补偿。

Details

Method: 提出了一种两阶段训练策略，包括预训练阶段使用符号距离函数（SDF）作为辅助学习目标，以及微调阶段结合动态适配器和改进的拓扑损失。 Result: 在五个基准数据集上的实验结果表明，SDF-TopoNet在拓扑准确性和定量分割指标上均优于现有方法，同时显著降低了训练复杂度。 Conclusion: SDF-TopoNet框架在提高分割精度和训练效率方面具有显著优势。 Abstract: Accurate segmentation of tubular and curvilinear structures, such as blood vessels, neurons, and road networks, is crucial in various applications. A key challenge is ensuring topological correctness while maintaining computational efficiency. Existing approaches often employ topological loss functions based on persistent homology, such as Betti error, to enforce structural consistency. However, these methods suffer from high computational costs and are insensitive to pixel-level accuracy, often requiring additional loss terms like Dice or MSE to compensate. To address these limitations, we propose \textbf{SDF-TopoNet}, an improved topology-aware segmentation framework that enhances both segmentation accuracy and training efficiency. Our approach introduces a novel two-stage training strategy. In the pre-training phase, we utilize the signed distance function (SDF) as an auxiliary learning target, allowing the model to encode topological information without directly relying on computationally expensive topological loss functions. In the fine-tuning phase, we incorporate a dynamic adapter alongside a refined topological loss to ensure topological correctness while mitigating overfitting and computational overhead. We evaluate our method on five benchmark datasets. Experimental results demonstrate that SDF-TopoNet outperforms existing methods in both topological accuracy and quantitative segmentation metrics, while significantly reducing training complexity.

Frans Zdyb,Albert Alonso,Julius B. Kirkegaard

Task: 检测计算显微镜中的细长、重叠结构。

Motivation: 现有的坐标基方法虽然提高了检测效果，但在样条精度上不如像素基方法。

Details

Method: 提出了一种无需训练的可微分渲染方法用于样条细化，实现了高可靠性和亚像素精度。 Result: 该方法提高了样条质量，增强了对分布变化的鲁棒性，缩小了合成数据与真实数据之间的差距。 Conclusion: 该方法结合了坐标基和像素基方法的优点，适用于药物发现和生物医学研究中的C. elegans线虫模型。 Abstract: Detecting slender, overlapping structures remains a challenge in computational microscopy. While recent coordinate-based approaches improve detection, they often produce less accurate splines than pixel-based methods. We introduce a training-free differentiable rendering approach to spline refinement, achieving both high reliability and sub-pixel accuracy. Our method improves spline quality, enhances robustness to distribution shifts, and shrinks the gap between synthetic and real-world data. Being fully unsupervised, the method is a drop-in replacement for the popular active contour model for spline refinement. Evaluated on C. elegans nematodes, a popular model organism for drug discovery and biomedical research, we demonstrate that our approach combines the strengths of both coordinate- and pixel-based methods.

Ship Detection in Remote Sensing Imagery for Arbitrarily Oriented Object Detection

Bibi Erum Ayesha,T. Satyanarayana Murthy,Palamakula Ramesh Babu,Ramu Kuchipudi

Task: 开发一种创新的船舶检测系统，用于海上监视和生态监测。

Motivation: 传统的船舶检测方法在任意方向、复杂背景和遮挡视角下存在挑战，需要更准确和高效的解决方案。

Details

Method: 使用YOLOv8进行实时处理，并结合改进的U-Net进行船舶实例分割。 Result: YOLOv8实现了88%的mAP，U-Net实现了89%的mAP，显著提高了船舶检测的准确性和边界划分。 Conclusion: 该研究展示了深度学习模型在船舶检测中的潜力，增强了海上监视、灾害响应和生态监测的能力。 Abstract: This research paper presents an innovative ship detection system tailored for applications like maritime surveillance and ecological monitoring. The study employs YOLOv8 and repurposed U-Net, two advanced deep learning models, to significantly enhance ship detection accuracy. Evaluation metrics include Mean Average Precision (mAP), processing speed, and overall accuracy. The research utilizes the "Airbus Ship Detection" dataset, featuring diverse remote sensing images, to assess the models' versatility in detecting ships with varying orientations and environmental contexts. Conventional ship detection faces challenges with arbitrary orientations, complex backgrounds, and obscured perspectives. Our approach incorporates YOLOv8 for real-time processing and U-Net for ship instance segmentation. Evaluation focuses on mAP, processing speed, and overall accuracy. The dataset is chosen for its diverse images, making it an ideal benchmark. Results demonstrate significant progress in ship detection. YOLOv8 achieves an 88% mAP, excelling in accurate and rapid ship detection. U Net, adapted for ship instance segmentation, attains an 89% mAP, improving boundary delineation and handling occlusions. This research enhances maritime surveillance, disaster response, and ecological monitoring, exemplifying the potential of deep learning models in ship detection.

Praveen Shastry,Sowmya Chowdary Muthulur,Naveen Kumarasami,Anandakumar D,Mounigasri M,Keerthana R,Kishore Prasath Venkatesh,Bargava Subramanian,Kalyan Sivasailam,Revathi Ezhumalai,Abitha Marimuthu

Task: 提出一种基于SIGLIP编码器和Gemma-3b变压器解码器的视觉语言模型（VLM），用于增强自动化慢性结核病（TB）筛查。

Motivation: 通过整合胸部X光图像和临床数据，解决手动解释的挑战，提高诊断的一致性和可及性，特别是在资源有限的环境中。

Details

Method: VLM架构结合了视觉变压器（ViT）用于视觉编码和基于变压器的文本编码器处理临床背景，如患者病史和治疗记录。跨模态注意力机制将放射特征与文本信息对齐，Gemma-3b解码器生成全面的诊断报告。模型在500万对医学图像和文本上进行了预训练，并使用10万张慢性TB特异性胸部X光进行了微调。 Result: 模型在检测关键慢性TB病理（包括纤维化、钙化肉芽肿和支气管扩张）方面表现出高精度（94%）和召回率（94%）。AUC得分超过0.93，IoU值超过0.91，验证了其在检测和定位TB相关异常方面的有效性。 Conclusion: VLM为自动化慢性TB诊断提供了一个强大且可扩展的解决方案，整合放射和临床数据以提供可操作且具有上下文感知的见解。未来工作将解决细微病理和数据集偏差，以增强模型的泛化能力，确保在不同人群和医疗环境中的公平性能。 Abstract: Background This study proposes a Vision-Language Model (VLM) leveraging the SIGLIP encoder and Gemma-3b transformer decoder to enhance automated chronic tuberculosis (TB) screening. By integrating chest X-ray images with clinical data, the model addresses the challenges of manual interpretation, improving diagnostic consistency and accessibility, particularly in resource-constrained settings. Methods The VLM architecture combines a Vision Transformer (ViT) for visual encoding and a transformer-based text encoder to process clinical context, such as patient histories and treatment records. Cross-modal attention mechanisms align radiographic features with textual information, while the Gemma-3b decoder generates comprehensive diagnostic reports. The model was pre-trained on 5 million paired medical images and texts and fine-tuned using 100,000 chronic TB-specific chest X-rays. Results The model demonstrated high precision (94 percent) and recall (94 percent) for detecting key chronic TB pathologies, including fibrosis, calcified granulomas, and bronchiectasis. Area Under the Curve (AUC) scores exceeded 0.93, and Intersection over Union (IoU) values were above 0.91, validating its effectiveness in detecting and localizing TB-related abnormalities. Conclusion The VLM offers a robust and scalable solution for automated chronic TB diagnosis, integrating radiographic and clinical data to deliver actionable and context-aware insights. Future work will address subtle pathologies and dataset biases to enhance the model's generalizability, ensuring equitable performance across diverse populations and healthcare settings.

Vision-Language Models for Acute Tuberculosis Diagnosis: A Multimodal Approach Combining Imaging and Clinical Data

Ananya Ganapthy,Praveen Shastry,Naveen Kumarasami,Anandakumar D,Keerthana R,Mounigasri M,Varshinipriya M,Kishore Prasath Venkatesh,Bargava Subramanian,Kalyan Sivasailam

Task: 利用SIGLIP和Gemma-3b架构的视觉语言模型（VLM）进行自动急性结核病（TB）筛查。

Motivation: 通过整合胸部X光图像和临床笔记，提高诊断准确性和效率，特别是在资源有限的环境中。

Details

Method: VLM结合胸部X光图像的视觉数据和临床背景，生成详细的、上下文感知的诊断报告。架构使用SIGLIP进行视觉编码，Gemma-3b进行解码，确保有效表示急性TB特异性病理和临床见解。 Result: 关键急性TB病理（包括实变、空洞和结节）的检测精度（97%）和召回率（96%）都很高。模型展示了强大的空间定位能力和区分TB阳性病例的鲁棒性，使其成为急性TB诊断的可靠工具。 Conclusion: VLM的多模态能力减少了对放射科医生的依赖，为急性TB筛查提供了可扩展的解决方案。未来的工作将集中在改进对细微病理的检测和解决数据集偏差，以增强其在全球多样化医疗环境中的通用性和应用。 Abstract: Background: This study introduces a Vision-Language Model (VLM) leveraging SIGLIP and Gemma-3b architectures for automated acute tuberculosis (TB) screening. By integrating chest X-ray images and clinical notes, the model aims to enhance diagnostic accuracy and efficiency, particularly in resource-limited settings. Methods: The VLM combines visual data from chest X-rays with clinical context to generate detailed, context-aware diagnostic reports. The architecture employs SIGLIP for visual encoding and Gemma-3b for decoding, ensuring effective representation of acute TB-specific pathologies and clinical insights. Results: Key acute TB pathologies, including consolidation, cavities, and nodules, were detected with high precision (97percent) and recall (96percent). The model demonstrated strong spatial localization capabilities and robustness in distinguishing TB-positive cases, making it a reliable tool for acute TB diagnosis. Conclusion: The multimodal capability of the VLM reduces reliance on radiologists, providing a scalable solution for acute TB screening. Future work will focus on improving the detection of subtle pathologies and addressing dataset biases to enhance its generalizability and application in diverse global healthcare settings.

AI-Driven Rapid Identification of Bacterial and Fungal Pathogens in Blood Smears of Septic Patients

Agnieszka Sroka-Oleksiak,Adam Pardyl,Dawid Rymarczyk,Aldona Olechowska-Jarząb,Katarzyna Biegun-Drożdż,Dorota Ochońska,Michał Wronka,Adriana Borowa,Tomasz Gosiewski,Miłosz Adamczyk,Henryk Telega,Bartosz Zieliński,Monika Brzychczy-Włoch

Task: 利用深度学习算法从革兰氏染色涂片中识别14种细菌和3种酵母样真菌。

Motivation: 传统微生物学方法耗时且昂贵，需要快速诊断和治疗败血症。

Details

Method: 使用Cellpose 3模型进行分割，并采用基于注意力的深度多实例学习进行分类。 Result: 模型对细菌的分类准确率为77.15%，对真菌的分类准确率为71.39%，ROC AUC分别为0.97和0.88。 Conclusion: 研究证实了该模型在微生物分类中的潜力，但需要进一步优化和扩展训练数据集。 Abstract: Sepsis is a life-threatening condition which requires rapid diagnosis and treatment. Traditional microbiological methods are time-consuming and expensive. In response to these challenges, deep learning algorithms were developed to identify 14 bacteria species and 3 yeast-like fungi from microscopic images of Gram-stained smears of positive blood samples from sepsis patients. A total of 16,637 Gram-stained microscopic images were used in the study. The analysis used the Cellpose 3 model for segmentation and Attention-based Deep Multiple Instance Learning for classification. Our model achieved an accuracy of 77.15% for bacteria and 71.39% for fungi, with ROC AUC of 0.97 and 0.88, respectively. The highest values, reaching up to 96.2%, were obtained for Cutibacterium acnes, Enterococcus faecium, Stenotrophomonas maltophilia and Nakaseomyces glabratus. Classification difficulties were observed in closely related species, such as Staphylococcus hominis and Staphylococcus haemolyticus, due to morphological similarity, and within Candida albicans due to high morphotic diversity. The study confirms the potential of our model for microbial classification, but it also indicates the need for further optimisation and expansion of the training data set. In the future, this technology could support microbial diagnosis, reducing diagnostic time and improving the effectiveness of sepsis treatment due to its simplicity and accessibility. Part of the results presented in this publication was covered by a patent application at the European Patent Office EP24461637.1 "A computer implemented method for identifying a microorganism in a blood and a data processing system therefor".

The Impact of Artificial Intelligence on Emergency Medicine: A Review of Recent Advances

Gustavo Correia,Victor Alves,Paulo Novais

Task: 回顾过去五年中人工智能在急诊影像研究中的应用。

Motivation: 探讨人工智能在急诊医学中的潜力，特别是在诊断过程和患者预后方面的改进。

Details

Method: 通过回顾和分析过去五年中关于人工智能在急诊影像研究中的应用的文献，重点关注机器学习和深度学习技术。 Result: 研究表明，人工智能在准确检测骨折、气胸和肺部疾病等病症方面表现出色，并且能够预测临床结果，如机械通气需求。 Conclusion: 尽管存在数据隐私、算法偏见和广泛验证等挑战，人工智能在急诊环境中具有变革潜力，未来应与临床专业知识相结合，以提升患者护理标准。 Abstract: Artificial Intelligence (AI) is revolutionizing emergency medicine by enhancing diagnostic processes and improving patient outcomes. This article provides a review of the current applications of AI in emergency imaging studies, focusing on the last five years of advancements. AI technologies, particularly machine learning and deep learning, are pivotal in interpreting complex imaging data, offering rapid, accurate diagnoses and potentially surpassing traditional diagnostic methods. Studies highlighted within the article demonstrate AI's capabilities in accurately detecting conditions such as fractures, pneumothorax, and pulmonary diseases from various imaging modalities including X-rays, CT scans, and MRIs. Furthermore, AI's ability to predict clinical outcomes like mechanical ventilation needs illustrates its potential in crisis resource optimization. Despite these advancements, the integration of AI into clinical practice presents challenges such as data privacy, algorithmic bias, and the need for extensive validation across diverse settings. This review underscores the transformative potential of AI in emergency settings, advocating for a future where AI and clinical expertise synergize to elevate patient care standards.

Novel AI-Based Quantification of Breast Arterial Calcification to Predict Cardiovascular Risk

Theodorus Dapamede,Aisha Urooj,Vedant Joshi,Gabrielle Gershon,Frank Li,Mohammadreza Chavoshi,Beatrice Brown-Mulry,Rohan Satya Isaac,Aawez Mansuri,Chad Robichaux,Chadi Ayoub,Reza Arsanjani,Laurence Sperling,Judy Gichoya,Marly van Assen,Charles W. ONeill,Imon Banerjee,Hari Trivedi

Task: 通过自动量化筛查乳腺X光片上的乳腺动脉钙化来识别心血管疾病风险的女性。

Motivation: 女性在心血管疾病方面存在诊断不足和治疗不足的问题，早期识别和管理可以改善这一状况。

Details

Method: 使用基于变压器的神经网络对116,135名女性的筛查乳腺X光片进行乳腺动脉钙化严重程度的量化（无、轻度、中度和重度）。 Result: 乳腺动脉钙化严重程度与主要不良心血管事件（MACE）独立相关，且在所有年龄组中均显著，轻度钙化也表明50岁以下女性风险增加。 Conclusion: 自动化的乳腺动脉钙化量化可以在常规乳腺X光检查中进行心血管风险评估，无需额外辐射或成本，特别对年轻女性具有早期心血管疾病风险分层的潜力。 Abstract: Women are underdiagnosed and undertreated for cardiovascular disease. Automatic quantification of breast arterial calcification on screening mammography can identify women at risk for cardiovascular disease and enable earlier treatment and management of disease. In this retrospective study of 116,135 women from two healthcare systems, a transformer-based neural network quantified BAC severity (no BAC, mild, moderate, and severe) on screening mammograms. Outcomes included major adverse cardiovascular events (MACE) and all-cause mortality. BAC severity was independently associated with MACE after adjusting for cardiovascular risk factors, with increasing hazard ratios from mild (HR 1.18-1.22), moderate (HR 1.38-1.47), to severe BAC (HR 2.03-2.22) across datasets (all p<0.001). This association remained significant across all age groups, with even mild BAC indicating increased risk in women under 50. BAC remained an independent predictor when analyzed alongside ASCVD risk scores, showing significant associations with myocardial infarction, stroke, heart failure, and mortality (all p<0.005). Automated BAC quantification enables opportunistic cardiovascular risk assessment during routine mammography without additional radiation or cost. This approach provides value beyond traditional risk factors, particularly in younger women, offering potential for early CVD risk stratification in the millions of women undergoing annual mammography.

Synchronous vs Asynchronous Reinforcement Learning in a Real World Robot

Ali Parsaee,Fahim Shahriar,Chuxin He,Ruiqing Tan

Task: 比较异步和同步强化学习在物理机器人上的性能

Motivation: 物理环境不会等待强化学习代理做出决策或更新，现有的强化学习算法未考虑这一点，导致代理响应时间增加，可能影响学习性能。异步强化学习方法可以解决这一问题，但缺乏在物理机器人上的比较研究。

Details

Method: 使用Franka Emika Panda机械臂进行异步和同步强化学习的性能比较 Result: 实验表明，使用异步强化学习的代理学习速度更快，获得的回报显著更多。响应时间更快的代理表现优于响应时间较慢的代理，即使后者进行了更多的梯度更新。 Conclusion: 异步强化学习在物理机器人上的性能优于同步强化学习，响应时间的快慢对学习性能有显著影响。 Abstract: In recent times, reinforcement learning (RL) with physical robots has attracted the attention of a wide range of researchers. However, state-of-the-art RL algorithms do not consider that physical environments do not wait for the RL agent to make decisions or updates. RL agents learn by periodically conducting computationally expensive gradient updates. When decision-making and gradient update tasks are carried out sequentially by the RL agent in a physical robot, it significantly increases the agent's response time. In a rapidly changing environment, this increased response time may be detrimental to the performance of the learning agent. Asynchronous RL methods, which separate the computation of decision-making and gradient updates, are a potential solution to this problem. However, only a few comparisons between asynchronous and synchronous RL have been made with physical robots. For this reason, the exact performance benefits of using asynchronous RL methods over synchronous RL methods are still unclear. In this study, we provide a performance comparison between asynchronous and synchronous RL using a physical robotic arm called Franka Emika Panda. Our experiments show that the agents learn faster and attain significantly more returns using asynchronous RL. Our experiments also demonstrate that the learning agent with a faster response time performs better than the agent with a slower response time, even if the agent with a slower response time performs a higher number of gradient updates.

Squeeze Out Tokens from Sample for Finer-Grained Data Governance

Weixiong Lin,Chen Ju,Haicheng Wang,Shengchao Hu,Shuai Xiao,Mengting Chen,Yuheng Jiao,Mingshuai Yao,Jinsong Lan,Qingwen Liu,Ying Chen

Task: 本文提出了一种名为DataJuicer的双分支数据治理方法，旨在通过更细粒度的样本内治理来提升数据集的质量。

Motivation: 现有的数据治理方法通过启发式标量分数估计样本贡献，丢弃低价值样本，但保留的样本中仍包含大量不理想的标记，存在进一步压缩和净化的潜力。

Details

Method: DataJuicer采用双分支方法，视觉分支保留显著的图像块并提取相关对象类别，文本分支则将这些类别融入标题中，从而提升图像-文本对齐。 Result: 实验表明，DataJuicer在图像-文本检索、分类和密集视觉推理任务上显著优于现有的DataSieve方法。 Conclusion: DataJuicer通过更细粒度的数据治理，能够生成更精炼的数据集，显著提升模型性能。 Abstract: Widely observed data scaling laws, in which error falls off as a power of the training size, demonstrate the diminishing returns of unselective data expansion. Hence, data governance is proposed to downsize datasets through pruning non-informative samples. Yet, isolating the impact of a specific sample on overall model performance is challenging, due to the vast computation required for tryout all sample combinations. Current data governors circumvent this complexity by estimating sample contributions through heuristic-derived scalar scores, thereby discarding low-value ones. Despite thorough sample sieving, retained samples contain substantial undesired tokens intrinsically, underscoring the potential for further compression and purification. In this work, we upgrade data governance from a 'sieving' approach to a 'juicing' one. Instead of scanning for least-flawed samples, our dual-branch DataJuicer applies finer-grained intra-sample governance. It squeezes out informative tokens and boosts image-text alignments. Specifically, the vision branch retains salient image patches and extracts relevant object classes, while the text branch incorporates these classes to enhance captions. Consequently, DataJuicer yields more refined datasets through finer-grained governance. Extensive experiments across datasets demonstrate that DataJuicer significantly outperforms existing DataSieve in image-text retrieval, classification, and dense visual reasoning.

Analysis of human visual field information using machine learning methods and assessment of their accuracy

A. I. Medvedeva,V. V. Bakutkin

Task: 研究用于分析视野图像以诊断和控制青光眼疾病的方法。

Motivation: 眼科社区对疾病控制和进口替代问题非常关注，因此需要研究有效的诊断方法。

Details

Method: 使用机器学习方法（随机梯度下降、逻辑回归、随机森林、朴素贝叶斯）对图像结果进行分类。 Result: 研究结果是能够从图像中确定结果是青光眼还是其他疾病的计算机模型（二元分类）。 Conclusion: 通过构建分类器，可以有效分类青光眼，为青光眼的诊断和控制提供了新的方法。 Abstract: Subject of research: is the study of methods for analyzing perimetric images for the diagnosis and control of glaucoma diseases. Objects of research: is a dataset collected on the ophthalmological perimeter with the results of various patient pathologies, since the ophthalmological community is acutely aware of the issue of disease control and import substitution. [5]. Purpose of research: is to consider various machine learning methods that can classify glaucoma. This is possible thanks to the classifier built after labeling the dataset. It is able to determine from the image whether the visual fields depicted on it are the results of the impact of glaucoma on the eyes or other visual diseases. Earlier in the work [3], a dataset was described that was collected on the Tomey perimeter. The average age of the examined patients ranged from 30 to 85 years. Methods of research: machine learning methods for classifying image results (stochastic gradient descent, logistic regression, random forest, naive Bayes). Main results of research: the result of the study is computer modeling that can determine from the image whether the result is glaucoma or another disease (binary classification).

Three-dimensional Reconstruction of the Lumbar Spine with Submillimeter Accuracy Using Biplanar X-ray Images

Wanxin Yu,Zhemin Zhu,Cong Wang,Yihang Bao,Chunjie Xia,Rongshan Cheng,Yan Yu,Tsung-Yuan Tsai

Task: 开发并验证一种从双平面X射线图像中高精度三维重建腰椎的全自动方法。

Motivation: 当前的全自动重建方法精度低，无法满足临床应用标准，因此需要一种高精度的三维重建方法。

Details

Method: 该方法包括从原始X射线图像中进行腰椎分解和标志点检测，然后使用可变形模型和标志点加权的2D-3D配准方法。 Result: 所提出的方法实现了0.80毫米的三维重建精度，显著优于主流方法。 Conclusion: 该方法将有助于临床诊断负重状态下的腰椎疾病。 Abstract: Three-dimensional reconstruction of the spine under weight-bearing conditions from biplanar X-ray images is of great importance for the clinical assessment of spinal diseases. However, the current fully automated reconstruction methods have low accuracy and fail to meet the clinical application standards. This study developed and validated a fully automated method for high-accuracy 3D reconstruction of the lumbar spine from biplanar X-ray images. The method involves lumbar decomposition and landmark detection from the raw X-ray images, followed by a deformable model and landmark-weighted 2D-3D registration approach. The reconstruction accuracy was validated by the gold standard obtained through the registration of CT-segmented vertebral models with the biplanar X-ray images. The proposed method achieved a 3D reconstruction accuracy of 0.80 mm, representing a significant improvement over the mainstream approaches. This study will contribute to the clinical diagnosis of lumbar in weight-bearing positions.

Reinforcement learning-based motion imitation for physiologically plausible musculoskeletal motor control

Merkourios Simos,Alberto Silvio Chiappa,Alexander Mathis

Task: 提出一种无模型运动模仿框架（KINESIS）以推进对基于肌肉的运动控制的理解。

Motivation: 理解人类运动在计算机动画、运动合成、神经科学、人类假肢和康复等领域有广泛应用。尽管强化学习在捕捉人类运动方面取得了显著成果，但控制生理上准确的人体模型仍是一个挑战。

Details

Method: 使用具有80个肌肉执行器和20个自由度的下肢肌肉骨骼模型，KINESIS在1.9小时的运动捕捉数据上实现了强大的模仿性能，并通过预训练的文本到运动生成模型实现自然语言控制，并可微调以执行高级任务。 Result: KINESIS生成的肌肉活动模式与人类肌电活动相关性良好，生理上的合理性使其成为解决人类运动控制理论中挑战性问题的有前途的模型。 Conclusion: KINESIS在理解人类运动控制方面具有潜力，特别是在解决Bernstein冗余问题的背景下。 Abstract: How do humans move? The quest to understand human motion has broad applications in numerous fields, ranging from computer animation and motion synthesis to neuroscience, human prosthetics and rehabilitation. Although advances in reinforcement learning (RL) have produced impressive results in capturing human motion using simplified humanoids, controlling physiologically accurate models of the body remains an open challenge. In this work, we present a model-free motion imitation framework (KINESIS) to advance the understanding of muscle-based motor control. Using a musculoskeletal model of the lower body with 80 muscle actuators and 20 DoF, we demonstrate that KINESIS achieves strong imitation performance on 1.9 hours of motion capture data, is controllable by natural language through pre-trained text-to-motion generative models, and can be fine-tuned to carry out high-level tasks such as target goal reaching. Importantly, KINESIS generates muscle activity patterns that correlate well with human EMG activity. The physiological plausibility makes KINESIS a promising model for tackling challenging problems in human motor control theory, which we highlight by investigating Bernstein's redundancy problem in the context of locomotion. Code, videos and benchmarks will be available at https://github.com/amathislab/Kinesis.

Core-Periphery Principle Guided State Space Model for Functional Connectome Classification

Minheng Chen,Xiaowei Yu,Jing Zhang,Tong Chen,Chao Cao,Yan Zhuang,Yanjun Lyu,Lu Zhang,Tianming Liu,Dajiang Zhu

Task: 提出一种用于功能连接组分类的创新框架，即核心-外围状态空间模型（CP-SSM）。

Motivation: 传统机器学习方法难以捕捉大脑区域之间的复杂关系，而深度学习方法（特别是基于Transformer的模型）在长序列建模中面临计算复杂性问题。

Details

Method: 引入Mamba，一种具有线性复杂度的选择性状态空间模型，以有效捕捉功能脑网络中的长程依赖关系；并设计CP-MoE，一种核心-外围（CP）引导的专家混合模型，以改进大脑连接模式的表示学习。 Result: 在ABIDE和ADNI两个基准fMRI数据集上的实验结果表明，CP-SSM在分类性能上优于基于Transformer的模型，同时显著降低了计算复杂度。 Conclusion: CP-SSM在建模大脑功能连接方面具有有效性和高效性，为基于神经影像的神经疾病诊断提供了一个有前景的方向。 Abstract: Understanding the organization of human brain networks has become a central focus in neuroscience, particularly in the study of functional connectivity, which plays a crucial role in diagnosing neurological disorders. Advances in functional magnetic resonance imaging and machine learning techniques have significantly improved brain network analysis. However, traditional machine learning approaches struggle to capture the complex relationships between brain regions, while deep learning methods, particularly Transformer-based models, face computational challenges due to their quadratic complexity in long-sequence modeling. To address these limitations, we propose a Core-Periphery State-Space Model (CP-SSM), an innovative framework for functional connectome classification. Specifically, we introduce Mamba, a selective state-space model with linear complexity, to effectively capture long-range dependencies in functional brain networks. Furthermore, inspired by the core-periphery (CP) organization, a fundamental characteristic of brain networks that enhances efficient information transmission, we design CP-MoE, a CP-guided Mixture-of-Experts that improves the representation learning of brain connectivity patterns. We evaluate CP-SSM on two benchmark fMRI datasets: ABIDE and ADNI. Experimental results demonstrate that CP-SSM surpasses Transformer-based models in classification performance while significantly reducing computational complexity. These findings highlight the effectiveness and efficiency of CP-SSM in modeling brain functional connectivity, offering a promising direction for neuroimaging-based neurological disease diagnosis.

Rui Yang,Lin Song,Yicheng Xiao,Runhui Huang,Yixiao Ge,Ying Shan,Hengshuang Zhao

Task: 提出一种简单而高效的方法，构建基于单一Transformer的本地端到端大型多模态模型的基线。

Motivation: 大多数大型多模态模型（LMMs）分别建模视觉和文本模态，导致资源密集且性能不如组合模型。

Details

Method: 提出一种早期融合的LMM，能够在早期阶段融合多模态输入并以自回归方式响应视觉指令；设计了一种高效的训练方法，利用预训练模型的先验知识。 Result: 所提出的模型在使用单一Transformer的LMM中表现出优越性能，并显著缩小了与组合LMM的性能差距。 Conclusion: 通过早期融合和高效训练方法，可以在单一Transformer中构建性能优越的本地端到端大型多模态模型。 Abstract: Recent advancements in large language models (LLMs) have significantly propelled the development of large multi-modal models (LMMs), highlighting the potential for general and intelligent assistants. However, most LMMs model visual and textual modalities separately, leading to recent efforts to develop native LMMs using a single transformer. Despite the promise, these native models are resource-intensive and often exhibit performance gaps compared to their compositional counterparts. To alleviate this issue, we propose a simple yet efficient method to construct a baseline for the native and end-to-end large multi-modal model in a single transformer. First, we propose a new early-fusion LMM that can fuse multi-modal inputs in the early stage and respond to visual instructions in an auto-regressive manner. Second, we devise an efficient training recipe for the proposed model, which harnesses the prior knowledge of the pre-trained models, addressing both the performance limitations and the challenge of resource consumption. The proposed model demonstrates superior performance compared to other LMMs using one transformer and significantly narrows the performance gap with compositional LMMs.

Bayesian Modeling of Zero-Shot Classifications for Urban Flood Detection

Matt Franchi,Nikhil Garg,Wendy Ju,Emma Pierson

Task: 提出一种两阶段方法（BayFlood）用于城市洪水检测，避免了对大规模标注数据集的需求。

Motivation: 街景数据集缺乏可靠的标签，且许多事件类型罕见，缺乏地面真实数据。

Details

Method: 首先使用预训练的视觉语言模型（VLM）进行零样本分类，然后对VLM分类结果拟合空间贝叶斯模型。 Result: VLM在多个城市和时间段提供了强零样本信号，贝叶斯模型改进了样本外预测，推断的洪水风险与已知的外部风险预测因子相关。 Conclusion: BayFlood方法可以改进城市洪水检测，揭示了现有方法忽视的高风险人群和人口偏见，并提出了新洪水传感器的位置建议。 Abstract: Street scene datasets, collected from Street View or dashboard cameras, offer a promising means of detecting urban objects and incidents like street flooding. However, a major challenge in using these datasets is their lack of reliable labels: there are myriad types of incidents, many types occur rarely, and ground-truth measures of where incidents occur are lacking. Here, we propose BayFlood, a two-stage approach which circumvents this difficulty. First, we perform zero-shot classification of where incidents occur using a pretrained vision-language model (VLM). Second, we fit a spatial Bayesian model on the VLM classifications. The zero-shot approach avoids the need to annotate large training sets, and the Bayesian model provides frequent desiderata in urban settings - principled measures of uncertainty, smoothing across locations, and incorporation of external data like stormwater accumulation zones. We comprehensively validate this two-stage approach, showing that VLMs provide strong zero-shot signal for floods across multiple cities and time periods, the Bayesian model improves out-of-sample prediction relative to baseline methods, and our inferred flood risk correlates with known external predictors of risk. Having validated our approach, we show it can be used to improve urban flood detection: our analysis reveals 113,738 people who are at high risk of flooding overlooked by current methods, identifies demographic biases in existing methods, and suggests locations for new flood sensors. More broadly, our results showcase how Bayesian modeling of zero-shot LM annotations represents a promising paradigm because it avoids the need to collect large labeled datasets and leverages the power of foundation models while providing the expressiveness and uncertainty quantification of Bayesian models.

SceneEval: Evaluating Semantic Coherence in Text-Conditioned 3D Indoor Scene Synthesis

Hou In Ivan Tam,Hou In Derek Pun,Austin T. Wang,Angel X. Chang,Manolis Savva

Task: 提出SceneEval评估框架，用于评估文本条件下的3D室内场景生成方法。

Motivation: 现有评估方法主要关注生成场景的真实性，而忽略了与输入文本的对齐，这是决定方法是否有效满足用户需求的关键因素。

Details

Method: 提出SceneEval评估框架，包括显式用户需求（如特定对象及其属性的存在）和隐式期望（如对象碰撞的缺失）的评估指标。引入SceneEval-100数据集，用于评估场景生成方法。 Result: 评估结果表明，当前方法在生成满足用户需求的场景方面存在困难。 Conclusion: 需要进一步研究以改进文本条件下的3D室内场景生成方法。 Abstract: Despite recent advances in text-conditioned 3D indoor scene generation, there remain gaps in the evaluation of these methods. Existing metrics primarily assess the realism of generated scenes by comparing them to a set of ground-truth scenes, often overlooking alignment with the input text - a critical factor in determining how effectively a method meets user requirements. We present SceneEval, an evaluation framework designed to address this limitation. SceneEval includes metrics for both explicit user requirements, such as the presence of specific objects and their attributes described in the input text, and implicit expectations, like the absence of object collisions, providing a comprehensive assessment of scene quality. To facilitate evaluation, we introduce SceneEval-100, a dataset of scene descriptions with annotated ground-truth scene properties. We evaluate recent scene generation methods using SceneEval and demonstrate its ability to provide detailed assessments of the generated scenes, highlighting strengths and areas for improvement across multiple dimensions. Our results show that current methods struggle at generating scenes that meet user requirements, underscoring the need for further research in this direction.

Involution and BSConv Multi-Depth Distillation Network for Lightweight Image Super-Resolution

Akram Khatami-Rizi,Ahmad Mahmoudi-Aznaveh

Task: 从低分辨率输入中重建高分辨率图像。

Motivation: 深度学习，尤其是卷积神经网络（CNN），在单图像超分辨率（SISR）方面取得了进展，但增加网络深度会导致参数和内存使用增加，训练速度变慢，这对资源有限的设备来说是一个问题。

Details

Method: 提出了Involution & BSConv多深度蒸馏网络（IBMDN），结合了Involution & BSConv多深度蒸馏块（IBMDB）和对比与高频注意力块（CHFAB）。IBMDB集成了Involution和BSConv以平衡计算效率和特征提取。CHFAB增强了高频细节以提高视觉质量。 Result: 实验表明，该方法在最小计算成本下实现了高精度。 Conclusion: IBMDN在减少复杂性的同时提高了评估指标如PSNR和SSIM，并在基于Transformer的模型中减少了内存使用，同时在GAN中增强了感知质量。 Abstract: Single Image Super-Resolution (SISR) aims to reconstruct high-resolution (HR) images from low-resolution (LR) inputs. Deep learning, especially Convolutional Neural Networks (CNNs), has advanced SISR. However, increasing network depth increases parameters, and memory usage, and slows training, which is problematic for resource-limited devices. To address this, lightweight models are developed to balance accuracy and efficiency. We propose the Involution & BSConv Multi-Depth Distillation Network (IBMDN), combining Involution & BSConv Multi-Depth Distillation Block (IBMDB) and the Contrast and High-Frequency Attention Block (CHFAB). IBMDB integrates Involution and BSConv to balance computational efficiency and feature extraction. CHFAB enhances high-frequency details for better visual quality. IBMDB is compatible with other SISR architectures and reduces complexity, improving evaluation metrics like PSNR and SSIM. In transformer-based models, IBMDB reduces memory usage while improving feature extraction. In GANs, it enhances perceptual quality, balancing pixel-level accuracy with perceptual details. Our experiments show that the method achieves high accuracy with minimal computational cost. The code is available at GitHub.

On the Robustness Tradeoff in Fine-Tuning

Kunyang Li,Jean-Charles Noirot Ferrand,Ryan Sheatsley,Blaine Hoak,Yohan Beugin,Eric Pauley,Patrick McDaniel

Task: 研究微调预训练模型对下游任务鲁棒性的影响

Motivation: 微调已成为将预训练模型适应下游任务的标准做法，但其对模型鲁棒性的影响尚未得到充分理解

Details

Method: 在6个基准数据集和7种不同的微调策略上评估微调模型的鲁棒性和准确性 Result: 观察到对抗鲁棒性和准确性之间存在一致的权衡，BitFit等外围更新在简单任务上更有效，而在复杂任务上，通过Compacter微调信息密集层（如注意力层）能获得更好的Pareto前沿 Conclusion: 强调需要进行鲁棒性感知的微调，以确保在实际部署中的可靠性 Abstract: Fine-tuning has become the standard practice for adapting pre-trained (upstream) models to downstream tasks. However, the impact on model robustness is not well understood. In this work, we characterize the robustness-accuracy trade-off in fine-tuning. We evaluate the robustness and accuracy of fine-tuned models over 6 benchmark datasets and 7 different fine-tuning strategies. We observe a consistent trade-off between adversarial robustness and accuracy. Peripheral updates such as BitFit are more effective for simple tasks--over 75% above the average measured with area under the Pareto frontiers on CIFAR-10 and CIFAR-100. In contrast, fine-tuning information-heavy layers, such as attention layers via Compacter, achieves a better Pareto frontier on more complex tasks--57.5% and 34.6% above the average on Caltech-256 and CUB-200, respectively. Lastly, we observe that robustness of fine-tuning against out-of-distribution data closely tracks accuracy. These insights emphasize the need for robustness-aware fine-tuning to ensure reliable real-world deployments.

ClimateGS: Real-Time Climate Simulation with 3D Gaussian Style Transfer

Yuezhen Xie,Meiying Zhang,Qi Hao

Task: 提出一种新的框架ClimateGS，用于实时渲染气候效果。

Motivation: 恶劣气候条件对自主系统提出了重大挑战，需要可靠的感知和决策能力。现有的基于物理的NeRF渲染方法虽然能生成逼真的场景表示，但渲染速度慢且预处理时间长，不适合实时测试和用户交互。

Details

Method: 开发了一种线性变换方法用于3D高斯逼真风格迁移，结合监督和自监督学习的联合训练策略，以及一种实时渲染方法用于气候模拟。 Result: 在MipNeRF360和Tanks and Temples数据集上评估，展示了与现有2D/3D方法相当或更优的视觉质量，适用于交互应用。 Conclusion: ClimateGS框架能够高效且逼真地渲染气候效果，适合实时应用。 Abstract: Adverse climate conditions pose significant challenges for autonomous systems, demanding reliable perception and decision-making across diverse environments. To better simulate these conditions, physically-based NeRF rendering methods have been explored for their ability to generate realistic scene representations. However, these methods suffer from slow rendering speeds and long preprocessing times, making them impractical for real-time testing and user interaction. This paper presents ClimateGS, a novel framework integrating 3D Gaussian representations with physical simulation to enable real-time climate effects rendering. The novelty of this work is threefold: 1) developing a linear transformation for 3D Gaussian photorealistic style transfer, enabling direct modification of spherical harmonics across bands for efficient and consistent style adaptation; 2) developing a joint training strategy for 3D style transfer, combining supervised and self-supervised learning to accelerate convergence while preserving original scene details; 3) developing a real-time rendering method for climate simulation, integrating physics-based effects with 3D Gaussian to achieve efficient and realistic rendering. We evaluate ClimateGS on MipNeRF360 and Tanks and Temples, demonstrating real-time rendering with comparable or superior visual quality to SOTA 2D/3D methods, making it suitable for interactive applications.

Exploring the Limits of KV Cache Compression in Visual Autoregressive Transformers

Bo Chen,Xiaoyu Li,Yekun Ke,Yingyu Liang,Zhenmei Shi,Zhao Song

Task: Error

Motivation: Error

Details

Method: Error Result: Error Conclusion: Error Abstract: A fundamental challenge in Visual Autoregressive models is the substantial memory overhead required during inference to store previously generated representations. Despite various attempts to mitigate this issue through compression techniques, prior works have not explicitly formalized the problem of KV-cache compression in this context. In this work, we take the first step in formally defining the KV-cache compression problem for Visual Autoregressive transformers. We then establish a fundamental negative result, proving that any mechanism for sequential visual token generation under attention-based architectures must use at least $\Omega(n^2 d)$ memory, when $d = \Omega(\log n)$, where $n$ is the number of tokens generated and $d$ is the embedding dimensionality. This result demonstrates that achieving truly sub-quadratic memory usage is impossible without additional structural constraints. Our proof is constructed via a reduction from a computational lower bound problem, leveraging randomized embedding techniques inspired by dimensionality reduction principles. Finally, we discuss how sparsity priors on visual representations can influence memory efficiency, presenting both impossibility results and potential directions for mitigating memory overhead.

He Huang,Yong Chen,Yujun Guo,Wei He

Task: 提出一种自监督的未知到已知退化变换框架（U2K），用于盲高光谱图像融合。

Motivation: 现有监督学习方法在测试数据退化与训练数据匹配时表现良好，但在处理未知退化时面临挑战。

Details

Method: 提出U2K框架，包括空间和光谱退化包裹模块（DW）和退化变换模块（DT），通过自监督方式训练，使用一致性损失和贪婪交替优化。 Result: U2K框架显著提高了盲高光谱图像融合的灵活性，并在各种退化设置下提升了五种现有监督学习方法的适应性，超越了现有的盲方法。 Conclusion: U2K框架有效提升了盲高光谱图像融合的适应性和性能。 Abstract: Hyperspectral image (HSI) fusion is an efficient technique that combines low-resolution HSI (LR-HSI) and high-resolution multispectral images (HR-MSI) to generate high-resolution HSI (HR-HSI). Existing supervised learning methods (SLMs) can yield promising results when test data degradation matches the training ones, but they face challenges in generalizing to unknown degradations. To unleash the potential and generalization ability of SLMs, we propose a novel self-supervised unknown-to-known degradation transformation framework (U2K) for blind HSI fusion, which adaptively transforms unknown degradation into the same type of degradation as those handled by pre-trained SLMs. Specifically, the proposed U2K framework consists of: (1) spatial and spectral Degradation Wrapping (DW) modules that map HR-HSI to unknown degraded HR-MSI and LR-HSI, and (2) Degradation Transformation (DT) modules that convert these wrapped data into predefined degradation patterns. The transformed HR-MSI and LR-HSI pairs are then processed by a pre-trained network to reconstruct the target HR-HSI. We train the U2K framework in a self-supervised manner using consistency loss and greedy alternating optimization, significantly improving the flexibility of blind HSI fusion. Extensive experiments confirm the effectiveness of our proposed U2K framework in boosting the adaptability of five existing SLMs under various degradation settings and surpassing state-of-the-art blind methods.

FetalFlex: Anatomy-Guided Diffusion Model for Flexible Control on Fetal Ultrasound Image Synthesis

Yaofei Duan,Tao Tan,Zhiyuan Zhu,Yuhao Huang,Yuanji Zhang,Rui Gao,Patrick Cheong-Iao Pang,Xinru Gao,Guowei Tao,Xiang Cong,Zhou Li,Lianying Liang,Guangzhi He,Linliang Yin,Xuedong Deng,Xin Yang,Dong Ni

Task: 提出一个灵活的胎儿超声图像生成框架（FetalFlex），以解决获取多平面注释胎儿超声数据集的挑战。

Motivation: 获取全面的多平面注释胎儿超声数据集具有挑战性，特别是对于罕见或复杂的异常情况，这给训练新手放射科医生和开发鲁棒的AI模型带来了困难。

Details

Method: FetalFlex利用解剖结构和多模态信息，通过预对齐模块和重绘策略实现可控的胎儿超声图像合成，并采用两阶段自适应采样策略逐步提高图像质量。 Result: 在多中心数据集上的实验表明，FetalFlex在多个图像质量指标上达到了最先进的性能，并且生成的图像显著提高了六种典型深度学习模型在下游分类和异常检测任务中的性能。 Conclusion: FetalFlex的解剖级可控生成为异常模拟和创建像素级的配对或反事实数据提供了独特的优势。 Abstract: Fetal ultrasound (US) examinations require the acquisition of multiple planes, each providing unique diagnostic information to evaluate fetal development and screening for congenital anomalies. However, obtaining a comprehensive, multi-plane annotated fetal US dataset remains challenging, particularly for rare or complex anomalies owing to their low incidence and numerous subtypes. This poses difficulties in training novice radiologists and developing robust AI models, especially for detecting abnormal fetuses. In this study, we introduce a Flexible Fetal US image generation framework (FetalFlex) to address these challenges, which leverages anatomical structures and multimodal information to enable controllable synthesis of fetal US images across diverse planes. Specifically, FetalFlex incorporates a pre-alignment module to enhance controllability and introduces a repaint strategy to ensure consistent texture and appearance. Moreover, a two-stage adaptive sampling strategy is developed to progressively refine image quality from coarse to fine levels. We believe that FetalFlex is the first method capable of generating both in-distribution normal and out-of-distribution abnormal fetal US images, without requiring any abnormal data. Experiments on multi-center datasets demonstrate that FetalFlex achieved state-of-the-art performance across multiple image quality metrics. A reader study further confirms the close alignment of the generated results with expert visual assessments. Furthermore, synthetic images by FetalFlex significantly improve the performance of six typical deep models in downstream classification and anomaly detection tasks. Lastly, FetalFlex's anatomy-level controllable generation offers a unique advantage for anomaly simulation and creating paired or counterfactual data at the pixel level. The demo is available at: https://dyf1023.github.io/FetalFlex/.

POSTA: A Go-to Framework for Customized Artistic Poster Generation

Haoyu Chen,Xiaojie Xu,Wenbo Li,Jingjing Ren,Tian Ye,Songhua Liu,Ying-Cong Chen,Lei Zhu,Xinchao Wang

Task: 提出一种基于扩散模型和多模态大语言模型的模块化框架POSTA，用于生成定制化的艺术海报。

Motivation: 现有的自动海报设计方法在文本准确性、用户定制性和美学吸引力方面存在不足，限制了其在电影和展览等艺术领域的应用。

Details

Method: POSTA框架由三个模块组成：背景扩散模块生成基于用户输入的主题背景，设计MLLM模块生成与背景风格一致的布局和排版元素，ArtText扩散模块对关键文本元素进行额外的风格化处理。 Result: POSTA在文本准确性和美学质量方面优于现有模型，展示了卓越的可控性和设计多样性。 Conclusion: POSTA框架通过模块化设计和多模态模型的应用，成功解决了现有自动海报设计方法的局限性，生成了视觉上一致且吸引人的定制化艺术海报。 Abstract: Poster design is a critical medium for visual communication. Prior work has explored automatic poster design using deep learning techniques, but these approaches lack text accuracy, user customization, and aesthetic appeal, limiting their applicability in artistic domains such as movies and exhibitions, where both clear content delivery and visual impact are essential. To address these limitations, we present POSTA: a modular framework powered by diffusion models and multimodal large language models (MLLMs) for customized artistic poster generation. The framework consists of three modules. Background Diffusion creates a themed background based on user input. Design MLLM then generates layout and typography elements that align with and complement the background style. Finally, to enhance the poster's aesthetic appeal, ArtText Diffusion applies additional stylization to key text elements. The final result is a visually cohesive and appealing poster, with a fully modular process that allows for complete customization. To train our models, we develop the PosterArt dataset, comprising high-quality artistic posters annotated with layout, typography, and pixel-level stylized text segmentation. Our comprehensive experimental analysis demonstrates POSTA's exceptional controllability and design diversity, outperforming existing models in both text accuracy and aesthetic quality.

A Language Vision Model Approach for Automated Tumor Contouring in Radiation Oncology

Yi Luo,Hamed Hooshangnejad,Xue Feng,Gaofeng Huang,Xiaojian Chen,Rui Zhang,Quan Chen,Wil Ngwa,Kai Ding

Task: 开发Oncology Contouring Copilot (OCC)系统，利用AI和人类监督结合的优势，通过文本描述进行精确的肿瘤轮廓描绘，以提高肿瘤学工作流程的效率。

Motivation: 肺癌是全球癌症相关死亡率的主要原因。肿瘤描绘的复杂性对放射治疗至关重要，但在资源有限的环境中往往缺乏专家。AI，特别是深度学习和自然语言处理的进步，提供了潜在的解决方案，但高假阳性率仍然是一个挑战。

Details

Method: OCC系统首先从CT扫描中识别结节候选，然后使用GPT-4V等语言视觉模型（LVMs）结合临床描述文本有效减少假阳性，合并文本和视觉数据以自动化肿瘤描绘，旨在通过结合经验丰富的领域专家的知识提升肿瘤护理质量。 Result: OCC系统的部署显著降低了假发现率35.0%，每扫描假阳性减少了72.4%，并在我们的数据集中实现了0.652的F1分数，用于无偏评估。 Conclusion: OCC代表了肿瘤护理的重大进步，特别是通过使用最新的LVMs来改善轮廓描绘结果：(1)通过优化肿瘤描绘和减少手动流程来简化肿瘤治疗工作流程；(2)提供一个可扩展且直观的框架，使用LVMs减少放射治疗计划中的假阳性；(3)引入新颖的医学语言视觉提示技术，通过消融研究最小化LVMs的幻觉；(4)进行LVMs的比较分析，突出其在解决医学语言视觉挑战中的潜力。 Abstract: Background: Lung cancer ranks as the leading cause of cancer-related mortality worldwide. The complexity of tumor delineation, crucial for radiation therapy, requires expertise often unavailable in resource-limited settings. Artificial Intelligence(AI), particularly with advancements in deep learning (DL) and natural language processing (NLP), offers potential solutions yet is challenged by high false positive rates. Purpose: The Oncology Contouring Copilot (OCC) system is developed to leverage oncologist expertise for precise tumor contouring using textual descriptions, aiming to increase the efficiency of oncological workflows by combining the strengths of AI with human oversight. Methods: Our OCC system initially identifies nodule candidates from CT scans. Employing Language Vision Models (LVMs) like GPT-4V, OCC then effectively reduces false positives with clinical descriptive texts, merging textual and visual data to automate tumor delineation, designed to elevate the quality of oncology care by incorporating knowledge from experienced domain experts. Results: Deployments of the OCC system resulted in a significant reduction in the false discovery rate by 35.0%, a 72.4% decrease in false positives per scan, and an F1-score of 0.652 across our dataset for unbiased evaluation. Conclusions: OCC represents a significant advance in oncology care, particularly through the use of the latest LVMs to improve contouring results by (1) streamlining oncology treatment workflows by optimizing tumor delineation, reducing manual processes; (2) offering a scalable and intuitive framework to reduce false positives in radiotherapy planning using LVMs; (3) introducing novel medical language vision prompt techniques to minimize LVMs hallucinations with ablation study, and (4) conducting a comparative analysis of LVMs, highlighting their potential in addressing medical language vision challenges.

A Novel Channel Boosted Residual CNN-Transformer with Regional-Boundary Learning for Breast Cancer Detection

Aamir Mehmood,Yue Hu,Saddam Hussain Khan

Task: 提出一种新的混合框架CB-Res-RBCMT，用于乳腺癌超声图像（BUSI）的详细分析。

Motivation: 现有的深度CNN和视觉变换器（ViT）在乳腺癌超声图像检测中表现出初步的成功，但模型复杂性和图像对比度、纹理及肿瘤形态变化带来的挑战限制了当前方法的有效性。

Details

Method: 结合定制的残差CNN和新的ViT组件，提出CB-Res-RBCMT框架，使用CMT块和新的区域及边界（RB）特征提取操作来捕捉对比度和形态变化。CMT块通过多头注意力机制增强全局上下文交互，并通过轻量级设计提高计算效率。定制的逆残差和主干CNN在CMT中有效提取局部纹理信息并处理梯度消失问题。新的通道增强（CB）策略通过结合原始RBCMT通道和基于迁移学习的残差CNN生成的特征图来丰富有限数据集的特征多样性。 Result: 在标准化的严格BUSI数据集上，CB-Res-RBCMT实现了95.57%的F1分数、95.63%的准确率、96.42%的敏感性和94.79%的精确度，优于现有的ViT和CNN方法。 Conclusion: 提出的集成CNN-Transformer框架在捕捉多样特征和提供卓越的BUSI癌症诊断性能方面表现出色。 Abstract: Recent advancements in detecting tumors using deep learning on breast ultrasound images (BUSI) have demonstrated significant success. Deep CNNs and vision-transformers (ViTs) have demonstrated individually promising initial performance. However, challenges related to model complexity and contrast, texture, and tumor morphology variations introduce uncertainties that hinder the effectiveness of current methods. This study introduces a novel hybrid framework, CB-Res-RBCMT, combining customized residual CNNs and new ViT components for detailed BUSI cancer analysis. The proposed RBCMT uses stem convolution blocks with CNN Meet Transformer (CMT) blocks, followed by new Regional and boundary (RB) feature extraction operations for capturing contrast and morphological variations. Moreover, the CMT block incorporates global contextual interactions through multi-head attention, enhancing computational efficiency with a lightweight design. Additionally, the customized inverse residual and stem CNNs within the CMT effectively extract local texture information and handle vanishing gradients. Finally, the new channel-boosted (CB) strategy enriches the feature diversity of the limited dataset by combining the original RBCMT channels with transfer learning-based residual CNN-generated maps. These diverse channels are processed through a spatial attention block for optimal pixel selection, reducing redundancy and improving the discrimination of minor contrast and texture variations. The proposed CB-Res-RBCMT achieves an F1-score of 95.57%, accuracy of 95.63%, sensitivity of 96.42%, and precision of 94.79% on the standard harmonized stringent BUSI dataset, outperforming existing ViT and CNN methods. These results demonstrate the versatility of our integrated CNN-Transformer framework in capturing diverse features and delivering superior performance in BUSI cancer diagnosis.

DRoPE: Directional Rotary Position Embedding for Efficient Agent Interaction Modeling

Jianbo Zhao,Taiyu Ban,Zhihao Liu,Hangning Zhou,Xiyang Wang,Qibin Zhou,Hailong Qin,Mu Yang,Lei Liu,Bin Li

Task: 提出了一种新的方向性旋转位置嵌入（DRoPE）方法，用于优化自动驾驶系统中的轨迹生成。

Motivation: 现有的场景中心、代理中心和查询中心框架在准确性、计算时间和内存效率之间存在不可能三角，需要一种新的方法来突破这一限制。

Details

Method: 提出了方向性旋转位置嵌入（DRoPE），通过在RoPE的2D旋转变换中引入统一的身份标量，将旋转角度与现实代理的航向对齐，从而自然地编码相对角度信息。 Result: 理论分析和实证评估表明，DRoPE能够同时优化轨迹生成的准确性、时间复杂度和空间复杂度，并且显著减少了空间复杂度。 Conclusion: DRoPE在理论和实践上都表现出良好的性能，能够有效解决现有方法在轨迹生成中的局限性。 Abstract: Accurate and efficient modeling of agent interactions is essential for trajectory generation, the core of autonomous driving systems. Existing methods, scene-centric, agent-centric, and query-centric frameworks, each present distinct advantages and drawbacks, creating an impossible triangle among accuracy, computational time, and memory efficiency. To break this limitation, we propose Directional Rotary Position Embedding (DRoPE), a novel adaptation of Rotary Position Embedding (RoPE), originally developed in natural language processing. Unlike traditional relative position embedding (RPE), which introduces significant space complexity, RoPE efficiently encodes relative positions without explicitly increasing complexity but faces inherent limitations in handling angular information due to periodicity. DRoPE overcomes this limitation by introducing a uniform identity scalar into RoPE's 2D rotary transformation, aligning rotation angles with realistic agent headings to naturally encode relative angular information. We theoretically analyze DRoPE's correctness and efficiency, demonstrating its capability to simultaneously optimize trajectory generation accuracy, time complexity, and space complexity. Empirical evaluations compared with various state-of-the-art trajectory generation models, confirm DRoPE's good performance and significantly reduced space complexity, indicating both theoretical soundness and practical effectiveness. The video documentation is available at https://drope-traj.github.io/.

Texture-Aware StarGAN for CT data harmonisation

Francesco Di Feola,Ludovica Pompilio,Cecilia Assolito,Valerio Guarrasi,Paolo Soda

Task: 提出一种新颖的纹理感知StarGAN用于CT数据协调，实现不同重建内核之间的一对多转换。

Motivation: CT在医学诊断中起着关键作用，但重建内核的变异性阻碍了数据驱动方法（如深度学习模型）实现可靠和泛化的性能。CT数据协调通过标准化不同来源或条件下的数据，成为减少这种非生物变异的有前景的解决方案。

Details

Method: 提出了一种纹理感知的StarGAN模型，并引入了一种多尺度纹理损失函数，将不同空间和角度尺度的纹理信息嵌入到协调过程中。 Result: 在公开数据集上进行了广泛的实验，使用了来自197名患者的48667张胸部CT切片，分布在三种不同的重建内核上，证明了该方法优于基线StarGAN。 Conclusion: 所提出的纹理感知StarGAN在CT数据协调中表现出色，能够有效解决内核引起的纹理变化问题。 Abstract: Computed Tomography (CT) plays a pivotal role in medical diagnosis; however, variability across reconstruction kernels hinders data-driven approaches, such as deep learning models, from achieving reliable and generalized performance. To this end, CT data harmonization has emerged as a promising solution to minimize such non-biological variances by standardizing data across different sources or conditions. In this context, Generative Adversarial Networks (GANs) have proved to be a powerful framework for harmonization, framing it as a style-transfer problem. However, GAN-based approaches still face limitations in capturing complex relationships within the images, which are essential for effective harmonization. In this work, we propose a novel texture-aware StarGAN for CT data harmonization, enabling one-to-many translations across different reconstruction kernels. Although the StarGAN model has been successfully applied in other domains, its potential for CT data harmonization remains unexplored. Furthermore, our approach introduces a multi-scale texture loss function that embeds texture information across different spatial and angular scales into the harmonization process, effectively addressing kernel-induced texture variations. We conducted extensive experimentation on a publicly available dataset, utilizing a total of 48667 chest CT slices from 197 patients distributed over three different reconstruction kernels, demonstrating the superiority of our method over the baseline StarGAN.

World Models in Artificial Intelligence: Sensing, Learning, and Reasoning Like a Child

Javier Del Ser,Jesus L. Lobo,Heimo Müller,Andreas Holzinger

Task: 探讨如何通过整合统计学习与六个关键研究领域（物理信息学习、神经符号学习、持续学习、因果推理、人在回路AI和负责任AI）来提升AI的推理能力。

Motivation: 当前的世界模型在强化学习中广泛应用，但缺乏结构化、自适应的表示能力，无法像儿童一样直观地发展。为了超越模式识别，需要基于皮亚杰认知发展理论的动态、可解释框架。

Details

Method: 提出整合统计学习与六个关键研究领域的方法，包括物理信息学习、神经符号学习、持续学习、因果推理、人在回路AI和负责任AI。 Result: 通过整合这些领域，AI可以从模式识别进化到真正的理解、适应和推理能力。 Conclusion: 整合统计学习与六个关键研究领域是实现AI真正推理能力的关键，这将使AI从模式识别进化到真正的理解、适应和推理能力。 Abstract: World Models help Artificial Intelligence (AI) predict outcomes, reason about its environment, and guide decision-making. While widely used in reinforcement learning, they lack the structured, adaptive representations that even young children intuitively develop. Advancing beyond pattern recognition requires dynamic, interpretable frameworks inspired by Piaget's cognitive development theory. We highlight six key research areas -- physics-informed learning, neurosymbolic learning, continual learning, causal inference, human-in-the-loop AI, and responsible AI -- as essential for enabling true reasoning in AI. By integrating statistical learning with advances in these areas, AI can evolve from pattern recognition to genuine understanding, adaptation and reasoning capabilities.

A Review on Large Language Models for Visual Analytics

Navya Sonal Agarwal,Sanjay Kumar Sonbhadra

Task: 本文全面回顾了大型语言模型（LLMs）与视觉分析的集成，探讨了其基础概念、能力和广泛应用。

Motivation: 研究动机在于探索LLMs在自然语言理解、自然语言生成、对话系统和文本到媒体转换中的潜力，以及其与视觉分析结合如何增强数据解释、可视化技术和交互探索能力。

Details

Method: 本文通过评估关键工具和平台（如LIDA、Chat2VIS、Julius AI和Zoho Analytics）以及专门的多模态模型（如ChartLlama和CharXIV），系统性地探讨了LLM任务分类，包括自然语言理解（NLU）、自然语言生成（NLG）、对话系统和文本到媒体转换。 Result: 本文提供了LLMs与视觉分析集成的SWOT分析，强调了其优势（如可访问性和灵活性）、劣势（如计算需求和偏见）、机会（如多模态集成和用户协作）和威胁（如隐私问题和技能退化）。 Conclusion: 本文强调了解决伦理考虑和方法改进以实现有效集成的重要性。 Abstract: This paper provides a comprehensive review of the integration of Large Language Models (LLMs) with visual analytics, addressing their foundational concepts, capabilities, and wide-ranging applications. It begins by outlining the theoretical underpinnings of visual analytics and the transformative potential of LLMs, specifically focusing on their roles in natural language understanding, natural language generation, dialogue systems, and text-to-media transformations. The review further investigates how the synergy between LLMs and visual analytics enhances data interpretation, visualization techniques, and interactive exploration capabilities. Key tools and platforms including LIDA, Chat2VIS, Julius AI, and Zoho Analytics, along with specialized multimodal models such as ChartLlama and CharXIV, are critically evaluated. The paper discusses their functionalities, strengths, and limitations in supporting data exploration, visualization enhancement, automated reporting, and insight extraction. The taxonomy of LLM tasks, ranging from natural language understanding (NLU), natural language generation (NLG), to dialogue systems and text-to-media transformations, is systematically explored. This review provides a SWOT analysis of integrating Large Language Models (LLMs) with visual analytics, highlighting strengths like accessibility and flexibility, weaknesses such as computational demands and biases, opportunities in multimodal integration and user collaboration, and threats including privacy concerns and skill degradation. It emphasizes addressing ethical considerations and methodological improvements for effective integration.

Beacon2Science: Enhancing STEREO/HI beacon data1 with machine learning for efficient CME tracking

Justin Le Louëdec,Maike Bauer,Tanja Amerstorfer,Jackie A. Davies

Task: 通过改进信标数据质量，提高日冕物质抛射（CME）的实时观测和预测精度。

Motivation: 日冕物质抛射（CME）引发的强烈地磁风暴可能对卫星和电子设备造成破坏性影响，因此实时观测和预测CME至关重要。

Details

Method: 提出了一种名为'Beacon2Science'的新管道，通过增强信标数据的质量（信噪比和空间分辨率）并通过学习插值提高时间分辨率，使其与科学数据的40分钟分辨率相匹配。 Result: 改进后的信标图像与科学数据相当，显示出比原始信标数据更好的CME可见性。从增强信标数据中提取的轨迹与科学图像的轨迹更接近，平均误差约为0.5°的伸长角，而原始信标数据的误差为1°。 Conclusion: 本文提出的工作为即将到来的任务（如Vigil和PUNCH）的应用铺平了道路。 Abstract: Observing and forecasting coronal mass ejections (CME) in real-time is crucial due to the strong geomagnetic storms they can generate that can have a potentially damaging effect, for example, on satellites and electrical devices. With its near-real-time availability, STEREO/HI beacon data is the perfect candidate for early forecasting of CMEs. However, previous work concluded that CME arrival prediction based on beacon data could not achieve the same accuracy as with high-resolution science data due to data gaps and lower quality. We present our novel pipeline entitled ''Beacon2Science'', bridging the gap between beacon and science data to improve CME tracking. Through this pipeline, we first enhance the quality (signal-to-noise ratio and spatial resolution) of beacon data. We then increase the time resolution of enhanced beacon images through learned interpolation to match science data's 40-minute resolution. We maximize information coherence between consecutive frames with adapted model architecture and loss functions through the different steps. The improved beacon images are comparable to science data, showing better CME visibility than the original beacon data. Furthermore, we compare CMEs tracked in beacon, enhanced beacon, and science images. The tracks extracted from enhanced beacon data are closer to those from science images, with a mean average error of $\sim 0.5 ^\circ$ of elongation compared to $1^\circ$ with original beacon data. The work presented in this paper paves the way for its application to forthcoming missions such as Vigil and PUNCH.

Euclid Quick Data Release (Q1). Active galactic nuclei identification using diffusion-based inpainting of Euclid VIS images

Euclid Collaboration,G. Stevens,S. Fotopoulou,M. N. Bremer,T. Matamoro Zatarain,K. Jahnke,B. Margalef-Bentabol,M. Huertas-Company,M. J. Smith,M. Walmsley,M. Salvato,M. Mezcua,A. Paulino-Afonso,M. Siudek,M. Talia,F. Ricci,W. Roster,N. Aghanim,B. Altieri,S. Andreon,H. Aussel,C. Baccigalupi,M. Baldi,S. Bardelli,P. Battaglia,A. Biviano,A. Bonchi,E. Branchini,M. Brescia,J. Brinchmann,S. Camera,G. Cañas-Herrera,V. Capobianco,C. Carbone,J. Carretero,M. Castellano,G. Castignani,S. Cavuoti,K. C. Chambers,A. Cimatti,C. Colodro-Conde,G. Congedo,C. J. Conselice,L. Conversi,Y. Copin,A. Costille,F. Courbin,H. M. Courtois,M. Cropper,A. Da Silva,H. Degaudenzi,G. De Lucia,C. Dolding,H. Dole,M. Douspis,F. Dubath,X. Dupac,S. Dusini,S. Escoffier,M. Farina,S. Ferriol,K. George,C. Giocoli,B. R. Granett,A. Grazian,F. Grupp,S. V. H. Haugan,I. M. Hook,F. Hormuth,A. Hornstrup,P. Hudelot,M. Jhabvala,E. Keihänen,S. Kermiche,A. Kiessling,M. Kilbinger,B. Kubik,M. Kümmel,H. Kurki-Suonio,Q. Le Boulc'h,A. M. C. Le Brun,D. Le Mignant,P. B. Lilje,V. Lindholm,I. Lloro,G. Mainetti,D. Maino,E. Maiorano,O. Marggraf,M. Martinelli,N. Martinet,F. Marulli,R. Massey,S. Maurogordato,H. J. McCracken,E. Medinaceli,S. Mei,M. Melchior,M. Meneghetti,E. Merlin,G. Meylan,A. Mora,M. Moresco,L. Moscardini,R. Nakajima,C. Neissner,S. -M. Niemi,C. Padilla,S. Paltani,F. Pasian,K. Pedersen,W. J. Percival,V. Pettorino,G. Polenta,M. Poncet,L. A. Popa,L. Pozzetti,F. Raison,R. Rebolo,A. Renzi,J. Rhodes,G. Riccio,E. Romelli,M. Roncarelli,R. Saglia,A. G. Sánchez,D. Sapone,J. A. Schewtschenko,M. Schirmer,P. Schneider,T. Schrabback,A. Secroun,S. Serrano,P. Simon,C. Sirignano,G. Sirri,J. Skottfelt,L. Stanco,J. Steinwagner,P. Tallada-Crespí,A. N. Taylor,I. Tereno,S. Toft,R. Toledo-Moreo,F. Torradeflot,I. Tutusaus,L. Valenziano,J. Valiviita,T. Vassallo,G. Verdoes Kleijn,A. Veropalumbo,Y. Wang,J. Weller,A. Zacchei,G. Zamorani,F. M. Zerbi,I. A. Zinchenko,E. Zucca,V. Allevato,M. Ballardini,M. Bolzonella,E. Bozzo,C. Burigana,R. Cabanac,A. Cappi,J. A. Escartin Vigo,L. Gabarra,W. G. Hartley,J. Martín-Fleitas,S. Matthew,R. B. Metcalf,A. Pezzotta,M. Pöntinen,I. Risso,V. Scottez,M. Sereno,M. Tenti,M. Wiesmann,Y. Akrami,S. Alvi,I. T. Andika,S. Anselmi,M. Archidiacono,F. Atrio-Barandela,D. Bertacca,M. Bethermin,L. Bisigello,A. Blanchard,L. Blot,S. Borgani,M. L. Brown,S. Bruton,A. Calabro,F. Caro,T. Castro,F. Cogato,S. Davini,G. Desprez,A. Díaz-Sánchez,J. J. Diaz,S. Di Domizio,J. M. Diego,P. -A. Duc,A. Enia,Y. Fang,A. G. Ferrari,A. Finoguenov,A. Fontana,A. Franco,J. García-Bellido,T. Gasparetto,V. Gautard,E. Gaztanaga,F. Giacomini,F. Gianotti,M. Guidi,C. M. Gutierrez,A. Hall,S. Hemmati,H. Hildebrandt,J. Hjorth,J. J. E. Kajava,Y. Kang,V. Kansal,D. Karagiannis,C. C. Kirkpatrick,S. Kruk,L. Legrand,M. Lembo,F. Lepori,G. Leroy,J. Lesgourgues,L. Leuzzi,T. I. Liaudat,J. Macias-Perez,M. Magliocchetti,F. Mannucci,R. Maoli,C. J. A. P. Martins,L. Maurin,M. Miluzio,P. Monaco,G. Morgante,K. Naidoo,A. Navarro-Alsina,F. Passalacqua,K. Paterson,L. Patrizii,A. Pisani,D. Potter,S. Quai,M. Radovich,P. -F. Rocci,G. Rodighiero,S. Sacquegna,M. Sahlén,D. B. Sanders,E. Sarpa,A. Schneider,M. Schultheis,D. Sciotti,E. Sellentin,F. Shankar,L. C. Smith,K. Tanidis,G. Testera,R. Teyssier,S. Tosi,A. Troja,M. Tucci,C. Valieri,D. Vergani,G. Verza,N. A. Walton

Task: 提出一种从单张图像中识别活动星系核（AGN）和类星体（QSO）的新方法。

Motivation: 传统的AGN和QSO识别方法通常需要多波段观测，而本文旨在通过单张图像实现高完整性的识别。

Details

Method: 利用Euclid VIS图像的空间分辨能力，训练了一个扩散模型，通过重建正常星系的光分布来识别偏离该分布的AGN和QSO。 Result: 该方法在仅使用VIS成像的情况下，相比传统方法（包括光学、近红外、中红外和X射线）具有更高的完整性。 Conclusion: 本文提出的方法在单张图像中识别AGN和QSO方面具有显著优势，为未来的天文观测提供了新的工具。 Abstract: Light emission from galaxies exhibit diverse brightness profiles, influenced by factors such as galaxy type, structural features and interactions with other galaxies. Elliptical galaxies feature more uniform light distributions, while spiral and irregular galaxies have complex, varied light profiles due to their structural heterogeneity and star-forming activity. In addition, galaxies with an active galactic nucleus (AGN) feature intense, concentrated emission from gas accretion around supermassive black holes, superimposed on regular galactic light, while quasi-stellar objects (QSO) are the extreme case of the AGN emission dominating the galaxy. The challenge of identifying AGN and QSO has been discussed many times in the literature, often requiring multi-wavelength observations. This paper introduces a novel approach to identify AGN and QSO from a single image. Diffusion models have been recently developed in the machine-learning literature to generate realistic-looking images of everyday objects. Utilising the spatial resolving power of the Euclid VIS images, we created a diffusion model trained on one million sources, without using any source pre-selection or labels. The model learns to reconstruct light distributions of normal galaxies, since the population is dominated by them. We condition the prediction of the central light distribution by masking the central few pixels of each source and reconstruct the light according to the diffusion model. We further use this prediction to identify sources that deviate from this profile by examining the reconstruction error of the few central pixels regenerated in each source's core. Our approach, solely using VIS imaging, features high completeness compared to traditional methods of AGN and QSO selection, including optical, near-infrared, mid-infrared, and X-rays. [abridged]

Abhi Kamboj,Minh N. Do

Task: 研究多模态对齐及其在跨模态迁移中的应用。

Motivation: 构建一个联合潜在向量空间，使得表示相同概念的两种模态映射到相同的向量，并探索无监督跨模态迁移的可行性。

Details

Method: 将多模态对齐问题表述为逆问题，并在特定条件下实现完美对齐。通过假设语义类在潜在空间中表示为高斯混合模型，展示如何进行跨模态迁移。 Result: 在合成的多模态高斯数据上的实验验证了完美对齐和跨模态迁移方法的有效性。 Conclusion: 这些发现有望激发对完美对齐应用和高斯模型在跨模态学习中使用的进一步探索。 Abstract: Multimodal alignment aims to construct a joint latent vector space where two modalities representing the same concept map to the same vector. We formulate this as an inverse problem and show that under certain conditions perfect alignment can be achieved. We then address a specific application of alignment referred to as cross-modal transfer. Unsupervised cross-modal transfer aims to leverage a model trained with one modality to perform inference on another modality, without any labeled fine-tuning on the new modality. Assuming that semantic classes are represented as a mixture of Gaussians in the latent space, we show how cross-modal transfer can be performed by projecting the data points from the representation space onto different subspaces representing each modality. Our experiments on synthetic multimodal Gaussian data verify the effectiveness of our perfect alignment and cross-modal transfer method. We hope these findings inspire further exploration of the applications of perfect alignment and the use of Gaussian models for cross-modal learning.

SemEval-2025 Task 1: AdMIRe -- Advancing Multimodal Idiomaticity Representation

Thomas Pickard,Aline Villavicencio,Maggie Mi,Wei He,Dylan Phelps,Carolina Scarton,Marco Idiart

Task: 评估和改进模型在多模态上下文和多种语言中解释惯用表达的能力。

Motivation: 惯用表达在自然语言处理中具有独特的挑战，因为它们的含义通常不能直接从其组成词中推断出来。尽管大型语言模型（LLMs）取得了进展，但惯用性仍然是语义表示的一个重大障碍。

Details

Method: 提出了SemEval-2025 Task 1: AdMiRe（推进多模态惯用性表示）的数据集和任务，包括两个子任务：根据图像与惯用或字面意义的对齐程度进行排名，以及预测序列中的下一张图像。 Result: 最有效的方法通过在多专家设置中利用预训练的LLMs和视觉语言模型，达到了人类水平的性能，并使用多个查询来平滑这些模型在惯用性表示中的弱点。 Conclusion: 通过多模态和多语言环境下的任务，可以显著提高模型对惯用表达的理解能力。 Abstract: Idiomatic expressions present a unique challenge in NLP, as their meanings are often not directly inferable from their constituent words. Despite recent advancements in Large Language Models (LLMs), idiomaticity remains a significant obstacle to robust semantic representation. We present datasets and tasks for SemEval-2025 Task 1: AdMiRe (Advancing Multimodal Idiomaticity Representation), which challenges the community to assess and improve models' ability to interpret idiomatic expressions in multimodal contexts and in multiple languages. Participants competed in two subtasks: ranking images based on their alignment with idiomatic or literal meanings, and predicting the next image in a sequence. The most effective methods achieved human-level performance by leveraging pretrained LLMs and vision-language models in mixture-of-experts settings, with multiple queries used to smooth over the weaknesses in these models' representations of idiomaticity.

FedSCA: Federated Tuning with Similarity-guided Collaborative Aggregation for Heterogeneous Medical Image Segmentation

Yumin Zhang,Yan Gao,Haoran Duan,Hanqing Guo,Tejal Shah,Rajiv Ranjan,Bo Wei

Task: 提出一种新的联邦学习与基础模型微调框架（FedSCA），用于医学图像分割。

Motivation: 由于医学图像数据集的规模有限且数据集中化受到隐私问题的限制，基础模型在医学图像分割中的应用受到阻碍。联邦学习与基础模型微调的结合可以解决这些问题，但非独立同分布数据和计算、通信限制仍然是挑战。

Details

Method: 提出了一种名为FedSCA的框架，包括参数高效微调、部分低层适配器传输和相似性引导的协作聚合。 Result: 在三个联邦学习基准测试中，FedSCA在医学图像分割任务中表现出色，达到了新的SOTA性能。 Conclusion: FedSCA框架有效解决了非独立同分布数据和计算、通信限制问题，显著提升了医学图像分割的性能。 Abstract: Transformer-based foundation models (FMs) have recently demonstrated remarkable performance in medical image segmentation. However, scaling these models is challenging due to the limited size of medical image datasets within isolated hospitals, where data centralization is restricted due to privacy concerns. These constraints, combined with the data-intensive nature of FMs, hinder their broader application. Integrating federated learning (FL) with foundation models (FLFM) fine-tuning offers a potential solution to these challenges by enabling collaborative model training without data sharing, thus allowing FMs to take advantage of a diverse pool of sensitive medical image data across hospitals/clients. However, non-independent and identically distributed (non-IID) data among clients, paired with computational and communication constraints in federated environments, presents an additional challenge that limits further performance improvements and remains inadequately addressed in existing studies. In this work, we propose a novel FLFM fine-tuning framework, \underline{\textbf{Fed}}erated tuning with \underline{\textbf{S}}imilarity-guided \underline{\textbf{C}}ollaborative \underline{\textbf{A}}ggregation (FedSCA), encompassing all phases of the FL process. This includes (1) specially designed parameter-efficient fine-tuning (PEFT) for local client training to enhance computational efficiency; (2) partial low-level adapter transmission for communication efficiency; and (3) similarity-guided collaborative aggregation (SGCA) on the server side to address non-IID issues. Extensive experiments on three FL benchmarks for medical image segmentation demonstrate the effectiveness of our proposed FedSCA, establishing new SOTA performance.

Towards efficient keyword spotting using spike-based time difference encoders

Alejandro Pequeño-Zurro,Lyes Khacef,Stefano Panzeri,Elisabetta Chicca

Task: 探索Temporal Difference Encoder (TDE)在关键词识别中的性能。

Motivation: 随着语音助手的广泛使用，边缘设备上的关键词识别变得越来越重要，但其部署常常受到目标嵌入式系统的极低功耗限制。

Details

Method: 使用TIdigits数据集，通过三种不同的脉冲神经网络（SNN）架构进行比较，包括前馈TDE、前馈CuBa-LIF和递归CuBa-LIF神经元。 Result: 前馈TDE网络的准确率（89%）高于前馈CuBa-LIF网络（71%），接近递归CuBa-LIF网络（91%），并且前馈TDE网络的突触操作比递归CuBa-LIF网络少92%。 Conclusion: TDE是一种有前景的神经元模型，适用于时空模式的可扩展事件驱动处理。 Abstract: Keyword spotting in edge devices is becoming increasingly important as voice-activated assistants are widely used. However, its deployment is often limited by the extreme low-power constraints of the target embedded systems. Here, we explore the Temporal Difference Encoder (TDE) performance in keyword spotting. This recent neuron model encodes the time difference in instantaneous frequency and spike count to perform efficient keyword spotting with neuromorphic processors. We use the TIdigits dataset of spoken digits with a formant decomposition and rate-based encoding into spikes. We compare three Spiking Neural Networks (SNNs) architectures to learn and classify spatio-temporal signals. The proposed SNN architectures are made of three layers with variation in its hidden layer composed of either (1) feedforward TDE, (2) feedforward Current-Based Leaky Integrate-and-Fire (CuBa-LIF), or (3) recurrent CuBa-LIF neurons. We first show that the spike trains of the frequency-converted spoken digits have a large amount of information in the temporal domain, reinforcing the importance of better exploiting temporal encoding for such a task. We then train the three SNNs with the same number of synaptic weights to quantify and compare their performance based on the accuracy and synaptic operations. The resulting accuracy of the feedforward TDE network (89%) is higher than the feedforward CuBa-LIF network (71%) and close to the recurrent CuBa-LIF network (91%). However, the feedforward TDE-based network performs 92% fewer synaptic operations than the recurrent CuBa-LIF network with the same amount of synapses. In addition, the results of the TDE network are highly interpretable and correlated with the frequency and timescale features of the spoken keywords in the dataset. Our findings suggest that the TDE is a promising neuron model for scalable event-driven processing of spatio-temporal patterns.

Federated Continual 3D Segmentation With Single-round Communication

Can Peng,Qianhui Men,Pramit Saha,Qianye Yang,Cheng Ouyang,J. Alison Noble

Task: 提出一种适用于动态联邦分析环境的联邦持续学习策略。

Motivation: 传统的联邦学习方法假设客户端数据和学习目标固定，但在现实场景中，新客户端可能加入，现有客户端可能扩展标签集，导致传统方法在通信和计算开销上效率低下。

Details

Method: 通过多模型蒸馏在服务器端进行一次性的模型聚合，减少频繁的服务器通信需求。 Result: 在3D腹部CT分割任务中验证了该方法的有效性。 Conclusion: 该方法通过最小化通信负载和减少同步需求，提供了一个高效且可扩展的联邦分析框架，适用于现实应用。 Abstract: Federated learning seeks to foster collaboration among distributed clients while preserving the privacy of their local data. Traditionally, federated learning methods assume a fixed setting in which client data and learning objectives remain constant. However, in real-world scenarios, new clients may join, and existing clients may expand the segmentation label set as task requirements evolve. In such a dynamic federated analysis setup, the conventional federated communication strategy of model aggregation per communication round is suboptimal. As new clients join, this strategy requires retraining, linearly increasing communication and computation overhead. It also imposes requirements for synchronized communication, which is difficult to achieve among distributed clients. In this paper, we propose a federated continual learning strategy that employs a one-time model aggregation at the server through multi-model distillation. This approach builds and updates the global model while eliminating the need for frequent server communication. When integrating new data streams or onboarding new clients, this approach efficiently reuses previous client models, avoiding the need to retrain the global model across the entire federation. By minimizing communication load and bypassing the need to put unchanged clients online, our approach relaxes synchronization requirements among clients, providing an efficient and scalable federated analysis framework suited for real-world applications. Using multi-class 3D abdominal CT segmentation as an application task, we demonstrate the effectiveness of the proposed approach.

LIFT: Latent Implicit Functions for Task- and Data-Agnostic Encoding

Amirhossein Kazerouni,Soroush Mehraban,Michael Brudno,Babak Taati

Task: 提出一种新的高性能框架LIFT，通过元学习捕捉多尺度信息，解决现有隐式神经表示（INR）框架的局限性。

Motivation: 现有的INR框架通常依赖于全局潜在向量或存在计算效率低下的问题，限制了其广泛应用。

Details

Method: LIFT利用多个并行的局部隐函数和分层潜在生成器来生成统一的潜在表示，涵盖局部、中间和全局特征。ReLIFT是LIFT的增强版本，引入了残差连接和表达频率编码。 Result: LIFT在生成建模和分类任务中实现了最先进的性能，并显著降低了计算成本。ReLIFT在信号表示和逆问题任务中也表现出色。 Conclusion: LIFT和ReLIFT提供了一种高效且强大的解决方案，提高了容量并加速了收敛，适用于多种任务。 Abstract: Implicit Neural Representations (INRs) are proving to be a powerful paradigm in unifying task modeling across diverse data domains, offering key advantages such as memory efficiency and resolution independence. Conventional deep learning models are typically modality-dependent, often requiring custom architectures and objectives for different types of signals. However, existing INR frameworks frequently rely on global latent vectors or exhibit computational inefficiencies that limit their broader applicability. We introduce LIFT, a novel, high-performance framework that addresses these challenges by capturing multiscale information through meta-learning. LIFT leverages multiple parallel localized implicit functions alongside a hierarchical latent generator to produce unified latent representations that span local, intermediate, and global features. This architecture facilitates smooth transitions across local regions, enhancing expressivity while maintaining inference efficiency. Additionally, we introduce ReLIFT, an enhanced variant of LIFT that incorporates residual connections and expressive frequency encodings. With this straightforward approach, ReLIFT effectively addresses the convergence-capacity gap found in comparable methods, providing an efficient yet powerful solution to improve capacity and speed up convergence. Empirical results show that LIFT achieves state-of-the-art (SOTA) performance in generative modeling and classification tasks, with notable reductions in computational costs. Moreover, in single-task settings, the streamlined ReLIFT architecture proves effective in signal representations and inverse problem tasks.

Command R7B Arabic: A Small, Enterprise Focused, Multilingual, and Culturally Aware Arabic LLM

Yazeed Alnumay,Alexandre Barbet,Anna Bialas,William Darling,Shaan Desai,Joan Devassy,Kyle Duffy,Stephanie Howe,Olivia Lasche,Justin Lee,Anirudh Shrinivason,Jennifer Tracey

Task: 构建高质量的企业阿拉伯语应用大语言模型（LLMs）。

Motivation: 由于阿拉伯语数字化数据的有限性，构建高质量的阿拉伯语大语言模型具有挑战性。

Details

Method: 采用数据合成和精炼策略，包括合成数据生成和人工标注，扩展阿拉伯语训练语料库，并进行迭代后训练。 Result: 发布了一个7B的小型开放权重模型，在阿拉伯语基准测试中表现优异。 Conclusion: 通过数据合成和迭代后训练，成功构建了一个在阿拉伯语应用中表现优异的大语言模型。 Abstract: Building high-quality large language models (LLMs) for enterprise Arabic applications remains challenging due to the limited availability of digitized Arabic data. In this work, we present a data synthesis and refinement strategy to help address this problem, namely, by leveraging synthetic data generation and human-in-the-loop annotation to expand our Arabic training corpus. We further present our iterative post training recipe that is essential to achieving state-of-the-art performance in aligning the model with human preferences, a critical aspect to enterprise use cases. The culmination of this effort is the release of a small, 7B, open-weight model that outperforms similarly sized peers in head-to-head comparisons and on Arabic-focused benchmarks covering cultural knowledge, instruction following, RAG, and contextual faithfulness.

Retrieval-Augmented Simulacra: Generative Agents for Up-to-date and Knowledge-Adaptive Simulations

Hikaru Shimadzu,Takehito Utsuro,Daisuke Kitayama

Task: 评估在虚拟社交网络环境中使用的搜索扩展生成机制对生成帖子和回复能力的影响。

Motivation: 随着社交网络服务在日本的影响力显著增长，以及使用SNS进行营销和情感信息传播研究的活跃，需要一种预测SNS互动趋势的系统。

Details

Method: 通过构建一个虚拟SNS环境，使用LLMs创建代理之间的聊天社区，模拟各种社区在SNS上的行为。 Result: 确认了模仿人类搜索行为的搜索扩展生成机制能够生成最自然的交流。 Conclusion: 提出的搜索扩展生成机制在虚拟SNS环境中表现良好，能够生成自然的帖子和回复。 Abstract: In the 2023 edition of the White Paper on Information and Communications, it is estimated that the population of social networking services in Japan will exceed 100 million by 2022, and the influence of social networking services in Japan is growing significantly. In addition, marketing using SNS and research on the propagation of emotions and information on SNS are being actively conducted, creating the need for a system for predicting trends in SNS interactions. We have already created a system that simulates the behavior of various communities on SNS by building a virtual SNS environment in which agents post and reply to each other in a chat community created by agents using a LLMs. In this paper, we evaluate the impact of the search extension generation mechanism used to create posts and replies in a virtual SNS environment using a simulation system on the ability to generate posts and replies. As a result of the evaluation, we confirmed that the proposed search extension generation mechanism, which mimics human search behavior, generates the most natural exchange.

An Explainable Framework for Misinformation Identification via Critical Question Answering

Ramon Ruiz-Dolz,John Lawrence

Task: 提出一种基于论证方案和关键问题的可解释框架，用于检测事实和理性错误信息。

Motivation: 现有的自然语言错误信息检测方法主要依赖于序列分类方法，导致系统不透明，分类原因不明确。虽然在自动事实核查领域已经提出了可解释的方法，但在自动理性核查系统中尚未实现。

Details

Method: 基于论证方案和关键问题理论，创建并发布了NLAS-CQ语料库，结合了3,566个教科书式的自然语言论证方案实例和4,687个与这些论证相关的关键问题答案。在此基础上，实现并验证了新的框架，结合分类和问答来分析论证中的错误信息，并以关键问题的形式向用户提供解释。 Result: 实现了新的可解释框架，能够有效检测事实和理性错误信息，并提供解释。 Conclusion: 提出的框架在检测错误信息方面具有潜力，并且通过提供解释增强了系统的透明度和可解释性。 Abstract: Natural language misinformation detection approaches have been, to date, largely dependent on sequence classification methods, producing opaque systems in which the reasons behind classification as misinformation are unclear. While an effort has been made in the area of automated fact-checking to propose explainable approaches to the problem, this is not the case for automated reason-checking systems. In this paper, we propose a new explainable framework for both factual and rational misinformation detection based on the theory of Argumentation Schemes and Critical Questions. For that purpose, we create and release NLAS-CQ, the first corpus combining 3,566 textbook-like natural language argumentation scheme instances and 4,687 corresponding answers to critical questions related to these arguments. On the basis of this corpus, we implement and validate our new framework which combines classification with question answering to analyse arguments in search of misinformation, and provides the explanations in form of critical questions to the human user.

ConQuer: A Framework for Concept-Based Quiz Generation

Yicheng Fu,Zikui Wang,Liuxin Yang,Meiqing Huo,Zhongdongming Dai

Task: 提出一个基于概念的测验生成框架ConQuer，利用外部知识源生成高质量的测验。

Motivation: 尽管LLMs提高了测验生成的效率，但AI生成的测验质量和教育影响仍存在担忧。

Details

Method: 引入ConQuer框架，利用外部知识源生成测验，并使用LLMs作为评判者进行多维度的质量评估。 Result: 实验结果显示，生成的测验在评估分数上提高了4.8%，在成对比较中胜出率为77.52%。 Conclusion: ConQuer框架有效提高了AI生成测验的质量，各组件在框架中的有效性得到了验证。 Abstract: Quizzes play a crucial role in education by reinforcing students' understanding of key concepts and encouraging self-directed exploration. However, compiling high-quality quizzes can be challenging and require deep expertise and insight into specific subject matter. Although LLMs have greatly enhanced the efficiency of quiz generation, concerns remain regarding the quality of these AI-generated quizzes and their educational impact on students. To address these issues, we introduce ConQuer, a concept-based quiz generation framework that leverages external knowledge sources. We employ comprehensive evaluation dimensions to assess the quality of the generated quizzes, using LLMs as judges. Our experiment results demonstrate a 4.8% improvement in evaluation scores and a 77.52% win rate in pairwise comparisons against baseline quiz sets. Ablation studies further underscore the effectiveness of each component in our framework. Code available at https://github.com/sofyc/ConQuer.

Synthetic Data Generation of Body Motion Data by Neural Gas Network for Emotion Recognition

Seyed Muhammad Hossein Mousavi

Task: Error

Motivation: Error

Details

Method: Error Result: Error Conclusion: Error Abstract: In the domain of emotion recognition using body motion, the primary challenge lies in the scarcity of diverse and generalizable datasets. Automatic emotion recognition uses machine learning and artificial intelligence techniques to recognize a person's emotional state from various data types, such as text, images, sound, and body motion. Body motion poses unique challenges as many factors, such as age, gender, ethnicity, personality, and illness, affect its appearance, leading to a lack of diverse and robust datasets specifically for emotion recognition. To address this, employing Synthetic Data Generation (SDG) methods, such as Generative Adversarial Networks (GANs) and Variational Auto Encoders (VAEs), offers potential solutions, though these methods are often complex. This research introduces a novel application of the Neural Gas Network (NGN) algorithm for synthesizing body motion data and optimizing diversity and generation speed. By learning skeletal structure topology, the NGN fits the neurons or gas particles on body joints. Generated gas particles, which form the skeletal structure later on, will be used to synthesize the new body posture. By attaching body postures over frames, the final synthetic body motion appears. We compared our generated dataset against others generated by GANs, VAEs, and another benchmark algorithm, using benchmark metrics such as Fr\'echet Inception Distance (FID), Diversity, and a few more. Furthermore, we continued evaluation using classification metrics such as accuracy, precision, recall, and a few others. Joint-related features or kinematic parameters were extracted, and the system assessed model performance against unseen data. Our findings demonstrate that the NGN algorithm produces more realistic and emotionally distinct body motion data and does so with more synthesizing speed than existing methods.

Generating Medically-Informed Explanations for Depression Detection using LLMs

Xiangyong Chen,Xiaochuan Lin

Task: 利用大型语言模型进行多任务抑郁症检测，同时生成基于医学诊断标准的文本解释。

Motivation: 从社交媒体数据中早期检测抑郁症提供了及时干预的宝贵机会，但这一任务需要专业医学知识和开发准确且可解释的模型。

Details

Method: 提出LLM-MTD（大型语言模型多任务抑郁症检测），利用预训练的大型语言模型同时分类社交媒体帖子中的抑郁症并生成基于医学诊断标准的文本解释。采用多任务学习框架和组合损失函数来优化分类准确性和解释质量。 Result: 在基准Reddit自报抑郁症数据集（RSDD）上评估LLM-MTD，与多种竞争基线方法（包括传统机器学习和微调BERT）进行比较，实验结果表明LLM-MTD在抑郁症检测中达到了最先进的性能，AUPRC和其他关键指标显著提高。生成的解释在人类评估中显示出相关性、完整性和医学准确性，增强了方法的可解释性。 Conclusion: 本研究提出了一种结合大型语言模型和可解释性的抑郁症检测新方法。 Abstract: Early detection of depression from social media data offers a valuable opportunity for timely intervention. However, this task poses significant challenges, requiring both professional medical knowledge and the development of accurate and explainable models. In this paper, we propose LLM-MTD (Large Language Model for Multi-Task Depression Detection), a novel approach that leverages a pre-trained large language model to simultaneously classify social media posts for depression and generate textual explanations grounded in medical diagnostic criteria. We train our model using a multi-task learning framework with a combined loss function that optimizes both classification accuracy and explanation quality. We evaluate LLM-MTD on the benchmark Reddit Self-Reported Depression Dataset (RSDD) and compare its performance against several competitive baseline methods, including traditional machine learning and fine-tuned BERT. Our experimental results demonstrate that LLM-MTD achieves state-of-the-art performance in depression detection, showing significant improvements in AUPRC and other key metrics. Furthermore, human evaluation of the generated explanations reveals their relevance, completeness, and medical accuracy, highlighting the enhanced interpretability of our approach. This work contributes a novel methodology for depression detection that combines the power of large language models with the crucial aspect of explainability.

Cafe-Talk: Generating 3D Talking Face Animation with Multimodal Coarse- and Fine-grained Control

Hejia Chen,Haoxian Zhang,Shoulong Zhang,Xiaoqiang Liu,Sisi Zhuang,Yuan Zhang,Pengfei Wan,Di Zhang,Shuai Li

Task: 提出一种基于扩散-变换器的3D说话人脸生成模型Cafe-Talk，以实现准确的唇同步和可控的表情。

Motivation: 现有方法仅采用离散情感标签全局控制表情，限制了时空域内的灵活细粒度面部控制。

Details

Method: 采用两阶段训练管道，首先使用语音音频和粗粒度条件训练模型，然后通过细粒度控制适配器逐步添加动作单元（AUs）表示的细粒度指令。设计了交换标签训练机制和基于掩码的CFG技术来调节细粒度控制的发生和强度。引入基于文本的检测器与文本-AU对齐，支持自然语言用户输入和多模态控制。 Result: Cafe-Talk在唇同步和表情表现方面达到了最先进的性能，并在用户研究中获得了广泛的细粒度控制接受度。 Conclusion: Cafe-Talk通过多模态控制条件实现了准确的唇同步和灵活的表情控制，显著提升了3D说话人脸生成的效果。 Abstract: Speech-driven 3D talking face method should offer both accurate lip synchronization and controllable expressions. Previous methods solely adopt discrete emotion labels to globally control expressions throughout sequences while limiting flexible fine-grained facial control within the spatiotemporal domain. We propose a diffusion-transformer-based 3D talking face generation model, Cafe-Talk, which simultaneously incorporates coarse- and fine-grained multimodal control conditions. Nevertheless, the entanglement of multiple conditions challenges achieving satisfying performance. To disentangle speech audio and fine-grained conditions, we employ a two-stage training pipeline. Specifically, Cafe-Talk is initially trained using only speech audio and coarse-grained conditions. Then, a proposed fine-grained control adapter gradually adds fine-grained instructions represented by action units (AUs), preventing unfavorable speech-lip synchronization. To disentangle coarse- and fine-grained conditions, we design a swap-label training mechanism, which enables the dominance of the fine-grained conditions. We also devise a mask-based CFG technique to regulate the occurrence and intensity of fine-grained control. In addition, a text-based detector is introduced with text-AU alignment to enable natural language user input and further support multimodal control. Extensive experimental results prove that Cafe-Talk achieves state-of-the-art lip synchronization and expressiveness performance and receives wide acceptance in fine-grained control in user studies. Project page: https://harryxd2018.github.io/cafe-talk/

Rui Yang,Lin Song,Yicheng Xiao,Runhui Huang,Yixiao Ge,Ying Shan,Hengshuang Zhao

Task: 构建一个基于单一Transformer的原生端到端大型多模态模型的基线方法

Motivation: 现有的多模态模型通常将视觉和文本模态分开建模，导致资源消耗大且性能存在差距，因此需要一种更高效的方法来构建原生多模态模型。

Details

Method: 提出了一种新的早期融合多模态模型，能够在早期阶段融合多模态输入，并以自回归方式响应视觉指令；同时设计了一种高效的训练方法，利用预训练模型的先验知识来解决性能限制和资源消耗问题。 Result: 所提出的模型在使用单一Transformer的多模态模型中表现出色，显著缩小了与组合式多模态模型的性能差距。 Conclusion: 该方法为构建高效的原生端到端大型多模态模型提供了一种简单而有效的基线方法。 Abstract: Recent advancements in large language models (LLMs) have significantly propelled the development of large multi-modal models (LMMs), highlighting the potential for general and intelligent assistants. However, most LMMs model visual and textual modalities separately, leading to recent efforts to develop native LMMs using a single transformer. Despite the promise, these native models are resource-intensive and often exhibit performance gaps compared to their compositional counterparts. To alleviate this issue, we propose a simple yet efficient method to construct a baseline for the native and end-to-end large multi-modal model in a single transformer. First, we propose a new early-fusion LMM that can fuse multi-modal inputs in the early stage and respond to visual instructions in an auto-regressive manner. Second, we devise an efficient training recipe for the proposed model, which harnesses the prior knowledge of the pre-trained models, addressing both the performance limitations and the challenge of resource consumption. The proposed model demonstrates superior performance compared to other LMMs using one transformer and significantly narrows the performance gap with compositional LMMs.

Salient Temporal Encoding for Dynamic Scene Graph Generation

Zhihao Zhu

Task: 提出一种新的时空场景图生成方法，选择性地在时间相关的对象对之间建立时间连接，并将时间关系表示为场景图中的显式边。

Motivation: 由于当前基准数据集中缺乏明确标注的时间关系，现有的时空场景图生成方法在所有对象之间建立了密集且抽象的时间连接，但并非所有时间连接都编码了有意义的时间动态。

Details

Method: 提出了一种新的时空场景图生成方法，选择性地在时间相关的对象对之间建立时间连接，并将时间关系表示为场景图中的显式边。 Result: 在场景图检测中，该方法比强基线提高了4.4%。此外，该方法还可以用于改进下游视觉任务，如在动作识别中，与最先进的方法相比，mAP提高了0.6%。 Conclusion: 该方法通过选择性地建立时间连接，生成了稀疏且显式的时间表示，从而提高了场景图生成和下游视觉任务的性能。 Abstract: Representing a dynamic scene using a structured spatial-temporal scene graph is a novel and particularly challenging task. To tackle this task, it is crucial to learn the temporal interactions between objects in addition to their spatial relations. Due to the lack of explicitly annotated temporal relations in current benchmark datasets, most of the existing spatial-temporal scene graph generation methods build dense and abstract temporal connections among all objects across frames. However, not all temporal connections are encoding meaningful temporal dynamics. We propose a novel spatial-temporal scene graph generation method that selectively builds temporal connections only between temporal-relevant objects pairs and represents the temporal relations as explicit edges in the scene graph. The resulting sparse and explicit temporal representation allows us to improve upon strong scene graph generation baselines by up to $4.4\%$ in Scene Graph Detection. In addition, we show that our approach can be leveraged to improve downstream vision tasks. Particularly, applying our approach to action recognition, shows 0.6\% gain in mAP in comparison to the state-of-the-art

Second language Korean Universal Dependency treebank v1.2: Focus on data augmentation and annotation scheme refinement

Hakyung Sung,Gyu-Ho Shin

Task: 扩展第二语言（L2）韩语通用依赖（UD）树库，并评估其在领域内和领域外数据集上的性能。

Motivation: 为了更好地与UD框架对齐，并提高韩语语言模型在L2韩语数据上的形态句法分析性能。

Details

Method: 手动标注5,454个句子，修订注释指南，并使用增强的树库对三个韩语语言模型进行微调。 Result: 微调显著提高了模型在各种指标上的性能。 Conclusion: 使用量身定制的L2数据集对基于第一语言的通用语言模型进行微调，对于L2数据的形态句法分析至关重要。 Abstract: We expand the second language (L2) Korean Universal Dependencies (UD) treebank with 5,454 manually annotated sentences. The annotation guidelines are also revised to better align with the UD framework. Using this enhanced treebank, we fine-tune three Korean language models and evaluate their performance on in-domain and out-of-domain L2-Korean datasets. The results show that fine-tuning significantly improves their performance across various metrics, thus highlighting the importance of using well-tailored L2 datasets for fine-tuning first-language-based, general-purpose language models for the morphosyntactic analysis of L2 data.

ReBot: Scaling Robot Learning with Real-to-Sim-to-Real Robotic Video Synthesis

Yu Fang,Yue Yang,Xinghao Zhu,Kaiyuan Zheng,Gedas Bertasius,Daniel Szafir,Mingyu Ding

Task: 提出ReBot方法，通过真实到模拟再到真实的方式扩展真实机器人数据集并适应视觉-语言-动作模型到目标领域。

Motivation: 真实世界数据收集成本高，限制了视觉-语言-动作模型的泛化能力。

Details

Method: ReBot方法在模拟环境中重放真实机器人轨迹以多样化操作对象，并将模拟运动与修复的真实世界背景结合，合成物理真实且时间一致的机器人视频。 Result: ReBot显著提升了视觉-语言-动作模型的性能和鲁棒性，在模拟和真实环境中均表现出色。 Conclusion: ReBot方法通过结合真实数据和模拟的可扩展性，能够有效提升视觉-语言-动作模型的泛化能力和性能。 Abstract: Vision-language-action (VLA) models present a promising paradigm by training policies directly on real robot datasets like Open X-Embodiment. However, the high cost of real-world data collection hinders further data scaling, thereby restricting the generalizability of VLAs. In this paper, we introduce ReBot, a novel real-to-sim-to-real approach for scaling real robot datasets and adapting VLA models to target domains, which is the last-mile deployment challenge in robot manipulation. Specifically, ReBot replays real-world robot trajectories in simulation to diversify manipulated objects (real-to-sim), and integrates the simulated movements with inpainted real-world background to synthesize physically realistic and temporally consistent robot videos (sim-to-real). Our approach has several advantages: 1) it enjoys the benefit of real data to minimize the sim-to-real gap; 2) it leverages the scalability of simulation; and 3) it can generalize a pretrained VLA to a target domain with fully automated data pipelines. Extensive experiments in both simulation and real-world environments show that ReBot significantly enhances the performance and robustness of VLAs. For example, in SimplerEnv with the WidowX robot, ReBot improved the in-domain performance of Octo by 7.2% and OpenVLA by 21.8%, and out-of-domain generalization by 19.9% and 9.4%, respectively. For real-world evaluation with a Franka robot, ReBot increased the success rates of Octo by 17% and OpenVLA by 20%. More information can be found at: https://yuffish.github.io/rebot/

Strategic resource allocation in memory encoding: An efficiency principle shaping language processing

Weijie Xu,Richard Futrell

Task: 研究工作记忆在句子处理中的战略资源分配原则。

Motivation: 探讨工作记忆如何高效地支持人类语言行为，特别是如何动态分配资源以优先处理新颖和意外信息。

Details

Method: 从资源理性的角度提出理论假设，并通过自然语料库数据进行实证分析。 Result: 发现战略资源分配在依赖局部性方面有收敛证据，非局部依赖与较不可预测的前件相关，减少了局部效应。同时，结果显示出显著的跨语言变异性。 Conclusion: 战略资源分配作为一种普遍效率原则，需要进一步研究其与语言特定短语结构的相互作用。 Abstract: How is the limited capacity of working memory efficiently used to support human linguistic behaviors? In this paper, we investigate strategic resource allocation as an efficiency principle for memory encoding in sentence processing. The idea is that working memory resources are dynamically and strategically allocated to prioritize novel and unexpected information, enhancing their representations to make them less susceptible to memory decay and interference. Theoretically, from a resource-rational perspective, we argue that this efficiency principle naturally arises from two functional assumptions about working memory, namely, its limited capacity and its noisy representation. Empirically, through naturalistic corpus data, we find converging evidence for strategic resource allocation in the context of dependency locality from both the production and the comprehension side, where non-local dependencies with less predictable antecedents are associated with reduced locality effect. However, our results also reveal considerable cross-linguistic variability, highlighting the need for a closer examination of how strategic resource allocation, as a universal efficiency principle, interacts with language-specific phrase structures.

SAUCE: Selective Concept Unlearning in Vision-Language Models with Sparse Autoencoders

Qing Li,Jiahui Geng,Derui Zhu,Fengyu Cai,Chenyang Lyu,Fakhri Karray

Task: 提出了一种名为SAUCE的新方法，用于在视觉语言模型（VLMs）中进行细粒度和选择性的概念遗忘。

Motivation: 现有的视觉语言模型遗忘方法主要依赖于从大型语言模型（LLMs）中借鉴的技术，这些方法需要大量的标注遗忘集，并且在粗粒度上进行遗忘，导致过度遗忘和模型效用降低。

Details

Method: SAUCE方法利用稀疏自编码器（SAEs）来捕获高维、语义丰富的稀疏特征，并识别与目标概念最相关的特征进行遗忘。在推理过程中，选择性地修改这些特征以抑制特定概念，同时保留无关信息。 Result: SAUCE在LLaVA-v1.5-7B和LLaMA-3.2-11B-Vision-Instruct两个不同的VLMs上进行了评估，涵盖了60个概念。实验表明，SAUCE在遗忘质量上比现有方法提高了18.04%，同时保持了相当的模型效用。 Conclusion: SAUCE是一种有效且可扩展的解决方案，适用于在VLMs中进行选择性概念遗忘。 Abstract: Unlearning methods for vision-language models (VLMs) have primarily adapted techniques from large language models (LLMs), relying on weight updates that demand extensive annotated forget sets. Moreover, these methods perform unlearning at a coarse granularity, often leading to excessive forgetting and reduced model utility. To address this issue, we introduce SAUCE, a novel method that leverages sparse autoencoders (SAEs) for fine-grained and selective concept unlearning in VLMs. Briefly, SAUCE first trains SAEs to capture high-dimensional, semantically rich sparse features. It then identifies the features most relevant to the target concept for unlearning. During inference, it selectively modifies these features to suppress specific concepts while preserving unrelated information. We evaluate SAUCE on two distinct VLMs, LLaVA-v1.5-7B and LLaMA-3.2-11B-Vision-Instruct, across two types of tasks: concrete concept unlearning (objects and sports scenes) and abstract concept unlearning (emotions, colors, and materials), encompassing a total of 60 concepts. Extensive experiments demonstrate that SAUCE outperforms state-of-the-art methods by 18.04% in unlearning quality while maintaining comparable model utility. Furthermore, we investigate SAUCE's robustness against widely used adversarial attacks, its transferability across models, and its scalability in handling multiple simultaneous unlearning requests. Our findings establish SAUCE as an effective and scalable solution for selective concept unlearning in VLMs.

Uncertainty Distillation: Teaching Language Models to Express Semantic Confidence

Sophia Hager,David Mueller,Kevin Duh,Nicholas Andrews

Task: 研究如何使大语言模型（LLMs）在回答事实性问题时能够表达其答案正确的可能性。

Motivation: 随着大语言模型越来越多地用于事实性问答，模型能够表达其答案正确的可能性变得越来越重要。为了使这些不确定性表达有意义，它们应该反映在表达的信心水平上的错误率。

Details

Method: 提出了一种简单的方法，即不确定性蒸馏，通过使用保留数据将初始不确定性估计映射到有意义的概率，创建带有标注的示例进行监督微调。 Result: 实验表明，该方法生成的表达信心与观察到的错误率相关，并且在短答案上语义不确定性与词汇不确定性相关性良好。 Conclusion: 不确定性蒸馏方法能够有效地教导大语言模型表达校准后的语义信心，从而提高其在事实性问答中的可靠性。 Abstract: As large language models (LLMs) are increasingly used for factual question-answering, it becomes more important for LLMs to have the capability to communicate the likelihood that their answer is correct. For these verbalized expressions of uncertainty to be meaningful, they should reflect the error rates at the expressed level of confidence. However, when prompted to express confidence, the error rates of current LLMs are inconsistent with their communicated confidences, highlighting the need for uncertainty quantification methods. Many prior methods calculate lexical uncertainty, estimating a model's confidence in the specific string it generated. In some cases, however, it may be more useful to estimate semantic uncertainty, or the model's confidence in the answer regardless of how it is verbalized. We propose a simple procedure, uncertainty distillation, to teach an LLM to verbalize calibrated semantic confidences. Using held-out data to map initial uncertainty estimates to meaningful probabilities, we create examples annotated with verbalized probabilities for supervised fine-tuning. We demonstrate our method yields verbalized confidences that correlate with observed error rates with a small fine-tuned language model as well as with larger instruction-tuned models, and find that our semantic uncertainty correlates well with lexical uncertainty on short answers.

Interpretable Unsupervised Joint Denoising and Enhancement for Real-World low-light Scenarios

Huaqiu Li,Xiaowan Hu,Haoqian Wang

Task: 提出一种针对真实世界低光图像的联合去噪和低光增强框架。

Motivation: 真实世界的低光图像通常存在局部过曝、低亮度、噪声和不均匀光照等复杂退化问题，现有方法难以有效处理这些退化。

Details

Method: 基于物理成像原理和Retinex理论，提出了一种基于成对子图像的训练策略，并利用离散余弦变换（DCT）在sRGB空间进行频域分解，引入隐式引导的混合表示策略。 Result: 实验结果表明，该方法在低光图像增强和去噪方面具有优越性。 Conclusion: 所提出的方法在真实世界场景中表现出色，代码将在GitHub上公开。 Abstract: Real-world low-light images often suffer from complex degradations such as local overexposure, low brightness, noise, and uneven illumination. Supervised methods tend to overfit to specific scenarios, while unsupervised methods, though better at generalization, struggle to model these degradations due to the lack of reference images. To address this issue, we propose an interpretable, zero-reference joint denoising and low-light enhancement framework tailored for real-world scenarios. Our method derives a training strategy based on paired sub-images with varying illumination and noise levels, grounded in physical imaging principles and retinex theory. Additionally, we leverage the Discrete Cosine Transform (DCT) to perform frequency domain decomposition in the sRGB space, and introduce an implicit-guided hybrid representation strategy that effectively separates intricate compounded degradations. In the backbone network design, we develop retinal decomposition network guided by implicit degradation representation mechanisms. Extensive experiments demonstrate the superiority of our method. Code will be available at https://github.com/huaqlili/unsupervised-light-enhance-ICLR2025.

Language Independent Named Entity Recognition via Orthogonal Transformation of Word Vectors

Omar E. Rakha,Hazem M. Abbas

Task: 使用双向LSTM/CRF模型和词嵌入进行跨语言的命名实体识别。

Motivation: 词嵌入是NLP中的关键组成部分，本文旨在通过训练一个源语言（英语）模型，将目标语言的词嵌入转换为源语言的词嵌入，从而实现跨语言的命名实体识别。

Details

Method: 提出了一种基于双向LSTM/CRF和词嵌入的模型，通过正交线性变换矩阵将目标语言的词嵌入转换为源语言的词嵌入。 Result: 通过在英语数据集上训练模型，该模型能够在阿拉伯语数据集上检测命名实体，而无需在阿拉伯语数据集上进行训练或微调。 Conclusion: 该方法展示了跨语言命名实体识别的潜力，无需在目标语言上进行额外的训练或微调。 Abstract: Word embeddings have been a key building block for NLP in which models relied heavily on word embeddings in many different tasks. In this paper, a model is proposed based on using Bidirectional LSTM/CRF with word embeddings to perform named entity recognition for any language. This is done by training a model on a source language (English) and transforming word embeddings from the target language into word embeddings of the source language by using an orthogonal linear transformation matrix. Evaluation of the model shows that by training a model on an English dataset the model was capable of detecting named entities in an Arabic dataset without neither training or fine tuning the model on an Arabic language dataset.

Learning-based 3D Reconstruction in Autonomous Driving: A Comprehensive Survey

Liewen Liao,Weihao Yan,Ming Yang,Songan Zhang

Task: 综述学习型3D重建在自动驾驶中的应用及其进展。

Motivation: 3D重建在自动驾驶中具有重要作用，能够精确建模动态和静态环境，促进场景理解和闭环仿真等关键任务。

Details

Method: 通过多视角深入分析，系统介绍学习型3D重建的基础知识，包括数据格式、基准测试和技术基础，并对方法进行分类和多维度分析。 Result: 总结了学习型3D重建在自动驾驶中的发展趋势和现有挑战。 Conclusion: 希望该综述能够为未来的研究提供启发。 Abstract: Learning-based 3D reconstruction has emerged as a transformative technique in autonomous driving, enabling precise modeling of both dynamic and static environments through advanced neural representations. Despite augmenting perception, 3D reconstruction inspires pioneering solution for vital tasks in the field of autonomous driving, such as scene understanding and closed-loop simulation. Commencing with an examination of input modalities, we investigates the details of 3D reconstruction and conducts a multi-perspective, in-depth analysis of recent advancements. Specifically, we first provide a systematic introduction of preliminaries, including data formats, benchmarks and technical preliminaries of learning-based 3D reconstruction, facilitating instant identification of suitable methods based on hardware configurations and sensor suites. Then, we systematically review learning-based 3D reconstruction methods in autonomous driving, categorizing approaches by subtasks and conducting multi-dimensional analysis and summary to establish a comprehensive technical reference. The development trends and existing challenges is summarized in the context of learning-based 3D reconstruction in autonomous driving. We hope that our review will inspire future researches.

FACTS&EVIDENCE: An Interactive Tool for Transparent Fine-Grained Factual Verification of Machine-Generated Text

Varich Boonsanong,Vidhisha Balachandran,Xiaochuang Han,Shangbin Feng,Lucy Lu Wang,Yulia Tsvetkov

Task: 开发一个交互式且透明的工具，用于用户驱动的复杂文本事实验证。

Motivation: 现有的自动事实验证工具缺乏预测推理的透明性和证据来源的多样性，无法提供可信赖的用户体验。

Details

Method: 开发Facts&Evidence工具，将复杂输入文本分解，可视化各个声明的可信度，并提供模型决策的解释和多种证据来源的归因。 Result: Facts&Evidence工具能够帮助用户理解、验证、选择性地信任和使用机器生成的文本。 Conclusion: Facts&Evidence工具旨在增强机器生成文本的消费者能力，使他们能够理解、验证、选择性地信任和使用这些文本。 Abstract: With the widespread consumption of AI-generated content, there has been an increased focus on developing automated tools to verify the factual accuracy of such content. However, prior research and tools developed for fact verification treat it as a binary classification or a linear regression problem. Although this is a useful mechanism as part of automatic guardrails in systems, we argue that such tools lack transparency in the prediction reasoning and diversity in source evidence to provide a trustworthy user experience. We develop Facts&Evidence - an interactive and transparent tool for user-driven verification of complex text. The tool facilitates the intricate decision-making involved in fact-verification, presenting its users a breakdown of complex input texts to visualize the credibility of individual claims along with an explanation of model decisions and attribution to multiple, diverse evidence sources. Facts&Evidence aims to empower consumers of machine-generated text and give them agency to understand, verify, selectively trust and use such text.

Matching Skeleton-based Activity Representations with Heterogeneous Signals for HAR

Shuheng Li,Jiayun Zhang,Xiaohan Fu,Xiyuan Zhang,Jingbo Shang,Rajesh K. Gupta

Task: 提出了一种基于骨架数据预训练活动表示并与异构HAR信号匹配的新框架SKELAR。

Motivation: 在人类活动识别（HAR）中，活动标签通常以one-hot格式编码，最近转向使用文本表示以提供上下文知识。然而，文本表示存在固有局限性，因此应基于物理运动数据进行HAR。

Details

Method: SKELAR框架通过自监督的粗角度重建任务捕捉核心运动知识，并通过自注意力匹配模块动态优先考虑相关身体部位。 Result: SKELAR在完整样本和少样本设置下均达到了最先进的性能，并展示了其在没有骨架数据收集的场景中有效利用合成骨架数据的能力。 Conclusion: SKELAR框架通过基于骨架数据的预训练和自注意力匹配模块，成功解决了HAR中的两个主要挑战，并在实验中表现出色。 Abstract: In human activity recognition (HAR), activity labels have typically been encoded in one-hot format, which has a recent shift towards using textual representations to provide contextual knowledge. Here, we argue that HAR should be anchored to physical motion data, as motion forms the basis of activity and applies effectively across sensing systems, whereas text is inherently limited. We propose SKELAR, a novel HAR framework that pretrains activity representations from skeleton data and matches them with heterogeneous HAR signals. Our method addresses two major challenges: (1) capturing core motion knowledge without context-specific details. We achieve this through a self-supervised coarse angle reconstruction task that recovers joint rotation angles, invariant to both users and deployments; (2) adapting the representations to downstream tasks with varying modalities and focuses. To address this, we introduce a self-attention matching module that dynamically prioritizes relevant body parts in a data-driven manner. Given the lack of corresponding labels in existing skeleton data, we establish MASD, a new HAR dataset with IMU, WiFi, and skeleton, collected from 20 subjects performing 27 activities. This is the first broadly applicable HAR dataset with time-synchronized data across three modalities. Experiments show that SKELAR achieves the state-of-the-art performance in both full-shot and few-shot settings. We also demonstrate that SKELAR can effectively leverage synthetic skeleton data to extend its use in scenarios without skeleton collections.

MMDT: Decoding the Trustworthiness and Safety of Multimodal Foundation Models

Chejian Xu,Jiawei Zhang,Zhaorun Chen,Chulin Xie,Mintong Kang,Yujin Potter,Zhun Wang,Zhuowen Yuan,Alexander Xiong,Zidi Xiong,Chenhui Zhang,Lingzhi Yuan,Yi Zeng,Peiyang Xu,Chengquan Guo,Andy Zhou,Jeffrey Ziwei Tan,Xuandong Zhao,Francesco Pinto,Zhen Xiang,Yu Gai,Zinan Lin,Dan Hendrycks,Bo Li,Dawn Song

Task: 提出一个统一的平台MMDT，用于全面评估多模态基础模型的安全性和可信度。

Motivation: 现有的多模态模型基准主要评估模型的帮助性，或仅关注有限的视角如公平性和隐私性，缺乏全面的安全性和可信度评估。

Details

Method: 设计了多种评估场景和红队算法，从安全性、幻觉、公平性/偏见、隐私、对抗鲁棒性和分布外泛化等多个角度评估模型。 Result: 评估了一系列多模态模型，揭示了这些模型在不同视角下的漏洞和改进空间。 Conclusion: MMDT是第一个全面且独特的多模态基础模型安全性和可信度评估平台，为开发更安全可靠的多模态基础模型和系统铺平了道路。 Abstract: Multimodal foundation models (MMFMs) play a crucial role in various applications, including autonomous driving, healthcare, and virtual assistants. However, several studies have revealed vulnerabilities in these models, such as generating unsafe content by text-to-image models. Existing benchmarks on multimodal models either predominantly assess the helpfulness of these models, or only focus on limited perspectives such as fairness and privacy. In this paper, we present the first unified platform, MMDT (Multimodal DecodingTrust), designed to provide a comprehensive safety and trustworthiness evaluation for MMFMs. Our platform assesses models from multiple perspectives, including safety, hallucination, fairness/bias, privacy, adversarial robustness, and out-of-distribution (OOD) generalization. We have designed various evaluation scenarios and red teaming algorithms under different tasks for each perspective to generate challenging data, forming a high-quality benchmark. We evaluate a range of multimodal models using MMDT, and our findings reveal a series of vulnerabilities and areas for improvement across these perspectives. This work introduces the first comprehensive and unique safety and trustworthiness evaluation platform for MMFMs, paving the way for developing safer and more reliable MMFMs and systems. Our platform and benchmark are available at https://mmdecodingtrust.github.io/.

Fire and Smoke Datasets in 20 Years: An In-depth Review

Sayed Pedram Haeri Boroujeni,Niloufar Mehrabi,Fatemeh Afghah,Connor Peter McGrath,Danish Bhatkar,Mithilesh Anil Biradar,Abolfazl Razi

Task: 对过去20年收集的火灾和烟雾数据集进行系统分析和评估。

Motivation: 火灾和烟雾现象对自然环境、生态系统、全球经济以及人类和野生动物的生命构成重大威胁，需要更先进的技术来实现早期检测、实时监测和最小化火灾对生态平衡和公共安全的影响。

Details

Method: 对火灾和烟雾数据集进行深入审查，分析每个数据集的特征，包括类型、大小、格式、收集方法和地理多样性，并总结每个数据集的优缺点。 Result: 通过使用ResNet-50、DeepLab-V3和YoloV8等最先进的算法对不同数据集进行广泛的实验分析。 Conclusion: 火灾和烟雾数据集在训练、评估和测试先进深度学习模型中起着关键作用，本文为推进火灾管理研究和技术的潜力提供了深入的见解。 Abstract: Fire and smoke phenomena pose a significant threat to the natural environment, ecosystems, and global economy, as well as human lives and wildlife. In this particular circumstance, there is a demand for more sophisticated and advanced technologies to implement an effective strategy for early detection, real-time monitoring, and minimizing the overall impacts of fires on ecological balance and public safety. Recently, the rapid advancement of Artificial Intelligence (AI) and Computer Vision (CV) frameworks has substantially revolutionized the momentum for developing efficient fire management systems. However, these systems extensively rely on the availability of adequate and high-quality fire and smoke data to create proficient Machine Learning (ML) methods for various tasks, such as detection and monitoring. Although fire and smoke datasets play a critical role in training, evaluating, and testing advanced Deep Learning (DL) models, a comprehensive review of the existing datasets is still unexplored. For this purpose, we provide an in-depth review to systematically analyze and evaluate fire and smoke datasets collected over the past 20 years. We investigate the characteristics of each dataset, including type, size, format, collection methods, and geographical diversities. We also review and highlight the unique features of each dataset, such as imaging modalities (RGB, thermal, infrared) and their applicability for different fire management tasks (classification, segmentation, detection). Furthermore, we summarize the strengths and weaknesses of each dataset and discuss their potential for advancing research and technology in fire management. Ultimately, we conduct extensive experimental analyses across different datasets using several state-of-the-art algorithms, such as ResNet-50, DeepLab-V3, and YoloV8.

The CLEF-2025 CheckThat! Lab: Subjectivity, Fact-Checking, Claim Normalization, and Retrieval

Firoj Alam,Julia Maria Struß,Tanmoy Chakraborty,Stefan Dietze,Salim Hafid,Katerina Korre,Arianna Muti,Preslav Nakov,Federico Ruggeri,Sebastian Schellhammer,Vinay Setty,Megha Sundriyal,Konstantin Todorov,Venktesh V

Task: CheckThat! lab 旨在开发和创新技术，以识别和对抗各种语言和平台上的在线虚假信息和操纵行为。

Motivation: 通过信息验证管道中的关键任务，如检查价值、证据检索和配对以及验证，来推进反虚假信息技术的发展。

Details

Method: 在2025版中，实验室重新审视了核心验证任务，并考虑了辅助挑战，包括主观性识别、声明规范化、数值声明的事实核查以及科学网络话语处理。 Result: 这些任务在文档和跨度级别上提出了具有挑战性的分类和检索问题，包括多语言环境。 Conclusion: CheckThat! lab 通过不断扩展和更新任务，推动了反虚假信息技术的进步，特别是在多语言和跨平台环境中。 Abstract: The CheckThat! lab aims to advance the development of innovative technologies designed to identify and counteract online disinformation and manipulation efforts across various languages and platforms. The first five editions focused on key tasks in the information verification pipeline, including check-worthiness, evidence retrieval and pairing, and verification. Since the 2023 edition, the lab has expanded its scope to address auxiliary tasks that support research and decision-making in verification. In the 2025 edition, the lab revisits core verification tasks while also considering auxiliary challenges. Task 1 focuses on the identification of subjectivity (a follow-up from CheckThat! 2024), Task 2 addresses claim normalization, Task 3 targets fact-checking numerical claims, and Task 4 explores scientific web discourse processing. These tasks present challenging classification and retrieval problems at both the document and span levels, including multilingual settings.

Redefining non-IID Data in Federated Learning for Computer Vision Tasks: Migrating from Labels to Embeddings for Task-Specific Data Distributions

Kasra Borazjani,Payam Abdisarabshali,Naji Khosravan,Seyyedali Hosseinalipour

Task: 研究联邦学习（FL）中数据异质性对性能的影响，并提出一种新的基于嵌入的数据异质性定义。

Motivation: 现有的联邦学习文献主要通过对客户端施加标签分布偏斜来模拟数据异质性，但这未能完全捕捉到计算机视觉任务中客户端之间的真实数据异质性。

Details

Method: 利用预训练的深度神经网络提取任务特定的数据嵌入，通过聚类数据点并使用狄利克雷分布将其分配给客户端，定义任务特定的数据异质性。 Result: 通过大量实验评估了不同联邦学习方法在新的数据异质性定义下的性能，并引入了新的基准性能指标。 Conclusion: 现有方法依赖于标签/类别分布偏斜，高估了联邦学习的性能，揭示了文献中被忽视的差距，并提出了新的研究方向。 Abstract: Federated Learning (FL) represents a paradigm shift in distributed machine learning (ML), enabling clients to train models collaboratively while keeping their raw data private. This paradigm shift from traditional centralized ML introduces challenges due to the non-iid (non-independent and identically distributed) nature of data across clients, significantly impacting FL's performance. Existing literature, predominantly model data heterogeneity by imposing label distribution skew across clients. In this paper, we show that label distribution skew fails to fully capture the real-world data heterogeneity among clients in computer vision tasks beyond classification. Subsequently, we demonstrate that current approaches overestimate FL's performance by relying on label/class distribution skew, exposing an overlooked gap in the literature. By utilizing pre-trained deep neural networks to extract task-specific data embeddings, we define task-specific data heterogeneity through the lens of each vision task and introduce a new level of data heterogeneity called embedding-based data heterogeneity. Our methodology involves clustering data points based on embeddings and distributing them among clients using the Dirichlet distribution. Through extensive experiments, we evaluate the performance of different FL methods under our revamped notion of data heterogeneity, introducing new benchmark performance measures to the literature. We further unveil a series of open research directions that can be pursued.

MetaLadder: Ascending Mathematical Solution Quality via Analogical-Problem Reasoning Transfer

Honglin Lin,Zhuoshi Pan,Yu Li,Qizhi Pei,Xin Gao,Mengzhang Cai,Conghui He,Lijun Wu

Task: 提出一种新的框架MetaLadder，通过提示大语言模型回忆和反思元问题及其解决方案，以提高数学推理任务的准确性。

Motivation: 当前的方法通常直接生成思维链和答案，与人类解决问题的策略有所不同。人类通常通过回忆类似案例并利用其解决方案来推理当前任务。

Details

Method: 提出MetaLadder框架，通过提示大语言模型回忆和反思元问题及其解决方案，并引入问题重述机制以增强模型对目标问题的理解。 Result: 在数学基准测试中，MetaLadder显著提高了大语言模型的问题解决准确性，比标准思维链方法提高了10.3%的准确率。 Conclusion: MetaLadder框架通过模仿人类从例子中学习和泛化的能力，显著提高了大语言模型在数学推理任务中的表现。 Abstract: Large Language Models (LLMs) have demonstrated promising capabilities in solving mathematical reasoning tasks, leveraging Chain-of-Thought (CoT) data as a vital component in guiding answer generation. Current paradigms typically generate CoT and answers directly for a given problem, diverging from human problem-solving strategies to some extent. Humans often solve problems by recalling analogous cases and leveraging their solutions to reason about the current task. Inspired by this cognitive process, we propose \textbf{MetaLadder}, a novel framework that explicitly prompts LLMs to recall and reflect on meta-problems, those structurally or semantically analogous problems, alongside their CoT solutions before addressing the target problem. Additionally, we introduce a problem-restating mechanism to enhance the model's comprehension of the target problem by regenerating the original question, which further improves reasoning accuracy. Therefore, the model can achieve reasoning transfer from analogical problems, mimicking human-like "learning from examples" and generalization abilities. Extensive experiments on mathematical benchmarks demonstrate that our MetaLadder significantly boosts LLMs' problem-solving accuracy, largely outperforming standard CoT-based methods (\textbf{10.3\%} accuracy gain) and other methods. Our code and data has been released at https://github.com/LHL3341/MetaLadder.

SuperPC: A Single Diffusion Model for Point Cloud Completion, Upsampling, Denoising, and Colorization

Yi Du,Zhipeng Zhao,Shaoshu Su,Sharath Golluri,Haoze Zheng,Runmao Yao,Chen Wang

Task: 提出一种统一的扩散模型SuperPC，能够同时处理点云的完成、上采样、去噪和着色任务。

Motivation: 现有的方法通常独立处理每个任务，忽略了这些缺陷之间的相互影响和关联，导致误差累积和计算成本增加。

Details

Method: 采用三级条件扩散框架，并结合新颖的空间混合融合策略，利用四个缺陷之间的相关性进行同时高效处理。 Result: SuperPC在所有四个任务上均优于现有的专用模型及其组合。 Conclusion: SuperPC通过统一的模型有效解决了点云处理中的多个任务，展示了其在处理复杂缺陷时的优越性能。 Abstract: Point cloud (PC) processing tasks-such as completion, upsampling, denoising, and colorization-are crucial in applications like autonomous driving and 3D reconstruction. Despite substantial advancements, prior approaches often address each of these tasks independently, with separate models focused on individual issues. However, this isolated approach fails to account for the fact that defects like incompleteness, low resolution, noise, and lack of color frequently coexist, with each defect influencing and correlating with the others. Simply applying these models sequentially can lead to error accumulation from each model, along with increased computational costs. To address these challenges, we introduce SuperPC, the first unified diffusion model capable of concurrently handling all four tasks. Our approach employs a three-level-conditioned diffusion framework, enhanced by a novel spatial-mix-fusion strategy, to leverage the correlations among these four defects for simultaneous, efficient processing. We show that SuperPC outperforms the state-of-the-art specialized models as well as their combination on all four individual tasks.

Deep Contrastive Unlearning for Language Models

Estrid He,Tabinda Sarwar,Ibrahim Khalil,Xun Yi,Ke Wang

Task: 提出一种名为DeepCUT的机器遗忘框架，用于微调语言模型，以在移除特定训练样本信息的同时不降低模型的预测质量。

Motivation: 由于大语言模型在训练过程中使用了大量文本数据，包括受版权保护的内容和用户生成的知识，这可能导致用户隐私泄露和版权侵犯的风险。因此，研究如何在移除特定训练样本信息的同时不降低模型的预测质量，以保护用户的“被遗忘权”。

Details

Method: 提出了一种名为DeepCUT的机器遗忘框架，通过直接优化模型的潜在空间来实现机器遗忘。 Result: 在真实世界数据集上的综合实验表明，DeepCUT在效率和效果上均优于基线方法，具有一致且显著的改进。 Conclusion: DeepCUT框架有效地解决了机器遗忘问题，能够在移除特定训练样本信息的同时保持模型的预测质量，为保护用户隐私和版权提供了新的解决方案。 Abstract: The past a few years have witnessed the great success of large language models, demonstrating powerful capabilities in comprehending textual data and generating human-like languages. Large language models achieve success by being trained on vast amounts of textual data, including online sources with copyrighted content and user-generated knowledge. However, this comes at a cost: the potential risk of exposing users' privacy and violating copyright protections. Thus, to safeguard individuals' "right to be forgotten", there has been increasing interests in machine unlearning -- the process of removing information carried by particular training samples from a model while not deteriorating its predictive quality. This is a challenging task due to the black-box nature of language models. Most existing studies focus on mitigating the impact of those forgot samples upon a model's outputs, and do not explicitly consider the geometric distributions of samples in the latent space of a model. To address this issue, we propose a machine unlearning framework, named Deep Contrastive Unlearning for fine-Tuning (DeepCUT) language models. Our proposed model achieves machine unlearning by directly optimizing the latent space of a model. Comprehensive experiments on real-world datasets demonstrate the effectiveness and efficiency of DeepCUT with consistent and significant improvement over baseline methods.

Effortless Active Labeling for Long-Term Test-Time Adaptation

Guowei Wang,Changxing Ding

Task: 研究如何在长期测试时适应（TTA）任务中实现无需大量标注的主动标注。

Motivation: 由于错误累积，长期测试时适应（TTA）是一个具有挑战性的任务。现有的方法通过在每个批次中主动标注一小部分样本来解决这个问题，但随着批次数的增加，标注负担迅速增加。

Details

Method: 首先，基于TTA上下文中的单步优化视角，标注每个批次中最有价值的样本。然后，引入一种有效的策略，通过特征扰动来识别这些样本。其次，发现标注和未标注样本产生的梯度幅度有显著差异，因此提出使用两个动态权重来平衡它们对模型优化的影响。 Result: 在ImageNet-C、-R、-K、-A和PACS数据库上的大量实验表明，该方法在显著降低标注成本的情况下，始终优于最先进的方法。 Conclusion: 本文提出的方法在长期测试时适应任务中实现了无需大量标注的主动标注，显著降低了标注成本，并在多个数据库上取得了优于现有方法的效果。 Abstract: Long-term test-time adaptation (TTA) is a challenging task due to error accumulation. Recent approaches tackle this issue by actively labeling a small proportion of samples in each batch, yet the annotation burden quickly grows as the batch number increases. In this paper, we investigate how to achieve effortless active labeling so that a maximum of one sample is selected for annotation in each batch. First, we annotate the most valuable sample in each batch based on the single-step optimization perspective in the TTA context. In this scenario, the samples that border between the source- and target-domain data distributions are considered the most feasible for the model to learn in one iteration. Then, we introduce an efficient strategy to identify these samples using feature perturbation. Second, we discover that the gradient magnitudes produced by the annotated and unannotated samples have significant variations. Therefore, we propose balancing their impact on model optimization using two dynamic weights. Extensive experiments on the popular ImageNet-C, -R, -K, -A and PACS databases demonstrate that our approach consistently outperforms state-of-the-art methods with significantly lower annotation costs.

MASS: Mathematical Data Selection via Skill Graphs for Pretraining Large Language Models

Jiazheng Li,Lu Yu,Qing Cui,Zhiqiang Zhang,Jun Zhou,Yanfang Ye,Chuxu Zhang

Task: 提出一种用于数学推理领域预训练大语言模型的数据选择框架MASS。

Motivation: 高质量数据在大语言模型的预训练和微调中起着关键作用，但现有数据选择方法往往忽略了领域相关数据的特定细节。

Details

Method: 构建一个技能图来捕捉数学技能及其相互关系，并基于此为目标数据集分配质量分数，从而选择出用于预训练的高质量数据子集。 Result: 实验结果表明，MASS在不同模型大小和预训练数据集上均表现出高效性和有效性，显著减少了训练所需的token数量，同时提升了模型性能。 Conclusion: MASS框架能够提高大语言模型预训练的效率和有效性，具有广泛的应用潜力。 Abstract: High-quality data plays a critical role in the pretraining and fine-tuning of large language models (LLMs), even determining their performance ceiling to some degree. Consequently, numerous data selection methods have been proposed to identify subsets of data that can effectively and efficiently enhance model performance. However, most of these methods focus on general data selection and tend to overlook the specific nuances of domain-related data. In this paper, we introduce MASS, a \textbf{MA}thematical data \textbf{S}election framework using the \textbf{S}kill graph for pretraining LLMs in the mathematical reasoning domain. By taking into account the unique characteristics of mathematics and reasoning, we construct a skill graph that captures the mathematical skills and their interrelations from a reference dataset. This skill graph guides us in assigning quality scores to the target dataset, enabling us to select the top-ranked subset which is further used to pretrain LLMs. Experimental results demonstrate the efficiency and effectiveness of MASS across different model sizes (1B and 7B) and pretraining datasets (web data and synthetic data). Specifically, in terms of efficiency, models trained on subsets selected by MASS can achieve similar performance to models trained on the original datasets, with a significant reduction in the number of trained tokens - ranging from 50\% to 70\% fewer tokens. In terms of effectiveness, when trained on the same amount of tokens, models trained on the data selected by MASS outperform those trained on the original datasets by 3.3\% to 5.9\%. These results underscore the potential of MASS to improve both the efficiency and effectiveness of pretraining LLMs.

Image Captioning Evaluation in the Age of Multimodal LLMs: Challenges and Future Perspectives

Sara Sarto,Marcella Cornia,Rita Cucchiara

Task: 评估机器生成的图像描述

Motivation: 随着多模态大语言模型（MLLMs）的出现，图像描述生成成为一个核心任务，增加了对稳健和可靠评估指标的需求。

Details

Method: 本文提供了一份全面的图像描述评估进展综述，分析了现有指标的演变、优势和局限性。 Result: 我们的分析揭示了标准评估方法的一些局限性，并提出了未来图像描述评估研究的有前景方向。 Conclusion: 本文强调了现有评估指标的局限性，并提出了未来研究的方向。 Abstract: The evaluation of machine-generated image captions is a complex and evolving challenge. With the advent of Multimodal Large Language Models (MLLMs), image captioning has become a core task, increasing the need for robust and reliable evaluation metrics. This survey provides a comprehensive overview of advancements in image captioning evaluation, analyzing the evolution, strengths, and limitations of existing metrics. We assess these metrics across multiple dimensions, including correlation with human judgment, ranking accuracy, and sensitivity to hallucinations. Additionally, we explore the challenges posed by the longer and more detailed captions generated by MLLMs and examine the adaptability of current metrics to these stylistic variations. Our analysis highlights some limitations of standard evaluation approaches and suggests promising directions for future research in image captioning assessment.

Covering Cracks in Content Moderation: Delexicalized Distant Supervision for Illicit Drug Jargon Detection

Minkyoo Song,Eugene Jang,Jaehan Kim,Seungwon Shin

Task: 检测社交媒体上的非法药物术语

Motivation: 由于药物相关问题的增加和社交媒体的普及，非法药物的销售和讨论在线上变得普遍，现有的药物术语检测方法存在易被规避和无法区分术语的良性使用的问题。

Details

Method: 提出了JEDIS框架，通过分析上下文来检测非法药物术语，结合了远程监督和去词汇化的方法，无需人工标注数据即可训练。 Result: 在两个手动标注的数据集上，JEDIS在F1分数和检测覆盖率方面显著优于现有的基于词汇的基线方法。 Conclusion: JEDIS框架在检测非法药物术语方面表现出色，能够有效应对现有方法的缺陷。 Abstract: In light of rising drug-related concerns and the increasing role of social media, sales and discussions of illicit drugs have become commonplace online. Social media platforms hosting user-generated content must therefore perform content moderation, which is a difficult task due to the vast amount of jargon used in drug discussions. Previous works on drug jargon detection were limited to extracting a list of terms, but these approaches have fundamental problems in practical application. First, they are trivially evaded using word substitutions. Second, they cannot distinguish whether euphemistic terms such as "pot" or "crack" are being used as drugs or in their benign meanings. We argue that drug content moderation should be done using contexts rather than relying on a banlist. However, manually annotated datasets for training such a task are not only expensive but also prone to becoming obsolete. We present JEDIS, a framework for detecting illicit drug jargon terms by analyzing their contexts. JEDIS utilizes a novel approach that combines distant supervision and delexicalization, which allows JEDIS to be trained without human-labeled data while being robust to new terms and euphemisms. Experiments on two manually annotated datasets show JEDIS significantly outperforms state-of-the-art word-based baselines in terms of F1-score and detection coverage in drug jargon detection. We also conduct qualitative analysis that demonstrates JEDIS is robust against pitfalls faced by existing approaches.

Can Large Vision Language Models Read Maps Like a Human?

Shuo Xing,Zezhou Sun,Shuangyu Xie,Kaiyuan Chen,Yanjia Huang,Yuping Wang,Jiachen Li,Dezhen Song,Zhengzhong Tu

Task: 介绍MapBench，一个专门为人类可读的、基于像素的地图户外导航设计的数据集。

Motivation: 解决复杂路径寻找场景中的导航问题，挑战现有的大规模视觉语言模型（LVLMs）的空间推理和结构化决策能力。

Details

Method: MapBench包含1600多个像素空间地图路径寻找问题，来自100个不同的地图，并提供地图空间场景图（MSSG）作为索引数据结构，用于在自然语言和LVLM生成结果之间进行转换。 Result: MapBench显著挑战了现有的LVLMs，揭示了它们在空间推理和结构化决策能力上的关键局限性。 Conclusion: MapBench为评估和改进LVLMs在复杂导航任务中的表现提供了一个重要的基准。 Abstract: In this paper, we introduce MapBench-the first dataset specifically designed for human-readable, pixel-based map-based outdoor navigation, curated from complex path finding scenarios. MapBench comprises over 1600 pixel space map path finding problems from 100 diverse maps. In MapBench, LVLMs generate language-based navigation instructions given a map image and a query with beginning and end landmarks. For each map, MapBench provides Map Space Scene Graph (MSSG) as an indexing data structure to convert between natural language and evaluate LVLM-generated results. We demonstrate that MapBench significantly challenges state-of-the-art LVLMs both zero-shot prompting and a Chain-of-Thought (CoT) augmented reasoning framework that decomposes map navigation into sequential cognitive processes. Our evaluation of both open-source and closed-source LVLMs underscores the substantial difficulty posed by MapBench, revealing critical limitations in their spatial reasoning and structured decision-making capabilities. We release all the code and dataset in https://github.com/taco-group/MapBench.

ML-Triton, A Multi-Level Compilation and Language Extension to Triton GPU Programming

Dewei Wang,Wei Zhu,Liyang Ling,Ettore Tiotto,Quintin Wang,Whitney Tsang,Julian Opperman,Jacky Deng

Task: 提出一种多级编译流程和编程接口的ML-Triton，以更好地利用GPU的层次结构。

Motivation: 传统的Triton编译器在从工作组级别直接降低到每线程级别时存在过早降低的问题，无法充分利用GPU的层次结构和SIMD单元的能力。

Details

Method: 提出ML-Triton，采用多级逐步降低的编译流程，从工作组级别逐步降低到warp和内部级别，并扩展Triton语言以支持用户设置的编译器提示和warp级别编程。 Result: 实验结果表明，ML-Triton在Intel GPU上的性能达到了专家编写内核的95%以上。 Conclusion: ML-Triton通过多级编译流程和编程接口，能够更好地利用GPU的层次结构，提供更好的性能。 Abstract: In the era of LLMs, dense operations such as GEMM and MHA are critical components. These operations are well-suited for parallel execution using a tilebased approach. While traditional GPU programming often relies on low level interfaces like CUDA or SYCL, Triton has emerged as a DSL that offers a more user-friendly and portable alternative by programming at a higher level. The current Triton starts at the workgroup (aka threadblock) level, and directly lowers to per-thread level. And then attempt to coalesce and amend through a series of passes, promoting information from low-level representation. We believe this is pre-mature lowering based on the below observations. 1. GPU has a hierarchical structure both physically and logically. Modern GPUs often feature SIMD units capable of directly operating on tiles on a warp or warpgroup basis, such as blocked load and blocked MMA. 2. Multi-level gradual lowering can make compiler decoupled and clean by separating considerations inter and intra a logical layer. 3. Kernel developers often need fine control to get good performance on the latest hardware. FlashAttention2 advocates explicit data partition between warps to make a performance boost. In this context, we propose ML-Triton which features multi-level compilation flow and programming interface. Our approach begins at the workgroup level and progressively lowers to the warp and intrinsic level, implementing a multilevel lowering align with the hierarchical nature of GPU. Additionally, we extend triton language to support user-set compiler hint and warp level programming, enabling researchers to get good out-of-the box performance without awaiting compiler updates. Experimental results demonstrate that our approach achieves performance above 95% of expert-written kernels on Intel GPU, as measured by the geometric mean.

Dynamic Accumulated Attention Map for Interpreting Evolution of Decision-Making in Vision Transformer

Yi Liao,Yongsheng Gao,Weichuan Zhang

Task: 提出一种新的视觉解释方法，动态累积注意力图（DAAM），用于可视化ViT模型内部的注意力流。

Motivation: 现有的视觉解释方法无法显示ViT模型内部隐藏的注意力流，无法解释ViT模型在决策过程中最终注意力区域的形成。

Details

Method: 提出了一种新的分解模块，通过解锁每个ViT块的自注意力模块生成的[class]标记来构建和存储空间特征信息，并通过分解分类得分来获得通道重要性系数。对于自监督ViT模型，提出了维度重要性权重来计算通道重要性系数。 Result: 定量和定性分析一致验证了DAAM在解释ViT模型方面的有效性和优越性。 Conclusion: DAAM能够可视化ViT模型内部任何中间块的决策注意力演化动态，为解释ViT模型提供了新的工具。 Abstract: Various Vision Transformer (ViT) models have been widely used for image recognition tasks. However, existing visual explanation methods can not display the attention flow hidden inside the inner structure of ViT models, which explains how the final attention regions are formed inside a ViT for its decision-making. In this paper, a novel visual explanation approach, Dynamic Accumulated Attention Map (DAAM), is proposed to provide a tool that can visualize, for the first time, the attention flow from the top to the bottom through ViT networks. To this end, a novel decomposition module is proposed to construct and store the spatial feature information by unlocking the [class] token generated by the self-attention module of each ViT block. The module can also obtain the channel importance coefficients by decomposing the classification score for supervised ViT models. Because of the lack of classification score in self-supervised ViT models, we propose dimension-wise importance weights to compute the channel importance coefficients. Such spatial features are linearly combined with the corresponding channel importance coefficients, forming the attention map for each block. The dynamic attention flow is revealed by block-wisely accumulating each attention map. The contribution of this work focuses on visualizing the evolution dynamic of the decision-making attention for any intermediate block inside a ViT model by proposing a novel decomposition module and dimension-wise importance weights. The quantitative and qualitative analysis consistently validate the effectiveness and superior capacity of the proposed DAAM for not only interpreting ViT models with the fully-connected layers as the classifier but also self-supervised ViT models. The code is available at https://github.com/ly9802/DynamicAccumulatedAttentionMap.

Inspecting the Representation Manifold of Differentially-Private Text

Stefan Arnold

Task: 研究差分隐私（DP）在文本中的应用，特别是通过语言模型和温度采样进行文本改写的效果。

Motivation: 探讨差分隐私在文本改写中对结构和复杂性在表示空间中的几何扭曲的影响。

Details

Method: 通过估计不同隐私预算下改写文本的内在维度，比较词级和句级方法的表示流形。 Result: 发现词级方法显著提高了表示流形，而句级方法生成的改写文本在拓扑结构上更接近人类编写的改写文本。在句级方法中，掩码改写比因果改写更好地保留了结构复杂性。 Conclusion: 自回归生成会从不自然的词汇选择中传播扭曲，导致表示空间的膨胀，而掩码改写方法在保留结构复杂性方面表现更优。 Abstract: Differential Privacy (DP) for text has recently taken the form of text paraphrasing using language models and temperature sampling to better balance privacy and utility. However, the geometric distortion of DP regarding the structure and complexity in the representation space remains unexplored. By estimating the intrinsic dimension of paraphrased text across varying privacy budgets, we find that word-level methods severely raise the representation manifold, while sentence-level methods produce paraphrases whose manifolds are topologically more consistent with human-written paraphrases. Among sentence-level methods, masked paraphrasing, compared to causal paraphrasing, demonstrates superior preservation of structural complexity, suggesting that autoregressive generation propagates distortions from unnatural word choices that cascade and inflate the representation space.

A Simple Combination of Diffusion Models for Better Quality Trade-Offs in Image Denoising

Jonas Dornbusch,Emanuel Pfarr,Florin-Alexandru Vasluianu,Frank Werner,Radu Timofte

Task: 提出一种新的线性组合扩散去噪器（LCDD），用于在图像去噪任务中平衡高视觉质量和低失真。

Motivation: 现有的扩散模型在图像重建任务中表现出色，但在高视觉质量和低失真之间难以有效平衡。

Details

Method: 提出线性组合扩散去噪器（LCDD），结合两种互补的推理过程：一种利用模型的生成潜力，另一种确保信号的忠实恢复。 Result: LCDD在去噪任务中实现了最先进的性能，并通过简单的标量超参数调整提供了可控的权衡。 Conclusion: LCDD通过结合生成潜力和信号恢复能力，在图像去噪任务中实现了高视觉质量和低失真的平衡。 Abstract: Diffusion models have garnered considerable interest in computer vision, owing both to their capacity to synthesize photorealistic images and to their proven effectiveness in image reconstruction tasks. However, existing approaches fail to efficiently balance the high visual quality of diffusion models with the low distortion achieved by previous image reconstruction methods. Specifically, for the fundamental task of additive Gaussian noise removal, we first illustrate an intuitive method for leveraging pretrained diffusion models. Further, we introduce our proposed Linear Combination Diffusion Denoiser (LCDD), which unifies two complementary inference procedures - one that leverages the model's generative potential and another that ensures faithful signal recovery. By exploiting the inherent structure of the denoising samples, LCDD achieves state-of-the-art performance and offers controlled, well-behaved trade-offs through a simple scalar hyperparameter adjustment.

Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering

Francesco Maria Molfese,Luca Moroni,Luca Gioffrè,Alessandro Scirè,Simone Conia,Roberto Navigli

Task: 评估大型语言模型（LLMs）在多项选择题回答（MCQA）任务中的表现。

Motivation: 多项选择题回答任务虽然评估起来相对简单，但最近的研究对其可靠性提出了质疑，尤其是在模型生成自由文本后再选择答案的情况下。

Details

Method: 系统地分析现有的答案提取方法是否与人类判断一致，以及它们在不同领域中如何受到提示中答案约束的影响。 Result: 传统评估策略往往低估了LLM的能力，而基于LLM的答案提取器容易出现系统性错误。 Conclusion: 需要在提示中包含格式约束以简化答案提取与允许模型生成自由文本以提高推理能力之间进行权衡，并呼吁标准化评估方法。 Abstract: One of the most widely used tasks to evaluate Large Language Models (LLMs) is Multiple-Choice Question Answering (MCQA). While open-ended question answering tasks are more challenging to evaluate, MCQA tasks are, in principle, easier to assess, as the model's answer is thought to be simple to extract and is directly compared to a set of predefined choices. However, recent studies have started to question the reliability of MCQA evaluation, showing that multiple factors can significantly impact the reported performance of LLMs, especially when the model generates free-form text before selecting one of the answer choices. In this work, we shed light on the inconsistencies of MCQA evaluation strategies, which can lead to inaccurate and misleading model comparisons. We systematically analyze whether existing answer extraction methods are aligned with human judgment, and how they are influenced by answer constraints in the prompt across different domains. Our experiments demonstrate that traditional evaluation strategies often underestimate LLM capabilities, while LLM-based answer extractors are prone to systematic errors. Moreover, we reveal a fundamental trade-off between including format constraints in the prompt to simplify answer extraction and allowing models to generate free-form text to improve reasoning. Our findings call for standardized evaluation methodologies and highlight the need for more reliable and consistent MCQA evaluation practices.

These Magic Moments: Differentiable Uncertainty Quantification of Radiance Field Models

Parker Ewen,Hao Chen,Seth Isaacson,Joey Wilson,Katherine A. Skinner,Ram Vasudevan

Task: 提出了一种利用渲染方程的高阶矩进行辐射场不确定性量化的新方法。

Motivation: 不确定性量化对于包括视图规划和场景理解在内的下游任务至关重要，尤其是在安全和鲁棒性方面。然而，辐射场的高维性和复杂性给不确定性量化带来了重大挑战，限制了这些方法在高速决策中的应用。

Details

Method: 利用渲染过程的概率性质，高效且可微分地计算辐射场输出的高阶矩，包括颜色、深度和语义预测。 Result: 该方法在合成和真实场景的广泛实验中表现出色，达到了最先进的性能，同时保持了简单性。 Conclusion: 该方法不仅在不确定性量化方面优于现有技术，还在下游应用（如最佳视角选择和神经辐射场训练的主动射线采样）中展示了其效用。 Abstract: This paper introduces a novel approach to uncertainty quantification for radiance fields by leveraging higher-order moments of the rendering equation. Uncertainty quantification is crucial for downstream tasks including view planning and scene understanding, where safety and robustness are paramount. However, the high dimensionality and complexity of radiance fields pose significant challenges for uncertainty quantification, limiting the use of these uncertainty quantification methods in high-speed decision-making. We demonstrate that the probabilistic nature of the rendering process enables efficient and differentiable computation of higher-order moments for radiance field outputs, including color, depth, and semantic predictions. Our method outperforms existing radiance field uncertainty estimation techniques while offering a more direct, computationally efficient, and differentiable formulation without the need for post-processing.Beyond uncertainty quantification, we also illustrate the utility of our approach in downstream applications such as next-best-view (NBV) selection and active ray sampling for neural radiance field training. Extensive experiments on synthetic and real-world scenes confirm the efficacy of our approach, which achieves state-of-the-art performance while maintaining simplicity.

LLM Alignment for the Arabs: A Homogenous Culture or Diverse Ones?

Amr Keleg

Task: 讨论阿拉伯语大型语言模型（LLMs）在文化多样性方面的局限性，并提出如何构建能够更好代表阿拉伯世界文化多样性的系统的初步思路。

Motivation: 现有的阿拉伯语LLMs假设阿拉伯世界文化同质化，忽略了阿拉伯世界的文化多样性。

Details

Method: 通过讨论和提供初步思路，分析现有阿拉伯语LLMs的局限性。 Result: 指出了现有阿拉伯语LLMs在文化多样性方面的不足，并提出了改进建议。 Conclusion: 希望NLP社区在开发多语言和阿拉伯语特定LLMs时，能够考虑到同一语言社区内的文化多样性。 Abstract: Large language models (LLMs) have the potential of being useful tools that can automate tasks and assist humans. However, these models are more fluent in English and more aligned with Western cultures, norms, and values. Arabic-specific LLMs are being developed to better capture the nuances of the Arabic language, as well as the views of the Arabs. Yet, Arabs are sometimes assumed to share the same culture. In this position paper, I discuss the limitations of this assumption and provide preliminary thoughts for how to build systems that can better represent the cultural diversity within the Arab world. The invalidity of the cultural homogeneity assumption might seem obvious, yet, it is widely adopted in developing multilingual and Arabic-specific LLMs. I hope that this paper will encourage the NLP community to be considerate of the cultural diversity within various communities speaking the same language.

Elevating Visual Question Answering through Implicitly Learned Reasoning Pathways in LVLMs

Liu Jing,Amirul Rahman

Task: 通过端到端训练增强大型视觉语言模型（LVLMs）的复杂视觉推理能力。

Motivation: 现有的LVLMs在多模态任务中表现出色，但在需要多步推理的复杂视觉推理任务中表现不佳。

Details

Method: 提出MF-SQ-LLaVA方法，通过在视觉问答数据集中增加由子问题和答案对组成的推理链，并使用多任务损失训练LVLM，以鼓励生成和回答这些中间步骤以及预测最终答案。 Result: 在ScienceQA和VQAv2数据集上的实验表明，MF-SQ-LLaVA显著优于现有的最先进模型，包括基础LLaVA和原始SQ-LLaVA。消融研究进一步验证了该方法各组成部分的贡献，人类评估也证实了该方法提高了推理过程的准确性和连贯性。 Conclusion: MF-SQ-LLaVA通过隐式自问自答的端到端训练，显著提升了LVLMs在复杂视觉推理任务中的表现。 Abstract: Large Vision-Language Models (LVLMs) have shown remarkable progress in various multimodal tasks, yet they often struggle with complex visual reasoning that requires multi-step inference. To address this limitation, we propose MF-SQ-LLaVA, a novel approach that enhances LVLMs by enabling implicit self-questioning through end-to-end training. Our method involves augmenting visual question answering datasets with reasoning chains consisting of sub-question and answer pairs, and training the LVLM with a multi-task loss that encourages the generation and answering of these intermediate steps, as well as the prediction of the final answer. We conduct extensive experiments on the ScienceQA and VQAv2 datasets, demonstrating that MF-SQ-LLaVA significantly outperforms existing state-of-the-art models, including the base LLaVA and the original SQ-LLaVA. Ablation studies further validate the contribution of each component of our approach, and human evaluation confirms the improved accuracy and coherence of the reasoning process enabled by our method.

SPADE: Systematic Prompt Framework for Automated Dialogue Expansion in Machine-Generated Text Detection

Haoyi Li,Angela Yifei Yuan,Soyeon Caren Han,Christopher Leckie

Task: 开发用于检测机器生成文本（MGT）的模型，并解决训练数据缺乏的问题。

Motivation: 大型语言模型（LLMs）生成合成内容的能力不断增强，引发了对其滥用的担忧，推动了MGT检测模型的发展。然而，由于缺乏系统生成的高质量数据集，这些检测器面临重大挑战。

Details

Method: 提出了五种新颖的数据增强框架，通过结构化提示方法生成合成用户对话，减少传统数据收集方法的成本。生成了14个新的对话数据集，并在七个MGT检测模型上进行了基准测试。 Result: 结果表明，使用我们提出的增强框架生成的混合数据集时，泛化性能有所提高。此外，模拟了在线对话检测，并研究了聊天历史长度与检测准确性之间的关系。 Conclusion: 我们提出的数据增强框架在提高MGT检测模型的泛化性能方面表现出色，并且我们的开源数据集可用于进一步研究。 Abstract: The increasing capability of large language models (LLMs) to generate synthetic content has heightened concerns about their misuse, driving the development of Machine-Generated Text (MGT) detection models. However, these detectors face significant challenges due to the lack of systematically generated, high-quality datasets for training. To address this issue, we propose five novel data augmentation frameworks for synthetic user dialogue generation through a structured prompting approach, reducing the costs associated with traditional data collection methods. Our proposed method yields 14 new dialogue datasets, which we benchmark against seven MGT detection models. The results demonstrate improved generalization performance when utilizing a mixed dataset produced by our proposed augmentation framework. Furthermore, considering that real-world agents lack knowledge of future opponent utterances, we simulate online dialogue detection and examine the relationship between chat history length and detection accuracy. We also benchmark online detection performance with limited chat history on our frameworks. Our open-source datasets can be downloaded from https://github.com/AngieYYF/SPADE-customer-service-dialogue.

SplatVoxel: History-Aware Novel View Streaming without Temporal Training

Yiming Wang,Lucy Chai,Xuan Luo,Michael Niemeyer,Manuel Lagunas,Stephen Lombardi,Siyu Tang,Tiancheng Sun

Task: 从稀疏视图视频中生成连续的高质量、时间一致的新视图序列。

Motivation: 现有的新视图合成方法在时间一致性和视觉保真度方面存在困难，导致闪烁和不一致。为了解决这些问题，引入了历史感知机制，利用之前的帧来重建场景并提高质量和稳定性。

Details

Method: 提出了一种混合的splat-voxel前馈场景重建方法，结合了高斯Splatting来传播时间信息，并使用分层体素网格进行时间融合。高斯基元通过运动图在时间上高效变形，将2D跟踪模型扩展到3D运动，而稀疏体素变换器以错误感知的方式整合新的时间观察。 Result: 该方法在静态和流式场景重建中实现了最先进的性能，有效减少了时间伪影和视觉伪影，同时在单个H100 GPU上以交互速率（15 fps，350ms延迟）运行。 Conclusion: 该方法不需要在多视图视频数据集上进行训练，可以直接应用于稀疏视图视频流，并在推理时以历史感知的方式运行，显著提高了新视图合成的质量和稳定性。 Abstract: We study the problem of novel view streaming from sparse-view videos, which aims to generate a continuous sequence of high-quality, temporally consistent novel views as new input frames arrive. However, existing novel view synthesis methods struggle with temporal coherence and visual fidelity, leading to flickering and inconsistency. To address these challenges, we introduce history-awareness, leveraging previous frames to reconstruct the scene and improve quality and stability. We propose a hybrid splat-voxel feed-forward scene reconstruction approach that combines Gaussian Splatting to propagate information over time, with a hierarchical voxel grid for temporal fusion. Gaussian primitives are efficiently warped over time using a motion graph that extends 2D tracking models to 3D motion, while a sparse voxel transformer integrates new temporal observations in an error-aware manner. Crucially, our method does not require training on multi-view video datasets, which are currently limited in size and diversity, and can be directly applied to sparse-view video streams in a history-aware manner at inference time. Our approach achieves state-of-the-art performance in both static and streaming scene reconstruction, effectively reducing temporal artifacts and visual artifacts while running at interactive rates (15 fps with 350ms delay) on a single H100 GPU. Project Page: https://19reborn.github.io/SplatVoxel/

ELTEX: A Framework for Domain-Driven Synthetic Data Generation

Arina Razmyslovich,Kseniia Murasheva,Sofia Sedlova,Julien Capitaine,Eugene Dmitriev

Task: 提出ELTEX框架，用于在专业领域生成高质量的合成训练数据。

Motivation: 大型语言模型（LLMs）在专业领域（如网络安全）的表现受限于领域特定训练数据的稀缺性。

Details

Method: ELTEX通过系统整合显式领域指示器提取和动态提示，以在生成过程中保留关键领域知识。 Result: 在区块链相关的网络攻击检测中，ELTEX增强的模型在标准分类指标和不确定性校准方面表现与GPT-4相当，且计算资源需求显著减少。 Conclusion: 领域驱动的合成数据生成可以有效弥合资源高效模型与大型架构在专业领域中的性能差距。 Abstract: We present ELTEX (Efficient LLM Token Extraction), a domain-driven framework for generating high-quality synthetic training data in specialized domains. While Large Language Models (LLMs) have shown impressive general capabilities, their performance in specialized domains like cybersecurity remains limited by the scarcity of domain-specific training data. ELTEX addresses this challenge by systematically integrating explicit domain indicator extraction with dynamic prompting to preserve critical domain knowledge throughout the generation process. We demonstrate ELTEX's effectiveness in the context of blockchain-related cyberattack detection, where we fine-tune Gemma-2B using various combinations of real and ELTEX-generated data. Our results show that the ELTEX-enhanced model achieves performance competitive with GPT-4 across both standard classification metrics and uncertainty calibration, while requiring significantly fewer computational resources. We release a curated synthetic dataset of social media texts for cyberattack detection in blockchain. Our work demonstrates that domain-driven synthetic data generation can effectively bridge the performance gap between resource-efficient models and larger architectures in specialized domains.

Construction Site Scaffolding Completeness Detection Based on Mask R-CNN and Hough Transform

Pei-Hsin Lin,Jacob J. Lin,Shang-Hsien Hsieh

Task: 提出一种基于深度学习的计算机视觉方法来检测脚手架及其交叉支撑。

Motivation: 脚手架的安全检查至关重要，但人工检查耗时且容易出错。

Details

Method: 使用带有标注标签的脚手架图像数据集训练卷积神经网络（CNN）模型。 Result: 提出的方法可以自动从施工现场拍摄的图像中检测交叉支撑的完整性，无需人工检查。 Conclusion: 这种非侵入且高效的脚手架完整性检测方法有助于提高施工现场的安全性。 Abstract: Construction site scaffolding is essential for many building projects, and ensuring its safety is crucial to prevent accidents. The safety inspector must check the scaffolding's completeness and integrity, where most violations occur. The inspection process includes ensuring all the components are in the right place since workers often compromise safety for convenience and disassemble parts such as cross braces. This paper proposes a deep learning-based approach to detect the scaffolding and its cross braces using computer vision. A scaffold image dataset with annotated labels is used to train a convolutional neural network (CNN) model. With the proposed approach, we can automatically detect the completeness of cross braces from images taken at construction sites, without the need for manual inspection, saving a significant amount of time and labor costs. This non-invasive and efficient solution for detecting scaffolding completeness can help improve safety in construction sites.

A Data-driven Investigation of Euphemistic Language: Comparing the usage of "slave" and "servant" in 19th century US newspapers

Jaihyun Park,Ryan Cordell

Task: 研究19世纪美国报纸中“奴隶”和“仆人”一词的使用情况。

Motivation: 探讨“奴隶”和“仆人”在19世纪美国报纸中的不同使用方式及其背后的社会文化意义。

Details

Method: 使用FastText嵌入考虑OCR错误，排除重印文本，使用Word2vec嵌入找到与“奴隶”和“仆人”语义相近的词，并计算对数几率比以识别南方和北方报纸中的过度代表词汇。 Result: 发现“奴隶”与社会经济、法律和行政词汇相关，而“仆人”在北方报纸中与宗教词汇相关，在南方报纸中与家庭和家庭词汇相关。南方报纸中的奴隶话语词汇在北方报纸中更为普遍，而各自地区的仆人话语词汇在各自地区更为普遍。 Conclusion: 本研究有助于理解19世纪美国报纸如何围绕被奴役的非洲裔美国人创造不同的话语。 Abstract: This study investigates the usage of "slave" and "servant" in the 19th century US newspapers using computational methods. While both terms were used to refer to enslaved African Americans, they were used in distinct ways. In the Chronicling America corpus, we included possible OCR errors by using FastText embedding and excluded text reprints to consider text reprint culture in the 19th century. Word2vec embedding was used to find semantically close words to "slave" and "servant" and log-odds ratio was calculated to identify over-represented discourse words in the Southern and Northern newspapers. We found that "slave" is associated with socio-economic, legal, and administrative words, however, "servant" is linked to religious words in the Northern newspapers while Southern newspapers associated "servant" with domestic and familial words. We further found that slave discourse words in Southern newspapers are more prevalent in Northern newspapers while servant discourse words from each side are prevalent in their own region. This study contributes to the understanding of how newspapers created different discourses around enslaved African Americans in the 19th century US.

ShapeShift: Towards Text-to-Shape Arrangement Synthesis with Content-Aware Geometric Constraints

Vihaan Misra,Peter Schaldenbrand,Jean Oh

Task: 本文的任务是通过固定的一组刚性形状生成符合语义描述的图像，类似于解决七巧板拼图或排列现实世界中的物体。

Motivation: 本文的动机是解决在仅使用固定刚性形状的情况下生成符合语义描述的图像这一更具挑战性的问题。

Details

Method: 本文提出的方法ShapeShift通过可微分矢量图形管道显式参数化每个形状，并通过预训练扩散模型的分数蒸馏采样迭代优化位置和方向。为了保持排列的清晰性，引入了内容感知的碰撞解决机制，在发生重叠时应用最小的语义一致调整，确保平滑收敛到物理有效的配置。 Result: 实验结果表明，ShapeShift在多种场景下都能生成令人信服的结果，并在定量和定性上优于其他技术。 Conclusion: 本文通过将基于扩散的语义引导与显式几何约束相结合，生成了空间关系清晰体现文本提示的可解释组合。 Abstract: While diffusion-based models excel at generating photorealistic images from text, a more nuanced challenge emerges when constrained to using only a fixed set of rigid shapes, akin to solving tangram puzzles or arranging real-world objects to match semantic descriptions. We formalize this problem as shape-based image generation, a new text-guided image-to-image translation task that requires rearranging the input set of rigid shapes into non-overlapping configurations and visually communicating the target concept. Unlike pixel-manipulation approaches, our method, ShapeShift, explicitly parameterizes each shape within a differentiable vector graphics pipeline, iteratively optimizing placement and orientation through score distillation sampling from pretrained diffusion models. To preserve arrangement clarity, we introduce a content-aware collision resolution mechanism that applies minimal semantically coherent adjustments when overlaps occur, ensuring smooth convergence toward physically valid configurations. By bridging diffusion-based semantic guidance with explicit geometric constraints, our approach yields interpretable compositions where spatial relationships clearly embody the textual prompt. Extensive experiments demonstrate compelling results across diverse scenarios, with quantitative and qualitative advantages over alternative techniques.

Exploring Model Editing for LLM-based Aspect-Based Sentiment Classification

Shichen Li,Zhongqing Wang,Zheyu Zhao,Yue Zhang,Peifeng Li

Task: 研究模型编辑作为一种高效的方法，使大型语言模型（LLMs）适应基于方面的情感分类任务。

Motivation: 模型编辑能够显著减少计算成本，并且能够精确地针对LLMs中的关键组件进行修改，展示了在高效微调应用中的巨大潜力。

Details

Method: 通过因果干预，追踪并确定哪些神经元隐藏状态对模型的预测至关重要。通过对LLM的每个组件进行干预和恢复，识别这些组件在基于方面的情感分类中的重要性。 Result: 发现一组特定的中层表示对于检测给定方面词的情感极性至关重要。基于这些发现，开发了一种专注于LLM关键部分的模型编辑方法。 Conclusion: 实验表明，该方法在显著减少可训练参数的情况下，与当前最强方法相比取得了竞争性的结果，展示了一种更高效和可解释的微调策略。 Abstract: Model editing aims at selectively updating a small subset of a neural model's parameters with an interpretable strategy to achieve desired modifications. It can significantly reduce computational costs to adapt to large language models (LLMs). Given its ability to precisely target critical components within LLMs, model editing shows great potential for efficient fine-tuning applications. In this work, we investigate model editing to serve an efficient method for adapting LLMs to solve aspect-based sentiment classification. Through causal interventions, we trace and determine which neuron hidden states are essential for the prediction of the model. By performing interventions and restorations on each component of an LLM, we identify the importance of these components for aspect-based sentiment classification. Our findings reveal that a distinct set of mid-layer representations is essential for detecting the sentiment polarity of given aspect words. Leveraging these insights, we develop a model editing approach that focuses exclusively on these critical parts of the LLM, leading to a more efficient method for adapting LLMs. Our in-domain and out-of-domain experiments demonstrate that this approach achieves competitive results compared to the currently strongest methods with significantly fewer trainable parameters, highlighting a more efficient and interpretable fine-tuning strategy.

HandSplat: Embedding-Driven Gaussian Splatting for High-Fidelity Hand Rendering

Yilan Dong,Haohe Liu,Qing Wang,Jiahao Yang,Wenqing Wang,Gregory Slabaugh,Shanxin Yuan

Task: 提出了一种基于高斯泼溅的新型框架HandSplat，用于提高手部渲染的保真度和稳定性。

Motivation: 现有的3D高斯泼溅方法在手部渲染中依赖于简化的非刚性运动模型，无法捕捉精细的几何和外观细节，且存在几何细节丢失、时间不稳定性和点分布效率低的问题。

Details

Method: 扩展了标准3DGS属性，引入了隐式几何和外观嵌入，以改进非刚性运动建模，并提出了局部梯度感知的密度化策略和姿态条件属性正则化。 Result: 在InterHand2.6M数据集上的实验表明，HandSplat在保真度和稳定性上优于现有方法，并实现了实时性能。 Conclusion: HandSplat框架有效提高了手部渲染的保真度和稳定性，具有实时性能，代码和预训练模型将在接受后发布。 Abstract: Existing 3D Gaussian Splatting (3DGS) methods for hand rendering rely on rigid skeletal motion with an oversimplified non-rigid motion model, which fails to capture fine geometric and appearance details. Additionally, they perform densification based solely on per-point gradients and process poses independently, ignoring spatial and temporal correlations. These limitations lead to geometric detail loss, temporal instability, and inefficient point distribution. To address these issues, we propose HandSplat, a novel Gaussian Splatting-based framework that enhances both fidelity and stability for hand rendering. To improve fidelity, we extend standard 3DGS attributes with implicit geometry and appearance embeddings for finer non-rigid motion modeling while preserving the static hand characteristic modeled by original 3DGS attributes. Additionally, we introduce a local gradient-aware densification strategy that dynamically refines Gaussian density in high-variation regions. To improve stability, we incorporate pose-conditioned attribute regularization to encourage attribute consistency across similar poses, mitigating temporal artifacts. Extensive experiments on InterHand2.6M demonstrate that HandSplat surpasses existing methods in fidelity and stability while achieving real-time performance. We will release the code and pre-trained models upon acceptance.

Increasing the Robustness of the Fine-tuned Multilingual Machine-Generated Text Detectors

Dominik Macko,Robert Moro,Ivan Srba

Task: 开发一种自动化方法来准确检测机器生成的内容。

Motivation: 由于LLMs的普及，人们担心它们被滥用于有害内容的创建和传播。人类无法区分高质量的机器生成文本和真实的人类写作文本，因此需要开发自动化手段来检测机器生成内容。

Details

Method: 提出了一种鲁棒的微调过程，用于LLMs的检测任务，使检测器对混淆更具鲁棒性，并且更能泛化到分布外数据。 Result: 该方法使检测器在检测机器生成内容时更加鲁棒和泛化。 Conclusion: 通过提出的鲁棒微调过程，可以更有效地检测机器生成内容，从而提供关于其可信度的额外信息。 Abstract: Since the proliferation of LLMs, there have been concerns about their misuse for harmful content creation and spreading. Recent studies justify such fears, providing evidence of LLM vulnerabilities and high potential of their misuse. Humans are no longer able to distinguish between high-quality machine-generated and authentic human-written texts. Therefore, it is crucial to develop automated means to accurately detect machine-generated content. It would enable to identify such content in online information space, thus providing an additional information about its credibility. This work addresses the problem by proposing a robust fine-tuning process of LLMs for the detection task, making the detectors more robust against obfuscation and more generalizable to out-of-distribution data.

RETHINED: A New Benchmark and Baseline for Real-Time High-Resolution Image Inpainting On Edge Devices

Marcelo Sanchez,Gil Triginer,Ignacio Sarasua,Lara Raad,Coloma Ballester

Task: 提出一种能够在边缘设备上实时进行高分辨率图像修复的基线方法。

Motivation: 现有的图像修复方法在低分辨率图像上表现良好，但在高分辨率图像上表现不佳且需要强大的硬件支持，限制了其在边缘设备上的部署。

Details

Method: 提出了一种由轻量级卷积神经网络（CNN）和分辨率无关的补丁替换机制组成的简单而有效的方法。 Result: 在各种移动设备上进行了广泛分析，展示了与现有最先进方法相似的修复性能，同时速度快了100倍。 Conclusion: 该方法能够在边缘设备上实时进行高分辨率图像修复，并发布了首个自由形式掩码的超高清图像修复数据集。 Abstract: Existing image inpainting methods have shown impressive completion results for low-resolution images. However, most of these algorithms fail at high resolutions and require powerful hardware, limiting their deployment on edge devices. Motivated by this, we propose the first baseline for REal-Time High-resolution image INpainting on Edge Devices (RETHINED) that is able to inpaint at ultra-high-resolution and can run in real-time ($\leq$ 30ms) in a wide variety of mobile devices. A simple, yet effective novel method formed by a lightweight Convolutional Neural Network (CNN) to recover structure, followed by a resolution-agnostic patch replacement mechanism to provide detailed texture. Specially our pipeline leverages the structural capacity of CNN and the high-level detail of patch-based methods, which is a key component for high-resolution image inpainting. To demonstrate the real application of our method, we conduct an extensive analysis on various mobile-friendly devices and demonstrate similar inpainting performance while being $\mathrm{100 \times faster}$ than existing state-of-the-art methods. Furthemore, we realease DF8K-Inpainting, the first free-form mask UHD inpainting dataset.

Christina Zorenböhmer,Sebastian Schmidt,Bernd Resch

Task: 生成第一个基于方面的情感分析（ABEA）训练数据集，并微调BERT模型以进行方面术语提取（ATE）和方面情感分类（AEC）。

Motivation: 解决ABEA领域面临的数据集瓶颈和情感类别复杂性增加的问题。

Details

Method: 生成包含2,621条英文推特的ABEA训练数据集，并使用Shaver等人的分层情感理论进行注释。微调GRACE模型进行ABEA子任务。 Result: ATE的F1得分为70.1%，联合ATE和AEC提取的F1得分为46.9%。 Conclusion: 模型性能的限制因素主要是训练数据集规模小和任务复杂性增加，导致模型过拟合和泛化能力有限。 Abstract: While sentiment analysis has advanced from sentence to aspect-level, i.e., the identification of concrete terms related to a sentiment, the equivalent field of Aspect-based Emotion Analysis (ABEA) is faced with dataset bottlenecks and the increased complexity of emotion classes in contrast to binary sentiments. This paper addresses these gaps, by generating a first ABEA training dataset, consisting of 2,621 English Tweets, and fine-tuning a BERT-based model for the ABEA sub-tasks of Aspect Term Extraction (ATE) and Aspect Emotion Classification (AEC). The dataset annotation process was based on the hierarchical emotion theory by Shaver et al. [1] and made use of group annotation and majority voting strategies to facilitate label consistency. The resulting dataset contained aspect-level emotion labels for Anger, Sadness, Happiness, Fear, and a None class. Using the new ABEA training dataset, the state-of-the-art ABSA model GRACE by Luo et al. [2] was fine-tuned for ABEA. The results reflected a performance plateau at an F1-score of 70.1% for ATE and 46.9% for joint ATE and AEC extraction. The limiting factors for model performance were broadly identified as the small training dataset size coupled with the increased task complexity, causing model overfitting and limited abilities to generalize well on new data.

Validation of Human Pose Estimation and Human Mesh Recovery for Extracting Clinically Relevant Motion Data from Videos

Kai Armstrong,Alexander Rodrigues,Alexander P. Willmott,Lei Zhang,Xujiong Ye

Task: 比较分析无标记运动捕捉技术在临床环境中的使用，验证其与现有技术的可比性。

Motivation: 验证无标记运动捕捉技术在运动学分析中的可行性，并展示其相对于传统技术的优势。

Details

Method: 比较无标记运动捕捉技术（如人体姿态估计和人体网格恢复）与现有技术（如惯性测量单元和反光标记光学运动捕捉）的性能。 Result: 无标记运动捕捉技术的结果与IMU和MoCap技术的结果一致，且具有更短的设置时间和更低的专业知识要求。 Conclusion: 尽管无标记运动捕捉技术在数据质量上仍有改进空间，但其在低速动作的临床测试中的误差范围内是可接受的。 Abstract: This work aims to discuss the current landscape of kinematic analysis tools, ranging from the state-of-the-art in sports biomechanics such as inertial measurement units (IMUs) and retroreflective marker-based optical motion capture (MoCap) to more novel approaches from the field of computing such as human pose estimation and human mesh recovery. Primarily, this comparative analysis aims to validate the use of marker-less MoCap techniques in a clinical setting by showing that these marker-less techniques are within a reasonable range for kinematics analysis compared to the more cumbersome and less portable state-of-the-art tools. Not only does marker-less motion capture using human pose estimation produce results in-line with the results of both the IMU and MoCap kinematics but also benefits from a reduced set-up time and reduced practical knowledge and expertise to set up. Overall, while there is still room for improvement when it comes to the quality of the data produced, we believe that this compromise is within the room of error that these low-speed actions that are used in small clinical tests.

Comparing Llama3 and DeepSeekR1 on Biomedical Text Classification Tasks

Yuting Guo,Abeed Sarker

Task: 比较两个开源大语言模型（Llama3-70B和DeepSeekR1-distill-Llama3-70B）在六个生物医学文本分类任务上的性能。

Motivation: 评估不同大语言模型在生物医学文本分类任务中的表现，特别是在零样本设置下的表现。

Details

Method: 在六个生物医学文本分类任务上进行实验，其中四个任务涉及社交媒体数据，两个任务涉及电子健康记录中的临床笔记。所有实验均在零样本设置下进行，并测量了精度、召回率和F1分数及其95%置信区间。 Result: DeepSeekR1-distill-Llama3-70B在大多数任务上的精度表现更好，召回率表现则参差不齐。零样本大语言模型在某些任务上表现出高F1分数，但在其他任务上表现不佳。 Conclusion: 模型选择应根据健康相关文本分类任务的具体需求进行，特别是在考虑精度-召回率权衡时。在有标注数据的情况下，监督分类方法可能比零样本大语言模型更可靠。 Abstract: This study compares the performance of two open-source large language models (LLMs)-Llama3-70B and DeepSeekR1-distill-Llama3-70B-on six biomedical text classification tasks. Four tasks involve data from social media, while two tasks focus on clinical notes from electronic health records, and all experiments were performed in zero-shot settings. Performance metrics, including precision, recall, and F1 scores, were measured for each task, along with their 95% confidence intervals. Results demonstrated that DeepSeekR1-distill-Llama3-70B generally performs better in terms of precision on most tasks, with mixed results on recall. While the zero-shot LLMs demonstrated high F1 scores for some tasks, they grossly underperformed on others, for data from both sources. The findings suggest that model selection should be guided by the specific requirements of the health-related text classification tasks, particularly when considering the precision-recall trade-offs, and that, in the presence of annotated data, supervised classification approaches may be more reliable than zero-shot LLMs.

Revisiting Image Fusion for Multi-Illuminant White-Balance Correction

David Serrano-Lozano,Aditya Arora,Luis Herranz,Konstantinos G. Derpanis,Michael S. Brown,Javier Vazquez-Corral

Task: 提出一种基于Transformer的高效模型，用于改进多光源场景下的白平衡校正。

Motivation: 现有的基于融合的方法在多光源场景下表现不佳，且缺乏专门的多光源图像数据集。

Details

Method: 提出一种基于Transformer的模型，能够有效捕捉sRGB白平衡预设之间的空间依赖关系，并引入一个包含16,000多张sRGB图像的大规模多光源数据集。 Result: 新方法在多光源图像融合数据集上比现有技术提高了100%。 Conclusion: 提出的Transformer模型和多光源数据集显著改进了多光源场景下的白平衡校正效果。 Abstract: White balance (WB) correction in scenes with multiple illuminants remains a persistent challenge in computer vision. Recent methods explored fusion-based approaches, where a neural network linearly blends multiple sRGB versions of an input image, each processed with predefined WB presets. However, we demonstrate that these methods are suboptimal for common multi-illuminant scenarios. Additionally, existing fusion-based methods rely on sRGB WB datasets lacking dedicated multi-illuminant images, limiting both training and evaluation. To address these challenges, we introduce two key contributions. First, we propose an efficient transformer-based model that effectively captures spatial dependencies across sRGB WB presets, substantially improving upon linear fusion techniques. Second, we introduce a large-scale multi-illuminant dataset comprising over 16,000 sRGB images rendered with five different WB settings, along with WB-corrected images. Our method achieves up to 100\% improvement over existing techniques on our new multi-illuminant image fusion dataset.

Entity-aware Cross-lingual Claim Detection for Automated Fact-checking

Rrubaa Panchendrarajan,Arkaitz Zubiaga

Task: 识别需要验证的声明，特别是在社交媒体平台上错误信息泛滥的情况下。

Motivation: 尽管在该任务上取得了显著进展，但仍存在处理多语言和多模态数据的挑战。

Details

Method: 提出了EX-Claim模型，利用命名实体识别和实体链接技术来提高语言级别的性能。 Result: 在三个不同社交媒体平台的数据集上进行的广泛实验表明，该模型在27种语言中显著优于基线模型，并实现了最高的知识转移率。 Conclusion: EX-Claim模型能够有效处理任何语言的声明，即使在训练数据有限的情况下也能实现良好的知识转移。 Abstract: Identifying claims requiring verification is a critical task in automated fact-checking, especially given the proliferation of misinformation on social media platforms. Despite significant progress in the task, there remain open challenges such as dealing with multilingual and multimodal data prevalent in online discourse. Addressing the multilingual challenge, recent efforts have focused on fine-tuning pre-trained multilingual language models. While these models can handle multiple languages, their ability to effectively transfer cross-lingual knowledge for detecting claims spreading on social media remains under-explored. In this paper, we introduce \textit{EX-Claim}, an entity-aware cross-lingual claim detection model that generalizes well to handle claims written in any language. The model leverages entity information derived from named entity recognition and entity linking techniques to improve the language-level performance of both seen and unseen languages during training. Extensive experiments conducted on three datasets from different social media platforms demonstrate that our proposed model significantly outperforms the baselines, across 27 languages, and achieves the highest rate of knowledge transfer, even with limited training data.

RAT: Boosting Misclassification Detection Ability without Extra Data

Ge Yan,Tsui-Wei Weng

Task: 研究图像分类模型的错误分类检测，特别是通过对抗性扰动的视角。

Motivation: 随着深度神经网络在高风险领域的广泛应用，检测模型的错误预测并进行干预变得至关重要。

Details

Method: 提出了使用鲁棒半径（即输入空间边距）作为置信度度量，并设计了两种高效的估计算法RR-BS和RR-Fast。此外，设计了一种称为半径感知训练（RAT）的训练方法，以提高模型识别错误的能力。 Result: 实验表明，与之前的方法相比，该方法在AURC上减少了29.3%，在FPR@95TPR上减少了21.62%。 Conclusion: 提出的方法在检测图像分类模型的错误分类方面表现出色，显著提高了检测性能。 Abstract: As deep neural networks(DNN) become increasingly prevalent, particularly in high-stakes areas such as autonomous driving and healthcare, the ability to detect incorrect predictions of models and intervene accordingly becomes crucial for safety. In this work, we investigate the detection of misclassified inputs for image classification models from the lens of adversarial perturbation: we propose to use robust radius (a.k.a. input-space margin) as a confidence metric and design two efficient estimation algorithms, RR-BS and RR-Fast, for misclassification detection. Furthermore, we design a training method called Radius Aware Training (RAT) to boost models' ability to identify mistakes. Extensive experiments show our method could achieve up to 29.3% reduction on AURC and 21.62% reduction in FPR@95TPR, compared with previous methods.

Model Hubs and Beyond: Analyzing Model Popularity, Performance, and Documentation

Pritam Kadasi,Sriman Reddy,Srivathsa Vamsi Chaturvedula,Rudranshu Sen,Agnish Saha,Soumavo Sikdar,Sayani Sarkar,Suhani Mittal,Rohit Jindal,Mayank Singh

Task: 研究Hugging Face平台上模型流行度与实际性能之间的关系，以及模型文档的全面性与流行度和性能的相关性。

Motivation: 随着Hugging Face等平台上机器学习模型数量的激增，用户在选择最佳模型时往往依赖模型的流行度（如下载量、点赞数或最近更新时间），而忽略了实际性能。

Details

Method: 评估了Hugging Face平台上500个情感分析模型，进行了大规模的人工标注（近80,000个标注）以及广泛的模型训练和评估。 Result: 模型流行度与实际性能并不一定相关。约80%的模型缺乏详细的模型、训练和评估过程信息，约88%的模型作者在模型卡片中夸大了其模型的性能。 Conclusion: 基于研究结果，提供了一份用户选择下游任务模型的指南清单。 Abstract: With the massive surge in ML models on platforms like Hugging Face, users often lose track and struggle to choose the best model for their downstream tasks, frequently relying on model popularity indicated by download counts, likes, or recency. We investigate whether this popularity aligns with actual model performance and how the comprehensiveness of model documentation correlates with both popularity and performance. In our study, we evaluated a comprehensive set of 500 Sentiment Analysis models on Hugging Face. This evaluation involved massive annotation efforts, with human annotators completing nearly 80,000 annotations, alongside extensive model training and evaluation. Our findings reveal that model popularity does not necessarily correlate with performance. Additionally, we identify critical inconsistencies in model card reporting: approximately 80\% of the models analyzed lack detailed information about the model, training, and evaluation processes. Furthermore, about 88\% of model authors overstate their models' performance in the model cards. Based on our findings, we provide a checklist of guidelines for users to choose good models for downstream tasks.

SketchSplat: 3D Edge Reconstruction via Differentiable Multi-view Sketch Splatting

Haiyang Ying,Matthias Zwicker

Task: 从校准的多视角图像中重建参数化的3D边缘。

Motivation: 现有的方法通常从多视角2D边缘图像重建3D边缘点集，然后拟合3D边缘到点集。然而，点集中的噪声可能导致拟合边缘之间的间隙，并且由于边缘拟合仅依赖于重建的3D点集，恢复的边缘可能与输入的多视角图像不对齐。

Details

Method: 提出了一种通过可微分多视角草图喷绘来重建准确、完整和紧凑的3D边缘的方法，称为SketchSplat。该方法将3D边缘表示为草图，这些草图是由控制点、尺度和不透明度等属性定义的参数化线条和曲线。在边缘重建过程中，从一组草图中迭代采样高斯点，并将高斯点栅格化到2D边缘图像上。然后，可以将图像误差相对于输入2D边缘图像的梯度反向传播以优化草图属性。 Result: 实验表明，该方法在基准CAD数据集上实现了最先进的准确性、完整性和紧凑性。 Conclusion: 该方法通过可微分的方式桥接了2D边缘图像和3D边缘，确保3D边缘与2D图像对齐，并实现了准确和完整的结果。此外，提出了一系列自适应拓扑操作，并在草图优化过程中应用这些操作，以减少所需的草图数量，同时确保高精度，从而产生更紧凑的重建。 Abstract: Edges are one of the most basic parametric primitives to describe structural information in 3D. In this paper, we study parametric 3D edge reconstruction from calibrated multi-view images. Previous methods usually reconstruct a 3D edge point set from multi-view 2D edge images, and then fit 3D edges to the point set. However, noise in the point set may cause gaps among fitted edges, and the recovered edges may not align with input multi-view images since the edge fitting depends only on the reconstructed 3D point set. To mitigate these problems, we propose SketchSplat, a method to reconstruct accurate, complete, and compact 3D edges via differentiable multi-view sketch splatting. We represent 3D edges as sketches, which are parametric lines and curves defined by attributes including control points, scales, and opacity. During edge reconstruction, we iteratively sample Gaussian points from a set of sketches and rasterize the Gaussians onto 2D edge images. Then the gradient of the image error with respect to the input 2D edge images can be back-propagated to optimize the sketch attributes. Our method bridges 2D edge images and 3D edges in a differentiable manner, which ensures that 3D edges align well with 2D images and leads to accurate and complete results. We also propose a series of adaptive topological operations and apply them along with the sketch optimization. The topological operations help reduce the number of sketches required while ensuring high accuracy, yielding a more compact reconstruction. Finally, we contribute an accurate 2D edge detector that improves the performance of both ours and existing methods. Experiments show that our method achieves state-of-the-art accuracy, completeness, and compactness on a benchmark CAD dataset.

Exploring Large Language Models for Word Games:Who is the Spy?

Chentian Wei,Jiewei Chen,Jinzhu Xu

Task: 探索大型语言模型（LLMs）如何有效地参与文字游戏，并提出一种无需训练的框架。

Motivation: 文字游戏因其基于规则和情境的特性，在自然语言处理（NLP）、博弈论及相关领域具有重要的研究价值。

Details

Method: 以经典文字游戏'谁是卧底'为例，引入基于思维链（CoT）的调度框架，使LLMs能够在推断角色词和伪装身份等任务中表现出色。 Result: 实验结果表明该框架的有效性，在多个数据集上显著提高了LLMs的表现。 Conclusion: 这项工作突显了LLMs在结构化游戏环境中掌握情境推理和社交互动的潜力。 Abstract: Word games hold significant research value for natural language processing (NLP), game theory, and related fields due to their rule-based and situational nature. This study explores how large language models (LLMs) can be effectively involved in word games and proposes a training-free framework. "Shei Shi Wo Di" or "Who is the Spy" in English, is a classic word game. Using this game as an example, we introduce a Chain-of-Thought (CoT)-based scheduling framework to enable LLMs to achieve excellent performance in tasks such as inferring role words and disguising their identities. We evaluate the framework's performance based on game success rates and the accuracy of the LLM agents' analytical results. Experimental results affirm the framework's effectiveness, demonstrating notable improvements in LLM performance across multiple datasets. This work highlights the potential of LLMs in mastering situational reasoning and social interactions within structured game environments. Our code is publicly available at https://github.com/ct-wei/Who-is-The-Spy.

Prototype Perturbation for Relaxing Alignment Constraints in Backward-Compatible Learning

Zikun Zhou,Yushuai Sun,Wenjie Pei,Xin Li,Yaowei Wang

Task: 提出一种新的方法来放松向后兼容学习中的约束，以保持新模型的判别能力。

Motivation: 传统的更新检索模型方法需要重新计算图库数据的嵌入，这是一个耗时且计算密集的过程。为了规避这一问题，向后兼容学习（BCL）被广泛探索，但其强对齐约束会损害新模型的判别能力。

Details

Method: 通过引入对旧特征原型的扰动来放松约束，使新特征空间与由这些扰动原型定义的伪旧特征空间对齐。提出了两种计算扰动的方法：邻居驱动原型扰动（NDPP）和优化驱动原型扰动（ODPP）。 Result: 在多个数据集上的实验表明，所提出的方法在向后兼容学习中优于现有的最先进算法。 Conclusion: 通过放松约束并引入扰动，可以在保持新模型判别能力的同时实现向后兼容学习。 Abstract: The traditional paradigm to update retrieval models requires re-computing the embeddings of the gallery data, a time-consuming and computationally intensive process known as backfilling. To circumvent backfilling, Backward-Compatible Learning (BCL) has been widely explored, which aims to train a new model compatible with the old one. Many previous works focus on effectively aligning the embeddings of the new model with those of the old one to enhance the backward-compatibility. Nevertheless, such strong alignment constraints would compromise the discriminative ability of the new model, particularly when different classes are closely clustered and hard to distinguish in the old feature space. To address this issue, we propose to relax the constraints by introducing perturbations to the old feature prototypes. This allows us to align the new feature space with a pseudo-old feature space defined by these perturbed prototypes, thereby preserving the discriminative ability of the new model in backward-compatible learning. We have developed two approaches for calculating the perturbations: Neighbor-Driven Prototype Perturbation (NDPP) and Optimization-Driven Prototype Perturbation (ODPP). Particularly, they take into account the feature distributions of not only the old but also the new models to obtain proper perturbations along with new model updating. Extensive experiments on the landmark and commodity datasets demonstrate that our approaches perform favorably against state-of-the-art BCL algorithms.

BigO(Bench) -- Can LLMs Generate Code with Controlled Time and Space Complexity?

Pierre Chambon,Baptiste Roziere,Benoit Sagot,Gabriel Synnaeve

Task: 评估生成语言模型在理解和生成具有指定时间和空间复杂度的代码方面的能力。

Motivation: 当前评估往往忽视模型在理解和生成受计算复杂度约束的代码方面的能力，BigO(Bench)旨在填补这一空白。

Details

Method: BigO(Bench)包括从分析测量中推断任何Python函数的算法复杂度的工具，以及一组3,105个编码问题和1,190,250个来自代码竞赛的解决方案，这些解决方案带有推断的时间和空间复杂度标签。 Result: 评估了多个最先进的语言模型，发现它们在处理复杂度要求方面各有优劣，特别是token-space推理模型在代码生成方面表现出色，但在复杂度理解方面表现不佳。 Conclusion: token-space推理模型在代码生成方面表现出色，但在复杂度理解方面表现不佳，可能无法很好地泛化到训练时没有奖励的任务。 Abstract: We introduce BigO(Bench), a novel coding benchmark designed to evaluate the capabilities of generative language models in understanding and generating code with specified time and space complexities. This benchmark addresses the gap in current evaluations that often overlook the ability of models to comprehend and produce code constrained by computational complexity. BigO(Bench) includes tooling to infer the algorithmic complexity of any Python function from profiling measurements, including human- or LLM-generated solutions. BigO(Bench) also includes of set of 3,105 coding problems and 1,190,250 solutions from Code Contests annotated with inferred (synthetic) time and space complexity labels from the complexity framework, as well as corresponding runtime and memory footprint values for a large set of input sizes. We present results from evaluating multiple state-of-the-art language models on this benchmark, highlighting their strengths and weaknesses in handling complexity requirements. In particular, token-space reasoning models are unrivaled in code generation but not in complexity understanding, hinting that they may not generalize well to tasks for which no reward was given at training time.

Decompositional Neural Scene Reconstruction with Generative Diffusion Prior

Junfeng Ni,Yu Liu,Ruijie Lu,Zirui Zhou,Song-Chun Zhu,Yixin Chen,Siyuan Huang

Task: 从稀疏视图中进行3D场景的分解重建，包括完整形状和所有对象的详细纹理。

Motivation: 现有的方法在欠约束区域和遮挡区域的恢复上存在显著退化，需要补充这些区域的缺失信息。

Details

Method: 提出DP-Recon，利用扩散先验（Score Distillation Sampling, SDS）来优化每个对象在新视图下的神经表示，并引入可见性引导方法动态调整每像素SDS损失权重。 Result: 在Replica和ScanNet++数据集上的实验表明，该方法显著优于现有技术，在10个视图下的对象重建效果优于基线方法在100个视图下的效果。 Conclusion: DP-Recon通过SDS优化实现了几何和外观的无缝文本编辑，并生成了支持逼真视觉特效（VFX）编辑的分解对象网格和详细UV贴图。 Abstract: Decompositional reconstruction of 3D scenes, with complete shapes and detailed texture of all objects within, is intriguing for downstream applications but remains challenging, particularly with sparse views as input. Recent approaches incorporate semantic or geometric regularization to address this issue, but they suffer significant degradation in underconstrained areas and fail to recover occluded regions. We argue that the key to solving this problem lies in supplementing missing information for these areas. To this end, we propose DP-Recon, which employs diffusion priors in the form of Score Distillation Sampling (SDS) to optimize the neural representation of each individual object under novel views. This provides additional information for the underconstrained areas, but directly incorporating diffusion prior raises potential conflicts between the reconstruction and generative guidance. Therefore, we further introduce a visibility-guided approach to dynamically adjust the per-pixel SDS loss weights. Together these components enhance both geometry and appearance recovery while remaining faithful to input images. Extensive experiments across Replica and ScanNet++ demonstrate that our method significantly outperforms SOTA methods. Notably, it achieves better object reconstruction under 10 views than the baselines under 100 views. Our method enables seamless text-based editing for geometry and appearance through SDS optimization and produces decomposed object meshes with detailed UV maps that support photorealistic Visual effects (VFX) editing. The project page is available at https://dp-recon.github.io/.

MAMM-Refine: A Recipe for Improving Faithfulness in Generation with Multi-Agent Collaboration

David Wan,Justin Chih-Yao Chen,Elias Stengel-Eskin,Mohit Bansal

Task: 将多智能体多模型推理扩展到生成任务，特别是通过改进来提高生成内容的忠实度。

Motivation: 多智能体协作在推理任务中显示出潜力，但在长文本生成任务（如摘要和问答）中尚未得到充分探索。

Details

Method: 研究了多实例和多种类型的大型语言模型（LLMs）在改进过程中的子任务（如错误检测、批评不忠实的句子和基于批评进行修正）中的迭代协作。 Result: 多智能体和多模型方法在错误检测和批评方面都有所帮助，将批评和改进重新定义为重排序任务而非生成任务可以提高多智能体性能。 Conclusion: 提出了一个名为MAMM-Refine的最终“配方”，其中多智能体和多模型协作显著提高了三个摘要数据集以及长文本问答的性能，证明了该方法的有效性和通用性。 Abstract: Multi-agent collaboration among models has shown promise in reasoning tasks but is underexplored in long-form generation tasks like summarization and question-answering. We extend multi-agent multi-model reasoning to generation, specifically to improving faithfulness through refinement, i.e., revising model-generated outputs to remove factual inconsistencies. We investigate how iterative collaboration among multiple instances and types of large language models (LLMs) enhances subtasks in the refinement process, such as error detection, critiquing unfaithful sentences, and making corrections based on critiques. We design intrinsic evaluations for each subtask, with our findings indicating that both multi-agent (multiple instances) and multi-model (diverse LLM types) approaches benefit error detection and critiquing. Additionally, reframing critiquing and refinement as reranking rather than generation tasks improves multi-agent performance. We consolidate these insights into a final "recipe" called Multi-Agent Multi-Model Refinement (MAMM-Refine), where multi-agent and multi-model collaboration significantly boosts performance on three summarization datasets as well as on long-form question answering, demonstrating the effectiveness and generalizability of our recipe.

H2ST: Hierarchical Two-Sample Tests for Continual Out-of-Distribution Detection

Yuhang Liu,Wenjie Zhao,Yunhui Guo

Task: 提出一种新的持续OOD检测方法，称为分层双样本测试（H2ST），用于开放世界任务增量学习（TIL）场景。

Motivation: 现有的TIL方法在封闭世界假设下运行，假设输入数据始终在分布内（ID）。然而，在开放世界设置中，输入样本可能来自分布外（OOD）源，其任务身份未知。当前的OOD检测方法在持续检测OOD样本时面临多个挑战。

Details

Method: 提出了一种名为分层双样本测试（H2ST）的新方法，通过假设测试消除阈值选择的需求，并利用特征图更好地利用模型能力，而不过度依赖模型性能。 Result: 广泛的实验和分析验证了H2ST在开放世界TIL场景中的有效性，并证明了其优于现有方法的性能。 Conclusion: H2ST在开放世界TIL场景中表现出色，具有较低的开销和优越的性能，能够进行任务级别的检测。 Abstract: Task Incremental Learning (TIL) is a specialized form of Continual Learning (CL) in which a model incrementally learns from non-stationary data streams. Existing TIL methodologies operate under the closed-world assumption, presuming that incoming data remains in-distribution (ID). However, in an open-world setting, incoming samples may originate from out-of-distribution (OOD) sources, with their task identities inherently unknown. Continually detecting OOD samples presents several challenges for current OOD detection methods: reliance on model outputs leads to excessive dependence on model performance, selecting suitable thresholds is difficult, hindering real-world deployment, and binary ID/OOD classification fails to provide task-level identification. To address these issues, we propose a novel continual OOD detection method called the Hierarchical Two-sample Tests (H2ST). H2ST eliminates the need for threshold selection through hypothesis testing and utilizes feature maps to better exploit model capabilities without excessive dependence on model performance. The proposed hierarchical architecture enables task-level detection with superior performance and lower overhead compared to non-hierarchical classifier two-sample tests. Extensive experiments and analysis validate the effectiveness of H2ST in open-world TIL scenarios and its superiority to the existing methods. Code is available at \href{https://github.com/YuhangLiuu/H2ST}{https://github.com/YuhangLiuu/H2ST}.

TROVE: A Challenge for Fine-Grained Text Provenance via Source Sentence Tracing and Relationship Classification

Junnan Zhu,Min Xiao,Yining Wang,Feifei Zhai,Yu Zhou,Chengqing Zong

Task: 设计并评估Text pROVEnance (TROVE)挑战，以追踪目标文本中每个句子的来源句子，并注释细粒度关系。

Motivation: 在医疗、法律和新闻等高风险领域，理解内容的来源和生成方式至关重要。

Details

Method: 利用三个公共数据集构建TROVE数据集，涵盖11种不同场景，采用三阶段注释过程（句子检索、GPT来源注释和人工来源注释），并评估11种LLM在直接提示和检索增强范式下的表现。 Result: 检索对于稳健性能至关重要，较大模型在复杂关系分类中表现更好，闭源模型通常领先，但开源模型在检索增强下显示出显著潜力。 Conclusion: TROVE挑战为追踪文本来源和细粒度关系提供了有效方法，检索增强对提升模型性能至关重要。 Abstract: LLMs have achieved remarkable fluency and coherence in text generation, yet their widespread adoption has raised concerns about content reliability and accountability. In high-stakes domains such as healthcare, law, and news, it is crucial to understand where and how the content is created. To address this, we introduce the Text pROVEnance (TROVE) challenge, designed to trace each sentence of a target text back to specific source sentences within potentially lengthy or multi-document inputs. Beyond identifying sources, TROVE annotates the fine-grained relationships (quotation, compression, inference, and others), providing a deep understanding of how each target sentence is formed. To benchmark TROVE, we construct our dataset by leveraging three public datasets covering 11 diverse scenarios (e.g., QA and summarization) in English and Chinese, spanning source texts of varying lengths (0-5k, 5-10k, 10k+), emphasizing the multi-document and long-document settings essential for provenance. To ensure high-quality data, we employ a three-stage annotation process: sentence retrieval, GPT provenance, and human provenance. We evaluate 11 LLMs under direct prompting and retrieval-augmented paradigms, revealing that retrieval is essential for robust performance, larger models perform better in complex relationship classification, and closed-source models often lead, yet open-source models show significant promise, particularly with retrieval augmentation.

SemanticFlow: A Self-Supervised Framework for Joint Scene Flow Prediction and Instance Segmentation in Dynamic Environments

Yinqi Chen,Meiying Zhang,Qi Hao,Guang Zhou

Task: 同时预测全分辨率点云的场景流和实例分割。

Motivation: 传统方法通常将动态交通场景中的对象运动估计和实例分割视为独立任务，导致性能不佳、时空不一致以及在复杂场景中效率低下。

Details

Method: 提出了一个多任务SemanticFlow框架，包括从粗到细的预测方案、一组损失函数以及自监督学习方案。 Result: 在Argoverse和Waymo数据集上验证了该框架，展示了在实例分割准确性、场景流估计和计算效率方面的优越性能。 Conclusion: 该框架为动态场景理解中的自监督方法建立了新的基准。 Abstract: Accurate perception of dynamic traffic scenes is crucial for high-level autonomous driving systems, requiring robust object motion estimation and instance segmentation. However, traditional methods often treat them as separate tasks, leading to suboptimal performance, spatio-temporal inconsistencies, and inefficiency in complex scenarios due to the absence of information sharing. This paper proposes a multi-task SemanticFlow framework to simultaneously predict scene flow and instance segmentation of full-resolution point clouds. The novelty of this work is threefold: 1) developing a coarse-to-fine prediction based multi-task scheme, where an initial coarse segmentation of static backgrounds and dynamic objects is used to provide contextual information for refining motion and semantic information through a shared feature processing module; 2) developing a set of loss functions to enhance the performance of scene flow estimation and instance segmentation, while can help ensure spatial and temporal consistency of both static and dynamic objects within traffic scenes; 3) developing a self-supervised learning scheme, which utilizes coarse segmentation to detect rigid objects and compute their transformation matrices between sequential frames, enabling the generation of self-supervised labels. The proposed framework is validated on the Argoverse and Waymo datasets, demonstrating superior performance in instance segmentation accuracy, scene flow estimation, and computational efficiency, establishing a new benchmark for self-supervised methods in dynamic scene understanding.

Inside-Out: Hidden Factual Knowledge in LLMs

Zorik Gekhman,Eyal Ben David,Hadas Orgad,Eran Ofek,Yonatan Belinkov,Idan Szpector,Jonathan Herzig,Roi Reichart

Task: 评估大型语言模型（LLMs）在其参数中编码的事实知识是否比其输出中表达的更多。

Motivation: 尽管一些研究暗示了这种可能性，但尚未有研究明确定义或证明这一现象。

Details

Method: 提出了一个正式的知识定义，将其量化为给定问题中正确-错误答案对中正确答案排名更高的比例。这产生了外部和内部知识，取决于用于评分单个答案候选者的信息：模型可观察到的标记级概率或其中间计算。当内部知识超过外部知识时，隐藏知识就会出现。 Result: 结果表明：（1）LLMs在其内部编码的事实知识比其外部表达的更多，平均差距为40%。（2）令人惊讶的是，一些知识隐藏得如此之深，以至于模型在内部完美地知道答案，却无法生成它，即使进行了1000次大规模重复采样。（3）这揭示了LLMs生成能力的根本限制，对通过重复答案采样在闭卷QA中扩展测试时计算的实际约束：显著的性能改进仍然无法实现，因为一些答案几乎从未被采样，但如果被采样，我们保证会将其排名第一。 Conclusion: LLMs在其参数中编码的事实知识比其输出中表达的更多，这揭示了LLMs生成能力的根本限制，并对通过重复答案采样在闭卷QA中扩展测试时计算提出了实际约束。 Abstract: This work presents a framework for assessing whether large language models (LLMs) encode more factual knowledge in their parameters than what they express in their outputs. While a few studies hint at this possibility, none has clearly defined or demonstrated this phenomenon. We first propose a formal definition of knowledge, quantifying it for a given question as the fraction of correct-incorrect answer pairs where the correct one is ranked higher. This gives rise to external and internal knowledge, depending on the information used to score individual answer candidates: either the model's observable token-level probabilities or its intermediate computations. Hidden knowledge arises when internal knowledge exceeds external knowledge. We then present a case study, applying this framework to three popular open-weights LLMs in a closed-book QA setup. Our results indicate that: (1) LLMs consistently encode more factual knowledge internally than what they express externally, with an average gap of 40%. (2) Surprisingly, some knowledge is so deeply hidden that a model can internally know an answer perfectly, yet fail to generate it even once, despite large-scale repeated sampling of 1,000 answers. This reveals fundamental limitations in the generation capabilities of LLMs, which (3) puts a practical constraint on scaling test-time compute via repeated answer sampling in closed-book QA: significant performance improvements remain inaccessible because some answers are practically never sampled, yet if they were, we would be guaranteed to rank them first.

Unlocking the Capabilities of Vision-Language Models for Generalizable and Explainable Deepfake Detection

Peipeng Yu,Jianwei Fei,Hui Gao,Xuan Feng,Zhihua Xia,Chip Hong Chang

Task: 提出一种新的范式，利用视觉语言模型（VLM）进行深度伪造检测。

Motivation: 当前的视觉语言模型在多模态数据理解方面表现出色，但其在深度伪造检测方面的潜力尚未被充分挖掘，主要原因是其知识与取证模式的不对齐。

Details

Method: 通过三个组件解锁VLM的潜力：(1) 知识引导的伪造适应模块，通过对比学习将VLM的语义空间与取证特征对齐；(2) 多模态提示调优框架，联合优化视觉-文本嵌入以实现定位和可解释性；(3) 迭代细化策略，支持多轮对话进行基于证据的推理。 Result: 在多个基准测试（包括FF++、CDF2、DFD、DFDCP和DFDC）上的广泛实验表明，该方案在泛化性能上超越了现有方法，并支持多轮对话能力。 Conclusion: 该框架通过知识引导的伪造检测器（KFD）、VLM图像编码器和大型语言模型（LLM）的结合，实现了高效的深度伪造检测和定位，并生成文本检测响应以辅助判断。 Abstract: Current vision-language models (VLMs) have demonstrated remarkable capabilities in understanding multimodal data, but their potential remains underexplored for deepfake detection due to the misaligned of their knowledge and forensics patterns. To this end, we present a novel paradigm that unlocks VLMs' potential capabilities through three components: (1) A knowledge-guided forgery adaptation module that aligns VLM's semantic space with forensic features through contrastive learning with external manipulation knowledge; (2) A multi-modal prompt tuning framework that jointly optimizes visual-textual embeddings for both localization and explainability; (3) An iterative refinement strategy enabling multi-turn dialog for evidence-based reasoning. Our framework includes a VLM-based Knowledge-guided Forgery Detector (KFD), a VLM image encoder, and a Large Language Model (LLM). The VLM image encoder extracts visual prompt embeddings from images, while the LLM receives visual and question prompt embeddings for inference. The KFD is used to calculate correlations between image features and pristine/deepfake class embeddings, enabling forgery classification and localization. The outputs from these components are used to construct forgery prompt embeddings. Finally, we feed these prompt embeddings into the LLM to generate textual detection responses to assist judgment. Extensive experiments on multiple benchmarks, including FF++, CDF2, DFD, DFDCP, and DFDC, demonstrate that our scheme surpasses state-of-the-art methods in generalization performance, while also supporting multi-turn dialogue capabilities.

SPILL: Domain-Adaptive Intent Clustering based on Selection and Pooling with Large Language Models

I-Fan Lin,Faegheh Hasibi,Suzan Verberne

Task: 提出一种无需微调的领域自适应方法SPILL，用于意图聚类。

Motivation: 现有的基于嵌入的聚类方法依赖于少量标记示例或无监督微调来优化每个新数据集的结果，这使得它们在多个数据集上的泛化能力较差。

Details

Method: 提出了一种两阶段方法：首先使用现有嵌入器为每个话语（种子）生成嵌入，然后应用距离度量选择与种子接近的候选池；在第二阶段，使用LLM从这些候选中选择与种子具有相同意图的话语，最后将这些选定的候选与种子池化以生成种子的精炼嵌入。 Result: 该方法通常优于直接使用嵌入器，并且与使用更大模型和需要微调的其他最先进研究相比，取得了可比的结果。 Conclusion: 该方法使现有嵌入器无需额外微调即可进一步改进，使其更适应新的领域数据集，并且将聚类任务视为小规模选择问题，具有根据用户目标使用LLM定制聚类任务的潜力。 Abstract: In this paper, we propose Selection and Pooling with Large Language Models (SPILL), an intuitive and domain-adaptive method for intent clustering without fine-tuning. Existing embeddings-based clustering methods rely on a few labeled examples or unsupervised fine-tuning to optimize results for each new dataset, which makes them less generalizable to multiple datasets. Our goal is to make these existing embedders more generalizable to new domain datasets without further fine-tuning. Inspired by our theoretical derivation and simulation results on the effectiveness of sampling and pooling techniques, we view the clustering task as a small-scale selection problem. A good solution to this problem is associated with better clustering performance. Accordingly, we propose a two-stage approach: First, for each utterance (referred to as the seed), we derive its embedding using an existing embedder. Then, we apply a distance metric to select a pool of candidates close to the seed. Because the embedder is not optimized for new datasets, in the second stage, we use an LLM to further select utterances from these candidates that share the same intent as the seed. Finally, we pool these selected candidates with the seed to derive a refined embedding for the seed. We found that our method generally outperforms directly using an embedder, and it achieves comparable results to other state-of-the-art studies, even those that use much larger models and require fine-tuning, showing its strength and efficiency. Our results indicate that our method enables existing embedders to be further improved without additional fine-tuning, making them more adaptable to new domain datasets. Additionally, viewing the clustering task as a small-scale selection problem gives the potential of using LLMs to customize clustering tasks according to the user's goals.

Fine-Grained Open-Vocabulary Object Detection with Fined-Grained Prompts: Task, Dataset and Benchmark

Ying Liu,Yijing Hua,Haojiang Chai,Yanbo Wang,TengQi Ye

Task: 提出3F-OVD任务，将监督细粒度目标检测扩展到开放词汇设置。

Motivation: 现有的开放词汇检测器在评估时存在不公平和不可靠的问题，主要原因是视觉感知语言词汇数据的变化。

Details

Method: 引入3F-OVD任务，创建新的数据集NEU-171K，并提出一种简单但有效的后处理技术。 Result: 在NEU-171K数据集上对最先进的目标检测器进行了基准测试。 Conclusion: 3F-OVD任务具有挑战性，需要深入理解细粒度描述和图像中的细节，新数据集和后处理技术有助于提高检测精度。 Abstract: Open-vocabulary detectors are proposed to locate and recognize objects in novel classes. However, variations in vision-aware language vocabulary data used for open-vocabulary learning can lead to unfair and unreliable evaluations. Recent evaluation methods have attempted to address this issue by incorporating object properties or adding locations and characteristics to the captions. Nevertheless, since these properties and locations depend on the specific details of the images instead of classes, detectors can not make accurate predictions without precise descriptions provided through human annotation. This paper introduces 3F-OVD, a novel task that extends supervised fine-grained object detection to the open-vocabulary setting. Our task is intuitive and challenging, requiring a deep understanding of Fine-grained captions and careful attention to Fine-grained details in images in order to accurately detect Fine-grained objects. Additionally, due to the scarcity of qualified fine-grained object detection datasets, we have created a new dataset, NEU-171K, tailored for both supervised and open-vocabulary settings. We benchmark state-of-the-art object detectors on our dataset for both settings. Furthermore, we propose a simple yet effective post-processing technique.

Optimizing Decomposition for Optimal Claim Verification

Yining Lu,Noah Ziems,Hy Dang,Meng Jiang

Task: 研究如何通过动态分解策略优化长文本事实性评估中的分解与验证过程。

Motivation: 现有的分解策略通常与下游验证器在原子性方面不一致，导致验证结果不理想。

Details

Method: 提出了一种基于强化学习的动态分解框架，利用验证器反馈来学习动态分解策略。 Result: 实验结果表明，动态分解策略优于现有分解策略，验证置信度平均提高0.07，准确率平均提高0.12。 Conclusion: 动态分解策略能够有效提高长文本事实性评估的验证效果。 Abstract: Current research on the \textit{Decompose-Then-Verify} paradigm for evaluating the factuality of long-form text typically treats decomposition and verification in isolation, overlooking their interactions and potential misalignment. We find that existing decomposition policies, typically hand-crafted demonstrations, do not align well with downstream verifiers in terms of atomicity -- a novel metric quantifying information density -- leading to suboptimal verification results. We formulate finding the optimal decomposition policy for optimal verification as a bilevel optimization problem. To approximate a solution for this strongly NP-hard problem, we propose dynamic decomposition, a reinforcement learning framework that leverages verifier feedback to learn a policy for dynamically decomposing claims to verifier-preferred atomicity. Experimental results show that dynamic decomposition outperforms existing decomposition policies, improving verification confidence by 0.07 and accuracy by 0.12 (on a 0-1 scale) on average across varying verifiers, datasets, and atomcities of input claims.

Temporal-Consistent Video Restoration with Pre-trained Diffusion Models

Hengkang Wang,Yang Liu,Huidong Liu,Chien-Chih Wang,Yanhui Guo,Hongdong Li,Bryan Wang,Ju Sun

Task: 提出一种新的最大后验（MAP）框架，用于从退化的视频中恢复高质量视频。

Motivation: 现有的零-shot视频恢复方法在使用预训练扩散模型时存在反向扩散过程中的近似误差和时间一致性不足的问题。

Details

Method: 将扩散模型中的反向过程视为函数，直接在扩散模型的种子空间中参数化视频帧，并通过语义一致性和像素级一致性策略提升时间一致性。 Result: 在多个虚拟现实任务上的广泛实验表明，该方法在视觉质量和时间一致性方面优于现有技术。 Conclusion: 提出的MAP框架在视频恢复任务中表现出色，能够有效消除近似误差并提升时间一致性。 Abstract: Video restoration (VR) aims to recover high-quality videos from degraded ones. Although recent zero-shot VR methods using pre-trained diffusion models (DMs) show good promise, they suffer from approximation errors during reverse diffusion and insufficient temporal consistency. Moreover, dealing with 3D video data, VR is inherently computationally intensive. In this paper, we advocate viewing the reverse process in DMs as a function and present a novel Maximum a Posterior (MAP) framework that directly parameterizes video frames in the seed space of DMs, eliminating approximation errors. We also introduce strategies to promote bilevel temporal consistency: semantic consistency by leveraging clustering structures in the seed space, and pixel-level consistency by progressive warping with optical flow refinements. Extensive experiments on multiple virtual reality tasks demonstrate superior visual quality and temporal consistency achieved by our method compared to the state-of-the-art.

SemEval-2025 Task 1: AdMIRe -- Advancing Multimodal Idiomaticity Representation

Thomas Pickard,Aline Villavicencio,Maggie Mi,Wei He,Dylan Phelps,Carolina Scarton,Marco Idiart

Task: 评估和改进模型在多模态上下文和多种语言中解释惯用表达的能力。

Motivation: 惯用表达在自然语言处理中是一个独特的挑战，因为它们的含义通常不能直接从其组成词中推断出来。尽管大型语言模型（LLMs）取得了进展，但惯用性仍然是语义表示的一个重大障碍。

Details

Method: 提出了SemEval-2025 Task 1: AdMiRe（推进多模态惯用性表示）的数据集和任务，包括两个子任务：根据图像与惯用或字面意义的对齐程度进行排序，以及预测序列中的下一张图像。 Result: 最有效的方法通过在多专家设置中利用预训练的LLMs和视觉语言模型，达到了人类水平的性能，并使用多个查询来平滑这些模型在惯用性表示中的弱点。 Conclusion: 通过多模态和多语言环境下的任务，可以显著提高模型对惯用表达的理解能力。 Abstract: Idiomatic expressions present a unique challenge in NLP, as their meanings are often not directly inferable from their constituent words. Despite recent advancements in Large Language Models (LLMs), idiomaticity remains a significant obstacle to robust semantic representation. We present datasets and tasks for SemEval-2025 Task 1: AdMiRe (Advancing Multimodal Idiomaticity Representation), which challenges the community to assess and improve models' ability to interpret idiomatic expressions in multimodal contexts and in multiple languages. Participants competed in two subtasks: ranking images based on their alignment with idiomatic or literal meanings, and predicting the next image in a sequence. The most effective methods achieved human-level performance by leveraging pretrained LLMs and vision-language models in mixture-of-experts settings, with multiple queries used to smooth over the weaknesses in these models' representations of idiomaticity.

DVHGNN: Multi-Scale Dilated Vision HGNN for Efficient Vision Recognition

Caoshuo Li,Tanzhe Li,Xiaobin Hu,Donghao Luo,Taisong Jin

Task: 提出一种新的视觉架构Dilated Vision HyperGraph Neural Network (DVHGNN)，以解决Vision Graph Neural Network (ViG)中的关键问题。

Motivation: Vision Graph Neural Network (ViG)在计算机视觉中引起了广泛关注，但其K-Nearest Neighbor (KNN)图构建导致的二次计算复杂性和普通图的成对关系限制是关键问题。

Details

Method: 提出了一种新的视觉架构DVHGNN，利用多尺度超图高效捕捉对象之间的高阶相关性。具体方法包括Clustering和Dilated HyperGraph Construction (DHGC)以自适应捕捉数据样本之间的多尺度依赖关系，并提出动态超图卷积机制以促进超图级别的自适应特征交换和融合。 Result: 在基准图像数据集上的广泛定性和定量评估表明，DVHGNN显著优于现有的视觉骨干网络。例如，DVHGNN-S在ImageNet-1K上达到了83.1%的top-1准确率，比ViG-S高出1.0%，比ViHGNN-S高出0.6%。 Conclusion: DVHGNN通过多尺度超图和动态超图卷积机制，有效解决了ViG中的关键问题，并在图像分类任务中取得了显著的性能提升。 Abstract: Recently, Vision Graph Neural Network (ViG) has gained considerable attention in computer vision. Despite its groundbreaking innovation, Vision Graph Neural Network encounters key issues including the quadratic computational complexity caused by its K-Nearest Neighbor (KNN) graph construction and the limitation of pairwise relations of normal graphs. To address the aforementioned challenges, we propose a novel vision architecture, termed Dilated Vision HyperGraph Neural Network (DVHGNN), which is designed to leverage multi-scale hypergraph to efficiently capture high-order correlations among objects. Specifically, the proposed method tailors Clustering and Dilated HyperGraph Construction (DHGC) to adaptively capture multi-scale dependencies among the data samples. Furthermore, a dynamic hypergraph convolution mechanism is proposed to facilitate adaptive feature exchange and fusion at the hypergraph level. Extensive qualitative and quantitative evaluations of the benchmark image datasets demonstrate that the proposed DVHGNN significantly outperforms the state-of-the-art vision backbones. For instance, our DVHGNN-S achieves an impressive top-1 accuracy of 83.1% on ImageNet-1K, surpassing ViG-S by +1.0% and ViHGNN-S by +0.6%.

Real-world validation of a multimodal LLM-powered pipeline for High-Accuracy Clinical Trial Patient Matching leveraging EHR data

Anatole Callies,Quentin Bodinier,Philippe Ravaud,Kourosh Davarpanah

Task: 自动化患者与临床试验的匹配

Motivation: 解决临床试验中患者招募因复杂的资格标准和劳动密集型的图表审查而受阻的问题

Details

Method: 引入一种广泛适用、无需集成的LLM驱动的管道，利用未处理的EHR文档进行患者-试验匹配，结合新的推理-LLM范式、最新LLMs的视觉能力和多模态嵌入 Result: 在n2c2数据集上达到了93%的标准级准确率，在真实世界试验中达到了87%的准确率，用户平均每名患者的审查时间不到9分钟，比传统手动图表审查提高了80% Conclusion: 该管道在临床试验患者匹配中表现出色，无需与站点系统进行定制集成或试验特定调整，从而实现了跨站点的可扩展部署 Abstract: Background: Patient recruitment in clinical trials is hindered by complex eligibility criteria and labor-intensive chart reviews. Prior research using text-only models have struggled to address this problem in a reliable and scalable way due to (1) limited reasoning capabilities, (2) information loss from converting visual records to text, and (3) lack of a generic EHR integration to extract patient data. Methods: We introduce a broadly applicable, integration-free, LLM-powered pipeline that automates patient-trial matching using unprocessed documents extracted from EHRs. Our approach leverages (1) the new reasoning-LLM paradigm, enabling the assessment of even the most complex criteria, (2) visual capabilities of latest LLMs to interpret medical records without lossy image-to-text conversions, and (3) multimodal embeddings for efficient medical record search. The pipeline was validated on the n2c2 2018 cohort selection dataset (288 diabetic patients) and a real-world dataset composed of 485 patients from 30 different sites matched against 36 diverse trials. Results: On the n2c2 dataset, our method achieved a new state-of-the-art criterion-level accuracy of 93\%. In real-world trials, the pipeline yielded an accuracy of 87\%, undermined by the difficulty to replicate human decision-making when medical records lack sufficient information. Nevertheless, users were able to review overall eligibility in under 9 minutes per patient on average, representing an 80\% improvement over traditional manual chart reviews. Conclusion: This pipeline demonstrates robust performance in clinical trial patient matching without requiring custom integration with site systems or trial-specific tailoring, thereby enabling scalable deployment across sites seeking to leverage AI for patient matching.

Efficient Personalization of Quantized Diffusion Model without Backpropagation

Hoigi Seo,Wongi Jeong,Kyungryeol Lee,Se Young Chun

Task: 通过量化和零阶优化技术实现扩散模型的高效微调，以减少内存需求。

Motivation: 扩散模型在图像合成中表现出色，但训练和微调需要大量计算和内存资源，特别是在边缘设备上运行时。

Details

Method: 提出了一种通过文本反向量化扩散模型的方法，并利用零阶优化对个性化令牌进行优化，避免了解量化和梯度存储。此外，提出了子空间梯度降噪和部分均匀时间步采样方法。 Result: 该方法在个性化Stable Diffusion中实现了与现有方法相当的图像和文本对齐分数，同时将训练内存需求减少了8.2倍。 Conclusion: 该方法在减少内存需求的同时，保持了扩散模型的性能，适用于边缘设备上的个性化应用。 Abstract: Diffusion models have shown remarkable performance in image synthesis, but they demand extensive computational and memory resources for training, fine-tuning and inference. Although advanced quantization techniques have successfully minimized memory usage for inference, training and fine-tuning these quantized models still require large memory possibly due to dequantization for accurate computation of gradients and/or backpropagation for gradient-based algorithms. However, memory-efficient fine-tuning is particularly desirable for applications such as personalization that often must be run on edge devices like mobile phones with private data. In this work, we address this challenge by quantizing a diffusion model with personalization via Textual Inversion and by leveraging a zeroth-order optimization on personalization tokens without dequantization so that it does not require gradient and activation storage for backpropagation that consumes considerable memory. Since a gradient estimation using zeroth-order optimization is quite noisy for a single or a few images in personalization, we propose to denoise the estimated gradient by projecting it onto a subspace that is constructed with the past history of the tokens, dubbed Subspace Gradient. In addition, we investigated the influence of text embedding in image generation, leading to our proposed time steps sampling, dubbed Partial Uniform Timestep Sampling for sampling with effective diffusion timesteps. Our method achieves comparable performance to prior methods in image and text alignment scores for personalizing Stable Diffusion with only forward passes while reducing training memory demand up to $8.2\times$.

VenusFactory: A Unified Platform for Protein Engineering Data Retrieval and Language Model Fine-Tuning

Yang Tan,Chen Liu,Jingyuan Gao,Banghao Wu,Mingchen Li,Ruilin Wang,Lingrong Zhang,Huiqun Yu,Guisheng Fan,Liang Hong,Bingxin Zhou

Task: 开发一个名为VenusFactory的多功能引擎，用于整合生物数据检索、标准化任务基准测试和蛋白质语言模型的模块化微调。

Motivation: 由于数据收集、任务基准测试和应用方面的挑战，跨学科采用预训练蛋白质语言模型（PLMs）仍然有限。

Details

Method: VenusFactory集成了40多个蛋白质相关数据集和40多个流行的PLMs，支持命令行执行和基于Gradio的无代码界面。 Result: VenusFactory成功整合了生物数据检索、任务基准测试和PLMs的微调，为计算机科学和生物学社区提供了支持。 Conclusion: VenusFactory是一个开源的多功能引擎，旨在促进跨学科采用预训练蛋白质语言模型，并提供了丰富的资源和工具。 Abstract: Natural language processing (NLP) has significantly influenced scientific domains beyond human language, including protein engineering, where pre-trained protein language models (PLMs) have demonstrated remarkable success. However, interdisciplinary adoption remains limited due to challenges in data collection, task benchmarking, and application. This work presents VenusFactory, a versatile engine that integrates biological data retrieval, standardized task benchmarking, and modular fine-tuning of PLMs. VenusFactory supports both computer science and biology communities with choices of both a command-line execution and a Gradio-based no-code interface, integrating $40+$ protein-related datasets and $40+$ popular PLMs. All implementations are open-sourced on https://github.com/tyang816/VenusFactory.

DPFlow: Adaptive Optical Flow Estimation with a Dual-Pyramid Framework

Henrique Morimitsu,Xiaobin Zhu,Roberto M. Cesar Jr.,Xiangyang Ji,Xu-Cheng Yin

Task: 提出一种自适应光流架构DPFlow，并引入新的基准测试Kubric-NK，用于评估高分辨率输入下的光流估计方法。

Motivation: 当前光流方法通常设计用于低分辨率输入，无法泛化到高分辨率输入，且缺乏高分辨率样本的基准测试。

Details

Method: 提出DPFlow架构，能够在仅使用低分辨率样本训练的情况下泛化到8K分辨率输入，并引入Kubric-NK基准测试。 Result: DPFlow在MPI-Sintel、KITTI 2015、Spring等基准测试上取得了最先进的结果。 Conclusion: DPFlow和Kubric-NK填补了高分辨率光流估计的空白，揭示了现有方法的泛化能力。 Abstract: Optical flow estimation is essential for video processing tasks, such as restoration and action recognition. The quality of videos is constantly increasing, with current standards reaching 8K resolution. However, optical flow methods are usually designed for low resolution and do not generalize to large inputs due to their rigid architectures. They adopt downscaling or input tiling to reduce the input size, causing a loss of details and global information. There is also a lack of optical flow benchmarks to judge the actual performance of existing methods on high-resolution samples. Previous works only conducted qualitative high-resolution evaluations on hand-picked samples. This paper fills this gap in optical flow estimation in two ways. We propose DPFlow, an adaptive optical flow architecture capable of generalizing up to 8K resolution inputs while trained with only low-resolution samples. We also introduce Kubric-NK, a new benchmark for evaluating optical flow methods with input resolutions ranging from 1K to 8K. Our high-resolution evaluation pushes the boundaries of existing methods and reveals new insights about their generalization capabilities. Extensive experimental results show that DPFlow achieves state-of-the-art results on the MPI-Sintel, KITTI 2015, Spring, and other high-resolution benchmarks.

SkyLadder: Better and Faster Pretraining via Context Window Scheduling

Tongyao Zhu,Qian Liu,Haonan Wang,Shiqi Chen,Xiangming Gu,Tianyu Pang,Min-Yen Kan

Task: 探索一种最优的上下文窗口调度策略，以更好地平衡长上下文能力与预训练效率。

Motivation: 研究发现，在固定token预算下，使用较短上下文窗口预训练的模型始终优于长上下文窗口的模型。

Details

Method: 提出了SkyLadder方法，实现从短到长的上下文窗口过渡。 Result: 在100B tokens上预训练了1B参数模型（最多32K上下文）和3B参数模型（8K上下文），在常见基准测试中获得了高达3.7%的增益，并且训练速度比基线快22%。 Conclusion: SkyLadder方法在保持强大标准基准性能的同时，在长上下文任务上匹配或超过了基线结果。 Abstract: Recent advancements in LLM pretraining have featured ever-expanding context windows to process longer sequences. However, our pilot study reveals that models pretrained with shorter context windows consistently outperform their long-context counterparts under a fixed token budget. This finding motivates us to explore an optimal context window scheduling strategy to better balance long-context capability with pretraining efficiency. To this end, we propose SkyLadder, a simple yet effective approach that implements a short-to-long context window transition. SkyLadder preserves strong standard benchmark performance, while matching or exceeding baseline results on long context tasks. Through extensive experiments, we pre-train 1B-parameter models (up to 32K context) and 3B-parameter models (8K context) on 100B tokens, demonstrating that SkyLadder yields consistent gains of up to 3.7% on common benchmarks, while achieving up to 22% faster training speeds compared to baselines. The code is at https://github.com/sail-sg/SkyLadder.

Mitigating Object Hallucinations in MLLMs via Multi-Frequency Perturbations

Shuo Li,Jiajun Sun,Guodong Zheng,Xiaoran Fan,Yujiong Shen,Yi Lu,Zhiheng Xi,Yuming Yang,Wenming Tan,Tao Ji,Tao Gui,Qi Zhang,Xuanjing Huang

Task: 提出一种名为多频率扰动（MFP）的方法，以减少多模态大语言模型（MLLMs）在视觉-语言任务中的对象幻觉。

Motivation: 多模态大语言模型在视觉-语言任务中表现出色，但其生成响应的真实性常因对象幻觉而受到损害。研究发现，模型对特定图像频率特征的过度敏感是导致这些幻觉的关键原因。

Details

Method: 引入多频率扰动（MFP），利用图像的低频和高频特征来扰动视觉特征表示，并在推理过程中显式抑制冗余的频率域特征。 Result: 实验结果表明，该方法显著减少了各种模型架构中的对象幻觉。此外，MFP作为一种训练时方法，可以与推理时方法结合，在CHAIR基准上实现最先进的性能。 Conclusion: 多频率扰动（MFP）是一种简单、经济且可插拔的方法，能有效减少多模态大语言模型中的对象幻觉，提升模型性能。 Abstract: Recently, multimodal large language models (MLLMs) have demonstrated remarkable performance in visual-language tasks. However, the authenticity of the responses generated by MLLMs is often compromised by object hallucinations. We identify that a key cause of these hallucinations is the model's over-susceptibility to specific image frequency features in detecting objects. In this paper, we introduce Multi-Frequency Perturbations (MFP), a simple, cost-effective, and pluggable method that leverages both low-frequency and high-frequency features of images to perturb visual feature representations and explicitly suppress redundant frequency-domain features during inference, thereby mitigating hallucinations. Experimental results demonstrate that our method significantly mitigates object hallucinations across various model architectures. Furthermore, as a training-time method, MFP can be combined with inference-time methods to achieve state-of-the-art performance on the CHAIR benchmark.

Evaluating Bias in Retrieval-Augmented Medical Question-Answering Systems

Yuelyu Ji,Hang Zhang,Yanshan Wang

Task: 评估基于检索增强生成（RAG）模型的医疗问答系统中的偏见

Motivation: RAG模型在支持临床决策时可能引入与种族、性别和社会健康决定因素相关的偏见，因此需要系统评估这些偏见

Details

Method: 通过分析人口统计敏感查询和测量检索差异，使用MMLU和MedMCQA等数据集，分析检索重叠和正确性差异 Result: 发现RAG管道中存在显著的人口统计差异 Conclusion: 强调需要明确考虑公平性的检索方法，以确保公平的临床决策 Abstract: Medical QA systems powered by Retrieval-Augmented Generation (RAG) models support clinical decision-making but may introduce biases related to race, gender, and social determinants of health. We systematically evaluate biases in RAG-based LLM by examining demographic-sensitive queries and measuring retrieval discrepancies. Using datasets like MMLU and MedMCQA, we analyze retrieval overlap and correctness disparities. Our findings reveal substantial demographic disparities within RAG pipelines, emphasizing the critical need for retrieval methods that explicitly account for fairness to ensure equitable clinical decision-making.

When Domain Generalization meets Generalized Category Discovery: An Adaptive Task-Arithmetic Driven Approach

Vaibhav Rathore,Shubhranil B,Saikat Dutta,Sarthak Mehrotra,Zsolt Kira,Biplab Banerjee

Task: 在目标域中聚类基类和新类，使用源域中仅有基类的监督。

Motivation: 当前方法在分布偏移时表现不佳，并且通常需要在训练期间访问目标数据，这有时是不切实际的。

Details

Method: 提出了DG-GCD范式，仅使用源数据进行训练，目标域在推理前保持不可见。DG2CD-Net通过一种情景训练策略构建域独立的判别嵌入空间，结合开放集域适应、新的边际损失和表示学习来优化特征空间。 Result: 在三个数据集上的实验表明，DG2CD-Net优于现有的针对DG-GCD定制的GCD方法。 Conclusion: DG2CD-Net通过情景更新机制增强了基模型对未见目标的适应性，证明了其在跨域泛化中的有效性。 Abstract: Generalized Class Discovery (GCD) clusters base and novel classes in a target domain using supervision from a source domain with only base classes. Current methods often falter with distribution shifts and typically require access to target data during training, which can sometimes be impractical. To address this issue, we introduce the novel paradigm of Domain Generalization in GCD (DG-GCD), where only source data is available for training, while the target domain, with a distinct data distribution, remains unseen until inference. To this end, our solution, DG2CD-Net, aims to construct a domain-independent, discriminative embedding space for GCD. The core innovation is an episodic training strategy that enhances cross-domain generalization by adapting a base model on tasks derived from source and synthetic domains generated by a foundation model. Each episode focuses on a cross-domain GCD task, diversifying task setups over episodes and combining open-set domain adaptation with a novel margin loss and representation learning for optimizing the feature space progressively. To capture the effects of fine-tuning on the base model, we extend task arithmetic by adaptively weighting the local task vectors concerning the fine-tuned models based on their GCD performance on a validation distribution. This episodic update mechanism boosts the adaptability of the base model to unseen targets. Experiments across three datasets confirm that DG2CD-Net outperforms existing GCD methods customized for DG-GCD.

From 1,000,000 Users to Every User: Scaling Up Personalized Preference for User-level Alignment

Jia-Nan Li,Jian Guan,Songhao Wu,Wei Wu,Rui Yan

Task: 提出一个可扩展的个性化对齐大语言模型的框架。

Motivation: 传统的大语言模型对齐方法假设人类偏好是统一的，忽视了用户价值观和需求的多样性。

Details

Method: 建立了一个系统的偏好空间，描述了心理和行为维度，并引入了多样化的角色表示。在此基础上，提出了AlignX数据集和两种互补的对齐方法：上下文对齐和偏好桥接对齐。 Result: 实验结果表明，该方法在四个基准测试中平均提高了17.06%的准确率，并表现出对新偏好的强适应能力、对有限用户数据的鲁棒性和精确的偏好控制能力。 Conclusion: 该框架有效推进了真正用户自适应的AI系统的发展。 Abstract: Large language models (LLMs) have traditionally been aligned through one-size-fits-all approaches that assume uniform human preferences, fundamentally overlooking the diversity in user values and needs. This paper introduces a comprehensive framework for scalable personalized alignment of LLMs. We establish a systematic preference space characterizing psychological and behavioral dimensions, alongside diverse persona representations for robust preference inference in real-world scenarios. Building upon this foundation, we introduce \textsc{AlignX}, a large-scale dataset of over 1.3 million personalized preference examples, and develop two complementary alignment approaches: \textit{in-context alignment} directly conditioning on persona representations and \textit{preference-bridged alignment} modeling intermediate preference distributions. Extensive experiments demonstrate substantial improvements over existing methods, with an average 17.06\% accuracy gain across four benchmarks while exhibiting a strong adaptation capability to novel preferences, robustness to limited user data, and precise preference controllability. These results validate our framework's effectiveness, advancing toward truly user-adaptive AI systems.

Spot the Fake: Large Multimodal Model-Based Synthetic Image Detection with Artifact Explanation

Siwei Wen,Junyan Ye,Peilin Feng,Hengrui Kang,Zichen Wen,Yize Chen,Jiang Wu,Wenjun Wu,Conghui He,Weijia Li

Task: 开发一种用于检测合成图像和DeepFake的专用大型多模态模型FakeVLM，并提供自然语言解释以增强可解释性。

Motivation: 随着人工智能生成内容（AIGC）技术的快速发展，合成图像在日常生活中越来越普遍，给真实性评估和检测带来了新的挑战。现有方法在评估图像真实性和定位伪造方面虽然有效，但往往缺乏人类可解释性，并且无法完全应对合成数据日益增长的复杂性。

Details

Method: 引入了FakeVLM，这是一种专门用于一般合成图像和DeepFake检测任务的大型多模态模型。FakeVLM不仅能够区分真实图像和伪造图像，还能为图像伪影提供清晰的自然语言解释。此外，还提出了FakeClue数据集，包含超过100,000张图像，分为七类，并用自然语言标注了细粒度的伪影线索。 Result: FakeVLM在多个数据集上的广泛评估中表现出色，在真实性分类和伪影解释任务中均表现出优越性，为合成图像检测设定了新的基准。 Conclusion: FakeVLM在合成数据检测方面表现出色，无需额外的分类器，是一种强大的解决方案。数据集和代码将在https://github.com/opendatalab/FakeVLM上发布。 Abstract: With the rapid advancement of Artificial Intelligence Generated Content (AIGC) technologies, synthetic images have become increasingly prevalent in everyday life, posing new challenges for authenticity assessment and detection. Despite the effectiveness of existing methods in evaluating image authenticity and locating forgeries, these approaches often lack human interpretability and do not fully address the growing complexity of synthetic data. To tackle these challenges, we introduce FakeVLM, a specialized large multimodal model designed for both general synthetic image and DeepFake detection tasks. FakeVLM not only excels in distinguishing real from fake images but also provides clear, natural language explanations for image artifacts, enhancing interpretability. Additionally, we present FakeClue, a comprehensive dataset containing over 100,000 images across seven categories, annotated with fine-grained artifact clues in natural language. FakeVLM demonstrates performance comparable to expert models while eliminating the need for additional classifiers, making it a robust solution for synthetic data detection. Extensive evaluations across multiple datasets confirm the superiority of FakeVLM in both authenticity classification and artifact explanation tasks, setting a new benchmark for synthetic image detection. The dataset and code will be released in: https://github.com/opendatalab/FakeVLM.

Dynamic Bi-Elman Attention Networks (DBEAN): Dual-Directional Context-Aware Representation Learning for Enhanced Text Classification

ZhengLin Lai,MengYao Liao,Dong Xu

Task: 提出一种新的文本分类模型DBEAN，旨在解决现有模型在可解释性、计算效率和长距离上下文理解方面的局限性。

Motivation: 传统方法在处理复杂语言结构和语义依赖时存在困难，现有深度学习模型在平衡可解释性、计算效率和长距离上下文理解方面仍有不足。

Details

Method: 提出动态双向Elman注意力网络（DBEAN），结合双向时间建模和自注意力机制，动态分配输入关键部分的权重。 Result: DBEAN提高了上下文表示能力，同时保持了计算效率。 Conclusion: DBEAN在文本分类任务中表现出色，能够有效平衡可解释性、计算效率和长距离上下文理解。 Abstract: Text classification, a fundamental task in natural language processing (NLP), aims to categorize textual data into predefined labels. Traditional methods struggled with complex linguistic structures and semantic dependencies. The advent of deep learning, particularly recurrent neural networks (RNNs) and Transformer-based models, has significantly advanced the field by enabling nuanced feature extraction and context-aware predictions. Despite improvements, existing models exhibit limitations in balancing interpretability, computational efficiency, and long-range contextual understanding. This paper proposes the Dynamic Bidirectional Elman with Attention Network (DBEAN), which integrates bidirectional temporal modelling with self-attention mechanisms. DBEAN dynamically assigns weights to critical segments of input, improving contextual representation while maintaining computational efficiency.

Robust Distribution Alignment for Industrial Anomaly Detection under Distribution Shift

Jingyi Liao,Xun Xu,Yongyi Su,Rong-Cheng Tu,Yifan Liu,Dacheng Tao,Xulei Yang

Task: 通过优化Sinkhorn距离来增强对未见目标域的泛化能力，以解决工业应用中的异常检测问题。

Motivation: 现有的方法在处理域转移（如光照变化或传感器漂移）时，往往依赖于目标分布的先验知识，难以泛化到为其他数据模态设计的骨干网络。

Details

Method: 基于记忆库的异常检测方法，优化有限的训练数据上的Sinkhorn距离，以增强对未见目标域的泛化能力。 Result: 在模拟分布转移的2D和3D异常检测基准测试中，所提出的方法表现出优于现有异常检测和域适应方法的结果。 Conclusion: 所提出的方法在增强对未见目标域的泛化能力方面表现出色，优于现有的异常检测和域适应方法。 Abstract: Anomaly detection plays a crucial role in quality control for industrial applications. However, ensuring robustness under unseen domain shifts such as lighting variations or sensor drift remains a significant challenge. Existing methods attempt to address domain shifts by training generalizable models but often rely on prior knowledge of target distributions and can hardly generalise to backbones designed for other data modalities. To overcome these limitations, we build upon memory-bank-based anomaly detection methods, optimizing a robust Sinkhorn distance on limited target training data to enhance generalization to unseen target domains. We evaluate the effectiveness on both 2D and 3D anomaly detection benchmarks with simulated distribution shifts. Our proposed method demonstrates superior results compared with state-of-the-art anomaly detection and domain adaptation methods.

Value Profiles for Encoding Human Variation

Taylor Sorensen,Pushkar Mishra,Roma Patel,Michael Henry Tessler,Michiel Bakker,Georgina Evans,Iason Gabriel,Noah Goodman,Verena Rieser

Task: 建模人类在评分任务中的变化，以实现个性化、多元模型对齐和计算社会科学中的AI系统。

Motivation: 为了在个性化、多元模型对齐和计算社会科学中实现AI系统，需要建模人类在评分任务中的变化。

Details

Method: 提出使用价值档案（自然语言描述从上下文演示中压缩的潜在价值）和可操控的解码器模型来估计基于价值档案或其他评分者信息的评分。 Result: 演示包含最多的信息，其次是价值档案和人口统计信息。价值档案在可审查性、可解释性和可操控性方面具有优势，并且有效地压缩了演示中的有用信息（>70%的信息保留）。 Conclusion: 价值档案提供了一种新颖的、预测性的方式来描述个体变化，超越了人口统计或群体信息。 Abstract: Modelling human variation in rating tasks is crucial for enabling AI systems for personalization, pluralistic model alignment, and computational social science. We propose representing individuals using value profiles -- natural language descriptions of underlying values compressed from in-context demonstrations -- along with a steerable decoder model to estimate ratings conditioned on a value profile or other rater information. To measure the predictive information in rater representations, we introduce an information-theoretic methodology. We find that demonstrations contain the most information, followed by value profiles and then demographics. However, value profiles offer advantages in terms of scrutability, interpretability, and steerability due to their compressed natural language format. Value profiles effectively compress the useful information from demonstrations (>70% information preservation). Furthermore, clustering value profiles to identify similarly behaving individuals better explains rater variation than the most predictive demographic groupings. Going beyond test set performance, we show that the decoder models interpretably change ratings according to semantic profile differences, are well-calibrated, and can help explain instance-level disagreement by simulating an annotator population. These results demonstrate that value profiles offer novel, predictive ways to describe individual variation beyond demographics or group information.

Derm1M: A Million-scale Vision-Language Dataset Aligned with Clinical Ontology Knowledge for Dermatology

Siyuan Yan,Ming Hu,Yiwen Jiang,Xieji Li,Hao Fei,Philipp Tschandl,Harald Kittler,Zongyuan Ge

Task: 提出并验证Derm1M，一个大规模皮肤病学视觉-语言数据集，用于提升AI研究和临床应用。

Motivation: 现有皮肤病学数据集在规模和深度上有限，缺乏丰富的文本描述和临床背景，限制了AI在皮肤病学中的应用。

Details

Method: 构建Derm1M数据集，包含1,029,761个图像-文本对，涵盖390多种皮肤状况和130个临床概念，并预训练了一系列CLIP-like模型（DermLIP）。 Result: DermLIP模型在多个任务和数据集上显著优于现有基础模型，包括零样本皮肤疾病分类、临床和伪影概念识别、少样本/全样本学习和跨模态检索。 Conclusion: Derm1M数据集和DermLIP模型在提升AI研究和临床应用方面具有显著潜力，数据集和代码将公开。 Abstract: The emergence of vision-language models has transformed medical AI, enabling unprecedented advances in diagnostic capability and clinical applications. However, progress in dermatology has lagged behind other medical domains due to the lack of standard image-text pairs. Existing dermatological datasets are limited in both scale and depth, offering only single-label annotations across a narrow range of diseases instead of rich textual descriptions, and lacking the crucial clinical context needed for real-world applications. To address these limitations, we present Derm1M, the first large-scale vision-language dataset for dermatology, comprising 1,029,761 image-text pairs. Built from diverse educational resources and structured around a standard ontology collaboratively developed by experts, Derm1M provides comprehensive coverage for over 390 skin conditions across four hierarchical levels and 130 clinical concepts with rich contextual information such as medical history, symptoms, and skin tone. To demonstrate Derm1M potential in advancing both AI research and clinical application, we pretrained a series of CLIP-like models, collectively called DermLIP, on this dataset. The DermLIP family significantly outperforms state-of-the-art foundation models on eight diverse datasets across multiple tasks, including zero-shot skin disease classification, clinical and artifacts concept identification, few-shot/full-shot learning, and cross-modal retrieval. Our dataset and code will be public.

Policy Frameworks for Transparent Chain-of-Thought Reasoning in Large Language Models

Yihang Chen,Haikang Deng,Kaiqiao Han,Qingyue Zhao

Task: 分析Chain-of-Thought (CoT)推理的全面披露的双刃影响，并提出一个分层次访问政策框架。

Motivation: 当前CoT披露政策在不同模型之间存在差异，缺乏统一的政策框架，需要平衡透明度、责任和安全性。

Details

Method: 提出一个分层次访问政策框架，通过道德许可、结构化推理输出和跨层次保障来定制CoT的可用性。 Result: 该框架旨在平衡透明度、责任和安全性，促进负责任的AI部署，同时减少滥用或误解的风险。 Conclusion: 分层次访问政策框架能够平衡透明度、责任和安全性，促进负责任的AI部署，同时减少滥用或误解的风险。 Abstract: Chain-of-Thought (CoT) reasoning enhances large language models (LLMs) by decomposing complex problems into step-by-step solutions, improving performance on reasoning tasks. However, current CoT disclosure policies vary widely across different models in frontend visibility, API access, and pricing strategies, lacking a unified policy framework. This paper analyzes the dual-edged implications of full CoT disclosure: while it empowers small-model distillation, fosters trust, and enables error diagnosis, it also risks violating intellectual property, enabling misuse, and incurring operational costs. We propose a tiered-access policy framework that balances transparency, accountability, and security by tailoring CoT availability to academic, business, and general users through ethical licensing, structured reasoning outputs, and cross-tier safeguards. By harmonizing accessibility with ethical and operational considerations, this framework aims to advance responsible AI deployment while mitigating risks of misuse or misinterpretation.

Deep Polycuboid Fitting for Compact 3D Representation of Indoor Scenes

Gahye Lee,Hyejeong Yoon,Jungeon Kim,Seungyong Lee

Task: 提出一种基于深度学习的紧凑表示3D室内场景的新框架。

Motivation: 室内场景主要由人造物体（如家具）组成，这些物体通常呈现直线几何形状，因此可以通过多立方体的组合来紧凑表示，从而有利于家具重新排列等下游应用。

Details

Method: 该框架首先使用变压器网络检测六种类型的立方体面，然后使用图神经网络验证检测到的面的空间关系以形成潜在的多立方体，最后通过聚合面标签重建每个多立方体实例。 Result: 该框架在包括Replica、ScanNet和使用iPhone捕获的场景在内的真实世界室内场景数据集中表现良好。 Conclusion: 该方法的通用性通过虚拟房间游览和场景编辑等实际应用得到了展示。 Abstract: This paper presents a novel framework for compactly representing a 3D indoor scene using a set of polycuboids through a deep learning-based fitting method. Indoor scenes mainly consist of man-made objects, such as furniture, which often exhibit rectilinear geometry. This property allows indoor scenes to be represented using combinations of polycuboids, providing a compact representation that benefits downstream applications like furniture rearrangement. Our framework takes a noisy point cloud as input and first detects six types of cuboid faces using a transformer network. Then, a graph neural network is used to validate the spatial relationships of the detected faces to form potential polycuboids. Finally, each polycuboid instance is reconstructed by forming a set of boxes based on the aggregated face labels. To train our networks, we introduce a synthetic dataset encompassing a diverse range of cuboid and polycuboid shapes that reflect the characteristics of indoor scenes. Our framework generalizes well to real-world indoor scene datasets, including Replica, ScanNet, and scenes captured with an iPhone. The versatility of our method is demonstrated through practical applications, such as virtual room tours and scene editing.

Threefold model for AI Readiness: A Case Study with Finnish Healthcare SMEs

Mohammed Alnajjar,Khalid Alnajjar,Mika Hämäläinen

Task: 研究芬兰医疗保健领域中小企业的AI采用情况。

Motivation: 了解医疗保健领域中小企业在AI采用方面的现状和挑战。

Details

Method: 通过对六家健康科技公司进行半结构化访谈，识别出三种AI参与类别：AI好奇型（探索AI）、AI拥抱型（整合AI）和AI服务型（提供AI解决方案）。 Result: 提出了一个三重模型，突出了关键采用障碍，包括监管复杂性、技术专长差距和财务限制。大多数中小企业仍处于早期采用阶段。 Conclusion: 提供了加速AI整合的可操作建议，重点关注监管改革、人才发展和公司间合作，为医疗保健组织、政策制定者和研究人员提供了有价值的见解。 Abstract: This study examines AI adoption among Finnish healthcare SMEs through semi-structured interviews with six health-tech companies. We identify three AI engagement categories: AI-curious (exploring AI), AI-embracing (integrating AI), and AI-catering (providing AI solutions). Our proposed threefold model highlights key adoption barriers, including regulatory complexities, technical expertise gaps, and financial constraints. While SMEs recognize AI's potential, most remain in early adoption stages. We provide actionable recommendations to accelerate AI integration, focusing on regulatory reforms, talent development, and inter-company collaboration, offering valuable insights for healthcare organizations, policymakers, and researchers.

GenM$^3$: Generative Pretrained Multi-path Motion Model for Text Conditional Human Motion Generation

Junyu Shi,Lijiang Liu,Yong Sun,Zhiyuan Zhang,Jinni Zhou,Qiang Nie

Task: 提出Generative Pretrained Multi-path Motion Model (GenM$^3$)框架，以解决大规模多源数据集中的数据异质性挑战，并学习统一的运动表示。

Motivation: 扩大运动数据集对于增强运动生成能力至关重要，但大规模多源数据集的训练引入了数据异质性挑战。

Details

Method: GenM$^3$框架包括两个组件：1) Multi-Expert VQ-VAE (MEVQ-VAE)，用于适应不同数据集分布以学习统一的离散运动表示；2) Multi-path Motion Transformer (MMT)，通过使用单独的模态特定路径来改进模态内表示，并通过文本-运动共享路径改进模态间对齐。 Result: 在HumanML3D基准测试中，GenM$^3$达到了0.035的FID，显著超越了现有方法。在IDEA400数据集上也展示了强大的零样本泛化能力。 Conclusion: GenM$^3$框架在多样化的运动场景中表现出色，展示了其有效性和适应性。 Abstract: Scaling up motion datasets is crucial to enhance motion generation capabilities. However, training on large-scale multi-source datasets introduces data heterogeneity challenges due to variations in motion content. To address this, we propose Generative Pretrained Multi-path Motion Model (GenM$^3$), a comprehensive framework designed to learn unified motion representations. GenM$^3$ comprises two components: 1) a Multi-Expert VQ-VAE (MEVQ-VAE) that adapts to different dataset distributions to learn a unified discrete motion representation, and 2) a Multi-path Motion Transformer (MMT) that improves intra-modal representations by using separate modality-specific pathways, each with densely activated experts to accommodate variations within that modality, and improves inter-modal alignment by the text-motion shared pathway. To enable large-scale training, we integrate and unify 11 high-quality motion datasets (approximately 220 hours of motion data) and augment it with textual annotations (nearly 10,000 motion sequences labeled by a large language model and 300+ by human experts). After training on our integrated dataset, GenM$^3$ achieves a state-of-the-art FID of 0.035 on the HumanML3D benchmark, surpassing state-of-the-art methods by a large margin. It also demonstrates strong zero-shot generalization on IDEA400 dataset, highlighting its effectiveness and adaptability across diverse motion scenarios.

Squeeze Out Tokens from Sample for Finer-Grained Data Governance

Weixiong Lin,Chen Ju,Haicheng Wang,Shengchao Hu,Shuai Xiao,Mengting Chen,Yuheng Jiao,Mingshuai Yao,Jinsong Lan,Qingwen Liu,Ying Chen

Task: 提出一种名为DataJuicer的双分支数据治理方法，通过更细粒度的样本内治理来提升数据集的质量。

Motivation: 现有的数据治理方法通过筛选低价值样本来缩减数据集，但保留的样本中仍包含大量不理想的标记，存在进一步压缩和净化的潜力。

Details

Method: DataJuicer采用双分支结构，视觉分支保留显著的图像块并提取相关对象类别，文本分支则将这些类别融入描述中，从而提升图像-文本对齐。 Result: 实验表明，DataJuicer在图像-文本检索、分类和密集视觉推理任务上显著优于现有的DataSieve方法。 Conclusion: DataJuicer通过更细粒度的数据治理，能够生成更精炼的数据集，从而提升模型性能。 Abstract: Widely observed data scaling laws, in which error falls off as a power of the training size, demonstrate the diminishing returns of unselective data expansion. Hence, data governance is proposed to downsize datasets through pruning non-informative samples. Yet, isolating the impact of a specific sample on overall model performance is challenging, due to the vast computation required for tryout all sample combinations. Current data governors circumvent this complexity by estimating sample contributions through heuristic-derived scalar scores, thereby discarding low-value ones. Despite thorough sample sieving, retained samples contain substantial undesired tokens intrinsically, underscoring the potential for further compression and purification. In this work, we upgrade data governance from a 'sieving' approach to a 'juicing' one. Instead of scanning for least-flawed samples, our dual-branch DataJuicer applies finer-grained intra-sample governance. It squeezes out informative tokens and boosts image-text alignments. Specifically, the vision branch retains salient image patches and extracts relevant object classes, while the text branch incorporates these classes to enhance captions. Consequently, DataJuicer yields more refined datasets through finer-grained governance. Extensive experiments across datasets demonstrate that DataJuicer significantly outperforms existing DataSieve in image-text retrieval, classification, and dense visual reasoning.

Shushing! Let's Imagine an Authentic Speech from the Silent Video

Jiaxin Ye,Hongming Shan

Task: 通过视觉输入生成真实的语音，不依赖听觉信号。

Motivation: 现有的方法在语义、音色和情感韵律的跨模态对齐上存在困难，因此提出了CV2S任务以增强跨模态一致性。

Details

Method: 提出了ImaginTalk，一种新颖的跨模态扩散框架，通过离散唇部对齐器预测离散语音标记，并使用BERT进行错误检测和修正。此外，还开发了风格扩散变换器，通过面部风格适配器增强语音的表现力。 Result: 实验表明，ImaginTalk能够生成高保真语音，具有更准确的语义细节和更强的音色和情感表现力。 Conclusion: ImaginTalk在视觉引导的语音生成任务中表现出色，能够生成高质量的语音，具有广泛的应用潜力。 Abstract: Vision-guided speech generation aims to produce authentic speech from facial appearance or lip motions without relying on auditory signals, offering significant potential for applications such as dubbing in filmmaking and assisting individuals with aphonia. Despite recent progress, existing methods struggle to achieve unified cross-modal alignment across semantics, timbre, and emotional prosody from visual cues, prompting us to propose Consistent Video-to-Speech (CV2S) as an extended task to enhance cross-modal consistency. To tackle emerging challenges, we introduce ImaginTalk, a novel cross-modal diffusion framework that generates faithful speech using only visual input, operating within a discrete space. Specifically, we propose a discrete lip aligner that predicts discrete speech tokens from lip videos to capture semantic information, while an error detector identifies misaligned tokens, which are subsequently refined through masked language modeling with BERT. To further enhance the expressiveness of the generated speech, we develop a style diffusion transformer equipped with a face-style adapter that adaptively customizes identity and prosody dynamics across both the channel and temporal dimensions while ensuring synchronization with lip-aware semantic features. Extensive experiments demonstrate that ImaginTalk can generate high-fidelity speech with more accurate semantic details and greater expressiveness in timbre and emotion compared to state-of-the-art baselines. Demos are shown at our project page: https://imagintalk.github.io.

Image Captioning Evaluation in the Age of Multimodal LLMs: Challenges and Future Perspectives

Sara Sarto,Marcella Cornia,Rita Cucchiara

Task: 评估机器生成的图像描述

Motivation: 随着多模态大语言模型（MLLMs）的出现，图像描述生成成为一个核心任务，增加了对稳健和可靠评估指标的需求。

Details

Method: 本文综述了图像描述评估的进展，分析了现有指标的演变、优势和局限性。 Result: 分析揭示了标准评估方法的一些局限性，并提出了未来图像描述评估研究的有前景方向。 Conclusion: 本文强调了现有评估指标的局限性，并提出了未来研究的方向。 Abstract: The evaluation of machine-generated image captions is a complex and evolving challenge. With the advent of Multimodal Large Language Models (MLLMs), image captioning has become a core task, increasing the need for robust and reliable evaluation metrics. This survey provides a comprehensive overview of advancements in image captioning evaluation, analyzing the evolution, strengths, and limitations of existing metrics. We assess these metrics across multiple dimensions, including correlation with human judgment, ranking accuracy, and sensitivity to hallucinations. Additionally, we explore the challenges posed by the longer and more detailed captions generated by MLLMs and examine the adaptability of current metrics to these stylistic variations. Our analysis highlights some limitations of standard evaluation approaches and suggests promising directions for future research in image captioning assessment.

FAVOR-Bench: A Comprehensive Benchmark for Fine-Grained Video Motion Understanding

Chongjun Tu,Lin Zhang,Pengtao Chen,Peng Ye,Xianfang Zeng,Wei Cheng,Gang Yu,Tao Chen

Task: 评估和改进多模态大语言模型（MLLMs）在视频内容理解中的细粒度运动理解能力。

Motivation: 现有的多模态大语言模型在视频内容理解方面表现出色，但在细粒度运动理解方面仍存在困难。

Details

Method: 引入了FAVOR-Bench，包含1,776个带有结构化手动注释的视频，设计了8,184个多项选择题对和开放任务评估方法。进一步构建了FAVOR-Train数据集，包含17,152个带有细粒度运动注释的视频。 Result: 实验表明，现有的21种最先进的多模态大语言模型在理解和描述视频运动的详细时间动态方面存在显著局限性。通过在FAVOR-Train上微调Qwen2.5-VL，在TVBench、MotionBench和FAVOR-Bench的运动相关任务上取得了持续改进。 Conclusion: FAVOR-Bench和FAVOR-Train为开发更强大的视频理解模型提供了有价值的工具。 Abstract: Multimodal Large Language Models (MLLMs) have shown remarkable capabilities in video content understanding but still struggle with fine-grained motion comprehension. To comprehensively assess the motion understanding ability of existing MLLMs, we introduce FAVOR-Bench, comprising 1,776 videos with structured manual annotations of various motions. Our benchmark includes both close-ended and open-ended tasks. For close-ended evaluation, we carefully design 8,184 multiple-choice question-answer pairs spanning six distinct sub-tasks. For open-ended evaluation, we develop both a novel cost-efficient LLM-free and a GPT-assisted caption assessment method, where the former can enhance benchmarking interpretability and reproducibility. Comprehensive experiments with 21 state-of-the-art MLLMs reveal significant limitations in their ability to comprehend and describe detailed temporal dynamics in video motions. To alleviate this limitation, we further build FAVOR-Train, a dataset consisting of 17,152 videos with fine-grained motion annotations. The results of finetuning Qwen2.5-VL on FAVOR-Train yield consistent improvements on motion-related tasks of TVBench, MotionBench and our FAVOR-Bench. Comprehensive assessment results demonstrate that the proposed FAVOR-Bench and FAVOR-Train provide valuable tools to the community for developing more powerful video understanding models. Project page: \href{https://favor-bench.github.io/}{https://favor-bench.github.io/}.

Unique Hard Attention: A Tale of Two Sides

Selim Jerad,Anej Svete,Jiaoda Li,Ryan Cotterell

Task: 分析左硬注意力变换器的表达能力及其与线性时序逻辑（LTL）的关系。

Motivation: 理解变换器的表达能力有助于揭示其能力和局限性，特别是硬注意力变换器在多个位置达到最大注意力分数时的选择方式。

Details

Method: 比较左硬注意力和右硬注意力变换器与线性时序逻辑（LTL）的等价性，并探讨左硬注意力变换器与软注意力的关系。 Result: 左硬注意力变换器与线性时序逻辑（LTL）的一个严格较弱的片段等价，并且与软注意力变换器等价。 Conclusion: 左硬注意力变换器可能更好地近似现实世界中的变换器，研究结果细化了变换器表达能力的图景，并强调了注意力方向性的作用。 Abstract: Understanding the expressive power of transformers has recently attracted attention, as it offers insights into their abilities and limitations. Many studies analyze unique hard attention transformers, where attention selects a single position that maximizes the attention scores. When multiple positions achieve the maximum score, either the rightmost or the leftmost of those is chosen. In this paper, we highlight the importance of this seeming triviality. Recently, finite-precision transformers with both leftmost- and rightmost-hard attention were shown to be equivalent to Linear Temporal Logic (LTL). We show that this no longer holds with only leftmost-hard attention -- in that case, they correspond to a \emph{strictly weaker} fragment of LTL. Furthermore, we show that models with leftmost-hard attention are equivalent to \emph{soft} attention, suggesting they may better approximate real-world transformers than right-attention models. These findings refine the landscape of transformer expressivity and underscore the role of attention directionality.

Optimal Transport Adapter Tuning for Bridging Modality Gaps in Few-Shot Remote Sensing Scene Classification

Zhong Ji,Ci Liu,Jingren Liu,Chen Tang,Yanwei Pang,Xuelong Li

Task: Few-Shot Remote Sensing Scene Classification (FS-RSSC) with limited labeled samples.

Motivation: Existing methods typically emphasize single-modal feature learning, neglecting the potential benefits of optimizing multi-modal representations.

Details

Method: Propose a novel Optimal Transport Adapter Tuning (OTAT) framework to construct an ideal Platonic representational space through optimal transport (OT) theory. The framework includes an Optimal Transport Adapter (OTA) with a cross-modal attention mechanism and a sample-level Entropy-Aware Weighted (EAW) loss. Result: OTAT achieves state-of-the-art performance in FS-RSSC, significantly improving model performance and generalization. Conclusion: The OTAT framework offers a scalable and efficient solution for advancing multimodal learning in remote sensing applications. Abstract: Few-Shot Remote Sensing Scene Classification (FS-RSSC) presents the challenge of classifying remote sensing images with limited labeled samples. Existing methods typically emphasize single-modal feature learning, neglecting the potential benefits of optimizing multi-modal representations. To address this limitation, we propose a novel Optimal Transport Adapter Tuning (OTAT) framework aimed at constructing an ideal Platonic representational space through optimal transport (OT) theory. This framework seeks to harmonize rich visual information with less dense textual cues, enabling effective cross-modal information transfer and complementarity. Central to this approach is the Optimal Transport Adapter (OTA), which employs a cross-modal attention mechanism to enrich textual representations and facilitate subsequent better information interaction. By transforming the network optimization into an OT optimization problem, OTA establishes efficient pathways for balanced information exchange between modalities. Moreover, we introduce a sample-level Entropy-Aware Weighted (EAW) loss, which combines difficulty-weighted similarity scores with entropy-based regularization. This loss function provides finer control over the OT optimization process, enhancing its solvability and stability. Our framework offers a scalable and efficient solution for advancing multimodal learning in remote sensing applications. Extensive experiments on benchmark datasets demonstrate that OTAT achieves state-of-the-art performance in FS-RSSC, significantly improving the model performance and generalization.

RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation Serving

Wenqi Jiang,Suvinay Subramanian,Cat Graves,Gustavo Alonso,Amir Yazdanbakhsh,Vidushi Dadu

Task: 提出一种高效的检索增强生成（RAG）服务优化框架。

Motivation: 由于RAG变体的快速涌现和工作负载特性的显著差异，高效的RAG服务仍然是一个开放的挑战。

Details

Method: 引入RAGSchema作为RAG算法的结构化抽象，分析代表性RAG工作负载，并提出RAGO优化框架。 Result: RAGO在每芯片QPS上提高了2倍，并将首次令牌延迟减少了55%。 Conclusion: RAGO框架显著提高了RAG服务的效率，满足了多样化的性能需求。 Abstract: Retrieval-augmented generation (RAG), which combines large language models (LLMs) with retrievals from external knowledge databases, is emerging as a popular approach for reliable LLM serving. However, efficient RAG serving remains an open challenge due to the rapid emergence of many RAG variants and the substantial differences in workload characteristics across them. In this paper, we make three fundamental contributions to advancing RAG serving. First, we introduce RAGSchema, a structured abstraction that captures the wide range of RAG algorithms, serving as a foundation for performance optimization. Second, we analyze several representative RAG workloads with distinct RAGSchema, revealing significant performance variability across these workloads. Third, to address this variability and meet diverse performance requirements, we propose RAGO (Retrieval-Augmented Generation Optimizer), a system optimization framework for efficient RAG serving. Our evaluation shows that RAGO achieves up to a 2x increase in QPS per chip and a 55% reduction in time-to-first-token latency compared to RAG systems built on LLM-system extensions.

VisNumBench: Evaluating Number Sense of Multimodal Large Language Models

Tengjin Weng,Jingyi Wang,Wenhao Jiang,Zhong Ming

Task: 评估多模态大语言模型（MLLMs）在视觉数字任务中的数字感知能力。

Motivation: 研究多模态大语言模型是否能够发展出类似人类的直观数字感知能力。

Details

Method: 引入视觉数字基准（VisNumBench），包含约1,900个多选题-答案对，涵盖七种视觉数字属性和四种视觉数字估计任务。 Result: 测试的17个MLLMs在数字感知相关任务中表现显著低于人类水平；多模态数学模型和多模态链式思维（CoT）模型在数字感知能力上没有显著提升；参数规模更大、通用能力更强的MLLMs在数字感知能力上表现出适度提升。 Conclusion: VisNumBench将成为研究社区的宝贵资源，鼓励进一步提升MLLMs的数字感知能力。 Abstract: Can Multimodal Large Language Models (MLLMs) develop an intuitive number sense similar to humans? Targeting this problem, we introduce Visual Number Benchmark (VisNumBench) to evaluate the number sense abilities of MLLMs across a wide range of visual numerical tasks. VisNumBench consists of about 1,900 multiple-choice question-answer pairs derived from both synthetic and real-world visual data, covering seven visual numerical attributes and four types of visual numerical estimation tasks. Our experiments on VisNumBench led to the following key findings: (i) The 17 MLLMs we tested, including open-source models such as Qwen2.5-VL and InternVL2.5, as well as proprietary models like GPT-4o and Gemini 2.0 Flash, perform significantly below human levels in number sense-related tasks. (ii) Multimodal mathematical models and multimodal chain-of-thought (CoT) models did not exhibit significant improvements in number sense abilities. (iii) Stronger MLLMs with larger parameter sizes and broader general abilities demonstrate modest gains in number sense abilities. We believe VisNumBench will serve as a valuable resource for the research community, encouraging further advancements in enhancing MLLMs' number sense abilities. All benchmark resources, including code and datasets, will be publicly available at https://wwwtttjjj.github.io/VisNumBench/.

Mitigating Object Hallucinations in MLLMs via Multi-Frequency Perturbations

Shuo Li,Jiajun Sun,Guodong Zheng,Xiaoran Fan,Yujiong Shen,Yi Lu,Zhiheng Xi,Yuming Yang,Wenming Tan,Tao Ji,Tao Gui,Qi Zhang,Xuanjing Huang

Task: 提出一种名为多频扰动（MFP）的方法，通过利用图像的低频和高频特征来扰动视觉特征表示，从而在推理过程中显式抑制冗余的频率域特征，以减轻多模态大语言模型（MLLMs）中的物体幻觉问题。

Motivation: 多模态大语言模型在视觉-语言任务中表现出色，但其生成的响应常常因物体幻觉而失真。研究发现，模型在检测物体时对特定图像频率特征的过度敏感性是导致这些幻觉的关键原因。

Details

Method: 引入多频扰动（MFP）方法，利用图像的低频和高频特征来扰动视觉特征表示，并在推理过程中显式抑制冗余的频率域特征。 Result: 实验结果表明，该方法显著减轻了各种模型架构中的物体幻觉。此外，作为一种训练时方法，MFP可以与推理时方法结合，在CHAIR基准上实现最先进的性能。 Conclusion: 多频扰动（MFP）是一种简单、经济且可插拔的方法，能够有效减轻多模态大语言模型中的物体幻觉问题，并在CHAIR基准上取得了最先进的性能。 Abstract: Recently, multimodal large language models (MLLMs) have demonstrated remarkable performance in visual-language tasks. However, the authenticity of the responses generated by MLLMs is often compromised by object hallucinations. We identify that a key cause of these hallucinations is the model's over-susceptibility to specific image frequency features in detecting objects. In this paper, we introduce Multi-Frequency Perturbations (MFP), a simple, cost-effective, and pluggable method that leverages both low-frequency and high-frequency features of images to perturb visual feature representations and explicitly suppress redundant frequency-domain features during inference, thereby mitigating hallucinations. Experimental results demonstrate that our method significantly mitigates object hallucinations across various model architectures. Furthermore, as a training-time method, MFP can be combined with inference-time methods to achieve state-of-the-art performance on the CHAIR benchmark.

UPME: An Unsupervised Peer Review Framework for Multimodal Large Language Model Evaluation

Qihui Zhang,Munan Ning,Zheyuan Liu,Yanbo Wang,Jiayi Ye,Yue Huang,Shuo Yang,Xiao Chen,Yibing Song,Li Yuan

Task: 提出一种无监督的同行评审多模态大语言模型评估框架，以解决现有评估方法中的人力工作量大和偏差问题。

Motivation: 现有的多模态大语言模型评估方法需要大量人力设计问答对，且自动化评估方法容易引入偏差。

Details

Method: 提出无监督的同行评审评估框架，利用图像数据自动生成问题并进行同行评审评估，引入视觉语言评分系统以减少偏差。 Result: 在MMstar数据集上，UPME与人类评估的Pearson相关系数为0.944；在ScienceQA数据集上为0.814。 Conclusion: UPME框架能够有效减少对人力工作的依赖，并与人类设计的基准和偏好高度一致。 Abstract: Multimodal Large Language Models (MLLMs) have emerged to tackle the challenges of Visual Question Answering (VQA), sparking a new research focus on conducting objective evaluations of these models. Existing evaluation methods face limitations due to the significant human workload required to design Q&A pairs for visual images, which inherently restricts the scale and scope of evaluations. Although automated MLLM-as-judge approaches attempt to reduce the human workload through automatic evaluations, they often introduce biases. To address these problems, we propose an Unsupervised Peer review MLLM Evaluation framework. It utilizes only image data, allowing models to automatically generate questions and conduct peer review assessments of answers from other models, effectively alleviating the reliance on human workload. Additionally, we introduce the vision-language scoring system to mitigate the bias issues, which focuses on three aspects: (i) response correctness; (ii) visual understanding and reasoning; and (iii) image-text correlation. Experimental results demonstrate that UPME achieves a Pearson correlation of 0.944 with human evaluations on the MMstar dataset and 0.814 on the ScienceQA dataset, indicating that our framework closely aligns with human-designed benchmarks and inherent human preferences.

Machine Unlearning in Hyperbolic vs. Euclidean Multimodal Contrastive Learning: Adapting Alignment Calibration to MERU

Àlex Pujol Vidal,Sergio Escalera,Kamal Nasrollahi,Thomas B. Moeslund

Task: 研究在双曲对比学习中的机器遗忘方法，以选择性移除预训练模型中的概念。

Motivation: 探索在双曲空间中概念移除的有效性，因为现有研究主要集中在欧几里得对比视觉语言模型中的遗忘。

Details

Method: 通过将Alignment Calibration应用于MERU模型，该模型将图像和文本嵌入双曲空间以更好地捕捉语义层次结构。 Result: 实验表明，双曲几何在概念移除方面具有独特优势，特别是在扩展到多个概念移除时，能够实现近乎完美的遗忘，同时保留合理性能。 Conclusion: 双曲遗忘在重组语义层次结构方面与欧几里得方法有根本性差异，这些发现不仅推进了机器遗忘技术，还提供了关于几何属性如何影响多模态模型中概念表示和移除的见解。 Abstract: Machine unlearning methods have become increasingly important for selective concept removal in large pre-trained models. While recent work has explored unlearning in Euclidean contrastive vision-language models, the effectiveness of concept removal in hyperbolic spaces remains unexplored. This paper investigates machine unlearning in hyperbolic contrastive learning by adapting Alignment Calibration to MERU, a model that embeds images and text in hyperbolic space to better capture semantic hierarchies. Through systematic experiments and ablation studies, we demonstrate that hyperbolic geometry offers distinct advantages for concept removal, achieving near perfect forgetting with reasonable performance on retained concepts, particularly when scaling to multiple concept removal. Our approach introduces hyperbolic-specific components including entailment calibration and norm regularization that leverage the unique properties of hyperbolic space. Comparative analysis with Euclidean models reveals fundamental differences in unlearning dynamics, with hyperbolic unlearning reorganizing the semantic hierarchy while Euclidean approaches merely disconnect cross-modal associations. These findings not only advance machine unlearning techniques but also provide insights into the geometric properties that influence concept representation and removal in multimodal models. Source code available at https://github.com/alex-pv01/HAC

3D Engine-ready Photorealistic Avatars via Dynamic Textures

Yifan Wang,Ivan Molodetskikh,Ondrej Texler,Dimitar Dinev

Task: 提出一种端到端的管道，使用标准3D资产构建显式表示的逼真3D虚拟形象。

Motivation: 随着数字世界和物理世界的交织，对逼真数字虚拟形象的需求增加，但现有的数字化方法成本高且不适用于大众消费者。

Details

Method: 使用动态生成的纹理来增强真实感并视觉上掩盖底层网格几何的缺陷。 Result: 实现了与现有图形管道的无缝集成，同时达到了与最先进的3D虚拟形象生成方法相当的视觉质量。 Conclusion: 提出的方法能够在不依赖昂贵设备的情况下，生成逼真的3D虚拟形象，并兼容现有的图形管道。 Abstract: As the digital and physical worlds become more intertwined, there has been a lot of interest in digital avatars that closely resemble their real-world counterparts. Current digitization methods used in 3D production pipelines require costly capture setups, making them impractical for mass usage among common consumers. Recent academic literature has found success in reconstructing humans from limited data using implicit representations (e.g., voxels used in NeRFs), which are able to produce impressive videos. However, these methods are incompatible with traditional rendering pipelines, making it difficult to use them in applications such as games. In this work, we propose an end-to-end pipeline that builds explicitly-represented photorealistic 3D avatars using standard 3D assets. Our key idea is the use of dynamically-generated textures to enhance the realism and visually mask deficiencies in the underlying mesh geometry. This allows for seamless integration with current graphics pipelines while achieving comparable visual quality to state-of-the-art 3D avatar generation methods.

A Review on Large Language Models for Visual Analytics

Navya Sonal Agarwal,Sanjay Kumar Sonbhadra

Task: 综述大型语言模型（LLMs）与视觉分析的集成，探讨其基础概念、能力和广泛应用。

Motivation: 探讨LLMs在自然语言理解、自然语言生成、对话系统和文本到媒体转换中的潜力，以及其与视觉分析的协同作用如何增强数据解释、可视化技术和交互探索能力。

Details

Method: 通过评估关键工具和平台（如LIDA、Chat2VIS、Julius AI和Zoho Analytics）以及专门的 multimodal 模型（如ChartLlama和CharXIV），系统探讨LLM任务分类，包括自然语言理解（NLU）、自然语言生成（NLG）、对话系统和文本到媒体转换。 Result: 提供了LLMs与视觉分析集成的SWOT分析，强调了其优势（如可访问性和灵活性）、劣势（如计算需求和偏见）、机会（如多模态集成和用户协作）和威胁（如隐私问题和技能退化）。 Conclusion: 强调解决伦理考虑和方法改进以实现有效集成的重要性。 Abstract: This paper provides a comprehensive review of the integration of Large Language Models (LLMs) with visual analytics, addressing their foundational concepts, capabilities, and wide-ranging applications. It begins by outlining the theoretical underpinnings of visual analytics and the transformative potential of LLMs, specifically focusing on their roles in natural language understanding, natural language generation, dialogue systems, and text-to-media transformations. The review further investigates how the synergy between LLMs and visual analytics enhances data interpretation, visualization techniques, and interactive exploration capabilities. Key tools and platforms including LIDA, Chat2VIS, Julius AI, and Zoho Analytics, along with specialized multimodal models such as ChartLlama and CharXIV, are critically evaluated. The paper discusses their functionalities, strengths, and limitations in supporting data exploration, visualization enhancement, automated reporting, and insight extraction. The taxonomy of LLM tasks, ranging from natural language understanding (NLU), natural language generation (NLG), to dialogue systems and text-to-media transformations, is systematically explored. This review provides a SWOT analysis of integrating Large Language Models (LLMs) with visual analytics, highlighting strengths like accessibility and flexibility, weaknesses such as computational demands and biases, opportunities in multimodal integration and user collaboration, and threats including privacy concerns and skill degradation. It emphasizes addressing ethical considerations and methodological improvements for effective integration.

MMAIF: Multi-task and Multi-degradation All-in-One for Image Fusion with Language Guidance

Zihan Cao,Yu Zhong,Ziqi Wang,Liang-Jian Deng

Task: 提出一个统一的框架，用于多任务、多退化和语言引导的图像融合。

Motivation: 现有方法存在多个显著限制，如需要任务或数据集特定的模型、忽略现实世界的图像退化、在像素空间中操作计算成本高、缺乏用户交互能力。

Details

Method: 提出了一个包含两个关键组件的框架：1）一个实用的退化管道，模拟现实世界的图像退化并生成交互提示以指导模型；2）一个在潜在空间中操作的全能扩散变换器（DiT），根据退化的输入和生成的提示融合干净图像。 Result: 广泛的定性和定量实验表明，该方法有效地解决了上述限制，并优于以前的恢复+融合和全能管道。 Conclusion: 提出的框架在多任务、多退化和语言引导的图像融合方面表现出色，代码已公开。 Abstract: Image fusion, a fundamental low-level vision task, aims to integrate multiple image sequences into a single output while preserving as much information as possible from the input. However, existing methods face several significant limitations: 1) requiring task- or dataset-specific models; 2) neglecting real-world image degradations (\textit{e.g.}, noise), which causes failure when processing degraded inputs; 3) operating in pixel space, where attention mechanisms are computationally expensive; and 4) lacking user interaction capabilities. To address these challenges, we propose a unified framework for multi-task, multi-degradation, and language-guided image fusion. Our framework includes two key components: 1) a practical degradation pipeline that simulates real-world image degradations and generates interactive prompts to guide the model; 2) an all-in-one Diffusion Transformer (DiT) operating in latent space, which fuses a clean image conditioned on both the degraded inputs and the generated prompts. Furthermore, we introduce principled modifications to the original DiT architecture to better suit the fusion task. Based on this framework, we develop two versions of the model: Regression-based and Flow Matching-based variants. Extensive qualitative and quantitative experiments demonstrate that our approach effectively addresses the aforementioned limitations and outperforms previous restoration+fusion and all-in-one pipelines. Codes are available at https://github.com/294coder/MMAIF.

When Pigs Get Sick: Multi-Agent AI for Swine Disease Detection

Tittaya Mairittha,Tanakon Sawanglok,Panuwit Raden,Sorrawit Treesuk

Task: 开发一种基于AI的多代理诊断系统，用于猪病监测和临床指导。

Motivation: 解决全球农业可持续发展中猪病监测的挑战，包括有限的兽医资源、病例识别的延迟和诊断准确性的差异。

Details

Method: 利用检索增强生成（RAG）技术，自动将用户输入分类为知识检索查询或基于症状的诊断查询，并通过自适应提问协议和置信度加权决策融合机制进行诊断。 Result: 系统在查询分类、疾病诊断和知识检索方面表现出高准确性、快速响应时间和一致的可靠性。 Conclusion: 该AI驱动的诊断框架提高了兽医决策能力，推动了可持续畜牧业管理实践，并为全球粮食安全的实现做出了实质性贡献。 Abstract: Swine disease surveillance is critical to the sustainability of global agriculture, yet its effectiveness is frequently undermined by limited veterinary resources, delayed identification of cases, and variability in diagnostic accuracy. To overcome these barriers, we introduce a novel AI-powered, multi-agent diagnostic system that leverages Retrieval-Augmented Generation (RAG) to deliver timely, evidence-based disease detection and clinical guidance. By automatically classifying user inputs into either Knowledge Retrieval Queries or Symptom-Based Diagnostic Queries, the system ensures targeted information retrieval and facilitates precise diagnostic reasoning. An adaptive questioning protocol systematically collects relevant clinical signs, while a confidence-weighted decision fusion mechanism integrates multiple diagnostic hypotheses to generate robust disease predictions and treatment recommendations. Comprehensive evaluations encompassing query classification, disease diagnosis, and knowledge retrieval demonstrate that the system achieves high accuracy, rapid response times, and consistent reliability. By providing a scalable, AI-driven diagnostic framework, this approach enhances veterinary decision-making, advances sustainable livestock management practices, and contributes substantively to the realization of global food security.

Generating Multimodal Driving Scenes via Next-Scene Prediction

Yanhao Wu,Haoyang Zhang,Tianwei Lin,Lichao Huang,Shujie Luo,Rui Wu,Congpei Qiu,Wei Ke,Tong Zhang

Task: 提出一种多模态生成框架，用于生成可控的自动驾驶场景以进行综合评估。

Motivation: 现有的生成方法只能捕捉有限的模态，限制了生成可控场景的能力，无法全面评估自动驾驶系统。

Details

Method: 引入包含四种主要数据模态的多模态生成框架，采用两阶段方法进行场景序列生成，包括时间自回归（TAR）组件和有序自回归（OAR）组件，并引入动作感知地图对齐（AMA）模块。 Result: 该框架能够有效地生成长序列的复杂、真实的驾驶场景，确保多模态一致性，并提供对场景元素的细粒度控制。 Conclusion: 提出的多模态生成框架在生成复杂驾驶场景方面表现出色，能够满足自动驾驶系统的综合评估需求。 Abstract: Generative models in Autonomous Driving (AD) enable diverse scene creation, yet existing methods fall short by only capturing a limited range of modalities, restricting the capability of generating controllable scenes for comprehensive evaluation of AD systems. In this paper, we introduce a multimodal generation framework that incorporates four major data modalities, including a novel addition of map modality. With tokenized modalities, our scene sequence generation framework autoregressively predicts each scene while managing computational demands through a two-stage approach. The Temporal AutoRegressive (TAR) component captures inter-frame dynamics for each modality while the Ordered AutoRegressive (OAR) component aligns modalities within each scene by sequentially predicting tokens in a fixed order. To maintain coherence between map and ego-action modalities, we introduce the Action-aware Map Alignment (AMA) module, which applies a transformation based on the ego-action to maintain coherence between these modalities. Our framework effectively generates complex, realistic driving scenes over extended sequences, ensuring multimodal consistency and offering fine-grained control over scene elements.

Solla: Towards a Speech-Oriented LLM That Hears Acoustic Context

Junyi Ao,Dekun Chen,Xiaohai Tian,Wenjie Feng,Jun Zhang,Lu Lu,Yuxuan Wang,Haizhou Li,Zhizheng Wu

Task: 提出一种新的框架Solla，用于同时理解语音指令和音频上下文。

Motivation: 现有的模型主要关注使用文本指令分析输入信号，忽略了语音指令和音频混合作为输入的场景。

Details

Method: Solla框架结合了音频标记模块和ASR辅助预测方法，以有效识别和表示音频事件，并提高对语音内容的理解。 Result: Solla在SA-Eval基准数据集上的表现与基线模型相当或优于基线模型，证明了其在联合理解语音和音频方面的有效性。 Conclusion: Solla框架在理解和处理混合语音指令和音频输入方面表现出色，为解决这一挑战提供了有效的解决方案。 Abstract: Large Language Models (LLMs) have recently shown remarkable ability to process not only text but also multimodal inputs such as speech and audio. However, most existing models primarily focus on analyzing input signals using text instructions, overlooking scenarios in which speech instructions and audio are mixed and serve as inputs to the model. To address these challenges, we introduce Solla, a novel framework designed to understand speech-based questions and hear the acoustic context concurrently. Solla incorporates an audio tagging module to effectively identify and represent audio events, as well as an ASR-assisted prediction method to improve comprehension of spoken content. To rigorously evaluate Solla and other publicly available models, we propose a new benchmark dataset called SA-Eval, which includes three tasks: audio event classification, audio captioning, and audio question answering. SA-Eval has diverse speech instruction with various speaking styles, encompassing two difficulty levels, easy and hard, to capture the range of real-world acoustic conditions. Experimental results show that Solla performs on par with or outperforms baseline models on both the easy and hard test sets, underscoring its effectiveness in jointly understanding speech and audio.

ChatStitch: Visualizing Through Structures via Surround-View Unsupervised Deep Image Stitching with Collaborative LLM-Agents

Hao Liang,Zhipeng Dong,Yi Yang,Mengyin Fu

Task: 提出ChatStitch系统，通过自然语言命令与外部数字资产集成，揭示被遮挡的盲点信息。

Motivation: 现有协作感知系统在用户交互效率和多摄像头逼真可视化方面存在局限性。

Details

Method: 采用基于大语言模型的多代理协作框架，并提出SV-UDIS方法，即非全局重叠条件下的首例环绕视图无监督深度图像拼接方法。 Result: 在UDIS-D数据集上，SV-UDIS方法在3、4、5图像拼接任务中实现了最先进的性能，PSNR分别提高了9%、17%和21%，SSIM分别提高了8%、18%和26%。 Conclusion: ChatStitch系统通过自然语言命令和多代理协作框架，有效解决了协作感知系统中的用户交互和多摄像头可视化问题，SV-UDIS方法在图像拼接任务中表现出色。 Abstract: Collaborative perception has garnered significant attention for its ability to enhance the perception capabilities of individual vehicles through the exchange of information with surrounding vehicle-agents. However, existing collaborative perception systems are limited by inefficiencies in user interaction and the challenge of multi-camera photorealistic visualization. To address these challenges, this paper introduces ChatStitch, the first collaborative perception system capable of unveiling obscured blind spot information through natural language commands integrated with external digital assets. To adeptly handle complex or abstract commands, ChatStitch employs a multi-agent collaborative framework based on Large Language Models. For achieving the most intuitive perception for humans, ChatStitch proposes SV-UDIS, the first surround-view unsupervised deep image stitching method under the non-global-overlapping condition. We conducted extensive experiments on the UDIS-D, MCOV-SLAM open datasets, and our real-world dataset. Specifically, our SV-UDIS method achieves state-of-the-art performance on the UDIS-D dataset for 3, 4, and 5 image stitching tasks, with PSNR improvements of 9%, 17%, and 21%, and SSIM improvements of 8%, 18%, and 26%, respectively.

What Makes a Reward Model a Good Teacher? An Optimization Perspective

Noam Razin,Zixuan Wang,Hubert Strauss,Stanley Wei,Jason D. Lee,Sanjeev Arora

Task: 研究奖励模型在人类反馈强化学习（RLHF）中的优化效果。

Motivation: 探讨奖励模型的准确性是否足以衡量其作为有效教师的能力，并从优化角度分析奖励模型的方差对RLHF目标的影响。

Details

Method: 从理论上证明奖励模型的方差对优化效果的影响，并通过实验验证理论。 Result: 即使奖励模型准确，如果其诱导的奖励方差低，RLHF目标的优化速度会非常慢。实验验证了奖励方差、准确性和奖励最大化率之间的相互作用。 Conclusion: 奖励模型不仅需要准确性，还需要诱导足够的方差以实现高效优化。 Abstract: The success of Reinforcement Learning from Human Feedback (RLHF) critically depends on the quality of the reward model. While this quality is primarily evaluated through accuracy, it remains unclear whether accuracy fully captures what makes a reward model an effective teacher. We address this question from an optimization perspective. First, we prove that regardless of how accurate a reward model is, if it induces low reward variance, then the RLHF objective suffers from a flat landscape. Consequently, even a perfectly accurate reward model can lead to extremely slow optimization, underperforming less accurate models that induce higher reward variance. We additionally show that a reward model that works well for one language model can induce low reward variance, and thus a flat objective landscape, for another. These results establish a fundamental limitation of evaluating reward models solely based on accuracy or independently of the language model they guide. Experiments using models of up to 8B parameters corroborate our theory, demonstrating the interplay between reward variance, accuracy, and reward maximization rate. Overall, our findings highlight that beyond accuracy, a reward model needs to induce sufficient variance for efficient optimization.

USAM-Net: A U-Net-based Network for Improved Stereo Correspondence and Scene Depth Estimation using Features from a Pre-trained Image Segmentation network

Joseph Emmanuel DL Dayo,Prospero C. Naval Jr

Task: 提出了一种新的卷积神经网络USAM-Net，用于增强深度估计性能。

Motivation: 自动驾驶和增强现实应用对高精度深度估计的需求不断增加，需要能够有效利用多种数据模态的先进神经网络架构。

Details

Method: USAM-Net采用双路径架构，结合预训练的分割模型（SAM）和深度估计模型，通过将语义分割图与立体图像结合来增强深度估计。 Result: 在DrivingStereo数据集上的实验表明，USAM-Net在全局差异（GD）和端点误差（EPE）方面优于传统模型，如CFNet、SegStereo和iResNet。 Conclusion: USAM-Net通过整合分割信息，在立体深度估计任务中表现出色，展示了其在需要高精度深度数据的应用中的潜力。 Abstract: The increasing demand for high-accuracy depth estimation in autonomous driving and augmented reality applications necessitates advanced neural architectures capable of effectively leveraging multiple data modalities. In this context, we introduce the Unified Segmentation Attention Mechanism Network (USAM-Net), a novel convolutional neural network that integrates stereo image inputs with semantic segmentation maps and attention to enhance depth estimation performance. USAM-Net employs a dual-pathway architecture, which combines a pre-trained segmentation model (SAM) and a depth estimation model. The segmentation pathway preprocesses the stereo images to generate semantic masks, which are then concatenated with the stereo images as inputs to the depth estimation pathway. This integration allows the model to focus on important features such as object boundaries and surface textures which are crucial for accurate depth perception. Empirical evaluation on the DrivingStereo dataset demonstrates that USAM-Net achieves superior performance metrics, including a Global Difference (GD) of 3.61\% and an End-Point Error (EPE) of 0.88, outperforming traditional models such as CFNet, SegStereo, and iResNet. These results underscore the effectiveness of integrating segmentation information into stereo depth estimation tasks, highlighting the potential of USAM-Net in applications demanding high-precision depth data.

TULIP: Towards Unified Language-Image Pretraining

Zineng Tang,Long Lian,Seun Eisape,XuDong Wang,Roei Herzig,Adam Yala,Alane Suhr,Trevor Darrell,David M. Chan

Task: 提出TULIP模型，以解决现有图像-文本对比模型在视觉中心任务中的不足。

Motivation: 现有的图像-文本对比模型（如CLIP和SigLIP）在需要高保真图像理解的任务（如计数、深度估计和细粒度对象识别）中表现不佳，而视觉专注模型在处理语言任务时又存在局限性。

Details

Method: TULIP模型通过生成数据增强、增强的图像-图像和文本-文本对比学习以及图像/文本重建正则化来学习细粒度视觉特征，同时保持全局语义对齐。 Result: TULIP模型在多个基准测试中超越了现有的最先进模型，在ImageNet-1K上建立了新的零样本性能SOTA，在RxRx1上的少样本分类线性探测中比SigLIP提高了2倍，在MMVP上的视觉-语言模型得分比SigLIP提高了3倍以上。 Conclusion: TULIP模型通过结合生成数据增强和对比学习，显著提升了图像理解和语言驱动的任务性能，为视觉-语言模型提供了新的解决方案。 Abstract: Despite the recent success of image-text contrastive models like CLIP and SigLIP, these models often struggle with vision-centric tasks that demand high-fidelity image understanding, such as counting, depth estimation, and fine-grained object recognition. These models, by performing language alignment, tend to prioritize high-level semantics over visual understanding, weakening their image understanding. On the other hand, vision-focused models are great at processing visual information but struggle to understand language, limiting their flexibility for language-driven tasks. In this work, we introduce TULIP, an open-source, drop-in replacement for existing CLIP-like models. Our method leverages generative data augmentation, enhanced image-image and text-text contrastive learning, and image/text reconstruction regularization to learn fine-grained visual features while preserving global semantic alignment. Our approach, scaling to over 1B parameters, outperforms existing state-of-the-art (SOTA) models across multiple benchmarks, establishing a new SOTA zero-shot performance on ImageNet-1K, delivering up to a $2\times$ enhancement over SigLIP on RxRx1 in linear probing for few-shot classification, and improving vision-language models, achieving over $3\times$ higher scores than SigLIP on MMVP. Our code/checkpoints are available at https://tulip-berkeley.github.io

Aligning Information Capacity Between Vision and Language via Dense-to-Sparse Feature Distillation for Image-Text Matching

Yang Liu,Wentao Feng,Zhuoyao Liu,Shudong Huang,Jiancheng Lv

Task: 解决多视图描述匹配问题，提出了一种新的视觉语义嵌入方法。

Motivation: 现有的方法在学习嵌入时信息容量有限，容易受到局部相似负样本的干扰。

Details

Method: 提出了Dense-to-Sparse Feature Distilled Visual Semantic Embedding (D2S-VSE)方法，通过密集文本蒸馏增强稀疏文本的信息容量。 Result: 在MS-COCO和Flickr30K数据集上的广泛评估表明，D2S-VSE优于现有的最先进方法。 Conclusion: D2S-VSE通过增强稀疏文本的信息容量，有效解决了多视图描述匹配问题。 Abstract: Enabling Visual Semantic Models to effectively handle multi-view description matching has been a longstanding challenge. Existing methods typically learn a set of embeddings to find the optimal match for each view's text and compute similarity. However, the visual and text embeddings learned through these approaches have limited information capacity and are prone to interference from locally similar negative samples. To address this issue, we argue that the information capacity of embeddings is crucial and propose Dense-to-Sparse Feature Distilled Visual Semantic Embedding (D2S-VSE), which enhances the information capacity of sparse text by leveraging dense text distillation. Specifically, D2S-VSE is a two-stage framework. In the pre-training stage, we align images with dense text to enhance the information capacity of visual semantic embeddings. In the fine-tuning stage, we optimize two tasks simultaneously, distilling dense text embeddings to sparse text embeddings while aligning images and sparse texts, enhancing the information capacity of sparse text embeddings. Our proposed D2S-VSE model is extensively evaluated on the large-scale MS-COCO and Flickr30K datasets, demonstrating its superiority over recent state-of-the-art methods.

Depth-Aware Range Image-Based Model for Point Cloud Segmentation

Bike Chen,Antti Tikanmäki,Juha Röning

Task: 点云分割（PCS）旨在将点云分成不同且有意义的组。

Motivation: 点云分割在机器人技术中起着重要作用，因为它使机器人能够直接理解其物理环境。然而，处理稀疏且大规模的室外点云时，基于范围图像的模型通常被采用，但这些模型缺乏显式的深度信息，导致在3D空间中分离的物体在范围图像中接触，增加了分割的难度。此外，现有的PCS模型通常源自现有的彩色图像模型，无法充分利用范围图像中固有的隐式但有序的深度信息，导致性能不佳。

Details

Method: 本文提出了深度感知模块（DAM）和Fast FMVNet V3。DAM通过显式建模通道间的相互依赖性来感知范围图像中的有序深度信息。Fast FMVNet V3通过将DAM集成到每个架构阶段的最后一个块中来结合DAM。 Result: 在SemanticKITTI、nuScenes和SemanticPOSS上进行的大量实验表明，DAM为Fast FMVNet V3带来了显著的性能提升，且计算成本可以忽略不计。 Conclusion: 本文提出的深度感知模块（DAM）和Fast FMVNet V3有效解决了基于范围图像的点云分割模型在处理深度信息时的不足，显著提升了分割性能。 Abstract: Point cloud segmentation (PCS) aims to separate points into different and meaningful groups. The task plays an important role in robotics because PCS enables robots to understand their physical environments directly. To process sparse and large-scale outdoor point clouds in real time, range image-based models are commonly adopted. However, in a range image, the lack of explicit depth information inevitably causes some separate objects in 3D space to touch each other, bringing difficulty for the range image-based models in correctly segmenting the objects. Moreover, previous PCS models are usually derived from the existing color image-based models and unable to make full use of the implicit but ordered depth information inherent in the range image, thereby achieving inferior performance. In this paper, we propose Depth-Aware Module (DAM) and Fast FMVNet V3. DAM perceives the ordered depth information in the range image by explicitly modelling the interdependence among channels. Fast FMVNet V3 incorporates DAM by integrating it into the last block in each architecture stage. Extensive experiments conducted on SemanticKITTI, nuScenes, and SemanticPOSS demonstrate that DAM brings a significant improvement for Fast FMVNet V3 with negligible computational cost.

Neuro Symbolic Knowledge Reasoning for Procedural Video Question Answering

Thanh-Son Nguyen,Hong Yang,Tzeh Yuan Neoh,Hao Zhang,Ee Yeo Keat,Basura Fernando

Task: 引入一个新的视频问答（VQA）数据集，挑战模型利用程序性知识进行复杂推理。

Motivation: 需要识别视觉实体、生成假设，并进行上下文、因果和反事实推理。

Details

Method: 提出一个神经符号推理模块，集成神经网络和LLM驱动的约束推理，以生成可解释的答案。 Result: 结果表明，将LLM与结构化知识推理结合，可以增强STAR基准和我们数据集上的程序性推理。 Conclusion: 代码和数据集将在https://github.com/LUNAProject22/KML上发布。 Abstract: This paper introduces a new video question-answering (VQA) dataset that challenges models to leverage procedural knowledge for complex reasoning. It requires recognizing visual entities, generating hypotheses, and performing contextual, causal, and counterfactual reasoning. To address this, we propose neuro symbolic reasoning module that integrates neural networks and LLM-driven constrained reasoning over variables for interpretable answer generation. Results show that combining LLMs with structured knowledge reasoning with logic enhances procedural reasoning on the STAR benchmark and our dataset. Code and dataset at https://github.com/LUNAProject22/KML soon.

Reducing Annotation Burden: Exploiting Image Knowledge for Few-Shot Medical Video Object Segmentation via Spatiotemporal Consistency Relearning

Zixuan Zheng,Yilei Shi,Chunlei Li,Jingliang Hu,Xiao Xiang Zhu,Lichao Mou

Task: 研究在极低数据量情况下利用少量视频帧注释和现有标记图像进行医学视频分割的方法。

Motivation: 减少医学领域中昂贵且稀缺的密集帧注释需求。

Details

Method: 提出一个两阶段框架：首先使用标记图像学习少样本分割模型，然后通过时空一致性重新学习方法在医学视频上提高性能。 Result: 实验证明该方法在少样本分割任务中优于现有最先进的方法。 Conclusion: 该模型在低数据量情况下实现了强大的视频分割性能，弥合了丰富的标记医学图像与稀缺的稀疏标记医学视频之间的差距。 Abstract: Few-shot video object segmentation aims to reduce annotation costs; however, existing methods still require abundant dense frame annotations for training, which are scarce in the medical domain. We investigate an extremely low-data regime that utilizes annotations from only a few video frames and leverages existing labeled images to minimize costly video annotations. Specifically, we propose a two-phase framework. First, we learn a few-shot segmentation model using labeled images. Subsequently, to improve performance without full supervision, we introduce a spatiotemporal consistency relearning approach on medical videos that enforces consistency between consecutive frames. Constraints are also enforced between the image model and relearning model at both feature and prediction levels. Experiments demonstrate the superiority of our approach over state-of-the-art few-shot segmentation methods. Our model bridges the gap between abundant annotated medical images and scarce, sparsely labeled medical videos to achieve strong video segmentation performance in this low data regime. Code is available at https://github.com/MedAITech/RAB.

Body-Hand Modality Expertized Networks with Cross-attention for Fine-grained Skeleton Action Recognition

Seungyeon Cho,Tae-Kyun Kim

Task: 提出了一种新的框架BHaRNet，用于基于骨架的人体动作识别，特别是关注细微的手部动作。

Motivation: 现有的方法主要关注全身动作，往往忽略了细微的手部动作，而这些动作对于区分精细动作至关重要。

Details

Method: BHaRNet框架通过增强典型的身体专家模型和手部专家模型，采用联合训练和交叉注意力机制来融合互补信息。 Result: 在多个大规模基准测试中，BHaRNet在保持较少计算量和参数的情况下，达到了最先进的准确率，特别是在手部密集动作中从86.4%提升到93.0%。 Conclusion: BHaRNet通过结合身体和手部专家模型，有效地捕捉了细微的手部动作，显著提升了动作识别的准确率。 Abstract: Skeleton-based Human Action Recognition (HAR) is a vital technology in robotics and human-robot interaction. However, most existing methods concentrate primarily on full-body movements and often overlook subtle hand motions that are critical for distinguishing fine-grained actions. Recent work leverages a unified graph representation that combines body, hand, and foot keypoints to capture detailed body dynamics. Yet, these models often blur fine hand details due to the disparity between body and hand action characteristics and the loss of subtle features during the spatial-pooling. In this paper, we propose BHaRNet (Body-Hand action Recognition Network), a novel framework that augments a typical body-expert model with a hand-expert model. Our model jointly trains both streams with an ensemble loss that fosters cooperative specialization, functioning in a manner reminiscent of a Mixture-of-Experts (MoE). Moreover, cross-attention is employed via an expertized branch method and a pooling-attention module to enable feature-level interactions and selectively fuse complementary information. Inspired by MMNet, we also demonstrate the applicability of our approach to multi-modal tasks by leveraging RGB information, where body features guide RGB learning to capture richer contextual cues. Experiments on large-scale benchmarks (NTU RGB+D 60, NTU RGB+D 120, PKU-MMD, and Northwestern-UCLA) demonstrate that BHaRNet achieves SOTA accuracies -- improving from 86.4\% to 93.0\% in hand-intensive actions -- while maintaining fewer GFLOPs and parameters than the relevant unified methods.

Ultrasound Image-to-Video Synthesis via Latent Dynamic Diffusion Models

Tingxiu Chen,Yilei Shi,Zixuan Zheng,Bingcong Yan,Jingliang Hu,Xiao Xiang Zhu,Lichao Mou

Task: 通过合成超声视频来解决超声视频数据集稀缺的问题。

Motivation: 公开可用的超声视频数据集稀缺，阻碍了有效的视频分类模型的开发。

Details

Method: 提出了一种潜在动态扩散模型（LDDM），将静态图像高效地转换为具有真实视频特征的动态序列。 Result: 在BUSV基准测试中展示了强大的定量结果和视觉上吸引人的合成视频。使用真实数据和LDDM合成视频的组合训练视频分类模型，性能显著优于仅使用真实数据。 Conclusion: 图像到视频的方法为推进超声视频分析提供了一种有效的数据增强解决方案。 Abstract: Ultrasound video classification enables automated diagnosis and has emerged as an important research area. However, publicly available ultrasound video datasets remain scarce, hindering progress in developing effective video classification models. We propose addressing this shortage by synthesizing plausible ultrasound videos from readily available, abundant ultrasound images. To this end, we introduce a latent dynamic diffusion model (LDDM) to efficiently translate static images to dynamic sequences with realistic video characteristics. We demonstrate strong quantitative results and visually appealing synthesized videos on the BUSV benchmark. Notably, training video classification models on combinations of real and LDDM-synthesized videos substantially improves performance over using real data alone, indicating our method successfully emulates dynamics critical for discrimination. Our image-to-video approach provides an effective data augmentation solution to advance ultrasound video analysis. Code is available at https://github.com/MedAITech/U_I2V.

Language-based Image Colorization: A Benchmark and Beyond

Yifan Li,Shuai Yang,Jiaying Liu

Task: 对基于语言的图像着色方法进行全面回顾和基准测试。

Motivation: 由于颜色模糊性，自动图像着色方法难以生成高质量图像，且用户可控性有限。基于语言的方法利用文本描述的效率和灵活性来指导着色。

Details

Method: 首先总结现有的自动着色方法，然后重点分析基于语言的方法，并将其分为两类：一类从头训练跨模态网络，另一类利用预训练的跨模态模型建立文本-视觉对应关系。基于现有方法的局限性，提出了一种基于蒸馏扩散模型的简单有效方法。 Result: 大量实验表明，所提出的简单基线方法比之前的复杂方法效果更好，且速度提高了14倍。 Conclusion: 这是首次对基于语言的图像着色领域进行全面回顾和基准测试，为该领域提供了有意义的见解。 Abstract: Image colorization aims to bring colors back to grayscale images. Automatic image colorization methods, which requires no additional guidance, struggle to generate high-quality images due to color ambiguity, and provides limited user controllability. Thanks to the emergency of cross-modality datasets and models, language-based colorization methods are proposed to fully utilize the efficiency and flexibly of text descriptions to guide colorization. In view of the lack of a comprehensive review of language-based colorization literature, we conduct a thorough analysis and benchmarking. We first briefly summarize existing automatic colorization methods. Then, we focus on language-based methods and point out their core challenge on cross-modal alignment. We further divide these methods into two categories: one attempts to train a cross-modality network from scratch, while the other utilizes the pre-trained cross-modality model to establish the textual-visual correspondence. Based on the analyzed limitations of existing language-based methods, we propose a simple yet effective method based on distilled diffusion model. Extensive experiments demonstrate that our simple baseline can produces better results than previous complex methods with 14 times speed up. To the best of our knowledge, this is the first comprehensive review and benchmark on language-based image colorization field, providing meaningful insights for the community. The code is available at https://github.com/lyf1212/Color-Turbo.

Taming Flow Matching with Unbalanced Optimal Transport into Fast Pansharpening

Zihan Cao,Yu Zhong,Liang-Jian Deng

Task: 提出了一种基于最优传输流匹配（OTFM）框架的单步高质量全色锐化方法。

Motivation: 现有的基于随机微分方程（SDEs）的扩散模型虽然性能优异，但多步采样过程带来了巨大的计算开销，限制了实际应用。现有方法通过减少采样步骤来提高效率，但往往会影响融合质量。

Details

Method: 提出了最优传输流匹配（OTFM）框架，结合不平衡最优传输（UOT）的双重公式，实现单步高质量全色锐化。UOT通过放松边缘约束增强了建模灵活性，适应遥感数据中的光谱和空间差异。 Result: 实验结果表明，OTFM在多个数据集上的性能与之前的回归模型和领先的基于扩散的方法相当或更好，且仅需一步采样。 Conclusion: OTFM框架在保持全色锐化约束的同时，实现了无模拟训练和单步推理，显著提高了计算效率并保持了高质量的融合效果。 Abstract: Pansharpening, a pivotal task in remote sensing for fusing high-resolution panchromatic and multispectral imagery, has garnered significant research interest. Recent advancements employing diffusion models based on stochastic differential equations (SDEs) have demonstrated state-of-the-art performance. However, the inherent multi-step sampling process of SDEs imposes substantial computational overhead, hindering practical deployment. While existing methods adopt efficient samplers, knowledge distillation, or retraining to reduce sampling steps (e.g., from 1,000 to fewer steps), such approaches often compromise fusion quality. In this work, we propose the Optimal Transport Flow Matching (OTFM) framework, which integrates the dual formulation of unbalanced optimal transport (UOT) to achieve one-step, high-quality pansharpening. Unlike conventional OT formulations that enforce rigid distribution alignment, UOT relaxes marginal constraints to enhance modeling flexibility, accommodating the intrinsic spectral and spatial disparities in remote sensing data. Furthermore, we incorporate task-specific regularization into the UOT objective, enhancing the robustness of the flow model. The OTFM framework enables simulation-free training and single-step inference while maintaining strict adherence to pansharpening constraints. Experimental evaluations across multiple datasets demonstrate that OTFM matches or exceeds the performance of previous regression-based models and leading diffusion-based methods while only needing one sampling step. Codes are available at https://github.com/294coder/PAN-OTFM.

One-Shot Medical Video Object Segmentation via Temporal Contrastive Memory Networks

Yaxiong Chen,Junjian Hu,Chunlei Li,Zixuan Zheng,Jingliang Hu,Yilei Shi,Shengwu Xiong,Xiao Xiang Zhu,Lichao Mou

Task: 提出了一种一次性医学视频对象分割任务，旨在通过仅使用第一帧的掩码注释来分离视频中的前景和背景像素。

Motivation: 医学视频数据的分析和注释面临数据可用性和注释的挑战，需要一种有效的方法来减少注释负担。

Details

Method: 提出了一种时间对比记忆网络，包括图像和掩码编码器、时间对比记忆库和解码器，用于学习特征表示、对齐相邻帧的嵌入并存储这些特征，以及融合编码图像特征和记忆读取进行分割。 Result: 广泛的实验表明，该方法在从单个示例中分割可见和不可见结构方面表现出最先进的性能，展示了从稀缺标签中泛化的能力。 Conclusion: 该方法有潜力减轻医学视频分析的注释负担，代码已公开。 Abstract: Video object segmentation is crucial for the efficient analysis of complex medical video data, yet it faces significant challenges in data availability and annotation. We introduce the task of one-shot medical video object segmentation, which requires separating foreground and background pixels throughout a video given only the mask annotation of the first frame. To address this problem, we propose a temporal contrastive memory network comprising image and mask encoders to learn feature representations, a temporal contrastive memory bank that aligns embeddings from adjacent frames while pushing apart distant ones to explicitly model inter-frame relationships and stores these features, and a decoder that fuses encoded image features and memory readouts for segmentation. We also collect a diverse, multi-source medical video dataset spanning various modalities and anatomies to benchmark this task. Extensive experiments demonstrate state-of-the-art performance in segmenting both seen and unseen structures from a single exemplar, showing ability to generalize from scarce labels. This highlights the potential to alleviate annotation burdens for medical video analysis. Code is available at https://github.com/MedAITech/TCMN.

Semi-KAN: KAN Provides an Effective Representation for Semi-Supervised Learning in Medical Image Segmentation

Zanting Ye,Xiaolong Niu,Xuanbin Wu,Wenxiang Yi,Yuan Chang,Lijun Lu

Task: 提出一种基于Kolmogorov-Arnold Networks (KANs)的半监督医学图像分割方法Semi-KAN，以增强表示学习能力。

Motivation: 现有的半监督医学图像分割方法通常依赖于单一固定的激活函数和线性建模模式，限制了其学习鲁棒表示的能力。

Details

Method: 提出Semi-KAN，将KANs集成到U-Net管道的编码器瓶颈和解码器顶层，以提取高级语义特征，并设计多分支U-Net架构进行不确定性估计。 Result: 在四个公共数据集上的实验表明，Semi-KAN在较少的KAN层和较低的计算成本下优于基线网络。 Conclusion: KANs作为一种有前景的方法，在半监督医学图像分割中具有潜力。 Abstract: Deep learning-based medical image segmentation has shown remarkable success; however, it typically requires extensive pixel-level annotations, which are both expensive and time-intensive. Semi-supervised medical image segmentation (SSMIS) offers a viable alternative, driven by advancements in CNNs and ViTs. However, these networks often rely on single fixed activation functions and linear modeling patterns, limiting their ability to effectively learn robust representations. Given the limited availability of labeled date, achieving robust representation learning becomes crucial. Inspired by Kolmogorov-Arnold Networks (KANs), we propose Semi-KAN, which leverages the untapped potential of KANs to enhance backbone architectures for representation learning in SSMIS. Our findings indicate that: (1) compared to networks with fixed activation functions, KANs exhibit superior representation learning capabilities with fewer parameters, and (2) KANs excel in high-semantic feature spaces. Building on these insights, we integrate KANs into tokenized intermediate representations, applying them selectively at the encoder's bottleneck and the decoder's top layers within a U-Net pipeline to extract high-level semantic features. Although learnable activation functions improve feature expansion, they introduce significant computational overhead with only marginal performance gains. To mitigate this, we reduce the feature dimensions and employ horizontal scaling to capture multiple pattern representations. Furthermore, we design a multi-branch U-Net architecture with uncertainty estimation to effectively learn diverse pattern representations. Extensive experiments on four public datasets demonstrate that Semi-KAN surpasses baseline networks, utilizing fewer KAN layers and lower computational cost, thereby underscoring the potential of KANs as a promising approach for SSMIS.

Disentangling Modes and Interference in the Spectrogram of Multicomponent Signals

Kévin Polisano,Sylvain Meignen,Nils Laurent,Hubert Leterme

Task: 研究如何将多分量信号的频谱图分解为模式部分和干扰部分。

Motivation: 探索两种方法以改善在强干扰情况下的时频分析。

Details

Method: 使用变分方法和监督学习方法（U-Net架构）进行频谱图分解。 Result: 数值实验展示了两种方法在频谱图分解中的优势和局限性。 Conclusion: 两种方法在强干扰情况下具有提升时频分析的潜力。 Abstract: In this paper, we investigate how the spectrogram of multicomponent signals can be decomposed into a mode part and an interference part. We explore two approaches: (i) a variational method inspired by texture-geometry decomposition in image processing, and (ii) a supervised learning approach using a U-Net architecture, trained on a dataset encompassing diverse interference patterns and noise conditions. Once the interference component is identified, we explain how it enables us to define a criterion to locally adapt the window length used in the definition of the spectrogram, for the sake of improving ridge detection in the presence of close modes. Numerical experiments illustrate the advantages and limitations of both approaches for spectrogram decomposition, highlighting their potential for enhancing time-frequency analysis in the presence of strong interference.

TGV: Tabular Data-Guided Learning of Visual Cardiac Representations

Marta Hasny,Maxime Di Folco,Keno Bressem,Julia Schnabel

Task: 提出一种利用临床相关表格数据来识别不同患者表型并在对比学习框架中形成更有意义的配对的方法。

Motivation: 在医学影像中，通常需要比较具有不同表型的整个患者，而不仅仅是单一扫描的多个增强版本。

Details

Method: 使用表格属性来指导视觉表示的训练，而不需要联合嵌入空间。 Result: 在UK Biobank的短轴心脏MR图像和临床属性上展示了该方法的优势，表格数据有助于更有效地区分患者亚组。在下游任务（包括心血管疾病和心脏表型的微调和零样本预测）中，结合表格数据的视觉表示优于仅依赖图像增强或联合图像-表格嵌入的传统方法。 Conclusion: 通过表格数据训练的图像编码器能够在表示中嵌入人口统计信息，使其能够在推理时利用表格数据的见解进行单模态预测，适用于现实世界的医学环境。 Abstract: Contrastive learning methods in computer vision typically rely on different views of the same image to form pairs. However, in medical imaging, we often seek to compare entire patients with different phenotypes rather than just multiple augmentations of one scan. We propose harnessing clinically relevant tabular data to identify distinct patient phenotypes and form more meaningful pairs in a contrastive learning framework. Our method uses tabular attributes to guide the training of visual representations, without requiring a joint embedding space. We demonstrate its strength using short-axis cardiac MR images and clinical attributes from the UK Biobank, where tabular data helps to more effectively distinguish between patient subgroups. Evaluation on downstream tasks, including fine-tuning and zero-shot prediction of cardiovascular artery diseases and cardiac phenotypes, shows that incorporating tabular data yields stronger visual representations than conventional methods that rely solely on image augmentations or combined image-tabular embeddings. Furthermore, we demonstrate that image encoders trained with tabular guidance are capable of embedding demographic information in their representations, allowing them to use insights from tabular data for unimodal predictions, making them well-suited to real-world medical settings where extensive clinical annotations may not be routinely available at inference time. The code will be available on GitHub.

Low-Complexity Patch-based No-Reference Point Cloud Quality Metric exploiting Weighted Structure and Texture Features

Michael Neri,Federica Battisti

Task: 提出一种基于低复杂度学习框架的无参考点云质量评估方法PST-PCQA。

Motivation: 在点云的压缩、传输和渲染过程中，会引入各种伪影，影响最终用户感知的质量。然而，评估这些失真对整体质量的影响是一个具有挑战性的任务。

Details

Method: 通过分析单个补丁，整合局部和全局特征来预测平均意见分数（Mean Opinion Score）。具体过程包括从补丁中提取特征，结合这些特征，并使用相关权重来预测整体质量。 Result: 在三个最先进的数据集上的实验测试表明，PST-PCQA具有良好的预测能力，能够通过分析不同的特征池化策略并在不同数据集上泛化。消融研究证实了逐补丁评估质量的好处。 Conclusion: PST-PCQA的轻量级结构使其适用于实时应用和计算能力有限的设备。为了可重复性，代码、模型和预训练权重已在https://github.com/michaelneri/PST-PCQA上公开。 Abstract: During the compression, transmission, and rendering of point clouds, various artifacts are introduced, affecting the quality perceived by the end user. However, evaluating the impact of these distortions on the overall quality is a challenging task. This study introduces PST-PCQA, a no-reference point cloud quality metric based on a low-complexity, learning-based framework. It evaluates point cloud quality by analyzing individual patches, integrating local and global features to predict the Mean Opinion Score. In summary, the process involves extracting features from patches, combining them, and using correlation weights to predict the overall quality. This approach allows us to assess point cloud quality without relying on a reference point cloud, making it particularly useful in scenarios where reference data is unavailable. Experimental tests on three state-of-the-art datasets show good prediction capabilities of PST-PCQA, through the analysis of different feature pooling strategies and its ability to generalize across different datasets. The ablation study confirms the benefits of evaluating quality on a patch-by-patch basis. Additionally, PST-PCQA's light-weight structure, with a small number of parameters to learn, makes it well-suited for real-time applications and devices with limited computational capacity. For reproducibility purposes, we made code, model, and pretrained weights available at https://github.com/michaelneri/PST-PCQA.

Semantic Segmentation of Transparent and Opaque Drinking Glasses with the Help of Zero-shot Learning

Annalena Blänsdorf,Tristan Wirth,Arne Rak,Thomas Pöllabauer,Volker Knauthe,Arjan Kuijper

Task: 提出TransCaGNet模型，用于分割图像中的透明结构。

Motivation: 透明结构在图像中难以与背景区分，如常见的玻璃杯。

Details

Method: 使用Trans4Trans架构替换CaGNet的分割骨干网络，并采用零样本学习来分割未在训练中出现的玻璃类别。 Result: TransCaGNet在合成数据集上的平均IoU和准确率分别提高了13.68%和17.88%，在真实世界数据集上的平均IoU和准确率分别提高了5.55%和5.72%。 Conclusion: TransCaGNet在分割透明结构方面表现优异，尤其是在合成数据集上训练后，在真实世界数据集上的表现也有所提升。 Abstract: Segmenting transparent structures in images is challenging since they are difficult to distinguish from the background. Common examples are drinking glasses, which are a ubiquitous part of our lives and appear in many different shapes and sizes. In this work we propose TransCaGNet, a modified version of the zero-shot model CaGNet. We exchange the segmentation backbone with the architecture of Trans4Trans to be capable of segmenting transparent objects. Since some glasses are rarely captured, we use zeroshot learning to be able to create semantic segmentations of glass categories not given during training. We propose a novel synthetic dataset covering a diverse set of different environmental conditions. Additionally we capture a real-world evaluation dataset since most applications take place in the real world. Comparing our model with Zeg-Clip we are able to show that TransCaGNet produces better mean IoU and accuracy values while ZegClip outperforms it mostly for unseen classes. To improve the segmentation results, we combine the semantic segmentation of the models with the segmentation results of SAM 2. Our evaluation emphasizes that distinguishing between different classes is challenging for the models due to similarity, points of view, or coverings. Taking this behavior into account, we assign glasses multiple possible categories. The modification leads to an improvement up to 13.68% for the mean IoU and up to 17.88% for the mean accuracy values on the synthetic dataset. Using our difficult synthetic dataset for training, the models produce even better results on the real-world dataset. The mean IoU is improved up to 5.55% and the mean accuracy up to 5.72% on the real-world dataset.

Universal Scene Graph Generation

Shengqiong Wu,Hao Fei,Tat-Seng Chua

Task: 提出一种能够从多种模态输入中全面表征语义场景的通用场景图（USG）表示方法。

Motivation: 当前场景图研究主要局限于单一模态场景建模，无法充分利用不同模态场景图表示在描述整体场景语义中的互补优势。

Details

Method: 设计了一种针对特定目标的USG解析器（USG-Par），采用模块化架构进行端到端的USG生成，包括用于缓解跨模态对象对齐的模态差距的对象关联器，以及通过将多模态对象和关系与文本场景图对齐来缓解领域不平衡的以文本为中心的场景对比学习机制。 Result: 实验表明，USG在表达场景语义方面比独立场景图具有更强的能力，且USG-Par具有更高的效率和性能。 Conclusion: USG能够从任何给定的模态输入组合中全面表征语义场景，USG-Par有效解决了跨模态对象对齐和领域外挑战的关键瓶颈。 Abstract: Scene graph (SG) representations can neatly and efficiently describe scene semantics, which has driven sustained intensive research in SG generation. In the real world, multiple modalities often coexist, with different types, such as images, text, video, and 3D data, expressing distinct characteristics. Unfortunately, current SG research is largely confined to single-modality scene modeling, preventing the full utilization of the complementary strengths of different modality SG representations in depicting holistic scene semantics. To this end, we introduce Universal SG (USG), a novel representation capable of fully characterizing comprehensive semantic scenes from any given combination of modality inputs, encompassing modality-invariant and modality-specific scenes. Further, we tailor a niche-targeting USG parser, USG-Par, which effectively addresses two key bottlenecks of cross-modal object alignment and out-of-domain challenges. We design the USG-Par with modular architecture for end-to-end USG generation, in which we devise an object associator to relieve the modality gap for cross-modal object alignment. Further, we propose a text-centric scene contrasting learning mechanism to mitigate domain imbalances by aligning multimodal objects and relations with textual SGs. Through extensive experiments, we demonstrate that USG offers a stronger capability for expressing scene semantics than standalone SGs, and also that our USG-Par achieves higher efficacy and performance.

Manifold Learning for Hyperspectral Images

Fethi Harkat,Tiphaine Deuberet,Guillaume Gey,Valérie Perrier,Kévin Polisano

Task: 提出一种通过构建邻接图来近似数据集拓扑的方法，以改进X射线透射多能量图像的表示。

Motivation: 传统的特征提取和投影技术（如主成分分析）在表示X射线透射多能量图像时表现不佳，限制了神经网络在决策过程中的性能。

Details

Method: 使用均匀流形逼近和投影构建邻接图，捕捉数据中的非线性相关性。 Result: 该方法显著提高了机器学习算法的性能，特别是在处理来自X射线透射光谱的高光谱图像时，增强了特征的可分离性。 Conclusion: 该方法不仅保留了数据的全局结构，还提高了分类的准确性和鲁棒性。 Abstract: Traditional feature extraction and projection techniques, such as Principal Component Analysis, struggle to adequately represent X-Ray Transmission (XRT) Multi-Energy (ME) images, limiting the performance of neural networks in decision-making processes. To address this issue, we propose a method that approximates the dataset topology by constructing adjacency graphs using the Uniform Manifold Approximation and Projection. This approach captures nonlinear correlations within the data, significantly improving the performance of machine learning algorithms, particularly in processing Hyperspectral Images (HSI) from X-ray transmission spectroscopy. This technique not only preserves the global structure of the data but also enhances feature separability, leading to more accurate and robust classification results.

Exploiting Diffusion Prior for Real-World Image Dehazing with Unpaired Training

Yunwei Lan,Zhigao Cui,Chang Liu,Jialun Peng,Nian Wang,Xin Luo,Dong Liu

Task: 利用扩散先验和物理先验进行真实场景去雾的无配对训练框架。

Motivation: 当前方法在真实场景中的泛化能力有限，主要由于特征表示有限和真实世界先验的利用不足。

Details

Method: 提出了一种名为Diff-Dehazer的无配对框架，利用扩散先验作为CycleGAN中的双射映射学习器，并集成物理先验以挖掘真实世界知识。 Result: 在多个真实世界数据集上的广泛实验证明了该方法的优越性能。 Conclusion: Diff-Dehazer通过结合扩散先验和物理先验，显著提高了真实场景去雾的效果。 Abstract: Unpaired training has been verified as one of the most effective paradigms for real scene dehazing by learning from unpaired real-world hazy and clear images. Although numerous studies have been proposed, current methods demonstrate limited generalization for various real scenes due to limited feature representation and insufficient use of real-world prior. Inspired by the strong generative capabilities of diffusion models in producing both hazy and clear images, we exploit diffusion prior for real-world image dehazing, and propose an unpaired framework named Diff-Dehazer. Specifically, we leverage diffusion prior as bijective mapping learners within the CycleGAN, a classic unpaired learning framework. Considering that physical priors contain pivotal statistics information of real-world data, we further excavate real-world knowledge by integrating physical priors into our framework. Furthermore, we introduce a new perspective for adequately leveraging the representation ability of diffusion models by removing degradation in image and text modalities, so as to improve the dehazing effect. Extensive experiments on multiple real-world datasets demonstrate the superior performance of our method. Our code https://github.com/ywxjm/Diff-Dehazer.

Learning 4D Panoptic Scene Graph Generation from Rich 2D Visual Scene

Shengqiong Wu,Hao Fei,Jingkang Yang,Xiangtai Li,Juncheng Li,Hanwang Zhang,Tat-seng Chua

Task: 提出一种新的框架用于4D全景场景图（4D-PSG）生成，利用丰富的2D视觉场景注释来增强4D场景学习。

Motivation: 当前4D-PSG研究面临数据稀缺和词汇外问题，且基准生成方法的管道性质导致性能不佳。

Details

Method: 引入一个集成了3D掩码解码器的4D大语言模型（4D-LLM），设计链式场景图推理机制，并提出2D到4D视觉场景迁移学习框架。 Result: 在基准数据上的广泛实验表明，该方法显著优于基线模型。 Conclusion: 所提出的方法有效解决了4D-PSG生成中的数据稀缺和词汇外问题，显著提升了性能。 Abstract: The latest emerged 4D Panoptic Scene Graph (4D-PSG) provides an advanced-ever representation for comprehensively modeling the dynamic 4D visual real world. Unfortunately, current pioneering 4D-PSG research can primarily suffer from data scarcity issues severely, as well as the resulting out-of-vocabulary problems; also, the pipeline nature of the benchmark generation method can lead to suboptimal performance. To address these challenges, this paper investigates a novel framework for 4D-PSG generation that leverages rich 2D visual scene annotations to enhance 4D scene learning. First, we introduce a 4D Large Language Model (4D-LLM) integrated with a 3D mask decoder for end-to-end generation of 4D-PSG. A chained SG inference mechanism is further designed to exploit LLMs' open-vocabulary capabilities to infer accurate and comprehensive object and relation labels iteratively. Most importantly, we propose a 2D-to-4D visual scene transfer learning framework, where a spatial-temporal scene transcending strategy effectively transfers dimension-invariant features from abundant 2D SG annotations to 4D scenes, effectively compensating for data scarcity in 4D-PSG. Extensive experiments on the benchmark data demonstrate that we strikingly outperform baseline models by a large margin, highlighting the effectiveness of our method.

Saad Lahlali,Sandra Kara,Hejer Ammar,Florian Chabot,Nicolas Granger,Hervé Le Borgne,Quoc-Cuong Pham

Task: 提出了一种新的框架，利用2D运动线索进行3D数据中的多目标发现。

Motivation: 尽管2D图像分析中的目标发现任务受到了广泛关注，但在3D数据中仍未被充分探索，现有方法主要依赖3D运动，存在诸多挑战。

Details

Method: 提出了DIOD-3D和xMOD两个框架，分别用于3D数据中的多目标发现和跨模态训练，结合了2D和3D数据，并利用2D运动线索。 Result: 在合成数据集（TRIP-PD）和真实世界数据集（KITTI和Waymo）上进行了广泛评估，与2D目标发现的最新方法相比，性能显著提升，F1@50得分提高了+8.7到+15.1。 Conclusion: 提出的方法在3D目标发现任务中表现出色，显著提升了性能，并且支持RGB和点云输入。 Abstract: Object discovery, which refers to the task of localizing objects without human annotations, has gained significant attention in 2D image analysis. However, despite this growing interest, it remains under-explored in 3D data, where approaches rely exclusively on 3D motion, despite its several challenges. In this paper, we present a novel framework that leverages advances in 2D object discovery which are based on 2D motion to exploit the advantages of such motion cues being more flexible and generalizable and to bridge the gap between 2D and 3D modalities. Our primary contributions are twofold: (i) we introduce DIOD-3D, the first baseline for multi-object discovery in 3D data using 2D motion, incorporating scene completion as an auxiliary task to enable dense object localization from sparse input data; (ii) we develop xMOD, a cross-modal training framework that integrates 2D and 3D data while always using 2D motion cues. xMOD employs a teacher-student training paradigm across the two modalities to mitigate confirmation bias by leveraging the domain gap. During inference, the model supports both RGB-only and point cloud-only inputs. Additionally, we propose a late-fusion technique tailored to our pipeline that further enhances performance when both modalities are available at inference. We evaluate our approach extensively on synthetic (TRIP-PD) and challenging real-world datasets (KITTI and Waymo). Notably, our approach yields a substantial performance improvement compared with the 2D object discovery state-of-the-art on all datasets with gains ranging from +8.7 to +15.1 in F1@50 score. The code is available at https://github.com/CEA-LIST/xMOD

Bridging the Gap: Fusing CNNs and Transformers to Decode the Elegance of Handwritten Arabic Script

Chaouki Boufenar,Mehdi Ayoub Rabiai,Boualem Nadjib Zahaf,Khelil Rafik Ouaras

Task: 提出一种结合卷积神经网络（CNN）和Transformer架构的混合方法，用于手写阿拉伯文字识别。

Motivation: 手写阿拉伯文字识别由于字母形态的动态变化和上下文差异而具有挑战性。

Details

Method: 结合卷积神经网络（CNN）和Transformer架构，评估了包括EfficientNet-B7和Vision Transformer（ViT-B16）在内的自定义和微调模型，并引入了一种基于置信度融合的集成模型。 Result: 在IFN/ENIT数据集上，集成模型在字母分类和位置分类上分别达到了96.38%和97.22%的准确率。 Conclusion: CNN和Transformer的互补性展示了它们在阿拉伯手写文字识别中的潜力，推动了OCR系统的发展，为实际应用提供了可扩展的解决方案。 Abstract: Handwritten Arabic script recognition is a challenging task due to the script's dynamic letter forms and contextual variations. This paper proposes a hybrid approach combining convolutional neural networks (CNNs) and Transformer-based architectures to address these complexities. We evaluated custom and fine-tuned models, including EfficientNet-B7 and Vision Transformer (ViT-B16), and introduced an ensemble model that leverages confidence-based fusion to integrate their strengths. Our ensemble achieves remarkable performance on the IFN/ENIT dataset, with 96.38% accuracy for letter classification and 97.22% for positional classification. The results highlight the complementary nature of CNNs and Transformers, demonstrating their combined potential for robust Arabic handwriting recognition. This work advances OCR systems, offering a scalable solution for real-world applications.

Forensics-Bench: A Comprehensive Forgery Detection Benchmark Suite for Large Vision Language Models

Jin Wang,Chenghui Lv,Xian Li,Shichao Dong,Huadong Li,kelu Yao,Chao Li,Wenqi Shao,Ping Luo

Task: 设计一个全面的基准测试套件来评估大型视觉语言模型（LVLMs）在伪造媒体检测中的能力。

Motivation: 随着AIGC的快速发展，伪造媒体的多样性显著增加，对社会安全、政治、法律等领域构成了前所未有的威胁。为了检测这些日益多样化的恶意伪造媒体，需要设计一个全面的基准测试套件来评估LVLMs的能力。

Details

Method: 提出了Forensics-Bench，一个包含63,292个精心策划的多选视觉问题的伪造检测评估基准套件，涵盖112种独特的伪造检测类型，从5个角度进行评估：伪造语义、伪造模态、伪造任务、伪造类型和伪造模型。 Result: 对22个开源LVLMs和3个专有模型（GPT-4o、Gemini 1.5 Pro和Claude 3.5 Sonnet）进行了全面评估，结果表明Forensics-Bench在全面伪造检测方面提出了重大挑战。 Conclusion: Forensics-Bench将激励社区推动LVLMs的前沿发展，努力在AIGC时代实现全方位的伪造检测器。 Abstract: Recently, the rapid development of AIGC has significantly boosted the diversities of fake media spread in the Internet, posing unprecedented threats to social security, politics, law, and etc. To detect the ever-increasingly diverse malicious fake media in the new era of AIGC, recent studies have proposed to exploit Large Vision Language Models (LVLMs) to design robust forgery detectors due to their impressive performance on a wide range of multimodal tasks. However, it still lacks a comprehensive benchmark designed to comprehensively assess LVLMs' discerning capabilities on forgery media. To fill this gap, we present Forensics-Bench, a new forgery detection evaluation benchmark suite to assess LVLMs across massive forgery detection tasks, requiring comprehensive recognition, location and reasoning capabilities on diverse forgeries. Forensics-Bench comprises 63,292 meticulously curated multi-choice visual questions, covering 112 unique forgery detection types from 5 perspectives: forgery semantics, forgery modalities, forgery tasks, forgery types and forgery models. We conduct thorough evaluations on 22 open-sourced LVLMs and 3 proprietary models GPT-4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet, highlighting the significant challenges of comprehensive forgery detection posed by Forensics-Bench. We anticipate that Forensics-Bench will motivate the community to advance the frontier of LVLMs, striving for all-around forgery detectors in the era of AIGC. The deliverables will be updated at https://Forensics-Bench.github.io/.

Single-Step Bidirectional Unpaired Image Translation Using Implicit Bridge Consistency Distillation

Suhyeon Lee,Kwanyoung Kim,Jong Chul Ye

Task: 提出一种新的框架Implicit Bridge Consistency Distillation (IBCD)，用于实现单步双向无配对图像翻译。

Motivation: 解决基于扩散模型或Schrödinger桥的方法由于其迭代采样性质而难以在现实世界应用中广泛采用的问题。

Details

Method: IBCD通过使用扩散隐式桥模型连接PF-ODE轨迹，并引入两个关键改进：1）一致性蒸馏的分布匹配，2）基于蒸馏难度的自适应加权方法。 Result: 实验结果表明，IBCD在单步生成中在基准数据集上达到了最先进的性能。 Conclusion: IBCD框架有效地解决了无配对图像翻译中的迭代采样问题，并在单步生成中取得了优异的性能。 Abstract: Unpaired image-to-image translation has seen significant progress since the introduction of CycleGAN. However, methods based on diffusion models or Schr\"odinger bridges have yet to be widely adopted in real-world applications due to their iterative sampling nature. To address this challenge, we propose a novel framework, Implicit Bridge Consistency Distillation (IBCD), which enables single-step bidirectional unpaired translation without using adversarial loss. IBCD extends consistency distillation by using a diffusion implicit bridge model that connects PF-ODE trajectories between distributions. Additionally, we introduce two key improvements: 1) distribution matching for consistency distillation and 2) adaptive weighting method based on distillation difficulty. Experimental results demonstrate that IBCD achieves state-of-the-art performance on benchmark datasets in a single generation step. Project page available at https://hyn2028.github.io/project_page/IBCD/index.html

Conjuring Positive Pairs for Efficient Unification of Representation Learning and Image Synthesis

Imanol G. Estepa,Jesús M. Rodríguez-de-Vera,Ignacio Sarasúa,Bhalaji Nagarajan,Petia Radeva

Task: 提出一种新的统一自监督学习框架Sorcen，结合对比-重建目标，以弥合表示学习和生成建模之间的差距。

Motivation: 现有的统一自监督学习方法依赖于语义标记重建，需要外部标记器，增加了训练开销。

Details

Method: Sorcen框架引入了“Echo Contrast”对比目标，利用生成能力在语义标记空间中生成回声样本，形成对比正对，无需额外的图像裁剪或增强。 Result: 在ImageNet-1k上的实验表明，Sorcen在线性探测、无条件图像生成、少样本学习和迁移学习上分别优于之前的统一自监督学习SoTA，同时效率提高了60.8%。 Conclusion: Sorcen在统一自监督学习模型中取得了显著改进和突破，特别是在线性探测和无条件图像生成方面表现优异。 Abstract: While representation learning and generative modeling seek to understand visual data, unifying both domains remains unexplored. Recent Unified Self-Supervised Learning (SSL) methods have started to bridge the gap between both paradigms. However, they rely solely on semantic token reconstruction, which requires an external tokenizer during training -- introducing a significant overhead. In this work, we introduce Sorcen, a novel unified SSL framework, incorporating a synergic Contrastive-Reconstruction objective. Our Contrastive objective, "Echo Contrast", leverages the generative capabilities of Sorcen, eliminating the need for additional image crops or augmentations during training. Sorcen "generates" an echo sample in the semantic token space, forming the contrastive positive pair. Sorcen operates exclusively on precomputed tokens, eliminating the need for an online token transformation during training, thereby significantly reducing computational overhead. Extensive experiments on ImageNet-1k demonstrate that Sorcen outperforms the previous Unified SSL SoTA by 0.4%, 1.48 FID, 1.76%, and 1.53% on linear probing, unconditional image generation, few-shot learning, and transfer learning, respectively, while being 60.8% more efficient. Additionally, Sorcen surpasses previous single-crop MIM SoTA in linear probing and achieves SoTA performance in unconditional image generation, highlighting significant improvements and breakthroughs in Unified SSL models.

MultiBARF: Integrating Imagery of Different Wavelength Regions by Using Neural Radiance Fields

Kana Kurata,Hitoshi Niigaki,Xiaojun Wu,Ryuichi Tanida

Task: 开发MultiBARF方法，以简化不同传感器图像的数据准备过程。

Motivation: 为了使不熟悉传感和图像处理的用户更容易进行数据准备，减少数据准备过程中对高传感和图像处理专业知识的需求。

Details

Method: MultiBARF方法通过合成两个不同传感器图像和深度图像对来替代共配准和几何校准，扩展了基于深度神经网络的Bundle Adjusting Neural Radiance Fields (BARF)方法。 Result: 实验表明，MultiBARF方法能够在NeRF上叠加可见光和热成像图像的两个颜色通道。 Conclusion: MultiBARF方法有效地简化了不同传感器图像的数据准备过程，使其对非专业用户更加友好。 Abstract: Optical sensor applications have become popular through digital transformation. Linking observed data to real-world locations and combining different image sensors is essential to make the applications practical and efficient. However, data preparation to try different sensor combinations requires high sensing and image processing expertise. To make data preparation easier for users unfamiliar with sensing and image processing, we have developed MultiBARF. This method replaces the co-registration and geometric calibration by synthesizing pairs of two different sensor images and depth images at assigned viewpoints. Our method extends Bundle Adjusting Neural Radiance Fields(BARF), a deep neural network-based novel view synthesis method, for the two imagers. Through experiments on visible light and thermographic images, we demonstrate that our method superimposes two color channels of those sensor images on NeRF.

An Investigation of Beam Density on LiDAR Object Detection Performance

Christoph Griesbacher,Christian Fruhwirth-Reisinger

Task: 研究3D物体检测在自动驾驶中的应用，特别是LiDAR传感器在不同光束密度下的性能差异。

Motivation: LiDAR传感器在自动驾驶中至关重要，但训练和推理数据之间的差异会导致性能下降，尤其是在不同光束密度的情况下。

Details

Method: 通过评估不同的物体检测架构，结合体素和点云方法，研究光束密度对域差距的影响。 Result: 实验表明，结合体素和点云方法在跨域性能上表现优异，且训练在更密集数据上的检测器对光束密度变化具有鲁棒性。 Conclusion: 光束密度引起的域差距需要与其他域变化一起评估，训练在更密集数据上的检测器在推理时对光束密度变化具有鲁棒性。 Abstract: Accurate 3D object detection is a critical component of autonomous driving, enabling vehicles to perceive their surroundings with precision and make informed decisions. LiDAR sensors, widely used for their ability to provide detailed 3D measurements, are key to achieving this capability. However, variations between training and inference data can cause significant performance drops when object detection models are employed in different sensor settings. One critical factor is beam density, as inference on sparse, cost-effective LiDAR sensors is often preferred in real-world applications. Despite previous work addressing the beam-density-induced domain gap, substantial knowledge gaps remain, particularly concerning dense 128-beam sensors in cross-domain scenarios. To gain better understanding of the impact of beam density on domain gaps, we conduct a comprehensive investigation that includes an evaluation of different object detection architectures. Our architecture evaluation reveals that combining voxel- and point-based approaches yields superior cross-domain performance by leveraging the strengths of both representations. Building on these findings, we analyze beam-density-induced domain gaps and argue that these domain gaps must be evaluated in conjunction with other domain shifts. Contrary to conventional beliefs, our experiments reveal that detectors benefit from training on denser data and exhibit robustness to beam density variations during inference.

When the Future Becomes the Past: Taming Temporal Correspondence for Self-supervised Video Representation Learning

Yang Liu,Qianqian Xu,Peisong Wen,Siran Dai,Qingming Huang

Task: 提出了一种自监督框架T-CoRe，用于视频表示学习，以解决随机时间采样引入的不确定性和像素空间中信息压缩不足的问题。

Motivation: 现有的Masked Video Modeling (MVM)方法在视频任务中取得了显著进展，但仍面临两个关键挑战：1) 随机时间采样引入的不确定性增加了模型训练的难度；2) 像素空间中的信息压缩不足，影响了下游任务的表现。

Details

Method: 提出了T-CoRe框架，包括三明治采样策略以减少重建不确定性，并在自蒸馏架构中引入辅助分支以在潜在空间中恢复表示。 Result: T-CoRe在多个下游任务中表现出色，证明了其在视频表示学习中的有效性。 Conclusion: T-CoRe通过结合三明治采样策略和自蒸馏架构，有效解决了视频表示学习中的两个关键挑战，并在多个下游任务中取得了优异的性能。 Abstract: The past decade has witnessed notable achievements in self-supervised learning for video tasks. Recent efforts typically adopt the Masked Video Modeling (MVM) paradigm, leading to significant progress on multiple video tasks. However, two critical challenges remain: 1) Without human annotations, the random temporal sampling introduces uncertainty, increasing the difficulty of model training. 2) Previous MVM methods primarily recover the masked patches in the pixel space, leading to insufficient information compression for downstream tasks. To address these challenges jointly, we propose a self-supervised framework that leverages Temporal Correspondence for video Representation learning (T-CoRe). For challenge 1), we propose a sandwich sampling strategy that selects two auxiliary frames to reduce reconstruction uncertainty in a two-side-squeezing manner. Addressing challenge 2), we introduce an auxiliary branch into a self-distillation architecture to restore representations in the latent space, generating high-level semantic representations enriched with temporal information. Experiments of T-CoRe consistently present superior performance across several downstream tasks, demonstrating its effectiveness for video representation learning. The code is available at https://github.com/yafeng19/T-CORE.

Distilling 3D distinctive local descriptors for 6D pose estimation

Amir Hamza,Andrea Caraffa,Davide Boscaini,Fabio Poiesi

Task: 通过知识蒸馏框架训练一个高效的学生模型，以从GeDi教师模型中回归局部描述符。

Motivation: GeDi在零样本6D姿态估计中表现出色，但其推理过程计算成本高，难以在实际应用中实现。

Details

Method: 引入一个知识蒸馏框架，包括一个高效的大规模训练程序和一个新的损失公式，以处理来自非显著教师描述符的弱监督。 Result: 在五个BOP基准数据集上验证了该方法，显著减少了推理时间，同时保持了与现有方法竞争的性能。 Conclusion: 该方法使零样本6D姿态估计更接近实时可行性。 Abstract: Three-dimensional local descriptors are crucial for encoding geometric surface properties, making them essential for various point cloud understanding tasks. Among these descriptors, GeDi has demonstrated strong zero-shot 6D pose estimation capabilities but remains computationally impractical for real-world applications due to its expensive inference process. \textit{Can we retain GeDi's effectiveness while significantly improving its efficiency?} In this paper, we explore this question by introducing a knowledge distillation framework that trains an efficient student model to regress local descriptors from a GeDi teacher. Our key contributions include: an efficient large-scale training procedure that ensures robustness to occlusions and partial observations while operating under compute and storage constraints, and a novel loss formulation that handles weak supervision from non-distinctive teacher descriptors. We validate our approach on five BOP Benchmark datasets and demonstrate a significant reduction in inference time while maintaining competitive performance with existing methods, bringing zero-shot 6D pose estimation closer to real-time feasibility. Project Website: https://tev-fbk.github.io/dGeDi/

GIVEPose: Gradual Intra-class Variation Elimination for RGB-based Category-Level Object Pose Estimation

Zinqin Huang,Gu Wang,Chenyangguang Zhang,Ruida Zhang,Xiu Li,Xiangyang Ji

Task: 提出一种新的类别级物体姿态估计方法，解决基于NOCS地图的几何引导姿态回归中的类内变化问题。

Motivation: 现有的RGBD-based方法依赖于精确的深度信息，限制了其广泛应用。RGB-based方法中，基于NOCS地图的几何引导姿态回归存在类内变化问题，导致结果不理想。

Details

Method: 提出Intra-class Variation-Free Consensus (IVFC)地图，结合NOCS地图和IVFC地图的优势，开发了GIVEPose框架，逐步消除类内变化。 Result: 在合成和真实数据集上的广泛评估表明，GIVEPose显著优于现有的最先进的RGB-based方法。 Conclusion: GIVEPose在类别级物体姿态估计中取得了显著改进，解决了类内变化问题。 Abstract: Recent advances in RGBD-based category-level object pose estimation have been limited by their reliance on precise depth information, restricting their broader applicability. In response, RGB-based methods have been developed. Among these methods, geometry-guided pose regression that originated from instance-level tasks has demonstrated strong performance. However, we argue that the NOCS map is an inadequate intermediate representation for geometry-guided pose regression method, as its many-to-one correspondence with category-level pose introduces redundant instance-specific information, resulting in suboptimal results. This paper identifies the intra-class variation problem inherent in pose regression based solely on the NOCS map and proposes the Intra-class Variation-Free Consensus (IVFC) map, a novel coordinate representation generated from the category-level consensus model. By leveraging the complementary strengths of the NOCS map and the IVFC map, we introduce GIVEPose, a framework that implements Gradual Intra-class Variation Elimination for category-level object pose estimation. Extensive evaluations on both synthetic and real-world datasets demonstrate that GIVEPose significantly outperforms existing state-of-the-art RGB-based approaches, achieving substantial improvements in category-level object pose estimation. Our code is available at https://github.com/ziqin-h/GIVEPose.

Text-Derived Relational Graph-Enhanced Network for Skeleton-Based Action Segmentation

Haoyu Ji,Bowen Chen,Weihong Ren,Wenze Huang,Zhihao Yang,Zhiyong Wang,Honghai Liu

Task: 基于骨架的时间动作分割（STAS），旨在从长时间未修剪的人类骨骼运动序列中分割和识别各种动作。

Motivation: 当前的STAS方法通常采用时空建模来建立关节和帧之间的依赖关系，并使用独热编码和交叉熵损失进行帧级分类监督。然而，这些方法忽视了骨骼特征中关节和动作之间的内在关联，导致对人类运动的理解有限。

Details

Method: 提出了一个文本驱动的关系图增强网络（TRG-Net），利用大型语言模型（LLM）生成的先验图来增强建模和监督。在建模方面，动态时空融合建模（DSFM）方法结合了文本驱动的关节图（TJG）和通道及帧级动态适应，以有效建模空间关系，同时在时间建模中整合时空核心特征。在监督方面，绝对-相对类间监督（ARIS）方法采用对比学习来规范绝对类分布，并利用文本驱动的动作图（TAG）捕捉动作特征之间的相对类间关系。此外，还提出了空间感知增强处理（SAEP）方法，通过随机关节遮挡和轴向旋转来增强空间泛化能力。 Result: 在四个公共数据集上的性能评估表明，TRG-Net取得了最先进的结果。 Conclusion: TRG-Net通过引入文本驱动的先验图和动态时空融合建模，显著提升了基于骨架的时间动作分割的性能。 Abstract: Skeleton-based Temporal Action Segmentation (STAS) aims to segment and recognize various actions from long, untrimmed sequences of human skeletal movements. Current STAS methods typically employ spatio-temporal modeling to establish dependencies among joints as well as frames, and utilize one-hot encoding with cross-entropy loss for frame-wise classification supervision. However, these methods overlook the intrinsic correlations among joints and actions within skeletal features, leading to a limited understanding of human movements. To address this, we propose a Text-Derived Relational Graph-Enhanced Network (TRG-Net) that leverages prior graphs generated by Large Language Models (LLM) to enhance both modeling and supervision. For modeling, the Dynamic Spatio-Temporal Fusion Modeling (DSFM) method incorporates Text-Derived Joint Graphs (TJG) with channel- and frame-level dynamic adaptation to effectively model spatial relations, while integrating spatio-temporal core features during temporal modeling. For supervision, the Absolute-Relative Inter-Class Supervision (ARIS) method employs contrastive learning between action features and text embeddings to regularize the absolute class distributions, and utilizes Text-Derived Action Graphs (TAG) to capture the relative inter-class relationships among action features. Additionally, we propose a Spatial-Aware Enhancement Processing (SAEP) method, which incorporates random joint occlusion and axial rotation to enhance spatial generalization. Performance evaluations on four public datasets demonstrate that TRG-Net achieves state-of-the-art results.

VideoGen-of-Thought: Step-by-step generating multi-shot video with minimal manual intervention

Mingzhe Zheng,Yongqi Xu,Haojian Huang,Xuran Ma,Yexin Liu,Wenjie Shu,Yatian Pang,Feilong Tang,Qifeng Chen,Harry Yang,Ser-Nam Lim

Task: 自动化从单一句子生成多镜头视频

Motivation: 现有的视频生成模型在短片段上表现出色，但在生成连贯的多镜头叙事时存在视觉动态不连贯和故事情节断裂的问题。现有解决方案要么依赖大量手动脚本/编辑，要么优先考虑单镜头保真度而非跨场景连续性，限制了其在电影类内容中的实用性。

Details

Method: 提出了VideoGen-of-Thought (VGoT)框架，通过动态故事情节建模、身份感知跨镜头传播和相邻潜在过渡机制，系统解决叙事碎片化、视觉不一致性和过渡伪影三个核心挑战。 Result: VGoT生成的多镜头视频在镜头内面部一致性和风格一致性上分别比现有最先进基线高出20.4%和17.4%，同时在跨镜头一致性上提高了100%以上，且手动调整次数减少了10倍。 Conclusion: VGoT框架有效解决了多镜头视频生成中的叙事连贯性、视觉一致性和过渡平滑性问题，显著提升了生成视频的质量和实用性。 Abstract: Current video generation models excel at short clips but fail to produce cohesive multi-shot narratives due to disjointed visual dynamics and fractured storylines. Existing solutions either rely on extensive manual scripting/editing or prioritize single-shot fidelity over cross-scene continuity, limiting their practicality for movie-like content. We introduce VideoGen-of-Thought (VGoT), a step-by-step framework that automates multi-shot video synthesis from a single sentence by systematically addressing three core challenges: (1) Narrative Fragmentation: Existing methods lack structured storytelling. We propose dynamic storyline modeling, which first converts the user prompt into concise shot descriptions, then elaborates them into detailed, cinematic specifications across five domains (character dynamics, background continuity, relationship evolution, camera movements, HDR lighting), ensuring logical narrative progression with self-validation. (2) Visual Inconsistency: Existing approaches struggle with maintaining visual consistency across shots. Our identity-aware cross-shot propagation generates identity-preserving portrait (IPP) tokens that maintain character fidelity while allowing trait variations (expressions, aging) dictated by the storyline. (3) Transition Artifacts: Abrupt shot changes disrupt immersion. Our adjacent latent transition mechanisms implement boundary-aware reset strategies that process adjacent shots' features at transition points, enabling seamless visual flow while preserving narrative continuity. VGoT generates multi-shot videos that outperform state-of-the-art baselines by 20.4% in within-shot face consistency and 17.4% in style consistency, while achieving over 100% better cross-shot consistency and 10x fewer manual adjustments than alternatives.

Object-Centric Pretraining via Target Encoder Bootstrapping

Nikola Đukić,Tim Lebailly,Tinne Tuytelaars

Task: 提出了一种新的自蒸馏设置OCEBO，用于从头开始训练对象中心模型，并在真实世界数据上实现无监督对象发现。

Motivation: 现有的对象中心表示学习方法依赖于预训练的非对象中心基础模型，但这些模型的特征在训练过程中必须保持冻结，限制了对象中心模型的性能。

Details

Method: 提出了OCEBO方法，通过目标编码器的指数移动平均更新，引入对象中心归纳偏差，并采用跨视图补丁过滤方法来缓解目标编码器随机初始化导致的槽崩溃问题。 Result: 在COCO数据集上的241k图像上进行预训练后，OCEBO在无监督对象发现任务上的性能与使用冻结非对象中心目标编码器的模型相当。 Conclusion: OCEBO方法通过自蒸馏和目标编码器引导，成功地在真实世界数据上实现了无监督对象发现，并公开了代码和预训练模型。 Abstract: Object-centric representation learning has recently been successfully applied to real-world datasets. This success can be attributed to pretrained non-object-centric foundation models, whose features serve as reconstruction targets for slot attention. However, targets must remain frozen throughout the training, which sets an upper bound on the performance object-centric models can attain. Attempts to update the target encoder by bootstrapping result in large performance drops, which can be attributed to its lack of object-centric inductive biases, causing the object-centric model's encoder to drift away from representations useful as reconstruction targets. To address these limitations, we propose Object-CEntric Pretraining by Target Encoder BOotstrapping, a self-distillation setup for training object-centric models from scratch, on real-world data, for the first time ever. In OCEBO, the target encoder is updated as an exponential moving average of the object-centric model, thus explicitly being enriched with object-centric inductive biases introduced by slot attention while removing the upper bound on performance present in other models. We mitigate the slot collapse caused by random initialization of the target encoder by introducing a novel cross-view patch filtering approach that limits the supervision to sufficiently informative patches. When pretrained on 241k images from COCO, OCEBO achieves unsupervised object discovery performance comparable to that of object-centric models with frozen non-object-centric target encoders pretrained on hundreds of millions of images. The code and pretrained models are publicly available at https://github.com/djukicn/ocebo.

PointSFDA: Source-free Domain Adaptation for Point Cloud Completion

Xing He,Zhe Zhu,Liangliang Nan,Honghua Chen,Jing Qin,Mingqiang Wei

Task: 提出一种无需源数据的点云补全领域自适应框架（PointSFDA）。

Motivation: 传统的点云补全方法在应用于真实世界扫描时面临显著挑战，尤其是在分布外数据上。

Details

Method: 提出了一种从粗到细的蒸馏解决方案，以显式传递从源数据集中学习的全局几何知识，并提出了一种自监督的部分掩码一致性训练策略，以学习目标域中的局部几何信息。 Result: 实验验证了该方法显著提高了跨域形状补全的性能。 Conclusion: PointSFDA框架在无需源数据的情况下，有效地提高了点云补全的性能。 Abstract: Conventional methods for point cloud completion, typically trained on synthetic datasets, face significant challenges when applied to out-of-distribution real-world scans. In this paper, we propose an effective yet simple source-free domain adaptation framework for point cloud completion, termed \textbf{PointSFDA}. Unlike unsupervised domain adaptation that reduces the domain gap by directly leveraging labeled source data, PointSFDA uses only a pretrained source model and unlabeled target data for adaptation, avoiding the need for inaccessible source data in practical scenarios. Being the first source-free domain adaptation architecture for point cloud completion, our method offers two core contributions. First, we introduce a coarse-to-fine distillation solution to explicitly transfer the global geometry knowledge learned from the source dataset. Second, as noise may be introduced due to domain gaps, we propose a self-supervised partial-mask consistency training strategy to learn local geometry information in the target domain. Extensive experiments have validated that our method significantly improves the performance of state-of-the-art networks in cross-domain shape completion. Our code is available at \emph{\textcolor{magenta}{https://github.com/Starak-x/PointSFDA}}.

ARC: Anchored Representation Clouds for High-Resolution INR Classification

Joost Luijmes,Alexander Gielisse,Roman Knyazhitskiy,Jan van Gemert

Task: 提出一种新的隐式神经表示（INR）架构，用于图像分类。

Motivation: 当前的INR图像分类方法在低分辨率数据上表现良好，但对图像空间变换敏感，且缺乏局部表示机制。

Details

Method: 提出ARC（Anchored Representation Clouds）架构，通过在图像空间中显式锚定局部潜在向量，引入空间结构。 Result: ARC在低分辨率和高分辨率图像的隐式图像分类中达到了最先进的性能，并提高了对图像空间平移的鲁棒性。 Conclusion: ARC通过引入局部表示机制，显著提升了INR在图像分类中的性能。 Abstract: Implicit neural representations (INRs) encode signals in neural network weights as a memory-efficient representation, decoupling sampling resolution from the associated resource costs. Current INR image classification methods are demonstrated on low-resolution data and are sensitive to image-space transformations. We attribute these issues to the global, fully-connected MLP neural network architecture encoding of current INRs, which lack mechanisms for local representation: MLPs are sensitive to absolute image location and struggle with high-frequency details. We propose ARC: Anchored Representation Clouds, a novel INR architecture that explicitly anchors latent vectors locally in image-space. By introducing spatial structure to the latent vectors, ARC captures local image data which in our testing leads to state-of-the-art implicit image classification of both low- and high-resolution images and increased robustness against image-space translation. Code can be found at https://github.com/JLuij/anchored_representation_clouds.

UltraFlwr -- An Efficient Federated Medical and Surgical Object Detection Framework

Yang Li,Soumya Snigdha Kundu,Maxence Boels,Toktam Mahmoodi,Sebastien Ourselin,Tom Vercauteren,Prokar Dasgupta,Jonathan Shapey,Alejandro Granados

Task: 提出了一种用于医疗和手术对象检测的联邦学习框架UltraFlwr，并设计了YOLO-PA策略以减少通信开销。

Motivation: 解决医疗和手术对象检测在边缘部署中面临的高质量标注数据有限、数据共享限制和计算资源受限等挑战。

Details

Method: 利用联邦学习（FL）进行去中心化模型训练，并提出YOLO-PA策略来减少通信开销。 Result: YOLO-PA策略在每轮通信中减少了高达83%的通信开销，同时在性能上与全聚合（FA）策略相当。 Conclusion: UltraFlwr框架使得在资源受限的边缘设备上进行高效训练和部署成为可能，推动了联邦对象检测在时间紧迫和资源受限的医疗和手术应用中的实际应用。 Abstract: Object detection shows promise for medical and surgical applications such as cell counting and tool tracking. However, its faces multiple real-world edge deployment challenges including limited high-quality annotated data, data sharing restrictions, and computational constraints. In this work, we introduce UltraFlwr, a framework for federated medical and surgical object detection. By leveraging Federated Learning (FL), UltraFlwr enables decentralized model training across multiple sites without sharing raw data. To further enhance UltraFlwr's efficiency, we propose YOLO-PA, a set of novel Partial Aggregation (PA) strategies specifically designed for YOLO models in FL. YOLO-PA significantly reduces communication overhead by up to 83% per round while maintaining performance comparable to Full Aggregation (FA) strategies. Our extensive experiments on BCCD and m2cai16-tool-locations datasets demonstrate that YOLO-PA not only provides better client models compared to client-wise centralized training and FA strategies, but also facilitates efficient training and deployment across resource-constrained edge devices. Further, we also establish one of the first benchmarks in federated medical and surgical object detection. This paper advances the feasibility of training and deploying detection models on the edge, making federated object detection more practical for time-critical and resource-constrained medical and surgical applications. UltraFlwr is publicly available at https://github.com/KCL-BMEIS/UltraFlwr.

Machine Unlearning in Hyperbolic vs. Euclidean Multimodal Contrastive Learning: Adapting Alignment Calibration to MERU

Àlex Pujol Vidal,Sergio Escalera,Kamal Nasrollahi,Thomas B. Moeslund

Task: 研究在双曲对比学习中的机器遗忘方法，特别是通过调整对齐校准到MERU模型来实现选择性概念移除。

Motivation: 尽管最近的工作探索了欧几里得对比视觉语言模型中的遗忘，但在双曲空间中的概念移除效果尚未被探索。

Details

Method: 通过系统实验和消融研究，展示了双曲几何在概念移除中的独特优势，并引入了双曲特定的组件，包括蕴含校准和范数正则化。 Result: 双曲几何在概念移除中表现出色，特别是在扩展到多个概念移除时，实现了近乎完美的遗忘，同时在保留概念上保持了合理的性能。 Conclusion: 双曲遗忘在重组语义层次结构方面与欧几里得方法有根本不同，这些发现不仅推进了机器遗忘技术，还提供了关于几何特性如何影响多模态模型中概念表示和移除的见解。 Abstract: Machine unlearning methods have become increasingly important for selective concept removal in large pre-trained models. While recent work has explored unlearning in Euclidean contrastive vision-language models, the effectiveness of concept removal in hyperbolic spaces remains unexplored. This paper investigates machine unlearning in hyperbolic contrastive learning by adapting Alignment Calibration to MERU, a model that embeds images and text in hyperbolic space to better capture semantic hierarchies. Through systematic experiments and ablation studies, we demonstrate that hyperbolic geometry offers distinct advantages for concept removal, achieving near perfect forgetting with reasonable performance on retained concepts, particularly when scaling to multiple concept removal. Our approach introduces hyperbolic-specific components including entailment calibration and norm regularization that leverage the unique properties of hyperbolic space. Comparative analysis with Euclidean models reveals fundamental differences in unlearning dynamics, with hyperbolic unlearning reorganizing the semantic hierarchy while Euclidean approaches merely disconnect cross-modal associations. These findings not only advance machine unlearning techniques but also provide insights into the geometric properties that influence concept representation and removal in multimodal models. Source code available at https://github.com/alex-pv01/HAC

3D Occupancy Prediction with Low-Resolution Queries via Prototype-aware View Transformation

Gyeongrok Oh,Sungjune Kim,Heeju Ko,Hyung-gun Chi,Jinkyu Kim,Dongwook Lee,Daehyun Ji,Sungjoon Choi,Sujin Jang,Sangpil Kim

Task: 提高基于相机的3D占用预测中体素查询的分辨率以提升视图转换质量。

Motivation: 由于计算限制和实时部署的实际需求，较小的查询分辨率会导致信息丢失，因此需要在有限的查询大小内编码和保留丰富的视觉细节。

Details

Method: 提出了ProtoOcc，一种利用聚类图像片段的原型在视图转换中增强低分辨率上下文的新型占用网络。 Result: 在Occ3D和SemanticKITTI基准测试上的实验结果表明，该方法有效，显示出相对于基线的明显改进。 Conclusion: ProtoOcc在减少75%体素分辨率的情况下仍能实现与基线竞争的性能。 Abstract: The resolution of voxel queries significantly influences the quality of view transformation in camera-based 3D occupancy prediction. However, computational constraints and the practical necessity for real-time deployment require smaller query resolutions, which inevitably leads to an information loss. Therefore, it is essential to encode and preserve rich visual details within limited query sizes while ensuring a comprehensive representation of 3D occupancy. To this end, we introduce ProtoOcc, a novel occupancy network that leverages prototypes of clustered image segments in view transformation to enhance low-resolution context. In particular, the mapping of 2D prototypes onto 3D voxel queries encodes high-level visual geometries and complements the loss of spatial information from reduced query resolutions. Additionally, we design a multi-perspective decoding strategy to efficiently disentangle the densely compressed visual cues into a high-dimensional 3D occupancy scene. Experimental results on both Occ3D and SemanticKITTI benchmarks demonstrate the effectiveness of the proposed method, showing clear improvements over the baselines. More importantly, ProtoOcc achieves competitive performance against the baselines even with 75\% reduced voxel resolution.

Benchmarking Large Language Models for Handwritten Text Recognition

Giorgia Crosilla,Lukas Klic,Giovanni Colavizza

Task: 评估多模态大语言模型（MLLMs）在手写文本识别（HTR）中的性能，并与传统模型进行比较。

Motivation: 传统的手写文本识别模型需要大量的手动标注，并且在布局和文本处理之间存在分离，导致错误。MLLMs提供了一种无需特定模型训练的通用方法。

Details

Method: 研究对多种专有和开源的大语言模型进行基准测试，评估它们在现代和历史数据集上的表现，并测试其自主纠正生成输出的能力。 Result: 专有模型，特别是Claude 3.5 Sonnet，在零样本设置中表现优于开源模型。MLLMs在现代手写识别中表现出色，但对英语有偏好。与Transkribus的比较显示两者没有明显优势。LLMs在零样本转录中的自主纠错能力有限。 Conclusion: MLLMs在手写文本识别中表现出色，尤其是在现代手写识别中，但对英语的偏好和自主纠错能力的限制仍需改进。 Abstract: Traditional machine learning models for Handwritten Text Recognition (HTR) rely on supervised training, requiring extensive manual annotations, and often produce errors due to the separation between layout and text processing. In contrast, Multimodal Large Language Models (MLLMs) offer a general approach to recognizing diverse handwriting styles without the need for model-specific training. The study benchmarks various proprietary and open-source LLMs against Transkribus models, evaluating their performance on both modern and historical datasets written in English, French, German, and Italian. In addition, emphasis is placed on testing the models' ability to autonomously correct previously generated outputs. Findings indicate that proprietary models, especially Claude 3.5 Sonnet, outperform open-source alternatives in zero-shot settings. MLLMs achieve excellent results in recognizing modern handwriting and exhibit a preference for the English language due to their pre-training dataset composition. Comparisons with Transkribus show no consistent advantage for either approach. Moreover, LLMs demonstrate limited ability to autonomously correct errors in zero-shot transcriptions.

Detect-and-Guide: Self-regulation of Diffusion Models for Safe Text-to-Image Generation via Guideline Token Optimization

Feifei Li,Mi Zhang,Yiming Sun,Min Yang

Task: 提出一种名为Detect-and-Guide (DAG)的安全生成框架，用于在文本到图像扩散模型中检测和消除有害内容。

Motivation: 现有的后处理模型干预技术（如概念遗忘和安全引导）在消除有害概念时会影响采样轨迹，且操作方式不透明，难以确定中间变量的哪一部分导致了不安全生成。

Details

Method: DAG利用扩散模型的内部知识，在采样过程中进行自我诊断和细粒度的自我调节。首先通过优化的token的交叉注意力图从噪声潜在空间中检测有害概念，然后应用具有自适应强度和编辑区域的安全引导来消除不安全生成。 Result: 实验表明，DAG在消除性内容方面达到了最先进的安全生成性能，平衡了有害性缓解和文本跟随性能。 Conclusion: DAG框架不需要对扩散模型进行微调，因此不会影响其生成多样性，且仅需少量标注数据集即可提供精确的检测图，具有泛化性和概念特异性。 Abstract: Text-to-image diffusion models have achieved state-of-the-art results in synthesis tasks; however, there is a growing concern about their potential misuse in creating harmful content. To mitigate these risks, post-hoc model intervention techniques, such as concept unlearning and safety guidance, have been developed. However, fine-tuning model weights or adapting the hidden states of the diffusion model operates in an uninterpretable way, making it unclear which part of the intermediate variables is responsible for unsafe generation. These interventions severely affect the sampling trajectory when erasing harmful concepts from complex, multi-concept prompts, thus hindering their practical use in real-world settings. In this work, we propose the safe generation framework Detect-and-Guide (DAG), leveraging the internal knowledge of diffusion models to perform self-diagnosis and fine-grained self-regulation during the sampling process. DAG first detects harmful concepts from noisy latents using refined cross-attention maps of optimized tokens, then applies safety guidance with adaptive strength and editing regions to negate unsafe generation. The optimization only requires a small annotated dataset and can provide precise detection maps with generalizability and concept specificity. Moreover, DAG does not require fine-tuning of diffusion models, and therefore introduces no loss to their generation diversity. Experiments on erasing sexual content show that DAG achieves state-of-the-art safe generation performance, balancing harmfulness mitigation and text-following performance on multi-concept real-world prompts.

DiST-4D: Disentangled Spatiotemporal Diffusion with Metric Depth for 4D Driving Scene Generation

Jiazhe Guo,Yikang Ding,Xiwu Chen,Shuo Chen,Bohan Li,Yingshuang Zou,Xiaoyang Lyu,Feiyang Tan,Xiaojuan Qi,Zhiheng Li,Hao Zhao

Task: 提出一种用于4D驾驶场景生成的解耦时空扩散框架DiST-4D，支持时间外推和空间新视角合成。

Motivation: 当前生成模型在无需每场景优化的情况下，难以同时支持时间外推和空间新视角合成。关键在于找到一种高效且可泛化的几何表示，无缝连接时间和空间合成。

Details

Method: 提出DiST-4D框架，利用度量深度作为核心几何表示，将问题分解为两个扩散过程：DiST-T（从过去观测直接预测未来度量深度和多视角RGB序列）和DiST-S（通过仅在现有视角上训练并强制循环一致性实现空间新视角合成）。 Result: DiST-4D在时间预测和新视角合成任务中实现了最先进的性能，并在规划相关评估中表现出竞争力。 Conclusion: DiST-4D通过解耦时空扩散框架，成功解决了4D驾驶场景生成中的时间和空间合成问题，展示了其在多种任务中的优越性能。 Abstract: Current generative models struggle to synthesize dynamic 4D driving scenes that simultaneously support temporal extrapolation and spatial novel view synthesis (NVS) without per-scene optimization. A key challenge lies in finding an efficient and generalizable geometric representation that seamlessly connects temporal and spatial synthesis. To address this, we propose DiST-4D, the first disentangled spatiotemporal diffusion framework for 4D driving scene generation, which leverages metric depth as the core geometric representation. DiST-4D decomposes the problem into two diffusion processes: DiST-T, which predicts future metric depth and multi-view RGB sequences directly from past observations, and DiST-S, which enables spatial NVS by training only on existing viewpoints while enforcing cycle consistency. This cycle consistency mechanism introduces a forward-backward rendering constraint, reducing the generalization gap between observed and unseen viewpoints. Metric depth is essential for both accurate reliable forecasting and accurate spatial NVS, as it provides a view-consistent geometric representation that generalizes well to unseen perspectives. Experiments demonstrate that DiST-4D achieves state-of-the-art performance in both temporal prediction and NVS tasks, while also delivering competitive performance in planning-related evaluations.

GO-N3RDet: Geometry Optimized NeRF-enhanced 3D Object Detector

Zechuan Li,Hongshan Yu,Yihao Ding,Jinhao Qiao,Basim Azam,Naveed Akhtar

Task: 提出一种场景几何优化的多视角3D物体检测器GO-N3RDet，通过神经辐射场增强。

Motivation: 由于遮挡和缺乏3D信息，从多视角2D图像构建3D特征具有挑战性。

Details

Method: 引入独特的3D位置信息嵌入体素优化机制来融合多视角特征，设计了双重要性采样方案用于NeRF分支，并提出不透明度优化模块以通过多视角一致性约束来精确预测体素不透明度。 Result: 在ScanNet和ARKITScenes数据集上的广泛实验验证了该模型在基于NeRF的多视角3D检测中达到了新的最先进水平。 Conclusion: GO-N3RDet通过其独特的模块形成了一个端到端的神经模型，显著提升了多视角3D物体检测的准确性。 Abstract: We propose GO-N3RDet, a scene-geometry optimized multi-view 3D object detector enhanced by neural radiance fields. The key to accurate 3D object detection is in effective voxel representation. However, due to occlusion and lack of 3D information, constructing 3D features from multi-view 2D images is challenging. Addressing that, we introduce a unique 3D positional information embedded voxel optimization mechanism to fuse multi-view features. To prioritize neural field reconstruction in object regions, we also devise a double importance sampling scheme for the NeRF branch of our detector. We additionally propose an opacity optimization module for precise voxel opacity prediction by enforcing multi-view consistency constraints. Moreover, to further improve voxel density consistency across multiple perspectives, we incorporate ray distance as a weighting factor to minimize cumulative ray errors. Our unique modules synergetically form an end-to-end neural model that establishes new state-of-the-art in NeRF-based multi-view 3D detection, verified with extensive experiments on ScanNet and ARKITScenes. Code will be available at https://github.com/ZechuanLi/GO-N3RDet.

CoE: Chain-of-Explanation via Automatic Visual Concept Circuit Description and Polysemanticity Quantification

Wenlong Yu,Qilong Wang,Chuang Liu,Dong Li,Qinghua Hu

Task: 提出一种Chain-of-Explanation (CoE)方法，用于自动化构建全局概念解释数据集并提供局部决策过程的语言解释。

Motivation: 当前的概念后解释方法在自动构建准确且充分的全局概念和局部电路的语言解释方面存在不足，特别是语义视觉概念（VCs）中的多义性严重影响了概念和深度视觉模型（DVMs）的可解释性。

Details

Method: 提出CoE方法，自动化解码和描述VCs以构建全局概念解释数据集，设计概念多义性解耦和过滤机制以区分最相关的概念原子，并引入概念多义性熵（CPE）作为模型可解释性的度量。 Result: 通过GPT-4o和人类实验验证了CPE的有效性和CoE的优越性，在可解释性评分上平均绝对提升了36%。 Conclusion: CoE方法有效解决了当前方法在自动构建语言解释方面的不足，显著提升了深度视觉模型的可解释性。 Abstract: Explainability is a critical factor influencing the wide deployment of deep vision models (DVMs). Concept-based post-hoc explanation methods can provide both global and local insights into model decisions. However, current methods in this field face challenges in that they are inflexible to automatically construct accurate and sufficient linguistic explanations for global concepts and local circuits. Particularly, the intrinsic polysemanticity in semantic Visual Concepts (VCs) impedes the interpretability of concepts and DVMs, which is underestimated severely. In this paper, we propose a Chain-of-Explanation (CoE) approach to address these issues. Specifically, CoE automates the decoding and description of VCs to construct global concept explanation datasets. Further, to alleviate the effect of polysemanticity on model explainability, we design a concept polysemanticity disentanglement and filtering mechanism to distinguish the most contextually relevant concept atoms. Besides, a Concept Polysemanticity Entropy (CPE), as a measure of model interpretability, is formulated to quantify the degree of concept uncertainty. The modeling of deterministic concepts is upgraded to uncertain concept atom distributions. Finally, CoE automatically enables linguistic local explanations of the decision-making process of DVMs by tracing the concept circuit. GPT-4o and human-based experiments demonstrate the effectiveness of CPE and the superiority of CoE, achieving an average absolute improvement of 36% in terms of explainability scores.

DEPT: Deep Extreme Point Tracing for Ultrasound Image Segmentation

Lei Shi,Xi Fang,Naiyu Wang,Junxing Zhang

Task: 提出了一种用于超声图像分割的深度极端点追踪（DEPT）与特征引导极端点掩码（FGEPM）算法。

Motivation: 全监督学习方法需要大量且劳动密集型的标注工作，为了解决这一问题，弱监督学习方法，特别是使用极端点作为监督信号的方法，提供了有效的解决方案。

Details

Method: 通过识别基于特征图的成本矩阵上连接所有极端点的最低成本路径生成伪标签，并提出了一种迭代训练策略以逐步优化伪标签，从而实现网络的持续改进。 Result: 在两个公共数据集上的实验结果表明，所提出的方法有效，其性能接近全监督方法，并优于几种现有的弱监督方法。 Conclusion: 所提出的DEPT与FGEPM算法在超声图像分割中表现出色，能够有效减少标注工作量并提高分割性能。 Abstract: Automatic medical image segmentation plays a crucial role in computer aided diagnosis. However, fully supervised learning approaches often require extensive and labor-intensive annotation efforts. To address this challenge, weakly supervised learning methods, particularly those using extreme points as supervisory signals, have the potential to offer an effective solution. In this paper, we introduce Deep Extreme Point Tracing (DEPT) integrated with Feature-Guided Extreme Point Masking (FGEPM) algorithm for ultrasound image segmentation. Notably, our method generates pseudo labels by identifying the lowest-cost path that connects all extreme points on the feature map-based cost matrix. Additionally, an iterative training strategy is proposed to refine pseudo labels progressively, enabling continuous network improvement. Experimental results on two public datasets demonstrate the effectiveness of our proposed method. The performance of our method approaches that of the fully supervised method and outperforms several existing weakly supervised methods.

LEGION: Learning to Ground and Explain for Synthetic Image Detection

Hengrui Kang,Siwei Wen,Zichen Wen,Junyan Ye,Weijia Li,Peilin Feng,Baichuan Zhou,Bin Wang,Dahua Lin,Linfeng Zhang,Conghui He

Task: 提出一个高质量且多样化的合成图像数据集SynthScars，并开发一个基于多模态大语言模型的图像伪造分析框架LEGION。

Motivation: 当前的合成图像检测方法缺乏文本可解释性，且数据集通常过时且缺乏细粒度注释。

Details

Method: 引入SynthScars数据集，并提出LEGION框架，该框架集成了伪影检测、分割和解释功能。 Result: LEGION在多个基准测试中优于现有方法，特别是在SynthScars数据集上，mIoU和F1得分分别超过第二名3.31%和7.75%。 Conclusion: LEGION不仅提高了合成图像检测的准确性，还能指导生成更高质量和更逼真的图像。 Abstract: The rapid advancements in generative technology have emerged as a double-edged sword. While offering powerful tools that enhance convenience, they also pose significant social concerns. As defenders, current synthetic image detection methods often lack artifact-level textual interpretability and are overly focused on image manipulation detection, and current datasets usually suffer from outdated generators and a lack of fine-grained annotations. In this paper, we introduce SynthScars, a high-quality and diverse dataset consisting of 12,236 fully synthetic images with human-expert annotations. It features 4 distinct image content types, 3 categories of artifacts, and fine-grained annotations covering pixel-level segmentation, detailed textual explanations, and artifact category labels. Furthermore, we propose LEGION (LEarning to Ground and explain for Synthetic Image detectiON), a multimodal large language model (MLLM)-based image forgery analysis framework that integrates artifact detection, segmentation, and explanation. Building upon this capability, we further explore LEGION as a controller, integrating it into image refinement pipelines to guide the generation of higher-quality and more realistic images. Extensive experiments show that LEGION outperforms existing methods across multiple benchmarks, particularly surpassing the second-best traditional expert on SynthScars by 3.31% in mIoU and 7.75% in F1 score. Moreover, the refined images generated under its guidance exhibit stronger alignment with human preferences. The code, model, and dataset will be released.

DeepMesh: Auto-Regressive Artist-mesh Creation with Reinforcement Learning

Ruowen Zhao,Junliang Ye,Zhengyi Wang,Guangce Liu,Yiwen Chen,Yikai Wang,Jun Zhu

Task: 提出了一种名为DeepMesh的框架，用于优化3D网格生成。

Motivation: 解决自回归方法在生成结构化网格时面临的有限面数和网格不完整性问题。

Details

Method: 通过两种关键创新来优化网格生成：(1) 一种高效的预训练策略，结合新的标记化算法以及数据整理和处理的改进；(2) 将强化学习引入3D网格生成，通过直接偏好优化（DPO）实现人类偏好对齐。 Result: DeepMesh在点云和图像条件下生成的网格具有复杂的细节和精确的拓扑结构，在精度和质量上优于现有最先进的方法。 Conclusion: DeepMesh通过创新的预训练策略和强化学习方法，显著提高了3D网格生成的精度和质量。 Abstract: Triangle meshes play a crucial role in 3D applications for efficient manipulation and rendering. While auto-regressive methods generate structured meshes by predicting discrete vertex tokens, they are often constrained by limited face counts and mesh incompleteness. To address these challenges, we propose DeepMesh, a framework that optimizes mesh generation through two key innovations: (1) an efficient pre-training strategy incorporating a novel tokenization algorithm, along with improvements in data curation and processing, and (2) the introduction of Reinforcement Learning (RL) into 3D mesh generation to achieve human preference alignment via Direct Preference Optimization (DPO). We design a scoring standard that combines human evaluation with 3D metrics to collect preference pairs for DPO, ensuring both visual appeal and geometric accuracy. Conditioned on point clouds and images, DeepMesh generates meshes with intricate details and precise topology, outperforming state-of-the-art methods in both precision and quality. Project page: https://zhaorw02.github.io/DeepMesh/

Challenges and Trends in Egocentric Vision: A Survey

Xiang Li,Heqian Qiu,Lanxiao Wang,Hanwen Zhang,Chenghao Qi,Linfeng Han,Huiyu Xiong,Hongliang Li

Task: 对自我中心视觉理解的研究进行全面的综述，系统分析自我中心场景的组成部分，并将任务分为四个主要领域：主体理解、物体理解、环境理解和混合理解。

Motivation: 随着人工智能技术和可穿戴设备的快速发展，自我中心视觉理解作为一个新的、具有挑战性的研究方向逐渐引起了学术界和工业界的广泛关注。

Details

Method: 系统分析自我中心场景的组成部分，并将任务分为四个主要领域：主体理解、物体理解、环境理解和混合理解。详细探讨每个类别中的子任务，并总结当前领域中的主要挑战和趋势。 Result: 总结了高质量的自我中心视觉数据集，为未来研究提供了宝贵的资源。 Conclusion: 通过总结最新进展，预计自我中心视觉技术在增强现实、虚拟现实和具身智能等领域的广泛应用，并基于领域的最新发展提出了未来的研究方向。 Abstract: With the rapid development of artificial intelligence technologies and wearable devices, egocentric vision understanding has emerged as a new and challenging research direction, gradually attracting widespread attention from both academia and industry. Egocentric vision captures visual and multimodal data through cameras or sensors worn on the human body, offering a unique perspective that simulates human visual experiences. This paper provides a comprehensive survey of the research on egocentric vision understanding, systematically analyzing the components of egocentric scenes and categorizing the tasks into four main areas: subject understanding, object understanding, environment understanding, and hybrid understanding. We explore in detail the sub-tasks within each category. We also summarize the main challenges and trends currently existing in the field. Furthermore, this paper presents an overview of high-quality egocentric vision datasets, offering valuable resources for future research. By summarizing the latest advancements, we anticipate the broad applications of egocentric vision technologies in fields such as augmented reality, virtual reality, and embodied intelligence, and propose future research directions based on the latest developments in the field.

Teng-Fang Hsiao,Bo-Kai Ruan,Yi-Lun Wu,Tzu-Ling Lin,Hong-Han Shuai

Task: 提出一种无需额外训练的文本和图像到图像生成方法（TF-TI2I），以增强复杂多图像指令下的图像生成质量。

Motivation: 现有方法在利用图像输入时往往只关注特定元素，或在处理复杂多图像指令时生成质量下降。

Details

Method: 利用MM-DiT架构，通过提取参考图像的浓缩视觉表示，并通过参考上下文掩码技术选择性地共享信息，同时使用Winner-Takes-All模块优先处理最相关的参考。 Result: 该方法在各种基准测试中表现出色，证实了其在处理复杂图像生成任务中的有效性。 Conclusion: TF-TI2I方法在不需额外训练的情况下，显著提升了复杂多图像指令下的图像生成质量。 Abstract: Text-and-Image-To-Image (TI2I), an extension of Text-To-Image (T2I), integrates image inputs with textual instructions to enhance image generation. Existing methods often partially utilize image inputs, focusing on specific elements like objects or styles, or they experience a decline in generation quality with complex, multi-image instructions. To overcome these challenges, we introduce Training-Free Text-and-Image-to-Image (TF-TI2I), which adapts cutting-edge T2I models such as SD3 without the need for additional training. Our method capitalizes on the MM-DiT architecture, in which we point out that textual tokens can implicitly learn visual information from vision tokens. We enhance this interaction by extracting a condensed visual representation from reference images, facilitating selective information sharing through Reference Contextual Masking -- this technique confines the usage of contextual tokens to instruction-relevant visual information. Additionally, our Winner-Takes-All module mitigates distribution shifts by prioritizing the most pertinent references for each vision token. Addressing the gap in TI2I evaluation, we also introduce the FG-TI2I Bench, a comprehensive benchmark tailored for TI2I and compatible with existing T2I methods. Our approach shows robust performance across various benchmarks, confirming its effectiveness in handling complex image-generation tasks.

EdgeRegNet: Edge Feature-based Multimodal Registration Network between Images and LiDAR Point Clouds

Yuanchao Yue,Hui Yuan,Qinglong Miao,Xiaolong Mao,Raouf Hamzaoui,Peter Eisert

Task: 解决2D图像和3D点云之间的跨模态数据配准问题。

Motivation: 跨模态数据配准在自动驾驶和机器人技术中具有广泛应用，准确的配准方法是多模态传感器数据融合的基础，能够提高感知系统的准确性和可靠性。

Details

Method: 提出了一种利用原始点云和图像的边缘信息进行跨模态配准的方法，通过提取边缘点和边缘像素保留关键信息，并引入基于注意力的特征交换块和最优匹配层来提高配准精度。 Result: 在KITTI和nuScenes数据集上验证了方法的准确性，展示了其最先进的性能。 Conclusion: 所提出的方法在保持计算效率的同时，提高了跨模态数据配准的精度，有效解决了现有方法中的精度损失和跨模态差异问题。 Abstract: Cross-modal data registration has long been a critical task in computer vision, with extensive applications in autonomous driving and robotics. Accurate and robust registration methods are essential for aligning data from different modalities, forming the foundation for multimodal sensor data fusion and enhancing perception systems' accuracy and reliability. The registration task between 2D images captured by cameras and 3D point clouds captured by Light Detection and Ranging (LiDAR) sensors is usually treated as a visual pose estimation problem. High-dimensional feature similarities from different modalities are leveraged to identify pixel-point correspondences, followed by pose estimation techniques using least squares methods. However, existing approaches often resort to downsampling the original point cloud and image data due to computational constraints, inevitably leading to a loss in precision. Additionally, high-dimensional features extracted using different feature extractors from various modalities require specific techniques to mitigate cross-modal differences for effective matching. To address these challenges, we propose a method that uses edge information from the original point clouds and images for cross-modal registration. We retain crucial information from the original data by extracting edge points and pixels, enhancing registration accuracy while maintaining computational efficiency. The use of edge points and edge pixels allows us to introduce an attention-based feature exchange block to eliminate cross-modal disparities. Furthermore, we incorporate an optimal matching layer to improve correspondence identification. We validate the accuracy of our method on the KITTI and nuScenes datasets, demonstrating its state-of-the-art performance.

Yuanchao Yue,Zhengxin Li,Wei Zhang,Hui Yuan

Task: 提出一种框架，将点云投影为多个2D表示以与相机图像匹配，解决LiDAR点云与相机图像之间的跨模态注册问题。

Motivation: 现有的LiDAR点云与相机图像的校准方法通常耗时且需要外部校准板或特定环境特征，跨模态注册直接对齐数据，无需外部校准，但由于点云与图像之间的领域差距，现有方法难以在保持实时性能的同时达到满意的注册精度。

Details

Method: 提出一种框架，将点云投影为多个2D表示以与相机图像匹配，并引入多尺度特征提取网络和patch-to-pixel匹配网络，以有效提取特征并提供更有效的监督。 Result: 在KITTI和nuScenes数据集上的实验验证了模型的有效性，模型实现了实时性能和高注册精度，在KITTI数据集上注册精度超过99%。 Conclusion: 所提出的框架有效解决了LiDAR点云与相机图像之间的跨模态注册问题，实现了高精度和实时性能。 Abstract: The primary requirement for cross-modal data fusion is the precise alignment of data from different sensors. However, the calibration between LiDAR point clouds and camera images is typically time-consuming and needs external calibration board or specific environmental features. Cross-modal registration effectively solves this problem by aligning the data directly without requiring external calibration. However, due to the domain gap between the point cloud and the image, existing methods rarely achieve satisfactory registration accuracy while maintaining real-time performance. To address this issue, we propose a framework that projects point clouds into several 2D representations for matching with camera images, which not only leverages the geometric characteristic of LiDAR point clouds more effectively but also bridge the domain gap between the point cloud and image. Moreover, to tackle the challenges of cross modal differences and the limited overlap between LiDAR point clouds and images in the image matching task, we introduce a multi-scale feature extraction network to effectively extract features from both camera images and the projection maps of LiDAR point cloud. Additionally, we propose a patch-to-pixel matching network to provide more effective supervision and achieve higher accuracy. We validate the performance of our model through experiments on the KITTI and nuScenes datasets. Our network achieves real-time performance and extremely high registration accuracy. On the KITTI dataset, our model achieves a registration accuracy rate of over 99\%.

Test-Time Backdoor Detection for Object Detection Models

Hangtao Zhang,Yichen Wang,Shihui Yan,Chenyu Zhu,Ziqi Zhou,Linshan Hou,Shengshan Hu,Minghui Li,Yanjun Zhang,Leo Yu Zhang

Task: 设计一种在测试时检测目标检测模型中中毒样本的新方法。

Motivation: 目标检测模型容易受到后门攻击，现有的防御方法在面对复杂的攻击效果时显得不足。

Details

Method: 设计了TRAnsformation Consistency Evaluation (TRACE)方法，通过应用前景和背景变换来评估变换一致性。 Result: TRACE在AUROC上比现有防御方法提高了30%，并且能够抵抗自适应攻击。 Conclusion: TRACE方法在检测目标检测模型中的中毒样本方面表现出色，具有较高的检测一致性和鲁棒性。 Abstract: Object detection models are vulnerable to backdoor attacks, where attackers poison a small subset of training samples by embedding a predefined trigger to manipulate prediction. Detecting poisoned samples (i.e., those containing triggers) at test time can prevent backdoor activation. However, unlike image classification tasks, the unique characteristics of object detection -- particularly its output of numerous objects -- pose fresh challenges for backdoor detection. The complex attack effects (e.g., "ghost" object emergence or "vanishing" object) further render current defenses fundamentally inadequate. To this end, we design TRAnsformation Consistency Evaluation (TRACE), a brand-new method for detecting poisoned samples at test time in object detection. Our journey begins with two intriguing observations: (1) poisoned samples exhibit significantly more consistent detection results than clean ones across varied backgrounds. (2) clean samples show higher detection consistency when introduced to different focal information. Based on these phenomena, TRACE applies foreground and background transformations to each test sample, then assesses transformation consistency by calculating the variance in objects confidences. TRACE achieves black-box, universal backdoor detection, with extensive experiments showing a 30% improvement in AUROC over state-of-the-art defenses and resistance to adaptive attacks.

DCA: Dividing and Conquering Amnesia in Incremental Object Detection

Aoting Zhang,Dongbao Yang,Chang Liu,Xiaopeng Hong,Miao Shang,Yu Zhou

Task: 增量目标检测（IOD）旨在培养一个能够持续定位和识别新类别同时保持对先前类别性能的目标检测器。

Motivation: 现有方法通过改进知识蒸馏和样本重放在某些方面取得了成功，但内在的遗忘机制仍未得到充分探索。本文深入研究了遗忘的原因，并发现了基于Transformer的IOD中定位和识别之间的遗忘不平衡现象。

Details

Method: 本文提出了一种分而治之的遗忘策略（DCA），将基于Transformer的IOD重新设计为定位-识别过程。DCA能够很好地保持和传递定位能力，同时将解耦的脆弱识别专门攻克。为了减少识别中的特征漂移，我们利用预训练语言模型中的语义知识来锚定跨增量任务的统一特征空间中的类别表示。 Result: 广泛的实验验证了我们的方法在长期增量场景中达到了最先进的性能。例如，在MS-COCO的四步设置下，我们的DCA策略显著提高了最终AP 6.9%。 Conclusion: 本文提出的DCA策略有效地解决了基于Transformer的IOD中的遗忘不平衡问题，显著提升了长期增量场景下的性能。 Abstract: Incremental object detection (IOD) aims to cultivate an object detector that can continuously localize and recognize novel classes while preserving its performance on previous classes. Existing methods achieve certain success by improving knowledge distillation and exemplar replay for transformer-based detection frameworks, but the intrinsic forgetting mechanisms remain underexplored. In this paper, we dive into the cause of forgetting and discover forgetting imbalance between localization and recognition in transformer-based IOD, which means that localization is less-forgetting and can generalize to future classes, whereas catastrophic forgetting occurs primarily on recognition. Based on these insights, we propose a Divide-and-Conquer Amnesia (DCA) strategy, which redesigns the transformer-based IOD into a localization-then-recognition process. DCA can well maintain and transfer the localization ability, leaving decoupled fragile recognition to be specially conquered. To reduce feature drift in recognition, we leverage semantic knowledge encoded in pre-trained language models to anchor class representations within a unified feature space across incremental tasks. This involves designing a duplex classifier fusion and embedding class semantic features into the recognition decoding process in the form of queries. Extensive experiments validate that our approach achieves state-of-the-art performance, especially for long-term incremental scenarios. For example, under the four-step setting on MS-COCO, our DCA strategy significantly improves the final AP by 6.9%.

SUM Parts: Benchmarking Part-Level Semantic Segmentation of Urban Meshes

Weixiao Gao,Liangliang Nan,Hugo Ledoux

Task: 介绍并评估一个用于城市纹理网格的大规模数据集SUM Parts，该数据集包含部分级别的语义标签。

Motivation: 城市场景分析中的语义分割主要集中在图像或点云上，而提供更丰富空间表示的纹理网格仍未得到充分探索。

Details

Method: 创建了一个包含21个类别的约2.5平方公里的城市纹理网格数据集，并使用自有的注释工具进行注释，该工具支持基于面和纹理的高效交互选择。 Result: 提供了对3D语义分割和交互注释方法的全面评估。 Conclusion: SUM Parts数据集填补了城市纹理网格语义分割的空白，并为相关研究提供了有价值的资源。 Abstract: Semantic segmentation in urban scene analysis has mainly focused on images or point clouds, while textured meshes - offering richer spatial representation - remain underexplored. This paper introduces SUM Parts, the first large-scale dataset for urban textured meshes with part-level semantic labels, covering about 2.5 km2 with 21 classes. The dataset was created using our own annotation tool, which supports both face- and texture-based annotations with efficient interactive selection. We also provide a comprehensive evaluation of 3D semantic segmentation and interactive annotation methods on this dataset. Our project page is available at https://tudelft3d.github.io/SUMParts/.

Recover and Match: Open-Vocabulary Multi-Label Recognition through Knowledge-Constrained Optimal Transport

Hao Tan,Zichang Tan,Jun Li,Ajian Liu,Jun Wan,Zhen Lei

Task: 解决开放词汇多标签识别中的局部语义丢失和无关区域匹配问题。

Motivation: 现有的视觉-语言模型（如CLIP）在局部语义和区域匹配方面存在不足，导致不可靠的预测。

Details

Method: 提出了RAM框架，包括Ladder Local Adapter（LLA）来恢复局部语义，以及Knowledge-Constrained Optimal Transport（KCOT）来抑制无关区域的匹配。 Result: RAM在多个数据集上达到了最先进的性能，并展示了提升现有方法的潜力。 Conclusion: RAM框架有效解决了开放词汇多标签识别中的关键问题，具有广泛的应用前景。 Abstract: Identifying multiple novel classes in an image, known as open-vocabulary multi-label recognition, is a challenging task in computer vision. Recent studies explore the transfer of powerful vision-language models such as CLIP. However, these approaches face two critical challenges: (1) The local semantics of CLIP are disrupted due to its global pre-training objectives, resulting in unreliable regional predictions. (2) The matching property between image regions and candidate labels has been neglected, relying instead on naive feature aggregation such as average pooling, which leads to spurious predictions from irrelevant regions. In this paper, we present RAM (Recover And Match), a novel framework that effectively addresses the above issues. To tackle the first problem, we propose Ladder Local Adapter (LLA) to enforce refocusing on local regions, recovering local semantics in a memory-friendly way. For the second issue, we propose Knowledge-Constrained Optimal Transport (KCOT) to suppress meaningless matching to non-GT labels by formulating the task as an optimal transport problem. As a result, RAM achieves state-of-the-art performance on various datasets from three distinct domains, and shows great potential to boost the existing methods. Code: https://github.com/EricTan7/RAM.

TruthLens:A Training-Free Paradigm for DeepFake Detection

Ritabrata Chakraborty,Rajatsubhra Chakraborty,Ali Khaleghi Rahimian,Thomas MacDougall

Task: 提出了一种新的训练免费框架TruthLens，用于深度伪造检测，并将其重新构想为视觉问答任务。

Motivation: 当前假图像检测方法主要依赖于二元分类模型，这些模型在注重准确性的同时往往忽视了可解释性，导致用户无法清楚了解图像被判定为真实或伪造的原因。

Details

Method: TruthLens利用最先进的大型视觉语言模型（LVLMs）观察和描述视觉伪影，并结合大型语言模型（LLMs）如GPT-4的推理能力，分析和汇总证据以做出明智的决策。 Result: 广泛的评估表明，TruthLens在具有挑战性的数据集上表现出色，实现了高准确性，同时保持了强大的可解释性。 Conclusion: 通过将深度伪造检测重新构想为推理驱动的过程，TruthLens在对抗合成媒体方面建立了一个新的范式，结合了尖端性能和可解释性，以应对日益增长的视觉虚假信息威胁。 Abstract: The proliferation of synthetic images generated by advanced AI models poses significant challenges in identifying and understanding manipulated visual content. Current fake image detection methods predominantly rely on binary classification models that focus on accuracy while often neglecting interpretability, leaving users without clear insights into why an image is deemed real or fake. To bridge this gap, we introduce TruthLens, a novel training-free framework that reimagines deepfake detection as a visual question-answering (VQA) task. TruthLens utilizes state-of-the-art large vision-language models (LVLMs) to observe and describe visual artifacts and combines this with the reasoning capabilities of large language models (LLMs) like GPT-4 to analyze and aggregate evidence into informed decisions. By adopting a multimodal approach, TruthLens seamlessly integrates visual and semantic reasoning to not only classify images as real or fake but also provide interpretable explanations for its decisions. This transparency enhances trust and provides valuable insights into the artifacts that signal synthetic content. Extensive evaluations demonstrate that TruthLens outperforms conventional methods, achieving high accuracy on challenging datasets while maintaining a strong emphasis on explainability. By reframing deepfake detection as a reasoning-driven process, TruthLens establishes a new paradigm in combating synthetic media, combining cutting-edge performance with interpretability to address the growing threats of visual disinformation.

Boosting HDR Image Reconstruction via Semantic Knowledge Transfer

Qingsen Yan,Tao Hu,Genggeng Chen,Wei Dong,Yanning Zhang

Task: 从多个低动态范围（LDR）图像中恢复高动态范围（HDR）图像，特别是在LDR图像存在明显退化和缺失内容的情况下。

Motivation: 利用场景特定的语义先验来恢复严重退化的区域，但由于这些先验通常从sRGB标准动态范围（SDR）图像中提取，域/格式差距在应用于HDR成像时带来了显著挑战。

Details

Method: 提出了一个通用框架，通过自蒸馏将SDR域中的语义知识转移到现有的HDR重建中。具体包括引入语义先验引导重建模型（SPGRM）和自蒸馏机制，以及使用语义知识对齐模块（SKAM）来填补缺失的语义内容。 Result: 实验表明，该方法显著提高了现有方法的HDR成像质量。 Conclusion: 所提出的框架通过自蒸馏和语义知识对齐，有效提升了HDR重建的质量，特别是在处理退化和缺失内容的LDR图像时。 Abstract: Recovering High Dynamic Range (HDR) images from multiple Low Dynamic Range (LDR) images becomes challenging when the LDR images exhibit noticeable degradation and missing content. Leveraging scene-specific semantic priors offers a promising solution for restoring heavily degraded regions. However, these priors are typically extracted from sRGB Standard Dynamic Range (SDR) images, the domain/format gap poses a significant challenge when applying it to HDR imaging. To address this issue, we propose a general framework that transfers semantic knowledge derived from SDR domain via self-distillation to boost existing HDR reconstruction. Specifically, the proposed framework first introduces the Semantic Priors Guided Reconstruction Model (SPGRM), which leverages SDR image semantic knowledge to address ill-posed problems in the initial HDR reconstruction results. Subsequently, we leverage a self-distillation mechanism that constrains the color and content information with semantic knowledge, aligning the external outputs between the baseline and SPGRM. Furthermore, to transfer the semantic knowledge of the internal features, we utilize a semantic knowledge alignment module (SKAM) to fill the missing semantic contents with the complementary masks. Extensive experiments demonstrate that our method can significantly improve the HDR imaging quality of existing methods.

EfficientLLaVA:Generalizable Auto-Pruning for Large Vision-language Models

Yinan Liang,Ziwei Wang,Xiuwei Xu,Jie Zhou,Jiwen Lu

Task: 提出一种自动剪枝方法，以提高多模态推理的效率。

Motivation: 多模态大语言模型在复杂推理任务中表现出色，但在资源有限的设备上部署时存在模型复杂性的挑战。

Details

Method: 利用少量样本搜索剪枝策略，通过最大化其在未知训练数据上的泛化能力来保持模型准确性，从而实现大型视觉语言模型在准确性和效率之间的最佳权衡。 Result: 在ScienceQA数据集上，仅使用64个样本进行剪枝策略搜索，EfficientLLaVA达到了83.05%的准确率，并且比密集的LLaVA-v1.5-7B模型快了1.8倍。 Conclusion: 提出的自动剪枝方法能够在保持模型准确性的同时显著提高效率，适用于资源有限的设备。 Abstract: While multimodal large language models demonstrate strong performance in complex reasoning tasks, they pose significant challenges related to model complexity during deployment, especially for resource-limited devices. In this paper, we propose an automatic pruning method for large vision-language models to enhance the efficiency of multimodal reasoning. Conventional methods rely on the training data of the original model to select the proper pruning ratio for different network components. However, these methods are impractical for large vision-language models due to the unaffordable search costs caused by web-scale training corpus. In contrast, our approach only leverages a small number of samples to search for the desired pruning policy by maximizing its generalization ability on unknown training data while maintaining the model accuracy, which enables the achievement of an optimal trade-off between accuracy and efficiency for large visual language models. Specifically, we formulate the generalization gap of the pruning strategy using the structural risk minimization principle. Based on both task performance and generalization capability, we iteratively search for the optimal pruning policy within a given search space and optimize the vision projector to evolve the search space with higher upper bound of performance. We conduct extensive experiments on the ScienceQA, Vizwiz, MM-vet, and LLaVA-Bench datasets for the task of visual question answering. Using only 64 samples for pruning policy search, EfficientLLaVA achieves an accuracy of 83.05% on ScienceQA, along with a $\times$ 1.8 speedup compared to the dense LLaVA-v1.5-7B model.

Improving Adversarial Transferability on Vision Transformers via Forward Propagation Refinement

Yuchen Ren,Zhengyu Zhao,Chenhao Lin,Bo Yang,Lu Zhou,Zhe Liu,Chao Shen

Task: 研究Vision Transformers（ViTs）在对抗样本转移中的前向传播优化方法。

Motivation: 为了深入了解ViTs在实际场景中的鲁棒性，研究其在对抗样本转移中的表现。

Details

Method: 提出了前向传播优化（FPR）方法，具体包括注意力图多样化（AMD）和动量令牌嵌入（MTE）。 Result: 实验表明，FPR方法在对抗样本转移中的表现优于现有的最佳后向传播优化方法，平均提升7.0%。 Conclusion: FPR方法在对抗样本转移中表现出色，且与现有防御方法和转移方法兼容。 Abstract: Vision Transformers (ViTs) have been widely applied in various computer vision and vision-language tasks. To gain insights into their robustness in practical scenarios, transferable adversarial examples on ViTs have been extensively studied. A typical approach to improving adversarial transferability is by refining the surrogate model. However, existing work on ViTs has restricted their surrogate refinement to backward propagation. In this work, we instead focus on Forward Propagation Refinement (FPR) and specifically refine two key modules of ViTs: attention maps and token embeddings. For attention maps, we propose Attention Map Diversification (AMD), which diversifies certain attention maps and also implicitly imposes beneficial gradient vanishing during backward propagation. For token embeddings, we propose Momentum Token Embedding (MTE), which accumulates historical token embeddings to stabilize the forward updates in both the Attention and MLP blocks. We conduct extensive experiments with adversarial examples transferred from ViTs to various CNNs and ViTs, demonstrating that our FPR outperforms the current best (backward) surrogate refinement by up to 7.0\% on average. We also validate its superiority against popular defenses and its compatibility with other transfer methods. Codes and appendix are available at https://github.com/RYC-98/FPR.

Visual Persona: Foundation Model for Full-Body Human Customization

Jisu Nam,Soowon Son,Zhan Xu,Jing Shi,Difan Liu,Feng Liu,Aashish Misraa,Seungryong Kim,Yang Zhou

Task: 开发一个基于文本描述生成个性化全身人像图像的基础模型。

Motivation: 现有的方法主要关注面部身份的保留，而忽略了全身外观的细节。本文旨在捕捉详细的全身外观，并与文本描述中的身体结构和场景变化对齐。

Details

Method: 提出了一个数据整理流程，利用视觉语言模型评估全身外观一致性，构建了一个包含580k配对图像的数据集。采用基于预训练文本到图像扩散模型的变压器编码器-解码器架构，将输入图像分割为不同的身体区域，编码为局部外观特征，并投影为密集身份嵌入，以生成定制图像。 Result: Visual Persona在生成高质量、定制化的图像方面优于现有方法，并通过广泛的消融研究验证了设计选择。 Conclusion: Visual Persona在各种下游任务中展示了其多功能性，能够从野外输入生成高质量的定制图像。 Abstract: We introduce Visual Persona, a foundation model for text-to-image full-body human customization that, given a single in-the-wild human image, generates diverse images of the individual guided by text descriptions. Unlike prior methods that focus solely on preserving facial identity, our approach captures detailed full-body appearance, aligning with text descriptions for body structure and scene variations. Training this model requires large-scale paired human data, consisting of multiple images per individual with consistent full-body identities, which is notoriously difficult to obtain. To address this, we propose a data curation pipeline leveraging vision-language models to evaluate full-body appearance consistency, resulting in Visual Persona-500K, a dataset of 580k paired human images across 100k unique identities. For precise appearance transfer, we introduce a transformer encoder-decoder architecture adapted to a pre-trained text-to-image diffusion model, which augments the input image into distinct body regions, encodes these regions as local appearance features, and projects them into dense identity embeddings independently to condition the diffusion model for synthesizing customized images. Visual Persona consistently surpasses existing approaches, generating high-quality, customized images from in-the-wild inputs. Extensive ablation studies validate design choices, and we demonstrate the versatility of Visual Persona across various downstream tasks.

Learn Your Scales: Towards Scale-Consistent Generative Novel View Synthesis

Fereshteh Forghani,Jason J. Yu,Tristan Aumentado-Armstrong,Konstantinos G. Derpanis,Marcus A. Brubaker

Task: 研究并解决在生成新视角合成方法（GNVS）中场景尺度模糊性的影响。

Motivation: 传统的多视角数据集在单目相机移动拍摄时存在尺度模糊性，之前的方法通过各种临时归一化预处理步骤承认了这一问题，但未直接分析错误场景尺度对其应用的影响。

Details

Method: 通过从单张图像中采样，研究场景尺度模糊性对GNVS模型的影响，并基于这些观察定义新的度量标准来衡量生成视图的尺度不一致性。提出了一种端到端的框架，联合估计场景尺度和GNVS模型。 Result: 实验表明，该方法在不增加复杂度或带来之前尺度归一化方法的缺点的情况下，减少了生成视图的尺度不一致性。 Conclusion: 消除尺度模糊性提高了GNVS模型的生成图像质量。 Abstract: Conventional depth-free multi-view datasets are captured using a moving monocular camera without metric calibration. The scales of camera positions in this monocular setting are ambiguous. Previous methods have acknowledged scale ambiguity in multi-view data via various ad-hoc normalization pre-processing steps, but have not directly analyzed the effect of incorrect scene scales on their application. In this paper, we seek to understand and address the effect of scale ambiguity when used to train generative novel view synthesis methods (GNVS). In GNVS, new views of a scene or object can be minimally synthesized given a single image and are, thus, unconstrained, necessitating the use of generative methods. The generative nature of these models captures all aspects of uncertainty, including any uncertainty of scene scales, which act as nuisance variables for the task. We study the effect of scene scale ambiguity in GNVS when sampled from a single image by isolating its effect on the resulting models and, based on these intuitions, define new metrics that measure the scale inconsistency of generated views. We then propose a framework to estimate scene scales jointly with the GNVS model in an end-to-end fashion. Empirically, we show that our method reduces the scale inconsistency of generated views without the complexity or downsides of previous scale normalization methods. Further, we show that removing this ambiguity improves generated image quality of the resulting GNVS model.

Automated Processing of eXplainable Artificial Intelligence Outputs in Deep Learning Models for Fault Diagnostics of Large Infrastructures

Giovanni Floreale,Piero Baraldi,Enrico Zio,Olga Fink

Task: 提出一种结合后解释与半监督学习的新框架，用于自动识别异常解释，从而减少维护决策者的工作量。

Motivation: 深度学习模型在处理图像以识别大型基础设施组件的健康状态时可能表现出偏见并依赖于非因果捷径，而手动分析XAI技术生成的解释既耗时又容易出错。

Details

Method: 结合后解释与半监督学习，自动识别与正确分类图像解释不同的异常解释，并应用于无人机收集的电力基础设施绝缘子外壳图像。 Result: 在两个故障类别上的平均分类准确率提高了8%，维护操作员只需手动重新分类15%的图像。 Conclusion: 所提出的框架在F1分数上优于基于忠实度度量的最先进方法，并成功识别了由非因果捷径导致的正确分类。 Abstract: Deep Learning (DL) models processing images to recognize the health state of large infrastructure components can exhibit biases and rely on non-causal shortcuts. eXplainable Artificial Intelligence (XAI) can address these issues but manually analyzing explanations generated by XAI techniques is time-consuming and prone to errors. This work proposes a novel framework that combines post-hoc explanations with semi-supervised learning to automatically identify anomalous explanations that deviate from those of correctly classified images and may therefore indicate model abnormal behaviors. This significantly reduces the workload for maintenance decision-makers, who only need to manually reclassify images flagged as having anomalous explanations. The proposed framework is applied to drone-collected images of insulator shells for power grid infrastructure monitoring, considering two different Convolutional Neural Networks (CNNs), GradCAM explanations and Deep Semi-Supervised Anomaly Detection. The average classification accuracy on two faulty classes is improved by 8% and maintenance operators are required to manually reclassify only 15% of the images. We compare the proposed framework with a state-of-the-art approach based on the faithfulness metric: the experimental results obtained demonstrate that the proposed framework consistently achieves F_1 scores larger than those of the faithfulness-based approach. Additionally, the proposed framework successfully identifies correct classifications that result from non-causal shortcuts, such as the presence of ID tags printed on insulator shells.

Temporal Regularization Makes Your Video Generator Stronger

Harold Haodong Chen,Haojian Huang,Xianfeng Wu,Yexin Liu,Yajing Bai,Wen-Jie Shu,Harry Yang,Ser-Nam Lim

Task: 探索视频生成中的时间增强方法，以提高时间一致性和多样性。

Motivation: 时间质量是视频生成的关键方面，确保帧间一致的运动和真实的动态，但实现高时间一致性和多样性仍然具有挑战性。

Details

Method: 提出了FluxFlow策略，通过在数据层面应用受控的时间扰动来增强时间质量，而不需要修改模型架构。 Result: 在UCF-101和VBench基准测试上的广泛实验表明，FluxFlow显著提高了各种视频生成模型的时间一致性和多样性，同时保持了空间保真度。 Conclusion: 时间增强是一种简单而有效的方法，具有提高视频生成质量的潜力。 Abstract: Temporal quality is a critical aspect of video generation, as it ensures consistent motion and realistic dynamics across frames. However, achieving high temporal coherence and diversity remains challenging. In this work, we explore temporal augmentation in video generation for the first time, and introduce FluxFlow for initial investigation, a strategy designed to enhance temporal quality. Operating at the data level, FluxFlow applies controlled temporal perturbations without requiring architectural modifications. Extensive experiments on UCF-101 and VBench benchmarks demonstrate that FluxFlow significantly improves temporal coherence and diversity across various video generation models, including U-Net, DiT, and AR-based architectures, while preserving spatial fidelity. These findings highlight the potential of temporal augmentation as a simple yet effective approach to advancing video generation quality.

Visual Position Prompt for MLLM based Visual Grounding

Wei Tang,Yanpeng Sun,Qinying Gu,Zechao Li

Task: 提高多模态大语言模型（MLLMs）在视觉定位任务中的坐标与空间信息对齐能力。

Motivation: 现有的MLLMs在图像相关任务中表现出色，但在位置感知任务（如视觉定位）中，由于缺乏明确的空间参考和特征提取过程中对全局上下文的优先考虑，导致其定位能力较弱。

Details

Method: 提出了VPP-LLaVA模型，该模型通过引入视觉位置提示（VPP）来增强其定位能力。VPP-LLaVA集成了两种互补机制：全局VPP通过在输入图像上叠加可学习的轴状嵌入来提供结构化的空间线索，局部VPP通过引入位置感知查询来关注细粒度的定位。 Result: 在标准定位基准测试中，VPP-LLaVA使用较少的训练样本（0.6M）取得了最先进的结果，优于其他依赖更大数据集（约21M样本）的MLLMs。 Conclusion: VPP-LLaVA通过引入视觉位置提示机制，显著提高了MLLMs在视觉定位任务中的性能，展示了其在处理位置感知任务中的潜力。 Abstract: Although Multimodal Large Language Models (MLLMs) excel at various image-related tasks, they encounter challenges in precisely aligning coordinates with spatial information within images, particularly in position-aware tasks such as visual grounding. This limitation arises from two key factors. First, MLLMs lack explicit spatial references, making it difficult to associate textual descriptions with precise image locations. Second, their feature extraction processes prioritize global context over fine-grained spatial details, leading to weak localization capability. To address this issue, we introduce VPP-LLaVA, an MLLM equipped with Visual Position Prompt (VPP) to improve its grounding capability. VPP-LLaVA integrates two complementary mechanisms. The global VPP overlays learnable, axis-like embeddings onto the input image to provide structured spatial cues. The local VPP focuses on fine-grained localization by incorporating position-aware queries, which suggests probable object locations. We also introduce a VPP-SFT dataset with 0.6M samples, consolidating high-quality visual grounding data into a compact format for efficient model training. Training on this dataset with VPP enhances the model's performance, achieving state-of-the-art results on standard grounding benchmarks despite using fewer training samples compared to other MLLMs like MiniGPT-v2, which rely on much larger datasets ($\sim$21M samples). The code and VPP-SFT dataset will be available at https://github.com/WayneTomas/VPP-LLaVA upon acceptance.

V2X-DG: Domain Generalization for Vehicle-to-Everything Cooperative Perception

Baolu Li,Zongzhe Xu,Jinlong Li,Xinyu Liu,Jianwu Fang,Xiaopeng Li,Hongkai Yu

Task: 研究基于LiDAR的V2X协同感知的领域泛化问题，以提高3D检测的泛化能力。

Motivation: 当前协同感知算法在相同数据集上训练和测试，导致协同感知系统的泛化能力未被充分探索。

Details

Method: 提出了基于协同混合增强的泛化方法（CMAG）和合作特征一致性（CFC）约束，以提高模型在未见领域中的泛化能力。 Result: 实验表明，该方法在未见数据集上显著提高了性能，同时在源数据集上保持了强大的性能。 Conclusion: 提出的方法有效提高了LiDAR-based V2X协同感知系统的泛化能力，适用于多种数据集。 Abstract: LiDAR-based Vehicle-to-Everything (V2X) cooperative perception has demonstrated its impact on the safety and effectiveness of autonomous driving. Since current cooperative perception algorithms are trained and tested on the same dataset, the generalization ability of cooperative perception systems remains underexplored. This paper is the first work to study the Domain Generalization problem of LiDAR-based V2X cooperative perception (V2X-DG) for 3D detection based on four widely-used open source datasets: OPV2V, V2XSet, V2V4Real and DAIR-V2X. Our research seeks to sustain high performance not only within the source domain but also across other unseen domains, achieved solely through training on source domain. To this end, we propose Cooperative Mixup Augmentation based Generalization (CMAG) to improve the model generalization capability by simulating the unseen cooperation, which is designed compactly for the domain gaps in cooperative perception. Furthermore, we propose a constraint for the regularization of the robust generalized feature representation learning: Cooperation Feature Consistency (CFC), which aligns the intermediately fused features of the generalized cooperation by CMAG and the early fused features of the original cooperation in source domain. Extensive experiments demonstrate that our approach achieves significant performance gains when generalizing to other unseen datasets while it also maintains strong performance on the source dataset.

MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space

Lixing Xiao,Shunlin Lu,Huaijin Pi,Ke Fan,Liang Pan,Yueer Zhou,Ziyong Feng,Xiaowei Zhou,Sida Peng,Jingbo Wang

Task: 解决基于文本条件的流式运动生成问题，即根据可变长度的历史运动和输入文本预测下一步的人体姿态。

Motivation: 现有方法在流式运动生成方面存在困难，例如扩散模型受限于预定义的运动长度，而基于GPT的方法由于离散化的非因果标记化导致响应延迟和错误累积问题。

Details

Method: 提出了MotionStreamer框架，将连续因果潜在空间引入概率自回归模型，通过连续潜在变量减少离散化引起的信息丢失，并在长期自回归生成中有效减少错误累积。 Result: 实验表明，该方法在现有方法中表现优异，并提供了更多应用，包括多轮生成、长期生成和动态运动组合。 Conclusion: MotionStreamer框架通过连续因果潜在空间和概率自回归模型，有效解决了流式运动生成中的问题，并在多个应用中表现出色。 Abstract: This paper addresses the challenge of text-conditioned streaming motion generation, which requires us to predict the next-step human pose based on variable-length historical motions and incoming texts. Existing methods struggle to achieve streaming motion generation, e.g., diffusion models are constrained by pre-defined motion lengths, while GPT-based methods suffer from delayed response and error accumulation problem due to discretized non-causal tokenization. To solve these problems, we propose MotionStreamer, a novel framework that incorporates a continuous causal latent space into a probabilistic autoregressive model. The continuous latents mitigate information loss caused by discretization and effectively reduce error accumulation during long-term autoregressive generation. In addition, by establishing temporal causal dependencies between current and historical motion latents, our model fully utilizes the available information to achieve accurate online motion decoding. Experiments show that our method outperforms existing approaches while offering more applications, including multi-round generation, long-term generation, and dynamic motion composition. Project Page: https://zju3dv.github.io/MotionStreamer/

Di$\mathtt{[M]}$O: Distilling Masked Diffusion Models into One-step Generator

Yuanzhi Zhu,Xi Wang,Stéphane Lathuilière,Vicky Kalogeiton

Task: Error

Motivation: Error

Details

Method: Error Result: Error Conclusion: Error Abstract: Masked Diffusion Models (MDMs) have emerged as a powerful generative modeling technique. Despite their remarkable results, they typically suffer from slow inference with several steps. In this paper, we propose Di$\mathtt{[M]}$O, a novel approach that distills masked diffusion models into a one-step generator. Di$\mathtt{[M]}$O addresses two key challenges: (1) the intractability of using intermediate-step information for one-step generation, which we solve through token-level distribution matching that optimizes model output logits by an 'on-policy framework' with the help of an auxiliary model; and (2) the lack of entropy in the initial distribution, which we address through a token initialization strategy that injects randomness while maintaining similarity to teacher training distribution. We show Di$\mathtt{[M]}$O's effectiveness on both class-conditional and text-conditional image generation, impressively achieving performance competitive to multi-step teacher outputs while drastically reducing inference time. To our knowledge, we are the first to successfully achieve one-step distillation of masked diffusion models and the first to apply discrete distillation to text-to-image generation, opening new paths for efficient generative modeling.

FP4DiT: Towards Effective Floating Point Quantization for Diffusion Transformers

Ruichen Chen,Keith G. Mills,Di Niu

Task: 提出一种基于浮点量化的后训练量化方法FP4DiT，用于Diffusion Transformer模型的低比特量化。

Motivation: 现有的DM PTQ方法主要针对卷积U-Net结构的经典DM模型，而新型的Diffusion Transformer模型如PixArt系列和Hunyuan采用了不同的Transformer架构，且整数量化方法在低比特设置下不能很好地适应网络权重和激活分布。

Details

Method: 扩展和推广了自适应舍入PTQ技术，以充分校准浮点量化的权重量化，并提出了鲁棒的在线激活量化技术。 Result: FP4DiT在W4A6和W4A8精度下优于基于整数的PTQ方法，并在PixArt-α、PixArt-Σ和Hunyuan上生成了具有说服力的视觉内容。 Conclusion: FP4DiT方法在低比特设置下能够更好地适应Diffusion Transformer模型的权重和激活分布，生成高质量的图像。 Abstract: Diffusion Models (DM) have revolutionized the text-to-image visual generation process. However, the large computational cost and model footprint of DMs hinders practical deployment, especially on edge devices. Post-training quantization (PTQ) is a lightweight method to alleviate these burdens without the need for training or fine-tuning. While recent DM PTQ methods achieve W4A8 on integer-based PTQ, two key limitations remain: First, while most existing DM PTQ methods evaluate on classical DMs like Stable Diffusion XL, 1.5 or earlier, which use convolutional U-Nets, newer Diffusion Transformer (DiT) models like the PixArt series, Hunyuan and others adopt fundamentally different transformer backbones to achieve superior image synthesis. Second, integer (INT) quantization is prevailing in DM PTQ but doesn't align well with the network weight and activation distribution, while Floating-Point Quantization (FPQ) is still under-investigated, yet it holds the potential to better align the weight and activation distributions in low-bit settings for DiT. In response, we introduce FP4DiT, a PTQ method that leverages FPQ to achieve W4A6 quantization. Specifically, we extend and generalize the Adaptive Rounding PTQ technique to adequately calibrate weight quantization for FPQ and demonstrate that DiT activations depend on input patch data, necessitating robust online activation quantization techniques. Experimental results demonstrate that FP4DiT outperforms integer-based PTQ at W4A6 and W4A8 precision and generates convincing visual content on PixArt-$\alpha$, PixArt-$\Sigma$ and Hunyuan in terms of several T2I metrics such as HPSv2 and CLIP.

EgoDTM: Towards 3D-Aware Egocentric Video-Language Pretraining

Boshen Xu,Yuting Mei,Xinbi Liu,Sipeng Zheng,Qin Jin

Task: 通过大规模3D感知视频预训练和视频-文本对比学习，联合训练Egocentric Depth- and Text-aware Model (EgoDTM)，以提升3D感知视觉理解。

Motivation: 人类感知和交互的是一个完全3D的世界，发展出超越文本理解的空间意识。然而，大多数先前的工作从1D文本或2D视觉线索（如边界框）中学习，这些方法本质上缺乏3D理解。

Details

Method: 引入EgoDTM，结合轻量级3D感知解码器，从深度估计模型生成的伪深度图中高效学习3D感知。通过有机结合多个基础模型，丰富原始简短描述的手-对象视觉线索。 Result: 大量实验表明，EgoDTM在多种下游任务中表现出色，展示了其卓越的3D感知视觉理解能力。 Conclusion: EgoDTM通过3D感知视频预训练和视频-文本对比学习，显著提升了3D感知视觉理解，并在多种任务中表现出色。 Abstract: Egocentric video-language pretraining has significantly advanced video representation learning. Humans perceive and interact with a fully 3D world, developing spatial awareness that extends beyond text-based understanding. However, most previous works learn from 1D text or 2D visual cues, such as bounding boxes, which inherently lack 3D understanding. To bridge this gap, we introduce EgoDTM, an Egocentric Depth- and Text-aware Model, jointly trained through large-scale 3D-aware video pretraining and video-text contrastive learning. EgoDTM incorporates a lightweight 3D-aware decoder to efficiently learn 3D-awareness from pseudo depth maps generated by depth estimation models. To further facilitate 3D-aware video pretraining, we enrich the original brief captions with hand-object visual cues by organically combining several foundation models. Extensive experiments demonstrate EgoDTM's superior performance across diverse downstream tasks, highlighting its superior 3D-aware visual understanding. Our code will be released at https://github.com/xuboshen/EgoDTM.

Toward task-driven satellite image super-resolution

Maciej Ziaja,Pawel Kowaleczko,Daniel Kostrzewa,Nicolas Longépé,Michal Kawulok

Task: 学习超分辨率算法以生成适合自动化图像分析的高分辨率图像。

Motivation: 现有的超分辨率方法虽然在生成高感知质量的图像方面表现出色，但尚不清楚重建的细节是否接近实际的地面真实信息，以及它们是否构成图像分析算法的更有价值的来源。

Details

Method: 提出了一种方法学方法，用于评估现有计算机视觉任务模型是否可用于评估超分辨率重建算法，并以任务驱动的方式训练它们。 Result: 通过实验研究支持分析，预计将为选择适当的计算机视觉任务奠定坚实基础，从而提升现实世界超分辨率的能力。 Conclusion: 该研究为任务驱动的超分辨率算法学习和评估提供了初步的方法论基础，有望推动实际应用中的超分辨率技术发展。 Abstract: Super-resolution is aimed at reconstructing high-resolution images from low-resolution observations. State-of-the-art approaches underpinned with deep learning allow for obtaining outstanding results, generating images of high perceptual quality. However, it often remains unclear whether the reconstructed details are close to the actual ground-truth information and whether they constitute a more valuable source for image analysis algorithms. In the reported work, we address the latter problem, and we present our efforts toward learning super-resolution algorithms in a task-driven way to make them suitable for generating high-resolution images that can be exploited for automated image analysis. In the reported initial research, we propose a methodological approach for assessing the existing models that perform computer vision tasks in terms of whether they can be used for evaluating super-resolution reconstruction algorithms, as well as training them in a task-driven way. We support our analysis with experimental study and we expect it to establish a solid foundation for selecting appropriate computer vision tasks that will advance the capabilities of real-world super-resolution.

Cube: A Roblox View of 3D Intelligence

Foundation AI Team,Kiran Bhat,Nishchaie Khanna,Karun Channa,Tinghui Zhou,Yiheng Zhu,Xiaoxia Sun,Charles Shang,Anirudh Sudarshan,Maurice Chu,Daiqing Li,Kangle Deng,Jean-Philippe Fauconnier,Tijmen Verhulsdonck,Maneesh Agrawala,Kayvon Fatahalian,Alexander Weiss,Christian Reiser,Ravi Kiran Chirravuri,Ravali Kandur,Alejandro Pelaez,Akash Garg,Michael Palleschi,Jessica Wang,Skylar Litz,Leon Liu,Anying Li,David Harmon,Derek Liu,Liangjun Feng,Denis Goupil,Lukas Kuczynski,Jihyun Yoon,Naveen Marri,Peiye Zhuang,Yinan Zhang,Brian Yin,Haomiao Jiang,Marcel van Workum,Thomas Lane,Bryce Erickson,Salil Pathare,Kyle Price,Anupam Singh,David Baszucki

Task: 构建一个用于3D智能的基础模型，支持开发者生成3D对象、场景、角色动画和对象行为的程序脚本。

Motivation: 基础模型在文本、图像、音频和视频领域展示了卓越的推理和生成能力，Roblox希望构建一个类似的3D智能基础模型，以支持开发者创建Roblox体验的各个方面。

Details

Method: 讨论了构建3D基础模型的三个关键设计要求，并提出了3D形状分词器的解决方案。展示了该分词方案在文本到形状生成、形状到文本生成和文本到场景生成中的应用。 Result: 展示了这些应用如何与现有的大型语言模型（LLMs）协作进行场景分析和推理。 Conclusion: 讨论了构建完全统一的3D智能基础模型的路径。 Abstract: Foundation models trained on vast amounts of data have demonstrated remarkable reasoning and generation capabilities in the domains of text, images, audio and video. Our goal at Roblox is to build such a foundation model for 3D intelligence, a model that can support developers in producing all aspects of a Roblox experience, from generating 3D objects and scenes to rigging characters for animation to producing programmatic scripts describing object behaviors. We discuss three key design requirements for such a 3D foundation model and then present our first step towards building such a model. We expect that 3D geometric shapes will be a core data type and describe our solution for 3D shape tokenizer. We show how our tokenization scheme can be used in applications for text-to-shape generation, shape-to-text generation and text-to-scene generation. We demonstrate how these applications can collaborate with existing large language models (LLMs) to perform scene analysis and reasoning. We conclude with a discussion outlining our path to building a fully unified foundation model for 3D intelligence.

TULIP: Towards Unified Language-Image Pretraining

Zineng Tang,Long Lian,Seun Eisape,XuDong Wang,Roei Herzig,Adam Yala,Alane Suhr,Trevor Darrell,David M. Chan

Task: 提出TULIP模型，以解决现有图像-文本对比模型在视觉中心任务上的不足。

Motivation: 现有图像-文本对比模型（如CLIP和SigLIP）在需要高保真图像理解的任务（如计数、深度估计和细粒度对象识别）上表现不佳，而视觉模型在处理语言任务时灵活性不足。

Details

Method: 利用生成数据增强、增强的图像-图像和文本-文本对比学习以及图像/文本重建正则化来学习细粒度视觉特征，同时保持全局语义对齐。 Result: TULIP在多个基准测试中优于现有最先进模型，在ImageNet-1K上实现了新的零样本性能最先进水平，在RxRx1上的少样本分类线性探测中比SigLIP提高了2倍，在MMVP上比SigLIP提高了3倍以上。 Conclusion: TULIP通过结合生成数据增强和对比学习，显著提升了图像理解和语言驱动的任务性能，成为现有CLIP类模型的有效替代方案。 Abstract: Despite the recent success of image-text contrastive models like CLIP and SigLIP, these models often struggle with vision-centric tasks that demand high-fidelity image understanding, such as counting, depth estimation, and fine-grained object recognition. These models, by performing language alignment, tend to prioritize high-level semantics over visual understanding, weakening their image understanding. On the other hand, vision-focused models are great at processing visual information but struggle to understand language, limiting their flexibility for language-driven tasks. In this work, we introduce TULIP, an open-source, drop-in replacement for existing CLIP-like models. Our method leverages generative data augmentation, enhanced image-image and text-text contrastive learning, and image/text reconstruction regularization to learn fine-grained visual features while preserving global semantic alignment. Our approach, scaling to over 1B parameters, outperforms existing state-of-the-art (SOTA) models across multiple benchmarks, establishing a new SOTA zero-shot performance on ImageNet-1K, delivering up to a $2\times$ enhancement over SigLIP on RxRx1 in linear probing for few-shot classification, and improving vision-language models, achieving over $3\times$ higher scores than SigLIP on MMVP. Our code/checkpoints are available at https://tulip-berkeley.github.io

SDF-TopoNet: A Two-Stage Framework for Tubular Structure Segmentation via SDF Pre-training and Topology-Aware Fine-Tuning

Siyi Wu,Leyi Zhao,Haitian Ma,Xinyuan Song

Task: 提出了一种改进的拓扑感知分割框架SDF-TopoNet，用于准确分割管状和曲线结构。

Motivation: 现有方法在确保拓扑正确性的同时，计算效率较低，且对像素级精度不敏感。

Details

Method: 提出了一种两阶段训练策略，包括预训练阶段使用SDF作为辅助学习目标，以及微调阶段结合动态适配器和改进的拓扑损失。 Result: 在五个基准数据集上的实验结果表明，SDF-TopoNet在拓扑准确性和定量分割指标上均优于现有方法，同时显著降低了训练复杂度。 Conclusion: SDF-TopoNet在提高分割精度和训练效率方面表现出色，解决了现有方法的局限性。 Abstract: Accurate segmentation of tubular and curvilinear structures, such as blood vessels, neurons, and road networks, is crucial in various applications. A key challenge is ensuring topological correctness while maintaining computational efficiency. Existing approaches often employ topological loss functions based on persistent homology, such as Betti error, to enforce structural consistency. However, these methods suffer from high computational costs and are insensitive to pixel-level accuracy, often requiring additional loss terms like Dice or MSE to compensate. To address these limitations, we propose \textbf{SDF-TopoNet}, an improved topology-aware segmentation framework that enhances both segmentation accuracy and training efficiency. Our approach introduces a novel two-stage training strategy. In the pre-training phase, we utilize the signed distance function (SDF) as an auxiliary learning target, allowing the model to encode topological information without directly relying on computationally expensive topological loss functions. In the fine-tuning phase, we incorporate a dynamic adapter alongside a refined topological loss to ensure topological correctness while mitigating overfitting and computational overhead. We evaluate our method on five benchmark datasets. Experimental results demonstrate that SDF-TopoNet outperforms existing methods in both topological accuracy and quantitative segmentation metrics, while significantly reducing training complexity.

Frans Zdyb,Albert Alonso,Julius B. Kirkegaard

Task: 检测计算显微镜中细长且重叠的结构。

Motivation: 现有的坐标基方法虽然改进了检测，但在生成样条曲线时往往不如像素基方法准确。

Details

Method: 提出了一种无需训练的可微分渲染方法用于样条曲线优化，实现了高可靠性和亚像素精度。 Result: 该方法提高了样条曲线的质量，增强了对分布变化的鲁棒性，并缩小了合成数据与真实数据之间的差距。 Conclusion: 该方法结合了坐标基和像素基方法的优点，适用于C. elegans线虫的检测，展示了其在药物发现和生物医学研究中的潜力。 Abstract: Detecting slender, overlapping structures remains a challenge in computational microscopy. While recent coordinate-based approaches improve detection, they often produce less accurate splines than pixel-based methods. We introduce a training-free differentiable rendering approach to spline refinement, achieving both high reliability and sub-pixel accuracy. Our method improves spline quality, enhances robustness to distribution shifts, and shrinks the gap between synthetic and real-world data. Being fully unsupervised, the method is a drop-in replacement for the popular active contour model for spline refinement. Evaluated on C. elegans nematodes, a popular model organism for drug discovery and biomedical research, we demonstrate that our approach combines the strengths of both coordinate- and pixel-based methods.

Ship Detection in Remote Sensing Imagery for Arbitrarily Oriented Object Detection

Bibi Erum Ayesha,T. Satyanarayana Murthy,Palamakula Ramesh Babu,Ramu Kuchipudi

Task: 开发一种创新的船舶检测系统，用于海上监视和生态监测等应用。

Motivation: 传统的船舶检测方法在任意方向、复杂背景和遮挡视角方面面临挑战，需要一种更准确和高效的方法。

Details

Method: 使用YOLOv8进行实时处理，并使用改进的U-Net进行船舶实例分割。 Result: YOLOv8实现了88%的mAP，在准确和快速的船舶检测方面表现出色；改进的U-Net实现了89%的mAP，改善了边界描绘和遮挡处理。 Conclusion: 该研究展示了深度学习模型在船舶检测中的潜力，增强了海上监视、灾害响应和生态监测的能力。 Abstract: This research paper presents an innovative ship detection system tailored for applications like maritime surveillance and ecological monitoring. The study employs YOLOv8 and repurposed U-Net, two advanced deep learning models, to significantly enhance ship detection accuracy. Evaluation metrics include Mean Average Precision (mAP), processing speed, and overall accuracy. The research utilizes the "Airbus Ship Detection" dataset, featuring diverse remote sensing images, to assess the models' versatility in detecting ships with varying orientations and environmental contexts. Conventional ship detection faces challenges with arbitrary orientations, complex backgrounds, and obscured perspectives. Our approach incorporates YOLOv8 for real-time processing and U-Net for ship instance segmentation. Evaluation focuses on mAP, processing speed, and overall accuracy. The dataset is chosen for its diverse images, making it an ideal benchmark. Results demonstrate significant progress in ship detection. YOLOv8 achieves an 88% mAP, excelling in accurate and rapid ship detection. U Net, adapted for ship instance segmentation, attains an 89% mAP, improving boundary delineation and handling occlusions. This research enhances maritime surveillance, disaster response, and ecological monitoring, exemplifying the potential of deep learning models in ship detection.

Praveen Shastry,Sowmya Chowdary Muthulur,Naveen Kumarasami,Anandakumar D,Mounigasri M,Keerthana R,Kishore Prasath Venkatesh,Bargava Subramanian,Kalyan Sivasailam,Revathi Ezhumalai,Abitha Marimuthu

Task: 提出一种基于SIGLIP编码器和Gemma-3b变压器解码器的视觉语言模型（VLM），以增强自动化慢性结核病（TB）筛查。

Motivation: 通过整合胸部X光图像和临床数据，解决手动解释的挑战，提高诊断的一致性和可及性，特别是在资源有限的环境中。

Details

Method: VLM架构结合了视觉变压器（ViT）用于视觉编码和基于变压器的文本编码器来处理临床背景，如患者病史和治疗记录。跨模态注意力机制将放射特征与文本信息对齐，而Gemma-3b解码器生成全面的诊断报告。模型在500万对医学图像和文本上进行了预训练，并使用10万张慢性TB特异性胸部X光进行了微调。 Result: 模型在检测关键慢性TB病理（包括纤维化、钙化肉芽肿和支气管扩张）方面表现出高精度（94%）和召回率（94%）。曲线下面积（AUC）得分超过0.93，交并比（IoU）值超过0.91，验证了其在检测和定位TB相关异常方面的有效性。 Conclusion: VLM为自动化慢性TB诊断提供了一个强大且可扩展的解决方案，整合了放射和临床数据，提供可操作且具有上下文感知的见解。未来的工作将解决细微病理和数据集偏差，以增强模型的泛化能力，确保在不同人群和医疗环境中的公平性能。 Abstract: Background This study proposes a Vision-Language Model (VLM) leveraging the SIGLIP encoder and Gemma-3b transformer decoder to enhance automated chronic tuberculosis (TB) screening. By integrating chest X-ray images with clinical data, the model addresses the challenges of manual interpretation, improving diagnostic consistency and accessibility, particularly in resource-constrained settings. Methods The VLM architecture combines a Vision Transformer (ViT) for visual encoding and a transformer-based text encoder to process clinical context, such as patient histories and treatment records. Cross-modal attention mechanisms align radiographic features with textual information, while the Gemma-3b decoder generates comprehensive diagnostic reports. The model was pre-trained on 5 million paired medical images and texts and fine-tuned using 100,000 chronic TB-specific chest X-rays. Results The model demonstrated high precision (94 percent) and recall (94 percent) for detecting key chronic TB pathologies, including fibrosis, calcified granulomas, and bronchiectasis. Area Under the Curve (AUC) scores exceeded 0.93, and Intersection over Union (IoU) values were above 0.91, validating its effectiveness in detecting and localizing TB-related abnormalities. Conclusion The VLM offers a robust and scalable solution for automated chronic TB diagnosis, integrating radiographic and clinical data to deliver actionable and context-aware insights. Future work will address subtle pathologies and dataset biases to enhance the model's generalizability, ensuring equitable performance across diverse populations and healthcare settings.

Vision-Language Models for Acute Tuberculosis Diagnosis: A Multimodal Approach Combining Imaging and Clinical Data

Ananya Ganapthy,Praveen Shastry,Naveen Kumarasami,Anandakumar D,Keerthana R,Mounigasri M,Varshinipriya M,Kishore Prasath Venkatesh,Bargava Subramanian,Kalyan Sivasailam

Task: 利用SIGLIP和Gemma-3b架构的视觉语言模型（VLM）进行自动急性结核病（TB）筛查。

Motivation: 通过整合胸部X光图像和临床笔记，提高诊断准确性和效率，特别是在资源有限的环境中。

Details

Method: VLM结合胸部X光的视觉数据和临床背景，生成详细的、上下文感知的诊断报告。架构使用SIGLIP进行视觉编码，Gemma-3b进行解码，确保有效表示急性TB特异性病理和临床见解。 Result: 关键急性TB病理（包括实变、空洞和结节）的检测精度（97%）和召回率（96%）都很高。模型展示了强大的空间定位能力和区分TB阳性病例的鲁棒性，使其成为急性TB诊断的可靠工具。 Conclusion: VLM的多模态能力减少了对放射科医生的依赖，为急性TB筛查提供了可扩展的解决方案。未来的工作将集中在改进对细微病理的检测和解决数据集偏差，以增强其在全球多样化医疗环境中的通用性和应用性。 Abstract: Background: This study introduces a Vision-Language Model (VLM) leveraging SIGLIP and Gemma-3b architectures for automated acute tuberculosis (TB) screening. By integrating chest X-ray images and clinical notes, the model aims to enhance diagnostic accuracy and efficiency, particularly in resource-limited settings. Methods: The VLM combines visual data from chest X-rays with clinical context to generate detailed, context-aware diagnostic reports. The architecture employs SIGLIP for visual encoding and Gemma-3b for decoding, ensuring effective representation of acute TB-specific pathologies and clinical insights. Results: Key acute TB pathologies, including consolidation, cavities, and nodules, were detected with high precision (97percent) and recall (96percent). The model demonstrated strong spatial localization capabilities and robustness in distinguishing TB-positive cases, making it a reliable tool for acute TB diagnosis. Conclusion: The multimodal capability of the VLM reduces reliance on radiologists, providing a scalable solution for acute TB screening. Future work will focus on improving the detection of subtle pathologies and addressing dataset biases to enhance its generalizability and application in diverse global healthcare settings.

AI-Driven Rapid Identification of Bacterial and Fungal Pathogens in Blood Smears of Septic Patients

Agnieszka Sroka-Oleksiak,Adam Pardyl,Dawid Rymarczyk,Aldona Olechowska-Jarząb,Katarzyna Biegun-Drożdż,Dorota Ochońska,Michał Wronka,Adriana Borowa,Tomasz Gosiewski,Miłosz Adamczyk,Henryk Telega,Bartosz Zieliński,Monika Brzychczy-Włoch

Task: 使用深度学习算法从革兰氏染色涂片的显微图像中识别14种细菌和3种酵母样真菌。

Motivation: 传统微生物学方法耗时且昂贵，需要快速诊断和治疗败血症。

Details

Method: 使用Cellpose 3模型进行分割，并使用基于注意力的深度多实例学习进行分类。 Result: 模型对细菌的分类准确率为77.15%，对真菌的分类准确率为71.39%，ROC AUC分别为0.97和0.88。 Conclusion: 研究证实了该模型在微生物分类中的潜力，但需要进一步优化和扩展训练数据集。 Abstract: Sepsis is a life-threatening condition which requires rapid diagnosis and treatment. Traditional microbiological methods are time-consuming and expensive. In response to these challenges, deep learning algorithms were developed to identify 14 bacteria species and 3 yeast-like fungi from microscopic images of Gram-stained smears of positive blood samples from sepsis patients. A total of 16,637 Gram-stained microscopic images were used in the study. The analysis used the Cellpose 3 model for segmentation and Attention-based Deep Multiple Instance Learning for classification. Our model achieved an accuracy of 77.15% for bacteria and 71.39% for fungi, with ROC AUC of 0.97 and 0.88, respectively. The highest values, reaching up to 96.2%, were obtained for Cutibacterium acnes, Enterococcus faecium, Stenotrophomonas maltophilia and Nakaseomyces glabratus. Classification difficulties were observed in closely related species, such as Staphylococcus hominis and Staphylococcus haemolyticus, due to morphological similarity, and within Candida albicans due to high morphotic diversity. The study confirms the potential of our model for microbial classification, but it also indicates the need for further optimisation and expansion of the training data set. In the future, this technology could support microbial diagnosis, reducing diagnostic time and improving the effectiveness of sepsis treatment due to its simplicity and accessibility. Part of the results presented in this publication was covered by a patent application at the European Patent Office EP24461637.1 "A computer implemented method for identifying a microorganism in a blood and a data processing system therefor".

The Impact of Artificial Intelligence on Emergency Medicine: A Review of Recent Advances

Gustavo Correia,Victor Alves,Paulo Novais

Task: 综述过去五年中人工智能在急诊影像学中的应用及其进展。

Motivation: 探讨人工智能在急诊医学中的潜力，特别是在影像学诊断和患者预后预测方面的应用。

Details

Method: 回顾过去五年中关于人工智能在急诊影像学中的应用研究，重点关注机器学习和深度学习技术。 Result: 人工智能在急诊影像学中表现出色，能够准确检测骨折、气胸和肺部疾病等，并能预测机械通气需求等临床结果。 Conclusion: 人工智能在急诊医学中具有变革潜力，但需要解决数据隐私、算法偏见和广泛验证等挑战，以实现与临床专业知识的协同作用，提升患者护理标准。 Abstract: Artificial Intelligence (AI) is revolutionizing emergency medicine by enhancing diagnostic processes and improving patient outcomes. This article provides a review of the current applications of AI in emergency imaging studies, focusing on the last five years of advancements. AI technologies, particularly machine learning and deep learning, are pivotal in interpreting complex imaging data, offering rapid, accurate diagnoses and potentially surpassing traditional diagnostic methods. Studies highlighted within the article demonstrate AI's capabilities in accurately detecting conditions such as fractures, pneumothorax, and pulmonary diseases from various imaging modalities including X-rays, CT scans, and MRIs. Furthermore, AI's ability to predict clinical outcomes like mechanical ventilation needs illustrates its potential in crisis resource optimization. Despite these advancements, the integration of AI into clinical practice presents challenges such as data privacy, algorithmic bias, and the need for extensive validation across diverse settings. This review underscores the transformative potential of AI in emergency settings, advocating for a future where AI and clinical expertise synergize to elevate patient care standards.

Novel AI-Based Quantification of Breast Arterial Calcification to Predict Cardiovascular Risk

Theodorus Dapamede,Aisha Urooj,Vedant Joshi,Gabrielle Gershon,Frank Li,Mohammadreza Chavoshi,Beatrice Brown-Mulry,Rohan Satya Isaac,Aawez Mansuri,Chad Robichaux,Chadi Ayoub,Reza Arsanjani,Laurence Sperling,Judy Gichoya,Marly van Assen,Charles W. ONeill,Imon Banerjee,Hari Trivedi

Task: 通过自动量化筛查乳腺X光片上的乳腺动脉钙化（BAC）来识别心血管疾病风险的女性。

Motivation: 女性在心血管疾病方面存在诊断不足和治疗不足的问题，自动量化BAC可以在常规乳腺X光检查中识别风险女性，从而实现早期治疗和管理。

Details

Method: 使用基于Transformer的神经网络对116,135名女性的筛查乳腺X光片进行BAC严重程度（无BAC、轻度、中度和重度）的量化。 Result: BAC严重程度与主要不良心血管事件（MACE）独立相关，且随着BAC严重程度的增加，风险比（HR）从轻度（HR 1.18-1.22）、中度（HR 1.38-1.47）到重度（HR 2.03-2.22）逐渐增加（所有p<0.001）。BAC在所有年龄组中均显著，甚至在50岁以下的女性中，轻度BAC也表明风险增加。 Conclusion: 自动BAC量化可以在常规乳腺X光检查中进行心血管风险评估，无需额外辐射或成本，特别是在年轻女性中，提供了早期心血管疾病风险分层的潜力。 Abstract: Women are underdiagnosed and undertreated for cardiovascular disease. Automatic quantification of breast arterial calcification on screening mammography can identify women at risk for cardiovascular disease and enable earlier treatment and management of disease. In this retrospective study of 116,135 women from two healthcare systems, a transformer-based neural network quantified BAC severity (no BAC, mild, moderate, and severe) on screening mammograms. Outcomes included major adverse cardiovascular events (MACE) and all-cause mortality. BAC severity was independently associated with MACE after adjusting for cardiovascular risk factors, with increasing hazard ratios from mild (HR 1.18-1.22), moderate (HR 1.38-1.47), to severe BAC (HR 2.03-2.22) across datasets (all p<0.001). This association remained significant across all age groups, with even mild BAC indicating increased risk in women under 50. BAC remained an independent predictor when analyzed alongside ASCVD risk scores, showing significant associations with myocardial infarction, stroke, heart failure, and mortality (all p<0.005). Automated BAC quantification enables opportunistic cardiovascular risk assessment during routine mammography without additional radiation or cost. This approach provides value beyond traditional risk factors, particularly in younger women, offering potential for early CVD risk stratification in the millions of women undergoing annual mammography.

Synchronous vs Asynchronous Reinforcement Learning in a Real World Robot

Ali Parsaee,Fahim Shahriar,Chuxin He,Ruiqing Tan

Task: 比较异步和同步强化学习在物理机器人上的性能。

Motivation: 现有的强化学习算法未考虑物理环境中决策和梯度更新的时间延迟问题，异步强化学习方法可能解决这一问题，但其在物理机器人上的性能优势尚不明确。

Details

Method: 使用Franka Emika Panda机械臂进行异步和同步强化学习的性能比较实验。 Result: 实验表明，异步强化学习的代理学习速度更快，获得的回报显著更高，且响应时间更快的代理表现优于响应时间较慢的代理。 Conclusion: 异步强化学习在物理机器人上具有显著的性能优势，尤其是在快速变化的环境中。 Abstract: In recent times, reinforcement learning (RL) with physical robots has attracted the attention of a wide range of researchers. However, state-of-the-art RL algorithms do not consider that physical environments do not wait for the RL agent to make decisions or updates. RL agents learn by periodically conducting computationally expensive gradient updates. When decision-making and gradient update tasks are carried out sequentially by the RL agent in a physical robot, it significantly increases the agent's response time. In a rapidly changing environment, this increased response time may be detrimental to the performance of the learning agent. Asynchronous RL methods, which separate the computation of decision-making and gradient updates, are a potential solution to this problem. However, only a few comparisons between asynchronous and synchronous RL have been made with physical robots. For this reason, the exact performance benefits of using asynchronous RL methods over synchronous RL methods are still unclear. In this study, we provide a performance comparison between asynchronous and synchronous RL using a physical robotic arm called Franka Emika Panda. Our experiments show that the agents learn faster and attain significantly more returns using asynchronous RL. Our experiments also demonstrate that the learning agent with a faster response time performs better than the agent with a slower response time, even if the agent with a slower response time performs a higher number of gradient updates.

Squeeze Out Tokens from Sample for Finer-Grained Data Governance

Weixiong Lin,Chen Ju,Haicheng Wang,Shengchao Hu,Shuai Xiao,Mengting Chen,Yuheng Jiao,Mingshuai Yao,Jinsong Lan,Qingwen Liu,Ying Chen

Task: 升级数据治理方法，从筛选样本到更细粒度的样本内治理。

Motivation: 现有数据治理方法通过启发式标量分数估计样本贡献，丢弃低价值样本，但保留的样本中仍包含大量不理想的标记，存在进一步压缩和净化的潜力。

Details

Method: 提出双分支DataJuicer方法，通过更细粒度的样本内治理，提取信息丰富的标记并增强图像-文本对齐。视觉分支保留显著图像块并提取相关对象类别，文本分支结合这些类别增强标题。 Result: 在多个数据集上的实验表明，DataJuicer在图像-文本检索、分类和密集视觉推理任务上显著优于现有的DataSieve方法。 Conclusion: DataJuicer通过更细粒度的治理方法，生成了更精炼的数据集，显著提升了模型性能。 Abstract: Widely observed data scaling laws, in which error falls off as a power of the training size, demonstrate the diminishing returns of unselective data expansion. Hence, data governance is proposed to downsize datasets through pruning non-informative samples. Yet, isolating the impact of a specific sample on overall model performance is challenging, due to the vast computation required for tryout all sample combinations. Current data governors circumvent this complexity by estimating sample contributions through heuristic-derived scalar scores, thereby discarding low-value ones. Despite thorough sample sieving, retained samples contain substantial undesired tokens intrinsically, underscoring the potential for further compression and purification. In this work, we upgrade data governance from a 'sieving' approach to a 'juicing' one. Instead of scanning for least-flawed samples, our dual-branch DataJuicer applies finer-grained intra-sample governance. It squeezes out informative tokens and boosts image-text alignments. Specifically, the vision branch retains salient image patches and extracts relevant object classes, while the text branch incorporates these classes to enhance captions. Consequently, DataJuicer yields more refined datasets through finer-grained governance. Extensive experiments across datasets demonstrate that DataJuicer significantly outperforms existing DataSieve in image-text retrieval, classification, and dense visual reasoning.

Analysis of human visual field information using machine learning methods and assessment of their accuracy

A. I. Medvedeva,V. V. Bakutkin

Task: 研究用于分析视野图像以诊断和控制青光眼疾病的方法。

Motivation: 眼科社区对疾病控制和进口替代问题非常关注，因此需要研究有效的诊断方法。

Details

Method: 使用机器学习方法（随机梯度下降、逻辑回归、随机森林、朴素贝叶斯）对图像结果进行分类。 Result: 研究结果是能够从图像中确定结果是青光眼还是其他疾病的计算机模型（二元分类）。 Conclusion: 通过构建分类器并标记数据集，可以实现对青光眼的分类诊断。 Abstract: Subject of research: is the study of methods for analyzing perimetric images for the diagnosis and control of glaucoma diseases. Objects of research: is a dataset collected on the ophthalmological perimeter with the results of various patient pathologies, since the ophthalmological community is acutely aware of the issue of disease control and import substitution. [5]. Purpose of research: is to consider various machine learning methods that can classify glaucoma. This is possible thanks to the classifier built after labeling the dataset. It is able to determine from the image whether the visual fields depicted on it are the results of the impact of glaucoma on the eyes or other visual diseases. Earlier in the work [3], a dataset was described that was collected on the Tomey perimeter. The average age of the examined patients ranged from 30 to 85 years. Methods of research: machine learning methods for classifying image results (stochastic gradient descent, logistic regression, random forest, naive Bayes). Main results of research: the result of the study is computer modeling that can determine from the image whether the result is glaucoma or another disease (binary classification).

Three-dimensional Reconstruction of the Lumbar Spine with Submillimeter Accuracy Using Biplanar X-ray Images

Wanxin Yu,Zhemin Zhu,Cong Wang,Yihang Bao,Chunjie Xia,Rongshan Cheng,Yan Yu,Tsung-Yuan Tsai

Task: 开发并验证一种从双平面X射线图像中高精度三维重建腰椎的全自动方法。

Motivation: 当前的全自动重建方法精度低，无法满足临床应用标准，因此需要开发一种高精度的三维重建方法。

Details

Method: 该方法包括从原始X射线图像中进行腰椎分解和地标检测，然后使用可变形模型和地标加权的2D-3D配准方法。 Result: 所提出的方法实现了0.80毫米的三维重建精度，显著优于主流方法。 Conclusion: 该方法将有助于负重位置下腰椎的临床诊断。 Abstract: Three-dimensional reconstruction of the spine under weight-bearing conditions from biplanar X-ray images is of great importance for the clinical assessment of spinal diseases. However, the current fully automated reconstruction methods have low accuracy and fail to meet the clinical application standards. This study developed and validated a fully automated method for high-accuracy 3D reconstruction of the lumbar spine from biplanar X-ray images. The method involves lumbar decomposition and landmark detection from the raw X-ray images, followed by a deformable model and landmark-weighted 2D-3D registration approach. The reconstruction accuracy was validated by the gold standard obtained through the registration of CT-segmented vertebral models with the biplanar X-ray images. The proposed method achieved a 3D reconstruction accuracy of 0.80 mm, representing a significant improvement over the mainstream approaches. This study will contribute to the clinical diagnosis of lumbar in weight-bearing positions.

Reinforcement learning-based motion imitation for physiologically plausible musculoskeletal motor control

Merkourios Simos,Alberto Silvio Chiappa,Alexander Mathis

Task: 提出一种无模型的运动模仿框架（KINESIS）来推进对基于肌肉的运动控制的理解。

Motivation: 理解人类运动在计算机动画、运动合成、神经科学、人类假肢和康复等领域有广泛的应用。尽管强化学习在捕捉人类运动方面取得了显著成果，但控制生理上准确的身体模型仍然是一个挑战。

Details

Method: 使用具有80个肌肉执行器和20个自由度的下肢肌肉骨骼模型，KINESIS在1.9小时的运动捕捉数据上实现了强大的模仿性能，并通过预训练的文本到运动生成模型进行自然语言控制，并可微调以执行高级任务，如目标到达。 Result: KINESIS生成的肌肉活动模式与人类肌电活动相关性良好，生理上的合理性使其成为解决人类运动控制理论中挑战性问题的有前途的模型。 Conclusion: KINESIS在理解人类运动控制方面具有潜力，特别是在解决Bernstein冗余问题的背景下。 Abstract: How do humans move? The quest to understand human motion has broad applications in numerous fields, ranging from computer animation and motion synthesis to neuroscience, human prosthetics and rehabilitation. Although advances in reinforcement learning (RL) have produced impressive results in capturing human motion using simplified humanoids, controlling physiologically accurate models of the body remains an open challenge. In this work, we present a model-free motion imitation framework (KINESIS) to advance the understanding of muscle-based motor control. Using a musculoskeletal model of the lower body with 80 muscle actuators and 20 DoF, we demonstrate that KINESIS achieves strong imitation performance on 1.9 hours of motion capture data, is controllable by natural language through pre-trained text-to-motion generative models, and can be fine-tuned to carry out high-level tasks such as target goal reaching. Importantly, KINESIS generates muscle activity patterns that correlate well with human EMG activity. The physiological plausibility makes KINESIS a promising model for tackling challenging problems in human motor control theory, which we highlight by investigating Bernstein's redundancy problem in the context of locomotion. Code, videos and benchmarks will be available at https://github.com/amathislab/Kinesis.

Core-Periphery Principle Guided State Space Model for Functional Connectome Classification

Minheng Chen,Xiaowei Yu,Jing Zhang,Tong Chen,Chao Cao,Yan Zhuang,Yanjun Lyu,Lu Zhang,Tianming Liu,Dajiang Zhu

Task: 提出一种用于功能连接组分类的核心-外围状态空间模型（CP-SSM）。

Motivation: 传统机器学习方法难以捕捉大脑区域之间的复杂关系，而深度学习方法（特别是基于Transformer的模型）在长序列建模中面临计算复杂度高的问题。

Details

Method: 提出了CP-SSM框架，引入了具有线性复杂度的选择性状态空间模型Mamba，并设计了核心-外围引导的专家混合模型（CP-MoE）来改进大脑连接模式的表示学习。 Result: 在ABIDE和ADNI两个基准fMRI数据集上的实验结果表明，CP-SSM在分类性能上优于基于Transformer的模型，同时显著降低了计算复杂度。 Conclusion: CP-SSM在建模大脑功能连接方面具有高效性和有效性，为基于神经影像的神经系统疾病诊断提供了有前景的方向。 Abstract: Understanding the organization of human brain networks has become a central focus in neuroscience, particularly in the study of functional connectivity, which plays a crucial role in diagnosing neurological disorders. Advances in functional magnetic resonance imaging and machine learning techniques have significantly improved brain network analysis. However, traditional machine learning approaches struggle to capture the complex relationships between brain regions, while deep learning methods, particularly Transformer-based models, face computational challenges due to their quadratic complexity in long-sequence modeling. To address these limitations, we propose a Core-Periphery State-Space Model (CP-SSM), an innovative framework for functional connectome classification. Specifically, we introduce Mamba, a selective state-space model with linear complexity, to effectively capture long-range dependencies in functional brain networks. Furthermore, inspired by the core-periphery (CP) organization, a fundamental characteristic of brain networks that enhances efficient information transmission, we design CP-MoE, a CP-guided Mixture-of-Experts that improves the representation learning of brain connectivity patterns. We evaluate CP-SSM on two benchmark fMRI datasets: ABIDE and ADNI. Experimental results demonstrate that CP-SSM surpasses Transformer-based models in classification performance while significantly reducing computational complexity. These findings highlight the effectiveness and efficiency of CP-SSM in modeling brain functional connectivity, offering a promising direction for neuroimaging-based neurological disease diagnosis.

Rui Yang,Lin Song,Yicheng Xiao,Runhui Huang,Yixiao Ge,Ying Shan,Hengshuang Zhao

Task: 提出一种简单而高效的方法来构建基于单一Transformer的原生端到端大型多模态模型的基线。

Motivation: 大多数大型多模态模型（LMMs）将视觉和文本模态分开建模，导致资源消耗大且性能存在差距。

Details

Method: 提出了一种新的早期融合LMM，能够在早期阶段融合多模态输入，并以自回归方式响应视觉指令；设计了一种高效的训练方法，利用预训练模型的先验知识。 Result: 所提出的模型在使用单一Transformer的LMMs中表现出优越性能，并显著缩小了与组合式LMMs的性能差距。 Conclusion: 该方法有效解决了资源消耗和性能限制的问题，为原生端到端大型多模态模型提供了新的基线。 Abstract: Recent advancements in large language models (LLMs) have significantly propelled the development of large multi-modal models (LMMs), highlighting the potential for general and intelligent assistants. However, most LMMs model visual and textual modalities separately, leading to recent efforts to develop native LMMs using a single transformer. Despite the promise, these native models are resource-intensive and often exhibit performance gaps compared to their compositional counterparts. To alleviate this issue, we propose a simple yet efficient method to construct a baseline for the native and end-to-end large multi-modal model in a single transformer. First, we propose a new early-fusion LMM that can fuse multi-modal inputs in the early stage and respond to visual instructions in an auto-regressive manner. Second, we devise an efficient training recipe for the proposed model, which harnesses the prior knowledge of the pre-trained models, addressing both the performance limitations and the challenge of resource consumption. The proposed model demonstrates superior performance compared to other LMMs using one transformer and significantly narrows the performance gap with compositional LMMs.

Bayesian Modeling of Zero-Shot Classifications for Urban Flood Detection

Matt Franchi,Nikhil Garg,Wendy Ju,Emma Pierson

Task: 提出一种两阶段方法BayFlood，用于检测城市洪水事件。

Motivation: 街景数据集缺乏可靠的标签，难以直接用于检测城市洪水事件。

Details

Method: 首先使用预训练的视觉语言模型（VLM）进行零样本分类，然后在其分类结果上拟合空间贝叶斯模型。 Result: VLM在多个城市和时间段内提供了强零样本信号，贝叶斯模型改进了样本外预测，推断的洪水风险与已知的外部风险预测因子相关。 Conclusion: BayFlood方法可以改进城市洪水检测，揭示了现有方法忽略的高风险人群和人口偏见，并提出了新洪水传感器的位置建议。 Abstract: Street scene datasets, collected from Street View or dashboard cameras, offer a promising means of detecting urban objects and incidents like street flooding. However, a major challenge in using these datasets is their lack of reliable labels: there are myriad types of incidents, many types occur rarely, and ground-truth measures of where incidents occur are lacking. Here, we propose BayFlood, a two-stage approach which circumvents this difficulty. First, we perform zero-shot classification of where incidents occur using a pretrained vision-language model (VLM). Second, we fit a spatial Bayesian model on the VLM classifications. The zero-shot approach avoids the need to annotate large training sets, and the Bayesian model provides frequent desiderata in urban settings - principled measures of uncertainty, smoothing across locations, and incorporation of external data like stormwater accumulation zones. We comprehensively validate this two-stage approach, showing that VLMs provide strong zero-shot signal for floods across multiple cities and time periods, the Bayesian model improves out-of-sample prediction relative to baseline methods, and our inferred flood risk correlates with known external predictors of risk. Having validated our approach, we show it can be used to improve urban flood detection: our analysis reveals 113,738 people who are at high risk of flooding overlooked by current methods, identifies demographic biases in existing methods, and suggests locations for new flood sensors. More broadly, our results showcase how Bayesian modeling of zero-shot LM annotations represents a promising paradigm because it avoids the need to collect large labeled datasets and leverages the power of foundation models while providing the expressiveness and uncertainty quantification of Bayesian models.

SceneEval: Evaluating Semantic Coherence in Text-Conditioned 3D Indoor Scene Synthesis

Hou In Ivan Tam,Hou In Derek Pun,Austin T. Wang,Angel X. Chang,Manolis Savva

Task: 提出SceneEval评估框架，用于评估文本条件下的3D室内场景生成方法。

Motivation: 现有评估方法主要关注生成场景的真实性，而忽略了与输入文本的对齐，这是决定方法是否有效满足用户需求的关键因素。

Details

Method: 提出SceneEval评估框架，包括显式用户需求（如特定对象及其属性的存在）和隐式期望（如对象碰撞的缺失）的评估指标，并引入SceneEval-100数据集。 Result: 评估结果表明，当前方法在生成满足用户需求的场景方面存在困难。 Conclusion: SceneEval能够提供详细的场景生成评估，突显了当前方法的优势和需要改进的地方，表明需要进一步研究。 Abstract: Despite recent advances in text-conditioned 3D indoor scene generation, there remain gaps in the evaluation of these methods. Existing metrics primarily assess the realism of generated scenes by comparing them to a set of ground-truth scenes, often overlooking alignment with the input text - a critical factor in determining how effectively a method meets user requirements. We present SceneEval, an evaluation framework designed to address this limitation. SceneEval includes metrics for both explicit user requirements, such as the presence of specific objects and their attributes described in the input text, and implicit expectations, like the absence of object collisions, providing a comprehensive assessment of scene quality. To facilitate evaluation, we introduce SceneEval-100, a dataset of scene descriptions with annotated ground-truth scene properties. We evaluate recent scene generation methods using SceneEval and demonstrate its ability to provide detailed assessments of the generated scenes, highlighting strengths and areas for improvement across multiple dimensions. Our results show that current methods struggle at generating scenes that meet user requirements, underscoring the need for further research in this direction.

Involution and BSConv Multi-Depth Distillation Network for Lightweight Image Super-Resolution

Akram Khatami-Rizi,Ahmad Mahmoudi-Aznaveh

Task: 从低分辨率输入中重建高分辨率图像。

Motivation: 解决深度学习模型在单图像超分辨率任务中因网络深度增加而导致的参数和内存使用增加、训练速度减慢的问题。

Details

Method: 提出了一种结合Involution & BSConv多深度蒸馏块（IBMDB）和对比及高频注意力块（CHFAB）的轻量级模型IBMDN。 Result: 该方法在减少复杂度的同时提高了PSNR和SSIM等评估指标，并在基于Transformer的模型中减少了内存使用，在GANs中增强了感知质量。 Conclusion: 实验表明，该方法在最小计算成本下实现了高精度。 Abstract: Single Image Super-Resolution (SISR) aims to reconstruct high-resolution (HR) images from low-resolution (LR) inputs. Deep learning, especially Convolutional Neural Networks (CNNs), has advanced SISR. However, increasing network depth increases parameters, and memory usage, and slows training, which is problematic for resource-limited devices. To address this, lightweight models are developed to balance accuracy and efficiency. We propose the Involution & BSConv Multi-Depth Distillation Network (IBMDN), combining Involution & BSConv Multi-Depth Distillation Block (IBMDB) and the Contrast and High-Frequency Attention Block (CHFAB). IBMDB integrates Involution and BSConv to balance computational efficiency and feature extraction. CHFAB enhances high-frequency details for better visual quality. IBMDB is compatible with other SISR architectures and reduces complexity, improving evaluation metrics like PSNR and SSIM. In transformer-based models, IBMDB reduces memory usage while improving feature extraction. In GANs, it enhances perceptual quality, balancing pixel-level accuracy with perceptual details. Our experiments show that the method achieves high accuracy with minimal computational cost. The code is available at GitHub.

On the Robustness Tradeoff in Fine-Tuning

Kunyang Li,Jean-Charles Noirot Ferrand,Ryan Sheatsley,Blaine Hoak,Yohan Beugin,Eric Pauley,Patrick McDaniel

Task: 研究微调预训练模型在下游任务中的鲁棒性与准确性之间的权衡

Motivation: 微调已成为将预训练模型适应下游任务的标准做法，但其对模型鲁棒性的影响尚不明确

Details

Method: 在6个基准数据集和7种不同的微调策略上评估微调模型的鲁棒性和准确性 Result: 观察到对抗鲁棒性和准确性之间存在一致的权衡，BitFit等外围更新在简单任务上更有效，而在复杂任务上，通过Compacter微调信息密集层（如注意力层）能获得更好的Pareto前沿 Conclusion: 强调了鲁棒性感知微调的必要性，以确保可靠的现实世界部署 Abstract: Fine-tuning has become the standard practice for adapting pre-trained (upstream) models to downstream tasks. However, the impact on model robustness is not well understood. In this work, we characterize the robustness-accuracy trade-off in fine-tuning. We evaluate the robustness and accuracy of fine-tuned models over 6 benchmark datasets and 7 different fine-tuning strategies. We observe a consistent trade-off between adversarial robustness and accuracy. Peripheral updates such as BitFit are more effective for simple tasks--over 75% above the average measured with area under the Pareto frontiers on CIFAR-10 and CIFAR-100. In contrast, fine-tuning information-heavy layers, such as attention layers via Compacter, achieves a better Pareto frontier on more complex tasks--57.5% and 34.6% above the average on Caltech-256 and CUB-200, respectively. Lastly, we observe that robustness of fine-tuning against out-of-distribution data closely tracks accuracy. These insights emphasize the need for robustness-aware fine-tuning to ensure reliable real-world deployments.

ClimateGS: Real-Time Climate Simulation with 3D Gaussian Style Transfer

Yuezhen Xie,Meiying Zhang,Qi Hao

Task: 提出一种新的框架ClimateGS，用于实时渲染气候效果，结合3D高斯表示和物理模拟。

Motivation: 恶劣的气候条件对自主系统提出了重大挑战，需要可靠的感知和决策能力。现有的基于物理的NeRF渲染方法虽然能生成逼真的场景表示，但渲染速度慢且预处理时间长，不适合实时测试和用户交互。

Details

Method: 1) 开发了一种线性变换方法，用于3D高斯照片级风格迁移，能够直接修改球谐函数以实现高效一致的风格适应；2) 开发了一种联合训练策略，结合监督学习和自监督学习，加速收敛同时保留原始场景细节；3) 开发了一种实时渲染方法，将基于物理的效果与3D高斯结合，实现高效逼真的渲染。 Result: 在MipNeRF360和Tanks and Temples数据集上评估ClimateGS，展示了实时渲染效果，视觉质量与SOTA 2D/3D方法相当或更优，适合交互应用。 Conclusion: ClimateGS框架通过结合3D高斯表示和物理模拟，实现了实时气候效果渲染，具有高效和逼真的特点，适用于交互应用。 Abstract: Adverse climate conditions pose significant challenges for autonomous systems, demanding reliable perception and decision-making across diverse environments. To better simulate these conditions, physically-based NeRF rendering methods have been explored for their ability to generate realistic scene representations. However, these methods suffer from slow rendering speeds and long preprocessing times, making them impractical for real-time testing and user interaction. This paper presents ClimateGS, a novel framework integrating 3D Gaussian representations with physical simulation to enable real-time climate effects rendering. The novelty of this work is threefold: 1) developing a linear transformation for 3D Gaussian photorealistic style transfer, enabling direct modification of spherical harmonics across bands for efficient and consistent style adaptation; 2) developing a joint training strategy for 3D style transfer, combining supervised and self-supervised learning to accelerate convergence while preserving original scene details; 3) developing a real-time rendering method for climate simulation, integrating physics-based effects with 3D Gaussian to achieve efficient and realistic rendering. We evaluate ClimateGS on MipNeRF360 and Tanks and Temples, demonstrating real-time rendering with comparable or superior visual quality to SOTA 2D/3D methods, making it suitable for interactive applications.

Exploring the Limits of KV Cache Compression in Visual Autoregressive Transformers

Bo Chen,Xiaoyu Li,Yekun Ke,Yingyu Liang,Zhenmei Shi,Zhao Song

Task: 正式定义视觉自回归变换器中的KV缓存压缩问题，并证明在基于注意力的架构下，顺序视觉标记生成必须使用至少Ω(n²d)内存。

Motivation: 视觉自回归模型在推理过程中需要存储先前生成的表示，导致内存开销巨大。尽管已有多种压缩技术尝试缓解这一问题，但之前的工作并未明确形式化KV缓存压缩问题。

Details

Method: 通过从计算下界问题中推导，利用随机嵌入技术，证明在基于注意力的架构下，顺序视觉标记生成必须使用至少Ω(n²d)内存。 Result: 证明了在基于注意力的架构下，顺序视觉标记生成必须使用至少Ω(n²d)内存，表明在没有额外结构约束的情况下，实现真正的次二次内存使用是不可能的。 Conclusion: 讨论了视觉表示上的稀疏性先验如何影响内存效率，提出了不可能性结果和缓解内存开销的潜在方向。 Abstract: A fundamental challenge in Visual Autoregressive models is the substantial memory overhead required during inference to store previously generated representations. Despite various attempts to mitigate this issue through compression techniques, prior works have not explicitly formalized the problem of KV-cache compression in this context. In this work, we take the first step in formally defining the KV-cache compression problem for Visual Autoregressive transformers. We then establish a fundamental negative result, proving that any mechanism for sequential visual token generation under attention-based architectures must use at least $\Omega(n^2 d)$ memory, when $d = \Omega(\log n)$, where $n$ is the number of tokens generated and $d$ is the embedding dimensionality. This result demonstrates that achieving truly sub-quadratic memory usage is impossible without additional structural constraints. Our proof is constructed via a reduction from a computational lower bound problem, leveraging randomized embedding techniques inspired by dimensionality reduction principles. Finally, we discuss how sparsity priors on visual representations can influence memory efficiency, presenting both impossibility results and potential directions for mitigating memory overhead.

He Huang,Yong Chen,Yujun Guo,Wei He

Task: 提出一种自监督的未知到已知退化变换框架（U2K），用于盲高光谱图像融合。

Motivation: 现有的监督学习方法在测试数据退化与训练数据匹配时表现良好，但在处理未知退化时面临挑战。为了释放监督学习方法的潜力和泛化能力，提出了U2K框架。

Details

Method: U2K框架包括空间和光谱退化包裹模块（DW）和退化变换模块（DT），通过自监督方式训练，使用一致性损失和贪婪交替优化。 Result: 实验表明，U2K框架显著提高了五种现有监督学习方法在各种退化设置下的适应性，并超越了现有的盲方法。 Conclusion: 提出的U2K框架有效提升了盲高光谱图像融合的灵活性和适应性。 Abstract: Hyperspectral image (HSI) fusion is an efficient technique that combines low-resolution HSI (LR-HSI) and high-resolution multispectral images (HR-MSI) to generate high-resolution HSI (HR-HSI). Existing supervised learning methods (SLMs) can yield promising results when test data degradation matches the training ones, but they face challenges in generalizing to unknown degradations. To unleash the potential and generalization ability of SLMs, we propose a novel self-supervised unknown-to-known degradation transformation framework (U2K) for blind HSI fusion, which adaptively transforms unknown degradation into the same type of degradation as those handled by pre-trained SLMs. Specifically, the proposed U2K framework consists of: (1) spatial and spectral Degradation Wrapping (DW) modules that map HR-HSI to unknown degraded HR-MSI and LR-HSI, and (2) Degradation Transformation (DT) modules that convert these wrapped data into predefined degradation patterns. The transformed HR-MSI and LR-HSI pairs are then processed by a pre-trained network to reconstruct the target HR-HSI. We train the U2K framework in a self-supervised manner using consistency loss and greedy alternating optimization, significantly improving the flexibility of blind HSI fusion. Extensive experiments confirm the effectiveness of our proposed U2K framework in boosting the adaptability of five existing SLMs under various degradation settings and surpassing state-of-the-art blind methods.

FetalFlex: Anatomy-Guided Diffusion Model for Flexible Control on Fetal Ultrasound Image Synthesis

Yaofei Duan,Tao Tan,Zhiyuan Zhu,Yuhao Huang,Yuanji Zhang,Rui Gao,Patrick Cheong-Iao Pang,Xinru Gao,Guowei Tao,Xiang Cong,Zhou Li,Lianying Liang,Guangzhi He,Linliang Yin,Xuedong Deng,Xin Yang,Dong Ni

Task: 提出一个灵活的胎儿超声图像生成框架（FetalFlex），以解决获取多平面注释胎儿超声数据集的挑战。

Motivation: 获取全面的多平面注释胎儿超声数据集具有挑战性，尤其是对于罕见或复杂的异常情况，这影响了新手放射科医生的培训和开发鲁棒的AI模型。

Details

Method: FetalFlex利用解剖结构和多模态信息，通过预对齐模块和重绘策略实现可控的胎儿超声图像生成，并采用两阶段自适应采样策略逐步提升图像质量。 Result: FetalFlex在多中心数据集上实现了最先进的性能，生成的图像与专家视觉评估高度一致，并显著提升了六种典型深度学习模型在下游分类和异常检测任务中的性能。 Conclusion: FetalFlex的解剖级可控生成为异常模拟和创建像素级的配对或反事实数据提供了独特优势。 Abstract: Fetal ultrasound (US) examinations require the acquisition of multiple planes, each providing unique diagnostic information to evaluate fetal development and screening for congenital anomalies. However, obtaining a comprehensive, multi-plane annotated fetal US dataset remains challenging, particularly for rare or complex anomalies owing to their low incidence and numerous subtypes. This poses difficulties in training novice radiologists and developing robust AI models, especially for detecting abnormal fetuses. In this study, we introduce a Flexible Fetal US image generation framework (FetalFlex) to address these challenges, which leverages anatomical structures and multimodal information to enable controllable synthesis of fetal US images across diverse planes. Specifically, FetalFlex incorporates a pre-alignment module to enhance controllability and introduces a repaint strategy to ensure consistent texture and appearance. Moreover, a two-stage adaptive sampling strategy is developed to progressively refine image quality from coarse to fine levels. We believe that FetalFlex is the first method capable of generating both in-distribution normal and out-of-distribution abnormal fetal US images, without requiring any abnormal data. Experiments on multi-center datasets demonstrate that FetalFlex achieved state-of-the-art performance across multiple image quality metrics. A reader study further confirms the close alignment of the generated results with expert visual assessments. Furthermore, synthetic images by FetalFlex significantly improve the performance of six typical deep models in downstream classification and anomaly detection tasks. Lastly, FetalFlex's anatomy-level controllable generation offers a unique advantage for anomaly simulation and creating paired or counterfactual data at the pixel level. The demo is available at: https://dyf1023.github.io/FetalFlex/.

POSTA: A Go-to Framework for Customized Artistic Poster Generation

Haoyu Chen,Xiaojie Xu,Wenbo Li,Jingjing Ren,Tian Ye,Songhua Liu,Ying-Cong Chen,Lei Zhu,Xinchao Wang

Task: 提出一种基于扩散模型和多模态大语言模型（MLLMs）的模块化框架POSTA，用于生成定制的艺术海报。

Motivation: 现有的自动海报设计方法在文本准确性、用户定制和美学吸引力方面存在不足，限制了其在电影和展览等艺术领域的应用。

Details

Method: POSTA框架由三个模块组成：背景扩散模块生成主题背景，设计MLLM模块生成与背景风格一致的布局和排版元素，艺术文本扩散模块对关键文本元素进行额外的风格化处理。 Result: POSTA在文本准确性和美学质量方面优于现有模型，展示了卓越的可控性和设计多样性。 Conclusion: POSTA框架通过模块化设计和多模态模型的应用，成功解决了现有自动海报设计方法的局限性，生成了视觉上连贯且吸引人的定制艺术海报。 Abstract: Poster design is a critical medium for visual communication. Prior work has explored automatic poster design using deep learning techniques, but these approaches lack text accuracy, user customization, and aesthetic appeal, limiting their applicability in artistic domains such as movies and exhibitions, where both clear content delivery and visual impact are essential. To address these limitations, we present POSTA: a modular framework powered by diffusion models and multimodal large language models (MLLMs) for customized artistic poster generation. The framework consists of three modules. Background Diffusion creates a themed background based on user input. Design MLLM then generates layout and typography elements that align with and complement the background style. Finally, to enhance the poster's aesthetic appeal, ArtText Diffusion applies additional stylization to key text elements. The final result is a visually cohesive and appealing poster, with a fully modular process that allows for complete customization. To train our models, we develop the PosterArt dataset, comprising high-quality artistic posters annotated with layout, typography, and pixel-level stylized text segmentation. Our comprehensive experimental analysis demonstrates POSTA's exceptional controllability and design diversity, outperforming existing models in both text accuracy and aesthetic quality.

A Language Vision Model Approach for Automated Tumor Contouring in Radiation Oncology

Yi Luo,Hamed Hooshangnejad,Xue Feng,Gaofeng Huang,Xiaojian Chen,Rui Zhang,Quan Chen,Wil Ngwa,Kai Ding

Task: 开发Oncology Contouring Copilot (OCC)系统，利用AI和人类监督相结合的方法，通过文本描述进行精确的肿瘤轮廓描绘。

Motivation: 肺癌是全球癌症相关死亡的主要原因，肿瘤轮廓描绘的复杂性在资源有限的环境中往往缺乏专家支持，AI技术尤其是深度学习和自然语言处理的进步提供了潜在的解决方案。

Details

Method: OCC系统首先从CT扫描中识别结节候选，然后使用语言视觉模型（如GPT-4V）结合临床描述文本有效减少假阳性，融合文本和视觉数据以自动化肿瘤轮廓描绘。 Result: OCC系统的部署显著减少了假发现率35.0%，每扫描假阳性减少了72.4%，数据集的F1得分为0.652。 Conclusion: OCC系统通过使用最新的语言视觉模型，显著提升了肿瘤轮廓描绘的质量，优化了肿瘤治疗工作流程，减少了手动过程，提供了一个可扩展且直观的框架来减少放疗计划中的假阳性，并引入了新的医学语言视觉提示技术以减少模型的幻觉，展示了语言视觉模型在解决医学语言视觉挑战中的潜力。 Abstract: Background: Lung cancer ranks as the leading cause of cancer-related mortality worldwide. The complexity of tumor delineation, crucial for radiation therapy, requires expertise often unavailable in resource-limited settings. Artificial Intelligence(AI), particularly with advancements in deep learning (DL) and natural language processing (NLP), offers potential solutions yet is challenged by high false positive rates. Purpose: The Oncology Contouring Copilot (OCC) system is developed to leverage oncologist expertise for precise tumor contouring using textual descriptions, aiming to increase the efficiency of oncological workflows by combining the strengths of AI with human oversight. Methods: Our OCC system initially identifies nodule candidates from CT scans. Employing Language Vision Models (LVMs) like GPT-4V, OCC then effectively reduces false positives with clinical descriptive texts, merging textual and visual data to automate tumor delineation, designed to elevate the quality of oncology care by incorporating knowledge from experienced domain experts. Results: Deployments of the OCC system resulted in a significant reduction in the false discovery rate by 35.0%, a 72.4% decrease in false positives per scan, and an F1-score of 0.652 across our dataset for unbiased evaluation. Conclusions: OCC represents a significant advance in oncology care, particularly through the use of the latest LVMs to improve contouring results by (1) streamlining oncology treatment workflows by optimizing tumor delineation, reducing manual processes; (2) offering a scalable and intuitive framework to reduce false positives in radiotherapy planning using LVMs; (3) introducing novel medical language vision prompt techniques to minimize LVMs hallucinations with ablation study, and (4) conducting a comparative analysis of LVMs, highlighting their potential in addressing medical language vision challenges.

A Novel Channel Boosted Residual CNN-Transformer with Regional-Boundary Learning for Breast Cancer Detection

Aamir Mehmood,Yue Hu,Saddam Hussain Khan

Task: 提出一种新的混合框架CB-Res-RBCMT，用于乳腺癌超声图像（BUSI）的详细癌症分析。

Motivation: 现有的深度卷积神经网络（CNN）和视觉变换器（ViT）在乳腺癌超声图像检测中表现出一定的性能，但模型复杂性和图像对比度、纹理及肿瘤形态变化带来的挑战限制了当前方法的有效性。

Details

Method: 提出了一种结合定制残差CNN和新ViT组件的混合框架CB-Res-RBCMT，使用CMT块和新的区域及边界（RB）特征提取操作来捕捉对比度和形态变化，并通过多头注意力机制增强全局上下文交互。 Result: 在标准BUSI数据集上，CB-Res-RBCMT的F1得分为95.57%，准确率为95.63%，灵敏度为96.42%，精确度为94.79%，优于现有的ViT和CNN方法。 Conclusion: 提出的CB-Res-RBCMT框架在捕捉多样化特征和提供卓越的BUSI癌症诊断性能方面表现出色。 Abstract: Recent advancements in detecting tumors using deep learning on breast ultrasound images (BUSI) have demonstrated significant success. Deep CNNs and vision-transformers (ViTs) have demonstrated individually promising initial performance. However, challenges related to model complexity and contrast, texture, and tumor morphology variations introduce uncertainties that hinder the effectiveness of current methods. This study introduces a novel hybrid framework, CB-Res-RBCMT, combining customized residual CNNs and new ViT components for detailed BUSI cancer analysis. The proposed RBCMT uses stem convolution blocks with CNN Meet Transformer (CMT) blocks, followed by new Regional and boundary (RB) feature extraction operations for capturing contrast and morphological variations. Moreover, the CMT block incorporates global contextual interactions through multi-head attention, enhancing computational efficiency with a lightweight design. Additionally, the customized inverse residual and stem CNNs within the CMT effectively extract local texture information and handle vanishing gradients. Finally, the new channel-boosted (CB) strategy enriches the feature diversity of the limited dataset by combining the original RBCMT channels with transfer learning-based residual CNN-generated maps. These diverse channels are processed through a spatial attention block for optimal pixel selection, reducing redundancy and improving the discrimination of minor contrast and texture variations. The proposed CB-Res-RBCMT achieves an F1-score of 95.57%, accuracy of 95.63%, sensitivity of 96.42%, and precision of 94.79% on the standard harmonized stringent BUSI dataset, outperforming existing ViT and CNN methods. These results demonstrate the versatility of our integrated CNN-Transformer framework in capturing diverse features and delivering superior performance in BUSI cancer diagnosis.

DRoPE: Directional Rotary Position Embedding for Efficient Agent Interaction Modeling

Jianbo Zhao,Taiyu Ban,Zhihao Liu,Hangning Zhou,Xiyang Wang,Qibin Zhou,Hailong Qin,Mu Yang,Lei Liu,Bin Li

Task: 提出了一种新的方向旋转位置嵌入（DRoPE）方法，用于优化自动驾驶系统中的轨迹生成。

Motivation: 现有的场景中心、代理中心和查询中心框架在准确性、计算时间和内存效率之间存在无法解决的三角关系，需要一种新的方法来突破这一限制。

Details

Method: 提出了方向旋转位置嵌入（DRoPE），通过在RoPE的2D旋转变换中引入统一的身份标量，将旋转角度与现实代理的航向对齐，从而自然地编码相对角度信息。 Result: 理论分析和实证评估表明，DRoPE能够同时优化轨迹生成的准确性、时间复杂度和空间复杂度，显著降低了空间复杂度。 Conclusion: DRoPE在理论和实践上都具有良好的性能，能够有效解决现有方法在准确性、计算时间和内存效率之间的权衡问题。 Abstract: Accurate and efficient modeling of agent interactions is essential for trajectory generation, the core of autonomous driving systems. Existing methods, scene-centric, agent-centric, and query-centric frameworks, each present distinct advantages and drawbacks, creating an impossible triangle among accuracy, computational time, and memory efficiency. To break this limitation, we propose Directional Rotary Position Embedding (DRoPE), a novel adaptation of Rotary Position Embedding (RoPE), originally developed in natural language processing. Unlike traditional relative position embedding (RPE), which introduces significant space complexity, RoPE efficiently encodes relative positions without explicitly increasing complexity but faces inherent limitations in handling angular information due to periodicity. DRoPE overcomes this limitation by introducing a uniform identity scalar into RoPE's 2D rotary transformation, aligning rotation angles with realistic agent headings to naturally encode relative angular information. We theoretically analyze DRoPE's correctness and efficiency, demonstrating its capability to simultaneously optimize trajectory generation accuracy, time complexity, and space complexity. Empirical evaluations compared with various state-of-the-art trajectory generation models, confirm DRoPE's good performance and significantly reduced space complexity, indicating both theoretical soundness and practical effectiveness. The video documentation is available at https://drope-traj.github.io/.

Texture-Aware StarGAN for CT data harmonisation

Francesco Di Feola,Ludovica Pompilio,Cecilia Assolito,Valerio Guarrasi,Paolo Soda

Task: 提出一种新颖的纹理感知StarGAN用于CT数据协调，实现不同重建核之间的一对多转换。

Motivation: CT在医学诊断中起着关键作用，但重建核的变异性阻碍了数据驱动方法（如深度学习模型）实现可靠和泛化的性能。CT数据协调通过标准化不同来源或条件下的数据，成为减少这种非生物变异的有效解决方案。

Details

Method: 提出了一种纹理感知的StarGAN模型，并引入了一种多尺度纹理损失函数，将不同空间和角度尺度的纹理信息嵌入到协调过程中。 Result: 在公开数据集上进行了广泛的实验，使用了来自197名患者的48667张胸部CT切片，分布在三种不同的重建核上，证明了该方法优于基线StarGAN。 Conclusion: 所提出的纹理感知StarGAN在CT数据协调中表现出色，能够有效解决重建核引起的纹理变化问题。 Abstract: Computed Tomography (CT) plays a pivotal role in medical diagnosis; however, variability across reconstruction kernels hinders data-driven approaches, such as deep learning models, from achieving reliable and generalized performance. To this end, CT data harmonization has emerged as a promising solution to minimize such non-biological variances by standardizing data across different sources or conditions. In this context, Generative Adversarial Networks (GANs) have proved to be a powerful framework for harmonization, framing it as a style-transfer problem. However, GAN-based approaches still face limitations in capturing complex relationships within the images, which are essential for effective harmonization. In this work, we propose a novel texture-aware StarGAN for CT data harmonization, enabling one-to-many translations across different reconstruction kernels. Although the StarGAN model has been successfully applied in other domains, its potential for CT data harmonization remains unexplored. Furthermore, our approach introduces a multi-scale texture loss function that embeds texture information across different spatial and angular scales into the harmonization process, effectively addressing kernel-induced texture variations. We conducted extensive experimentation on a publicly available dataset, utilizing a total of 48667 chest CT slices from 197 patients distributed over three different reconstruction kernels, demonstrating the superiority of our method over the baseline StarGAN.

World Models in Artificial Intelligence: Sensing, Learning, and Reasoning Like a Child

Javier Del Ser,Jesus L. Lobo,Heimo Müller,Andreas Holzinger

Task: 探索如何使人工智能从模式识别发展到真正的理解、适应和推理能力。

Motivation: 当前的世界模型在强化学习中广泛应用，但缺乏结构化和自适应的表示，无法像儿童一样直观地发展。

Details

Method: 通过整合统计学习与物理信息学习、神经符号学习、持续学习、因果推理、人在回路中的AI和负责任AI等六个关键研究领域的进展。 Result: 提出了一个动态、可解释的框架，使AI能够从模式识别发展到真正的理解、适应和推理能力。 Conclusion: 通过整合统计学习与多个关键研究领域的进展，AI可以进化出真正的理解、适应和推理能力。 Abstract: World Models help Artificial Intelligence (AI) predict outcomes, reason about its environment, and guide decision-making. While widely used in reinforcement learning, they lack the structured, adaptive representations that even young children intuitively develop. Advancing beyond pattern recognition requires dynamic, interpretable frameworks inspired by Piaget's cognitive development theory. We highlight six key research areas -- physics-informed learning, neurosymbolic learning, continual learning, causal inference, human-in-the-loop AI, and responsible AI -- as essential for enabling true reasoning in AI. By integrating statistical learning with advances in these areas, AI can evolve from pattern recognition to genuine understanding, adaptation and reasoning capabilities.

A Review on Large Language Models for Visual Analytics

Navya Sonal Agarwal,Sanjay Kumar Sonbhadra

Task: 综述大型语言模型（LLMs）与视觉分析的集成，探讨其基础概念、能力和广泛应用。

Motivation: 探讨LLMs在自然语言理解、自然语言生成、对话系统和文本到媒体转换中的变革潜力，以及其与视觉分析的协同作用如何增强数据解释、可视化技术和交互探索能力。

Details

Method: 通过评估关键工具和平台（如LIDA、Chat2VIS、Julius AI和Zoho Analytics）以及专门的多模态模型（如ChartLlama和CharXIV），系统探讨LLM任务分类，包括自然语言理解（NLU）、自然语言生成（NLG）、对话系统和文本到媒体转换。 Result: 提供了LLMs与视觉分析集成的SWOT分析，强调了其优势（如可访问性和灵活性）、劣势（如计算需求和偏见）、机会（如多模态集成和用户协作）和威胁（如隐私问题和技能退化）。 Conclusion: 强调解决伦理考虑和方法改进以实现有效集成的重要性。 Abstract: This paper provides a comprehensive review of the integration of Large Language Models (LLMs) with visual analytics, addressing their foundational concepts, capabilities, and wide-ranging applications. It begins by outlining the theoretical underpinnings of visual analytics and the transformative potential of LLMs, specifically focusing on their roles in natural language understanding, natural language generation, dialogue systems, and text-to-media transformations. The review further investigates how the synergy between LLMs and visual analytics enhances data interpretation, visualization techniques, and interactive exploration capabilities. Key tools and platforms including LIDA, Chat2VIS, Julius AI, and Zoho Analytics, along with specialized multimodal models such as ChartLlama and CharXIV, are critically evaluated. The paper discusses their functionalities, strengths, and limitations in supporting data exploration, visualization enhancement, automated reporting, and insight extraction. The taxonomy of LLM tasks, ranging from natural language understanding (NLU), natural language generation (NLG), to dialogue systems and text-to-media transformations, is systematically explored. This review provides a SWOT analysis of integrating Large Language Models (LLMs) with visual analytics, highlighting strengths like accessibility and flexibility, weaknesses such as computational demands and biases, opportunities in multimodal integration and user collaboration, and threats including privacy concerns and skill degradation. It emphasizes addressing ethical considerations and methodological improvements for effective integration.

Beacon2Science: Enhancing STEREO/HI beacon data1 with machine learning for efficient CME tracking

Justin Le Louëdec,Maike Bauer,Tanja Amerstorfer,Jackie A. Davies

Task: 通过改进STEREO/HI信标数据，提高日冕物质抛射（CME）的实时观测和预测精度。

Motivation: 日冕物质抛射（CME）引发的强烈地磁风暴可能对卫星和电子设备造成破坏性影响，因此实时观测和预测CME至关重要。

Details

Method: 提出了一种名为'Beacon2Science'的新管道，通过增强信标数据的质量（信噪比和空间分辨率）并通过学习插值提高时间分辨率，使其与科学数据的40分钟分辨率相匹配。 Result: 改进后的信标图像与科学数据相当，显示出比原始信标数据更好的CME可见性。从增强信标数据中提取的轨迹与科学图像的轨迹更接近，平均误差约为0.5°的伸长率，而原始信标数据的误差为1°。 Conclusion: 本文提出的工作为即将到来的任务（如Vigil和PUNCH）的应用铺平了道路。 Abstract: Observing and forecasting coronal mass ejections (CME) in real-time is crucial due to the strong geomagnetic storms they can generate that can have a potentially damaging effect, for example, on satellites and electrical devices. With its near-real-time availability, STEREO/HI beacon data is the perfect candidate for early forecasting of CMEs. However, previous work concluded that CME arrival prediction based on beacon data could not achieve the same accuracy as with high-resolution science data due to data gaps and lower quality. We present our novel pipeline entitled ''Beacon2Science'', bridging the gap between beacon and science data to improve CME tracking. Through this pipeline, we first enhance the quality (signal-to-noise ratio and spatial resolution) of beacon data. We then increase the time resolution of enhanced beacon images through learned interpolation to match science data's 40-minute resolution. We maximize information coherence between consecutive frames with adapted model architecture and loss functions through the different steps. The improved beacon images are comparable to science data, showing better CME visibility than the original beacon data. Furthermore, we compare CMEs tracked in beacon, enhanced beacon, and science images. The tracks extracted from enhanced beacon data are closer to those from science images, with a mean average error of $\sim 0.5 ^\circ$ of elongation compared to $1^\circ$ with original beacon data. The work presented in this paper paves the way for its application to forthcoming missions such as Vigil and PUNCH.

Euclid Quick Data Release (Q1). Active galactic nuclei identification using diffusion-based inpainting of Euclid VIS images

Euclid Collaboration,G. Stevens,S. Fotopoulou,M. N. Bremer,T. Matamoro Zatarain,K. Jahnke,B. Margalef-Bentabol,M. Huertas-Company,M. J. Smith,M. Walmsley,M. Salvato,M. Mezcua,A. Paulino-Afonso,M. Siudek,M. Talia,F. Ricci,W. Roster,N. Aghanim,B. Altieri,S. Andreon,H. Aussel,C. Baccigalupi,M. Baldi,S. Bardelli,P. Battaglia,A. Biviano,A. Bonchi,E. Branchini,M. Brescia,J. Brinchmann,S. Camera,G. Cañas-Herrera,V. Capobianco,C. Carbone,J. Carretero,M. Castellano,G. Castignani,S. Cavuoti,K. C. Chambers,A. Cimatti,C. Colodro-Conde,G. Congedo,C. J. Conselice,L. Conversi,Y. Copin,A. Costille,F. Courbin,H. M. Courtois,M. Cropper,A. Da Silva,H. Degaudenzi,G. De Lucia,C. Dolding,H. Dole,M. Douspis,F. Dubath,X. Dupac,S. Dusini,S. Escoffier,M. Farina,S. Ferriol,K. George,C. Giocoli,B. R. Granett,A. Grazian,F. Grupp,S. V. H. Haugan,I. M. Hook,F. Hormuth,A. Hornstrup,P. Hudelot,M. Jhabvala,E. Keihänen,S. Kermiche,A. Kiessling,M. Kilbinger,B. Kubik,M. Kümmel,H. Kurki-Suonio,Q. Le Boulc'h,A. M. C. Le Brun,D. Le Mignant,P. B. Lilje,V. Lindholm,I. Lloro,G. Mainetti,D. Maino,E. Maiorano,O. Marggraf,M. Martinelli,N. Martinet,F. Marulli,R. Massey,S. Maurogordato,H. J. McCracken,E. Medinaceli,S. Mei,M. Melchior,M. Meneghetti,E. Merlin,G. Meylan,A. Mora,M. Moresco,L. Moscardini,R. Nakajima,C. Neissner,S. -M. Niemi,C. Padilla,S. Paltani,F. Pasian,K. Pedersen,W. J. Percival,V. Pettorino,G. Polenta,M. Poncet,L. A. Popa,L. Pozzetti,F. Raison,R. Rebolo,A. Renzi,J. Rhodes,G. Riccio,E. Romelli,M. Roncarelli,R. Saglia,A. G. Sánchez,D. Sapone,J. A. Schewtschenko,M. Schirmer,P. Schneider,T. Schrabback,A. Secroun,S. Serrano,P. Simon,C. Sirignano,G. Sirri,J. Skottfelt,L. Stanco,J. Steinwagner,P. Tallada-Crespí,A. N. Taylor,I. Tereno,S. Toft,R. Toledo-Moreo,F. Torradeflot,I. Tutusaus,L. Valenziano,J. Valiviita,T. Vassallo,G. Verdoes Kleijn,A. Veropalumbo,Y. Wang,J. Weller,A. Zacchei,G. Zamorani,F. M. Zerbi,I. A. Zinchenko,E. Zucca,V. Allevato,M. Ballardini,M. Bolzonella,E. Bozzo,C. Burigana,R. Cabanac,A. Cappi,J. A. Escartin Vigo,L. Gabarra,W. G. Hartley,J. Martín-Fleitas,S. Matthew,R. B. Metcalf,A. Pezzotta,M. Pöntinen,I. Risso,V. Scottez,M. Sereno,M. Tenti,M. Wiesmann,Y. Akrami,S. Alvi,I. T. Andika,S. Anselmi,M. Archidiacono,F. Atrio-Barandela,D. Bertacca,M. Bethermin,L. Bisigello,A. Blanchard,L. Blot,S. Borgani,M. L. Brown,S. Bruton,A. Calabro,F. Caro,T. Castro,F. Cogato,S. Davini,G. Desprez,A. Díaz-Sánchez,J. J. Diaz,S. Di Domizio,J. M. Diego,P. -A. Duc,A. Enia,Y. Fang,A. G. Ferrari,A. Finoguenov,A. Fontana,A. Franco,J. García-Bellido,T. Gasparetto,V. Gautard,E. Gaztanaga,F. Giacomini,F. Gianotti,M. Guidi,C. M. Gutierrez,A. Hall,S. Hemmati,H. Hildebrandt,J. Hjorth,J. J. E. Kajava,Y. Kang,V. Kansal,D. Karagiannis,C. C. Kirkpatrick,S. Kruk,L. Legrand,M. Lembo,F. Lepori,G. Leroy,J. Lesgourgues,L. Leuzzi,T. I. Liaudat,J. Macias-Perez,M. Magliocchetti,F. Mannucci,R. Maoli,C. J. A. P. Martins,L. Maurin,M. Miluzio,P. Monaco,G. Morgante,K. Naidoo,A. Navarro-Alsina,F. Passalacqua,K. Paterson,L. Patrizii,A. Pisani,D. Potter,S. Quai,M. Radovich,P. -F. Rocci,G. Rodighiero,S. Sacquegna,M. Sahlén,D. B. Sanders,E. Sarpa,A. Schneider,M. Schultheis,D. Sciotti,E. Sellentin,F. Shankar,L. C. Smith,K. Tanidis,G. Testera,R. Teyssier,S. Tosi,A. Troja,M. Tucci,C. Valieri,D. Vergani,G. Verza,N. A. Walton

Task: 提出一种从单张图像中识别活动星系核（AGN）和类星体（QSO）的新方法。

Motivation: 传统的AGN和QSO识别方法通常需要多波段观测，而本文旨在通过单张图像实现高完整性的识别。

Details

Method: 利用Euclid VIS图像的空间分辨能力，训练了一个基于扩散模型的算法，通过重建星系的光分布来识别AGN和QSO。 Result: 该方法在仅使用VIS成像的情况下，相比传统方法（包括光学、近红外、中红外和X射线）具有更高的完整性。 Conclusion: 本文提出的方法在单张图像中识别AGN和QSO方面表现出色，为未来的天文观测提供了新的工具。 Abstract: Light emission from galaxies exhibit diverse brightness profiles, influenced by factors such as galaxy type, structural features and interactions with other galaxies. Elliptical galaxies feature more uniform light distributions, while spiral and irregular galaxies have complex, varied light profiles due to their structural heterogeneity and star-forming activity. In addition, galaxies with an active galactic nucleus (AGN) feature intense, concentrated emission from gas accretion around supermassive black holes, superimposed on regular galactic light, while quasi-stellar objects (QSO) are the extreme case of the AGN emission dominating the galaxy. The challenge of identifying AGN and QSO has been discussed many times in the literature, often requiring multi-wavelength observations. This paper introduces a novel approach to identify AGN and QSO from a single image. Diffusion models have been recently developed in the machine-learning literature to generate realistic-looking images of everyday objects. Utilising the spatial resolving power of the Euclid VIS images, we created a diffusion model trained on one million sources, without using any source pre-selection or labels. The model learns to reconstruct light distributions of normal galaxies, since the population is dominated by them. We condition the prediction of the central light distribution by masking the central few pixels of each source and reconstruct the light according to the diffusion model. We further use this prediction to identify sources that deviate from this profile by examining the reconstruction error of the few central pixels regenerated in each source's core. Our approach, solely using VIS imaging, features high completeness compared to traditional methods of AGN and QSO selection, including optical, near-infrared, mid-infrared, and X-rays. [abridged]

Abhi Kamboj,Minh N. Do

Task: 研究多模态对齐问题，特别是无监督跨模态迁移。

Motivation: 构建一个联合潜在向量空间，使得表示相同概念的两种模态映射到相同的向量。

Details

Method: 将多模态对齐问题表述为一个逆问题，并在特定条件下实现完美对齐。通过假设语义类在潜在空间中表示为高斯混合模型，展示如何进行跨模态迁移。 Result: 在合成的多模态高斯数据上的实验验证了完美对齐和跨模态迁移方法的有效性。 Conclusion: 这些发现有望激发对完美对齐应用和高斯模型在跨模态学习中使用的进一步探索。 Abstract: Multimodal alignment aims to construct a joint latent vector space where two modalities representing the same concept map to the same vector. We formulate this as an inverse problem and show that under certain conditions perfect alignment can be achieved. We then address a specific application of alignment referred to as cross-modal transfer. Unsupervised cross-modal transfer aims to leverage a model trained with one modality to perform inference on another modality, without any labeled fine-tuning on the new modality. Assuming that semantic classes are represented as a mixture of Gaussians in the latent space, we show how cross-modal transfer can be performed by projecting the data points from the representation space onto different subspaces representing each modality. Our experiments on synthetic multimodal Gaussian data verify the effectiveness of our perfect alignment and cross-modal transfer method. We hope these findings inspire further exploration of the applications of perfect alignment and the use of Gaussian models for cross-modal learning.

SemEval-2025 Task 1: AdMIRe -- Advancing Multimodal Idiomaticity Representation

Thomas Pickard,Aline Villavicencio,Maggie Mi,Wei He,Dylan Phelps,Carolina Scarton,Marco Idiart

Task: 评估和改进模型在多模态上下文和多种语言中解释惯用表达的能力。

Motivation: 惯用表达在自然语言处理中具有独特的挑战，因为它们的含义通常不能直接从其组成词中推断出来。尽管大型语言模型（LLMs）取得了进展，但惯用性仍然是语义表示的显著障碍。

Details

Method: 提出了SemEval-2025 Task 1: AdMiRe（推进多模态惯用性表示）的数据集和任务，包括两个子任务：根据与惯用或字面意义的对齐程度对图像进行排序，以及预测序列中的下一张图像。 Result: 最有效的方法通过在多专家设置中利用预训练的LLMs和视觉语言模型，并结合多次查询来平滑这些模型在惯用性表示中的弱点，达到了人类水平的性能。 Conclusion: 通过多模态和多语言的方法，可以显著提高模型对惯用表达的理解和表示能力。 Abstract: Idiomatic expressions present a unique challenge in NLP, as their meanings are often not directly inferable from their constituent words. Despite recent advancements in Large Language Models (LLMs), idiomaticity remains a significant obstacle to robust semantic representation. We present datasets and tasks for SemEval-2025 Task 1: AdMiRe (Advancing Multimodal Idiomaticity Representation), which challenges the community to assess and improve models' ability to interpret idiomatic expressions in multimodal contexts and in multiple languages. Participants competed in two subtasks: ranking images based on their alignment with idiomatic or literal meanings, and predicting the next image in a sequence. The most effective methods achieved human-level performance by leveraging pretrained LLMs and vision-language models in mixture-of-experts settings, with multiple queries used to smooth over the weaknesses in these models' representations of idiomaticity.

FedSCA: Federated Tuning with Similarity-guided Collaborative Aggregation for Heterogeneous Medical Image Segmentation

Yumin Zhang,Yan Gao,Haoran Duan,Hanqing Guo,Tejal Shah,Rajiv Ranjan,Bo Wei

Task: 提出一种新的联邦学习基础模型微调框架（FedSCA），用于医学图像分割。

Motivation: 由于医学图像数据集规模有限且数据集中化受到隐私问题的限制，基础模型在医学图像分割中的应用受到阻碍。联邦学习与基础模型微调结合可以解决这些问题，但非独立同分布数据和计算通信限制仍然是挑战。

Details

Method: 提出了一种名为FedSCA的框架，包括参数高效微调（PEFT）、部分低层适配器传输和相似性引导的协作聚合（SGCA）。 Result: 在三个联邦学习基准测试中，FedSCA在医学图像分割任务中表现出色，达到了新的SOTA性能。 Conclusion: FedSCA框架有效解决了联邦学习环境中的非独立同分布数据和计算通信限制问题，显著提升了医学图像分割的性能。 Abstract: Transformer-based foundation models (FMs) have recently demonstrated remarkable performance in medical image segmentation. However, scaling these models is challenging due to the limited size of medical image datasets within isolated hospitals, where data centralization is restricted due to privacy concerns. These constraints, combined with the data-intensive nature of FMs, hinder their broader application. Integrating federated learning (FL) with foundation models (FLFM) fine-tuning offers a potential solution to these challenges by enabling collaborative model training without data sharing, thus allowing FMs to take advantage of a diverse pool of sensitive medical image data across hospitals/clients. However, non-independent and identically distributed (non-IID) data among clients, paired with computational and communication constraints in federated environments, presents an additional challenge that limits further performance improvements and remains inadequately addressed in existing studies. In this work, we propose a novel FLFM fine-tuning framework, \underline{\textbf{Fed}}erated tuning with \underline{\textbf{S}}imilarity-guided \underline{\textbf{C}}ollaborative \underline{\textbf{A}}ggregation (FedSCA), encompassing all phases of the FL process. This includes (1) specially designed parameter-efficient fine-tuning (PEFT) for local client training to enhance computational efficiency; (2) partial low-level adapter transmission for communication efficiency; and (3) similarity-guided collaborative aggregation (SGCA) on the server side to address non-IID issues. Extensive experiments on three FL benchmarks for medical image segmentation demonstrate the effectiveness of our proposed FedSCA, establishing new SOTA performance.

Towards efficient keyword spotting using spike-based time difference encoders

Alejandro Pequeño-Zurro,Lyes Khacef,Stefano Panzeri,Elisabetta Chicca

Task: 探索时间差分编码器（TDE）在关键词识别中的性能。

Motivation: 由于语音助手的广泛使用，边缘设备中的关键词识别变得越来越重要，但其部署常常受到目标嵌入式系统的极低功耗限制。

Details

Method: 使用TIdigits数据集，通过三种不同的脉冲神经网络（SNN）架构进行学习和分类时空信号。 Result: 前馈TDE网络的准确率（89%）高于前馈CuBa-LIF网络（71%），接近循环CuBa-LIF网络（91%），并且前馈TDE网络比循环CuBa-LIF网络少执行92%的突触操作。 Conclusion: TDE是一种有前途的神经元模型，适用于时空模式的可扩展事件驱动处理。 Abstract: Keyword spotting in edge devices is becoming increasingly important as voice-activated assistants are widely used. However, its deployment is often limited by the extreme low-power constraints of the target embedded systems. Here, we explore the Temporal Difference Encoder (TDE) performance in keyword spotting. This recent neuron model encodes the time difference in instantaneous frequency and spike count to perform efficient keyword spotting with neuromorphic processors. We use the TIdigits dataset of spoken digits with a formant decomposition and rate-based encoding into spikes. We compare three Spiking Neural Networks (SNNs) architectures to learn and classify spatio-temporal signals. The proposed SNN architectures are made of three layers with variation in its hidden layer composed of either (1) feedforward TDE, (2) feedforward Current-Based Leaky Integrate-and-Fire (CuBa-LIF), or (3) recurrent CuBa-LIF neurons. We first show that the spike trains of the frequency-converted spoken digits have a large amount of information in the temporal domain, reinforcing the importance of better exploiting temporal encoding for such a task. We then train the three SNNs with the same number of synaptic weights to quantify and compare their performance based on the accuracy and synaptic operations. The resulting accuracy of the feedforward TDE network (89%) is higher than the feedforward CuBa-LIF network (71%) and close to the recurrent CuBa-LIF network (91%). However, the feedforward TDE-based network performs 92% fewer synaptic operations than the recurrent CuBa-LIF network with the same amount of synapses. In addition, the results of the TDE network are highly interpretable and correlated with the frequency and timescale features of the spoken keywords in the dataset. Our findings suggest that the TDE is a promising neuron model for scalable event-driven processing of spatio-temporal patterns.

Federated Continual 3D Segmentation With Single-round Communication

Can Peng,Qianhui Men,Pramit Saha,Qianye Yang,Cheng Ouyang,J. Alison Noble

Task: 提出一种联邦持续学习策略，通过多模型蒸馏在服务器端进行一次模型聚合，以适应动态联邦分析环境。

Motivation: 传统联邦学习方法在动态环境中存在通信和计算开销大、同步通信难以实现的问题，需要一种更高效和可扩展的解决方案。

Details

Method: 采用多模型蒸馏技术，在服务器端进行一次模型聚合，减少频繁的服务器通信，并重用之前的客户端模型。 Result: 提出的方法在3D腹部CT分割任务中表现出有效性，减少了通信负载并放松了客户端之间的同步要求。 Conclusion: 该方法为现实应用提供了一个高效且可扩展的联邦分析框架，能够有效应对动态环境中的挑战。 Abstract: Federated learning seeks to foster collaboration among distributed clients while preserving the privacy of their local data. Traditionally, federated learning methods assume a fixed setting in which client data and learning objectives remain constant. However, in real-world scenarios, new clients may join, and existing clients may expand the segmentation label set as task requirements evolve. In such a dynamic federated analysis setup, the conventional federated communication strategy of model aggregation per communication round is suboptimal. As new clients join, this strategy requires retraining, linearly increasing communication and computation overhead. It also imposes requirements for synchronized communication, which is difficult to achieve among distributed clients. In this paper, we propose a federated continual learning strategy that employs a one-time model aggregation at the server through multi-model distillation. This approach builds and updates the global model while eliminating the need for frequent server communication. When integrating new data streams or onboarding new clients, this approach efficiently reuses previous client models, avoiding the need to retrain the global model across the entire federation. By minimizing communication load and bypassing the need to put unchanged clients online, our approach relaxes synchronization requirements among clients, providing an efficient and scalable federated analysis framework suited for real-world applications. Using multi-class 3D abdominal CT segmentation as an application task, we demonstrate the effectiveness of the proposed approach.

LIFT: Latent Implicit Functions for Task- and Data-Agnostic Encoding

Amirhossein Kazerouni,Soroush Mehraban,Michael Brudno,Babak Taati

Task: 提出一种新的高性能框架LIFT，通过元学习捕捉多尺度信息，解决现有隐式神经表示（INR）框架的局限性。

Motivation: 现有INR框架通常依赖于全局潜在向量或存在计算效率低下的问题，限制了其广泛应用。

Details

Method: LIFT利用多个并行的局部隐式函数和分层潜在生成器，生成跨越局部、中间和全局特征的统一潜在表示。ReLIFT是LIFT的增强版本，引入了残差连接和表达频率编码。 Result: LIFT在生成建模和分类任务中实现了最先进的性能，并显著降低了计算成本。ReLIFT在信号表示和逆问题任务中也表现出色。 Conclusion: LIFT和ReLIFT通过捕捉多尺度信息和引入残差连接，提供了高效且强大的解决方案，提高了容量并加速了收敛。 Abstract: Implicit Neural Representations (INRs) are proving to be a powerful paradigm in unifying task modeling across diverse data domains, offering key advantages such as memory efficiency and resolution independence. Conventional deep learning models are typically modality-dependent, often requiring custom architectures and objectives for different types of signals. However, existing INR frameworks frequently rely on global latent vectors or exhibit computational inefficiencies that limit their broader applicability. We introduce LIFT, a novel, high-performance framework that addresses these challenges by capturing multiscale information through meta-learning. LIFT leverages multiple parallel localized implicit functions alongside a hierarchical latent generator to produce unified latent representations that span local, intermediate, and global features. This architecture facilitates smooth transitions across local regions, enhancing expressivity while maintaining inference efficiency. Additionally, we introduce ReLIFT, an enhanced variant of LIFT that incorporates residual connections and expressive frequency encodings. With this straightforward approach, ReLIFT effectively addresses the convergence-capacity gap found in comparable methods, providing an efficient yet powerful solution to improve capacity and speed up convergence. Empirical results show that LIFT achieves state-of-the-art (SOTA) performance in generative modeling and classification tasks, with notable reductions in computational costs. Moreover, in single-task settings, the streamlined ReLIFT architecture proves effective in signal representations and inverse problem tasks.

Command R7B Arabic: A Small, Enterprise Focused, Multilingual, and Culturally Aware Arabic LLM

Yazeed Alnumay,Alexandre Barbet,Anna Bialas,William Darling,Shaan Desai,Joan Devassy,Kyle Duffy,Stephanie Howe,Olivia Lasche,Justin Lee,Anirudh Shrinivason,Jennifer Tracey

Task: 构建高质量的企业阿拉伯语应用大语言模型（LLMs）。

Motivation: 由于阿拉伯语数字化数据的有限性，构建高质量的企业阿拉伯语应用大语言模型具有挑战性。

Details

Method: 采用数据合成和精炼策略，包括合成数据生成和人工标注，以扩展阿拉伯语训练语料库，并提出了迭代的后训练方法以优化模型性能。 Result: 发布了一个7B的小型开放权重模型，该模型在阿拉伯语相关的基准测试中表现优异，包括文化知识、指令跟随、RAG和上下文忠实度。 Conclusion: 通过数据合成和精炼策略以及迭代后训练方法，成功构建了一个在阿拉伯语应用中表现优异的小型大语言模型。 Abstract: Building high-quality large language models (LLMs) for enterprise Arabic applications remains challenging due to the limited availability of digitized Arabic data. In this work, we present a data synthesis and refinement strategy to help address this problem, namely, by leveraging synthetic data generation and human-in-the-loop annotation to expand our Arabic training corpus. We further present our iterative post training recipe that is essential to achieving state-of-the-art performance in aligning the model with human preferences, a critical aspect to enterprise use cases. The culmination of this effort is the release of a small, 7B, open-weight model that outperforms similarly sized peers in head-to-head comparisons and on Arabic-focused benchmarks covering cultural knowledge, instruction following, RAG, and contextual faithfulness.

Retrieval-Augmented Simulacra: Generative Agents for Up-to-date and Knowledge-Adaptive Simulations

Hikaru Shimadzu,Takehito Utsuro,Daisuke Kitayama

Task: 评估在虚拟SNS环境中使用的搜索扩展生成机制对生成帖子和回复能力的影响。

Motivation: 随着社交网络服务在日本的影响力显著增长，以及使用SNS进行营销和情感信息传播研究的活跃进行，需要一种预测SNS互动趋势的系统。

Details

Method: 通过构建一个虚拟SNS环境，使用LLMs创建代理之间的聊天社区，模拟各种社区在SNS上的行为。 Result: 确认了模仿人类搜索行为的搜索扩展生成机制能够生成最自然的交流。 Conclusion: 提出的搜索扩展生成机制在虚拟SNS环境中表现出色，能够生成自然的帖子和回复。 Abstract: In the 2023 edition of the White Paper on Information and Communications, it is estimated that the population of social networking services in Japan will exceed 100 million by 2022, and the influence of social networking services in Japan is growing significantly. In addition, marketing using SNS and research on the propagation of emotions and information on SNS are being actively conducted, creating the need for a system for predicting trends in SNS interactions. We have already created a system that simulates the behavior of various communities on SNS by building a virtual SNS environment in which agents post and reply to each other in a chat community created by agents using a LLMs. In this paper, we evaluate the impact of the search extension generation mechanism used to create posts and replies in a virtual SNS environment using a simulation system on the ability to generate posts and replies. As a result of the evaluation, we confirmed that the proposed search extension generation mechanism, which mimics human search behavior, generates the most natural exchange.

An Explainable Framework for Misinformation Identification via Critical Question Answering

Ramon Ruiz-Dolz,John Lawrence

Task: 提出一种基于论证方案和关键问题的可解释框架，用于检测事实和理性错误信息。

Motivation: 现有的自然语言错误信息检测方法主要依赖于序列分类方法，这些方法产生的系统不透明，分类为错误信息的原因不明确。

Details

Method: 创建并发布NLAS-CQ语料库，结合3,566个教科书式的自然语言论证方案实例和4,687个与这些论证相关的关键问题答案。在此基础上，实现并验证新的框架，该框架结合分类和问答来分析论证中的错误信息，并以关键问题的形式向用户提供解释。 Result: 实现了新的框架，并通过NLAS-CQ语料库进行了验证。 Conclusion: 提出的框架能够有效地检测事实和理性错误信息，并通过关键问题提供解释，增强了系统的透明度和可解释性。 Abstract: Natural language misinformation detection approaches have been, to date, largely dependent on sequence classification methods, producing opaque systems in which the reasons behind classification as misinformation are unclear. While an effort has been made in the area of automated fact-checking to propose explainable approaches to the problem, this is not the case for automated reason-checking systems. In this paper, we propose a new explainable framework for both factual and rational misinformation detection based on the theory of Argumentation Schemes and Critical Questions. For that purpose, we create and release NLAS-CQ, the first corpus combining 3,566 textbook-like natural language argumentation scheme instances and 4,687 corresponding answers to critical questions related to these arguments. On the basis of this corpus, we implement and validate our new framework which combines classification with question answering to analyse arguments in search of misinformation, and provides the explanations in form of critical questions to the human user.

ConQuer: A Framework for Concept-Based Quiz Generation

Yicheng Fu,Zikui Wang,Liuxin Yang,Meiqing Huo,Zhongdongming Dai

Task: 介绍ConQuer，一个基于概念的测验生成框架，利用外部知识源来提高AI生成测验的质量。

Motivation: 尽管LLMs提高了测验生成的效率，但AI生成测验的质量及其对学生的教育影响仍存在担忧。

Details

Method: 引入ConQuer框架，利用外部知识源，并采用综合评估维度来评估生成的测验质量，使用LLMs作为评判者。 Result: 实验结果显示，评估分数提高了4.8%，在成对比较中相对于基线测验集的胜率为77.52%。消融研究进一步证明了框架中每个组件的有效性。 Conclusion: ConQuer框架有效提高了AI生成测验的质量，具有显著的教育应用潜力。 Abstract: Quizzes play a crucial role in education by reinforcing students' understanding of key concepts and encouraging self-directed exploration. However, compiling high-quality quizzes can be challenging and require deep expertise and insight into specific subject matter. Although LLMs have greatly enhanced the efficiency of quiz generation, concerns remain regarding the quality of these AI-generated quizzes and their educational impact on students. To address these issues, we introduce ConQuer, a concept-based quiz generation framework that leverages external knowledge sources. We employ comprehensive evaluation dimensions to assess the quality of the generated quizzes, using LLMs as judges. Our experiment results demonstrate a 4.8% improvement in evaluation scores and a 77.52% win rate in pairwise comparisons against baseline quiz sets. Ablation studies further underscore the effectiveness of each component in our framework. Code available at https://github.com/sofyc/ConQuer.

Synthetic Data Generation of Body Motion Data by Neural Gas Network for Emotion Recognition

Seyed Muhammad Hossein Mousavi

Task: 使用神经气体网络（NGN）算法生成多样化和可推广的体动数据以进行情感识别。

Motivation: 解决体动情感识别领域中数据集缺乏多样性和鲁棒性的问题。

Details

Method: 应用神经气体网络（NGN）算法生成体动数据，并通过学习骨骼结构拓扑来优化多样性和生成速度。 Result: NGN算法生成的体动数据比现有方法更真实、情感更鲜明，且生成速度更快。 Conclusion: NGN算法在生成体动数据方面表现出色，能够有效解决数据集多样性和鲁棒性的问题。 Abstract: In the domain of emotion recognition using body motion, the primary challenge lies in the scarcity of diverse and generalizable datasets. Automatic emotion recognition uses machine learning and artificial intelligence techniques to recognize a person's emotional state from various data types, such as text, images, sound, and body motion. Body motion poses unique challenges as many factors, such as age, gender, ethnicity, personality, and illness, affect its appearance, leading to a lack of diverse and robust datasets specifically for emotion recognition. To address this, employing Synthetic Data Generation (SDG) methods, such as Generative Adversarial Networks (GANs) and Variational Auto Encoders (VAEs), offers potential solutions, though these methods are often complex. This research introduces a novel application of the Neural Gas Network (NGN) algorithm for synthesizing body motion data and optimizing diversity and generation speed. By learning skeletal structure topology, the NGN fits the neurons or gas particles on body joints. Generated gas particles, which form the skeletal structure later on, will be used to synthesize the new body posture. By attaching body postures over frames, the final synthetic body motion appears. We compared our generated dataset against others generated by GANs, VAEs, and another benchmark algorithm, using benchmark metrics such as Fr\'echet Inception Distance (FID), Diversity, and a few more. Furthermore, we continued evaluation using classification metrics such as accuracy, precision, recall, and a few others. Joint-related features or kinematic parameters were extracted, and the system assessed model performance against unseen data. Our findings demonstrate that the NGN algorithm produces more realistic and emotionally distinct body motion data and does so with more synthesizing speed than existing methods.

Generating Medically-Informed Explanations for Depression Detection using LLMs

Xiangyong Chen,Xiaochuan Lin

Task: 利用大型语言模型进行多任务抑郁症检测，同时生成基于医学诊断标准的文本解释。

Motivation: 早期从社交媒体数据中检测抑郁症为及时干预提供了宝贵机会，但这一任务需要专业医学知识和开发准确且可解释的模型。

Details

Method: 提出LLM-MTD（大型语言模型多任务抑郁症检测），利用预训练的大型语言模型同时分类社交媒体帖子中的抑郁症并生成基于医学诊断标准的文本解释。 Result: LLM-MTD在基准Reddit自报抑郁症数据集（RSDD）上表现出色，在AUPRC和其他关键指标上显著优于传统机器学习和微调BERT等方法。 Conclusion: LLM-MTD结合了大型语言模型的力量和可解释性，为抑郁症检测提供了一种新颖的方法。 Abstract: Early detection of depression from social media data offers a valuable opportunity for timely intervention. However, this task poses significant challenges, requiring both professional medical knowledge and the development of accurate and explainable models. In this paper, we propose LLM-MTD (Large Language Model for Multi-Task Depression Detection), a novel approach that leverages a pre-trained large language model to simultaneously classify social media posts for depression and generate textual explanations grounded in medical diagnostic criteria. We train our model using a multi-task learning framework with a combined loss function that optimizes both classification accuracy and explanation quality. We evaluate LLM-MTD on the benchmark Reddit Self-Reported Depression Dataset (RSDD) and compare its performance against several competitive baseline methods, including traditional machine learning and fine-tuned BERT. Our experimental results demonstrate that LLM-MTD achieves state-of-the-art performance in depression detection, showing significant improvements in AUPRC and other key metrics. Furthermore, human evaluation of the generated explanations reveals their relevance, completeness, and medical accuracy, highlighting the enhanced interpretability of our approach. This work contributes a novel methodology for depression detection that combines the power of large language models with the crucial aspect of explainability.

Cafe-Talk: Generating 3D Talking Face Animation with Multimodal Coarse- and Fine-grained Control

Hejia Chen,Haoxian Zhang,Shoulong Zhang,Xiaoqiang Liu,Sisi Zhuang,Yuan Zhang,Pengfei Wan,Di Zhang,Shuai Li

Task: 提出一种基于扩散-变换器的3D说话人脸生成模型Cafe-Talk，以实现准确的唇同步和可控的表情。

Motivation: 现有方法仅采用离散情感标签全局控制表情，限制了时空域内的灵活细粒度面部控制。

Details

Method: 采用两阶段训练管道，首先使用语音音频和粗粒度条件训练模型，然后逐步添加细粒度控制条件；设计了交换标签训练机制和基于掩码的CFG技术；引入了基于文本的检测器以实现自然语言用户输入。 Result: Cafe-Talk在唇同步和表情表现上达到了最先进的性能，并在用户研究中获得了广泛的细粒度控制认可。 Conclusion: Cafe-Talk通过多模态控制条件实现了准确的唇同步和灵活的表情控制，具有广泛的应用前景。 Abstract: Speech-driven 3D talking face method should offer both accurate lip synchronization and controllable expressions. Previous methods solely adopt discrete emotion labels to globally control expressions throughout sequences while limiting flexible fine-grained facial control within the spatiotemporal domain. We propose a diffusion-transformer-based 3D talking face generation model, Cafe-Talk, which simultaneously incorporates coarse- and fine-grained multimodal control conditions. Nevertheless, the entanglement of multiple conditions challenges achieving satisfying performance. To disentangle speech audio and fine-grained conditions, we employ a two-stage training pipeline. Specifically, Cafe-Talk is initially trained using only speech audio and coarse-grained conditions. Then, a proposed fine-grained control adapter gradually adds fine-grained instructions represented by action units (AUs), preventing unfavorable speech-lip synchronization. To disentangle coarse- and fine-grained conditions, we design a swap-label training mechanism, which enables the dominance of the fine-grained conditions. We also devise a mask-based CFG technique to regulate the occurrence and intensity of fine-grained control. In addition, a text-based detector is introduced with text-AU alignment to enable natural language user input and further support multimodal control. Extensive experimental results prove that Cafe-Talk achieves state-of-the-art lip synchronization and expressiveness performance and receives wide acceptance in fine-grained control in user studies. Project page: https://harryxd2018.github.io/cafe-talk/

Rui Yang,Lin Song,Yicheng Xiao,Runhui Huang,Yixiao Ge,Ying Shan,Hengshuang Zhao

Task: 提出一种简单而高效的方法，构建基于单一Transformer的原生端到端大型多模态模型的基线。

Motivation: 大多数大型多模态模型（LMMs）将视觉和文本模态分开建模，导致资源消耗大且性能存在差距。

Details

Method: 提出一种早期融合的LMM，能够在早期阶段融合多模态输入并以自回归方式响应视觉指令；设计了一种高效的训练方法，利用预训练模型的先验知识。 Result: 所提出的模型在使用单一Transformer的LMM中表现出优越性能，并显著缩小了与组合式LMM的性能差距。 Conclusion: 该方法有效解决了资源消耗和性能限制的问题，为原生端到端大型多模态模型提供了新的基线。 Abstract: Recent advancements in large language models (LLMs) have significantly propelled the development of large multi-modal models (LMMs), highlighting the potential for general and intelligent assistants. However, most LMMs model visual and textual modalities separately, leading to recent efforts to develop native LMMs using a single transformer. Despite the promise, these native models are resource-intensive and often exhibit performance gaps compared to their compositional counterparts. To alleviate this issue, we propose a simple yet efficient method to construct a baseline for the native and end-to-end large multi-modal model in a single transformer. First, we propose a new early-fusion LMM that can fuse multi-modal inputs in the early stage and respond to visual instructions in an auto-regressive manner. Second, we devise an efficient training recipe for the proposed model, which harnesses the prior knowledge of the pre-trained models, addressing both the performance limitations and the challenge of resource consumption. The proposed model demonstrates superior performance compared to other LMMs using one transformer and significantly narrows the performance gap with compositional LMMs.

Salient Temporal Encoding for Dynamic Scene Graph Generation

Zhihao Zhu

Task: 提出一种新的时空场景图生成方法，选择性地在时间相关的物体对之间建立时间连接，并将时间关系表示为场景图中的显式边。

Motivation: 由于当前基准数据集中缺乏明确标注的时间关系，现有的时空场景图生成方法在所有物体之间建立了密集且抽象的时间连接，但并非所有时间连接都编码了有意义的时间动态。

Details

Method: 提出了一种新的时空场景图生成方法，选择性地在时间相关的物体对之间建立时间连接，并将时间关系表示为场景图中的显式边。 Result: 在场景图检测中，该方法比强基线提高了4.4%。此外，该方法在动作识别任务中比现有技术提高了0.6%的mAP。 Conclusion: 该方法通过稀疏且显式的时间表示，显著提高了场景图生成和下游视觉任务的性能。 Abstract: Representing a dynamic scene using a structured spatial-temporal scene graph is a novel and particularly challenging task. To tackle this task, it is crucial to learn the temporal interactions between objects in addition to their spatial relations. Due to the lack of explicitly annotated temporal relations in current benchmark datasets, most of the existing spatial-temporal scene graph generation methods build dense and abstract temporal connections among all objects across frames. However, not all temporal connections are encoding meaningful temporal dynamics. We propose a novel spatial-temporal scene graph generation method that selectively builds temporal connections only between temporal-relevant objects pairs and represents the temporal relations as explicit edges in the scene graph. The resulting sparse and explicit temporal representation allows us to improve upon strong scene graph generation baselines by up to $4.4\%$ in Scene Graph Detection. In addition, we show that our approach can be leveraged to improve downstream vision tasks. Particularly, applying our approach to action recognition, shows 0.6\% gain in mAP in comparison to the state-of-the-art

Second language Korean Universal Dependency treebank v1.2: Focus on data augmentation and annotation scheme refinement

Hakyung Sung,Gyu-Ho Shin

Task: 扩展第二语言（L2）韩语通用依存（UD）树库，并评估其在领域内和领域外数据集上的性能。

Motivation: 为了更好地与UD框架对齐，并提高韩语语言模型在L2韩语数据上的形态句法分析性能。

Details

Method: 手动标注5,454个句子，修订标注指南，并使用增强的树库微调三个韩语语言模型。 Result: 微调显著提高了模型在各种指标上的性能。 Conclusion: 使用量身定制的L2数据集微调基于第一语言的通用语言模型对于L2数据的形态句法分析至关重要。 Abstract: We expand the second language (L2) Korean Universal Dependencies (UD) treebank with 5,454 manually annotated sentences. The annotation guidelines are also revised to better align with the UD framework. Using this enhanced treebank, we fine-tune three Korean language models and evaluate their performance on in-domain and out-of-domain L2-Korean datasets. The results show that fine-tuning significantly improves their performance across various metrics, thus highlighting the importance of using well-tailored L2 datasets for fine-tuning first-language-based, general-purpose language models for the morphosyntactic analysis of L2 data.

ReBot: Scaling Robot Learning with Real-to-Sim-to-Real Robotic Video Synthesis

Yu Fang,Yue Yang,Xinghao Zhu,Kaiyuan Zheng,Gedas Bertasius,Daniel Szafir,Mingyu Ding

Task: 提出了一种名为ReBot的实-仿-实方法，用于扩展真实机器人数据集并适应视觉-语言-动作（VLA）模型到目标领域。

Motivation: 真实世界数据收集的高成本限制了VLA模型的泛化能力，ReBot旨在通过实-仿-实方法解决这一问题。

Details

Method: ReBot通过在仿真中重放真实世界机器人轨迹来多样化操作对象，并将仿真运动与修复的真实世界背景结合，合成物理逼真且时间一致的机器人视频。 Result: 实验表明，ReBot显著提高了VLA模型的性能和鲁棒性，在仿真和真实环境中均取得了显著的效果提升。 Conclusion: ReBot通过实-仿-实方法有效扩展了真实机器人数据集，并成功适应了VLA模型到目标领域，解决了机器人操作中的最后一公里部署挑战。 Abstract: Vision-language-action (VLA) models present a promising paradigm by training policies directly on real robot datasets like Open X-Embodiment. However, the high cost of real-world data collection hinders further data scaling, thereby restricting the generalizability of VLAs. In this paper, we introduce ReBot, a novel real-to-sim-to-real approach for scaling real robot datasets and adapting VLA models to target domains, which is the last-mile deployment challenge in robot manipulation. Specifically, ReBot replays real-world robot trajectories in simulation to diversify manipulated objects (real-to-sim), and integrates the simulated movements with inpainted real-world background to synthesize physically realistic and temporally consistent robot videos (sim-to-real). Our approach has several advantages: 1) it enjoys the benefit of real data to minimize the sim-to-real gap; 2) it leverages the scalability of simulation; and 3) it can generalize a pretrained VLA to a target domain with fully automated data pipelines. Extensive experiments in both simulation and real-world environments show that ReBot significantly enhances the performance and robustness of VLAs. For example, in SimplerEnv with the WidowX robot, ReBot improved the in-domain performance of Octo by 7.2% and OpenVLA by 21.8%, and out-of-domain generalization by 19.9% and 9.4%, respectively. For real-world evaluation with a Franka robot, ReBot increased the success rates of Octo by 17% and OpenVLA by 20%. More information can be found at: https://yuffish.github.io/rebot/

Strategic resource allocation in memory encoding: An efficiency principle shaping language processing

Weijie Xu,Richard Futrell

Task: 研究战略资源分配作为句子处理中记忆编码的效率原则。

Motivation: 探讨工作记忆的有限容量如何有效支持人类语言行为。

Details

Method: 从资源理性的角度进行理论分析，并通过自然语料库数据进行实证研究。 Result: 发现战略资源分配在依赖局部性背景下得到了支持，但也揭示了跨语言的变异性。 Conclusion: 战略资源分配作为一种普遍效率原则，需要进一步研究其与语言特定短语结构的相互作用。 Abstract: How is the limited capacity of working memory efficiently used to support human linguistic behaviors? In this paper, we investigate strategic resource allocation as an efficiency principle for memory encoding in sentence processing. The idea is that working memory resources are dynamically and strategically allocated to prioritize novel and unexpected information, enhancing their representations to make them less susceptible to memory decay and interference. Theoretically, from a resource-rational perspective, we argue that this efficiency principle naturally arises from two functional assumptions about working memory, namely, its limited capacity and its noisy representation. Empirically, through naturalistic corpus data, we find converging evidence for strategic resource allocation in the context of dependency locality from both the production and the comprehension side, where non-local dependencies with less predictable antecedents are associated with reduced locality effect. However, our results also reveal considerable cross-linguistic variability, highlighting the need for a closer examination of how strategic resource allocation, as a universal efficiency principle, interacts with language-specific phrase structures.

SAUCE: Selective Concept Unlearning in Vision-Language Models with Sparse Autoencoders

Qing Li,Jiahui Geng,Derui Zhu,Fengyu Cai,Chenyang Lyu,Fakhri Karray

Task: 提出了一种名为SAUCE的新方法，用于在视觉语言模型（VLMs）中进行细粒度和选择性的概念遗忘。

Motivation: 现有的视觉语言模型遗忘方法主要依赖于从大型语言模型（LLMs）中借鉴的技术，这些方法需要大量的注释遗忘集，并且在粗粒度上进行遗忘，导致过度遗忘和模型效用降低。

Details

Method: SAUCE利用稀疏自编码器（SAEs）来捕获高维、语义丰富的稀疏特征，并识别与目标概念最相关的特征进行遗忘。在推理过程中，选择性地修改这些特征以抑制特定概念，同时保留无关信息。 Result: SAUCE在LLaVA-v1.5-7B和LLaMA-3.2-11B-Vision-Instruct两个不同的VLMs上进行了评估，涵盖了60个概念。实验表明，SAUCE在遗忘质量上比现有方法提高了18.04%，同时保持了相当的模型效用。 Conclusion: SAUCE是一种有效且可扩展的解决方案，适用于在VLMs中进行选择性概念遗忘。 Abstract: Unlearning methods for vision-language models (VLMs) have primarily adapted techniques from large language models (LLMs), relying on weight updates that demand extensive annotated forget sets. Moreover, these methods perform unlearning at a coarse granularity, often leading to excessive forgetting and reduced model utility. To address this issue, we introduce SAUCE, a novel method that leverages sparse autoencoders (SAEs) for fine-grained and selective concept unlearning in VLMs. Briefly, SAUCE first trains SAEs to capture high-dimensional, semantically rich sparse features. It then identifies the features most relevant to the target concept for unlearning. During inference, it selectively modifies these features to suppress specific concepts while preserving unrelated information. We evaluate SAUCE on two distinct VLMs, LLaVA-v1.5-7B and LLaMA-3.2-11B-Vision-Instruct, across two types of tasks: concrete concept unlearning (objects and sports scenes) and abstract concept unlearning (emotions, colors, and materials), encompassing a total of 60 concepts. Extensive experiments demonstrate that SAUCE outperforms state-of-the-art methods by 18.04% in unlearning quality while maintaining comparable model utility. Furthermore, we investigate SAUCE's robustness against widely used adversarial attacks, its transferability across models, and its scalability in handling multiple simultaneous unlearning requests. Our findings establish SAUCE as an effective and scalable solution for selective concept unlearning in VLMs.

Uncertainty Distillation: Teaching Language Models to Express Semantic Confidence

Sophia Hager,David Mueller,Kevin Duh,Nicholas Andrews

Task: 开发一种方法来教导大型语言模型（LLMs）表达校准的语义置信度。

Motivation: 随着大型语言模型越来越多地用于事实问答，模型能够传达其答案正确的可能性变得越来越重要。

Details

Method: 提出了一种称为不确定性蒸馏的简单程序，通过使用保留数据将初始不确定性估计映射到有意义的概率，创建带有标注的示例进行监督微调。 Result: 该方法生成的表达置信度与观察到的错误率相关，并且在短答案上语义不确定性与词汇不确定性相关性良好。 Conclusion: 不确定性蒸馏方法能够有效地教导LLMs表达校准的语义置信度，从而提高其在实际应用中的可靠性。 Abstract: As large language models (LLMs) are increasingly used for factual question-answering, it becomes more important for LLMs to have the capability to communicate the likelihood that their answer is correct. For these verbalized expressions of uncertainty to be meaningful, they should reflect the error rates at the expressed level of confidence. However, when prompted to express confidence, the error rates of current LLMs are inconsistent with their communicated confidences, highlighting the need for uncertainty quantification methods. Many prior methods calculate lexical uncertainty, estimating a model's confidence in the specific string it generated. In some cases, however, it may be more useful to estimate semantic uncertainty, or the model's confidence in the answer regardless of how it is verbalized. We propose a simple procedure, uncertainty distillation, to teach an LLM to verbalize calibrated semantic confidences. Using held-out data to map initial uncertainty estimates to meaningful probabilities, we create examples annotated with verbalized probabilities for supervised fine-tuning. We demonstrate our method yields verbalized confidences that correlate with observed error rates with a small fine-tuned language model as well as with larger instruction-tuned models, and find that our semantic uncertainty correlates well with lexical uncertainty on short answers.

Interpretable Unsupervised Joint Denoising and Enhancement for Real-World low-light Scenarios

Huaqiu Li,Xiaowan Hu,Haoqian Wang

Task: 提出一种可解释的、零参考的联合去噪和低光增强框架，用于处理现实世界中的低光图像。

Motivation: 现实世界中的低光图像常常受到局部过曝、低亮度、噪声和不均匀照明等复杂退化的影响。有监督方法容易过拟合特定场景，而无监督方法由于缺乏参考图像，难以建模这些退化。

Details

Method: 基于物理成像原理和Retinex理论，提出一种基于成对子图像的训练策略，利用离散余弦变换（DCT）在sRGB空间进行频域分解，并引入隐式引导的混合表示策略。 Result: 广泛的实验证明了该方法的优越性。 Conclusion: 该方法在现实场景中的低光图像处理上表现出色，代码将在GitHub上公开。 Abstract: Real-world low-light images often suffer from complex degradations such as local overexposure, low brightness, noise, and uneven illumination. Supervised methods tend to overfit to specific scenarios, while unsupervised methods, though better at generalization, struggle to model these degradations due to the lack of reference images. To address this issue, we propose an interpretable, zero-reference joint denoising and low-light enhancement framework tailored for real-world scenarios. Our method derives a training strategy based on paired sub-images with varying illumination and noise levels, grounded in physical imaging principles and retinex theory. Additionally, we leverage the Discrete Cosine Transform (DCT) to perform frequency domain decomposition in the sRGB space, and introduce an implicit-guided hybrid representation strategy that effectively separates intricate compounded degradations. In the backbone network design, we develop retinal decomposition network guided by implicit degradation representation mechanisms. Extensive experiments demonstrate the superiority of our method. Code will be available at https://github.com/huaqlili/unsupervised-light-enhance-ICLR2025.

Language Independent Named Entity Recognition via Orthogonal Transformation of Word Vectors

Omar E. Rakha,Hazem M. Abbas

Task: 使用双向LSTM/CRF模型和词嵌入进行跨语言的命名实体识别。

Motivation: 词嵌入是NLP中的关键组成部分，本文旨在通过训练一个源语言（英语）模型，并将其应用于目标语言（阿拉伯语）的命名实体识别，而无需在目标语言上进行训练或微调。

Details

Method: 提出了一种基于双向LSTM/CRF和词嵌入的模型，通过正交线性变换矩阵将目标语言的词嵌入转换为源语言的词嵌入。 Result: 通过在英语数据集上训练模型，该模型能够在阿拉伯语数据集上检测命名实体，而无需在阿拉伯语数据集上进行训练或微调。 Conclusion: 该方法展示了跨语言命名实体识别的潜力，无需在目标语言上进行额外的训练或微调。 Abstract: Word embeddings have been a key building block for NLP in which models relied heavily on word embeddings in many different tasks. In this paper, a model is proposed based on using Bidirectional LSTM/CRF with word embeddings to perform named entity recognition for any language. This is done by training a model on a source language (English) and transforming word embeddings from the target language into word embeddings of the source language by using an orthogonal linear transformation matrix. Evaluation of the model shows that by training a model on an English dataset the model was capable of detecting named entities in an Arabic dataset without neither training or fine tuning the model on an Arabic language dataset.

Learning-based 3D Reconstruction in Autonomous Driving: A Comprehensive Survey

Liewen Liao,Weihao Yan,Ming Yang,Songan Zhang

Task: 综述学习型3D重建在自动驾驶中的应用及其进展。

Motivation: 通过先进的神经表示，3D重建能够精确建模动态和静态环境，从而增强感知能力，并为场景理解和闭环仿真等关键任务提供创新解决方案。

Details

Method: 首先系统介绍学习型3D重建的预备知识，包括数据格式、基准测试和技术预备知识，然后按子任务分类并多维度分析和总结学习型3D重建方法。 Result: 总结了学习型3D重建在自动驾驶中的发展趋势和现有挑战。 Conclusion: 希望该综述能够为未来的研究提供启发。 Abstract: Learning-based 3D reconstruction has emerged as a transformative technique in autonomous driving, enabling precise modeling of both dynamic and static environments through advanced neural representations. Despite augmenting perception, 3D reconstruction inspires pioneering solution for vital tasks in the field of autonomous driving, such as scene understanding and closed-loop simulation. Commencing with an examination of input modalities, we investigates the details of 3D reconstruction and conducts a multi-perspective, in-depth analysis of recent advancements. Specifically, we first provide a systematic introduction of preliminaries, including data formats, benchmarks and technical preliminaries of learning-based 3D reconstruction, facilitating instant identification of suitable methods based on hardware configurations and sensor suites. Then, we systematically review learning-based 3D reconstruction methods in autonomous driving, categorizing approaches by subtasks and conducting multi-dimensional analysis and summary to establish a comprehensive technical reference. The development trends and existing challenges is summarized in the context of learning-based 3D reconstruction in autonomous driving. We hope that our review will inspire future researches.

FACTS&EVIDENCE: An Interactive Tool for Transparent Fine-Grained Factual Verification of Machine-Generated Text

Varich Boonsanong,Vidhisha Balachandran,Xiaochuang Han,Shangbin Feng,Lucy Lu Wang,Yulia Tsvetkov

Task: 开发一个交互式且透明的工具，用于用户驱动的复杂文本事实验证。

Motivation: 现有的工具在预测推理和证据来源多样性方面缺乏透明度，无法提供可信赖的用户体验。

Details

Method: 开发Facts&Evidence工具，将复杂输入文本分解，可视化各个声明的可信度，并提供模型决策的解释和多种证据来源的归因。 Result: Facts&Evidence工具能够帮助用户理解、验证、选择性信任和使用机器生成的文本。 Conclusion: Facts&Evidence工具旨在增强机器生成文本消费者的能力，使他们能够理解、验证、选择性信任和使用此类文本。 Abstract: With the widespread consumption of AI-generated content, there has been an increased focus on developing automated tools to verify the factual accuracy of such content. However, prior research and tools developed for fact verification treat it as a binary classification or a linear regression problem. Although this is a useful mechanism as part of automatic guardrails in systems, we argue that such tools lack transparency in the prediction reasoning and diversity in source evidence to provide a trustworthy user experience. We develop Facts&Evidence - an interactive and transparent tool for user-driven verification of complex text. The tool facilitates the intricate decision-making involved in fact-verification, presenting its users a breakdown of complex input texts to visualize the credibility of individual claims along with an explanation of model decisions and attribution to multiple, diverse evidence sources. Facts&Evidence aims to empower consumers of machine-generated text and give them agency to understand, verify, selectively trust and use such text.

Matching Skeleton-based Activity Representations with Heterogeneous Signals for HAR

Shuheng Li,Jiayun Zhang,Xiaohan Fu,Xiyuan Zhang,Jingbo Shang,Rajesh K. Gupta

Task: 提出了一种基于骨架数据预训练活动表示并与异构HAR信号匹配的新框架SKELAR。

Motivation: 传统的HAR通常使用one-hot编码表示活动标签，最近转向使用文本表示以提供上下文知识。然而，文本表示存在固有局限性，HAR应基于物理运动数据，因为运动是活动的基础，并且适用于各种传感系统。

Details

Method: SKELAR框架通过自监督的粗角度重建任务从骨架数据中预训练活动表示，并通过自注意力匹配模块动态优先考虑相关身体部位。 Result: SKELAR在full-shot和few-shot设置下均达到了最先进的性能，并且能够有效利用合成骨架数据扩展其应用场景。 Conclusion: SKELAR框架通过基于骨架数据的预训练和自注意力匹配，能够有效处理异构HAR信号，并在多种场景下表现出色。 Abstract: In human activity recognition (HAR), activity labels have typically been encoded in one-hot format, which has a recent shift towards using textual representations to provide contextual knowledge. Here, we argue that HAR should be anchored to physical motion data, as motion forms the basis of activity and applies effectively across sensing systems, whereas text is inherently limited. We propose SKELAR, a novel HAR framework that pretrains activity representations from skeleton data and matches them with heterogeneous HAR signals. Our method addresses two major challenges: (1) capturing core motion knowledge without context-specific details. We achieve this through a self-supervised coarse angle reconstruction task that recovers joint rotation angles, invariant to both users and deployments; (2) adapting the representations to downstream tasks with varying modalities and focuses. To address this, we introduce a self-attention matching module that dynamically prioritizes relevant body parts in a data-driven manner. Given the lack of corresponding labels in existing skeleton data, we establish MASD, a new HAR dataset with IMU, WiFi, and skeleton, collected from 20 subjects performing 27 activities. This is the first broadly applicable HAR dataset with time-synchronized data across three modalities. Experiments show that SKELAR achieves the state-of-the-art performance in both full-shot and few-shot settings. We also demonstrate that SKELAR can effectively leverage synthetic skeleton data to extend its use in scenarios without skeleton collections.

MMDT: Decoding the Trustworthiness and Safety of Multimodal Foundation Models

Chejian Xu,Jiawei Zhang,Zhaorun Chen,Chulin Xie,Mintong Kang,Yujin Potter,Zhun Wang,Zhuowen Yuan,Alexander Xiong,Zidi Xiong,Chenhui Zhang,Lingzhi Yuan,Yi Zeng,Peiyang Xu,Chengquan Guo,Andy Zhou,Jeffrey Ziwei Tan,Xuandong Zhao,Francesco Pinto,Zhen Xiang,Yu Gai,Zinan Lin,Dan Hendrycks,Bo Li,Dawn Song

Task: 提出一个统一的平台MMDT（Multimodal DecodingTrust），用于全面评估多模态基础模型（MMFMs）的安全性和可信度。

Motivation: 现有的多模态模型基准主要评估模型的有用性，或仅关注公平性和隐私等有限视角，缺乏对安全性和可信度的全面评估。

Details

Method: 设计了一个统一的平台MMDT，从安全性、幻觉、公平性/偏见、隐私、对抗鲁棒性和分布外（OOD）泛化等多个角度评估模型。设计了各种评估场景和红队算法，生成具有挑战性的数据，形成高质量的基准。 Result: 评估了一系列多模态模型，揭示了这些模型在各个角度上的漏洞和改进空间。 Conclusion: 提出了首个全面且独特的多模态基础模型安全性和可信度评估平台，为开发更安全可靠的多模态基础模型和系统铺平了道路。 Abstract: Multimodal foundation models (MMFMs) play a crucial role in various applications, including autonomous driving, healthcare, and virtual assistants. However, several studies have revealed vulnerabilities in these models, such as generating unsafe content by text-to-image models. Existing benchmarks on multimodal models either predominantly assess the helpfulness of these models, or only focus on limited perspectives such as fairness and privacy. In this paper, we present the first unified platform, MMDT (Multimodal DecodingTrust), designed to provide a comprehensive safety and trustworthiness evaluation for MMFMs. Our platform assesses models from multiple perspectives, including safety, hallucination, fairness/bias, privacy, adversarial robustness, and out-of-distribution (OOD) generalization. We have designed various evaluation scenarios and red teaming algorithms under different tasks for each perspective to generate challenging data, forming a high-quality benchmark. We evaluate a range of multimodal models using MMDT, and our findings reveal a series of vulnerabilities and areas for improvement across these perspectives. This work introduces the first comprehensive and unique safety and trustworthiness evaluation platform for MMFMs, paving the way for developing safer and more reliable MMFMs and systems. Our platform and benchmark are available at https://mmdecodingtrust.github.io/.

Fire and Smoke Datasets in 20 Years: An In-depth Review

Sayed Pedram Haeri Boroujeni,Niloufar Mehrabi,Fatemeh Afghah,Connor Peter McGrath,Danish Bhatkar,Mithilesh Anil Biradar,Abolfazl Razi

Task: 系统分析和评估过去20年收集的火灾和烟雾数据集。

Motivation: 火灾和烟雾现象对自然环境、生态系统、全球经济以及人类和野生动物的生命构成重大威胁，需要更先进的技术来实现早期检测、实时监测和最小化火灾对生态平衡和公共安全的影响。

Details

Method: 对过去20年收集的火灾和烟雾数据集进行深入审查，分析每个数据集的特征，包括类型、大小、格式、收集方法和地理多样性，并总结每个数据集的优缺点。 Result: 通过使用ResNet-50、DeepLab-V3和YoloV8等最先进的算法对不同数据集进行广泛的实验分析。 Conclusion: 该研究为火灾管理领域的研究和技术进步提供了有价值的见解和潜在方向。 Abstract: Fire and smoke phenomena pose a significant threat to the natural environment, ecosystems, and global economy, as well as human lives and wildlife. In this particular circumstance, there is a demand for more sophisticated and advanced technologies to implement an effective strategy for early detection, real-time monitoring, and minimizing the overall impacts of fires on ecological balance and public safety. Recently, the rapid advancement of Artificial Intelligence (AI) and Computer Vision (CV) frameworks has substantially revolutionized the momentum for developing efficient fire management systems. However, these systems extensively rely on the availability of adequate and high-quality fire and smoke data to create proficient Machine Learning (ML) methods for various tasks, such as detection and monitoring. Although fire and smoke datasets play a critical role in training, evaluating, and testing advanced Deep Learning (DL) models, a comprehensive review of the existing datasets is still unexplored. For this purpose, we provide an in-depth review to systematically analyze and evaluate fire and smoke datasets collected over the past 20 years. We investigate the characteristics of each dataset, including type, size, format, collection methods, and geographical diversities. We also review and highlight the unique features of each dataset, such as imaging modalities (RGB, thermal, infrared) and their applicability for different fire management tasks (classification, segmentation, detection). Furthermore, we summarize the strengths and weaknesses of each dataset and discuss their potential for advancing research and technology in fire management. Ultimately, we conduct extensive experimental analyses across different datasets using several state-of-the-art algorithms, such as ResNet-50, DeepLab-V3, and YoloV8.

The CLEF-2025 CheckThat! Lab: Subjectivity, Fact-Checking, Claim Normalization, and Retrieval

Firoj Alam,Julia Maria Struß,Tanmoy Chakraborty,Stefan Dietze,Salim Hafid,Katerina Korre,Arianna Muti,Preslav Nakov,Federico Ruggeri,Sebastian Schellhammer,Vinay Setty,Megha Sundriyal,Konstantin Todorov,Venktesh V

Task: 识别和应对在线虚假信息和操纵行为，包括主观性识别、声明规范化、数值声明的事实核查以及科学网络话语处理。

Motivation: 推动创新技术的发展，以识别和应对跨语言和跨平台的在线虚假信息和操纵行为。

Details

Method: 通过CheckThat!实验室的多个版本，逐步扩展任务范围，包括核心验证任务和辅助任务。 Result: 提出了多个具有挑战性的分类和检索问题，涵盖文档和跨度的多语言设置。 Conclusion: CheckThat!实验室通过不断扩展任务范围，推动了在线虚假信息和操纵行为识别技术的发展。 Abstract: The CheckThat! lab aims to advance the development of innovative technologies designed to identify and counteract online disinformation and manipulation efforts across various languages and platforms. The first five editions focused on key tasks in the information verification pipeline, including check-worthiness, evidence retrieval and pairing, and verification. Since the 2023 edition, the lab has expanded its scope to address auxiliary tasks that support research and decision-making in verification. In the 2025 edition, the lab revisits core verification tasks while also considering auxiliary challenges. Task 1 focuses on the identification of subjectivity (a follow-up from CheckThat! 2024), Task 2 addresses claim normalization, Task 3 targets fact-checking numerical claims, and Task 4 explores scientific web discourse processing. These tasks present challenging classification and retrieval problems at both the document and span levels, including multilingual settings.

Redefining non-IID Data in Federated Learning for Computer Vision Tasks: Migrating from Labels to Embeddings for Task-Specific Data Distributions

Kasra Borazjani,Payam Abdisarabshali,Naji Khosravan,Seyyedali Hosseinalipour

Task: 研究联邦学习（FL）在计算机视觉任务中数据异构性的新定义及其对性能的影响。

Motivation: 现有的文献主要通过标签分布偏斜来模拟数据异构性，但这未能完全捕捉到真实世界中的数据异构性，尤其是在分类以外的计算机视觉任务中。

Details

Method: 利用预训练的深度神经网络提取任务特定的数据嵌入，通过聚类数据点并使用狄利克雷分布将其分配给客户端，定义任务特定的数据异构性。 Result: 通过大量实验评估了不同FL方法在新的数据异构性定义下的性能，并引入了新的基准性能指标。 Conclusion: 现有的方法通过依赖标签/类别分布偏斜高估了FL的性能，揭示了文献中被忽视的差距，并提出了新的研究方向。 Abstract: Federated Learning (FL) represents a paradigm shift in distributed machine learning (ML), enabling clients to train models collaboratively while keeping their raw data private. This paradigm shift from traditional centralized ML introduces challenges due to the non-iid (non-independent and identically distributed) nature of data across clients, significantly impacting FL's performance. Existing literature, predominantly model data heterogeneity by imposing label distribution skew across clients. In this paper, we show that label distribution skew fails to fully capture the real-world data heterogeneity among clients in computer vision tasks beyond classification. Subsequently, we demonstrate that current approaches overestimate FL's performance by relying on label/class distribution skew, exposing an overlooked gap in the literature. By utilizing pre-trained deep neural networks to extract task-specific data embeddings, we define task-specific data heterogeneity through the lens of each vision task and introduce a new level of data heterogeneity called embedding-based data heterogeneity. Our methodology involves clustering data points based on embeddings and distributing them among clients using the Dirichlet distribution. Through extensive experiments, we evaluate the performance of different FL methods under our revamped notion of data heterogeneity, introducing new benchmark performance measures to the literature. We further unveil a series of open research directions that can be pursued.

MetaLadder: Ascending Mathematical Solution Quality via Analogical-Problem Reasoning Transfer

Honglin Lin,Zhuoshi Pan,Yu Li,Qizhi Pei,Xin Gao,Mengzhang Cai,Conghui He,Lijun Wu

Task: 提出一种名为MetaLadder的新框架，通过回忆和反思结构或语义上类似的问题及其CoT解决方案来提高LLMs在数学推理任务中的准确性。

Motivation: 人类在解决问题时通常会回忆类似案例并利用其解决方案来推理当前任务，而当前的LLMs范式通常直接生成CoT和答案，与人类的解题策略有所不同。

Details

Method: 提出MetaLadder框架，通过显式提示LLMs回忆和反思元问题及其CoT解决方案，并引入问题重述机制以增强模型对目标问题的理解。 Result: 在数学基准测试中，MetaLadder显著提高了LLMs的解题准确性，比标准的CoT方法提高了10.3%的准确率。 Conclusion: MetaLadder框架通过模仿人类的“从例子中学习”和泛化能力，实现了从类比问题中进行推理转移，显著提升了LLMs的解题准确性。 Abstract: Large Language Models (LLMs) have demonstrated promising capabilities in solving mathematical reasoning tasks, leveraging Chain-of-Thought (CoT) data as a vital component in guiding answer generation. Current paradigms typically generate CoT and answers directly for a given problem, diverging from human problem-solving strategies to some extent. Humans often solve problems by recalling analogous cases and leveraging their solutions to reason about the current task. Inspired by this cognitive process, we propose \textbf{MetaLadder}, a novel framework that explicitly prompts LLMs to recall and reflect on meta-problems, those structurally or semantically analogous problems, alongside their CoT solutions before addressing the target problem. Additionally, we introduce a problem-restating mechanism to enhance the model's comprehension of the target problem by regenerating the original question, which further improves reasoning accuracy. Therefore, the model can achieve reasoning transfer from analogical problems, mimicking human-like "learning from examples" and generalization abilities. Extensive experiments on mathematical benchmarks demonstrate that our MetaLadder significantly boosts LLMs' problem-solving accuracy, largely outperforming standard CoT-based methods (\textbf{10.3\%} accuracy gain) and other methods. Our code and data has been released at https://github.com/LHL3341/MetaLadder.

SuperPC: A Single Diffusion Model for Point Cloud Completion, Upsampling, Denoising, and Colorization

Yi Du,Zhipeng Zhao,Shaoshu Su,Sharath Golluri,Haoze Zheng,Runmao Yao,Chen Wang

Task: 提出了一种统一的扩散模型SuperPC，能够同时处理点云的完成、上采样、去噪和着色任务。

Motivation: 现有的方法通常独立处理点云处理任务，忽略了这些缺陷之间的相互影响和相关性，导致错误累积和计算成本增加。

Details

Method: 采用三级条件扩散框架，并结合新颖的空间混合融合策略，以利用四种缺陷之间的相关性进行同时高效处理。 Result: SuperPC在所有四个单独任务上均优于现有的最先进专用模型及其组合。 Conclusion: SuperPC通过统一的扩散模型有效地解决了点云处理中的多种缺陷，展示了其在多种任务上的优越性能。 Abstract: Point cloud (PC) processing tasks-such as completion, upsampling, denoising, and colorization-are crucial in applications like autonomous driving and 3D reconstruction. Despite substantial advancements, prior approaches often address each of these tasks independently, with separate models focused on individual issues. However, this isolated approach fails to account for the fact that defects like incompleteness, low resolution, noise, and lack of color frequently coexist, with each defect influencing and correlating with the others. Simply applying these models sequentially can lead to error accumulation from each model, along with increased computational costs. To address these challenges, we introduce SuperPC, the first unified diffusion model capable of concurrently handling all four tasks. Our approach employs a three-level-conditioned diffusion framework, enhanced by a novel spatial-mix-fusion strategy, to leverage the correlations among these four defects for simultaneous, efficient processing. We show that SuperPC outperforms the state-of-the-art specialized models as well as their combination on all four individual tasks.

Deep Contrastive Unlearning for Language Models

Estrid He,Tabinda Sarwar,Ibrahim Khalil,Xun Yi,Ke Wang

Task: 提出一种名为Deep Contrastive Unlearning for fine-Tuning (DeepCUT)的框架，用于在语言模型中实现机器遗忘。

Motivation: 保护用户隐私和版权，解决现有方法未考虑模型潜在空间中样本几何分布的问题。

Details

Method: 通过直接优化模型的潜在空间来实现机器遗忘。 Result: 在真实数据集上的实验表明，DeepCUT在效果和效率上均优于基线方法。 Conclusion: DeepCUT框架在机器遗忘任务中表现出色，能够有效保护用户隐私和版权。 Abstract: The past a few years have witnessed the great success of large language models, demonstrating powerful capabilities in comprehending textual data and generating human-like languages. Large language models achieve success by being trained on vast amounts of textual data, including online sources with copyrighted content and user-generated knowledge. However, this comes at a cost: the potential risk of exposing users' privacy and violating copyright protections. Thus, to safeguard individuals' "right to be forgotten", there has been increasing interests in machine unlearning -- the process of removing information carried by particular training samples from a model while not deteriorating its predictive quality. This is a challenging task due to the black-box nature of language models. Most existing studies focus on mitigating the impact of those forgot samples upon a model's outputs, and do not explicitly consider the geometric distributions of samples in the latent space of a model. To address this issue, we propose a machine unlearning framework, named Deep Contrastive Unlearning for fine-Tuning (DeepCUT) language models. Our proposed model achieves machine unlearning by directly optimizing the latent space of a model. Comprehensive experiments on real-world datasets demonstrate the effectiveness and efficiency of DeepCUT with consistent and significant improvement over baseline methods.

Effortless Active Labeling for Long-Term Test-Time Adaptation

Guowei Wang,Changxing Ding

Task: 研究如何在长期测试时间适应（TTA）中实现无需费力的主动标注，每批次最多选择一个样本进行标注。

Motivation: 由于错误累积，长期测试时间适应（TTA）是一个具有挑战性的任务。现有的方法通过主动标注每批次中的一小部分样本来解决这个问题，但随着批次数的增加，标注负担迅速增加。

Details

Method: 首先，基于TTA上下文中的单步优化视角，标注每批次中最有价值的样本。然后，引入一种有效的策略，通过特征扰动来识别这些样本。其次，发现标注和未标注样本产生的梯度幅度存在显著差异，因此提出使用两个动态权重来平衡它们对模型优化的影响。 Result: 在ImageNet-C、-R、-K、-A和PACS数据库上的大量实验表明，该方法在显著降低标注成本的情况下，始终优于最先进的方法。 Conclusion: 本文提出的方法在长期测试时间适应（TTA）中实现了无需费力的主动标注，显著降低了标注成本，并在多个数据库上取得了优于现有方法的效果。 Abstract: Long-term test-time adaptation (TTA) is a challenging task due to error accumulation. Recent approaches tackle this issue by actively labeling a small proportion of samples in each batch, yet the annotation burden quickly grows as the batch number increases. In this paper, we investigate how to achieve effortless active labeling so that a maximum of one sample is selected for annotation in each batch. First, we annotate the most valuable sample in each batch based on the single-step optimization perspective in the TTA context. In this scenario, the samples that border between the source- and target-domain data distributions are considered the most feasible for the model to learn in one iteration. Then, we introduce an efficient strategy to identify these samples using feature perturbation. Second, we discover that the gradient magnitudes produced by the annotated and unannotated samples have significant variations. Therefore, we propose balancing their impact on model optimization using two dynamic weights. Extensive experiments on the popular ImageNet-C, -R, -K, -A and PACS databases demonstrate that our approach consistently outperforms state-of-the-art methods with significantly lower annotation costs.

MASS: Mathematical Data Selection via Skill Graphs for Pretraining Large Language Models

Jiazheng Li,Lu Yu,Qing Cui,Zhiqiang Zhang,Jun Zhou,Yanfang Ye,Chuxu Zhang

Task: 提出一种基于技能图的数学数据选择框架（MASS），用于在数学推理领域预训练大语言模型（LLMs）。

Motivation: 高质量数据在大语言模型的预训练和微调中起着关键作用，但现有数据选择方法往往忽视领域相关数据的特定细节。

Details

Method: 构建一个技能图，捕捉数学技能及其相互关系，并基于此为目标数据集分配质量分数，选择排名靠前的子集用于预训练LLMs。 Result: 实验结果表明，MASS在不同模型大小（1B和7B）和预训练数据集（网络数据和合成数据）上均表现出高效性和有效性。 Conclusion: MASS能够显著提高预训练LLMs的效率和效果，减少训练所需的token数量，同时提升模型性能。 Abstract: High-quality data plays a critical role in the pretraining and fine-tuning of large language models (LLMs), even determining their performance ceiling to some degree. Consequently, numerous data selection methods have been proposed to identify subsets of data that can effectively and efficiently enhance model performance. However, most of these methods focus on general data selection and tend to overlook the specific nuances of domain-related data. In this paper, we introduce MASS, a \textbf{MA}thematical data \textbf{S}election framework using the \textbf{S}kill graph for pretraining LLMs in the mathematical reasoning domain. By taking into account the unique characteristics of mathematics and reasoning, we construct a skill graph that captures the mathematical skills and their interrelations from a reference dataset. This skill graph guides us in assigning quality scores to the target dataset, enabling us to select the top-ranked subset which is further used to pretrain LLMs. Experimental results demonstrate the efficiency and effectiveness of MASS across different model sizes (1B and 7B) and pretraining datasets (web data and synthetic data). Specifically, in terms of efficiency, models trained on subsets selected by MASS can achieve similar performance to models trained on the original datasets, with a significant reduction in the number of trained tokens - ranging from 50\% to 70\% fewer tokens. In terms of effectiveness, when trained on the same amount of tokens, models trained on the data selected by MASS outperform those trained on the original datasets by 3.3\% to 5.9\%. These results underscore the potential of MASS to improve both the efficiency and effectiveness of pretraining LLMs.

Image Captioning Evaluation in the Age of Multimodal LLMs: Challenges and Future Perspectives

Sara Sarto,Marcella Cornia,Rita Cucchiara

Task: 评估机器生成的图像描述，特别是针对多模态大语言模型（MLLMs）生成的图像描述。

Motivation: 随着多模态大语言模型（MLLMs）的出现，图像描述生成成为一个核心任务，需要更加稳健和可靠的评估指标。

Details

Method: 本文对图像描述评估的进展进行了全面综述，分析了现有指标的演变、优势和局限性。 Result: 评估了这些指标在多个维度上的表现，包括与人类判断的相关性、排名准确性和对幻觉的敏感性。 Conclusion: 分析揭示了标准评估方法的一些局限性，并提出了未来图像描述评估研究的有前景方向。 Abstract: The evaluation of machine-generated image captions is a complex and evolving challenge. With the advent of Multimodal Large Language Models (MLLMs), image captioning has become a core task, increasing the need for robust and reliable evaluation metrics. This survey provides a comprehensive overview of advancements in image captioning evaluation, analyzing the evolution, strengths, and limitations of existing metrics. We assess these metrics across multiple dimensions, including correlation with human judgment, ranking accuracy, and sensitivity to hallucinations. Additionally, we explore the challenges posed by the longer and more detailed captions generated by MLLMs and examine the adaptability of current metrics to these stylistic variations. Our analysis highlights some limitations of standard evaluation approaches and suggests promising directions for future research in image captioning assessment.

Covering Cracks in Content Moderation: Delexicalized Distant Supervision for Illicit Drug Jargon Detection

Minkyoo Song,Eugene Jang,Jaehan Kim,Seungwon Shin

Task: 检测社交媒体上的非法药物术语

Motivation: 由于社交媒体上药物相关内容的增加，现有的基于术语列表的检测方法存在易被规避和无法区分术语的良性使用的问题。

Details

Method: 提出了JEDIS框架，通过分析上下文来检测非法药物术语，结合了远程监督和去词汇化的方法。 Result: 在两个手动标注的数据集上，JEDIS在F1分数和检测覆盖率方面显著优于现有的基于单词的基线方法。 Conclusion: JEDIS框架在检测非法药物术语方面表现出色，能够有效应对现有方法的缺陷。 Abstract: In light of rising drug-related concerns and the increasing role of social media, sales and discussions of illicit drugs have become commonplace online. Social media platforms hosting user-generated content must therefore perform content moderation, which is a difficult task due to the vast amount of jargon used in drug discussions. Previous works on drug jargon detection were limited to extracting a list of terms, but these approaches have fundamental problems in practical application. First, they are trivially evaded using word substitutions. Second, they cannot distinguish whether euphemistic terms such as "pot" or "crack" are being used as drugs or in their benign meanings. We argue that drug content moderation should be done using contexts rather than relying on a banlist. However, manually annotated datasets for training such a task are not only expensive but also prone to becoming obsolete. We present JEDIS, a framework for detecting illicit drug jargon terms by analyzing their contexts. JEDIS utilizes a novel approach that combines distant supervision and delexicalization, which allows JEDIS to be trained without human-labeled data while being robust to new terms and euphemisms. Experiments on two manually annotated datasets show JEDIS significantly outperforms state-of-the-art word-based baselines in terms of F1-score and detection coverage in drug jargon detection. We also conduct qualitative analysis that demonstrates JEDIS is robust against pitfalls faced by existing approaches.

Can Large Vision Language Models Read Maps Like a Human?

Shuo Xing,Zezhou Sun,Shuangyu Xie,Kaiyuan Chen,Yanjia Huang,Yuping Wang,Jiachen Li,Dezhen Song,Zhengzhong Tu

Task: 介绍MapBench数据集，专门用于人类可读的基于像素的地图户外导航。

Motivation: 解决复杂路径寻找场景中的导航问题，挑战现有的大型视觉语言模型（LVLMs）的空间推理和结构化决策能力。

Details

Method: MapBench包含1600多个像素空间地图路径寻找问题，来自100个不同的地图，并提供地图空间场景图（MSSG）作为索引数据结构。 Result: MapBench显著挑战了现有的LVLMs，揭示了它们在空间推理和结构化决策能力上的关键局限性。 Conclusion: MapBench为评估和改进LVLMs在复杂导航任务中的表现提供了重要工具。 Abstract: In this paper, we introduce MapBench-the first dataset specifically designed for human-readable, pixel-based map-based outdoor navigation, curated from complex path finding scenarios. MapBench comprises over 1600 pixel space map path finding problems from 100 diverse maps. In MapBench, LVLMs generate language-based navigation instructions given a map image and a query with beginning and end landmarks. For each map, MapBench provides Map Space Scene Graph (MSSG) as an indexing data structure to convert between natural language and evaluate LVLM-generated results. We demonstrate that MapBench significantly challenges state-of-the-art LVLMs both zero-shot prompting and a Chain-of-Thought (CoT) augmented reasoning framework that decomposes map navigation into sequential cognitive processes. Our evaluation of both open-source and closed-source LVLMs underscores the substantial difficulty posed by MapBench, revealing critical limitations in their spatial reasoning and structured decision-making capabilities. We release all the code and dataset in https://github.com/taco-group/MapBench.

ML-Triton, A Multi-Level Compilation and Language Extension to Triton GPU Programming

Dewei Wang,Wei Zhu,Liyang Ling,Ettore Tiotto,Quintin Wang,Whitney Tsang,Julian Opperman,Jacky Deng

Task: 提出一种多级编译流程和编程接口的ML-Triton，以更好地利用GPU的层次结构。

Motivation: 传统的Triton编译器从工作组级别直接降低到线程级别，这种过早的降低不利于充分利用GPU的层次结构和SIMD单元。

Details

Method: 提出ML-Triton，采用多级编译流程，从工作组级别逐步降低到warp和内部级别，并扩展Triton语言以支持用户设置的编译器提示和warp级别编程。 Result: 实验结果表明，ML-Triton在Intel GPU上的性能达到了专家编写内核的95%以上。 Conclusion: ML-Triton通过多级编译流程和扩展的编程接口，能够在不等待编译器更新的情况下获得良好的性能。 Abstract: In the era of LLMs, dense operations such as GEMM and MHA are critical components. These operations are well-suited for parallel execution using a tilebased approach. While traditional GPU programming often relies on low level interfaces like CUDA or SYCL, Triton has emerged as a DSL that offers a more user-friendly and portable alternative by programming at a higher level. The current Triton starts at the workgroup (aka threadblock) level, and directly lowers to per-thread level. And then attempt to coalesce and amend through a series of passes, promoting information from low-level representation. We believe this is pre-mature lowering based on the below observations. 1. GPU has a hierarchical structure both physically and logically. Modern GPUs often feature SIMD units capable of directly operating on tiles on a warp or warpgroup basis, such as blocked load and blocked MMA. 2. Multi-level gradual lowering can make compiler decoupled and clean by separating considerations inter and intra a logical layer. 3. Kernel developers often need fine control to get good performance on the latest hardware. FlashAttention2 advocates explicit data partition between warps to make a performance boost. In this context, we propose ML-Triton which features multi-level compilation flow and programming interface. Our approach begins at the workgroup level and progressively lowers to the warp and intrinsic level, implementing a multilevel lowering align with the hierarchical nature of GPU. Additionally, we extend triton language to support user-set compiler hint and warp level programming, enabling researchers to get good out-of-the box performance without awaiting compiler updates. Experimental results demonstrate that our approach achieves performance above 95% of expert-written kernels on Intel GPU, as measured by the geometric mean.

Dynamic Accumulated Attention Map for Interpreting Evolution of Decision-Making in Vision Transformer

Yi Liao,Yongsheng Gao,Weichuan Zhang

Task: 提出一种新的视觉解释方法，动态累积注意力图（DAAM），用于可视化Vision Transformer（ViT）模型内部的注意力流。

Motivation: 现有的视觉解释方法无法展示ViT模型内部隐藏的注意力流，无法解释ViT模型在决策过程中最终注意力区域的形成过程。

Details

Method: 提出了一种新的分解模块，通过解锁每个ViT块的自注意力模块生成的[class]标记来构建和存储空间特征信息，并通过分解分类得分来获取通道重要性系数。对于自监督ViT模型，提出了维度重要性权重来计算通道重要性系数。 Result: 定量和定性分析一致验证了所提出的DAAM在解释ViT模型方面的有效性和优越性，不仅适用于全连接层作为分类器的ViT模型，也适用于自监督ViT模型。 Conclusion: DAAM方法能够可视化ViT模型内部任何中间块的决策注意力演化动态，具有显著的解释能力。 Abstract: Various Vision Transformer (ViT) models have been widely used for image recognition tasks. However, existing visual explanation methods can not display the attention flow hidden inside the inner structure of ViT models, which explains how the final attention regions are formed inside a ViT for its decision-making. In this paper, a novel visual explanation approach, Dynamic Accumulated Attention Map (DAAM), is proposed to provide a tool that can visualize, for the first time, the attention flow from the top to the bottom through ViT networks. To this end, a novel decomposition module is proposed to construct and store the spatial feature information by unlocking the [class] token generated by the self-attention module of each ViT block. The module can also obtain the channel importance coefficients by decomposing the classification score for supervised ViT models. Because of the lack of classification score in self-supervised ViT models, we propose dimension-wise importance weights to compute the channel importance coefficients. Such spatial features are linearly combined with the corresponding channel importance coefficients, forming the attention map for each block. The dynamic attention flow is revealed by block-wisely accumulating each attention map. The contribution of this work focuses on visualizing the evolution dynamic of the decision-making attention for any intermediate block inside a ViT model by proposing a novel decomposition module and dimension-wise importance weights. The quantitative and qualitative analysis consistently validate the effectiveness and superior capacity of the proposed DAAM for not only interpreting ViT models with the fully-connected layers as the classifier but also self-supervised ViT models. The code is available at https://github.com/ly9802/DynamicAccumulatedAttentionMap.

Inspecting the Representation Manifold of Differentially-Private Text

Stefan Arnold

Task: 研究差分隐私（DP）在文本中的应用，特别是通过语言模型和温度采样进行文本改写的效果。

Motivation: 探索差分隐私在文本表示空间中的几何失真问题，特别是结构和复杂性的变化。

Details

Method: 通过估计不同隐私预算下改写文本的内在维度，比较词级和句级方法在表示空间中的表现。 Result: 发现词级方法显著提高了表示流形，而句级方法生成的改写文本在拓扑结构上更接近人类编写的改写文本。在句级方法中，掩码改写相比因果改写更能保持结构复杂性。 Conclusion: 自回归生成会从不自然的词汇选择中传播失真，导致表示空间膨胀，而掩码改写方法在保持结构复杂性方面表现更优。 Abstract: Differential Privacy (DP) for text has recently taken the form of text paraphrasing using language models and temperature sampling to better balance privacy and utility. However, the geometric distortion of DP regarding the structure and complexity in the representation space remains unexplored. By estimating the intrinsic dimension of paraphrased text across varying privacy budgets, we find that word-level methods severely raise the representation manifold, while sentence-level methods produce paraphrases whose manifolds are topologically more consistent with human-written paraphrases. Among sentence-level methods, masked paraphrasing, compared to causal paraphrasing, demonstrates superior preservation of structural complexity, suggesting that autoregressive generation propagates distortions from unnatural word choices that cascade and inflate the representation space.

A Simple Combination of Diffusion Models for Better Quality Trade-Offs in Image Denoising

Jonas Dornbusch,Emanuel Pfarr,Florin-Alexandru Vasluianu,Frank Werner,Radu Timofte

Task: 提出一种新的线性组合扩散去噪器（LCDD），用于在图像去噪任务中平衡高视觉质量和低失真。

Motivation: 现有的扩散模型在图像重建任务中表现出色，但在高视觉质量和低失真之间未能有效平衡。

Details

Method: 提出线性组合扩散去噪器（LCDD），结合两种互补的推理过程：一种利用模型的生成潜力，另一种确保信号恢复的准确性。 Result: LCDD在去噪任务中达到了最先进的性能，并通过简单的标量超参数调整实现了可控的权衡。 Conclusion: LCDD在图像去噪任务中表现出色，能够有效平衡高视觉质量和低失真。 Abstract: Diffusion models have garnered considerable interest in computer vision, owing both to their capacity to synthesize photorealistic images and to their proven effectiveness in image reconstruction tasks. However, existing approaches fail to efficiently balance the high visual quality of diffusion models with the low distortion achieved by previous image reconstruction methods. Specifically, for the fundamental task of additive Gaussian noise removal, we first illustrate an intuitive method for leveraging pretrained diffusion models. Further, we introduce our proposed Linear Combination Diffusion Denoiser (LCDD), which unifies two complementary inference procedures - one that leverages the model's generative potential and another that ensures faithful signal recovery. By exploiting the inherent structure of the denoising samples, LCDD achieves state-of-the-art performance and offers controlled, well-behaved trade-offs through a simple scalar hyperparameter adjustment.

Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering

Francesco Maria Molfese,Luca Moroni,Luca Gioffrè,Alessandro Scirè,Simone Conia,Roberto Navigli

Task: 评估大型语言模型（LLMs）在多项选择题回答（MCQA）任务中的表现。

Motivation: 多项选择题回答任务虽然评估起来相对简单，但其可靠性受到质疑，尤其是在模型生成自由文本后再选择答案的情况下。

Details

Method: 系统分析现有答案提取方法是否与人类判断一致，以及它们如何受到提示中答案约束的影响。 Result: 传统评估策略往往低估了LLM的能力，而基于LLM的答案提取器容易产生系统性错误。 Conclusion: 需要标准化的评估方法，并强调更可靠和一致的MCQA评估实践。 Abstract: One of the most widely used tasks to evaluate Large Language Models (LLMs) is Multiple-Choice Question Answering (MCQA). While open-ended question answering tasks are more challenging to evaluate, MCQA tasks are, in principle, easier to assess, as the model's answer is thought to be simple to extract and is directly compared to a set of predefined choices. However, recent studies have started to question the reliability of MCQA evaluation, showing that multiple factors can significantly impact the reported performance of LLMs, especially when the model generates free-form text before selecting one of the answer choices. In this work, we shed light on the inconsistencies of MCQA evaluation strategies, which can lead to inaccurate and misleading model comparisons. We systematically analyze whether existing answer extraction methods are aligned with human judgment, and how they are influenced by answer constraints in the prompt across different domains. Our experiments demonstrate that traditional evaluation strategies often underestimate LLM capabilities, while LLM-based answer extractors are prone to systematic errors. Moreover, we reveal a fundamental trade-off between including format constraints in the prompt to simplify answer extraction and allowing models to generate free-form text to improve reasoning. Our findings call for standardized evaluation methodologies and highlight the need for more reliable and consistent MCQA evaluation practices.

These Magic Moments: Differentiable Uncertainty Quantification of Radiance Field Models

Parker Ewen,Hao Chen,Seth Isaacson,Joey Wilson,Katherine A. Skinner,Ram Vasudevan

Task: 提出了一种利用渲染方程的高阶矩进行辐射场不确定性量化的新方法。

Motivation: 不确定性量化对于包括视图规划和场景理解在内的下游任务至关重要，尤其是在安全和鲁棒性方面。然而，辐射场的高维性和复杂性给不确定性量化带来了重大挑战，限制了这些方法在高速决策中的应用。

Details

Method: 利用渲染过程的概率性质，高效且可微分地计算辐射场输出的高阶矩，包括颜色、深度和语义预测。 Result: 该方法在辐射场不确定性估计技术上优于现有方法，提供了更直接、计算效率更高且无需后处理的可微分公式。 Conclusion: 广泛的实验验证了该方法的有效性，在合成和真实场景中均达到了最先进的性能，同时保持了简单性。 Abstract: This paper introduces a novel approach to uncertainty quantification for radiance fields by leveraging higher-order moments of the rendering equation. Uncertainty quantification is crucial for downstream tasks including view planning and scene understanding, where safety and robustness are paramount. However, the high dimensionality and complexity of radiance fields pose significant challenges for uncertainty quantification, limiting the use of these uncertainty quantification methods in high-speed decision-making. We demonstrate that the probabilistic nature of the rendering process enables efficient and differentiable computation of higher-order moments for radiance field outputs, including color, depth, and semantic predictions. Our method outperforms existing radiance field uncertainty estimation techniques while offering a more direct, computationally efficient, and differentiable formulation without the need for post-processing.Beyond uncertainty quantification, we also illustrate the utility of our approach in downstream applications such as next-best-view (NBV) selection and active ray sampling for neural radiance field training. Extensive experiments on synthetic and real-world scenes confirm the efficacy of our approach, which achieves state-of-the-art performance while maintaining simplicity.

LLM Alignment for the Arabs: A Homogenous Culture or Diverse Ones?

Amr Keleg

Task: 讨论阿拉伯语大型语言模型（LLMs）在文化多样性方面的局限性，并提出如何构建能够更好代表阿拉伯世界文化多样性的系统的初步思考。

Motivation: 现有的阿拉伯语LLMs假设阿拉伯人共享相同的文化，这种假设忽略了阿拉伯世界的文化多样性。

Details

Method: 通过讨论文化同质性假设的局限性，提出改进建议。 Result: 指出了文化同质性假设的广泛采用及其对阿拉伯语LLMs发展的影响。 Conclusion: 希望本文能鼓励NLP社区在开发多语言和阿拉伯语特定LLMs时考虑同一语言社区内的文化多样性。 Abstract: Large language models (LLMs) have the potential of being useful tools that can automate tasks and assist humans. However, these models are more fluent in English and more aligned with Western cultures, norms, and values. Arabic-specific LLMs are being developed to better capture the nuances of the Arabic language, as well as the views of the Arabs. Yet, Arabs are sometimes assumed to share the same culture. In this position paper, I discuss the limitations of this assumption and provide preliminary thoughts for how to build systems that can better represent the cultural diversity within the Arab world. The invalidity of the cultural homogeneity assumption might seem obvious, yet, it is widely adopted in developing multilingual and Arabic-specific LLMs. I hope that this paper will encourage the NLP community to be considerate of the cultural diversity within various communities speaking the same language.

Elevating Visual Question Answering through Implicitly Learned Reasoning Pathways in LVLMs

Liu Jing,Amirul Rahman

Task: 提出一种增强大型视觉语言模型（LVLMs）的方法，使其能够通过端到端训练进行隐式自我提问，以解决复杂视觉推理任务中的多步推理问题。

Motivation: 现有的LVLMs在多模态任务中表现出色，但在需要多步推理的复杂视觉推理任务中表现不佳。

Details

Method: 通过在视觉问答数据集中增加由子问题和答案对组成的推理链，并使用多任务损失训练LVLM，鼓励生成和回答这些中间步骤以及预测最终答案。 Result: 在ScienceQA和VQAv2数据集上的实验表明，MF-SQ-LLaVA显著优于现有的最先进模型，包括基础LLaVA和原始SQ-LLaVA。消融研究进一步验证了方法中每个组件的贡献，人类评估也证实了该方法提高了推理过程的准确性和连贯性。 Conclusion: MF-SQ-LLaVA通过隐式自我提问和多任务训练，显著提升了LVLMs在复杂视觉推理任务中的表现。 Abstract: Large Vision-Language Models (LVLMs) have shown remarkable progress in various multimodal tasks, yet they often struggle with complex visual reasoning that requires multi-step inference. To address this limitation, we propose MF-SQ-LLaVA, a novel approach that enhances LVLMs by enabling implicit self-questioning through end-to-end training. Our method involves augmenting visual question answering datasets with reasoning chains consisting of sub-question and answer pairs, and training the LVLM with a multi-task loss that encourages the generation and answering of these intermediate steps, as well as the prediction of the final answer. We conduct extensive experiments on the ScienceQA and VQAv2 datasets, demonstrating that MF-SQ-LLaVA significantly outperforms existing state-of-the-art models, including the base LLaVA and the original SQ-LLaVA. Ablation studies further validate the contribution of each component of our approach, and human evaluation confirms the improved accuracy and coherence of the reasoning process enabled by our method.

SPADE: Systematic Prompt Framework for Automated Dialogue Expansion in Machine-Generated Text Detection

Haoyi Li,Angela Yifei Yuan,Soyeon Caren Han,Christopher Leckie

Task: 开发用于检测机器生成文本（MGT）的模型，并生成高质量的合成用户对话数据集。

Motivation: 由于缺乏系统生成的高质量数据集，现有的MGT检测模型面临挑战，因此需要开发新的数据增强框架来降低传统数据收集方法的成本。

Details

Method: 提出了五种新的数据增强框架，通过结构化提示方法生成合成用户对话，并生成了14个新的对话数据集。 Result: 使用提出的增强框架生成的混合数据集在七个MGT检测模型上表现出更好的泛化性能。同时，模拟了在线对话检测，并研究了聊天历史长度与检测准确性之间的关系。 Conclusion: 提出的数据增强框架有效降低了数据收集成本，并提高了MGT检测模型的性能。开源数据集可供下载。 Abstract: The increasing capability of large language models (LLMs) to generate synthetic content has heightened concerns about their misuse, driving the development of Machine-Generated Text (MGT) detection models. However, these detectors face significant challenges due to the lack of systematically generated, high-quality datasets for training. To address this issue, we propose five novel data augmentation frameworks for synthetic user dialogue generation through a structured prompting approach, reducing the costs associated with traditional data collection methods. Our proposed method yields 14 new dialogue datasets, which we benchmark against seven MGT detection models. The results demonstrate improved generalization performance when utilizing a mixed dataset produced by our proposed augmentation framework. Furthermore, considering that real-world agents lack knowledge of future opponent utterances, we simulate online dialogue detection and examine the relationship between chat history length and detection accuracy. We also benchmark online detection performance with limited chat history on our frameworks. Our open-source datasets can be downloaded from https://github.com/AngieYYF/SPADE-customer-service-dialogue.

SplatVoxel: History-Aware Novel View Streaming without Temporal Training

Yiming Wang,Lucy Chai,Xuan Luo,Michael Niemeyer,Manuel Lagunas,Stephen Lombardi,Siyu Tang,Tiancheng Sun

Task: 从稀疏视图视频中生成高质量、时间一致的新视图序列。

Motivation: 现有的新视图合成方法在时间一致性和视觉保真度方面存在困难，导致闪烁和不一致。

Details

Method: 引入历史感知，利用先前帧重建场景并提高质量和稳定性。提出了一种混合的splat-voxel前馈场景重建方法，结合高斯Splatting在时间上传播信息，并使用分层体素网格进行时间融合。 Result: 在静态和流式场景重建中实现了最先进的性能，有效减少了时间伪影和视觉伪影，同时在单个H100 GPU上以交互速率（15 fps，350ms延迟）运行。 Conclusion: 该方法无需在多视图视频数据集上进行训练，可以直接应用于稀疏视图视频流，并在推理时以历史感知的方式运行。 Abstract: We study the problem of novel view streaming from sparse-view videos, which aims to generate a continuous sequence of high-quality, temporally consistent novel views as new input frames arrive. However, existing novel view synthesis methods struggle with temporal coherence and visual fidelity, leading to flickering and inconsistency. To address these challenges, we introduce history-awareness, leveraging previous frames to reconstruct the scene and improve quality and stability. We propose a hybrid splat-voxel feed-forward scene reconstruction approach that combines Gaussian Splatting to propagate information over time, with a hierarchical voxel grid for temporal fusion. Gaussian primitives are efficiently warped over time using a motion graph that extends 2D tracking models to 3D motion, while a sparse voxel transformer integrates new temporal observations in an error-aware manner. Crucially, our method does not require training on multi-view video datasets, which are currently limited in size and diversity, and can be directly applied to sparse-view video streams in a history-aware manner at inference time. Our approach achieves state-of-the-art performance in both static and streaming scene reconstruction, effectively reducing temporal artifacts and visual artifacts while running at interactive rates (15 fps with 350ms delay) on a single H100 GPU. Project Page: https://19reborn.github.io/SplatVoxel/

ELTEX: A Framework for Domain-Driven Synthetic Data Generation

Arina Razmyslovich,Kseniia Murasheva,Sofia Sedlova,Julien Capitaine,Eugene Dmitriev

Task: 提出ELTEX框架，用于在专业领域生成高质量的合成训练数据。

Motivation: 大型语言模型（LLMs）在专业领域（如网络安全）的表现受限于领域特定训练数据的稀缺性。

Details

Method: ELTEX通过系统整合显式领域指示器提取与动态提示，以在生成过程中保留关键领域知识。 Result: 在区块链相关网络攻击检测的背景下，使用ELTEX生成的数据微调Gemma-2B模型，结果显示ELTEX增强的模型在标准分类指标和不确定性校准方面与GPT-4竞争，同时需要显著更少的计算资源。 Conclusion: 领域驱动的合成数据生成可以有效弥合资源高效模型与大型架构在专业领域中的性能差距。 Abstract: We present ELTEX (Efficient LLM Token Extraction), a domain-driven framework for generating high-quality synthetic training data in specialized domains. While Large Language Models (LLMs) have shown impressive general capabilities, their performance in specialized domains like cybersecurity remains limited by the scarcity of domain-specific training data. ELTEX addresses this challenge by systematically integrating explicit domain indicator extraction with dynamic prompting to preserve critical domain knowledge throughout the generation process. We demonstrate ELTEX's effectiveness in the context of blockchain-related cyberattack detection, where we fine-tune Gemma-2B using various combinations of real and ELTEX-generated data. Our results show that the ELTEX-enhanced model achieves performance competitive with GPT-4 across both standard classification metrics and uncertainty calibration, while requiring significantly fewer computational resources. We release a curated synthetic dataset of social media texts for cyberattack detection in blockchain. Our work demonstrates that domain-driven synthetic data generation can effectively bridge the performance gap between resource-efficient models and larger architectures in specialized domains.

Construction Site Scaffolding Completeness Detection Based on Mask R-CNN and Hough Transform

Pei-Hsin Lin,Jacob J. Lin,Shang-Hsien Hsieh

Task: 提出一种基于深度学习的计算机视觉方法，用于检测脚手架及其交叉支撑。

Motivation: 确保脚手架的安全性和完整性，防止事故发生，减少人工检查的时间和成本。

Details

Method: 使用带有注释标签的脚手架图像数据集训练卷积神经网络（CNN）模型。 Result: 能够自动从施工现场拍摄的图像中检测交叉支撑的完整性，无需人工检查。 Conclusion: 这种非侵入性且高效的脚手架完整性检测解决方案有助于提高施工现场的安全性。 Abstract: Construction site scaffolding is essential for many building projects, and ensuring its safety is crucial to prevent accidents. The safety inspector must check the scaffolding's completeness and integrity, where most violations occur. The inspection process includes ensuring all the components are in the right place since workers often compromise safety for convenience and disassemble parts such as cross braces. This paper proposes a deep learning-based approach to detect the scaffolding and its cross braces using computer vision. A scaffold image dataset with annotated labels is used to train a convolutional neural network (CNN) model. With the proposed approach, we can automatically detect the completeness of cross braces from images taken at construction sites, without the need for manual inspection, saving a significant amount of time and labor costs. This non-invasive and efficient solution for detecting scaffolding completeness can help improve safety in construction sites.

A Data-driven Investigation of Euphemistic Language: Comparing the usage of "slave" and "servant" in 19th century US newspapers

Jaihyun Park,Ryan Cordell

Task: 研究19世纪美国报纸中“奴隶”和“仆人”一词的使用情况。

Motivation: 探讨“奴隶”和“仆人”在19世纪美国报纸中的不同使用方式及其背后的社会文化意义。

Details

Method: 使用FastText嵌入考虑OCR错误，排除重印文本，使用Word2vec嵌入找到与“奴隶”和“仆人”语义相近的词，并计算对数几率比以识别南方和北方报纸中过度代表的话语词。 Result: 发现“奴隶”与社会经济、法律和行政词汇相关，而“仆人”在北方报纸中与宗教词汇相关，在南方报纸中与家庭和家庭词汇相关。南方报纸中的奴隶话语词在北方报纸中更为普遍，而各方报纸中的仆人话语词在各自地区更为普遍。 Conclusion: 本研究有助于理解19世纪美国报纸如何围绕被奴役的非洲裔美国人创造不同的话语。 Abstract: This study investigates the usage of "slave" and "servant" in the 19th century US newspapers using computational methods. While both terms were used to refer to enslaved African Americans, they were used in distinct ways. In the Chronicling America corpus, we included possible OCR errors by using FastText embedding and excluded text reprints to consider text reprint culture in the 19th century. Word2vec embedding was used to find semantically close words to "slave" and "servant" and log-odds ratio was calculated to identify over-represented discourse words in the Southern and Northern newspapers. We found that "slave" is associated with socio-economic, legal, and administrative words, however, "servant" is linked to religious words in the Northern newspapers while Southern newspapers associated "servant" with domestic and familial words. We further found that slave discourse words in Southern newspapers are more prevalent in Northern newspapers while servant discourse words from each side are prevalent in their own region. This study contributes to the understanding of how newspapers created different discourses around enslaved African Americans in the 19th century US.

ShapeShift: Towards Text-to-Shape Arrangement Synthesis with Content-Aware Geometric Constraints

Vihaan Misra,Peter Schaldenbrand,Jean Oh

Task: 解决在固定刚性形状集合下进行文本引导的图像生成问题。

Motivation: 现有的基于扩散的模型在生成逼真图像方面表现出色，但在使用固定刚性形状集合时面临挑战，类似于解决七巧板拼图或排列现实世界物体以匹配语义描述。

Details

Method: 提出了ShapeShift方法，通过可微分矢量图形管道显式参数化每个形状，并通过预训练扩散模型的分数蒸馏采样迭代优化位置和方向。引入了内容感知的碰撞解决机制，以确保在发生重叠时进行最小的语义一致调整。 Result: 实验结果表明，ShapeShift在多种场景下都取得了令人信服的结果，在定量和定性上优于其他技术。 Conclusion: 通过将基于扩散的语义引导与显式几何约束相结合，ShapeShift生成了可解释的构图，其中空间关系清晰地体现了文本提示。 Abstract: While diffusion-based models excel at generating photorealistic images from text, a more nuanced challenge emerges when constrained to using only a fixed set of rigid shapes, akin to solving tangram puzzles or arranging real-world objects to match semantic descriptions. We formalize this problem as shape-based image generation, a new text-guided image-to-image translation task that requires rearranging the input set of rigid shapes into non-overlapping configurations and visually communicating the target concept. Unlike pixel-manipulation approaches, our method, ShapeShift, explicitly parameterizes each shape within a differentiable vector graphics pipeline, iteratively optimizing placement and orientation through score distillation sampling from pretrained diffusion models. To preserve arrangement clarity, we introduce a content-aware collision resolution mechanism that applies minimal semantically coherent adjustments when overlaps occur, ensuring smooth convergence toward physically valid configurations. By bridging diffusion-based semantic guidance with explicit geometric constraints, our approach yields interpretable compositions where spatial relationships clearly embody the textual prompt. Extensive experiments demonstrate compelling results across diverse scenarios, with quantitative and qualitative advantages over alternative techniques.

Exploring Model Editing for LLM-based Aspect-Based Sentiment Classification

Shichen Li,Zhongqing Wang,Zheyu Zhao,Yue Zhang,Peifeng Li

Task: 研究模型编辑以提供一种高效的方法来适应大型语言模型（LLMs）解决基于方面的情感分类问题。

Motivation: 模型编辑可以选择性地更新神经模型的一小部分参数，以显著降低计算成本，并精确针对LLMs中的关键组件，显示出高效微调应用的潜力。

Details

Method: 通过因果干预，追踪并确定哪些神经元隐藏状态对模型的预测至关重要。通过对LLM的每个组件进行干预和恢复，识别这些组件对基于方面的情感分类的重要性。 Result: 发现一组独特的中层表示对于检测给定方面词的情感极性至关重要。利用这些见解，开发了一种专注于LLM关键部分的模型编辑方法，从而实现了更高效的LLM适应方法。 Conclusion: 在领域内和领域外的实验中，该方法与当前最强的方法相比，使用显著更少的可训练参数实现了竞争性的结果，突出了更高效和可解释的微调策略。 Abstract: Model editing aims at selectively updating a small subset of a neural model's parameters with an interpretable strategy to achieve desired modifications. It can significantly reduce computational costs to adapt to large language models (LLMs). Given its ability to precisely target critical components within LLMs, model editing shows great potential for efficient fine-tuning applications. In this work, we investigate model editing to serve an efficient method for adapting LLMs to solve aspect-based sentiment classification. Through causal interventions, we trace and determine which neuron hidden states are essential for the prediction of the model. By performing interventions and restorations on each component of an LLM, we identify the importance of these components for aspect-based sentiment classification. Our findings reveal that a distinct set of mid-layer representations is essential for detecting the sentiment polarity of given aspect words. Leveraging these insights, we develop a model editing approach that focuses exclusively on these critical parts of the LLM, leading to a more efficient method for adapting LLMs. Our in-domain and out-of-domain experiments demonstrate that this approach achieves competitive results compared to the currently strongest methods with significantly fewer trainable parameters, highlighting a more efficient and interpretable fine-tuning strategy.

HandSplat: Embedding-Driven Gaussian Splatting for High-Fidelity Hand Rendering

Yilan Dong,Haohe Liu,Qing Wang,Jiahao Yang,Wenqing Wang,Gregory Slabaugh,Shanxin Yuan

Task: 提出一种基于3D高斯泼溅的手部渲染框架HandSplat，以提高渲染的保真度和稳定性。

Motivation: 现有的3D高斯泼溅方法在手部渲染中依赖于刚体骨骼运动，且非刚体运动模型过于简化，无法捕捉精细的几何和外观细节。此外，这些方法仅基于每点梯度进行密集化处理，并独立处理姿势，忽略了空间和时间相关性。这些限制导致几何细节丢失、时间不稳定性和点分布效率低下。

Details

Method: 提出HandSplat框架，扩展标准3DGS属性，引入隐式几何和外观嵌入以改进非刚体运动建模，同时保留原始3DGS属性建模的静态手部特征。此外，引入局部梯度感知密集化策略，动态细化高变化区域的高斯密度。为提高稳定性，引入姿势条件属性正则化，鼓励相似姿势间的属性一致性，减少时间伪影。 Result: 在InterHand2.6M数据集上的大量实验表明，HandSplat在保真度和稳定性上优于现有方法，并实现了实时性能。 Conclusion: HandSplat框架通过改进非刚体运动建模和引入局部梯度感知密集化策略，显著提高了手部渲染的保真度和稳定性，同时保持了实时性能。 Abstract: Existing 3D Gaussian Splatting (3DGS) methods for hand rendering rely on rigid skeletal motion with an oversimplified non-rigid motion model, which fails to capture fine geometric and appearance details. Additionally, they perform densification based solely on per-point gradients and process poses independently, ignoring spatial and temporal correlations. These limitations lead to geometric detail loss, temporal instability, and inefficient point distribution. To address these issues, we propose HandSplat, a novel Gaussian Splatting-based framework that enhances both fidelity and stability for hand rendering. To improve fidelity, we extend standard 3DGS attributes with implicit geometry and appearance embeddings for finer non-rigid motion modeling while preserving the static hand characteristic modeled by original 3DGS attributes. Additionally, we introduce a local gradient-aware densification strategy that dynamically refines Gaussian density in high-variation regions. To improve stability, we incorporate pose-conditioned attribute regularization to encourage attribute consistency across similar poses, mitigating temporal artifacts. Extensive experiments on InterHand2.6M demonstrate that HandSplat surpasses existing methods in fidelity and stability while achieving real-time performance. We will release the code and pre-trained models upon acceptance.

Increasing the Robustness of the Fine-tuned Multilingual Machine-Generated Text Detectors

Dominik Macko,Robert Moro,Ivan Srba

Task: 开发一种自动化方法来准确检测机器生成的内容。

Motivation: 由于LLMs的普及，人们担心它们被滥用于有害内容的创建和传播。研究表明，人类无法区分高质量的机器生成文本和真实的人类写作文本，因此需要开发自动化手段来检测机器生成内容。

Details

Method: 提出了一种鲁棒的微调过程，用于LLMs的检测任务，使检测器在面对混淆时更加鲁棒，并且对分布外数据更具泛化能力。 Result: 该方法使检测器在面对混淆时更加鲁棒，并且对分布外数据更具泛化能力。 Conclusion: 通过提出的鲁棒微调过程，可以更有效地检测机器生成内容，从而提高在线信息空间的可信度。 Abstract: Since the proliferation of LLMs, there have been concerns about their misuse for harmful content creation and spreading. Recent studies justify such fears, providing evidence of LLM vulnerabilities and high potential of their misuse. Humans are no longer able to distinguish between high-quality machine-generated and authentic human-written texts. Therefore, it is crucial to develop automated means to accurately detect machine-generated content. It would enable to identify such content in online information space, thus providing an additional information about its credibility. This work addresses the problem by proposing a robust fine-tuning process of LLMs for the detection task, making the detectors more robust against obfuscation and more generalizable to out-of-distribution data.

RETHINED: A New Benchmark and Baseline for Real-Time High-Resolution Image Inpainting On Edge Devices

Marcelo Sanchez,Gil Triginer,Ignacio Sarasua,Lara Raad,Coloma Ballester

Task: 提出一种能够在边缘设备上实时进行高分辨率图像修复的基线方法。

Motivation: 现有的图像修复方法在低分辨率图像上表现出色，但在高分辨率图像上表现不佳且需要强大的硬件支持，限制了其在边缘设备上的部署。

Details

Method: 提出了一种由轻量级卷积神经网络（CNN）和分辨率无关的补丁替换机制组成的简单而有效的新方法。 Result: 在各种移动设备上进行了广泛分析，展示了相似的修复性能，同时比现有最先进方法快100倍。 Conclusion: 该方法能够在边缘设备上实时进行高分辨率图像修复，并发布了首个自由形式掩码UHD修复数据集DF8K-Inpainting。 Abstract: Existing image inpainting methods have shown impressive completion results for low-resolution images. However, most of these algorithms fail at high resolutions and require powerful hardware, limiting their deployment on edge devices. Motivated by this, we propose the first baseline for REal-Time High-resolution image INpainting on Edge Devices (RETHINED) that is able to inpaint at ultra-high-resolution and can run in real-time ($\leq$ 30ms) in a wide variety of mobile devices. A simple, yet effective novel method formed by a lightweight Convolutional Neural Network (CNN) to recover structure, followed by a resolution-agnostic patch replacement mechanism to provide detailed texture. Specially our pipeline leverages the structural capacity of CNN and the high-level detail of patch-based methods, which is a key component for high-resolution image inpainting. To demonstrate the real application of our method, we conduct an extensive analysis on various mobile-friendly devices and demonstrate similar inpainting performance while being $\mathrm{100 \times faster}$ than existing state-of-the-art methods. Furthemore, we realease DF8K-Inpainting, the first free-form mask UHD inpainting dataset.

Christina Zorenböhmer,Sebastian Schmidt,Bernd Resch

Task: 生成第一个基于方面的情感分析（ABEA）训练数据集，并微调BERT模型用于ABEA的子任务：方面术语提取（ATE）和方面情感分类（AEC）。

Motivation: 解决ABEA领域面临的数据集瓶颈和情感类别复杂性增加的问题。

Details

Method: 基于Shaver等人的分层情感理论，使用群体注释和多数投票策略生成包含2,621条英文推特的ABEA训练数据集，并微调GRACE模型用于ABEA任务。 Result: 模型在ATE任务上的F1得分为70.1%，在联合ATE和AEC提取任务上的F1得分为46.9%。 Conclusion: 模型性能的限制因素主要是训练数据集规模较小和任务复杂性增加，导致模型过拟合和泛化能力有限。 Abstract: While sentiment analysis has advanced from sentence to aspect-level, i.e., the identification of concrete terms related to a sentiment, the equivalent field of Aspect-based Emotion Analysis (ABEA) is faced with dataset bottlenecks and the increased complexity of emotion classes in contrast to binary sentiments. This paper addresses these gaps, by generating a first ABEA training dataset, consisting of 2,621 English Tweets, and fine-tuning a BERT-based model for the ABEA sub-tasks of Aspect Term Extraction (ATE) and Aspect Emotion Classification (AEC). The dataset annotation process was based on the hierarchical emotion theory by Shaver et al. [1] and made use of group annotation and majority voting strategies to facilitate label consistency. The resulting dataset contained aspect-level emotion labels for Anger, Sadness, Happiness, Fear, and a None class. Using the new ABEA training dataset, the state-of-the-art ABSA model GRACE by Luo et al. [2] was fine-tuned for ABEA. The results reflected a performance plateau at an F1-score of 70.1% for ATE and 46.9% for joint ATE and AEC extraction. The limiting factors for model performance were broadly identified as the small training dataset size coupled with the increased task complexity, causing model overfitting and limited abilities to generalize well on new data.

Validation of Human Pose Estimation and Human Mesh Recovery for Extracting Clinically Relevant Motion Data from Videos

Kai Armstrong,Alexander Rodrigues,Alexander P. Willmott,Lei Zhang,Xujiong Ye

Task: 比较分析无标记运动捕捉技术在临床环境中的应用。

Motivation: 验证无标记运动捕捉技术在运动学分析中的有效性，并与现有的惯性测量单元（IMUs）和基于反光标记的光学运动捕捉（MoCap）技术进行比较。

Details

Method: 比较分析无标记运动捕捉技术（如人体姿态估计和人体网格恢复）与现有的IMUs和MoCap技术在运动学分析中的表现。 Result: 无标记运动捕捉技术的结果与IMUs和MoCap技术的结果一致，并且具有更短的设置时间和更少的专业知识要求。 Conclusion: 尽管无标记运动捕捉技术在数据质量上仍有改进空间，但其在低速动作的临床测试中的误差范围是可接受的。 Abstract: This work aims to discuss the current landscape of kinematic analysis tools, ranging from the state-of-the-art in sports biomechanics such as inertial measurement units (IMUs) and retroreflective marker-based optical motion capture (MoCap) to more novel approaches from the field of computing such as human pose estimation and human mesh recovery. Primarily, this comparative analysis aims to validate the use of marker-less MoCap techniques in a clinical setting by showing that these marker-less techniques are within a reasonable range for kinematics analysis compared to the more cumbersome and less portable state-of-the-art tools. Not only does marker-less motion capture using human pose estimation produce results in-line with the results of both the IMU and MoCap kinematics but also benefits from a reduced set-up time and reduced practical knowledge and expertise to set up. Overall, while there is still room for improvement when it comes to the quality of the data produced, we believe that this compromise is within the room of error that these low-speed actions that are used in small clinical tests.

Comparing Llama3 and DeepSeekR1 on Biomedical Text Classification Tasks

Yuting Guo,Abeed Sarker

Task: 比较两个开源大语言模型（Llama3-70B和DeepSeekR1-distill-Llama3-70B）在六个生物医学文本分类任务中的性能。

Motivation: 评估不同大语言模型在生物医学文本分类任务中的表现，特别是在零样本设置下的表现。

Details

Method: 在六个生物医学文本分类任务上进行了实验，其中四个任务涉及社交媒体数据，两个任务涉及电子健康记录中的临床笔记。所有实验均在零样本设置下进行，并测量了精度、召回率和F1分数及其95%置信区间。 Result: DeepSeekR1-distill-Llama3-70B在大多数任务上的精度表现更好，召回率结果则参差不齐。尽管在某些任务上零样本大语言模型表现出较高的F1分数，但在其他任务上表现不佳。 Conclusion: 模型选择应根据健康相关文本分类任务的具体要求进行，特别是在考虑精度-召回率权衡时。在有标注数据的情况下，监督分类方法可能比零样本大语言模型更可靠。 Abstract: This study compares the performance of two open-source large language models (LLMs)-Llama3-70B and DeepSeekR1-distill-Llama3-70B-on six biomedical text classification tasks. Four tasks involve data from social media, while two tasks focus on clinical notes from electronic health records, and all experiments were performed in zero-shot settings. Performance metrics, including precision, recall, and F1 scores, were measured for each task, along with their 95% confidence intervals. Results demonstrated that DeepSeekR1-distill-Llama3-70B generally performs better in terms of precision on most tasks, with mixed results on recall. While the zero-shot LLMs demonstrated high F1 scores for some tasks, they grossly underperformed on others, for data from both sources. The findings suggest that model selection should be guided by the specific requirements of the health-related text classification tasks, particularly when considering the precision-recall trade-offs, and that, in the presence of annotated data, supervised classification approaches may be more reliable than zero-shot LLMs.

Revisiting Image Fusion for Multi-Illuminant White-Balance Correction

David Serrano-Lozano,Aditya Arora,Luis Herranz,Konstantinos G. Derpanis,Michael S. Brown,Javier Vazquez-Corral

Task: 提出一种基于Transformer的高效模型，用于多光源场景下的白平衡校正。

Motivation: 现有的基于融合的方法在多光源场景下表现不佳，且缺乏专门的多光源图像数据集。

Details

Method: 提出了一种基于Transformer的模型，能够有效捕捉sRGB白平衡预设之间的空间依赖关系，并引入了一个包含16,000多张sRGB图像的大规模多光源数据集。 Result: 新方法在多光源图像融合数据集上比现有技术提高了100%。 Conclusion: 提出的基于Transformer的模型和新的多光源数据集显著提升了多光源场景下的白平衡校正效果。 Abstract: White balance (WB) correction in scenes with multiple illuminants remains a persistent challenge in computer vision. Recent methods explored fusion-based approaches, where a neural network linearly blends multiple sRGB versions of an input image, each processed with predefined WB presets. However, we demonstrate that these methods are suboptimal for common multi-illuminant scenarios. Additionally, existing fusion-based methods rely on sRGB WB datasets lacking dedicated multi-illuminant images, limiting both training and evaluation. To address these challenges, we introduce two key contributions. First, we propose an efficient transformer-based model that effectively captures spatial dependencies across sRGB WB presets, substantially improving upon linear fusion techniques. Second, we introduce a large-scale multi-illuminant dataset comprising over 16,000 sRGB images rendered with five different WB settings, along with WB-corrected images. Our method achieves up to 100\% improvement over existing techniques on our new multi-illuminant image fusion dataset.

Entity-aware Cross-lingual Claim Detection for Automated Fact-checking

Rrubaa Panchendrarajan,Arkaitz Zubiaga

Task: 识别需要验证的声明，特别是在社交媒体平台上错误信息泛滥的情况下。

Motivation: 尽管在该任务上取得了显著进展，但仍存在处理多语言和多模态数据的挑战。

Details

Method: 引入了EX-Claim，一种实体感知的跨语言声明检测模型，利用命名实体识别和实体链接技术来提高语言级别的性能。 Result: 在三个不同社交媒体平台的数据集上进行的广泛实验表明，所提出的模型在27种语言中显著优于基线模型，并在有限训练数据下实现了最高的知识转移率。 Conclusion: EX-Claim模型能够有效处理任何语言的声明，并在跨语言知识转移方面表现出色。 Abstract: Identifying claims requiring verification is a critical task in automated fact-checking, especially given the proliferation of misinformation on social media platforms. Despite significant progress in the task, there remain open challenges such as dealing with multilingual and multimodal data prevalent in online discourse. Addressing the multilingual challenge, recent efforts have focused on fine-tuning pre-trained multilingual language models. While these models can handle multiple languages, their ability to effectively transfer cross-lingual knowledge for detecting claims spreading on social media remains under-explored. In this paper, we introduce \textit{EX-Claim}, an entity-aware cross-lingual claim detection model that generalizes well to handle claims written in any language. The model leverages entity information derived from named entity recognition and entity linking techniques to improve the language-level performance of both seen and unseen languages during training. Extensive experiments conducted on three datasets from different social media platforms demonstrate that our proposed model significantly outperforms the baselines, across 27 languages, and achieves the highest rate of knowledge transfer, even with limited training data.

RAT: Boosting Misclassification Detection Ability without Extra Data

Ge Yan,Tsui-Wei Weng

Task: 检测图像分类模型的错误分类输入。

Motivation: 随着深度神经网络在高风险领域的广泛应用，检测模型的错误预测并进行干预变得至关重要。

Details

Method: 提出使用鲁棒半径（输入空间边际）作为置信度度量，并设计了两种高效的估计算法RR-BS和RR-Fast。此外，设计了一种称为半径感知训练（RAT）的训练方法，以提高模型识别错误的能力。 Result: 实验表明，与之前的方法相比，该方法在AURC上减少了29.3%，在FPR@95TPR上减少了21.62%。 Conclusion: 提出的方法在检测错误分类输入方面表现出色，显著优于现有方法。 Abstract: As deep neural networks(DNN) become increasingly prevalent, particularly in high-stakes areas such as autonomous driving and healthcare, the ability to detect incorrect predictions of models and intervene accordingly becomes crucial for safety. In this work, we investigate the detection of misclassified inputs for image classification models from the lens of adversarial perturbation: we propose to use robust radius (a.k.a. input-space margin) as a confidence metric and design two efficient estimation algorithms, RR-BS and RR-Fast, for misclassification detection. Furthermore, we design a training method called Radius Aware Training (RAT) to boost models' ability to identify mistakes. Extensive experiments show our method could achieve up to 29.3% reduction on AURC and 21.62% reduction in FPR@95TPR, compared with previous methods.

Model Hubs and Beyond: Analyzing Model Popularity, Performance, and Documentation

Pritam Kadasi,Sriman Reddy,Srivathsa Vamsi Chaturvedula,Rudranshu Sen,Agnish Saha,Soumavo Sikdar,Sayani Sarkar,Suhani Mittal,Rohit Jindal,Mayank Singh

Task: 研究Hugging Face平台上模型流行度与实际性能之间的关系，以及模型文档的全面性与流行度和性能的相关性。

Motivation: 随着Hugging Face平台上机器学习模型数量的激增，用户在选择最佳模型时往往依赖模型的流行度（如下载量、点赞数或最近更新时间），而忽略了实际性能。

Details

Method: 评估了Hugging Face平台上500个情感分析模型，进行了大规模的人工标注（近80,000个标注）以及广泛的模型训练和评估。 Result: 研究发现，模型流行度与实际性能并不一定相关。约80%的模型缺乏详细的模型、训练和评估过程信息，约88%的模型作者在模型卡片中夸大了模型的性能。 Conclusion: 基于研究结果，提供了一份指南清单，帮助用户为下游任务选择合适的模型。 Abstract: With the massive surge in ML models on platforms like Hugging Face, users often lose track and struggle to choose the best model for their downstream tasks, frequently relying on model popularity indicated by download counts, likes, or recency. We investigate whether this popularity aligns with actual model performance and how the comprehensiveness of model documentation correlates with both popularity and performance. In our study, we evaluated a comprehensive set of 500 Sentiment Analysis models on Hugging Face. This evaluation involved massive annotation efforts, with human annotators completing nearly 80,000 annotations, alongside extensive model training and evaluation. Our findings reveal that model popularity does not necessarily correlate with performance. Additionally, we identify critical inconsistencies in model card reporting: approximately 80\% of the models analyzed lack detailed information about the model, training, and evaluation processes. Furthermore, about 88\% of model authors overstate their models' performance in the model cards. Based on our findings, we provide a checklist of guidelines for users to choose good models for downstream tasks.

SketchSplat: 3D Edge Reconstruction via Differentiable Multi-view Sketch Splatting

Haiyang Ying,Matthias Zwicker

Task: 从校准的多视角图像中重建参数化的3D边缘。

Motivation: 现有的方法通常从多视角2D边缘图像重建3D边缘点集，然后拟合3D边缘到点集。然而，点集中的噪声可能导致拟合边缘之间的间隙，并且由于边缘拟合仅依赖于重建的3D点集，恢复的边缘可能与输入的多视角图像不对齐。

Details

Method: 提出了一种名为SketchSplat的方法，通过可微的多视角草图喷溅来重建准确、完整且紧凑的3D边缘。将3D边缘表示为草图，这些草图是由控制点、尺度和不透明度等属性定义的参数化线条和曲线。在边缘重建过程中，从一组草图中迭代采样高斯点，并将高斯点栅格化到2D边缘图像上。然后，可以将图像误差相对于输入2D边缘图像的梯度反向传播以优化草图属性。 Result: 实验表明，该方法在基准CAD数据集上实现了最先进的准确性、完整性和紧凑性。 Conclusion: SketchSplat方法通过可微的方式桥接了2D边缘图像和3D边缘，确保了3D边缘与2D图像的良好对齐，并实现了准确和完整的结果。此外，提出的一系列自适应拓扑操作与草图优化一起应用，有助于减少所需的草图数量，同时确保高精度，从而产生更紧凑的重建。 Abstract: Edges are one of the most basic parametric primitives to describe structural information in 3D. In this paper, we study parametric 3D edge reconstruction from calibrated multi-view images. Previous methods usually reconstruct a 3D edge point set from multi-view 2D edge images, and then fit 3D edges to the point set. However, noise in the point set may cause gaps among fitted edges, and the recovered edges may not align with input multi-view images since the edge fitting depends only on the reconstructed 3D point set. To mitigate these problems, we propose SketchSplat, a method to reconstruct accurate, complete, and compact 3D edges via differentiable multi-view sketch splatting. We represent 3D edges as sketches, which are parametric lines and curves defined by attributes including control points, scales, and opacity. During edge reconstruction, we iteratively sample Gaussian points from a set of sketches and rasterize the Gaussians onto 2D edge images. Then the gradient of the image error with respect to the input 2D edge images can be back-propagated to optimize the sketch attributes. Our method bridges 2D edge images and 3D edges in a differentiable manner, which ensures that 3D edges align well with 2D images and leads to accurate and complete results. We also propose a series of adaptive topological operations and apply them along with the sketch optimization. The topological operations help reduce the number of sketches required while ensuring high accuracy, yielding a more compact reconstruction. Finally, we contribute an accurate 2D edge detector that improves the performance of both ours and existing methods. Experiments show that our method achieves state-of-the-art accuracy, completeness, and compactness on a benchmark CAD dataset.

Exploring Large Language Models for Word Games:Who is the Spy?

Chentian Wei,Jiewei Chen,Jinzhu Xu

Task: 探索大型语言模型（LLMs）在文字游戏中的有效应用，并提出一种无需训练的框架。

Motivation: 文字游戏因其基于规则和情境的特性，在自然语言处理（NLP）、博弈论及相关领域具有重要的研究价值。

Details

Method: 提出了一种基于思维链（CoT）的调度框架，以‘谁是卧底’游戏为例，使LLMs在推断角色词和伪装身份等任务中表现出色。 Result: 实验结果表明该框架的有效性，在多个数据集上显著提升了LLMs的表现。 Conclusion: 该工作展示了LLMs在结构化游戏环境中掌握情境推理和社交互动的潜力。 Abstract: Word games hold significant research value for natural language processing (NLP), game theory, and related fields due to their rule-based and situational nature. This study explores how large language models (LLMs) can be effectively involved in word games and proposes a training-free framework. "Shei Shi Wo Di" or "Who is the Spy" in English, is a classic word game. Using this game as an example, we introduce a Chain-of-Thought (CoT)-based scheduling framework to enable LLMs to achieve excellent performance in tasks such as inferring role words and disguising their identities. We evaluate the framework's performance based on game success rates and the accuracy of the LLM agents' analytical results. Experimental results affirm the framework's effectiveness, demonstrating notable improvements in LLM performance across multiple datasets. This work highlights the potential of LLMs in mastering situational reasoning and social interactions within structured game environments. Our code is publicly available at https://github.com/ct-wei/Who-is-The-Spy.

Prototype Perturbation for Relaxing Alignment Constraints in Backward-Compatible Learning

Zikun Zhou,Yushuai Sun,Wenjie Pei,Xin Li,Yaowei Wang

Task: 提出一种新的方法来放松向后兼容学习中的约束，以保持新模型的判别能力。

Motivation: 传统的更新检索模型的方法需要重新计算图库数据的嵌入，这是一个耗时且计算密集的过程。为了规避这一问题，向后兼容学习（BCL）被广泛探索，但其强对齐约束会损害新模型的判别能力。

Details

Method: 通过引入对旧特征原型的扰动来放松约束，使新特征空间与由这些扰动原型定义的伪旧特征空间对齐。提出了两种计算扰动的方法：邻居驱动原型扰动（NDPP）和优化驱动原型扰动（ODPP）。 Result: 在多个数据集上的实验表明，所提出的方法在向后兼容学习算法中表现优异。 Conclusion: 通过放松约束并引入扰动，可以在保持新模型判别能力的同时实现向后兼容学习。 Abstract: The traditional paradigm to update retrieval models requires re-computing the embeddings of the gallery data, a time-consuming and computationally intensive process known as backfilling. To circumvent backfilling, Backward-Compatible Learning (BCL) has been widely explored, which aims to train a new model compatible with the old one. Many previous works focus on effectively aligning the embeddings of the new model with those of the old one to enhance the backward-compatibility. Nevertheless, such strong alignment constraints would compromise the discriminative ability of the new model, particularly when different classes are closely clustered and hard to distinguish in the old feature space. To address this issue, we propose to relax the constraints by introducing perturbations to the old feature prototypes. This allows us to align the new feature space with a pseudo-old feature space defined by these perturbed prototypes, thereby preserving the discriminative ability of the new model in backward-compatible learning. We have developed two approaches for calculating the perturbations: Neighbor-Driven Prototype Perturbation (NDPP) and Optimization-Driven Prototype Perturbation (ODPP). Particularly, they take into account the feature distributions of not only the old but also the new models to obtain proper perturbations along with new model updating. Extensive experiments on the landmark and commodity datasets demonstrate that our approaches perform favorably against state-of-the-art BCL algorithms.

BigO(Bench) -- Can LLMs Generate Code with Controlled Time and Space Complexity?

Pierre Chambon,Baptiste Roziere,Benoit Sagot,Gabriel Synnaeve

Task: 评估生成语言模型在理解和生成具有指定时间和空间复杂度的代码方面的能力。

Motivation: 当前评估往往忽视模型在理解和生成受计算复杂度约束的代码方面的能力，BigO(Bench)旨在填补这一空白。

Details

Method: BigO(Bench)包括从分析测量中推断任何Python函数的算法复杂度的工具，以及一组3,105个编码问题和1,190,250个来自Code Contests的解决方案，这些解决方案带有推断的时间复杂度标签和空间复杂度标签。 Result: 评估了多个最先进的语言模型，发现它们在处理复杂度要求方面的优势和劣势。特别是，token-space推理模型在代码生成方面无与伦比，但在复杂度理解方面表现不佳。 Conclusion: token-space推理模型可能无法很好地泛化到训练时没有奖励的任务上。 Abstract: We introduce BigO(Bench), a novel coding benchmark designed to evaluate the capabilities of generative language models in understanding and generating code with specified time and space complexities. This benchmark addresses the gap in current evaluations that often overlook the ability of models to comprehend and produce code constrained by computational complexity. BigO(Bench) includes tooling to infer the algorithmic complexity of any Python function from profiling measurements, including human- or LLM-generated solutions. BigO(Bench) also includes of set of 3,105 coding problems and 1,190,250 solutions from Code Contests annotated with inferred (synthetic) time and space complexity labels from the complexity framework, as well as corresponding runtime and memory footprint values for a large set of input sizes. We present results from evaluating multiple state-of-the-art language models on this benchmark, highlighting their strengths and weaknesses in handling complexity requirements. In particular, token-space reasoning models are unrivaled in code generation but not in complexity understanding, hinting that they may not generalize well to tasks for which no reward was given at training time.

Decompositional Neural Scene Reconstruction with Generative Diffusion Prior

Junfeng Ni,Yu Liu,Ruijie Lu,Zirui Zhou,Song-Chun Zhu,Yixin Chen,Siyuan Huang

Task: 从稀疏视图中重建3D场景的完整形状和详细纹理。

Motivation: 解决在稀疏视图输入下，现有方法在未约束区域和遮挡区域恢复效果不佳的问题。

Details

Method: 提出DP-Recon方法，利用扩散先验（Score Distillation Sampling, SDS）优化每个物体在新视图下的神经表示，并引入可见性引导方法动态调整每像素SDS损失权重。 Result: 在Replica和ScanNet++数据集上的实验表明，该方法显著优于现有方法，特别是在10视图下的物体重建效果优于基线方法在100视图下的效果。 Conclusion: DP-Recon方法通过SDS优化实现了几何和外观的无缝文本编辑，并生成了支持逼真视觉效果（VFX）编辑的分解物体网格和详细UV贴图。 Abstract: Decompositional reconstruction of 3D scenes, with complete shapes and detailed texture of all objects within, is intriguing for downstream applications but remains challenging, particularly with sparse views as input. Recent approaches incorporate semantic or geometric regularization to address this issue, but they suffer significant degradation in underconstrained areas and fail to recover occluded regions. We argue that the key to solving this problem lies in supplementing missing information for these areas. To this end, we propose DP-Recon, which employs diffusion priors in the form of Score Distillation Sampling (SDS) to optimize the neural representation of each individual object under novel views. This provides additional information for the underconstrained areas, but directly incorporating diffusion prior raises potential conflicts between the reconstruction and generative guidance. Therefore, we further introduce a visibility-guided approach to dynamically adjust the per-pixel SDS loss weights. Together these components enhance both geometry and appearance recovery while remaining faithful to input images. Extensive experiments across Replica and ScanNet++ demonstrate that our method significantly outperforms SOTA methods. Notably, it achieves better object reconstruction under 10 views than the baselines under 100 views. Our method enables seamless text-based editing for geometry and appearance through SDS optimization and produces decomposed object meshes with detailed UV maps that support photorealistic Visual effects (VFX) editing. The project page is available at https://dp-recon.github.io/.

MAMM-Refine: A Recipe for Improving Faithfulness in Generation with Multi-Agent Collaboration

David Wan,Justin Chih-Yao Chen,Elias Stengel-Eskin,Mohit Bansal

Task: 将多智能体多模型推理扩展到生成任务，特别是通过改进来提高生成内容的忠实度。

Motivation: 多智能体协作在推理任务中显示出潜力，但在长文本生成任务（如摘要和问答）中尚未得到充分探索。

Details

Method: 研究多实例和多种类型的大型语言模型（LLMs）在改进过程中的子任务（如错误检测、批评不忠实的句子和基于批评进行修正）中的迭代协作。 Result: 多智能体和多模型方法在错误检测和批评方面都有所帮助，将批评和改进重新定义为重排序任务而非生成任务可以提高多智能体的性能。 Conclusion: 提出了一个名为多智能体多模型改进（MAMM-Refine）的最终“配方”，在多智能体和多模型协作下，显著提高了三个摘要数据集和长文本问答的性能，证明了该方法的有效性和通用性。 Abstract: Multi-agent collaboration among models has shown promise in reasoning tasks but is underexplored in long-form generation tasks like summarization and question-answering. We extend multi-agent multi-model reasoning to generation, specifically to improving faithfulness through refinement, i.e., revising model-generated outputs to remove factual inconsistencies. We investigate how iterative collaboration among multiple instances and types of large language models (LLMs) enhances subtasks in the refinement process, such as error detection, critiquing unfaithful sentences, and making corrections based on critiques. We design intrinsic evaluations for each subtask, with our findings indicating that both multi-agent (multiple instances) and multi-model (diverse LLM types) approaches benefit error detection and critiquing. Additionally, reframing critiquing and refinement as reranking rather than generation tasks improves multi-agent performance. We consolidate these insights into a final "recipe" called Multi-Agent Multi-Model Refinement (MAMM-Refine), where multi-agent and multi-model collaboration significantly boosts performance on three summarization datasets as well as on long-form question answering, demonstrating the effectiveness and generalizability of our recipe.

H2ST: Hierarchical Two-Sample Tests for Continual Out-of-Distribution Detection

Yuhang Liu,Wenjie Zhao,Yunhui Guo

Task: 提出一种新的持续OOD检测方法，称为分层双样本测试（H2ST），用于开放世界任务增量学习（TIL）场景。

Motivation: 现有的TIL方法在封闭世界假设下运行，假设传入的数据是分布内的（ID）。然而，在开放世界设置中，传入的样本可能来自分布外（OOD）源，其任务身份未知。当前的OOD检测方法在持续检测OOD样本时面临多个挑战。

Details

Method: 提出了一种新的持续OOD检测方法，称为分层双样本测试（H2ST），通过假设测试消除了阈值选择的需求，并利用特征图更好地利用模型能力，而不依赖于模型性能。 Result: 广泛的实验和分析验证了H2ST在开放世界TIL场景中的有效性，并证明了其优于现有方法的性能。 Conclusion: H2ST在开放世界TIL场景中表现出色，具有较低的开销和优越的性能，适用于实际部署。 Abstract: Task Incremental Learning (TIL) is a specialized form of Continual Learning (CL) in which a model incrementally learns from non-stationary data streams. Existing TIL methodologies operate under the closed-world assumption, presuming that incoming data remains in-distribution (ID). However, in an open-world setting, incoming samples may originate from out-of-distribution (OOD) sources, with their task identities inherently unknown. Continually detecting OOD samples presents several challenges for current OOD detection methods: reliance on model outputs leads to excessive dependence on model performance, selecting suitable thresholds is difficult, hindering real-world deployment, and binary ID/OOD classification fails to provide task-level identification. To address these issues, we propose a novel continual OOD detection method called the Hierarchical Two-sample Tests (H2ST). H2ST eliminates the need for threshold selection through hypothesis testing and utilizes feature maps to better exploit model capabilities without excessive dependence on model performance. The proposed hierarchical architecture enables task-level detection with superior performance and lower overhead compared to non-hierarchical classifier two-sample tests. Extensive experiments and analysis validate the effectiveness of H2ST in open-world TIL scenarios and its superiority to the existing methods. Code is available at \href{https://github.com/YuhangLiuu/H2ST}{https://github.com/YuhangLiuu/H2ST}.

TROVE: A Challenge for Fine-Grained Text Provenance via Source Sentence Tracing and Relationship Classification

Junnan Zhu,Min Xiao,Yining Wang,Feifei Zhai,Yu Zhou,Chengqing Zong

Task: 设计并评估Text pROVEnance (TROVE)挑战，以追踪目标文本的每个句子到特定源句子，并注释细粒度关系。

Motivation: 在高风险领域（如医疗、法律和新闻）中，了解内容的来源和生成方式至关重要。

Details

Method: 利用三个公共数据集构建数据集，涵盖11种不同场景，采用三阶段注释过程（句子检索、GPT来源注释和人工来源注释），并评估11种LLM在直接提示和检索增强范式下的表现。 Result: 检索对于稳健性能至关重要，较大模型在复杂关系分类中表现更好，闭源模型通常领先，但开源模型在检索增强方面显示出显著潜力。 Conclusion: TROVE挑战为追踪文本来源和注释细粒度关系提供了有效方法，检索增强对提升模型性能至关重要。 Abstract: LLMs have achieved remarkable fluency and coherence in text generation, yet their widespread adoption has raised concerns about content reliability and accountability. In high-stakes domains such as healthcare, law, and news, it is crucial to understand where and how the content is created. To address this, we introduce the Text pROVEnance (TROVE) challenge, designed to trace each sentence of a target text back to specific source sentences within potentially lengthy or multi-document inputs. Beyond identifying sources, TROVE annotates the fine-grained relationships (quotation, compression, inference, and others), providing a deep understanding of how each target sentence is formed. To benchmark TROVE, we construct our dataset by leveraging three public datasets covering 11 diverse scenarios (e.g., QA and summarization) in English and Chinese, spanning source texts of varying lengths (0-5k, 5-10k, 10k+), emphasizing the multi-document and long-document settings essential for provenance. To ensure high-quality data, we employ a three-stage annotation process: sentence retrieval, GPT provenance, and human provenance. We evaluate 11 LLMs under direct prompting and retrieval-augmented paradigms, revealing that retrieval is essential for robust performance, larger models perform better in complex relationship classification, and closed-source models often lead, yet open-source models show significant promise, particularly with retrieval augmentation.

SemanticFlow: A Self-Supervised Framework for Joint Scene Flow Prediction and Instance Segmentation in Dynamic Environments

Yinqi Chen,Meiying Zhang,Qi Hao,Guang Zhou

Task: 同时预测全分辨率点云的场景流和实例分割。

Motivation: 传统方法通常将动态交通场景中的对象运动估计和实例分割视为独立任务，导致性能不佳、时空不一致以及在复杂场景中效率低下。

Details

Method: 提出了一个多任务SemanticFlow框架，包括从粗到细的预测方案、一组损失函数以及自监督学习方案。 Result: 在Argoverse和Waymo数据集上验证了该框架，展示了在实例分割准确性、场景流估计和计算效率方面的优越性能。 Conclusion: 该框架为动态场景理解中的自监督方法建立了新的基准。 Abstract: Accurate perception of dynamic traffic scenes is crucial for high-level autonomous driving systems, requiring robust object motion estimation and instance segmentation. However, traditional methods often treat them as separate tasks, leading to suboptimal performance, spatio-temporal inconsistencies, and inefficiency in complex scenarios due to the absence of information sharing. This paper proposes a multi-task SemanticFlow framework to simultaneously predict scene flow and instance segmentation of full-resolution point clouds. The novelty of this work is threefold: 1) developing a coarse-to-fine prediction based multi-task scheme, where an initial coarse segmentation of static backgrounds and dynamic objects is used to provide contextual information for refining motion and semantic information through a shared feature processing module; 2) developing a set of loss functions to enhance the performance of scene flow estimation and instance segmentation, while can help ensure spatial and temporal consistency of both static and dynamic objects within traffic scenes; 3) developing a self-supervised learning scheme, which utilizes coarse segmentation to detect rigid objects and compute their transformation matrices between sequential frames, enabling the generation of self-supervised labels. The proposed framework is validated on the Argoverse and Waymo datasets, demonstrating superior performance in instance segmentation accuracy, scene flow estimation, and computational efficiency, establishing a new benchmark for self-supervised methods in dynamic scene understanding.

Inside-Out: Hidden Factual Knowledge in LLMs

Zorik Gekhman,Eyal Ben David,Hadas Orgad,Eran Ofek,Yonatan Belinkov,Idan Szpector,Jonathan Herzig,Roi Reichart

Task: 评估大型语言模型（LLMs）在其参数中编码的事实知识是否比其输出中表达的更多。

Motivation: 现有研究暗示了这种可能性，但尚未明确定义或证明这一现象。

Details

Method: 提出了一个形式化的知识定义，量化了给定问题的正确-错误答案对中正确答案排名更高的比例。根据用于评分的信息来源，分为外部知识和内部知识。隐藏知识是指内部知识超过外部知识的情况。 Result: LLMs在其内部编码的事实知识比其外部表达的知识多40%。某些知识隐藏得如此之深，以至于模型在内部完美知道答案，但在大规模重复采样1000次答案后仍无法生成。 Conclusion: LLMs的生成能力存在根本性限制，这限制了通过重复答案采样在闭卷QA中扩展测试时计算的实用性，因为某些答案几乎从未被采样，但如果被采样，它们将被保证排名第一。 Abstract: This work presents a framework for assessing whether large language models (LLMs) encode more factual knowledge in their parameters than what they express in their outputs. While a few studies hint at this possibility, none has clearly defined or demonstrated this phenomenon. We first propose a formal definition of knowledge, quantifying it for a given question as the fraction of correct-incorrect answer pairs where the correct one is ranked higher. This gives rise to external and internal knowledge, depending on the information used to score individual answer candidates: either the model's observable token-level probabilities or its intermediate computations. Hidden knowledge arises when internal knowledge exceeds external knowledge. We then present a case study, applying this framework to three popular open-weights LLMs in a closed-book QA setup. Our results indicate that: (1) LLMs consistently encode more factual knowledge internally than what they express externally, with an average gap of 40%. (2) Surprisingly, some knowledge is so deeply hidden that a model can internally know an answer perfectly, yet fail to generate it even once, despite large-scale repeated sampling of 1,000 answers. This reveals fundamental limitations in the generation capabilities of LLMs, which (3) puts a practical constraint on scaling test-time compute via repeated answer sampling in closed-book QA: significant performance improvements remain inaccessible because some answers are practically never sampled, yet if they were, we would be guaranteed to rank them first.

Unlocking the Capabilities of Vision-Language Models for Generalizable and Explainable Deepfake Detection

Peipeng Yu,Jianwei Fei,Hui Gao,Xuan Feng,Zhihua Xia,Chip Hong Chang

Task: 提出一种新的范式，利用视觉语言模型（VLM）进行深度伪造检测。

Motivation: 当前视觉语言模型在多模态数据理解方面表现出色，但其在深度伪造检测方面的潜力尚未充分挖掘，主要原因是其知识与取证模式的不对齐。

Details

Method: 提出了一种包含三个组件的范式：1）知识引导的伪造适应模块，通过对比学习将VLM的语义空间与取证特征对齐；2）多模态提示调优框架，联合优化视觉-文本嵌入以实现定位和可解释性；3）迭代优化策略，支持多轮对话以进行基于证据的推理。 Result: 在多个基准测试（包括FF++、CDF2、DFD、DFDCP和DFDC）上的广泛实验表明，该方案在泛化性能上超越了现有方法，并支持多轮对话能力。 Conclusion: 该研究成功解锁了VLM在深度伪造检测中的潜力，提供了一种新的方法，能够在泛化性能和多轮对话能力上超越现有方法。 Abstract: Current vision-language models (VLMs) have demonstrated remarkable capabilities in understanding multimodal data, but their potential remains underexplored for deepfake detection due to the misaligned of their knowledge and forensics patterns. To this end, we present a novel paradigm that unlocks VLMs' potential capabilities through three components: (1) A knowledge-guided forgery adaptation module that aligns VLM's semantic space with forensic features through contrastive learning with external manipulation knowledge; (2) A multi-modal prompt tuning framework that jointly optimizes visual-textual embeddings for both localization and explainability; (3) An iterative refinement strategy enabling multi-turn dialog for evidence-based reasoning. Our framework includes a VLM-based Knowledge-guided Forgery Detector (KFD), a VLM image encoder, and a Large Language Model (LLM). The VLM image encoder extracts visual prompt embeddings from images, while the LLM receives visual and question prompt embeddings for inference. The KFD is used to calculate correlations between image features and pristine/deepfake class embeddings, enabling forgery classification and localization. The outputs from these components are used to construct forgery prompt embeddings. Finally, we feed these prompt embeddings into the LLM to generate textual detection responses to assist judgment. Extensive experiments on multiple benchmarks, including FF++, CDF2, DFD, DFDCP, and DFDC, demonstrate that our scheme surpasses state-of-the-art methods in generalization performance, while also supporting multi-turn dialogue capabilities.

SPILL: Domain-Adaptive Intent Clustering based on Selection and Pooling with Large Language Models

I-Fan Lin,Faegheh Hasibi,Suzan Verberne

Task: 提出一种无需微调的领域自适应意图聚类方法SPILL。

Motivation: 现有基于嵌入的聚类方法依赖于少量标注示例或无监督微调来优化每个新数据集的结果，这使得它们在多个数据集上的泛化能力较差。

Details

Method: 提出了一种两阶段方法：首先为每个话语（种子）生成嵌入，然后使用距离度量选择与种子接近的候选池；接着使用大语言模型（LLM）从候选池中选择与种子具有相同意图的话语，最后将这些选定的候选与种子池化以生成精炼的嵌入。 Result: 该方法通常优于直接使用嵌入器，并且与使用更大模型和需要微调的其他最先进研究结果相当，显示了其强度和效率。 Conclusion: 该方法使现有嵌入器无需额外微调即可进一步改进，使其更适应新领域数据集。将聚类任务视为小规模选择问题，展示了使用LLM根据用户目标定制聚类任务的潜力。 Abstract: In this paper, we propose Selection and Pooling with Large Language Models (SPILL), an intuitive and domain-adaptive method for intent clustering without fine-tuning. Existing embeddings-based clustering methods rely on a few labeled examples or unsupervised fine-tuning to optimize results for each new dataset, which makes them less generalizable to multiple datasets. Our goal is to make these existing embedders more generalizable to new domain datasets without further fine-tuning. Inspired by our theoretical derivation and simulation results on the effectiveness of sampling and pooling techniques, we view the clustering task as a small-scale selection problem. A good solution to this problem is associated with better clustering performance. Accordingly, we propose a two-stage approach: First, for each utterance (referred to as the seed), we derive its embedding using an existing embedder. Then, we apply a distance metric to select a pool of candidates close to the seed. Because the embedder is not optimized for new datasets, in the second stage, we use an LLM to further select utterances from these candidates that share the same intent as the seed. Finally, we pool these selected candidates with the seed to derive a refined embedding for the seed. We found that our method generally outperforms directly using an embedder, and it achieves comparable results to other state-of-the-art studies, even those that use much larger models and require fine-tuning, showing its strength and efficiency. Our results indicate that our method enables existing embedders to be further improved without additional fine-tuning, making them more adaptable to new domain datasets. Additionally, viewing the clustering task as a small-scale selection problem gives the potential of using LLMs to customize clustering tasks according to the user's goals.

Fine-Grained Open-Vocabulary Object Detection with Fined-Grained Prompts: Task, Dataset and Benchmark

Ying Liu,Yijing Hua,Haojiang Chai,Yanbo Wang,TengQi Ye

Task: 扩展监督细粒度目标检测到开放词汇设置，提出3F-OVD任务。

Motivation: 现有的开放词汇检测器在评估时存在不公平和不可靠的问题，主要原因是视觉感知语言词汇数据的变化。

Details

Method: 提出了3F-OVD任务，并创建了一个新的数据集NEU-171K，适用于监督和开放词汇设置。此外，提出了一种简单但有效的后处理技术。 Result: 在NEU-171K数据集上对最先进的目标检测器进行了基准测试。 Conclusion: 3F-OVD任务具有挑战性，需要深入理解细粒度描述和图像中的细节，新数据集和后处理技术有助于提高检测精度。 Abstract: Open-vocabulary detectors are proposed to locate and recognize objects in novel classes. However, variations in vision-aware language vocabulary data used for open-vocabulary learning can lead to unfair and unreliable evaluations. Recent evaluation methods have attempted to address this issue by incorporating object properties or adding locations and characteristics to the captions. Nevertheless, since these properties and locations depend on the specific details of the images instead of classes, detectors can not make accurate predictions without precise descriptions provided through human annotation. This paper introduces 3F-OVD, a novel task that extends supervised fine-grained object detection to the open-vocabulary setting. Our task is intuitive and challenging, requiring a deep understanding of Fine-grained captions and careful attention to Fine-grained details in images in order to accurately detect Fine-grained objects. Additionally, due to the scarcity of qualified fine-grained object detection datasets, we have created a new dataset, NEU-171K, tailored for both supervised and open-vocabulary settings. We benchmark state-of-the-art object detectors on our dataset for both settings. Furthermore, we propose a simple yet effective post-processing technique.

Optimizing Decomposition for Optimal Claim Verification

Yining Lu,Noah Ziems,Hy Dang,Meng Jiang

Task: 优化长文本事实性评估中的分解与验证策略

Motivation: 现有研究通常将分解和验证视为独立过程，忽略了它们之间的相互作用和潜在的不一致性，导致验证结果不理想。

Details

Method: 提出了动态分解的强化学习框架，利用验证器反馈来学习动态分解策略，以达到验证器偏好的原子性。 Result: 实验结果表明，动态分解策略优于现有分解策略，在不同验证器、数据集和输入声明的原子性上，验证置信度平均提高了0.07，准确率平均提高了0.12。 Conclusion: 动态分解策略能够有效提升长文本事实性评估的验证效果。 Abstract: Current research on the \textit{Decompose-Then-Verify} paradigm for evaluating the factuality of long-form text typically treats decomposition and verification in isolation, overlooking their interactions and potential misalignment. We find that existing decomposition policies, typically hand-crafted demonstrations, do not align well with downstream verifiers in terms of atomicity -- a novel metric quantifying information density -- leading to suboptimal verification results. We formulate finding the optimal decomposition policy for optimal verification as a bilevel optimization problem. To approximate a solution for this strongly NP-hard problem, we propose dynamic decomposition, a reinforcement learning framework that leverages verifier feedback to learn a policy for dynamically decomposing claims to verifier-preferred atomicity. Experimental results show that dynamic decomposition outperforms existing decomposition policies, improving verification confidence by 0.07 and accuracy by 0.12 (on a 0-1 scale) on average across varying verifiers, datasets, and atomcities of input claims.

Temporal-Consistent Video Restoration with Pre-trained Diffusion Models

Hengkang Wang,Yang Liu,Huidong Liu,Chien-Chih Wang,Yanhui Guo,Hongdong Li,Bryan Wang,Ju Sun

Task: 提出一种新的最大后验（MAP）框架，直接参数化扩散模型（DMs）种子空间中的视频帧，以消除近似误差并提高时间一致性。

Motivation: 现有的零-shot视频恢复方法在使用预训练扩散模型时存在近似误差和时间一致性不足的问题，且处理3D视频数据计算量大。

Details

Method: 提出了一种新的最大后验（MAP）框架，直接参数化扩散模型种子空间中的视频帧，并通过语义一致性和像素级一致性策略来提高时间一致性。 Result: 在多个虚拟现实任务上的广泛实验表明，该方法在视觉质量和时间一致性方面优于现有技术。 Conclusion: 所提出的方法在视频恢复任务中表现出优越的视觉质量和时间一致性，解决了现有方法的近似误差和时间一致性问题。 Abstract: Video restoration (VR) aims to recover high-quality videos from degraded ones. Although recent zero-shot VR methods using pre-trained diffusion models (DMs) show good promise, they suffer from approximation errors during reverse diffusion and insufficient temporal consistency. Moreover, dealing with 3D video data, VR is inherently computationally intensive. In this paper, we advocate viewing the reverse process in DMs as a function and present a novel Maximum a Posterior (MAP) framework that directly parameterizes video frames in the seed space of DMs, eliminating approximation errors. We also introduce strategies to promote bilevel temporal consistency: semantic consistency by leveraging clustering structures in the seed space, and pixel-level consistency by progressive warping with optical flow refinements. Extensive experiments on multiple virtual reality tasks demonstrate superior visual quality and temporal consistency achieved by our method compared to the state-of-the-art.

SemEval-2025 Task 1: AdMIRe -- Advancing Multimodal Idiomaticity Representation

Thomas Pickard,Aline Villavicencio,Maggie Mi,Wei He,Dylan Phelps,Carolina Scarton,Marco Idiart

Task: 评估和改进模型在多模态上下文和多种语言中解释惯用表达的能力。

Motivation: 惯用表达在自然语言处理中具有独特的挑战，因为它们的含义通常不能直接从其组成词中推断出来。尽管大型语言模型（LLMs）取得了进展，但惯用性仍然是鲁棒语义表示的一个重大障碍。

Details

Method: 提出了SemEval-2025 Task 1: AdMiRe（推进多模态惯用性表示）的数据集和任务，包括两个子任务：根据图像与惯用或字面意义的对齐程度进行排名，以及预测序列中的下一张图像。 Result: 最有效的方法通过在多专家设置中利用预训练的LLMs和视觉语言模型，并使用多个查询来平滑这些模型在惯用性表示中的弱点，达到了人类水平的性能。 Conclusion: 通过多模态和多语言环境中的任务，可以有效评估和改进模型对惯用表达的理解能力，预训练的LLMs和视觉语言模型在这一过程中发挥了重要作用。 Abstract: Idiomatic expressions present a unique challenge in NLP, as their meanings are often not directly inferable from their constituent words. Despite recent advancements in Large Language Models (LLMs), idiomaticity remains a significant obstacle to robust semantic representation. We present datasets and tasks for SemEval-2025 Task 1: AdMiRe (Advancing Multimodal Idiomaticity Representation), which challenges the community to assess and improve models' ability to interpret idiomatic expressions in multimodal contexts and in multiple languages. Participants competed in two subtasks: ranking images based on their alignment with idiomatic or literal meanings, and predicting the next image in a sequence. The most effective methods achieved human-level performance by leveraging pretrained LLMs and vision-language models in mixture-of-experts settings, with multiple queries used to smooth over the weaknesses in these models' representations of idiomaticity.

DVHGNN: Multi-Scale Dilated Vision HGNN for Efficient Vision Recognition

Caoshuo Li,Tanzhe Li,Xiaobin Hu,Donghao Luo,Taisong Jin

Task: 提出一种新的视觉架构，称为扩张视觉超图神经网络（DVHGNN），以解决Vision Graph Neural Network（ViG）中的关键问题。

Motivation: Vision Graph Neural Network（ViG）在计算机视觉中引起了广泛关注，但其K近邻（KNN）图构建导致的二次计算复杂性和普通图的成对关系限制是关键问题。

Details

Method: 提出了一种新的视觉架构DVHGNN，利用多尺度超图高效捕捉对象之间的高阶相关性。具体方法包括定制化的聚类和扩张超图构建（DHGC）以及动态超图卷积机制。 Result: 在基准图像数据集上的广泛定性和定量评估表明，DVHGNN显著优于现有的视觉骨干网络。例如，DVHGNN-S在ImageNet-1K上达到了83.1%的top-1准确率，超过了ViG-S和ViHGNN-S。 Conclusion: DVHGNN通过多尺度超图和动态超图卷积机制，有效解决了ViG中的关键问题，并在图像分类任务中取得了显著的性能提升。 Abstract: Recently, Vision Graph Neural Network (ViG) has gained considerable attention in computer vision. Despite its groundbreaking innovation, Vision Graph Neural Network encounters key issues including the quadratic computational complexity caused by its K-Nearest Neighbor (KNN) graph construction and the limitation of pairwise relations of normal graphs. To address the aforementioned challenges, we propose a novel vision architecture, termed Dilated Vision HyperGraph Neural Network (DVHGNN), which is designed to leverage multi-scale hypergraph to efficiently capture high-order correlations among objects. Specifically, the proposed method tailors Clustering and Dilated HyperGraph Construction (DHGC) to adaptively capture multi-scale dependencies among the data samples. Furthermore, a dynamic hypergraph convolution mechanism is proposed to facilitate adaptive feature exchange and fusion at the hypergraph level. Extensive qualitative and quantitative evaluations of the benchmark image datasets demonstrate that the proposed DVHGNN significantly outperforms the state-of-the-art vision backbones. For instance, our DVHGNN-S achieves an impressive top-1 accuracy of 83.1% on ImageNet-1K, surpassing ViG-S by +1.0% and ViHGNN-S by +0.6%.

Real-world validation of a multimodal LLM-powered pipeline for High-Accuracy Clinical Trial Patient Matching leveraging EHR data

Anatole Callies,Quentin Bodinier,Philippe Ravaud,Kourosh Davarpanah

Task: 自动化患者-试验匹配，使用从电子健康记录（EHR）中提取的未处理文档。

Motivation: 解决临床试验中患者招募的复杂性和劳动密集型图表审查问题。

Details

Method: 引入了一种广泛适用、无需集成的LLM驱动管道，利用（1）新的推理-LLM范式，（2）最新LLM的视觉能力，（3）多模态嵌入进行高效医疗记录搜索。 Result: 在n2c2数据集上，方法达到了93%的标准级准确率；在真实世界试验中，准确率为87%，用户能够在平均9分钟内审查患者的整体资格。 Conclusion: 该管道在临床试验患者匹配中表现出色，无需与站点系统进行定制集成或试验特定调整，从而实现了跨站点的可扩展部署。 Abstract: Background: Patient recruitment in clinical trials is hindered by complex eligibility criteria and labor-intensive chart reviews. Prior research using text-only models have struggled to address this problem in a reliable and scalable way due to (1) limited reasoning capabilities, (2) information loss from converting visual records to text, and (3) lack of a generic EHR integration to extract patient data. Methods: We introduce a broadly applicable, integration-free, LLM-powered pipeline that automates patient-trial matching using unprocessed documents extracted from EHRs. Our approach leverages (1) the new reasoning-LLM paradigm, enabling the assessment of even the most complex criteria, (2) visual capabilities of latest LLMs to interpret medical records without lossy image-to-text conversions, and (3) multimodal embeddings for efficient medical record search. The pipeline was validated on the n2c2 2018 cohort selection dataset (288 diabetic patients) and a real-world dataset composed of 485 patients from 30 different sites matched against 36 diverse trials. Results: On the n2c2 dataset, our method achieved a new state-of-the-art criterion-level accuracy of 93\%. In real-world trials, the pipeline yielded an accuracy of 87\%, undermined by the difficulty to replicate human decision-making when medical records lack sufficient information. Nevertheless, users were able to review overall eligibility in under 9 minutes per patient on average, representing an 80\% improvement over traditional manual chart reviews. Conclusion: This pipeline demonstrates robust performance in clinical trial patient matching without requiring custom integration with site systems or trial-specific tailoring, thereby enabling scalable deployment across sites seeking to leverage AI for patient matching.

Efficient Personalization of Quantized Diffusion Model without Backpropagation

Hoigi Seo,Wongi Jeong,Kyungryeol Lee,Se Young Chun

Task: 通过量化和零阶优化技术实现扩散模型的高效微调，以减少内存需求。

Motivation: 扩散模型在图像合成中表现出色，但训练和微调需要大量计算和内存资源，特别是在边缘设备上运行时。

Details

Method: 提出了一种通过文本反向量化扩散模型的方法，并利用零阶优化进行个性化令牌的优化，避免了反量化和梯度存储。此外，提出了子空间梯度降噪和部分均匀时间步采样方法。 Result: 该方法在个性化Stable Diffusion中实现了与现有方法相当的图像和文本对齐分数，同时将训练内存需求减少了8.2倍。 Conclusion: 所提出的方法在减少内存需求的同时，保持了扩散模型的性能，适用于边缘设备上的个性化应用。 Abstract: Diffusion models have shown remarkable performance in image synthesis, but they demand extensive computational and memory resources for training, fine-tuning and inference. Although advanced quantization techniques have successfully minimized memory usage for inference, training and fine-tuning these quantized models still require large memory possibly due to dequantization for accurate computation of gradients and/or backpropagation for gradient-based algorithms. However, memory-efficient fine-tuning is particularly desirable for applications such as personalization that often must be run on edge devices like mobile phones with private data. In this work, we address this challenge by quantizing a diffusion model with personalization via Textual Inversion and by leveraging a zeroth-order optimization on personalization tokens without dequantization so that it does not require gradient and activation storage for backpropagation that consumes considerable memory. Since a gradient estimation using zeroth-order optimization is quite noisy for a single or a few images in personalization, we propose to denoise the estimated gradient by projecting it onto a subspace that is constructed with the past history of the tokens, dubbed Subspace Gradient. In addition, we investigated the influence of text embedding in image generation, leading to our proposed time steps sampling, dubbed Partial Uniform Timestep Sampling for sampling with effective diffusion timesteps. Our method achieves comparable performance to prior methods in image and text alignment scores for personalizing Stable Diffusion with only forward passes while reducing training memory demand up to $8.2\times$.

VenusFactory: A Unified Platform for Protein Engineering Data Retrieval and Language Model Fine-Tuning

Yang Tan,Chen Liu,Jingyuan Gao,Banghao Wu,Mingchen Li,Ruilin Wang,Lingrong Zhang,Huiqun Yu,Guisheng Fan,Liang Hong,Bingxin Zhou

Task: 开发一个名为VenusFactory的引擎，用于整合生物数据检索、标准化任务基准测试和蛋白质语言模型的模块化微调。

Motivation: 由于数据收集、任务基准测试和应用方面的挑战，跨学科采用预训练的蛋白质语言模型仍然有限。

Details

Method: VenusFactory引擎整合了生物数据检索、标准化任务基准测试和模块化微调，支持命令行执行和基于Gradio的无代码界面。 Result: VenusFactory集成了40多个蛋白质相关数据集和40多个流行的蛋白质语言模型，所有实现均已开源。 Conclusion: VenusFactory为计算机科学和生物学社区提供了一个多功能工具，促进了蛋白质语言模型的跨学科应用。 Abstract: Natural language processing (NLP) has significantly influenced scientific domains beyond human language, including protein engineering, where pre-trained protein language models (PLMs) have demonstrated remarkable success. However, interdisciplinary adoption remains limited due to challenges in data collection, task benchmarking, and application. This work presents VenusFactory, a versatile engine that integrates biological data retrieval, standardized task benchmarking, and modular fine-tuning of PLMs. VenusFactory supports both computer science and biology communities with choices of both a command-line execution and a Gradio-based no-code interface, integrating $40+$ protein-related datasets and $40+$ popular PLMs. All implementations are open-sourced on https://github.com/tyang816/VenusFactory.

DPFlow: Adaptive Optical Flow Estimation with a Dual-Pyramid Framework

Henrique Morimitsu,Xiaobin Zhu,Roberto M. Cesar Jr.,Xiangyang Ji,Xu-Cheng Yin

Task: 提出一种能够泛化到8K分辨率输入的自适应光流架构DPFlow，并引入新的基准Kubric-NK来评估光流方法。

Motivation: 当前的光流方法通常设计用于低分辨率，无法泛化到大输入，且缺乏高分辨率样本的基准来评估现有方法的实际性能。

Details

Method: 提出DPFlow，一种自适应光流架构，能够在仅使用低分辨率样本训练的情况下泛化到8K分辨率输入，并引入Kubric-NK基准。 Result: DPFlow在MPI-Sintel、KITTI 2015、Spring等基准上取得了最先进的结果。 Conclusion: DPFlow和Kubric-NK填补了高分辨率光流估计的空白，并揭示了现有方法的泛化能力。 Abstract: Optical flow estimation is essential for video processing tasks, such as restoration and action recognition. The quality of videos is constantly increasing, with current standards reaching 8K resolution. However, optical flow methods are usually designed for low resolution and do not generalize to large inputs due to their rigid architectures. They adopt downscaling or input tiling to reduce the input size, causing a loss of details and global information. There is also a lack of optical flow benchmarks to judge the actual performance of existing methods on high-resolution samples. Previous works only conducted qualitative high-resolution evaluations on hand-picked samples. This paper fills this gap in optical flow estimation in two ways. We propose DPFlow, an adaptive optical flow architecture capable of generalizing up to 8K resolution inputs while trained with only low-resolution samples. We also introduce Kubric-NK, a new benchmark for evaluating optical flow methods with input resolutions ranging from 1K to 8K. Our high-resolution evaluation pushes the boundaries of existing methods and reveals new insights about their generalization capabilities. Extensive experimental results show that DPFlow achieves state-of-the-art results on the MPI-Sintel, KITTI 2015, Spring, and other high-resolution benchmarks.

SkyLadder: Better and Faster Pretraining via Context Window Scheduling

Tongyao Zhu,Qian Liu,Haonan Wang,Shiqi Chen,Xiangming Gu,Tianyu Pang,Min-Yen Kan

Task: 探索一种最优的上下文窗口调度策略，以更好地平衡长上下文能力与预训练效率。

Motivation: 研究发现，在固定token预算下，使用较短上下文窗口预训练的模型始终优于长上下文窗口的模型。

Details

Method: 提出了SkyLadder方法，实现从短到长的上下文窗口过渡。 Result: 在100B token上预训练1B参数模型（最多32K上下文）和3B参数模型（8K上下文），SkyLadder在常见基准上实现了高达3.7%的增益，同时训练速度比基线快22%。 Conclusion: SkyLadder方法在保持强标准基准性能的同时，在长上下文任务上匹配或超过了基线结果。 Abstract: Recent advancements in LLM pretraining have featured ever-expanding context windows to process longer sequences. However, our pilot study reveals that models pretrained with shorter context windows consistently outperform their long-context counterparts under a fixed token budget. This finding motivates us to explore an optimal context window scheduling strategy to better balance long-context capability with pretraining efficiency. To this end, we propose SkyLadder, a simple yet effective approach that implements a short-to-long context window transition. SkyLadder preserves strong standard benchmark performance, while matching or exceeding baseline results on long context tasks. Through extensive experiments, we pre-train 1B-parameter models (up to 32K context) and 3B-parameter models (8K context) on 100B tokens, demonstrating that SkyLadder yields consistent gains of up to 3.7% on common benchmarks, while achieving up to 22% faster training speeds compared to baselines. The code is at https://github.com/sail-sg/SkyLadder.

Mitigating Object Hallucinations in MLLMs via Multi-Frequency Perturbations

Shuo Li,Jiajun Sun,Guodong Zheng,Xiaoran Fan,Yujiong Shen,Yi Lu,Zhiheng Xi,Yuming Yang,Wenming Tan,Tao Ji,Tao Gui,Qi Zhang,Xuanjing Huang

Task: 提出一种名为多频率扰动（MFP）的方法，以减少多模态大语言模型（MLLMs）在视觉-语言任务中的物体幻觉。

Motivation: 多模态大语言模型在视觉-语言任务中表现出色，但其生成的响应常常因物体幻觉而失真。研究发现，模型对特定图像频率特征的过度敏感是导致这些幻觉的关键原因。

Details

Method: 引入多频率扰动（MFP），利用图像的低频和高频特征来扰动视觉特征表示，并在推理过程中显式抑制冗余的频率域特征。 Result: 实验结果表明，该方法显著减少了各种模型架构中的物体幻觉。此外，作为一种训练时方法，MFP可以与推理时方法结合，在CHAIR基准上达到最先进的性能。 Conclusion: 多频率扰动（MFP）是一种简单、经济且可插拔的方法，能有效减少多模态大语言模型中的物体幻觉，并提升其在视觉-语言任务中的表现。 Abstract: Recently, multimodal large language models (MLLMs) have demonstrated remarkable performance in visual-language tasks. However, the authenticity of the responses generated by MLLMs is often compromised by object hallucinations. We identify that a key cause of these hallucinations is the model's over-susceptibility to specific image frequency features in detecting objects. In this paper, we introduce Multi-Frequency Perturbations (MFP), a simple, cost-effective, and pluggable method that leverages both low-frequency and high-frequency features of images to perturb visual feature representations and explicitly suppress redundant frequency-domain features during inference, thereby mitigating hallucinations. Experimental results demonstrate that our method significantly mitigates object hallucinations across various model architectures. Furthermore, as a training-time method, MFP can be combined with inference-time methods to achieve state-of-the-art performance on the CHAIR benchmark.

Evaluating Bias in Retrieval-Augmented Medical Question-Answering Systems

Yuelyu Ji,Hang Zhang,Yanshan Wang

Task: 评估基于检索增强生成（RAG）模型的医疗问答系统中的偏见。

Motivation: 医疗问答系统在临床决策中发挥重要作用，但可能引入与种族、性别和社会健康决定因素相关的偏见。

Details

Method: 通过分析人口统计敏感查询和测量检索差异，系统评估RAG模型中的偏见。使用MMLU和MedMCQA等数据集，分析检索重叠和正确性差异。 Result: 研究发现RAG管道中存在显著的人口统计差异。 Conclusion: 强调需要明确考虑公平性的检索方法，以确保公平的临床决策。 Abstract: Medical QA systems powered by Retrieval-Augmented Generation (RAG) models support clinical decision-making but may introduce biases related to race, gender, and social determinants of health. We systematically evaluate biases in RAG-based LLM by examining demographic-sensitive queries and measuring retrieval discrepancies. Using datasets like MMLU and MedMCQA, we analyze retrieval overlap and correctness disparities. Our findings reveal substantial demographic disparities within RAG pipelines, emphasizing the critical need for retrieval methods that explicitly account for fairness to ensure equitable clinical decision-making.

When Domain Generalization meets Generalized Category Discovery: An Adaptive Task-Arithmetic Driven Approach

Vaibhav Rathore,Shubhranil B,Saikat Dutta,Sarthak Mehrotra,Zsolt Kira,Biplab Banerjee

Task: 在目标域中聚类基础和新类别，使用仅包含基础类别的源域的监督。

Motivation: 当前方法在分布变化时表现不佳，并且通常需要在训练期间访问目标数据，这有时是不切实际的。

Details

Method: 引入了域泛化在广义类发现中的新范式（DG-GCD），其中仅源数据可用于训练，而目标域在推理前保持不可见。提出了DG2CD-Net，通过基于源域和合成域生成的任务进行基础模型的适应，增强跨域泛化能力。 Result: 在三个数据集上的实验证实，DG2CD-Net优于现有的针对DG-GCD定制的GCD方法。 Conclusion: DG2CD-Net通过域独立的、有区分性的嵌入空间和任务算术的扩展，提高了基础模型对未见目标的适应性。 Abstract: Generalized Class Discovery (GCD) clusters base and novel classes in a target domain using supervision from a source domain with only base classes. Current methods often falter with distribution shifts and typically require access to target data during training, which can sometimes be impractical. To address this issue, we introduce the novel paradigm of Domain Generalization in GCD (DG-GCD), where only source data is available for training, while the target domain, with a distinct data distribution, remains unseen until inference. To this end, our solution, DG2CD-Net, aims to construct a domain-independent, discriminative embedding space for GCD. The core innovation is an episodic training strategy that enhances cross-domain generalization by adapting a base model on tasks derived from source and synthetic domains generated by a foundation model. Each episode focuses on a cross-domain GCD task, diversifying task setups over episodes and combining open-set domain adaptation with a novel margin loss and representation learning for optimizing the feature space progressively. To capture the effects of fine-tuning on the base model, we extend task arithmetic by adaptively weighting the local task vectors concerning the fine-tuned models based on their GCD performance on a validation distribution. This episodic update mechanism boosts the adaptability of the base model to unseen targets. Experiments across three datasets confirm that DG2CD-Net outperforms existing GCD methods customized for DG-GCD.

From 1,000,000 Users to Every User: Scaling Up Personalized Preference for User-level Alignment

Jia-Nan Li,Jian Guan,Songhao Wu,Wei Wu,Rui Yan

Task: 提出一个可扩展的个性化对齐大语言模型（LLMs）的框架。

Motivation: 传统的一刀切对齐方法忽视了用户价值观和需求的多样性。

Details

Method: 建立系统化的偏好空间和多样化的人物表示，开发了包含130万个性化偏好示例的AlignX数据集，并提出了两种互补的对齐方法：基于上下文的对齐和基于偏好桥接的对齐。 Result: 在四个基准测试中平均提高了17.06%的准确率，表现出对新偏好的强大适应能力、对有限用户数据的鲁棒性以及精确的偏好控制能力。 Conclusion: 该框架有效推进了真正用户自适应的AI系统的发展。 Abstract: Large language models (LLMs) have traditionally been aligned through one-size-fits-all approaches that assume uniform human preferences, fundamentally overlooking the diversity in user values and needs. This paper introduces a comprehensive framework for scalable personalized alignment of LLMs. We establish a systematic preference space characterizing psychological and behavioral dimensions, alongside diverse persona representations for robust preference inference in real-world scenarios. Building upon this foundation, we introduce \textsc{AlignX}, a large-scale dataset of over 1.3 million personalized preference examples, and develop two complementary alignment approaches: \textit{in-context alignment} directly conditioning on persona representations and \textit{preference-bridged alignment} modeling intermediate preference distributions. Extensive experiments demonstrate substantial improvements over existing methods, with an average 17.06\% accuracy gain across four benchmarks while exhibiting a strong adaptation capability to novel preferences, robustness to limited user data, and precise preference controllability. These results validate our framework's effectiveness, advancing toward truly user-adaptive AI systems.

Spot the Fake: Large Multimodal Model-Based Synthetic Image Detection with Artifact Explanation

Siwei Wen,Junyan Ye,Peilin Feng,Hengrui Kang,Zichen Wen,Yize Chen,Jiang Wu,Wenjun Wu,Conghui He,Weijia Li

Task: 开发一种用于检测合成图像和DeepFake的专门大型多模态模型FakeVLM，并提供自然语言解释。

Motivation: 随着人工智能生成内容（AIGC）技术的快速发展，合成图像在日常生活中越来越普遍，给真实性评估和检测带来了新的挑战。现有方法在评估图像真实性和定位伪造方面虽然有效，但缺乏人类可解释性，并且无法完全应对合成数据日益增长的复杂性。

Details

Method: 引入了FakeVLM，这是一种专门用于一般合成图像和DeepFake检测任务的大型多模态模型。FakeVLM不仅能够区分真实图像和伪造图像，还能为图像伪影提供清晰的自然语言解释，增强了可解释性。此外，还提出了FakeClue数据集，包含超过100,000张图像，分为七类，并用自然语言标注了细粒度的伪影线索。 Result: FakeVLM在多个数据集上的广泛评估中表现出色，在真实性分类和伪影解释任务中均表现出优越性，为合成图像检测设定了新的基准。 Conclusion: FakeVLM在合成数据检测方面表现出色，消除了对额外分类器的需求，成为一种强大的解决方案。数据集和代码将在https://github.com/opendatalab/FakeVLM上发布。 Abstract: With the rapid advancement of Artificial Intelligence Generated Content (AIGC) technologies, synthetic images have become increasingly prevalent in everyday life, posing new challenges for authenticity assessment and detection. Despite the effectiveness of existing methods in evaluating image authenticity and locating forgeries, these approaches often lack human interpretability and do not fully address the growing complexity of synthetic data. To tackle these challenges, we introduce FakeVLM, a specialized large multimodal model designed for both general synthetic image and DeepFake detection tasks. FakeVLM not only excels in distinguishing real from fake images but also provides clear, natural language explanations for image artifacts, enhancing interpretability. Additionally, we present FakeClue, a comprehensive dataset containing over 100,000 images across seven categories, annotated with fine-grained artifact clues in natural language. FakeVLM demonstrates performance comparable to expert models while eliminating the need for additional classifiers, making it a robust solution for synthetic data detection. Extensive evaluations across multiple datasets confirm the superiority of FakeVLM in both authenticity classification and artifact explanation tasks, setting a new benchmark for synthetic image detection. The dataset and code will be released in: https://github.com/opendatalab/FakeVLM.

Dynamic Bi-Elman Attention Networks (DBEAN): Dual-Directional Context-Aware Representation Learning for Enhanced Text Classification

ZhengLin Lai,MengYao Liao,Dong Xu

Task: 提出一种新的文本分类模型，即动态双向Elman注意力网络（DBEAN），以改进现有模型在解释性、计算效率和长距离上下文理解方面的不足。

Motivation: 传统方法在处理复杂语言结构和语义依赖时表现不佳，而现有的深度学习模型在解释性、计算效率和长距离上下文理解方面存在局限性。

Details

Method: 提出动态双向Elman注意力网络（DBEAN），结合双向时间建模和自注意力机制，动态分配输入关键段的权重。 Result: DBEAN改进了上下文表示，同时保持了计算效率。 Conclusion: DBEAN在文本分类任务中表现出色，能够更好地平衡解释性、计算效率和长距离上下文理解。 Abstract: Text classification, a fundamental task in natural language processing (NLP), aims to categorize textual data into predefined labels. Traditional methods struggled with complex linguistic structures and semantic dependencies. The advent of deep learning, particularly recurrent neural networks (RNNs) and Transformer-based models, has significantly advanced the field by enabling nuanced feature extraction and context-aware predictions. Despite improvements, existing models exhibit limitations in balancing interpretability, computational efficiency, and long-range contextual understanding. This paper proposes the Dynamic Bidirectional Elman with Attention Network (DBEAN), which integrates bidirectional temporal modelling with self-attention mechanisms. DBEAN dynamically assigns weights to critical segments of input, improving contextual representation while maintaining computational efficiency.

Robust Distribution Alignment for Industrial Anomaly Detection under Distribution Shift

Jingyi Liao,Xun Xu,Yongyi Su,Rong-Cheng Tu,Yifan Liu,Dacheng Tao,Xulei Yang

Task: 在工业应用中，异常检测在质量控制中起着至关重要的作用。

Motivation: 现有的方法在处理未见过的领域转移（如光照变化或传感器漂移）时存在显著挑战，通常依赖于目标分布的先验知识，难以推广到为其他数据模态设计的骨干网络。

Details

Method: 我们基于记忆库的异常检测方法，优化了有限目标训练数据上的鲁棒Sinkhorn距离，以增强对未见目标领域的泛化能力。 Result: 在模拟分布转移的2D和3D异常检测基准上，我们提出的方法展示了优于最先进的异常检测和领域适应方法的结果。 Conclusion: 我们的方法在未见目标领域的泛化能力上表现出色，优于现有的异常检测和领域适应方法。 Abstract: Anomaly detection plays a crucial role in quality control for industrial applications. However, ensuring robustness under unseen domain shifts such as lighting variations or sensor drift remains a significant challenge. Existing methods attempt to address domain shifts by training generalizable models but often rely on prior knowledge of target distributions and can hardly generalise to backbones designed for other data modalities. To overcome these limitations, we build upon memory-bank-based anomaly detection methods, optimizing a robust Sinkhorn distance on limited target training data to enhance generalization to unseen target domains. We evaluate the effectiveness on both 2D and 3D anomaly detection benchmarks with simulated distribution shifts. Our proposed method demonstrates superior results compared with state-of-the-art anomaly detection and domain adaptation methods.

Value Profiles for Encoding Human Variation

Taylor Sorensen,Pushkar Mishra,Roma Patel,Michael Henry Tessler,Michiel Bakker,Georgina Evans,Iason Gabriel,Noah Goodman,Verena Rieser

Task: 建模人类在评分任务中的变化，以实现个性化、多元模型对齐和计算社会科学的AI系统。

Motivation: 为了在个性化、多元模型对齐和计算社会科学中实现AI系统，需要建模人类在评分任务中的变化。

Details

Method: 提出使用价值档案（自然语言描述）和可控制的解码器模型来估计基于价值档案或其他评分者信息的评分。 Result: 发现演示包含最多信息，其次是价值档案和人口统计信息。价值档案在可审查性、可解释性和可控性方面具有优势。 Conclusion: 价值档案提供了一种新颖的、预测性的方式来描述个体变化，超越了人口统计或群体信息。 Abstract: Modelling human variation in rating tasks is crucial for enabling AI systems for personalization, pluralistic model alignment, and computational social science. We propose representing individuals using value profiles -- natural language descriptions of underlying values compressed from in-context demonstrations -- along with a steerable decoder model to estimate ratings conditioned on a value profile or other rater information. To measure the predictive information in rater representations, we introduce an information-theoretic methodology. We find that demonstrations contain the most information, followed by value profiles and then demographics. However, value profiles offer advantages in terms of scrutability, interpretability, and steerability due to their compressed natural language format. Value profiles effectively compress the useful information from demonstrations (>70% information preservation). Furthermore, clustering value profiles to identify similarly behaving individuals better explains rater variation than the most predictive demographic groupings. Going beyond test set performance, we show that the decoder models interpretably change ratings according to semantic profile differences, are well-calibrated, and can help explain instance-level disagreement by simulating an annotator population. These results demonstrate that value profiles offer novel, predictive ways to describe individual variation beyond demographics or group information.

Derm1M: A Million-scale Vision-Language Dataset Aligned with Clinical Ontology Knowledge for Dermatology

Siyuan Yan,Ming Hu,Yiwen Jiang,Xieji Li,Hao Fei,Philipp Tschandl,Harald Kittler,Zongyuan Ge

Task: 提出并验证Derm1M，一个大规模皮肤病学视觉-语言数据集，用于提升AI研究和临床应用。

Motivation: 现有皮肤病学数据集在规模和深度上有限，缺乏丰富的文本描述和临床背景，限制了皮肤病学AI的发展。

Details

Method: 构建Derm1M数据集，包含1,029,761个图像-文本对，涵盖390多种皮肤状况和130个临床概念，并预训练了一系列CLIP-like模型（DermLIP）。 Result: DermLIP模型在多个任务和数据集上显著优于现有的基础模型，包括零样本皮肤病分类、临床和伪影概念识别、少样本/全样本学习和跨模态检索。 Conclusion: Derm1M数据集和DermLIP模型在提升皮肤病学AI研究和临床应用方面具有巨大潜力，数据集和代码将公开。 Abstract: The emergence of vision-language models has transformed medical AI, enabling unprecedented advances in diagnostic capability and clinical applications. However, progress in dermatology has lagged behind other medical domains due to the lack of standard image-text pairs. Existing dermatological datasets are limited in both scale and depth, offering only single-label annotations across a narrow range of diseases instead of rich textual descriptions, and lacking the crucial clinical context needed for real-world applications. To address these limitations, we present Derm1M, the first large-scale vision-language dataset for dermatology, comprising 1,029,761 image-text pairs. Built from diverse educational resources and structured around a standard ontology collaboratively developed by experts, Derm1M provides comprehensive coverage for over 390 skin conditions across four hierarchical levels and 130 clinical concepts with rich contextual information such as medical history, symptoms, and skin tone. To demonstrate Derm1M potential in advancing both AI research and clinical application, we pretrained a series of CLIP-like models, collectively called DermLIP, on this dataset. The DermLIP family significantly outperforms state-of-the-art foundation models on eight diverse datasets across multiple tasks, including zero-shot skin disease classification, clinical and artifacts concept identification, few-shot/full-shot learning, and cross-modal retrieval. Our dataset and code will be public.

Policy Frameworks for Transparent Chain-of-Thought Reasoning in Large Language Models

Yihang Chen,Haikang Deng,Kaiqiao Han,Qingyue Zhao

Task: 分析Chain-of-Thought (CoT)推理的全面披露对大型语言模型（LLMs）的双刃剑影响，并提出一个分层次访问政策框架。

Motivation: 当前CoT披露政策在不同模型之间存在差异，缺乏统一的政策框架，这可能导致知识产权侵犯、滥用和运营成本增加等问题。

Details

Method: 提出一个分层次访问政策框架，通过道德许可、结构化推理输出和跨层次保障措施，平衡透明度、责任和安全性。 Result: 该框架旨在通过协调可访问性与道德和运营考虑，推动负责任的AI部署，同时减少滥用或误解的风险。 Conclusion: 分层次访问政策框架能够平衡透明度、责任和安全性，有助于推动负责任的AI部署并减少潜在风险。 Abstract: Chain-of-Thought (CoT) reasoning enhances large language models (LLMs) by decomposing complex problems into step-by-step solutions, improving performance on reasoning tasks. However, current CoT disclosure policies vary widely across different models in frontend visibility, API access, and pricing strategies, lacking a unified policy framework. This paper analyzes the dual-edged implications of full CoT disclosure: while it empowers small-model distillation, fosters trust, and enables error diagnosis, it also risks violating intellectual property, enabling misuse, and incurring operational costs. We propose a tiered-access policy framework that balances transparency, accountability, and security by tailoring CoT availability to academic, business, and general users through ethical licensing, structured reasoning outputs, and cross-tier safeguards. By harmonizing accessibility with ethical and operational considerations, this framework aims to advance responsible AI deployment while mitigating risks of misuse or misinterpretation.

Deep Polycuboid Fitting for Compact 3D Representation of Indoor Scenes

Gahye Lee,Hyejeong Yoon,Jungeon Kim,Seungyong Lee

Task: 提出一种基于深度学习的紧凑表示3D室内场景的新框架。

Motivation: 室内场景主要由人造物体（如家具）组成，这些物体通常呈现直线几何形状，因此可以使用多立方体组合来表示，从而为下游应用（如家具重新排列）提供紧凑的表示。

Details

Method: 框架首先使用变压器网络检测六种类型的立方体面，然后使用图神经网络验证检测到的面的空间关系以形成潜在的多立方体，最后通过聚合面标签重建每个多立方体实例。 Result: 该框架在包括Replica、ScanNet和使用iPhone捕获的场景在内的真实世界室内场景数据集上表现良好。 Conclusion: 该方法的实用性通过虚拟房间游览和场景编辑等实际应用得到了展示。 Abstract: This paper presents a novel framework for compactly representing a 3D indoor scene using a set of polycuboids through a deep learning-based fitting method. Indoor scenes mainly consist of man-made objects, such as furniture, which often exhibit rectilinear geometry. This property allows indoor scenes to be represented using combinations of polycuboids, providing a compact representation that benefits downstream applications like furniture rearrangement. Our framework takes a noisy point cloud as input and first detects six types of cuboid faces using a transformer network. Then, a graph neural network is used to validate the spatial relationships of the detected faces to form potential polycuboids. Finally, each polycuboid instance is reconstructed by forming a set of boxes based on the aggregated face labels. To train our networks, we introduce a synthetic dataset encompassing a diverse range of cuboid and polycuboid shapes that reflect the characteristics of indoor scenes. Our framework generalizes well to real-world indoor scene datasets, including Replica, ScanNet, and scenes captured with an iPhone. The versatility of our method is demonstrated through practical applications, such as virtual room tours and scene editing.

Threefold model for AI Readiness: A Case Study with Finnish Healthcare SMEs

Mohammed Alnajjar,Khalid Alnajjar,Mika Hämäläinen

Task: 研究芬兰医疗保健中小企业的AI采用情况。

Motivation: 了解AI在医疗保健中小企业中的应用现状及其面临的挑战。

Details

Method: 通过对六家健康科技公司进行半结构化访谈。 Result: 提出了三种AI参与类别：AI-curious（探索AI）、AI-embracing（整合AI）和AI-catering（提供AI解决方案），并识别了主要采用障碍，包括监管复杂性、技术专家缺口和财务限制。 Conclusion: 提供了加速AI整合的可操作建议，重点关注监管改革、人才发展和公司间合作，为医疗保健组织、政策制定者和研究人员提供了有价值的见解。 Abstract: This study examines AI adoption among Finnish healthcare SMEs through semi-structured interviews with six health-tech companies. We identify three AI engagement categories: AI-curious (exploring AI), AI-embracing (integrating AI), and AI-catering (providing AI solutions). Our proposed threefold model highlights key adoption barriers, including regulatory complexities, technical expertise gaps, and financial constraints. While SMEs recognize AI's potential, most remain in early adoption stages. We provide actionable recommendations to accelerate AI integration, focusing on regulatory reforms, talent development, and inter-company collaboration, offering valuable insights for healthcare organizations, policymakers, and researchers.

GenM$^3$: Generative Pretrained Multi-path Motion Model for Text Conditional Human Motion Generation

Junyu Shi,Lijiang Liu,Yong Sun,Zhiyuan Zhang,Jinni Zhou,Qiang Nie

Task: 提出Generative Pretrained Multi-path Motion Model (GenM$^3$)框架，以解决大规模多源数据集中的数据异质性挑战，并学习统一的运动表示。

Motivation: 为了增强运动生成能力，需要扩展运动数据集，但大规模多源数据集的训练引入了数据异质性挑战。

Details

Method: GenM$^3$框架包括两个组件：1) Multi-Expert VQ-VAE (MEVQ-VAE)，用于适应不同数据集分布以学习统一的离散运动表示；2) Multi-path Motion Transformer (MMT)，通过使用单独的模态特定路径来改进模态内表示，并通过文本-运动共享路径改进模态间对齐。 Result: 在HumanML3D基准测试中，GenM$^3$达到了0.035的FID，显著超越了现有方法，并在IDEA400数据集上展示了强大的零样本泛化能力。 Conclusion: GenM$^3$框架在多种运动场景中表现出色，具有高效性和适应性。 Abstract: Scaling up motion datasets is crucial to enhance motion generation capabilities. However, training on large-scale multi-source datasets introduces data heterogeneity challenges due to variations in motion content. To address this, we propose Generative Pretrained Multi-path Motion Model (GenM$^3$), a comprehensive framework designed to learn unified motion representations. GenM$^3$ comprises two components: 1) a Multi-Expert VQ-VAE (MEVQ-VAE) that adapts to different dataset distributions to learn a unified discrete motion representation, and 2) a Multi-path Motion Transformer (MMT) that improves intra-modal representations by using separate modality-specific pathways, each with densely activated experts to accommodate variations within that modality, and improves inter-modal alignment by the text-motion shared pathway. To enable large-scale training, we integrate and unify 11 high-quality motion datasets (approximately 220 hours of motion data) and augment it with textual annotations (nearly 10,000 motion sequences labeled by a large language model and 300+ by human experts). After training on our integrated dataset, GenM$^3$ achieves a state-of-the-art FID of 0.035 on the HumanML3D benchmark, surpassing state-of-the-art methods by a large margin. It also demonstrates strong zero-shot generalization on IDEA400 dataset, highlighting its effectiveness and adaptability across diverse motion scenarios.

Squeeze Out Tokens from Sample for Finer-Grained Data Governance

Weixiong Lin,Chen Ju,Haicheng Wang,Shengchao Hu,Shuai Xiao,Mengting Chen,Yuheng Jiao,Mingshuai Yao,Jinsong Lan,Qingwen Liu,Ying Chen

Task: 升级数据治理方法，从筛选样本到更细粒度的样本内治理，以提取更多信息并提升图像-文本对齐。

Motivation: 现有的数据治理方法通过筛选低价值样本来缩减数据集，但保留的样本中仍包含大量不理想的标记，存在进一步压缩和净化的潜力。

Details

Method: 提出双分支DataJuicer方法，视觉分支保留显著的图像块并提取相关对象类别，文本分支结合这些类别来增强描述。 Result: 实验表明，DataJuicer在图像-文本检索、分类和密集视觉推理任务上显著优于现有的DataSieve方法。 Conclusion: DataJuicer通过更细粒度的治理方法，能够生成更精炼的数据集，显著提升模型性能。 Abstract: Widely observed data scaling laws, in which error falls off as a power of the training size, demonstrate the diminishing returns of unselective data expansion. Hence, data governance is proposed to downsize datasets through pruning non-informative samples. Yet, isolating the impact of a specific sample on overall model performance is challenging, due to the vast computation required for tryout all sample combinations. Current data governors circumvent this complexity by estimating sample contributions through heuristic-derived scalar scores, thereby discarding low-value ones. Despite thorough sample sieving, retained samples contain substantial undesired tokens intrinsically, underscoring the potential for further compression and purification. In this work, we upgrade data governance from a 'sieving' approach to a 'juicing' one. Instead of scanning for least-flawed samples, our dual-branch DataJuicer applies finer-grained intra-sample governance. It squeezes out informative tokens and boosts image-text alignments. Specifically, the vision branch retains salient image patches and extracts relevant object classes, while the text branch incorporates these classes to enhance captions. Consequently, DataJuicer yields more refined datasets through finer-grained governance. Extensive experiments across datasets demonstrate that DataJuicer significantly outperforms existing DataSieve in image-text retrieval, classification, and dense visual reasoning.

Shushing! Let's Imagine an Authentic Speech from the Silent Video

Jiaxin Ye,Hongming Shan

Task: 通过视觉引导生成真实语音，仅依赖面部外观或唇部运动而不依赖听觉信号。

Motivation: 在电影制作中的配音和帮助失声者等应用中具有重要潜力。

Details

Method: 提出了ImaginTalk，一种新颖的跨模态扩散框架，通过在离散空间中仅使用视觉输入生成忠实语音。具体包括离散唇部对齐器、错误检测器和风格扩散变压器。 Result: 实验表明，ImaginTalk能够生成高保真语音，具有更准确的语义细节和更强的音色和情感表现力。 Conclusion: ImaginTalk在跨模态一致性方面表现出色，能够生成高质量的语音。 Abstract: Vision-guided speech generation aims to produce authentic speech from facial appearance or lip motions without relying on auditory signals, offering significant potential for applications such as dubbing in filmmaking and assisting individuals with aphonia. Despite recent progress, existing methods struggle to achieve unified cross-modal alignment across semantics, timbre, and emotional prosody from visual cues, prompting us to propose Consistent Video-to-Speech (CV2S) as an extended task to enhance cross-modal consistency. To tackle emerging challenges, we introduce ImaginTalk, a novel cross-modal diffusion framework that generates faithful speech using only visual input, operating within a discrete space. Specifically, we propose a discrete lip aligner that predicts discrete speech tokens from lip videos to capture semantic information, while an error detector identifies misaligned tokens, which are subsequently refined through masked language modeling with BERT. To further enhance the expressiveness of the generated speech, we develop a style diffusion transformer equipped with a face-style adapter that adaptively customizes identity and prosody dynamics across both the channel and temporal dimensions while ensuring synchronization with lip-aware semantic features. Extensive experiments demonstrate that ImaginTalk can generate high-fidelity speech with more accurate semantic details and greater expressiveness in timbre and emotion compared to state-of-the-art baselines. Demos are shown at our project page: https://imagintalk.github.io.

Image Captioning Evaluation in the Age of Multimodal LLMs: Challenges and Future Perspectives

Sara Sarto,Marcella Cornia,Rita Cucchiara

Task: 评估机器生成的图像描述

Motivation: 随着多模态大语言模型（MLLMs）的出现，图像描述生成成为一个核心任务，增加了对稳健和可靠评估指标的需求。

Details

Method: 本文提供了图像描述评估进展的全面概述，分析了现有指标的演变、优势和局限性。 Result: 我们的分析揭示了标准评估方法的一些局限性，并提出了未来图像描述评估研究的有希望的方向。 Conclusion: 本文强调了现有评估指标的局限性，并提出了未来研究的方向。 Abstract: The evaluation of machine-generated image captions is a complex and evolving challenge. With the advent of Multimodal Large Language Models (MLLMs), image captioning has become a core task, increasing the need for robust and reliable evaluation metrics. This survey provides a comprehensive overview of advancements in image captioning evaluation, analyzing the evolution, strengths, and limitations of existing metrics. We assess these metrics across multiple dimensions, including correlation with human judgment, ranking accuracy, and sensitivity to hallucinations. Additionally, we explore the challenges posed by the longer and more detailed captions generated by MLLMs and examine the adaptability of current metrics to these stylistic variations. Our analysis highlights some limitations of standard evaluation approaches and suggests promising directions for future research in image captioning assessment.

FAVOR-Bench: A Comprehensive Benchmark for Fine-Grained Video Motion Understanding

Chongjun Tu,Lin Zhang,Pengtao Chen,Peng Ye,Xianfang Zeng,Wei Cheng,Gang Yu,Tao Chen

Task: 评估多模态大语言模型（MLLMs）在视频内容理解中的细粒度运动理解能力。

Motivation: 现有的MLLMs在视频内容理解方面表现出色，但在细粒度运动理解方面仍存在困难。为了全面评估现有MLLMs的运动理解能力，引入了FAVOR-Bench。

Details

Method: 引入了FAVOR-Bench，包含1,776个带有结构化手动注释的视频，设计了8,184个多项选择题-答案对，并开发了两种评估方法：一种新颖的成本效益高的LLM-free方法和一种GPT辅助的标题评估方法。 Result: 实验表明，21种最先进的MLLMs在理解和描述视频运动的详细时间动态方面存在显著局限性。通过构建FAVOR-Train数据集并在其上微调Qwen2.5-VL，在TVBench、MotionBench和FAVOR-Bench的运动相关任务上取得了持续改进。 Conclusion: FAVOR-Bench和FAVOR-Train为社区提供了有价值的工具，用于开发更强大的视频理解模型。 Abstract: Multimodal Large Language Models (MLLMs) have shown remarkable capabilities in video content understanding but still struggle with fine-grained motion comprehension. To comprehensively assess the motion understanding ability of existing MLLMs, we introduce FAVOR-Bench, comprising 1,776 videos with structured manual annotations of various motions. Our benchmark includes both close-ended and open-ended tasks. For close-ended evaluation, we carefully design 8,184 multiple-choice question-answer pairs spanning six distinct sub-tasks. For open-ended evaluation, we develop both a novel cost-efficient LLM-free and a GPT-assisted caption assessment method, where the former can enhance benchmarking interpretability and reproducibility. Comprehensive experiments with 21 state-of-the-art MLLMs reveal significant limitations in their ability to comprehend and describe detailed temporal dynamics in video motions. To alleviate this limitation, we further build FAVOR-Train, a dataset consisting of 17,152 videos with fine-grained motion annotations. The results of finetuning Qwen2.5-VL on FAVOR-Train yield consistent improvements on motion-related tasks of TVBench, MotionBench and our FAVOR-Bench. Comprehensive assessment results demonstrate that the proposed FAVOR-Bench and FAVOR-Train provide valuable tools to the community for developing more powerful video understanding models. Project page: \href{https://favor-bench.github.io/}{https://favor-bench.github.io/}.

Unique Hard Attention: A Tale of Two Sides

Selim Jerad,Anej Svete,Jiaoda Li,Ryan Cotterell

Task: 分析左硬注意力变换器的表达能力及其与线性时序逻辑（LTL）的关系。

Motivation: 理解变换器的表达能力有助于揭示其能力和局限性，特别是硬注意力机制在变换器中的作用。

Details

Method: 通过比较左硬注意力和右硬注意力变换器，分析它们与线性时序逻辑（LTL）的等价性。 Result: 左硬注意力变换器对应于LTL的一个严格较弱的片段，并且与软注意力变换器等价。 Conclusion: 这些发现细化了变换器表达能力的图景，并强调了注意力方向性的作用。 Abstract: Understanding the expressive power of transformers has recently attracted attention, as it offers insights into their abilities and limitations. Many studies analyze unique hard attention transformers, where attention selects a single position that maximizes the attention scores. When multiple positions achieve the maximum score, either the rightmost or the leftmost of those is chosen. In this paper, we highlight the importance of this seeming triviality. Recently, finite-precision transformers with both leftmost- and rightmost-hard attention were shown to be equivalent to Linear Temporal Logic (LTL). We show that this no longer holds with only leftmost-hard attention -- in that case, they correspond to a \emph{strictly weaker} fragment of LTL. Furthermore, we show that models with leftmost-hard attention are equivalent to \emph{soft} attention, suggesting they may better approximate real-world transformers than right-attention models. These findings refine the landscape of transformer expressivity and underscore the role of attention directionality.

Optimal Transport Adapter Tuning for Bridging Modality Gaps in Few-Shot Remote Sensing Scene Classification

Zhong Ji,Ci Liu,Jingren Liu,Chen Tang,Yanwei Pang,Xuelong Li

Task: Few-Shot Remote Sensing Scene Classification (FS-RSSC) with limited labeled samples.

Motivation: Existing methods typically emphasize single-modal feature learning, neglecting the potential benefits of optimizing multi-modal representations.

Details

Method: Propose a novel Optimal Transport Adapter Tuning (OTAT) framework to construct an ideal Platonic representational space through optimal transport (OT) theory, harmonizing visual and textual cues. Result: OTAT achieves state-of-the-art performance in FS-RSSC, significantly improving model performance and generalization. Conclusion: The OTAT framework offers a scalable and efficient solution for advancing multimodal learning in remote sensing applications. Abstract: Few-Shot Remote Sensing Scene Classification (FS-RSSC) presents the challenge of classifying remote sensing images with limited labeled samples. Existing methods typically emphasize single-modal feature learning, neglecting the potential benefits of optimizing multi-modal representations. To address this limitation, we propose a novel Optimal Transport Adapter Tuning (OTAT) framework aimed at constructing an ideal Platonic representational space through optimal transport (OT) theory. This framework seeks to harmonize rich visual information with less dense textual cues, enabling effective cross-modal information transfer and complementarity. Central to this approach is the Optimal Transport Adapter (OTA), which employs a cross-modal attention mechanism to enrich textual representations and facilitate subsequent better information interaction. By transforming the network optimization into an OT optimization problem, OTA establishes efficient pathways for balanced information exchange between modalities. Moreover, we introduce a sample-level Entropy-Aware Weighted (EAW) loss, which combines difficulty-weighted similarity scores with entropy-based regularization. This loss function provides finer control over the OT optimization process, enhancing its solvability and stability. Our framework offers a scalable and efficient solution for advancing multimodal learning in remote sensing applications. Extensive experiments on benchmark datasets demonstrate that OTAT achieves state-of-the-art performance in FS-RSSC, significantly improving the model performance and generalization.

RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation Serving

Wenqi Jiang,Suvinay Subramanian,Cat Graves,Gustavo Alonso,Amir Yazdanbakhsh,Vidushi Dadu

Task: 提出一种高效的服务检索增强生成（RAG）的方法。

Motivation: 由于RAG变体的快速涌现和工作负载特性的显著差异，高效的RAG服务仍然是一个挑战。

Details

Method: 引入RAGSchema作为RAG算法的结构化抽象，分析代表性RAG工作负载，并提出RAGO系统优化框架。 Result: RAGO在每芯片QPS上实现了高达2倍的提升，并将首次令牌延迟减少了55%。 Conclusion: RAGO框架显著提高了RAG服务的效率，满足了多样化的性能需求。 Abstract: Retrieval-augmented generation (RAG), which combines large language models (LLMs) with retrievals from external knowledge databases, is emerging as a popular approach for reliable LLM serving. However, efficient RAG serving remains an open challenge due to the rapid emergence of many RAG variants and the substantial differences in workload characteristics across them. In this paper, we make three fundamental contributions to advancing RAG serving. First, we introduce RAGSchema, a structured abstraction that captures the wide range of RAG algorithms, serving as a foundation for performance optimization. Second, we analyze several representative RAG workloads with distinct RAGSchema, revealing significant performance variability across these workloads. Third, to address this variability and meet diverse performance requirements, we propose RAGO (Retrieval-Augmented Generation Optimizer), a system optimization framework for efficient RAG serving. Our evaluation shows that RAGO achieves up to a 2x increase in QPS per chip and a 55% reduction in time-to-first-token latency compared to RAG systems built on LLM-system extensions.

VisNumBench: Evaluating Number Sense of Multimodal Large Language Models

Tengjin Weng,Jingyi Wang,Wenhao Jiang,Zhong Ming

Task: 评估多模态大语言模型（MLLMs）在视觉数字任务中的数字感知能力。

Motivation: 研究多模态大语言模型是否能够发展出类似人类的直观数字感知能力。

Details

Method: 引入视觉数字基准（VisNumBench），包含约1,900个多选题-答案对，涵盖七个视觉数字属性和四种视觉数字估计任务。 Result: 测试的17个MLLMs在数字感知相关任务中表现显著低于人类水平；多模态数学模型和多模态链式思维模型未显著提升数字感知能力；参数规模更大、通用能力更强的MLLMs在数字感知能力上表现出适度提升。 Conclusion: VisNumBench将成为研究社区的有价值资源，鼓励进一步推动MLLMs数字感知能力的提升。 Abstract: Can Multimodal Large Language Models (MLLMs) develop an intuitive number sense similar to humans? Targeting this problem, we introduce Visual Number Benchmark (VisNumBench) to evaluate the number sense abilities of MLLMs across a wide range of visual numerical tasks. VisNumBench consists of about 1,900 multiple-choice question-answer pairs derived from both synthetic and real-world visual data, covering seven visual numerical attributes and four types of visual numerical estimation tasks. Our experiments on VisNumBench led to the following key findings: (i) The 17 MLLMs we tested, including open-source models such as Qwen2.5-VL and InternVL2.5, as well as proprietary models like GPT-4o and Gemini 2.0 Flash, perform significantly below human levels in number sense-related tasks. (ii) Multimodal mathematical models and multimodal chain-of-thought (CoT) models did not exhibit significant improvements in number sense abilities. (iii) Stronger MLLMs with larger parameter sizes and broader general abilities demonstrate modest gains in number sense abilities. We believe VisNumBench will serve as a valuable resource for the research community, encouraging further advancements in enhancing MLLMs' number sense abilities. All benchmark resources, including code and datasets, will be publicly available at https://wwwtttjjj.github.io/VisNumBench/.

Mitigating Object Hallucinations in MLLMs via Multi-Frequency Perturbations

Shuo Li,Jiajun Sun,Guodong Zheng,Xiaoran Fan,Yujiong Shen,Yi Lu,Zhiheng Xi,Yuming Yang,Wenming Tan,Tao Ji,Tao Gui,Qi Zhang,Xuanjing Huang

Task: 提出一种名为多频率扰动（MFP）的方法，以减少多模态大语言模型（MLLMs）在视觉-语言任务中的物体幻觉。

Motivation: 现有的多模态大语言模型在视觉-语言任务中表现出色，但其生成的响应常常因物体幻觉而失真。研究发现，模型在检测物体时对特定图像频率特征的过度敏感是导致幻觉的主要原因。

Details

Method: 引入多频率扰动（MFP）方法，利用图像的低频和高频特征来扰动视觉特征表示，并在推理过程中显式抑制冗余的频率域特征，从而减少幻觉。 Result: 实验结果表明，该方法显著减少了各种模型架构中的物体幻觉。此外，作为一种训练时方法，MFP可以与推理时方法结合，在CHAIR基准测试中达到最先进的性能。 Conclusion: 多频率扰动（MFP）是一种简单、经济且可插拔的方法，能够有效减少多模态大语言模型中的物体幻觉，并在CHAIR基准测试中取得了优异的性能。 Abstract: Recently, multimodal large language models (MLLMs) have demonstrated remarkable performance in visual-language tasks. However, the authenticity of the responses generated by MLLMs is often compromised by object hallucinations. We identify that a key cause of these hallucinations is the model's over-susceptibility to specific image frequency features in detecting objects. In this paper, we introduce Multi-Frequency Perturbations (MFP), a simple, cost-effective, and pluggable method that leverages both low-frequency and high-frequency features of images to perturb visual feature representations and explicitly suppress redundant frequency-domain features during inference, thereby mitigating hallucinations. Experimental results demonstrate that our method significantly mitigates object hallucinations across various model architectures. Furthermore, as a training-time method, MFP can be combined with inference-time methods to achieve state-of-the-art performance on the CHAIR benchmark.

UPME: An Unsupervised Peer Review Framework for Multimodal Large Language Model Evaluation

Qihui Zhang,Munan Ning,Zheyuan Liu,Yanbo Wang,Jiayi Ye,Yue Huang,Shuo Yang,Xiao Chen,Yibing Song,Li Yuan

Task: 提出一种无监督的同行评审多模态大语言模型评估框架，以解决现有评估方法中的人力工作量大和偏差问题。

Motivation: 现有的评估方法由于需要大量人力设计视觉图像的问答对，限制了评估的规模和范围，而自动化的MLLM-as-judge方法虽然减少了人力工作量，但引入了偏差。

Details

Method: 提出了一种无监督的同行评审MLLM评估框架，利用图像数据自动生成问题并对其他模型的答案进行同行评审，同时引入了视觉语言评分系统以减少偏差。 Result: 实验结果表明，UPME在MMstar数据集上与人类评估的Pearson相关性为0.944，在ScienceQA数据集上为0.814，表明该框架与人类设计的基准和内在偏好高度一致。 Conclusion: UPME框架有效减少了评估过程中的人力工作量，并通过视觉语言评分系统减少了偏差，与人类评估结果高度一致。 Abstract: Multimodal Large Language Models (MLLMs) have emerged to tackle the challenges of Visual Question Answering (VQA), sparking a new research focus on conducting objective evaluations of these models. Existing evaluation methods face limitations due to the significant human workload required to design Q&A pairs for visual images, which inherently restricts the scale and scope of evaluations. Although automated MLLM-as-judge approaches attempt to reduce the human workload through automatic evaluations, they often introduce biases. To address these problems, we propose an Unsupervised Peer review MLLM Evaluation framework. It utilizes only image data, allowing models to automatically generate questions and conduct peer review assessments of answers from other models, effectively alleviating the reliance on human workload. Additionally, we introduce the vision-language scoring system to mitigate the bias issues, which focuses on three aspects: (i) response correctness; (ii) visual understanding and reasoning; and (iii) image-text correlation. Experimental results demonstrate that UPME achieves a Pearson correlation of 0.944 with human evaluations on the MMstar dataset and 0.814 on the ScienceQA dataset, indicating that our framework closely aligns with human-designed benchmarks and inherent human preferences.

Machine Unlearning in Hyperbolic vs. Euclidean Multimodal Contrastive Learning: Adapting Alignment Calibration to MERU

Àlex Pujol Vidal,Sergio Escalera,Kamal Nasrollahi,Thomas B. Moeslund

Task: 研究在双曲对比学习中的机器遗忘方法，特别是通过调整对齐校准到MERU模型来实现概念移除。

Motivation: 探索在双曲空间中进行概念移除的有效性，因为现有的研究主要集中在欧几里得对比视觉语言模型中的遗忘方法。

Details

Method: 通过系统实验和消融研究，采用双曲几何的特定组件，包括蕴含校准和范数正则化，来利用双曲空间的独特属性。 Result: 双曲几何在概念移除方面表现出独特的优势，特别是在扩展到多个概念移除时，实现了近乎完美的遗忘，同时在保留概念上保持了合理的性能。 Conclusion: 双曲遗忘方法不仅推进了机器遗忘技术，还提供了关于几何属性如何影响多模态模型中概念表示和移除的见解。 Abstract: Machine unlearning methods have become increasingly important for selective concept removal in large pre-trained models. While recent work has explored unlearning in Euclidean contrastive vision-language models, the effectiveness of concept removal in hyperbolic spaces remains unexplored. This paper investigates machine unlearning in hyperbolic contrastive learning by adapting Alignment Calibration to MERU, a model that embeds images and text in hyperbolic space to better capture semantic hierarchies. Through systematic experiments and ablation studies, we demonstrate that hyperbolic geometry offers distinct advantages for concept removal, achieving near perfect forgetting with reasonable performance on retained concepts, particularly when scaling to multiple concept removal. Our approach introduces hyperbolic-specific components including entailment calibration and norm regularization that leverage the unique properties of hyperbolic space. Comparative analysis with Euclidean models reveals fundamental differences in unlearning dynamics, with hyperbolic unlearning reorganizing the semantic hierarchy while Euclidean approaches merely disconnect cross-modal associations. These findings not only advance machine unlearning techniques but also provide insights into the geometric properties that influence concept representation and removal in multimodal models. Source code available at https://github.com/alex-pv01/HAC

3D Engine-ready Photorealistic Avatars via Dynamic Textures

Yifan Wang,Ivan Molodetskikh,Ondrej Texler,Dimitar Dinev

Task: 提出一种端到端的管道，使用标准3D资产构建显式表示的逼真3D虚拟形象。

Motivation: 当前用于3D生产管道的数字化方法需要昂贵的捕捉设置，不适合普通消费者大规模使用。

Details

Method: 使用动态生成的纹理来增强真实感，并在视觉上掩盖底层网格几何的缺陷。 Result: 实现了与当前图形管道的无缝集成，同时达到了与最先进的3D虚拟形象生成方法相当的视觉质量。 Conclusion: 该方法能够在保持高质量视觉效果的同时，与现有图形管道兼容，具有广泛的应用前景。 Abstract: As the digital and physical worlds become more intertwined, there has been a lot of interest in digital avatars that closely resemble their real-world counterparts. Current digitization methods used in 3D production pipelines require costly capture setups, making them impractical for mass usage among common consumers. Recent academic literature has found success in reconstructing humans from limited data using implicit representations (e.g., voxels used in NeRFs), which are able to produce impressive videos. However, these methods are incompatible with traditional rendering pipelines, making it difficult to use them in applications such as games. In this work, we propose an end-to-end pipeline that builds explicitly-represented photorealistic 3D avatars using standard 3D assets. Our key idea is the use of dynamically-generated textures to enhance the realism and visually mask deficiencies in the underlying mesh geometry. This allows for seamless integration with current graphics pipelines while achieving comparable visual quality to state-of-the-art 3D avatar generation methods.

A Review on Large Language Models for Visual Analytics

Navya Sonal Agarwal,Sanjay Kumar Sonbhadra

Task: 综述大型语言模型（LLMs）与视觉分析的集成，探讨其基础概念、能力和广泛应用。

Motivation: 探讨LLMs在自然语言理解、自然语言生成、对话系统和文本到媒体转换中的潜力，以及它们与视觉分析的协同作用如何增强数据解释、可视化技术和交互探索能力。

Details

Method: 通过评估关键工具和平台（如LIDA、Chat2VIS、Julius AI和Zoho Analytics）以及专门的 multimodal 模型（如ChartLlama和CharXIV），系统探讨LLM任务分类（从自然语言理解、自然语言生成到对话系统和文本到媒体转换）。 Result: 提供了LLMs与视觉分析集成的SWOT分析，强调了其优势（如可访问性和灵活性）、劣势（如计算需求和偏见）、机会（如多模态集成和用户协作）和威胁（如隐私问题和技能退化）。 Conclusion: 强调解决伦理考虑和方法改进以实现有效集成的重要性。 Abstract: This paper provides a comprehensive review of the integration of Large Language Models (LLMs) with visual analytics, addressing their foundational concepts, capabilities, and wide-ranging applications. It begins by outlining the theoretical underpinnings of visual analytics and the transformative potential of LLMs, specifically focusing on their roles in natural language understanding, natural language generation, dialogue systems, and text-to-media transformations. The review further investigates how the synergy between LLMs and visual analytics enhances data interpretation, visualization techniques, and interactive exploration capabilities. Key tools and platforms including LIDA, Chat2VIS, Julius AI, and Zoho Analytics, along with specialized multimodal models such as ChartLlama and CharXIV, are critically evaluated. The paper discusses their functionalities, strengths, and limitations in supporting data exploration, visualization enhancement, automated reporting, and insight extraction. The taxonomy of LLM tasks, ranging from natural language understanding (NLU), natural language generation (NLG), to dialogue systems and text-to-media transformations, is systematically explored. This review provides a SWOT analysis of integrating Large Language Models (LLMs) with visual analytics, highlighting strengths like accessibility and flexibility, weaknesses such as computational demands and biases, opportunities in multimodal integration and user collaboration, and threats including privacy concerns and skill degradation. It emphasizes addressing ethical considerations and methodological improvements for effective integration.

MMAIF: Multi-task and Multi-degradation All-in-One for Image Fusion with Language Guidance

Zihan Cao,Yu Zhong,Ziqi Wang,Liang-Jian Deng

Task: 提出一个统一的框架，用于多任务、多退化和语言引导的图像融合。

Motivation: 现有方法存在多个显著限制，如需要任务或数据集特定的模型、忽略真实世界的图像退化、在像素空间中操作计算成本高、缺乏用户交互能力。

Details

Method: 提出一个统一的框架，包括一个实用的退化管道和一个在潜在空间中操作的全能扩散变换器（DiT）。 Result: 实验表明，该方法有效解决了上述限制，并优于之前的恢复+融合和全能管道。 Conclusion: 提出的框架在多任务、多退化和语言引导的图像融合中表现出色，解决了现有方法的多个限制。 Abstract: Image fusion, a fundamental low-level vision task, aims to integrate multiple image sequences into a single output while preserving as much information as possible from the input. However, existing methods face several significant limitations: 1) requiring task- or dataset-specific models; 2) neglecting real-world image degradations (\textit{e.g.}, noise), which causes failure when processing degraded inputs; 3) operating in pixel space, where attention mechanisms are computationally expensive; and 4) lacking user interaction capabilities. To address these challenges, we propose a unified framework for multi-task, multi-degradation, and language-guided image fusion. Our framework includes two key components: 1) a practical degradation pipeline that simulates real-world image degradations and generates interactive prompts to guide the model; 2) an all-in-one Diffusion Transformer (DiT) operating in latent space, which fuses a clean image conditioned on both the degraded inputs and the generated prompts. Furthermore, we introduce principled modifications to the original DiT architecture to better suit the fusion task. Based on this framework, we develop two versions of the model: Regression-based and Flow Matching-based variants. Extensive qualitative and quantitative experiments demonstrate that our approach effectively addresses the aforementioned limitations and outperforms previous restoration+fusion and all-in-one pipelines. Codes are available at https://github.com/294coder/MMAIF.

When Pigs Get Sick: Multi-Agent AI for Swine Disease Detection

Tittaya Mairittha,Tanakon Sawanglok,Panuwit Raden,Sorrawit Treesuk

Task: 开发一种基于AI的多代理诊断系统，用于猪病监测和临床指导。

Motivation: 全球农业的可持续性受到兽医资源有限、病例识别延迟和诊断准确性不一致的挑战。

Details

Method: 利用检索增强生成（RAG）技术，自动将用户输入分类为知识检索查询或基于症状的诊断查询，采用自适应提问协议收集相关临床体征，并通过置信度加权决策融合机制整合多个诊断假设。 Result: 系统在查询分类、疾病诊断和知识检索方面表现出高准确性、快速响应时间和一致的可靠性。 Conclusion: 该AI驱动的诊断框架提高了兽医决策能力，推动了可持续的畜牧管理实践，并为实现全球粮食安全做出了实质性贡献。 Abstract: Swine disease surveillance is critical to the sustainability of global agriculture, yet its effectiveness is frequently undermined by limited veterinary resources, delayed identification of cases, and variability in diagnostic accuracy. To overcome these barriers, we introduce a novel AI-powered, multi-agent diagnostic system that leverages Retrieval-Augmented Generation (RAG) to deliver timely, evidence-based disease detection and clinical guidance. By automatically classifying user inputs into either Knowledge Retrieval Queries or Symptom-Based Diagnostic Queries, the system ensures targeted information retrieval and facilitates precise diagnostic reasoning. An adaptive questioning protocol systematically collects relevant clinical signs, while a confidence-weighted decision fusion mechanism integrates multiple diagnostic hypotheses to generate robust disease predictions and treatment recommendations. Comprehensive evaluations encompassing query classification, disease diagnosis, and knowledge retrieval demonstrate that the system achieves high accuracy, rapid response times, and consistent reliability. By providing a scalable, AI-driven diagnostic framework, this approach enhances veterinary decision-making, advances sustainable livestock management practices, and contributes substantively to the realization of global food security.

Generating Multimodal Driving Scenes via Next-Scene Prediction

Yanhao Wu,Haoyang Zhang,Tianwei Lin,Lichao Huang,Shujie Luo,Rui Wu,Congpei Qiu,Wei Ke,Tong Zhang

Task: 提出一种多模态生成框架，用于生成可控的自动驾驶场景。

Motivation: 现有的生成方法只能捕捉有限的模态，限制了生成可控场景的能力，无法全面评估自动驾驶系统。

Details

Method: 引入包含四种主要数据模态的多模态生成框架，采用两阶段方法进行场景序列生成，包括时间自回归（TAR）组件和有序自回归（OAR）组件，并引入动作感知地图对齐（AMA）模块。 Result: 该框架能够有效生成长序列的复杂、真实的驾驶场景，确保多模态一致性，并提供对场景元素的细粒度控制。 Conclusion: 提出的多模态生成框架能够生成可控的自动驾驶场景，为自动驾驶系统的全面评估提供了有效工具。 Abstract: Generative models in Autonomous Driving (AD) enable diverse scene creation, yet existing methods fall short by only capturing a limited range of modalities, restricting the capability of generating controllable scenes for comprehensive evaluation of AD systems. In this paper, we introduce a multimodal generation framework that incorporates four major data modalities, including a novel addition of map modality. With tokenized modalities, our scene sequence generation framework autoregressively predicts each scene while managing computational demands through a two-stage approach. The Temporal AutoRegressive (TAR) component captures inter-frame dynamics for each modality while the Ordered AutoRegressive (OAR) component aligns modalities within each scene by sequentially predicting tokens in a fixed order. To maintain coherence between map and ego-action modalities, we introduce the Action-aware Map Alignment (AMA) module, which applies a transformation based on the ego-action to maintain coherence between these modalities. Our framework effectively generates complex, realistic driving scenes over extended sequences, ensuring multimodal consistency and offering fine-grained control over scene elements.

Solla: Towards a Speech-Oriented LLM That Hears Acoustic Context

Junyi Ao,Dekun Chen,Xiaohai Tian,Wenjie Feng,Jun Zhang,Lu Lu,Yuxuan Wang,Haizhou Li,Zhizheng Wu

Task: 提出了一种名为Solla的新框架，旨在同时理解语音问题和听觉上下文。

Motivation: 现有的模型主要关注使用文本指令分析输入信号，忽略了语音指令和音频混合作为模型输入的场景。

Details

Method: Solla框架结合了音频标记模块和ASR辅助预测方法，以有效识别和表示音频事件，并提高对语音内容的理解。 Result: 实验结果表明，Solla在简单和困难测试集上表现与基线模型相当或优于基线模型。 Conclusion: Solla框架在联合理解语音和音频方面表现出色。 Abstract: Large Language Models (LLMs) have recently shown remarkable ability to process not only text but also multimodal inputs such as speech and audio. However, most existing models primarily focus on analyzing input signals using text instructions, overlooking scenarios in which speech instructions and audio are mixed and serve as inputs to the model. To address these challenges, we introduce Solla, a novel framework designed to understand speech-based questions and hear the acoustic context concurrently. Solla incorporates an audio tagging module to effectively identify and represent audio events, as well as an ASR-assisted prediction method to improve comprehension of spoken content. To rigorously evaluate Solla and other publicly available models, we propose a new benchmark dataset called SA-Eval, which includes three tasks: audio event classification, audio captioning, and audio question answering. SA-Eval has diverse speech instruction with various speaking styles, encompassing two difficulty levels, easy and hard, to capture the range of real-world acoustic conditions. Experimental results show that Solla performs on par with or outperforms baseline models on both the easy and hard test sets, underscoring its effectiveness in jointly understanding speech and audio.

ChatStitch: Visualizing Through Structures via Surround-View Unsupervised Deep Image Stitching with Collaborative LLM-Agents

Hao Liang,Zhipeng Dong,Yi Yang,Mengyin Fu

Task: 介绍ChatStitch，一种通过自然语言命令与外部数字资产集成的协作感知系统，以揭示被遮挡的盲点信息。

Motivation: 现有的协作感知系统在用户交互效率和多摄像头逼真可视化方面存在局限性。

Details

Method: ChatStitch采用基于大语言模型的多代理协作框架，并提出了SV-UDIS，一种在非全局重叠条件下的环绕视图无监督深度图像拼接方法。 Result: 在UDIS-D、MCOV-SLAM开放数据集和真实世界数据集上进行了广泛实验，SV-UDIS方法在UDIS-D数据集上的3、4和5图像拼接任务中实现了最先进的性能，PSNR分别提高了9%、17%和21%，SSIM分别提高了8%、18%和26%。 Conclusion: ChatStitch通过自然语言命令和多代理协作框架，显著提升了协作感知系统的性能，特别是在盲点信息揭示和图像拼接方面。 Abstract: Collaborative perception has garnered significant attention for its ability to enhance the perception capabilities of individual vehicles through the exchange of information with surrounding vehicle-agents. However, existing collaborative perception systems are limited by inefficiencies in user interaction and the challenge of multi-camera photorealistic visualization. To address these challenges, this paper introduces ChatStitch, the first collaborative perception system capable of unveiling obscured blind spot information through natural language commands integrated with external digital assets. To adeptly handle complex or abstract commands, ChatStitch employs a multi-agent collaborative framework based on Large Language Models. For achieving the most intuitive perception for humans, ChatStitch proposes SV-UDIS, the first surround-view unsupervised deep image stitching method under the non-global-overlapping condition. We conducted extensive experiments on the UDIS-D, MCOV-SLAM open datasets, and our real-world dataset. Specifically, our SV-UDIS method achieves state-of-the-art performance on the UDIS-D dataset for 3, 4, and 5 image stitching tasks, with PSNR improvements of 9%, 17%, and 21%, and SSIM improvements of 8%, 18%, and 26%, respectively.

What Makes a Reward Model a Good Teacher? An Optimization Perspective

Noam Razin,Zixuan Wang,Hubert Strauss,Stanley Wei,Jason D. Lee,Sanjeev Arora

Task: 研究奖励模型在从人类反馈中进行强化学习（RLHF）中的质量评估问题。

Motivation: 尽管奖励模型的质量主要通过准确性来评估，但尚不清楚准确性是否完全捕捉了奖励模型作为有效教师的特性。

Details

Method: 从优化的角度分析奖励模型的准确性与其在RLHF中的表现之间的关系，并通过实验验证理论。 Result: 研究发现，即使奖励模型非常准确，如果其诱导的奖励方差较低，RLHF目标的优化速度会非常慢。实验验证了奖励方差、准确性和奖励最大化率之间的相互作用。 Conclusion: 除了准确性外，奖励模型还需要诱导足够的方差以实现高效优化。 Abstract: The success of Reinforcement Learning from Human Feedback (RLHF) critically depends on the quality of the reward model. While this quality is primarily evaluated through accuracy, it remains unclear whether accuracy fully captures what makes a reward model an effective teacher. We address this question from an optimization perspective. First, we prove that regardless of how accurate a reward model is, if it induces low reward variance, then the RLHF objective suffers from a flat landscape. Consequently, even a perfectly accurate reward model can lead to extremely slow optimization, underperforming less accurate models that induce higher reward variance. We additionally show that a reward model that works well for one language model can induce low reward variance, and thus a flat objective landscape, for another. These results establish a fundamental limitation of evaluating reward models solely based on accuracy or independently of the language model they guide. Experiments using models of up to 8B parameters corroborate our theory, demonstrating the interplay between reward variance, accuracy, and reward maximization rate. Overall, our findings highlight that beyond accuracy, a reward model needs to induce sufficient variance for efficient optimization.

USAM-Net: A U-Net-based Network for Improved Stereo Correspondence and Scene Depth Estimation using Features from a Pre-trained Image Segmentation network

Joseph Emmanuel DL Dayo,Prospero C. Naval Jr

Task: 提出了一种新的卷积神经网络USAM-Net，用于增强深度估计性能。

Motivation: 自动驾驶和增强现实应用对高精度深度估计的需求不断增加，需要能够有效利用多种数据模态的先进神经架构。

Details

Method: USAM-Net采用双路径架构，结合预训练的分割模型（SAM）和深度估计模型，通过将语义分割图与立体图像结合来增强深度估计。 Result: 在DrivingStereo数据集上的实验表明，USAM-Net在全局差异（GD）和端点误差（EPE）方面优于传统模型。 Conclusion: USAM-Net通过整合分割信息，展示了在需要高精度深度数据的应用中的潜力。 Abstract: The increasing demand for high-accuracy depth estimation in autonomous driving and augmented reality applications necessitates advanced neural architectures capable of effectively leveraging multiple data modalities. In this context, we introduce the Unified Segmentation Attention Mechanism Network (USAM-Net), a novel convolutional neural network that integrates stereo image inputs with semantic segmentation maps and attention to enhance depth estimation performance. USAM-Net employs a dual-pathway architecture, which combines a pre-trained segmentation model (SAM) and a depth estimation model. The segmentation pathway preprocesses the stereo images to generate semantic masks, which are then concatenated with the stereo images as inputs to the depth estimation pathway. This integration allows the model to focus on important features such as object boundaries and surface textures which are crucial for accurate depth perception. Empirical evaluation on the DrivingStereo dataset demonstrates that USAM-Net achieves superior performance metrics, including a Global Difference (GD) of 3.61\% and an End-Point Error (EPE) of 0.88, outperforming traditional models such as CFNet, SegStereo, and iResNet. These results underscore the effectiveness of integrating segmentation information into stereo depth estimation tasks, highlighting the potential of USAM-Net in applications demanding high-precision depth data.

TULIP: Towards Unified Language-Image Pretraining

Zineng Tang,Long Lian,Seun Eisape,XuDong Wang,Roei Herzig,Adam Yala,Alane Suhr,Trevor Darrell,David M. Chan

Task: 提出一种名为TULIP的模型，用于改进现有CLIP类模型在视觉中心任务中的表现。

Motivation: 现有的图像-文本对比模型（如CLIP和SigLIP）在需要高保真图像理解的任务（如计数、深度估计和细粒度对象识别）中表现不佳，而视觉中心模型在处理语言任务时又存在局限性。

Details

Method: TULIP模型通过生成数据增强、增强的图像-图像和文本-文本对比学习以及图像/文本重建正则化来学习细粒度视觉特征，同时保持全局语义对齐。 Result: TULIP模型在多个基准测试中超越了现有的最先进模型，在ImageNet-1K上实现了新的零样本性能最先进水平，在RxRx1上的线性探测中比SigLIP提高了2倍，在MMVP上的视觉-语言模型得分比SigLIP提高了3倍以上。 Conclusion: TULIP模型通过结合生成数据增强和对比学习，显著提升了视觉中心任务的表现，同时保持了语义对齐能力。 Abstract: Despite the recent success of image-text contrastive models like CLIP and SigLIP, these models often struggle with vision-centric tasks that demand high-fidelity image understanding, such as counting, depth estimation, and fine-grained object recognition. These models, by performing language alignment, tend to prioritize high-level semantics over visual understanding, weakening their image understanding. On the other hand, vision-focused models are great at processing visual information but struggle to understand language, limiting their flexibility for language-driven tasks. In this work, we introduce TULIP, an open-source, drop-in replacement for existing CLIP-like models. Our method leverages generative data augmentation, enhanced image-image and text-text contrastive learning, and image/text reconstruction regularization to learn fine-grained visual features while preserving global semantic alignment. Our approach, scaling to over 1B parameters, outperforms existing state-of-the-art (SOTA) models across multiple benchmarks, establishing a new SOTA zero-shot performance on ImageNet-1K, delivering up to a $2\times$ enhancement over SigLIP on RxRx1 in linear probing for few-shot classification, and improving vision-language models, achieving over $3\times$ higher scores than SigLIP on MMVP. Our code/checkpoints are available at https://tulip-berkeley.github.io

Aligning Information Capacity Between Vision and Language via Dense-to-Sparse Feature Distillation for Image-Text Matching

Yang Liu,Wentao Feng,Zhuoyao Liu,Shudong Huang,Jiancheng Lv

Task: 解决多视角描述匹配问题，提升视觉语义模型的信息容量。

Motivation: 现有的方法通过学习一组嵌入来找到每个视角文本的最佳匹配并计算相似性，但这些方法学习到的视觉和文本嵌入信息容量有限，容易受到局部相似负样本的干扰。

Details

Method: 提出了Dense-to-Sparse Feature Distilled Visual Semantic Embedding (D2S-VSE)方法，通过密集文本蒸馏增强稀疏文本的信息容量。该方法分为预训练和微调两个阶段，预训练阶段通过将图像与密集文本对齐来增强视觉语义嵌入的信息容量，微调阶段同时优化两个任务，将密集文本嵌入蒸馏到稀疏文本嵌入中，同时对齐图像和稀疏文本，增强稀疏文本嵌入的信息容量。 Result: 在MS-COCO和Flickr30K数据集上的广泛评估表明，D2S-VSE模型优于最近的最先进方法。 Conclusion: D2S-VSE通过增强稀疏文本嵌入的信息容量，有效解决了多视角描述匹配问题，并在大规模数据集上展示了其优越性。 Abstract: Enabling Visual Semantic Models to effectively handle multi-view description matching has been a longstanding challenge. Existing methods typically learn a set of embeddings to find the optimal match for each view's text and compute similarity. However, the visual and text embeddings learned through these approaches have limited information capacity and are prone to interference from locally similar negative samples. To address this issue, we argue that the information capacity of embeddings is crucial and propose Dense-to-Sparse Feature Distilled Visual Semantic Embedding (D2S-VSE), which enhances the information capacity of sparse text by leveraging dense text distillation. Specifically, D2S-VSE is a two-stage framework. In the pre-training stage, we align images with dense text to enhance the information capacity of visual semantic embeddings. In the fine-tuning stage, we optimize two tasks simultaneously, distilling dense text embeddings to sparse text embeddings while aligning images and sparse texts, enhancing the information capacity of sparse text embeddings. Our proposed D2S-VSE model is extensively evaluated on the large-scale MS-COCO and Flickr30K datasets, demonstrating its superiority over recent state-of-the-art methods.

Depth-Aware Range Image-Based Model for Point Cloud Segmentation

Bike Chen,Antti Tikanmäki,Juha Röning

Task: 点云分割（PCS）旨在将点云分成不同且有意义的组。

Motivation: 点云分割在机器人技术中起着重要作用，因为它使机器人能够直接理解其物理环境。然而，处理稀疏且大规模的室外点云时，基于距离图像的模型通常无法充分利用隐含但有序的深度信息，导致性能不佳。

Details

Method: 本文提出了深度感知模块（DAM）和Fast FMVNet V3。DAM通过显式建模通道间的相互依赖性来感知距离图像中的有序深度信息。Fast FMVNet V3通过将DAM集成到每个架构阶段的最后一个块中来结合DAM。 Result: 在SemanticKITTI、nuScenes和SemanticPOSS上进行的大量实验表明，DAM为Fast FMVNet V3带来了显著的改进，且计算成本可以忽略不计。 Conclusion: 本文提出的DAM和Fast FMVNet V3能够有效提升点云分割的性能，特别是在处理大规模室外点云时表现出色。 Abstract: Point cloud segmentation (PCS) aims to separate points into different and meaningful groups. The task plays an important role in robotics because PCS enables robots to understand their physical environments directly. To process sparse and large-scale outdoor point clouds in real time, range image-based models are commonly adopted. However, in a range image, the lack of explicit depth information inevitably causes some separate objects in 3D space to touch each other, bringing difficulty for the range image-based models in correctly segmenting the objects. Moreover, previous PCS models are usually derived from the existing color image-based models and unable to make full use of the implicit but ordered depth information inherent in the range image, thereby achieving inferior performance. In this paper, we propose Depth-Aware Module (DAM) and Fast FMVNet V3. DAM perceives the ordered depth information in the range image by explicitly modelling the interdependence among channels. Fast FMVNet V3 incorporates DAM by integrating it into the last block in each architecture stage. Extensive experiments conducted on SemanticKITTI, nuScenes, and SemanticPOSS demonstrate that DAM brings a significant improvement for Fast FMVNet V3 with negligible computational cost.

Neuro Symbolic Knowledge Reasoning for Procedural Video Question Answering

Thanh-Son Nguyen,Hong Yang,Tzeh Yuan Neoh,Hao Zhang,Ee Yeo Keat,Basura Fernando

Task: 引入一个新的视频问答（VQA）数据集，挑战模型利用程序性知识进行复杂推理。

Motivation: 需要识别视觉实体、生成假设，并进行上下文、因果和反事实推理。

Details

Method: 提出了一种神经符号推理模块，集成了神经网络和LLM驱动的约束推理，用于生成可解释的答案。 Result: 结果表明，将LLMs与结构化知识推理结合，增强了在STAR基准和我们的数据集上的程序性推理。 Conclusion: 代码和数据集可在https://github.com/LUNAProject22/KML获取。 Abstract: This paper introduces a new video question-answering (VQA) dataset that challenges models to leverage procedural knowledge for complex reasoning. It requires recognizing visual entities, generating hypotheses, and performing contextual, causal, and counterfactual reasoning. To address this, we propose neuro symbolic reasoning module that integrates neural networks and LLM-driven constrained reasoning over variables for interpretable answer generation. Results show that combining LLMs with structured knowledge reasoning with logic enhances procedural reasoning on the STAR benchmark and our dataset. Code and dataset at https://github.com/LUNAProject22/KML soon.

Reducing Annotation Burden: Exploiting Image Knowledge for Few-Shot Medical Video Object Segmentation via Spatiotemporal Consistency Relearning

Zixuan Zheng,Yilei Shi,Chunlei Li,Jingliang Hu,Xiao Xiang Zhu,Lichao Mou

Task: 研究一种在极低数据量下进行视频对象分割的方法，以减少医学领域的标注成本。

Motivation: 现有的方法仍然需要大量的密集帧标注进行训练，而这些标注在医学领域中非常稀缺。

Details

Method: 提出一个两阶段框架：首先使用标注图像学习一个少样本分割模型，然后通过时空一致性重新学习方法在医学视频上提高性能。 Result: 实验证明了该方法在少样本分割任务上优于现有的最先进方法。 Conclusion: 该模型在低数据量下实现了强大的视频分割性能，填补了丰富的标注医学图像与稀缺的稀疏标注医学视频之间的差距。 Abstract: Few-shot video object segmentation aims to reduce annotation costs; however, existing methods still require abundant dense frame annotations for training, which are scarce in the medical domain. We investigate an extremely low-data regime that utilizes annotations from only a few video frames and leverages existing labeled images to minimize costly video annotations. Specifically, we propose a two-phase framework. First, we learn a few-shot segmentation model using labeled images. Subsequently, to improve performance without full supervision, we introduce a spatiotemporal consistency relearning approach on medical videos that enforces consistency between consecutive frames. Constraints are also enforced between the image model and relearning model at both feature and prediction levels. Experiments demonstrate the superiority of our approach over state-of-the-art few-shot segmentation methods. Our model bridges the gap between abundant annotated medical images and scarce, sparsely labeled medical videos to achieve strong video segmentation performance in this low data regime. Code is available at https://github.com/MedAITech/RAB.

Body-Hand Modality Expertized Networks with Cross-attention for Fine-grained Skeleton Action Recognition

Seungyeon Cho,Tae-Kyun Kim

Task: 提出了一种新的框架BHaRNet，用于基于骨架的人体动作识别，特别是关注细微的手部动作。

Motivation: 现有的方法主要关注全身动作，往往忽略了细微的手部动作，而这些动作对于区分细粒度动作至关重要。

Details

Method: BHaRNet框架通过增强典型的身体专家模型与手部专家模型，采用联合训练和交叉注意力机制，实现特征级交互和选择性融合互补信息。 Result: 在多个大规模基准测试中，BHaRNet在保持较少GFLOPs和参数的同时，达到了最先进的准确率，特别是在手部密集动作中从86.4%提升到93.0%。 Conclusion: BHaRNet通过结合身体和手部专家模型，有效地捕捉了细微的手部动作，显著提升了动作识别的准确率。 Abstract: Skeleton-based Human Action Recognition (HAR) is a vital technology in robotics and human-robot interaction. However, most existing methods concentrate primarily on full-body movements and often overlook subtle hand motions that are critical for distinguishing fine-grained actions. Recent work leverages a unified graph representation that combines body, hand, and foot keypoints to capture detailed body dynamics. Yet, these models often blur fine hand details due to the disparity between body and hand action characteristics and the loss of subtle features during the spatial-pooling. In this paper, we propose BHaRNet (Body-Hand action Recognition Network), a novel framework that augments a typical body-expert model with a hand-expert model. Our model jointly trains both streams with an ensemble loss that fosters cooperative specialization, functioning in a manner reminiscent of a Mixture-of-Experts (MoE). Moreover, cross-attention is employed via an expertized branch method and a pooling-attention module to enable feature-level interactions and selectively fuse complementary information. Inspired by MMNet, we also demonstrate the applicability of our approach to multi-modal tasks by leveraging RGB information, where body features guide RGB learning to capture richer contextual cues. Experiments on large-scale benchmarks (NTU RGB+D 60, NTU RGB+D 120, PKU-MMD, and Northwestern-UCLA) demonstrate that BHaRNet achieves SOTA accuracies -- improving from 86.4\% to 93.0\% in hand-intensive actions -- while maintaining fewer GFLOPs and parameters than the relevant unified methods.

Ultrasound Image-to-Video Synthesis via Latent Dynamic Diffusion Models

Tingxiu Chen,Yilei Shi,Zixuan Zheng,Bingcong Yan,Jingliang Hu,Xiao Xiang Zhu,Lichao Mou

Task: 通过合成超声视频来解决超声视频数据集稀缺的问题。

Motivation: 公开的超声视频数据集稀缺，阻碍了有效的视频分类模型的开发。

Details

Method: 提出了一种潜在动态扩散模型（LDDM），将静态图像高效地转换为具有真实视频特征的动态序列。 Result: 在BUSV基准测试中展示了强大的定量结果和视觉上吸引人的合成视频。使用真实数据和LDDM合成视频的组合训练视频分类模型，性能显著优于仅使用真实数据。 Conclusion: 图像到视频的方法为推进超声视频分析提供了有效的数据增强解决方案。 Abstract: Ultrasound video classification enables automated diagnosis and has emerged as an important research area. However, publicly available ultrasound video datasets remain scarce, hindering progress in developing effective video classification models. We propose addressing this shortage by synthesizing plausible ultrasound videos from readily available, abundant ultrasound images. To this end, we introduce a latent dynamic diffusion model (LDDM) to efficiently translate static images to dynamic sequences with realistic video characteristics. We demonstrate strong quantitative results and visually appealing synthesized videos on the BUSV benchmark. Notably, training video classification models on combinations of real and LDDM-synthesized videos substantially improves performance over using real data alone, indicating our method successfully emulates dynamics critical for discrimination. Our image-to-video approach provides an effective data augmentation solution to advance ultrasound video analysis. Code is available at https://github.com/MedAITech/U_I2V.

Language-based Image Colorization: A Benchmark and Beyond

Yifan Li,Shuai Yang,Jiaying Liu

Task: 对基于语言的图像着色方法进行全面回顾和基准测试。

Motivation: 由于颜色模糊性，自动图像着色方法难以生成高质量图像，并且用户可控性有限。基于语言的着色方法利用文本描述的效率和灵活性来指导着色。

Details

Method: 首先总结现有的自动着色方法，然后重点分析基于语言的方法，并将其分为两类：一类从头训练跨模态网络，另一类利用预训练的跨模态模型建立文本-视觉对应关系。提出了一种基于蒸馏扩散模型的简单有效方法。 Result: 实验表明，所提出的简单基线方法比之前的复杂方法效果更好，并且速度提高了14倍。 Conclusion: 这是首次对基于语言的图像着色领域进行全面回顾和基准测试，为该领域提供了有意义的见解。 Abstract: Image colorization aims to bring colors back to grayscale images. Automatic image colorization methods, which requires no additional guidance, struggle to generate high-quality images due to color ambiguity, and provides limited user controllability. Thanks to the emergency of cross-modality datasets and models, language-based colorization methods are proposed to fully utilize the efficiency and flexibly of text descriptions to guide colorization. In view of the lack of a comprehensive review of language-based colorization literature, we conduct a thorough analysis and benchmarking. We first briefly summarize existing automatic colorization methods. Then, we focus on language-based methods and point out their core challenge on cross-modal alignment. We further divide these methods into two categories: one attempts to train a cross-modality network from scratch, while the other utilizes the pre-trained cross-modality model to establish the textual-visual correspondence. Based on the analyzed limitations of existing language-based methods, we propose a simple yet effective method based on distilled diffusion model. Extensive experiments demonstrate that our simple baseline can produces better results than previous complex methods with 14 times speed up. To the best of our knowledge, this is the first comprehensive review and benchmark on language-based image colorization field, providing meaningful insights for the community. The code is available at https://github.com/lyf1212/Color-Turbo.

Taming Flow Matching with Unbalanced Optimal Transport into Fast Pansharpening

Zihan Cao,Yu Zhong,Liang-Jian Deng

Task: 提出了一种基于最优传输流匹配（OTFM）框架的高质量单步全色锐化方法。

Motivation: 现有的基于随机微分方程（SDE）的扩散模型在全色锐化任务中表现出色，但其多步采样过程带来了巨大的计算开销，限制了实际应用。

Details

Method: 提出了最优传输流匹配（OTFM）框架，结合不平衡最优传输（UOT）的双重公式，实现单步高质量全色锐化。 Result: 实验结果表明，OTFM在多个数据集上的表现优于或等同于之前的回归模型和领先的基于扩散的方法，且仅需一步采样。 Conclusion: OTFM框架在保持全色锐化约束的同时，实现了无模拟训练和单步推理，显著提高了计算效率。 Abstract: Pansharpening, a pivotal task in remote sensing for fusing high-resolution panchromatic and multispectral imagery, has garnered significant research interest. Recent advancements employing diffusion models based on stochastic differential equations (SDEs) have demonstrated state-of-the-art performance. However, the inherent multi-step sampling process of SDEs imposes substantial computational overhead, hindering practical deployment. While existing methods adopt efficient samplers, knowledge distillation, or retraining to reduce sampling steps (e.g., from 1,000 to fewer steps), such approaches often compromise fusion quality. In this work, we propose the Optimal Transport Flow Matching (OTFM) framework, which integrates the dual formulation of unbalanced optimal transport (UOT) to achieve one-step, high-quality pansharpening. Unlike conventional OT formulations that enforce rigid distribution alignment, UOT relaxes marginal constraints to enhance modeling flexibility, accommodating the intrinsic spectral and spatial disparities in remote sensing data. Furthermore, we incorporate task-specific regularization into the UOT objective, enhancing the robustness of the flow model. The OTFM framework enables simulation-free training and single-step inference while maintaining strict adherence to pansharpening constraints. Experimental evaluations across multiple datasets demonstrate that OTFM matches or exceeds the performance of previous regression-based models and leading diffusion-based methods while only needing one sampling step. Codes are available at https://github.com/294coder/PAN-OTFM.

One-Shot Medical Video Object Segmentation via Temporal Contrastive Memory Networks

Yaxiong Chen,Junjian Hu,Chunlei Li,Zixuan Zheng,Jingliang Hu,Yilei Shi,Shengwu Xiong,Xiao Xiang Zhu,Lichao Mou

Task: 提出了一种一次性医学视频对象分割任务，旨在通过仅使用第一帧的掩码注释来分离视频中的前景和背景像素。

Motivation: 医学视频数据的分析和注释面临数据可用性和注释的挑战，需要一种能够从少量注释中学习并推广的方法。

Details

Method: 提出了一种时间对比记忆网络，包括图像和掩码编码器、时间对比记忆库和解码器，用于学习特征表示、对齐相邻帧的嵌入并存储这些特征，以及融合编码图像特征和记忆读取进行分割。 Result: 实验结果表明，该方法在从单个示例中分割可见和不可见结构方面表现出最先进的性能，展示了从稀缺标签中推广的能力。 Conclusion: 该方法有潜力减轻医学视频分析的注释负担，代码已公开。 Abstract: Video object segmentation is crucial for the efficient analysis of complex medical video data, yet it faces significant challenges in data availability and annotation. We introduce the task of one-shot medical video object segmentation, which requires separating foreground and background pixels throughout a video given only the mask annotation of the first frame. To address this problem, we propose a temporal contrastive memory network comprising image and mask encoders to learn feature representations, a temporal contrastive memory bank that aligns embeddings from adjacent frames while pushing apart distant ones to explicitly model inter-frame relationships and stores these features, and a decoder that fuses encoded image features and memory readouts for segmentation. We also collect a diverse, multi-source medical video dataset spanning various modalities and anatomies to benchmark this task. Extensive experiments demonstrate state-of-the-art performance in segmenting both seen and unseen structures from a single exemplar, showing ability to generalize from scarce labels. This highlights the potential to alleviate annotation burdens for medical video analysis. Code is available at https://github.com/MedAITech/TCMN.

Semi-KAN: KAN Provides an Effective Representation for Semi-Supervised Learning in Medical Image Segmentation

Zanting Ye,Xiaolong Niu,Xuanbin Wu,Wenxiang Yi,Yuan Chang,Lijun Lu

Task: 提出一种基于Kolmogorov-Arnold Networks (KANs)的半监督医学图像分割方法Semi-KAN。

Motivation: 现有的半监督医学图像分割方法通常依赖于单一固定的激活函数和线性建模模式，限制了其学习鲁棒表示的能力。

Details

Method: 提出Semi-KAN，利用KANs增强骨干架构的表示学习能力，并将其集成到U-Net管道的编码器瓶颈和解码器顶层，以提取高层语义特征。 Result: 在四个公共数据集上的实验表明，Semi-KAN在较少的KAN层和较低的计算成本下超越了基线网络。 Conclusion: KANs在半监督医学图像分割中具有潜力，能够有效提升表示学习能力。 Abstract: Deep learning-based medical image segmentation has shown remarkable success; however, it typically requires extensive pixel-level annotations, which are both expensive and time-intensive. Semi-supervised medical image segmentation (SSMIS) offers a viable alternative, driven by advancements in CNNs and ViTs. However, these networks often rely on single fixed activation functions and linear modeling patterns, limiting their ability to effectively learn robust representations. Given the limited availability of labeled date, achieving robust representation learning becomes crucial. Inspired by Kolmogorov-Arnold Networks (KANs), we propose Semi-KAN, which leverages the untapped potential of KANs to enhance backbone architectures for representation learning in SSMIS. Our findings indicate that: (1) compared to networks with fixed activation functions, KANs exhibit superior representation learning capabilities with fewer parameters, and (2) KANs excel in high-semantic feature spaces. Building on these insights, we integrate KANs into tokenized intermediate representations, applying them selectively at the encoder's bottleneck and the decoder's top layers within a U-Net pipeline to extract high-level semantic features. Although learnable activation functions improve feature expansion, they introduce significant computational overhead with only marginal performance gains. To mitigate this, we reduce the feature dimensions and employ horizontal scaling to capture multiple pattern representations. Furthermore, we design a multi-branch U-Net architecture with uncertainty estimation to effectively learn diverse pattern representations. Extensive experiments on four public datasets demonstrate that Semi-KAN surpasses baseline networks, utilizing fewer KAN layers and lower computational cost, thereby underscoring the potential of KANs as a promising approach for SSMIS.

Disentangling Modes and Interference in the Spectrogram of Multicomponent Signals

Kévin Polisano,Sylvain Meignen,Nils Laurent,Hubert Leterme

Task: 研究如何将多分量信号的频谱图分解为模式部分和干扰部分。

Motivation: 提高在强干扰情况下的时频分析能力。

Details

Method: 探索了两种方法：(i) 基于图像处理中的纹理-几何分解的变分方法，(ii) 使用U-Net架构的监督学习方法，训练数据集包含多种干扰模式和噪声条件。 Result: 数值实验展示了两种方法在频谱图分解中的优势和局限性。 Conclusion: 这两种方法在强干扰情况下具有增强时频分析的潜力。 Abstract: In this paper, we investigate how the spectrogram of multicomponent signals can be decomposed into a mode part and an interference part. We explore two approaches: (i) a variational method inspired by texture-geometry decomposition in image processing, and (ii) a supervised learning approach using a U-Net architecture, trained on a dataset encompassing diverse interference patterns and noise conditions. Once the interference component is identified, we explain how it enables us to define a criterion to locally adapt the window length used in the definition of the spectrogram, for the sake of improving ridge detection in the presence of close modes. Numerical experiments illustrate the advantages and limitations of both approaches for spectrogram decomposition, highlighting their potential for enhancing time-frequency analysis in the presence of strong interference.

TGV: Tabular Data-Guided Learning of Visual Cardiac Representations

Marta Hasny,Maxime Di Folco,Keno Bressem,Julia Schnabel

Task: 利用临床相关的表格数据来识别不同的患者表型，并在对比学习框架中形成更有意义的对比对。

Motivation: 在医学影像中，通常需要比较具有不同表型的整个患者，而不仅仅是单一扫描的多个增强版本。

Details

Method: 使用表格属性来指导视觉表示的训练，而不需要联合嵌入空间。 Result: 在UK Biobank的短轴心脏MR图像和临床属性上，表格数据帮助更有效地区分患者亚组。在下游任务中，包括心血管疾病和心脏表型的微调和零样本预测，结合表格数据的视觉表示比仅依赖图像增强或联合图像-表格嵌入的传统方法更强。 Conclusion: 通过表格数据训练的图像编码器能够在表示中嵌入人口统计信息，使其能够在推理时利用表格数据的洞察力进行单模态预测，适用于现实世界的医疗环境。 Abstract: Contrastive learning methods in computer vision typically rely on different views of the same image to form pairs. However, in medical imaging, we often seek to compare entire patients with different phenotypes rather than just multiple augmentations of one scan. We propose harnessing clinically relevant tabular data to identify distinct patient phenotypes and form more meaningful pairs in a contrastive learning framework. Our method uses tabular attributes to guide the training of visual representations, without requiring a joint embedding space. We demonstrate its strength using short-axis cardiac MR images and clinical attributes from the UK Biobank, where tabular data helps to more effectively distinguish between patient subgroups. Evaluation on downstream tasks, including fine-tuning and zero-shot prediction of cardiovascular artery diseases and cardiac phenotypes, shows that incorporating tabular data yields stronger visual representations than conventional methods that rely solely on image augmentations or combined image-tabular embeddings. Furthermore, we demonstrate that image encoders trained with tabular guidance are capable of embedding demographic information in their representations, allowing them to use insights from tabular data for unimodal predictions, making them well-suited to real-world medical settings where extensive clinical annotations may not be routinely available at inference time. The code will be available on GitHub.

Low-Complexity Patch-based No-Reference Point Cloud Quality Metric exploiting Weighted Structure and Texture Features

Michael Neri,Federica Battisti

Task: 提出一种无参考点云质量评估方法PST-PCQA，用于评估点云在压缩、传输和渲染过程中引入的失真对质量的影响。

Motivation: 在点云的压缩、传输和渲染过程中，各种伪影会影响最终用户感知的质量，评估这些失真对整体质量的影响是一个具有挑战性的任务。

Details

Method: PST-PCQA是一种基于低复杂度学习框架的无参考点云质量评估方法，通过分析单个补丁，整合局部和全局特征来预测平均意见分数。 Result: 在三个最先进的数据集上的实验测试表明，PST-PCQA具有良好的预测能力，能够泛化到不同的数据集。 Conclusion: PST-PCQA的轻量级结构使其适用于实时应用和计算能力有限的设备，并且通过逐块评估质量的方法具有显著优势。 Abstract: During the compression, transmission, and rendering of point clouds, various artifacts are introduced, affecting the quality perceived by the end user. However, evaluating the impact of these distortions on the overall quality is a challenging task. This study introduces PST-PCQA, a no-reference point cloud quality metric based on a low-complexity, learning-based framework. It evaluates point cloud quality by analyzing individual patches, integrating local and global features to predict the Mean Opinion Score. In summary, the process involves extracting features from patches, combining them, and using correlation weights to predict the overall quality. This approach allows us to assess point cloud quality without relying on a reference point cloud, making it particularly useful in scenarios where reference data is unavailable. Experimental tests on three state-of-the-art datasets show good prediction capabilities of PST-PCQA, through the analysis of different feature pooling strategies and its ability to generalize across different datasets. The ablation study confirms the benefits of evaluating quality on a patch-by-patch basis. Additionally, PST-PCQA's light-weight structure, with a small number of parameters to learn, makes it well-suited for real-time applications and devices with limited computational capacity. For reproducibility purposes, we made code, model, and pretrained weights available at https://github.com/michaelneri/PST-PCQA.

Semantic Segmentation of Transparent and Opaque Drinking Glasses with the Help of Zero-shot Learning

Annalena Blänsdorf,Tristan Wirth,Arne Rak,Thomas Pöllabauer,Volker Knauthe,Arjan Kuijper

Task: 提出TransCaGNet模型，用于分割图像中的透明结构。

Motivation: 透明结构在图像中难以与背景区分，如常见的玻璃杯。

Details

Method: 修改CaGNet模型，使用Trans4Trans架构进行透明物体分割，并采用零样本学习处理训练中未见的玻璃类别。提出新的合成数据集和真实世界评估数据集。 Result: TransCaGNet在合成数据集上的平均IoU和准确率分别提高了13.68%和17.88%，在真实世界数据集上分别提高了5.55%和5.72%。 Conclusion: TransCaGNet在分割透明结构方面表现优异，尤其是在合成数据集上训练后，在真实世界数据集上的表现也有所提升。 Abstract: Segmenting transparent structures in images is challenging since they are difficult to distinguish from the background. Common examples are drinking glasses, which are a ubiquitous part of our lives and appear in many different shapes and sizes. In this work we propose TransCaGNet, a modified version of the zero-shot model CaGNet. We exchange the segmentation backbone with the architecture of Trans4Trans to be capable of segmenting transparent objects. Since some glasses are rarely captured, we use zeroshot learning to be able to create semantic segmentations of glass categories not given during training. We propose a novel synthetic dataset covering a diverse set of different environmental conditions. Additionally we capture a real-world evaluation dataset since most applications take place in the real world. Comparing our model with Zeg-Clip we are able to show that TransCaGNet produces better mean IoU and accuracy values while ZegClip outperforms it mostly for unseen classes. To improve the segmentation results, we combine the semantic segmentation of the models with the segmentation results of SAM 2. Our evaluation emphasizes that distinguishing between different classes is challenging for the models due to similarity, points of view, or coverings. Taking this behavior into account, we assign glasses multiple possible categories. The modification leads to an improvement up to 13.68% for the mean IoU and up to 17.88% for the mean accuracy values on the synthetic dataset. Using our difficult synthetic dataset for training, the models produce even better results on the real-world dataset. The mean IoU is improved up to 5.55% and the mean accuracy up to 5.72% on the real-world dataset.

Universal Scene Graph Generation

Shengqiong Wu,Hao Fei,Tat-Seng Chua

Task: 提出一种能够从任何模态输入组合中全面表征语义场景的通用场景图（USG）表示方法。

Motivation: 当前场景图研究主要局限于单一模态的场景建模，无法充分利用不同模态场景图表示在描述整体场景语义中的互补优势。

Details

Method: 设计了通用场景图解析器（USG-Par），采用模块化架构进行端到端的USG生成，包括用于跨模态对象对齐的对象关联器和用于缓解领域不平衡的文本中心场景对比学习机制。 Result: 实验表明，USG在表达场景语义方面比独立场景图具有更强的能力，且USG-Par具有更高的效率和性能。 Conclusion: USG和USG-Par能够有效解决跨模态对象对齐和领域不平衡问题，提供了一种更全面的场景语义表示方法。 Abstract: Scene graph (SG) representations can neatly and efficiently describe scene semantics, which has driven sustained intensive research in SG generation. In the real world, multiple modalities often coexist, with different types, such as images, text, video, and 3D data, expressing distinct characteristics. Unfortunately, current SG research is largely confined to single-modality scene modeling, preventing the full utilization of the complementary strengths of different modality SG representations in depicting holistic scene semantics. To this end, we introduce Universal SG (USG), a novel representation capable of fully characterizing comprehensive semantic scenes from any given combination of modality inputs, encompassing modality-invariant and modality-specific scenes. Further, we tailor a niche-targeting USG parser, USG-Par, which effectively addresses two key bottlenecks of cross-modal object alignment and out-of-domain challenges. We design the USG-Par with modular architecture for end-to-end USG generation, in which we devise an object associator to relieve the modality gap for cross-modal object alignment. Further, we propose a text-centric scene contrasting learning mechanism to mitigate domain imbalances by aligning multimodal objects and relations with textual SGs. Through extensive experiments, we demonstrate that USG offers a stronger capability for expressing scene semantics than standalone SGs, and also that our USG-Par achieves higher efficacy and performance.

Manifold Learning for Hyperspectral Images

Fethi Harkat,Tiphaine Deuberet,Guillaume Gey,Valérie Perrier,Kévin Polisano

Task: 提出一种通过构建邻接图来近似数据集拓扑的方法，以改进X射线透射多能量图像的表示。

Motivation: 传统的特征提取和投影技术（如主成分分析）在表示X射线透射多能量图像时表现不佳，限制了神经网络在决策过程中的性能。

Details

Method: 使用均匀流形逼近和投影（UMAP）构建邻接图，捕捉数据中的非线性相关性。 Result: 该方法显著提高了机器学习算法在处理X射线透射光谱学中的高光谱图像时的性能，增强了特征的可分离性。 Conclusion: 该方法不仅保留了数据的全局结构，还提高了分类的准确性和鲁棒性。 Abstract: Traditional feature extraction and projection techniques, such as Principal Component Analysis, struggle to adequately represent X-Ray Transmission (XRT) Multi-Energy (ME) images, limiting the performance of neural networks in decision-making processes. To address this issue, we propose a method that approximates the dataset topology by constructing adjacency graphs using the Uniform Manifold Approximation and Projection. This approach captures nonlinear correlations within the data, significantly improving the performance of machine learning algorithms, particularly in processing Hyperspectral Images (HSI) from X-ray transmission spectroscopy. This technique not only preserves the global structure of the data but also enhances feature separability, leading to more accurate and robust classification results.

Exploiting Diffusion Prior for Real-World Image Dehazing with Unpaired Training

Yunwei Lan,Zhigao Cui,Chang Liu,Jialun Peng,Nian Wang,Xin Luo,Dong Liu

Task: 利用扩散先验和物理先验进行真实世界图像去雾。

Motivation: 现有的无配对训练方法在真实场景去雾中表现出有限的泛化能力，主要由于特征表示有限和真实世界先验利用不足。

Details

Method: 提出了一种名为Diff-Dehazer的无配对框架，利用扩散先验作为CycleGAN中的双射映射学习器，并集成物理先验以挖掘真实世界知识。 Result: 在多个真实世界数据集上的广泛实验证明了该方法的优越性能。 Conclusion: Diff-Dehazer通过利用扩散先验和物理先验，显著提高了真实世界图像去雾的效果。 Abstract: Unpaired training has been verified as one of the most effective paradigms for real scene dehazing by learning from unpaired real-world hazy and clear images. Although numerous studies have been proposed, current methods demonstrate limited generalization for various real scenes due to limited feature representation and insufficient use of real-world prior. Inspired by the strong generative capabilities of diffusion models in producing both hazy and clear images, we exploit diffusion prior for real-world image dehazing, and propose an unpaired framework named Diff-Dehazer. Specifically, we leverage diffusion prior as bijective mapping learners within the CycleGAN, a classic unpaired learning framework. Considering that physical priors contain pivotal statistics information of real-world data, we further excavate real-world knowledge by integrating physical priors into our framework. Furthermore, we introduce a new perspective for adequately leveraging the representation ability of diffusion models by removing degradation in image and text modalities, so as to improve the dehazing effect. Extensive experiments on multiple real-world datasets demonstrate the superior performance of our method. Our code https://github.com/ywxjm/Diff-Dehazer.

Learning 4D Panoptic Scene Graph Generation from Rich 2D Visual Scene

Shengqiong Wu,Hao Fei,Jingkang Yang,Xiangtai Li,Juncheng Li,Hanwang Zhang,Tat-seng Chua

Task: 提出一种新的4D全景场景图（4D-PSG）生成框架，利用丰富的2D视觉场景注释来增强4D场景学习。

Motivation: 当前的4D-PSG研究面临数据稀缺和词汇外问题，且基准生成方法的流水线性质导致性能不佳。

Details

Method: 引入4D大语言模型（4D-LLM）与3D掩码解码器集成，设计链式场景图推理机制，提出2D到4D视觉场景迁移学习框架。 Result: 在基准数据上的大量实验表明，该方法显著优于基线模型。 Conclusion: 该方法有效解决了4D-PSG生成中的数据稀缺问题，并显著提升了性能。 Abstract: The latest emerged 4D Panoptic Scene Graph (4D-PSG) provides an advanced-ever representation for comprehensively modeling the dynamic 4D visual real world. Unfortunately, current pioneering 4D-PSG research can primarily suffer from data scarcity issues severely, as well as the resulting out-of-vocabulary problems; also, the pipeline nature of the benchmark generation method can lead to suboptimal performance. To address these challenges, this paper investigates a novel framework for 4D-PSG generation that leverages rich 2D visual scene annotations to enhance 4D scene learning. First, we introduce a 4D Large Language Model (4D-LLM) integrated with a 3D mask decoder for end-to-end generation of 4D-PSG. A chained SG inference mechanism is further designed to exploit LLMs' open-vocabulary capabilities to infer accurate and comprehensive object and relation labels iteratively. Most importantly, we propose a 2D-to-4D visual scene transfer learning framework, where a spatial-temporal scene transcending strategy effectively transfers dimension-invariant features from abundant 2D SG annotations to 4D scenes, effectively compensating for data scarcity in 4D-PSG. Extensive experiments on the benchmark data demonstrate that we strikingly outperform baseline models by a large margin, highlighting the effectiveness of our method.

Saad Lahlali,Sandra Kara,Hejer Ammar,Florian Chabot,Nicolas Granger,Hervé Le Borgne,Quoc-Cuong Pham

Task: 在3D数据中进行多对象发现，利用2D运动线索。

Motivation: 尽管2D图像分析中的对象发现任务受到了广泛关注，但在3D数据中仍未被充分探索，现有方法依赖于3D运动，存在诸多挑战。

Details

Method: 提出了DIOD-3D和xMOD框架，利用2D运动线索进行3D对象发现，并采用跨模态训练框架整合2D和3D数据。 Result: 在合成数据集（TRIP-PD）和真实数据集（KITTI和Waymo）上进行了广泛评估，性能显著提升，F1@50得分提高了+8.7到+15.1。 Conclusion: 提出的方法在3D对象发现任务中表现优异，显著优于现有的2D对象发现方法。 Abstract: Object discovery, which refers to the task of localizing objects without human annotations, has gained significant attention in 2D image analysis. However, despite this growing interest, it remains under-explored in 3D data, where approaches rely exclusively on 3D motion, despite its several challenges. In this paper, we present a novel framework that leverages advances in 2D object discovery which are based on 2D motion to exploit the advantages of such motion cues being more flexible and generalizable and to bridge the gap between 2D and 3D modalities. Our primary contributions are twofold: (i) we introduce DIOD-3D, the first baseline for multi-object discovery in 3D data using 2D motion, incorporating scene completion as an auxiliary task to enable dense object localization from sparse input data; (ii) we develop xMOD, a cross-modal training framework that integrates 2D and 3D data while always using 2D motion cues. xMOD employs a teacher-student training paradigm across the two modalities to mitigate confirmation bias by leveraging the domain gap. During inference, the model supports both RGB-only and point cloud-only inputs. Additionally, we propose a late-fusion technique tailored to our pipeline that further enhances performance when both modalities are available at inference. We evaluate our approach extensively on synthetic (TRIP-PD) and challenging real-world datasets (KITTI and Waymo). Notably, our approach yields a substantial performance improvement compared with the 2D object discovery state-of-the-art on all datasets with gains ranging from +8.7 to +15.1 in F1@50 score. The code is available at https://github.com/CEA-LIST/xMOD

Bridging the Gap: Fusing CNNs and Transformers to Decode the Elegance of Handwritten Arabic Script

Chaouki Boufenar,Mehdi Ayoub Rabiai,Boualem Nadjib Zahaf,Khelil Rafik Ouaras

Task: 提出一种结合卷积神经网络（CNN）和Transformer架构的混合方法，用于手写阿拉伯文字识别。

Motivation: 手写阿拉伯文字识别由于字母形态的动态变化和上下文变化而具有挑战性。

Details

Method: 使用自定义和微调的模型，包括EfficientNet-B7和Vision Transformer（ViT-B16），并引入了一种基于置信度融合的集成模型。 Result: 在IFN/ENIT数据集上，集成模型在字母分类和位置分类上分别达到了96.38%和97.22%的准确率。 Conclusion: CNN和Transformer的结合展示了它们在阿拉伯手写文字识别中的互补性，为实际应用提供了可扩展的解决方案。 Abstract: Handwritten Arabic script recognition is a challenging task due to the script's dynamic letter forms and contextual variations. This paper proposes a hybrid approach combining convolutional neural networks (CNNs) and Transformer-based architectures to address these complexities. We evaluated custom and fine-tuned models, including EfficientNet-B7 and Vision Transformer (ViT-B16), and introduced an ensemble model that leverages confidence-based fusion to integrate their strengths. Our ensemble achieves remarkable performance on the IFN/ENIT dataset, with 96.38% accuracy for letter classification and 97.22% for positional classification. The results highlight the complementary nature of CNNs and Transformers, demonstrating their combined potential for robust Arabic handwriting recognition. This work advances OCR systems, offering a scalable solution for real-world applications.

Forensics-Bench: A Comprehensive Forgery Detection Benchmark Suite for Large Vision Language Models

Jin Wang,Chenghui Lv,Xian Li,Shichao Dong,Huadong Li,kelu Yao,Chao Li,Wenqi Shao,Ping Luo

Task: 设计一个用于评估大型视觉语言模型（LVLMs）在伪造媒体检测中的辨别能力的综合基准。

Motivation: 随着AIGC的快速发展，伪造媒体的多样性显著增加，对社会安全、政治、法律等领域构成了前所未有的威胁。现有的研究虽然提出了利用LVLMs设计强大的伪造检测器，但仍缺乏一个全面的基准来评估LVLMs在伪造媒体检测中的能力。

Details

Method: 提出了Forensics-Bench，一个包含63,292个精心策划的多项选择视觉问题的伪造检测评估基准套件，涵盖112种独特的伪造检测类型，从5个角度进行评估：伪造语义、伪造模态、伪造任务、伪造类型和伪造模型。 Result: 对22个开源LVLMs和3个专有模型（GPT-4o、Gemini 1.5 Pro和Claude 3.5 Sonnet）进行了全面评估，突出了Forensics-Bench在全面伪造检测方面提出的重大挑战。 Conclusion: Forensics-Bench将激励社区推进LVLMs的前沿，努力在AIGC时代实现全方位的伪造检测器。 Abstract: Recently, the rapid development of AIGC has significantly boosted the diversities of fake media spread in the Internet, posing unprecedented threats to social security, politics, law, and etc. To detect the ever-increasingly diverse malicious fake media in the new era of AIGC, recent studies have proposed to exploit Large Vision Language Models (LVLMs) to design robust forgery detectors due to their impressive performance on a wide range of multimodal tasks. However, it still lacks a comprehensive benchmark designed to comprehensively assess LVLMs' discerning capabilities on forgery media. To fill this gap, we present Forensics-Bench, a new forgery detection evaluation benchmark suite to assess LVLMs across massive forgery detection tasks, requiring comprehensive recognition, location and reasoning capabilities on diverse forgeries. Forensics-Bench comprises 63,292 meticulously curated multi-choice visual questions, covering 112 unique forgery detection types from 5 perspectives: forgery semantics, forgery modalities, forgery tasks, forgery types and forgery models. We conduct thorough evaluations on 22 open-sourced LVLMs and 3 proprietary models GPT-4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet, highlighting the significant challenges of comprehensive forgery detection posed by Forensics-Bench. We anticipate that Forensics-Bench will motivate the community to advance the frontier of LVLMs, striving for all-around forgery detectors in the era of AIGC. The deliverables will be updated at https://Forensics-Bench.github.io/.

Single-Step Bidirectional Unpaired Image Translation Using Implicit Bridge Consistency Distillation

Suhyeon Lee,Kwanyoung Kim,Jong Chul Ye

Task: 提出一种新的框架Implicit Bridge Consistency Distillation (IBCD)，用于实现单步双向无配对图像翻译。

Motivation: 由于基于扩散模型或Schrödinger桥的方法在现实世界应用中尚未广泛采用，为了解决这一挑战，提出了IBCD框架。

Details

Method: IBCD通过使用扩散隐式桥模型连接PF-ODE轨迹，并引入了两个关键改进：1）一致性蒸馏的分布匹配；2）基于蒸馏难度的自适应加权方法。 Result: 实验结果表明，IBCD在基准数据集上以单步生成的方式实现了最先进的性能。 Conclusion: IBCD框架在单步生成中表现出色，具有广泛的应用潜力。 Abstract: Unpaired image-to-image translation has seen significant progress since the introduction of CycleGAN. However, methods based on diffusion models or Schr\"odinger bridges have yet to be widely adopted in real-world applications due to their iterative sampling nature. To address this challenge, we propose a novel framework, Implicit Bridge Consistency Distillation (IBCD), which enables single-step bidirectional unpaired translation without using adversarial loss. IBCD extends consistency distillation by using a diffusion implicit bridge model that connects PF-ODE trajectories between distributions. Additionally, we introduce two key improvements: 1) distribution matching for consistency distillation and 2) adaptive weighting method based on distillation difficulty. Experimental results demonstrate that IBCD achieves state-of-the-art performance on benchmark datasets in a single generation step. Project page available at https://hyn2028.github.io/project_page/IBCD/index.html

Conjuring Positive Pairs for Efficient Unification of Representation Learning and Image Synthesis

Imanol G. Estepa,Jesús M. Rodríguez-de-Vera,Ignacio Sarasúa,Bhalaji Nagarajan,Petia Radeva

Task: 提出一种新的统一自监督学习框架Sorcen，结合对比-重建目标，以弥合表示学习和生成建模之间的差距。

Motivation: 现有的统一自监督学习方法依赖于语义标记重建，需要外部标记器，导致训练过程中的显著开销。

Details

Method: Sorcen框架引入了协同对比-重建目标，利用生成能力生成回波样本，形成对比正对，并仅在预计算的标记上操作，减少计算开销。 Result: 在ImageNet-1k上的实验表明，Sorcen在线性探测、无条件图像生成、少样本学习和迁移学习方面分别优于之前的最先进方法0.4%、1.48 FID、1.76%和1.53%，同时效率提高了60.8%。 Conclusion: Sorcen在统一自监督学习模型中取得了显著改进和突破，特别是在线性探测和无条件图像生成方面达到了最先进的性能。 Abstract: While representation learning and generative modeling seek to understand visual data, unifying both domains remains unexplored. Recent Unified Self-Supervised Learning (SSL) methods have started to bridge the gap between both paradigms. However, they rely solely on semantic token reconstruction, which requires an external tokenizer during training -- introducing a significant overhead. In this work, we introduce Sorcen, a novel unified SSL framework, incorporating a synergic Contrastive-Reconstruction objective. Our Contrastive objective, "Echo Contrast", leverages the generative capabilities of Sorcen, eliminating the need for additional image crops or augmentations during training. Sorcen "generates" an echo sample in the semantic token space, forming the contrastive positive pair. Sorcen operates exclusively on precomputed tokens, eliminating the need for an online token transformation during training, thereby significantly reducing computational overhead. Extensive experiments on ImageNet-1k demonstrate that Sorcen outperforms the previous Unified SSL SoTA by 0.4%, 1.48 FID, 1.76%, and 1.53% on linear probing, unconditional image generation, few-shot learning, and transfer learning, respectively, while being 60.8% more efficient. Additionally, Sorcen surpasses previous single-crop MIM SoTA in linear probing and achieves SoTA performance in unconditional image generation, highlighting significant improvements and breakthroughs in Unified SSL models.

MultiBARF: Integrating Imagery of Different Wavelength Regions by Using Neural Radiance Fields

Kana Kurata,Hitoshi Niigaki,Xiaojun Wu,Ryuichi Tanida

Task: 开发MultiBARF方法，简化不同传感器图像的数据准备过程。

Motivation: 为了使不熟悉传感和图像处理的用户更容易进行数据准备，减少对高专业知识的依赖。

Details

Method: 通过合成两个不同传感器图像和深度图像对，替代共配准和几何校准，扩展了Bundle Adjusting Neural Radiance Fields (BARF)方法。 Result: 实验表明，该方法能够在NeRF上叠加可见光和热成像图像的两个颜色通道。 Conclusion: MultiBARF方法有效地简化了数据准备过程，适用于不同传感器图像的组合。 Abstract: Optical sensor applications have become popular through digital transformation. Linking observed data to real-world locations and combining different image sensors is essential to make the applications practical and efficient. However, data preparation to try different sensor combinations requires high sensing and image processing expertise. To make data preparation easier for users unfamiliar with sensing and image processing, we have developed MultiBARF. This method replaces the co-registration and geometric calibration by synthesizing pairs of two different sensor images and depth images at assigned viewpoints. Our method extends Bundle Adjusting Neural Radiance Fields(BARF), a deep neural network-based novel view synthesis method, for the two imagers. Through experiments on visible light and thermographic images, we demonstrate that our method superimposes two color channels of those sensor images on NeRF.

An Investigation of Beam Density on LiDAR Object Detection Performance

Christoph Griesbacher,Christian Fruhwirth-Reisinger

Task: 研究光束密度对3D物体检测模型在不同传感器设置下的性能影响。

Motivation: 在自动驾驶中，精确的3D物体检测至关重要，但训练和推理数据之间的差异会导致性能显著下降，尤其是在使用稀疏、成本效益高的LiDAR传感器时。

Details

Method: 通过评估不同的物体检测架构，结合体素和点云方法，研究光束密度引起的域差距。 Result: 实验表明，结合体素和点云方法在跨域性能上表现优越，且训练在更密集数据上的检测器对光束密度变化表现出鲁棒性。 Conclusion: 光束密度引起的域差距需要与其他域变化一起评估，训练在更密集数据上的检测器对光束密度变化具有鲁棒性。 Abstract: Accurate 3D object detection is a critical component of autonomous driving, enabling vehicles to perceive their surroundings with precision and make informed decisions. LiDAR sensors, widely used for their ability to provide detailed 3D measurements, are key to achieving this capability. However, variations between training and inference data can cause significant performance drops when object detection models are employed in different sensor settings. One critical factor is beam density, as inference on sparse, cost-effective LiDAR sensors is often preferred in real-world applications. Despite previous work addressing the beam-density-induced domain gap, substantial knowledge gaps remain, particularly concerning dense 128-beam sensors in cross-domain scenarios. To gain better understanding of the impact of beam density on domain gaps, we conduct a comprehensive investigation that includes an evaluation of different object detection architectures. Our architecture evaluation reveals that combining voxel- and point-based approaches yields superior cross-domain performance by leveraging the strengths of both representations. Building on these findings, we analyze beam-density-induced domain gaps and argue that these domain gaps must be evaluated in conjunction with other domain shifts. Contrary to conventional beliefs, our experiments reveal that detectors benefit from training on denser data and exhibit robustness to beam density variations during inference.

When the Future Becomes the Past: Taming Temporal Correspondence for Self-supervised Video Representation Learning

Yang Liu,Qianqian Xu,Peisong Wen,Siran Dai,Qingming Huang

Task: 提出了一种自监督框架T-CoRe，利用时间对应性进行视频表示学习。

Motivation: 解决随机时间采样引入的不确定性和先前MVM方法在像素空间中恢复掩码补丁导致的信息压缩不足的问题。

Details

Method: 提出了三明治采样策略以减少重建不确定性，并在自蒸馏架构中引入辅助分支以恢复潜在空间中的表示。 Result: T-CoRe在多个下游任务中表现出色，证明了其在视频表示学习中的有效性。 Conclusion: T-CoRe通过减少不确定性和生成高层次的语义表示，显著提升了视频表示学习的效果。 Abstract: The past decade has witnessed notable achievements in self-supervised learning for video tasks. Recent efforts typically adopt the Masked Video Modeling (MVM) paradigm, leading to significant progress on multiple video tasks. However, two critical challenges remain: 1) Without human annotations, the random temporal sampling introduces uncertainty, increasing the difficulty of model training. 2) Previous MVM methods primarily recover the masked patches in the pixel space, leading to insufficient information compression for downstream tasks. To address these challenges jointly, we propose a self-supervised framework that leverages Temporal Correspondence for video Representation learning (T-CoRe). For challenge 1), we propose a sandwich sampling strategy that selects two auxiliary frames to reduce reconstruction uncertainty in a two-side-squeezing manner. Addressing challenge 2), we introduce an auxiliary branch into a self-distillation architecture to restore representations in the latent space, generating high-level semantic representations enriched with temporal information. Experiments of T-CoRe consistently present superior performance across several downstream tasks, demonstrating its effectiveness for video representation learning. The code is available at https://github.com/yafeng19/T-CORE.

Distilling 3D distinctive local descriptors for 6D pose estimation

Amir Hamza,Andrea Caraffa,Davide Boscaini,Fabio Poiesi

Task: 通过知识蒸馏框架训练一个高效的学生模型来回归GeDi教师的局部描述符。

Motivation: GeDi在零样本6D姿态估计中表现出色，但其推理过程计算成本高，难以在实际应用中实现。

Details

Method: 引入知识蒸馏框架，训练一个高效的学生模型，并提出一种新的损失公式来处理非显著教师描述符的弱监督。 Result: 在五个BOP Benchmark数据集上验证了该方法，显著减少了推理时间，同时保持了与现有方法竞争的性能。 Conclusion: 该方法使零样本6D姿态估计更接近实时可行性。 Abstract: Three-dimensional local descriptors are crucial for encoding geometric surface properties, making them essential for various point cloud understanding tasks. Among these descriptors, GeDi has demonstrated strong zero-shot 6D pose estimation capabilities but remains computationally impractical for real-world applications due to its expensive inference process. \textit{Can we retain GeDi's effectiveness while significantly improving its efficiency?} In this paper, we explore this question by introducing a knowledge distillation framework that trains an efficient student model to regress local descriptors from a GeDi teacher. Our key contributions include: an efficient large-scale training procedure that ensures robustness to occlusions and partial observations while operating under compute and storage constraints, and a novel loss formulation that handles weak supervision from non-distinctive teacher descriptors. We validate our approach on five BOP Benchmark datasets and demonstrate a significant reduction in inference time while maintaining competitive performance with existing methods, bringing zero-shot 6D pose estimation closer to real-time feasibility. Project Website: https://tev-fbk.github.io/dGeDi/

GIVEPose: Gradual Intra-class Variation Elimination for RGB-based Category-Level Object Pose Estimation

Zinqin Huang,Gu Wang,Chenyangguang Zhang,Ruida Zhang,Xiu Li,Xiangyang Ji

Task: 提出一种新的坐标表示方法（IVFC map）和框架（GIVEPose）用于类别级物体姿态估计。

Motivation: 现有的基于NOCS map的几何引导姿态回归方法存在类内变化问题，导致结果不理想。

Details

Method: 提出Intra-class Variation-Free Consensus (IVFC) map，并结合NOCS map和IVFC map的优势，开发了GIVEPose框架。 Result: 在合成和真实数据集上的广泛评估表明，GIVEPose显著优于现有的最先进的基于RGB的方法。 Conclusion: GIVEPose通过逐步消除类内变化，显著提高了类别级物体姿态估计的性能。 Abstract: Recent advances in RGBD-based category-level object pose estimation have been limited by their reliance on precise depth information, restricting their broader applicability. In response, RGB-based methods have been developed. Among these methods, geometry-guided pose regression that originated from instance-level tasks has demonstrated strong performance. However, we argue that the NOCS map is an inadequate intermediate representation for geometry-guided pose regression method, as its many-to-one correspondence with category-level pose introduces redundant instance-specific information, resulting in suboptimal results. This paper identifies the intra-class variation problem inherent in pose regression based solely on the NOCS map and proposes the Intra-class Variation-Free Consensus (IVFC) map, a novel coordinate representation generated from the category-level consensus model. By leveraging the complementary strengths of the NOCS map and the IVFC map, we introduce GIVEPose, a framework that implements Gradual Intra-class Variation Elimination for category-level object pose estimation. Extensive evaluations on both synthetic and real-world datasets demonstrate that GIVEPose significantly outperforms existing state-of-the-art RGB-based approaches, achieving substantial improvements in category-level object pose estimation. Our code is available at https://github.com/ziqin-h/GIVEPose.

Text-Derived Relational Graph-Enhanced Network for Skeleton-Based Action Segmentation

Haoyu Ji,Bowen Chen,Weihong Ren,Wenze Huang,Zhihao Yang,Zhiyong Wang,Honghai Liu

Task: 从长时间未修剪的人体骨骼运动序列中分割和识别各种动作。

Motivation: 现有的STAS方法通常采用时空建模来建立关节和帧之间的依赖关系，并使用独热编码和交叉熵损失进行帧级分类监督。然而，这些方法忽略了骨骼特征中关节和动作之间的内在相关性，导致对人类运动的理解有限。

Details

Method: 提出了一个文本衍生的关系图增强网络（TRG-Net），利用大型语言模型（LLM）生成的先验图来增强建模和监督。建模方面，动态时空融合建模（DSFM）方法结合了文本衍生的关节图（TJG）和通道及帧级动态适应，有效建模空间关系，同时在时间建模中整合时空核心特征。监督方面，绝对-相对类间监督（ARIS）方法采用动作特征和文本嵌入之间的对比学习来规范绝对类分布，并利用文本衍生的动作图（TAG）捕捉动作特征之间的相对类间关系。此外，提出了空间感知增强处理（SAEP）方法，结合随机关节遮挡和轴向旋转来增强空间泛化能力。 Result: 在四个公共数据集上的性能评估表明，TRG-Net达到了最先进的结果。 Conclusion: TRG-Net通过引入文本衍生的关系图和动态时空融合建模，显著提升了骨骼动作分割和识别的性能。 Abstract: Skeleton-based Temporal Action Segmentation (STAS) aims to segment and recognize various actions from long, untrimmed sequences of human skeletal movements. Current STAS methods typically employ spatio-temporal modeling to establish dependencies among joints as well as frames, and utilize one-hot encoding with cross-entropy loss for frame-wise classification supervision. However, these methods overlook the intrinsic correlations among joints and actions within skeletal features, leading to a limited understanding of human movements. To address this, we propose a Text-Derived Relational Graph-Enhanced Network (TRG-Net) that leverages prior graphs generated by Large Language Models (LLM) to enhance both modeling and supervision. For modeling, the Dynamic Spatio-Temporal Fusion Modeling (DSFM) method incorporates Text-Derived Joint Graphs (TJG) with channel- and frame-level dynamic adaptation to effectively model spatial relations, while integrating spatio-temporal core features during temporal modeling. For supervision, the Absolute-Relative Inter-Class Supervision (ARIS) method employs contrastive learning between action features and text embeddings to regularize the absolute class distributions, and utilizes Text-Derived Action Graphs (TAG) to capture the relative inter-class relationships among action features. Additionally, we propose a Spatial-Aware Enhancement Processing (SAEP) method, which incorporates random joint occlusion and axial rotation to enhance spatial generalization. Performance evaluations on four public datasets demonstrate that TRG-Net achieves state-of-the-art results.

VideoGen-of-Thought: Step-by-step generating multi-shot video with minimal manual intervention

Mingzhe Zheng,Yongqi Xu,Haojian Huang,Xuran Ma,Yexin Liu,Wenjie Shu,Yatian Pang,Feilong Tang,Qifeng Chen,Harry Yang,Ser-Nam Lim

Task: 自动化从单个句子生成多镜头视频的框架。

Motivation: 现有视频生成模型在短片段上表现出色，但在生成连贯的多镜头叙事时存在视觉动态不连贯和故事情节断裂的问题。

Details

Method: 提出了VideoGen-of-Thought (VGoT)框架，通过动态故事情节建模、身份感知的跨镜头传播和相邻潜在过渡机制来解决叙事碎片化、视觉不一致和过渡伪影三个核心挑战。 Result: VGoT在镜头内面部一致性和风格一致性上分别比现有基线方法提高了20.4%和17.4%，同时在跨镜头一致性上提高了100%以上，并且减少了10倍的手动调整。 Conclusion: VGoT框架能够生成连贯的多镜头视频，显著优于现有方法。 Abstract: Current video generation models excel at short clips but fail to produce cohesive multi-shot narratives due to disjointed visual dynamics and fractured storylines. Existing solutions either rely on extensive manual scripting/editing or prioritize single-shot fidelity over cross-scene continuity, limiting their practicality for movie-like content. We introduce VideoGen-of-Thought (VGoT), a step-by-step framework that automates multi-shot video synthesis from a single sentence by systematically addressing three core challenges: (1) Narrative Fragmentation: Existing methods lack structured storytelling. We propose dynamic storyline modeling, which first converts the user prompt into concise shot descriptions, then elaborates them into detailed, cinematic specifications across five domains (character dynamics, background continuity, relationship evolution, camera movements, HDR lighting), ensuring logical narrative progression with self-validation. (2) Visual Inconsistency: Existing approaches struggle with maintaining visual consistency across shots. Our identity-aware cross-shot propagation generates identity-preserving portrait (IPP) tokens that maintain character fidelity while allowing trait variations (expressions, aging) dictated by the storyline. (3) Transition Artifacts: Abrupt shot changes disrupt immersion. Our adjacent latent transition mechanisms implement boundary-aware reset strategies that process adjacent shots' features at transition points, enabling seamless visual flow while preserving narrative continuity. VGoT generates multi-shot videos that outperform state-of-the-art baselines by 20.4% in within-shot face consistency and 17.4% in style consistency, while achieving over 100% better cross-shot consistency and 10x fewer manual adjustments than alternatives.

Object-Centric Pretraining via Target Encoder Bootstrapping

Nikola Đukić,Tim Lebailly,Tinne Tuytelaars

Task: 提出了一种新的自蒸馏设置OCEBO，用于从头开始训练对象中心模型。

Motivation: 现有的对象中心表示学习方法依赖于预训练的非对象中心基础模型，这些模型的特征作为重建目标，但目标必须保持冻结，限制了性能。

Details

Method: 通过目标编码器自举（OCEBO）更新目标编码器，使其作为对象中心模型的指数移动平均，从而引入对象中心归纳偏差并消除性能上限。 Result: 在COCO数据集上预训练的OCEBO在无监督对象发现性能上达到了与使用冻结非对象中心目标编码器的模型相当的水平。 Conclusion: OCEBO通过自蒸馏设置和交叉视图补丁过滤方法，成功地在真实世界数据上从头训练对象中心模型，并取得了显著的性能提升。 Abstract: Object-centric representation learning has recently been successfully applied to real-world datasets. This success can be attributed to pretrained non-object-centric foundation models, whose features serve as reconstruction targets for slot attention. However, targets must remain frozen throughout the training, which sets an upper bound on the performance object-centric models can attain. Attempts to update the target encoder by bootstrapping result in large performance drops, which can be attributed to its lack of object-centric inductive biases, causing the object-centric model's encoder to drift away from representations useful as reconstruction targets. To address these limitations, we propose Object-CEntric Pretraining by Target Encoder BOotstrapping, a self-distillation setup for training object-centric models from scratch, on real-world data, for the first time ever. In OCEBO, the target encoder is updated as an exponential moving average of the object-centric model, thus explicitly being enriched with object-centric inductive biases introduced by slot attention while removing the upper bound on performance present in other models. We mitigate the slot collapse caused by random initialization of the target encoder by introducing a novel cross-view patch filtering approach that limits the supervision to sufficiently informative patches. When pretrained on 241k images from COCO, OCEBO achieves unsupervised object discovery performance comparable to that of object-centric models with frozen non-object-centric target encoders pretrained on hundreds of millions of images. The code and pretrained models are publicly available at https://github.com/djukicn/ocebo.

PointSFDA: Source-free Domain Adaptation for Point Cloud Completion

Xing He,Zhe Zhu,Liangliang Nan,Honghua Chen,Jing Qin,Mingqiang Wei

Task: 提出一种有效的无源域适应框架PointSFDA，用于点云补全。

Motivation: 传统方法在应用于真实世界扫描时面临显著挑战，特别是在源数据不可访问的情况下。

Details

Method: 使用预训练的源模型和未标记的目标数据进行适应，引入粗到细的蒸馏解决方案和自我监督的部分掩码一致性训练策略。 Result: 实验验证了该方法显著提高了跨域形状补全的性能。 Conclusion: PointSFDA是一种有效的无源域适应框架，能够显著提高点云补全的性能。 Abstract: Conventional methods for point cloud completion, typically trained on synthetic datasets, face significant challenges when applied to out-of-distribution real-world scans. In this paper, we propose an effective yet simple source-free domain adaptation framework for point cloud completion, termed \textbf{PointSFDA}. Unlike unsupervised domain adaptation that reduces the domain gap by directly leveraging labeled source data, PointSFDA uses only a pretrained source model and unlabeled target data for adaptation, avoiding the need for inaccessible source data in practical scenarios. Being the first source-free domain adaptation architecture for point cloud completion, our method offers two core contributions. First, we introduce a coarse-to-fine distillation solution to explicitly transfer the global geometry knowledge learned from the source dataset. Second, as noise may be introduced due to domain gaps, we propose a self-supervised partial-mask consistency training strategy to learn local geometry information in the target domain. Extensive experiments have validated that our method significantly improves the performance of state-of-the-art networks in cross-domain shape completion. Our code is available at \emph{\textcolor{magenta}{https://github.com/Starak-x/PointSFDA}}.

ARC: Anchored Representation Clouds for High-Resolution INR Classification

Joost Luijmes,Alexander Gielisse,Roman Knyazhitskiy,Jan van Gemert

Task: 提出一种新的隐式神经表示（INR）架构ARC，用于图像分类。

Motivation: 当前的INR图像分类方法在低分辨率数据上表现良好，但对图像空间变换敏感，且缺乏局部表示机制。

Details

Method: 提出ARC（Anchored Representation Clouds），通过在图像空间中显式锚定局部潜在向量，引入空间结构。 Result: ARC在低分辨率和高分辨率图像的隐式图像分类中达到了最先进的水平，并提高了对图像空间平移的鲁棒性。 Conclusion: ARC通过引入局部表示机制，显著提升了INR在图像分类中的性能。 Abstract: Implicit neural representations (INRs) encode signals in neural network weights as a memory-efficient representation, decoupling sampling resolution from the associated resource costs. Current INR image classification methods are demonstrated on low-resolution data and are sensitive to image-space transformations. We attribute these issues to the global, fully-connected MLP neural network architecture encoding of current INRs, which lack mechanisms for local representation: MLPs are sensitive to absolute image location and struggle with high-frequency details. We propose ARC: Anchored Representation Clouds, a novel INR architecture that explicitly anchors latent vectors locally in image-space. By introducing spatial structure to the latent vectors, ARC captures local image data which in our testing leads to state-of-the-art implicit image classification of both low- and high-resolution images and increased robustness against image-space translation. Code can be found at https://github.com/JLuij/anchored_representation_clouds.

UltraFlwr -- An Efficient Federated Medical and Surgical Object Detection Framework

Yang Li,Soumya Snigdha Kundu,Maxence Boels,Toktam Mahmoodi,Sebastien Ourselin,Tom Vercauteren,Prokar Dasgupta,Jonathan Shapey,Alejandro Granados

Task: 提出了一种用于医疗和手术对象检测的联邦学习框架UltraFlwr，并设计了YOLO-PA策略以减少通信开销。

Motivation: 解决医疗和手术对象检测在边缘部署中面临的高质量标注数据有限、数据共享限制和计算资源受限等挑战。

Details

Method: 利用联邦学习（FL）实现跨多个站点的去中心化模型训练，并提出YOLO-PA策略来减少通信开销。 Result: YOLO-PA策略在每轮通信中减少了高达83%的通信开销，同时在BCCD和m2cai16-tool-locations数据集上表现优于客户端集中训练和全聚合策略。 Conclusion: UltraFlwr框架提高了在边缘设备上训练和部署检测模型的可行性，使联邦对象检测在时间和资源受限的医疗和手术应用中更加实用。 Abstract: Object detection shows promise for medical and surgical applications such as cell counting and tool tracking. However, its faces multiple real-world edge deployment challenges including limited high-quality annotated data, data sharing restrictions, and computational constraints. In this work, we introduce UltraFlwr, a framework for federated medical and surgical object detection. By leveraging Federated Learning (FL), UltraFlwr enables decentralized model training across multiple sites without sharing raw data. To further enhance UltraFlwr's efficiency, we propose YOLO-PA, a set of novel Partial Aggregation (PA) strategies specifically designed for YOLO models in FL. YOLO-PA significantly reduces communication overhead by up to 83% per round while maintaining performance comparable to Full Aggregation (FA) strategies. Our extensive experiments on BCCD and m2cai16-tool-locations datasets demonstrate that YOLO-PA not only provides better client models compared to client-wise centralized training and FA strategies, but also facilitates efficient training and deployment across resource-constrained edge devices. Further, we also establish one of the first benchmarks in federated medical and surgical object detection. This paper advances the feasibility of training and deploying detection models on the edge, making federated object detection more practical for time-critical and resource-constrained medical and surgical applications. UltraFlwr is publicly available at https://github.com/KCL-BMEIS/UltraFlwr.

Machine Unlearning in Hyperbolic vs. Euclidean Multimodal Contrastive Learning: Adapting Alignment Calibration to MERU

Àlex Pujol Vidal,Sergio Escalera,Kamal Nasrollahi,Thomas B. Moeslund

Task: 研究在双曲对比学习中实现机器遗忘的方法。

Motivation: 探索在双曲空间中实现概念移除的有效性，以更好地捕捉语义层次结构。

Details

Method: 通过将Alignment Calibration应用于MERU模型，引入双曲特定的组件，包括蕴含校准和范数正则化。 Result: 双曲几何在概念移除方面表现出独特的优势，特别是在扩展到多个概念移除时，实现了近乎完美的遗忘，并保留了合理的性能。 Conclusion: 双曲遗忘在重组语义层次结构方面与欧几里得方法有根本不同，这些发现不仅推进了机器遗忘技术，还提供了对多模态模型中概念表示和移除的几何属性的见解。 Abstract: Machine unlearning methods have become increasingly important for selective concept removal in large pre-trained models. While recent work has explored unlearning in Euclidean contrastive vision-language models, the effectiveness of concept removal in hyperbolic spaces remains unexplored. This paper investigates machine unlearning in hyperbolic contrastive learning by adapting Alignment Calibration to MERU, a model that embeds images and text in hyperbolic space to better capture semantic hierarchies. Through systematic experiments and ablation studies, we demonstrate that hyperbolic geometry offers distinct advantages for concept removal, achieving near perfect forgetting with reasonable performance on retained concepts, particularly when scaling to multiple concept removal. Our approach introduces hyperbolic-specific components including entailment calibration and norm regularization that leverage the unique properties of hyperbolic space. Comparative analysis with Euclidean models reveals fundamental differences in unlearning dynamics, with hyperbolic unlearning reorganizing the semantic hierarchy while Euclidean approaches merely disconnect cross-modal associations. These findings not only advance machine unlearning techniques but also provide insights into the geometric properties that influence concept representation and removal in multimodal models. Source code available at https://github.com/alex-pv01/HAC

3D Occupancy Prediction with Low-Resolution Queries via Prototype-aware View Transformation

Gyeongrok Oh,Sungjune Kim,Heeju Ko,Hyung-gun Chi,Jinkyu Kim,Dongwook Lee,Daehyun Ji,Sungjoon Choi,Sujin Jang,Sangpil Kim

Task: 提高基于相机的3D占用预测中体素查询的分辨率以增强视图转换质量。

Motivation: 由于计算限制和实时部署的实际需求，较小的查询分辨率会导致信息丢失，因此需要在有限的查询大小内编码和保留丰富的视觉细节。

Details

Method: 提出了ProtoOcc，一种利用聚类图像片段的原型在视图转换中增强低分辨率上下文的新型占用网络。 Result: 在Occ3D和SemanticKITTI基准测试上的实验结果表明，该方法有效，显示出相对于基线的明显改进。 Conclusion: ProtoOcc在体素分辨率减少75%的情况下仍能实现与基线竞争的性能。 Abstract: The resolution of voxel queries significantly influences the quality of view transformation in camera-based 3D occupancy prediction. However, computational constraints and the practical necessity for real-time deployment require smaller query resolutions, which inevitably leads to an information loss. Therefore, it is essential to encode and preserve rich visual details within limited query sizes while ensuring a comprehensive representation of 3D occupancy. To this end, we introduce ProtoOcc, a novel occupancy network that leverages prototypes of clustered image segments in view transformation to enhance low-resolution context. In particular, the mapping of 2D prototypes onto 3D voxel queries encodes high-level visual geometries and complements the loss of spatial information from reduced query resolutions. Additionally, we design a multi-perspective decoding strategy to efficiently disentangle the densely compressed visual cues into a high-dimensional 3D occupancy scene. Experimental results on both Occ3D and SemanticKITTI benchmarks demonstrate the effectiveness of the proposed method, showing clear improvements over the baselines. More importantly, ProtoOcc achieves competitive performance against the baselines even with 75\% reduced voxel resolution.

Benchmarking Large Language Models for Handwritten Text Recognition

Giorgia Crosilla,Lukas Klic,Giovanni Colavizza

Task: 评估多模态大语言模型（MLLMs）在手写文本识别（HTR）中的性能，并与传统模型进行比较。

Motivation: 传统的手写文本识别模型需要大量的手动标注，并且在布局和文本处理之间存在分离，容易产生错误。多模态大语言模型提供了一种无需特定模型训练的通用方法，能够识别多样化的手写风格。

Details

Method: 研究对多种专有和开源的大语言模型进行了基准测试，评估了它们在现代和历史数据集上的表现，并测试了模型自主纠正先前生成输出的能力。 Result: 专有模型，特别是Claude 3.5 Sonnet，在零样本设置中表现优于开源模型。MLLMs在现代手写识别中表现出色，但由于预训练数据集的构成，对英语有偏好。与Transkribus的比较显示，两种方法没有一致的优势。此外，大语言模型在零样本转录中自主纠正错误的能力有限。 Conclusion: 多模态大语言模型在手写文本识别中表现出色，尤其是在现代手写识别中，但在自主纠正错误方面仍有改进空间。 Abstract: Traditional machine learning models for Handwritten Text Recognition (HTR) rely on supervised training, requiring extensive manual annotations, and often produce errors due to the separation between layout and text processing. In contrast, Multimodal Large Language Models (MLLMs) offer a general approach to recognizing diverse handwriting styles without the need for model-specific training. The study benchmarks various proprietary and open-source LLMs against Transkribus models, evaluating their performance on both modern and historical datasets written in English, French, German, and Italian. In addition, emphasis is placed on testing the models' ability to autonomously correct previously generated outputs. Findings indicate that proprietary models, especially Claude 3.5 Sonnet, outperform open-source alternatives in zero-shot settings. MLLMs achieve excellent results in recognizing modern handwriting and exhibit a preference for the English language due to their pre-training dataset composition. Comparisons with Transkribus show no consistent advantage for either approach. Moreover, LLMs demonstrate limited ability to autonomously correct errors in zero-shot transcriptions.

Detect-and-Guide: Self-regulation of Diffusion Models for Safe Text-to-Image Generation via Guideline Token Optimization

Feifei Li,Mi Zhang,Yiming Sun,Min Yang

Task: 提出一种名为Detect-and-Guide (DAG)的安全生成框架，用于在文本到图像扩散模型中检测和消除有害内容。

Motivation: 现有的后处理模型干预技术（如概念遗忘和安全指导）在消除有害概念时会影响采样轨迹，且操作方式不透明，难以确定中间变量中哪一部分导致了不安全生成。

Details

Method: DAG利用扩散模型的内部知识，在采样过程中进行自我诊断和细粒度的自我调节。首先通过优化的token的交叉注意力图从噪声潜在空间中检测有害概念，然后应用具有自适应强度和编辑区域的安全指导来消除不安全生成。 Result: 实验表明，DAG在消除色情内容方面达到了最先进的安全生成性能，平衡了有害性缓解和文本跟随性能。 Conclusion: DAG框架不需要对扩散模型进行微调，因此不会影响其生成多样性，并且只需要少量标注数据集即可提供精确的检测图，具有通用性和概念特异性。 Abstract: Text-to-image diffusion models have achieved state-of-the-art results in synthesis tasks; however, there is a growing concern about their potential misuse in creating harmful content. To mitigate these risks, post-hoc model intervention techniques, such as concept unlearning and safety guidance, have been developed. However, fine-tuning model weights or adapting the hidden states of the diffusion model operates in an uninterpretable way, making it unclear which part of the intermediate variables is responsible for unsafe generation. These interventions severely affect the sampling trajectory when erasing harmful concepts from complex, multi-concept prompts, thus hindering their practical use in real-world settings. In this work, we propose the safe generation framework Detect-and-Guide (DAG), leveraging the internal knowledge of diffusion models to perform self-diagnosis and fine-grained self-regulation during the sampling process. DAG first detects harmful concepts from noisy latents using refined cross-attention maps of optimized tokens, then applies safety guidance with adaptive strength and editing regions to negate unsafe generation. The optimization only requires a small annotated dataset and can provide precise detection maps with generalizability and concept specificity. Moreover, DAG does not require fine-tuning of diffusion models, and therefore introduces no loss to their generation diversity. Experiments on erasing sexual content show that DAG achieves state-of-the-art safe generation performance, balancing harmfulness mitigation and text-following performance on multi-concept real-world prompts.

DiST-4D: Disentangled Spatiotemporal Diffusion with Metric Depth for 4D Driving Scene Generation

Jiazhe Guo,Yikang Ding,Xiwu Chen,Shuo Chen,Bohan Li,Yingshuang Zou,Xiaoyang Lyu,Feiyang Tan,Xiaojuan Qi,Zhiheng Li,Hao Zhao

Task: 提出DiST-4D，一个用于4D驾驶场景生成的解耦时空扩散框架，以解决现有生成模型在时空外推和空间新视角合成中的挑战。

Motivation: 现有生成模型在无需逐场景优化的情况下，难以同时支持时间外推和空间新视角合成，关键在于找到一种高效且可泛化的几何表示。

Details

Method: DiST-4D利用度量深度作为核心几何表示，将问题分解为两个扩散过程：DiST-T预测未来度量深度和多视角RGB序列，DiST-S通过仅在现有视角上训练并强制循环一致性来实现空间新视角合成。 Result: 实验表明，DiST-4D在时间预测和新视角合成任务中均达到了最先进的性能，并在规划相关评估中表现出色。 Conclusion: 度量深度作为一种视图一致的几何表示，对于准确可靠的时间预测和空间新视角合成至关重要，DiST-4D通过解耦时空扩散框架成功解决了现有挑战。 Abstract: Current generative models struggle to synthesize dynamic 4D driving scenes that simultaneously support temporal extrapolation and spatial novel view synthesis (NVS) without per-scene optimization. A key challenge lies in finding an efficient and generalizable geometric representation that seamlessly connects temporal and spatial synthesis. To address this, we propose DiST-4D, the first disentangled spatiotemporal diffusion framework for 4D driving scene generation, which leverages metric depth as the core geometric representation. DiST-4D decomposes the problem into two diffusion processes: DiST-T, which predicts future metric depth and multi-view RGB sequences directly from past observations, and DiST-S, which enables spatial NVS by training only on existing viewpoints while enforcing cycle consistency. This cycle consistency mechanism introduces a forward-backward rendering constraint, reducing the generalization gap between observed and unseen viewpoints. Metric depth is essential for both accurate reliable forecasting and accurate spatial NVS, as it provides a view-consistent geometric representation that generalizes well to unseen perspectives. Experiments demonstrate that DiST-4D achieves state-of-the-art performance in both temporal prediction and NVS tasks, while also delivering competitive performance in planning-related evaluations.

GO-N3RDet: Geometry Optimized NeRF-enhanced 3D Object Detector

Zechuan Li,Hongshan Yu,Yihao Ding,Jinhao Qiao,Basim Azam,Naveed Akhtar

Task: 提出一种基于神经辐射场的场景几何优化的多视角3D物体检测器GO-N3RDet。

Motivation: 由于遮挡和缺乏3D信息，从多视角2D图像构建3D特征具有挑战性。

Details

Method: 引入了一种独特的3D位置信息嵌入体素优化机制来融合多视角特征，设计了双重重要性采样方案以优先重建物体区域的神经场，提出了不透明度优化模块以通过多视角一致性约束来精确预测体素不透明度，并引入射线距离作为权重因子以最小化累积射线误差。 Result: 在ScanNet和ARKITScenes数据集上进行了广泛的实验验证，GO-N3RDet在基于NeRF的多视角3D检测中达到了新的最先进水平。 Conclusion: GO-N3RDet通过独特的模块协同工作，形成了一个端到端的神经模型，显著提升了多视角3D物体检测的准确性。 Abstract: We propose GO-N3RDet, a scene-geometry optimized multi-view 3D object detector enhanced by neural radiance fields. The key to accurate 3D object detection is in effective voxel representation. However, due to occlusion and lack of 3D information, constructing 3D features from multi-view 2D images is challenging. Addressing that, we introduce a unique 3D positional information embedded voxel optimization mechanism to fuse multi-view features. To prioritize neural field reconstruction in object regions, we also devise a double importance sampling scheme for the NeRF branch of our detector. We additionally propose an opacity optimization module for precise voxel opacity prediction by enforcing multi-view consistency constraints. Moreover, to further improve voxel density consistency across multiple perspectives, we incorporate ray distance as a weighting factor to minimize cumulative ray errors. Our unique modules synergetically form an end-to-end neural model that establishes new state-of-the-art in NeRF-based multi-view 3D detection, verified with extensive experiments on ScanNet and ARKITScenes. Code will be available at https://github.com/ZechuanLi/GO-N3RDet.

CoE: Chain-of-Explanation via Automatic Visual Concept Circuit Description and Polysemanticity Quantification

Wenlong Yu,Qilong Wang,Chuang Liu,Dong Li,Qinghua Hu

Task: 提出一种Chain-of-Explanation (CoE)方法，用于自动构建全局概念解释数据集并提供局部决策过程的语言解释。

Motivation: 当前的后解释方法在自动构建准确且充分的全局概念和局部电路的语言解释方面存在不足，特别是语义视觉概念（VCs）的多义性严重影响了概念和深度视觉模型（DVMs）的可解释性。

Details

Method: 提出CoE方法，自动化解码和描述VCs以构建全局概念解释数据集，设计概念多义性解耦和过滤机制，并引入概念多义性熵（CPE）作为模型可解释性的度量。 Result: 实验结果表明，CPE和CoE在可解释性评分上平均绝对提升了36%。 Conclusion: CoE方法有效解决了当前后解释方法的不足，显著提升了深度视觉模型的可解释性。 Abstract: Explainability is a critical factor influencing the wide deployment of deep vision models (DVMs). Concept-based post-hoc explanation methods can provide both global and local insights into model decisions. However, current methods in this field face challenges in that they are inflexible to automatically construct accurate and sufficient linguistic explanations for global concepts and local circuits. Particularly, the intrinsic polysemanticity in semantic Visual Concepts (VCs) impedes the interpretability of concepts and DVMs, which is underestimated severely. In this paper, we propose a Chain-of-Explanation (CoE) approach to address these issues. Specifically, CoE automates the decoding and description of VCs to construct global concept explanation datasets. Further, to alleviate the effect of polysemanticity on model explainability, we design a concept polysemanticity disentanglement and filtering mechanism to distinguish the most contextually relevant concept atoms. Besides, a Concept Polysemanticity Entropy (CPE), as a measure of model interpretability, is formulated to quantify the degree of concept uncertainty. The modeling of deterministic concepts is upgraded to uncertain concept atom distributions. Finally, CoE automatically enables linguistic local explanations of the decision-making process of DVMs by tracing the concept circuit. GPT-4o and human-based experiments demonstrate the effectiveness of CPE and the superiority of CoE, achieving an average absolute improvement of 36% in terms of explainability scores.

DEPT: Deep Extreme Point Tracing for Ultrasound Image Segmentation

Lei Shi,Xi Fang,Naiyu Wang,Junxing Zhang

Task: 提出一种结合深度极端点追踪（DEPT）和特征引导极端点掩码（FGEPM）算法的超声图像分割方法。

Motivation: 解决全监督学习方法在医学图像分割中需要大量标注数据的问题，探索弱监督学习方法，特别是使用极端点作为监督信号的潜力。

Details

Method: 通过识别基于特征图的成本矩阵上连接所有极端点的最低成本路径生成伪标签，并提出迭代训练策略逐步优化伪标签。 Result: 在两个公共数据集上的实验结果表明，所提出的方法接近全监督方法的性能，并优于几种现有的弱监督方法。 Conclusion: 所提出的方法在超声图像分割中表现出色，能够有效减少标注工作量，同时保持较高的分割精度。 Abstract: Automatic medical image segmentation plays a crucial role in computer aided diagnosis. However, fully supervised learning approaches often require extensive and labor-intensive annotation efforts. To address this challenge, weakly supervised learning methods, particularly those using extreme points as supervisory signals, have the potential to offer an effective solution. In this paper, we introduce Deep Extreme Point Tracing (DEPT) integrated with Feature-Guided Extreme Point Masking (FGEPM) algorithm for ultrasound image segmentation. Notably, our method generates pseudo labels by identifying the lowest-cost path that connects all extreme points on the feature map-based cost matrix. Additionally, an iterative training strategy is proposed to refine pseudo labels progressively, enabling continuous network improvement. Experimental results on two public datasets demonstrate the effectiveness of our proposed method. The performance of our method approaches that of the fully supervised method and outperforms several existing weakly supervised methods.

LEGION: Learning to Ground and Explain for Synthetic Image Detection

Hengrui Kang,Siwei Wen,Zichen Wen,Junyan Ye,Weijia Li,Peilin Feng,Baichuan Zhou,Bin Wang,Dahua Lin,Linfeng Zhang,Conghui He

Task: 提出一种基于多模态大语言模型（MLLM）的图像伪造分析框架LEGION，并引入高质量数据集SynthScars。

Motivation: 当前合成图像检测方法缺乏文本可解释性，且数据集通常过时且缺乏细粒度标注。

Details

Method: 提出LEGION框架，集成伪影检测、分割和解释，并将其应用于图像精炼管道。 Result: LEGION在多个基准测试中优于现有方法，特别是在SynthScars数据集上，mIoU和F1分数分别超过第二名3.31%和7.75%。 Conclusion: LEGION不仅提高了图像伪造检测的准确性，还能指导生成更高质量和更逼真的图像。 Abstract: The rapid advancements in generative technology have emerged as a double-edged sword. While offering powerful tools that enhance convenience, they also pose significant social concerns. As defenders, current synthetic image detection methods often lack artifact-level textual interpretability and are overly focused on image manipulation detection, and current datasets usually suffer from outdated generators and a lack of fine-grained annotations. In this paper, we introduce SynthScars, a high-quality and diverse dataset consisting of 12,236 fully synthetic images with human-expert annotations. It features 4 distinct image content types, 3 categories of artifacts, and fine-grained annotations covering pixel-level segmentation, detailed textual explanations, and artifact category labels. Furthermore, we propose LEGION (LEarning to Ground and explain for Synthetic Image detectiON), a multimodal large language model (MLLM)-based image forgery analysis framework that integrates artifact detection, segmentation, and explanation. Building upon this capability, we further explore LEGION as a controller, integrating it into image refinement pipelines to guide the generation of higher-quality and more realistic images. Extensive experiments show that LEGION outperforms existing methods across multiple benchmarks, particularly surpassing the second-best traditional expert on SynthScars by 3.31% in mIoU and 7.75% in F1 score. Moreover, the refined images generated under its guidance exhibit stronger alignment with human preferences. The code, model, and dataset will be released.

DeepMesh: Auto-Regressive Artist-mesh Creation with Reinforcement Learning

Ruowen Zhao,Junliang Ye,Zhengyi Wang,Guangce Liu,Yiwen Chen,Yikai Wang,Jun Zhu

Task: 提出了一种名为DeepMesh的框架，用于优化3D网格生成。

Motivation: 解决自回归方法在生成结构化网格时面临的网格面数限制和网格不完整性问题。

Details

Method: 通过两种关键创新优化网格生成：(1) 结合新颖的标记化算法的高效预训练策略，以及数据整理和处理的改进；(2) 在3D网格生成中引入强化学习（RL），通过直接偏好优化（DPO）实现人类偏好对齐。 Result: DeepMesh在点云和图像条件下生成的网格具有复杂的细节和精确的拓扑结构，在精度和质量上均优于现有方法。 Conclusion: DeepMesh框架通过创新的预训练策略和强化学习方法，显著提升了3D网格生成的精度和质量。 Abstract: Triangle meshes play a crucial role in 3D applications for efficient manipulation and rendering. While auto-regressive methods generate structured meshes by predicting discrete vertex tokens, they are often constrained by limited face counts and mesh incompleteness. To address these challenges, we propose DeepMesh, a framework that optimizes mesh generation through two key innovations: (1) an efficient pre-training strategy incorporating a novel tokenization algorithm, along with improvements in data curation and processing, and (2) the introduction of Reinforcement Learning (RL) into 3D mesh generation to achieve human preference alignment via Direct Preference Optimization (DPO). We design a scoring standard that combines human evaluation with 3D metrics to collect preference pairs for DPO, ensuring both visual appeal and geometric accuracy. Conditioned on point clouds and images, DeepMesh generates meshes with intricate details and precise topology, outperforming state-of-the-art methods in both precision and quality. Project page: https://zhaorw02.github.io/DeepMesh/

Challenges and Trends in Egocentric Vision: A Survey

Xiang Li,Heqian Qiu,Lanxiao Wang,Hanwen Zhang,Chenghao Qi,Linfeng Han,Huiyu Xiong,Hongliang Li

Task: 对自我中心视觉理解的研究进行全面的综述，系统分析自我中心场景的组成部分，并将任务分为四个主要领域：主体理解、物体理解、环境理解和混合理解。

Motivation: 随着人工智能技术和可穿戴设备的快速发展，自我中心视觉理解作为一个新的研究方向逐渐受到学术界和工业界的广泛关注。

Details

Method: 本文详细探讨了每个类别中的子任务，并总结了该领域目前存在的主要挑战和趋势。 Result: 本文提供了高质量的自我中心视觉数据集概述，为未来研究提供了宝贵的资源。 Conclusion: 通过总结最新进展，本文预期自我中心视觉技术在增强现实、虚拟现实和具身智能等领域的广泛应用，并基于该领域的最新发展提出了未来的研究方向。 Abstract: With the rapid development of artificial intelligence technologies and wearable devices, egocentric vision understanding has emerged as a new and challenging research direction, gradually attracting widespread attention from both academia and industry. Egocentric vision captures visual and multimodal data through cameras or sensors worn on the human body, offering a unique perspective that simulates human visual experiences. This paper provides a comprehensive survey of the research on egocentric vision understanding, systematically analyzing the components of egocentric scenes and categorizing the tasks into four main areas: subject understanding, object understanding, environment understanding, and hybrid understanding. We explore in detail the sub-tasks within each category. We also summarize the main challenges and trends currently existing in the field. Furthermore, this paper presents an overview of high-quality egocentric vision datasets, offering valuable resources for future research. By summarizing the latest advancements, we anticipate the broad applications of egocentric vision technologies in fields such as augmented reality, virtual reality, and embodied intelligence, and propose future research directions based on the latest developments in the field.

Teng-Fang Hsiao,Bo-Kai Ruan,Yi-Lun Wu,Tzu-Ling Lin,Hong-Han Shuai

Task: 提出一种无需训练的文本和图像到图像生成方法（TF-TI2I），以增强复杂图像生成任务的效果。

Motivation: 现有方法在利用图像输入时往往只关注特定元素，或在处理复杂多图像指令时生成质量下降。

Details

Method: 利用MM-DiT架构，通过提取参考图像的浓缩视觉表示，并采用参考上下文掩码技术，选择性地共享信息。同时，使用Winner-Takes-All模块来缓解分布偏移。 Result: 在各种基准测试中表现出色，证实了其在处理复杂图像生成任务中的有效性。 Conclusion: TF-TI2I方法无需额外训练，能够有效处理复杂的图像生成任务，并引入了FG-TI2I Bench作为评估基准。 Abstract: Text-and-Image-To-Image (TI2I), an extension of Text-To-Image (T2I), integrates image inputs with textual instructions to enhance image generation. Existing methods often partially utilize image inputs, focusing on specific elements like objects or styles, or they experience a decline in generation quality with complex, multi-image instructions. To overcome these challenges, we introduce Training-Free Text-and-Image-to-Image (TF-TI2I), which adapts cutting-edge T2I models such as SD3 without the need for additional training. Our method capitalizes on the MM-DiT architecture, in which we point out that textual tokens can implicitly learn visual information from vision tokens. We enhance this interaction by extracting a condensed visual representation from reference images, facilitating selective information sharing through Reference Contextual Masking -- this technique confines the usage of contextual tokens to instruction-relevant visual information. Additionally, our Winner-Takes-All module mitigates distribution shifts by prioritizing the most pertinent references for each vision token. Addressing the gap in TI2I evaluation, we also introduce the FG-TI2I Bench, a comprehensive benchmark tailored for TI2I and compatible with existing T2I methods. Our approach shows robust performance across various benchmarks, confirming its effectiveness in handling complex image-generation tasks.

EdgeRegNet: Edge Feature-based Multimodal Registration Network between Images and LiDAR Point Clouds

Yuanchao Yue,Hui Yuan,Qinglong Miao,Xiaolong Mao,Raouf Hamzaoui,Peter Eisert

Task: 跨模态数据配准，特别是2D图像和3D点云之间的配准。

Motivation: 在自动驾驶和机器人技术中，准确和鲁棒的配准方法对于不同模态数据的对齐至关重要，是多模态传感器数据融合的基础，能够提高感知系统的准确性和可靠性。

Details

Method: 提出了一种利用原始点云和图像的边缘信息进行跨模态配准的方法，通过提取边缘点和边缘像素保留原始数据的关键信息，并引入基于注意力的特征交换块来消除跨模态差异，同时结合最优匹配层来提高对应关系识别的准确性。 Result: 在KITTI和nuScenes数据集上验证了方法的准确性，展示了其最先进的性能。 Conclusion: 所提出的方法在保持计算效率的同时，提高了跨模态数据配准的准确性，能够有效消除跨模态差异并提高对应关系识别的准确性。 Abstract: Cross-modal data registration has long been a critical task in computer vision, with extensive applications in autonomous driving and robotics. Accurate and robust registration methods are essential for aligning data from different modalities, forming the foundation for multimodal sensor data fusion and enhancing perception systems' accuracy and reliability. The registration task between 2D images captured by cameras and 3D point clouds captured by Light Detection and Ranging (LiDAR) sensors is usually treated as a visual pose estimation problem. High-dimensional feature similarities from different modalities are leveraged to identify pixel-point correspondences, followed by pose estimation techniques using least squares methods. However, existing approaches often resort to downsampling the original point cloud and image data due to computational constraints, inevitably leading to a loss in precision. Additionally, high-dimensional features extracted using different feature extractors from various modalities require specific techniques to mitigate cross-modal differences for effective matching. To address these challenges, we propose a method that uses edge information from the original point clouds and images for cross-modal registration. We retain crucial information from the original data by extracting edge points and pixels, enhancing registration accuracy while maintaining computational efficiency. The use of edge points and edge pixels allows us to introduce an attention-based feature exchange block to eliminate cross-modal disparities. Furthermore, we incorporate an optimal matching layer to improve correspondence identification. We validate the accuracy of our method on the KITTI and nuScenes datasets, demonstrating its state-of-the-art performance.

Yuanchao Yue,Zhengxin Li,Wei Zhang,Hui Yuan

Task: 提出一种框架，将点云投影为多个2D表示以与相机图像匹配，解决LiDAR点云与相机图像之间的跨模态配准问题。

Motivation: LiDAR点云与相机图像的校准通常耗时且需要外部校准板或特定环境特征，现有方法在保持实时性能的同时难以达到满意的配准精度。

Details

Method: 提出一种框架，将点云投影为多个2D表示以与相机图像匹配，并引入多尺度特征提取网络和patch-to-pixel匹配网络。 Result: 在KITTI和nuScenes数据集上的实验验证了模型的高效性，KITTI数据集上的配准准确率超过99%。 Conclusion: 所提出的框架不仅更有效地利用了LiDAR点云的几何特性，还弥合了点云与图像之间的域差距，实现了实时性能和高配准精度。 Abstract: The primary requirement for cross-modal data fusion is the precise alignment of data from different sensors. However, the calibration between LiDAR point clouds and camera images is typically time-consuming and needs external calibration board or specific environmental features. Cross-modal registration effectively solves this problem by aligning the data directly without requiring external calibration. However, due to the domain gap between the point cloud and the image, existing methods rarely achieve satisfactory registration accuracy while maintaining real-time performance. To address this issue, we propose a framework that projects point clouds into several 2D representations for matching with camera images, which not only leverages the geometric characteristic of LiDAR point clouds more effectively but also bridge the domain gap between the point cloud and image. Moreover, to tackle the challenges of cross modal differences and the limited overlap between LiDAR point clouds and images in the image matching task, we introduce a multi-scale feature extraction network to effectively extract features from both camera images and the projection maps of LiDAR point cloud. Additionally, we propose a patch-to-pixel matching network to provide more effective supervision and achieve higher accuracy. We validate the performance of our model through experiments on the KITTI and nuScenes datasets. Our network achieves real-time performance and extremely high registration accuracy. On the KITTI dataset, our model achieves a registration accuracy rate of over 99\%.

Test-Time Backdoor Detection for Object Detection Models

Hangtao Zhang,Yichen Wang,Shihui Yan,Chenyu Zhu,Ziqi Zhou,Linshan Hou,Shengshan Hu,Minghui Li,Yanjun Zhang,Leo Yu Zhang

Task: 检测对象检测模型中的后门攻击样本

Motivation: 对象检测模型容易受到后门攻击，攻击者通过在训练样本中嵌入预定义的触发器来操纵预测。检测测试时包含触发器的样本可以防止后门激活。然而，对象检测的独特特性（特别是其输出多个对象）给后门检测带来了新的挑战。复杂的攻击效果（例如“幽灵”对象出现或“消失”对象）使得当前的防御措施从根本上不足。

Details

Method: 设计了TRAnsformation Consistency Evaluation (TRACE)方法，通过应用前景和背景变换来评估每个测试样本的变换一致性，计算对象置信度的方差。 Result: TRACE实现了黑盒、通用的后门检测，实验表明其在AUROC上比最先进的防御方法提高了30%，并且能够抵抗自适应攻击。 Conclusion: TRACE方法在检测对象检测模型中的后门攻击样本方面表现出色，显著提高了检测性能并增强了对抗自适应攻击的能力。 Abstract: Object detection models are vulnerable to backdoor attacks, where attackers poison a small subset of training samples by embedding a predefined trigger to manipulate prediction. Detecting poisoned samples (i.e., those containing triggers) at test time can prevent backdoor activation. However, unlike image classification tasks, the unique characteristics of object detection -- particularly its output of numerous objects -- pose fresh challenges for backdoor detection. The complex attack effects (e.g., "ghost" object emergence or "vanishing" object) further render current defenses fundamentally inadequate. To this end, we design TRAnsformation Consistency Evaluation (TRACE), a brand-new method for detecting poisoned samples at test time in object detection. Our journey begins with two intriguing observations: (1) poisoned samples exhibit significantly more consistent detection results than clean ones across varied backgrounds. (2) clean samples show higher detection consistency when introduced to different focal information. Based on these phenomena, TRACE applies foreground and background transformations to each test sample, then assesses transformation consistency by calculating the variance in objects confidences. TRACE achieves black-box, universal backdoor detection, with extensive experiments showing a 30% improvement in AUROC over state-of-the-art defenses and resistance to adaptive attacks.

DCA: Dividing and Conquering Amnesia in Incremental Object Detection

Aoting Zhang,Dongbao Yang,Chang Liu,Xiaopeng Hong,Miao Shang,Yu Zhou

Task: 研究增量目标检测（IOD）中的遗忘机制，并提出一种分治遗忘（DCA）策略来改善这一问题。

Motivation: 现有的方法通过改进知识蒸馏和样本重放取得了一定的成功，但对遗忘机制的内在原因仍缺乏深入探索。本文旨在深入探讨遗忘的原因，并提出解决方案。

Details

Method: 提出了一种分治遗忘（DCA）策略，将基于Transformer的IOD重新设计为定位-识别过程，并利用预训练语言模型中的语义知识来减少识别中的特征漂移。 Result: 实验表明，该方法在长期增量场景下表现优异，特别是在MS-COCO数据集上的四步设置中，最终AP显著提高了6.9%。 Conclusion: DCA策略能够有效维持和传递定位能力，同时特别解决了解耦的脆弱识别问题，显著提升了增量目标检测的性能。 Abstract: Incremental object detection (IOD) aims to cultivate an object detector that can continuously localize and recognize novel classes while preserving its performance on previous classes. Existing methods achieve certain success by improving knowledge distillation and exemplar replay for transformer-based detection frameworks, but the intrinsic forgetting mechanisms remain underexplored. In this paper, we dive into the cause of forgetting and discover forgetting imbalance between localization and recognition in transformer-based IOD, which means that localization is less-forgetting and can generalize to future classes, whereas catastrophic forgetting occurs primarily on recognition. Based on these insights, we propose a Divide-and-Conquer Amnesia (DCA) strategy, which redesigns the transformer-based IOD into a localization-then-recognition process. DCA can well maintain and transfer the localization ability, leaving decoupled fragile recognition to be specially conquered. To reduce feature drift in recognition, we leverage semantic knowledge encoded in pre-trained language models to anchor class representations within a unified feature space across incremental tasks. This involves designing a duplex classifier fusion and embedding class semantic features into the recognition decoding process in the form of queries. Extensive experiments validate that our approach achieves state-of-the-art performance, especially for long-term incremental scenarios. For example, under the four-step setting on MS-COCO, our DCA strategy significantly improves the final AP by 6.9%.

SUM Parts: Benchmarking Part-Level Semantic Segmentation of Urban Meshes

Weixiao Gao,Liangliang Nan,Hugo Ledoux

Task: 介绍并评估一个用于城市纹理网格的大规模数据集SUM Parts，该数据集包含部分级别的语义标签。

Motivation: 城市场景分析中的语义分割主要集中在图像或点云上，而提供更丰富空间表示的纹理网格仍未得到充分探索。

Details

Method: 创建了一个包含21个类别的约2.5平方公里的城市纹理网格数据集，并使用自研的注释工具进行注释，该工具支持基于面和纹理的高效交互选择。 Result: 提供了对3D语义分割和交互注释方法的全面评估。 Conclusion: SUM Parts数据集为城市纹理网格的语义分割研究提供了重要的资源，并展示了其在实际应用中的潜力。 Abstract: Semantic segmentation in urban scene analysis has mainly focused on images or point clouds, while textured meshes - offering richer spatial representation - remain underexplored. This paper introduces SUM Parts, the first large-scale dataset for urban textured meshes with part-level semantic labels, covering about 2.5 km2 with 21 classes. The dataset was created using our own annotation tool, which supports both face- and texture-based annotations with efficient interactive selection. We also provide a comprehensive evaluation of 3D semantic segmentation and interactive annotation methods on this dataset. Our project page is available at https://tudelft3d.github.io/SUMParts/.

Recover and Match: Open-Vocabulary Multi-Label Recognition through Knowledge-Constrained Optimal Transport

Hao Tan,Zichang Tan,Jun Li,Ajian Liu,Jun Wan,Zhen Lei

Task: 解决开放词汇多标签识别中的局部语义丢失和区域与标签匹配问题。

Motivation: 现有的视觉-语言模型（如CLIP）在局部语义和区域与标签匹配方面存在不足，导致不可靠的预测。

Details

Method: 提出了RAM框架，包括Ladder Local Adapter (LLA) 和 Knowledge-Constrained Optimal Transport (KCOT) 来解决上述问题。 Result: RAM在多个数据集上取得了最先进的性能，并展示了提升现有方法的潜力。 Conclusion: RAM通过恢复局部语义和优化区域与标签的匹配，有效解决了开放词汇多标签识别中的关键问题。 Abstract: Identifying multiple novel classes in an image, known as open-vocabulary multi-label recognition, is a challenging task in computer vision. Recent studies explore the transfer of powerful vision-language models such as CLIP. However, these approaches face two critical challenges: (1) The local semantics of CLIP are disrupted due to its global pre-training objectives, resulting in unreliable regional predictions. (2) The matching property between image regions and candidate labels has been neglected, relying instead on naive feature aggregation such as average pooling, which leads to spurious predictions from irrelevant regions. In this paper, we present RAM (Recover And Match), a novel framework that effectively addresses the above issues. To tackle the first problem, we propose Ladder Local Adapter (LLA) to enforce refocusing on local regions, recovering local semantics in a memory-friendly way. For the second issue, we propose Knowledge-Constrained Optimal Transport (KCOT) to suppress meaningless matching to non-GT labels by formulating the task as an optimal transport problem. As a result, RAM achieves state-of-the-art performance on various datasets from three distinct domains, and shows great potential to boost the existing methods. Code: https://github.com/EricTan7/RAM.

TruthLens:A Training-Free Paradigm for DeepFake Detection

Ritabrata Chakraborty,Rajatsubhra Chakraborty,Ali Khaleghi Rahimian,Thomas MacDougall

Task: 提出了一种新的无训练框架TruthLens，用于深度伪造检测，并将其重新定义为视觉问答任务。

Motivation: 当前伪造图像检测方法主要依赖二元分类模型，注重准确性但忽视可解释性，用户无法清楚了解图像被判定为真实或伪造的原因。

Details

Method: TruthLens利用最先进的大型视觉语言模型（LVLMs）观察和描述视觉伪影，并结合大型语言模型（LLMs）如GPT-4的推理能力，分析和聚合证据以做出明智决策。 Result: TruthLens在具有挑战性的数据集上表现出色，实现了高准确性，同时保持了强大的可解释性。 Conclusion: 通过将深度伪造检测重新定义为推理驱动过程，TruthLens建立了对抗合成媒体的新范式，结合了尖端性能和可解释性，以应对视觉虚假信息的日益增长的威胁。 Abstract: The proliferation of synthetic images generated by advanced AI models poses significant challenges in identifying and understanding manipulated visual content. Current fake image detection methods predominantly rely on binary classification models that focus on accuracy while often neglecting interpretability, leaving users without clear insights into why an image is deemed real or fake. To bridge this gap, we introduce TruthLens, a novel training-free framework that reimagines deepfake detection as a visual question-answering (VQA) task. TruthLens utilizes state-of-the-art large vision-language models (LVLMs) to observe and describe visual artifacts and combines this with the reasoning capabilities of large language models (LLMs) like GPT-4 to analyze and aggregate evidence into informed decisions. By adopting a multimodal approach, TruthLens seamlessly integrates visual and semantic reasoning to not only classify images as real or fake but also provide interpretable explanations for its decisions. This transparency enhances trust and provides valuable insights into the artifacts that signal synthetic content. Extensive evaluations demonstrate that TruthLens outperforms conventional methods, achieving high accuracy on challenging datasets while maintaining a strong emphasis on explainability. By reframing deepfake detection as a reasoning-driven process, TruthLens establishes a new paradigm in combating synthetic media, combining cutting-edge performance with interpretability to address the growing threats of visual disinformation.

Boosting HDR Image Reconstruction via Semantic Knowledge Transfer

Qingsen Yan,Tao Hu,Genggeng Chen,Wei Dong,Yanning Zhang

Task: 从多个低动态范围（LDR）图像中恢复高动态范围（HDR）图像，特别是在LDR图像存在明显退化和内容缺失的情况下。

Motivation: 利用场景特定的语义先验为恢复严重退化区域提供了有希望的解决方案，但由于这些先验通常从sRGB标准动态范围（SDR）图像中提取，域/格式差距在应用于HDR成像时带来了显著挑战。

Details

Method: 提出了一个通用框架，通过自蒸馏将SDR域中的语义知识转移，以增强现有的HDR重建。具体包括引入语义先验引导重建模型（SPGRM）和自蒸馏机制，以及使用语义知识对齐模块（SKAM）来填补缺失的语义内容。 Result: 实验表明，该方法显著提高了现有方法的HDR成像质量。 Conclusion: 所提出的框架通过自蒸馏和语义知识对齐，有效提升了HDR图像重建的质量。 Abstract: Recovering High Dynamic Range (HDR) images from multiple Low Dynamic Range (LDR) images becomes challenging when the LDR images exhibit noticeable degradation and missing content. Leveraging scene-specific semantic priors offers a promising solution for restoring heavily degraded regions. However, these priors are typically extracted from sRGB Standard Dynamic Range (SDR) images, the domain/format gap poses a significant challenge when applying it to HDR imaging. To address this issue, we propose a general framework that transfers semantic knowledge derived from SDR domain via self-distillation to boost existing HDR reconstruction. Specifically, the proposed framework first introduces the Semantic Priors Guided Reconstruction Model (SPGRM), which leverages SDR image semantic knowledge to address ill-posed problems in the initial HDR reconstruction results. Subsequently, we leverage a self-distillation mechanism that constrains the color and content information with semantic knowledge, aligning the external outputs between the baseline and SPGRM. Furthermore, to transfer the semantic knowledge of the internal features, we utilize a semantic knowledge alignment module (SKAM) to fill the missing semantic contents with the complementary masks. Extensive experiments demonstrate that our method can significantly improve the HDR imaging quality of existing methods.

EfficientLLaVA:Generalizable Auto-Pruning for Large Vision-language Models

Yinan Liang,Ziwei Wang,Xiuwei Xu,Jie Zhou,Jiwen Lu

Task: 提出一种自动剪枝方法，以提高多模态推理的效率。

Motivation: 多模态大语言模型在复杂推理任务中表现出色，但在资源有限的设备上部署时存在模型复杂性的挑战。

Details

Method: 利用少量样本搜索剪枝策略，通过最大化其在未知训练数据上的泛化能力来保持模型准确性，从而实现准确性和效率之间的最佳权衡。 Result: 在ScienceQA数据集上，EfficientLLaVA实现了83.05%的准确率，并且比密集的LLaVA-v1.5-7B模型快了1.8倍。 Conclusion: 该方法能够在保持模型准确性的同时显著提高效率，适用于资源有限的设备。 Abstract: While multimodal large language models demonstrate strong performance in complex reasoning tasks, they pose significant challenges related to model complexity during deployment, especially for resource-limited devices. In this paper, we propose an automatic pruning method for large vision-language models to enhance the efficiency of multimodal reasoning. Conventional methods rely on the training data of the original model to select the proper pruning ratio for different network components. However, these methods are impractical for large vision-language models due to the unaffordable search costs caused by web-scale training corpus. In contrast, our approach only leverages a small number of samples to search for the desired pruning policy by maximizing its generalization ability on unknown training data while maintaining the model accuracy, which enables the achievement of an optimal trade-off between accuracy and efficiency for large visual language models. Specifically, we formulate the generalization gap of the pruning strategy using the structural risk minimization principle. Based on both task performance and generalization capability, we iteratively search for the optimal pruning policy within a given search space and optimize the vision projector to evolve the search space with higher upper bound of performance. We conduct extensive experiments on the ScienceQA, Vizwiz, MM-vet, and LLaVA-Bench datasets for the task of visual question answering. Using only 64 samples for pruning policy search, EfficientLLaVA achieves an accuracy of 83.05% on ScienceQA, along with a $\times$ 1.8 speedup compared to the dense LLaVA-v1.5-7B model.

Improving Adversarial Transferability on Vision Transformers via Forward Propagation Refinement

Yuchen Ren,Zhengyu Zhao,Chenhao Lin,Bo Yang,Lu Zhou,Zhe Liu,Chao Shen

Task: 研究Vision Transformers (ViTs)在对抗样本转移中的前向传播优化方法。

Motivation: 为了提升ViTs在实际场景中的鲁棒性，研究对抗样本的可转移性，并探索前向传播优化方法。

Details

Method: 提出了前向传播优化（FPR）方法，包括注意力图多样化（AMD）和动量词嵌入（MTE）。 Result: 实验表明，FPR方法在对抗样本转移任务中优于现有的最佳后向传播优化方法，平均提升7.0%。 Conclusion: FPR方法在提升对抗样本转移性方面具有显著优势，并且与现有防御方法和转移方法兼容。 Abstract: Vision Transformers (ViTs) have been widely applied in various computer vision and vision-language tasks. To gain insights into their robustness in practical scenarios, transferable adversarial examples on ViTs have been extensively studied. A typical approach to improving adversarial transferability is by refining the surrogate model. However, existing work on ViTs has restricted their surrogate refinement to backward propagation. In this work, we instead focus on Forward Propagation Refinement (FPR) and specifically refine two key modules of ViTs: attention maps and token embeddings. For attention maps, we propose Attention Map Diversification (AMD), which diversifies certain attention maps and also implicitly imposes beneficial gradient vanishing during backward propagation. For token embeddings, we propose Momentum Token Embedding (MTE), which accumulates historical token embeddings to stabilize the forward updates in both the Attention and MLP blocks. We conduct extensive experiments with adversarial examples transferred from ViTs to various CNNs and ViTs, demonstrating that our FPR outperforms the current best (backward) surrogate refinement by up to 7.0\% on average. We also validate its superiority against popular defenses and its compatibility with other transfer methods. Codes and appendix are available at https://github.com/RYC-98/FPR.

Visual Persona: Foundation Model for Full-Body Human Customization

Jisu Nam,Soowon Son,Zhan Xu,Jing Shi,Difan Liu,Feng Liu,Aashish Misraa,Seungryong Kim,Yang Zhou

Task: 开发一个基于文本描述生成个性化全身人像图像的基础模型。

Motivation: 现有的方法主要关注面部身份的保留，而忽略了全身外观的细节和与文本描述的匹配。

Details

Method: 提出了一种数据整理流程，利用视觉-语言模型评估全身外观一致性，并引入了一种基于预训练文本到图像扩散模型的变压器编码器-解码器架构。 Result: Visual Persona模型在生成高质量、个性化图像方面优于现有方法，并在各种下游任务中展示了其多功能性。 Conclusion: Visual Persona模型能够从野外输入生成高质量的个性化图像，并通过广泛的消融研究验证了其设计选择。 Abstract: We introduce Visual Persona, a foundation model for text-to-image full-body human customization that, given a single in-the-wild human image, generates diverse images of the individual guided by text descriptions. Unlike prior methods that focus solely on preserving facial identity, our approach captures detailed full-body appearance, aligning with text descriptions for body structure and scene variations. Training this model requires large-scale paired human data, consisting of multiple images per individual with consistent full-body identities, which is notoriously difficult to obtain. To address this, we propose a data curation pipeline leveraging vision-language models to evaluate full-body appearance consistency, resulting in Visual Persona-500K, a dataset of 580k paired human images across 100k unique identities. For precise appearance transfer, we introduce a transformer encoder-decoder architecture adapted to a pre-trained text-to-image diffusion model, which augments the input image into distinct body regions, encodes these regions as local appearance features, and projects them into dense identity embeddings independently to condition the diffusion model for synthesizing customized images. Visual Persona consistently surpasses existing approaches, generating high-quality, customized images from in-the-wild inputs. Extensive ablation studies validate design choices, and we demonstrate the versatility of Visual Persona across various downstream tasks.

Learn Your Scales: Towards Scale-Consistent Generative Novel View Synthesis

Fereshteh Forghani,Jason J. Yu,Tristan Aumentado-Armstrong,Konstantinos G. Derpanis,Marcus A. Brubaker

Task: 研究并解决在生成式新视角合成方法（GNVS）中场景尺度模糊性的影响。

Motivation: 传统的无深度多视图数据集使用未经过度量校准的单目移动相机捕获，导致相机位置的尺度模糊。之前的方法通过各种临时归一化预处理步骤承认了多视图数据中的尺度模糊性，但未直接分析错误场景尺度对其应用的影响。

Details

Method: 通过从单张图像中采样，研究场景尺度模糊性对GNVS模型的影响，并基于这些直觉定义新的度量标准来衡量生成视图的尺度不一致性。提出了一种框架，以端到端的方式联合估计场景尺度和GNVS模型。 Result: 实验表明，该方法减少了生成视图的尺度不一致性，且无需之前尺度归一化方法的复杂性或缺点。此外，去除这种模糊性提高了生成的GNVS模型的图像质量。 Conclusion: 通过联合估计场景尺度和GNVS模型，可以有效减少生成视图的尺度不一致性，并提高生成图像的质量。 Abstract: Conventional depth-free multi-view datasets are captured using a moving monocular camera without metric calibration. The scales of camera positions in this monocular setting are ambiguous. Previous methods have acknowledged scale ambiguity in multi-view data via various ad-hoc normalization pre-processing steps, but have not directly analyzed the effect of incorrect scene scales on their application. In this paper, we seek to understand and address the effect of scale ambiguity when used to train generative novel view synthesis methods (GNVS). In GNVS, new views of a scene or object can be minimally synthesized given a single image and are, thus, unconstrained, necessitating the use of generative methods. The generative nature of these models captures all aspects of uncertainty, including any uncertainty of scene scales, which act as nuisance variables for the task. We study the effect of scene scale ambiguity in GNVS when sampled from a single image by isolating its effect on the resulting models and, based on these intuitions, define new metrics that measure the scale inconsistency of generated views. We then propose a framework to estimate scene scales jointly with the GNVS model in an end-to-end fashion. Empirically, we show that our method reduces the scale inconsistency of generated views without the complexity or downsides of previous scale normalization methods. Further, we show that removing this ambiguity improves generated image quality of the resulting GNVS model.

Automated Processing of eXplainable Artificial Intelligence Outputs in Deep Learning Models for Fault Diagnostics of Large Infrastructures

Giovanni Floreale,Piero Baraldi,Enrico Zio,Olga Fink

Task: 提出一种结合事后解释与半监督学习的新框架，自动识别异常解释，减少维护决策者的工作量。

Motivation: 深度学习模型在处理图像以识别大型基础设施组件的健康状态时可能表现出偏见并依赖非因果捷径，手动分析XAI技术生成的解释耗时且容易出错。

Details

Method: 结合事后解释与半监督学习，自动识别异常解释，并将其应用于无人机收集的电力基础设施绝缘子外壳图像。 Result: 在两个故障类别上的平均分类准确率提高了8%，维护操作员只需手动重新分类15%的图像。 Conclusion: 所提出的框架在F1分数上优于基于忠实度度量的最先进方法，并成功识别出由非因果捷径导致的正确分类。 Abstract: Deep Learning (DL) models processing images to recognize the health state of large infrastructure components can exhibit biases and rely on non-causal shortcuts. eXplainable Artificial Intelligence (XAI) can address these issues but manually analyzing explanations generated by XAI techniques is time-consuming and prone to errors. This work proposes a novel framework that combines post-hoc explanations with semi-supervised learning to automatically identify anomalous explanations that deviate from those of correctly classified images and may therefore indicate model abnormal behaviors. This significantly reduces the workload for maintenance decision-makers, who only need to manually reclassify images flagged as having anomalous explanations. The proposed framework is applied to drone-collected images of insulator shells for power grid infrastructure monitoring, considering two different Convolutional Neural Networks (CNNs), GradCAM explanations and Deep Semi-Supervised Anomaly Detection. The average classification accuracy on two faulty classes is improved by 8% and maintenance operators are required to manually reclassify only 15% of the images. We compare the proposed framework with a state-of-the-art approach based on the faithfulness metric: the experimental results obtained demonstrate that the proposed framework consistently achieves F_1 scores larger than those of the faithfulness-based approach. Additionally, the proposed framework successfully identifies correct classifications that result from non-causal shortcuts, such as the presence of ID tags printed on insulator shells.

Temporal Regularization Makes Your Video Generator Stronger

Harold Haodong Chen,Haojian Huang,Xianfeng Wu,Yexin Liu,Yajing Bai,Wen-Jie Shu,Harry Yang,Ser-Nam Lim

Task: 探索视频生成中的时间增强方法，并引入FluxFlow策略以提高时间质量。

Motivation: 时间质量是视频生成的关键方面，确保帧间一致的运动和真实的动态，但实现高时间一致性和多样性仍然具有挑战性。

Details

Method: 在数据层面应用FluxFlow策略，通过受控的时间扰动来增强时间质量，无需修改模型架构。 Result: 在UCF-101和VBench基准测试上的广泛实验表明，FluxFlow显著提高了各种视频生成模型的时间一致性和多样性，同时保持了空间保真度。 Conclusion: 时间增强作为一种简单而有效的方法，具有提高视频生成质量的潜力。 Abstract: Temporal quality is a critical aspect of video generation, as it ensures consistent motion and realistic dynamics across frames. However, achieving high temporal coherence and diversity remains challenging. In this work, we explore temporal augmentation in video generation for the first time, and introduce FluxFlow for initial investigation, a strategy designed to enhance temporal quality. Operating at the data level, FluxFlow applies controlled temporal perturbations without requiring architectural modifications. Extensive experiments on UCF-101 and VBench benchmarks demonstrate that FluxFlow significantly improves temporal coherence and diversity across various video generation models, including U-Net, DiT, and AR-based architectures, while preserving spatial fidelity. These findings highlight the potential of temporal augmentation as a simple yet effective approach to advancing video generation quality.

Visual Position Prompt for MLLM based Visual Grounding

Wei Tang,Yanpeng Sun,Qinying Gu,Zechao Li

Task: 改进多模态大语言模型（MLLMs）在图像空间信息对齐方面的能力，特别是在视觉定位任务中。

Motivation: 现有的MLLMs在图像空间信息对齐方面存在挑战，尤其是在位置感知任务中，如视觉定位。这主要是由于缺乏明确的空间参考和对细粒度空间细节的提取不足。

Details

Method: 提出了VPP-LLaVA模型，该模型通过引入视觉位置提示（VPP）来增强其定位能力。VPP-LLaVA集成了两种互补机制：全局VPP和局部VPP。全局VPP在输入图像上叠加可学习的轴状嵌入以提供结构化的空间线索，局部VPP则通过引入位置感知查询来关注细粒度的定位。 Result: 通过在VPP-SFT数据集上的训练，VPP-LLaVA在标准定位基准测试中取得了最先进的结果，尽管使用的训练样本数量比其他MLLMs（如MiniGPT-v2）少得多。 Conclusion: VPP-LLaVA通过引入视觉位置提示（VPP）显著提高了MLLMs在视觉定位任务中的性能，展示了在较少训练样本下实现高效模型训练的潜力。 Abstract: Although Multimodal Large Language Models (MLLMs) excel at various image-related tasks, they encounter challenges in precisely aligning coordinates with spatial information within images, particularly in position-aware tasks such as visual grounding. This limitation arises from two key factors. First, MLLMs lack explicit spatial references, making it difficult to associate textual descriptions with precise image locations. Second, their feature extraction processes prioritize global context over fine-grained spatial details, leading to weak localization capability. To address this issue, we introduce VPP-LLaVA, an MLLM equipped with Visual Position Prompt (VPP) to improve its grounding capability. VPP-LLaVA integrates two complementary mechanisms. The global VPP overlays learnable, axis-like embeddings onto the input image to provide structured spatial cues. The local VPP focuses on fine-grained localization by incorporating position-aware queries, which suggests probable object locations. We also introduce a VPP-SFT dataset with 0.6M samples, consolidating high-quality visual grounding data into a compact format for efficient model training. Training on this dataset with VPP enhances the model's performance, achieving state-of-the-art results on standard grounding benchmarks despite using fewer training samples compared to other MLLMs like MiniGPT-v2, which rely on much larger datasets ($\sim$21M samples). The code and VPP-SFT dataset will be available at https://github.com/WayneTomas/VPP-LLaVA upon acceptance.

V2X-DG: Domain Generalization for Vehicle-to-Everything Cooperative Perception

Baolu Li,Zongzhe Xu,Jinlong Li,Xinyu Liu,Jianwu Fang,Xiaopeng Li,Hongkai Yu

Task: 研究基于LiDAR的V2X协同感知的领域泛化问题（V2X-DG）以提高3D检测的泛化能力。

Motivation: 当前协同感知算法在同一数据集上训练和测试，导致协同感知系统的泛化能力未被充分探索。

Details

Method: 提出了基于协同混合增强的泛化方法（CMAG）和合作特征一致性（CFC）约束，以提高模型在未见领域中的泛化能力。 Result: 实验表明，该方法在未见数据集上显著提升了性能，同时在源数据集上保持了强大的性能。 Conclusion: 提出的方法有效提高了基于LiDAR的V2X协同感知系统的泛化能力。 Abstract: LiDAR-based Vehicle-to-Everything (V2X) cooperative perception has demonstrated its impact on the safety and effectiveness of autonomous driving. Since current cooperative perception algorithms are trained and tested on the same dataset, the generalization ability of cooperative perception systems remains underexplored. This paper is the first work to study the Domain Generalization problem of LiDAR-based V2X cooperative perception (V2X-DG) for 3D detection based on four widely-used open source datasets: OPV2V, V2XSet, V2V4Real and DAIR-V2X. Our research seeks to sustain high performance not only within the source domain but also across other unseen domains, achieved solely through training on source domain. To this end, we propose Cooperative Mixup Augmentation based Generalization (CMAG) to improve the model generalization capability by simulating the unseen cooperation, which is designed compactly for the domain gaps in cooperative perception. Furthermore, we propose a constraint for the regularization of the robust generalized feature representation learning: Cooperation Feature Consistency (CFC), which aligns the intermediately fused features of the generalized cooperation by CMAG and the early fused features of the original cooperation in source domain. Extensive experiments demonstrate that our approach achieves significant performance gains when generalizing to other unseen datasets while it also maintains strong performance on the source dataset.

MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space

Lixing Xiao,Shunlin Lu,Huaijin Pi,Ke Fan,Liang Pan,Yueer Zhou,Ziyong Feng,Xiaowei Zhou,Sida Peng,Jingbo Wang

Task: 解决基于文本条件的流式运动生成问题，预测基于可变长度历史运动和输入文本的下一步人体姿态。

Motivation: 现有方法在流式运动生成方面存在困难，例如扩散模型受限于预定义的运动长度，而基于GPT的方法由于离散化的非因果标记化导致响应延迟和错误累积问题。

Details

Method: 提出MotionStreamer框架，将连续因果潜在空间引入概率自回归模型，减少离散化导致的信息丢失和长期自回归生成中的错误累积。 Result: 实验表明，该方法优于现有方法，并提供了更多应用，包括多轮生成、长期生成和动态运动组合。 Conclusion: MotionStreamer通过连续因果潜在空间和概率自回归模型，有效解决了流式运动生成中的问题，具有广泛的应用前景。 Abstract: This paper addresses the challenge of text-conditioned streaming motion generation, which requires us to predict the next-step human pose based on variable-length historical motions and incoming texts. Existing methods struggle to achieve streaming motion generation, e.g., diffusion models are constrained by pre-defined motion lengths, while GPT-based methods suffer from delayed response and error accumulation problem due to discretized non-causal tokenization. To solve these problems, we propose MotionStreamer, a novel framework that incorporates a continuous causal latent space into a probabilistic autoregressive model. The continuous latents mitigate information loss caused by discretization and effectively reduce error accumulation during long-term autoregressive generation. In addition, by establishing temporal causal dependencies between current and historical motion latents, our model fully utilizes the available information to achieve accurate online motion decoding. Experiments show that our method outperforms existing approaches while offering more applications, including multi-round generation, long-term generation, and dynamic motion composition. Project Page: https://zju3dv.github.io/MotionStreamer/

Di$\mathtt{[M]}$O: Distilling Masked Diffusion Models into One-step Generator

Yuanzhi Zhu,Xi Wang,Stéphane Lathuilière,Vicky Kalogeiton

Task: 提出一种将掩码扩散模型蒸馏为一步生成器的新方法Di[M]O。

Motivation: 掩码扩散模型虽然生成效果好，但推理速度慢，需要多步生成。

Details

Method: 通过标记级分布匹配和标记初始化策略解决中间步信息不可用和初始分布缺乏熵的问题。 Result: Di[M]O在类条件和文本条件图像生成上表现出色，推理时间大幅减少。 Conclusion: Di[M]O首次成功实现掩码扩散模型的一步蒸馏，并为高效生成建模开辟了新途径。 Abstract: Masked Diffusion Models (MDMs) have emerged as a powerful generative modeling technique. Despite their remarkable results, they typically suffer from slow inference with several steps. In this paper, we propose Di$\mathtt{[M]}$O, a novel approach that distills masked diffusion models into a one-step generator. Di$\mathtt{[M]}$O addresses two key challenges: (1) the intractability of using intermediate-step information for one-step generation, which we solve through token-level distribution matching that optimizes model output logits by an 'on-policy framework' with the help of an auxiliary model; and (2) the lack of entropy in the initial distribution, which we address through a token initialization strategy that injects randomness while maintaining similarity to teacher training distribution. We show Di$\mathtt{[M]}$O's effectiveness on both class-conditional and text-conditional image generation, impressively achieving performance competitive to multi-step teacher outputs while drastically reducing inference time. To our knowledge, we are the first to successfully achieve one-step distillation of masked diffusion models and the first to apply discrete distillation to text-to-image generation, opening new paths for efficient generative modeling.

FP4DiT: Towards Effective Floating Point Quantization for Diffusion Transformers

Ruichen Chen,Keith G. Mills,Di Niu

Task: 提出一种基于浮点量化的后训练量化方法FP4DiT，用于Diffusion Transformer模型的低比特量化。

Motivation: 现有的后训练量化方法主要针对卷积U-Net结构的扩散模型，且整数量化在低比特设置下不能很好地对齐网络权重和激活分布。

Details

Method: 扩展和推广自适应舍入后训练量化技术，以充分校准浮点量化的权重量化，并提出鲁棒的在线激活量化技术。 Result: FP4DiT在W4A6和W4A8精度下优于整数基的后训练量化方法，并在PixArt-α、PixArt-Σ和Hunyuan上生成了令人信服的视觉内容。 Conclusion: FP4DiT方法在低比特设置下能够更好地对齐权重和激活分布，适用于Diffusion Transformer模型，并展示了其在图像生成任务中的优越性。 Abstract: Diffusion Models (DM) have revolutionized the text-to-image visual generation process. However, the large computational cost and model footprint of DMs hinders practical deployment, especially on edge devices. Post-training quantization (PTQ) is a lightweight method to alleviate these burdens without the need for training or fine-tuning. While recent DM PTQ methods achieve W4A8 on integer-based PTQ, two key limitations remain: First, while most existing DM PTQ methods evaluate on classical DMs like Stable Diffusion XL, 1.5 or earlier, which use convolutional U-Nets, newer Diffusion Transformer (DiT) models like the PixArt series, Hunyuan and others adopt fundamentally different transformer backbones to achieve superior image synthesis. Second, integer (INT) quantization is prevailing in DM PTQ but doesn't align well with the network weight and activation distribution, while Floating-Point Quantization (FPQ) is still under-investigated, yet it holds the potential to better align the weight and activation distributions in low-bit settings for DiT. In response, we introduce FP4DiT, a PTQ method that leverages FPQ to achieve W4A6 quantization. Specifically, we extend and generalize the Adaptive Rounding PTQ technique to adequately calibrate weight quantization for FPQ and demonstrate that DiT activations depend on input patch data, necessitating robust online activation quantization techniques. Experimental results demonstrate that FP4DiT outperforms integer-based PTQ at W4A6 and W4A8 precision and generates convincing visual content on PixArt-$\alpha$, PixArt-$\Sigma$ and Hunyuan in terms of several T2I metrics such as HPSv2 and CLIP.

EgoDTM: Towards 3D-Aware Egocentric Video-Language Pretraining

Boshen Xu,Yuting Mei,Xinbi Liu,Sipeng Zheng,Qin Jin

Task: 通过大规模3D感知视频预训练和视频-文本对比学习，联合训练Egocentric Depth- and Text-aware Model (EgoDTM)。

Motivation: 人类感知和互动的是一个完全3D的世界，发展出超越文本理解的空间意识。然而，大多数先前的工作从1D文本或2D视觉线索（如边界框）中学习，这些方法本质上缺乏3D理解。

Details

Method: 引入EgoDTM，结合轻量级3D感知解码器，从深度估计模型生成的伪深度图中高效学习3D感知。通过有机结合多个基础模型，丰富原始简短字幕的手-对象视觉线索。 Result: 大量实验表明，EgoDTM在多种下游任务中表现出色，展示了其卓越的3D感知视觉理解能力。 Conclusion: EgoDTM通过3D感知视频预训练和视频-文本对比学习，显著提升了视频表示学习，特别是在3D感知视觉理解方面。 Abstract: Egocentric video-language pretraining has significantly advanced video representation learning. Humans perceive and interact with a fully 3D world, developing spatial awareness that extends beyond text-based understanding. However, most previous works learn from 1D text or 2D visual cues, such as bounding boxes, which inherently lack 3D understanding. To bridge this gap, we introduce EgoDTM, an Egocentric Depth- and Text-aware Model, jointly trained through large-scale 3D-aware video pretraining and video-text contrastive learning. EgoDTM incorporates a lightweight 3D-aware decoder to efficiently learn 3D-awareness from pseudo depth maps generated by depth estimation models. To further facilitate 3D-aware video pretraining, we enrich the original brief captions with hand-object visual cues by organically combining several foundation models. Extensive experiments demonstrate EgoDTM's superior performance across diverse downstream tasks, highlighting its superior 3D-aware visual understanding. Our code will be released at https://github.com/xuboshen/EgoDTM.

Toward task-driven satellite image super-resolution

Maciej Ziaja,Pawel Kowaleczko,Daniel Kostrzewa,Nicolas Longépé,Michal Kawulok

Task: 学习超分辨率算法以生成适合自动图像分析的高分辨率图像。

Motivation: 现有的超分辨率方法虽然能生成高质量图像，但不确定重建的细节是否接近真实信息，以及是否对图像分析算法更有价值。

Details

Method: 提出一种方法论，用于评估现有计算机视觉任务模型是否可用于评估超分辨率重建算法，并以任务驱动的方式训练它们。 Result: 通过实验研究支持分析，为选择适当的计算机视觉任务奠定了基础。 Conclusion: 该研究为提升现实世界超分辨率能力提供了坚实的基础。 Abstract: Super-resolution is aimed at reconstructing high-resolution images from low-resolution observations. State-of-the-art approaches underpinned with deep learning allow for obtaining outstanding results, generating images of high perceptual quality. However, it often remains unclear whether the reconstructed details are close to the actual ground-truth information and whether they constitute a more valuable source for image analysis algorithms. In the reported work, we address the latter problem, and we present our efforts toward learning super-resolution algorithms in a task-driven way to make them suitable for generating high-resolution images that can be exploited for automated image analysis. In the reported initial research, we propose a methodological approach for assessing the existing models that perform computer vision tasks in terms of whether they can be used for evaluating super-resolution reconstruction algorithms, as well as training them in a task-driven way. We support our analysis with experimental study and we expect it to establish a solid foundation for selecting appropriate computer vision tasks that will advance the capabilities of real-world super-resolution.

Cube: A Roblox View of 3D Intelligence

Foundation AI Team,Kiran Bhat,Nishchaie Khanna,Karun Channa,Tinghui Zhou,Yiheng Zhu,Xiaoxia Sun,Charles Shang,Anirudh Sudarshan,Maurice Chu,Daiqing Li,Kangle Deng,Jean-Philippe Fauconnier,Tijmen Verhulsdonck,Maneesh Agrawala,Kayvon Fatahalian,Alexander Weiss,Christian Reiser,Ravi Kiran Chirravuri,Ravali Kandur,Alejandro Pelaez,Akash Garg,Michael Palleschi,Jessica Wang,Skylar Litz,Leon Liu,Anying Li,David Harmon,Derek Liu,Liangjun Feng,Denis Goupil,Lukas Kuczynski,Jihyun Yoon,Naveen Marri,Peiye Zhuang,Yinan Zhang,Brian Yin,Haomiao Jiang,Marcel van Workum,Thomas Lane,Bryce Erickson,Salil Pathare,Kyle Price,Anupam Singh,David Baszucki

Task: 构建一个用于3D智能的基础模型，支持开发者生成3D对象、场景、角色动画和对象行为的程序脚本。

Motivation: 利用基础模型在文本、图像、音频和视频领域的显著推理和生成能力，将其扩展到3D智能领域，以支持Roblox开发者创建全面的体验。

Details

Method: 提出了3D基础模型的三个关键设计需求，并介绍了3D形状分词器的解决方案。展示了该分词方案在文本到形状生成、形状到文本生成和文本到场景生成中的应用。 Result: 展示了这些应用如何与现有的大型语言模型（LLMs）协作进行场景分析和推理。 Conclusion: 讨论了构建完全统一的3D智能基础模型的路径。 Abstract: Foundation models trained on vast amounts of data have demonstrated remarkable reasoning and generation capabilities in the domains of text, images, audio and video. Our goal at Roblox is to build such a foundation model for 3D intelligence, a model that can support developers in producing all aspects of a Roblox experience, from generating 3D objects and scenes to rigging characters for animation to producing programmatic scripts describing object behaviors. We discuss three key design requirements for such a 3D foundation model and then present our first step towards building such a model. We expect that 3D geometric shapes will be a core data type and describe our solution for 3D shape tokenizer. We show how our tokenization scheme can be used in applications for text-to-shape generation, shape-to-text generation and text-to-scene generation. We demonstrate how these applications can collaborate with existing large language models (LLMs) to perform scene analysis and reasoning. We conclude with a discussion outlining our path to building a fully unified foundation model for 3D intelligence.

TULIP: Towards Unified Language-Image Pretraining

Zineng Tang,Long Lian,Seun Eisape,XuDong Wang,Roei Herzig,Adam Yala,Alane Suhr,Trevor Darrell,David M. Chan

Task: 提出TULIP模型，以解决现有图像-文本对比模型在视觉中心任务中的不足。

Motivation: 现有的图像-文本对比模型（如CLIP和SigLIP）在需要高保真图像理解的任务（如计数、深度估计和细粒度物体识别）中表现不佳，而视觉中心模型在处理语言任务时灵活性不足。

Details

Method: 利用生成数据增强、增强的图像-图像和文本-文本对比学习以及图像/文本重建正则化来学习细粒度视觉特征，同时保持全局语义对齐。 Result: TULIP模型在多个基准测试中优于现有的最先进模型，在ImageNet-1K上实现了新的零样本性能最先进水平，在RxRx1上的线性探测少样本分类中比SigLIP提高了2倍，在MMVP上的视觉-语言模型得分比SigLIP提高了3倍以上。 Conclusion: TULIP模型通过结合生成数据增强和对比学习，显著提升了图像理解和语言对齐的能力，为视觉-语言任务提供了更强大的解决方案。 Abstract: Despite the recent success of image-text contrastive models like CLIP and SigLIP, these models often struggle with vision-centric tasks that demand high-fidelity image understanding, such as counting, depth estimation, and fine-grained object recognition. These models, by performing language alignment, tend to prioritize high-level semantics over visual understanding, weakening their image understanding. On the other hand, vision-focused models are great at processing visual information but struggle to understand language, limiting their flexibility for language-driven tasks. In this work, we introduce TULIP, an open-source, drop-in replacement for existing CLIP-like models. Our method leverages generative data augmentation, enhanced image-image and text-text contrastive learning, and image/text reconstruction regularization to learn fine-grained visual features while preserving global semantic alignment. Our approach, scaling to over 1B parameters, outperforms existing state-of-the-art (SOTA) models across multiple benchmarks, establishing a new SOTA zero-shot performance on ImageNet-1K, delivering up to a $2\times$ enhancement over SigLIP on RxRx1 in linear probing for few-shot classification, and improving vision-language models, achieving over $3\times$ higher scores than SigLIP on MMVP. Our code/checkpoints are available at https://tulip-berkeley.github.io

SDF-TopoNet: A Two-Stage Framework for Tubular Structure Segmentation via SDF Pre-training and Topology-Aware Fine-Tuning

Siyi Wu,Leyi Zhao,Haitian Ma,Xinyuan Song

Task: 提出一种改进的拓扑感知分割框架SDF-TopoNet，用于准确分割管状和曲线结构。

Motivation: 现有方法在确保拓扑正确性的同时，计算成本高且对像素级精度不敏感，需要额外的损失项来补偿。

Details

Method: 提出一种两阶段训练策略，预训练阶段使用符号距离函数（SDF）作为辅助学习目标，微调阶段结合动态适配器和改进的拓扑损失。 Result: 在五个基准数据集上的实验结果表明，SDF-TopoNet在拓扑准确性和定量分割指标上优于现有方法，同时显著降低了训练复杂度。 Conclusion: SDF-TopoNet框架在提高分割精度和训练效率方面具有显著优势。 Abstract: Accurate segmentation of tubular and curvilinear structures, such as blood vessels, neurons, and road networks, is crucial in various applications. A key challenge is ensuring topological correctness while maintaining computational efficiency. Existing approaches often employ topological loss functions based on persistent homology, such as Betti error, to enforce structural consistency. However, these methods suffer from high computational costs and are insensitive to pixel-level accuracy, often requiring additional loss terms like Dice or MSE to compensate. To address these limitations, we propose \textbf{SDF-TopoNet}, an improved topology-aware segmentation framework that enhances both segmentation accuracy and training efficiency. Our approach introduces a novel two-stage training strategy. In the pre-training phase, we utilize the signed distance function (SDF) as an auxiliary learning target, allowing the model to encode topological information without directly relying on computationally expensive topological loss functions. In the fine-tuning phase, we incorporate a dynamic adapter alongside a refined topological loss to ensure topological correctness while mitigating overfitting and computational overhead. We evaluate our method on five benchmark datasets. Experimental results demonstrate that SDF-TopoNet outperforms existing methods in both topological accuracy and quantitative segmentation metrics, while significantly reducing training complexity.

Frans Zdyb,Albert Alonso,Julius B. Kirkegaard

Task: 检测计算显微镜中的细长、重叠结构。

Motivation: 现有的坐标方法虽然改进了检测，但在样条精度上不如像素方法。

Details

Method: 提出了一种无需训练的可微分渲染方法用于样条细化，实现了高可靠性和亚像素精度。 Result: 该方法提高了样条质量，增强了对分布变化的鲁棒性，并缩小了合成数据与真实数据之间的差距。 Conclusion: 该方法结合了坐标方法和像素方法的优点，适用于C. elegans线虫等模型生物的研究。 Abstract: Detecting slender, overlapping structures remains a challenge in computational microscopy. While recent coordinate-based approaches improve detection, they often produce less accurate splines than pixel-based methods. We introduce a training-free differentiable rendering approach to spline refinement, achieving both high reliability and sub-pixel accuracy. Our method improves spline quality, enhances robustness to distribution shifts, and shrinks the gap between synthetic and real-world data. Being fully unsupervised, the method is a drop-in replacement for the popular active contour model for spline refinement. Evaluated on C. elegans nematodes, a popular model organism for drug discovery and biomedical research, we demonstrate that our approach combines the strengths of both coordinate- and pixel-based methods.

Ship Detection in Remote Sensing Imagery for Arbitrarily Oriented Object Detection

Bibi Erum Ayesha,T. Satyanarayana Murthy,Palamakula Ramesh Babu,Ramu Kuchipudi

Task: 开发一种创新的船舶检测系统，用于海上监视和生态监测。

Motivation: 传统的船舶检测方法在任意方向、复杂背景和遮挡视角下存在挑战。

Details

Method: 采用YOLOv8进行实时处理，并使用改进的U-Net进行船舶实例分割。 Result: YOLOv8实现了88%的mAP，U-Net实现了89%的mAP，显著提高了船舶检测的准确性和边界划分。 Conclusion: 该研究展示了深度学习模型在船舶检测中的潜力，增强了海上监视、灾害响应和生态监测的能力。 Abstract: This research paper presents an innovative ship detection system tailored for applications like maritime surveillance and ecological monitoring. The study employs YOLOv8 and repurposed U-Net, two advanced deep learning models, to significantly enhance ship detection accuracy. Evaluation metrics include Mean Average Precision (mAP), processing speed, and overall accuracy. The research utilizes the "Airbus Ship Detection" dataset, featuring diverse remote sensing images, to assess the models' versatility in detecting ships with varying orientations and environmental contexts. Conventional ship detection faces challenges with arbitrary orientations, complex backgrounds, and obscured perspectives. Our approach incorporates YOLOv8 for real-time processing and U-Net for ship instance segmentation. Evaluation focuses on mAP, processing speed, and overall accuracy. The dataset is chosen for its diverse images, making it an ideal benchmark. Results demonstrate significant progress in ship detection. YOLOv8 achieves an 88% mAP, excelling in accurate and rapid ship detection. U Net, adapted for ship instance segmentation, attains an 89% mAP, improving boundary delineation and handling occlusions. This research enhances maritime surveillance, disaster response, and ecological monitoring, exemplifying the potential of deep learning models in ship detection.

Praveen Shastry,Sowmya Chowdary Muthulur,Naveen Kumarasami,Anandakumar D,Mounigasri M,Keerthana R,Kishore Prasath Venkatesh,Bargava Subramanian,Kalyan Sivasailam,Revathi Ezhumalai,Abitha Marimuthu

Task: 提出一种利用SIGLIP编码器和Gemma-3b变压器解码器的视觉语言模型（VLM），以增强自动化慢性结核病（TB）筛查。

Motivation: 通过整合胸部X光图像和临床数据，解决手动解释的挑战，提高诊断的一致性和可及性，特别是在资源有限的环境中。

Details

Method: VLM架构结合了视觉变压器（ViT）进行视觉编码和基于变压器的文本编码器处理临床背景，如患者病史和治疗记录。跨模态注意力机制将放射特征与文本信息对齐，Gemma-3b解码器生成全面的诊断报告。模型在500万对医学图像和文本上进行了预训练，并使用10万张慢性TB特异性胸部X光进行了微调。 Result: 模型在检测关键慢性TB病理（包括纤维化、钙化肉芽肿和支气管扩张）方面表现出高精度（94%）和召回率（94%）。曲线下面积（AUC）得分超过0.93，交并比（IoU）值超过0.91，验证了其在检测和定位TB相关异常方面的有效性。 Conclusion: VLM为自动化慢性TB诊断提供了一个强大且可扩展的解决方案，整合放射和临床数据以提供可操作且具有上下文意识的见解。未来的工作将解决细微病理和数据集偏差，以增强模型的泛化能力，确保在不同人群和医疗环境中的公平性能。 Abstract: Background This study proposes a Vision-Language Model (VLM) leveraging the SIGLIP encoder and Gemma-3b transformer decoder to enhance automated chronic tuberculosis (TB) screening. By integrating chest X-ray images with clinical data, the model addresses the challenges of manual interpretation, improving diagnostic consistency and accessibility, particularly in resource-constrained settings. Methods The VLM architecture combines a Vision Transformer (ViT) for visual encoding and a transformer-based text encoder to process clinical context, such as patient histories and treatment records. Cross-modal attention mechanisms align radiographic features with textual information, while the Gemma-3b decoder generates comprehensive diagnostic reports. The model was pre-trained on 5 million paired medical images and texts and fine-tuned using 100,000 chronic TB-specific chest X-rays. Results The model demonstrated high precision (94 percent) and recall (94 percent) for detecting key chronic TB pathologies, including fibrosis, calcified granulomas, and bronchiectasis. Area Under the Curve (AUC) scores exceeded 0.93, and Intersection over Union (IoU) values were above 0.91, validating its effectiveness in detecting and localizing TB-related abnormalities. Conclusion The VLM offers a robust and scalable solution for automated chronic TB diagnosis, integrating radiographic and clinical data to deliver actionable and context-aware insights. Future work will address subtle pathologies and dataset biases to enhance the model's generalizability, ensuring equitable performance across diverse populations and healthcare settings.

Vision-Language Models for Acute Tuberculosis Diagnosis: A Multimodal Approach Combining Imaging and Clinical Data

Ananya Ganapthy,Praveen Shastry,Naveen Kumarasami,Anandakumar D,Keerthana R,Mounigasri M,Varshinipriya M,Kishore Prasath Venkatesh,Bargava Subramanian,Kalyan Sivasailam

Task: 利用SIGLIP和Gemma-3b架构的视觉语言模型（VLM）进行自动急性结核病（TB）筛查。

Motivation: 通过整合胸部X光图像和临床笔记，提高诊断准确性和效率，特别是在资源有限的环境中。

Details

Method: VLM结合胸部X光图像的视觉数据和临床背景，生成详细的、上下文感知的诊断报告。架构使用SIGLIP进行视觉编码，Gemma-3b进行解码，确保有效表示急性TB特异性病理和临床见解。 Result: 关键急性TB病理（包括实变、空洞和结节）的检测精度（97%）和召回率（96%）都很高。模型展示了强大的空间定位能力和区分TB阳性病例的鲁棒性，使其成为急性TB诊断的可靠工具。 Conclusion: VLM的多模态能力减少了对放射科医生的依赖，为急性TB筛查提供了可扩展的解决方案。未来的工作将集中在改进对细微病理的检测和解决数据集偏差，以增强其在不同全球医疗环境中的通用性和应用性。 Abstract: Background: This study introduces a Vision-Language Model (VLM) leveraging SIGLIP and Gemma-3b architectures for automated acute tuberculosis (TB) screening. By integrating chest X-ray images and clinical notes, the model aims to enhance diagnostic accuracy and efficiency, particularly in resource-limited settings. Methods: The VLM combines visual data from chest X-rays with clinical context to generate detailed, context-aware diagnostic reports. The architecture employs SIGLIP for visual encoding and Gemma-3b for decoding, ensuring effective representation of acute TB-specific pathologies and clinical insights. Results: Key acute TB pathologies, including consolidation, cavities, and nodules, were detected with high precision (97percent) and recall (96percent). The model demonstrated strong spatial localization capabilities and robustness in distinguishing TB-positive cases, making it a reliable tool for acute TB diagnosis. Conclusion: The multimodal capability of the VLM reduces reliance on radiologists, providing a scalable solution for acute TB screening. Future work will focus on improving the detection of subtle pathologies and addressing dataset biases to enhance its generalizability and application in diverse global healthcare settings.

AI-Driven Rapid Identification of Bacterial and Fungal Pathogens in Blood Smears of Septic Patients

Agnieszka Sroka-Oleksiak,Adam Pardyl,Dawid Rymarczyk,Aldona Olechowska-Jarząb,Katarzyna Biegun-Drożdż,Dorota Ochońska,Michał Wronka,Adriana Borowa,Tomasz Gosiewski,Miłosz Adamczyk,Henryk Telega,Bartosz Zieliński,Monika Brzychczy-Włoch

Task: 使用深度学习算法从脓毒症患者的革兰氏染色涂片中识别14种细菌和3种酵母样真菌。

Motivation: 传统的微生物学方法耗时且昂贵，需要快速诊断和治疗脓毒症。

Details

Method: 使用Cellpose 3模型进行分割，并使用基于注意力的深度多实例学习进行分类。 Result: 模型对细菌的分类准确率为77.15%，对真菌的分类准确率为71.39%，ROC AUC分别为0.97和0.88。 Conclusion: 研究证实了该模型在微生物分类中的潜力，但需要进一步优化和扩展训练数据集。未来，该技术可以支持微生物诊断，减少诊断时间并提高脓毒症治疗的有效性。 Abstract: Sepsis is a life-threatening condition which requires rapid diagnosis and treatment. Traditional microbiological methods are time-consuming and expensive. In response to these challenges, deep learning algorithms were developed to identify 14 bacteria species and 3 yeast-like fungi from microscopic images of Gram-stained smears of positive blood samples from sepsis patients. A total of 16,637 Gram-stained microscopic images were used in the study. The analysis used the Cellpose 3 model for segmentation and Attention-based Deep Multiple Instance Learning for classification. Our model achieved an accuracy of 77.15% for bacteria and 71.39% for fungi, with ROC AUC of 0.97 and 0.88, respectively. The highest values, reaching up to 96.2%, were obtained for Cutibacterium acnes, Enterococcus faecium, Stenotrophomonas maltophilia and Nakaseomyces glabratus. Classification difficulties were observed in closely related species, such as Staphylococcus hominis and Staphylococcus haemolyticus, due to morphological similarity, and within Candida albicans due to high morphotic diversity. The study confirms the potential of our model for microbial classification, but it also indicates the need for further optimisation and expansion of the training data set. In the future, this technology could support microbial diagnosis, reducing diagnostic time and improving the effectiveness of sepsis treatment due to its simplicity and accessibility. Part of the results presented in this publication was covered by a patent application at the European Patent Office EP24461637.1 "A computer implemented method for identifying a microorganism in a blood and a data processing system therefor".

The Impact of Artificial Intelligence on Emergency Medicine: A Review of Recent Advances

Gustavo Correia,Victor Alves,Paulo Novais

Task: 回顾过去五年中人工智能在急诊影像学中的应用。

Motivation: 探讨人工智能在急诊医学中的潜力，特别是在诊断过程和患者预后方面的改进。

Details

Method: 通过回顾和分析过去五年中关于人工智能在急诊影像学中的应用的研究，特别是机器学习和深度学习技术。 Result: 研究表明，人工智能在准确检测骨折、气胸和肺部疾病等病症方面表现出色，并能预测机械通气需求等临床结果。 Conclusion: 尽管面临数据隐私、算法偏见和广泛验证等挑战，人工智能在急诊环境中具有变革潜力，未来应将其与临床专业知识结合以提升患者护理标准。 Abstract: Artificial Intelligence (AI) is revolutionizing emergency medicine by enhancing diagnostic processes and improving patient outcomes. This article provides a review of the current applications of AI in emergency imaging studies, focusing on the last five years of advancements. AI technologies, particularly machine learning and deep learning, are pivotal in interpreting complex imaging data, offering rapid, accurate diagnoses and potentially surpassing traditional diagnostic methods. Studies highlighted within the article demonstrate AI's capabilities in accurately detecting conditions such as fractures, pneumothorax, and pulmonary diseases from various imaging modalities including X-rays, CT scans, and MRIs. Furthermore, AI's ability to predict clinical outcomes like mechanical ventilation needs illustrates its potential in crisis resource optimization. Despite these advancements, the integration of AI into clinical practice presents challenges such as data privacy, algorithmic bias, and the need for extensive validation across diverse settings. This review underscores the transformative potential of AI in emergency settings, advocating for a future where AI and clinical expertise synergize to elevate patient care standards.

Novel AI-Based Quantification of Breast Arterial Calcification to Predict Cardiovascular Risk

Theodorus Dapamede,Aisha Urooj,Vedant Joshi,Gabrielle Gershon,Frank Li,Mohammadreza Chavoshi,Beatrice Brown-Mulry,Rohan Satya Isaac,Aawez Mansuri,Chad Robichaux,Chadi Ayoub,Reza Arsanjani,Laurence Sperling,Judy Gichoya,Marly van Assen,Charles W. ONeill,Imon Banerjee,Hari Trivedi

Task: 通过自动量化筛查乳腺X光片上的乳腺动脉钙化（BAC）来识别心血管疾病风险高的女性。

Motivation: 女性在心血管疾病方面存在诊断不足和治疗不足的问题，通过自动量化BAC可以在常规乳腺X光检查中进行心血管风险评估。

Details

Method: 使用基于Transformer的神经网络对116,135名女性的筛查乳腺X光片进行BAC严重程度（无BAC、轻度、中度和重度）的量化。 Result: BAC严重程度与主要不良心血管事件（MACE）独立相关，且在所有年龄组中均显著，即使是轻度BAC也表明50岁以下女性的风险增加。 Conclusion: 自动BAC量化能够在常规乳腺X光检查中进行心血管风险评估，无需额外辐射或成本，特别是在年轻女性中提供了早期心血管疾病风险分层的潜力。 Abstract: Women are underdiagnosed and undertreated for cardiovascular disease. Automatic quantification of breast arterial calcification on screening mammography can identify women at risk for cardiovascular disease and enable earlier treatment and management of disease. In this retrospective study of 116,135 women from two healthcare systems, a transformer-based neural network quantified BAC severity (no BAC, mild, moderate, and severe) on screening mammograms. Outcomes included major adverse cardiovascular events (MACE) and all-cause mortality. BAC severity was independently associated with MACE after adjusting for cardiovascular risk factors, with increasing hazard ratios from mild (HR 1.18-1.22), moderate (HR 1.38-1.47), to severe BAC (HR 2.03-2.22) across datasets (all p<0.001). This association remained significant across all age groups, with even mild BAC indicating increased risk in women under 50. BAC remained an independent predictor when analyzed alongside ASCVD risk scores, showing significant associations with myocardial infarction, stroke, heart failure, and mortality (all p<0.005). Automated BAC quantification enables opportunistic cardiovascular risk assessment during routine mammography without additional radiation or cost. This approach provides value beyond traditional risk factors, particularly in younger women, offering potential for early CVD risk stratification in the millions of women undergoing annual mammography.

Synchronous vs Asynchronous Reinforcement Learning in a Real World Robot

Ali Parsaee,Fahim Shahriar,Chuxin He,Ruiqing Tan

Task: 比较异步和同步强化学习方法在物理机器人上的性能。

Motivation: 现有的强化学习算法未考虑物理环境中决策和梯度更新的时间延迟问题，这可能导致学习代理在快速变化的环境中表现不佳。

Details

Method: 使用Franka Emika Panda机械臂进行异步和同步强化学习的性能比较实验。 Result: 实验表明，异步强化学习方法使代理学习更快且获得更高的回报，响应时间更快的代理表现更好。 Conclusion: 异步强化学习方法在物理机器人上具有显著的性能优势，尤其是在快速变化的环境中。 Abstract: In recent times, reinforcement learning (RL) with physical robots has attracted the attention of a wide range of researchers. However, state-of-the-art RL algorithms do not consider that physical environments do not wait for the RL agent to make decisions or updates. RL agents learn by periodically conducting computationally expensive gradient updates. When decision-making and gradient update tasks are carried out sequentially by the RL agent in a physical robot, it significantly increases the agent's response time. In a rapidly changing environment, this increased response time may be detrimental to the performance of the learning agent. Asynchronous RL methods, which separate the computation of decision-making and gradient updates, are a potential solution to this problem. However, only a few comparisons between asynchronous and synchronous RL have been made with physical robots. For this reason, the exact performance benefits of using asynchronous RL methods over synchronous RL methods are still unclear. In this study, we provide a performance comparison between asynchronous and synchronous RL using a physical robotic arm called Franka Emika Panda. Our experiments show that the agents learn faster and attain significantly more returns using asynchronous RL. Our experiments also demonstrate that the learning agent with a faster response time performs better than the agent with a slower response time, even if the agent with a slower response time performs a higher number of gradient updates.

Squeeze Out Tokens from Sample for Finer-Grained Data Governance

Weixiong Lin,Chen Ju,Haicheng Wang,Shengchao Hu,Shuai Xiao,Mengting Chen,Yuheng Jiao,Mingshuai Yao,Jinsong Lan,Qingwen Liu,Ying Chen

Task: 升级数据治理方法，从筛选样本到更细粒度的样本内治理。

Motivation: 现有的数据治理方法通过启发式标量分数估计样本贡献，丢弃低价值样本，但仍存在大量不理想的标记，表明数据集有进一步压缩和净化的潜力。

Details

Method: 提出双分支DataJuicer方法，通过更细粒度的样本内治理，提取信息丰富的标记并增强图像-文本对齐。视觉分支保留显著的图像块并提取相关对象类别，文本分支则结合这些类别来增强描述。 Result: 实验表明，DataJuicer在图像-文本检索、分类和密集视觉推理任务上显著优于现有的DataSieve方法。 Conclusion: DataJuicer通过更细粒度的治理方法，生成了更精炼的数据集，显著提升了模型性能。 Abstract: Widely observed data scaling laws, in which error falls off as a power of the training size, demonstrate the diminishing returns of unselective data expansion. Hence, data governance is proposed to downsize datasets through pruning non-informative samples. Yet, isolating the impact of a specific sample on overall model performance is challenging, due to the vast computation required for tryout all sample combinations. Current data governors circumvent this complexity by estimating sample contributions through heuristic-derived scalar scores, thereby discarding low-value ones. Despite thorough sample sieving, retained samples contain substantial undesired tokens intrinsically, underscoring the potential for further compression and purification. In this work, we upgrade data governance from a 'sieving' approach to a 'juicing' one. Instead of scanning for least-flawed samples, our dual-branch DataJuicer applies finer-grained intra-sample governance. It squeezes out informative tokens and boosts image-text alignments. Specifically, the vision branch retains salient image patches and extracts relevant object classes, while the text branch incorporates these classes to enhance captions. Consequently, DataJuicer yields more refined datasets through finer-grained governance. Extensive experiments across datasets demonstrate that DataJuicer significantly outperforms existing DataSieve in image-text retrieval, classification, and dense visual reasoning.

Analysis of human visual field information using machine learning methods and assessment of their accuracy

A. I. Medvedeva,V. V. Bakutkin

Task: 研究用于分析视野图像以诊断和控制青光眼疾病的方法。

Motivation: 眼科社区对疾病控制和进口替代问题非常关注，因此需要研究相关方法。

Details

Method: 使用机器学习方法（随机梯度下降、逻辑回归、随机森林、朴素贝叶斯）对图像结果进行分类。 Result: 研究结果是能够从图像中确定结果是否为青光眼或其他疾病的计算机模型（二元分类）。 Conclusion: 通过构建分类器并标记数据集，可以实现青光眼的分类诊断。 Abstract: Subject of research: is the study of methods for analyzing perimetric images for the diagnosis and control of glaucoma diseases. Objects of research: is a dataset collected on the ophthalmological perimeter with the results of various patient pathologies, since the ophthalmological community is acutely aware of the issue of disease control and import substitution. [5]. Purpose of research: is to consider various machine learning methods that can classify glaucoma. This is possible thanks to the classifier built after labeling the dataset. It is able to determine from the image whether the visual fields depicted on it are the results of the impact of glaucoma on the eyes or other visual diseases. Earlier in the work [3], a dataset was described that was collected on the Tomey perimeter. The average age of the examined patients ranged from 30 to 85 years. Methods of research: machine learning methods for classifying image results (stochastic gradient descent, logistic regression, random forest, naive Bayes). Main results of research: the result of the study is computer modeling that can determine from the image whether the result is glaucoma or another disease (binary classification).

Three-dimensional Reconstruction of the Lumbar Spine with Submillimeter Accuracy Using Biplanar X-ray Images

Wanxin Yu,Zhemin Zhu,Cong Wang,Yihang Bao,Chunjie Xia,Rongshan Cheng,Yan Yu,Tsung-Yuan Tsai

Task: 开发并验证一种从双平面X射线图像中高精度三维重建腰椎的全自动方法。

Motivation: 当前全自动重建方法精度低，无法满足临床应用标准。

Details

Method: 该方法包括从原始X射线图像中进行腰椎分解和标志点检测，然后使用可变形模型和标志点加权的2D-3D配准方法。 Result: 所提出的方法实现了0.80毫米的三维重建精度，显著优于主流方法。 Conclusion: 该研究将有助于负重位置下的腰椎临床诊断。 Abstract: Three-dimensional reconstruction of the spine under weight-bearing conditions from biplanar X-ray images is of great importance for the clinical assessment of spinal diseases. However, the current fully automated reconstruction methods have low accuracy and fail to meet the clinical application standards. This study developed and validated a fully automated method for high-accuracy 3D reconstruction of the lumbar spine from biplanar X-ray images. The method involves lumbar decomposition and landmark detection from the raw X-ray images, followed by a deformable model and landmark-weighted 2D-3D registration approach. The reconstruction accuracy was validated by the gold standard obtained through the registration of CT-segmented vertebral models with the biplanar X-ray images. The proposed method achieved a 3D reconstruction accuracy of 0.80 mm, representing a significant improvement over the mainstream approaches. This study will contribute to the clinical diagnosis of lumbar in weight-bearing positions.

Reinforcement learning-based motion imitation for physiologically plausible musculoskeletal motor control

Merkourios Simos,Alberto Silvio Chiappa,Alexander Mathis

Task: 提出一种无模型的运动模仿框架（KINESIS）以推进对基于肌肉的运动控制的理解。

Motivation: 理解人类运动在计算机动画、运动合成、神经科学、人类假肢和康复等领域有广泛应用。尽管强化学习在捕捉人类运动方面取得了显著进展，但控制生理上精确的身体模型仍然是一个挑战。

Details

Method: 使用具有80个肌肉执行器和20个自由度的下肢肌肉骨骼模型，展示了KINESIS在1.9小时的运动捕捉数据上的强大模仿性能，并通过预训练的文本到运动生成模型实现自然语言控制，并可微调以执行高级任务。 Result: KINESIS生成的肌肉活动模式与人类肌电活动相关性良好，生理上的合理性使其成为解决人类运动控制理论中挑战性问题的有前途的模型。 Conclusion: KINESIS在理解人类运动控制方面具有潜力，特别是在解决Bernstein冗余问题的背景下。 Abstract: How do humans move? The quest to understand human motion has broad applications in numerous fields, ranging from computer animation and motion synthesis to neuroscience, human prosthetics and rehabilitation. Although advances in reinforcement learning (RL) have produced impressive results in capturing human motion using simplified humanoids, controlling physiologically accurate models of the body remains an open challenge. In this work, we present a model-free motion imitation framework (KINESIS) to advance the understanding of muscle-based motor control. Using a musculoskeletal model of the lower body with 80 muscle actuators and 20 DoF, we demonstrate that KINESIS achieves strong imitation performance on 1.9 hours of motion capture data, is controllable by natural language through pre-trained text-to-motion generative models, and can be fine-tuned to carry out high-level tasks such as target goal reaching. Importantly, KINESIS generates muscle activity patterns that correlate well with human EMG activity. The physiological plausibility makes KINESIS a promising model for tackling challenging problems in human motor control theory, which we highlight by investigating Bernstein's redundancy problem in the context of locomotion. Code, videos and benchmarks will be available at https://github.com/amathislab/Kinesis.

Core-Periphery Principle Guided State Space Model for Functional Connectome Classification

Minheng Chen,Xiaowei Yu,Jing Zhang,Tong Chen,Chao Cao,Yan Zhuang,Yanjun Lyu,Lu Zhang,Tianming Liu,Dajiang Zhu

Task: 提出一种用于功能连接组分类的核心-外围状态空间模型（CP-SSM）。

Motivation: 传统机器学习方法难以捕捉大脑区域之间的复杂关系，而深度学习方法，特别是基于Transformer的模型，由于长序列建模中的二次复杂度面临计算挑战。

Details

Method: 提出了一种创新的框架CP-SSM，引入了具有线性复杂度的选择性状态空间模型Mamba，并设计了CP-MoE，一种核心-外围引导的专家混合模型。 Result: 在ABIDE和ADNI两个基准fMRI数据集上的实验结果表明，CP-SSM在分类性能上优于基于Transformer的模型，同时显著降低了计算复杂度。 Conclusion: CP-SSM在建模大脑功能连接方面具有有效性和高效性，为基于神经影像的神经疾病诊断提供了有前景的方向。 Abstract: Understanding the organization of human brain networks has become a central focus in neuroscience, particularly in the study of functional connectivity, which plays a crucial role in diagnosing neurological disorders. Advances in functional magnetic resonance imaging and machine learning techniques have significantly improved brain network analysis. However, traditional machine learning approaches struggle to capture the complex relationships between brain regions, while deep learning methods, particularly Transformer-based models, face computational challenges due to their quadratic complexity in long-sequence modeling. To address these limitations, we propose a Core-Periphery State-Space Model (CP-SSM), an innovative framework for functional connectome classification. Specifically, we introduce Mamba, a selective state-space model with linear complexity, to effectively capture long-range dependencies in functional brain networks. Furthermore, inspired by the core-periphery (CP) organization, a fundamental characteristic of brain networks that enhances efficient information transmission, we design CP-MoE, a CP-guided Mixture-of-Experts that improves the representation learning of brain connectivity patterns. We evaluate CP-SSM on two benchmark fMRI datasets: ABIDE and ADNI. Experimental results demonstrate that CP-SSM surpasses Transformer-based models in classification performance while significantly reducing computational complexity. These findings highlight the effectiveness and efficiency of CP-SSM in modeling brain functional connectivity, offering a promising direction for neuroimaging-based neurological disease diagnosis.

Rui Yang,Lin Song,Yicheng Xiao,Runhui Huang,Yixiao Ge,Ying Shan,Hengshuang Zhao

Task: 构建一个基于单一Transformer的端到端大型多模态模型基线

Motivation: 现有的多模态模型通常将视觉和文本模态分开建模，导致资源消耗大且性能存在差距，因此需要一种更高效的方法来解决这些问题。

Details

Method: 提出了一种新的早期融合多模态模型，能够在早期阶段融合多模态输入，并以自回归方式响应视觉指令；同时设计了一种高效的训练方法，利用预训练模型的先验知识来解决性能限制和资源消耗的挑战。 Result: 所提出的模型在使用单一Transformer的情况下表现出优于其他多模态模型的性能，并显著缩小了与组合式多模态模型的性能差距。 Conclusion: 通过早期融合和高效训练方法，可以在单一Transformer中构建一个高效且性能优越的端到端大型多模态模型。 Abstract: Recent advancements in large language models (LLMs) have significantly propelled the development of large multi-modal models (LMMs), highlighting the potential for general and intelligent assistants. However, most LMMs model visual and textual modalities separately, leading to recent efforts to develop native LMMs using a single transformer. Despite the promise, these native models are resource-intensive and often exhibit performance gaps compared to their compositional counterparts. To alleviate this issue, we propose a simple yet efficient method to construct a baseline for the native and end-to-end large multi-modal model in a single transformer. First, we propose a new early-fusion LMM that can fuse multi-modal inputs in the early stage and respond to visual instructions in an auto-regressive manner. Second, we devise an efficient training recipe for the proposed model, which harnesses the prior knowledge of the pre-trained models, addressing both the performance limitations and the challenge of resource consumption. The proposed model demonstrates superior performance compared to other LMMs using one transformer and significantly narrows the performance gap with compositional LMMs.

Bayesian Modeling of Zero-Shot Classifications for Urban Flood Detection

Matt Franchi,Nikhil Garg,Wendy Ju,Emma Pierson

Task: 提出一种两阶段方法（BayFlood）来检测城市洪水，避免了对大规模标注数据的需求。

Motivation: 街景数据集缺乏可靠的标签，且许多事件类型发生频率低，难以获取真实数据。

Details

Method: 首先使用预训练的视觉-语言模型（VLM）进行零样本分类，然后在其分类结果上拟合空间贝叶斯模型。 Result: VLM在多个城市和时间段内提供了强零样本信号，贝叶斯模型相对于基线方法提高了样本外预测能力，推断的洪水风险与已知的外部风险预测因子相关。 Conclusion: BayFlood方法可以改进城市洪水检测，揭示了现有方法忽视的高风险人群和人口统计偏差，并提出了新洪水传感器的位置建议。 Abstract: Street scene datasets, collected from Street View or dashboard cameras, offer a promising means of detecting urban objects and incidents like street flooding. However, a major challenge in using these datasets is their lack of reliable labels: there are myriad types of incidents, many types occur rarely, and ground-truth measures of where incidents occur are lacking. Here, we propose BayFlood, a two-stage approach which circumvents this difficulty. First, we perform zero-shot classification of where incidents occur using a pretrained vision-language model (VLM). Second, we fit a spatial Bayesian model on the VLM classifications. The zero-shot approach avoids the need to annotate large training sets, and the Bayesian model provides frequent desiderata in urban settings - principled measures of uncertainty, smoothing across locations, and incorporation of external data like stormwater accumulation zones. We comprehensively validate this two-stage approach, showing that VLMs provide strong zero-shot signal for floods across multiple cities and time periods, the Bayesian model improves out-of-sample prediction relative to baseline methods, and our inferred flood risk correlates with known external predictors of risk. Having validated our approach, we show it can be used to improve urban flood detection: our analysis reveals 113,738 people who are at high risk of flooding overlooked by current methods, identifies demographic biases in existing methods, and suggests locations for new flood sensors. More broadly, our results showcase how Bayesian modeling of zero-shot LM annotations represents a promising paradigm because it avoids the need to collect large labeled datasets and leverages the power of foundation models while providing the expressiveness and uncertainty quantification of Bayesian models.

SceneEval: Evaluating Semantic Coherence in Text-Conditioned 3D Indoor Scene Synthesis

Hou In Ivan Tam,Hou In Derek Pun,Austin T. Wang,Angel X. Chang,Manolis Savva

Task: 提出SceneEval评估框架，用于评估文本条件下的3D室内场景生成方法。

Motivation: 现有评估方法主要关注生成场景的真实性，而忽略了与输入文本的对齐，这是决定方法是否满足用户需求的关键因素。

Details

Method: SceneEval框架包括显式用户需求（如特定对象及其属性的存在）和隐式期望（如对象碰撞的缺失）的评估指标。 Result: 评估结果显示，当前方法在生成满足用户需求的场景方面存在困难。 Conclusion: SceneEval能够提供详细的场景质量评估，突显了当前方法的优势和需要改进的领域，表明需要进一步研究以满足用户需求。 Abstract: Despite recent advances in text-conditioned 3D indoor scene generation, there remain gaps in the evaluation of these methods. Existing metrics primarily assess the realism of generated scenes by comparing them to a set of ground-truth scenes, often overlooking alignment with the input text - a critical factor in determining how effectively a method meets user requirements. We present SceneEval, an evaluation framework designed to address this limitation. SceneEval includes metrics for both explicit user requirements, such as the presence of specific objects and their attributes described in the input text, and implicit expectations, like the absence of object collisions, providing a comprehensive assessment of scene quality. To facilitate evaluation, we introduce SceneEval-100, a dataset of scene descriptions with annotated ground-truth scene properties. We evaluate recent scene generation methods using SceneEval and demonstrate its ability to provide detailed assessments of the generated scenes, highlighting strengths and areas for improvement across multiple dimensions. Our results show that current methods struggle at generating scenes that meet user requirements, underscoring the need for further research in this direction.

Involution and BSConv Multi-Depth Distillation Network for Lightweight Image Super-Resolution

Akram Khatami-Rizi,Ahmad Mahmoudi-Aznaveh

Task: 从低分辨率输入中重建高分辨率图像。

Motivation: 深度学习，特别是卷积神经网络（CNN），在单图像超分辨率（SISR）方面取得了进展，但增加网络深度会导致参数和内存使用增加，训练速度变慢，这对于资源有限的设备来说是一个问题。

Details

Method: 提出了Involution & BSConv多深度蒸馏网络（IBMDN），结合了Involution & BSConv多深度蒸馏块（IBMDB）和对比与高频注意力块（CHFAB）。IBMDB集成了Involution和BSConv以平衡计算效率和特征提取。CHFAB增强了高频细节以提高视觉质量。 Result: 实验表明，该方法在最小计算成本下实现了高精度。 Conclusion: IBMDN在减少复杂性的同时提高了评估指标（如PSNR和SSIM），并在基于Transformer的模型中减少了内存使用，同时在GAN中增强了感知质量。 Abstract: Single Image Super-Resolution (SISR) aims to reconstruct high-resolution (HR) images from low-resolution (LR) inputs. Deep learning, especially Convolutional Neural Networks (CNNs), has advanced SISR. However, increasing network depth increases parameters, and memory usage, and slows training, which is problematic for resource-limited devices. To address this, lightweight models are developed to balance accuracy and efficiency. We propose the Involution & BSConv Multi-Depth Distillation Network (IBMDN), combining Involution & BSConv Multi-Depth Distillation Block (IBMDB) and the Contrast and High-Frequency Attention Block (CHFAB). IBMDB integrates Involution and BSConv to balance computational efficiency and feature extraction. CHFAB enhances high-frequency details for better visual quality. IBMDB is compatible with other SISR architectures and reduces complexity, improving evaluation metrics like PSNR and SSIM. In transformer-based models, IBMDB reduces memory usage while improving feature extraction. In GANs, it enhances perceptual quality, balancing pixel-level accuracy with perceptual details. Our experiments show that the method achieves high accuracy with minimal computational cost. The code is available at GitHub.

On the Robustness Tradeoff in Fine-Tuning

Kunyang Li,Jean-Charles Noirot Ferrand,Ryan Sheatsley,Blaine Hoak,Yohan Beugin,Eric Pauley,Patrick McDaniel

Task: 研究微调预训练模型对下游任务鲁棒性的影响

Motivation: 微调已成为将预训练模型适应下游任务的标准做法，但其对模型鲁棒性的影响尚不明确

Details

Method: 在6个基准数据集和7种不同的微调策略上评估微调模型的鲁棒性和准确性 Result: 观察到对抗鲁棒性和准确性之间存在一致的权衡，BitFit等外围更新在简单任务上更有效，而Compacter等微调信息密集层在复杂任务上表现更好 Conclusion: 强调需要鲁棒性感知的微调以确保实际部署的可靠性 Abstract: Fine-tuning has become the standard practice for adapting pre-trained (upstream) models to downstream tasks. However, the impact on model robustness is not well understood. In this work, we characterize the robustness-accuracy trade-off in fine-tuning. We evaluate the robustness and accuracy of fine-tuned models over 6 benchmark datasets and 7 different fine-tuning strategies. We observe a consistent trade-off between adversarial robustness and accuracy. Peripheral updates such as BitFit are more effective for simple tasks--over 75% above the average measured with area under the Pareto frontiers on CIFAR-10 and CIFAR-100. In contrast, fine-tuning information-heavy layers, such as attention layers via Compacter, achieves a better Pareto frontier on more complex tasks--57.5% and 34.6% above the average on Caltech-256 and CUB-200, respectively. Lastly, we observe that robustness of fine-tuning against out-of-distribution data closely tracks accuracy. These insights emphasize the need for robustness-aware fine-tuning to ensure reliable real-world deployments.

ClimateGS: Real-Time Climate Simulation with 3D Gaussian Style Transfer

Yuezhen Xie,Meiying Zhang,Qi Hao

Task: 提出一种名为ClimateGS的新框架，将3D高斯表示与物理模拟结合，实现实时气候效果渲染。

Motivation: 恶劣气候条件对自主系统提出了重大挑战，需要可靠的感知和决策能力。现有的基于物理的NeRF渲染方法虽然能生成逼真的场景表示，但渲染速度慢且预处理时间长，不适合实时测试和用户交互。

Details

Method: 1) 开发了一种用于3D高斯逼真风格转换的线性变换，能够直接修改球谐函数以实现高效且一致的风格适应；2) 开发了一种联合训练策略，结合监督学习和自监督学习，加速收敛同时保留原始场景细节；3) 开发了一种实时渲染方法，将基于物理的效果与3D高斯结合，实现高效且逼真的渲染。 Result: 在MipNeRF360和Tanks and Temples数据集上评估ClimateGS，展示了实时渲染效果，视觉质量与SOTA 2D/3D方法相当或更优，适用于交互应用。 Conclusion: ClimateGS框架通过结合3D高斯表示和物理模拟，实现了实时气候效果渲染，具有高效、逼真的特点，适用于交互应用。 Abstract: Adverse climate conditions pose significant challenges for autonomous systems, demanding reliable perception and decision-making across diverse environments. To better simulate these conditions, physically-based NeRF rendering methods have been explored for their ability to generate realistic scene representations. However, these methods suffer from slow rendering speeds and long preprocessing times, making them impractical for real-time testing and user interaction. This paper presents ClimateGS, a novel framework integrating 3D Gaussian representations with physical simulation to enable real-time climate effects rendering. The novelty of this work is threefold: 1) developing a linear transformation for 3D Gaussian photorealistic style transfer, enabling direct modification of spherical harmonics across bands for efficient and consistent style adaptation; 2) developing a joint training strategy for 3D style transfer, combining supervised and self-supervised learning to accelerate convergence while preserving original scene details; 3) developing a real-time rendering method for climate simulation, integrating physics-based effects with 3D Gaussian to achieve efficient and realistic rendering. We evaluate ClimateGS on MipNeRF360 and Tanks and Temples, demonstrating real-time rendering with comparable or superior visual quality to SOTA 2D/3D methods, making it suitable for interactive applications.

Exploring the Limits of KV Cache Compression in Visual Autoregressive Transformers

Bo Chen,Xiaoyu Li,Yekun Ke,Yingyu Liang,Zhenmei Shi,Zhao Song

Task: Error

Motivation: Error

Details

Method: Error Result: Error Conclusion: Error Abstract: A fundamental challenge in Visual Autoregressive models is the substantial memory overhead required during inference to store previously generated representations. Despite various attempts to mitigate this issue through compression techniques, prior works have not explicitly formalized the problem of KV-cache compression in this context. In this work, we take the first step in formally defining the KV-cache compression problem for Visual Autoregressive transformers. We then establish a fundamental negative result, proving that any mechanism for sequential visual token generation under attention-based architectures must use at least $\Omega(n^2 d)$ memory, when $d = \Omega(\log n)$, where $n$ is the number of tokens generated and $d$ is the embedding dimensionality. This result demonstrates that achieving truly sub-quadratic memory usage is impossible without additional structural constraints. Our proof is constructed via a reduction from a computational lower bound problem, leveraging randomized embedding techniques inspired by dimensionality reduction principles. Finally, we discuss how sparsity priors on visual representations can influence memory efficiency, presenting both impossibility results and potential directions for mitigating memory overhead.

He Huang,Yong Chen,Yujun Guo,Wei He

Task: 提出一种自监督的未知到已知退化变换框架（U2K），用于盲高光谱图像融合。

Motivation: 现有监督学习方法在测试数据退化与训练数据匹配时表现良好，但在处理未知退化时面临挑战。

Details

Method: 提出U2K框架，包括空间和光谱退化包装模块（DW）和退化变换模块（DT），通过自监督方式训练，使用一致性损失和贪婪交替优化。 Result: 实验证明U2K框架在多种退化设置下提升了五种现有监督学习方法的适应性，并超越了现有的盲方法。 Conclusion: U2K框架显著提高了盲高光谱图像融合的灵活性和适应性。 Abstract: Hyperspectral image (HSI) fusion is an efficient technique that combines low-resolution HSI (LR-HSI) and high-resolution multispectral images (HR-MSI) to generate high-resolution HSI (HR-HSI). Existing supervised learning methods (SLMs) can yield promising results when test data degradation matches the training ones, but they face challenges in generalizing to unknown degradations. To unleash the potential and generalization ability of SLMs, we propose a novel self-supervised unknown-to-known degradation transformation framework (U2K) for blind HSI fusion, which adaptively transforms unknown degradation into the same type of degradation as those handled by pre-trained SLMs. Specifically, the proposed U2K framework consists of: (1) spatial and spectral Degradation Wrapping (DW) modules that map HR-HSI to unknown degraded HR-MSI and LR-HSI, and (2) Degradation Transformation (DT) modules that convert these wrapped data into predefined degradation patterns. The transformed HR-MSI and LR-HSI pairs are then processed by a pre-trained network to reconstruct the target HR-HSI. We train the U2K framework in a self-supervised manner using consistency loss and greedy alternating optimization, significantly improving the flexibility of blind HSI fusion. Extensive experiments confirm the effectiveness of our proposed U2K framework in boosting the adaptability of five existing SLMs under various degradation settings and surpassing state-of-the-art blind methods.

FetalFlex: Anatomy-Guided Diffusion Model for Flexible Control on Fetal Ultrasound Image Synthesis

Yaofei Duan,Tao Tan,Zhiyuan Zhu,Yuhao Huang,Yuanji Zhang,Rui Gao,Patrick Cheong-Iao Pang,Xinru Gao,Guowei Tao,Xiang Cong,Zhou Li,Lianying Liang,Guangzhi He,Linliang Yin,Xuedong Deng,Xin Yang,Dong Ni

Task: 提出了一种灵活的胎儿超声图像生成框架（FetalFlex），用于生成多样化的胎儿超声图像。

Motivation: 由于罕见或复杂异常的胎儿超声数据难以获取，导致训练新手放射科医生和开发鲁棒的AI模型存在困难。

Details

Method: FetalFlex利用解剖结构和多模态信息，通过预对齐模块和重绘策略实现可控的图像合成，并采用两阶段自适应采样策略逐步提升图像质量。 Result: 在多中心数据集上的实验表明，FetalFlex在多个图像质量指标上达到了最先进的性能，生成的图像显著提高了下游分类和异常检测任务的性能。 Conclusion: FetalFlex能够生成正常和异常的胎儿超声图像，无需异常数据，为异常模拟和像素级配对或反事实数据的创建提供了独特优势。 Abstract: Fetal ultrasound (US) examinations require the acquisition of multiple planes, each providing unique diagnostic information to evaluate fetal development and screening for congenital anomalies. However, obtaining a comprehensive, multi-plane annotated fetal US dataset remains challenging, particularly for rare or complex anomalies owing to their low incidence and numerous subtypes. This poses difficulties in training novice radiologists and developing robust AI models, especially for detecting abnormal fetuses. In this study, we introduce a Flexible Fetal US image generation framework (FetalFlex) to address these challenges, which leverages anatomical structures and multimodal information to enable controllable synthesis of fetal US images across diverse planes. Specifically, FetalFlex incorporates a pre-alignment module to enhance controllability and introduces a repaint strategy to ensure consistent texture and appearance. Moreover, a two-stage adaptive sampling strategy is developed to progressively refine image quality from coarse to fine levels. We believe that FetalFlex is the first method capable of generating both in-distribution normal and out-of-distribution abnormal fetal US images, without requiring any abnormal data. Experiments on multi-center datasets demonstrate that FetalFlex achieved state-of-the-art performance across multiple image quality metrics. A reader study further confirms the close alignment of the generated results with expert visual assessments. Furthermore, synthetic images by FetalFlex significantly improve the performance of six typical deep models in downstream classification and anomaly detection tasks. Lastly, FetalFlex's anatomy-level controllable generation offers a unique advantage for anomaly simulation and creating paired or counterfactual data at the pixel level. The demo is available at: https://dyf1023.github.io/FetalFlex/.

POSTA: A Go-to Framework for Customized Artistic Poster Generation

Haoyu Chen,Xiaojie Xu,Wenbo Li,Jingjing Ren,Tian Ye,Songhua Liu,Ying-Cong Chen,Lei Zhu,Xinchao Wang

Task: 提出一种基于扩散模型和多模态大语言模型的模块化框架POSTA，用于生成定制化的艺术海报。

Motivation: 现有的自动海报设计方法在文本准确性、用户定制性和美学吸引力方面存在不足，限制了其在电影和展览等艺术领域的应用。

Details

Method: POSTA框架由三个模块组成：背景扩散模块生成主题背景，设计多模态大语言模块生成与背景风格一致的布局和排版元素，艺术文本扩散模块对关键文本元素进行额外的风格化处理。 Result: POSTA在文本准确性和美学质量方面优于现有模型，展示了卓越的可控性和设计多样性。 Conclusion: POSTA框架通过模块化设计和多模态模型的应用，成功解决了现有自动海报设计方法的局限性，生成了视觉上一致且吸引人的定制化艺术海报。 Abstract: Poster design is a critical medium for visual communication. Prior work has explored automatic poster design using deep learning techniques, but these approaches lack text accuracy, user customization, and aesthetic appeal, limiting their applicability in artistic domains such as movies and exhibitions, where both clear content delivery and visual impact are essential. To address these limitations, we present POSTA: a modular framework powered by diffusion models and multimodal large language models (MLLMs) for customized artistic poster generation. The framework consists of three modules. Background Diffusion creates a themed background based on user input. Design MLLM then generates layout and typography elements that align with and complement the background style. Finally, to enhance the poster's aesthetic appeal, ArtText Diffusion applies additional stylization to key text elements. The final result is a visually cohesive and appealing poster, with a fully modular process that allows for complete customization. To train our models, we develop the PosterArt dataset, comprising high-quality artistic posters annotated with layout, typography, and pixel-level stylized text segmentation. Our comprehensive experimental analysis demonstrates POSTA's exceptional controllability and design diversity, outperforming existing models in both text accuracy and aesthetic quality.

A Language Vision Model Approach for Automated Tumor Contouring in Radiation Oncology

Yi Luo,Hamed Hooshangnejad,Xue Feng,Gaofeng Huang,Xiaojian Chen,Rui Zhang,Quan Chen,Wil Ngwa,Kai Ding

Task: 开发Oncology Contouring Copilot (OCC)系统，利用AI和人类专家的结合来提高肿瘤轮廓勾画的效率和准确性。

Motivation: 肺癌是全球癌症相关死亡率的主要原因，肿瘤轮廓勾画的复杂性在资源有限的环境中尤为突出，AI技术尤其是深度学习和自然语言处理的进步提供了潜在的解决方案。

Details

Method: OCC系统首先从CT扫描中识别结节候选，然后使用语言视觉模型（如GPT-4V）结合临床描述文本减少假阳性，融合文本和视觉数据以自动化肿瘤轮廓勾画。 Result: OCC系统的部署使假发现率显著降低了35.0%，每次扫描的假阳性减少了72.4%，数据集的F1得分为0.652。 Conclusion: OCC系统通过使用最新的语言视觉模型在肿瘤护理中取得了显著进展，优化了肿瘤轮廓勾画，减少了手动过程，提供了一个可扩展且直观的框架来减少放疗计划中的假阳性，并引入了新的医学语言视觉提示技术以减少幻觉，展示了语言视觉模型在解决医学语言视觉挑战中的潜力。 Abstract: Background: Lung cancer ranks as the leading cause of cancer-related mortality worldwide. The complexity of tumor delineation, crucial for radiation therapy, requires expertise often unavailable in resource-limited settings. Artificial Intelligence(AI), particularly with advancements in deep learning (DL) and natural language processing (NLP), offers potential solutions yet is challenged by high false positive rates. Purpose: The Oncology Contouring Copilot (OCC) system is developed to leverage oncologist expertise for precise tumor contouring using textual descriptions, aiming to increase the efficiency of oncological workflows by combining the strengths of AI with human oversight. Methods: Our OCC system initially identifies nodule candidates from CT scans. Employing Language Vision Models (LVMs) like GPT-4V, OCC then effectively reduces false positives with clinical descriptive texts, merging textual and visual data to automate tumor delineation, designed to elevate the quality of oncology care by incorporating knowledge from experienced domain experts. Results: Deployments of the OCC system resulted in a significant reduction in the false discovery rate by 35.0%, a 72.4% decrease in false positives per scan, and an F1-score of 0.652 across our dataset for unbiased evaluation. Conclusions: OCC represents a significant advance in oncology care, particularly through the use of the latest LVMs to improve contouring results by (1) streamlining oncology treatment workflows by optimizing tumor delineation, reducing manual processes; (2) offering a scalable and intuitive framework to reduce false positives in radiotherapy planning using LVMs; (3) introducing novel medical language vision prompt techniques to minimize LVMs hallucinations with ablation study, and (4) conducting a comparative analysis of LVMs, highlighting their potential in addressing medical language vision challenges.

A Novel Channel Boosted Residual CNN-Transformer with Regional-Boundary Learning for Breast Cancer Detection

Aamir Mehmood,Yue Hu,Saddam Hussain Khan

Task: 通过结合定制的残差卷积神经网络（CNN）和新的视觉变换器（ViT）组件，提出了一种新的混合框架CB-Res-RBCMT，用于详细的乳腺超声图像（BUSI）癌症分析。

Motivation: 现有的深度卷积神经网络（CNN）和视觉变换器（ViT）在乳腺超声图像（BUSI）肿瘤检测中表现出初步的成功，但模型复杂性、对比度、纹理和肿瘤形态变化等挑战限制了当前方法的有效性。

Details

Method: 提出了一种新的混合框架CB-Res-RBCMT，结合了定制的残差CNN和新的ViT组件，使用茎卷积块与CNN Meet Transformer（CMT）块，并通过新的区域和边界（RB）特征提取操作来捕捉对比度和形态变化。CMT块通过多头注意力机制增强全局上下文交互，采用轻量级设计提高计算效率。定制的逆残差和茎CNN在CMT中有效提取局部纹理信息并处理梯度消失问题。新的通道增强（CB）策略通过结合原始RBCMT通道和基于迁移学习的残差CNN生成的特征图来丰富有限数据集的特征多样性。这些多样化的通道通过空间注意力块进行处理，以选择最佳像素，减少冗余并提高对微小对比度和纹理变化的区分能力。 Result: 提出的CB-Res-RBCMT在标准严格统一的BUSI数据集上实现了F1分数为95.57%，准确率为95.63%，敏感性为96.42%，精确率为94.79%，优于现有的ViT和CNN方法。 Conclusion: 结果表明，集成的CNN-Transformer框架在捕捉多样化特征和提供卓越的BUSI癌症诊断性能方面具有多功能性。 Abstract: Recent advancements in detecting tumors using deep learning on breast ultrasound images (BUSI) have demonstrated significant success. Deep CNNs and vision-transformers (ViTs) have demonstrated individually promising initial performance. However, challenges related to model complexity and contrast, texture, and tumor morphology variations introduce uncertainties that hinder the effectiveness of current methods. This study introduces a novel hybrid framework, CB-Res-RBCMT, combining customized residual CNNs and new ViT components for detailed BUSI cancer analysis. The proposed RBCMT uses stem convolution blocks with CNN Meet Transformer (CMT) blocks, followed by new Regional and boundary (RB) feature extraction operations for capturing contrast and morphological variations. Moreover, the CMT block incorporates global contextual interactions through multi-head attention, enhancing computational efficiency with a lightweight design. Additionally, the customized inverse residual and stem CNNs within the CMT effectively extract local texture information and handle vanishing gradients. Finally, the new channel-boosted (CB) strategy enriches the feature diversity of the limited dataset by combining the original RBCMT channels with transfer learning-based residual CNN-generated maps. These diverse channels are processed through a spatial attention block for optimal pixel selection, reducing redundancy and improving the discrimination of minor contrast and texture variations. The proposed CB-Res-RBCMT achieves an F1-score of 95.57%, accuracy of 95.63%, sensitivity of 96.42%, and precision of 94.79% on the standard harmonized stringent BUSI dataset, outperforming existing ViT and CNN methods. These results demonstrate the versatility of our integrated CNN-Transformer framework in capturing diverse features and delivering superior performance in BUSI cancer diagnosis.

DRoPE: Directional Rotary Position Embedding for Efficient Agent Interaction Modeling

Jianbo Zhao,Taiyu Ban,Zhihao Liu,Hangning Zhou,Xiyang Wang,Qibin Zhou,Hailong Qin,Mu Yang,Lei Liu,Bin Li

Task: 提出了一种新的方向性旋转位置嵌入（DRoPE）方法，用于优化自动驾驶系统中的轨迹生成。

Motivation: 现有的场景中心、代理中心和查询中心框架在准确性、计算时间和内存效率之间存在不可调和的矛盾，需要一种新的方法来突破这一限制。

Details

Method: 提出了方向性旋转位置嵌入（DRoPE），通过在RoPE的2D旋转变换中引入统一的身份标量，使旋转角度与现实中的代理方向对齐，从而自然地编码相对角度信息。 Result: 理论分析和实验评估表明，DRoPE能够同时优化轨迹生成的准确性、时间复杂度和空间复杂度，显著降低了空间复杂度。 Conclusion: DRoPE在理论和实践上都表现出良好的性能，能够有效解决现有方法在轨迹生成中的局限性。 Abstract: Accurate and efficient modeling of agent interactions is essential for trajectory generation, the core of autonomous driving systems. Existing methods, scene-centric, agent-centric, and query-centric frameworks, each present distinct advantages and drawbacks, creating an impossible triangle among accuracy, computational time, and memory efficiency. To break this limitation, we propose Directional Rotary Position Embedding (DRoPE), a novel adaptation of Rotary Position Embedding (RoPE), originally developed in natural language processing. Unlike traditional relative position embedding (RPE), which introduces significant space complexity, RoPE efficiently encodes relative positions without explicitly increasing complexity but faces inherent limitations in handling angular information due to periodicity. DRoPE overcomes this limitation by introducing a uniform identity scalar into RoPE's 2D rotary transformation, aligning rotation angles with realistic agent headings to naturally encode relative angular information. We theoretically analyze DRoPE's correctness and efficiency, demonstrating its capability to simultaneously optimize trajectory generation accuracy, time complexity, and space complexity. Empirical evaluations compared with various state-of-the-art trajectory generation models, confirm DRoPE's good performance and significantly reduced space complexity, indicating both theoretical soundness and practical effectiveness. The video documentation is available at https://drope-traj.github.io/.

Texture-Aware StarGAN for CT data harmonisation

Francesco Di Feola,Ludovica Pompilio,Cecilia Assolito,Valerio Guarrasi,Paolo Soda

Task: 提出一种新颖的纹理感知StarGAN用于CT数据协调，实现不同重建核之间的一对多转换。

Motivation: CT在医学诊断中起着关键作用，但重建核的变异性阻碍了数据驱动方法（如深度学习模型）实现可靠和泛化的性能。CT数据协调通过标准化不同来源或条件下的数据来最小化非生物变异。

Details

Method: 提出了一种纹理感知StarGAN模型，并引入多尺度纹理损失函数，将不同空间和角度尺度的纹理信息嵌入到协调过程中。 Result: 在公开数据集上进行了广泛实验，使用了来自197名患者的48667张胸部CT切片，分布在三种不同的重建核上，证明了该方法优于基线StarGAN。 Conclusion: 纹理感知StarGAN在CT数据协调中表现出色，能够有效解决重建核引起的纹理变化问题。 Abstract: Computed Tomography (CT) plays a pivotal role in medical diagnosis; however, variability across reconstruction kernels hinders data-driven approaches, such as deep learning models, from achieving reliable and generalized performance. To this end, CT data harmonization has emerged as a promising solution to minimize such non-biological variances by standardizing data across different sources or conditions. In this context, Generative Adversarial Networks (GANs) have proved to be a powerful framework for harmonization, framing it as a style-transfer problem. However, GAN-based approaches still face limitations in capturing complex relationships within the images, which are essential for effective harmonization. In this work, we propose a novel texture-aware StarGAN for CT data harmonization, enabling one-to-many translations across different reconstruction kernels. Although the StarGAN model has been successfully applied in other domains, its potential for CT data harmonization remains unexplored. Furthermore, our approach introduces a multi-scale texture loss function that embeds texture information across different spatial and angular scales into the harmonization process, effectively addressing kernel-induced texture variations. We conducted extensive experimentation on a publicly available dataset, utilizing a total of 48667 chest CT slices from 197 patients distributed over three different reconstruction kernels, demonstrating the superiority of our method over the baseline StarGAN.

World Models in Artificial Intelligence: Sensing, Learning, and Reasoning Like a Child

Javier Del Ser,Jesus L. Lobo,Heimo Müller,Andreas Holzinger

Task: 探索如何通过整合统计学习与六个关键研究领域（物理信息学习、神经符号学习、持续学习、因果推理、人机交互AI和负责任AI）来提升AI的推理能力。

Motivation: 现有的世界模型在强化学习中广泛应用，但缺乏结构化、自适应的表示能力，无法像儿童一样直观地发展。

Details

Method: 提出通过整合统计学习与六个关键研究领域（物理信息学习、神经符号学习、持续学习、因果推理、人机交互AI和负责任AI）来提升AI的推理能力。 Result: 通过整合这些领域，AI可以从模式识别进化到真正的理解、适应和推理能力。 Conclusion: 整合统计学习与六个关键研究领域是实现AI真正推理能力的关键。 Abstract: World Models help Artificial Intelligence (AI) predict outcomes, reason about its environment, and guide decision-making. While widely used in reinforcement learning, they lack the structured, adaptive representations that even young children intuitively develop. Advancing beyond pattern recognition requires dynamic, interpretable frameworks inspired by Piaget's cognitive development theory. We highlight six key research areas -- physics-informed learning, neurosymbolic learning, continual learning, causal inference, human-in-the-loop AI, and responsible AI -- as essential for enabling true reasoning in AI. By integrating statistical learning with advances in these areas, AI can evolve from pattern recognition to genuine understanding, adaptation and reasoning capabilities.

A Review on Large Language Models for Visual Analytics

Navya Sonal Agarwal,Sanjay Kumar Sonbhadra

Task: 综述大型语言模型（LLMs）与视觉分析的整合，探讨其基础概念、能力和广泛应用。

Motivation: 探讨LLMs在自然语言理解、自然语言生成、对话系统和文本到媒体转换中的潜力，以及如何通过LLMs与视觉分析的协同作用增强数据解释、可视化技术和交互探索能力。

Details

Method: 通过评估关键工具和平台（如LIDA、Chat2VIS、Julius AI、Zoho Analytics）以及专门的多模态模型（如ChartLlama、CharXIV），系统探讨LLM任务分类（从自然语言理解、自然语言生成到对话系统和文本到媒体转换）。 Result: 提供了LLMs与视觉分析整合的SWOT分析，强调了可访问性和灵活性等优势，计算需求和偏见等劣势，多模态整合和用户协作等机会，以及隐私问题和技能退化等威胁。 Conclusion: 强调解决伦理考虑和方法改进以实现有效整合的重要性。 Abstract: This paper provides a comprehensive review of the integration of Large Language Models (LLMs) with visual analytics, addressing their foundational concepts, capabilities, and wide-ranging applications. It begins by outlining the theoretical underpinnings of visual analytics and the transformative potential of LLMs, specifically focusing on their roles in natural language understanding, natural language generation, dialogue systems, and text-to-media transformations. The review further investigates how the synergy between LLMs and visual analytics enhances data interpretation, visualization techniques, and interactive exploration capabilities. Key tools and platforms including LIDA, Chat2VIS, Julius AI, and Zoho Analytics, along with specialized multimodal models such as ChartLlama and CharXIV, are critically evaluated. The paper discusses their functionalities, strengths, and limitations in supporting data exploration, visualization enhancement, automated reporting, and insight extraction. The taxonomy of LLM tasks, ranging from natural language understanding (NLU), natural language generation (NLG), to dialogue systems and text-to-media transformations, is systematically explored. This review provides a SWOT analysis of integrating Large Language Models (LLMs) with visual analytics, highlighting strengths like accessibility and flexibility, weaknesses such as computational demands and biases, opportunities in multimodal integration and user collaboration, and threats including privacy concerns and skill degradation. It emphasizes addressing ethical considerations and methodological improvements for effective integration.

Beacon2Science: Enhancing STEREO/HI beacon data1 with machine learning for efficient CME tracking

Justin Le Louëdec,Maike Bauer,Tanja Amerstorfer,Jackie A. Davies

Task: 通过改进信标数据质量，提高日冕物质抛射（CME）的实时观测和预测精度。

Motivation: 日冕物质抛射（CME）引发的强烈地磁风暴可能对卫星和电子设备造成破坏性影响，因此实时观测和预测CME至关重要。

Details

Method: 提出了名为'Beacon2Science'的新管道，通过增强信标数据的质量（信噪比和空间分辨率）并通过学习插值提高时间分辨率，使其与科学数据的40分钟分辨率相匹配。 Result: 改进后的信标图像与科学数据相当，显示出比原始信标数据更好的CME可见性。增强信标数据提取的轨迹与科学图像的轨迹更接近，平均误差约为0.5°的伸长，而原始信标数据的误差为1°。 Conclusion: 本文提出的工作为即将到来的任务（如Vigil和PUNCH）的应用铺平了道路。 Abstract: Observing and forecasting coronal mass ejections (CME) in real-time is crucial due to the strong geomagnetic storms they can generate that can have a potentially damaging effect, for example, on satellites and electrical devices. With its near-real-time availability, STEREO/HI beacon data is the perfect candidate for early forecasting of CMEs. However, previous work concluded that CME arrival prediction based on beacon data could not achieve the same accuracy as with high-resolution science data due to data gaps and lower quality. We present our novel pipeline entitled ''Beacon2Science'', bridging the gap between beacon and science data to improve CME tracking. Through this pipeline, we first enhance the quality (signal-to-noise ratio and spatial resolution) of beacon data. We then increase the time resolution of enhanced beacon images through learned interpolation to match science data's 40-minute resolution. We maximize information coherence between consecutive frames with adapted model architecture and loss functions through the different steps. The improved beacon images are comparable to science data, showing better CME visibility than the original beacon data. Furthermore, we compare CMEs tracked in beacon, enhanced beacon, and science images. The tracks extracted from enhanced beacon data are closer to those from science images, with a mean average error of $\sim 0.5 ^\circ$ of elongation compared to $1^\circ$ with original beacon data. The work presented in this paper paves the way for its application to forthcoming missions such as Vigil and PUNCH.

Euclid Quick Data Release (Q1). Active galactic nuclei identification using diffusion-based inpainting of Euclid VIS images

Euclid Collaboration,G. Stevens,S. Fotopoulou,M. N. Bremer,T. Matamoro Zatarain,K. Jahnke,B. Margalef-Bentabol,M. Huertas-Company,M. J. Smith,M. Walmsley,M. Salvato,M. Mezcua,A. Paulino-Afonso,M. Siudek,M. Talia,F. Ricci,W. Roster,N. Aghanim,B. Altieri,S. Andreon,H. Aussel,C. Baccigalupi,M. Baldi,S. Bardelli,P. Battaglia,A. Biviano,A. Bonchi,E. Branchini,M. Brescia,J. Brinchmann,S. Camera,G. Cañas-Herrera,V. Capobianco,C. Carbone,J. Carretero,M. Castellano,G. Castignani,S. Cavuoti,K. C. Chambers,A. Cimatti,C. Colodro-Conde,G. Congedo,C. J. Conselice,L. Conversi,Y. Copin,A. Costille,F. Courbin,H. M. Courtois,M. Cropper,A. Da Silva,H. Degaudenzi,G. De Lucia,C. Dolding,H. Dole,M. Douspis,F. Dubath,X. Dupac,S. Dusini,S. Escoffier,M. Farina,S. Ferriol,K. George,C. Giocoli,B. R. Granett,A. Grazian,F. Grupp,S. V. H. Haugan,I. M. Hook,F. Hormuth,A. Hornstrup,P. Hudelot,M. Jhabvala,E. Keihänen,S. Kermiche,A. Kiessling,M. Kilbinger,B. Kubik,M. Kümmel,H. Kurki-Suonio,Q. Le Boulc'h,A. M. C. Le Brun,D. Le Mignant,P. B. Lilje,V. Lindholm,I. Lloro,G. Mainetti,D. Maino,E. Maiorano,O. Marggraf,M. Martinelli,N. Martinet,F. Marulli,R. Massey,S. Maurogordato,H. J. McCracken,E. Medinaceli,S. Mei,M. Melchior,M. Meneghetti,E. Merlin,G. Meylan,A. Mora,M. Moresco,L. Moscardini,R. Nakajima,C. Neissner,S. -M. Niemi,C. Padilla,S. Paltani,F. Pasian,K. Pedersen,W. J. Percival,V. Pettorino,G. Polenta,M. Poncet,L. A. Popa,L. Pozzetti,F. Raison,R. Rebolo,A. Renzi,J. Rhodes,G. Riccio,E. Romelli,M. Roncarelli,R. Saglia,A. G. Sánchez,D. Sapone,J. A. Schewtschenko,M. Schirmer,P. Schneider,T. Schrabback,A. Secroun,S. Serrano,P. Simon,C. Sirignano,G. Sirri,J. Skottfelt,L. Stanco,J. Steinwagner,P. Tallada-Crespí,A. N. Taylor,I. Tereno,S. Toft,R. Toledo-Moreo,F. Torradeflot,I. Tutusaus,L. Valenziano,J. Valiviita,T. Vassallo,G. Verdoes Kleijn,A. Veropalumbo,Y. Wang,J. Weller,A. Zacchei,G. Zamorani,F. M. Zerbi,I. A. Zinchenko,E. Zucca,V. Allevato,M. Ballardini,M. Bolzonella,E. Bozzo,C. Burigana,R. Cabanac,A. Cappi,J. A. Escartin Vigo,L. Gabarra,W. G. Hartley,J. Martín-Fleitas,S. Matthew,R. B. Metcalf,A. Pezzotta,M. Pöntinen,I. Risso,V. Scottez,M. Sereno,M. Tenti,M. Wiesmann,Y. Akrami,S. Alvi,I. T. Andika,S. Anselmi,M. Archidiacono,F. Atrio-Barandela,D. Bertacca,M. Bethermin,L. Bisigello,A. Blanchard,L. Blot,S. Borgani,M. L. Brown,S. Bruton,A. Calabro,F. Caro,T. Castro,F. Cogato,S. Davini,G. Desprez,A. Díaz-Sánchez,J. J. Diaz,S. Di Domizio,J. M. Diego,P. -A. Duc,A. Enia,Y. Fang,A. G. Ferrari,A. Finoguenov,A. Fontana,A. Franco,J. García-Bellido,T. Gasparetto,V. Gautard,E. Gaztanaga,F. Giacomini,F. Gianotti,M. Guidi,C. M. Gutierrez,A. Hall,S. Hemmati,H. Hildebrandt,J. Hjorth,J. J. E. Kajava,Y. Kang,V. Kansal,D. Karagiannis,C. C. Kirkpatrick,S. Kruk,L. Legrand,M. Lembo,F. Lepori,G. Leroy,J. Lesgourgues,L. Leuzzi,T. I. Liaudat,J. Macias-Perez,M. Magliocchetti,F. Mannucci,R. Maoli,C. J. A. P. Martins,L. Maurin,M. Miluzio,P. Monaco,G. Morgante,K. Naidoo,A. Navarro-Alsina,F. Passalacqua,K. Paterson,L. Patrizii,A. Pisani,D. Potter,S. Quai,M. Radovich,P. -F. Rocci,G. Rodighiero,S. Sacquegna,M. Sahlén,D. B. Sanders,E. Sarpa,A. Schneider,M. Schultheis,D. Sciotti,E. Sellentin,F. Shankar,L. C. Smith,K. Tanidis,G. Testera,R. Teyssier,S. Tosi,A. Troja,M. Tucci,C. Valieri,D. Vergani,G. Verza,N. A. Walton

Task: 提出一种从单张图像中识别活动星系核（AGN）和类星体（QSO）的新方法。

Motivation: 传统的AGN和QSO识别方法通常需要多波段观测，本文旨在通过单张图像实现高完整性的识别。

Details

Method: 利用Euclid VIS图像的空间分辨能力，训练一个扩散模型，通过重建正常星系的光分布来识别偏离该分布的AGN和QSO。 Result: 该方法在仅使用VIS成像的情况下，相比传统方法（包括光学、近红外、中红外和X射线）具有更高的完整性。 Conclusion: 本文提出的方法在单张图像中识别AGN和QSO方面表现出色，具有较高的应用潜力。 Abstract: Light emission from galaxies exhibit diverse brightness profiles, influenced by factors such as galaxy type, structural features and interactions with other galaxies. Elliptical galaxies feature more uniform light distributions, while spiral and irregular galaxies have complex, varied light profiles due to their structural heterogeneity and star-forming activity. In addition, galaxies with an active galactic nucleus (AGN) feature intense, concentrated emission from gas accretion around supermassive black holes, superimposed on regular galactic light, while quasi-stellar objects (QSO) are the extreme case of the AGN emission dominating the galaxy. The challenge of identifying AGN and QSO has been discussed many times in the literature, often requiring multi-wavelength observations. This paper introduces a novel approach to identify AGN and QSO from a single image. Diffusion models have been recently developed in the machine-learning literature to generate realistic-looking images of everyday objects. Utilising the spatial resolving power of the Euclid VIS images, we created a diffusion model trained on one million sources, without using any source pre-selection or labels. The model learns to reconstruct light distributions of normal galaxies, since the population is dominated by them. We condition the prediction of the central light distribution by masking the central few pixels of each source and reconstruct the light according to the diffusion model. We further use this prediction to identify sources that deviate from this profile by examining the reconstruction error of the few central pixels regenerated in each source's core. Our approach, solely using VIS imaging, features high completeness compared to traditional methods of AGN and QSO selection, including optical, near-infrared, mid-infrared, and X-rays. [abridged]

Abhi Kamboj,Minh N. Do

Task: 构建一个联合潜在向量空间，使得表示相同概念的两种模态映射到相同的向量。

Motivation: 研究多模态对齐问题，探索在特定条件下实现完美对齐的可能性，并应用于跨模态迁移。

Details

Method: 将多模态对齐问题表述为一个逆问题，并假设语义类在潜在空间中表示为高斯混合模型，通过将数据点投影到表示每个模态的不同子空间来实现跨模态迁移。 Result: 在合成的多模态高斯数据上的实验验证了完美对齐和跨模态迁移方法的有效性。 Conclusion: 这些发现有望激发对完美对齐应用和高斯模型在跨模态学习中使用的进一步探索。 Abstract: Multimodal alignment aims to construct a joint latent vector space where two modalities representing the same concept map to the same vector. We formulate this as an inverse problem and show that under certain conditions perfect alignment can be achieved. We then address a specific application of alignment referred to as cross-modal transfer. Unsupervised cross-modal transfer aims to leverage a model trained with one modality to perform inference on another modality, without any labeled fine-tuning on the new modality. Assuming that semantic classes are represented as a mixture of Gaussians in the latent space, we show how cross-modal transfer can be performed by projecting the data points from the representation space onto different subspaces representing each modality. Our experiments on synthetic multimodal Gaussian data verify the effectiveness of our perfect alignment and cross-modal transfer method. We hope these findings inspire further exploration of the applications of perfect alignment and the use of Gaussian models for cross-modal learning.

SemEval-2025 Task 1: AdMIRe -- Advancing Multimodal Idiomaticity Representation

Thomas Pickard,Aline Villavicencio,Maggie Mi,Wei He,Dylan Phelps,Carolina Scarton,Marco Idiart

Task: 评估和改进模型在多模态上下文和多种语言中解释习语表达的能力。

Motivation: 习语表达在自然语言处理中具有独特的挑战性，因为它们的含义通常不能直接从其组成词汇中推断出来。尽管大型语言模型（LLMs）取得了进展，但习语性仍然是语义表示的一个重大障碍。

Details

Method: 提出了SemEval-2025 Task 1: AdMiRe（推进多模态习语性表示）的数据集和任务，包括两个子任务：根据图像与习语或字面意义的对齐程度进行排名，以及预测序列中的下一张图像。 Result: 最有效的方法通过在多专家设置中利用预训练的LLMs和视觉语言模型，达到了人类水平的性能，并使用多个查询来平滑这些模型在习语性表示中的弱点。 Conclusion: 通过多模态和多语言的习语性表示任务，可以显著提高模型对习语表达的理解能力。 Abstract: Idiomatic expressions present a unique challenge in NLP, as their meanings are often not directly inferable from their constituent words. Despite recent advancements in Large Language Models (LLMs), idiomaticity remains a significant obstacle to robust semantic representation. We present datasets and tasks for SemEval-2025 Task 1: AdMiRe (Advancing Multimodal Idiomaticity Representation), which challenges the community to assess and improve models' ability to interpret idiomatic expressions in multimodal contexts and in multiple languages. Participants competed in two subtasks: ranking images based on their alignment with idiomatic or literal meanings, and predicting the next image in a sequence. The most effective methods achieved human-level performance by leveraging pretrained LLMs and vision-language models in mixture-of-experts settings, with multiple queries used to smooth over the weaknesses in these models' representations of idiomaticity.

FedSCA: Federated Tuning with Similarity-guided Collaborative Aggregation for Heterogeneous Medical Image Segmentation

Yumin Zhang,Yan Gao,Haoran Duan,Hanqing Guo,Tejal Shah,Rajiv Ranjan,Bo Wei

Task: 提出一种新的联邦学习与基础模型微调框架（FedSCA），用于医学图像分割。

Motivation: 由于医学图像数据集的规模有限且数据集中化受到隐私问题的限制，基础模型在医学图像分割中的应用受到阻碍。联邦学习与基础模型微调的结合可以解决这些问题，但非独立同分布数据和计算、通信限制仍然存在挑战。

Details

Method: 提出了一种名为FedSCA的框架，包括（1）专门设计的参数高效微调（PEFT）用于本地客户端训练以提高计算效率；（2）部分低层适配器传输以提高通信效率；（3）服务器端的相似性引导协作聚合（SGCA）以解决非独立同分布问题。 Result: 在三个联邦学习基准测试中进行的广泛实验证明了FedSCA的有效性，并建立了新的SOTA性能。 Conclusion: FedSCA框架通过参数高效微调、部分低层适配器传输和相似性引导协作聚合，有效解决了医学图像分割中的非独立同分布数据和计算、通信限制问题，显著提升了性能。 Abstract: Transformer-based foundation models (FMs) have recently demonstrated remarkable performance in medical image segmentation. However, scaling these models is challenging due to the limited size of medical image datasets within isolated hospitals, where data centralization is restricted due to privacy concerns. These constraints, combined with the data-intensive nature of FMs, hinder their broader application. Integrating federated learning (FL) with foundation models (FLFM) fine-tuning offers a potential solution to these challenges by enabling collaborative model training without data sharing, thus allowing FMs to take advantage of a diverse pool of sensitive medical image data across hospitals/clients. However, non-independent and identically distributed (non-IID) data among clients, paired with computational and communication constraints in federated environments, presents an additional challenge that limits further performance improvements and remains inadequately addressed in existing studies. In this work, we propose a novel FLFM fine-tuning framework, \underline{\textbf{Fed}}erated tuning with \underline{\textbf{S}}imilarity-guided \underline{\textbf{C}}ollaborative \underline{\textbf{A}}ggregation (FedSCA), encompassing all phases of the FL process. This includes (1) specially designed parameter-efficient fine-tuning (PEFT) for local client training to enhance computational efficiency; (2) partial low-level adapter transmission for communication efficiency; and (3) similarity-guided collaborative aggregation (SGCA) on the server side to address non-IID issues. Extensive experiments on three FL benchmarks for medical image segmentation demonstrate the effectiveness of our proposed FedSCA, establishing new SOTA performance.

Towards efficient keyword spotting using spike-based time difference encoders

Alejandro Pequeño-Zurro,Lyes Khacef,Stefano Panzeri,Elisabetta Chicca

Task: 探索Temporal Difference Encoder (TDE)在关键词识别中的性能。

Motivation: 由于语音助手的广泛使用，边缘设备中的关键词识别变得越来越重要，但其部署受到目标嵌入式系统的极低功耗限制。

Details

Method: 使用TIdigits数据集，通过三种不同的Spiking Neural Networks (SNNs)架构（前馈TDE、前馈CuBa-LIF和递归CuBa-LIF）进行学习和分类时空信号。 Result: 前馈TDE网络的准确率（89%）高于前馈CuBa-LIF网络（71%），接近递归CuBa-LIF网络（91%），且前馈TDE网络的突触操作比递归CuBa-LIF网络少92%。 Conclusion: TDE是一种有前途的神经元模型，适用于时空模式的可扩展事件驱动处理。 Abstract: Keyword spotting in edge devices is becoming increasingly important as voice-activated assistants are widely used. However, its deployment is often limited by the extreme low-power constraints of the target embedded systems. Here, we explore the Temporal Difference Encoder (TDE) performance in keyword spotting. This recent neuron model encodes the time difference in instantaneous frequency and spike count to perform efficient keyword spotting with neuromorphic processors. We use the TIdigits dataset of spoken digits with a formant decomposition and rate-based encoding into spikes. We compare three Spiking Neural Networks (SNNs) architectures to learn and classify spatio-temporal signals. The proposed SNN architectures are made of three layers with variation in its hidden layer composed of either (1) feedforward TDE, (2) feedforward Current-Based Leaky Integrate-and-Fire (CuBa-LIF), or (3) recurrent CuBa-LIF neurons. We first show that the spike trains of the frequency-converted spoken digits have a large amount of information in the temporal domain, reinforcing the importance of better exploiting temporal encoding for such a task. We then train the three SNNs with the same number of synaptic weights to quantify and compare their performance based on the accuracy and synaptic operations. The resulting accuracy of the feedforward TDE network (89%) is higher than the feedforward CuBa-LIF network (71%) and close to the recurrent CuBa-LIF network (91%). However, the feedforward TDE-based network performs 92% fewer synaptic operations than the recurrent CuBa-LIF network with the same amount of synapses. In addition, the results of the TDE network are highly interpretable and correlated with the frequency and timescale features of the spoken keywords in the dataset. Our findings suggest that the TDE is a promising neuron model for scalable event-driven processing of spatio-temporal patterns.

Federated Continual 3D Segmentation With Single-round Communication

Can Peng,Qianhui Men,Pramit Saha,Qianye Yang,Cheng Ouyang,J. Alison Noble

Task: 提出一种联邦持续学习策略，通过多模型蒸馏在服务器端进行一次模型聚合，以应对动态联邦分析设置中的新客户端加入和标签集扩展。

Motivation: 传统的联邦学习方法假设客户端数据和学习目标固定，但在实际场景中，新客户端可能加入，现有客户端可能扩展标签集，传统方法在这种动态设置下通信和计算开销大，且需要同步通信，难以实现。

Details

Method: 采用多模型蒸馏在服务器端进行一次模型聚合，减少服务器通信频率，重用之前的客户端模型，避免全局模型重新训练。 Result: 通过多类3D腹部CT分割任务验证了所提出方法的有效性。 Conclusion: 该方法减少了通信负载，放松了客户端之间的同步要求，提供了一个高效且可扩展的联邦分析框架，适用于实际应用。 Abstract: Federated learning seeks to foster collaboration among distributed clients while preserving the privacy of their local data. Traditionally, federated learning methods assume a fixed setting in which client data and learning objectives remain constant. However, in real-world scenarios, new clients may join, and existing clients may expand the segmentation label set as task requirements evolve. In such a dynamic federated analysis setup, the conventional federated communication strategy of model aggregation per communication round is suboptimal. As new clients join, this strategy requires retraining, linearly increasing communication and computation overhead. It also imposes requirements for synchronized communication, which is difficult to achieve among distributed clients. In this paper, we propose a federated continual learning strategy that employs a one-time model aggregation at the server through multi-model distillation. This approach builds and updates the global model while eliminating the need for frequent server communication. When integrating new data streams or onboarding new clients, this approach efficiently reuses previous client models, avoiding the need to retrain the global model across the entire federation. By minimizing communication load and bypassing the need to put unchanged clients online, our approach relaxes synchronization requirements among clients, providing an efficient and scalable federated analysis framework suited for real-world applications. Using multi-class 3D abdominal CT segmentation as an application task, we demonstrate the effectiveness of the proposed approach.

LIFT: Latent Implicit Functions for Task- and Data-Agnostic Encoding

Amirhossein Kazerouni,Soroush Mehraban,Michael Brudno,Babak Taati

Task: 提出了一种新颖的高性能框架LIFT，通过元学习捕捉多尺度信息，以解决现有隐式神经表示（INR）框架的局限性。

Motivation: 现有的隐式神经表示框架通常依赖于全局潜在向量或存在计算效率低下的问题，限制了其广泛应用。

Details

Method: LIFT利用多个并行的局部隐式函数和分层潜在生成器，生成跨越局部、中间和全局特征的统一潜在表示。ReLIFT是LIFT的增强版本，引入了残差连接和表达频率编码。 Result: LIFT在生成建模和分类任务中实现了最先进的性能，并显著降低了计算成本。ReLIFT在信号表示和逆问题任务中也表现出色。 Conclusion: LIFT和ReLIFT通过捕捉多尺度信息和引入残差连接，有效解决了现有方法的局限性，提供了高效且强大的解决方案。 Abstract: Implicit Neural Representations (INRs) are proving to be a powerful paradigm in unifying task modeling across diverse data domains, offering key advantages such as memory efficiency and resolution independence. Conventional deep learning models are typically modality-dependent, often requiring custom architectures and objectives for different types of signals. However, existing INR frameworks frequently rely on global latent vectors or exhibit computational inefficiencies that limit their broader applicability. We introduce LIFT, a novel, high-performance framework that addresses these challenges by capturing multiscale information through meta-learning. LIFT leverages multiple parallel localized implicit functions alongside a hierarchical latent generator to produce unified latent representations that span local, intermediate, and global features. This architecture facilitates smooth transitions across local regions, enhancing expressivity while maintaining inference efficiency. Additionally, we introduce ReLIFT, an enhanced variant of LIFT that incorporates residual connections and expressive frequency encodings. With this straightforward approach, ReLIFT effectively addresses the convergence-capacity gap found in comparable methods, providing an efficient yet powerful solution to improve capacity and speed up convergence. Empirical results show that LIFT achieves state-of-the-art (SOTA) performance in generative modeling and classification tasks, with notable reductions in computational costs. Moreover, in single-task settings, the streamlined ReLIFT architecture proves effective in signal representations and inverse problem tasks.