2025 03 25

ChatGPT or A Silent Everywhere Helper: A Survey of Large Language Models

Azim Akhtarshenas,Afshin Dini,Navid Ayoobi

Task: 对ChatGPT进行全面分析，包括其架构、训练过程、功能以及在多个领域的应用。

Motivation: 探讨ChatGPT作为大型语言模型的代表，其独特功能、性能指标以及对行业和社会的影响。

Details

Method: 通过文献综述和比较分析，评估ChatGPT的性能、风险及未来发展方向。 Result: 总结了ChatGPT的优势、潜在风险（如错误信息、偏见和数据隐私问题）以及未来研究方向。 Conclusion: ChatGPT在AI领域具有重要影响，未来需进一步研究以解决其挑战并推动技术进步。 Abstract: Large Language Models (LLMs) have revo lutionized natural language processing Natural Language Processing (NLP), with Chat Generative Pre-trained Transformer (ChatGPT) standing out as a notable exampledue to its advanced capabilities and widespread applications. This survey provides a comprehensive analysis of ChatGPT, exploring its architecture, training processes, and functionalities. We examine its integration into various domains across industries such as customer service, education, healthcare, and entertainment. A comparative analysis with other LLMs highlights ChatGPT's unique features and performance metrics. Regarding benchmarks, the paper examines ChatGPT's comparative performance against other LLMs and discusses potential risks such as misinformation, bias, and data privacy concerns. Additionally, we offer a number of figures and tables that outline the backdrop of the discussion, the main ideas of the article, the numerous LLM models, a thorough list of datasets used for pre-training, fine-tuning, and evaluation, as well as particular LLM applications with pertinent references. Finally, we identify future research directions and technological advancements, underscoring the evolving landscape of LLMs and their profound impact on artificial intelligence Artificial Intelligence (AI) and society.

A Comprehensive Survey on Long Context Language Modeling

Jiaheng Liu,Dawei Zhu,Zhiqi Bai,Yancheng He,Huanxuan Liao,Haoran Que,Zekun Wang,Chenchen Zhang,Ge Zhang,Jiebin Zhang,Yuanxing Zhang,Zhuo Chen,Hangyu Guo,Shilong Li,Ziqiang Liu,Yong Shan,Yifan Song,Jiayi Tian,Wenhao Wu,Zhejian Zhou,Ruijie Zhu,Junlan Feng,Yang Gao,Shizhu He,Zhoujun Li,Tianyu Liu,Fanyu Meng,Wenbo Su,Yingshui Tan,Zili Wang,Jian Yang,Wei Ye,Bo Zheng,Wangchunshu Zhou,Wenhao Huang,Sujian Li,Zhaoxiang Zhang

Task: 对长上下文语言模型（LCLMs）的最新进展进行全面综述。

Motivation: 随着长文档、对话等文本数据的增加，开发能够高效处理长上下文的语言模型变得至关重要。

Details

Method: 围绕三个关键方面展开：如何获得高效LCLMs、如何高效训练和部署LCLMs、如何全面评估和分析LCLMs。 Result: 提供了数据策略、架构设计、基础设施需求、评估范式等方面的详细分析，并探讨了应用场景和未来发展方向。 Conclusion: 本综述为研究者和工程师提供了关于长上下文语言模型的最新文献资源，并附有相关GitHub仓库。 Abstract: Efficient processing of long contexts has been a persistent pursuit in Natural Language Processing. With the growing number of long documents, dialogues, and other textual data, it is important to develop Long Context Language Models (LCLMs) that can process and analyze extensive inputs in an effective and efficient way. In this paper, we present a comprehensive survey on recent advances in long-context modeling for large language models. Our survey is structured around three key aspects: how to obtain effective and efficient LCLMs, how to train and deploy LCLMs efficiently, and how to evaluate and analyze LCLMs comprehensively. For the first aspect, we discuss data strategies, architectural designs, and workflow approaches oriented with long context processing. For the second aspect, we provide a detailed examination of the infrastructure required for LCLM training and inference. For the third aspect, we present evaluation paradigms for long-context comprehension and long-form generation, as well as behavioral analysis and mechanism interpretability of LCLMs. Beyond these three key aspects, we thoroughly explore the diverse application scenarios where existing LCLMs have been deployed and outline promising future development directions. This survey provides an up-to-date review of the literature on long-context LLMs, which we wish to serve as a valuable resource for both researchers and engineers. An associated GitHub repository collecting the latest papers and repos is available at: \href{https://github.com/LCLM-Horizon/A-Comprehensive-Survey-For-Long-Context-Language-Modeling}{\color[RGB]{175,36,67}{LCLM-Horizon}}.

Beyond Negation Detection: Comprehensive Assertion Detection Models for Clinical NLP

Veysel Kocaman,Yigit Gul,M. Aytug Kaya,Hasham Ul Haq,Mehmet Butgul,Cabir Celik,David Talby

Task: 开发并评估先进的断言状态检测模型，以提升临床自然语言处理（NLP）中医疗事实的准确提取。

Motivation: 现有的商业解决方案（如AWS Medical Comprehend、Azure AI Text Analytics和GPT-4o）在领域适应性方面表现不足，尤其是对否定检测的过度关注导致性能不佳。

Details

Method: 开发了多种模型，包括微调的大型语言模型（LLM）、基于Transformer的分类器、少样本分类器和深度学习方法，并与商业API解决方案、传统规则方法NegEx和GPT-4o进行了比较。 Result: 微调的LLM在整体准确率（0.962）上显著优于GPT-4o（0.901）和商业API，尤其在Present、Absent和Hypothetical断言上表现突出；深度学习方法在Conditional和Associated-with-Someone-Else类别上优于商业解决方案；少样本分类器提供了轻量级且高竞争力的替代方案（0.929）。 Conclusion: 领域适应性强、透明且可定制的临床NLP解决方案优于通用LLM和专有API，其模型在Spark NLP中实现了可扩展的推理和与其他医疗NLP任务的无缝集成。 Abstract: Assertion status detection is a critical yet often overlooked component of clinical NLP, essential for accurately attributing extracted medical facts. Past studies have narrowly focused on negation detection, leading to underperforming commercial solutions such as AWS Medical Comprehend, Azure AI Text Analytics, and GPT-4o due to their limited domain adaptation. To address this gap, we developed state-of-the-art assertion detection models, including fine-tuned LLMs, transformer-based classifiers, few-shot classifiers, and deep learning (DL) approaches. We evaluated these models against cloud-based commercial API solutions, the legacy rule-based NegEx approach, and GPT-4o. Our fine-tuned LLM achieves the highest overall accuracy (0.962), outperforming GPT-4o (0.901) and commercial APIs by a notable margin, particularly excelling in Present (+4.2%), Absent (+8.4%), and Hypothetical (+23.4%) assertions. Our DL-based models surpass commercial solutions in Conditional (+5.3%) and Associated-with-Someone-Else (+10.1%) categories, while the few-shot classifier offers a lightweight yet highly competitive alternative (0.929), making it ideal for resource-constrained environments. Integrated within Spark NLP, our models consistently outperform black-box commercial solutions while enabling scalable inference and seamless integration with medical NER, Relation Extraction, and Terminology Resolution. These results reinforce the importance of domain-adapted, transparent, and customizable clinical NLP solutions over general-purpose LLMs and proprietary APIs.

Language-specific Neurons Do Not Facilitate Cross-Lingual Transfer

Soumen Kumar Mondal,Sayambhu Sen,Abhishek Singhania,Preethi Jyothi

Task: 探索现有语言特定神经元识别技术是否能提升低资源语言的跨语言任务性能。

Motivation: 多语言大语言模型在低资源语言上的性能显著下降，需要研究如何改进其跨语言理解能力。

Details

Method: 使用语言特定神经元识别技术（如语言激活概率熵和基于激活概率的阈值）以及神经元特定的LoRA微调，实验模型包括Llama 3.1和Mistral Nemo。 Result: 神经元特定干预不足以在低资源语言的跨语言下游任务（如XNLI、XQuAD）中带来改进。 Conclusion: 研究揭示了实现跨语言泛化的挑战，为多语言大语言模型提供了重要见解。 Abstract: Multilingual large language models (LLMs) aim towards robust natural language understanding across diverse languages, yet their performance significantly degrades on low-resource languages. This work explores whether existing techniques to identify language-specific neurons can be leveraged to enhance cross-lingual task performance of lowresource languages. We conduct detailed experiments covering existing language-specific neuron identification techniques (such as Language Activation Probability Entropy and activation probability-based thresholding) and neuron-specific LoRA fine-tuning with models like Llama 3.1 and Mistral Nemo. We find that such neuron-specific interventions are insufficient to yield cross-lingual improvements on downstream tasks (XNLI, XQuAD) in lowresource languages. This study highlights the challenges in achieving cross-lingual generalization and provides critical insights for multilingual LLMs.

AI-driven Automation of End-to-end Assessment of Suturing Expertise

Atharva Deo,Nicholas Matsumoto,Sun Kim,Peter Wager,Randy G. Tsai,Aaron Denmark,Cherine Yang,Xi Li,Jay Moran,Miguel Hernandez,Andrew J. Hung

Task: 开发一种基于AI的方法，自动化完成缝合技能评估工具EASE的端到端评分。

Motivation: 当前EASE的评分过程由人工完成，耗时耗力，AI方法可实现实时评分和反馈，加速学习过程并减少手术中的关键错误。

Details

Method: 采用基于AI的方法，实时预测评分，重点关注缝合技能的7个领域，涵盖3个缝合阶段。 Result: AI方法能够以最小资源实现实时评分，为外科医生或学员提供实时反馈。 Conclusion: AI方法显著提升了缝合技能评估的效率，有望改善患者治疗效果。 Abstract: We present an AI based approach to automate the End-to-end Assessment of Suturing Expertise (EASE), a suturing skills assessment tool that comprehensively defines criteria around relevant sub-skills.1 While EASE provides granular skills assessment related to suturing to provide trainees with an objective evaluation of their aptitude along with actionable insights, the scoring process is currently performed by human evaluators, which is time and resource consuming. The AI based approach solves this by enabling real-time score prediction with minimal resources during model inference. This enables the possibility of real-time feedback to the surgeons/trainees, potentially accelerating the learning process for the suturing task and mitigating critical errors during the surgery, improving patient outcomes. In this study, we focus on the following 7 EASE domains that come under 3 suturing phases: 1) Needle Handling: Number of Repositions, Needle Hold Depth, Needle Hold Ratio, and Needle Hold Angle; 2) Needle Driving: Driving Smoothness, and Wrist Rotation; 3) Needle Withdrawal: Wrist Rotation.

ConvoGen: Enhancing Conversational AI with Synthetic Data: A Multi-Agent Approach

Reem Gody,Mahmoud Goudy,Ahmed Y. Tawfik

Task: 提出一种名为ConvoGen的创新框架，用于通过多智能体系统生成合成对话数据。

Motivation: 生成多样且真实的对话场景，以支持对话AI模型的训练和评估，以及增强现有数据集。

Details

Method: 利用少样本学习，并通过动态更新的少样本中心进行迭代采样。 Result: 实验证明该方法能生成高质量且多样化的合成对话数据。 Conclusion: ConvoGen框架具有提升对话AI系统开发和评估的潜力。 Abstract: In this paper, we present ConvoGen: an innovative framework for generating synthetic conversational data using multi-agent systems. Our method leverages few-shot learning and introduces iterative sampling from a dynamically updated few-shot hub to create diverse and realistic conversational scenarios. The generated data has numerous applications, including training and evaluating conversational AI models, and augmenting existing datasets for tasks like conversational intent classification or conversation summarization. Our experiments demonstrate the effectiveness of this method in producing high-quality diverse synthetic conversational data, highlighting its potential to enhance the development and evaluation of conversational AI systems.

IRef-VLA: A Benchmark for Interactive Referential Grounding with Imperfect Language in 3D Scenes

Haochen Zhang,Nader Zantout,Pujith Kachana,Ji Zhang,Wenshan Wang

Task: 构建一个用于3D场景中交互式参考视觉和语言引导动作的基准数据集IRef-VLA。

Motivation: 解决室内导航中因3D空间推理和语义理解需求以及语言不完美或与场景不对齐带来的挑战。

Details

Method: 通过整合现有数据集中的11.5K扫描3D房间、7.6M启发式生成的语义关系和4.7M参考语句，构建IRef-VLA数据集，并加入语义对象和房间注释、场景图、可导航空间注释以及不完美语言。 Result: IRef-VLA成为最大规模的现实世界参考接地任务数据集，并通过最先进模型验证其泛化能力。 Conclusion: IRef-VLA为开发鲁棒的交互式导航系统提供了资源支持，数据集和源代码已公开。 Abstract: With the recent rise of large language models, vision-language models, and other general foundation models, there is growing potential for multimodal, multi-task robotics that can operate in diverse environments given natural language input. One such application is indoor navigation using natural language instructions. However, despite recent progress, this problem remains challenging due to the 3D spatial reasoning and semantic understanding required. Additionally, the language used may be imperfect or misaligned with the scene, further complicating the task. To address this challenge, we curate a benchmark dataset, IRef-VLA, for Interactive Referential Vision and Language-guided Action in 3D Scenes with imperfect references. IRef-VLA is the largest real-world dataset for the referential grounding task, consisting of over 11.5K scanned 3D rooms from existing datasets, 7.6M heuristically generated semantic relations, and 4.7M referential statements. Our dataset also contains semantic object and room annotations, scene graphs, navigable free space annotations, and is augmented with statements where the language has imperfections or ambiguities. We verify the generalizability of our dataset by evaluating with state-of-the-art models to obtain a performance baseline and also develop a graph-search baseline to demonstrate the performance bound and generation of alternatives using scene-graph knowledge. With this benchmark, we aim to provide a resource for 3D scene understanding that aids the development of robust, interactive navigation systems. The dataset and all source code is publicly released at https://github.com/HaochenZ11/IRef-VLA.

SaudiCulture: A Benchmark for Evaluating Large Language Models Cultural Competence within Saudi Arabia

Lama Ayash,Hassan Alhuzali,Ashwag Alasmari,Sultan Aloufi

Task: 评估大型语言模型在沙特阿拉伯文化背景下的文化理解能力。

Motivation: 大型语言模型在自然语言处理中表现出色，但在捕捉文化细微差异方面存在不足，尤其是在沙特阿拉伯这样具有丰富文化和方言多样性的国家。

Details

Method: 引入SaudiCulture基准数据集，涵盖五个主要地理区域和多种文化领域，通过不同复杂度的题目评估五种大型语言模型的表现。 Result: 所有模型在面对高度专业化或区域特定的问题时表现显著下降，尤其是在需要多个正确答案的情况下，且某些文化类别比其他类别更容易识别。 Conclusion: 强调在大型语言模型训练中融入区域特定知识的重要性，以提升其文化理解能力。 Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language processing; however, they often struggle to accurately capture and reflect cultural nuances. This research addresses this challenge by focusing on Saudi Arabia, a country characterized by diverse dialects and rich cultural traditions. We introduce SaudiCulture, a novel benchmark designed to evaluate the cultural competence of LLMs within the distinct geographical and cultural contexts of Saudi Arabia. SaudiCulture is a comprehensive dataset of questions covering five major geographical regions, such as West, East, South, North, and Center, along with general questions applicable across all regions. The dataset encompasses a broad spectrum of cultural domains, including food, clothing, entertainment, celebrations, and crafts. To ensure a rigorous evaluation, SaudiCulture includes questions of varying complexity, such as open-ended, single-choice, and multiple-choice formats, with some requiring multiple correct answers. Additionally, the dataset distinguishes between common cultural knowledge and specialized regional aspects. We conduct extensive evaluations on five LLMs, such as GPT-4, Llama 3.3, FANAR, Jais, and AceGPT, analyzing their performance across different question types and cultural contexts. Our findings reveal that all models experience significant performance declines when faced with highly specialized or region-specific questions, particularly those requiring multiple correct responses. Additionally, certain cultural categories are more easily identifiable than others, further highlighting inconsistencies in LLMs cultural understanding. These results emphasize the importance of incorporating region-specific knowledge into LLMs training to enhance their cultural competence.

Enhancing Subsequent Video Retrieval via Vision-Language Models (VLMs)

Yicheng Duan,Xi Huang,Duo Chen

Task: 提出一种结合向量相似性搜索和图数据结构的框架，以提升自适应、时间敏感的视频检索效率。

Motivation: 视频内容的快速增长需要高效且精确的检索系统，而现有的视觉语言模型（VLMs）在自适应、时间敏感的视频检索中存在不足。

Details

Method: 通过利用VLM嵌入进行初始检索，并结合图数据结构建模视频片段间的上下文关系，实现自适应查询优化。 Result: 实验表明该方法在精确性、可扩展性和鲁棒性方面表现优异。 Conclusion: 该框架为动态环境中的交互式视频检索提供了有效的解决方案。 Abstract: The rapid growth of video content demands efficient and precise retrieval systems. While vision-language models (VLMs) excel in representation learning, they often struggle with adaptive, time-sensitive video retrieval. This paper introduces a novel framework that combines vector similarity search with graph-based data structures. By leveraging VLM embeddings for initial retrieval and modeling contextual relationships among video segments, our approach enables adaptive query refinement and improves retrieval accuracy. Experiments demonstrate its precision, scalability, and robustness, offering an effective solution for interactive video retrieval in dynamic environments.

Judge Anything: MLLM as a Judge Across Any Modality

Shu Pu,Yaochen Wang,Dongping Chen,Yuhang Chen,Guohao Wang,Qi Qin,Zhongyi Zhang,Zhiyuan Zhang,Zetong Zhou,Shuang Gong,Yi Gui,Yao Wan,Philip S. Yu

Task: 评估生成式基础模型在开放多模态理解（MMU）和生成（MMG）任务中的表现，并扩展多模态大语言模型（MLLMs）作为自动化评估工具的能力。

Motivation: 由于跨模态交互的复杂性，评估多模态任务面临挑战，需要更统一的评估方法和标准化的测试平台。

Details

Method: 引入两个基准TaskAnything和JudgeAnything，分别评估MLLMs在任意模态任务中的整体表现和评判能力，并开发自动化平台OmniArena。 Result: MLLMs在MMU任务中表现较好（平均66.55%和42.79%），但在MMG任务中表现较差（平均53.37%和30.05%），揭示了跨模态偏见和幻觉问题。 Conclusion: 需要更公平的评估协议和更强的人类偏好对齐，同时提供了公开的源代码和数据集。 Abstract: Evaluating generative foundation models on open-ended multimodal understanding (MMU) and generation (MMG) tasks across diverse modalities (e.g., images, audio, video) poses significant challenges due to the complexity of cross-modal interactions. To this end, the idea of utilizing Multimodal LLMs (MLLMs) as automated judges has emerged, with encouraging results in assessing vision-language understanding tasks. Moving further, this paper extends MLLM-as-a-Judge across modalities to a unified manner by introducing two benchmarks, TaskAnything and JudgeAnything, to respectively evaluate the overall performance and judging capabilities of MLLMs across any-to-any modality tasks. Specifically, TaskAnything evaluates the MMU and MMG capabilities across 15 any-to-any modality categories, employing 1,500 queries curated from well-established benchmarks. Furthermore, JudgeAnything evaluates the judging capabilities of 5 advanced (e.g., GPT-4o and Gemini-2.0-Flash) from the perspectives of Pair Comparison and Score Evaluation, providing a standardized testbed that incorporates human judgments and detailed rubrics. Our extensive experiments reveal that while these MLLMs show promise in assessing MMU (i.e., achieving an average of 66.55% in Pair Comparison setting and 42.79% in Score Evaluation setting), they encounter significant challenges with MMG tasks (i.e., averaging only 53.37% in Pair Comparison setting and 30.05% in Score Evaluation setting), exposing cross-modality biases and hallucination issues. To address this, we present OmniArena, an automated platform for evaluating omni-models and multimodal reward models. Our work highlights the need for fairer evaluation protocols and stronger alignment with human preferences. The source code and dataset are publicly available at: https://urrealhero.github.io/judgeanythingweb/.

Feature-Based Dual Visual Feature Extraction Model for Compound Multimodal Emotion Recognition

Ran Liu,Fengyu Zhang,Cong Yu,Longjiang Yang,Zhuofan Wen,Siyuan Zhang,Hailiang Yao,Shun Chen,Zheng Lian,Bin Liu

Task: 提出一种融合Vision Transformer（ViT）和Residual Network（ResNet）特征的多模态情感识别方法，用于解决复合表情识别挑战。

Motivation: 多模态情感识别在情感计算和人机交互中有重要应用，但在现实世界中，复合情感识别面临更大的不确定性和模态冲突问题。

Details

Method: 融合ViT和ResNet的特征，并在C-EXPR-DB和MELD数据集上进行实验。 Result: 在复杂视觉和音频线索的场景（如C-EXPR-DB）中，融合ViT和ResNet特征的模型表现出优越性能。 Conclusion: 提出的方法在复合表情识别任务中表现优异，代码已开源。 Abstract: This article presents our results for the eighth Affective Behavior Analysis in-the-wild (ABAW) competition.Multimodal emotion recognition (ER) has important applications in affective computing and human-computer interaction. However, in the real world, compound emotion recognition faces greater issues of uncertainty and modal conflicts. For the Compound Expression (CE) Recognition Challenge,this paper proposes a multimodal emotion recognition method that fuses the features of Vision Transformer (ViT) and Residual Network (ResNet). We conducted experiments on the C-EXPR-DB and MELD datasets. The results show that in scenarios with complex visual and audio cues (such as C-EXPR-DB), the model that fuses the features of ViT and ResNet exhibits superior performance.Our code are avalible on https://github.com/MyGitHub-ax/8th_ABAW

Follow-up Question Generation For Enhanced Patient-Provider Conversations

Joseph Gatto,Parker Seegmiller,Timothy Burdick,Inas S. Khayal,Sarah DeLozier,Sarah M. Preum

Task: 提出一种名为FollowupQ的多智能体框架，用于在异步医疗对话中生成个性化的后续问题。

Motivation: 解决医疗对话中因信息碎片化和并行思维过程带来的挑战，减少医生与患者之间的沟通成本。

Details

Method: FollowupQ通过处理患者消息和电子健康记录（EHR）数据，生成个性化的后续问题。 Result: FollowupQ减少了34%的医生后续沟通需求，并在真实和合成数据上分别提升了17%和5%的性能。 Conclusion: FollowupQ有效提升了异步医疗对话的效率，并发布了首个公开的异步医疗消息数据集。 Abstract: Follow-up question generation is an essential feature of dialogue systems as it can reduce conversational ambiguity and enhance modeling complex interactions. Conversational contexts often pose core NLP challenges such as (i) extracting relevant information buried in fragmented data sources, and (ii) modeling parallel thought processes. These two challenges occur frequently in medical dialogue as a doctor asks questions based not only on patient utterances but also their prior EHR data and current diagnostic hypotheses. Asking medical questions in asynchronous conversations compounds these issues as doctors can only rely on static EHR information to motivate follow-up questions. To address these challenges, we introduce FollowupQ, a novel framework for enhancing asynchronous medical conversation. FollowupQ is a multi-agent framework that processes patient messages and EHR data to generate personalized follow-up questions, clarifying patient-reported medical conditions. FollowupQ reduces requisite provider follow-up communications by 34%. It also improves performance by 17% and 5% on real and synthetic data, respectively. We also release the first public dataset of asynchronous medical messages with linked EHR data alongside 2,300 follow-up questions written by clinical experts for the wider NLP research community.

High Efficiency Wiener Filter-based Point Cloud Quality Enhancement for MPEG G-PCC

Yuxuan Wei,Zehan Wang,Tian Guo,Hao Liu,Liquan Shen,Hui Yuan

Task: 提出一种高效的Wiener滤波器，用于改进动态点云在G-PCC标准下的重建质量和率失真性能。

Motivation: 由于G-PCC在低比特率下重建质量较低，需要一种方法来提升其性能。

Details

Method: 提出一种基于Wiener滤波器的方法，包括系数继承、基于方差的点分类和基于Morton码的快速最近邻搜索算法。 Result: 实验结果表明，该方法在Luma、Chroma Cb和Chroma Cr分量上分别实现了-6.1%、-7.3%和-8.0%的平均Bj{\o}ntegaard delta率提升。 Conclusion: 所提出的方法在计算复杂度可接受的情况下，显著提升了动态点云的压缩性能。 Abstract: Point clouds, which directly record the geometry and attributes of scenes or objects by a large number of points, are widely used in various applications such as virtual reality and immersive communication. However, due to the huge data volume and unstructured geometry, efficient compression of point clouds is very crucial. The Moving Picture Expert Group is establishing a geometry-based point cloud compression (G-PCC) standard for both static and dynamic point clouds in recent years. Although lossy compression of G-PCC can achieve a very high compression ratio, the reconstruction quality is relatively low, especially at low bitrates. To mitigate this problem, we propose a high efficiency Wiener filter that can be integrated into the encoder and decoder pipeline of G-PCC to improve the reconstruction quality as well as the rate-distortion performance for dynamic point clouds. Specifically, we first propose a basic Wiener filter, and then improve it by introducing coefficients inheritance and variance-based point classification for the Luma component. Besides, to reduce the complexity of the nearest neighbor search during the application of the Wiener filter, we also propose a Morton code-based fast nearest neighbor search algorithm for efficient calculation of filter coefficients. Experimental results demonstrate that the proposed method can achieve average Bj{\o}ntegaard delta rates of -6.1%, -7.3%, and -8.0% for Luma, Chroma Cb, and Chroma Cr components, respectively, under the condition of lossless-geometry-lossy-attributes configuration compared to the latest G-PCC encoding platform (i.e., geometry-based solid content test model version 7.0 release candidate 2) by consuming affordable computational complexity.

Language Models May Verbatim Complete TextThey Were Not Explicitly Trained On

Ken Ziyu Liu,Christopher A. Choquette-Choo,Matthew Jagielski,Peter Kairouz,Sanmi Koyejo,Percy Liang,Nicolas Papernot

Task: 研究如何通过完成测试判断文本是否用于训练大型语言模型（LLM），并揭示基于n-gram的成员定义存在的问题。

Motivation: 当前基于n-gram重叠的成员定义可能被操纵，导致完成测试的准确性受到质疑。

Details

Method: 通过重新训练LLM并移除已完成样本，研究自然和对抗性场景下的成员定义问题。 Result: 发现n-gram成员定义容易被操纵，且难以找到统一的n值；设计了对抗性数据集以展示问题。 Conclusion: n-gram成员定义存在不足，未能考虑训练算法中的辅助信息。 Abstract: An important question today is whether a given text was used to train a large language model (LLM). A \emph{completion} test is often employed: check if the LLM completes a sufficiently complex text. This, however, requires a ground-truth definition of membership; most commonly, it is defined as a member based on the $n$-gram overlap between the target text and any text in the dataset. In this work, we demonstrate that this $n$-gram based membership definition can be effectively gamed. We study scenarios where sequences are \emph{non-members} for a given $n$ and we find that completion tests still succeed. We find many natural cases of this phenomenon by retraining LLMs from scratch after removing all training samples that were completed; these cases include exact duplicates, near-duplicates, and even short overlaps. They showcase that it is difficult to find a single viable choice of $n$ for membership definitions. Using these insights, we design adversarial datasets that can cause a given target sequence to be completed without containing it, for any reasonable choice of $n$. Our findings highlight the inadequacy of $n$-gram membership, suggesting membership definitions fail to account for auxiliary information available to the training algorithm.

Spatiotemporal Learning with Context-aware Video Tubelets for Ultrasound Video Analysis

Gary Y. Li,Li Chen,Bryson Hicks,Nikolai Schnittke,David O. Kessler,Jeffrey Shupp,Maria Parker,Cristiana Baloescu,Christopher Moore,Cynthia Gregory,Kenton Gregory,Balasundar Raju,Jochen Kruecker,Alvin Chen

Task: 提出一种轻量级框架，用于基于视频子体积（tubelet）的目标检测和视频分类，以保留全局空间上下文和精细时空特征。

Motivation: 当前最先进的方法通过分类视频子体积进行操作，但往往因仅关注检测ROI内的局部区域而失去全局空间上下文。

Details

Method: 通过嵌入tubelet的位置、大小和置信度作为分类器输入，并使用预训练检测模型的ROI对齐特征图，以增加感受野并降低计算复杂度。 Result: 在14,804个视频的五折交叉验证中，该方法优于之前的tubelet方法，适用于实时工作流程。 Conclusion: 提出的方法在保留全局上下文的同时，高效地实现了视频分类和目标检测，适用于实时应用。 Abstract: Computer-aided pathology detection algorithms for video-based imaging modalities must accurately interpret complex spatiotemporal information by integrating findings across multiple frames. Current state-of-the-art methods operate by classifying on video sub-volumes (tubelets), but they often lose global spatial context by focusing only on local regions within detection ROIs. Here we propose a lightweight framework for tubelet-based object detection and video classification that preserves both global spatial context and fine spatiotemporal features. To address the loss of global context, we embed tubelet location, size, and confidence as inputs to the classifier. Additionally, we use ROI-aligned feature maps from a pre-trained detection model, leveraging learned feature representations to increase the receptive field and reduce computational complexity. Our method is efficient, with the spatiotemporal tubelet classifier comprising only 0.4M parameters. We apply our approach to detect and classify lung consolidation and pleural effusion in ultrasound videos. Five-fold cross-validation on 14,804 videos from 828 patients shows our method outperforms previous tubelet-based approaches and is suited for real-time workflows.

Bayesian Teaching Enables Probabilistic Reasoning in Large Language Models

Linlu Qiu,Fei Sha,Kelsey Allen,Yoon Kim,Tal Linzen,Sjoerd van Steenkiste

Task: 评估大型语言模型（LLMs）是否能够通过贝叶斯推理框架逐步更新其信念以适应用户偏好。

Motivation: 研究LLMs在交互中是否能像人类一样通过贝叶斯推理逐步优化其内部表示和预测能力。

Details

Method: 使用贝叶斯推理框架评估LLMs的信念更新能力，并通过训练LLMs模仿最优贝叶斯模型的预测来改进其推理能力。 Result: 训练后的LLMs不仅在特定推荐任务上表现显著提升，还能将贝叶斯推理能力泛化到其他任务。 Conclusion: LLMs能够有效学习推理策略并泛化到新领域，这解释了其实际应用中的成功。 Abstract: Artificial intelligence systems based on large language models (LLMs) are increasingly used as agents that interact with users and with the world. To do so successfully, LLMs need to construct internal representations of the world and form probabilistic beliefs about those representations. To provide a user with personalized recommendations, for example, the LLM needs to gradually infer the user's preferences, over the course of multiple interactions. To evaluate whether contemporary LLMs are able to do so, we use the Bayesian inference framework from probability theory, which lays out the optimal way to update an agent's beliefs as it receives new information. We first show that the LLMs do not update their beliefs as expected from the Bayesian framework, and that consequently their predictions do not improve as expected as more information becomes available, even less so than we find is the case for humans. To address this issue, we teach the LLMs to reason in a Bayesian manner by training them to mimic the predictions of an optimal Bayesian model. We find that this approach not only significantly improves the LLM's performance on the particular recommendation task it is trained on, but also enables generalization to other tasks. This suggests that this method endows the LLM with broader Bayesian reasoning skills. More generally, our results indicate that LLMs can learn about reasoning strategies effectively and generalize those skills to new domains, which in part explains LLMs' empirical success.

ProtoGS: Efficient and High-Quality Rendering with 3D Gaussian Prototypes

Zhengqing Gao,Dongting Hu,Jia-Wang Bian,Huan Fu,Yan Li,Tongliang Liu,Mingming Gong,Kun Zhang

Task: 提出ProtoGS方法，通过学习高斯原型来表示高斯基元，显著减少高斯数量而不牺牲视觉质量。

Motivation: 3D高斯泼溅（3DGS）在新视角合成中取得进展，但需要大量高斯基元，难以在轻量设备上部署。现有压缩方法无法保持渲染质量和效率。

Details

Method: 提出ProtoGS方法，利用高斯原型进行高效渲染，并通过重建损失指导原型学习；结合SfM点作为锚点分组高斯基元，使用K-means聚类生成高斯原型。 Result: 在真实和合成数据集上优于现有方法，显著减少高斯数量，保持或提升渲染质量，同时实现高渲染速度。 Conclusion: ProtoGS通过高斯原型和锚点优化，有效解决了3DGS的高斯数量问题，提升了轻量设备上的部署能力。 Abstract: 3D Gaussian Splatting (3DGS) has made significant strides in novel view synthesis but is limited by the substantial number of Gaussian primitives required, posing challenges for deployment on lightweight devices. Recent methods address this issue by compressing the storage size of densified Gaussians, yet fail to preserve rendering quality and efficiency. To overcome these limitations, we propose ProtoGS to learn Gaussian prototypes to represent Gaussian primitives, significantly reducing the total Gaussian amount without sacrificing visual quality. Our method directly uses Gaussian prototypes to enable efficient rendering and leverage the resulting reconstruction loss to guide prototype learning. To further optimize memory efficiency during training, we incorporate structure-from-motion (SfM) points as anchor points to group Gaussian primitives. Gaussian prototypes are derived within each group by clustering of K-means, and both the anchor points and the prototypes are optimized jointly. Our experiments on real-world and synthetic datasets prove that we outperform existing methods, achieving a substantial reduction in the number of Gaussians, and enabling high rendering speed while maintaining or even enhancing rendering fidelity.

Leveraging Human Production-Interpretation Asymmetries to Test LLM Cognitive Plausibility

Suet-Ying Lam,Qingcheng Zeng,Jingyi Wu,Rob Voigt

Task: 研究大型语言模型（LLMs）是否在语言处理上表现出与人类相似的生产-解释区分。

Motivation: 探讨LLMs是否能够像人类一样在句子处理中表现出生产与解释的不对称性。

Details

Method: 通过人类对隐含因果关系动词的生产与解释不对称性作为测试平台，评估指令调优的LLMs是否复制这种区分。 Result: 发现某些LLMs在数量和质量上反映了人类的生产-解释不对称性，且这种行为与模型大小和元语言提示的选择有关。 Conclusion: LLMs在特定条件下可以表现出类似人类的生产-解释区分，模型大小和提示选择是关键因素。 Abstract: Whether large language models (LLMs) process language similarly to humans has been the subject of much theoretical and practical debate. We examine this question through the lens of the production-interpretation distinction found in human sentence processing and evaluate the extent to which instruction-tuned LLMs replicate this distinction. Using an empirically documented asymmetry between production and interpretation in humans for implicit causality verbs as a testbed, we find that some LLMs do quantitatively and qualitatively reflect human-like asymmetries between production and interpretation. We demonstrate that whether this behavior holds depends upon both model size - with larger models more likely to reflect human-like patterns and the choice of meta-linguistic prompts used to elicit the behavior.

ProDehaze: Prompting Diffusion Models Toward Faithful Image Dehazing

Tianwen Zhou,Jing Wang,Songtao Wu,Kuanhong Xu

Task: 提出ProDehaze框架，利用内部图像先验指导预训练模型的外部先验，以解决图像去雾中的幻觉问题。

Motivation: 现有的大规模预训练扩散模型在图像去雾中虽能提升感知质量，但常因幻觉问题导致去雾图像与原始图像不忠实。

Details

Method: 引入两种选择性内部先验：潜在空间中的结构提示恢复器（强调结构丰富区域）和解码过程中的雾感知自校正细化器（对齐清晰输入区域与输出的分布）。 Result: 在真实数据集上的实验表明，ProDehaze在图像去雾中实现了高保真结果，尤其在减少色彩偏移方面表现突出。 Conclusion: ProDehaze通过结合内部和外部先验，有效解决了去雾中的幻觉问题，提升了图像保真度。 Abstract: Recent approaches using large-scale pretrained diffusion models for image dehazing improve perceptual quality but often suffer from hallucination issues, producing unfaithful dehazed image to the original one. To mitigate this, we propose ProDehaze, a framework that employs internal image priors to direct external priors encoded in pretrained models. We introduce two types of \textit{selective} internal priors that prompt the model to concentrate on critical image areas: a Structure-Prompted Restorer in the latent space that emphasizes structure-rich regions, and a Haze-Aware Self-Correcting Refiner in the decoding process to align distributions between clearer input regions and the output. Extensive experiments on real-world datasets demonstrate that ProDehaze achieves high-fidelity results in image dehazing, particularly in reducing color shifts. Our code is at https://github.com/TianwenZhou/ProDehaze.

GPBench: A Comprehensive and Fine-Grained Benchmark for Evaluating Large Language Models as General Practitioners

Zheqing Li,Yiying Yang,Jiping Lang,Wenhao Jiang,Yuhang Zhao,Shuang Li,Dingqian Wang,Zhu Lin,Xuanna Li,Yuze Tang,Jiexian Qiu,Xiaolin Lu,Hongji Yu,Shuang Chen,Yuhua Bi,Xiaofei Zeng,Yixian Chen,Junrong Chen,Lin Yao

Task: 评估大型语言模型（LLMs）在普通医生（GPs）日常工作决策中的有效性。

Motivation: 由于普通医生的临床能力在不同地区和医疗环境中存在显著差异，且现有评估框架多为考试式选择题，缺乏真实场景的全面评估，因此需要开发更贴近实际工作的评估工具。

Details

Method: 设计了GPBench，包含来自临床实践的测试题和基于普通医生能力模型的新型评估框架，测试题包括选择题和情景问题，并由专家精细标注。 Result: 主流LLMs在疾病分期、并发症识别、治疗细节和药物使用等方面存在至少十个主要缺陷。 Conclusion: 现有LLMs尚不适合在无人监督的情况下独立用于真实世界的普通医生工作场景。 Abstract: General practitioners (GPs) serve as the cornerstone of primary healthcare systems by providing continuous and comprehensive medical services. However, due to community-oriented nature of their practice, uneven training and resource gaps, the clinical proficiency among GPs can vary significantly across regions and healthcare settings. Currently, Large Language Models (LLMs) have demonstrated great potential in clinical and medical applications, making them a promising tool for supporting general practice. However, most existing benchmarks and evaluation frameworks focus on exam-style assessments-typically multiple-choice question-lack comprehensive assessment sets that accurately mirror the real-world scenarios encountered by GPs. To evaluate how effectively LLMs can make decisions in the daily work of GPs, we designed GPBench, which consists of both test questions from clinical practice and a novel evaluation framework. The test set includes multiple-choice questions that assess fundamental knowledge of general practice, as well as realistic, scenario-based problems. All questions are meticulously annotated by experts, incorporating rich fine-grained information related to clinical management. The proposed LLM evaluation framework is based on the competency model for general practice, providing a comprehensive methodology for assessing LLM performance in real-world settings. As the first large-model evaluation set targeting GP decision-making scenarios, GPBench allows us to evaluate current mainstream LLMs. Expert assessment and evaluation reveal that in areas such as disease staging, complication recognition, treatment detail, and medication usage, these models exhibit at least ten major shortcomings. Overall, existing LLMs are not yet suitable for independent use in real-world GP working scenarios without human oversight.

Meme Similarity and Emotion Detection using Multimodal Analysis

Aidos Konyspay,Pakizar Shamoi,Malika Ziyada,Zhusup Smambayev

Task: 提出一种多模态方法，用于比较和分析网络迷因的视觉与文本元素及其引发的情感。

Motivation: 现有研究多关注迷因的视觉或文本单一元素，而忽视了两者的相互作用，因此需要一种有效的方法来填补这一空白。

Details

Method: 采用多模态CLIP模型对迷因的视觉和文本内容进行嵌入分析，并结合Reddit Meme Dataset和Memotion Dataset提取特征；通过用户研究验证自动化评估的准确性，并使用DistilBERT模型对迷因情感进行分类。 Result: 计算模型与人类判断的一致性达到67.23%，愤怒和快乐是迷因中的主导情感，激励类迷因引发更强的情感反应。 Conclusion: 该研究为多模态迷因分析提供了新方法，改进了在线视觉交流和用户体验，并为内容审核策略提供了参考。 Abstract: Internet memes are a central element of online culture, blending images and text. While substantial research has focused on either the visual or textual components of memes, little attention has been given to their interplay. This gap raises a key question: What methodology can effectively compare memes and the emotions they elicit? Our study employs a multimodal methodological approach, analyzing both the visual and textual elements of memes. Specifically, we perform a multimodal CLIP (Contrastive Language-Image Pre-training) model for grouping similar memes based on text and visual content embeddings, enabling robust similarity assessments across modalities. Using the Reddit Meme Dataset and Memotion Dataset, we extract low-level visual features and high-level semantic features to identify similar meme pairs. To validate these automated similarity assessments, we conducted a user study with 50 participants, asking them to provide yes/no responses regarding meme similarity and their emotional reactions. The comparison of experimental results with human judgments showed a 67.23\% agreement, suggesting that the computational approach aligns well with human perception. Additionally, we implemented a text-based classifier using the DistilBERT model to categorize memes into one of six basic emotions. The results indicate that anger and joy are the dominant emotions in memes, with motivational memes eliciting stronger emotional responses. This research contributes to the study of multimodal memes, enhancing both language-based and visual approaches to analyzing and improving online visual communication and user experiences. Furthermore, it provides insights for better content moderation strategies in online platforms.

Enhancing Persona Consistency for LLMs' Role-Playing using Persona-Aware Contrastive Learning

Ke Ji,Yixin Lian,Linxu Li,Jingsheng Gao,Weiyuan Li,Bin Dai

Task: 通过提出一种名为PCL的无标注框架，对齐大型语言模型在角色扮演中的行为，以增强其角色一致性。

Motivation: 现有大型语言模型在对话生成任务中缺乏情感和细粒度角色意识，限制了其提供个性化和多样化互动的能力，且传统方法在角色扮演场景中难以部署。

Details

Method: 设计角色链方法促使模型基于角色特征和对话上下文自我调整，并通过迭代对比学习增强角色扮演策略。 Result: 实验表明，配备PCL的模型在自动评估和人工评估中显著优于普通模型。 Conclusion: PCL框架有效提升了大型语言模型在角色扮演中的角色一致性，无需高成本标注数据。 Abstract: In recent years, large language models (LLMs) have achieved breakthrough progress in many dialogue generation tasks. However, their lack of emotion and fine-grained role awareness limits the model's ability to provide personalized and diverse interactions further. Current methods face high costs in collecting high-quality annotated data for scenarios such as role-playing, and traditional human alignment methods are difficult to deploy due to the inherent diversity of model behavior in role-playing scenarios. Inspired by the alignment of models for safety behaviors through RLHF (Reinforcement Learning from Human Feedback), in this paper, we revisit model role-playing behavior from the perspective of persona alignment and propose a novel annotation-free framework named \textbf{\underline{P}}ersona-Aware \textbf{\underline{C}}ontrastive \textbf{\underline{L}}earning (PCL) to align LLMs' behavior during role-playing, enhancing the model's role consistency. Specifically, we first design a role chain method to encourage the model to self-question based on the role characteristics and dialogue context to adjust personality consistency. Then, we further enhance the model's role-playing strategy through iterative contrastive learning between the use of role characteristics and not. Experiments on both black-box and white-box LLMs show that LLMs equipped with PCL significantly outperform vanilla LLMs under automatic evaluation methods (CharEval \& GPT-4) and human expert evaluation.

You Only Look Once at Anytime (AnytimeYOLO): Analysis and Optimization of Early-Exits for Object-Detection

Daniel Kuhse,Harun Teper,Sebastian Buschjäger,Chien-Yao Wang,Jian-Jia Chen

Task: 提出AnytimeYOLO，一种支持随时中断推理的YOLO架构变体，用于实时安全关键应用。

Motivation: 为满足实时安全关键应用中对随时中断推理的需求，提供更灵活的预测能力。

Details

Method: 通过结构化探索修改YOLO架构，引入高粒度终止点，提出新的转置变体和两种优化算法。 Result: 评估了设计选择的性能与权衡，提出了新的随时质量度量，并讨论了部署中的关键挑战。 Conclusion: AnytimeYOLO为实时应用提供了灵活的推理能力，但部署仍面临成本挑战。 Abstract: We introduce AnytimeYOLO, a family of variants of the YOLO architecture that enables anytime object detection. Our AnytimeYOLO networks allow for interruptible inference, i.e., they provide a prediction at any point in time, a property desirable for safety-critical real-time applications. We present structured explorations to modify the YOLO architecture, enabling early termination to obtain intermediate results. We focus on providing fine-grained control through high granularity of available termination points. First, we formalize Anytime Models as a special class of prediction models that offer anytime predictions. Then, we discuss a novel transposed variant of the YOLO architecture, that changes the architecture to enable better early predictions and greater freedom for the order of processing stages. Finally, we propose two optimization algorithms that, given an anytime model, can be used to determine the optimal exit execution order and the optimal subset of early-exits to select for deployment in low-resource environments. We evaluate the anytime performance and trade-offs of design choices, proposing a new anytime quality metric for this purpose. In particular, we also discuss key challenges for anytime inference that currently make its deployment costly.

Can LLMs Automate Fact-Checking Article Writing?

Dhruv Sahnan,David Corney,Irene Larraz,Giovanni Zagni,Ruben Miguez,Zhuohan Xie,Iryna Gurevych,Elizabeth Churchill,Tanmoy Chakraborty,Preslav Nakov

Task: 扩展自动事实检查流程，生成完整的事实检查文章。

Motivation: 现有自动事实检查系统缺乏生成适合公众传播的输出，而人工事实检查员通过文章传达结果。

Details

Method: 通过专家访谈确定文章需求，开发基于LLM的框架QRAFT，模拟人工写作流程。 Result: QRAFT优于其他文本生成方法，但仍显著落后于专家撰写的文章。 Conclusion: 该研究为自动生成事实检查文章的新方向提供了基础。 Abstract: Automatic fact-checking aims to support professional fact-checkers by offering tools that can help speed up manual fact-checking. Yet, existing frameworks fail to address the key step of producing output suitable for broader dissemination to the general public: while human fact-checkers communicate their findings through fact-checking articles, automated systems typically produce little or no justification for their assessments. Here, we aim to bridge this gap. We argue for the need to extend the typical automatic fact-checking pipeline with automatic generation of full fact-checking articles. We first identify key desiderata for such articles through a series of interviews with experts from leading fact-checking organizations. We then develop QRAFT, an LLM-based agentic framework that mimics the writing workflow of human fact-checkers. Finally, we assess the practical usefulness of QRAFT through human evaluations with professional fact-checkers. Our evaluation shows that while QRAFT outperforms several previously proposed text-generation approaches, it lags considerably behind expert-written articles. We hope that our work will enable further research in this new and important direction.

Event-Based Crossing Dataset (EBCD)

Joey Mulé,Dhandeep Challagundla,Rachit Saini,Riadul Islam

Task: 提出一个多阈值框架的事件数据集（EBCD），用于动态户外环境中的行人和车辆检测。

Motivation: 传统事件数据集采用固定阈值限制像素激活，无法适应真实环境变化，导致细节丢失或噪声增加。

Details

Method: 通过采集十个不同阈值水平的事件图像，评估稀疏性和噪声抑制条件下的检测性能，并测试多种先进检测架构。 Result: EBCD数据集能够系统评估阈值选择对检测性能的影响，促进事件检测的适应性评估。 Conclusion: EBCD为低延迟、高保真神经形态成像的进一步研究提供了公开数据集。 Abstract: Event-based vision revolutionizes traditional image sensing by capturing asynchronous intensity variations rather than static frames, enabling ultrafast temporal resolution, sparse data encoding, and enhanced motion perception. While this paradigm offers significant advantages, conventional event-based datasets impose a fixed thresholding constraint to determine pixel activations, severely limiting adaptability to real-world environmental fluctuations. Lower thresholds retain finer details but introduce pervasive noise, whereas higher thresholds suppress extraneous activations at the expense of crucial object information. To mitigate these constraints, we introduce the Event-Based Crossing Dataset (EBCD), a comprehensive dataset tailored for pedestrian and vehicle detection in dynamic outdoor environments, incorporating a multi-thresholding framework to refine event representations. By capturing event-based images at ten distinct threshold levels (4, 8, 12, 16, 20, 30, 40, 50, 60, and 75), this dataset facilitates an extensive assessment of object detection performance under varying conditions of sparsity and noise suppression. We benchmark state-of-the-art detection architectures-including YOLOv4, YOLOv7, EfficientDet-b0, MobileNet-v1, and Histogram of Oriented Gradients (HOG)-to experiment upon the nuanced impact of threshold selection on detection performance. By offering a systematic approach to threshold variation, we foresee that EBCD fosters a more adaptive evaluation of event-based object detection, aligning diverse neuromorphic vision with real-world scene dynamics. We present the dataset as publicly available to propel further advancements in low-latency, high-fidelity neuromorphic imaging: https://ieee-dataport.org/documents/event-based-crossing-dataset-ebcd

Enhancing Arabic Automated Essay Scoring with Synthetic Data and Error Injection

Chatrine Qwaider,Bashar Alhafni,Kirill Chirkunov,Nizar Habash,Ted Briscoe

Task: 提出一种利用大型语言模型和Transformer生成合成阿拉伯语作文数据集的新框架，用于自动作文评分（AES）。

Motivation: 阿拉伯语AES系统面临标注作文数据集不足的挑战，需要解决这一问题以提高评分性能。

Details

Method: 通过提示大型语言模型生成不同CEFR水平的作文，并使用微调的Standard Arabic BERT模型进行错误注入，生成逼真的人工作文数据集。 Result: 生成了3,040篇标注作文，并开发了基于BERT的自动评分系统，实验证明框架有效提升了阿拉伯语AES性能。 Conclusion: 该框架为阿拉伯语AES提供了高质量的数据集和评分工具，显著提升了评分效率和准确性。 Abstract: Automated Essay Scoring (AES) plays a crucial role in assessing language learners' writing quality, reducing grading workload, and providing real-time feedback. Arabic AES systems are particularly challenged by the lack of annotated essay datasets. This paper presents a novel framework leveraging Large Language Models (LLMs) and Transformers to generate synthetic Arabic essay datasets for AES. We prompt an LLM to generate essays across CEFR proficiency levels and introduce controlled error injection using a fine-tuned Standard Arabic BERT model for error type prediction. Our approach produces realistic human-like essays, contributing a dataset of 3,040 annotated essays. Additionally, we develop a BERT-based auto-marking system for accurate and scalable Arabic essay evaluation. Experimental results demonstrate the effectiveness of our framework in improving Arabic AES performance.

Should we pre-train a decoder in contrastive learning for dense prediction tasks?

Sébastien Quetin,Tapotosh Ghosh,Farhad Maleki

Task: 提出DeCon框架，将仅编码器的自监督对比学习方法扩展为高效的编码器-解码器框架，实现联合预训练。

Motivation: 传统方法仅预训练编码器，忽略了联合预训练编码器和解码器的潜在优势。

Details

Method: 更新现有架构以支持解码器及其对比损失，引入加权编码器-解码器对比损失，适配两种对比自监督框架。 Result: 在COCO目标检测和实例分割中取得新SOTA，在Pascal VOC语义分割中匹配SOTA，提升编码器表示能力。 Conclusion: DeCon框架能预训练解码器，增强编码器表示能力，适用于异构解码器架构和低数据场景。 Abstract: Contrastive learning in self-supervised settings primarily focuses on pre-training encoders, while decoders are typically introduced and trained separately for downstream dense prediction tasks. This conventional approach, however, overlooks the potential benefits of jointly pre-training both the encoder and decoder. In this paper, we propose DeCon: a framework-agnostic adaptation to convert an encoder-only self-supervised learning (SSL) contrastive approach to an efficient encoder-decoder framework that can be pre-trained in a contrastive manner. We first update the existing architecture to accommodate a decoder and its respective contrastive loss. We then introduce a weighted encoder-decoder contrastive loss with non-competing objectives that facilitates the joint encoder-decoder architecture pre-training. We adapt two established contrastive SSL frameworks tailored for dense prediction tasks, achieve new state-of-the-art results in COCO object detection and instance segmentation, and match state-of-the-art performance on Pascal VOC semantic segmentation. We show that our approach allows for pre-training a decoder and enhances the representation power of the encoder and its performance in dense prediction tasks. This benefit holds across heterogeneous decoder architectures between pre-training and fine-tuning and persists in out-of-domain, limited-data scenarios.

Building Resource-Constrained Language Agents: A Korean Case Study on Chemical Toxicity Information

Hojun Cho,Donghu Kim,Soyoung Yang,Chan Lee,Hunjoo Lee,Jaegul Choo

Task: 开发一种在资源受限环境下高效运行的韩语化学毒性信息代理Tox-chat。

Motivation: 解决大型语言模型在资源受限环境（尤其是专业领域和小语种）中部署的挑战。

Details

Method: 提出两种创新：1) 通过分层段落搜索减少令牌消耗的上下文高效架构；2) 基于场景的对话生成方法，从大模型中提取工具使用能力。 Result: 实验表明，经过微调的8B参数模型在数据库忠实度和用户偏好上显著优于未调优模型和基线方法。 Conclusion: 为在实践约束下开发领域特定语言代理的研究提供了有价值的见解。 Abstract: Language agents powered by large language models (LLMs) face significant deployment challenges in resource-constrained environments, particularly for specialized domains and less-common languages. This paper presents Tox-chat, a Korean chemical toxicity information agent devised within these limitations. We propose two key innovations: a context-efficient architecture that reduces token consumption through hierarchical section search, and a scenario-based dialogue generation methodology that effectively distills tool-using capabilities from larger models. Experimental evaluations demonstrate that our fine-tuned 8B parameter model substantially outperforms both untuned models and baseline approaches, in terms of DB faithfulness and preference. Our work offers valuable insights for researchers developing domain-specific language agents under practical constraints.

FMDConv: Fast Multi-Attention Dynamic Convolution via Speed-Accuracy Trade-off

Tianyu Zhang,Fan Wan,Haoran Duan,Kevin W. Tong,Jingjing Deng,Yang Long

Task: 提出一种名为FMDConv的动态卷积方法，以优化速度与准确性的权衡。

Motivation: 动态卷积虽然能提升模型准确性，但计算开销大，限制了其在资源受限环境中的应用。

Details

Method: FMDConv通过整合输入注意力、温度退化核注意力和输出注意力，以较低复杂度选择性增强特征提取。 Result: 在CIFAR-10、CIFAR-100和ImageNet上的实验表明，FMDConv在ResNet-18和ResNet-50上分别减少计算成本49.8%和42.2%，同时保持准确性。 Conclusion: FMDConv在资源受限的实际应用中具有显著优势。 Abstract: Spatial convolution is fundamental in constructing deep Convolutional Neural Networks (CNNs) for visual recognition. While dynamic convolution enhances model accuracy by adaptively combining static kernels, it incurs significant computational overhead, limiting its deployment in resource-constrained environments such as federated edge computing. To address this, we propose Fast Multi-Attention Dynamic Convolution (FMDConv), which integrates input attention, temperature-degraded kernel attention, and output attention to optimize the speed-accuracy trade-off. FMDConv achieves a better balance between accuracy and efficiency by selectively enhancing feature extraction with lower complexity. Furthermore, we introduce two novel quantitative metrics, the Inverse Efficiency Score and Rate-Correct Score, to systematically evaluate this trade-off. Extensive experiments on CIFAR-10, CIFAR-100, and ImageNet demonstrate that FMDConv reduces the computational cost by up to 49.8\% on ResNet-18 and 42.2\% on ResNet-50 compared to prior multi-attention dynamic convolution methods while maintaining competitive accuracy. These advantages make FMDConv highly suitable for real-world, resource-constrained applications.

Improving Preference Extraction In LLMs By Identifying Latent Knowledge Through Classifying Probes

Sharan Maiya,Yinhong Liu,Ramit Debnath,Anna Korhonen

Task: 提出一种使用线性分类探针的方法，以更准确地评估大型语言模型（LLMs）的偏好。

Motivation: 大型语言模型作为自动评估工具时存在无意识偏差，影响其有效性。

Details

Method: 通过训练线性分类探针，利用对比提示对的差异直接访问LLMs的潜在知识。 Result: 实验表明，监督和无监督探针方法在文本质量评估和常识推理任务中均优于传统生成式评估方法，且计算成本相近。 Conclusion: 线性探针为LLM评估任务提供了一种准确、鲁棒且高效的方法，同时揭示了模型如何编码与判断相关的知识。 Abstract: Large Language Models (LLMs) are often used as automated judges to evaluate text, but their effectiveness can be hindered by various unintentional biases. We propose using linear classifying probes, trained by leveraging differences between contrasting pairs of prompts, to directly access LLMs' latent knowledge and extract more accurate preferences. Through extensive experiments using models of varying size from four different families and six diverse datasets assessing text quality evaluation and common sense reasoning, we demonstrate that both supervised and unsupervised probing approaches consistently outperform traditional generation-based judgement while maintaining similar computational costs. These probes generalise under domain shifts and can even outperform finetuned evaluators with the same training data size. Our results suggest linear probing offers an accurate, robust and computationally efficient approach for LLM-as-judge tasks while providing interpretable insights into how models encode judgement-relevant knowledge. Our data and code will be openly released in the future.

DermDiff: Generative Diffusion Model for Mitigating Racial Biases in Dermatology Diagnosis

Nusrat Munia,Abdullah-Al-Zubaer Imran

Task: 提出一种名为DermDiff的新型生成模型，用于生成多样化和代表性的皮肤镜图像数据，以改善皮肤疾病诊断。

Motivation: 现有AI模型在皮肤疾病诊断中因数据集有限且存在偏见，导致对某些肤色表现不佳，需要解决这一问题。

Details

Method: 利用文本提示和多模态图像-文本学习，生成多样化和代表性的皮肤镜图像数据。 Result: DermDiff在生成高保真度和多样性的图像方面表现出色，并有望减轻皮肤科诊断中的种族偏见。 Conclusion: DermDiff为解决皮肤疾病诊断中的数据集偏见问题提供了一种有效方法，并展示了在改善诊断公平性方面的潜力。 Abstract: Skin diseases, such as skin cancer, are a significant public health issue, and early diagnosis is crucial for effective treatment. Artificial intelligence (AI) algorithms have the potential to assist in triaging benign vs malignant skin lesions and improve diagnostic accuracy. However, existing AI models for skin disease diagnosis are often developed and tested on limited and biased datasets, leading to poor performance on certain skin tones. To address this problem, we propose a novel generative model, named DermDiff, that can generate diverse and representative dermoscopic image data for skin disease diagnosis. Leveraging text prompting and multimodal image-text learning, DermDiff improves the representation of underrepresented groups (patients, diseases, etc.) in highly imbalanced datasets. Our extensive experimentation showcases the effectiveness of DermDiff in terms of high fidelity and diversity. Furthermore, downstream evaluation suggests the potential of DermDiff in mitigating racial biases for dermatology diagnosis. Our code is available at https://github.com/Munia03/DermDiff

Relation Extraction with Instance-Adapted Predicate Descriptions

Yuhang Jiang,Ramakanth Kavuluru

Task: 探索一种基于双编码器架构的新方法，用于改进关系抽取任务中的性能。

Motivation: 尽管解码器专用大型语言模型在生成任务中表现出色，但较小的编码器模型仍是关系抽取的首选架构，因此需要改进其性能。

Details

Method: 提出了一种双编码器架构，结合联合对比和交叉熵损失，并使用第二个编码器计算实例特定的谓词表示。 Result: 在两个生物医学关系抽取数据集和两个通用领域数据集上，F1分数比现有最优方法提高了1%至2%。 Conclusion: 通过简单的双编码器架构和联合损失函数，显著提升了关系抽取的性能，消融研究验证了各组件的重要性。 Abstract: Relation extraction (RE) is a standard information extraction task playing a major role in downstream applications such as knowledge discovery and question answering. Although decoder-only large language models are excelling in generative tasks, smaller encoder models are still the go to architecture for RE. In this paper, we revisit fine-tuning such smaller models using a novel dual-encoder architecture with a joint contrastive and cross-entropy loss. Unlike previous methods that employ a fixed linear layer for predicate representations, our approach uses a second encoder to compute instance-specific predicate representations by infusing them with real entity spans from corresponding input instances. We conducted experiments on two biomedical RE datasets and two general domain datasets. Our approach achieved F1 score improvements ranging from 1% to 2% over state-of-the-art methods with a simple but elegant formulation. Ablation studies justify the importance of various components built into the proposed architecture.

Generating, Fast and Slow: Scalable Parallel Video Generation with Video Interface Networks

Bhishma Dedhia,David Bourgin,Krishna Kumar Singh,Yuheng Li,Yan Kang,Zhan Xu,Niraj K. Jha,Yuchen Liu

Task: 提出一种名为Video Interface Networks (VINs)的新范式，以解决Diffusion Transformers (DiTs)在生成长视频时计算复杂度高的问题。

Motivation: 直接训练和采样长视频在计算上具有挑战性，现有方法需要多次采样链迭代和专用一致性模块，效率较低。

Details

Method: VINs通过引入抽象模块，实现视频块的并行推理，利用全局语义编码指导DiTs并行去噪。 Result: VINs在背景一致性和主体连贯性上优于现有方法，同时减少了25-40%的计算量，并在用户研究中获得更高的视频质量和时间一致性评价。 Conclusion: VINs为生成长视频提供了一种高效且高质量的新方法。 Abstract: Diffusion Transformers (DiTs) can generate short photorealistic videos, yet directly training and sampling longer videos with full attention across the video remains computationally challenging. Alternative methods break long videos down into sequential generation of short video segments, requiring multiple sampling chain iterations and specialized consistency modules. To overcome these challenges, we introduce a new paradigm called Video Interface Networks (VINs), which augment DiTs with an abstraction module to enable parallel inference of video chunks. At each diffusion step, VINs encode global semantics from the noisy input of local chunks and the encoded representations, in turn, guide DiTs in denoising chunks in parallel. The coupling of VIN and DiT is learned end-to-end on the denoising objective. Further, the VIN architecture maintains fixed-size encoding tokens that encode the input via a single cross-attention step. Disentangling the encoding tokens from the input thus enables VIN to scale to long videos and learn essential semantics. Experiments on VBench demonstrate that VINs surpass existing chunk-based methods in preserving background consistency and subject coherence. We then show via an optical flow analysis that our approach attains state-of-the-art motion smoothness while using 25-40% fewer FLOPs than full generation. Finally, human raters favorably assessed the overall video quality and temporal consistency of our method in a user study.

ParsiPy: NLP Toolkit for Historical Persian Texts in Python

Farhan Farsi,Parnian Fazel,Sepand Haghighi,Sadra Sabouri,Farzaneh Goshtasb,Nadia Hajipour,Ehsaneddin Asgari,Hossein Sameti

Task: 开发一个名为ParsiPy的NLP工具包，用于分析历史波斯语言。

Motivation: 历史语言研究面临复杂的正字法系统、碎片化文本证据以及缺乏标准化数字表示的挑战，需要专门的NLP工具来处理语音转录和分析古代文本。

Details

Method: ParsiPy提供分词、词形还原、词性标注、音素到转写转换以及词嵌入模块。 Result: 通过处理Parsig（中古波斯语）文本展示了工具包的实用性，展示了其在扩展历史语言研究计算方法中的潜力。 Conclusion: 该工作为计算语言学做出了贡献，提供了可用于更广泛古代文本研究及其数字保存的工具。 Abstract: The study of historical languages presents unique challenges due to their complex orthographic systems, fragmentary textual evidence, and the absence of standardized digital representations of text in those languages. Tackling these challenges needs special NLP digital tools to handle phonetic transcriptions and analyze ancient texts. This work introduces ParsiPy, an NLP toolkit designed to facilitate the analysis of historical Persian languages by offering modules for tokenization, lemmatization, part-of-speech tagging, phoneme-to-transliteration conversion, and word embedding. We demonstrate the utility of our toolkit through the processing of Parsig (Middle Persian) texts, highlighting its potential for expanding computational methods in the study of historical languages. Through this work, we contribute to computational philology, offering tools that can be adapted for the broader study of ancient texts and their digital preservation.

PRIMAL: Physically Reactive and Interactive Motor Model for Avatar Learning

Yan Zhang,Yao Feng,Alpár Cseke,Nitin Saini,Nathan Bajandas,Nicolas Heron,Michael J. Black

Task: 开发一种生成式运动模型，驱动交互式虚拟角色在3D空间中实现持续、真实、可控且响应迅速的运动。

Motivation: 现有运动生成方法大多无法支持‘具身智能’，存在离线设置、速度慢、运动长度有限或运动不自然等问题。

Details

Method: 提出PRIMAL，一种基于自回归扩散模型的二阶段训练范式，包括预训练阶段学习运动动态，以及适应阶段通过ControlNet-like适配器微调运动控制。 Result: 模型能够从单帧初始状态生成无界、真实且可控的运动，并实时响应外部脉冲，同时在少样本个性化动作和空间控制任务中表现优异。 Conclusion: PRIMAL在实时角色动画系统中表现出色，优于现有基线方法。 Abstract: To build a motor system of the interactive avatar, it is essential to develop a generative motion model drives the body to move through 3D space in a perpetual, realistic, controllable, and responsive manner. Although motion generation has been extensively studied, most methods do not support ``embodied intelligence'' due to their offline setting, slow speed, limited motion lengths, or unnatural movements. To overcome these limitations, we propose PRIMAL, an autoregressive diffusion model that is learned with a two-stage paradigm, inspired by recent advances in foundation models. In the pretraining stage, the model learns motion dynamics from a large number of sub-second motion segments, providing ``motor primitives'' from which more complex motions are built. In the adaptation phase, we employ a ControlNet-like adaptor to fine-tune the motor control for semantic action generation and spatial target reaching. Experiments show that physics effects emerge from our training. Given a single-frame initial state, our model not only generates unbounded, realistic, and controllable motion, but also enables the avatar to be responsive to induced impulses in real time. In addition, we can effectively and efficiently adapt our base model to few-shot personalized actions and the task of spatial control. Evaluations show that our proposed method outperforms state-of-the-art baselines. We leverage the model to create a real-time character animation system in Unreal Engine that is highly responsive and natural. Code, models, and more results are available at: https://yz-cnsdqz.github.io/eigenmotion/PRIMAL

Feather-SQL: A Lightweight NL2SQL Framework with Dual-Model Collaboration Paradigm for Small Language Models

Wenqi Pei,Hailing Xu,Hengyuan Zhao,Shizheng Hou,Han Chen,Zining Zhang,Pingyi Luo,Bingsheng He

Task: 提出一种轻量级框架Feather-SQL，用于提升小型语言模型（SLMs）在自然语言转SQL（NL2SQL）任务中的性能。

Motivation: 大型语言模型（LLMs）在NL2SQL任务中表现优异，但依赖闭源系统和高计算资源，存在数据隐私和部署问题；而小型语言模型（SLMs）性能较差且与现有框架不兼容。

Details

Method: Feather-SQL通过模式剪枝和链接、多路径和多候选生成提升SQL可执行性和准确性，并提出1+1模型协作范式，结合通用聊天模型和SQL专家模型。 Result: 在BIRD数据集上，Feather-SQL使未微调的SLMs性能提升约10%，并将SLMs的准确率上限提高到54.76%。 Conclusion: Feather-SQL有效解决了SLMs在NL2SQL任务中的性能问题，同时兼顾了隐私和部署的实用性。 Abstract: Natural Language to SQL (NL2SQL) has seen significant advancements with large language models (LLMs). However, these models often depend on closed-source systems and high computational resources, posing challenges in data privacy and deployment. In contrast, small language models (SLMs) struggle with NL2SQL tasks, exhibiting poor performance and incompatibility with existing frameworks. To address these issues, we introduce Feather-SQL, a new lightweight framework tailored for SLMs. Feather-SQL improves SQL executability and accuracy through 1) schema pruning and linking, 2) multi-path and multi-candidate generation. Additionally, we introduce the 1+1 Model Collaboration Paradigm, which pairs a strong general-purpose chat model with a fine-tuned SQL specialist, combining strong analytical reasoning with high-precision SQL generation. Experimental results on BIRD demonstrate that Feather-SQL improves NL2SQL performance on SLMs, with around 10% boost for models without fine-tuning. The proposed paradigm raises the accuracy ceiling of SLMs to 54.76%, highlighting its effectiveness.

Is there anything left? Measuring semantic residuals of objects removed from 3D Gaussian Splatting

Simona Kocour,Assia Benbihi,Aikaterini Adam,Torsten Sattler

Task: 提出一种定量评估方法，用于衡量3D场景中移除操作后是否残留可推理的物体残余。

Motivation: 解决隐私保护映射中移除私人元素后是否仍残留可推理信息的问题，填补研究空白。

Details

Method: 提出基于空间和语义一致性的移除细化方法，并通过实验验证评估指标的有效性。 Result: 实验表明所提指标有意义且与用户研究一致，移除细化方法有效。 Conclusion: 首次解决了3D场景中移除操作的残余问题，为隐私保护提供了新工具。 Abstract: Searching in and editing 3D scenes has become extremely intuitive with trainable scene representations that allow linking human concepts to elements in the scene. These operations are often evaluated on the basis of how accurately the searched element is segmented or extracted from the scene. In this paper, we address the inverse problem, that is, how much of the searched element remains in the scene after it is removed. This question is particularly important in the context of privacy-preserving mapping when a user reconstructs a 3D scene and wants to remove private elements before sharing the map. To the best of our knowledge, this is the first work to address this question. To answer this, we propose a quantitative evaluation that measures whether a removal operation leaves object residuals that can be reasoned over. The scene is not private when such residuals are present. Experiments on state-of-the-art scene representations show that the proposed metrics are meaningful and consistent with the user study that we also present. We also propose a method to refine the removal based on spatial and semantic consistency.

Enhancing Retrieval Systems with Inference-Time Logical Reasoning

Felix Faltings,Wei Wei,Yujia Bao

Task: 提出一种在检索过程中显式结合逻辑推理的新框架。

Motivation: 传统检索方法在处理涉及否定、合取和析取等逻辑结构的复杂查询时表现不佳。

Details

Method: 从自然语言查询中提取逻辑推理结构，并通过组合余弦相似度分数生成最终文档评分。 Result: 在合成和真实世界基准测试中，该方法始终优于传统检索方法，显著提升了复杂查询的检索性能。 Conclusion: 所提出的方法在不牺牲计算效率的情况下，成功地将逻辑推理融入检索过程。 Abstract: Traditional retrieval methods rely on transforming user queries into vector representations and retrieving documents based on cosine similarity within an embedding space. While efficient and scalable, this approach often fails to handle complex queries involving logical constructs such as negations, conjunctions, and disjunctions. In this paper, we propose a novel inference-time logical reasoning framework that explicitly incorporates logical reasoning into the retrieval process. Our method extracts logical reasoning structures from natural language queries and then composes the individual cosine similarity scores to formulate the final document scores. This approach enables the retrieval process to handle complex logical reasoning without compromising computational efficiency. Our results on both synthetic and real-world benchmarks demonstrate that the proposed method consistently outperforms traditional retrieval methods across different models and datasets, significantly improving retrieval performance for complex queries.

Guidance Free Image Editing via Explicit Conditioning

Mehdi Noroozi,Alberto Gil Ramos,Luca Morreale,Ruchika Chavhan,Malcolm Chadwick,Abhinav Mehrotra,Sourav Bhattacharya

Task: 提出一种新的条件化技术（Explicit Conditioning, EC）以减少条件扩散模型的计算负担。

Motivation: 现有的Classifier Free Guidance (CFG)方法在生成高质量图像时需要多次去噪步骤，导致计算成本过高。

Details

Method: 通过显式条件化噪声分布（EC）来引导扩散过程，减少计算需求。 Result: EC在图像编辑任务中优于CFG，能以更少的计算生成多样且高质量的图像。 Conclusion: EC显著提高了扩散模型的推理效率，同时保持了生成质量。 Abstract: Current sampling mechanisms for conditional diffusion models rely mainly on Classifier Free Guidance (CFG) to generate high-quality images. However, CFG requires several denoising passes in each time step, e.g., up to three passes in image editing tasks, resulting in excessive computational costs. This paper introduces a novel conditioning technique to ease the computational burden of the well-established guidance techniques, thereby significantly improving the inference time of diffusion models. We present Explicit Conditioning (EC) of the noise distribution on the input modalities to achieve this. Intuitively, we model the noise to guide the conditional diffusion model during the diffusion process. We present evaluations on image editing tasks and demonstrate that EC outperforms CFG in generating diverse high-quality images with significantly reduced computations.

Satisfactory Medical Consultation based on Terminology-Enhanced Information Retrieval and Emotional In-Context Learning

Kaiwen Zuo,Jing Tang,Hanbing Qin,Binli Luo,Ligang He,Shiyan Tang

Task: 提出一种结合术语增强信息检索（TEIR）和情感上下文学习（EICL）的新型医疗咨询框架，以提升大型语言模型（LLMs）在医疗领域的表现。

Motivation: 尽管LLMs在医疗咨询方面取得进展，但其表现仍不及专业咨询标准，因此需要改进。

Details

Method: 通过TEIR模块实现隐式推理和长上下文处理，EICL模块通过记忆语义和属性信息生成高相关句子，并结合大规模数据集增强模型能力。 Result: 实验证明该方法在BLEU和ROUGE指标上优于五种基线模型，并在实际临床咨询中提升患者满意度。 Conclusion: TEIR和EICL模块显著提升了LLMs在医疗咨询中的表现，具有实际应用潜力。 Abstract: Recent advancements in Large Language Models (LLMs) have marked significant progress in understanding and responding to medical inquiries. However, their performance still falls short of the standards set by professional consultations. This paper introduces a novel framework for medical consultation, comprising two main modules: Terminology-Enhanced Information Retrieval (TEIR) and Emotional In-Context Learning (EICL). TEIR ensures implicit reasoning through the utilization of inductive knowledge and key terminology retrieval, overcoming the limitations of restricted domain knowledge in public databases. Additionally, this module features capabilities for processing long context. The EICL module aids in generating sentences with high attribute relevance by memorizing semantic and attribute information from unlabelled corpora and applying controlled retrieval for the required information. Furthermore, a dataset comprising 803,564 consultation records was compiled in China, significantly enhancing the model's capability for complex dialogues and proactive inquiry initiation. Comprehensive experiments demonstrate the proposed method's effectiveness in extending the context window length of existing LLMs. The experimental outcomes and extensive data validate the framework's superiority over five baseline models in terms of BLEU and ROUGE performance metrics, with substantial leads in certain capabilities. Notably, ablation studies confirm the significance of the TEIR and EICL components. In addition, our new framework has the potential to significantly improve patient satisfaction in real clinical consulting situations.

Karol Chlasta,Katarzyna Wisiecka,Krzysztof Krejtz,Izabela Krejtz

Task: 利用卷积神经网络（CNNs）分析视觉注意力扫描路径，开发一种AI辅助的情感障碍筛查方法。

Motivation: 幸福感是一个动态且个体内波动的概念，其降低常与抑郁症或焦虑症相关，这些疾病表现为对特定刺激（如人脸）的视觉注意力偏差。

Details

Method: 通过残差卷积神经网络（ResNet）处理眼动追踪数据，生成图像并分析注意力模式。 Result: 实验结果显示，三分类系统的平均准确率为48%，二分类系统为62%。 Conclusion: 该方法可用于快速、生态且有效的心理健康筛查系统，通过眼动追踪评估幸福感。 Abstract: Well-being is a dynamic construct that evolves over time and fluctuates within individuals, presenting challenges for accurate quantification. Reduced well-being is often linked to depression or anxiety disorders, which are characterised by biases in visual attention towards specific stimuli, such as human faces. This paper introduces a novel approach to AI-assisted screening of affective disorders by analysing visual attention scan paths using convolutional neural networks (CNNs). Data were collected from two studies examining (1) attentional tendencies in individuals diagnosed with major depression and (2) social anxiety. These data were processed using residual CNNs through images generated from eye-gaze patterns. Experimental results, obtained with ResNet architectures, demonstrated an average accuracy of 48% for a three-class system and 62% for a two-class system. Based on these exploratory findings, we propose that this method could be employed in rapid, ecological, and effective mental health screening systems to assess well-being through eye-tracking.

Think Before Refusal : Triggering Safety Reflection in LLMs to Mitigate False Refusal Behavior

Shengyun Si,Xinpeng Wang,Guangyao Zhai,Nassir Navab,Barbara Plank

Task: 研究如何通过安全反思提示减少大型语言模型（LLMs）的误拒行为。

Motivation: 当前通过拒绝有害请求实现LLMs无害化的方法可能导致误拒良性查询，影响模型实用性。

Details

Method: 提出Think-Before-Refusal（TBR）框架，结合安全反思进行安全感知的指令微调。 Result: 在15个预训练模型上的实验表明，结合安全反思的微调显著减少误拒行为，同时保持安全性和整体性能。 Conclusion: 安全反思提示能有效减少LLMs的误拒行为，提升模型实用性。 Abstract: Recent advancements in large language models (LLMs) have demonstrated that fine-tuning and human alignment can render LLMs harmless. In practice, such "harmlessness" behavior is mainly achieved by training models to reject harmful requests, such as "Explain how to burn down my neighbor's house", where the model appropriately declines to respond. However, this approach can inadvertently result in false refusal, where models reject benign queries as well, such as "Tell me how to kill a Python process". In this work, we demonstrate that prompting safety reflection before generating a response can mitigate false refusal behavior. Building on this finding, we introduce the Think-Before-Refusal (TBR) schema and conduct safety-aware instruction fine-tuning incorporating safety reflection. In an ablation study across 15 pre-trained models, we show that models fine-tuned with safety reflection significantly reduce false refusal behavior while maintaining safety and overall performance compared to those fine-tuned without safety reflection.

Enhancing Martian Terrain Recognition with Deep Constrained Clustering

Tejas Panambur,Mario Parente

Task: 提出一种名为Deep Constrained Clustering with Metric Learning (DCCML)的新算法，以提高火星地形分类的准确性。

Motivation: 火星地形识别对理解其地形、地貌、古气候和宜居性至关重要，但现有深度聚类方法在强度、尺度和旋转的自然变化下表现不佳。

Details

Method: DCCML结合了软性必须链接约束（来自空间和深度相似性）和硬性约束（来自立体相机对和时间相邻图像），以指导聚类过程。 Result: 在好奇号火星车数据集上，DCCML将同质聚类提高了16.7%，Davies-Bouldin指数从3.86降至1.82，检索准确率从86.71%提升至89.86%。 Conclusion: DCCML显著提升了火星地质特征的分类精度，增强了对火星地貌的分析和理解能力。 Abstract: Martian terrain recognition is pivotal for advancing our understanding of topography, geomorphology, paleoclimate, and habitability. While deep clustering methods have shown promise in learning semantically homogeneous feature embeddings from Martian rover imagery, the natural variations in intensity, scale, and rotation pose significant challenges for accurate terrain classification. To address these limitations, we propose Deep Constrained Clustering with Metric Learning (DCCML), a novel algorithm that leverages multiple constraint types to guide the clustering process. DCCML incorporates soft must-link constraints derived from spatial and depth similarities between neighboring patches, alongside hard constraints from stereo camera pairs and temporally adjacent images. Experimental evaluation on the Curiosity rover dataset (with 150 clusters) demonstrates that DCCML increases homogeneous clusters by 16.7 percent while reducing the Davies-Bouldin Index from 3.86 to 1.82 and boosting retrieval accuracy from 86.71 percent to 89.86 percent. This improvement enables more precise classification of Martian geological features, advancing our capacity to analyze and understand the planet's landscape.

MedPlan:A Two-Stage RAG-Based System for Personalized Medical Plan Generation

Hsin-Ling Hsu,Cong-Tinh Dao,Luning Wang,Zitao Shuai,Thao Nguyen Minh Phan,Jun-En Ding,Chun-Chieh Liao,Pengfei Hu,Xiaoxue Han,Chih-Ho Hsu,Dongsheng Luo,Wen-Chih Peng,Feng Liu,Fang-Ming Hung,Chenwei Wu

Task: 提出一种基于SOAP方法的两阶段框架MedPlan，用于改进电子健康记录（EHR）中的治疗计划生成。

Motivation: 当前方法在生成治疗计划时缺乏临床医生的顺序推理过程，未充分利用患者历史背景，且未能有效区分主观与客观临床信息。

Details

Method: 采用两阶段架构，首先生成临床评估，再基于评估和患者特定信息生成结构化治疗计划。 Result: MedPlan在评估准确性和治疗计划质量上显著优于基线方法。 Conclusion: MedPlan通过模拟临床医生工作流程，显著提升了治疗计划生成的性能。 Abstract: Despite recent success in applying large language models (LLMs) to electronic health records (EHR), most systems focus primarily on assessment rather than treatment planning. We identify three critical limitations in current approaches: they generate treatment plans in a single pass rather than following the sequential reasoning process used by clinicians; they rarely incorporate patient-specific historical context; and they fail to effectively distinguish between subjective and objective clinical information. Motivated by the SOAP methodology (Subjective, Objective, Assessment, Plan), we introduce MedPlan, a novel framework that structures LLM reasoning to align with real-life clinician workflows. Our approach employs a two-stage architecture that first generates a clinical assessment based on patient symptoms and objective data, then formulates a structured treatment plan informed by this assessment and enriched with patient-specific information through retrieval-augmented generation. Comprehensive evaluation demonstrates that our method significantly outperforms baseline approaches in both assessment accuracy and treatment plan quality.

InstructVEdit: A Holistic Approach for Instructional Video Editing

Chi Zhang,Chengjian Feng,Feng Yan,Qiming Zhang,Mingjin Zhang,Yujie Zhong,Jing Zhang,Lin Ma

Task: 提出一种全周期的指令视频编辑方法InstructVEdit，解决视频编辑数据稀缺和模型架构探索不足的问题。

Motivation: 由于大规模高质量编辑视频数据难以获取，限制了训练数据的可用性和模型架构的系统性探索。

Details

Method: 通过建立可靠的数据集整理流程、改进模型架构以提升编辑质量并保持时间一致性，以及提出基于真实数据的迭代优化策略。 Result: 实验表明InstructVEdit在指令视频编辑中达到最先进性能，并展现出对多样化现实场景的强适应性。 Conclusion: InstructVEdit为解决指令视频编辑的挑战提供了有效的全周期解决方案。 Abstract: Video editing according to instructions is a highly challenging task due to the difficulty in collecting large-scale, high-quality edited video pair data. This scarcity not only limits the availability of training data but also hinders the systematic exploration of model architectures and training strategies. While prior work has improved specific aspects of video editing (e.g., synthesizing a video dataset using image editing techniques or decomposed video editing training), a holistic framework addressing the above challenges remains underexplored. In this study, we introduce InstructVEdit, a full-cycle instructional video editing approach that: (1) establishes a reliable dataset curation workflow to initialize training, (2) incorporates two model architectural improvements to enhance edit quality while preserving temporal consistency, and (3) proposes an iterative refinement strategy leveraging real-world data to enhance generalization and minimize train-test discrepancies. Extensive experiments show that InstructVEdit achieves state-of-the-art performance in instruction-based video editing, demonstrating robust adaptability to diverse real-world scenarios. Project page: https://o937-blip.github.io/InstructVEdit.

WindowKV: Task-Adaptive Group-Wise KV Cache Window Selection for Efficient LLM Inference

Youhui Zuo,Sibo Wei,Chen Zhang,Zhuorui Liu,Wenpeng Lu,Dawei Song

Task: 提出一种任务自适应的KV缓存窗口选择方法WindowKV，以减少GPU内存消耗并保持语义连贯性。

Motivation: 现有KV缓存压缩方法忽视了语义连贯性和任务特性，导致性能下降。

Details

Method: 动态选择局部语义窗口，并结合层内KV缓存索引共享策略。 Result: 在LongBench基准测试中，仅使用12%的原始KV缓存即可保持性能，并在Needle-in-a-Haystack评估中达到最优。 Conclusion: WindowKV在高效性和性能之间取得了平衡，适用于工业场景。 Abstract: With the advancements in long-context inference capabilities of large language models (LLMs), the KV cache has become one of the foundational components. However, its substantial GPU memory consumption makes KV cache compression a key technique for enabling efficient LLM inference in industrial scenarios. While recent studies have focused on optimizing the memory occupied by the KV cache, they overlook two critical factors: preserving semantic coherence and considering task-specific characteristic during compression. To address these limitations, we propose a novel task-adaptive KV cache window selection method, WindowKV. WindowKV dynamically selects local semantic windows consisting of consecutive tokens, according to task-specific characteristics, ensuring the retained KV cache captures continuous, essential context. Additionally, we introduce an intra-group layer KV cache indices sharing strategy to reduce computational overhead, achieving a balance between performance and efficiency. We rigorously evaluate WindowKV on the LongBench benchmark, and the results demonstrate that it maintains a performance comparable to full KV cache retention while using only 12% of the original KV cache, significantly reducing memory requirements. Furthermore, our method also achieves state-of-the-art results in the Needle-in-a-Haystack evaluation, highlighting its effectiveness and robustness.

Visual Variational Autoencoder Prompt Tuning

Xi Xiao,Yunbei Zhang,Yanshuh Li,Xingjian Li,Tianyang Wang,Jihun Hamm,Xiao Wang,Min Xu

Task: 提出一种动态生成输入依赖提示的视觉变分自编码器提示调优框架（V$^2$APT），以改进现有静态提示调优方法的局限性。

Motivation: 现有视觉提示调优方法（VPT）主要依赖静态、领域特定的提示，无法捕捉单个实例内的视觉多样性。

Details

Method: 使用变分自编码器架构生成动态、输入依赖的提示，通过学习图像特征的潜在表示并解码为定制提示。 Result: 在FGVC、HTA和VTAB-1k基准测试中，V$^2$APT表现优于现有PEFT方法，HTA上比VPT-Deep提升3.2%，平均性能提升2.0%。 Conclusion: V$^2$APT通过动态提示生成显著提升了视觉任务的性能，为参数高效调优提供了新思路。 Abstract: Parameter-efficient fine-tuning (PEFT) has emerged as a crucial approach for adapting large vision transformers to downstream tasks without the prohibitive computational costs of full fine-tuning. While existing visual prompt tuning (VPT) methods have made significant strides, they predominantly rely on static, domain-specific prompts that fail to capture the rich visual diversity within individual instances. This paper introduces V$^2$APT (Visual Variational Autoencoder Prompt Tuning), a novel framework that generates dynamic, input-dependent prompts using a variational autoencoder architecture. By learning a latent representation of image-specific features and decoding them into customized prompts, V$^2$APT adapts to the unique visual characteristics of each input. Extensive experiments on FGVC, HTA, and VTAB-1k benchmarks demonstrate that our approach consistently outperforms state-of-the-art PEFT methods. Notably, V$^2$APT achieves +3.2\% improvement over VPT-Deep on HTA, with an average performance gain of +2.0\% across all three datasets.

STShield: Single-Token Sentinel for Real-Time Jailbreak Detection in Large Language Models

Xunguang Wang,Wenxuan Wang,Zhenlan Ji,Zongjie Li,Pingchuan Ma,Daoyuan Wu,Shuai Wang

Task: 提出一种轻量级框架STShield，用于实时检测和防御大型语言模型（LLMs）的越狱攻击。

Motivation: 现有防御方法要么容易受到适应性攻击，要么需要计算成本高昂的辅助模型，因此需要一种更高效且实用的解决方案。

Details

Method: STShield引入了一种新颖的单令牌哨兵机制，通过在模型响应序列中附加二进制安全指示器，利用LLM自身的对齐能力进行检测。框架结合了正常提示的监督微调和嵌入空间扰动的对抗训练。 Result: 实验表明，STShield能有效防御多种越狱攻击，同时保持模型在合法查询上的性能，且计算开销极小。 Conclusion: STShield是一种适用于实际部署的高效防御方案，优于现有方法。 Abstract: Large Language Models (LLMs) have become increasingly vulnerable to jailbreak attacks that circumvent their safety mechanisms. While existing defense methods either suffer from adaptive attacks or require computationally expensive auxiliary models, we present STShield, a lightweight framework for real-time jailbroken judgement. STShield introduces a novel single-token sentinel mechanism that appends a binary safety indicator to the model's response sequence, leveraging the LLM's own alignment capabilities for detection. Our framework combines supervised fine-tuning on normal prompts with adversarial training using embedding-space perturbations, achieving robust detection while preserving model utility. Extensive experiments demonstrate that STShield successfully defends against various jailbreak attacks, while maintaining the model's performance on legitimate queries. Compared to existing approaches, STShield achieves superior defense performance with minimal computational overhead, making it a practical solution for real-world LLM deployment.

Collaborative Temporal Consistency Learning for Point-supervised Natural Language Video Localization

Zhuo Tao,Liang Li,Qi Chen,Yunbin Tu,Zheng-Jun Zha,Ming-Hsuan Yang,Yuankai Qi,Qingming Huang

Task: 自然语言视频定位（NLVL），旨在通过给定的语言描述定位视频中的目标时刻。

Motivation: 点监督范式在定位精度和标注成本之间提供了平衡，但由于缺乏完整标注，视频内容与语言描述的对齐成为挑战。

Details

Method: 提出了一个协作时间一致性学习（COTEL）框架，结合显著性检测和时刻定位，设计了帧级和段级时间一致性学习模块以及交叉一致性引导方案。 Result: 在两个基准测试中，该方法优于现有技术。 Conclusion: COTEL框架通过协同学习和对比对齐，有效提升了视频与语言描述的对齐效果。 Abstract: Natural language video localization (NLVL) is a crucial task in video understanding that aims to localize the target moment in videos specified by a given language description. Recently, a point-supervised paradigm has been presented to address this task, requiring only a single annotated frame within the target moment rather than complete temporal boundaries. Compared with the fully-supervised paradigm, it offers a balance between localization accuracy and annotation cost. However, due to the absence of complete annotation, it is challenging to align the video content with language descriptions, consequently hindering accurate moment prediction. To address this problem, we propose a new COllaborative Temporal consistEncy Learning (COTEL) framework that leverages the synergy between saliency detection and moment localization to strengthen the video-language alignment. Specifically, we first design a frame- and a segment-level Temporal Consistency Learning (TCL) module that models semantic alignment across frame saliencies and sentence-moment pairs. Then, we design a cross-consistency guidance scheme, including a Frame-level Consistency Guidance (FCG) and a Segment-level Consistency Guidance (SCG), that enables the two temporal consistency learning paths to reinforce each other mutually. Further, we introduce a Hierarchical Contrastive Alignment Loss (HCAL) to comprehensively align the video and text query. Extensive experiments on two benchmarks demonstrate that our method performs favorably against SoTA approaches. We will release all the source codes.

Experience Retrieval-Augmentation with Electronic Health Records Enables Accurate Discharge QA

Justice Ou,Tinglin Huang,Yilun Zhao,Ziyang Yu,Peiqing Lu,Rex Ying

Task: 提出基于电子健康记录（EHR）的Experience Retrieval Augmentation（ExpRAG）框架，以增强大型语言模型（LLMs）在临床应用中的可靠性。

Motivation: 临床案例知识对有效的医学推理至关重要，因为它提供了基于真实患者经验的上下文，而现有的检索增强生成（RAG）主要依赖开放式数据集的通用医学知识。

Details

Method: ExpRAG通过粗到细的检索过程，利用基于EHR的报告排序器和经验检索器，从其他患者的出院报告中提取相关上下文。 Result: 实验结果表明，ExpRAG在DischargeQA数据集上平均相对性能提升5.2%，优于基于文本的排序器。 Conclusion: 案例知识对医学推理具有重要价值，ExpRAG框架为临床应用中LLMs的可靠性提供了有效解决方案。 Abstract: To improve the reliability of Large Language Models (LLMs) in clinical applications, retrieval-augmented generation (RAG) is extensively applied to provide factual medical knowledge. However, beyond general medical knowledge from open-ended datasets, clinical case-based knowledge is also critical for effective medical reasoning, as it provides context grounded in real-world patient experiences. Motivated by this, we propose Experience Retrieval Augmentation - ExpRAG framework based on Electronic Health Record (EHR), aiming to offer the relevant context from other patients' discharge reports. ExpRAG performs retrieval through a coarse-to-fine process, utilizing an EHR-based report ranker to efficiently identify similar patients, followed by an experience retriever to extract task-relevant content for enhanced medical reasoning. To evaluate ExpRAG, we introduce DischargeQA, a clinical QA dataset with 1,280 discharge-related questions across diagnosis, medication, and instruction tasks. Each problem is generated using EHR data to ensure realistic and challenging scenarios. Experimental results demonstrate that ExpRAG consistently outperforms a text-based ranker, achieving an average relative improvement of 5.2%, highlighting the importance of case-based knowledge for medical reasoning.

Efficient Diffusion Training through Parallelization with Truncated Karhunen-Loève Expansion

Yumeng Ren,Yaofang Liu,Aitor Artola,Laurent Mertz,Raymond H. Chan,Jean-michel Morel

Task: 提出一种基于Karhunen-Loève展开的KL扩散方法，以解决扩散去噪模型训练中收敛速度慢的问题。

Motivation: 扩散去噪模型在图像生成中表现优异，但训练收敛速度慢，部分原因是前向过程的布朗运动复杂性。

Details

Method: 使用Karhunen-Loève展开表示布朗运动，截断为有限特征函数，提出KL扩散作为新的前向过程，并设计相应的去噪损失函数。 Result: KL扩散方法显著优于基线模型，收敛速度快两倍，最终FID分数更低，且支持高度并行计算，无需额外可学习参数。 Conclusion: KL扩散方法高效、灵活，可无缝集成到现有扩散模型中，提升性能。 Abstract: Diffusion denoising models have become a popular approach for image generation, but they often suffer from slow convergence during training. In this paper, we identify that this slow convergence is partly due to the complexity of the Brownian motion driving the forward-time process. To address this, we represent the Brownian motion using the Karhunen-Lo\`eve expansion, truncating it to a limited number of eigenfunctions. We propose a novel ordinary differential equation with augmented random initials, termed KL diffusion, as a new forward-time process for training and sampling. By developing an appropriate denoising loss function, we facilitate the integration of our KL-diffusion into existing denoising-based models. Using the widely adopted DDIM framework as our baseline ensures a fair comparison, as our modifications focus solely on the forward process and loss function, leaving the network architecture and sampling methods unchanged. Our method significantly outperforms baseline diffusion models, achieving convergence speeds that are twice faster to reach the best FID score of the baseline and ultimately yielding much lower FID scores. Notably, our approach allows for highly parallelized computation, requires no additional learnable parameters, and can be flexibly integrated into existing diffusion methods. The code will be made publicly available.

An Empirical Study of the Role of Incompleteness and Ambiguity in Interactions with Large Language Models

Riya Naik,Ashwin Srinivasan,Estrid He,Swati Agarwal

Task: 研究何时需要多轮交互与大型语言模型（LLM）以成功回答问题或判定问题无法回答。

Motivation: 随着大型语言模型（LLM）的出现，自然语言作为人机交互媒介的能力显著提升，但如何通过多轮交互优化问题回答仍待探索。

Details

Method: 提出一种神经符号框架，模拟人与LLM代理之间的交互，并通过交互信息定义问题的不完整性和模糊性。 Result: 实验结果表明，多轮交互通常适用于高比例不完整或模糊问题的数据集，且增加交互长度能减少这些问题。 Conclusion: 提出的不完整性和模糊性度量可作为评估LLM问答交互的有效工具。 Abstract: Natural language as a medium for human-computer interaction has long been anticipated, has been undergoing a sea-change with the advent of Large Language Models (LLMs) with startling capacities for processing and generating language. Many of us now treat LLMs as modern-day oracles, asking it almost any kind of question. Unlike its Delphic predecessor, consulting an LLM does not have to be a single-turn activity (ask a question, receive an answer, leave); and -- also unlike the Pythia -- it is widely acknowledged that answers from LLMs can be improved with additional context. In this paper, we aim to study when we need multi-turn interactions with LLMs to successfully get a question answered; or conclude that a question is unanswerable. We present a neural symbolic framework that models the interactions between human and LLM agents. Through the proposed framework, we define incompleteness and ambiguity in the questions as properties deducible from the messages exchanged in the interaction, and provide results from benchmark problems, in which the answer-correctness is shown to depend on whether or not questions demonstrate the presence of incompleteness or ambiguity (according to the properties we identify). Our results show multi-turn interactions are usually required for datasets which have a high proportion of incompleteness or ambiguous questions; and that that increasing interaction length has the effect of reducing incompleteness or ambiguity. The results also suggest that our measures of incompleteness and ambiguity can be useful tools for characterising interactions with an LLM on question-answeringproblems

OMR-Diffusion:Optimizing Multi-Round Enhanced Training in Diffusion Models for Improved Intent Understanding

Kun Li,Jianhui Wang,Miao Zhang,Xueqian Wang

Task: 提出一种视觉协同适应（VCA）框架，通过人类反馈优化文本驱动图像生成，以更好地满足用户在多轮对话中的偏好和意图。

Motivation: 生成式AI在多轮对话场景中难以持续生成符合用户偏好和意图的图像，需要更有效的优化方法。

Details

Method: 采用人类反馈驱动的奖励模型，结合多样性、一致性和偏好反馈等多奖励函数，通过LoRA优化扩散模型。 Result: 在人类评估中取得508胜，优于DALL-E 3（463胜），对话效率为3.4轮（DALL-E 3为13.7轮），并在LPIPS（0.15）和BLIP（0.59）等指标上表现优异。 Conclusion: VCA框架显著提升了图像生成的一致性和用户意图对齐效果，优于现有基线方法。 Abstract: Generative AI has significantly advanced text-driven image generation, but it still faces challenges in producing outputs that consistently align with evolving user preferences and intents, particularly in multi-turn dialogue scenarios. In this research, We present a Visual Co-Adaptation (VCA) framework that incorporates human-in-the-loop feedback, utilizing a well-trained reward model specifically designed to closely align with human preferences. Using a diverse multi-turn dialogue dataset, the framework applies multiple reward functions (such as diversity, consistency, and preference feedback) to refine the diffusion model through LoRA, effectively optimizing image generation based on user input. We also constructed multi-round dialogue datasets with prompts and image pairs that well-fit user intent. Experiments show the model achieves 508 wins in human evaluation, outperforming DALL-E 3 (463 wins) and others. It also achieves 3.4 rounds in dialogue efficiency (vs. 13.7 for DALL-E 3) and excels in metrics like LPIPS (0.15) and BLIP (0.59). Various experiments demonstrate the effectiveness of the proposed method over state-of-the-art baselines, with significant improvements in image consistency and alignment with user intent.

SLIDE: Sliding Localized Information for Document Extraction

Divyansh Singh,Manuel Nunez Martinez,Bonnie J. Dorr,Sonja Schmer Galunder

Task: 提出一种名为SLIDE的分块方法，用于从长文本和低资源语言中构建准确的知识图谱。

Motivation: 解决大型语言模型（LLMs）在处理长文本和低资源语言时性能下降的问题，尤其是信息截断导致的实体和关系提取不准确。

Details

Method: SLIDE通过重叠窗口生成局部上下文，保留关键信息，提升知识图谱构建效果。 Result: SLIDE显著提升了GraphRAG的性能，在英语中实体提取提升24%，关系提取提升39%；在低资源语言（如南非荷兰语）中，实体提取提升49%，关系提取提升82%。 Conclusion: SLIDE在多语言和资源受限环境下表现出色，提升了问答指标的全面性、多样性和有效性。 Abstract: Constructing accurate knowledge graphs from long texts and low-resource languages is challenging, as large language models (LLMs) experience degraded performance with longer input chunks. This problem is amplified in low-resource settings where data scarcity hinders accurate entity and relationship extraction. Contextual retrieval methods, while improving retrieval accuracy, struggle with long documents. They truncate critical information in texts exceeding maximum context lengths of LLMs, significantly limiting knowledge graph construction. We introduce SLIDE (Sliding Localized Information for Document Extraction), a chunking method that processes long documents by generating local context through overlapping windows. SLIDE ensures that essential contextual information is retained, enhancing knowledge graph extraction from documents exceeding LLM context limits. It significantly improves GraphRAG performance, achieving a 24% increase in entity extraction and a 39% improvement in relationship extraction for English. For Afrikaans, a low-resource language, SLIDE achieves a 49% increase in entity extraction and an 82% improvement in relationship extraction. Furthermore, it improves upon state-of-the-art in question-answering metrics such as comprehensiveness, diversity and empowerment, demonstrating its effectiveness in multilingual and resource-constrained settings.

3D Modeling: Camera Movement Estimation and path Correction for SFM Model using the Combination of Modified A-SIFT and Stereo System

Usha Kumari,Shuvendu Rana

Task: 提出一种改进的ASIFT方法和双相机模型，用于精确估计和校正相机运动轨迹，以生成准确的3D模型。

Motivation: 解决3D建模中因大视角变化、计算复杂性和对齐差异带来的挑战，提高相机路径生成的效率和精度。

Details

Method: 改进ASIFT以提取更多匹配点并减少计算开销；引入双相机旋转校正模型和立体相机平移估计模型；结合ASIFT和双相机SFM模型。 Result: 实验结果显示，相机运动轨迹的准确率达到99.9%，优于现有方法。 Conclusion: 该方法为高精度和高效率的3D重建应用提供了可靠解决方案。 Abstract: Creating accurate and efficient 3D models poses significant challenges, particularly in addressing large viewpoint variations, computational complexity, and alignment discrepancies. Efficient camera path generation can help resolve these issues. In this context, a modified version of the Affine Scale-Invariant Feature Transform (ASIFT) is proposed to extract more matching points with reduced computational overhead, ensuring an adequate number of inliers for precise camera rotation angle estimation. Additionally, a novel two-camera-based rotation correction model is introduced to mitigate small rotational errors, further enhancing accuracy. Furthermore, a stereo camera-based translation estimation and correction model is implemented to determine camera movement in 3D space by altering the Structure From Motion (SFM) model. Finally, the novel combination of ASIFT and two camera-based SFM models provides an accurate camera movement trajectory in 3D space. Experimental results show that the proposed camera movement approach achieves 99.9% accuracy compared to the actual camera movement path and outperforms state-of-the-art camera path estimation methods. By leveraging this accurate camera path, the system facilitates the creation of precise 3D models, making it a robust solution for applications requiring high fidelity and efficiency in 3D reconstruction.

Won: Establishing Best Practices for Korean Financial NLP

Guijin Son,Hyunwoo Ko,Haneral Jung,Chami Hwang

Task: 评估韩国金融领域大型语言模型的开放排行榜。

Motivation: 推动韩国及其他语言金融领域大型语言模型的发展，提供更好的评估资源和实践方法。

Details

Method: 通过八周的封闭基准测试评估1,119份提交，涵盖五个MCQA类别和一个开放式问答任务，并发布8万条开放指令数据集。 Result: 发布了开放指令数据集和透明的大型语言模型Won，总结了高效训练策略。 Conclusion: 贡献有助于促进更安全、更高效的金融领域大型语言模型的开发。 Abstract: In this work, we present the first open leaderboard for evaluating Korean large language models focused on finance. Operated for about eight weeks, the leaderboard evaluated 1,119 submissions on a closed benchmark covering five MCQA categories: finance and accounting, stock price prediction, domestic company analysis, financial markets, and financial agent tasks and one open-ended qa task. Building on insights from these evaluations, we release an open instruction dataset of 80k instances and summarize widely used training strategies observed among top-performing models. Finally, we introduce Won, a fully open and transparent LLM built using these best practices. We hope our contributions help advance the development of better and safer financial LLMs for Korean and other languages.

Yuheng Feng,Jianhui Wang,Kun Li,Sida Li,Tianyu Shi,Haoyue Han,Miao Zhang,Xueqian Wang

Task: 通过两阶段对话优化和协同适应框架（TDRI）解决文本到图像生成中的模糊提示和用户意图对齐问题。

Motivation: 现有文本到图像生成技术在处理模糊提示和用户意图对齐方面仍存在挑战。

Details

Method: TDRI框架包含初始生成阶段和交互优化阶段，通过D2P、FR和AO三个模块迭代优化用户反馈。 Result: TDRI在人类偏好（33.6%）、CLIP和BLIP对齐分数（0.338和0.336）上优于现有方法，用户满意度在8轮反馈后达88%。 Conclusion: TDRI在创意和工业领域具有广泛应用潜力，能优化创意流程并提升用户偏好对齐。 Abstract: Although text-to-image generation technologies have made significant advancements, they still face challenges when dealing with ambiguous prompts and aligning outputs with user intent.Our proposed framework, TDRI (Two-Phase Dialogue Refinement and Co-Adaptation), addresses these issues by enhancing image generation through iterative user interaction. It consists of two phases: the Initial Generation Phase, which creates base images based on user prompts, and the Interactive Refinement Phase, which integrates user feedback through three key modules. The Dialogue-to-Prompt (D2P) module ensures that user feedback is effectively transformed into actionable prompts, which improves the alignment between user intent and model input. By evaluating generated outputs against user expectations, the Feedback-Reflection (FR) module identifies discrepancies and facilitates improvements. In an effort to ensure consistently high-quality results, the Adaptive Optimization (AO) module fine-tunes the generation process by balancing user preferences and maintaining prompt fidelity. Experimental results show that TDRI outperforms existing methods by achieving 33.6% human preference, compared to 6.2% for GPT-4 augmentation, and the highest CLIP and BLIP alignment scores (0.338 and 0.336, respectively). In iterative feedback tasks, user satisfaction increased to 88% after 8 rounds, with diminishing returns beyond 6 rounds. Furthermore, TDRI has been found to reduce the number of iterations and improve personalization in the creation of fashion products. TDRI exhibits a strong potential for a wide range of applications in the creative and industrial domains, as it streamlines the creative process and improves alignment with user preferences

Understanding the Effects of RLHF on the Quality and Detectability of LLM-Generated Texts

Beining Xu,Arkaitz Zubiaga

Task: 研究RLHF编辑对LLM生成文本质量及检测器性能的影响。

Motivation: LLM生成文本难以与人类文本区分，可能被恶意利用，现有检测方法可能被绕过。

Details

Method: 通过RLHF编辑LLM生成文本，分析其对文本质量和检测器性能的影响。 Result: RLHF提高文本质量但增加可检测性，训练检测器对短文本和含代码文本脆弱，零样本检测器更稳健。 Conclusion: RLHF虽提升文本质量，但也增加可检测性，需改进检测方法以应对潜在恶意使用。 Abstract: Large Language Models (LLMs) have demonstrated exceptional performance on a range of downstream NLP tasks by generating text that closely resembles human writing. However, the ease of achieving this similarity raises concerns from potential malicious uses at scale by bad actors, as LLM-generated text becomes increasingly difficult to discern from human text. Although detection methods have been developed to address this issue, bad actors can further manipulate LLM-generated texts to make them less detectable. In this work, we study how further editing texts with Reinforcement Learning from Human Feedback (RLHF), which aligns model outputs with human preferences, affects (a) the quality of generated texts for two tasks, and (b) the performance of LLM-generated text detectors, looking at both training-based and zero-shot detection methods. Although RLHF improves the quality of LLM-generated texts, we find that it also tends to produce more detectable, lengthy, and repetitive outputs. Additionally, we observe that training-based detectors are vulnerable to short texts and to texts that incorporate code, whereas zero-shot detectors exhibit greater robustness.

A Temporal Modeling Framework for Video Pre-Training on Video Instance Segmentation

Qing Zhong,Peng-Tao Jiang,Wen Wang,Guodong Ding,Lin Wu,Kaiqi Huang

Task: 提出一种新颖的视频预训练方法，以增强视频实例分割（VIS）模型的性能，特别是在处理复杂实例关系的视频时。

Motivation: 现有的VIS方法通常在图像上预训练模型，然后在视频上微调，但预训练模型缺乏时间知识，导致性能下降。

Details

Method: 通过一致的伪视频增强创建多样化的伪视频样本，并引入多尺度时间模块以增强模型对时间关系的建模能力。 Result: 在常见的VIS基准测试中，该方法表现优于现有技术，特别是在OVIS数据集上平均精度提高了4.0%。 Conclusion: 该方法有效缩小了预训练和微调阶段的差距，提升了VIS模型的性能，且无需对模型架构进行限制。 Abstract: Contemporary Video Instance Segmentation (VIS) methods typically adhere to a pre-train then fine-tune regime, where a segmentation model trained on images is fine-tuned on videos. However, the lack of temporal knowledge in the pre-trained model introduces a domain gap which may adversely affect the VIS performance. To effectively bridge this gap, we present a novel video pre-training approach to enhance VIS models, especially for videos with intricate instance relationships. Our crucial innovation focuses on reducing disparities between the pre-training and fine-tuning stages. Specifically, we first introduce consistent pseudo-video augmentations to create diverse pseudo-video samples for pre-training while maintaining the instance consistency across frames. Then, we incorporate a multi-scale temporal module to enhance the model's ability to model temporal relations through self- and cross-attention at short- and long-term temporal spans. Our approach does not set constraints on model architecture and can integrate seamlessly with various VIS methods. Experiment results on commonly adopted VIS benchmarks show that our method consistently outperforms state-of-the-art methods. Our approach achieves a notable 4.0% increase in average precision on the challenging OVIS dataset.

Instructing the Architecture Search for Spatial-temporal Sequence Forecasting with LLM

Xin Xue,Haoyi Zhou,Tianyu Chen,Shuai Zhang,Yizhou Long,Jianxin Li

Task: 提出一种基于大语言模型（LLM）的新型神经架构搜索（NAS）方法，用于时空序列预测（STSF）。

Motivation: 现有NAS方法在STSF中依赖耗时数据驱动方式，缺乏背景知识利用和复杂搜索轨迹探索能力，而LLM在决策中表现出色但尚未应用于NAS。

Details

Method: 通过多级增强机制激发LLM能力：步骤级分解生成任务并利用提示工程；实例级采用一步调优框架和记忆库；任务级设计两阶段搜索平衡探索与优化。 Result: 实验表明，该方法在效率和效果上均优于现有NAS方法。 Conclusion: 基于LLM的NAS方法为STSF提供了高效且有效的解决方案。 Abstract: Spatial-temporal sequence forecasting (STSF) is a long-standing research problem with widespread real-world applications. Neural architecture search (NAS), which automates the neural network design, has been shown effective in tackling the STSF problem. However, the existing NAS methods for STSF focus on generating architectures in a time-consuming data-driven fashion, which heavily limits their ability to use background knowledge and explore the complicated search trajectory. Large language models (LLMs) have shown remarkable ability in decision-making with comprehensive internal world knowledge, but how it could benefit NAS for STSF remains unexplored. In this paper, we propose a novel NAS method for STSF based on LLM. Instead of directly generate architectures with LLM, We inspire the LLM's capability with a multi-level enhancement mechanism. Specifically, on the step-level, we decompose the generation task into decision steps with powerful prompt engineering and inspire LLM to serve as instructor for architecture search based on its internal knowledge. On the instance-level, we utilize a one-step tuning framework to quickly evaluate the architecture instance and a memory bank to cumulate knowledge to improve LLM's search ability. On the task-level, we propose a two-stage architecture search, balancing the exploration stage and optimization stage, to reduce the possibility of being trapped in local optima. Extensive experimental results demonstrate that our method can achieve competitive effectiveness with superior efficiency against existing NAS methods for STSF.

DCEvo: Discriminative Cross-Dimensional Evolutionary Learning for Infrared and Visible Image Fusion

Jinyuan Liu,Bowei Zhang,Qingyun Mei,Xingyuan Li,Yang Zou,Zhiying Jiang,Long Ma,Risheng Liu,Xin Fan

Task: 提出了一种名为DCEvo的判别式跨维度进化学习框架，用于同时提升红外与可见光图像融合的视觉质量和感知准确性。

Motivation: 现有方法通常将图像融合与后续高级任务分开处理，导致融合图像对任务性能提升有限且无法为融合过程提供建设性反馈。

Details

Method: 结合进化学习的强大搜索能力，将双任务优化建模为多目标问题，并引入判别增强器和跨维度嵌入块以实现互补特征学习和高效特征整合。 Result: 在三个基准测试中，该方法显著优于现有技术，视觉质量平均提升9.32%，同时增强了后续高级任务性能。 Conclusion: DCEvo框架通过动态平衡损失函数参数和跨维度特征整合，实现了图像融合与任务性能的双重优化。 Abstract: Infrared and visible image fusion integrates information from distinct spectral bands to enhance image quality by leveraging the strengths and mitigating the limitations of each modality. Existing approaches typically treat image fusion and subsequent high-level tasks as separate processes, resulting in fused images that offer only marginal gains in task performance and fail to provide constructive feedback for optimizing the fusion process. To overcome these limitations, we propose a Discriminative Cross-Dimension Evolutionary Learning Framework, termed DCEvo, which simultaneously enhances visual quality and perception accuracy. Leveraging the robust search capabilities of Evolutionary Learning, our approach formulates the optimization of dual tasks as a multi-objective problem by employing an Evolutionary Algorithm (EA) to dynamically balance loss function parameters. Inspired by visual neuroscience, we integrate a Discriminative Enhancer (DE) within both the encoder and decoder, enabling the effective learning of complementary features from different modalities. Additionally, our Cross-Dimensional Embedding (CDE) block facilitates mutual enhancement between high-dimensional task features and low-dimensional fusion features, ensuring a cohesive and efficient feature integration process. Experimental results on three benchmarks demonstrate that our method significantly outperforms state-of-the-art approaches, achieving an average improvement of 9.32% in visual quality while also enhancing subsequent high-level tasks. The code is available at https://github.com/Beate-Suy-Zhang/DCEvo.

Personalized Language Models via Privacy-Preserving Evolutionary Model Merging

Kyuyoung Kim,Jinwoo Shin,Jaehyung Kim

Task: 提出一种隐私保护的个性化大语言模型方法，直接优化任务特定指标。

Motivation: 现有方法未能直接优化任务特定指标且缺乏隐私保护机制。

Details

Method: 采用基于进化算法的梯度自由方法（PriME），在优化过程中融入隐私保护。 Result: 在LaMP基准测试中，PriME性能提升45%，隐私-效用权衡更优。 Conclusion: 进化算法在隐私保护的LLM个性化中具有潜力。 Abstract: Personalization in large language models (LLMs) seeks to tailor models to individual user or user group preferences. Prompt-based methods augment queries with user preference information, whereas training-based methods directly encode preferences into model parameters for more effective personalization. Despite achieving some success in personalizing LLMs, prior methods often fail to directly optimize task-specific metrics and lack explicit privacy-preservation mechanisms. To address these limitations, we propose Privacy-Preserving Model Merging via Evolutionary Algorithms (PriME), a novel approach to personalization that employs gradient-free methods to directly optimize task-specific metrics while preserving user privacy. By incorporating privacy preservation into optimization, PriME produces a personalized module that effectively captures the target user's preferences while minimizing the privacy risks for the users sharing their private information. Experiments on the LaMP benchmark show that PriME outperforms both prompt-based and training-based methods, achieving up to a 45% performance improvement over the prior art. Further analysis shows that PriME achieves a significantly better privacy-utility trade-off, highlighting the potential of evolutionary approaches for privacy-preserving LLM personalization.

Towards Transformer-Based Aligned Generation with Self-Coherence Guidance

Shulei Wang,Wang Lin,Hai Huang,Hanting Wang,Sihang Cai,WenKang Han,Tao Jin,Jingyuan Chen,Jiacheng Sun,Jieming Zhu,Zhou Zhao

Task: 提出一种无需训练的新方法，用于增强基于Transformer的文本引导扩散模型（TGDMs）的对齐能力。

Motivation: 现有的TGDMs在处理复杂文本提示或多概念属性绑定时，生成的图像语义对齐效果不佳，而传统的U-Net方法在Transformer架构中效果有限。

Details

Method: 通过直接优化生成过程中的交叉注意力图，引入自一致性引导（Self-Coherence Guidance），动态利用去噪步骤中的掩码优化注意力图。 Result: 实验结果表明，该方法在粗粒度属性绑定、细粒度属性绑定和风格绑定任务中均显著优于现有方法。 Conclusion: 该方法无需额外训练即可实现精确对齐，为TGDMs的性能提升提供了新思路。 Abstract: We introduce a novel, training-free approach for enhancing alignment in Transformer-based Text-Guided Diffusion Models (TGDMs). Existing TGDMs often struggle to generate semantically aligned images, particularly when dealing with complex text prompts or multi-concept attribute binding challenges. Previous U-Net-based methods primarily optimized the latent space, but their direct application to Transformer-based architectures has shown limited effectiveness. Our method addresses these challenges by directly optimizing cross-attention maps during the generation process. Specifically, we introduce Self-Coherence Guidance, a method that dynamically refines attention maps using masks derived from previous denoising steps, ensuring precise alignment without additional training. To validate our approach, we constructed more challenging benchmarks for evaluating coarse-grained attribute binding, fine-grained attribute binding, and style binding. Experimental results demonstrate the superior performance of our method, significantly surpassing other state-of-the-art methods across all evaluated tasks. Our code is available at https://scg-diffusion.github.io/scg-diffusion.

Investigating Recent Large Language Models for Vietnamese Machine Reading Comprehension

Anh Duc Nguyen,Hieu Minh Phi,Anh Viet Ngo,Long Hai Trieu,Thai Phuong Nguyen

Task: 在越南语低资源环境下，对Llama 3和Gemma两种大型语言模型进行微调，并在ViMMRC数据集上评估其机器阅读理解性能。

Motivation: 探索大型语言模型在低资源语言（如越南语）上的有效性，填补现有研究的空白。

Details

Method: 使用量化低秩适应（QLoRA）方法高效微调Llama 3和Gemma模型，并与基于BERT的传统方法和更大的模型（如GPT-3和GPT-3.5）进行性能对比。 Result: 微调后的模型在性能上超越了基于BERT的方法和更大的模型（如GPT-3和GPT-3.5），证明了微调过程的有效性。 Conclusion: 研究表明现代大型语言模型在低资源语言环境下仍能超越传统模型，同时适合资源受限的环境。研究为低资源语言的NLP发展提供了贡献，并公开了微调后的模型。 Abstract: Large Language Models (LLMs) have shown remarkable proficiency in Machine Reading Comprehension (MRC) tasks; however, their effectiveness for low-resource languages like Vietnamese remains largely unexplored. In this paper, we fine-tune and evaluate two state-of-the-art LLMs: Llama 3 (8B parameters) and Gemma (7B parameters), on ViMMRC, a Vietnamese MRC dataset. By utilizing Quantized Low-Rank Adaptation (QLoRA), we efficiently fine-tune these models and compare their performance against powerful LLM-based baselines. Although our fine-tuned models are smaller than GPT-3 and GPT-3.5, they outperform both traditional BERT-based approaches and these larger models. This demonstrates the effectiveness of our fine-tuning process, showcasing how modern LLMs can surpass the capabilities of older models like BERT while still being suitable for deployment in resource-constrained environments. Through intensive analyses, we explore various aspects of model performance, providing valuable insights into adapting LLMs for low-resource languages like Vietnamese. Our study contributes to the advancement of natural language processing in low-resource languages, and we make our fine-tuned models publicly available at: https://huggingface.co/iaiuet.

CountLLM: Towards Generalizable Repetitive Action Counting via Large Language Model

Ziyu Yao,Xuxin Cheng,Zhiqi Huang,Lei Li

Task: 提出一种基于大语言模型（LLM）的框架CountLLM，用于视频中重复动作的计数。

Motivation: 现有方法依赖回归网络，表征能力有限，且监督学习在狭窄的训练集上容易过拟合，泛化能力不足。

Details

Method: CountLLM结合视频数据和周期性文本提示，利用预训练LLM的强大表征能力，并通过周期性结构化模板和渐进式多模态训练范式提升模型性能。 Result: 在广泛认可的基准测试中，CountLLM表现出优越的性能和泛化能力，尤其在处理与训练数据显著不同的新动作时。 Conclusion: CountLLM为重复动作计数提供了一种有前景的新方法。 Abstract: Repetitive action counting, which aims to count periodic movements in a video, is valuable for video analysis applications such as fitness monitoring. However, existing methods largely rely on regression networks with limited representational capacity, which hampers their ability to accurately capture variable periodic patterns. Additionally, their supervised learning on narrow, limited training sets leads to overfitting and restricts their ability to generalize across diverse scenarios. To address these challenges, we propose CountLLM, the first large language model (LLM)-based framework that takes video data and periodic text prompts as inputs and outputs the desired counting value. CountLLM leverages the rich clues from explicit textual instructions and the powerful representational capabilities of pre-trained LLMs for repetitive action counting. To effectively guide CountLLM, we develop a periodicity-based structured template for instructions that describes the properties of periodicity and implements a standardized answer format to ensure consistency. Additionally, we propose a progressive multimodal training paradigm to enhance the periodicity-awareness of the LLM. Empirical evaluations on widely recognized benchmarks demonstrate CountLLM's superior performance and generalization, particularly in handling novel and out-of-domain actions that deviate significantly from the training data, offering a promising avenue for repetitive action counting.

Dynamic Task Vector Grouping for Efficient Multi-Task Prompt Tuning

Pieyi Zhang,Richong Zhang,Zhijie Nie

Task: 通过多任务提示调优提升低资源目标任务的性能。

Motivation: 现有方法通常将所有源任务或单一高相似度源任务的软提示一次性迁移，但研究发现最优迁移性能往往来自源任务的组合，且任务相似性在迁移后会动态变化。

Details

Method: 提出动态任务向量分组（DTVG）方法，通过任务向量衡量相似性，基于目标相似性和知识一致性分组，并在迭代中动态更新组合。 Result: 在26个NLP数据集上的实验表明，DTVG能有效分组相似源任务并减少负迁移，达到最优性能。 Conclusion: DTVG通过动态组合源任务和更新相似性，显著提升了多任务提示调优的效果。 Abstract: Multi-task prompt tuning utilizes multiple high-resource source tasks to improve performance on low-source target tasks. Existing approaches transfer the soft prompt trained by combining all source tasks or a single ``high-similar'' source task one-time-only. However, we find that the optimal transfer performance often comes from a combination of source tasks, which is neither one nor all. Further, we find that the similarity between source and target tasks also changes dynamically during fine-tuning after transfering, making similarity calculation in the initiation stage inadequate. To address these issues, we propose a method called Dynamic Task Vector Grouping (DTVG), whose core ideas contain (1) measuring the task similarity with task vectors instead of soft prompt, (2) grouping the optimal source task combination based on two metrics: {\it target similarity} and {\it knowledge consistency}; (3) dynamically updating the combination in each iteration step. Extensive experiments on the 26 NLP datasets under different settings demonstrate that DTVG effectively groups similar source tasks while reducing negative transfer, achieving the start-of-art performance.

MotionDiff: Training-free Zero-shot Interactive Motion Editing via Flow-assisted Multi-view Diffusion

Yikun Ma,Yiqing Li,Jiawei Wu,Zhi Jin

Task: 提出一种无需训练的零样本扩散方法MotionDiff，用于复杂多视角运动编辑。

Motivation: 生成模型在可控编辑方面存在挑战，尤其是在处理空间信息的运动编辑中，现有方法难以处理复杂旋转和拉伸运动，且多视角一致性不足。

Details

Method: 利用光流进行多视角运动编辑，通过Point Kinematic Model (PKM)估计多视角光流，并在Multi-view Motion Diffusion Stage (MMDS)中生成多视角运动结果。 Result: MotionDiff在高质量多视角一致性运动结果上优于其他基于物理的生成运动编辑方法，且无需重新训练。 Conclusion: MotionDiff是一种高效、无需训练的多视角运动编辑方法，适用于多种下游任务。 Abstract: Generative models have made remarkable advancements and are capable of producing high-quality content. However, performing controllable editing with generative models remains challenging, due to their inherent uncertainty in outputs. This challenge is praticularly pronounced in motion editing, which involves the processing of spatial information. While some physics-based generative methods have attempted to implement motion editing, they typically operate on single-view images with simple motions, such as translation and dragging. These methods struggle to handle complex rotation and stretching motions and ensure multi-view consistency, often necessitating resource-intensive retraining. To address these challenges, we propose MotionDiff, a training-free zero-shot diffusion method that leverages optical flow for complex multi-view motion editing. Specifically, given a static scene, users can interactively select objects of interest to add motion priors. The proposed Point Kinematic Model (PKM) then estimates corresponding multi-view optical flows during the Multi-view Flow Estimation Stage (MFES). Subsequently, these optical flows are utilized to generate multi-view motion results through decoupled motion representation in the Multi-view Motion Diffusion Stage (MMDS). Extensive experiments demonstrate that MotionDiff outperforms other physics-based generative motion editing methods in achieving high-quality multi-view consistent motion results. Notably, MotionDiff does not require retraining, enabling users to conveniently adapt it for various down-stream tasks.

Long Is More Important Than Difficult for Training Reasoning Models

Si Shen,Fei Huang,Zhixiao Zhao,Chang Liu,Tiansheng Zheng,Danhao Zhu

Task: 提出一种方法，通过解耦问题难度与推理长度的依赖关系，提升推理模型的性能。

Motivation: 高难度问题稀缺，限制了数据集规模，而推理长度对模型性能的影响更大。

Details

Method: 实证证明推理长度是主要影响因素，提出生成任意长度推理数据的技术，并基于Long1K数据集微调模型。 Result: Long1K-32B模型在MATH和GPQA上分别达到95.6%和71.1%的准确率，优于基线模型。 Conclusion: 通过生成长推理数据，可以在小规模数据集上显著提升推理模型的性能。 Abstract: Difficult problems, which often result in long reasoning traces, are widely recognized as key factors for enhancing the performance of reasoning models. However, such high-challenge problems are scarce, limiting the size of available datasets. In this paper, we propose a simple method to decouple the reliance on problem difficulty. First, we empirically demonstrate that reasoning length, rather than problem difficulty, primarily influences the performance of trained models. Second, we identify a scaling law on reasoning length, showing that model performance increases in a log-linear fashion as the reasoning data length grows. Finally, we introduce a straightforward technique to generate reasoning data of arbitrary length, and show that synthesized data is effective for training reasoning models. After fine-tuning the Qwen2.5-32B-Instruct language model on our Long1K dataset, we present our model, Long1K-32B, which achieves remarkable performance with only 1,000 training samples, achieving 95.6\% accuracy on MATH, and 71.1\% on GPQA outperforming DeepSeek-R1-Distill-Qwen-32B. The model, code, and dataset are all open-sourced, available at https://huggingface.co/ZTss/LONG1.

MUST: The First Dataset and Unified Framework for Multispectral UAV Single Object Tracking

Haolin Qin,Tingfa Xu,Tianhao Li,Zhenxiang Chen,Tao Feng,Jianan Li

Task: 提出首个大规模多光谱无人机单目标跟踪数据集（MUST）及新型跟踪框架UNTrack，以解决多光谱无人机跟踪中的挑战。

Motivation: 多光谱图像（MSI）为解决无人机跟踪中的小目标和遮挡问题提供了潜力，但缺乏相关数据集阻碍了研究进展。

Details

Method: 引入MUST数据集，并提出UNTrack框架，该框架通过光谱提示、初始模板和序列搜索编码统一的光谱、空间和时间特征，采用非对称Transformer和光谱背景消除机制优化关系建模。 Result: UNTrack在多光谱无人机跟踪任务中优于现有最先进方法。 Conclusion: MUST数据集和UNTrack框架将为未来多光谱无人机跟踪研究提供重要支持。 Abstract: UAV tracking faces significant challenges in real-world scenarios, such as small-size targets and occlusions, which limit the performance of RGB-based trackers. Multispectral images (MSI), which capture additional spectral information, offer a promising solution to these challenges. However, progress in this field has been hindered by the lack of relevant datasets. To address this gap, we introduce the first large-scale Multispectral UAV Single Object Tracking dataset (MUST), which includes 250 video sequences spanning diverse environments and challenges, providing a comprehensive data foundation for multispectral UAV tracking. We also propose a novel tracking framework, UNTrack, which encodes unified spectral, spatial, and temporal features from spectrum prompts, initial templates, and sequential searches. UNTrack employs an asymmetric transformer with a spectral background eliminate mechanism for optimal relationship modeling and an encoder that continuously updates the spectrum prompt to refine tracking, improving both accuracy and efficiency. Extensive experiments show that our proposed UNTrack outperforms state-of-the-art UAV trackers. We believe our dataset and framework will drive future research in this area. The dataset is available on https://github.com/q2479036243/MUST-Multispectral-UAV-Single-Object-Tracking.

Mind with Eyes: from Language Reasoning to Multimodal Reasoning

Zhiyu Lin,Yifei Gao,Xian Zhao,Yunfan Yang,Jitao Sang

Task: 系统综述多模态推理的最新方法，并将其分为语言中心多模态推理和协作多模态推理两类。

Motivation: 通过多模态推理实现更全面、类人的认知能力，释放语言模型的潜力。

Details

Method: 分类为语言中心多模态推理（包括单次视觉感知和主动视觉感知）和协作多模态推理（涉及动作生成和状态更新）。 Result: 分析了技术演进、挑战、关键基准任务和评估指标，并提出了未来研究方向。 Conclusion: 为多模态推理研究提供了结构化综述，旨在推动该领域的进一步发展。 Abstract: Language models have recently advanced into the realm of reasoning, yet it is through multimodal reasoning that we can fully unlock the potential to achieve more comprehensive, human-like cognitive capabilities. This survey provides a systematic overview of the recent multimodal reasoning approaches, categorizing them into two levels: language-centric multimodal reasoning and collaborative multimodal reasoning. The former encompasses one-pass visual perception and active visual perception, where vision primarily serves a supporting role in language reasoning. The latter involves action generation and state update within reasoning process, enabling a more dynamic interaction between modalities. Furthermore, we analyze the technical evolution of these methods, discuss their inherent challenges, and introduce key benchmark tasks and evaluation metrics for assessing multimodal reasoning performance. Finally, we provide insights into future research directions from the following two perspectives: (i) from visual-language reasoning to omnimodal reasoning and (ii) from multimodal reasoning to multimodal agents. This survey aims to provide a structured overview that will inspire further advancements in multimodal reasoning research.

MAMAT: 3D Mamba-Based Atmospheric Turbulence Removal and its Object Detection Capability

Paul Hill,Zhiming Liu,Nantheera Anantrasirichai

Task: 提出一种基于3D Mamba架构的双模块方法（MAMAT），用于消除大气湍流引起的视频失真。

Motivation: 大气湍流会降低视频质量，影响可视化、目标检测、分类和跟踪等监控系统任务，因此需要有效的恢复和增强方法。

Details

Method: 采用双模块策略：第一模块使用可变形3D卷积进行非刚性配准以减少空间偏移，第二模块增强对比度和细节。 Result: MAMAT在视觉质量上提升3%，目标检测性能提升15%，优于现有学习方法。 Conclusion: MAMAT不仅提升视频可视化效果，还显著提高目标检测精度，缩小了视觉恢复与监控应用效果之间的差距。 Abstract: Restoration and enhancement are essential for improving the quality of videos captured under atmospheric turbulence conditions, aiding visualization, object detection, classification, and tracking in surveillance systems. In this paper, we introduce a novel Mamba-based method, the 3D Mamba-Based Atmospheric Turbulence Removal (MAMAT), which employs a dual-module strategy to mitigate these distortions. The first module utilizes deformable 3D convolutions for non-rigid registration to minimize spatial shifts, while the second module enhances contrast and detail. Leveraging the advanced capabilities of the 3D Mamba architecture, experimental results demonstrate that MAMAT outperforms state-of-the-art learning-based methods, achieving up to a 3\% improvement in visual quality and a 15\% boost in object detection. It not only enhances visualization but also significantly improves object detection accuracy, bridging the gap between visual restoration and the effectiveness of surveillance applications.

On the effectiveness of LLMs for automatic grading of open-ended questions in Spanish

Germán Capdehourat,Isabel Amigo,Brian Lorenzo,Joaquín Trigo

Task: 探索不同LLMs和提示技术在自动评分短文本开放性问题答案中的表现。

Motivation: 评分是教育者必须面对的耗时任务，而及时反馈对学习过程有积极影响；LLMs的出现为自动评分提供了新思路。

Details

Method: 比较不同LLMs和提示技术在西班牙语环境下的表现，并与人类专家评分对比。 Result: 高级LLMs在准确性、精确性和一致性上表现良好，最佳组合在三等级评分任务中达到95%以上准确率，二分类任务中超过98%。 Conclusion: LLMs在自动评分中具有潜力，尤其在教育应用中，但提示风格对结果有显著影响。 Abstract: Grading is a time-consuming and laborious task that educators must face. It is an important task since it provides feedback signals to learners, and it has been demonstrated that timely feedback improves the learning process. In recent years, the irruption of LLMs has shed light on the effectiveness of automatic grading. In this paper, we explore the performance of different LLMs and prompting techniques in automatically grading short-text answers to open-ended questions. Unlike most of the literature, our study focuses on a use case where the questions, answers, and prompts are all in Spanish. Experimental results comparing automatic scores to those of human-expert evaluators show good outcomes in terms of accuracy, precision and consistency for advanced LLMs, both open and proprietary. Results are notably sensitive to prompt styles, suggesting biases toward certain words or content in the prompt. However, the best combinations of models and prompt strategies, consistently surpasses an accuracy of 95% in a three-level grading task, which even rises up to more than 98% when the it is simplified to a binary right or wrong rating problem, which demonstrates the potential that LLMs have to implement this type of automation in education applications.

GUI-Xplore: Empowering Generalizable GUI Agents with One Exploration

Yuchen Sun,Shanhui Zhao,Tao Yu,Hao Wen,Samith Va,Mengwei Xu,Yuanchun Li,Chongyang Zhang

Task: 提出GUI-Xplore数据集和Xplore-Agent框架，以提升GUI代理在跨应用和跨任务中的泛化能力。

Motivation: 现有数据集因忽略应用间的结构差异和仅关注导航任务，限制了GUI代理的泛化能力。

Details

Method: 通过探索与推理框架设计GUI-Xplore数据集，并提出结合动作感知GUI建模与图引导环境推理的Xplore-Agent框架。 Result: Xplore-Agent在陌生环境中性能提升10%，但仍需进一步优化以实现真正的泛化。 Conclusion: GUI-Xplore和Xplore-Agent为GUI代理的泛化能力提供了新方向，但仍有改进空间。 Abstract: GUI agents hold significant potential to enhance the experience and efficiency of human-device interaction. However, current methods face challenges in generalizing across applications (apps) and tasks, primarily due to two fundamental limitations in existing datasets. First, these datasets overlook developer-induced structural variations among apps, limiting the transferability of knowledge across diverse software environments. Second, many of them focus solely on navigation tasks, which restricts their capacity to represent comprehensive software architectures and complex user interactions. To address these challenges, we introduce GUI-Xplore, a dataset meticulously designed to enhance cross-application and cross-task generalization via an exploration-and-reasoning framework. GUI-Xplore integrates pre-recorded exploration videos providing contextual insights, alongside five hierarchically structured downstream tasks designed to comprehensively evaluate GUI agent capabilities. To fully exploit GUI-Xplore's unique features, we propose Xplore-Agent, a GUI agent framework that combines Action-aware GUI Modeling with Graph-Guided Environment Reasoning. Further experiments indicate that Xplore-Agent achieves a 10% improvement over existing methods in unfamiliar environments, yet there remains significant potential for further enhancement towards truly generalizable GUI agents.

A Multi-Model Adaptation of Speculative Decoding for Classification

Somnath Roy,Padharthi Sreekar,Srivatsa Narasimha,Anubhav Anand

Task: 将推测解码技术从生成任务重新应用于分类任务。

Motivation: 通过多模型框架优化计算效率，同时保持分类准确性。

Details

Method: 使用三个轻量级工作模型和一个更强大的判断模型，工作模型独立预测标签，多数一致时直接接受，否则由判断模型介入。 Result: 3B参数工作模型与判断模型的一致性接近7B参数模型，且在速度和效率上表现更优。 Conclusion: 该方法在分类任务中实现了效率与准确性的平衡，尤其适合轻量级模型。 Abstract: The current study introduces a novel adaptation of speculative decoding, repurposed from generation to classification tasks. We propose a multi-model framework employing up to three lightweight worker models and a single, more robust judge model analogous to draft models and target model, respectively, in speculative decoding. The worker models, tasked with the bulk of the computation, independently predict discrete class labels for a given input. When majority worker models agree on a label, it is accepted as the final label, optimizing efficiency by bypassing the computationally expensive judge model. In cases of disagreement, the judge model intervenes to resolve the label. This approach minimizes redundant computation, leverages the redundancy of multiple workers for confidence, and confines the judge model's role to challenging cases, offering a practical balance of efficiency and accuracy. Our analysis suggests that smaller out of the box instruction/chat finetuned worker models with 3 billion parameters (hereafter, 3B) demonstrate a level of alignment with judge models comparable to that of larger finetuned worker models with 7 billion parameters (hereafter, 7B) across both simple and higher order reasoning tasks. The top performing 3B worker model pair achieve an agreement rate of approximately 80-83% for sentiment and around 50-80% for similar ticket when compared to judge models. Additionally, 3B worker models provide a speedup ranging from 2.8x to 9x relative to the judge models, while 7B worker model combinations achieve a speedup ranging from 1.28x to 0.28x

Multi-modality Anomaly Segmentation on the Road

Heng Gao,Zhuolin He,Shoumeng Qiu,Xiangyang Xue,Jian Pu

Task: 开发一种多模态不确定性异常分割框架（MMRAS+），用于自动驾驶系统。

Motivation: 当前单模态异常分割框架在图像中对非异常区域产生高异常分数，影响自动驾驶系统的安全性。

Details

Method: 引入CLIP文本编码器的文本模态，开发多模态异常分割框架MMRAS+，并设计集成模块提升性能。 Result: 在RoadAnomaly、SMIYC和Fishyscapes验证数据集上表现出优越性能。 Conclusion: MMRAS+是首个用于自动驾驶的多模态异常分割解决方案，有效减少非异常类的高异常输出。 Abstract: Semantic segmentation allows autonomous driving cars to understand the surroundings of the vehicle comprehensively. However, it is also crucial for the model to detect obstacles that may jeopardize the safety of autonomous driving systems. Based on our experiments, we find that current uni-modal anomaly segmentation frameworks tend to produce high anomaly scores for non-anomalous regions in images. Motivated by this empirical finding, we develop a multi-modal uncertainty-based anomaly segmentation framework, named MMRAS+, for autonomous driving systems. MMRAS+ effectively reduces the high anomaly outputs of non-anomalous classes by introducing text-modal using the CLIP text encoder. Indeed, MMRAS+ is the first multi-modal anomaly segmentation solution for autonomous driving. Moreover, we develop an ensemble module to further boost the anomaly segmentation performance. Experiments on RoadAnomaly, SMIYC, and Fishyscapes validation datasets demonstrate the superior performance of our method. The code is available in https://github.com/HengGao12/MMRAS_plus.

Temporal Relation Extraction in Clinical Texts: A Span-based Graph Transformer Approach

Rochana Chaturvedi,Peyman Baghershahi,Sourav Medya,Barbara Di Eugenio

Task: 从非结构化文本中提取临床事件及其时间关系。

Motivation: 医疗领域中，从文本中提取时间信息对于事件上下文分析和获取可操作见解至关重要。

Details

Method: 提出GRAPHTREX方法，结合基于跨度的实体-关系提取、临床大型预训练语言模型（LPLMs）和异构图变换器（HGT），以捕捉局部和全局依赖关系。 Result: 在tempeval F1分数上比之前最佳方法提高了5.5%，在长距离关系上提高了8.9%。 Conclusion: 该研究不仅推动了时间信息提取的发展，还为通过增强的时间推理改进诊断和预后模型奠定了基础。 Abstract: Temporal information extraction from unstructured text is essential for contextualizing events and deriving actionable insights, particularly in the medical domain. We address the task of extracting clinical events and their temporal relations using the well-studied I2B2 2012 Temporal Relations Challenge corpus. This task is inherently challenging due to complex clinical language, long documents, and sparse annotations. We introduce GRAPHTREX, a novel method integrating span-based entity-relation extraction, clinical large pre-trained language models (LPLMs), and Heterogeneous Graph Transformers (HGT) to capture local and global dependencies. Our HGT component facilitates information propagation across the document through innovative global landmarks that bridge distant entities. Our method improves the state-of-the-art with 5.5% improvement in the tempeval $F_1$ score over the previous best and up to 8.9% improvement on long-range relations, which presents a formidable challenge. This work not only advances temporal information extraction but also lays the groundwork for improved diagnostic and prognostic models through enhanced temporal reasoning.

Normalized Matching Transformer

Abtin Pourhadi,Paul Swoboda

Task: 提出一种新的稀疏关键点匹配方法，用于图像对之间的匹配。

Motivation: 通过结合深度学习和图神经网络，提升关键点匹配的准确性和效率。

Details

Method: 采用全深度学习框架，结合视觉主干网络、SplineCNN图神经网络、归一化Transformer解码器和Sinkhorn算法，并使用对比损失和超球面损失进行训练。 Result: 在PascalVOC和SPair-71k数据集上，分别比现有方法提高了5.1%和2.2%，且训练轮次减少了1.7倍。 Conclusion: 该方法通过简单架构和先进损失函数，显著提升了稀疏关键点匹配的性能。 Abstract: We present a new state of the art approach for sparse keypoint matching between pairs of images. Our method consists of a fully deep learning based approach combining a visual backbone coupled with a SplineCNN graph neural network for feature processing and a normalized transformer decoder for decoding keypoint correspondences together with the Sinkhorn algorithm. Our method is trained using a contrastive and a hyperspherical loss for better feature representations. We additionally use data augmentation during training. This comparatively simple architecture combining extensive normalization and advanced losses outperforms current state of the art approaches on PascalVOC and SPair-71k datasets by $5.1\%$ and $2.2\%$ respectively compared to BBGM, ASAR, COMMON and GMTR while training for at least $1.7x$ fewer epochs.

$D^2LoRA$: Data-Driven LoRA Initialization for Low Resource Tasks

Javad SeraJ,Mohammad Mahdi Mohajeri,Mohammad Javad Dousti

Task: 研究在数据稀缺场景下通过$D^2LoRA$方法优化大语言模型的性能。

Motivation: 在数据稀缺场景中，LoRA方法的收敛速度较慢，需要更高效的初始化方法来提升训练效率和模型性能。

Details

Method: 提出$D^2LoRA$，一种数据驱动的LoRA指标初始化方法，并与传统LoRA在性能与灾难性遗忘方面进行对比实验。 Result: $D^2LoRA$在GSM8K基准上提升1%，在标题生成任务中ROUGE分数提升2分。 Conclusion: $D^2LoRA$能有效适应多任务场景，减少训练成本，尤其适用于数据稀缺的情况。 Abstract: Tuning large language models is essential for optimizing their performance across diverse applications, particularly in scenarios with limited data availability. Tuning large language models in scarce data scenarios is crucial, particularly given that the convergence speed of the LoRA method is lower than that of full fine-tuning. In this paper, we present an analysis of post-training methods including Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Odds Ratio Preference Optimization (ORPO) within the context of task-specific learning using the LoRA method. Next we introduce $D^2LoRA$, a data-driven approach for initializing LoRA metrics that enhances training efficiency, especially in limited-data settings. Our experiments compare $D^2LoRA$ with vanilla LoRA in terms of performance and catastrophic forgetting under extremely data-constrained conditions. The results demonstrate that $D^2LoRA$ achieves a 1% improvement GSM8K benchmark and a 2-point improvement in ROUGE score in title generation tasks. $D^2LoRA$ facilitates the adaptation of LLMs to multiple tasks even when task-specific data is scarce, thereby reducing training expenses and offering data cost.

EMPLACE: Self-Supervised Urban Scene Change Detection

Tim Alpherts,Sennay Ghebreab,Nanne van Noord

Task: 提出AC-1M数据集和EMPLACE方法，用于城市场景变化检测（USCD）。

Motivation: 传统USCD方法依赖小规模数据集和监督学习，难以适应新城市且标注成本高。

Details

Method: 使用自监督方法EMPLACE和自适应三元组损失训练Vision Transformer。 Result: EMPLACE在预训练和零样本设置下优于现有方法，并在阿姆斯特丹案例中成功检测大小变化及其与房价的相关性。 Conclusion: AC-1M和EMPLACE为USCD提供了高效且可扩展的解决方案，揭示了城市变化与社会经济因素的联系。 Abstract: Urban change is a constant process that influences the perception of neighbourhoods and the lives of the people within them. The field of Urban Scene Change Detection (USCD) aims to capture changes in street scenes using computer vision and can help raise awareness of changes that make it possible to better understand the city and its residents. Traditionally, the field of USCD has used supervised methods with small scale datasets. This constrains methods when applied to new cities, as it requires labour-intensive labeling processes and forces a priori definitions of relevant change. In this paper we introduce AC-1M the largest USCD dataset by far of over 1.1M images, together with EMPLACE, a self-supervising method to train a Vision Transformer using our adaptive triplet loss. We show EMPLACE outperforms SOTA methods both as a pre-training method for linear fine-tuning as well as a zero-shot setting. Lastly, in a case study of Amsterdam, we show that we are able to detect both small and large changes throughout the city and that changes uncovered by EMPLACE, depending on size, correlate with housing prices - which in turn is indicative of inequity.

Clarifying Misconceptions in COVID-19 Vaccine Sentiment and Stance Analysis and Their Implications for Vaccine Hesitancy Mitigation: A Systematic Review

Lorena G Barberia,Belinda Lombard,Norton Trevisan Roman,Tatiane C. M. Sousa

Task: 系统综述使用情感分析或立场检测研究Twitter上关于COVID-19疫苗的讨论。

Motivation: 机器学习模型的进步使得通过自然语言处理检测社交媒体上的疫苗犹豫成为可能，但现有研究存在测量偏差问题。

Details

Method: 通过PROSPERO注册的系统综述，分析了2020年至2023年使用监督机器学习进行情感分析或立场检测的研究，并分类为五个维度。 Result: 研究发现测量偏差普遍存在，影响了研究结果的普适性和解释性。 Conclusion: 改进自然语言处理方法的报告是解决疫苗犹豫研究知识缺口的关键。 Abstract: Background Advances in machine learning (ML) models have increased the capability of researchers to detect vaccine hesitancy in social media using Natural Language Processing (NLP). A considerable volume of research has identified the persistence of COVID-19 vaccine hesitancy in discourse shared on various social media platforms. Methods Our objective in this study was to conduct a systematic review of research employing sentiment analysis or stance detection to study discourse towards COVID-19 vaccines and vaccination spread on Twitter (officially known as X since 2023). Following registration in the PROSPERO international registry of systematic reviews, we searched papers published from 1 January 2020 to 31 December 2023 that used supervised machine learning to assess COVID-19 vaccine hesitancy through stance detection or sentiment analysis on Twitter. We categorized the studies according to a taxonomy of five dimensions: tweet sample selection approach, self-reported study type, classification typology, annotation codebook definitions, and interpretation of results. We analyzed if studies using stance detection report different hesitancy trends than those using sentiment analysis by examining how COVID-19 vaccine hesitancy is measured, and whether efforts were made to avoid measurement bias. Results Our review found that measurement bias is widely prevalent in studies employing supervised machine learning to analyze sentiment and stance toward COVID-19 vaccines and vaccination. The reporting errors are sufficiently serious that they hinder the generalisability and interpretation of these studies to understanding whether individual opinions communicate reluctance to vaccinate against SARS-CoV-2. Conclusion Improving the reporting of NLP methods is crucial to addressing knowledge gaps in vaccine hesitancy discourse.

BackMix: Regularizing Open Set Recognition by Removing Underlying Fore-Background Priors

Yu Wang,Junxian Mu,Hongzhi Huang,Qilong Wang,Pengfei Zhu,Qinghua Hu

Task: 提出一种无需精心选择辅助已知异常样本的方法（BackMix），用于改进开放集识别（OSR）模型的性能。

Motivation: 现有方法依赖辅助数据集中的未知样本进行正则化，但对选择已知异常样本敏感，限制了其实际应用。

Details

Method: 通过分析前景与背景在开放集识别中的作用，提出BackMix方法，利用随机替换背景的方式去除前景与背景的先验关联。 Result: 实验表明，BackMix显著提高了开放集识别的性能，且方法简单易实现。 Conclusion: BackMix为开放集识别提供了一种高效且通用的解决方案，无需额外推理操作。 Abstract: Open set recognition (OSR) requires models to classify known samples while detecting unknown samples for real-world applications. Existing studies show impressive progress using unknown samples from auxiliary datasets to regularize OSR models, but they have proved to be sensitive to selecting such known outliers. In this paper, we discuss the aforementioned problem from a new perspective: Can we regularize OSR models without elaborately selecting auxiliary known outliers? We first empirically and theoretically explore the role of foregrounds and backgrounds in open set recognition and disclose that: 1) backgrounds that correlate with foregrounds would mislead the model and cause failures when encounters 'partially' known images; 2) Backgrounds unrelated to foregrounds can serve as auxiliary known outliers and provide regularization via global average pooling. Based on the above insights, we propose a new method, Background Mix (BackMix), that mixes the foreground of an image with different backgrounds to remove the underlying fore-background priors. Specifically, BackMix first estimates the foreground with class activation maps (CAMs), then randomly replaces image patches with backgrounds from other images to obtain mixed images for training. With backgrounds de-correlated from foregrounds, the open set recognition performance is significantly improved. The proposed method is quite simple to implement, requires no extra operation for inferences, and can be seamlessly integrated into almost all of the existing frameworks. The code is released on https://github.com/Vanixxz/BackMix.

Muhidin A. Mohamed,Shuab D. Ahmed,Yahye A. Isse,Hanad M. Mohamed,Fuad M. Hassan,Houssein A. Assowe

Task: 为索马里语创建首个基于Transformer的单语语言模型（SomBERTa）并开发两个标注数据集，用于虚假新闻和毒性内容分类。

Motivation: 解决低资源语言（如索马里语）在AI自动化中面临的挑战，包括缺乏标注数据集和定制化语言模型。

Details

Method: 创建两个人工标注的索马里语社交媒体数据集，并开发基于Transformer的单语语言模型SomBERTa，随后在多个分类任务上进行微调和评估。 Result: SomBERTa在虚假新闻和毒性内容分类任务中表现优于其他多语言模型，平均准确率达到87.99%。 Conclusion: 该研究为索马里语NLP提供了基础语言模型和可复现的框架，促进了低资源语言的数字包容性和语言多样性。 Abstract: The fact that everyone with a social media account can create and share content, and the increasing public reliance on social media platforms as a news and information source bring about significant challenges such as misinformation, fake news, harmful content, etc. Although human content moderation may be useful to an extent and used by these platforms to flag posted materials, the use of AI models provides a more sustainable, scalable, and effective way to mitigate these harmful contents. However, low-resourced languages such as the Somali language face limitations in AI automation, including scarce annotated training datasets and lack of language models tailored to their unique linguistic characteristics. This paper presents part of our ongoing research work to bridge some of these gaps for the Somali language. In particular, we created two human-annotated social-media-sourced Somali datasets for two downstream applications, fake news \& toxicity classification, and developed a transformer-based monolingual Somali language model (named SomBERTa) -- the first of its kind to the best of our knowledge. SomBERTa is then fine-tuned and evaluated on toxic content, fake news and news topic classification datasets. Comparative evaluation analysis of the proposed model against related multilingual models (e.g., AfriBERTa, AfroXLMR, etc) demonstrated that SomBERTa consistently outperformed these comparators in both fake news and toxic content classification tasks while achieving the best average accuracy (87.99%) across all tasks. This research contributes to Somali NLP by offering a foundational language model and a replicable framework for other low-resource languages, promoting digital and AI inclusivity and linguistic diversity.

Towards Invisible Backdoor Attack on Text-to-Image Diffusion Model

Jie Zhang,Zhongqi Wang,Shiguang Shan,Xilin Chen

Task: 提出一种新型的隐形后门攻击方法（IBA），通过减少语义一致性和注意力一致性来增强后门样本的隐蔽性。

Motivation: 当前后门样本在语义一致性和注意力一致性上存在可检测的痕迹，容易被防御者识别，因此需要一种更隐蔽的攻击方法。

Details

Method: 利用句法结构作为后门触发器以增强对文本变化的敏感性，并基于核最大均值差异（KMMD）的正则化方法对齐后门样本与良性样本的交叉注意力响应分布。 Result: IBA实现了97.5%的攻击成功率，且98%的后门样本能够绕过三种先进检测机制。 Conclusion: IBA显著提升了后门攻击的隐蔽性和抵抗防御的能力。 Abstract: Backdoor attacks targeting text-to-image diffusion models have advanced rapidly, enabling attackers to implant malicious triggers into these models to manipulate their outputs. However, current backdoor samples often exhibit two key abnormalities compared to benign samples: 1) Semantic Consistency, where backdoor prompts tend to generate images with similar semantic content even with significant textual variations to the prompts; 2) Attention Consistency, where the trigger induces consistent structural responses in the cross-attention maps. These consistencies leave detectable traces for defenders, making backdoors easier to identify. To enhance the stealthiness of backdoor samples, we propose a novel Invisible Backdoor Attack (IBA) by explicitly mitigating these consistencies. Specifically, our approach leverages syntactic structures as backdoor triggers to amplify the sensitivity to textual variations, effectively breaking down the semantic consistency. Besides, a regularization method based on Kernel Maximum Mean Discrepancy (KMMD) is proposed to align the distribution of cross-attention responses between backdoor and benign samples, thereby disrupting attention consistency. Extensive experiments demonstrate that our IBA achieves a 97.5% attack success rate while exhibiting stronger resistance to defenses, with an average of over 98% backdoor samples bypassing three state-of-the-art detection mechanisms. The code is available at https://github.com/Robin-WZQ/IBA.

GeoBenchX: Benchmarking LLMs for Multistep Geospatial Tasks

Varvara Krechetova,Denis Kochedykov

Task: 为商业GIS从业者评估大型语言模型（LLMs）在多步地理空间任务中的表现建立基准。

Motivation: 提供一种标准化方法，以评估LLMs在地理空间任务中的能力，尤其是解决复杂任务和拒绝幻觉的能力。

Details

Method: 使用一个配备23种地理空间功能的简单工具调用代理，评估七种领先的商业LLMs（Sonnet 3.5和3.7、Haiku 3.5、Gemini 2.0、GPT-4o、GPT-4o mini和o3-mini），任务分为四个复杂度递增的类别，包括可解决和故意不可解决的任务。 Result: Sonnet 3.5和GPT-4o表现最佳，Claude模型在可解决任务上表现优异，而OpenAI模型更擅长识别不可解决场景。Anthropic模型的token使用量显著高于其他模型。常见错误包括几何关系误解、依赖过时知识和数据操作低效。 Conclusion: 发布的基准集、评估框架和数据生成管道为GeoAI领域的LLMs评估提供了标准化方法。 Abstract: In this paper, we establish a benchmark for evaluating large language models (LLMs) on multi-step geospatial tasks relevant to commercial GIS practitioners. We assess seven leading commercial LLMs (Sonnet 3.5 and 3.7, Haiku 3.5, Gemini 2.0, GPT-4o, GPT-4o mini, and o3-mini) using a simple tool-calling agent equipped with 23 geospatial functions. Our benchmark comprises tasks across four categories of increasing complexity, with both solvable and intentionally unsolvable tasks to test hallucination rejection. We develop an LLM-as-Judge evaluation framework to compare agent solutions against reference implementations. Results show Sonnet 3.5 and GPT-4o achieve the best overall performance, with Claude models excelling on solvable tasks while OpenAI models better identify unsolvable scenarios. We observe significant differences in token usage, with Anthropic models consuming substantially more tokens than competitors. Common errors include misunderstanding geometrical relationships, relying on outdated knowledge, and inefficient data manipulation. The resulting benchmark set, evaluation framework, and data generation pipeline are released as open-source resources, providing one more standardized method for ongoing evaluation of LLMs for GeoAI.

DynASyn: Multi-Subject Personalization Enabling Dynamic Action Synthesis

Yongjin Choi,Chanhun Park,Seung Jun Baek

Task: 提出一种名为DynASyn的方法，用于从单张参考图像中实现多主体个性化，同时解决主体行为和动态交互的修改问题。

Motivation: 现有方法在修改主体行为或动态交互时表现不佳，尤其是当仅有一张参考图像时，容易过拟合。

Details

Method: 通过概念先验与主体外观和动作对齐，并通过注意力图正则化和概念提示与图像增强，结合SDE编辑生成多样化图像。 Result: DynASyn能够合成具有新上下文和动态交互的高度真实图像，在定量和定性上均优于基线方法。 Conclusion: DynASyn是一种有效的多主体个性化方法，能够在保持身份一致性的同时生成多样化的行为和交互。 Abstract: Recent advances in text-to-image diffusion models spurred research on personalization, i.e., a customized image synthesis, of subjects within reference images. Although existing personalization methods are able to alter the subjects' positions or to personalize multiple subjects simultaneously, they often struggle to modify the behaviors of subjects or their dynamic interactions. The difficulty is attributable to overfitting to reference images, which worsens if only a single reference image is available. We propose DynASyn, an effective multi-subject personalization from a single reference image addressing these challenges. DynASyn preserves the subject identity in the personalization process by aligning concept-based priors with subject appearances and actions. This is achieved by regularizing the attention maps between the subject token and images through concept-based priors. In addition, we propose concept-based prompt-and-image augmentation for an enhanced trade-off between identity preservation and action diversity. We adopt an SDE-based editing guided by augmented prompts to generate diverse appearances and actions while maintaining identity consistency in the augmented images. Experiments show that DynASyn is capable of synthesizing highly realistic images of subjects with novel contexts and dynamic interactions with the surroundings, and outperforms baseline methods in both quantitative and qualitative aspects.

MathAgent: Leveraging a Mixture-of-Math-Agent Framework for Real-World Multimodal Mathematical Error Detection

Yibo Yan,Shen Wang,Jiahao Huo,Philip S. Yu,Xuming Hu,Qingsong Wen

Task: 开发MathAgent框架以解决多模态大语言模型在数学错误检测中的挑战。

Motivation: 多模态大语言模型在数学问题解决中表现优异，但在复杂多模态数学场景中的错误检测和分类任务上仍有不足。

Details

Method: 提出MathAgent框架，通过三个专门代理（图像-文本一致性验证器、视觉语义解释器和综合错误分析器）分阶段处理错误检测。 Result: 在真实教育数据上，MathAgent在错误步骤识别和分类上分别比基线模型提高了5%和3%，并在实际教育平台中取得了90%的学生满意度和显著成本节约。 Conclusion: MathAgent框架在多模态数学错误检测任务中表现出色，具有实际应用价值。 Abstract: Mathematical error detection in educational settings presents a significant challenge for Multimodal Large Language Models (MLLMs), requiring a sophisticated understanding of both visual and textual mathematical content along with complex reasoning capabilities. Though effective in mathematical problem-solving, MLLMs often struggle with the nuanced task of identifying and categorizing student errors in multimodal mathematical contexts. Therefore, we introduce MathAgent, a novel Mixture-of-Math-Agent framework designed specifically to address these challenges. Our approach decomposes error detection into three phases, each handled by a specialized agent: an image-text consistency validator, a visual semantic interpreter, and an integrative error analyzer. This architecture enables more accurate processing of mathematical content by explicitly modeling relationships between multimodal problems and student solution steps. We evaluate MathAgent on real-world educational data, demonstrating approximately 5% higher accuracy in error step identification and 3% improvement in error categorization compared to baseline models. Besides, MathAgent has been successfully deployed in an educational platform that has served over one million K-12 students, achieving nearly 90% student satisfaction while generating significant cost savings by reducing manual error detection.

Co-op: Correspondence-based Novel Object Pose Estimation

Sungphill Moon,Hyeontae Son,Dongcheol Hur,Sangwook Kim

Task: 提出一种名为Co-op的新方法，用于从单张RGB图像中准确且鲁棒地估计训练期间未见过的物体的6自由度位姿。

Motivation: 现有基于模板的方法因使用大量模板而导致效率低下，需要一种更高效且无需额外微调的方法。

Details

Method: 通过找到输入图像与预渲染模板之间的半密集对应关系，结合基于块级分类和偏移回归的混合表示，并使用可微分PnP层进行位姿细化。 Result: 在BOP Challenge的七个核心数据集上大幅超越现有方法，达到最先进的准确率。 Conclusion: Co-op方法不仅快速估计物体位姿，而且在精度和泛化性能上显著优于现有方法。 Abstract: We propose Co-op, a novel method for accurately and robustly estimating the 6DoF pose of objects unseen during training from a single RGB image. Our method requires only the CAD model of the target object and can precisely estimate its pose without any additional fine-tuning. While existing model-based methods suffer from inefficiency due to using a large number of templates, our method enables fast and accurate estimation with a small number of templates. This improvement is achieved by finding semi-dense correspondences between the input image and the pre-rendered templates. Our method achieves strong generalization performance by leveraging a hybrid representation that combines patch-level classification and offset regression. Additionally, our pose refinement model estimates probabilistic flow between the input image and the rendered image, refining the initial estimate to an accurate pose using a differentiable PnP layer. We demonstrate that our method not only estimates object poses rapidly but also outperforms existing methods by a large margin on the seven core datasets of the BOP Challenge, achieving state-of-the-art accuracy.

Evaluating Negative Sampling Approaches for Neural Topic Models

Suman Adhya,Avishek Lahiri,Debarshi Kumar Sanyal,Partha Pratim Das

Task: 研究负采样策略对神经主题模型性能的影响。

Motivation: 负采样在计算机视觉和自然语言处理中表现优异，但其在无监督领域（如主题建模）中的效果尚未充分探索。

Details

Method: 在基于变分自编码器的神经主题模型解码器中引入负采样技术，比较不同负采样策略的效果。 Result: 实验表明，负采样显著提升了主题连贯性、主题多样性和文档分类准确性，并提高了生成主题的质量。 Conclusion: 负采样是提升神经主题模型效果的有力工具。 Abstract: Negative sampling has emerged as an effective technique that enables deep learning models to learn better representations by introducing the paradigm of learn-to-compare. The goal of this approach is to add robustness to deep learning models to learn better representation by comparing the positive samples against the negative ones. Despite its numerous demonstrations in various areas of computer vision and natural language processing, a comprehensive study of the effect of negative sampling in an unsupervised domain like topic modeling has not been well explored. In this paper, we present a comprehensive analysis of the impact of different negative sampling strategies on neural topic models. We compare the performance of several popular neural topic models by incorporating a negative sampling technique in the decoder of variational autoencoder-based neural topic models. Experiments on four publicly available datasets demonstrate that integrating negative sampling into topic models results in significant enhancements across multiple aspects, including improved topic coherence, richer topic diversity, and more accurate document classification. Manual evaluations also indicate that the inclusion of negative sampling into neural topic models enhances the quality of the generated topics. These findings highlight the potential of negative sampling as a valuable tool for advancing the effectiveness of neural topic models.

V2P-Bench: Evaluating Video-Language Understanding with Visual Prompts for Better Human-Model Interaction

Yiming Zhao,Yu Zeng,Yukun Qi,YaoYang Liu,Lin Chen,Zehui Chen,Xikun Bao,Jie Zhao,Feng Zhao

Task: 提出一个名为V2P-Bench的基准测试，用于评估大视觉语言模型（LVLMs）在多模态人机交互场景中的视频理解能力。

Motivation: 当前基准测试仅依赖文本提示进行评估，缺乏精确的空间和时间参考，限制了人机交互的体验和效率。

Details

Method: V2P-Bench包含980个独特视频和1,172个问答对，涵盖5个主要任务和12个维度，支持实例级细粒度理解。 Result: 现有最强模型（如GPT-4o和Gemini-1.5-Pro）在V2P-Bench上表现较差（65.4%和67.9%），远低于人类专家的88.3%。 Conclusion: V2P-Bench揭示了LVLMs在视频视觉提示理解上的不足，有望推动多模态人机交互和视频理解评估的发展。 Abstract: Large Vision-Language Models (LVLMs) have made significant progress in the field of video understanding recently. However, current benchmarks uniformly lean on text prompts for evaluation, which often necessitate complex referential language and fail to provide precise spatial and temporal references. This limitation diminishes the experience and efficiency of human-model interaction. To address this limitation, we propose the Video Visual Prompt Benchmark(V2P-Bench), a comprehensive benchmark specifically designed to evaluate LVLMs' video understanding capabilities in multimodal human-model interaction scenarios. V2P-Bench includes 980 unique videos and 1,172 QA pairs, covering 5 main tasks and 12 dimensions, facilitating instance-level fine-grained understanding aligned with human cognition. Benchmarking results reveal that even the most powerful models perform poorly on V2P-Bench (65.4% for GPT-4o and 67.9% for Gemini-1.5-Pro), significantly lower than the human experts' 88.3%, highlighting the current shortcomings of LVLMs in understanding video visual prompts. We hope V2P-Bench will serve as a foundation for advancing multimodal human-model interaction and video understanding evaluation. Project page: https://github.com/gaotiexinqu/V2P-Bench.

Unmasking Deceptive Visuals: Benchmarking Multimodal Large Language Models on Misleading Chart Question Answering

Zixin Chen,Sicheng Song,Kashun Shum,Yanna Lin,Rui Sheng,Huamin Qu

Task: 评估多模态大语言模型（MLLMs）在检测和解释误导性图表方面的能力。

Motivation: 尽管误导性图表是一个长期存在的问题，但现有研究尚未系统评估MLLMs在此任务中的表现。

Details

Method: 提出Misleading ChartQA Benchmark，一个包含3,000多个样本的多模态数据集，涵盖21种误导类型和10种图表类型。 Result: 测试了16种先进的MLLMs，发现其在识别视觉欺骗性实践方面的局限性，并提出了一种新的检测和定位误导的流程。 Conclusion: 为MLLM驱动的误导性图表理解奠定了基础，并公开数据集以支持进一步研究。 Abstract: Misleading chart visualizations, which intentionally manipulate data representations to support specific claims, can distort perceptions and lead to incorrect conclusions. Despite decades of research, misleading visualizations remain a widespread and pressing issue. Recent advances in multimodal large language models (MLLMs) have demonstrated strong chart comprehension capabilities, yet no existing work has systematically evaluated their ability to detect and interpret misleading charts. This paper introduces the Misleading Chart Question Answering (Misleading ChartQA) Benchmark, a large-scale multimodal dataset designed to assess MLLMs in identifying and reasoning about misleading charts. It contains over 3,000 curated examples, covering 21 types of misleaders and 10 chart types. Each example includes standardized chart code, CSV data, and multiple-choice questions with labeled explanations, validated through multi-round MLLM checks and exhausted expert human review. We benchmark 16 state-of-the-art MLLMs on our dataset, revealing their limitations in identifying visually deceptive practices. We also propose a novel pipeline that detects and localizes misleaders, enhancing MLLMs' accuracy in misleading chart interpretation. Our work establishes a foundation for advancing MLLM-driven misleading chart comprehension. We publicly release the sample dataset to support further research in this critical area.

Serial Low-rank Adaptation of Vision Transformer

Houqiang Zhong,Shaocheng Shen,Ke Cai,Zhenglong Wu,Jiangchao Yao,Yuan Cheng,Xuefei Li,Xiaoyun Zhang,Li Song,Qiang Hu

Task: 提出一种名为Serial LoRA的新型低秩适应方法，用于在参数高效的情况下微调大型预训练视觉基础模型。

Motivation: 考虑到计算和存储成本的实际限制，开发更先进的低秩适应方法以减少参数和内存需求是资源受限应用场景中的重要挑战。

Details

Method: 在常用的视觉Transformer基础上，提出Serial LoRA，通过引入共享低秩矩阵与注意力机制串联组合，提取参数适应中的共性，显著减少冗余。 Result: Serial LoRA仅使用LoRA的1/4参数，但在大多数情况下实现了可比性能，实验证明了其一致性优势。 Conclusion: Serial LoRA是一种高效的低秩适应方法，显著减少了参数和内存需求，同时保持了性能。 Abstract: Fine-tuning large pre-trained vision foundation models in a parameter-efficient manner is critical for downstream vision tasks, considering the practical constraints of computational and storage costs. Low-rank adaptation (LoRA) is a well-established technique in this domain, achieving impressive efficiency by reducing the parameter space to a low-rank form. However, developing more advanced low-rank adaptation methods to reduce parameters and memory requirements remains a significant challenge in resource-constrained application scenarios. In this study, we consider on top of the commonly used vision transformer and propose Serial LoRA, a novel LoRA variant that introduces a shared low-rank matrix serially composite with the attention mechanism. Such a design extracts the underlying commonality of parameters in adaptation, significantly reducing redundancy. Notably, Serial LoRA uses only 1/4 parameters of LoRA but achieves comparable performance in most cases. We conduct extensive experiments on a range of vision foundation models with the transformer structure, and the results confirm consistent superiority of our method.

GINGER: Grounded Information Nugget-Based Generation of Responses

Weronika Łajewska,Krisztian Balog

Task: 提出一种模块化流程，通过信息块（nuggets）改进检索增强生成（RAG）的事实准确性、来源归属和响应完整性。

Motivation: 解决RAG在事实正确性、来源归属和响应完整性方面的挑战。

Details

Method: 采用多阶段流程，包括信息块检测、聚类、排序、顶部聚类摘要和流畅性增强。 Result: 在TREC RAG'24数据集上，GINGER框架实现了最先进的性能。 Conclusion: 提出的方法能够确保事实基础、来源归属，并在长度限制内最大化信息包含。 Abstract: Retrieval-augmented generation (RAG) faces challenges related to factual correctness, source attribution, and response completeness. To address them, we propose a modular pipeline for grounded response generation that operates on information nuggets-minimal, atomic units of relevant information extracted from retrieved documents. The multistage pipeline encompasses nugget detection, clustering, ranking, top cluster summarization, and fluency enhancement. It guarantees grounding in specific facts, facilitates source attribution, and ensures maximum information inclusion within length constraints. Extensive experiments on the TREC RAG'24 dataset evaluated with the AutoNuggetizer framework demonstrate that GINGER achieves state-of-the-art performance on this benchmark.

HiLoTs: High-Low Temporal Sensitive Representation Learning for Semi-Supervised LiDAR Segmentation in Autonomous Driving

R. D. Lin,Pengcheng Weng,Yinqiao Wang,Han Ding,Jinsong Han,Fei Wang

Task: 提出一种名为HiLoTs的半监督学习方法，用于LiDAR点云语义分割，以利用自动驾驶场景中长时程时间特性。

Motivation: 现有半监督方法通常仅关注点云空间分布或短时程时间表示，忽略了自动驾驶场景中丰富的长时程时间特性。

Details

Method: HiLoTs通过从连续LiDAR帧中学习高时间敏感性和低时间敏感性表示，并使用交叉注意力机制增强和融合这些表示，同时采用教师-学生框架利用未标记数据。 Result: 在SemanticKITTI和nuScenes数据集上，HiLoTs优于现有半监督方法，性能接近LiDAR+相机多模态方法。 Conclusion: HiLoTs有效利用了长时程时间特性，显著提升了半监督LiDAR点云语义分割的性能。 Abstract: LiDAR point cloud semantic segmentation plays a crucial role in autonomous driving. In recent years, semi-supervised methods have gained popularity due to their significant reduction in annotation labor and time costs. Current semi-supervised methods typically focus on point cloud spatial distribution or consider short-term temporal representations, e.g., only two adjacent frames, often overlooking the rich long-term temporal properties inherent in autonomous driving scenarios. In driving experience, we observe that nearby objects, such as roads and vehicles, remain stable while driving, whereas distant objects exhibit greater variability in category and shape. This natural phenomenon is also captured by LiDAR, which reflects lower temporal sensitivity for nearby objects and higher sensitivity for distant ones. To leverage these characteristics, we propose HiLoTs, which learns high-temporal sensitivity and low-temporal sensitivity representations from continuous LiDAR frames. These representations are further enhanced and fused using a cross-attention mechanism. Additionally, we employ a teacher-student framework to align the representations learned by the labeled and unlabeled branches, effectively utilizing the large amounts of unlabeled data. Experimental results on the SemanticKITTI and nuScenes datasets demonstrate that our proposed HiLoTs outperforms state-of-the-art semi-supervised methods, and achieves performance close to LiDAR+Camera multimodal approaches. Code is available on https://github.com/rdlin118/HiLoTs

Exploring Topic Trends in COVID-19 Research Literature using Non-Negative Matrix Factorization

Divya Patel,Vansh Parikh,Om Patel,Agam Shah,Bhaskar Chaudhury

Task: 应用非负矩阵分解（NMF）对COVID-19开放研究数据集（CORD-19）进行主题建模，揭示其潜在主题结构及其演变。

Motivation: 通过分析COVID-19研究文献的主题结构及其演变，为未来研究提供有价值的资源。

Details

Method: 使用NMF分解文档-词矩阵，结合tf-idf特征提取和稳定性分析，选择最优主题数量。 Result: 揭示了CORD-19数据集中主题的演变，为理解COVID-19研究的知识结构提供了贡献。 Conclusion: 该方法为COVID-19研究领域的知识结构分析提供了有效工具，并可作为未来研究的基础。 Abstract: In this work, we apply topic modeling using Non-Negative Matrix Factorization (NMF) on the COVID-19 Open Research Dataset (CORD-19) to uncover the underlying thematic structure and its evolution within the extensive body of COVID-19 research literature. NMF factorizes the document-term matrix into two non-negative matrices, effectively representing the topics and their distribution across the documents. This helps us see how strongly documents relate to topics and how topics relate to words. We describe the complete methodology which involves a series of rigorous pre-processing steps to standardize the available text data while preserving the context of phrases, and subsequently feature extraction using the term frequency-inverse document frequency (tf-idf), which assigns weights to words based on their frequency and rarity in the dataset. To ensure the robustness of our topic model, we conduct a stability analysis. This process assesses the stability scores of the NMF topic model for different numbers of topics, enabling us to select the optimal number of topics for our analysis. Through our analysis, we track the evolution of topics over time within the CORD-19 dataset. Our findings contribute to the understanding of the knowledge structure of the COVID-19 research landscape, providing a valuable resource for future research in this field.

CODA: Repurposing Continuous VAEs for Discrete Tokenization

Zeyu Liu,Zanlin Ni,Yeguo Hua,Xin Deng,Xiao Ma,Cheng Zhong,Gao Huang

Task: 提出一种名为CODA的框架，将连续变分自编码器（VAEs）适应为离散视觉分词器，以解耦压缩和离散化过程。

Motivation: 传统离散分词器联合学习压缩和离散化任务，导致训练不稳定、码本利用率低和重建质量有限。

Details

Method: 通过精心设计的离散化过程，将现成的连续VAEs适应为离散分词器，专注于离散化任务。 Result: 在ImageNet 256×256基准测试中，CODA以6倍少的训练预算实现了100%的码本利用率，以及0.43和1.34的显著重建FID（rFID）。 Conclusion: CODA框架通过解耦压缩和离散化，实现了稳定高效的训练，同时保持了连续VAEs的高视觉保真度。 Abstract: Discrete visual tokenizers transform images into a sequence of tokens, enabling token-based visual generation akin to language models. However, this process is inherently challenging, as it requires both compressing visual signals into a compact representation and discretizing them into a fixed set of codes. Traditional discrete tokenizers typically learn the two tasks jointly, often leading to unstable training, low codebook utilization, and limited reconstruction quality. In this paper, we introduce \textbf{CODA}(\textbf{CO}ntinuous-to-\textbf{D}iscrete \textbf{A}daptation), a framework that decouples compression and discretization. Instead of training discrete tokenizers from scratch, CODA adapts off-the-shelf continuous VAEs -- already optimized for perceptual compression -- into discrete tokenizers via a carefully designed discretization process. By primarily focusing on discretization, CODA ensures stable and efficient training while retaining the strong visual fidelity of continuous VAEs. Empirically, with $\mathbf{6 \times}$ less training budget than standard VQGAN, our approach achieves a remarkable codebook utilization of 100% and notable reconstruction FID (rFID) of $\mathbf{0.43}$ and $\mathbf{1.34}$ for $8 \times$ and $16 \times$ compression on ImageNet 256$\times$ 256 benchmark.

LakotaBERT: A Transformer-based Model for Low Resource Lakota Language

Kanishka Parankusham,Rodrigue Rizk,KC Santosh

Task: 开发首个针对拉科塔语的大型语言模型LakotaBERT，以支持该濒危语言的复兴。

Motivation: 拉科塔语作为北美苏族人的一种濒危语言，年轻一代的流利度下降，亟需技术手段支持其复兴。

Details

Method: 通过收集105K句拉科塔语、英语及平行文本构建语料库，并基于RoBERTa架构预训练模型，与现有模型（如RoBERTa、BERT等）进行对比评估。 Result: 初步结果显示，模型在掩码语言建模任务中达到51%的准确率，性能接近英语模型。 Conclusion: 通过结合AI与语言学方法，该研究为其他濒危土著语言的复兴提供了技术示范。 Abstract: Lakota, a critically endangered language of the Sioux people in North America, faces significant challenges due to declining fluency among younger generations. This paper introduces LakotaBERT, the first large language model (LLM) tailored for Lakota, aiming to support language revitalization efforts. Our research has two primary objectives: (1) to create a comprehensive Lakota language corpus and (2) to develop a customized LLM for Lakota. We compiled a diverse corpus of 105K sentences in Lakota, English, and parallel texts from various sources, such as books and websites, emphasizing the cultural significance and historical context of the Lakota language. Utilizing the RoBERTa architecture, we pre-trained our model and conducted comparative evaluations against established models such as RoBERTa, BERT, and multilingual BERT. Initial results demonstrate a masked language modeling accuracy of 51% with a single ground truth assumption, showcasing performance comparable to that of English-based models. We also evaluated the model using additional metrics, such as precision and F1 score, to provide a comprehensive assessment of its capabilities. By integrating AI and linguistic methodologies, we aspire to enhance linguistic diversity and cultural resilience, setting a valuable precedent for leveraging technology in the revitalization of other endangered indigenous languages.

GOAL: Global-local Object Alignment Learning

Hyungyu Choi,Young Kyun Jang,Chanho Eom

Task: 提出一种名为GOAL的微调方法，增强CLIP模型处理长文本描述的能力。

Motivation: 现有的视觉语言模型（如CLIP）在短文本对齐上表现良好，但在处理长文本时效果不佳。

Details

Method: 通过全局和局部语义对齐（LISM和TSL）改进CLIP的长文本处理能力。 Result: 在三个新基准测试中，GOAL显著优于基线CLIP微调方法。 Conclusion: GOAL通过局部语义对齐和全局上下文结合，为长文本任务提供更精细的嵌入表示。 Abstract: Vision-language models like CLIP have shown impressive capabilities in aligning images and text, but they often struggle with lengthy and detailed text descriptions because of their training focus on short and concise captions. We present GOAL (Global-local Object Alignment Learning), a novel fine-tuning method that enhances CLIP's ability to handle lengthy text by leveraging both global and local semantic alignments between image and lengthy text. Our approach consists of two key components: Local Image-Sentence Matching (LISM), which identifies corresponding pairs between image segments and descriptive sentences, and Token Similarity-based Learning (TSL), which efficiently propagates local element attention through these matched pairs. Evaluating GOAL on three new benchmarks for image-lengthy text retrieval, we demonstrate significant improvements over baseline CLIP fine-tuning, establishing a simple yet effective approach for adapting CLIP to detailed textual descriptions. Through extensive experiments, we show that our method's focus on local semantic alignment alongside global context leads to more nuanced and representative embeddings, particularly beneficial for tasks requiring fine-grained understanding of lengthy text descriptions.

Mapping Hymns and Organizing Concepts in the Rigveda: Quantitatively Connecting the Vedic Suktas

Venkatesh Bollineni,Igor Crk,Eren Gultepe

Task: 利用NLP技术分析《梨俱吠陀》的主题和语义联系。

Motivation: 《梨俱吠陀》因其古老的语言、诗歌结构和大量文本而难以理解和分析。

Details

Method: 使用LSA、SBERT和Doc2Vec生成嵌入，通过UMAP降维和社区检测方法（Louvain、Leiden、标签传播）分析主题网络。 Result: 仅LSA结合Leiden方法显著（z=2.726, p<.01），成功识别七组著名主题；Doc2Vec无效，SBERT部分成功但不显著。 Conclusion: LSA的改进方法在分析《梨俱吠陀》主题时表现最佳。 Abstract: Accessing and gaining insight into the Rigveda poses a non-trivial challenge due to its extremely ancient Sanskrit language, poetic structure, and large volume of text. By using NLP techniques, this study identified topics and semantic connections of hymns within the Rigveda that were corroborated by seven well-known groupings of hymns. The 1,028 suktas (hymns) from the modern English translation of the Rigveda by Jamison and Brereton were preprocessed and sukta-level embeddings were obtained using, i) a novel adaptation of LSA, presented herein, ii) SBERT, and iii) Doc2Vec embeddings. Following an UMAP dimension reduction of the vectors, the network of suktas was formed using k-nearest neighbours. Then, community detection of topics in the sukta networks was performed with the Louvain, Leiden, and label propagation methods, whose statistical significance of the formed topics were determined using an appropriate null distribution. Only the novel adaptation of LSA using the Leiden method, had detected sukta topic networks that were significant (z = 2.726, p < .01) with a modularity score of 0.944. Of the seven famous sukta groupings analyzed (e.g., creation, funeral, water, etc.) the LSA derived network was successful in all seven cases, while Doc2Vec was not significant and failed to detect the relevant suktas. SBERT detected four of the famous suktas as separate groups, but mistakenly combined three of them into a single mixed group. Also, the SBERT network was not statistically significant.

Aligning Foundation Model Priors and Diffusion-Based Hand Interactions for Occlusion-Resistant Two-Hand Reconstruction

Gaoge Han,Yongkang Cheng,Zhe Chen,Shaoli Huang,Tongliang Liu

Task: 从单目图像中精确重建双手的姿势和交互，解决对齐和遮挡问题。

Motivation: 现有方法在复杂动态的手部姿势和遮挡情况下难以实现合理的交互对齐，导致对齐错误和穿透伪影。

Details

Method: 提出一种新框架，结合基础模型驱动的2D先验和基于扩散的交互细化，通过融合对齐编码器和双手扩散模型实现遮挡鲁棒的重建。 Result: 在InterHand2.6M、FreiHAND和HIC数据集上达到最先进性能，显著提升遮挡处理和交互鲁棒性。 Conclusion: 该方法通过多模态先验和扩散模型的有效结合，成功解决了双手重建中的对齐和遮挡问题。 Abstract: Two-hand reconstruction from monocular images faces persistent challenges due to complex and dynamic hand postures and occlusions, causing significant difficulty in achieving plausible interaction alignment. Existing approaches struggle with such alignment issues, often resulting in misalignment and penetration artifacts. To tackle this, we propose a novel framework that attempts to precisely align hand poses and interactions by synergistically integrating foundation model-driven 2D priors with diffusion-based interaction refinement for occlusion-resistant two-hand reconstruction. First, we introduce a Fusion Alignment Encoder that learns to align fused multimodal priors keypoints, segmentation maps, and depth cues from foundation models during training. This provides robust structured guidance, further enabling efficient inference without foundation models at test time while maintaining high reconstruction accuracy. Second, we employ a two-hand diffusion model explicitly trained to transform interpenetrated poses into plausible, non-penetrated interactions, leveraging gradient-guided denoising to correct artifacts and ensure realistic spatial relations. Extensive evaluations demonstrate that our method achieves state-of-the-art performance on InterHand2.6M, FreiHAND, and HIC datasets, significantly advancing occlusion handling and interaction robustness.

ShED-HD: A Shannon Entropy Distribution Framework for Lightweight Hallucination Detection on Edge Devices

Aneesh Vathul,Daniel Lee,Sheryl Chen,Arthi Tasmia

Task: 提出一种轻量级的幻觉检测框架ShED-HD，用于高效检测大语言模型生成的虚假内容。

Motivation: 现有幻觉检测方法在计算成本或准确性上存在不足，无法满足资源受限环境（如边缘设备）的需求。

Details

Method: 采用基于Shannon熵分布的分类方法，结合轻量级BiLSTM架构和单头注意力机制，检测序列级熵模式。 Result: 在三个数据集上的实验表明，ShED-HD在分布外设置中显著优于其他高效方法，同时在分布内设置中表现相当。 Conclusion: ShED-HD是一种低成本、高准确性和可泛化的幻觉检测方法，提升了资源受限环境中LLM生成内容的可信度。 Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities on a broad array of NLP tasks, but their tendency to produce hallucinations$\unicode{x2013}$plausible-sounding but factually incorrect content$\unicode{x2013}$poses severe challenges in high-stakes domains. Existing hallucination detection methods either bear the computational cost of multiple inference passes or sacrifice accuracy for efficiency with single-pass approaches, neither of which is ideal in resource-constrained environments such as edge devices. We propose the Shannon Entropy Distribution Hallucination Detector (ShED-HD), a novel hallucination detection framework that bridges this gap by classifying sequence-level entropy patterns using a lightweight BiLSTM architecture with single-headed attention. In contrast to prior approaches, ShED-HD efficiently detects distinctive uncertainty patterns across entire output sequences, preserving contextual awareness. Through in-depth evaluation on three datasets (BioASQ, TriviaQA, and Jeopardy Questions), we show that ShED-HD significantly outperforms other computationally efficient approaches in the out-of-distribution setting, while achieving comparable performance in the in-distribution setting. ShED-HD facilitates hallucination detection that is low-cost, accurate, and generalizable, improving the credibility of content generated by LLMs in resource-constrained environments where trustworthy AI functionality is crucial.

Topology preserving Image segmentation using the iterative convolution-thresholding method

Lingyun Deng,Litong Liu,Dong Wang,Xiao-Ping Wang

Task: 将拓扑保持约束引入迭代卷积阈值方法（ICTM）中，提出拓扑保持的ICTM（TP-ICTM）以提高图像分割的准确性和鲁棒性。

Motivation: 传统分割模型主要关注图像的视觉属性，而忽略了目标对象的拓扑特性，导致在复杂拓扑结构的图像中分割结果与真实情况不符。

Details

Method: 在ICTM中引入拓扑保持约束，形成TP-ICTM。 Result: 实验表明，TP-ICTM通过显式保持目标对象的拓扑特性（如连通性），在复杂结构或噪声图像中实现了更高的准确性和鲁棒性。 Conclusion: TP-ICTM通过结合拓扑约束，显著提升了图像分割的性能，特别是在处理复杂拓扑结构的图像时。 Abstract: Variational models are widely used in image segmentation, with various models designed to address different types of images by optimizing specific objective functionals. However, traditional segmentation models primarily focus on the visual attributes of the image, often neglecting the topological properties of the target objects. This limitation can lead to segmentation results that deviate from the ground truth, particularly in images with complex topological structures. In this paper, we introduce a topology-preserving constraint into the iterative convolution-thresholding method (ICTM), resulting in the topology-preserving ICTM (TP-ICTM). Extensive experiments demonstrate that, by explicitly preserving the topological properties of target objects-such as connectivity-the proposed algorithm achieves enhanced accuracy and robustness, particularly in images with intricate structures or noise.

Tadesse Destaw Belay,Israel Abebe Azime,Ibrahim Said Ahmad,Idris Abdulmumin,Abinew Ali Ayele,Shamsuddeen Hassan Muhammad,Seid Muhie Yimam

Task: 探索针对低资源非洲语言的领域和任务自适应持续预训练方法。

Motivation: 预训练语言模型（PLMs）在多任务和多数据源中表现优异，但针对低资源非洲语言的研究较少，需要探索其适应性。

Details

Method: 使用AfriSocial语料库进行领域自适应预训练（DAPT），并结合任务自适应预训练（TAPT）方法。 Result: DAPT在16种目标语言的细粒度情感分类任务中F1分数提升1%至28.27%；TAPT进一步提升了0.55%至15.11%；结合DAPT和TAPT效果更佳。 Conclusion: 提出的方法显著提升了低资源非洲语言的NLP任务性能，相关资源将公开以支持更多类似任务。 Abstract: Pretrained Language Models (PLMs) built from various sources are the foundation of today's NLP progress. Language representations learned by such models achieve strong performance across many tasks with datasets of varying sizes drawn from various sources. We explore a thorough analysis of domain and task adaptive continual pretraining approaches for low-resource African languages and a promising result is shown for the evaluated tasks. We create AfriSocial, a corpus designed for domain adaptive finetuning that passes through quality pre-processing steps. Continual pretraining PLMs using AfriSocial as domain adaptive pretraining (DAPT) data, consistently improves performance on fine-grained emotion classification task of 16 targeted languages from 1% to 28.27% macro F1 score. Likewise, using the task adaptive pertaining (TAPT) approach, further finetuning with small unlabeled but similar task data shows promising results. For example, unlabeled sentiment data (source) for fine-grained emotion classification task (target) improves the base model results by an F1 score ranging from 0.55% to 15.11%. Combining the two methods, DAPT + TAPT, achieves also better results than base models. All the resources will be available to improve low-resource NLP tasks, generally, as well as other similar domain tasks such as hate speech and sentiment tasks.

Progressive Prompt Detailing for Improved Alignment in Text-to-Image Generative Models

Ketan Suhaas Saichandran,Xavier Thomas,Prakhar Kaushik,Deepti Ghadiyaram

Task: 提出SCoPE方法，通过逐步细化输入提示来提升文本到图像生成模型的对齐能力。

Motivation: 解决文本到图像生成模型在处理复杂场景和多样对象时的对齐问题。

Details

Method: 将详细输入提示分解为多个子提示，从粗到细逐步细化，并在推理过程中进行插值。 Result: 在GenAI-Bench数据集上，85%的提示中VQA得分平均提升4%。 Conclusion: SCoPE是一种无需训练的方法，显著提升了文本到图像的对齐效果。 Abstract: Text-to-image generative models often struggle with long prompts detailing complex scenes, diverse objects with distinct visual characteristics and spatial relationships. In this work, we propose SCoPE (Scheduled interpolation of Coarse-to-fine Prompt Embeddings), a training-free method to improve text-to-image alignment by progressively refining the input prompt in a coarse-to-fine-grained manner. Given a detailed input prompt, we first decompose it into multiple sub-prompts which evolve from describing broad scene layout to highly intricate details. During inference, we interpolate between these sub-prompts and thus progressively introduce finer-grained details into the generated image. Our training-free plug-and-play approach significantly enhances prompt alignment, achieves an average improvement of up to +4% in Visual Question Answering (VQA) scores over the Stable Diffusion baselines on 85% of the prompts from the GenAI-Bench dataset.

PAD: Towards Efficient Data Generation for Transfer Learning Using Phrase Alignment

Jong Myoung Kim,Young-Jun_Lee,Ho-Jin Choi,Sangkeun Jung

Task: 探索短语对齐数据（PAD）如何提升韩语建模中的迁移学习效率。

Motivation: 利用英语数据的丰富性解决韩语等非英语语言资源稀缺的问题。

Details

Method: 通过标准化统计机器翻译（SMT）生成的PAD，结合韩语的句法特性进行实验。 Result: PAD有效弥补了SMT的不足，显著提升模型性能，并与传统数据构建方法互补。 Conclusion: PAD为资源稀缺语言提供了一种高效且成本低廉的解决方案。 Abstract: Transfer learning leverages the abundance of English data to address the scarcity of resources in modeling non-English languages, such as Korean. In this study, we explore the potential of Phrase Aligned Data (PAD) from standardized Statistical Machine Translation (SMT) to enhance the efficiency of transfer learning. Through extensive experiments, we demonstrate that PAD synergizes effectively with the syntactic characteristics of the Korean language, mitigating the weaknesses of SMT and significantly improving model performance. Moreover, we reveal that PAD complements traditional data construction methods and enhances their effectiveness when combined. This innovative approach not only boosts model performance but also suggests a cost-efficient solution for resource-scarce languages.

GaussianFocus: Constrained Attention Focus for 3D Gaussian Splatting

Zexu Huang,Min Xu,Stuart Perry

Task: 提出GaussianFocus方法，以解决3D高斯泼溅技术在渲染质量和扩展性方面的不足。

Motivation: 3D高斯泼溅技术在小规模和物体中心场景中表现优异，但在大规模场景中存在冗余高斯噪声、内存限制和优化时间过长等问题。

Details

Method: 引入块注意力算法优化渲染质量，采用高斯约束策略减少冗余，并提出分块重建策略处理大规模场景。 Result: GaussianFocus显著减少冗余高斯并提升渲染质量，优于现有最优方法，且能高效处理大规模场景。 Conclusion: GaussianFocus通过创新方法解决了3D高斯泼溅技术的局限性，为大规模场景的高质量渲染提供了有效解决方案。 Abstract: Recent developments in 3D reconstruction and neural rendering have significantly propelled the capabilities of photo-realistic 3D scene rendering across various academic and industrial fields. The 3D Gaussian Splatting technique, alongside its derivatives, integrates the advantages of primitive-based and volumetric representations to deliver top-tier rendering quality and efficiency. Despite these advancements, the method tends to generate excessive redundant noisy Gaussians overfitted to every training view, which degrades the rendering quality. Additionally, while 3D Gaussian Splatting excels in small-scale and object-centric scenes, its application to larger scenes is hindered by constraints such as limited video memory, excessive optimization duration, and variable appearance across views. To address these challenges, we introduce GaussianFocus, an innovative approach that incorporates a patch attention algorithm to refine rendering quality and implements a Gaussian constraints strategy to minimize redundancy. Moreover, we propose a subdivision reconstruction strategy for large-scale scenes, dividing them into smaller, manageable blocks for individual training. Our results indicate that GaussianFocus significantly reduces unnecessary Gaussians and enhances rendering quality, surpassing existing State-of-The-Art (SoTA) methods. Furthermore, we demonstrate the capability of our approach to effectively manage and render large scenes, such as urban environments, whilst maintaining high fidelity in the visual output.

Enhancing Multi-Label Emotion Analysis and Corresponding Intensities for Ethiopian Languages

Tadesse Destaw Belay,Dawit Ketema Gete,Abinew Ali Ayele,Olga Kolesnikova,Grigori Sidorov,Seid Muhie Yimam

Task: 建模和整合情感理解模型，用于多标签情感标注和情感强度分析。

Motivation: 在社交媒体平台上，用户经常同时表达多种情感，且情感强度对决策和反馈分析至关重要。

Details

Method: 扩展EthioEmo数据集，标注情感强度，并评估多种预训练语言模型（PLMs）和大型语言模型（LLMs）。 Result: 提供了全面的基准测试结果，展示了不同模型在多标签情感标注和情感强度分析中的表现。 Conclusion: 情感强度和多标签标注的结合为情感理解模型提供了更全面的视角，尤其在决策支持和心理健康研究中具有重要意义。 Abstract: In this digital world, people freely express their emotions using different social media platforms. As a result, modeling and integrating emotion-understanding models are vital for various human-computer interaction tasks such as decision-making, product and customer feedback analysis, political promotions, marketing research, and social media monitoring. As users express different emotions simultaneously in a single instance, annotating emotions in a multilabel setting such as the EthioEmo (Belay et al., 2025) dataset effectively captures this dynamic. Additionally, incorporating intensity, or the degree of emotion, is crucial, as emotions can significantly differ in their expressive strength and impact. This intensity is significant for assessing whether further action is necessary in decision-making processes, especially concerning negative emotions in applications such as healthcare and mental health studies. To enhance the EthioEmo dataset, we include annotations for the intensity of each labeled emotion. Furthermore, we evaluate various state-of-the-art encoder-only Pretrained Language Models (PLMs) and decoder-only Large Language Models (LLMs) to provide comprehensive benchmarking.

LightLoc: Learning Outdoor LiDAR Localization at Light Speed

Wen Li,Chen Liu,Shangshu Yu,Dunqiang Liu,Yin Zhou,Siqi Shen,Chenglu Wen,Cheng Wang

Task: 提出一种名为LightLoc的方法，用于在大型户外场景中快速学习定位。

Motivation: 现有场景坐标回归方法在户外LiDAR定位中表现优秀，但训练时间过长，限制了其在时间敏感应用（如自动驾驶、无人机和机器人）中的实用性。

Details

Method: 引入样本分类指导和冗余样本下采样两种新技术，以提高训练效率并减少训练时间。 Result: 在大型户外数据集上的实验表明，LightLoc在训练时间减少50倍的同时，达到了最先进的性能。 Conclusion: LightLoc通过快速训练和置信度估计能力，为时间敏感应用提供了高效的定位解决方案。 Abstract: Scene coordinate regression achieves impressive results in outdoor LiDAR localization but requires days of training. Since training needs to be repeated for each new scene, long training times make these methods impractical for time-sensitive applications, such as autonomous driving, drones, and robotics. We identify large coverage areas and vast data in large-scale outdoor scenes as key challenges that limit fast training. In this paper, we propose LightLoc, the first method capable of efficiently learning localization in a new scene at light speed. LightLoc introduces two novel techniques to address these challenges. First, we introduce sample classification guidance to assist regression learning, reducing ambiguity from similar samples and improving training efficiency. Second, we propose redundant sample downsampling to remove well-learned frames during training, reducing training time without compromising accuracy. Additionally, the fast training and confidence estimation capabilities of sample classification enable its integration into SLAM, effectively eliminating error accumulation. Extensive experiments on large-scale outdoor datasets demonstrate that LightLoc achieves state-of-the-art performance with a 50x reduction in training time than existing methods. Our code is available at https://github.com/liw95/LightLoc.

Bridging Emotions and Architecture: Sentiment Analysis in Modern Distributed Systems

Mahak Shah,Akaash Vishal Hazarika,Meetu Malhotra,Sachin C. Patil,Joshit Mohanty

Task: 研究情感分析与分布式系统的结合，探讨其方法、挑战及未来方向。

Motivation: 情感分析在多个领域（如社交媒体监控、客户反馈评估和市场研究）中具有重要应用，而分布式系统能高效处理大规模数据。

Details

Method: 通过实验比较单节点配置和分布式架构在情感分析模型中的性能与准确性。 Result: 实验展示了两种方法在性能和准确性方面的优缺点。 Conclusion: 情感分析与分布式系统的结合具有潜力，但仍需进一步研究以解决挑战。 Abstract: Sentiment analysis is a field within NLP that has gained importance because it is applied in various areas such as; social media surveillance, customer feedback evaluation and market research. At the same time, distributed systems allow for effective processing of large amounts of data. Therefore, this paper examines how sentiment analysis converges with distributed systems by concentrating on different approaches, challenges and future investigations. Furthermore, we do an extensive experiment where we train sentiment analysis models using both single node configuration and distributed architecture to bring out the benefits and shortcomings of each method in terms of performance and accuracy.

RefCut: Interactive Segmentation with Reference Guidance

Zheng Lin,Nan Zhou,Chen-Xi Du,Deng-Ping Fan,Shi-Min Hu

Task: 解决交互式分割中的部分模糊性和对象模糊性问题，提出RefCut框架。

Motivation: 现有方法无法提供直观的模型指导，导致输出结果不稳定，难以满足大规模高效标注需求。

Details

Method: 引入基于参考的交互式分割框架RefCut，用户提供参考图像和掩码，模型基于此优化。 Result: 在多个数据集组合评估中，RefCut取得最先进性能。 Conclusion: RefCut推动了直观可控的交互式分割领域发展，代码和演示视频已公开。 Abstract: Interactive segmentation aims to segment the specified target on the image with positive and negative clicks from users. Interactive ambiguity is a crucial issue in this field, which refers to the possibility of multiple compliant outcomes with the same clicks, such as selecting a part of an object versus the entire object, a single object versus a combination of multiple objects, and so on. The existing methods cannot provide intuitive guidance to the model, which leads to unstable output results and makes it difficult to meet the large-scale and efficient annotation requirements for specific targets in some scenarios. To bridge this gap, we introduce RefCut, a reference-based interactive segmentation framework designed to address part ambiguity and object ambiguity in segmenting specific targets. Users only need to provide a reference image and corresponding reference masks, and the model will be optimized based on them, which greatly reduces the interactive burden on users when annotating a large number of such targets. In addition, to enrich these two kinds of ambiguous data, we propose a new Target Disassembly Dataset which contains two subsets of part disassembly and object disassembly for evaluation. In the combination evaluation of multiple datasets, our RefCut achieved state-of-the-art performance. Extensive experiments and visualized results demonstrate that RefCut advances the field of intuitive and controllable interactive segmentation. Our code will be publicly available and the demo video is in https://www.lin-zheng.com/refcut.

Sun-Shine: A Large Language Model for Tibetan Culture

Cheng Huang,Fan Gao,Nyima Tashi,Yutong Liu,Xiangxiang Wang,Thupten Tsering,Ban Ma-bao,Renzeg Duojie,Gadeng Luosang,Rinchen Dongrub,Dorje Tashi,Xiao Feng,Yongbin Yu

Task: 开发首个针对藏族文化的大型语言模型Llama-Sunshine（Sun-Shine），并构建首个大规模藏族文化数据集TIB-STC。

Motivation: 当前大型语言模型（LLMs）在满足藏族等少数民族语言需求方面表现不足，且藏族文化因其复杂性和独特性面临数据稀缺的挑战。

Details

Method: 采用先进模型架构优化藏族语言特征，并构建包含文学、宗教文本、新闻和对话数据的综合数据集TIB-STC。 Result: Sun-Shine在藏族语言处理任务（如语言建模、文本分类、机器翻译和句法分析）中表现出色，并在低资源场景下展示强大泛化能力。 Conclusion: Sun-Shine是首个针对藏族文化的大型语言模型，填补了LLMs在少数民族语言领域的空白，并展示了初步的智能能力。 Abstract: Tibetan, a minority language in China, features a highly intricate grammatical structure, characterized by four verb tenses and a tense system with frequent irregularities, contributing to its extensive inflectional diversity. Recently, advances in Large Language Models (LLMs) have transformed the paradigm in many domains. Despite the success in other fields, current LLMs often fall short in catering to the needs of domain experts like Tibetans, and the potential of LLMs for Tibetan culture is under-explored. The intrinsic reasons are the immense and intricate nature of Tibetan culture as well as the necessity for higher granularity and richness in knowledge. Simultaneously, the complexity and uniqueness of its grammatical structure, coupled with its status as a minority ethnic language, contribute to data scarcity, which remains a fundamental challenge. To alleviate these issues, we introduce Llama-Sunshine (Sun-Shine), the first large language model for Tibetan culture, which is expert in various Tibetan language processing tasks. Sun-Shine incorporates state-of-the-art model architectures optimized for Tibetan's linguistic features. We also propose TIB-STC, a comprehensive dataset comprising diverse Tibetan texts such as literature, religious scripts, news, and conversational data, which is also the first large-scale dataset for Tibetan culture. Though comprehensive experiments, Sun-Shine not only demonstrates a higher level of knowledge expertise for Tibetan culture but also gains preliminary embodied intelligence capabilities in Tibetan language processing tasks, like language modeling, text classification, machine translation, and syntactic analysis. Moreover, it excels in low-resource scenarios, showcasing strong generalization capabilities.

Fractal-IR: A Unified Framework for Efficient and Scalable Image Restoration

Yawei Li,Bin Ren,Jingyun Liang,Rakesh Ranjan,Mengyuan Liu,Nicu Sebe,Ming-Hsuan Yang,Luca Benini

Task: 提出一种基于分形的图像修复方法Fractal-IR，以高效处理多种退化类型和分辨率的图像修复任务。

Motivation: 尽管视觉变换器在图像修复任务中取得突破，但如何高效扩展其处理多种退化和分辨率仍具挑战性。

Details

Method: 采用分形设计，通过逐步扩展局部信息至更广区域来修复图像，避免计算量大的长程自注意力机制。 Result: 在七种常见图像修复任务中达到最先进性能，例如在Manga109上实现0.21 dB PSNR提升。 Conclusion: Fractal-IR通过分形架构和有效的模型扩展策略，显著提升了图像修复的性能和效率。 Abstract: While vision transformers achieve significant breakthroughs in various image restoration (IR) tasks, it is still challenging to efficiently scale them across multiple types of degradations and resolutions. In this paper, we propose Fractal-IR, a fractal-based design that progressively refines degraded images by repeatedly expanding local information into broader regions. This fractal architecture naturally captures local details at early stages and seamlessly transitions toward global context in deeper fractal stages, removing the need for computationally heavy long-range self-attention mechanisms. Moveover, we observe the challenge in scaling up vision transformers for IR tasks. Through a series of analyses, we identify a holistic set of strategies to effectively guide model scaling. Extensive experimental results show that Fractal-IR achieves state-of-the-art performance in seven common image restoration tasks, including super-resolution, denoising, JPEG artifact removal, IR in adverse weather conditions, motion deblurring, defocus deblurring, and demosaicking. For $2\times$ SR on Manga109, Fractal-IR achieves a 0.21 dB PSNR gain. For grayscale image denoising on Urban100, Fractal-IR surpasses the previous method by 0.2 dB for $\sigma=50$.

When is dataset cartography ineffective? Using training dynamics does not improve robustness against Adversarial SQuAD

Paul K. Mandal

Task: 研究数据集制图在SQuAD数据集上对抽取式问答任务的有效性。

Motivation: 分析SQuAD中的标注伪影，并评估对抗性数据集（AddSent和AddOneSent）对ELECTRA-small模型的影响。

Details

Method: 利用训练动态将SQuAD划分为易学、模糊和难学子集，并比较基于制图的子集与随机子集的模型性能。 Result: 基于制图的训练子集未显著提升模型在SQuAD验证集或AddSent对抗集上的泛化能力，仅在AddOneSent上略有提升。 Conclusion: 数据集制图对SQuAD式问答任务的对抗鲁棒性帮助有限，结果与SNLI的先前研究存在差异。 Abstract: In this paper, I investigate the effectiveness of dataset cartography for extractive question answering on the SQuAD dataset. I begin by analyzing annotation artifacts in SQuAD and evaluate the impact of two adversarial datasets, AddSent and AddOneSent, on an ELECTRA-small model. Using training dynamics, I partition SQuAD into easy-to-learn, ambiguous, and hard-to-learn subsets. I then compare the performance of models trained on these subsets to those trained on randomly selected samples of equal size. Results show that training on cartography-based subsets does not improve generalization to the SQuAD validation set or the AddSent adversarial set. While the hard-to-learn subset yields a slightly higher F1 score on the AddOneSent dataset, the overall gains are limited. These findings suggest that dataset cartography provides little benefit for adversarial robustness in SQuAD-style QA tasks. I conclude by comparing these results to prior findings on SNLI and discuss possible reasons for the observed differences.

Wenxuan Zhu,Bing Li,Cheng Zheng,Jinjie Mai,Jun Chen,Letian Jiang,Abdullah Hamdi,Sara Rojas Martinez,Chia-Wen Lin,Mohamed Elhoseiny,Bernard Ghanem

Task: 评估多模态大语言模型（MLLMs）在4D物体（具有时间演化的3D物体）理解能力上的表现。

Motivation: 目前缺乏公开标准化的基准来评估MLLMs在4D物体理解上的能力，因此需要填补这一空白。

Details

Method: 提出了4D-Bench基准，包含4D物体问答和4D物体描述任务，涵盖多样类别、高质量标注和多视角时空理解需求。 Result: MLLMs在时间理解上表现较弱，开源模型在时间理解上与闭源模型差距较大；4D物体问答任务中，即使是GPT-4o的准确率也仅为63%，远低于人类基准的91%。 Conclusion: 4D物体理解存在显著差距，MLLMs需要进一步改进。 Abstract: Multimodal Large Language Models (MLLMs) have demonstrated impressive 2D image/video understanding capabilities. However, there are no publicly standardized benchmarks to assess the abilities of MLLMs in understanding the 4D objects (3D objects with temporal evolution over time). In this paper, we introduce 4D-Bench, the first benchmark to evaluate the capabilities of MLLMs in 4D object understanding, featuring tasks in 4D object Question Answering (4D object QA) and 4D object captioning. 4D-Bench provides 4D objects with diverse categories, high-quality annotations, and tasks necessitating multi-view spatial-temporal understanding, different from existing 2D image/video-based benchmarks. With 4D-Bench, we evaluate a wide range of open-source and closed-source MLLMs. The results from the 4D object captioning experiment indicate that MLLMs generally exhibit weaker temporal understanding compared to their appearance understanding, notably, while open-source models approach closed-source performance in appearance understanding, they show larger performance gaps in temporal understanding. 4D object QA yields surprising findings: even with simple single-object videos, MLLMs perform poorly, with state-of-the-art GPT-4o achieving only 63\% accuracy compared to the human baseline of 91\%. These findings highlight a substantial gap in 4D object understanding and the need for further advancements in MLLMs.

Fact-checking AI-generated news reports: Can LLMs catch their own lies?

Jiayi Yao,Haibo Sun,Nianwen Xue

Task: 评估大型语言模型（LLMs）对自身或其他LLMs生成的“新闻报告”中声明真实性的判断能力。

Motivation: 确定LLMs是否能像验证人类声明一样有效地对其自身生成的内容进行事实核查。

Details

Method: 使用类似人类声明验证的方法，并结合检索增强生成（RAG）技术，评估LLMs在不同类型新闻中的表现。 Result: LLMs在评估国家或国际新闻时表现优于地方新闻，对静态信息的评估优于动态信息，对真实声明的验证优于虚假声明；RAG技术显著减少了无法评估的声明数量，但也增加了错误评估。 Conclusion: 未来研究需优先提高检索信息的精确性和相关性，动态事件和地方新闻可能需要人工介入的事实核查系统以确保准确性。 Abstract: In this paper, we evaluate the ability of Large Language Models (LLMs) to assess the veracity of claims in ''news reports'' generated by themselves or other LLMs. Our goal is to determine whether LLMs can effectively fact-check their own content, using methods similar to those used to verify claims made by humans. Our findings indicate that LLMs are more effective at assessing claims in national or international news stories than in local news stories, better at evaluating static information than dynamic information, and better at verifying true claims compared to false ones. We hypothesize that this disparity arises because the former types of claims are better represented in the training data. Additionally, we find that incorporating retrieved results from a search engine in a Retrieval-Augmented Generation (RAG) setting significantly reduces the number of claims an LLM cannot assess. However, this approach also increases the occurrence of incorrect assessments, partly due to irrelevant or low-quality search results. This diagnostic study highlights the need for future research on fact-checking machine-generated reports to prioritize improving the precision and relevance of retrieved information to better support fact-checking efforts. Furthermore, claims about dynamic events and local news may require human-in-the-loop fact-checking systems to ensure accuracy and reliability.

ClaraVid: A Holistic Scene Reconstruction Benchmark From Aerial Perspective With Delentropy-Based Complexity Profiling

Radu Beche,Sergiu Nedevschi

Task: 开发一个合成航空数据集ClaraVid，并引入Delentropic Scene Profile（DSP）作为场景复杂度度量指标。

Motivation: 现有合成数据集存在任务特定限制、不真实的场景构成和渲染伪影，限制了航空场景理解算法的发展。

Details

Method: 构建包含16,917张高分辨率图像的ClaraVid数据集，提供密集深度图、全景分割、稀疏点云和动态对象掩码；提出DSP作为场景复杂度度量指标。 Result: 实验表明，DSP能可靠地量化场景复杂度，且高delentropy值与重建误差增加显著相关。 Conclusion: ClaraVid和DSP为航空场景理解提供了高质量数据和复杂度评估工具，推动了神经重建方法的发展。 Abstract: The development of aerial holistic scene understanding algorithms is hindered by the scarcity of comprehensive datasets that enable both semantic and geometric reconstruction. While synthetic datasets offer an alternative, existing options exhibit task-specific limitations, unrealistic scene compositions, and rendering artifacts that compromise real-world applicability. We introduce ClaraVid, a synthetic aerial dataset specifically designed to overcome these limitations. Comprising 16,917 high-resolution images captured at 4032x3024 from multiple viewpoints across diverse landscapes, ClaraVid provides dense depth maps, panoptic segmentation, sparse point clouds, and dynamic object masks, while mitigating common rendering artifacts. To further advance neural reconstruction, we introduce the Delentropic Scene Profile (DSP), a novel complexity metric derived from differential entropy analysis, designed to quantitatively assess scene difficulty and inform reconstruction tasks. Utilizing DSP, we systematically benchmark neural reconstruction methods, uncovering a consistent, measurable correlation between scene complexity and reconstruction accuracy. Empirical results indicate that higher delentropy strongly correlates with increased reconstruction errors, validating DSP as a reliable complexity prior. Currently under review, upon acceptance the data and code will be available at $\href{https://rdbch.github.io/claravid}{rdbch.github.io/ClaraVid}$.

Surgical Action Planning with Large Language Models

Mengya Xu,Zhongzhen Huang,Jie Zhang,Xiaofan Zhang,Qi Dou

Task: 提出了一种基于大语言模型的手术动作规划框架（LLM-SAP），用于从视觉输入生成未来动作计划。

Motivation: 当前智能应用缺乏术中预测性规划，手术动作规划（SAP）在增强术中指导和自动化流程方面具有潜力，但面临工具-动作关系理解和手术进展跟踪等挑战。

Details

Method: LLM-SAP框架整合了两个新模块：近历史聚焦记忆模块（NHF-MM）用于建模历史状态，以及提示工厂用于动作规划。采用预训练大语言模型（如Qwen2.5和Qwen2-VL）进行零样本测试，并通过监督微调（SFT）和LoRA解决数据隐私问题。 Result: 在CholecT50-SAP数据集上，Qwen2.5-72B-SFT模型的准确率比Qwen2.5-72B高出19.3%。 Conclusion: LLM-SAP框架在手术动作预测方面表现出色，支持手术教育、术中决策、流程文档和技能分析等应用。 Abstract: In robot-assisted minimally invasive surgery, we introduce the Surgical Action Planning (SAP) task, which generates future action plans from visual inputs to address the absence of intraoperative predictive planning in current intelligent applications. SAP shows great potential for enhancing intraoperative guidance and automating procedures. However, it faces challenges such as understanding instrument-action relationships and tracking surgical progress. Large Language Models (LLMs) show promise in understanding surgical video content but remain underexplored for predictive decision-making in SAP, as they focus mainly on retrospective analysis. Challenges like data privacy, computational demands, and modality-specific constraints further highlight significant research gaps. To tackle these challenges, we introduce LLM-SAP, a Large Language Models-based Surgical Action Planning framework that predicts future actions and generates text responses by interpreting natural language prompts of surgical goals. The text responses potentially support surgical education, intraoperative decision-making, procedure documentation, and skill analysis. LLM-SAP integrates two novel modules: the Near-History Focus Memory Module (NHF-MM) for modeling historical states and the prompts factory for action planning. We evaluate LLM-SAP on our constructed CholecT50-SAP dataset using models like Qwen2.5 and Qwen2-VL, demonstrating its effectiveness in next-action prediction. Pre-trained LLMs are tested zero-shot, and supervised fine-tuning (SFT) with LoRA is implemented to address data privacy concerns. Our experiments show that Qwen2.5-72B-SFT surpasses Qwen2.5-72B with a 19.3% higher accuracy.

A Causal Adjustment Module for Debiasing Scene Graph Generation

Li Liu,Shuzhou Sun,Shuaifeng Zhi,Fan Shi,Zhen Liu,Janne Heikkilä,Yongxiang Liu

Task: 通过因果推理技术建模场景图生成（SGG）中观察到的偏斜分布之间的因果关系，以解决模型偏差问题。

Motivation: 现有方法仅将模型偏差归因于关系的长尾分布，而忽略了更深层次的对象和对象对分布偏斜问题，因此需要更全面的因果分析。

Details

Method: 提出基于中介的因果链模型（MCCM）和因果调整模块（CAModule），通过建模对象、对象对和关系之间的因果关系，并引入中介变量（共现分布）来补充因果链。 Result: CAModule在多个SGG骨干模型和基准测试中实现了最先进的平均召回率，并在零样本召回率指标上也有显著提升。 Conclusion: 通过因果推理建模和调整，能够更全面地解决SGG中的模型偏差问题，并提升零样本关系识别的能力。 Abstract: While recent debiasing methods for Scene Graph Generation (SGG) have shown impressive performance, these efforts often attribute model bias solely to the long-tail distribution of relationships, overlooking the more profound causes stemming from skewed object and object pair distributions. In this paper, we employ causal inference techniques to model the causality among these observed skewed distributions. Our insight lies in the ability of causal inference to capture the unobservable causal effects between complex distributions, which is crucial for tracing the roots of model bias. Specifically, we introduce the Mediator-based Causal Chain Model (MCCM), which, in addition to modeling causality among objects, object pairs, and relationships, incorporates mediator variables, i.e., cooccurrence distribution, for complementing the causality. Following this, we propose the Causal Adjustment Module (CAModule) to estimate the modeled causal structure, using variables from MCCM as inputs to produce a set of adjustment factors aimed at correcting biased model predictions. Moreover, our method enables the composition of zero-shot relationships, thereby enhancing the model's ability to recognize such relationships. Experiments conducted across various SGG backbones and popular benchmarks demonstrate that CAModule achieves state-of-the-art mean recall rates, with significant improvements also observed on the challenging zero-shot recall rate metric.

J&H: Evaluating the Robustness of Large Language Models Under Knowledge-Injection Attacks in Legal Domain

Yiran Hu,Huanghai Liu,Qingjing Chen,Ning Zheng,Chong Wang,Yun Liu,Charles L. A. Clarke,Weixing Shen

Task: 提出一种法律知识注入攻击的方法，用于测试大型语言模型（LLMs）在法律领域的鲁棒性，并推断其是否具备法律知识和推理逻辑。

Motivation: 随着LLMs规模的扩大，其在法律等知识密集型领域的应用受到关注，但尚不清楚其判断是否基于领域知识和逻辑推理。若仅依赖特定词汇或模式而非语言逻辑，则在实际应用中存在风险。

Details

Method: 提出J&H评估框架，通过法律知识注入攻击测试LLMs的鲁棒性，攻击其推理逻辑的各个部分（大前提、小前提和结论生成）。 Result: 现有LLMs在实验中未能抵御攻击，表明其鲁棒性不足。同时提出并比较了增强LLMs知识鲁棒性的方法。 Conclusion: LLMs在法律任务中可能缺乏逻辑推理能力，易受误导，需进一步改进其知识鲁棒性。 Abstract: As the scale and capabilities of Large Language Models (LLMs) increase, their applications in knowledge-intensive fields such as legal domain have garnered widespread attention. However, it remains doubtful whether these LLMs make judgments based on domain knowledge for reasoning. If LLMs base their judgments solely on specific words or patterns, rather than on the underlying logic of the language, the ''LLM-as-judges'' paradigm poses substantial risks in the real-world applications. To address this question, we propose a method of legal knowledge injection attacks for robustness testing, thereby inferring whether LLMs have learned legal knowledge and reasoning logic. In this paper, we propose J&H: an evaluation framework for detecting the robustness of LLMs under knowledge injection attacks in the legal domain. The aim of the framework is to explore whether LLMs perform deductive reasoning when accomplishing legal tasks. To further this aim, we have attacked each part of the reasoning logic underlying these tasks (major premise, minor premise, and conclusion generation). We have collected mistakes that legal experts might make in judicial decisions in the real world, such as typos, legal synonyms, inaccurate external legal statutes retrieval. However, in real legal practice, legal experts tend to overlook these mistakes and make judgments based on logic. However, when faced with these errors, LLMs are likely to be misled by typographical errors and may not utilize logic in their judgments. We conducted knowledge injection attacks on existing general and domain-specific LLMs. Current LLMs are not robust against the attacks employed in our experiments. In addition we propose and compare several methods to enhance the knowledge robustness of LLMs.

good4cir: Generating Detailed Synthetic Captions for Composed Image Retrieval

Pranavi Kolouju,Eric Xing,Robert Pless,Nathan Jacobs,Abby Stylianou

Task: 提出一种利用视觉语言模型生成高质量合成标注的结构化流程，以改进组合图像检索（CIR）任务。

Motivation: 现有数据集依赖简单、模糊或不足的手动标注，限制了细粒度检索的性能。

Details

Method: 通过提取查询图像的细粒度对象描述、为目标图像生成可比较的描述，并合成捕捉图像间有意义变换的文本指令。 Result: 减少了幻觉、增强了修改多样性并确保对象级别一致性，提高了CIR模型的检索准确性。 Conclusion: 提出的方法改进了现有数据集并支持跨领域新数据集的创建，为CIR和多模态检索研究提供了框架。 Abstract: Composed image retrieval (CIR) enables users to search images using a reference image combined with textual modifications. Recent advances in vision-language models have improved CIR, but dataset limitations remain a barrier. Existing datasets often rely on simplistic, ambiguous, or insufficient manual annotations, hindering fine-grained retrieval. We introduce good4cir, a structured pipeline leveraging vision-language models to generate high-quality synthetic annotations. Our method involves: (1) extracting fine-grained object descriptions from query images, (2) generating comparable descriptions for target images, and (3) synthesizing textual instructions capturing meaningful transformations between images. This reduces hallucination, enhances modification diversity, and ensures object-level consistency. Applying our method improves existing datasets and enables creating new datasets across diverse domains. Results demonstrate improved retrieval accuracy for CIR models trained on our pipeline-generated datasets. We release our dataset construction framework to support further research in CIR and multi-modal retrieval.

Teaching LLMs for Step-Level Automatic Math Correction via Reinforcement Learning

Junsong Li,Jie Zhou,Yutao Yang,Bihao Zhan,Qianjun Pan,Yuyang Ding,Qin Chen,Jiang Bo,Xin Lin,Liang He

Task: 通过强化学习方法提升大型语言模型在数学解题步骤级别的自动批改能力。

Motivation: 现有研究多关注问题级别的最终答案判断，缺乏对解题过程中每一步的详细反馈，需要语义理解和推理能力。

Details

Method: 提出基于强化学习的方法StepAMC，将步骤级数学批改任务转化为强化学习问题，设计空间约束的策略网络和细粒度奖励网络。 Result: 在两个基准数据集上的实验表明，模型优于十一个强基线方法。 Conclusion: StepAMC通过强化学习提升了大型语言模型在数学步骤批改中的推理能力和稳定性。 Abstract: Automatic math correction aims to check students' solutions to mathematical problems via artificial intelligence technologies. Most existing studies focus on judging the final answer at the problem level, while they ignore detailed feedback on each step in a math problem-solving process, which requires abilities of semantic understanding and reasoning. In this paper, we propose a reinforcement learning (RL)-based method to boost large language model (LLM) for step-level automatic math correction, named StepAMC. Particularly, we convert the step-level automatic math correction within the text classification task into an RL problem to enhance the reasoning capabilities of LLMs. Then, we design a space-constrained policy network to improve the stability of RL. Then, we introduce a fine-grained reward network to convert the binary human feedback into a continuous value. We conduct extensive experiments over two benchmark datasets and the results show that our model outperforms the eleven strong baselines.

IceBench: A Benchmark for Deep Learning based Sea Ice Type Classification

Samira Alkaee Taleghan,Andrew P. Barrett,Walter N. Meier,Farnoush Banaei-Kashani

Task: 提出一个名为IceBench的标准化基准框架，用于海冰类型分类。

Motivation: 传统手动方法耗时、昂贵且存在偏见，自动化分类方法（尤其是深度学习）能提高效率和一致性，但缺乏标准化基准和比较研究。

Details

Method: 利用AI4Arctic Sea Ice Challenge数据集，建立包含评估指标和代表性模型的IceBench框架，并进行比较研究和系统实验。 Result: 提出了IceBench框架，支持新方法的集成与评估，并通过实验分析了模型的时空迁移性、数据降尺度和预处理策略。 Conclusion: IceBench为海冰类型分类提供了标准化工具，促进了方法比较和可重复性研究。 Abstract: Sea ice plays a critical role in the global climate system and maritime operations, making timely and accurate classification essential. However, traditional manual methods are time-consuming, costly, and have inherent biases. Automating sea ice type classification addresses these challenges by enabling faster, more consistent, and scalable analysis. While both traditional and deep learning approaches have been explored, deep learning models offer a promising direction for improving efficiency and consistency in sea ice classification. However, the absence of a standardized benchmark and comparative study prevents a clear consensus on the best-performing models. To bridge this gap, we introduce \textit{IceBench}, a comprehensive benchmarking framework for sea ice type classification. Our key contributions are threefold: First, we establish the IceBench benchmarking framework which leverages the existing AI4Arctic Sea Ice Challenge dataset as a standardized dataset, incorporates a comprehensive set of evaluation metrics, and includes representative models from the entire spectrum of sea ice type classification methods categorized in two distinct groups, namely, pixel-based classification methods and patch-based classification methods. IceBench is open-source and allows for convenient integration and evaluation of other sea ice type classification methods; hence, facilitating comparative evaluation of new methods and improving reproducibility in the field. Second, we conduct an in-depth comparative study on representative models to assess their strengths and limitations, providing insights for both practitioners and researchers. Third, we leverage IceBench for systematic experiments addressing key research questions on model transferability across seasons (time) and locations (space), data downscaling, and preprocessing strategies.

Words as Bridges: Exploring Computational Support for Cross-Disciplinary Translation Work

Calvin Bao,Yow-Ting Shiue,Marine Carpuat,Joel Chan

Task: 探索如何通过跨领域词嵌入对齐技术支持学者在不同学术领域间的概念探索。

Motivation: 学者在跨领域研究中常因领域特定术语（行话）而受阻，现有方法多关注简化或总结术语，而本研究尝试保留术语作为概念桥梁。

Details

Method: 将不同学术领域视为不同语言社区，利用无监督跨语言词嵌入对齐技术探索领域特定词嵌入空间的概念对齐。 Result: 开发了一个原型跨领域搜索引擎，并通过两个案例研究测试其效果，揭示了该方法的潜力与局限。 Conclusion: 提出了未来支持跨领域信息搜索的计算界面的设计见解。 Abstract: Scholars often explore literature outside of their home community of study. This exploration process is frequently hampered by field-specific jargon. Past computational work often focuses on supporting translation work by removing jargon through simplification and summarization; here, we explore a different approach that preserves jargon as useful bridges to new conceptual spaces. Specifically, we cast different scholarly domains as different language-using communities, and explore how to adapt techniques from unsupervised cross-lingual alignment of word embeddings to explore conceptual alignments between domain-specific word embedding spaces.We developed a prototype cross-domain search engine that uses aligned domain-specific embeddings to support conceptual exploration, and tested this prototype in two case studies. We discuss qualitative insights into the promises and pitfalls of this approach to translation work, and suggest design insights for future interfaces that provide computational support for cross-domain information seeking.

What Time Tells Us? An Explorative Study of Time Awareness Learned from Static Images

Dongheng Lin,Han Hu,Jianbo Jiao

Task: 探索如何从静态图像中学习时间感知，并回答“时间告诉我们什么”的问题。

Motivation: 时间通过光照变化在视觉中显现，启发我们从静态图像中学习时间感知。

Details

Method: 提出了一种时间-图像对比学习（TICL）方法，通过跨模态对比学习联合建模时间戳和视觉表示。 Result: TICL在时间戳估计任务中表现优异，且学习到的时间感知嵌入在多个下游任务（如图像检索、视频分类和图像编辑）中表现出色。 Conclusion: 静态图像中的时间相关视觉线索可以被学习，并为未来研究时间相关视觉上下文奠定了基础。 Abstract: Time becomes visible through illumination changes in what we see. Inspired by this, in this paper we explore the potential to learn time awareness from static images, trying to answer: what time tells us? To this end, we first introduce a Time-Oriented Collection (TOC) dataset, which contains 130,906 images with reliable timestamps. Leveraging this dataset, we propose a Time-Image Contrastive Learning (TICL) approach to jointly model timestamps and related visual representations through cross-modal contrastive learning. We found that the proposed TICL, 1) not only achieves state-of-the-art performance on the timestamp estimation task, over various benchmark metrics, 2) but also, interestingly, though only seeing static images, the time-aware embeddings learned from TICL show strong capability in several time-aware downstream tasks such as time-based image retrieval, video scene classification, and time-aware image editing. Our findings suggest that time-related visual cues can be learned from static images and are beneficial for various vision tasks, laying a foundation for future research on understanding time-related visual context. Project page:https://rathgrith.github.io/timetells/.

Whispering in Amharic: Fine-tuning Whisper for Low-resource Language

Dawit Ketema Gete,Bedru Yimam Ahamed,Tadesse Destaw Belay,Yohannes Ayana Ejigu,Sukairaj Hafiz Imam,Alemu Belay Tessema,Mohammed Oumer Adem,Tadesse Amare Belay,Robert Geislinger,Umma Aliyu Musa,Martin Semmann,Shamsuddeen Hassan Muhammad,Henning Schreiber,Seid Muhie Yimam

Task: 微调OpenAI的Whisper自动语音识别（ASR）模型，以提高阿姆哈拉语（一种低资源语言）的转录准确性。

Motivation: 基础Whisper模型在阿姆哈拉语上表现不佳，因为其训练数据中对该语言的表示有限。

Details

Method: 使用Mozilla Common Voice、FLEURS和BDU-speech数据集对Whisper模型进行微调，并探索数据组合策略。 Result: 最佳模型Whispersmall-am在混合FLEURS数据和新数据集上表现显著提升，同时阿姆哈拉语同音词归一化显著改善了WER和BLEU分数。 Conclusion: 研究强调了微调策略和数据集组成对低资源语言ASR的重要性，为未来阿姆哈拉语语音识别研究提供了见解。 Abstract: This work explores fine-tuning OpenAI's Whisper automatic speech recognition (ASR) model for Amharic, a low-resource language, to improve transcription accuracy. While the foundational Whisper model struggles with Amharic due to limited representation in its training data, we fine-tune it using datasets like Mozilla Common Voice, FLEURS, and the BDU-speech dataset. The best-performing model, Whispersmall-am, significantly improves when finetuned on a mix of existing FLEURS data and new, unseen Amharic datasets. Training solely on new data leads to poor performance, but combining it with FLEURS data reinforces the model, enabling better specialization in Amharic. We also demonstrate that normalizing Amharic homophones significantly enhances Word Error Rate (WER) and Bilingual Evaluation Understudy (BLEU) scores. This study underscores the importance of fine-tuning strategies and dataset composition for improving ASR in low-resource languages, providing insights for future Amharic speech recognition research.

Guided Diffusion for the Extension of Machine Vision to Human Visual Perception

Takahiro Shindo,Yui Tatsumi,Taiju Watanabe,Hiroshi Watanabe

Task: 提出一种基于引导扩散模型的方法，将机器视觉扩展到人类视觉感知。

Motivation: 随着图像识别模型的快速发展，面向AI任务的图像压缩（ICM）变得重要，同时需要兼顾人类视觉感知的需求。扩散模型能够从少量数据生成人类可视图，为图像压缩提供了新思路。

Details

Method: 利用ICM方法的输出引导扩散模型，从随机噪声生成人类可视图，作为机器视觉与人类视觉之间的桥梁。 Result: 生成的图像在比特率和图像质量上进行了评估，并与其他可扩展图像编码方法进行了比较。 Conclusion: 引导扩散模型能够在不增加额外比特率开销的情况下，实现机器视觉与人类视觉之间的转换，具有潜在的应用价值。 Abstract: Image compression technology eliminates redundant information to enable efficient transmission and storage of images, serving both machine vision and human visual perception. For years, image coding focused on human perception has been well-studied, leading to the development of various image compression standards. On the other hand, with the rapid advancements in image recognition models, image compression for AI tasks, known as Image Coding for Machines (ICM), has gained significant importance. Therefore, scalable image coding techniques that address the needs of both machines and humans have become a key area of interest. Additionally, there is increasing demand for research applying the diffusion model, which can generate human-viewable images from a small amount of data to image compression methods for human vision. Image compression methods that use diffusion models can partially reconstruct the target image by guiding the generation process with a small amount of conditioning information. Inspired by the diffusion model's potential, we propose a method for extending machine vision to human visual perception using guided diffusion. Utilizing the diffusion model guided by the output of the ICM method, we generate images for human perception from random noise. Guided diffusion acts as a bridge between machine vision and human vision, enabling transitions between them without any additional bitrate overhead. The generated images then evaluated based on bitrate and image quality, and we compare their compression performance with other scalable image coding methods for humans and machines.

MAGIC-VQA: Multimodal And Grounded Inference with Commonsense Knowledge for Visual Question Answering

Shuo Yang,Siwen Luo,Soyeon Caren Han,Eduard Hovy

Task: 通过系统整合常识知识与大型视觉语言模型（LVLM）来增强视觉问答（VQA）的鲁棒性。

Motivation: 大型视觉语言模型（LVLM）在视觉问答中缺乏对常识知识的整合，限制了其在现实场景中的表现。

Details

Method: 提出MAGIC-VQA框架，采用三阶段方法：显式知识整合、按类型后处理以及基于图神经网络（GNN）的隐式知识增强。 Result: 在基准数据集上实现了最先进的性能，显著提升了VQA中的常识推理能力。 Conclusion: MAGIC-VQA通过统一常识知识与LVLM驱动的推理，填补了关键空白，无需大量预训练或复杂提示调整。 Abstract: Visual Question Answering (VQA) requires reasoning across visual and textual modalities, yet Large Vision-Language Models (LVLMs) often lack integrated commonsense knowledge, limiting their robustness in real-world scenarios. To address this, we introduce MAGIC-VQA, a novel framework that enhances VQA by systematically integrating commonsense knowledge with LVLMs. MAGIC-VQA employs a three-stage process: (1) Explicit Knowledge Integration from external sources, (2) By-Type Post-Processing for contextual refinement, and (3) Implicit Knowledge Augmentation using a Graph Neural Network (GNN) for structured reasoning. While GNNs bring greater depth to structured inference, they enable superior relational inference beyond LVLMs. MAGIC-VQA bridges a key gap by unifying commonsensse knowledge with LVLM-driven reasoning, eliminating the need for extensive pre-training or complex prompt tuning. Our framework achieves state-of-the-art performance on benchmark datasets, significantly improving commonsense reasoning in VQA.

Debiasing Multimodal Large Language Models via Noise-Aware Preference Optimization

Zefeng Zhang,Hengzhu Tang,Jiawei Sheng,Zhenyu Zhang,Yiming Ren,Zhenyang Li,Dawei Yin,Duohe Ma,Tingwen Liu

Task: 解决多模态大语言模型中的模态偏差问题。

Motivation: 多模态大语言模型在处理任务时倾向于依赖单一模态，忽视其他模态的关键信息，导致错误聚焦和不相关响应。

Details

Method: 提出基于偏好优化的方法，包括构建去偏数据集RLAIFVBias和噪声感知偏好优化算法。 Result: 实验验证了方法的有效性，不仅能缓解模态偏差，还能显著减少幻觉现象。 Conclusion: 通过偏好优化方法成功解决了模态偏差问题，并提升了模型的整体性能。 Abstract: Multimodal Large Language Models excel in various tasks, yet often struggle with modality bias, where the model tends to rely heavily on a single modality and overlook critical information in other modalities, which leads to incorrect focus and generating irrelevant responses. In this paper, we propose using the paradigm of preference optimization to solve the modality bias problem, including RLAIFVBias, a debiased preference optimization dataset, and a Noise Aware Preference Optimization algorithm. Specifically, we first construct the dataset by introducing perturbations to reduce the informational content of certain modalities, compelling the model to rely on a specific modality when generating negative responses. To address the inevitable noise in automatically constructed data, we combine the noise robust Mean Absolute Error with the Binary Cross Entropy in Direct Preference Optimization by a negative Box Cox transformation, and dynamically adjust the algorithm noise robustness based on the evaluated noise levels in the data. Extensive experiments validate our approach, demonstrating not only its effectiveness in mitigating modality bias but also its significant role in minimizing hallucinations.

Autoregressive Language Models for Knowledge Base Population: A case study in the space mission domain

Andrés García-Silva,José Manuel Gómez-Pérez

Task: 通过微调自回归语言模型实现端到端的知识库填充（KBP）。

Motivation: 利用大型语言模型支持的大上下文窗口，解决知识库填充和维护的需求。

Details

Method: 生成数据集并微调自回归语言模型，应用于空间任务知识图的填充。 Result: 微调后的较小模型在KBP任务中表现优于更大模型，且部署和推理成本更低。 Conclusion: 专用于KBP的小型模型无需在提示中包含本体，为输入文本或输出序列化提供更多上下文空间。 Abstract: Knowledge base population KBP plays a crucial role in populating and maintaining knowledge bases up-to-date in organizations by leveraging domain corpora. Motivated by the increasingly large context windows supported by large language models, we propose to fine-tune an autoregressive language model for end-toend KPB. Our case study involves the population of a space mission knowledge graph. To fine-tune the model we generate a dataset for end-to-end KBP tapping into existing domain resources. Our case study shows that fine-tuned language models of limited size can achieve competitive and even higher accuracy than larger models in the KBP task. Smaller models specialized for KBP offer affordable deployment and lower-cost inference. Moreover, KBP specialist models do not require the ontology to be included in the prompt, allowing for more space in the context for additional input text or output serialization.

TransAnimate: Taming Layer Diffusion to Generate RGBA Video

Xuewei Chen,Zhimin Chen,Yiren Song

Task: 提出TransAnimate框架，用于生成带有透明通道（RGBA）的动态视频。

Motivation: 解决现有文本到视频生成模型在生成透明视频时面临的数据稀缺和模型适配问题。

Details

Method: 结合RGBA图像生成技术和视频生成模块，利用预训练模型权重、时序模型和控制插件，并引入交互式运动引导控制机制。 Result: TransAnimate能够生成高质量的RGBA视频，适用于游戏和视觉效果应用。 Conclusion: TransAnimate是一个实用且有效的工具，填补了透明视频生成领域的空白。 Abstract: Text-to-video generative models have made remarkable advancements in recent years. However, generating RGBA videos with alpha channels for transparency and visual effects remains a significant challenge due to the scarcity of suitable datasets and the complexity of adapting existing models for this purpose. To address these limitations, we present TransAnimate, an innovative framework that integrates RGBA image generation techniques with video generation modules, enabling the creation of dynamic and transparent videos. TransAnimate efficiently leverages pre-trained text-to-transparent image model weights and combines them with temporal models and controllability plugins trained on RGB videos, adapting them for controllable RGBA video generation tasks. Additionally, we introduce an interactive motion-guided control mechanism, where directional arrows define movement and colors adjust scaling, offering precise and intuitive control for designing game effects. To further alleviate data scarcity, we have developed a pipeline for creating an RGBA video dataset, incorporating high-quality game effect videos, extracted foreground objects, and synthetic transparent videos. Comprehensive experiments demonstrate that TransAnimate generates high-quality RGBA videos, establishing it as a practical and effective tool for applications in gaming and visual effects.

SciClaims: An End-to-End Generative System for Biomedical Claim Analysis

Raúl Ortega,José Manuel Gómez-Pérez

Task: 提出并验证一种名为SciClaims的系统，用于自动化科学文献中的关键声明提取、证据检索和验证。

Motivation: 当前的科学声明验证方法存在局限性，如缺乏端到端的流程、依赖复杂的NLP和信息检索系统、以及验证结果解释不清晰。

Details

Method: 利用先进的大语言模型（LLMs）开发SciClaims系统，整合声明提取、证据检索和验证的全过程。 Result: SciClaims在声明提取和验证方面优于现有方法，无需额外微调，为自动化科学声明分析设定了新标准。 Conclusion: SciClaims通过整合大语言模型，解决了当前科学声明验证的局限性，为科学文献的自动化分析提供了更高效和可靠的解决方案。 Abstract: Validating key claims in scientific literature, particularly in biomedical research, is essential for ensuring accuracy and advancing knowledge. This process is critical in sectors like the pharmaceutical industry, where rapid scientific progress requires automation and deep domain expertise. However, current solutions have significant limitations. They lack end-to-end pipelines encompassing all claim extraction, evidence retrieval, and verification steps; rely on complex NLP and information retrieval pipelines prone to multiple failure points; and often fail to provide clear, user-friendly justifications for claim verification outcomes. To address these challenges, we introduce SciClaims, an advanced system powered by state-of-the-art large language models (LLMs) that seamlessly integrates the entire scientific claim analysis process. SciClaims outperforms previous approaches in both claim extraction and verification without requiring additional fine-tuning, setting a new benchmark for automated scientific claim analysis.

Cross-Domain Underwater Image Enhancement Guided by No-Reference Image Quality Assessment: A Transfer Learning Approach

Zhi Zhang,Daoyi Chen

Task: 提出一种基于迁移学习的单幅水下图像增强方法（Trans-UIE），以解决伪标签和数据集稀缺问题。

Motivation: 水下图像增强面临伪标签导致的领域差异和小数据集易过拟合的挑战。

Details

Method: 通过预训练捕捉水下图像增强的基本范式，结合参考和非参考数据集进行微调，并利用无参考图像质量评估（NR-IQA）指标和Pearson相关损失来优化模型。 Result: 在参考和无参考水下基准数据集上，Trans-UIE显著优于现有方法。 Conclusion: Trans-UIE通过迁移学习和多损失优化，有效解决了水下图像增强的领域差异和过拟合问题。 Abstract: Single underwater image enhancement (UIE) is a challenging ill-posed problem, but its development is hindered by two major issues: (1) The labels in underwater reference datasets are pseudo labels, relying on these pseudo ground truths in supervised learning leads to domain discrepancy. (2) Underwater reference datasets are scarce, making training on such small datasets prone to overfitting and distribution shift. To address these challenges, we propose Trans-UIE, a transfer learning-based UIE model that captures the fundamental paradigms of UIE through pretraining and utilizes a dataset composed of both reference and non-reference datasets for fine-tuning. However, fine-tuning the model using only reconstruction loss may introduce confirmation bias. To mitigate this, our method leverages no-reference image quality assessment (NR-IQA) metrics from above-water scenes to guide the transfer learning process across domains while generating enhanced images with the style of the above-water image domain. Additionally, to reduce the risk of overfitting during the pretraining stage, we introduce Pearson correlation loss. Experimental results on both full-reference and no-reference underwater benchmark datasets demonstrate that Trans-UIE significantly outperforms state-of-the-art methods.

Natural Language Processing for Electronic Health Records in Scandinavian Languages: Norwegian, Swedish, and Danish

Ashenafi Zebene Woldaregay,Jørgen Aarmo Lund,Phuong Dinh Ngo,Mariyam Tayefi,Joel Burman,Stine Hansen,Martin Hylleholt Sillesen,Hercules Dalianis,Robert Jenssen,Lindsetmo Rolf Ole,Karl Øyvind Mikalsen

Task: 对斯堪的纳维亚大陆临床文本的最先进自然语言处理方法进行系统性综述。

Motivation: 临床自然语言处理（NLP）在医疗保健领域具有巨大潜力，但斯堪的纳维亚大陆语言的临床文本研究存在差距和差异，需要全面评估。

Details

Method: 通过多种在线数据库（如PubMed、ScienceDirect等）进行文献检索，筛选2010年至2024年间发表的英文文章，重点关注挪威语、瑞典语和丹麦语的临床NLP研究。 Result: 研究发现，瑞典语的研究占主导地位（64%），挪威语和丹麦语的研究较少；在任务如去标识化中，挪威语和丹麦语的研究活动明显不足；资源分享和迁移学习的应用水平较低。 Conclusion: 综述全面评估了斯堪的纳维亚大陆临床NLP的现状，并指出了阻碍该领域快速发展的潜在障碍和挑战。 Abstract: Background: Clinical natural language processing (NLP) refers to the use of computational methods for extracting, processing, and analyzing unstructured clinical text data, and holds a huge potential to transform healthcare in various clinical tasks. Objective: The study aims to perform a systematic review to comprehensively assess and analyze the state-of-the-art NLP methods for the mainland Scandinavian clinical text. Method: A literature search was conducted in various online databases including PubMed, ScienceDirect, Google Scholar, ACM digital library, and IEEE Xplore between December 2022 and February 2024. Further, relevant references to the included articles were also used to solidify our search. The final pool includes articles that conducted clinical NLP in the mainland Scandinavian languages and were published in English between 2010 and 2024. Results: Out of the 113 articles, 18% (n=21) focus on Norwegian clinical text, 64% (n=72) on Swedish, 10% (n=11) on Danish, and 8% (n=9) focus on more than one language. Generally, the review identified positive developments across the region despite some observable gaps and disparities between the languages. There are substantial disparities in the level of adoption of transformer-based models. In essential tasks such as de-identification, there is significantly less research activity focusing on Norwegian and Danish compared to Swedish text. Further, the review identified a low level of sharing resources such as data, experimentation code, pre-trained models, and rate of adaptation and transfer learning in the region. Conclusion: The review presented a comprehensive assessment of the state-of-the-art Clinical NLP for electronic health records (EHR) text in mainland Scandinavian languages and, highlighted the potential barriers and challenges that hinder the rapid advancement of the field in the region.

Selecting and Pruning: A Differentiable Causal Sequentialized State-Space Model for Two-View Correspondence Learning

Xiang Fang,Shihua Zhang,Hao Zhang,Tao Lu,Huabing Zhou,Jiayi Ma

Task: Two-view correspondence learning aims to discern true and false correspondences between image pairs by recognizing their underlying different information.

Motivation: Previous methods either treat the information equally or require explicit storage of the entire context, which is laborious in real-world scenarios.

Details

Method: Proposed CorrMamba, leveraging Mamba's selectivity to mine information from true correspondences while mitigating interference from false ones, with a causal sequential learning approach and local-context enhancement module. Result: CorrMamba achieves state-of-the-art performance, surpassing previous SOTA by 2.58 absolute percentage points in AUC@20° for outdoor relative pose estimation. Conclusion: CorrMamba demonstrates practical superiority in correspondence filtering, with code to be publicly available. Abstract: Two-view correspondence learning aims to discern true and false correspondences between image pairs by recognizing their underlying different information. Previous methods either treat the information equally or require the explicit storage of the entire context, tending to be laborious in real-world scenarios. Inspired by Mamba's inherent selectivity, we propose \textbf{CorrMamba}, a \textbf{Corr}espondence filter leveraging \textbf{Mamba}'s ability to selectively mine information from true correspondences while mitigating interference from false ones, thus achieving adaptive focus at a lower cost. To prevent Mamba from being potentially impacted by unordered keypoints that obscured its ability to mine spatial information, we customize a causal sequential learning approach based on the Gumbel-Softmax technique to establish causal dependencies between features in a fully autonomous and differentiable manner. Additionally, a local-context enhancement module is designed to capture critical contextual cues essential for correspondence pruning, complementing the core framework. Extensive experiments on relative pose estimation, visual localization, and analysis demonstrate that CorrMamba achieves state-of-the-art performance. Notably, in outdoor relative pose estimation, our method surpasses the previous SOTA by $2.58$ absolute percentage points in AUC@20\textdegree, highlighting its practical superiority. Our code will be publicly available.

Self-Reported Confidence of Large Language Models in Gastroenterology: Analysis of Commercial, Open-Source, and Quantized Models

Nariman Naderi,Seyed Amir Ahmad Safavi-Naini,Thomas Savage,Zahra Atf,Peter Lewis,Girish Nadkarni,Ali Soroush

Task: 评估多个大型语言模型在回答胃肠病学问题时自我报告的确定性。

Motivation: 研究大型语言模型在医疗领域中的不确定性估计问题，以确保其安全使用。

Details

Method: 使用300个胃肠病学问题测试多个模型（GPT、Claude、Llama等），分析其Brier分数和AUROC值。 Result: 表现最佳的模型（GPT-o1预览、GPT-4o和Claude-3.5-Sonnet）Brier分数为0.15-0.2，AUROC为0.6，但所有模型均表现出过度自信的倾向。 Conclusion: 不确定性估计是大型语言模型在医疗领域安全使用的重要挑战。 Abstract: This study evaluated self-reported response certainty across several large language models (GPT, Claude, Llama, Phi, Mistral, Gemini, Gemma, and Qwen) using 300 gastroenterology board-style questions. The highest-performing models (GPT-o1 preview, GPT-4o, and Claude-3.5-Sonnet) achieved Brier scores of 0.15-0.2 and AUROC of 0.6. Although newer models demonstrated improved performance, all exhibited a consistent tendency towards overconfidence. Uncertainty estimation presents a significant challenge to the safe use of LLMs in healthcare. Keywords: Large Language Models; Confidence Elicitation; Artificial Intelligence; Gastroenterology; Uncertainty Quantification

FisherTune: Fisher-Guided Robust Tuning of Vision Foundation Models for Domain Generalized Segmentation

Dong Zhao,Jinlong Li,Shuang Wang,Mengyao Wu,Qi Zang,Nicu Sebe,Zhun Zhong

Task: 提出一种名为FisherTune的微调方法，用于在领域泛化语义分割（DGSS）任务中优化视觉基础模型（VFMs）的性能。

Motivation: 现有方法在微调视觉基础模型时可能未能充分利用其潜力，且领域敏感参数可能阻碍泛化能力。

Details

Method: 提出基于领域相关Fisher信息矩阵（DR-FIM）的FisherTune方法，通过变分推断稳定DR-FIM估计，选择性更新参数以保持泛化能力。 Result: 实验表明FisherTune在跨领域分割任务中表现优异，优于选择性参数和基于适配器的方法。 Conclusion: FisherTune是一种有效的微调方法，能够在保持泛化能力的同时提升领域泛化语义分割的性能。 Abstract: Vision Foundation Models (VFMs) excel in generalization due to large-scale pretraining, but fine-tuning them for Domain Generalized Semantic Segmentation (DGSS) while maintaining this ability remains challenging. Existing approaches either selectively fine-tune parameters or freeze the VFMs and update only the adapters, both of which may underutilize the VFMs' full potential in DGSS tasks. We observe that domain-sensitive parameters in VFMs, arising from task and distribution differences, can hinder generalization. To address this, we propose \textbf{FisherTune}, a robust fine-tuning method guided by the Domain-Related Fisher Information Matrix (DR-FIM). DR-FIM measures parameter sensitivity across tasks and domains, enabling selective updates that preserve generalization and enhance DGSS adaptability. FisherTune incorporates variational inference to stabilize DR-FIM estimation, treating parameters as Gaussian-distributed variables and leveraging pre-trained priors. Extensive experiments show that FisherTune achieves superior cross-domain segmentation while maintaining generalization, outperforming selective-parameter and adapter-based methods.

ClinText-SP and RigoBERTa Clinical: a new set of open resources for Spanish Clinical NLP

Guillem García Subies,Álvaro Barbero Jiménez,Paloma Martínez Fernández

Task: Introduce ClinText-SP, the largest publicly available Spanish clinical corpus, and RigoBERTa Clinical, a state-of-the-art clinical encoder language model.

Motivation: To provide a rich and diverse dataset and a high-performing model for Spanish clinical NLP, addressing the lack of accessible resources in this domain.

Details

Method: Curate ClinText-SP from diverse open sources and develop RigoBERTa Clinical through domain-adaptive pretraining on this dataset. Result: RigoBERTa Clinical outperforms existing models on multiple clinical NLP benchmarks. Conclusion: The release of ClinText-SP and RigoBERTa Clinical aims to empower the research community and advance clinical NLP for healthcare applications. Abstract: We present a novel contribution to Spanish clinical natural language processing by introducing the largest publicly available clinical corpus, ClinText-SP, along with a state-of-the-art clinical encoder language model, RigoBERTa Clinical. Our corpus was meticulously curated from diverse open sources, including clinical cases from medical journals and annotated corpora from shared tasks, providing a rich and diverse dataset that was previously difficult to access. RigoBERTa Clinical, developed through domain-adaptive pretraining on this comprehensive dataset, significantly outperforms existing models on multiple clinical NLP benchmarks. By publicly releasing both the dataset and the model, we aim to empower the research community with robust resources that can drive further advancements in clinical NLP and ultimately contribute to improved healthcare applications.

Real-World Remote Sensing Image Dehazing: Benchmark and Baseline

Zeng-Hui Zhu,Wei Lu,Si-Bao Chen,Chris H. Q. Ding,Jin Tang,Bin Luo

Task: 提出一种针对真实世界遥感图像去雾（RSID）的新框架MCAF-Net，并引入首个大规模真实世界遥感雾霾图像数据集RRSHID。

Motivation: 现有方法主要依赖合成数据集，但由于合成数据与真实数据之间的领域差距，难以在真实场景中有效应用。

Details

Method: MCAF-Net框架包含三个创新组件：多分支特征集成块聚合器（MFIBA）、色彩校准自监督注意力模块（CSAM）和多尺度特征自适应融合模块（MFAFM）。 Result: MCAF-Net在真实世界RSID中表现出最先进的性能，同时在合成数据集上保持竞争力。 Conclusion: RRSHID和MCAF-Net为真实世界RSID研究设定了新基准，推动了这一复杂任务的实用解决方案。 Abstract: Remote Sensing Image Dehazing (RSID) poses significant challenges in real-world scenarios due to the complex atmospheric conditions and severe color distortions that degrade image quality. The scarcity of real-world remote sensing hazy image pairs has compelled existing methods to rely primarily on synthetic datasets. However, these methods struggle with real-world applications due to the inherent domain gap between synthetic and real data. To address this, we introduce Real-World Remote Sensing Hazy Image Dataset (RRSHID), the first large-scale dataset featuring real-world hazy and dehazed image pairs across diverse atmospheric conditions. Based on this, we propose MCAF-Net, a novel framework tailored for real-world RSID. Its effectiveness arises from three innovative components: Multi-branch Feature Integration Block Aggregator (MFIBA), which enables robust feature extraction through cascaded integration blocks and parallel multi-branch processing; Color-Calibrated Self-Supervised Attention Module (CSAM), which mitigates complex color distortions via self-supervised learning and attention-guided refinement; and Multi-Scale Feature Adaptive Fusion Module (MFAFM), which integrates features effectively while preserving local details and global context. Extensive experiments validate that MCAF-Net demonstrates state-of-the-art performance in real-world RSID, while maintaining competitive performance on synthetic datasets. The introduction of RRSHID and MCAF-Net sets new benchmarks for real-world RSID research, advancing practical solutions for this complex task. The code and dataset are publicly available at \url{https://github.com/lwCVer/RRSHID}.

LinkAlign: Scalable Schema Linking for Real-World Large-Scale Multi-Database Text-to-SQL

Yihan Wang,Peiyu Liu,Xin Yang

Task: 解决多数据库场景下的模式链接问题，提升Text-to-SQL任务的性能。

Motivation: 模式链接是Text-to-SQL任务中实现人类水平性能的关键瓶颈，尤其是在多数据库场景下。

Details

Method: 提出LinkAlign框架，通过多轮语义增强检索、无关信息隔离和模式提取增强，解决数据库检索和模式项定位问题。 Result: 在SPIDER和BIRD基准测试中表现优异，优于现有基线方法。 Conclusion: LinkAlign填补了当前研究与实际场景之间的差距，为模式链接提供了实用且可扩展的解决方案。 Abstract: Schema linking is a critical bottleneck in achieving human-level performance in Text-to-SQL tasks, particularly in real-world large-scale multi-database scenarios. Addressing schema linking faces two major challenges: (1) Database Retrieval: selecting the correct database from a large schema pool in multi-database settings, while filtering out irrelevant ones. (2) Schema Item Grounding: accurately identifying the relevant tables and columns from within a large and redundant schema for SQL generation. To address this, we introduce LinkAlign, a novel framework that can effectively adapt existing baselines to real-world environments by systematically addressing schema linking. Our framework comprises three key steps: multi-round semantic enhanced retrieval and irrelevant information isolation for Challenge 1, and schema extraction enhancement for Challenge 2. We evaluate our method performance of schema linking on the SPIDER and BIRD benchmarks, and the ability to adapt existing Text-to-SQL models to real-world environments on the SPIDER 2.0-lite benchmark. Experiments show that LinkAlign outperforms existing baselines in multi-database settings, demonstrating its effectiveness and robustness. On the other hand, our method ranks highest among models excluding those using long chain-of-thought reasoning LLMs. This work bridges the gap between current research and real-world scenarios, providing a practical solution for robust and scalable schema linking. The codes are available at https://github.com/Satissss/LinkAlign.

PhysTwin: Physics-Informed Reconstruction and Simulation of Deformable Objects from Videos

Hanxiao Jiang,Hao-Yu Hsu,Kaifeng Zhang,Hsin-Ni Yu,Shenlong Wang,Yunzhu Li

Task: 提出PhysTwin框架，通过稀疏视频创建物理和视觉逼真的实时交互式数字孪生体。

Motivation: 数字孪生体在机器人、内容创作和XR领域具有巨大潜力，但现有方法难以实现高保真重建和实时交互。

Details

Method: 结合弹簧-质量模型、生成形状模型和高斯渲染的物理感知表示，以及多阶段优化逆向建模框架。 Result: PhysTwin在重建、渲染、未来预测和新交互模拟方面优于竞争方法。 Conclusion: PhysTwin为实时交互模拟和机器人运动规划提供了有效解决方案。 Abstract: Creating a physical digital twin of a real-world object has immense potential in robotics, content creation, and XR. In this paper, we present PhysTwin, a novel framework that uses sparse videos of dynamic objects under interaction to produce a photo- and physically realistic, real-time interactive virtual replica. Our approach centers on two key components: (1) a physics-informed representation that combines spring-mass models for realistic physical simulation, generative shape models for geometry, and Gaussian splats for rendering; and (2) a novel multi-stage, optimization-based inverse modeling framework that reconstructs complete geometry, infers dense physical properties, and replicates realistic appearance from videos. Our method integrates an inverse physics framework with visual perception cues, enabling high-fidelity reconstruction even from partial, occluded, and limited viewpoints. PhysTwin supports modeling various deformable objects, including ropes, stuffed animals, cloth, and delivery packages. Experiments show that PhysTwin outperforms competing methods in reconstruction, rendering, future prediction, and simulation under novel interactions. We further demonstrate its applications in interactive real-time simulation and model-based robotic motion planning.

LANGALIGN: Enhancing Non-English Language Models via Cross-Lingual Embedding Alignment

Jong Myoung Kim,Young-Jun Lee,Ho-Jin Choi,Sangkeun Jung

Task: 提出LANGALIGN方法，通过对齐英语与目标语言的嵌入向量来提升非英语语言模型的性能。

Motivation: 由于实际限制，许多开发者仍依赖嵌入模型，而英语数据集常被用作非英语模型的种子数据，其质量直接影响性能。

Details

Method: 在语言模型与任务头之间的接口处对齐英语与目标语言的嵌入向量。 Result: 在韩语、日语和汉语上的实验表明，LANGALIGN显著提升了性能，并可反向应用于将目标语言数据转换为英语模型可处理的格式。 Conclusion: LANGALIGN是一种有效的方法，能够提升非英语语言模型的性能，并具有双向应用的潜力。 Abstract: While Large Language Models have gained attention, many service developers still rely on embedding-based models due to practical constraints. In such cases, the quality of fine-tuning data directly impacts performance, and English datasets are often used as seed data for training non-English models. In this study, we propose LANGALIGN, which enhances target language processing by aligning English embedding vectors with those of the target language at the interface between the language model and the task header. Experiments on Korean, Japanese, and Chinese demonstrate that LANGALIGN significantly improves performance across all three languages. Additionally, we show that LANGALIGN can be applied in reverse to convert target language data into a format that an English-based model can process.

Shot Sequence Ordering for Video Editing: Benchmarks, Metrics, and Cinematology-Inspired Computing Methods

Yuzhi Li,Haojun Xu,Feng Tian

Task: 解决AI辅助视频编辑中的镜头序列排序（SSO）任务，以提升视频叙事和观看体验。

Motivation: 高质量视频创作依赖专业编辑技能和视觉语言理解，而公开基准数据集的缺乏阻碍了SSO任务的研究进展。

Details

Method: 引入两个新基准数据集（AVE-Order和ActivityNet-Order），提出Kendall Tau距离作为评估指标，并提出Kendall Tau距离-交叉熵损失函数和Cinematology Embedding概念。 Result: 实验结果表明，所提出的损失函数和方法显著提高了SSO任务的准确性。 Conclusion: 通过公开数据集和新方法，推动了SSO任务的研究进展，为AI辅助视频编辑提供了实用工具。 Abstract: With the rising popularity of short video platforms, the demand for video production has increased substantially. However, high-quality video creation continues to rely heavily on professional editing skills and a nuanced understanding of visual language. To address this challenge, the Shot Sequence Ordering (SSO) task in AI-assisted video editing has emerged as a pivotal approach for enhancing video storytelling and the overall viewing experience. Nevertheless, the progress in this field has been impeded by a lack of publicly available benchmark datasets. In response, this paper introduces two novel benchmark datasets, AVE-Order and ActivityNet-Order. Additionally, we employ the Kendall Tau distance as an evaluation metric for the SSO task and propose the Kendall Tau Distance-Cross Entropy Loss. We further introduce the concept of Cinematology Embedding, which incorporates movie metadata and shot labels as prior knowledge into the SSO model, and constructs the AVE-Meta dataset to validate the method's effectiveness. Experimental results indicate that the proposed loss function and method substantially enhance SSO task accuracy. All datasets are publicly accessible at https://github.com/litchiar/ShotSeqBench.

ZeroLM: Data-Free Transformer Architecture Search for Language Models

Zhen-Song Chen,Hong-Wei Ding,Xian-Jia Wang,Witold Pedrycz

Task: 提出一种新的零成本代理方法，用于量化Transformer架构的模型容量并优化其子模块贡献。

Motivation: 现有零成本代理方法在架构排名任务中表现不佳，尤其是在Transformer模型中，且当前自动化代理发现方法存在搜索时间长、易过拟合和结构复杂等问题。

Details

Method: 通过高效权重统计计算量化模型容量，并将Transformer架构分解为功能不同的子模块以优化其贡献。 Result: 在FlexiBERT基准测试中，Spearman's rho为0.76，Kendall's tau为0.53，表现出卓越的计算效率和稳健性能。 Conclusion: 该方法为大规模架构搜索提供了实用解决方案，同时保持了高效和稳健的性能。 Abstract: Neural architecture search (NAS) provides a systematic framework for automating the design of neural network architectures, yet its widespread adoption is hindered by prohibitive computational requirements. Existing zero-cost proxy methods, while reducing search overhead, demonstrate inadequate performance in architecture ranking tasks, particularly for Transformer-based models where they often underperform simple parameter counting metrics. Current automated proxy discovery approaches suffer from extended search times, susceptibility to data overfitting, and structural complexity. This paper introduces a novel zero-cost proxy methodology that quantifies model capacity through efficient weight statistics computation while decomposing Transformer architectures into functionally distinct sub-modules, thereby optimizing the balance of their contributions to overall performance. Our comprehensive evaluation demonstrates the superiority of this approach, achieving a Spearman's rho of 0.76 and Kendall's tau of 0.53 on the FlexiBERT benchmark. The proposed method exhibits exceptional computational efficiency while maintaining robust performance across diverse NAS benchmark tasks, offering a practical solution for large-scale architecture search.

PIM: Physics-Informed Multi-task Pre-training for Improving Inertial Sensor-Based Human Activity Recognition

Dominique Nshimyimana,Vitor Fortes Rey,Sungho Suh,Bo Zhou,Paul Lukowicz

Task: 提出一种基于物理信息的多任务预训练框架（PIM），用于基于IMU的人类活动识别（HAR）。

Motivation: 解决传统自监督学习方法在HAR中忽视物理机制和约束的问题，通过物理知识生成预训练任务。

Details

Method: 利用物理方程计算运动速度、角度和传感器对称性等特征，作为自监督学习的预训练任务。 Result: 在四个HAR基准数据集上，PIM框架在准确率和F1分数上优于现有方法，尤其在少标签数据情况下表现显著。 Conclusion: PIM框架通过物理知识驱动的预训练任务，显著提升了HAR模型的性能，尤其在数据稀缺情况下效果突出。 Abstract: Human activity recognition (HAR) with deep learning models relies on large amounts of labeled data, often challenging to obtain due to associated cost, time, and labor. Self-supervised learning (SSL) has emerged as an effective approach to leverage unlabeled data through pretext tasks, such as masked reconstruction and multitask learning with signal processing-based data augmentations, to pre-train encoder models. However, such methods are often derived from computer vision approaches that disregard physical mechanisms and constraints that govern wearable sensor data and the phenomena they reflect. In this paper, we propose a physics-informed multi-task pre-training (PIM) framework for IMU-based HAR. PIM generates pre-text tasks based on the understanding of basic physical aspects of human motion: including movement speed, angles of movement, and symmetry between sensor placements. Given a sensor signal, we calculate corresponding features using physics-based equations and use them as pretext tasks for SSL. This enables the model to capture fundamental physical characteristics of human activities, which is especially relevant for multi-sensor systems. Experimental evaluations on four HAR benchmark datasets demonstrate that the proposed method outperforms existing state-of-the-art methods, including data augmentation and masked reconstruction, in terms of accuracy and F1 score. We have observed gains of almost 10\% in macro f1 score and accuracy with only 2 to 8 labeled examples per class and up to 3% when there is no reduction in the amount of training data.

Yazhou Zhang,Chunwang Zou,Bo Wang,Jing Qin

Task: 提出一种创新的多模态Commander-GPT框架，用于讽刺检测任务。

Motivation: 传统单模态讽刺检测方法效果不佳，多模态方法虽受关注但仍有挑战，需进一步探索如何有效利用多模态信息。

Details

Method: 将讽刺检测任务分解为六个子任务，由中央指挥官分配最适合的大语言模型处理每个子任务，最终聚合结果。 Result: 在MMSD和MMSD 2.0数据集上，使用四种多模态大语言模型和六种提示策略，F1分数提升了19.3%。 Conclusion: 提出的框架无需微调或真实依据，即可实现最先进的性能。 Abstract: Sarcasm detection, as a crucial research direction in the field of Natural Language Processing (NLP), has attracted widespread attention. Traditional sarcasm detection tasks have typically focused on single-modal approaches (e.g., text), but due to the implicit and subtle nature of sarcasm, such methods often fail to yield satisfactory results. In recent years, researchers have shifted the focus of sarcasm detection to multi-modal approaches. However, effectively leveraging multi-modal information to accurately identify sarcastic content remains a challenge that warrants further exploration. Leveraging the powerful integrated processing capabilities of Multi-Modal Large Language Models (MLLMs) for various information sources, we propose an innovative multi-modal Commander-GPT framework. Inspired by military strategy, we first decompose the sarcasm detection task into six distinct sub-tasks. A central commander (decision-maker) then assigns the best-suited large language model to address each specific sub-task. Ultimately, the detection results from each model are aggregated to identify sarcasm. We conducted extensive experiments on MMSD and MMSD 2.0, utilizing four multi-modal large language models and six prompting strategies. Our experiments demonstrate that our approach achieves state-of-the-art performance, with a 19.3% improvement in F1 score, without necessitating fine-tuning or ground-truth rationales.

Co-SemDepth: Fast Joint Semantic Segmentation and Depth Estimation on Aerial Images

Yara AlaaEldin,Francesca Odone

Task: 利用单目摄像头在无人机低空非结构化环境中预测深度和语义地图。

Motivation: 理解场景的几何和语义属性对自主导航至关重要，尤其是在无人机导航中。

Details

Method: 提出一种联合深度学习架构，能够同时准确快速地完成深度预测和语义分割任务。 Result: 在MidAir和Aeroscapes基准数据集上验证了该架构的有效性，其性能优于或与现有单任务和联合架构方法相当，且运行速度快（20.2 FPS），内存占用低。 Conclusion: 该联合架构在无人机导航中表现出色，兼具高效性和实用性。 Abstract: Understanding the geometric and semantic properties of the scene is crucial in autonomous navigation and particularly challenging in the case of Unmanned Aerial Vehicle (UAV) navigation. Such information may be by obtained by estimating depth and semantic segmentation maps of the surrounding environment and for their practical use in autonomous navigation, the procedure must be performed as close to real-time as possible. In this paper, we leverage monocular cameras on aerial robots to predict depth and semantic maps in low-altitude unstructured environments. We propose a joint deep-learning architecture that can perform the two tasks accurately and rapidly, and validate its effectiveness on MidAir and Aeroscapes benchmark datasets. Our joint-architecture proves to be competitive or superior to the other single and joint architecture methods while performing its task fast predicting 20.2 FPS on a single NVIDIA quadro p5000 GPU and it has a low memory footprint. All codes for training and prediction can be found on this link: https://github.com/Malga-Vision/Co-SemDepth

Unsupervised Acquisition of Discrete Grammatical Categories

David Ph. Shakouri,Crit Cremers,Niels O. Schiller

Task: 通过多智能体系统模拟语言习得过程，探究如何从语言范例中获取抽象语法知识。

Motivation: 研究如何在没有直接访问母语模型内部知识的情况下，仅通过语言范例学习抽象语法规则。

Details

Method: 使用多智能体系统（成年语言模型和子代语言模型），通过层次聚类分析母语模型生成的话语，提取离散语法规则。 Result: 实验表明，系统能够从输入数据中提取非平凡的语法知识，并通过测试集验证了参数配置的有效性。 Conclusion: 该计算实验室环境可用于模拟语言习得过程，并获取类似自然语言的语法结构。 Abstract: This article presents experiments performed using a computational laboratory environment for language acquisition experiments. It implements a multi-agent system consisting of two agents: an adult language model and a daughter language model that aims to learn the mother language. Crucially, the daughter agent does not have access to the internal knowledge of the mother language model but only to the language exemplars the mother agent generates. These experiments illustrate how this system can be used to acquire abstract grammatical knowledge. We demonstrate how statistical analyses of patterns in the input data corresponding to grammatical categories yield discrete grammatical rules. These rules are subsequently added to the grammatical knowledge of the daughter language model. To this end, hierarchical agglomerative cluster analysis was applied to the utterances consecutively generated by the mother language model. It is argued that this procedure can be used to acquire structures resembling grammatical categories proposed by linguists for natural languages. Thus, it is established that non-trivial grammatical knowledge has been acquired. Moreover, the parameter configuration of this computational laboratory environment determined using training data generated by the mother language model is validated in a second experiment with a test set similarly resulting in the acquisition of non-trivial categories.

Histomorphology-driven multi-instance learning for breast cancer WSI classification

Baizhi Wang,Rui Yan,Wenxin Ma,Xu Zhang,Yuhao Wang,Xiaolong Li,Yunjie Gu,Zihang Jiang,S. Kevin Zhou

Task: 提出一种新的框架，将组织形态学信息（肿瘤细胞密度、细胞形态和组织结构）显式地整合到全切片图像（WSI）分类中。

Motivation: 现有的WSI分类方法难以有效利用组织形态学信息，限制了其对关键和细粒度病理特征的捕捉能力。

Details

Method: 框架包含三个关键组件：(1)基于医学先验知识在补丁级别估计肿瘤相关组织形态学信息的重要性；(2)通过组织形态学驱动的聚类池化生成代表性聚类级特征；(3)通过组织形态学驱动的多实例聚合实现WSI级分类。 Result: 实验结果表明，该框架显著提升了WSI分类性能，在分子亚型和癌症亚型诊断中实现了高准确率。 Conclusion: 通过整合组织形态学信息，该框架增强了模型对关键和细粒度病理模式的捕捉能力，从而提升了分类效果。 Abstract: Histomorphology is crucial in breast cancer diagnosis. However, existing whole slide image (WSI) classification methods struggle to effectively incorporate histomorphology information, limiting their ability to capture key and fine-grained pathological features. To address this limitation, we propose a novel framework that explicitly incorporates histomorphology (tumor cellularity, cellular morphology, and tissue architecture) into WSI classification. Specifically, our approach consists of three key components: (1) estimating the importance of tumor-related histomorphology information at the patch level based on medical prior knowledge; (2) generating representative cluster-level features through histomorphology-driven cluster pooling; and (3) enabling WSI-level classification through histomorphology-driven multi-instance aggregation. With the incorporation of histomorphological information, our framework strengthens the model's ability to capture key and fine-grained pathological patterns, thereby enhancing WSI classification performance. Experimental results demonstrate its effectiveness, achieving high diagnostic accuracy for molecular subtyping and cancer subtyping. The code will be made available at https://github.com/Badgewho/HMDMIL.

Predicting the Road Ahead: A Knowledge Graph based Foundation Model for Scene Understanding in Autonomous Driving

Hongkuan Zhou,Stefan Schmid,Yicong Li,Lavdim Halilaj,Xiangtong Yao,Wei cao

Task: 提出FM4SU方法，用于训练符号基础模型以理解自动驾驶场景。

Motivation: 当前方法在理解复杂驾驶场景的时空演化方面存在局限性。

Details

Method: 利用知识图谱捕捉感知数据和领域知识，提取鸟瞰图符号表示，并通过预训练语言模型学习场景元素共现关系。 Result: 在nuScenes数据集上的实验显示，微调模型在所有任务中显著提升准确性，T5模型的下一个场景预测准确率达86.7%。 Conclusion: FM4SU为开发更全面的自动驾驶场景理解模型提供了有前景的基础。 Abstract: The autonomous driving field has seen remarkable advancements in various topics, such as object recognition, trajectory prediction, and motion planning. However, current approaches face limitations in effectively comprehending the complex evolutions of driving scenes over time. This paper proposes FM4SU, a novel methodology for training a symbolic foundation model (FM) for scene understanding in autonomous driving. It leverages knowledge graphs (KGs) to capture sensory observation along with domain knowledge such as road topology, traffic rules, or complex interactions between traffic participants. A bird's eye view (BEV) symbolic representation is extracted from the KG for each driving scene, including the spatio-temporal information among the objects across the scenes. The BEV representation is serialized into a sequence of tokens and given to pre-trained language models (PLMs) for learning an inherent understanding of the co-occurrence among driving scene elements and generating predictions on the next scenes. We conducted a number of experiments using the nuScenes dataset and KG in various scenarios. The results demonstrate that fine-tuned models achieve significantly higher accuracy in all tasks. The fine-tuned T5 model achieved a next scene prediction accuracy of 86.7%. This paper concludes that FM4SU offers a promising foundation for developing more comprehensive models for scene understanding in autonomous driving.

Taste More, Taste Better: Diverse Data and Strong Model Boost Semi-Supervised Crowd Counting

Maochen Yang,Zekun Li,Jian Zhang,Lei Qi,Yinghuan Shi

Task: 提出一种名为TMTB的半监督人群计数框架，旨在有效利用未标记数据。

Motivation: 解决密集场景中高标注成本的问题，并提升未标记数据的利用效果和准确性。

Details

Method: 结合数据增强技术（背景修复）和视觉状态空间模型（VSSM）作为骨干网络，同时引入抗噪声分类头。 Result: 在四个基准数据集上大幅超越现有最优方法。 Conclusion: TMTB框架通过数据增强和模型优化，显著提升了半监督人群计数的性能。 Abstract: Semi-supervised crowd counting is crucial for addressing the high annotation costs of densely populated scenes. Although several methods based on pseudo-labeling have been proposed, it remains challenging to effectively and accurately utilize unlabeled data. In this paper, we propose a novel framework called Taste More Taste Better (TMTB), which emphasizes both data and model aspects. Firstly, we explore a data augmentation technique well-suited for the crowd counting task. By inpainting the background regions, this technique can effectively enhance data diversity while preserving the fidelity of the entire scenes. Secondly, we introduce the Visual State Space Model as backbone to capture the global context information from crowd scenes, which is crucial for extremely crowded, low-light, and adverse weather scenarios. In addition to the traditional regression head for exact prediction, we employ an Anti-Noise classification head to provide less exact but more accurate supervision, since the regression head is sensitive to noise in manual annotations. We conduct extensive experiments on four benchmark datasets and show that our method outperforms state-of-the-art methods by a large margin. Code is publicly available on https://github.com/syhien/taste_more_taste_better.

Construction Identification and Disambiguation Using BERT: A Case Study of NPN

Wesley Scivetti,Nathan Schneider

Task: 探究BERT模型对英语NPN（名词-介词-名词）构式的形式和语义的表征能力。

Motivation: 验证构造语法假设，即语言知识主要由形式-意义对（构式）组成，并探索BERT模型是否能够捕捉这些构式，尤其是罕见且多义的NPN构式。

Details

Method: 构建一个包含语义标注的NPN构式实例和干扰项的基准数据集，训练和评估探测分类器，分析BERT嵌入的语义和形式敏感性。 Result: 探测分类器能够较好地区分NPN构式与干扰项，并对构式的多义性进行消歧，表明BERT嵌入中包含了构式的语义信息；词序的故意打乱会导致构式被拒绝，显示BERT对形式的敏感性。 Conclusion: BERT隐式地编码了NPN构式的知识，超越了表面句法模式和词汇线索。 Abstract: Construction Grammar hypothesizes that knowledge of a language consists chiefly of knowledge of form-meaning pairs (''constructions'') that include vocabulary, general grammar rules, and even idiosyncratic patterns. Recent work has shown that transformer language models represent at least some constructional patterns, including ones where the construction is rare overall. In this work, we probe BERT's representation of the form and meaning of a minor construction of English, the NPN (noun-preposition-noun) construction -- exhibited in such expressions as face to face and day to day -- which is known to be polysemous. We construct a benchmark dataset of semantically annotated corpus instances (including distractors that superficially resemble the construction). With this dataset, we train and evaluate probing classifiers. They achieve decent discrimination of the construction from distractors, as well as sense disambiguation among true instances of the construction, revealing that BERT embeddings carry indications of the construction's semantics. Moreover, artificially permuting the word order of true construction instances causes them to be rejected, indicating sensitivity to matters of form. We conclude that BERT does latently encode at least some knowledge of the NPN construction going beyond a surface syntactic pattern and lexical cues.

Geometric Constrained Non-Line-of-Sight Imaging

Xueying Liu,Lianfang Wang,Jun Liu,Yong Wang,Yuping Duan

Task: 提出一种联合反照率-表面重建方法，用于非视距（NLOS）成像中的法线和反照率联合估计。

Motivation: 法线重建在NLOS成像中至关重要，但联合估计法线和反照率会显著增加问题的复杂性和计算难度。

Details

Method: 利用形状算子的Frobenius范数控制法线场的变化率，首次将正则化方法应用于隐藏物体表面法线的重建。 Result: 在合成和实验数据集上表现出鲁棒性和有效性，重建精度更高且速度比现有方法快30倍。 Conclusion: 该方法通过提高法线场的精度，增强了细节表示能力，实现了隐藏物体几何的高精度重建。 Abstract: Normal reconstruction is crucial in non-line-of-sight (NLOS) imaging, as it provides key geometric and lighting information about hidden objects, which significantly improves reconstruction accuracy and scene understanding. However, jointly estimating normals and albedo expands the problem from matrix-valued functions to tensor-valued functions that substantially increasing complexity and computational difficulty. In this paper, we propose a novel joint albedo-surface reconstruction method, which utilizes the Frobenius norm of the shape operator to control the variation rate of the normal field. It is the first attempt to apply regularization methods to the reconstruction of surface normals for hidden objects. By improving the accuracy of the normal field, it enhances detail representation and achieves high-precision reconstruction of hidden object geometry. The proposed method demonstrates robustness and effectiveness on both synthetic and experimental datasets. On transient data captured within 15 seconds, our surface normal-regularized reconstruction model produces more accurate surfaces than recently proposed methods and is 30 times faster than the existing surface reconstruction approach.

Synthetic Function Demonstrations Improve Generation in Low-Resource Programming Languages

Nick McKenna,Xinnuo Xu,Jack Williams,Nick Wilson,Benjamin Van Durme,Christian Poelitz

Task: 提出一种为低资源编程语言生成高质量训练数据的新方法。

Motivation: 低资源编程语言（如Excel公式）缺乏足够的训练数据，限制了模型性能的提升。

Details

Method: 使用教师模型生成完全合成的、教科书质量的示例数据，并对表现不佳的学生模型进行微调。 Result: 在Excel领域的两个问答数据集上，微调后的模型表现优于标准的RAG方法。 Conclusion: 合成数据生成和微调方法在低资源编程语言中具有显著优势。 Abstract: A key consideration when training an LLM is whether the target language is more or less resourced, whether this is English compared to Welsh, or Python compared to Excel. Typical training data for programming languages consist of real program demonstrations coupled with human-written comments. Here we present novel approaches to the creation of such data for low resource programming languages. We generate fully-synthetic, textbook-quality demonstrations of common library functions in an example domain of Excel formulas, using a teacher model. We then finetune an underperforming student model, and show improvement on 2 question-answering datasets recast into the Excel domain. We show advantages of finetuning over standard, off-the-shelf RAG approaches, which can offer only modest improvement due to the unfamiliar target domain.

SymmCompletion: High-Fidelity and High-Consistency Point Cloud Completion with Symmetry Guidance

Hongyu Yan,Zijun Li,Kunming Luo,Li Lu,Ping Tan

Task: 从部分点云中恢复完整的点形状。

Motivation: 现有方法在全局完整性上表现良好，但容易丢失原始几何细节，且存在现有点云与重建缺失部分之间的几何不一致问题。

Details

Method: 提出SymmCompletion方法，包含局部对称变换网络（LSTNet）和对称引导变换器（SGFormer），通过对称引导生成几何对齐的部分-缺失对并优化初始点云。 Result: 在多个基准数据集上的定性和定量评估表明，该方法优于现有最先进的完成网络。 Conclusion: SymmCompletion能够生成高保真且几何一致的最终点云。 Abstract: Point cloud completion aims to recover a complete point shape from a partial point cloud. Although existing methods can form satisfactory point clouds in global completeness, they often lose the original geometry details and face the problem of geometric inconsistency between existing point clouds and reconstructed missing parts. To tackle this problem, we introduce SymmCompletion, a highly effective completion method based on symmetry guidance. Our method comprises two primary components: a Local Symmetry Transformation Network (LSTNet) and a Symmetry-Guidance Transformer (SGFormer). First, LSTNet efficiently estimates point-wise local symmetry transformation to transform key geometries of partial inputs into missing regions, thereby generating geometry-align partial-missing pairs and initial point clouds. Second, SGFormer leverages the geometric features of partial-missing pairs as the explicit symmetric guidance that can constrain the refinement process for initial point clouds. As a result, SGFormer can exploit provided priors to form high-fidelity and geometry-consistency final point clouds. Qualitative and quantitative evaluations on several benchmark datasets demonstrate that our method outperforms state-of-the-art completion networks.

AlphaSpace: Enabling Robotic Actions through Semantic Tokenization and Symbolic Reasoning

Alan Dao,Dinh Bach Vu,Bui Quang Huy

Task: 提出AlphaSpace方法，增强大语言模型在3D笛卡尔空间导航中的空间推理能力。

Motivation: 现有模型在空间推理任务中表现不足，需要一种更有效的方法来提升其能力。

Details

Method: 采用基于语义的分词策略，通过专用语义令牌编码高度信息，并结合符号合成推理数据。 Result: AlphaSpace在操作子任务中显著优于现有模型，准确率达到66.67%，而GPT-4o和Claude 3.5 Sonnet分别为37.5%和29.17%。 Conclusion: AlphaSpace是一种有效的空间推理增强方法，显著提升了模型在3D空间中的表现。 Abstract: This paper presents AlphaSpace, a novel methodology designed to enhance the spatial reasoning capabilities of large language models (LLMs) for 3D Cartesian space navigation. AlphaSpace employs a semantics-based tokenization strategy, encoding height information through specialized semantic tokens, and integrates primarily symbolic synthetic reasoning data. This approach enables LLMs to accurately manipulate objects by positioning them at specific [x, y, z] coordinates. Experimental results demonstrate that AlphaSpace significantly outperforms existing models on manipulation subtasks, achieving a total accuracy of 66.67%, compared to 37.5% for GPT-4o and 29.17% for Claude 3.5 Sonnet.

Finsler Multi-Dimensional Scaling: Manifold Learning for Asymmetric Dimensionality Reduction and Embedding

Thomas Dagès,Simon Weber,Ya-Wei Eileen Lin,Ronen Talmon,Daniel Cremers,Michael Lindenbaum,Alfred M. Bruckstein,Ron Kimmel

Task: 将多维缩放（MDS）问题扩展到Finsler流形，以处理非对称数据。

Motivation: 由于Riemannian流形的度量对称性限制，无法有效处理非对称数据，因此需要一种更通用的方法。

Details

Method: 定义了一种基于Finsler流形的规范空间，用于嵌入非对称数据，并保留了理论收敛性。 Result: Finsler嵌入在非对称数据上表现出色，适用于数据可视化、降维、有向图嵌入和链接预测等应用。 Conclusion: Finsler流形为处理非对称数据提供了一种直观且理论保证的方法，扩展了MDS的应用范围。 Abstract: Dimensionality reduction is a fundamental task that aims to simplify complex data by reducing its feature dimensionality while preserving essential patterns, with core applications in data analysis and visualisation. To preserve the underlying data structure, multi-dimensional scaling (MDS) methods focus on preserving pairwise dissimilarities, such as distances. They optimise the embedding to have pairwise distances as close as possible to the data dissimilarities. However, the current standard is limited to embedding data in Riemannian manifolds. Motivated by the lack of asymmetry in the Riemannian metric of the embedding space, this paper extends the MDS problem to a natural asymmetric generalisation of Riemannian manifolds called Finsler manifolds. Inspired by Euclidean space, we define a canonical Finsler space for embedding asymmetric data. Due to its simplicity with respect to geodesics, data representation in this space is both intuitive and simple to analyse. We demonstrate that our generalisation benefits from the same theoretical convergence guarantees. We reveal the effectiveness of our Finsler embedding across various types of non-symmetric data, highlighting its value in applications such as data visualisation, dimensionality reduction, directed graph embedding, and link prediction.

I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders

Andrey Galichin,Alexey Dontsov,Polina Druzhinina,Anton Razzhigaev,Oleg Y. Rogov,Elena Tutubalina,Ivan Oseledets

Task: 探索大型语言模型（LLMs）内部推理机制，特别是DeepSeek-R1系列模型的推理特征。

Motivation: 尽管LLMs在推理任务上表现出色，但其内部推理机制仍未被充分理解，需要揭示驱动推理的具体特征。

Details

Method: 使用稀疏自编码器（SAEs）分解模型的潜在表示，提取并验证与推理能力直接相关的特征。 Result: 成功识别并验证了与推理能力相关的特征，并通过操纵这些特征系统性提升了模型的推理性能。 Conclusion: 首次提供了LLMs推理的机制性解释，为理解和改进模型推理能力提供了新途径。 Abstract: Large Language Models (LLMs) have achieved remarkable success in natural language processing. Recent advances have led to the developing of a new class of reasoning LLMs; for example, open-source DeepSeek-R1 has achieved state-of-the-art performance by integrating deep thinking and complex reasoning. Despite these impressive capabilities, the internal reasoning mechanisms of such models remain unexplored. In this work, we employ Sparse Autoencoders (SAEs), a method to learn a sparse decomposition of latent representations of a neural network into interpretable features, to identify features that drive reasoning in the DeepSeek-R1 series of models. First, we propose an approach to extract candidate ''reasoning features'' from SAE representations. We validate these features through empirical analysis and interpretability methods, demonstrating their direct correlation with the model's reasoning abilities. Crucially, we demonstrate that steering these features systematically enhances reasoning performance, offering the first mechanistic account of reasoning in LLMs. Code available at https://github.com/AIRI-Institute/SAE-Reasoning

Vision-R1: Evolving Human-Free Alignment in Large Vision-Language Models via Vision-Guided Reinforcement Learning

Yufei Zhan,Yousong Zhu,Shurong Zheng,Hongyin Zhao,Fan Yang,Ming Tang,Jinqiao Wang

Task: 提出一种名为Vision-R1的新型视觉引导强化学习算法，用于提升大型视觉语言模型（LVLMs）的性能。

Motivation: 由于构建高质量的人类标注偏好数据和开发鲁棒的奖励模型成本高且具有挑战性，因此需要一种无需依赖这些资源的方法来优化LVLMs。

Details

Method: 提出Vision-R1算法，利用视觉反馈作为奖励信号，仅需指令数据，无需专门的奖励模型或人工偏好数据集。通过多维反馈和动态调整奖励标准，实现模型持续改进。 Result: 在7B LVLMs上应用Vision-R1后，性能显著提升，最高达50%，甚至超越10倍规模的现有最佳模型。 Conclusion: Vision-R1是一种高效且无需依赖昂贵标注数据的强化学习算法，能够显著提升LVLMs的性能。 Abstract: Large Vision-Language Models (LVLMs) typically follow a two-stage training paradigm-pretraining and supervised fine-tuning. Recently, preference optimization, derived from the language domain, has emerged as an effective post-training reinforcement strategy to enhance capabilities of LVLMs. However, constructing high-quality human-annotated preference data and developing robust reward models to mimic these preferences are both costly and challenging. Motivated by this observation, we propose Vision-R1, a novel vision-guided R1-like reinforcement learning algorithm for LVLMs that rewards models with definitive vision feedback. It only leverages curated instruction data, eliminating the need for specialized reward models and handcrafted preference datasets. We incorporate a criterion-driven reward function that further integrates multi-dimensional feedback to evaluate model completions comprehensively based on the vision task logic. Furthermore, we introduce a progressive rule refinement strategy that dynamically adjusts the reward criteria during training, enabling continuous model improvement and mitigating reward hacking. Extensive experiments on both in-distribution and out-of-distribution benchmarks demonstrate that fine-tuning the 7B LVLMs with Vision-R1 achieves consistent performance gains, with even up to 50% improvement and surpassing the state-of-the-art 10x size model.

AgentDropout: Dynamic Agent Elimination for Token-Efficient and High-Performance LLM-Based Multi-Agent Collaboration

Zhexuan Wang,Yutong Wang,Xuebo Liu,Liang Ding,Miao Zhang,Jie Liu,Min Zhang

Task: 提出一种名为AgentDropout的方法，通过优化多智能体系统中的通信拓扑结构，提高通信效率和任务性能。

Motivation: 多智能体系统在协作解决问题时面临通信效率低和任务性能不佳的挑战，因此需要设计高效的通信拓扑结构。

Details

Method: 提出AgentDropout方法，通过动态调整通信图中的冗余智能体和通信连接，优化邻接矩阵。 Result: 相比现有方法，AgentDropout平均减少了21.6%的提示令牌消耗和18.4%的完成令牌消耗，任务性能提升了1.14。 Conclusion: AgentDropout在领域迁移性和结构鲁棒性方面表现优异，证明了其可靠性和有效性。 Abstract: Multi-agent systems (MAS) based on large language models (LLMs) have demonstrated significant potential in collaborative problem-solving. However, they still face substantial challenges of low communication efficiency and suboptimal task performance, making the careful design of the agents' communication topologies particularly important. Inspired by the management theory that roles in an efficient team are often dynamically adjusted, we propose AgentDropout, which identifies redundant agents and communication across different communication rounds by optimizing the adjacency matrices of the communication graphs and eliminates them to enhance both token efficiency and task performance. Compared to state-of-the-art methods, AgentDropout achieves an average reduction of 21.6% in prompt token consumption and 18.4% in completion token consumption, along with a performance improvement of 1.14 on the tasks. Furthermore, the extended experiments demonstrate that AgentDropout achieves notable domain transferability and structure robustness, revealing its reliability and effectiveness. We release our code at https://github.com/wangzx1219/AgentDropout.

Retrieval Augmented Generation and Understanding in Vision: A Survey and New Outlook

Xu Zheng,Ziqiao Weng,Yuanhuiyi Lyu,Lutao Jiang,Haiwei Xue,Bin Ren,Danda Paudel,Nicu Sebe,Luc Van Gool,Xuming Hu

Task: 综述检索增强生成（RAG）技术在计算机视觉（CV）领域的应用现状，重点关注视觉理解和视觉生成两大方向。

Motivation: 通过整合外部权威知识库，提升视觉模型的理解和生成能力，弥补仅依赖内部模型知识的局限性。

Details

Method: 系统回顾视觉理解（如图像识别、医学报告生成等）和视觉生成（如图像、视频、3D生成等）任务中的RAG方法，并探讨其在具身AI中的最新进展。 Result: 总结了RAG在CV领域的应用现状，指出了当前方法的局限性，并提出了未来研究方向。 Conclusion: RAG在CV领域具有巨大潜力，但仍处于早期阶段，需进一步研究以推动其发展。 Abstract: Retrieval-augmented generation (RAG) has emerged as a pivotal technique in artificial intelligence (AI), particularly in enhancing the capabilities of large language models (LLMs) by enabling access to external, reliable, and up-to-date knowledge sources. In the context of AI-Generated Content (AIGC), RAG has proven invaluable by augmenting model outputs with supplementary, relevant information, thus improving their quality. Recently, the potential of RAG has extended beyond natural language processing, with emerging methods integrating retrieval-augmented strategies into the computer vision (CV) domain. These approaches aim to address the limitations of relying solely on internal model knowledge by incorporating authoritative external knowledge bases, thereby improving both the understanding and generation capabilities of vision models. This survey provides a comprehensive review of the current state of retrieval-augmented techniques in CV, focusing on two main areas: (I) visual understanding and (II) visual generation. In the realm of visual understanding, we systematically review tasks ranging from basic image recognition to complex applications such as medical report generation and multimodal question answering. For visual content generation, we examine the application of RAG in tasks related to image, video, and 3D generation. Furthermore, we explore recent advancements in RAG for embodied AI, with a particular focus on applications in planning, task execution, multimodal perception, interaction, and specialized domains. Given that the integration of retrieval-augmented techniques in CV is still in its early stages, we also highlight the key limitations of current approaches and propose future research directions to drive the development of this promising area.

xKV: Cross-Layer SVD for KV-Cache Compression

Chi-Chih Chang,Chien-Yu Lin,Yash Akhauri,Wei-Cheng Lin,Kai-Chiang Wu,Luis Ceze,Mohamed S. Abdelfattah

Task: 提出一种名为xKV的后训练方法，通过奇异值分解（SVD）对KV-Cache进行压缩，以减少大型语言模型（LLMs）在长上下文窗口下的内存消耗。

Motivation: 现有的KV-Cache合并方法需要昂贵的预训练或依赖于不切实际的假设（如层间高余弦相似性），而xKV利用层间主奇异向量的对齐特性，提供了一种简单高效的解决方案。

Details

Method: xKV通过对分组层的KV-Cache应用奇异值分解（SVD），将其合并到一个共享的低秩子空间中，从而显著减少KV-Cache的大小。 Result: 在RULER长上下文基准测试中，xKV实现了比现有技术高6.8倍的压缩率，同时准确率提高了2.7%；在编码任务中，与MLA结合时实现了3倍的压缩率且无性能损失。 Conclusion: xKV是一种高效且通用的方法，能够显著缓解长上下文LLM推理中的内存瓶颈问题。 Abstract: Large Language Models (LLMs) with long context windows enable powerful applications but come at the cost of high memory consumption to store the Key and Value states (KV-Cache). Recent studies attempted to merge KV-cache from multiple layers into shared representations, yet these approaches either require expensive pretraining or rely on assumptions of high per-token cosine similarity across layers which generally does not hold in practice. We find that the dominant singular vectors are remarkably well-aligned across multiple layers of the KV-Cache. Exploiting this insight, we propose xKV, a simple post-training method that applies Singular Value Decomposition (SVD) on the KV-Cache of grouped layers. xKV consolidates the KV-Cache of multiple layers into a shared low-rank subspace, significantly reducing KV-Cache sizes. Through extensive evaluations on the RULER long-context benchmark with widely-used LLMs (e.g., Llama-3.1 and Qwen2.5), xKV achieves up to 6.8x higher compression rates than state-of-the-art inter-layer technique while improving accuracy by 2.7%. Moreover, xKV is compatible with the emerging Multi-Head Latent Attention (MLA) (e.g., DeepSeek-Coder-V2), yielding a notable 3x compression rates on coding tasks without performance degradation. These results highlight xKV's strong capability and versatility in addressing memory bottlenecks for long-context LLM inference. Our code is publicly available at: https://github.com/abdelfattah-lab/xKV.

OmnimatteZero: Training-free Real-time Omnimatte with Pre-trained Video Diffusion Models

Dvir Samuel,Matan Levy,Nir Darshan,Gal Chechik,Rami Ben-Ari

Task: 分解视频为语义层，包括背景、独立对象及其关联效果（如阴影和反射）。

Motivation: 现有方法需要大量训练或昂贵的自监督优化，而本文提出了一种无需训练的方法。

Details

Method: 利用预训练的视频扩散模型，通过零样本图像修复技术实现视频对象移除，并利用自注意力图捕捉对象及其效果信息。 Result: OmnimatteZero在背景重建方面表现优异，且实现了实时性能。 Conclusion: OmnimatteZero是一种高效、无需训练的方法，能够快速分解和重组视频层。 Abstract: Omnimatte aims to decompose a given video into semantically meaningful layers, including the background and individual objects along with their associated effects, such as shadows and reflections. Existing methods often require extensive training or costly self-supervised optimization. In this paper, we present OmnimatteZero, a training-free approach that leverages off-the-shelf pre-trained video diffusion models for omnimatte. It can remove objects from videos, extract individual object layers along with their effects, and composite those objects onto new videos. We accomplish this by adapting zero-shot image inpainting techniques for video object removal, a task they fail to handle effectively out-of-the-box. We then show that self-attention maps capture information about the object and its footprints and use them to inpaint the object's effects, leaving a clean background. Additionally, through simple latent arithmetic, object layers can be isolated and recombined seamlessly with new video layers to produce new videos. Evaluations show that OmnimatteZero not only achieves superior performance in terms of background reconstruction but also sets a new record for the fastest Omnimatte approach, achieving real-time performance with minimal frame runtime.

Big Help or Big Brother? Auditing Tracking, Profiling, and Personalization in Generative AI Assistants

Yash Vekaria,Aurelio Loris Canino,Jonathan Levitsky,Alex Ciechonski,Patricia Callejo,Anna Maria Mandalari,Zubair Shafiq

Task: 研究生成式AI浏览器助手的设计与操作，特别是其如何收集、存储、处理和共享用户数据。

Motivation: 生成式AI浏览器助手虽然提供丰富的功能，但也引发了严重的隐私问题，需要深入了解其数据收集和用户画像行为。

Details

Method: 通过网络流量分析和新型提示框架，审计了十款最受欢迎的生成式AI浏览器助手扩展的跟踪、画像和个性化行为。 Result: 发现这些助手主要依赖服务器端API，自动收集并共享网页内容（包括敏感信息），部分还向第三方跟踪器共享用户标识符和提示。 Conclusion: 生成式AI浏览器助手在几乎没有保护措施的情况下收集个人和敏感信息用于画像和个性化，隐私风险显著。 Abstract: Generative AI (GenAI) browser assistants integrate powerful capabilities of GenAI in web browsers to provide rich experiences such as question answering, content summarization, and agentic navigation. These assistants, available today as browser extensions, can not only track detailed browsing activity such as search and click data, but can also autonomously perform tasks such as filling forms, raising significant privacy concerns. It is crucial to understand the design and operation of GenAI browser extensions, including how they collect, store, process, and share user data. To this end, we study their ability to profile users and personalize their responses based on explicit or inferred demographic attributes and interests of users. We perform network traffic analysis and use a novel prompting framework to audit tracking, profiling, and personalization by the ten most popular GenAI browser assistant extensions. We find that instead of relying on local in-browser models, these assistants largely depend on server-side APIs, which can be auto-invoked without explicit user interaction. When invoked, they collect and share webpage content, often the full HTML DOM and sometimes even the user's form inputs, with their first-party servers. Some assistants also share identifiers and user prompts with third-party trackers such as Google Analytics. The collection and sharing continues even if a webpage contains sensitive information such as health or personal information such as name or SSN entered in a web form. We find that several GenAI browser assistants infer demographic attributes such as age, gender, income, and interests and use this profile--which carries across browsing contexts--to personalize responses. In summary, our work shows that GenAI browser assistants can and do collect personal and sensitive information for profiling and personalization with little to no safeguards.

Qiao Liang,Yanjiang Liu,Ben He,Yaojie Lu,Hongyu Lin,Jia Zheng,Xianpei Han,Le Sun,Yingfei Sun

Task: 量化视觉编码器的先验知识对多模态大语言模型（MLLMs）性能的影响。

Motivation: 现有研究多将MLLMs视为通过端到端训练优化的统一系统，而视觉编码器的先验知识影响鲜有研究。

Details

Method: 提出新指标$Rank_e$量化先验知识影响，并设计两阶段训练框架VisPRE以增强视觉编码器的先验知识。 Result: 实验表明，增强视觉编码器的先验知识显著提升MLLMs的视觉理解能力，尤其在罕见视觉实体场景中。 Conclusion: 视觉编码器的先验知识对MLLMs性能有重要影响，VisPRE为提升性能提供了新策略。 Abstract: Does the prior knowledge of the vision encoder constrain the capability boundary of Multi-modal Large Language Models (MLLMs)? While most existing research treats MLLMs as unified systems optimized through end-to-end training, the impact of vision encoder's prior knowledge is seldom investigated. In this work, we introduce a novel metric, $Rank_e$, to quantify the effect of the vision encoder's prior knowledge on MLLM performance. Our analysis reveals a positive correlation between prior knowledge and MLLM performance. Moreover, we find that domain-specific fine-tuning using solely end-to-end visual question answering (VQA) data is insufficient--particularly for entities with low inherent visual prior knowledge. To address this issue, we propose VisPRE (Vision Prior Remediation), a two-stage training framework that explicitly incorporates prior knowledge at the vision encoder level. Experimental results demonstrate that augmenting vision encoder's prior knowledge substantially boosts the visual understanding capabilities of MLLMs, offering a novel and effective strategy for improving performance, especially in scenarios involving uncommon visual entities.

State Fourier Diffusion Language Model (SFDLM): A Scalable, Novel Iterative Approach to Language Modeling

Andrew Kiruluta,Andreas Lemos

Task: 提出一种完全基于扩散的离散文本生成模型，避免使用Transformer或大型卷积模块。

Motivation: Transformer在文本生成中占主导地位，但其自注意力机制计算成本高，因此探索一种更高效的替代方法。

Details

Method: 结合时间域的结构化状态空间动态和频域的新型Complex Fourier多层感知机模块，通过局部状态空间更新和全局傅里叶混合捕捉依赖关系。 Result: 模型能够有效捕捉短程和长程依赖关系，实现文本生成。 Conclusion: 该扩散驱动模型为文本生成提供了一种高效且计算成本低的替代方案。 Abstract: In recent years, diffusion based methods have emerged as a powerful paradigm for generative modeling. Although discrete diffusion for natural language processing has been explored to a lesser extent, it shows promise for tasks requiring iterative denoising of token based data. In standard approaches to text generation, transformers dominate, but their reliance on self attention often incurs high computational costs. This paper introduces a fully diffusion driven discrete text generation model built without any transformer or large convolution modules. Instead, the model integrates structured state space dynamics in the time domain with a novel Complex Fourier Multi Layer Perceptron module that operates in the frequency domain. The forward noising process randomly samples the vocabulary to replace tokens with a controlled probability, while the learned reverse model systematically reverts corrupted sequences toward their original states. By composing local state space updates with global Fourier based mixing, the approach effectively captures both short and long range dependencies.

Tianyi Shang,Zhenyu Li,Pengjie Xu,Zhaojun Deng,Ruirui Zhang

Task: 提出一种名为Des4Pos的两阶段文本驱动遥感定位框架，用于解决大规模点云地图中基于环境描述的定位问题。

Motivation: 当前方法在点云编码器捕捉局部细节和长距离空间关系方面存在不足，且文本与点云表示之间存在显著的模态差距。

Details

Method: Des4Pos采用多尺度融合注意力机制（MFAM）增强局部几何特征，双向LSTM模块强化全局空间关系，并引入阶梯式文本编码器（STE）和级联残差注意力（CRA）模块以对齐和融合跨模态特征。 Result: 在KITTI360Pose测试集上，Des4Pos在文本到点云地点识别任务中达到40%的top-1准确率和77%的top-10准确率（5米半径阈值），分别比现有最佳方法高出7%。 Conclusion: Des4Pos通过创新的两阶段框架和跨模态特征对齐，显著提升了基于文本的点云定位性能。 Abstract: Environment description-based localization in large-scale point cloud maps constructed through remote sensing is critically significant for the advancement of large-scale autonomous systems, such as delivery robots operating in the last mile. However, current approaches encounter challenges due to the inability of point cloud encoders to effectively capture local details and long-range spatial relationships, as well as a significant modality gap between text and point cloud representations. To address these challenges, we present Des4Pos, a novel two-stage text-driven remote sensing localization framework. In the coarse stage, the point-cloud encoder utilizes the Multi-scale Fusion Attention Mechanism (MFAM) to enhance local geometric features, followed by a bidirectional Long Short-Term Memory (LSTM) module to strengthen global spatial relationships. Concurrently, the Stepped Text Encoder (STE) integrates cross-modal prior knowledge from CLIP [1] and aligns text and point-cloud features using this prior knowledge, effectively bridging modality discrepancies. In the fine stage, we introduce a Cascaded Residual Attention (CRA) module to fuse cross-modal features and predict relative localization offsets, thereby achieving greater localization precision. Experiments on the KITTI360Pose test set demonstrate that Des4Pos achieves state-of-the-art performance in text-to-point-cloud place recognition. Specifically, it attains a top-1 accuracy of 40% and a top-10 accuracy of 77% under a 5-meter radius threshold, surpassing the best existing methods by 7% and 7%, respectively.

Junwei Kuang,Liang Yang,Shaoze Cui,Weiguo Fan

Task: 开发一个能够识别在线健康问答社区中用户社交支持需求的模型。

Motivation: 在线社交支持若与用户需求不匹配可能无效甚至有害，需精准识别需求。

Details

Method: 提出HA-SOS框架，结合答案增强的半监督学习、基于大语言模型的文本数据增强技术及统一训练过程。 Result: HA-SOS在性能上显著优于现有问题分类模型和半监督学习方法。 Conclusion: HA-SOS为社交支持、问题分类等领域提供新方法，并帮助平台管理者更精准满足用户需求。 Abstract: Patients are increasingly turning to online health Q&A communities for social support to improve their well-being. However, when this support received does not align with their specific needs, it may prove ineffective or even detrimental. This necessitates a model capable of identifying the social support needs in questions. However, training such a model is challenging due to the scarcity and class imbalance issues of labeled data. To overcome these challenges, we follow the computational design science paradigm to develop a novel framework, Hybrid Approach for SOcial Support need classification (HA-SOS). HA-SOS integrates an answer-enhanced semi-supervised learning approach, a text data augmentation technique leveraging large language models (LLMs) with reliability- and diversity-aware sample selection mechanism, and a unified training process to automatically label social support needs in questions. Extensive empirical evaluations demonstrate that HA-SOS significantly outperforms existing question classification models and alternative semi-supervised learning approaches. This research contributes to the literature on social support, question classification, semi-supervised learning, and text data augmentation. In practice, our HA-SOS framework facilitates online Q&A platform managers and answerers to better understand users' social support needs, enabling them to provide timely, personalized answers and interventions.

DualCP: Rehearsal-Free Domain-Incremental Learning via Dual-Level Concept Prototype

Qiang Wang,Yuhang He,SongLin Dong,Xiang Song,Jizhou Han,Haoyu Luo,Yihong Gong

Task: 设计一种名为DualCP的方法来解决无排练域增量学习（RFDIL）中新旧知识冲突的问题。

Motivation: 受人类大脑增量认知过程的启发，结合隐私和训练时间的实际考虑，提出一种更实用的RFDIL方法。

Details

Method: 提出DualCP（双层次概念原型），包括概念原型生成器（CPG）和粗到细校准器（C2F），并设计Dual Dot-Regression（DDR）损失函数优化C2F模块。 Result: 在DomainNet、CDDB和CORe50数据集上的实验验证了方法的有效性。 Conclusion: DualCP方法在RFDIL中有效解决了新旧知识冲突问题，具有实际应用价值。 Abstract: Domain-Incremental Learning (DIL) enables vision models to adapt to changing conditions in real-world environments while maintaining the knowledge acquired from previous domains. Given privacy concerns and training time, Rehearsal-Free DIL (RFDIL) is more practical. Inspired by the incremental cognitive process of the human brain, we design Dual-level Concept Prototypes (DualCP) for each class to address the conflict between learning new knowledge and retaining old knowledge in RFDIL. To construct DualCP, we propose a Concept Prototype Generator (CPG) that generates both coarse-grained and fine-grained prototypes for each class. Additionally, we introduce a Coarse-to-Fine calibrator (C2F) to align image features with DualCP. Finally, we propose a Dual Dot-Regression (DDR) loss function to optimize our C2F module. Extensive experiments on the DomainNet, CDDB, and CORe50 datasets demonstrate the effectiveness of our method.

From Text to Talent: A Pipeline for Extracting Insights from Candidate Profiles

Paolo Frazzetto,Muhammad Uzair Ul Haq,Flavia Fabris,Alessandro Sperduti

Task: 提出一种利用大型语言模型和图相似性度量来为特定职位推荐理想候选人的新流程。

Motivation: 现有研究多关注自动化候选人选择，而多职位招聘的作用尚未充分研究。

Details

Method: 将候选人档案表示为多模态嵌入，利用大型语言模型和图相似性度量。 Result: 该方法能有效捕捉职位需求与候选人属性之间的复杂关系，优化招聘流程。 Conclusion: 研究展示了大型语言模型和图方法在招聘领域的潜力，为人力资源机器学习应用提供了新方向。 Abstract: The recruitment process is undergoing a significant transformation with the increasing use of machine learning and natural language processing techniques. While previous studies have focused on automating candidate selection, the role of multiple vacancies in this process remains understudied. This paper addresses this gap by proposing a novel pipeline that leverages Large Language Models and graph similarity measures to suggest ideal candidates for specific job openings. Our approach represents candidate profiles as multimodal embeddings, enabling the capture of nuanced relationships between job requirements and candidate attributes. The proposed approach has significant implications for the recruitment industry, enabling companies to streamline their hiring processes and identify top talent more efficiently. Our work contributes to the growing body of research on the application of machine learning in human resources, highlighting the potential of LLMs and graph-based methods in revolutionizing the recruitment landscape.

SceneSplat: Gaussian Splatting-based Scene Understanding with Vision-Language Pretraining

Yue Li,Qi Ma,Runyi Yang,Huapeng Li,Mengjiao Ma,Bin Ren,Nikola Popovic,Nicu Sebe,Ender Konukoglu,Theo Gevers,Luc Van Gool,Martin R. Oswald,Danda Pani Paudel

Task: 提出SceneSplat，一种基于3D高斯泼溅（3DGS）的大规模室内场景理解方法，并引入自监督学习方案。

Motivation: 现有方法依赖2D或文本模态，缺乏仅基于3D数据的端到端语义学习模型，且缺乏相关训练数据。3DGS作为3D场景表示的标准，但如何将其与语义推理有效结合仍具挑战。

Details

Method: 提出SceneSplat方法，直接在3DGS上操作，并设计自监督学习方案从无标签场景中学习3D特征。同时构建SceneSplat-7K数据集，包含6868个室内场景。 Result: 实验表明，所提方法在SceneSplat-7K数据集上显著优于基线方法。 Conclusion: SceneSplat填补了3D数据端到端语义学习的空白，为3DGS在室内场景理解中的应用提供了标准化基准。 Abstract: Recognizing arbitrary or previously unseen categories is essential for comprehensive real-world 3D scene understanding. Currently, all existing methods rely on 2D or textual modalities during training, or together at inference. This highlights a clear absence of a model capable of processing 3D data alone for learning semantics end-to-end, along with the necessary data to train such a model. Meanwhile, 3D Gaussian Splatting (3DGS) has emerged as the de facto standard for 3D scene representation across various vision tasks. However, effectively integrating semantic reasoning into 3DGS in a generalizable fashion remains an open challenge. To address these limitations we introduce SceneSplat, to our knowledge the first large-scale 3D indoor scene understanding approach that operates natively on 3DGS. Furthermore, we propose a self-supervised learning scheme that unlocks rich 3D feature learning from unlabeled scenes. In order to power the proposed methods, we introduce SceneSplat-7K, the first large-scale 3DGS dataset for indoor scenes, comprising of 6868 scenes derived from 7 established datasets like ScanNet, Matterport3D, etc. Generating SceneSplat-7K required computational resources equivalent to 119 GPU-days on an L4 GPU, enabling standardized benchmarking for 3DGS-based reasoning for indoor scenes. Our exhaustive experiments on SceneSplat-7K demonstrate the significant benefit of the proposed methods over the established baselines.

Variance Control via Weight Rescaling in LLM Pre-training

Louis Owen,Abhay Kumar,Nilabhra Roy Chowdhury,Fabian Güra

Task: 提出Layer Index Rescaling (LIR)权重初始化方案和Target Variance Rescaling (TVR)方差控制策略，以优化大型语言模型（LLM）预训练中的权重初始化和方差管理。

Motivation: 尽管权重初始化和方差控制在神经网络中的重要性已被广泛研究，但在LLM预训练中的相关研究较少，需要更有效的策略来提升模型性能。

Details

Method: 引入LIR权重初始化方案和TVR方差控制策略，并在1B参数的LLaMA模型上进行实验验证。 Result: 实验表明，这些技术显著提升了下游任务性能（最高提升4.6%），并减少了极端激活值，有助于缓解量化和低精度训练的挑战。 Conclusion: LIR和TVR策略有效优化了LLM预训练中的权重初始化和方差管理，提升了模型性能和训练稳定性。 Abstract: The outcome of Large Language Model (LLM) pre-training strongly depends on weight initialization and variance control strategies. Although the importance of initial variance control has been well documented in neural networks in general, the literature on initialization and management of its growth during LLM pre-training, specifically, is somewhat sparse. In this paper, we introduce the Layer Index Rescaling (LIR) weight initialization scheme, and the Target Variance Rescaling (TVR) variance control strategy. Experiments on a 1B parameter LLaMA model demonstrate that better variance management using these techniques yields substantial improvements in downstream task performance (up to 4.6% on common pre-training benchmarks) and reduces extreme activation values, thus mitigating challenges associated with quantization and low-precision training. Our code is available at: https://github.com/bluorion-com/weight_rescaling.

PolarFree: Polarization-based Reflection-free Imaging

Mingde Yao,Menglu Wang,King-Man Tam,Lingen Li,Tianfan Xue,Jinwei Gu

Task: 利用偏振信息进行RGB图像的反射去除。

Motivation: 现有方法依赖小规模或合成数据集，无法捕捉真实场景的多样性和复杂性，因此需要构建大规模数据集并开发更有效的反射去除方法。

Details

Method: 构建大规模数据集PolaRGB，并提出基于扩散过程的PolarFree方法，利用偏振信息生成无反射线索。 Result: PolarFree在挑战性反射场景中显著提升图像清晰度，为偏振成像和反射去除设定了新基准。 Conclusion: PolaRGB数据集和PolarFree方法为真实场景下的反射去除提供了有效解决方案。 Abstract: Reflection removal is challenging due to complex light interactions, where reflections obscure important details and hinder scene understanding. Polarization naturally provides a powerful cue to distinguish between reflected and transmitted light, enabling more accurate reflection removal. However, existing methods often rely on small-scale or synthetic datasets, which fail to capture the diversity and complexity of real-world scenarios. To this end, we construct a large-scale dataset, PolaRGB, for Polarization-based reflection removal of RGB images, which enables us to train models that generalize effectively across a wide range of real-world scenarios. The PolaRGB dataset contains 6,500 well-aligned mixed-transmission image pairs, 8x larger than existing polarization datasets, and is the first to include both RGB and polarization images captured across diverse indoor and outdoor environments with varying lighting conditions. Besides, to fully exploit the potential of polarization cues for reflection removal, we introduce PolarFree, which leverages diffusion process to generate reflection-free cues for accurate reflection removal. Extensive experiments show that PolarFree significantly enhances image clarity in challenging reflective scenarios, setting a new benchmark for polarized imaging and reflection removal. Code and dataset are available at https://github.com/mdyao/PolarFree.

Large Language Models (LLMs) for Source Code Analysis: applications, models and datasets

Hamed Jelodar,Mohammad Meymani,Roozbeh Razavi-Far

Task: 探索大型语言模型（LLMs）在源代码分析中的角色及其应用、模型、数据集和挑战。

Motivation: 随着软件系统复杂度的增加，将LLMs集成到代码分析工作流中以提高效率、准确性和自动化变得至关重要。

Details

Method: 通过调查学术文章，研究LLMs在源代码分析中的应用，揭示研究进展、当前趋势和该领域的知识结构。 Result: 总结了LLMs在代码分析中的应用、使用的模型和数据集，以及面临的挑战。 Conclusion: 该研究为未来工作提供了有价值的工具、数据集和关键挑战的总结。 Abstract: Large language models (LLMs) and transformer-based architectures are increasingly utilized for source code analysis. As software systems grow in complexity, integrating LLMs into code analysis workflows becomes essential for enhancing efficiency, accuracy, and automation. This paper explores the role of LLMs for different code analysis tasks, focusing on three key aspects: 1) what they can analyze and their applications, 2) what models are used and 3) what datasets are used, and the challenges they face. Regarding the goal of this research, we investigate scholarly articles that explore the use of LLMs for source code analysis to uncover research developments, current trends, and the intellectual structure of this emerging field. Additionally, we summarize limitations and highlight essential tools, datasets, and key challenges, which could be valuable for future work.

Ziming Wei,Bingqian Lin,Yunshuang Nie,Jiaqi Chen,Shikui Ma,Hang Xu,Xiaodan Liang

Task: 提出一种基于重写的增强范式（RAM），用于解决视觉语言导航（VLN）领域中的数据稀缺问题。

Motivation: 数据稀缺限制了VLN智能体在未见环境中的泛化能力，现有方法依赖模拟器或网络数据，但存在多样性不足或噪声问题。

Details

Method: 通过重写人类标注的训练数据，生成未见过的观察-指令对，包括对象丰富的观察重写和观察对比指令重写，并结合混合-聚焦训练策略。 Result: 在离散和连续环境（R2R、REVERIE、R4R和R2R-CE数据集）上表现出优越性能和泛化能力。 Conclusion: RAM范式以无模拟器和低人工成本的方式有效提升了VLN的泛化能力。 Abstract: Data scarcity is a long-standing challenge in the Vision-Language Navigation (VLN) field, which extremely hinders the generalization of agents to unseen environments. Previous works primarily rely on additional simulator data or web-collected images/videos to improve the generalization. However, the simulator environments still face limited diversity, and the web-collected data often requires extensive labor to remove the noise. In this paper, we propose a Rewriting-driven AugMentation (RAM) paradigm for VLN, which directly creates the unseen observation-instruction pairs via rewriting human-annotated training data. Benefiting from our rewriting mechanism, new observation-instruction can be obtained in both simulator-free and labor-saving manners to promote generalization. Specifically, we first introduce Object-Enriched Observation Rewriting, where we combine Vision-Language Models (VLMs) and Large Language Models (LLMs) to derive rewritten object-enriched scene descriptions, enabling observation synthesis with diverse objects and spatial layouts via Text-to-Image Generation Models (T2IMs). Then, we propose Observation-Contrast Instruction Rewriting, which generates observation-aligned rewritten instructions by requiring LLMs to reason the difference between original and new observations. We further develop a mixing-then-focusing training strategy with a random observation cropping scheme, effectively enhancing data distribution diversity while suppressing augmentation data noise during training. Experiments on both the discrete environments (R2R, REVERIE, and R4R datasets) and continuous environments (R2R-CE dataset) show the superior performance and impressive generalization ability of our method. Code is available at https://github.com/SaDil13/VLN-RAM.

Autonomous Radiotherapy Treatment Planning Using DOLA: A Privacy-Preserving, LLM-Based Optimization Agent

Humza Nusrat,Bing Luo,Ryan Hall,Joshua Kim,Hassan Bagher-Ebadian,Anthony Doemer,Benjamin Movsas,Kundan Thind

Task: 开发一种基于大型语言模型（LLM）的自主代理DOLA，用于优化放射治疗计划。

Motivation: 解决放射治疗计划制定中的复杂性和主观决策问题，同时保护患者隐私。

Details

Method: 结合LLaMa3.1 LLM与商业治疗计划系统，采用链式思维提示、检索增强生成（RAG）和强化学习（RL）。 Result: 70B参数模型表现优于8B模型，RAG方法比基线提升19.8%，RL加速收敛。 Conclusion: DOLA是首个成功部署的本地LLM代理，为临床工作流程提供了可扩展且隐私保护的解决方案。 Abstract: Radiotherapy treatment planning is a complex and time-intensive process, often impacted by inter-planner variability and subjective decision-making. To address these challenges, we introduce Dose Optimization Language Agent (DOLA), an autonomous large language model (LLM)-based agent designed for optimizing radiotherapy treatment plans while rigorously protecting patient privacy. DOLA integrates the LLaMa3.1 LLM directly with a commercial treatment planning system, utilizing chain-of-thought prompting, retrieval-augmented generation (RAG), and reinforcement learning (RL). Operating entirely within secure local infrastructure, this agent eliminates external data sharing. We evaluated DOLA using a retrospective cohort of 18 prostate cancer patients prescribed 60 Gy in 20 fractions, comparing model sizes (8 billion vs. 70 billion parameters) and optimization strategies (No-RAG, RAG, and RAG+RL) over 10 planning iterations. The 70B model demonstrated significantly improved performance, achieving approximately 16.4% higher final scores than the 8B model. The RAG approach outperformed the No-RAG baseline by 19.8%, and incorporating RL accelerated convergence, highlighting the synergy of retrieval-based memory and reinforcement learning. Optimal temperature hyperparameter analysis identified 0.4 as providing the best balance between exploration and exploitation. This proof of concept study represents the first successful deployment of locally hosted LLM agents for autonomous optimization of treatment plans within a commercial radiotherapy planning system. By extending human-machine interaction through interpretable natural language reasoning, DOLA offers a scalable and privacy-conscious framework, with significant potential for clinical implementation and workflow improvement.

PanopticSplatting: End-to-End Panoptic Gaussian Splatting

Yuxuan Xie,Xuan Yu,Changjian Jiang,Sitong Mao,Shunbo Zhou,Rui Fan,Rong Xiong,Yue Wang

Task: 实现开放词汇全景重建的端到端系统。

Motivation: 现有方法多阶段且依赖手工设计组件，导致误差累积和效率低下。

Details

Method: 提出PanopticSplatting，采用查询引导的高斯分割和局部交叉注意力，结合标签混合和变形技术。 Result: 在ScanNet-V2和ScanNet++数据集上表现优于基于NeRF和高斯的方法。 Conclusion: PanopticSplatting高效、鲁棒，且易于推广到高斯溅射的多种变体。 Abstract: Open-vocabulary panoptic reconstruction is a challenging task for simultaneous scene reconstruction and understanding. Recently, methods have been proposed for 3D scene understanding based on Gaussian splatting. However, these methods are multi-staged, suffering from the accumulated errors and the dependence of hand-designed components. To streamline the pipeline and achieve global optimization, we propose PanopticSplatting, an end-to-end system for open-vocabulary panoptic reconstruction. Our method introduces query-guided Gaussian segmentation with local cross attention, lifting 2D instance masks without cross-frame association in an end-to-end way. The local cross attention within view frustum effectively reduces the training memory, making our model more accessible to large scenes with more Gaussians and objects. In addition, to address the challenge of noisy labels in 2D pseudo masks, we propose label blending to promote consistent 3D segmentation with less noisy floaters, as well as label warping on 2D predictions which enhances multi-view coherence and segmentation accuracy. Our method demonstrates strong performances in 3D scene panoptic reconstruction on the ScanNet-V2 and ScanNet++ datasets, compared with both NeRF-based and Gaussian-based panoptic reconstruction methods. Moreover, PanopticSplatting can be easily generalized to numerous variants of Gaussian splatting, and we demonstrate its robustness on different Gaussian base models.

FairFlow: Mitigating Dataset Biases through Undecided Learning

Jiali Cheng,Hadi Amiri

Task: 提出一种名为FairFlow的去偏框架，以减轻语言模型在数据中的偏见。

Motivation: 语言模型容易受到数据集偏见（如捷径和虚假相关性）的影响，导致在新数据上性能下降。

Details

Method: FairFlow框架包含两个关键组件：一套生成不同偏见视图的数据和模型扰动操作，以及一个对比目标，用于从这些偏见视图中学习去偏且鲁棒的表征。 Result: 实验表明，FairFlow在去偏方法中表现优异，特别是在域外和困难测试样本上，同时不影响域内性能。 Conclusion: FairFlow是一种有效的去偏框架，能够显著提升模型在未知偏见数据上的鲁棒性。 Abstract: Language models are prone to dataset biases, known as shortcuts and spurious correlations in data, which often result in performance drop on new data. We present a new debiasing framework called ``FairFlow'' that mitigates dataset biases by learning to be undecided in its predictions for data samples or representations associated with known or unknown biases. The framework introduces two key components: a suite of data and model perturbation operations that generate different biased views of input samples, and a contrastive objective that learns debiased and robust representations from the resulting biased views of samples. Experiments show that FairFlow outperforms existing debiasing methods, particularly against out-of-domain and hard test samples without compromising the in-domain performance

Vehicular Road Crack Detection with Deep Learning: A New Online Benchmark for Comprehensive Evaluation of Existing Algorithms

Nachuan Ma,Zhengfei Song,Qiang Hu,Chuang-Wei Liu,Yu Han,Yanting Zhang,Rui Fan,Lihua Xie

Task: 系统综述深度学习在道路裂缝检测中的最新技术，并创建UDTIRI-Crack数据集作为基准。

Motivation: 提高道路裂缝检测的效率、准确性和客观性，以替代人工视觉检测，并填补现有技术系统性综述的空白。

Details

Method: 综述了监督、无监督、半监督和弱监督方法，创建UDTIRI-Crack数据集，并进行性能、效率和泛化性实验。 Result: 比较了多种算法的性能，探讨了基础模型和大型语言模型的可行性，并总结了现有挑战和未来趋势。 Conclusion: 该综述为开发下一代智能道路检测车辆提供了实用指导，并发布了UDTIRI-Crack数据集作为基准。 Abstract: In the emerging field of urban digital twins (UDTs), advancing intelligent road inspection (IRI) vehicles with automatic road crack detection systems is essential for maintaining civil infrastructure. Over the past decade, deep learning-based road crack detection methods have been developed to detect cracks more efficiently, accurately, and objectively, with the goal of replacing manual visual inspection. Nonetheless, there is a lack of systematic reviews on state-of-the-art (SoTA) deep learning techniques, especially data-fusion and label-efficient algorithms for this task. This paper thoroughly reviews the SoTA deep learning-based algorithms, including (1) supervised, (2) unsupervised, (3) semi-supervised, and (4) weakly-supervised methods developed for road crack detection. Also, we create a dataset called UDTIRI-Crack, comprising $2,500$ high-quality images from seven public annotated sources, as the first extensive online benchmark in this field. Comprehensive experiments are conducted to compare the detection performance, computational efficiency, and generalizability of public SoTA deep learning-based algorithms for road crack detection. In addition, the feasibility of foundation models and large language models (LLMs) for road crack detection is explored. Afterwards, the existing challenges and future development trends of deep learning-based road crack detection algorithms are discussed. We believe this review can serve as practical guidance for developing intelligent road detection vehicles with the next-generation road condition assessment systems. The released benchmark UDTIRI-Crack is available at https://udtiri.com/submission/.

V2P-Bench: Evaluating Video-Language Understanding with Visual Prompts for Better Human-Model Interaction

Yiming Zhao,Yu Zeng,Yukun Qi,YaoYang Liu,Lin Chen,Zehui Chen,Xikun Bao,Jie Zhao,Feng Zhao

Task: 提出Video Visual Prompt Benchmark (V2P-Bench)以评估大型视觉语言模型在视频理解中的多模态交互能力。

Motivation: 当前基准测试仅依赖文本提示，缺乏精确的空间和时间参考，限制了人机交互的体验和效率。

Details

Method: 设计包含980个独特视频和1,172个问答对的V2P-Bench，涵盖5个主要任务和12个维度，实现细粒度实例级理解。 Result: 现有最强模型在V2P-Bench上表现不佳（GPT-4o为65.4%，Gemini-1.5-Pro为67.9%），远低于人类专家的88.3%。 Conclusion: V2P-Bench为推进多模态人机交互和视频理解评估提供了基础。 Abstract: Large Vision-Language Models (LVLMs) have made significant progress in the field of video understanding recently. However, current benchmarks uniformly lean on text prompts for evaluation, which often necessitate complex referential language and fail to provide precise spatial and temporal references. This limitation diminishes the experience and efficiency of human-model interaction. To address this limitation, we propose the Video Visual Prompt Benchmark(V2P-Bench), a comprehensive benchmark specifically designed to evaluate LVLMs' video understanding capabilities in multimodal human-model interaction scenarios. V2P-Bench includes 980 unique videos and 1,172 QA pairs, covering 5 main tasks and 12 dimensions, facilitating instance-level fine-grained understanding aligned with human cognition. Benchmarking results reveal that even the most powerful models perform poorly on V2P-Bench (65.4% for GPT-4o and 67.9% for Gemini-1.5-Pro), significantly lower than the human experts' 88.3%, highlighting the current shortcomings of LVLMs in understanding video visual prompts. We hope V2P-Bench will serve as a foundation for advancing multimodal human-model interaction and video understanding evaluation. Project page: https://github.com/gaotiexinqu/V2P-Bench.

Unified Geometry and Color Compression Framework for Point Clouds via Generative Diffusion Priors

Tianxin Huang,Gim Hee Lee

Task: 提出一种测试时统一的3D点云几何与颜色压缩框架。

Motivation: 现有基于学习的压缩方法通常将几何和颜色属性分开处理，难以直接应用于彩色点云，且训练数据集的有限性限制了其泛化能力。

Details

Method: 利用预训练的生成扩散模型，通过提示调谐将原始彩色点云压缩为稀疏的'种子'集，并通过多次去噪步骤实现解压缩。 Result: 在物体和室内场景上的实验表明，该方法在几何和颜色压缩方面优于现有基线。 Conclusion: 该方法提供了一种高效且通用的彩色点云压缩解决方案。 Abstract: With the growth of 3D applications and the rapid increase in sensor-collected 3D point cloud data, there is a rising demand for efficient compression algorithms. Most existing learning-based compression methods handle geometry and color attributes separately, treating them as distinct tasks, making these methods challenging to apply directly to point clouds with colors. Besides, the limited capacities of training datasets also limit their generalizability across points with different distributions. In this work, we introduce a test-time unified geometry and color compression framework of 3D point clouds. Instead of training a compression model based on specific datasets, we adapt a pre-trained generative diffusion model to compress original colored point clouds into sparse sets, termed 'seeds', using prompt tuning. Decompression is then achieved through multiple denoising steps with separate sampling processes. Experiments on objects and indoor scenes demonstrate that our method has superior performances compared to existing baselines for the compression of geometry and color.

Energy-Aware LLMs: A step towards sustainable AI for downstream applications

Nguyen Phuc Tran,Brigitte Jaumard,Oscar Delgado

Task: 研究在通信网络故障票分析中，如何在LLM的能源效率和模型性能之间取得平衡。

Motivation: 尽管LLM在通信网络中带来了创新，但其高计算资源需求和能源消耗问题亟待解决。

Details

Method: 提出了一种端到端的流水线，结合量化和剪枝技术，评估其在根因分析和响应反馈任务中的表现。 Result: 适当的量化和剪枝组合能够显著降低能源消耗并提升模型性能。 Conclusion: 该研究为LLM在通信网络中的高效应用提供了可行方案。 Abstract: Advanced Large Language Models (LLMs) have revolutionized various fields, including communication networks, sparking an innovation wave that has led to new applications and services, and significantly enhanced solution schemes. Despite all these impressive developments, most LLMs typically require huge computational resources, resulting in terribly high energy consumption. Thus, this research study proposes an end-to-end pipeline that investigates the trade-off between energy efficiency and model performance for an LLM during fault ticket analysis in communication networks. It further evaluates the pipeline performance using two real-world datasets for the tasks of root cause analysis and response feedback in a communication network. Our results show that an appropriate combination of quantization and pruning techniques is able to reduce energy consumption while significantly improving model performance.

Anomize: Better Open Vocabulary Video Anomaly Detection

Fei Li,Wenxuan Liu,Jingjing Chen,Ruixu Zhang,Yuran Wang,Xian Zhong,Zheng Wang

Task: Open Vocabulary Video Anomaly Detection (OVVAD) aims to detect and classify both base and novel anomalies.

Motivation: Existing methods struggle with detection ambiguity and categorization confusion for novel anomalies.

Details

Method: The proposed Anomize framework leverages multiple levels of visual data and textual information to mitigate detection ambiguity, and incorporates label relations to reduce categorization confusion. Result: Anomize achieves superior performance on UCF-Crime and XD-Violence datasets. Conclusion: The framework effectively addresses challenges in OVVAD, demonstrating its effectiveness. Abstract: Open Vocabulary Video Anomaly Detection (OVVAD) seeks to detect and classify both base and novel anomalies. However, existing methods face two specific challenges related to novel anomalies. The first challenge is detection ambiguity, where the model struggles to assign accurate anomaly scores to unfamiliar anomalies. The second challenge is categorization confusion, where novel anomalies are often misclassified as visually similar base instances. To address these challenges, we explore supplementary information from multiple sources to mitigate detection ambiguity by leveraging multiple levels of visual data alongside matching textual information. Furthermore, we propose incorporating label relations to guide the encoding of new labels, thereby improving alignment between novel videos and their corresponding labels, which helps reduce categorization confusion. The resulting Anomize framework effectively tackles these issues, achieving superior performance on UCF-Crime and XD-Violence datasets, demonstrating its effectiveness in OVVAD.

Every Sample Matters: Leveraging Mixture-of-Experts and High-Quality Data for Efficient and Accurate Code LLM

Codefuse,Ling Team,:,Wenting Cai,Yuchen Cao,Chaoyu Chen,Chen Chen,Siba Chen,Qing Cui,Peng Di,Junpeng Fang,Zi Gong,Ting Guo,Zhengyu He,Yang Huang,Cong Li,Jianguo Li,Zheng Li,Shijie Lian,BingChang Liu,Songshan Luo,Shuo Mao,Min Shen,Jian Wu,Jiaolong Yang,Wenjie Yang,Tong Ye,Hang Yu,Wei Zhang,Zhenduo Zhang,Hailin Zhao,Xunjin Zheng,Jun Zhou

Task: 构建一个高效且性能全面的代码大型语言模型（Ling-Coder-Lite）。

Motivation: 现有代码LLM在性能和效率之间存在权衡，需要突破这一限制。

Details

Method: 采用高效的Mixture-of-Experts（MoE）架构和高质量数据筛选方法（基于程序分析）。 Result: Ling-Coder-Lite在12个代表性编码基准测试中表现与同类先进模型相当，同时部署资源减少50%。 Conclusion: Ling-Coder-Lite在性能和效率上取得平衡，开源模型和数据以促进进一步研究。 Abstract: Recent advancements in code large language models (LLMs) have demonstrated remarkable capabilities in code generation and understanding. It is still challenging to build a code LLM with comprehensive performance yet ultimate efficiency. Many attempts have been released in the open source community to break the trade-off between performance and efficiency, such as the Qwen Coder series and the DeepSeek Coder series. This paper introduces yet another attempt in this area, namely Ling-Coder-Lite. We leverage the efficient Mixture-of-Experts (MoE) architecture along with a set of high-quality data curation methods (especially those based on program analytics) to build an efficient yet powerful code LLM. Ling-Coder-Lite exhibits on-par performance on 12 representative coding benchmarks compared to state-of-the-art models of similar size, such as Qwen2.5-Coder-7B and DeepSeek-Coder-V2-Lite, while offering competitive latency and throughput. In practice, we achieve a 50\% reduction in deployment resources compared to the similar-sized dense model without performance loss. To facilitate further research and development in this area, we open-source our models as well as a substantial portion of high-quality data for the annealing and post-training stages. The models and data can be accessed at~\url{https://huggingface.co/inclusionAI/Ling-Coder-lite}.

M3Net: Multimodal Multi-task Learning for 3D Detection, Segmentation, and Occupancy Prediction in Autonomous Driving

Xuesong Chen,Shaoshuai Shi,Tao Ma,Jingqiu Zhou,Simon See,Ka Chun Cheung,Hongsheng Li

Task: 提出一种名为M3Net的多模态多任务网络，用于同时处理自动驾驶中的检测、分割和3D占用预测任务。

Motivation: 当前算法通常单独处理各个子任务，效率低下；现有多任务学习方法未能解决任务间的冲突。

Details

Method: M3Net采用多模态数据输入，通过查询-令牌交互处理多任务；提出模态自适应特征集成（MAFI）模块和任务导向通道缩放（TCS）模块。 Result: 在nuScenes基准测试中达到最先进的多任务学习性能。 Conclusion: M3Net通过创新的模块设计有效解决了多任务学习中的冲突，提升了自动驾驶感知系统的效率。 Abstract: The perception system for autonomous driving generally requires to handle multiple diverse sub-tasks. However, current algorithms typically tackle individual sub-tasks separately, which leads to low efficiency when aiming at obtaining full-perception results. Some multi-task learning methods try to unify multiple tasks with one model, but do not solve the conflicts in multi-task learning. In this paper, we introduce M3Net, a novel multimodal and multi-task network that simultaneously tackles detection, segmentation, and 3D occupancy prediction for autonomous driving and achieves superior performance than single task model. M3Net takes multimodal data as input and multiple tasks via query-token interactions. To enhance the integration of multi-modal features for multi-task learning, we first propose the Modality-Adaptive Feature Integration (MAFI) module, which enables single-modality features to predict channel-wise attention weights for their high-performing tasks, respectively. Based on integrated features, we then develop task-specific query initialization strategies to accommodate the needs of detection/segmentation and 3D occupancy prediction. Leveraging the properly initialized queries, a shared decoder transforms queries and BEV features layer-wise, facilitating multi-task learning. Furthermore, we propose a Task-oriented Channel Scaling (TCS) module in the decoder to mitigate conflicts between optimizing for different tasks. Additionally, our proposed multi-task querying and TCS module support both Transformer-based decoder and Mamba-based decoder, demonstrating its flexibility to different architectures. M3Net achieves state-of-the-art multi-task learning performance on the nuScenes benchmarks.

Debiasing Multimodal Large Language Models via Noise-Aware Preference Optimization

Zefeng Zhang,Hengzhu Tang,Jiawei Sheng,Zhenyu Zhang,Yiming Ren,Zhenyang Li,Dawei Yin,Duohe Ma,Tingwen Liu

Task: 解决多模态大语言模型中的模态偏差问题。

Motivation: 多模态大语言模型在处理任务时倾向于依赖单一模态，忽略其他模态的关键信息，导致错误聚焦和不相关响应。

Details

Method: 提出基于偏好优化的方法，包括构建去偏数据集RLAIFVBias和噪声感知偏好优化算法。 Result: 实验验证了方法的有效性，不仅能缓解模态偏差，还能显著减少幻觉现象。 Conclusion: 偏好优化范式是解决模态偏差问题的有效途径。 Abstract: Multimodal Large Language Models excel in various tasks, yet often struggle with modality bias, where the model tends to rely heavily on a single modality and overlook critical information in other modalities, which leads to incorrect focus and generating irrelevant responses. In this paper, we propose using the paradigm of preference optimization to solve the modality bias problem, including RLAIFVBias, a debiased preference optimization dataset, and a Noise Aware Preference Optimization algorithm. Specifically, we first construct the dataset by introducing perturbations to reduce the informational content of certain modalities, compelling the model to rely on a specific modality when generating negative responses. To address the inevitable noise in automatically constructed data, we combine the noise robust Mean Absolute Error with the Binary Cross Entropy in Direct Preference Optimization by a negative Box Cox transformation, and dynamically adjust the algorithm noise robustness based on the evaluated noise levels in the data. Extensive experiments validate our approach, demonstrating not only its effectiveness in mitigating modality bias but also its significant role in minimizing hallucinations.

PanoGS: Gaussian-based Panoptic Segmentation for 3D Open Vocabulary Scene Understanding

Hongjia Zhai,Hai Li,Zhenzhe Li,Xiaokun Pan,Yijia He,Guofeng Zhang

Task: 提出一种名为PanoGS的新方法，用于3D全景开放词汇场景理解。

Motivation: 现有方法无法区分3D实例级信息，通常仅预测场景特征与文本查询之间的热图。

Details

Method: 采用金字塔三平面建模潜在连续参数特征空间，使用3D特征解码器回归多视图融合的2D特征云；提出语言引导的图切割方法，结合几何重建和语言线索将3D高斯基元分组为超基元；基于SAM引导的边缘亲和度计算进行图聚类分割。 Result: 在广泛使用的数据集上表现出更好或更具竞争力的性能。 Conclusion: PanoGS是一种有效的方法，能够实现3D全景开放词汇场景理解。 Abstract: Recently, 3D Gaussian Splatting (3DGS) has shown encouraging performance for open vocabulary scene understanding tasks. However, previous methods cannot distinguish 3D instance-level information, which usually predicts a heatmap between the scene feature and text query. In this paper, we propose PanoGS, a novel and effective 3D panoptic open vocabulary scene understanding approach. Technically, to learn accurate 3D language features that can scale to large indoor scenarios, we adopt the pyramid tri-plane to model the latent continuous parametric feature space and use a 3D feature decoder to regress the multi-view fused 2D feature cloud. Besides, we propose language-guided graph cuts that synergistically leverage reconstructed geometry and learned language cues to group 3D Gaussian primitives into a set of super-primitives. To obtain 3D consistent instance, we perform graph clustering based segmentation with SAM-guided edge affinity computation between different super-primitives. Extensive experiments on widely used datasets show better or more competitive performance on 3D panoptic open vocabulary scene understanding. Project page: \href{https://zju3dv.github.io/panogs}{https://zju3dv.github.io/panogs}.

Human-AI Interaction and User Satisfaction: Empirical Evidence from Online Reviews of AI Products

Stefan Pasch,Sun-Young Ha

Task: 分析超过10万条用户评论，探讨人机交互（HAI）原则如何影响用户满意度。

Motivation: 填补大规模实证研究中关于HAI原则对用户满意度影响的空白。

Details

Method: 基于行业指南识别七个核心HAI维度，分析其在G2.com平台用户评论中的覆盖率和情感倾向。 Result: 四个HAI维度（适应性、定制化、错误恢复和安全性）的情感倾向与用户满意度正相关；不同职业背景的用户关注点不同，但HAI维度对满意度的影响不受职业背景调节。 Conclusion: HAI原则对用户满意度具有普遍影响，设计AI系统时应重视这些维度。 Abstract: Human-AI Interaction (HAI) guidelines and design principles have become increasingly important in both industry and academia to guide the development of AI systems that align with user needs and expectations. However, large-scale empirical evidence on how HAI principles shape user satisfaction in practice remains limited. This study addresses that gap by analyzing over 100,000 user reviews of AI-related products from G2.com, a leading review platform for business software and services. Based on widely adopted industry guidelines, we identify seven core HAI dimensions and examine their coverage and sentiment within the reviews. We find that the sentiment on four HAI dimensions-adaptability, customization, error recovery, and security-is positively associated with overall user satisfaction. Moreover, we show that engagement with HAI dimensions varies by professional background: Users with technical job roles are more likely to discuss system-focused aspects, such as reliability, while non-technical users emphasize interaction-focused features like customization and feedback. Interestingly, the relationship between HAI sentiment and overall satisfaction is not moderated by job role, suggesting that once an HAI dimension has been identified by users, its effect on satisfaction is consistent across job roles.

End-to-End Implicit Neural Representations for Classification

Alexander Gielisse,Jan van Gemert

Task: 通过改进隐式神经表示（INRs）的初始化方法和学习率策略，提升其在分类任务中的性能。

Motivation: 当前基于INRs的分类方法性能显著低于基于像素的方法（如CNN），且现有工作主要关注对称性等变性问题，但效果有限。

Details

Method: 提出一种端到端策略，结合初始化SIREN和可学习的学习率方案，并应用简单的Transformer模型。 Result: 在CIFAR-10任务中，无增强时准确率从38.8%提升至59.6%，有增强时从63.4%提升至64.7%；在Imagenette和ImageNet-1K数据集上首次实现高分辨率分类。 Conclusion: 该方法在不显式处理对称性的情况下，显著提升了INRs的分类性能，并在高分辨率数据集上实现了突破。 Abstract: Implicit neural representations (INRs) such as NeRF and SIREN encode a signal in neural network parameters and show excellent results for signal reconstruction. Using INRs for downstream tasks, such as classification, is however not straightforward. Inherent symmetries in the parameters pose challenges and current works primarily focus on designing architectures that are equivariant to these symmetries. However, INR-based classification still significantly under-performs compared to pixel-based methods like CNNs. This work presents an end-to-end strategy for initializing SIRENs together with a learned learning-rate scheme, to yield representations that improve classification accuracy. We show that a simple, straightforward, Transformer model applied to a meta-learned SIREN, without incorporating explicit symmetry equivariances, outperforms the current state-of-the-art. On the CIFAR-10 SIREN classification task, we improve the state-of-the-art without augmentations from 38.8% to 59.6%, and from 63.4% to 64.7% with augmentations. We demonstrate scalability on the high-resolution Imagenette dataset achieving reasonable reconstruction quality with a classification accuracy of 60.8% and are the first to do INR classification on the full ImageNet-1K dataset where we achieve a SIREN classification performance of 23.6%. To the best of our knowledge, no other SIREN classification approach has managed to set a classification baseline for any high-resolution image dataset. Our code is available at https://github.com/SanderGielisse/MWT

Trade-offs in Large Reasoning Models: An Empirical Analysis of Deliberative and Adaptive Reasoning over Foundational Capabilities

Weixiang Zhao,Xingyu Sui,Jiahe Guo,Yulin Hu,Yang Deng,Yanyan Zhao,Bing Qin,Wanxiang Che,Tat-Seng Chua,Ting Liu

Task: 研究大型推理模型（LRMs）在获得深思熟虑推理能力时对基础能力和推理成本的影响。

Motivation: 尽管LRMs在推理任务中表现出色，但研究发现其深思熟虑推理能力会显著降低基础能力并增加成本，因此需要探索适应性推理方法以缓解这些问题。

Details

Method: 通过系统评估不同模型家族（DeepSeek、Qwen、LLaMA）和规模（7B至671B），并引入自适应推理模式（如零思考、少思考和总结思考）。 Result: 研究发现深思熟虑推理能力会降低模型的有用性和无害性，并增加推理成本，但自适应推理能有效缓解这些缺点。 Conclusion: 开发能够根据任务特性动态分配推理计算资源的通用LRMs至关重要。 Abstract: Recent advancements in Large Reasoning Models (LRMs), such as OpenAI's o1/o3 and DeepSeek-R1, have demonstrated remarkable performance in specialized reasoning tasks through human-like deliberative thinking and long chain-of-thought reasoning. However, our systematic evaluation across various model families (DeepSeek, Qwen, and LLaMA) and scales (7B to 671B) reveals that acquiring these deliberative reasoning capabilities significantly reduces the foundational capabilities of LRMs, including notable declines in helpfulness and harmlessness, alongside substantially increased inference costs. Importantly, we demonstrate that adaptive reasoning -- employing modes like Zero-Thinking, Less-Thinking, and Summary-Thinking -- can effectively alleviate these drawbacks. Our empirical insights underline the critical need for developing more versatile LRMs capable of dynamically allocating inference-time compute according to specific task characteristics.

An Image-like Diffusion Method for Human-Object Interaction Detection

Xiaofei Hui,Haoxuan Qu,Hossein Rahmani,Jun Liu

Task: 提出一种新框架HOI-IDiff，通过图像扩散模型生成人-物交互检测输出。

Motivation: 人-物交互检测中存在高度模糊性和不确定性，且遮挡和复杂背景会加剧问题。

Details

Method: 将检测输出重新定义为图像，利用图像扩散模型生成，并设计定制化的HOI扩散过程和切片分块模型架构。 Result: 大量实验证明该框架的有效性。 Conclusion: HOI-IDiff通过图像扩散模型成功解决了人-物交互检测中的模糊性和不确定性问题。 Abstract: Human-object interaction (HOI) detection often faces high levels of ambiguity and indeterminacy, as the same interaction can appear vastly different across different human-object pairs. Additionally, the indeterminacy can be further exacerbated by issues such as occlusions and cluttered backgrounds. To handle such a challenging task, in this work, we begin with a key observation: the output of HOI detection for each human-object pair can be recast as an image. Thus, inspired by the strong image generation capabilities of image diffusion models, we propose a new framework, HOI-IDiff. In HOI-IDiff, we tackle HOI detection from a novel perspective, using an Image-like Diffusion process to generate HOI detection outputs as images. Furthermore, recognizing that our recast images differ in certain properties from natural images, we enhance our framework with a customized HOI diffusion process and a slice patchification model architecture, which are specifically tailored to generate our recast ``HOI images''. Extensive experiments demonstrate the efficacy of our framework.

Qiao Liang,Yanjiang Liu,Ben He,Yaojie Lu,Hongyu Lin,Jia Zheng,Xianpei Han,Le Sun,Yingfei Sun

Task: 研究视觉编码器的先验知识对多模态大语言模型（MLLMs）性能的影响。

Motivation: 现有研究通常将MLLMs视为通过端到端训练优化的统一系统，而视觉编码器先验知识的影响鲜少被探讨。

Details

Method: 引入新指标$Rank_e$量化视觉编码器先验知识对MLLM性能的影响，并提出两阶段训练框架VisPRE（Vision Prior Remediation）显式融入先验知识。 Result: 分析显示先验知识与MLLM性能呈正相关，且VisPRE显著提升了MLLMs的视觉理解能力。 Conclusion: 增强视觉编码器的先验知识是提升MLLM性能的有效策略，尤其在涉及罕见视觉实体时。 Abstract: Does the prior knowledge of the vision encoder constrain the capability boundary of Multi-modal Large Language Models (MLLMs)? While most existing research treats MLLMs as unified systems optimized through end-to-end training, the impact of vision encoder's prior knowledge is seldom investigated. In this work, we introduce a novel metric, $Rank_e$, to quantify the effect of the vision encoder's prior knowledge on MLLM performance. Our analysis reveals a positive correlation between prior knowledge and MLLM performance. Moreover, we find that domain-specific fine-tuning using solely end-to-end visual question answering (VQA) data is insufficient--particularly for entities with low inherent visual prior knowledge. To address this issue, we propose VisPRE (Vision Prior Remediation), a two-stage training framework that explicitly incorporates prior knowledge at the vision encoder level. Experimental results demonstrate that augmenting vision encoder's prior knowledge substantially boosts the visual understanding capabilities of MLLMs, offering a novel and effective strategy for improving performance, especially in scenarios involving uncommon visual entities.

MLLM-For3D: Adapting Multimodal Large Language Model for 3D Reasoning Segmentation

Jiaxin Huang,Runnan Chen,Ziwen Li,Zhengqing Gao,Xiao He,Yandong Guo,Mingming Gong,Tongliang Liu

Task: 将多模态大语言模型（MLLMs）的2D图像推理分割能力扩展到3D场景理解。

Motivation: 尽管MLLMs在2D图像推理分割上表现出色，但将其能力应用于3D场景仍未被充分探索。

Details

Method: 提出MLLM-For3D框架，通过生成多视角伪分割掩码和文本嵌入，并将其投影到3D空间，同时引入空间一致性策略和Token-for-Query方法。 Result: 在无标注3D训练数据的情况下，MLLM-For3D在多个室内场景基准测试中优于现有方法。 Conclusion: MLLM-For3D能有效解释用户意图、理解3D场景并推理空间关系，展示了从2D到3D知识迁移的潜力。 Abstract: Reasoning segmentation aims to segment target objects in complex scenes based on human intent and spatial reasoning. While recent multimodal large language models (MLLMs) have demonstrated impressive 2D image reasoning segmentation, adapting these capabilities to 3D scenes remains underexplored. In this paper, we introduce MLLM-For3D, a simple yet effective framework that transfers knowledge from 2D MLLMs to 3D scene understanding. Specifically, we utilize MLLMs to generate multi-view pseudo segmentation masks and corresponding text embeddings, then unproject 2D masks into 3D space and align them with the text embeddings. The primary challenge lies in the absence of 3D context and spatial consistency across multiple views, causing the model to hallucinate objects that do not exist and fail to target objects consistently. Training the 3D model with such irrelevant objects leads to performance degradation. To address this, we introduce a spatial consistency strategy to enforce that segmentation masks remain coherent in the 3D space, effectively capturing the geometry of the scene. Moreover, we develop a Token-for-Query approach for multimodal semantic alignment, enabling consistent identification of the same object across different views. Extensive evaluations on various challenging indoor scene benchmarks demonstrate that, even without any labeled 3D training data, MLLM-For3D outperforms existing 3D reasoning segmentation methods, effectively interpreting user intent, understanding 3D scenes, and reasoning about spatial relationships.

(G)I-DLE: Generative Inference via Distribution-preserving Logit Exclusion with KL Divergence Minimization for Constrained Decoding

Hanwool Lee

Task: 提出一种名为(G)I-DLE的新方法，通过KL散度最小化在约束解码中保持自回归语言模型的条件概率分布，同时排除不良标记。

Motivation: 传统方法通过将禁止标记的logits设为负无穷来约束解码，这会扭曲从原始logits到后验概率的转换并增加输出方差。

Details

Method: (G)I-DLE通过重新归一化允许标记的概率来最小化这种扭曲。 Result: 在K2-Eval数据集上的实验表明，G-IDLE不仅提高了平均评估分数，还显著降低了输出质量的方差。 Conclusion: (G)I-DLE是一种有效的约束解码方法，能够在不扭曲概率分布的情况下提升模型输出质量。 Abstract: We propose (G)I-DLE, a new approach to constrained decoding that leverages KL divergence minimization to preserve the intrinsic conditional probability distribution of autoregressive language models while excluding undesirable tokens. Unlike conventional methods that naively set banned tokens' logits to $-\infty$, which can distort the conversion from raw logits to posterior probabilities and increase output variance, (G)I-DLE re-normalizes the allowed token probabilities to minimize such distortion. We validate our method on the K2-Eval dataset, specifically designed to assess Korean language fluency, logical reasoning, and cultural appropriateness. Experimental results on Qwen2.5 models (ranging from 1.5B to 14B) demonstrate that G-IDLE not only boosts mean evaluation scores but also substantially reduces the variance of output quality.

TCFG: Tangential Damping Classifier-free Guidance

Mingi Kwon,Shin seong Kim,Jaeseok Jeong. Yi Ting Hsiao,Youngjung Uh

Task: 提出一种基于几何视角的方法，通过过滤条件分数和无条件分数的奇异向量来提升分类器自由引导（CFG）在文本到图像合成中的性能。

Motivation: 无条件分数在估计相邻时间步之间的过渡时可能干扰特定条件的轨迹，影响生成图像的质量和对齐性。

Details

Method: 利用奇异值分解（SVD）过滤条件分数和无条件分数的奇异向量，使无条件分数与条件分数对齐，优化采样轨迹。 Result: 提出的方法在几乎不增加计算量的情况下提高了图像质量，并提供了对扩散模型中分数函数行为的深入理解。 Conclusion: 该方法通过几何视角优化了CFG，实现了更准确和上下文一致的图像合成。 Abstract: Diffusion models have achieved remarkable success in text-to-image synthesis, largely attributed to the use of classifier-free guidance (CFG), which enables high-quality, condition-aligned image generation. CFG combines the conditional score (e.g., text-conditioned) with the unconditional score to control the output. However, the unconditional score is in charge of estimating the transition between manifolds of adjacent timesteps from $x_t$ to $x_{t-1}$, which may inadvertently interfere with the trajectory toward the specific condition. In this work, we introduce a novel approach that leverages a geometric perspective on the unconditional score to enhance CFG performance when conditional scores are available. Specifically, we propose a method that filters the singular vectors of both conditional and unconditional scores using singular value decomposition. This filtering process aligns the unconditional score with the conditional score, thereby refining the sampling trajectory to stay closer to the manifold. Our approach improves image quality with negligible additional computation. We provide deeper insights into the score function behavior in diffusion models and present a practical technique for achieving more accurate and contextually coherent image synthesis.

Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation

Ziming Wei,Bingqian Lin,Yunshuang Nie,Jiaqi Chen,Shikui Ma,Hang Xu,Xiaodan Liang

Task: 提出一种基于重写的增强范式（RAM），用于解决视觉语言导航（VLN）领域的数据稀缺问题。

Motivation: 数据稀缺严重限制了VLN智能体在未见环境中的泛化能力，现有方法依赖模拟器或网络数据，但存在多样性不足或噪声问题。

Details

Method: 通过重写人类标注的训练数据生成新的观察-指令对，结合视觉语言模型（VLMs）、大语言模型（LLMs）和文本到图像生成模型（T2IMs）实现观察合成和指令重写，并提出混合聚焦训练策略。 Result: 在离散和连续环境数据集（R2R、REVERIE、R4R、R2R-CE）上表现出优越性能和泛化能力。 Conclusion: RAM范式以无模拟器和节省劳动力的方式有效提升了VLN任务的泛化能力。 Abstract: Data scarcity is a long-standing challenge in the Vision-Language Navigation (VLN) field, which extremely hinders the generalization of agents to unseen environments. Previous works primarily rely on additional simulator data or web-collected images/videos to improve the generalization. However, the simulator environments still face limited diversity, and the web-collected data often requires extensive labor to remove the noise. In this paper, we propose a Rewriting-driven AugMentation (RAM) paradigm for VLN, which directly creates the unseen observation-instruction pairs via rewriting human-annotated training data. Benefiting from our rewriting mechanism, new observation-instruction can be obtained in both simulator-free and labor-saving manners to promote generalization. Specifically, we first introduce Object-Enriched Observation Rewriting, where we combine Vision-Language Models (VLMs) and Large Language Models (LLMs) to derive rewritten object-enriched scene descriptions, enabling observation synthesis with diverse objects and spatial layouts via Text-to-Image Generation Models (T2IMs). Then, we propose Observation-Contrast Instruction Rewriting, which generates observation-aligned rewritten instructions by requiring LLMs to reason the difference between original and new observations. We further develop a mixing-then-focusing training strategy with a random observation cropping scheme, effectively enhancing data distribution diversity while suppressing augmentation data noise during training. Experiments on both the discrete environments (R2R, REVERIE, and R4R datasets) and continuous environments (R2R-CE dataset) show the superior performance and impressive generalization ability of our method. Code is available at https://github.com/SaDil13/VLN-RAM.

AGIR: Assessing 3D Gait Impairment with Reasoning based on LLMs

Diwei Wang,Cédric Bobenrieth,Hyewon Seo

Task: 开发一种名为AGIR的管道，结合预训练的VQ-VAE运动标记器和大型语言模型（LLM），用于神经退行性疾病的步态损伤评估。

Motivation: 临床实践中步态评估存在主观性和缺乏精确性的问题，且现有深度学习方法缺乏可解释性，限制了其在临床决策中的应用。

Details

Method: 提出AGIR管道，包括VQ-VAE运动标记器和LLM，通过两阶段监督微调（SFT）策略增强LLM的运动理解和病理知识。 Result: 验证表明AGIR在现有数据集上具有鲁棒性和准确性，能够从运动输入中分配步态损伤评分并提供临床意义的解释。 Conclusion: AGIR为步态损伤评估提供了一种可解释且准确的解决方案，具有临床应用的潜力。 Abstract: Assessing gait impairment plays an important role in early diagnosis, disease monitoring, and treatment evaluation for neurodegenerative diseases. Despite its widespread use in clinical practice, it is limited by subjectivity and a lack of precision. While recent deep learning-based approaches have consistently improved classification accuracies, they often lack interpretability, hindering their utility in clinical decision-making. To overcome these challenges, we introduce AGIR, a novel pipeline consisting of a pre-trained VQ-VAE motion tokenizer and a subsequent Large Language Model (LLM) fine-tuned over pairs of motion tokens and Chain-of-Thought (CoT) reasonings. To fine-tune an LLM for pathological gait analysis, we first introduce a multimodal dataset by adding rationales dedicated to MDS-UPDRS gait score assessment to an existing PD gait dataset. We then introduce a two-stage supervised fine-tuning (SFT) strategy to enhance the LLM's motion comprehension with pathology-specific knowledge. This strategy includes: 1) a generative stage that aligns gait motions with analytic descriptions through bidirectional motion-description generation, 2) a reasoning stage that integrates logical Chain-of-Thought (CoT) reasoning for impairment assessment with UPDRS gait score. Validation on an existing dataset and comparisons with state-of-the-art methods confirm the robustness and accuracy of our pipeline, demonstrating its ability to assign gait impairment scores from motion input with clinically meaningful rationales.

AgentRxiv: Towards Collaborative Autonomous Research

Samuel Schmidgall,Michael Moor

Task: 开发一个框架（AgentRxiv），使LLM代理实验室能够通过共享预印本服务器协作、分享见解并迭代改进研究。

Motivation: 现有代理工作流在孤立环境中进行研究，无法持续改进先前的研究成果，限制了科学发现的进展。

Details

Method: 引入AgentRxiv框架，允许代理实验室上传和检索研究报告，以协作和迭代改进研究。 Result: 使用AgentRxiv的代理在MATH-500上相对基线性能提升11.4%，并在其他领域平均提升3.3%；多实验室协作时，MATH-500上相对基线提升13.7%。 Conclusion: 自主代理可以与人类共同设计未来AI系统，AgentRxiv有望加速科学发现。 Abstract: Progress in scientific discovery is rarely the result of a single "Eureka" moment, but is rather the product of hundreds of scientists incrementally working together toward a common goal. While existing agent workflows are capable of producing research autonomously, they do so in isolation, without the ability to continuously improve upon prior research results. To address these challenges, we introduce AgentRxiv-a framework that lets LLM agent laboratories upload and retrieve reports from a shared preprint server in order to collaborate, share insights, and iteratively build on each other's research. We task agent laboratories to develop new reasoning and prompting techniques and find that agents with access to their prior research achieve higher performance improvements compared to agents operating in isolation (11.4% relative improvement over baseline on MATH-500). We find that the best performing strategy generalizes to benchmarks in other domains (improving on average by 3.3%). Multiple agent laboratories sharing research through AgentRxiv are able to work together towards a common goal, progressing more rapidly than isolated laboratories, achieving higher overall accuracy (13.7% relative improvement over baseline on MATH-500). These findings suggest that autonomous agents may play a role in designing future AI systems alongside humans. We hope that AgentRxiv allows agents to collaborate toward research goals and enables researchers to accelerate discovery.

LocDiffusion: Identifying Locations on Earth by Diffusing in the Hilbert Space

Zhangyu Wang,Jielu Zhang,Zhongliang Zhou,Qian Cao,Nemin Wu,Zeping Liu,Lan Mu,Yang Song,Yiqun Xie,Ni Lao,Gengchen Mai

Task: 通过扩散机制实现图像地理定位。

Motivation: 现有方法在测试图像空间分布与分类或检索选择不匹配时性能显著下降，需解决这一问题。

Details

Method: 提出球形谐波狄拉克δ（SHDD）表示法，结合CS-UNet架构和LocDiffusion模型，在隐藏位置嵌入空间中扩散地理信息。 Result: LocDiffusion在图像地理定位任务中表现优异，对未见地理位置的泛化能力显著更强。 Conclusion: 扩散机制结合SHDD表示和CS-UNet架构为图像地理定位提供了创新且高效的解决方案。 Abstract: Image geolocalization is a fundamental yet challenging task, aiming at inferring the geolocation on Earth where an image is taken. Existing methods approach it either via grid-based classification or via image retrieval. Their performance significantly suffers when the spatial distribution of test images does not align with such choices. To address these limitations, we propose to leverage diffusion as a mechanism for image geolocalization. To avoid the problematic manifold reprojection step in diffusion, we developed a novel spherical positional encoding-decoding framework, which encodes points on a spherical surface (e.g., geolocations on Earth) into a Hilbert space of Spherical Harmonics coefficients and decodes points (geolocations) by mode-seeking. We call this type of position encoding Spherical Harmonics Dirac Delta (SHDD) Representation. We also propose a novel SirenNet-based architecture called CS-UNet to learn the conditional backward process in the latent SHDD space by minimizing a latent KL-divergence loss. We train a conditional latent diffusion model called LocDiffusion that generates geolocations under the guidance of images -- to the best of our knowledge, the first generative model for image geolocalization by diffusing geolocation information in a hidden location embedding space. We evaluate our method against SOTA image geolocalization baselines. LocDiffusion achieves competitive geolocalization performance and demonstrates significantly stronger generalizability to unseen geolocations.

Decoupling Angles and Strength in Low-rank Adaptation

Massimo Bini,Leander Girrbach,Zeynep Akata

Task: 提出一种新的参数高效微调方法DeLoRA，以解决现有方法在超参数选择和训练时长上的鲁棒性问题。

Motivation: 现有的PEFT方法（如LoRA）在超参数选择和长训练时间下表现不稳定，而其他方法（如ETHER）虽然鲁棒性强，但适应性表达能力有限。

Details

Method: DeLoRA通过归一化和缩放可学习的低秩矩阵，将角度学习与适应强度解耦，从而增强鲁棒性。 Result: 在图像生成、自然语言理解和指令调优任务中，DeLoRA表现优于或与现有PEFT方法相当，且鲁棒性更强。 Conclusion: DeLoRA是一种高效且鲁棒的参数微调方法，适用于多种任务。 Abstract: Parameter-Efficient FineTuning (PEFT) methods have recently gained significant popularity thanks to the widespread availability of large-scale pretrained models. These methods allow for quick adaptation to downstream tasks with minimal computational cost. However, popular finetuning methods such as LoRA exhibit limited robustness when it comes to hyperparameter choices or extended training regimes, preventing optimal out-of-the-box performance. In contrast, bounded approaches, such as ETHER, provide greater robustness but are limited to extremely low-rank adaptations and fixed-strength transformations, reducing their adaptation expressive power. In this work, we propose Decoupled Low-rank Adaptation (DeLoRA), a novel finetuning method that normalizes and scales learnable low-rank matrices. By bounding the distance of the transformation, DeLoRA effectively decouples the angular learning from the adaptation strength, enhancing robustness without compromising performance. Through evaluations on subject-driven image generation, natural language understanding, and instruction tuning, we show that DeLoRA matches or surpasses performance of competing PEFT methods, while exhibiting stronger robustness. Code is available at https://github.com/ExplainableML/DeLoRA.

PHT-CAD: Efficient CAD Parametric Primitive Analysis with Progressive Hierarchical Tuning

Ke Niu,Yuwen Chen,Haiyang Yu,Zhuofan Chen,Xianghui Que,Bin Li,Xiangyang Xue

Task: 提出一种名为PHT-CAD的新型2D参数化基元分析框架，以解决2D工程图纸分析中的结构约束推理和高级语义理解问题。

Motivation: 2D参数化基元分析（PPA）在工业制造中具有重要意义，但由于结构约束推理和高级语义理解的挑战，研究仍不充分。

Details

Method: 提出高效混合参数化（EHP）表示方法，并开发PHT-CAD框架，利用视觉语言模型（VLMs）进行模态对齐和推理，同时引入四类回归头预测基元组件。采用渐进分层调优（PHT）三阶段训练范式。 Result: 实验证明了PHT-CAD的有效性，并展示了ParaCAD基准数据集在推动2D PPA研究中的实际意义。 Conclusion: PHT-CAD和ParaCAD为2D工程图纸分析提供了新的解决方案和资源，推动了该领域的研究进展。 Abstract: Computer-Aided Design (CAD) plays a pivotal role in industrial manufacturing, yet 2D Parametric Primitive Analysis (PPA) remains underexplored due to two key challenges: structural constraint reasoning and advanced semantic understanding. To tackle these challenges, we first propose an Efficient Hybrid Parametrization (EHP) for better representing 2D engineering drawings. EHP contains four types of atomic component i.e., point, line, circle, and arc). Additionally, we propose PHT-CAD, a novel 2D PPA framework that harnesses the modality alignment and reasoning capabilities of Vision-Language Models (VLMs) for precise engineering drawing analysis. In PHT-CAD, we introduce four dedicated regression heads to predict corresponding atomic components. To train PHT-CAD, a three-stage training paradigm Progressive Hierarchical Tuning (PHT) is proposed to progressively enhance PHT-CAD's capability to perceive individual primitives, infer structural constraints, and align annotation layers with their corresponding geometric representations. Considering that existing datasets lack complete annotation layers and real-world engineering drawings, we introduce ParaCAD, the first large-scale benchmark that explicitly integrates both the geometric and annotation layers. ParaCAD comprises over 10 million annotated drawings for training and 3,000 real-world industrial drawings with complex topological structures and physical constraints for test. Extensive experiments demonstrate the effectiveness of PHT-CAD and highlight the practical significance of ParaCAD in advancing 2D PPA research.

Bridging Writing Manner Gap in Visual Instruction Tuning by Creating LLM-aligned Instructions

Dong Jing,Nanyi Fei,Zhiwu Lu

Task: 评估视觉指令调整阶段中指令质量对大型多模态模型（LMMs）性能的影响，并提出一种基于写作风格对齐的方法来优化指令质量。

Motivation: 发现视觉指令与基础大型语言模型（LLMs）之间存在显著的写作风格差距，导致模型性能下降，希望通过对齐写作风格来提升模型表现。

Details

Method: 提出直接利用基础LLM来对齐视觉指令的写作风格，生成LLM对齐的指令，以减少写作风格差距。 Result: 实验表明，该方法成功缩小了写作风格差距，并在LLaVA-7B和QwenVL模型上显著减少了幻觉现象，同时在15个视觉和语言基准测试中实现了全面改进。 Conclusion: 通过对齐写作风格，可以有效提升LMMs的性能，减少能力退化问题。 Abstract: In the realm of Large Multi-modal Models (LMMs), the instruction quality during the visual instruction tuning stage significantly influences the performance of modality alignment. In this paper, we assess the instruction quality from a unique perspective termed \textbf{Writing Manner}, which encompasses the selection of vocabulary, grammar and sentence structure to convey specific semantics. We argue that there exists a substantial writing manner gap between the visual instructions and the base Large Language Models (LLMs) within LMMs. This gap forces the pre-trained base LLMs to deviate from their original writing styles, leading to capability degradation of both base LLMs and LMMs. To bridge the writing manner gap while preserving the original semantics, we propose directly leveraging the base LLM to align the writing manner of soft-format visual instructions with that of the base LLM itself, resulting in novel LLM-aligned instructions. The manual writing manner evaluation results demonstrate that our approach successfully minimizes the writing manner gap. By utilizing LLM-aligned instructions, the baseline models LLaVA-7B and QwenVL demonstrate enhanced resistance to hallucinations and non-trivial comprehensive improvements across all $15$ visual and language benchmarks.

LongDiff: Training-Free Long Video Generation in One Go

Zhuoling Li,Hossein Rahmani,Qiuhong Ke,Jun Liu

Task: 提出一种名为LongDiff的无训练方法，用于解决短视频生成模型在生成长视频时面临的时间一致性保持和视觉细节保留问题。

Motivation: 现有视频扩散模型主要针对短视频生成设计，导致在生成长视频时难以保持时间一致性和视觉细节。

Details

Method: 提出LongDiff方法，包含位置映射（PM）和信息帧选择（IFS）两个组件，以解决时间位置模糊和信息稀释问题。 Result: 实验证明LongDiff能够有效利用现有视频扩散模型实现高质量的长视频生成。 Conclusion: LongDiff是一种无需训练的方法，能够显著提升现有模型在长视频生成中的表现。 Abstract: Video diffusion models have recently achieved remarkable results in video generation. Despite their encouraging performance, most of these models are mainly designed and trained for short video generation, leading to challenges in maintaining temporal consistency and visual details in long video generation. In this paper, we propose LongDiff, a novel training-free method consisting of carefully designed components \ -- Position Mapping (PM) and Informative Frame Selection (IFS) \ -- to tackle two key challenges that hinder short-to-long video generation generalization: temporal position ambiguity and information dilution. Our LongDiff unlocks the potential of off-the-shelf video diffusion models to achieve high-quality long video generation in one go. Extensive experiments demonstrate the efficacy of our method.

Solving Situation Puzzles with Large Language Model and External Reformulation

Kun Li,Xinwei Chen,Tianyou Song,Chengrui Zhou,Zhuoran Liu,Zhenyan Zhang,Jiangjian Guo,Qing Shan

Task: 提出一种外部重构方法，以提升大型语言模型在多轮对话推理任务中的表现。

Motivation: 发现大型语言模型（如ChatGPT）在多轮对话推理任务（如情境谜题）中表现不佳，倾向于重复提问或聚焦于特定细节。

Details

Method: 采用外部重构方法，在多次问答后或模型提出错误猜测时重构情境谜题。 Result: 实验表明该方法在胜率和提问/猜测次数等指标上优于直接使用大型语言模型。 Conclusion: 战略性问题重构能有效增强大型语言模型在复杂交互场景中的推理能力。 Abstract: In recent years, large language models (LLMs) have shown an impressive ability to perform arithmetic and symbolic reasoning tasks. However, we found that LLMs (e.g., ChatGPT) cannot perform well on reasoning that requires multiple rounds of dialogue, especially when solving situation puzzles. Specifically, LLMs intend to ask very detailed questions focusing on a specific aspect or same/similar questions after several rounds of Q&As. To help LLMs get out of the above dilemma, we propose a novel external reformulation methodology, where the situation puzzle will be reformulated after several rounds of Q&A or when the LLMs raise an incorrect guess. Experiments show superior performance (e.g., win rate, number of question/guess attempts) of our method than directly using LLMs for solving situation puzzles, highlighting the potential of strategic problem reformulation to enhance the reasoning capabilities of LLMs in complex interactive scenarios.

Decorum: A Language-Based Approach For Style-Conditioned Synthesis of Indoor 3D Scenes

Kelly O. Marshall,Omid Poursaeed,Sergiu Oprea,Amit Kumar,Anushrut Jignasu,Chinmay Hegde,Yilei Li,Rakesh Ranjan

Task: 提出一种名为Decorum的方法，通过自然语言控制3D室内场景生成过程。

Motivation: 现有方法在控制场景布局、视觉特征和风格偏好方面表现有限，仅支持简单的文本输入。

Details

Method: 采用基于语言的表示方法，利用大型语言模型（LLMs）建模语言到语言的映射，并结合多模态LLMs实现新颖的家具检索方法。 Result: 在3D-FRONT数据集上的评估显示，该方法在文本条件场景合成和对象检索方面优于现有工作。 Conclusion: Decorum方法通过自然语言控制，显著提升了3D室内场景生成的灵活性和效果。 Abstract: 3D indoor scene generation is an important problem for the design of digital and real-world environments. To automate this process, a scene generation model should be able to not only generate plausible scene layouts, but also take into consideration visual features and style preferences. Existing methods for this task exhibit very limited control over these attributes, only allowing text inputs in the form of simple object-level descriptions or pairwise spatial relationships. Our proposed method Decorum enables users to control the scene generation process with natural language by adopting language-based representations at each stage. This enables us to harness recent advancements in Large Language Models (LLMs) to model language-to-language mappings. In addition, we show that using a text-based representation allows us to select furniture for our scenes using a novel object retrieval method based on multimodal LLMs. Evaluations on the benchmark 3D-FRONT dataset show that our methods achieve improvements over existing work in text-conditioned scene synthesis and object retrieval.

On the Perception Bottleneck of VLMs for Chart Understanding

Junteng Liu,Weihao Zeng,Xiwen Zhang,Yijun Wang,Zifei Shan,Junxian He

Task: 研究图表理解中大型视觉语言模型（LVLMs）的感知瓶颈问题。

Motivation: 现有大型视觉语言模型在图表理解中的感知能力成为关键瓶颈，限制了模型对数值数据、文本元素和复杂视觉组件的分析和推理能力。

Details

Method: 将感知瓶颈分解为视觉编码器瓶颈和提取瓶颈，并通过对比学习框架增强视觉编码器。 Result: 实验表明，视觉表征中的信息比线性提取器捕获的更丰富，增强视觉编码器能显著缓解感知瓶颈并提升图表理解能力。 Conclusion: 通过改进视觉编码器，可以有效缓解LVLMs在图表理解中的感知瓶颈，提升模型性能。 Abstract: Chart understanding requires models to effectively analyze and reason about numerical data, textual elements, and complex visual components. Our observations reveal that the perception capabilities of existing large vision-language models (LVLMs) constitute a critical bottleneck in this process. In this study, we delve into this perception bottleneck by decomposing it into two components: the vision encoder bottleneck, where the visual representation may fail to encapsulate the correct information, and the extraction bottleneck, where the language model struggles to extract the necessary information from the provided visual representations. Through comprehensive experiments, we find that (1) the information embedded within visual representations is substantially richer than what is typically captured by linear extractors, such as the widely used retrieval accuracy metric; (2) While instruction tuning effectively enhances the extraction capability of LVLMs, the vision encoder remains a critical bottleneck, demanding focused attention and improvement. Therefore, we further enhance the visual encoder to mitigate the vision encoder bottleneck under a contrastive learning framework. Empirical results demonstrate that our approach significantly mitigates the perception bottleneck and improves the ability of LVLMs to comprehend charts. Code is publicly available at https://github.com/hkust-nlp/Vision4Chart.

DiffusionTalker: Efficient and Compact Speech-Driven 3D Talking Head via Personalizer-Guided Distillation

Peng Chen,Xiaobao Wei,Ming Lu,Hui Chen,Feng Tian

Task: 提出DiffusionTalker方法，通过个性化引导的蒸馏技术解决实时语音驱动的3D面部动画中的个性化、效率和紧凑性问题。

Motivation: 现有基于扩散模型的方法虽能提升面部动画的多样性，但缺乏个性化说话风格、效率低且模型体积大。

Details

Method: 采用对比个性化器学习身份和情感嵌入，并通过个性化增强器和迭代蒸馏技术提升效率和紧凑性。 Result: 实现了8倍以上的推理加速，模型存储减少86.4%，同时性能损失最小化。 Conclusion: DiffusionTalker在个性化、效率和紧凑性上优于现有方法，代码将开源。 Abstract: Real-time speech-driven 3D facial animation has been attractive in academia and industry. Traditional methods mainly focus on learning a deterministic mapping from speech to animation. Recent approaches start to consider the nondeterministic fact of speech-driven 3D face animation and employ the diffusion model for the task. Existing diffusion-based methods can improve the diversity of facial animation. However, personalized speaking styles conveying accurate lip language is still lacking, besides, efficiency and compactness still need to be improved. In this work, we propose DiffusionTalker to address the above limitations via personalizer-guided distillation. In terms of personalization, we introduce a contrastive personalizer that learns identity and emotion embeddings to capture speaking styles from audio. We further propose a personalizer enhancer during distillation to enhance the influence of embeddings on facial animation. For efficiency, we use iterative distillation to reduce the steps required for animation generation and achieve more than 8x speedup in inference. To achieve compactness, we distill the large teacher model into a smaller student model, reducing our model's storage by 86.4\% while minimizing performance loss. After distillation, users can derive their identity and emotion embeddings from audio to quickly create personalized animations that reflect specific speaking styles. Extensive experiments are conducted to demonstrate that our method outperforms state-of-the-art methods. The code will be released at: https://github.com/ChenVoid/DiffusionTalker.

StableGS: A Floater-Free Framework for 3D Gaussian Splatting

Luchao Wang,Qian Ren,Kaiming He,Hua Wang,Zhi Chen,Yaohua Tang

Task: 提出StableGS框架，解决3D高斯泼溅（3DGS）训练中的局部极小值和浮游伪影问题。

Motivation: 3DGS在新视角合成中表现出色，但其训练过程中耦合的不透明度和颜色优化容易陷入局部极小值，导致浮游伪影，影响视觉保真度。

Details

Method: 引入跨视角深度一致性约束和双不透明度GS模型，解耦半透明物体的几何与材质属性，并集成DUSt3R深度估计以增强弱纹理区域的几何稳定性。 Result: StableGS显著提升了3DGS的训练稳定性，在开源数据集上优于现有最先进方法。 Conclusion: StableGS从根本上解决了3DGS训练的不稳定性问题，为高质量新视角合成提供了可靠方案。 Abstract: Recent years have witnessed remarkable success of 3D Gaussian Splatting (3DGS) in novel view synthesis, surpassing prior differentiable rendering methods in both quality and efficiency. However, its training process suffers from coupled opacity-color optimization that frequently converges to local minima, producing floater artifacts that degrade visual fidelity. We present StableGS, a framework that eliminates floaters through cross-view depth consistency constraints while introducing a dual-opacity GS model to decouple geometry and material properties of translucent objects. To further enhance reconstruction quality in weakly-textured regions, we integrate DUSt3R depth estimation, significantly improving geometric stability. Our method fundamentally addresses 3DGS training instabilities, outperforming existing state-of-the-art methods across open-source datasets.

MAO: Efficient Model-Agnostic Optimization of Prompt Tuning for Vision-Language Models

Haoyang Li,Siyu Zhou,Liang Wang,Guodong Long

Task: 通过提出一种即插即用的模型无关优化方法（MAO）来提升基于CLIP的提示调优的效率。

Motivation: 现有研究通常通过重构模型架构（如额外损失计算和元网络）来提升性能，但这会增加复杂性和训练成本。

Details

Method: 提出数据驱动增强框架优化初始数据分布，并引入可变正则化模块增强任务特定特征处理流程。 Result: MAO在保持低计算成本的同时显著提升了性能。 Conclusion: MAO是一种高效且性能优异的提示调优优化方法。 Abstract: Though CLIP-based prompt tuning significantly enhances pre-trained Vision-Language Models, existing research focuses on reconstructing the model architecture, e.g., additional loss calculation and meta-networks. These approaches generally lead to increased complexity and extended training cost. To maintain the efficiency of the tuning process, we propose plug-and-play Model-Agnostic Optimization (MAO) for prompt tuning. Without altering any components of the prompt tuning backbone, we introduce a Data-Driven Enhancement framework to optimize the distribution of the initial data, and incorporate an Alterable Regularization module to boost the task-specific feature processing pipeline, thereby improving overall performance while maintaining low computational cost. Extensive experiments on MAO demonstrate its outstanding performance and efficiency. The code of MAO is available at: https://github.com/JREion/M.A.O .

Global-Local Tree Search for Language Guided 3D Scene Generation

Wei Deng,Mengshi Qi,Huadong Ma

Task: 利用大视觉语言模型（VLM）生成3D室内场景，并将其视为受空间和布局常识约束的规划问题。

Motivation: 目前关于VLM在3D室内场景生成方面的研究较少，而VLM（如GPT-4）在多领域取得了显著成功，因此探索其在此任务中的应用具有潜力。

Details

Method: 提出了一种新的全局-局部树搜索算法，通过分层分解场景结构（房间、区域、地板对象和支持对象）和离散化俯视图空间为密集网格，利用VLM生成对象位置。 Result: 定量和定性实验结果表明，该方法生成的3D场景比现有方法更合理。 Conclusion: 该方法成功地将VLM应用于3D场景生成，并通过树搜索算法和分层分解解决了空间规划问题，效果优于现有方法。 Abstract: Large Vision-Language Models (VLMs), such as GPT-4, have achieved remarkable success across various fields. However, there are few studies on 3D indoor scene generation with VLMs. This paper considers this task as a planning problem subject to spatial and layout common sense constraints. To solve the problem with a VLM, we propose a new global-local tree search algorithm. Globally, the method places each object sequentially and explores multiple placements during each placement process, where the problem space is represented as a tree. To reduce the depth of the tree, we decompose the scene structure hierarchically, i.e. room level, region level, floor object level, and supported object level. The algorithm independently generates the floor objects in different regions and supported objects placed on different floor objects. Locally, we also decompose the sub-task, the placement of each object, into multiple steps. The algorithm searches the tree of problem space. To leverage the VLM model to produce positions of objects, we discretize the top-down view space as a dense grid and fill each cell with diverse emojis to make to cells distinct. We prompt the VLM with the emoji grid and the VLM produces a reasonable location for the object by describing the position with the name of emojis. The quantitative and qualitative experimental results illustrate our approach generates more plausible 3D scenes than state-of-the-art approaches. Our source code is available at https://github.com/dw-dengwei/TreeSearchGen .

Self-Attention Diffusion Models for Zero-Shot Biomedical Image Segmentation: Unlocking New Frontiers in Medical Imaging

Abderrachid Hamrani,Anuradha Godavarty

Task: 提出一种无需标注数据的零样本医学图像分割方法。

Motivation: 解决医学图像分割中零样本和无监督学习的挑战，减少对昂贵标注数据的依赖。

Details

Method: 利用自注意力扩散模型（ADZUS），结合生成和判别能力，实现零样本分割。 Result: 在多个医学图像数据集上达到最先进性能，Dice分数88.7%至92.9%，IoU分数66.3%至93.3%。 Conclusion: ADZUS展示了零样本医学图像分割的潜力，但需较高计算资源，有望推动AI医疗影像技术的发展。 Abstract: Producing high-quality segmentation masks for medical images is a fundamental challenge in biomedical image analysis. Recent research has explored large-scale supervised training to enable segmentation across various medical imaging modalities and unsupervised training to facilitate segmentation without dense annotations. However, constructing a model capable of segmenting diverse medical images in a zero-shot manner without any annotations remains a significant hurdle. This paper introduces the Attention Diffusion Zero-shot Unsupervised System (ADZUS), a novel approach that leverages self-attention diffusion models for zero-shot biomedical image segmentation. ADZUS harnesses the intrinsic capabilities of pre-trained diffusion models, utilizing their generative and discriminative potentials to segment medical images without requiring annotated training data or prior domain-specific knowledge. The ADZUS architecture is detailed, with its integration of self-attention mechanisms that facilitate context-aware and detail-sensitive segmentations being highlighted. Experimental results across various medical imaging datasets, including skin lesion segmentation, chest X-ray infection segmentation, and white blood cell segmentation, reveal that ADZUS achieves state-of-the-art performance. Notably, ADZUS reached Dice scores ranging from 88.7\% to 92.9\% and IoU scores from 66.3\% to 93.3\% across different segmentation tasks, demonstrating significant improvements in handling novel, unseen medical imagery. It is noteworthy that while ADZUS demonstrates high effectiveness, it demands substantial computational resources and extended processing times. The model's efficacy in zero-shot settings underscores its potential to reduce reliance on costly annotations and seamlessly adapt to new medical imaging tasks, thereby expanding the diagnostic capabilities of AI-driven medical imaging technologies.

Junyuan Gao,Jiahe Song,Jiang Wu,Runchuan Zhu,Guanlin Shen,Shasha Wang,Xingjian Wei,Haote Yang,Songyang Zhang,Weijia Li,Bin Wang,Dahua Lin,Lijun Wu,Conghui He

Task: 提出PM4Bench，一个用于大型视觉语言模型（LVLMs）的并行多语言多模态多任务基准。

Motivation: 解决现有多语言基准的语言特定内容偏见、多模态输入格式不连贯以及缺乏安全性评估等局限性。

Details

Method: 采用并行语料库设计，涵盖10种语言，并嵌入文本和查询于图像中，要求模型同时“看”、“读”和“思考”。 Result: 评估了11种主流LVLMs，发现显著的跨语言性能差异，特别是在视觉设置中，OCR能力是关键因素。 Conclusion: PM4Bench填补了现有基准的空白，并揭示了LVLMs的跨语言性能问题。 Abstract: Existing multilingual benchmarks for Large Vision Language Models (LVLMs) suffer from limitations including language-specific content biases, disjointed multimodal input formats, and a lack of safety evaluation. To address these gaps, we propose PM4Bench, the first Parallel Multilingual Multi-Modal Multi-task Benchmark for LVLMs. PM4Bench features a parallel corpus design across 10 languages, enabling fair and accurate cross-lingual comparisons. It includes the vision setting where text and queries are embedded in images, requiring LVLMs to simultaneously "see", "read", and "think", aligning with real-world applications. Additionally, PM\textsuperscript{4}Bench incorporates safety evaluations, addressing critical oversight in existing multilingual benchmarks. Using PM4Bench, we evaluate 11 mainstream LVLMs, revealing significant cross-linguistic performance disparities, particularly in vision settings, and identifying OCR capability as a key determinant of these imbalances. We will release PM4Bench at https://github.com/opendatalab/PM4Bench .

Training A Neural Network For Partially Occluded Road Sign Identification In The Context Of Autonomous Vehicles

Gulnaz Gimaletdinova,Dim Shaiakhmetov,Madina Akpaeva,Mukhammadmuso Abduzhabbarov,Kadyrmamat Momunov

Task: 研究部分遮挡对交通标志识别的影响。

Motivation: 自动驾驶车辆和计算机视觉技术的发展凸显了交通标志识别准确性的重要性，但部分遮挡会增加识别任务的复杂性。

Details

Method: 收集包含完全可见和部分遮挡交通标志的数据集（5,746张图像），并比较自定义CNN与迁移学习模型的性能。 Result: 自定义CNN达到96%准确率，VGG16（全层解冻）达到99%准确率。仅用完全可见标志训练的模型在识别遮挡标志时效果下降。 Conclusion: 训练集中需包含部分遮挡数据以确保模型在复杂场景中的鲁棒性，提升自动驾驶安全性。 Abstract: The increasing number of autonomous vehicles and the rapid development of computer vision technologies underscore the particular importance of conducting research on the accuracy of traffic sign recognition. Numerous studies in this field have already achieved significant results, demonstrating high effectiveness in addressing traffic sign recognition tasks. However, the task becomes considerably more complex when a sign is partially obscured by surrounding objects, such as tree branches, billboards, or other elements of the urban environment. In our study, we investigated how partial occlusion of traffic signs affects their recognition. For this purpose, we collected a dataset comprising 5,746 images, including both fully visible and partially occluded signs, and made it publicly available. Using this dataset, we compared the performance of our custom convolutional neural network (CNN), which achieved 96% accuracy, with models trained using transfer learning. The best result was obtained by VGG16 with full layer unfreezing, reaching 99% accuracy. Additional experiments revealed that models trained solely on fully visible signs lose effectiveness when recognizing occluded signs. This highlights the critical importance of incorporating real-world data with partial occlusion into training sets to ensure robust model performance in complex practical scenarios and to enhance the safety of autonomous driving.

Safeguarding Mobile GUI Agent via Logic-based Action Verification

Jungjae Lee,Dongjae Lee,Chihun Choi,Youngmin Im,Jaeyoung Wi,Kihong Heo,Sangeun Oh,Sunjae Lee,Insik Shin

Task: 开发一种名为VeriSafe Agent（VSA）的形式化验证系统，用于确保移动GUI代理的行为严格符合用户意图。

Motivation: 大型基础模型（LFMs）在移动GUI代理中的应用存在不可靠性和错误倾向，需要一种形式化验证方法来提高其可靠性。

Details

Method: VSA通过自动形式化技术将自然语言指令转换为可形式化验证的规范，并使用领域特定语言（DSL）进行运行时规则验证。 Result: VSA在300条用户指令和18个移动应用中验证代理行为的准确率达到94.3%-98.33%，比现有方法提高了20.4%-25.6%，任务完成率提升了90%-130%。 Conclusion: VSA首次将形式化验证引入GUI代理，显著提高了LFM驱动的自动化系统的可靠性和任务完成率。 Abstract: Large Foundation Models (LFMs) have unlocked new possibilities in human-computer interaction, particularly with the rise of mobile Graphical User Interface (GUI) Agents capable of interpreting GUIs. These agents promise to revolutionize mobile computing by allowing users to automate complex mobile tasks through simple natural language instructions. However, the inherent probabilistic nature of LFMs, coupled with the ambiguity and context-dependence of mobile tasks, makes LFM-based automation unreliable and prone to errors. To address this critical challenge, we introduce VeriSafe Agent (VSA): a formal verification system that serves as a logically grounded safeguard for Mobile GUI Agents. VSA is designed to deterministically ensure that an agent's actions strictly align with user intent before conducting an action. At its core, VSA introduces a novel autoformalization technique that translates natural language user instructions into a formally verifiable specification, expressed in our domain-specific language (DSL). This enables runtime, rule-based verification, allowing VSA to detect and prevent erroneous actions executing an action, either by providing corrective feedback or halting unsafe behavior. To the best of our knowledge, VSA is the first attempt to bring the rigor of formal verification to GUI agent. effectively bridging the gap between LFM-driven automation and formal software verification. We implement VSA using off-the-shelf LLM services (GPT-4o) and evaluate its performance on 300 user instructions across 18 widely used mobile apps. The results demonstrate that VSA achieves 94.3%-98.33% accuracy in verifying agent actions, representing a significant 20.4%-25.6% improvement over existing LLM-based verification methods, and consequently increases the GUI agent's task completion rate by 90%-130%.

SimMotionEdit: Text-Based Human Motion Editing with Motion Similarity Prediction

Zhengyuan Li,Kai Cheng,Anindita Ghosh,Uttaran Bhattacharya,Liangyan Gui,Aniket Bera

Task: 提出一种基于多任务训练范式的文本驱动3D人体运动编辑方法，结合运动相似性预测任务。

Motivation: 现有方法在运动编辑中难以实现精确控制，导致运动语义与语言指令不一致。

Details

Method: 采用多任务训练范式，结合运动编辑和运动相似性预测任务，并设计基于Diffusion-Transformer的架构。 Result: 实验表明，该方法在编辑对齐性和保真度上达到最先进水平。 Conclusion: 多任务训练和新型架构设计显著提升了文本驱动3D运动编辑的性能。 Abstract: Text-based 3D human motion editing is a critical yet challenging task in computer vision and graphics. While training-free approaches have been explored, the recent release of the MotionFix dataset, which includes source-text-motion triplets, has opened new avenues for training, yielding promising results. However, existing methods struggle with precise control, often leading to misalignment between motion semantics and language instructions. In this paper, we introduce a related task, motion similarity prediction, and propose a multi-task training paradigm, where we train the model jointly on motion editing and motion similarity prediction to foster the learning of semantically meaningful representations. To complement this task, we design an advanced Diffusion-Transformer-based architecture that separately handles motion similarity prediction and motion editing. Extensive experiments demonstrate the state-of-the-art performance of our approach in both editing alignment and fidelity.

Verbal Process Supervision Elicits Better Coding Agents

Hao-Yuan Chen,Cheng-Pong Huang,Jui-Ming Yao

Task: 提出CURA系统，一种结合语言模型和推理架构的代码理解与推理代理系统，以解决复杂软件工程任务。

Motivation: 尽管大型语言模型在代码生成方面取得了显著进展，但在复杂软件工程任务中仍存在挑战，需要更高效的推理能力。

Details

Method: 引入CURA系统，结合语言模型和推理架构，并采用言语过程监督（VPS）技术进行增强。 Result: CURA在BigCodeBench等挑战性基准测试中比基线模型提升了3.65%，与o3-mini模型和VPS技术结合时达到最先进性能。 Conclusion: CURA展示了将推理驱动架构与基于语言模型的代码生成相结合的潜力，为复杂软件工程任务提供了新的解决方案。 Abstract: The emergence of large language models and their applications as AI agents have significantly advanced state-of-the-art code generation benchmarks, transforming modern software engineering tasks. However, even with test-time computed reasoning models, these systems still struggle with complex software engineering challenges. This work introduces CURA, a code understanding and reasoning agent system enhanced with verbal process supervision (VPS), achieving a 3.65\% improvement over baseline models on challenging benchmarks like BigCodeBench. Furthermore, CURA, when paired with the o3-mini model and VPS techniques, attains state-of-the-art performance. This work represents a step forward in integrating reasoning-driven architectures with LLM-based code generation, enabling agentic reasoning for language models to solve complex software engineering tasks.

MammAlps: A multi-view video behavior monitoring dataset of wild mammals in the Swiss Alps

Valentin Gabeff,Haozhe Qi,Brendan Flaherty,Gencer Sumbül,Alexander Mathis,Devis Tuia

Task: 提出MammAlps数据集，用于野生动物行为监测，并设计两个互补的基准任务：多模态动物行为识别和生态导向的长期事件分析。

Motivation: 缺乏标注的视频数据集限制了野生动物行为监测中视频理解模型的发展，需要推动机器学习和生态学的结合。

Details

Method: 基于9个相机陷阱收集的多模态数据（视频、音频、2D分割图和个体轨迹），构建MammAlps数据集，并设计分层多模态行为识别和生态导向的基准任务。 Result: MammAlps包含14小时视频、音频和8.5小时密集标注的个体轨迹，提出了两个互补的基准任务，并公开了代码和数据。 Conclusion: MammAlps数据集和基准任务有助于填补机器学习和生态学之间的空白，推动野生动物行为监测的研究。 Abstract: Monitoring wildlife is essential for ecology and ethology, especially in light of the increasing human impact on ecosystems. Camera traps have emerged as habitat-centric sensors enabling the study of wildlife populations at scale with minimal disturbance. However, the lack of annotated video datasets limits the development of powerful video understanding models needed to process the vast amount of fieldwork data collected. To advance research in wild animal behavior monitoring we present MammAlps, a multimodal and multi-view dataset of wildlife behavior monitoring from 9 camera-traps in the Swiss National Park. MammAlps contains over 14 hours of video with audio, 2D segmentation maps and 8.5 hours of individual tracks densely labeled for species and behavior. Based on 6135 single animal clips, we propose the first hierarchical and multimodal animal behavior recognition benchmark using audio, video and reference scene segmentation maps as inputs. Furthermore, we also propose a second ecology-oriented benchmark aiming at identifying activities, species, number of individuals and meteorological conditions from 397 multi-view and long-term ecological events, including false positive triggers. We advocate that both tasks are complementary and contribute to bridging the gap between machine learning and ecology. Code and data are available at: https://github.com/eceo-epfl/MammAlps

Instruction-Aligned Visual Attention for Mitigating Hallucinations in Large Vision-Language Models

Bin Li,Dehong Gao,Yeyuan Wang,Linbo Jin,Shanqing Yu,Xiaoyan Cai,Libin Yang

Task: 提出一种指令对齐的视觉注意力方法（IAVA）以减少大型视觉语言模型（LVLMs）在描述图像时产生的幻觉问题。

Motivation: 大型视觉语言模型在描述图像时容易产生幻觉，生成包含不存在对象的答案，原因是模型过度关注不相关的图像标记。

Details

Method: 通过比较两种不同指令下注意力权重的变化识别不相关标记，并利用对比解码动态调整原始图像标记和不相关标记的logits。 Result: IAVA在MME、POPE和TextVQA等基准测试中表现优于现有解码技术，有效减少对象幻觉。 Conclusion: IAVA方法通过减少对不相关信息的过度关注，显著改善了大型视觉语言模型的幻觉问题。 Abstract: Despite the significant success of Large Vision-Language models(LVLMs), these models still suffer hallucinations when describing images, generating answers that include non-existent objects. It is reported that these models tend to over-focus on certain irrelevant image tokens that do not contain critical information for answering the question and distort the output. To address this, we propose an Instruction-Aligned Visual Attention(IAVA) approach, which identifies irrelevant tokens by comparing changes in attention weights under two different instructions. By applying contrastive decoding, we dynamically adjust the logits generated from original image tokens and irrelevant image tokens, reducing the model's over-attention to irrelevant information. The experimental results demonstrate that IAVA consistently outperforms existing decoding techniques on benchmarks such as MME, POPE, and TextVQA in mitigating object hallucinations. Our IAVA approach is available online at https://github.com/Lee-lab558/IAVA.

PG-SAM: Prior-Guided SAM with Medical for Multi-organ Segmentation

Yiheng Zhong,Zihong Luo,Chengzhi Liu,Feilong Tang,Zelin Peng,Ming Hu,Yingzhen Hu,Jionglong Su,Zongyuan Geand,Imran Razzak

Task: 通过细粒度模态先验对齐器提升医学图像分割的准确性和鲁棒性。

Motivation: SAM在医学图像分割中的准确性和鲁棒性显著下降，现有方法通过模态融合提供更详细的先验，但文本粒度和领域差距影响了先验的准确性。

Details

Method: 提出Prior-Guided SAM (PG-SAM)，利用细粒度模态先验对齐器和医学LLM的文本信息，结合多级特征融合和迭代掩码优化器。 Result: 在Synapse数据集上，PG-SAM实现了最先进的性能。 Conclusion: PG-SAM通过细粒度模态对齐和高质量语义信息提升了医学图像分割的效果。 Abstract: Segment Anything Model (SAM) demonstrates powerful zero-shot capabilities; however, its accuracy and robustness significantly decrease when applied to medical image segmentation. Existing methods address this issue through modality fusion, integrating textual and image information to provide more detailed priors. In this study, we argue that the granularity of text and the domain gap affect the accuracy of the priors. Furthermore, the discrepancy between high-level abstract semantics and pixel-level boundary details in images can introduce noise into the fusion process. To address this, we propose Prior-Guided SAM (PG-SAM), which employs a fine-grained modality prior aligner to leverage specialized medical knowledge for better modality alignment. The core of our method lies in efficiently addressing the domain gap with fine-grained text from a medical LLM. Meanwhile, it also enhances the priors' quality after modality alignment, ensuring more accurate segmentation. In addition, our decoder enhances the model's expressive capabilities through multi-level feature fusion and iterative mask optimizer operations, supporting unprompted learning. We also propose a unified pipeline that effectively supplies high-quality semantic information to SAM. Extensive experiments on the Synapse dataset demonstrate that the proposed PG-SAM achieves state-of-the-art performance. Our anonymous code is released at https://github.com/logan-0623/PG-SAM.

Distil-xLSTM: Learning Attention Mechanisms through Recurrent Structures

Abdoul Majid O. Thiombiano,Brahim Hnich,Ali Ben Mrad,Mohamed Wiem Mkaouer

Task: 提出一种基于xLSTM的小型语言模型（Distil-xLSTM），通过从大型语言模型（LLM）中蒸馏知识，实现计算和规模高效。

Motivation: 当前NLP领域以Transformer模型为主，但基于循环机制的新架构（如xLSTM和Mamba）在某些情况下表现优于注意力模型，因此探索其潜力。

Details

Method: 利用xLSTM的循环序列混合组件近似Transformer模型的注意力参数化，并通过知识蒸馏训练小型模型。 Result: Distil-xLSTM在计算和规模高效的同时，表现出良好的性能。 Conclusion: Distil-xLSTM展示了基于循环机制的模型在NLP任务中的潜力，尤其是在资源受限的场景下。 Abstract: The current era of Natural Language Processing (NLP) is dominated by Transformer models. However, novel architectures relying on recurrent mechanisms, such as xLSTM and Mamba, have been proposed as alternatives to attention-based models. Although computation is done differently than with the attention mechanism mechanism, these recurrent models yield good results and sometimes even outperform state-of-the-art attention-based models. In this work, we propose Distil-xLSTM, an xLSTM-based Small Language Model (SLM) trained by distilling knowledge from a Large Language Model (LLM) that shows promising results while being compute and scale efficient. Our Distil-xLSTM focuses on approximating a transformer-based model attention parametrization using its recurrent sequence mixing components and shows good results with minimal training.

CustomKD: Customizing Large Vision Foundation for Edge Model Improvement via Knowledge Distillation

Jungsoo Lee,Debasmit Das,Munawar Hayat,Sungha Choi,Kyuwoong Hwang,Fatih Porikli

Task: 提出一种名为CustomKD的新型知识蒸馏方法，利用大型视觉基础模型（LVFMs）提升边缘模型（如MobileNetV3）的性能。

Motivation: 尽管LVFMs（如DINOv2和CLIP）在知识蒸馏中潜力巨大，但其在提升边缘模型性能方面的应用尚未充分探索。模型容量和架构的差异是主要挑战。

Details

Method: CustomKD通过定制LVFMs的通用特征以减少模型差异，同时对齐教师和学生的特征，使学生更容易理解并克服模型差异。 Result: CustomKD在无标记数据场景（如无监督域适应和半监督学习）中显著提升了边缘模型的性能，达到了新的最先进水平。 Conclusion: CustomKD是一种简单而有效的知识蒸馏方法，能够显著提升边缘模型的性能，尤其是在无标记数据场景中。 Abstract: We propose a novel knowledge distillation approach, CustomKD, that effectively leverages large vision foundation models (LVFMs) to enhance the performance of edge models (e.g., MobileNetV3). Despite recent advancements in LVFMs, such as DINOv2 and CLIP, their potential in knowledge distillation for enhancing edge models remains underexplored. While knowledge distillation is a promising approach for improving the performance of edge models, the discrepancy in model capacities and heterogeneous architectures between LVFMs and edge models poses a significant challenge. Our observation indicates that although utilizing larger backbones (e.g., ViT-S to ViT-L) in teacher models improves their downstream task performances, the knowledge distillation from the large teacher models fails to bring as much performance gain for student models as for teacher models due to the large model discrepancy. Our simple yet effective CustomKD customizes the well-generalized features inherent in LVFMs to a given student model in order to reduce model discrepancies. Specifically, beyond providing well-generalized original knowledge from teachers, CustomKD aligns the features of teachers to those of students, making it easy for students to understand and overcome the large model discrepancy overall. CustomKD significantly improves the performances of edge models in scenarios with unlabeled data such as unsupervised domain adaptation (e.g., OfficeHome and DomainNet) and semi-supervised learning (e.g., CIFAR-100 with 400 labeled samples and ImageNet with 1% labeled samples), achieving the new state-of-the-art performances.

Dense Retrieval for Low Resource Languages -- the Case of Amharic Language

Tilahun Yeshambel,Moncef Garouani,Serge Molina,Josiane Mothe

Task: 探讨在阿姆哈拉语（一种低资源语言）上使用密集检索器的困难和结果。

Motivation: 阿姆哈拉语是一种拥有1.2亿使用者的低资源语言，研究其信息检索的挑战和成果具有重要意义。

Details

Method: 描述了亚的斯亚贝巴大学在阿姆哈拉语信息检索方面的努力和面临的困难。 Result: 报告了使用密集检索器在阿姆哈拉语上的一些结果。 Conclusion: 总结了在低资源语言上应用密集检索器的挑战和潜在成果。 Abstract: This paper reports some difficulties and some results when using dense retrievers on Amharic, one of the low-resource languages spoken by 120 millions populations. The efforts put and difficulties faced by University Addis Ababa toward Amharic Information Retrieval will be developed during the presentation.

Surface-Aware Distilled 3D Semantic Features

Lukas Uzolas,Elmar Eisemann,Petr Kellnhofer

Task: 学习一种表面感知的嵌入空间，以解决3D形状对应匹配中的语义歧义问题。

Motivation: 现有基于预训练视觉模型的语义特征匹配方法难以区分同一语义类别的实例（如“左手”与“右手”），导致显著的映射错误。

Details

Method: 提出一种自监督方法，通过对比损失学习表面感知的嵌入空间，保留预训练模型的语义内容并消除表面远距离特征的歧义。 Result: 在对应匹配基准测试中表现优异，并支持下游应用如部分分割、姿态对齐和运动转移。 Conclusion: 该方法仅需少量未配对的训练网格即可为新的3D形状推断特征，解决了语义歧义问题，提升了3D任务的性能。 Abstract: Many 3D tasks such as pose alignment, animation, motion transfer, and 3D reconstruction rely on establishing correspondences between 3D shapes. This challenge has recently been approached by matching of semantic features from pre-trained vision models. However, despite their power, these features struggle to differentiate instances of the same semantic class such as "left hand" versus "right hand" which leads to substantial mapping errors. To solve this, we learn a surface-aware embedding space that is robust to these ambiguities. Importantly, our approach is self-supervised and requires only a small number of unpaired training meshes to infer features for new 3D shapes at test time. We achieve this by introducing a contrastive loss that preserves the semantic content of the features distilled from foundational models while disambiguating features located far apart on the shape's surface. We observe superior performance in correspondence matching benchmarks and enable downstream applications including in-part segmentation, pose alignment, and motion transfer. The project site is available at https://lukas.uzolas.com/SurfaceAware3DFeaturesSite.

AgentSpec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents

Haoyu Wang,Christopher M. Poskitt,Jun Sun

Task: 提出一种轻量级领域特定语言AgentSpec，用于指定和执行LLM代理的运行时约束。

Motivation: LLM代理的自主性引入了安全风险，现有缓解方法在鲁棒性、可解释性和适应性方面存在不足。

Details

Method: 设计AgentSpec语言，用户通过定义结构化规则（触发器、谓词和执行机制）确保代理在预设安全边界内运行。 Result: 在代码执行、具身代理和自动驾驶等多个领域验证了AgentSpec的适应性和有效性，成功阻止90%以上不安全代码执行，消除所有具身代理危险行为，确保自动驾驶100%合规。 Conclusion: AgentSpec结合可解释性、模块化和高效性，为LLM代理安全提供了实用且可扩展的解决方案。 Abstract: Agents built on LLMs are increasingly deployed across diverse domains, automating complex decision-making and task execution. However, their autonomy introduces safety risks, including security vulnerabilities, legal violations, and unintended harmful actions. Existing mitigation methods, such as model-based safeguards and early enforcement strategies, fall short in robustness, interpretability, and adaptability. To address these challenges, we propose AgentSpec, a lightweight domain-specific language for specifying and enforcing runtime constraints on LLM agents. With AgentSpec, users define structured rules that incorporate triggers, predicates, and enforcement mechanisms, ensuring agents operate within predefined safety boundaries. We implement AgentSpec across multiple domains, including code execution, embodied agents, and autonomous driving, demonstrating its adaptability and effectiveness. Our evaluation shows that AgentSpec successfully prevents unsafe executions in over 90% of code agent cases, eliminates all hazardous actions in embodied agent tasks, and enforces 100% compliance by autonomous vehicles (AVs). Despite its strong safety guarantees, AgentSpec remains computationally lightweight, with overheads in milliseconds. By combining interpretability, modularity, and efficiency, AgentSpec provides a practical and scalable solution for enforcing LLM agent safety across diverse applications. We also automate the generation of rules using LLMs and assess their effectiveness. Our evaluation shows that the rules generated by OpenAI o1 achieve a precision of 95.56% and recall of 70.96% for embodied agents, successfully identifying 87.26% of the risky code, and prevent AVs from breaking laws in 5 out of 8 scenarios.

Minh-Tuan Tran,Trung Le,Xuan-May Le,Thanh-Toan Do,Dinh Phung

Task: 提出一种名为NRR-DD的数据集蒸馏方法，以平衡实例特定特征和类通用特征。

Motivation: 现有方法在平衡实例特定特征和类通用特征方面存在不足，导致模型性能受限。

Details

Method: 采用非关键区域细化（NRR-DD）方法，保留实例特定细节并丰富非关键区域的类通用信息；引入基于距离的代表性（DBR）知识转移技术。 Result: NRR-DD在小规模和大规模数据集上均达到最先进性能，且仅需存储每个实例的两个距离即可实现类似效果。 Conclusion: NRR-DD方法有效平衡了两种特征，提升了模型性能，且具有高效性。 Abstract: Dataset distillation has become a popular method for compressing large datasets into smaller, more efficient representations while preserving critical information for model training. Data features are broadly categorized into two types: instance-specific features, which capture unique, fine-grained details of individual examples, and class-general features, which represent shared, broad patterns across a class. However, previous approaches often struggle to balance these features-some focus solely on class-general patterns, neglecting finer instance details, while others prioritize instance-specific features, overlooking the shared characteristics essential for class-level understanding. In this paper, we introduce the Non-Critical Region Refinement Dataset Distillation (NRR-DD) method, which preserves instance-specific details and fine-grained regions in synthetic data while enriching non-critical regions with class-general information. This approach enables models to leverage all pixel information, capturing both feature types and enhancing overall performance. Additionally, we present Distance-Based Representative (DBR) knowledge transfer, which eliminates the need for soft labels in training by relying on the distance between synthetic data predictions and one-hot encoded labels. Experimental results show that NRR-DD achieves state-of-the-art performance on both small- and large-scale datasets. Furthermore, by storing only two distances per instance, our method delivers comparable results across various settings. The code is available at https://github.com/tmtuan1307/NRR-DD.

ArchSeek: Retrieving Architectural Case Studies Using Vision-Language Models

Danrui Li,Yichao Shi,Yaluo Wang,Ziying Shi,Mubbasir Kapadia

Task: 开发一个名为ArchSeek的案例研究搜索系统，专为建筑设计专业人士设计，支持文本和图像查询以及推荐功能。

Motivation: 传统的基于文本的搜索工具难以捕捉建筑知识的视觉和复杂性，导致搜索效率低下且不精确。

Details

Method: 利用视觉语言模型和跨模态嵌入技术，实现细粒度控制的文本和图像查询，以及基于交互的设计案例推荐。 Result: ArchSeek为建筑师提供了一种更高效、个性化的设计灵感发现方式，并具有在其他视觉驱动设计领域应用的潜力。 Conclusion: ArchSeek通过结合视觉理解技术和推荐功能，显著提升了建筑设计案例搜索的效率和精确性。 Abstract: Efficiently searching for relevant case studies is critical in architectural design, as designers rely on precedent examples to guide or inspire their ongoing projects. However, traditional text-based search tools struggle to capture the inherently visual and complex nature of architectural knowledge, often leading to time-consuming and imprecise exploration. This paper introduces ArchSeek, an innovative case study search system with recommendation capability, tailored for architecture design professionals. Powered by the visual understanding capabilities from vision-language models and cross-modal embeddings, it enables text and image queries with fine-grained control, and interaction-based design case recommendations. It offers architects a more efficient, personalized way to discover design inspirations, with potential applications across other visually driven design fields. The source code is available at https://github.com/danruili/ArchSeek.

TopV: Compatible Token Pruning with Inference Time Optimization for Fast and Low-Memory Multimodal Vision Language Model

Cheng Yang,Yang Sui,Jinqi Xiao,Lingyi Huang,Yu Gong,Chendi Li,Jinghua Yan,Yu Bai,Ponnuswamy Sadayappan,Xia Hu,Bo Yuan

Task: 提出一种名为TopV的视觉令牌修剪方法，以优化视觉语言模型（VLM）的推理效率和内存使用。

Motivation: 现有的视觉令牌修剪方法依赖启发式标准且与FlashAttention和KV缓存不兼容，限制了其实际应用。

Details

Method: 将令牌修剪建模为优化问题，引入视觉感知成本函数（包括特征相似性、相对空间距离和绝对中心距离）来识别重要令牌，并与FlashAttention兼容。 Result: 实验表明，TopV在性能和效率上优于现有方法，显著减少了KV缓存大小。 Conclusion: TopV是一种无需额外训练的高效视觉令牌修剪方法，兼容现有技术并显著提升推理效率。 Abstract: Vision-Language Models (VLMs) demand substantial computational resources during inference, largely due to the extensive visual input tokens for representing visual information. Previous studies have noted that visual tokens tend to receive less attention than text tokens, suggesting their lower importance during inference and potential for pruning. However, their methods encounter several challenges: reliance on greedy heuristic criteria for token importance and incompatibility with FlashAttention and KV cache. To address these issues, we introduce \textbf{TopV}, a compatible \textbf{TO}ken \textbf{P}runing with inference Time Optimization for fast and low-memory \textbf{V}LM, achieving efficient pruning without additional training or fine-tuning. Instead of relying on attention scores, we formulate token pruning as an optimization problem, accurately identifying important visual tokens while remaining compatible with FlashAttention. Additionally, since we only perform this pruning once during the prefilling stage, it effectively reduces KV cache size. Our optimization framework incorporates a visual-aware cost function considering factors such as Feature Similarity, Relative Spatial Distance, and Absolute Central Distance, to measure the importance of each source visual token, enabling effective pruning of low-importance tokens. Extensive experiments demonstrate that our method outperforms previous token pruning methods, validating the effectiveness and efficiency of our approach.

BitDecoding: Unlocking Tensor Cores for Long-Context LLMs Decoding with Low-Bit KV Cache

Dayou Du,Shijie Cao,Jianyi Cheng,Ting Cao,Mao Yang

Task: 提出一种名为BitDecoding的GPU优化框架，用于解决低比特KV缓存在解码过程中的效率问题。

Motivation: 长上下文大型语言模型（LLMs）的广泛采用导致自回归解码中KV缓存的内存和计算成本增加，低比特量化虽能减少内存开销，但现有实现因量化和反量化开销及未充分利用Tensor Cores而无法达到预期加速效果。

Details

Method: BitDecoding通过Tensor Cores-Centric BitFusion Scheme确保数据布局兼容性，结合高效的并行解码内核和细粒度异步流水线，优化低比特KV缓存的解码效率。 Result: 在RTX 4090、A100和H100上分别实现7.5倍、4.8倍和8.9倍加速，优于FP16 FlashDecoding-v2和QServe，且在LLaMA-3.1-8B上降低单批次解码延迟3倍。 Conclusion: BitDecoding在长上下文生成场景中显著提升解码效率，为低比特KV缓存的实际应用提供了高效解决方案。 Abstract: The growing adoption of long-context Large Language Models (LLMs) has introduced significant memory and computational challenges in autoregressive decoding due to the expanding Key-Value (KV) cache. KV cache quantization has emerged as a promising solution, with prior work showing that 4-bit or even 2-bit quantization can maintain model accuracy while reducing memory costs. However, despite these benefits, preliminary implementations for the low-bit KV cache struggle to deliver the expected speedup due to quantization and dequantization overheads and the lack of Tensor Cores utilization. In this work, we propose BitDecoding, a GPU-optimized framework that unlocks Tensor Cores for efficient decoding with low-bit KV cache. Efficiently leveraging Tensor Cores for low-bit KV cache is challenging due to the dynamic nature of KV cache generation at each decoding step. BitDecoding addresses these challenges with a Tensor Cores-Centric BitFusion Scheme that ensures data layout compatibility to enable high utilization of Tensor Cores. Additionally, BitDecoding incorporates a warp-efficient parallel decoding kernel and a fine-grained asynchronous pipeline, minimizing dequantization overhead and improving computational efficiency. Experiments show that BitDecoding achieves up to 7.5x speedup on RTX 4090, 4.8x on A100, and 8.9x on H100, compared to FP16 FlashDecoding-v2. It also outperforms the state-of-the-art low-bit KV cache implementation (QServe) by up to 4.3x. On LLaMA-3.1-8B with a 128K sequence length, BitDecoding reduces single-batch decoding latency by 3x, demonstrating its effectiveness in long-context generation scenarios. The code is available at https://github.com/DD-DuDa/BitDecoding.

TrackID3x3: A Dataset and Algorithm for Multi-Player Tracking with Identification and Pose Estimation in 3x3 Basketball Full-court Videos

Kazuhiro Yamada,Li Yin,Qingrui Hu,Ning Ding,Shunsuke Iwashita,Jun Ichikawa,Kiwamu Kotani,Calvin Yeung,Keisuke Fujii

Task: 提出TrackID3x3数据集，专注于3x3篮球场景中的多目标跟踪、球员识别和姿态估计。

Motivation: 现有数据集和方法主要针对主流团队运动，忽视了固定摄像头场景和非主流运动的需求。

Details

Method: 提出TrackID3x3数据集，包含三个子集，并设计Track-ID任务和基线算法Track-ID algorithm。 Result: 基准实验展示了稳健的结果，并揭示了剩余挑战。 Conclusion: TrackID3x3数据集和评估基准为3x3篮球的自动化分析提供了坚实基础。 Abstract: Multi-object tracking, player identification, and pose estimation are fundamental components of sports analytics, essential for analyzing player movements, performance, and tactical strategies. However, existing datasets and methodologies primarily target mainstream team sports such as soccer and conventional 5-on-5 basketball, often overlooking scenarios involving fixed-camera setups commonly used at amateur levels, less mainstream sports, or datasets that explicitly incorporate pose annotations. In this paper, we propose the TrackID3x3 dataset, the first publicly available comprehensive dataset specifically designed for multi-player tracking, player identification, and pose estimation in 3x3 basketball scenarios. The dataset comprises three distinct subsets (Indoor fixed-camera, Outdoor fixed-camera, and Drone camera footage), capturing diverse full-court camera perspectives and environments. We also introduce the Track-ID task, a simplified variant of the game state reconstruction task that excludes field detection and focuses exclusively on fixed-camera scenarios. To evaluate performance, we propose a baseline algorithm called Track-ID algorithm, tailored to assess tracking and identification quality. Furthermore, our benchmark experiments, utilizing recent multi-object tracking algorithms (e.g., BoT-SORT-ReID) and top-down pose estimation methods (HRNet, RTMPose, and SwinPose), demonstrate robust results and highlight remaining challenges. Our dataset and evaluation benchmarks provide a solid foundation for advancing automated analytics in 3x3 basketball. Dataset and code will be available at https://github.com/open-starlab/TrackID3x3.

REALM: A Dataset of Real-World LLM Use Cases

Jingwen Cheng,Kshitish Ghate,Wenyue Hua,William Yang Wang,Hong Shen,Fei Fang

Task: 通过REALM数据集分析大型语言模型（如GPT系列）在现实世界中的应用及其用户特征。

Motivation: 尽管大型语言模型在工业应用中产生了重大影响，但对其实际应用的全面理解仍然有限。

Details

Method: 收集了来自Reddit和新闻文章的94,000多个LLM用例，构建了REALM数据集，并对其进行了分类和分析。 Result: REALM揭示了LLM的多样化应用及其用户职业与使用类型之间的关系，为研究其社会角色提供了数据支持。 Conclusion: REALM数据集为未来研究LLM在不同领域的应用及其社会影响奠定了基础。 Abstract: Large Language Models, such as the GPT series, have driven significant industrial applications, leading to economic and societal transformations. However, a comprehensive understanding of their real-world applications remains limited. To address this, we introduce REALM, a dataset of over 94,000 LLM use cases collected from Reddit and news articles. REALM captures two key dimensions: the diverse applications of LLMs and the demographics of their users. It categorizes LLM applications and explores how users' occupations relate to the types of applications they use. By integrating real-world data, REALM offers insights into LLM adoption across different domains, providing a foundation for future research on their evolving societal roles. A dedicated dashboard https://realm-e7682.web.app/ presents the data.

Voxel-based Point Cloud Geometry Compression with Space-to-Channel Context

Bojun Liu,Yangzhi Ma,Ao Luo,Li Li,Dong Liu

Task: 提出一种阶段性的空间到通道（S2C）上下文模型，用于点云几何压缩，以解决基于体素方法在接收域受限和高比特深度点云处理中的局限性。

Motivation: 基于体素的方法在点云几何压缩中效率高，但在处理高比特深度点云时因接收域受限而表现不佳，需要改进。

Details

Method: 引入阶段性的S2C上下文模型，结合通道自回归策略和几何残差编码（GRC），并使用球坐标系和残差概率近似（RPA）模块。 Result: 实验表明，S2C模型在保持或提升重建质量的同时节省比特，并降低计算复杂度。 Conclusion: S2C上下文模型有效解决了基于体素方法的局限性，提升了点云压缩的性能。 Abstract: Voxel-based methods are among the most efficient for point cloud geometry compression, particularly with dense point clouds. However, they face limitations due to a restricted receptive field, especially when handling high-bit depth point clouds. To overcome this issue, we introduce a stage-wise Space-to-Channel (S2C) context model for both dense point clouds and low-level sparse point clouds. This model utilizes a channel-wise autoregressive strategy to effectively integrate neighborhood information at a coarse resolution. For high-level sparse point clouds, we further propose a level-wise S2C context model that addresses resolution limitations by incorporating Geometry Residual Coding (GRC) for consistent-resolution cross-level prediction. Additionally, we use the spherical coordinate system for its compact representation and enhance our GRC approach with a Residual Probability Approximation (RPA) module, which features a large kernel size. Experimental results show that our S2C context model not only achieves bit savings while maintaining or improving reconstruction quality but also reduces computational complexity compared to state-of-the-art voxel-based compression methods.

EconEvals: Benchmarks and Litmus Tests for LLM Agents in Unknown Environments

Sara Fish,Julia Shephard,Minkai Li,Ran I. Shorrer,Yannai A. Gonczarowski

Task: 开发用于评估LLM代理在未知环境中行动、学习和策略制定的基准测试。

Motivation: 为LLM代理提供能够模拟复杂经济问题的测试环境，以评估其能力和倾向。

Details

Method: 通过合成生成具有可扩展难度级别的决策任务，并提出新的定量测量方法（litmus tests）。 Result: 开发了一套基准测试和litmus tests，用于评估LLM代理在多种经济问题中的表现和倾向。 Conclusion: 这些工具为LLM代理在复杂经济问题中的应用提供了重要的评估手段。 Abstract: We develop benchmarks for LLM agents that act in, learn from, and strategize in unknown environments, the specifications of which the LLM agent must learn over time from deliberate exploration. Our benchmarks consist of decision-making tasks derived from key problems in economics. To forestall saturation, the benchmark tasks are synthetically generated with scalable difficulty levels. Additionally, we propose litmus tests, a new kind of quantitative measure for LLMs and LLM agents. Unlike benchmarks, litmus tests quantify differences in character, values, and tendencies of LLMs and LLM agents, by considering their behavior when faced with tradeoffs (e.g., efficiency versus equality) where there is no objectively right or wrong behavior. Overall, our benchmarks and litmus tests assess the abilities and tendencies of LLM agents in tackling complex economic problems in diverse settings spanning procurement, scheduling, task allocation, and pricing -- applications that should grow in importance as such agents are further integrated into the economy.

CO-SPY: Combining Semantic and Pixel Features to Detect Synthetic Images by AI

Siyuan Cheng,Lingjuan Lyu,Zhenting Wang,Xiangyu Zhang,Vikash Sehwag

Task: 提出一种名为Co-Spy的新框架，用于更通用和鲁棒地检测AI生成的合成图像。

Motivation: 现有方法在区分真实与AI生成图像时缺乏泛化能力，且易受后处理技术影响。

Details

Method: 通过增强语义特征和人工痕迹特征，并自适应地整合它们，结合新构建的数据集Co-Spy-Bench进行检测。 Result: 在相同训练条件下，Co-Spy的平均准确率比现有方法提高了约11%至34%。 Conclusion: Co-Spy框架在合成图像检测中表现出更高的通用性和鲁棒性。 Abstract: With the rapid advancement of generative AI, it is now possible to synthesize high-quality images in a few seconds. Despite the power of these technologies, they raise significant concerns regarding misuse. Current efforts to distinguish between real and AI-generated images may lack generalization, being effective for only certain types of generative models and susceptible to post-processing techniques like JPEG compression. To overcome these limitations, we propose a novel framework, Co-Spy, that first enhances existing semantic features (e.g., the number of fingers in a hand) and artifact features (e.g., pixel value differences), and then adaptively integrates them to achieve more general and robust synthetic image detection. Additionally, we create Co-Spy-Bench, a comprehensive dataset comprising 5 real image datasets and 22 state-of-the-art generative models, including the latest models like FLUX. We also collect 50k synthetic images in the wild from the Internet to enable evaluation in a more practical setting. Our extensive evaluations demonstrate that our detector outperforms existing methods under identical training conditions, achieving an average accuracy improvement of approximately 11% to 34%. The code is available at https://github.com/Megum1/Co-Spy.

Reasoning to Learn from Latent Thoughts

Yangjun Ruan,Neil Band,Chris J. Maddison,Tatsunori Hashimoto

Task: 通过建模和推断文本生成过程中的潜在思想，提高预训练语言模型的数据效率。

Motivation: 随着计算规模的增长，人类编写的文本数据可能成为语言模型扩展的瓶颈，因此需要更高效的数据利用方法。

Details

Method: 提出一种方法，将网络文本视为人类思维过程的压缩结果，并通过推断潜在思想来增强预训练数据的效率。 Result: 实验表明，推断潜在思想的方法显著提高了数据效率（MATH任务上从5.7%提升到25.4%），并通过EM算法实现了自举式性能提升。 Conclusion: 潜在思想推断为数据受限的预训练提供了新的扩展机会，特别是在推理计算和EM迭代方面。 Abstract: Compute scaling for language model (LM) pretraining has outpaced the growth of human-written texts, leading to concerns that data will become the bottleneck to LM scaling. To continue scaling pretraining in this data-constrained regime, we propose that explicitly modeling and inferring the latent thoughts that underlie the text generation process can significantly improve pretraining data efficiency. Intuitively, our approach views web text as the compressed final outcome of a verbose human thought process and that the latent thoughts contain important contextual knowledge and reasoning steps that are critical to data-efficient learning. We empirically demonstrate the effectiveness of our approach through data-constrained continued pretraining for math. We first show that synthetic data approaches to inferring latent thoughts significantly improve data efficiency, outperforming training on the same amount of raw data (5.7\% $\rightarrow$ 25.4\% on MATH). Furthermore, we demonstrate latent thought inference without a strong teacher, where an LM bootstraps its own performance by using an EM algorithm to iteratively improve the capability of the trained LM and the quality of thought-augmented pretraining data. We show that a 1B LM can bootstrap its performance across at least three iterations and significantly outperform baselines trained on raw data, with increasing gains from additional inference compute when performing the E-step. The gains from inference scaling and EM iterations suggest new opportunities for scaling data-constrained pretraining.

LGPS: A Lightweight GAN-Based Approach for Polyp Segmentation in Colonoscopy Images

Fiseha B. Tesema,Alejandro Guerra Manzanares,Tianxiang Cui,Qian Zhang,Moses Solomon,Sean He

Task: 提出一种轻量级的基于GAN的框架LGPS，用于结直肠息肉分割。

Motivation: 解决现有深度学习方法在息肉分割中存在的高计算成本、小或低对比度息肉分割困难以及跨数据集泛化能力有限的问题。

Details

Method: LGPS结合了MobileNetV2主干网络、改进的残差块和Squeeze-and-Excitation模块、卷积条件随机场（ConvCRF）以及混合损失函数。 Result: 在五个基准数据集上验证，LGPS在PolypGen测试数据集上取得了Dice 0.7299和IoU 0.7867，优于现有方法，且模型参数量仅为1.07百万。 Conclusion: LGPS的轻量设计和强大性能表明其在改善早期结直肠癌诊断中具有潜力。 Abstract: Colorectal cancer (CRC) is a major global cause of cancer-related deaths, with early polyp detection and removal during colonoscopy being crucial for prevention. While deep learning methods have shown promise in polyp segmentation, challenges such as high computational costs, difficulty in segmenting small or low-contrast polyps, and limited generalizability across datasets persist. To address these issues, we propose LGPS, a lightweight GAN-based framework for polyp segmentation. LGPS incorporates three key innovations: (1) a MobileNetV2 backbone enhanced with modified residual blocks and Squeeze-and-Excitation (ResE) modules for efficient feature extraction; (2) Convolutional Conditional Random Fields (ConvCRF) for precise boundary refinement; and (3) a hybrid loss function combining Binary Cross-Entropy, Weighted IoU Loss, and Dice Loss to address class imbalance and enhance segmentation accuracy. LGPS is validated on five benchmark datasets and compared with state-of-the-art(SOTA) methods. On the largest and challenging PolypGen test dataset, LGPS achieves a Dice of 0.7299 and an IoU of 0.7867, outperformed all SOTA works and demonstrating robust generalization. With only 1.07 million parameters, LGPS is 17 times smaller than the smallest existing model, making it highly suitable for real-time clinical applications. Its lightweight design and strong performance underscore its potential for improving early CRC diagnosis. Code is available at https://github.com/Falmi/LGPS/.

Toward building next-generation Geocoding systems: a systematic review

Zhengcong Yin,Daniel W. Goldberg,Binbin Lin,Bing Zhou,Diya Li,Andong Ma,Ziqian Ming,Heng Cai,Zhe Zhang,Shaohua Wang,Shanzhen Gao,Joey Ying Lee,Xiao Li,Da Huo

Task: 综述地理编码系统的演进需求、构建方法及未来改进方向。

Motivation: 地理编码系统的质量对后续应用至关重要，需开发新一代系统以满足多样化需求。

Details

Method: 分析地理编码系统的功能组件，并综述从传统规则方法到先进技术（如信息检索、自然语言处理和大语言模型）的现有方法。 Result: 提出了基于最新技术改进新一代地理编码系统的机会。 Conclusion: 通过技术整合与创新，新一代地理编码系统有望显著提升性能和应用范围。 Abstract: Geocoding systems are widely used in both scientific research for spatial analysis and everyday life through location-based services. The quality of geocoded data significantly impacts subsequent processes and applications, underscoring the need for next-generation systems. In response to this demand, this review first examines the evolving requirements for geocoding inputs and outputs across various scenarios these systems must address. It then provides a detailed analysis of how to construct such systems by breaking them down into key functional components and reviewing a broad spectrum of existing approaches, from traditional rule-based methods to advanced techniques in information retrieval, natural language processing, and large language models. Finally, we identify opportunities to improve next-generation geocoding systems in light of recent technological advances.

Image-to-Text for Medical Reports Using Adaptive Co-Attention and Triple-LSTM Module

Yishen Liu,Shengda Liu,Hudan Pan

Task: 提出一种多模态模型CA-TriNet，用于提高医学报告生成的准确性和避免过拟合。

Motivation: 通用大模型难以准确捕捉医学报告生成的专业知识，且医学数据的重复性和相似性导致模型难以提取有意义的特征并容易过拟合。

Details

Method: 结合Transformer架构与多LSTM网络，通过协同注意力模块链接视觉和文本Transformer，并引入自适应权重算子；Triple-LSTM模块利用目标图像对象优化生成句子。 Result: 在三个公开数据集上的评估表明，CA-TriNet在综合能力上优于现有先进模型，甚至在某些指标上超越预训练大语言模型。 Conclusion: CA-TriNet通过多模态协同和自适应优化，显著提升了医学报告生成的性能。 Abstract: Medical report generation requires specialized expertise that general large models often fail to accurately capture. Moreover, the inherent repetition and similarity in medical data make it difficult for models to extract meaningful features, resulting in a tendency to overfit. So in this paper, we propose a multimodal model, Co-Attention Triple-LSTM Network (CA-TriNet), a deep learning model that combines transformer architectures with a Multi-LSTM network. Its Co-Attention module synergistically links a vision transformer with a text transformer to better differentiate medical images with similarities, augmented by an adaptive weight operator to catch and amplify image labels with minor similarities. Furthermore, its Triple-LSTM module refines generated sentences using targeted image objects. Extensive evaluations over three public datasets have demonstrated that CA-TriNet outperforms state-of-the-art models in terms of comprehensive ability, even pre-trained large language models on some metrics.

SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild

Weihao Zeng,Yuzhen Huang,Qian Liu,Wei Liu,Keqing He,Zejun Ma,Junxian He

Task: 研究零强化学习（RL）训练在10种不同基础模型上的表现，包括推理准确性和响应长度的改进。

Motivation: 现有研究主要关注Qwen2.5系列模型，但这些模型已具备较强的指令遵循和自省能力，可能不具备代表性。

Details

Method: 采用调整格式奖励和控制查询难度等关键设计策略，对10种不同基础模型进行零RL训练。 Result: 在大多数设置中显著提高了推理准确性和响应长度，但不同基础模型在训练中表现出不同的行为模式。 Conclusion: 分享了成功实现零RL训练的关键设计，并开源了代码、模型和分析工具以促进进一步研究。 Abstract: DeepSeek-R1 has shown that long chain-of-thought (CoT) reasoning can naturally emerge through a simple reinforcement learning (RL) framework with rule-based rewards, where the training may directly start from the base models-a paradigm referred to as zero RL training. Most recent efforts to reproduce zero RL training have primarily focused on the Qwen2.5 model series, which may not be representative as we find the base models already exhibit strong instruction-following and self-reflection abilities. In this work, we investigate zero RL training across 10 diverse base models, spanning different families and sizes including LLama3-8B, Mistral-7B/24B, DeepSeek-Math-7B, Qwen2.5-math-7B, and all Qwen2.5 models from 0.5B to 32B. Leveraging several key design strategies-such as adjusting format reward and controlling query difficulty-we achieve substantial improvements in both reasoning accuracy and response length across most settings. However, by carefully monitoring the training dynamics, we observe that different base models exhibit distinct patterns during training. For instance, the increased response length does not always correlate with the emergence of certain cognitive behaviors such as verification (i.e., the "aha moment"). Notably, we observe the "aha moment" for the first time in small models not from the Qwen family. We share the key designs that enable successful zero RL training, along with our findings and practices. To facilitate further research, we open-source the code, models, and analysis tools.

Diff-Palm: Realistic Palmprint Generation with Polynomial Creases and Intra-Class Variation Controllable Diffusion Models

Jianlong Jin,Chenglong Zhao,Ruixin Zhang,Sheng Shang,Jianqing Xu,Jingyun Zhang,ShaoMing Wang,Yang Zhao,Shouhong Ding,Wei Jia,Yunsheng Wu

Task: 提出一种基于多项式的手掌纹表示方法和条件扩散模型，以生成具有高身份一致性和类内变化的手掌纹数据集。

Motivation: 现有方法使用Bézier曲线模拟手掌纹，但生成的数据与真实数据差距较大，导致识别模型性能下降。

Details

Method: 引入多项式表示手掌纹，并提出条件扩散模型及K步噪声共享采样方法。 Result: 实验表明，仅使用合成数据集训练的识别模型首次超越真实数据集训练的模型，且性能随生成身份数量增加而提升。 Conclusion: 该方法显著缩小了合成与真实手掌纹的差距，为手掌纹识别提供了高质量合成数据。 Abstract: Palmprint recognition is significantly limited by the lack of large-scale publicly available datasets. Previous methods have adopted B\'ezier curves to simulate the palm creases, which then serve as input for conditional GANs to generate realistic palmprints. However, without employing real data fine-tuning, the performance of the recognition model trained on these synthetic datasets would drastically decline, indicating a large gap between generated and real palmprints. This is primarily due to the utilization of an inaccurate palm crease representation and challenges in balancing intra-class variation with identity consistency. To address this, we introduce a polynomial-based palm crease representation that provides a new palm crease generation mechanism more closely aligned with the real distribution. We also propose the palm creases conditioned diffusion model with a novel intra-class variation control method. By applying our proposed $K$-step noise-sharing sampling, we are able to synthesize palmprint datasets with large intra-class variation and high identity consistency. Experimental results show that, for the first time, recognition models trained solely on our synthetic datasets, without any fine-tuning, outperform those trained on real datasets. Furthermore, our approach achieves superior recognition performance as the number of generated identities increases.

Exploring Training and Inference Scaling Laws in Generative Retrieval

Hongru Cai,Yongqi Li,Ruifeng Yuan,Wenjie Wang,Zhen Zhang,Wenjie Li,Tat-Seng Chua

Task: 系统研究生成式检索中训练和推理的扩展规律，探索模型规模、训练数据规模和推理计算如何共同影响检索性能。

Motivation: 生成式检索作为一种新兴范式，其性能和可扩展性的机制尚不明确，需要深入研究。

Details

Method: 提出一种基于对比熵和生成损失的新评估指标，并通过实验分析模型规模、训练数据和推理计算的影响。 Result: 实验表明，n-gram方法在扩展规律中表现良好，更大的LLM和更高的推理计算能显著提升性能，LLaMA模型优于T5模型。 Conclusion: 模型规模、数据可用性和推理计算的相互作用揭示了生成式检索的潜力，为未来系统设计提供了新见解。 Abstract: Generative retrieval has emerged as a novel paradigm that leverages large language models (LLMs) to autoregressively generate document identifiers. Although promising, the mechanisms that underpin its performance and scalability remain largely unclear. We conduct a systematic investigation of training and inference scaling laws in generative retrieval, exploring how model size, training data scale, and inference-time compute jointly influence retrieval performance. To address the lack of suitable metrics, we propose a novel evaluation measure inspired by contrastive entropy and generation loss, providing a continuous performance signal that enables robust comparisons across diverse generative retrieval methods. Our experiments show that n-gram-based methods demonstrate strong alignment with both training and inference scaling laws, especially when paired with larger LLMs. Furthermore, increasing inference computation yields substantial performance gains, revealing that generative retrieval can significantly benefit from higher compute budgets at inference. Across these settings, LLaMA models consistently outperform T5 models, suggesting a particular advantage for larger decoder-only models in generative retrieval. Taken together, our findings underscore that model sizes, data availability, and inference computation interact to unlock the full potential of generative retrieval, offering new insights for designing and optimizing future systems.

Plug-and-Play Interpretable Responsible Text-to-Image Generation via Dual-Space Multi-facet Concept Control

Basim Azam,Naveed Akhtar

Task: 提出一种新的技术，通过同时考虑公平和安全内容的广泛概念，实现负责任文本到图像（T2I）生成。

Motivation: 现有方法在处理负责任生成内容时存在局限性，如单独处理概念、缺乏可解释性，且可能影响模型性能。

Details

Method: 使用知识蒸馏和概念白化技术，通过外部即插即用机制学习可解释的复合负责任空间，并在推理时调整生成内容。 Result: 展示了在文本嵌入空间和扩散模型潜在空间中模块的有效性，并提供了强有力的实验结果。 Conclusion: 提出了一种可扩展且可解释的方法，实现了负责任T2I生成，同时保持了模型性能。 Abstract: Ethical issues around text-to-image (T2I) models demand a comprehensive control over the generative content. Existing techniques addressing these issues for responsible T2I models aim for the generated content to be fair and safe (non-violent/explicit). However, these methods remain bounded to handling the facets of responsibility concepts individually, while also lacking in interpretability. Moreover, they often require alteration to the original model, which compromises the model performance. In this work, we propose a unique technique to enable responsible T2I generation by simultaneously accounting for an extensive range of concepts for fair and safe content generation in a scalable manner. The key idea is to distill the target T2I pipeline with an external plug-and-play mechanism that learns an interpretable composite responsible space for the desired concepts, conditioned on the target T2I pipeline. We use knowledge distillation and concept whitening to enable this. At inference, the learned space is utilized to modulate the generative content. A typical T2I pipeline presents two plug-in points for our approach, namely; the text embedding space and the diffusion model latent space. We develop modules for both points and show the effectiveness of our approach with a range of strong results.

Towards Training-free Anomaly Detection with Vision and Language Foundation Models

Jinjin Zhang,Guodong Wang,Yizhou Jin,Di Huang

Task: 提出一种无需训练的多模态框架LogSAD，用于同时检测逻辑和结构异常。

Motivation: 现有方法主要关注局部结构异常，而忽略了包含逻辑约束的组合异常。

Details

Method: 采用基于GPT-4V的match-of-thought架构生成匹配提案，结合多粒度异常检测和校准模块。 Result: LogSAD在无需训练的情况下实现了最先进的性能，优于监督方法。 Conclusion: LogSAD是一个鲁棒且有效的统一框架，适用于逻辑和结构异常检测。 Abstract: Anomaly detection is valuable for real-world applications, such as industrial quality inspection. However, most approaches focus on detecting local structural anomalies while neglecting compositional anomalies incorporating logical constraints. In this paper, we introduce LogSAD, a novel multi-modal framework that requires no training for both Logical and Structural Anomaly Detection. First, we propose a match-of-thought architecture that employs advanced large multi-modal models (i.e. GPT-4V) to generate matching proposals, formulating interests and compositional rules of thought for anomaly detection. Second, we elaborate on multi-granularity anomaly detection, consisting of patch tokens, sets of interests, and composition matching with vision and language foundation models. Subsequently, we present a calibration module to align anomaly scores from different detectors, followed by integration strategies for the final decision. Consequently, our approach addresses both logical and structural anomaly detection within a unified framework and achieves state-of-the-art results without the need for training, even when compared to supervised approaches, highlighting its robustness and effectiveness. Code is available at https://github.com/zhang0jhon/LogSAD.

TensoFlow: Tensorial Flow-based Sampler for Inverse Rendering

Chun Gu,Xiaofei Wei,Li Zhang,Xiatian Zhu

Task: 从多视角图像中恢复场景几何、材质属性和光照。

Motivation: 现有逆渲染方法通常使用预定义的非可学习重要性采样器，难以有效匹配空间和方向变化的被积函数，导致高方差和性能不佳。

Details

Method: 提出一种空间和方向感知的重要性采样器，通过归一化流参数化，并结合张量表示学习场景空间特性。 Result: 实验验证了TensoFlow在合成和真实世界基准上的优越性。 Conclusion: 提出的方法能够更准确和灵活地捕捉复杂场景的特性，显著提升逆渲染性能。 Abstract: Inverse rendering aims to recover scene geometry, material properties, and lighting from multi-view images. Given the complexity of light-surface interactions, importance sampling is essential for the evaluation of the rendering equation, as it reduces variance and enhances the efficiency of Monte Carlo sampling. Existing inverse rendering methods typically use pre-defined non-learnable importance samplers in prior manually, struggling to effectively match the spatially and directionally varied integrand and resulting in high variance and suboptimal performance. To address this limitation, we propose the concept of learning a spatially and directionally aware importance sampler for the rendering equation to accurately and flexibly capture the unconstrained complexity of a typical scene. We further formulate TensoFlow, a generic approach for sampler learning in inverse rendering, enabling to closely match the integrand of the rendering equation spatially and directionally. Concretely, our sampler is parameterized by normalizing flows, allowing both directional sampling of incident light and probability density function (PDF) inference. To capture the characteristics of the sampler spatially, we learn a tensorial representation of the scene space, which imposes spatial conditions, together with reflected direction, leading to spatially and directionally aware sampling distributions. Our model can be optimized by minimizing the difference between the integrand and our normalizing flow. Extensive experiments validate the superiority of TensoFlow over prior alternatives on both synthetic and real-world benchmarks.

Mitigating Cache Noise in Test-Time Adaptation for Large Vision-Language Models

Haotian Zhai,Xinyu Chen,Can Zhang,Tianming Sha,Ruirui Li

Task: 提出一种名为CRG的零样本测试时适应方法，以解决视觉语言模型在分布偏移下的性能下降问题。

Motivation: 现有基于缓存的TTA方法依赖缓存特征标签的准确性，易受噪声伪标签影响，且缺乏有效的类分布建模机制。

Details

Method: 结合可学习的残差参数优化缓存特征质量，并引入高斯判别分析（GDA）动态建模类内特征分布。 Result: 在13个基准测试中，CRG优于现有TTA方法，表现出卓越的鲁棒性和适应性。 Conclusion: CRG通过综合可靠的缓存机制和动态建模，显著提升了视觉语言模型在分布偏移下的性能。 Abstract: Test-time adaptation (TTA) of visual language models has recently attracted significant attention as a solution to the performance degradation caused by distribution shifts in downstream tasks. However, existing cache-based TTA methods have certain limitations. They mainly rely on the accuracy of cached feature labels, and the presence of noisy pseudo-labels can cause these features to deviate from their true distribution. This makes cache retrieval methods based on similarity matching highly sensitive to outliers or extreme samples. Moreover, current methods lack effective mechanisms to model class distributions, which limits their ability to fully exploit the potential of cached information. To address these challenges, we introduce a comprehensive and reliable caching mechanism and propose a novel zero-shot TTA method called ``Cache, Residual, Gaussian" (CRG). This method not only employs learnable residual parameters to better align positive and negative visual prototypes with text prototypes, thereby optimizing the quality of cached features, but also incorporates Gaussian Discriminant Analysis (GDA) to dynamically model intra-class feature distributions, further mitigating the impact of noisy features. Experimental results on 13 benchmarks demonstrate that CRG outperforms state-of-the-art TTA methods, showcasing exceptional robustness and adaptability.

Coeff-Tuning: A Graph Filter Subspace View for Tuning Attention-Based Large Models

Zichen Miao,Wei Chen,Qiang Qiu

Task: 通过学习一组小的组合系数来调整大型预训练Transformer的注意力操作，以构建更具表达力的滤波器子空间。

Motivation: 现有的参数高效微调方法主要基于张量分解视角，而本研究从图卷积的角度重新表示注意力操作，以增强Transformer的表达能力。

Details

Method: 将多头注意力图表示为卷积滤波器子空间，通过学习组合系数调整子空间，并结合残差参数化和正则化设计稳定微调。 Result: 实验表明，调整后的滤波器子空间能有效扩展多头注意力的特征空间，提升Transformer性能，且参数增量可忽略。 Conclusion: 该方法在参数高效微调中表现优于基线方法，且可与现有方法无缝结合。 Abstract: Transformer-based large pre-trained models have shown remarkable generalization ability, and various parameter-efficient fine-tuning (PEFT) methods have been proposed to customize these models on downstream tasks with minimal computational and memory budgets. Previous PEFT methods are primarily designed from a tensor-decomposition perspective that tries to effectively tune the linear transformation by finding the smallest subset of parameters to train. Our study adopts an orthogonal view by representing the attention operation as a graph convolution and formulating the multi-head attention maps as a convolutional filter subspace, with each attention map as a subspace element. In this paper, we propose to tune the large pre-trained transformers by learning a small set of combination coefficients that construct a more expressive filter subspace from the original multi-head attention maps. We show analytically and experimentally that the tuned filter subspace can effectively expand the feature space of the multi-head attention and further enhance the capacity of transformers. We further stabilize the fine-tuning with a residual parameterization of the tunable subspace coefficients, and enhance the generalization with a regularization design by directly applying dropout on the tunable coefficient during training. The tunable coefficients take a tiny number of parameters and can be combined with previous PEFT methods in a plug-and-play manner. Extensive experiments show that our approach achieves superior performances than PEFT baselines with neglectable additional parameters.

SPMTrack: Spatio-Temporal Parameter-Efficient Fine-Tuning with Mixture of Experts for Scalable Visual Tracking

Wenrui Cai,Qingjie Liu,Yunhong Wang

Task: 提出一种基于专家混合模型的新型视觉跟踪器SPMTrack，用于更灵活地处理多样化的关系建模。

Motivation: 现有单模型跟踪器难以同时有效处理不同图像块间的关系建模，尤其是背景与前景的注意力分配需求差异显著。

Details

Method: 采用专家混合模型（TMoE）扩展关系建模至时空上下文，并作为参数高效微调方法。 Result: 在七个数据集上的实验表明，SPMTrack显著优于当前最先进的跟踪器。 Conclusion: SPMTrack通过TMoE实现了更灵活的关系建模和高效训练，同时保持了预训练模型的泛化能力。 Abstract: Most state-of-the-art trackers adopt one-stream paradigm, using a single Vision Transformer for joint feature extraction and relation modeling of template and search region images. However, relation modeling between different image patches exhibits significant variations. For instance, background regions dominated by target-irrelevant information require reduced attention allocation, while foreground, particularly boundary areas, need to be be emphasized. A single model may not effectively handle all kinds of relation modeling simultaneously. In this paper, we propose a novel tracker called SPMTrack based on mixture-of-experts tailored for visual tracking task (TMoE), combining the capability of multiple experts to handle diverse relation modeling more flexibly. Benefiting from TMoE, we extend relation modeling from image pairs to spatio-temporal context, further improving tracking accuracy with minimal increase in model parameters. Moreover, we employ TMoE as a parameter-efficient fine-tuning method, substantially reducing trainable parameters, which enables us to train SPMTrack of varying scales efficiently and preserve the generalization ability of pretrained models to achieve superior performance. We conduct experiments on seven datasets, and experimental results demonstrate that our method significantly outperforms current state-of-the-art trackers. The source code is available at https://github.com/WenRuiCai/SPMTrack.

GranQ: Granular Zero-Shot Quantization with Unified Layer-Channel Awareness

Inpyo Hong,Youngwan Jo,Hyojeong Lee,Sunghyun Ahn,Sanghyun Park

Task: 提出一种名为GranQ的新型零样本量化方法，以解决低比特环境下激活损失的问题。

Motivation: 现有零样本量化方法在低比特环境下由于粗粒度的缩放策略导致显著的激活损失。

Details

Method: GranQ通过层-通道感知动态调整量化粒度，并结合向量化激活量化以减少计算开销。 Result: GranQ在性能上优于现有的零样本量化方法，甚至优于依赖量化感知训练的方法。 Conclusion: GranQ为零样本量化研究提供了新的方向，超越了传统的数据生成和模型训练方法。 Abstract: Zero-shot quantization (ZSQ) enables neural network compression without training data, which is crucial in restricted data access environments. However, existing ZSQ methods suffer from significant activation loss in low-bit environments owing to their coarse-grained scaling strategy. To address this issue, we propose GranQ, a novel ZSQ approach that leverages layer-channel awareness to minimize the quantization error. Unlike conventional layer- or channel-wise quantization, GranQ dynamically adjusts quantization granularity by considering both layer- and channel-level activation distributions. This enables fine-grained quantization while minimizing activation distortion. Additionally, we introduce vectorized activation quantization, which enables efficient parallel computation and reduces computational overhead while preserving accuracy. GranQ achieves superior performance compared with those of state-of-the-art ZSQ methods that employ quantization-aware training. With these findings, we anticipate that GranQ will inspire novel research directions beyond conventional ZSQ approaches focused on data generation and model training.

PS-EIP: Robust Photometric Stereo Based on Event Interval Profile

Kazuma Kitazawa,Takahito Aoto,Satoshi Ikehata,Tsuyoshi Takatani

Task: 提出一种基于事件间隔轮廓的鲁棒光度立体方法（PS-EIP），用于从事件间隔的时间序列轮廓中恢复像素级表面法线。

Motivation: 现有的EventPS方法独立处理每个事件间隔，对噪声、阴影和非朗伯反射敏感，需要一种更鲁棒的方法。

Details

Method: 利用事件间隔轮廓的连续性，并引入基于轮廓形状的异常值检测方法，增强对阴影和镜面反射等异常值的鲁棒性。 Result: 实验表明，PS-EIP在不依赖深度学习的情况下，显著提高了对异常值的鲁棒性，优于EventPS的深度学习变体EventPS-FCN。 Conclusion: PS-EIP是一种鲁棒的光度立体方法，能够有效处理噪声和异常值，适用于实际事件数据。 Abstract: Recently, the energy-efficient photometric stereo method using an event camera has been proposed to recover surface normals from events triggered by changes in logarithmic Lambertian reflections under a moving directional light source. However, EventPS treats each event interval independently, making it sensitive to noise, shadows, and non-Lambertian reflections. This paper proposes Photometric Stereo based on Event Interval Profile (PS-EIP), a robust method that recovers pixelwise surface normals from a time-series profile of event intervals. By exploiting the continuity of the profile and introducing an outlier detection method based on profile shape, our approach enhances robustness against outliers from shadows and specular reflections. Experiments using real event data from 3D-printed objects demonstrate that PS-EIP significantly improves robustness to outliers compared to EventPS's deep-learning variant, EventPS-FCN, without relying on deep learning.

Human-Object Interaction with Vision-Language Model Guided Relative Movement Dynamics

Zekai Deng,Ye Shi,Kaiyang Ji,Lan Xu,Shaoli Huang,Jingya Wang

Task: 提出一个统一的Human-Object Interaction框架，通过语言命令控制静态场景和动态对象的交互。

Motivation: 现有方法在物理真实性和支持多样交互类型方面存在不足，需要一种更通用的解决方案。

Details

Method: 利用Vision-Language Models将语言命令转化为Relative Movement Dynamics（RMD）图，并通过目标导向的强化学习实现交互。 Result: 框架支持长时程、多轮交互，并在实验中展示了广泛的HOI任务处理能力。 Conclusion: 提出的框架能够有效处理多样化的HOI任务，并保持长期、多轮交互的稳定性。 Abstract: Human-Object Interaction (HOI) is vital for advancing simulation, animation, and robotics, enabling the generation of long-term, physically plausible motions in 3D environments. However, existing methods often fall short of achieving physics realism and supporting diverse types of interactions. To address these challenges, this paper introduces a unified Human-Object Interaction framework that provides unified control over interactions with static scenes and dynamic objects using language commands. The interactions between human and object parts can always be described as the continuous stable Relative Movement Dynamics (RMD) between human and object parts. By leveraging the world knowledge and scene perception capabilities of Vision-Language Models (VLMs), we translate language commands into RMD diagrams, which are used to guide goal-conditioned reinforcement learning for sequential interaction with objects. Our framework supports long-horizon interactions among dynamic, articulated, and static objects. To support the training and evaluation of our framework, we present a new dataset named Interplay, which includes multi-round task plans generated by VLMs, covering both static and dynamic HOI tasks. Extensive experiments demonstrate that our proposed framework can effectively handle a wide range of HOI tasks, showcasing its ability to maintain long-term, multi-round transitions. For more details, please refer to our project webpage: https://rmd-hoi.github.io/.

Diffusion-4K: Ultra-High-Resolution Image Synthesis with Latent Diffusion Models

Jinjin Zhang,Qiuyu Huang,Junjie Liu,Xiefan Guo,Di Huang

Task: 提出Diffusion-4K框架，用于直接生成超高清（4K）图像。

Motivation: 解决现有公开4K图像合成数据集的缺失问题，并提升超高清图像生成的细节和质量。

Details

Method: 构建Aesthetic-4K基准数据集，并提出基于小波的微调方法。 Result: Diffusion-4K在超高清图像合成中表现出色，尤其在细节和文本提示一致性方面。 Conclusion: Diffusion-4K在超高清图像合成中具有显著优势，适用于现代大规模扩散模型。 Abstract: In this paper, we present Diffusion-4K, a novel framework for direct ultra-high-resolution image synthesis using text-to-image diffusion models. The core advancements include: (1) Aesthetic-4K Benchmark: addressing the absence of a publicly available 4K image synthesis dataset, we construct Aesthetic-4K, a comprehensive benchmark for ultra-high-resolution image generation. We curated a high-quality 4K dataset with carefully selected images and captions generated by GPT-4o. Additionally, we introduce GLCM Score and Compression Ratio metrics to evaluate fine details, combined with holistic measures such as FID, Aesthetics and CLIPScore for a comprehensive assessment of ultra-high-resolution images. (2) Wavelet-based Fine-tuning: we propose a wavelet-based fine-tuning approach for direct training with photorealistic 4K images, applicable to various latent diffusion models, demonstrating its effectiveness in synthesizing highly detailed 4K images. Consequently, Diffusion-4K achieves impressive performance in high-quality image synthesis and text prompt adherence, especially when powered by modern large-scale diffusion models (e.g., SD3-2B and Flux-12B). Extensive experimental results from our benchmark demonstrate the superiority of Diffusion-4K in ultra-high-resolution image synthesis.

Cost-Sensitive Learning for Long-Tailed Temporal Action Segmentation

Zhanzhong Pang,Fadime Sener,Shrinivas Ramasubramanian,Angela Yao

Task: Temporal action segmentation in untrimmed procedural videos to densely label frames into action classes.

Motivation: Addressing the bi-level learning bias (class-level and transition-level) caused by long-tailed distributions in procedural videos.

Details

Method: Introducing a constrained optimization problem with a novel cost-sensitive loss function (weighted cross-entropy) based on learning states of actions and transitions. Result: Significant improvements in per-class frame-wise and segment-wise performance on three benchmarks. Conclusion: The proposed approach effectively alleviates bi-level learning bias, enhancing temporal action segmentation performance. Abstract: Temporal action segmentation in untrimmed procedural videos aims to densely label frames into action classes. These videos inherently exhibit long-tailed distributions, where actions vary widely in frequency and duration. In temporal action segmentation approaches, we identified a bi-level learning bias. This bias encompasses (1) a class-level bias, stemming from class imbalance favoring head classes, and (2) a transition-level bias arising from variations in transitions, prioritizing commonly observed transitions. As a remedy, we introduce a constrained optimization problem to alleviate both biases. We define learning states for action classes and their associated transitions and integrate them into the optimization process. We propose a novel cost-sensitive loss function formulated as a weighted cross-entropy loss, with weights adaptively adjusted based on the learning state of actions and their transitions. Experiments on three challenging temporal segmentation benchmarks and various frameworks demonstrate the effectiveness of our approach, resulting in significant improvements in both per-class frame-wise and segment-wise performance.

Context-Enhanced Memory-Refined Transformer for Online Action Detection

Zhanzhong Pang,Fadime Sener,Angela Yao

Task: 在线动作检测（OAD）通过利用过去的观察数据来检测流媒体视频中的动作。

Motivation: 现有OAD方法在训练和推理之间存在不一致性，影响了学习效果。训练时使用不同长度的短期记忆，而推理时依赖完整长度的短期记忆。

Details

Method: 提出了一种上下文增强的记忆精炼Transformer（CMeRT），通过上下文增强编码器改进帧表示，并利用记忆精炼解码器通过近未来生成提升性能。 Result: CMeRT在THUMOS'14、CrossTask和EPIC-Kitchens-100数据集上实现了在线检测和预测的最先进性能。 Conclusion: CMeRT通过解决训练与推理的不一致性，显著提升了在线动作检测的性能。 Abstract: Online Action Detection (OAD) detects actions in streaming videos using past observations. State-of-the-art OAD approaches model past observations and their interactions with an anticipated future. The past is encoded using short- and long-term memories to capture immediate and long-range dependencies, while anticipation compensates for missing future context. We identify a training-inference discrepancy in existing OAD methods that hinders learning effectiveness. The training uses varying lengths of short-term memory, while inference relies on a full-length short-term memory. As a remedy, we propose a Context-enhanced Memory-Refined Transformer (CMeRT). CMeRT introduces a context-enhanced encoder to improve frame representations using additional near-past context. It also features a memory-refined decoder to leverage near-future generation to enhance performance. CMeRT achieves state-of-the-art in online detection and anticipation on THUMOS'14, CrossTask, and EPIC-Kitchens-100.

NeRFPrior: Learning Neural Radiance Field as a Prior for Indoor Scene Reconstruction

Wenyuan Zhang,Emily Yue-ting Jia,Junsheng Zhou,Baorui Ma,Kanle Shi,Yu-Shen Liu

Task: 利用神经辐射场（NeRF）作为先验，通过体积渲染学习有符号距离场（SDF）以实现高质量的表面重建。

Motivation: 当前先验方法需要大规模预训练且仅提供几何线索，未考虑颜色的重要性，限制了表面重建的质量和效率。

Details

Method: 提出NeRFPrior，利用NeRF先验提供几何和颜色线索，并通过多视角一致性约束和深度一致性损失优化SDF学习。 Result: 实验结果表明，该方法在广泛使用的基准测试中优于现有最先进方法。 Conclusion: NeRFPrior通过结合几何和颜色线索，显著提升了表面重建的质量和效率。 Abstract: Recently, it has shown that priors are vital for neural implicit functions to reconstruct high-quality surfaces from multi-view RGB images. However, current priors require large-scale pre-training, and merely provide geometric clues without considering the importance of color. In this paper, we present NeRFPrior, which adopts a neural radiance field as a prior to learn signed distance fields using volume rendering for surface reconstruction. Our NeRF prior can provide both geometric and color clues, and also get trained fast under the same scene without additional data. Based on the NeRF prior, we are enabled to learn a signed distance function (SDF) by explicitly imposing a multi-view consistency constraint on each ray intersection for surface inference. Specifically, at each ray intersection, we use the density in the prior as a coarse geometry estimation, while using the color near the surface as a clue to check its visibility from another view angle. For the textureless areas where the multi-view consistency constraint does not work well, we further introduce a depth consistency loss with confidence weights to infer the SDF. Our experimental results outperform the state-of-the-art methods under the widely used benchmarks.

MonoInstance: Enhancing Monocular Priors via Multi-view Instance Alignment for Neural Rendering and Reconstruction

Wenyuan Zhang,Yixiao Yang,Han Huang,Liang Han,Kanle Shi,Yu-Shen Liu

Task: 提出一种名为MonoInstance的方法，通过探索单目深度估计的不确定性，为神经渲染和重建提供增强的几何先验。

Motivation: 现有的多视图任务中，单目深度先验存在预测不一致和跨视图不一致的问题，当前方法未能有效利用这些先验。

Details

Method: 通过将多视图中的实例深度对齐到共同的3D空间，并将单目深度的不确定性转化为噪声点云中的密度度量，同时在高不确定性区域引入约束项。 Result: 实验结果表明，MonoInstance在多种基准测试中显著提升了重建和新视角合成的性能。 Conclusion: MonoInstance是一种通用策略，可无缝集成到多种多视图神经渲染框架中，有效解决了单目深度先验的不一致性问题。 Abstract: Monocular depth priors have been widely adopted by neural rendering in multi-view based tasks such as 3D reconstruction and novel view synthesis. However, due to the inconsistent prediction on each view, how to more effectively leverage monocular cues in a multi-view context remains a challenge. Current methods treat the entire estimated depth map indiscriminately, and use it as ground truth supervision, while ignoring the inherent inaccuracy and cross-view inconsistency in monocular priors. To resolve these issues, we propose MonoInstance, a general approach that explores the uncertainty of monocular depths to provide enhanced geometric priors for neural rendering and reconstruction. Our key insight lies in aligning each segmented instance depths from multiple views within a common 3D space, thereby casting the uncertainty estimation of monocular depths into a density measure within noisy point clouds. For high-uncertainty areas where depth priors are unreliable, we further introduce a constraint term that encourages the projected instances to align with corresponding instance masks on nearby views. MonoInstance is a versatile strategy which can be seamlessly integrated into various multi-view neural rendering frameworks. Our experimental results demonstrate that MonoInstance significantly improves the performance in both reconstruction and novel view synthesis under various benchmarks.

MaSS13K: A Matting-level Semantic Segmentation Benchmark

Chenxi Xie,Minghan Li,Hui Zeng,Jun Luo,Lei Zhang

Task: 构建一个大规模、高分辨率的语义分割数据集MaSS13K，并提出一种针对高分辨率语义分割的方法MaSSFormer。

Motivation: 现有数据集分辨率有限且缺乏精确的掩码细节和边界，无法满足图像编辑、AR/VR等应用的需求。

Details

Method: 提出MaSSFormer方法，采用高效像素解码器聚合高级语义特征和低级纹理特征，并结合高质量掩码与伪标签的新学习范式。 Result: MaSS13K数据集具有高复杂度的精确掩码，MaSSFormer在基准测试中表现优异。 Conclusion: MaSS13K数据集和MaSSFormer模型推动了高分辨率、高质量语义分割的研究。 Abstract: High-resolution semantic segmentation is essential for applications such as image editing, bokeh imaging, AR/VR, etc. Unfortunately, existing datasets often have limited resolution and lack precise mask details and boundaries. In this work, we build a large-scale, matting-level semantic segmentation dataset, named MaSS13K, which consists of 13,348 real-world images, all at 4K resolution. MaSS13K provides high-quality mask annotations of a number of objects, which are categorized into seven categories: human, vegetation, ground, sky, water, building, and others. MaSS13K features precise masks, with an average mask complexity 20-50 times higher than existing semantic segmentation datasets. We consequently present a method specifically designed for high-resolution semantic segmentation, namely MaSSFormer, which employs an efficient pixel decoder that aggregates high-level semantic features and low-level texture features across three stages, aiming to produce high-resolution masks with minimal computational cost. Finally, we propose a new learning paradigm, which integrates the high-quality masks of the seven given categories with pseudo labels from new classes, enabling MaSSFormer to transfer its accurate segmentation capability to other classes of objects. Our proposed MaSSFormer is comprehensively evaluated on the MaSS13K benchmark together with 14 representative segmentation models. We expect that our meticulously annotated MaSS13K dataset and the MaSSFormer model can facilitate the research of high-resolution and high-quality semantic segmentation. Datasets and codes can be found at https://github.com/xiechenxi99/MaSS13K.

MoST: Efficient Monarch Sparse Tuning for 3D Representation Learning

Xu Han,Yuan Tang,Jinfeng Xu,Xianzhi Li

Task: 提出一种名为Monarch Sparse Tuning (MoST)的重新参数化方法，专门用于3D表示学习中的参数高效微调。

Motivation: 现有基于适配器和提示调优的3D参数高效微调方法存在额外的推理开销，且兼容性有限。

Details

Method: 通过引入Point Monarch结构化矩阵家族，捕捉不规则点云的局部几何特征，并将密集更新权重矩阵重新参数化为稀疏的Point Monarch矩阵。 Result: MoST在多个基准测试中取得最优性能，例如ScanObjectNN (PB_50_RS)上97.5%的准确率和ModelNet40分类上96.2%的准确率。 Conclusion: MoST是一种简单、有效且通用性强的参数高效微调方法，能够显著减少参数数量并保持高性能。 Abstract: We introduce Monarch Sparse Tuning (MoST), the first reparameterization-based parameter-efficient fine-tuning (PEFT) method tailored for 3D representation learning. Unlike existing adapter-based and prompt-tuning 3D PEFT methods, MoST introduces no additional inference overhead and is compatible with many 3D representation learning backbones. At its core, we present a new family of structured matrices for 3D point clouds, Point Monarch, which can capture local geometric features of irregular points while offering high expressiveness. MoST reparameterizes the dense update weight matrices as our sparse Point Monarch matrices, significantly reducing parameters while retaining strong performance. Experiments on various backbones show that MoST is simple, effective, and highly generalizable. It captures local features in point clouds, achieving state-of-the-art results on multiple benchmarks, e.g., 97.5% acc. on ScanObjectNN (PB_50_RS) and 96.2% on ModelNet40 classification, while it can also combine with other matrix decompositions (e.g., Low-rank, Kronecker) to further reduce parameters.

DiffusedWrinkles: A Diffusion-Based Model for Data-Driven Garment Animation

Raquel Vidaurre,Elena Garces,Dan Casas

Task: 提出一种基于2D图像扩散模型的数据驱动方法，用于生成3D服装动画。

Motivation: 现有方法（如全连接网络、图神经网络或生成对抗网络）难以处理具有精细褶皱细节的参数化服装，因此需要一种能够合成高质量3D动画且对服装网格拓扑无关的方法。

Details

Method: 将3D服装变形表示为2D布局一致的纹理，编码相对于参数化服装模板的3D偏移量，并训练一种新型条件扩散模型。 Result: 能够为多种服装和体型合成高质量的3D动画，且可以生成时间一致的序列。 Conclusion: 该方法在生成高质量3D服装动画方面具有显著优势，且能够灵活适应不同姿态、形状和设计。 Abstract: We present a data-driven method for learning to generate animations of 3D garments using a 2D image diffusion model. In contrast to existing methods, typically based on fully connected networks, graph neural networks, or generative adversarial networks, which have difficulties to cope with parametric garments with fine wrinkle detail, our approach is able to synthesize high-quality 3D animations for a wide variety of garments and body shapes, while being agnostic to the garment mesh topology. Our key idea is to represent 3D garment deformations as a 2D layout-consistent texture that encodes 3D offsets with respect to a parametric garment template. Using this representation, we encode a large dataset of garments simulated in various motions and shapes and train a novel conditional diffusion model that is able to synthesize high-quality pose-shape-and-design dependent 3D garment deformations. Since our model is generative, we can synthesize various plausible deformations for a given target pose, shape, and design. Additionally, we show that we can further condition our model using an existing garment state, which enables the generation of temporally coherent sequences.

Do Your Best and Get Enough Rest for Continual Learning

Hankyul Kang,Gregor Seifer,Donghyun Lee,Jongbin Ryu

Task: 基于遗忘曲线理论优化神经网络的持续学习能力。

Motivation: 解决持续学习中长期记忆保留和灾难性遗忘的问题。

Details

Method: 提出view-batch模型，通过调整学习计划和优化回忆间隔，结合回放方法和自监督学习。 Result: 实验证明该方法显著提升了多种先进持续学习方法的性能。 Conclusion: view-batch模型有效利用遗忘曲线理论，增强了神经网络的长期记忆能力。 Abstract: According to the forgetting curve theory, we can enhance memory retention by learning extensive data and taking adequate rest. This means that in order to effectively retain new knowledge, it is essential to learn it thoroughly and ensure sufficient rest so that our brain can memorize without forgetting. The main takeaway from this theory is that learning extensive data at once necessitates sufficient rest before learning the same data again. This aspect of human long-term memory retention can be effectively utilized to address the continual learning of neural networks. Retaining new knowledge for a long period of time without catastrophic forgetting is the critical problem of continual learning. Therefore, based on Ebbinghaus' theory, we introduce the view-batch model that adjusts the learning schedules to optimize the recall interval between retraining the same samples. The proposed view-batch model allows the network to get enough rest to learn extensive knowledge from the same samples with a recall interval of sufficient length. To this end, we specifically present two approaches: 1) a replay method that guarantees the optimal recall interval, and 2) a self-supervised learning that acquires extensive knowledge from a single training sample at a time. We empirically show that these approaches of our method are aligned with the forgetting curve theory, which can enhance long-term memory. In our experiments, we also demonstrate that our method significantly improves many state-of-the-art continual learning methods in various protocols and scenarios. We open-source this project at https://github.com/hankyul2/ViewBatchModel.

Exploring State Space Model in Wavelet Domain: An Infrared and Visible Image Fusion Network via Wavelet Transform and State Space Model

Tianpei Zhang,Yiming Zhu,Jufeng Zhao,Guangmang Cui,Yuchen Zheng

Task: 提出一种结合小波变换与状态空间模型（SSM）的方法Wavelet-Mamba（W-Mamba），用于红外与可见光图像融合（IVIF）。

Motivation: 当前方法未能充分结合频域特征与全局语义信息，导致全局特征提取不足和局部纹理细节保留不充分。

Details

Method: 引入Wavelet-SSM模块，结合小波变换的频域特征提取和SSM的全局信息提取，并提出跨模态特征注意力调制机制。 Result: 实验结果表明，该方法在视觉效果和性能上均优于当前最先进方法。 Conclusion: Wavelet-Mamba有效解决了全局与局部特征提取问题，提升了红外与可见光图像融合的效果。 Abstract: Deep learning techniques have revolutionized the infrared and visible image fusion (IVIF), showing remarkable efficacy on complex scenarios. However, current methods do not fully combine frequency domain features with global semantic information, which will result in suboptimal extraction of global features across modalities and insufficient preservation of local texture details. To address these issues, we propose Wavelet-Mamba (W-Mamba), which integrates wavelet transform with the state-space model (SSM). Specifically, we introduce Wavelet-SSM module, which incorporates wavelet-based frequency domain feature extraction and global information extraction through SSM, thereby effectively capturing both global and local features. Additionally, we propose a cross-modal feature attention modulation, which facilitates efficient interaction and fusion between different modalities. The experimental results indicate that our method achieves both visually compelling results and superior performance compared to current state-of-the-art methods. Our code is available at https://github.com/Lmmh058/W-Mamba.

PP-FormulaNet: Bridging Accuracy and Efficiency in Advanced Formula Recognition

Hongen Liu,Cheng Cui,Yuning Du,Yi Liu,Gang Pan

Task: 将文档图像中的数学表达式转换为结构化的符号格式（如LaTeX）。

Motivation: 满足文档智能中对数学公式识别的需求，提供高精度和高效率的解决方案。

Details

Method: 提出了PP-FormulaNet模型，包括高精度版本PP-FormulaNet-L和高效率版本PP-FormulaNet-S，并开发了Formula Mining System用于数据增强。 Result: PP-FormulaNet-L的准确率比UniMERNet高6%，PP-FormulaNet-S的速度快16倍以上。 Conclusion: PP-FormulaNet在精度和效率上均有显著提升，适用于广泛的文档处理场景，且代码和模型已开源。 Abstract: Formula recognition is an important task in document intelligence. It involves converting mathematical expressions from document images into structured symbolic formats that computers can easily work with. LaTeX is the most common format used for this purpose. In this work, we present PP-FormulaNet, a state-of-the-art formula recognition model that excels in both accuracy and efficiency. To meet the diverse needs of applications, we have developed two specialized models: PP-FormulaNet-L, tailored for high-accuracy scenarios, and PP-FormulaNet-S, optimized for high-efficiency contexts. Our extensive evaluations reveal that PP-FormulaNet-L attains accuracy levels that surpass those of prominent models such as UniMERNet by a significant 6%. Conversely, PP-FormulaNet-S operates at speeds that are over 16 times faster. These advancements facilitate seamless integration of PP-FormulaNet into a broad spectrum of document processing environments that involve intricate mathematical formulas. Furthermore, we introduce a Formula Mining System, which is capable of extracting a vast amount of high-quality formula data. This system further enhances the robustness and applicability of our formula recognition model. Code and models are publicly available at PaddleOCR(https://github.com/PaddlePaddle/PaddleOCR) and PaddleX(https://github.com/PaddlePaddle/PaddleX).

LiDAR Remote Sensing Meets Weak Supervision: Concepts, Methods, and Perspectives

Yuan Gao,Shaobo Xia,Pu Wang,Xiaohuan Xi,Sheng Nie,Cheng Wang

Task: 系统性地综述弱监督学习在LiDAR遥感的解释和反演任务中的应用。

Motivation: LiDAR遥感的解释和反演任务通常依赖高成本、耗时的精确标注或稀缺的监督信号，弱监督学习为解决这一问题提供了新思路。

Details

Method: 采用统一的弱监督学习视角，综述LiDAR解释和反演的最新研究进展。 Result: 总结了弱监督技术在LiDAR遥感中的发展和应用，并探讨了未来研究方向。 Conclusion: 弱监督学习在LiDAR遥感中具有重要潜力，未来研究应进一步探索其应用。 Abstract: LiDAR (Light Detection and Ranging) enables rapid and accurate acquisition of three-dimensional spatial data, widely applied in remote sensing areas such as surface mapping, environmental monitoring, urban modeling, and forestry inventory. LiDAR remote sensing primarily includes data interpretation and LiDAR-based inversion. However, LiDAR interpretation typically relies on dense and precise annotations, which are costly and time-consuming. Similarly, LiDAR inversion depends on scarce supervisory signals and expensive field surveys for annotations. To address this challenge, weakly supervised learning has gained significant attention in recent years, with many methods emerging to tackle LiDAR remote sensing tasks using incomplete, inaccurate, and inexact annotations, as well as annotations from other domains. Existing review articles treat LiDAR interpretation and inversion as separate tasks. This review, for the first time, adopts a unified weakly supervised learning perspective to systematically examine research on both LiDAR interpretation and inversion. We summarize the latest advancements, provide a comprehensive review of the development and application of weakly supervised techniques in LiDAR remote sensing, and discuss potential future research directions in this field.

Resource-Efficient Motion Control for Video Generation via Dynamic Mask Guidance

Sicong Feng,Jielong Yang,Li Peng

Task: 提出一种基于掩码引导的视频生成方法，以解决文本到视频生成中的高成本、数据需求大和一致性维护问题。

Motivation: 当前文本到视频生成模型存在训练成本高、数据需求大以及文本与前景对象运动一致性难以维持的挑战。

Details

Method: 通过掩码运动序列控制视频生成，结合前景掩码实现精确的文本-位置匹配和运动轨迹控制，并采用首帧共享策略和自回归扩展方法。 Result: 在视频编辑和艺术视频生成等任务中表现优异，一致性和质量优于现有方法。 Conclusion: 掩码引导的视频生成方法能够以较少训练数据实现高质量、一致性的视频生成。 Abstract: Recent advances in diffusion models bring new vitality to visual content creation. However, current text-to-video generation models still face significant challenges such as high training costs, substantial data requirements, and difficulties in maintaining consistency between given text and motion of the foreground object. To address these challenges, we propose mask-guided video generation, which can control video generation through mask motion sequences, while requiring limited training data. Our model enhances existing architectures by incorporating foreground masks for precise text-position matching and motion trajectory control. Through mask motion sequences, we guide the video generation process to maintain consistent foreground objects throughout the sequence. Additionally, through a first-frame sharing strategy and autoregressive extension approach, we achieve more stable and longer video generation. Extensive qualitative and quantitative experiments demonstrate that this approach excels in various video generation tasks, such as video editing and generating artistic videos, outperforming previous methods in terms of consistency and quality. Our generated results can be viewed in the supplementary materials.

PDDM: Pseudo Depth Diffusion Model for RGB-PD Semantic Segmentation Based in Complex Indoor Scenes

Xinhua Xu,Hong Liu,Jianbing Wu,Jinfu Liu

Task: 探索利用伪深度替代真实深度进行语义分割的实用性，并提出一种RGB-PD分割流程和伪深度聚合模块（PDAM）。

Motivation: RGB-D数据集的采集成本高且对齐困难，而伪深度可以消除对RGB-D传感器和对齐过程的依赖。

Details

Method: 设计了RGB-PD分割流程和PDAM模块，并提出基于扩散模型的伪深度扩散模型（PDDM）。 Result: 伪深度显著提升了分割性能，PDDM在NYUv2和SUNRGB-D数据集上分别实现了+6.98 mIoU和+2.11 mIoU的提升。 Conclusion: 伪深度是一种有效的替代方案，PDDM在语义分割中表现出色。 Abstract: The integration of RGB and depth modalities significantly enhances the accuracy of segmenting complex indoor scenes, with depth data from RGB-D cameras playing a crucial role in this improvement. However, collecting an RGB-D dataset is more expensive than an RGB dataset due to the need for specialized depth sensors. Aligning depth and RGB images also poses challenges due to sensor positioning and issues like missing data and noise. In contrast, Pseudo Depth (PD) from high-precision depth estimation algorithms can eliminate the dependence on RGB-D sensors and alignment processes, as well as provide effective depth information and show significant potential in semantic segmentation. Therefore, to explore the practicality of utilizing pseudo depth instead of real depth for semantic segmentation, we design an RGB-PD segmentation pipeline to integrate RGB and pseudo depth and propose a Pseudo Depth Aggregation Module (PDAM) for fully exploiting the informative clues provided by the diverse pseudo depth maps. The PDAM aggregates multiple pseudo depth maps into a single modality, making it easily adaptable to other RGB-D segmentation methods. In addition, the pre-trained diffusion model serves as a strong feature extractor for RGB segmentation tasks, but multi-modal diffusion-based segmentation methods remain unexplored. Therefore, we present a Pseudo Depth Diffusion Model (PDDM) that adopts a large-scale text-image diffusion model as a feature extractor and a simple yet effective fusion strategy to integrate pseudo depth. To verify the applicability of pseudo depth and our PDDM, we perform extensive experiments on the NYUv2 and SUNRGB-D datasets. The experimental results demonstrate that pseudo depth can effectively enhance segmentation performance, and our PDDM achieves state-of-the-art performance, outperforming other methods by +6.98 mIoU on NYUv2 and +2.11 mIoU on SUNRGB-D.

DashGaussian: Optimizing 3D Gaussian Splatting in 200 Seconds

Youyu Chen,Junjun Jiang,Kui Jiang,Xiao Tang,Zhihao Li,Xianming Liu,Yinyu Nie

Task: 提出DashGaussian，一种调度方案，通过减少3D高斯泼溅（3DGS）优化中的冗余复杂度来加速优化过程。

Motivation: 3DGS的优化时间成本主要由渲染分辨率和基元数量决定，称为优化复杂度，现有方法存在冗余计算问题。

Details

Method: 将3DGS优化建模为逐步拟合训练视图中更高频分量的过程，并提出动态渲染分辨率方案以减少优化复杂度；同时调度基元数量的增长以与渲染分辨率同步。 Result: 实验表明，该方法平均加速各种3DGS骨干网络的优化过程45.7%，同时保持渲染质量。 Conclusion: DashGaussian通过动态调度优化复杂度，显著提升了3DGS优化的效率，且不影响渲染质量。 Abstract: 3D Gaussian Splatting (3DGS) renders pixels by rasterizing Gaussian primitives, where the rendering resolution and the primitive number, concluded as the optimization complexity, dominate the time cost in primitive optimization. In this paper, we propose DashGaussian, a scheduling scheme over the optimization complexity of 3DGS that strips redundant complexity to accelerate 3DGS optimization. Specifically, we formulate 3DGS optimization as progressively fitting 3DGS to higher levels of frequency components in the training views, and propose a dynamic rendering resolution scheme that largely reduces the optimization complexity based on this formulation. Besides, we argue that a specific rendering resolution should cooperate with a proper primitive number for a better balance between computing redundancy and fitting quality, where we schedule the growth of the primitives to synchronize with the rendering resolution. Extensive experiments show that our method accelerates the optimization of various 3DGS backbones by 45.7% on average while preserving the rendering quality.

Xusheng Cao,Haori Lu,Linlan Huang,Fei Yang,Xialei Liu,Ming-Ming Cheng

Task: 提出一种基于知识图谱增强的生成多模态模型（KG-GMM），以解决持续学习中的灾难性遗忘问题。

Motivation: 持续学习中模型容易遗忘先前学习的知识，导致对新任务适应时出现错误分类。

Details

Method: 通过构建动态知识图谱，利用图谱中的关系增强类别标签，并通过知识图谱增强推理方法减少旧类别信息的丢失。 Result: 实验表明，KG-GMM在传统持续学习和少样本持续学习场景中均取得了最先进的结果。 Conclusion: 知识图谱在持续学习中能有效保留知识，减少遗忘。 Abstract: Continual learning in computer vision faces the critical challenge of catastrophic forgetting, where models struggle to retain prior knowledge while adapting to new tasks. Although recent studies have attempted to leverage the generalization capabilities of pre-trained models to mitigate overfitting on current tasks, models still tend to forget details of previously learned categories as tasks progress, leading to misclassification. To address these limitations, we introduce a novel Knowledge Graph Enhanced Generative Multi-modal model (KG-GMM) that builds an evolving knowledge graph throughout the learning process. Our approach utilizes relationships within the knowledge graph to augment the class labels and assigns different relations to similar categories to enhance model differentiation. During testing, we propose a Knowledge Graph Augmented Inference method that locates specific categories by analyzing relationships within the generated text, thereby reducing the loss of detailed information about old classes when learning new knowledge and alleviating forgetting. Experiments demonstrate that our method effectively leverages relational information to help the model correct mispredictions, achieving state-of-the-art results in both conventional CIL and few-shot CIL settings, confirming the efficacy of knowledge graphs at preserving knowledge in the continual learning scenarios.

Offline Meteorology-Pollution Coupling Global Air Pollution Forecasting Model with Bilinear Pooling

Xu Fan,Yuetan Lin,Bing Gong,Hao Li

Task: 开发一种基于深度学习的离线耦合框架，用于全球空气污染预测。

Motivation: 传统物理模型和现有深度学习方法在实时预测效率和计算资源需求上存在局限性。

Details

Method: 提出一种基于双线性池化的离线耦合框架，将气象场与污染物离线耦合。 Result: 该模型仅需13%的参数即可达到与在线耦合模型相当的性能，并在多项指标上优于现有模型CAMS。 Conclusion: 离线耦合气象场与污染物能显著降低预测误差，为实时全球空气污染预警系统提供了新范式。 Abstract: Air pollution has become a major threat to human health, making accurate forecasting crucial for pollution control. Traditional physics-based models forecast global air pollution by coupling meteorology and pollution processes, using either online or offline methods depending on whether fully integrated with meteorological models and run simultaneously. However, the high computational demands of both methods severely limit real-time prediction efficiency. Existing deep learning (DL) solutions employ online coupling strategies for global air pollution forecasting, which finetune pollution forecasting based on pretrained atmospheric models, requiring substantial training resources. This study pioneers a DL-based offline coupling framework that utilizes bilinear pooling to achieve offline coupling between meteorological fields and pollutants. The proposed model requires only 13% of the parameters of DL-based online coupling models while achieving competitive performance. Compared with the state-of-the-art global air pollution forecasting model CAMS, our approach demonstrates superiority in 63% variables across all forecast time steps and 85% variables in predictions exceeding 48 hours. This work pioneers experimental validation of the effectiveness of meteorological fields in DL-based global air pollution forecasting, demonstrating that offline coupling meteorological fields with pollutants can achieve a 15% relative reduction in RMSE across all pollution variables. The research establishes a new paradigm for real-time global air pollution warning systems and delivers critical technical support for developing more efficient and comprehensive AI-powered global atmospheric forecasting frameworks.

Sherry X. Chen,Misha Sra,Pradeep Sen

Task: 提出一种自监督方法Instruct-CLIP，用于改进现有数据集中指令与图像编辑的对齐问题。

Motivation: 现有基于文本到图像生成模型的数据集存在指令与编辑结果不对齐的问题，影响了模型的训练效果。

Details

Method: 通过Instruct-CLIP学习原始图像与编辑图像的语义变化，并适应噪声潜在图像和扩散时间步，以优化潜在扩散模型的训练。 Result: 修正了InstructPix2Pix数据集，生成超过12万个优化样本，并训练出更符合指令的编辑模型。 Conclusion: Instruct-CLIP有效提升了指令与图像编辑的对齐性，为相关任务提供了高质量数据集和模型。 Abstract: Although natural language instructions offer an intuitive way to guide automated image editing, deep-learning models often struggle to achieve high-quality results, largely due to challenges in creating large, high-quality training datasets. Previous work has typically relied on text-toimage (T2I) generative models to produce pairs of original and edited images that simulate the input/output of an instruction-guided image-editing model. However, these image pairs often fail to align with the specified edit instructions due to the limitations of T2I models, which negatively impacts models trained on such datasets. To address this, we present Instruct-CLIP, a self-supervised method that learns the semantic changes between original and edited images to refine and better align the instructions in existing datasets. Furthermore, we adapt Instruct-CLIP to handle noisy latent images and diffusion timesteps so that it can be used to train latent diffusion models (LDMs) [19] and efficiently enforce alignment between the edit instruction and the image changes in latent space at any step of the diffusion pipeline. We use Instruct-CLIP to correct the InstructPix2Pix dataset and get over 120K refined samples we then use to fine-tune their model, guided by our novel Instruct-CLIP-based loss function. The resulting model can produce edits that are more aligned with the given instructions. Our code and dataset are available at https://github.com/SherryXTChen/Instruct-CLIP.git.

VTD-CLIP: Video-to-Text Discretization via Prompting CLIP

Wencheng Zhu,Yuexin Wang,Hongxuan Li,Pengfei Zhu,Danqing Song,Qinghua Hu

Task: 提出一种简单有效的视频到文本离散化框架，以改进视频识别任务中的时间建模和泛化能力。

Motivation: 现有方法主要依赖图像-文本预训练模型的参数高效微调，但存在解释性差和泛化能力不足的问题。

Details

Method: 利用冻结的文本编码器构建视觉码本，将时间视觉数据转换为文本标记，并引入置信感知融合模块和可学习文本提示。 Result: 在HMDB-51、UCF-101、SSv2和Kinetics-400等数据集上验证了方法的优越性，性能优于现有最先进方法。 Conclusion: 提出的框架通过显式视频建模和自适应码本更新，显著提升了视频识别任务的性能和解释性。 Abstract: Vision-language models bridge visual and linguistic understanding and have proven to be powerful for video recognition tasks. Existing approaches primarily rely on parameter-efficient fine-tuning of image-text pre-trained models, yet they often suffer from limited interpretability and poor generalization due to inadequate temporal modeling. To address these, we propose a simple yet effective video-to-text discretization framework. Our method repurposes the frozen text encoder to construct a visual codebook from video class labels due to the many-to-one contrastive alignment between visual and textual embeddings in multimodal pretraining. This codebook effectively transforms temporal visual data into textual tokens via feature lookups and offers interpretable video representations through explicit video modeling. Then, to enhance robustness against irrelevant or noisy frames, we introduce a confidence-aware fusion module that dynamically weights keyframes by assessing their semantic relevance via the codebook. Furthermore, our method incorporates learnable text prompts to conduct adaptive codebook updates. Extensive experiments on HMDB-51, UCF-101, SSv2, and Kinetics-400 have validated the superiority of our approach, achieving more competitive improvements over state-of-the-art methods. The code will be publicly available at https://github.com/isxinxin/VTD-CLIP.

Fast and Physically-based Neural Explicit Surface for Relightable Human Avatars

Jiacheng Wu,Ruiqi Zhang,Jie Chen,Hui Zhang

Task: 从稀疏视角视频中高效建模可重光照的人体虚拟形象。

Motivation: 当前方法使用神经隐式表示动态几何和反射率，但由于体积渲染需要密集采样，成本高昂。

Details

Method: 提出基于物理的神经显式表面（PhyNES），利用紧凑的神经材质贴图，通过连接符号距离场到显式表面，实现高效几何推断。 Result: PhyNES在重光照质量上与SOTA方法相当，同时显著提升渲染速度、内存效率和重建质量。 Conclusion: PhyNES通过2D神经表示高效建模动态几何和材质，支持实时物理渲染，适用于AR/VR应用。 Abstract: Efficiently modeling relightable human avatars from sparse-view videos is crucial for AR/VR applications. Current methods use neural implicit representations to capture dynamic geometry and reflectance, which incur high costs due to the need for dense sampling in volume rendering. To overcome these challenges, we introduce Physically-based Neural Explicit Surface (PhyNES), which employs compact neural material maps based on the Neural Explicit Surface (NES) representation. PhyNES organizes human models in a compact 2D space, enhancing material disentanglement efficiency. By connecting Signed Distance Fields to explicit surfaces, PhyNES enables efficient geometry inference around a parameterized human shape model. This approach models dynamic geometry, texture, and material maps as 2D neural representations, enabling efficient rasterization. PhyNES effectively captures physical surface attributes under varying illumination, enabling real-time physically-based rendering. Experiments show that PhyNES achieves relighting quality comparable to SOTA methods while significantly improving rendering speed, memory efficiency, and reconstruction quality.

U-REPA: Aligning Diffusion U-Nets to ViTs

Yuchuan Tian,Hanting Chen,Mengyu Zheng,Yuchen Liang,Chao Xu,Yunhe Wang

Task: 提出一种名为U-REPA的表征对齐范式，用于解决将REPA方法适配到U-Net架构时的挑战。

Motivation: REPA方法在DiT训练中表现优异，但未在U-Net架构中验证，而U-Net具有更快的收敛速度。适配过程中面临功能块差异、空间维度不一致和空间间隙等挑战。

Details

Method: 提出U-REPA，包括选择U-Net中间阶段作为对齐点、通过MLP上采样特征、引入流形损失解决相似性对齐问题。 Result: U-REPA在生成质量和收敛速度上表现优异，在ImageNet 256×256上FID<1.5，且仅需一半训练周期即可超越REPA。 Conclusion: U-REPA成功解决了REPA在U-Net上的适配问题，显著提升了生成效率和性能。 Abstract: Representation Alignment (REPA) that aligns Diffusion Transformer (DiT) hidden-states with ViT visual encoders has proven highly effective in DiT training, demonstrating superior convergence properties, but it has not been validated on the canonical diffusion U-Net architecture that shows faster convergence compared to DiTs. However, adapting REPA to U-Net architectures presents unique challenges: (1) different block functionalities necessitate revised alignment strategies; (2) spatial-dimension inconsistencies emerge from U-Net's spatial downsampling operations; (3) space gaps between U-Net and ViT hinder the effectiveness of tokenwise alignment. To encounter these challenges, we propose U-REPA, a representation alignment paradigm that bridges U-Net hidden states and ViT features as follows: Firstly, we propose via observation that due to skip connection, the middle stage of U-Net is the best alignment option. Secondly, we propose upsampling of U-Net features after passing them through MLPs. Thirdly, we observe difficulty when performing tokenwise similarity alignment, and further introduces a manifold loss that regularizes the relative similarity between samples. Experiments indicate that the resulting U-REPA could achieve excellent generation quality and greatly accelerates the convergence speed. With CFG guidance interval, U-REPA could reach $FID<1.5$ in 200 epochs or 1M iterations on ImageNet 256 $\times$ 256, and needs only half the total epochs to perform better than REPA. Codes are available at https://github.com/YuchuanTian/U-REPA.

Panorama Generation From NFoV Image Done Right

Dian Zheng,Cheng Zhang,Xiao-Ming Wu,Cao Li,Chengfei Lv,Jian-Fang Hu,Wei-Shi Zheng

Task: 从窄视场（NFoV）图像生成360度全景图。

Motivation: 现有方法主要使用InceptionNet或CLIP指标评估生成的全景图，但这些指标倾向于感知图像质量，不适合评估失真。

Details

Method: 提出Distort-CLIP评估失真，并发现‘视觉欺骗’现象；提出PanoDecouple框架，将全景生成解耦为失真引导和内容完成两部分。 Result: PanoDecouple在失真和视觉指标上均优于现有方法。 Conclusion: 解耦方法能有效解决失真与视觉效果的权衡问题。 Abstract: Generating 360-degree panoramas from narrow field of view (NFoV) image is a promising computer vision task for Virtual Reality (VR) applications. Existing methods mostly assess the generated panoramas with InceptionNet or CLIP based metrics, which tend to perceive the image quality and is \textbf{not suitable for evaluating the distortion}. In this work, we first propose a distortion-specific CLIP, named Distort-CLIP to accurately evaluate the panorama distortion and discover the \textbf{``visual cheating''} phenomenon in previous works (\ie, tending to improve the visual results by sacrificing distortion accuracy). This phenomenon arises because prior methods employ a single network to learn the distinct panorama distortion and content completion at once, which leads the model to prioritize optimizing the latter. To address the phenomenon, we propose \textbf{PanoDecouple}, a decoupled diffusion model framework, which decouples the panorama generation into distortion guidance and content completion, aiming to generate panoramas with both accurate distortion and visual appeal. Specifically, we design a DistortNet for distortion guidance by imposing panorama-specific distortion prior and a modified condition registration mechanism; and a ContentNet for content completion by imposing perspective image information. Additionally, a distortion correction loss function with Distort-CLIP is introduced to constrain the distortion explicitly. The extensive experiments validate that PanoDecouple surpasses existing methods both in distortion and visual metrics.

4DGC: Rate-Aware 4D Gaussian Compression for Efficient Streamable Free-Viewpoint Video

Qiang Hu,Zihan Zheng,Houqiang Zhong,Sihua Fu,Li Song,XiaoyunZhang,Guangtao Zhai,Yanfeng Wang

Task: 提出一种名为4DGC的速率感知4D高斯压缩框架，用于减少动态3D高斯溅射（3DGS）的存储和传输需求。

Motivation: 现有方法在处理动态3DGS表示和压缩时忽略了运动信息和速率-失真（RD）权衡，导致性能下降和模型冗余。

Details

Method: 引入运动感知的动态高斯表示，结合紧凑的运动网格和稀疏补偿高斯，并采用端到端压缩方案，包括可微分量化和隐式熵模型。 Result: 4DGC显著减少了存储大小，同时在多个数据集上优于现有方法的RD性能。 Conclusion: 4DGC通过联合优化速率-失真权衡，有效解决了动态3DGS的压缩问题，为自由视点视频（FVV）提供了高效的解决方案。 Abstract: 3D Gaussian Splatting (3DGS) has substantial potential for enabling photorealistic Free-Viewpoint Video (FVV) experiences. However, the vast number of Gaussians and their associated attributes poses significant challenges for storage and transmission. Existing methods typically handle dynamic 3DGS representation and compression separately, neglecting motion information and the rate-distortion (RD) trade-off during training, leading to performance degradation and increased model redundancy. To address this gap, we propose 4DGC, a novel rate-aware 4D Gaussian compression framework that significantly reduces storage size while maintaining superior RD performance for FVV. Specifically, 4DGC introduces a motion-aware dynamic Gaussian representation that utilizes a compact motion grid combined with sparse compensated Gaussians to exploit inter-frame similarities. This representation effectively handles large motions, preserving quality and reducing temporal redundancy. Furthermore, we present an end-to-end compression scheme that employs differentiable quantization and a tiny implicit entropy model to compress the motion grid and compensated Gaussians efficiently. The entire framework is jointly optimized using a rate-distortion trade-off. Extensive experiments demonstrate that 4DGC supports variable bitrates and consistently outperforms existing methods in RD performance across multiple datasets.

Breaking the Encoder Barrier for Seamless Video-Language Understanding

Handong Li,Yiyuan Zhang,Longteng Guo,Xiangyu Yue,Jing Liu

Task: 提出一种无需视觉编码器的视频-大语言模型（Video-LLM）ELVA，直接建模视频与语言的细粒度交互。

Motivation: 现有基于编码器-解码器框架的Video-LLMs计算成本高、存在分辨率偏差且难以捕捉多模态细粒度交互。

Details

Method: 采用令牌合并构建自底向上的层次表示，引入视频引导监督器直接学习时空表示，并通过混合分辨率机制平衡性能与效率。 Result: 仅用7M公开视频-文本对，ELVA性能与基于编码器的Video-LLMs相当，但计算量减少95%，推理延迟降低92%。 Conclusion: ELVA为实时视频理解提供了可扩展且高效的解决方案。 Abstract: Most Video-Large Language Models (Video-LLMs) adopt an encoder-decoder framework, where a vision encoder extracts frame-wise features for processing by a language model. However, this approach incurs high computational costs, introduces resolution biases, and struggles to capture fine-grained multimodal interactions. To overcome these limitations, we propose ELVA, an encoder-free Video-LLM that directly models nuanced video-language interactions without relying on a vision encoder. ELVA employs token merging to construct a bottom-up hierarchical representation and incorporates a video guidance supervisor for direct spatiotemporal representation learning. Additionally, a hybrid-resolution mechanism strategically integrates high- and low-resolution frames as inputs to achieve an optimal balance between performance and efficiency. With only 7M publicly available video-text pairs, ELVA achieves performance on par with encoder-based Video-LLMs while reducing FLOPs by up to 95\% and inference latency by 92\%, offering a scalable and efficient solution for real-time video understanding.

Teller: Real-Time Streaming Audio-Driven Portrait Animation with Autoregressive Motion Generation

Dingcheng Zhen,Shunshun Yin,Shiyang Qin,Hou Yi,Ziwei Zhang,Siyuan Liu,Gan Qi,Ming Tao

Task: 提出首个自回归框架Teller，用于实时音频驱动的肖像动画（即“说话头部”生成）。

Motivation: 解决现有方法在生成真实感说话头部时面临的动画时间长和身体部位自然运动保持的挑战。

Details

Method: Teller框架通过自回归运动生成，包括面部运动潜在生成（FMLG）和高效时间模块（ETM）细化运动真实性。FMLG使用残差VQ模型将面部运动潜在映射为离散运动标记，并结合音频嵌入；ETM则捕捉更精细的运动细节。 Result: Teller在推理速度上显著优于基于扩散的模型（0.92秒生成1秒视频），并实现25 FPS的实时流性能。实验表明其在质量和真实感上优于现有方法。 Conclusion: Teller是首个高效、实时的音频驱动肖像动画框架，在运动细节和真实感上表现优异。 Abstract: In this work, we introduce the first autoregressive framework for real-time, audio-driven portrait animation, a.k.a, talking head. Beyond the challenge of lengthy animation times, a critical challenge in realistic talking head generation lies in preserving the natural movement of diverse body parts. To this end, we propose Teller, the first streaming audio-driven protrait animation framework with autoregressive motion generation. Specifically, Teller first decomposes facial and body detail animation into two components: Facial Motion Latent Generation (FMLG) based on an autoregressive transfromer, and movement authenticity refinement using a Efficient Temporal Module (ETM).Concretely, FMLG employs a Residual VQ model to map the facial motion latent from the implicit keypoint-based model into discrete motion tokens, which are then temporally sliced with audio embeddings. This enables the AR tranformer to learn real-time, stream-based mappings from audio to motion. Furthermore, Teller incorporate ETM to capture finer motion details. This module ensures the physical consistency of body parts and accessories, such as neck muscles and earrings, improving the realism of these movements. Teller is designed to be efficient, surpassing the inference speed of diffusion-based models (Hallo 20.93s vs. Teller 0.92s for one second video generation), and achieves a real-time streaming performance of up to 25 FPS. Extensive experiments demonstrate that our method outperforms recent audio-driven portrait animation models, especially in small movements, as validated by human evaluations with a significant margin in quality and realism.

CQ-DINO: Mitigating Gradient Dilution via Category Queries for Vast Vocabulary Object Detection

Zhichao Sun,Huazhang Hu,Yidong Ma,Gang Liu,Nemo Chen,Xu Tang,Yongchao Xu

Task: 提出一种基于类别查询的目标检测框架CQ-DINO，以解决传统分类检测器在大量词汇目标检测任务中的局限性。

Motivation: 传统分类检测器在处理大量词汇目标检测任务时存在正梯度稀释和硬负梯度稀释的问题，导致性能下降。

Details

Method: 通过将分类任务重新定义为对象查询与可学习类别查询之间的对比任务，并结合图像引导的查询选择，减少负空间并重新平衡梯度分布。 Result: 在V3Det基准测试中性能提升2.1% AP，同时在COCO数据集上保持竞争力。 Conclusion: CQ-DINO为需要广泛类别覆盖的真实世界检测系统提供了可扩展的解决方案。 Abstract: With the exponential growth of data, traditional object detection methods are increasingly struggling to handle vast vocabulary object detection tasks effectively. We analyze two key limitations of classification-based detectors: positive gradient dilution, where rare positive categories receive insufficient learning signals, and hard negative gradient dilution, where discriminative gradients are overwhelmed by numerous easy negatives. To address these challenges, we propose CQ-DINO, a category query-based object detection framework that reformulates classification as a contrastive task between object queries and learnable category queries. Our method introduces image-guided query selection, which reduces the negative space by adaptively retrieving top-K relevant categories per image via cross-attention, thereby rebalancing gradient distributions and facilitating implicit hard example mining. Furthermore, CQ-DINO flexibly integrates explicit hierarchical category relationships in structured datasets (e.g., V3Det) or learns implicit category correlations via self-attention in generic datasets (e.g., COCO). Experiments demonstrate that CQ-DINO achieves superior performance on the challenging V3Det benchmark (surpassing previous methods by 2.1% AP) while maintaining competitiveness in COCO. Our work provides a scalable solution for real-world detection systems requiring wide category coverage. The dataset and code will be publicly at https://github.com/RedAIGC/CQ-DINO.

A Simple yet Effective Layout Token in Large Language Models for Document Understanding

Zhaoqing Zhu,Chuwei Luo,Zirui Shao,Feiyu Gao,Hangdi Xing,Qi Zheng,Ji Zhang

Task: 提出一种名为LayTokenLLM的新方法，用于解决现有布局与文本结合方法在文档理解任务中的局限性。

Motivation: 现有方法将布局信息表示为文本标记并与文本内容交错输入，但需要额外的位置ID，限制了模型从文本中学习的能力，并在长上下文推理中引入未训练的位置ID，影响性能。

Details

Method: LayTokenLLM将布局信息表示为每个文本段的单个标记，并使用专门的位置编码方案，共享文本和布局标记的位置ID，无需额外位置ID。此外，设计了新的预训练目标NTLP以增强文本与布局标记的跨模态学习。 Result: 实验表明，LayTokenLLM在多页文档理解任务及大多数单页任务上优于现有布局集成的大语言模型和多模态语言模型。 Conclusion: LayTokenLLM通过简化布局表示和共享位置ID，有效提升了文档理解任务的性能，同时解决了长上下文推理中的问题。 Abstract: Recent methods that integrate spatial layouts with text for document understanding in large language models (LLMs) have shown promising results. A commonly used method is to represent layout information as text tokens and interleave them with text content as inputs to the LLMs. However, such a method still demonstrates limitations, as it requires additional position IDs for tokens that are used to represent layout information. Due to the constraint on max position IDs, assigning them to layout information reduces those available for text content, reducing the capacity for the model to learn from the text during training, while also introducing a large number of potentially untrained position IDs during long-context inference, which can hinder performance on document understanding tasks. To address these issues, we propose LayTokenLLM, a simple yet effective method for document understanding. LayTokenLLM represents layout information as a single token per text segment and uses a specialized positional encoding scheme. It shares position IDs between text and layout tokens, eliminating the need for additional position IDs. This design maintains the model's capacity to learn from text while mitigating long-context issues during inference. Furthermore, a novel pre-training objective called Next Interleaved Text and Layout Token Prediction (NTLP) is devised to enhance cross-modality learning between text and layout tokens. Extensive experiments show that LayTokenLLM outperforms existing layout-integrated LLMs and MLLMs of similar scales on multi-page document understanding tasks, as well as most single-page tasks.

On the Perception Bottleneck of VLMs for Chart Understanding

Junteng Liu,Weihao Zeng,Xiwen Zhang,Yijun Wang,Zifei Shan,Junxian He

Task: 研究大型视觉语言模型（LVLMs）在图表理解中的感知瓶颈问题。

Motivation: 现有大型视觉语言模型在图表理解中的感知能力成为关键瓶颈，限制了模型对数值数据、文本元素和复杂视觉组件的分析和推理能力。

Details

Method: 将感知瓶颈分解为视觉编码器瓶颈和提取瓶颈，并通过对比学习框架增强视觉编码器。 Result: 实验表明，视觉表示中嵌入的信息比线性提取器捕获的更丰富，且增强视觉编码器能显著缓解感知瓶颈，提升LVLMs的图表理解能力。 Conclusion: 通过改进视觉编码器，可以有效缓解LVLMs在图表理解中的感知瓶颈，提升模型性能。 Abstract: Chart understanding requires models to effectively analyze and reason about numerical data, textual elements, and complex visual components. Our observations reveal that the perception capabilities of existing large vision-language models (LVLMs) constitute a critical bottleneck in this process. In this study, we delve into this perception bottleneck by decomposing it into two components: the vision encoder bottleneck, where the visual representation may fail to encapsulate the correct information, and the extraction bottleneck, where the language model struggles to extract the necessary information from the provided visual representations. Through comprehensive experiments, we find that (1) the information embedded within visual representations is substantially richer than what is typically captured by linear extractors, such as the widely used retrieval accuracy metric; (2) While instruction tuning effectively enhances the extraction capability of LVLMs, the vision encoder remains a critical bottleneck, demanding focused attention and improvement. Therefore, we further enhance the visual encoder to mitigate the vision encoder bottleneck under a contrastive learning framework. Empirical results demonstrate that our approach significantly mitigates the perception bottleneck and improves the ability of LVLMs to comprehend charts. Code is publicly available at https://github.com/hkust-nlp/Vision4Chart.

ReconDreamer++: Harmonizing Generative and Reconstructive Models for Driving Scene Representation

Guosheng Zhao,Xiaofeng Wang,Chaojun Ni,Zheng Zhu,Wenkang Qin,Guan Huang,Xingang Wang

Task: 提出ReconDreamer++框架，通过改进生成数据的保真度和结构化元素（如地面）的表示，提升自动驾驶闭环模拟中的渲染质量。

Motivation: 现有方法（如ReconDreamer）在生成数据与真实传感器观测之间存在显著差距，尤其是在结构化元素的保真度上。

Details

Method: 引入Novel Trajectory Deformable Network（NTDNet）和学习空间变形机制，同时利用3D高斯保留几何先验知识并优化外观属性。 Result: 在多个数据集上验证了ReconDreamer++的优越性能，特别是在Waymo数据集上，其性能接近Street Gaussians，并在新轨迹上显著优于ReconDreamer。 Conclusion: ReconDreamer++通过改进域差距和结构化元素表示，显著提升了渲染质量，尤其在路面等结构化元素的重建上表现突出。 Abstract: Combining reconstruction models with generative models has emerged as a promising paradigm for closed-loop simulation in autonomous driving. For example, ReconDreamer has demonstrated remarkable success in rendering large-scale maneuvers. However, a significant gap remains between the generated data and real-world sensor observations, particularly in terms of fidelity for structured elements, such as the ground surface. To address these challenges, we propose ReconDreamer++, an enhanced framework that significantly improves the overall rendering quality by mitigating the domain gap and refining the representation of the ground surface. Specifically, ReconDreamer++ introduces the Novel Trajectory Deformable Network (NTDNet), which leverages learnable spatial deformation mechanisms to bridge the domain gap between synthesized novel views and original sensor observations. Moreover, for structured elements such as the ground surface, we preserve geometric prior knowledge in 3D Gaussians, and the optimization process focuses on refining appearance attributes while preserving the underlying geometric structure. Experimental evaluations conducted on multiple datasets (Waymo, nuScenes, PandaSet, and EUVS) confirm the superior performance of ReconDreamer++. Specifically, on Waymo, ReconDreamer++ achieves performance comparable to Street Gaussians for the original trajectory while significantly outperforming ReconDreamer on novel trajectories. In particular, it achieves substantial improvements, including a 6.1% increase in NTA-IoU, a 23. 0% improvement in FID, and a remarkable 4.5% gain in the ground surface metric NTL-IoU, highlighting its effectiveness in accurately reconstructing structured elements such as the road surface.

Chenfei Liao,Kaiyu Lei,Xu Zheng,Junha Moon,Zhixiong Wang,Yixuan Wang,Danda Pani Paudel,Luc Van Gool,Xuming Hu

Task: 提出一个多模态语义分割（MMSS）的鲁棒性基准测试。

Motivation: 现有研究在真实世界部署中存在差距，主要由于多模态数据质量的变异性与不确定性，缺乏标准化鲁棒性评估基准。

Details

Method: 通过调查现有文献并分类代表性方法，提出一个基准测试，评估模型在三种场景下的表现：完全缺失模态（EMM）、随机缺失模态（RMM）和噪声模态（NM）。 Result: 提出了四种评估指标（$mIoU^{Avg}_{EMM}$、$mIoU^{E}_{EMM}$、$mIoU^{Avg}_{RMM}$、$mIoU^{E}_{RMM}$）来量化模型鲁棒性。 Conclusion: 该工作首次为MMSS鲁棒性提供了专用基准，为领域发展提供了新见解和工具。 Abstract: Multi-modal semantic segmentation (MMSS) addresses the limitations of single-modality data by integrating complementary information across modalities. Despite notable progress, a significant gap persists between research and real-world deployment due to variability and uncertainty in multi-modal data quality. Robustness has thus become essential for practical MMSS applications. However, the absence of standardized benchmarks for evaluating robustness hinders further advancement. To address this, we first survey existing MMSS literature and categorize representative methods to provide a structured overview. We then introduce a robustness benchmark that evaluates MMSS models under three scenarios: Entire-Missing Modality (EMM), Random-Missing Modality (RMM), and Noisy Modality (NM). From a probabilistic standpoint, we model modality failure under two conditions: (1) all damaged combinations are equally probable; (2) each modality fails independently following a Bernoulli distribution. Based on these, we propose four metrics-$mIoU^{Avg}_{EMM}$, $mIoU^{E}_{EMM}$, $mIoU^{Avg}_{RMM}$, and $mIoU^{E}_{RMM}$-to assess model robustness under EMM and RMM. This work provides the first dedicated benchmark for MMSS robustness, offering new insights and tools to advance the field. Source code is available at https://github.com/Chenfei-Liao/Multi-Modal-Semantic-Segmentation-Robustness-Benchmark.

Latent Space Super-Resolution for Higher-Resolution Image Generation with Diffusion Models

Jinho Jeong,Sangmin Han,Jinwoo Kim,Seon Joo Kim

Task: 提出一种名为LSRNA的新框架，用于通过潜在空间超分辨率实现高分辨率（超过1K）图像生成。

Motivation: 现有扩散模型在超出训练分辨率时容易出现结构扭曲或内容重复，而基于参考的方法在潜在空间或RGB空间上采样时存在质量下降或过度平滑的问题。

Details

Method: 结合潜在空间超分辨率（LSR）进行流形对齐和区域噪声添加（RNA）以增强高频细节。 Result: 实验表明，LSRNA在多种分辨率和指标上优于现有基于参考的方法，并验证了潜在空间上采样在保留细节和清晰度中的关键作用。 Conclusion: LSRNA通过潜在空间超分辨率和区域噪声添加，有效解决了高分辨率图像生成中的问题，提升了生成质量。 Abstract: In this paper, we propose LSRNA, a novel framework for higher-resolution (exceeding 1K) image generation using diffusion models by leveraging super-resolution directly in the latent space. Existing diffusion models struggle with scaling beyond their training resolutions, often leading to structural distortions or content repetition. Reference-based methods address the issues by upsampling a low-resolution reference to guide higher-resolution generation. However, they face significant challenges: upsampling in latent space often causes manifold deviation, which degrades output quality. On the other hand, upsampling in RGB space tends to produce overly smoothed outputs. To overcome these limitations, LSRNA combines Latent space Super-Resolution (LSR) for manifold alignment and Region-wise Noise Addition (RNA) to enhance high-frequency details. Our extensive experiments demonstrate that integrating LSRNA outperforms state-of-the-art reference-based methods across various resolutions and metrics, while showing the critical role of latent space upsampling in preserving detail and sharpness. The code is available at https://github.com/3587jjh/LSRNA.

InPO: Inversion Preference Optimization with Reparametrized DDIM for Efficient Diffusion Model Alignment

Yunhong Lu,Qichao Wang,Hengyuan Cao,Xierui Wang,Xiaoyin Xu,Min Zhang

Task: 通过直接偏好优化（DPO）方法对齐文本到图像（T2I）扩散模型与人类偏好。

Motivation: 现有方法在训练效率和生成质量上表现不佳，主要由于扩散模型的长马尔可夫链过程和反向过程的复杂性。

Details

Method: 提出DDIM-InPO方法，将扩散模型视为单步生成模型，通过重参数化技术和反转技术选择性优化潜在变量输出。 Result: 实验表明，DDIM-InPO仅需400步微调即可在人类偏好评估任务中超越所有基线方法。 Conclusion: DDIM-InPO是一种高效的扩散模型偏好对齐方法，显著提升了训练效率和生成质量。 Abstract: Without using explicit reward, direct preference optimization (DPO) employs paired human preference data to fine-tune generative models, a method that has garnered considerable attention in large language models (LLMs). However, exploration of aligning text-to-image (T2I) diffusion models with human preferences remains limited. In comparison to supervised fine-tuning, existing methods that align diffusion model suffer from low training efficiency and subpar generation quality due to the long Markov chain process and the intractability of the reverse process. To address these limitations, we introduce DDIM-InPO, an efficient method for direct preference alignment of diffusion models. Our approach conceptualizes diffusion model as a single-step generative model, allowing us to fine-tune the outputs of specific latent variables selectively. In order to accomplish this objective, we first assign implicit rewards to any latent variable directly via a reparameterization technique. Then we construct an Inversion technique to estimate appropriate latent variables for preference optimization. This modification process enables the diffusion model to only fine-tune the outputs of latent variables that have a strong correlation with the preference dataset. Experimental results indicate that our DDIM-InPO achieves state-of-the-art performance with just 400 steps of fine-tuning, surpassing all preference aligning baselines for T2I diffusion models in human preference evaluation tasks.

StableGS: A Floater-Free Framework for 3D Gaussian Splatting

Luchao Wang,Qian Ren,Kaiming He,Hua Wang,Zhi Chen,Yaohua Tang

Task: 提出StableGS框架，解决3D高斯泼溅（3DGS）训练中的不稳定性问题。

Motivation: 3DGS在训练过程中存在耦合的不透明度-颜色优化问题，容易陷入局部极小值，导致视觉伪影（如浮动物体），影响视觉保真度。

Details

Method: 通过跨视角深度一致性约束消除浮动物体，并引入双不透明度GS模型解耦半透明物体的几何和材质属性；结合DUSt3R深度估计提升弱纹理区域的几何稳定性。 Result: 显著改善了3DGS的训练稳定性，在开源数据集上超越了现有最优方法。 Conclusion: StableGS从根本上解决了3DGS训练的不稳定性问题，提升了视觉质量和几何稳定性。 Abstract: Recent years have witnessed remarkable success of 3D Gaussian Splatting (3DGS) in novel view synthesis, surpassing prior differentiable rendering methods in both quality and efficiency. However, its training process suffers from coupled opacity-color optimization that frequently converges to local minima, producing floater artifacts that degrade visual fidelity. We present StableGS, a framework that eliminates floaters through cross-view depth consistency constraints while introducing a dual-opacity GS model to decouple geometry and material properties of translucent objects. To further enhance reconstruction quality in weakly-textured regions, we integrate DUSt3R depth estimation, significantly improving geometric stability. Our method fundamentally addresses 3DGS training instabilities, outperforming existing state-of-the-art methods across open-source datasets.

Hiding Images in Diffusion Models by Editing Learned Score Functions

Haoyu Chen,Yunqiao Yang,Nan Zhong,Kede Ma

Task: 探索在扩散模型中隐藏数据的潜力，并提出一种简单有效的方法。

Motivation: 当前方法在提取准确性、模型保真度和隐藏效率方面存在局限性，主要由于隐藏和提取过程与多个去噪扩散步骤的纠缠。

Details

Method: 在反向扩散过程的特定时间步嵌入图像，通过编辑学习的评分函数，并结合梯度参数选择和低秩适应进行高效微调。 Result: 方法能够提取高质量图像，复现原始模型行为，且隐藏速度显著快于先前方法，同时支持多接收者场景。 Conclusion: 该方法在扩散模型中实现了高效、高质量的数据隐藏，并具有广泛的应用潜力。 Abstract: Hiding data using neural networks (i.e., neural steganography) has achieved remarkable success across both discriminative classifiers and generative adversarial networks. However, the potential of data hiding in diffusion models remains relatively unexplored. Current methods exhibit limitations in achieving high extraction accuracy, model fidelity, and hiding efficiency due primarily to the entanglement of the hiding and extraction processes with multiple denoising diffusion steps. To address these, we describe a simple yet effective approach that embeds images at specific timesteps in the reverse diffusion process by editing the learned score functions. Additionally, we introduce a parameter-efficient fine-tuning method that combines gradient-based parameter selection with low-rank adaptation to enhance model fidelity and hiding efficiency. Comprehensive experiments demonstrate that our method extracts high-quality images at human-indistinguishable levels, replicates the original model behaviors at both sample and population levels, and embeds images orders of magnitude faster than prior methods. Besides, our method naturally supports multi-recipient scenarios through independent extraction channels.

MuMA: 3D PBR Texturing via Multi-Channel Multi-View Generation and Agentic Post-Processing

Lingting Zhu,Jingrui Ye,Runze Zhang,Zeyu Hu,Yingda Yin,Lanjiong Li,Jinnan Chen,Shengju Qian,Xin Wang,Qingmin Liao,Lequan Yu

Task: 提出一种名为MuMA的方法，用于通过多通道多视图生成和智能后处理实现3D PBR纹理生成。

Motivation: 当前3D生成方法在基于物理的渲染（PBR）纹理方面表现不足，主要由于数据有限和多通道材料建模的挑战。

Details

Method: 1) 建模阴影和反照率外观通道，利用阴影通道集成固有分解模块；2) 利用多模态大语言模型模拟艺术家对材料的评估和选择技术。 Result: 实验表明，MuMA在视觉质量和材料保真度上优于现有方法。 Conclusion: MuMA通过创新的多通道建模和智能后处理，显著提升了3D PBR纹理生成的效果。 Abstract: Current methods for 3D generation still fall short in physically based rendering (PBR) texturing, primarily due to limited data and challenges in modeling multi-channel materials. In this work, we propose MuMA, a method for 3D PBR texturing through Multi-channel Multi-view generation and Agentic post-processing. Our approach features two key innovations: 1) We opt to model shaded and albedo appearance channels, where the shaded channels enables the integration intrinsic decomposition modules for material properties. 2) Leveraging multimodal large language models, we emulate artists' techniques for material assessment and selection. Experiments demonstrate that MuMA achieves superior results in visual quality and material fidelity compared to existing methods.

SIT-FER: Integration of Semantic-, Instance-, Text-level Information for Semi-supervised Facial Expression Recognition

Sixian Ding,Xu Jiang,Zhongjing Du,Jiaqi Cui,Xinyi Zeng,Yan Wang

Task: 提出一种半监督深度学习框架，结合语义、实例和文本级信息生成高质量伪标签，用于面部表情识别。

Motivation: 现有半监督面部表情识别方法主要依赖不可靠的语义级伪标签，影响性能和实用性。

Details

Method: 通过计算面部视觉特征与文本、实例特征的相似性，结合语义级概率，生成高质量伪标签，并利用文本嵌入增强监督训练。 Result: 在三个数据集上显著优于现有半监督方法，甚至超过全监督基线。 Conclusion: 提出的框架通过多级信息融合和文本嵌入监督，有效提升了半监督面部表情识别的性能。 Abstract: Semi-supervised deep facial expression recognition (SS-DFER) has gained increasingly research interest due to the difficulty in accessing sufficient labeled data in practical settings. However, existing SS-DFER methods mainly utilize generated semantic-level pseudo-labels for supervised learning, the unreliability of which compromises their performance and undermines the practical utility. In this paper, we propose a novel SS-DFER framework that simultaneously incorporates semantic, instance, and text-level information to generate high-quality pseudo-labels. Specifically, for the unlabeled data, considering the comprehensive knowledge within the textual descriptions and instance representations, we respectively calculate the similarities between the facial vision features and the corresponding textual and instance features to obtain the probabilities at the text- and instance-level. Combining with the semantic-level probability, these three-level probabilities are elaborately aggregated to gain the final pseudo-labels. Furthermore, to enhance the utilization of one-hot labels for the labeled data, we also incorporate text embeddings excavated from textual descriptions to co-supervise model training, enabling facial visual features to exhibit semantic correlations in the text space. Experiments on three datasets demonstrate that our method significantly outperforms current state-of-the-art SS-DFER methods and even exceeds fully supervised baselines. The code will be available at https://github.com/PatrickStarL/SIT-FER.

CFReID: Continual Few-shot Person Re-Identification

Hao Ni,Lianli Gao,Pengpeng Zeng,Heng Tao Shen,Jingkuan Song

Task: 提出一种名为Continual Few-shot ReID (CFReID)的新范式，用于在少样本条件下增量训练模型并测试所有已见领域。

Motivation: 现实世界的监控系统动态演化，需要模型持续处理来自不同领域的新数据，而现有的Lifelong ReID (LReID)需要大规模标注数据，这在隐私和成本上不可行。

Details

Method: 提出Stable Distribution Alignment (SDA)框架，包括Meta Distribution Alignment (MDA)和Prototype-based Few-shot Adaptation (PFA)两个模块。 Result: SDA显著提升了少样本学习和抗遗忘能力，仅用5%的数据（32个ID）即显著优于需要700至1000个ID的LReID方法。 Conclusion: SDA框架在少样本条件下有效解决了知识学习和遗忘问题，为动态监控系统提供了可行的解决方案。 Abstract: Real-world surveillance systems are dynamically evolving, requiring a person Re-identification model to continuously handle newly incoming data from various domains. To cope with these dynamics, Lifelong ReID (LReID) has been proposed to learn and accumulate knowledge across multiple domains incrementally. However, LReID models need to be trained on large-scale labeled data for each unseen domain, which are typically inaccessible due to privacy and cost concerns. In this paper, we propose a new paradigm called Continual Few-shot ReID (CFReID), which requires models to be incrementally trained using few-shot data and tested on all seen domains. Under few-shot conditions, CFREID faces two core challenges: 1) learning knowledge from few-shot data of unseen domain, and 2) avoiding catastrophic forgetting of seen domains. To tackle these two challenges, we propose a Stable Distribution Alignment (SDA) framework from feature distribution perspective. Specifically, our SDA is composed of two modules, i.e., Meta Distribution Alignment (MDA) and Prototype-based Few-shot Adaptation (PFA). To support the study of CFReID, we establish an evaluation benchmark for CFReID on five publicly available ReID datasets. Extensive experiments demonstrate that our SDA can enhance the few-shot learning and anti-forgetting capabilities under few-shot conditions. Notably, our approach, using only 5\% of the data, i.e., 32 IDs, significantly outperforms LReID's state-of-the-art performance, which requires 700 to 1,000 IDs.

MetaSpatial: Reinforcing 3D Spatial Reasoning in VLMs for the Metaverse

Zhenyu Pan,Han Liu

Task: 提出MetaSpatial，一个基于强化学习的框架，用于增强视觉语言模型（VLMs）的3D空间推理能力，实现实时3D场景生成。

Motivation: 解决VLMs缺乏内部化3D空间推理能力以及传统监督微调（SFT）在布局生成任务中的低效性问题。

Details

Method: 采用多轮强化学习优化机制，结合物理感知约束和渲染图像评估，逐步优化3D布局。 Result: MetaSpatial显著提升了空间一致性和格式稳定性，生成的3D布局更真实、对齐且功能连贯。 Conclusion: 验证了强化学习在3D空间推理中的有效性，适用于元宇宙、AR/VR、数字孪生和游戏开发等领域。 Abstract: We present MetaSpatial, the first reinforcement learning (RL)-based framework designed to enhance 3D spatial reasoning in vision-language models (VLMs), enabling real-time 3D scene generation without the need for hard-coded optimizations. MetaSpatial addresses two core challenges: (i) the lack of internalized 3D spatial reasoning in VLMs, which limits their ability to generate realistic layouts, and (ii) the inefficiency of traditional supervised fine-tuning (SFT) for layout generation tasks, as perfect ground truth annotations are unavailable. Our key innovation is a multi-turn RL-based optimization mechanism that integrates physics-aware constraints and rendered image evaluations, ensuring generated 3D layouts are coherent, physically plausible, and aesthetically consistent. Methodologically, MetaSpatial introduces an adaptive, iterative reasoning process, where the VLM refines spatial arrangements over multiple turns by analyzing rendered outputs, improving scene coherence progressively. Empirical evaluations demonstrate that MetaSpatial significantly enhances the spatial consistency and formatting stability of various scale models. Post-training, object placements are more realistic, aligned, and functionally coherent, validating the effectiveness of RL for 3D spatial reasoning in metaverse, AR/VR, digital twins, and game development applications. Our code, data, and training pipeline are publicly available at https://github.com/PzySeere/MetaSpatial.

Global-Local Tree Search for Language Guided 3D Scene Generation

Wei Deng,Mengshi Qi,Huadong Ma

Task: 利用大视觉语言模型（VLM）进行3D室内场景生成。

Motivation: 目前关于VLM在3D室内场景生成方面的研究较少，且该任务涉及空间和布局常识约束。

Details

Method: 提出一种全局-局部树搜索算法，通过分层分解场景结构（房间、区域、地板对象、支撑对象）和离散化空间网格（使用表情符号标记）来生成对象位置。 Result: 实验结果表明，该方法生成的3D场景比现有方法更合理。 Conclusion: 该方法为VLM在3D场景生成中的应用提供了新思路，并通过开源代码促进进一步研究。 Abstract: Large Vision-Language Models (VLMs), such as GPT-4, have achieved remarkable success across various fields. However, there are few studies on 3D indoor scene generation with VLMs. This paper considers this task as a planning problem subject to spatial and layout common sense constraints. To solve the problem with a VLM, we propose a new global-local tree search algorithm. Globally, the method places each object sequentially and explores multiple placements during each placement process, where the problem space is represented as a tree. To reduce the depth of the tree, we decompose the scene structure hierarchically, i.e. room level, region level, floor object level, and supported object level. The algorithm independently generates the floor objects in different regions and supported objects placed on different floor objects. Locally, we also decompose the sub-task, the placement of each object, into multiple steps. The algorithm searches the tree of problem space. To leverage the VLM model to produce positions of objects, we discretize the top-down view space as a dense grid and fill each cell with diverse emojis to make to cells distinct. We prompt the VLM with the emoji grid and the VLM produces a reasonable location for the object by describing the position with the name of emojis. The quantitative and qualitative experimental results illustrate our approach generates more plausible 3D scenes than state-of-the-art approaches. Our source code is available at https://github.com/dw-dengwei/TreeSearchGen .

Video-XL-Pro: Reconstructive Token Compression for Extremely Long Video Understanding

Xiangrui Liu,Yan Shu,Zheng Liu,Ao Li,Yang Tian,Bo Zhao

Task: 提出一种高效的方法Video-XL-Pro，用于解决多模态大语言模型（MLLMs）在超长视频理解中的困难。

Motivation: 现有MLLMs在长视频理解方面表现不佳，尽管已有高级令牌压缩技术。

Details

Method: 基于ReCoT（可学习的令牌重构压缩模块），包括动态令牌合成器（DTS）和语义引导掩码（SGM），并结合视频数据集修剪策略和查询感知选择器。 Result: Video-XL-Pro在多个长视频理解基准测试中优于大多数7B模型，且能在单张A100 GPU上处理超过8K帧。 Conclusion: Video-XL-Pro是一种高效且轻量级的解决方案，显著提升了长视频理解的性能。 Abstract: Despite advanced token compression techniques, existing multimodal large language models (MLLMs) still struggle with hour-long video understanding. In this work, we propose Video-XL-Pro, an efficient method for extremely long video understanding, built upon Reconstructive Compression of Tokens (ReCoT), a learnable module that leverages self-supervised learning to generate comprehensive and compact video tokens. ReCoT introduces two key components: (i) Dynamic Token Synthesizer (DTS): DTS generates pseudo-video tokens from static image tokens by learning intra-token relationships, which are then used in masked video modeling. (ii) Semantic-Guided Masking (SGM): SGM adaptively masks redundant visual tokens to facilitate more effective reconstructive learning. To improve training efficiency in MLLMs fine-tuning, we introduce a video-specific dataset pruning strategy and design a simple yet Query-aware Selector that enables the model to precisely locate query-relevant video tokens. With only 3B parameters, Video-XL-Pro outperforms most 7B models trained on larger datasets across multiple long video understanding benchmarks. Moreover, it can process over 8K frames on a single A100 GPU while maintaining high-quality performance.

Explaining Domain Shifts in Language: Concept erasing for Interpretable Image Classification

Zequn Zeng,Yudi Su,Jianqiao Sun,Tiansheng Wen,Hao Zhang,Zhengjue Wang,Bo Chen,Hongwei Liu,Jiawei Ma

Task: 提出一种名为LanCE的语言引导概念擦除框架，以消除领域特定概念对模型泛化能力的影响。

Motivation: 领域特定概念会削弱模型的泛化能力，阻碍其在高风险应用中的使用，因此需要一种方法来减少这些概念对预测的影响。

Details

Method: 利用预训练的视觉语言模型（VLMs）和大语言模型（LLMs）模拟未见视觉领域的描述符，并引入领域描述符正交性（DDO）正则化器来减少领域特定概念的影响。 Result: 在四个标准基准和三个新基准上的评估表明，DDO显著提高了概念模型在分布外（OOD）泛化上的性能。 Conclusion: LanCE框架通过DDO正则化器有效提升了模型的泛化能力，适用于多种概念模型。 Abstract: Concept-based models can map black-box representations to human-understandable concepts, which makes the decision-making process more transparent and then allows users to understand the reason behind predictions. However, domain-specific concepts often impact the final predictions, which subsequently undermine the model generalization capabilities, and prevent the model from being used in high-stake applications. In this paper, we propose a novel Language-guided Concept-Erasing (LanCE) framework. In particular, we empirically demonstrate that pre-trained vision-language models (VLMs) can approximate distinct visual domain shifts via domain descriptors while prompting large Language Models (LLMs) can easily simulate a wide range of descriptors of unseen visual domains. Then, we introduce a novel plug-in domain descriptor orthogonality (DDO) regularizer to mitigate the impact of these domain-specific concepts on the final predictions. Notably, the DDO regularizer is agnostic to the design of concept-based models and we integrate it into several prevailing models. Through evaluation of domain generalization on four standard benchmarks and three newly introduced benchmarks, we demonstrate that DDO can significantly improve the out-of-distribution (OOD) generalization over the previous state-of-the-art concept-based models.Our code is available at https://github.com/joeyz0z/LanCE.

Junyuan Gao,Jiahe Song,Jiang Wu,Runchuan Zhu,Guanlin Shen,Shasha Wang,Xingjian Wei,Haote Yang,Songyang Zhang,Weijia Li,Bin Wang,Dahua Lin,Lijun Wu,Conghui He

Task: 提出PM4Bench，一个用于大型视觉语言模型（LVLMs）的并行多语言多模态多任务基准。

Motivation: 解决现有多语言基准在语言特定内容偏见、多模态输入格式不连贯以及缺乏安全性评估方面的局限性。

Details

Method: 设计一个跨10种语言的平行语料库，支持公平和准确的跨语言比较，并嵌入文本和查询于图像中以模拟真实应用场景，同时加入安全性评估。 Result: 评估了11种主流LVLMs，揭示了显著的跨语言性能差异，特别是在视觉设置中，并识别OCR能力为关键影响因素。 Conclusion: PM4Bench填补了现有基准的不足，为LVLMs的多语言和多模态评估提供了更全面的工具。 Abstract: Existing multilingual benchmarks for Large Vision Language Models (LVLMs) suffer from limitations including language-specific content biases, disjointed multimodal input formats, and a lack of safety evaluation. To address these gaps, we propose PM4Bench, the first Parallel Multilingual Multi-Modal Multi-task Benchmark for LVLMs. PM4Bench features a parallel corpus design across 10 languages, enabling fair and accurate cross-lingual comparisons. It includes the vision setting where text and queries are embedded in images, requiring LVLMs to simultaneously "see", "read", and "think", aligning with real-world applications. Additionally, PM\textsuperscript{4}Bench incorporates safety evaluations, addressing critical oversight in existing multilingual benchmarks. Using PM4Bench, we evaluate 11 mainstream LVLMs, revealing significant cross-linguistic performance disparities, particularly in vision settings, and identifying OCR capability as a key determinant of these imbalances. We will release PM4Bench at https://github.com/opendatalab/PM4Bench .

Can Text-to-Video Generation help Video-Language Alignment?

Luca Zanella,Massimiliano Mancini,Willi Menapace,Sergey Tulyakov,Yiming Wang,Elisa Ricci

Task: 研究合成视频是否有助于解决视频-语言对齐模型中负样本引入的偏见问题。

Motivation: 现有方法中，负样本可能引入语言偏见，且缺乏细粒度变化的视频覆盖所有可能的负样本。

Details

Method: 提出SynViTA方法，动态加权合成视频的贡献，并引入语义一致性损失。 Result: SynViTA在多个测试集和基准上优于现有方法。 Conclusion: SynViTA是使用合成视频学习视频-语言模型的有前景的第一步。 Abstract: Recent video-language alignment models are trained on sets of videos, each with an associated positive caption and a negative caption generated by large language models. A problem with this procedure is that negative captions may introduce linguistic biases, i.e., concepts are seen only as negatives and never associated with a video. While a solution would be to collect videos for the negative captions, existing databases lack the fine-grained variations needed to cover all possible negatives. In this work, we study whether synthetic videos can help to overcome this issue. Our preliminary analysis with multiple generators shows that, while promising on some tasks, synthetic videos harm the performance of the model on others. We hypothesize this issue is linked to noise (semantic and visual) in the generated videos and develop a method, SynViTA, that accounts for those. SynViTA dynamically weights the contribution of each synthetic video based on how similar its target caption is w.r.t. the real counterpart. Moreover, a semantic consistency loss makes the model focus on fine-grained differences across captions, rather than differences in video appearance. Experiments show that, on average, SynViTA improves over existing methods on VideoCon test sets and SSv2-Temporal, SSv2-Events, and ATP-Hard benchmarks, being a first promising step for using synthetic videos when learning video-language models.

Uncertainty-guided Perturbation for Image Super-Resolution Diffusion Model

Leheng Zhang,Weiyi You,Kexuan Shi,Shuhang Gu

Task: 提出一种基于不确定性引导噪声加权的扩散超分辨率方法，以改进低分辨率信息的利用。

Motivation: 扩散方法在感知质量上优于GAN方法，但现有方法多关注噪声调度或采样过程，而忽略了低分辨率信息的有效利用。

Details

Method: 通过不确定性估计指导区域特定噪声水平控制，并改进网络架构，提出不确定性引导扰动超分辨率（UPSR）模型。 Result: 实验表明，UPSR在模型规模和训练开销减少的情况下，定量和定性上均优于当前最先进方法。 Conclusion: 不确定性引导噪声加权能有效提升扩散超分辨率性能，尤其是在不同区域噪声控制的优化上。 Abstract: Diffusion-based image super-resolution methods have demonstrated significant advantages over GAN-based approaches, particularly in terms of perceptual quality. Building upon a lengthy Markov chain, diffusion-based methods possess remarkable modeling capacity, enabling them to achieve outstanding performance in real-world scenarios. Unlike previous methods that focus on modifying the noise schedule or sampling process to enhance performance, our approach emphasizes the improved utilization of LR information. We find that different regions of the LR image can be viewed as corresponding to different timesteps in a diffusion process, where flat areas are closer to the target HR distribution but edge and texture regions are farther away. In these flat areas, applying a slight noise is more advantageous for the reconstruction. We associate this characteristic with uncertainty and propose to apply uncertainty estimate to guide region-specific noise level control, a technique we refer to as Uncertainty-guided Noise Weighting. Pixels with lower uncertainty (i.e., flat regions) receive reduced noise to preserve more LR information, therefore improving performance. Furthermore, we modify the network architecture of previous methods to develop our Uncertainty-guided Perturbation Super-Resolution (UPSR) model. Extensive experimental results demonstrate that, despite reduced model size and training overhead, the proposed UWSR method outperforms current state-of-the-art methods across various datasets, both quantitatively and qualitatively.

LookCloser: Frequency-aware Radiance Field for Tiny-Detail Scene

Xiaoyu Zhang,Weihong Pan,Chong Bao,Xiyu Zhang,Xiaojun Xiang,Hanqing Jiang,Hujun Bao

Task: 提出一种频率感知的NeRF框架FA-NeRF，用于同时捕捉场景的整体结构和高清细节。

Motivation: 人类通过多频率信息感知环境，而现有NeRF框架仅能单独建模高频局部视图或低频场景结构，无法平衡两者。

Details

Method: 提出3D频率量化方法分析场景频率分布，结合频率网格和频率感知特征重加权策略。 Result: 实验表明FA-NeRF在建模完整场景并保留细节方面显著优于现有方法。 Conclusion: FA-NeRF通过频率感知方法有效平衡了场景结构和细节建模，提升了NeRF的性能。 Abstract: Humans perceive and comprehend their surroundings through information spanning multiple frequencies. In immersive scenes, people naturally scan their environment to grasp its overall structure while examining fine details of objects that capture their attention. However, current NeRF frameworks primarily focus on modeling either high-frequency local views or the broad structure of scenes with low-frequency information, which is limited to balancing both. We introduce FA-NeRF, a novel frequency-aware framework for view synthesis that simultaneously captures the overall scene structure and high-definition details within a single NeRF model. To achieve this, we propose a 3D frequency quantification method that analyzes the scene's frequency distribution, enabling frequency-aware rendering. Our framework incorporates a frequency grid for fast convergence and querying, a frequency-aware feature re-weighting strategy to balance features across different frequency contents. Extensive experiments show that our method significantly outperforms existing approaches in modeling entire scenes while preserving fine details.

AIM2PC: Aerial Image to 3D Building Point Cloud Reconstruction

Soulaimene Turki,Daniel Panangian,Houda Chaabouni-Chouayakh,Ksenia Bittner

Task: 提出一种从单视角航拍图像重建完整3D建筑点云的新方法AIM2PC。

Motivation: 现有方法主要关注屋顶而忽略几何细节，且缺乏完整3D点云数据集和可靠的相机姿态信息。

Details

Method: 利用包含完整3D点云和相机姿态的数据集，结合二进制掩码和Sobel边缘图，通过基于CDPM的点云扩散模型逐步重建。 Result: 方法能够重建包含墙体信息的完整3D建筑点云，性能优于现有基线技术。 Conclusion: AIM2PC解决了现有方法的局限性，并提供了公开数据集以支持进一步研究。 Abstract: Three-dimensional urban reconstruction of buildings from single-view images has attracted significant attention over the past two decades. However, recent methods primarily focus on rooftops from aerial images, often overlooking essential geometrical details. Additionally, there is a notable lack of datasets containing complete 3D point clouds for entire buildings, along with challenges in obtaining reliable camera pose information for aerial images. This paper addresses these challenges by presenting a novel methodology, AIM2PC , which utilizes our generated dataset that includes complete 3D point clouds and determined camera poses. Our approach takes features from a single aerial image as input and concatenates them with essential additional conditions, such as binary masks and Sobel edge maps, to enable more edge-aware reconstruction. By incorporating a point cloud diffusion model based on Centered denoising Diffusion Probabilistic Models (CDPM), we project these concatenated features onto the partially denoised point cloud using our camera poses at each diffusion step. The proposed method is able to reconstruct the complete 3D building point cloud, including wall information and demonstrates superior performance compared to existing baseline techniques. To allow further comparisons with our methodology the dataset has been made available at https://github.com/Soulaimene/AIM2PCDataset

DiN: Diffusion Model for Robust Medical VQA with Semantic Noisy Labels

Erjian Guo,Zhen Zhao,Zicheng Wang,Tong Chen,Yunyi Liu,Luping Zhou

Task: 建立首个医学视觉问答（Med-VQA）中噪声标签的基准，并提出了DiN框架以处理噪声标签问题。

Motivation: 医学图像中包含关键临床信息，但噪声标签和高质量数据集不足的问题尚未充分探索。

Details

Method: 通过模拟人类误标设计语义噪声类型，提出DiN框架，包括Answer Diffuser（AD）模块（基于扩散模型的粗到细答案生成）、Answer Condition Generator（ACG）模块（生成任务特定条件信息）和Noisy Label Refinement（NLR）模块（鲁棒损失函数和动态答案调整）。 Result: DiN框架通过扩散模型和噪声标签优化提升了Med-VQA的准确性。 Conclusion: DiN框架为Med-VQA中的噪声标签问题提供了有效解决方案，并通过实验验证了其性能提升。 Abstract: Medical Visual Question Answering (Med-VQA) systems benefit the interpretation of medical images containing critical clinical information. However, the challenge of noisy labels and limited high-quality datasets remains underexplored. To address this, we establish the first benchmark for noisy labels in Med-VQA by simulating human mislabeling with semantically designed noise types. More importantly, we introduce the DiN framework, which leverages a diffusion model to handle noisy labels in Med-VQA. Unlike the dominant classification-based VQA approaches that directly predict answers, our Answer Diffuser (AD) module employs a coarse-to-fine process, refining answer candidates with a diffusion model for improved accuracy. The Answer Condition Generator (ACG) further enhances this process by generating task-specific conditional information via integrating answer embeddings with fused image-question features. To address label noise, our Noisy Label Refinement(NLR) module introduces a robust loss function and dynamic answer adjustment to further boost the performance of the AD module.

HiRes-FusedMIM: A High-Resolution RGB-DSM Pre-trained Model for Building-Level Remote Sensing Applications

Guneet Mutreja,Philipp Schuegraf,Ksenia Bittner

Task: 提出一种名为HiRes-FusedMIM的预训练模型，用于结合高分辨率RGB和DSM数据以提升城市环境理解。

Motivation: 现有自监督学习模型忽视了高分辨率数字表面模型（DSM）在城市环境分析中的重要性，尤其是在建筑级别分析中。

Details

Method: 采用双编码器的简单掩码图像建模（SimMIM）架构，结合多目标损失函数，从RGB和DSM数据中学习联合表示。 Result: HiRes-FusedMIM在多个建筑相关数据集上表现优于现有方法，证明了DSM数据在预训练中的价值以及双编码器架构的优势。 Conclusion: HiRes-FusedMIM通过结合RGB和DSM数据，显著提升了建筑级别分析的性能，为相关研究和应用提供了有力工具。 Abstract: Recent advances in self-supervised learning have led to the development of foundation models that have significantly advanced performance in various computer vision tasks. However, despite their potential, these models often overlook the crucial role of high-resolution digital surface models (DSMs) in understanding urban environments, particularly for building-level analysis, which is essential for applications like digital twins. To address this gap, we introduce HiRes-FusedMIM, a novel pre-trained model specifically designed to leverage the rich information contained within high-resolution RGB and DSM data. HiRes-FusedMIM utilizes a dual-encoder simple masked image modeling (SimMIM) architecture with a multi-objective loss function that combines reconstruction and contrastive objectives, enabling it to learn powerful, joint representations from both modalities. We conducted a comprehensive evaluation of HiRes-FusedMIM on a diverse set of downstream tasks, including classification, semantic segmentation, and instance segmentation. Our results demonstrate that: 1) HiRes-FusedMIM outperforms previous state-of-the-art geospatial methods on several building-related datasets, including WHU Aerial and LoveDA, demonstrating its effectiveness in capturing and leveraging fine-grained building information; 2) Incorporating DSMs during pre-training consistently improves performance compared to using RGB data alone, highlighting the value of elevation information for building-level analysis; 3) The dual-encoder architecture of HiRes-FusedMIM, with separate encoders for RGB and DSM data, significantly outperforms a single-encoder model on the Vaihingen segmentation task, indicating the benefits of learning specialized representations for each modality. To facilitate further research and applications in this direction, we will publicly release the trained model weights.

UniPCGC: Towards Practical Point Cloud Geometry Compression via an Efficient Unified Approach

Kangli Wang,Wei Gao

Task: 提出一种高效统一的点云几何压缩框架UniPCGC，支持有损压缩、无损压缩、可变速率和可变复杂度。

Motivation: 现有的基于学习的点云压缩方法存在高复杂度、压缩模式有限以及不支持可变速率等问题，限制了其实际应用。

Details

Method: 提出UniPCGC框架，包括Uneven 8-Stage Lossless Coder（UELC）用于无损压缩模式，以及Variable Rate and Complexity Module（VRCM）用于有损压缩模式，通过动态组合实现统一框架。 Result: 在无损压缩中压缩比提升8.1%，在有损压缩中BD-Rate提升14.02%，同时支持可变速率和可变复杂度。 Conclusion: UniPCGC框架在性能和灵活性上优于现有方法，推动了实用点云压缩的发展。 Abstract: Learning-based point cloud compression methods have made significant progress in terms of performance. However, these methods still encounter challenges including high complexity, limited compression modes, and a lack of support for variable rate, which restrict the practical application of these methods. In order to promote the development of practical point cloud compression, we propose an efficient unified point cloud geometry compression framework, dubbed as UniPCGC. It is a lightweight framework that supports lossy compression, lossless compression, variable rate and variable complexity. First, we introduce the Uneven 8-Stage Lossless Coder (UELC) in the lossless mode, which allocates more computational complexity to groups with higher coding difficulty, and merges groups with lower coding difficulty. Second, Variable Rate and Complexity Module (VRCM) is achieved in the lossy mode through joint adoption of a rate modulation module and dynamic sparse convolution. Finally, through the dynamic combination of UELC and VRCM, we achieve lossy compression, lossless compression, variable rate and complexity within a unified framework. Compared to the previous state-of-the-art method, our method achieves a compression ratio (CR) gain of 8.1\% on lossless compression, and a Bjontegaard Delta Rate (BD-Rate) gain of 14.02\% on lossy compression, while also supporting variable rate and variable complexity.

Distilling Stereo Networks for Performant and Efficient Leaner Networks

Rafia Rahim,Samuel Woerz,Andreas Zell

Task: 将知识蒸馏技术应用于立体匹配网络，以提升其性能和推理速度。

Motivation: 尽管知识蒸馏在视觉任务中广泛应用，但在立体匹配网络中研究较少，主要由于网络结构的复杂性。

Details

Method: 结合前沿立体匹配方法和通用知识蒸馏技术，设计从主干网络到蒸馏点及损失函数的完整蒸馏流程。 Result: 学生网络在性能上优于PSMNet、CFNet和LEAStereo，速度分别快8倍、5倍和8倍，且在ETH3D和Middlebury数据集上表现出更好的泛化能力。 Conclusion: 通过系统设计蒸馏流程，可以实现更高效且性能优越的立体匹配网络。 Abstract: Knowledge distillation has been quite popular in vision for tasks like classification and segmentation however not much work has been done for distilling state-of-the-art stereo matching methods despite their range of applications. One of the reasons for its lack of use in stereo matching networks is due to the inherent complexity of these networks, where a typical network is composed of multiple two- and three-dimensional modules. In this work, we systematically combine the insights from state-of-the-art stereo methods with general knowledge-distillation techniques to develop a joint framework for stereo networks distillation with competitive results and faster inference. Moreover, we show, via a detailed empirical analysis, that distilling knowledge from the stereo network requires careful design of the complete distillation pipeline starting from backbone to the right selection of distillation points and corresponding loss functions. This results in the student networks that are not only leaner and faster but give excellent performance . For instance, our student network while performing better than the performance oriented methods like PSMNet [1], CFNet [2], and LEAStereo [3]) on benchmark SceneFlow dataset, is 8x, 5x, and 8x faster respectively. Furthermore, compared to speed oriented methods having inference time less than 100ms, our student networks perform better than all the tested methods. In addition, our student network also shows better generalization capabilities when tested on unseen datasets like ETH3D and Middlebury.

Benchmarking Post-Hoc Unknown-Category Detection in Food Recognition

Lubnaa Abdur Rahman,Ioannis Papathanail,Lorenzo Brigato,Stavroula Mougiakakou

Task: 对细粒度食物识别中的后验分布外（OOD）检测方法进行实证分析。

Motivation: 食物识别模型在区分已知和未知样本时表现不佳，导致在实际应用中（如自动饮食评估系统）出现错误分类，需要改进。

Details

Method: 评估多种后验OOD检测方法，重点关注虚拟对数匹配（ViM）方法。 Result: ViM表现最佳，结合了对数和特征空间表示；基于Transformer的架构在OOD检测中优于卷积模型。 Conclusion: ViM和Transformer架构在食物识别的OOD检测中具有优势，高ID准确率的模型表现更好。 Abstract: Food recognition models often struggle to distinguish between seen and unseen samples, frequently misclassifying samples from unseen categories by assigning them an in-distribution (ID) label. This misclassification presents significant challenges when deploying these models in real-world applications, particularly within automatic dietary assessment systems, where incorrect labels can lead to cascading errors throughout the system. Ideally, such models should prompt the user when an unknown sample is encountered, allowing for corrective action. Given no prior research exploring food recognition in real-world settings, in this work we conduct an empirical analysis of various post-hoc out-of-distribution (OOD) detection methods for fine-grained food recognition. Our findings indicate that virtual logit matching (ViM) performed the best overall, likely due to its combination of logits and feature-space representations. Additionally, our work reinforces prior notions in the OOD domain, noting that models with higher ID accuracy performed better across the evaluated OOD detection methods. Furthermore, transformer-based architectures consistently outperformed convolution-based models in detecting OOD samples across various methods.

EvAnimate: Event-conditioned Image-to-Video Generation for Human Animation

Qiang Qu,Ming Li,Xiaoming Chen,Tongliang Liu

Task: 利用事件流作为运动线索，将静态人体图像转化为动态序列。

Motivation: 传统视频数据作为运动线索存在时间分辨率低、运动模糊、曝光过度及低光条件下不准确等问题，而事件相机提供的高时间分辨率、宽动态范围及抗运动模糊特性可以解决这些问题。

Details

Method: 提出EvAnimate框架，通过专用事件表示将异步事件流转换为3通道切片，并采用双分支架构生成高质量视频，同时利用数据增强策略提升跨人物泛化能力。 Result: 实验表明，EvAnimate在传统视频线索表现不佳的场景下，实现了高时间保真度和鲁棒性能。 Conclusion: EvAnimate通过事件流作为运动线索，显著提升了动态序列生成的质量和一致性。 Abstract: Conditional human animation transforms a static reference image into a dynamic sequence by applying motion cues such as poses. These motion cues are typically derived from video data but are susceptible to limitations including low temporal resolution, motion blur, overexposure, and inaccuracies under low-light conditions. In contrast, event cameras provide data streams with exceptionally high temporal resolution, a wide dynamic range, and inherent resistance to motion blur and exposure issues. In this work, we propose EvAnimate, a framework that leverages event streams as motion cues to animate static human images. Our approach employs a specialized event representation that transforms asynchronous event streams into 3-channel slices with controllable slicing rates and appropriate slice density, ensuring compatibility with diffusion models. Subsequently, a dual-branch architecture generates high-quality videos by harnessing the inherent motion dynamics of the event streams, thereby enhancing both video quality and temporal consistency. Specialized data augmentation strategies further enhance cross-person generalization. Finally, we establish a new benchmarking, including simulated event data for training and validation, and a real-world event dataset capturing human actions under normal and extreme scenarios. The experiment results demonstrate that EvAnimate achieves high temporal fidelity and robust performance in scenarios where traditional video-derived cues fall short.

ATARS: An Aerial Traffic Atomic Activity Recognition and Temporal Segmentation Dataset

Zihao Chen,Hsuanyu Wu,Chi-Hsi Kung,Yi-Ting Chen,Yan-Tsung Peng

Task: 提出并评估首个用于多标签原子活动分析的航拍数据集ATARS，并研究多标签时序原子活动识别任务。

Motivation: 现有原子活动数据集仅支持以自我为中心的视角，且仅提供视频级注释，无法满足对整个交叉路口交通活动的需求，且手动修剪视频耗时费力。

Details

Method: 引入ATARS数据集，提供每帧的原子活动标签，并提出多标签时序原子活动识别任务。 Result: 实验表明ATARS数据集在识别极小物体活动等方面具有独特挑战，并提供了未来改进方向的见解。 Conclusion: ATARS数据集填补了航拍视角下原子活动分析的空白，为未来研究提供了重要基础。 Abstract: Traffic Atomic Activity which describes traffic patterns for topological intersection dynamics is a crucial topic for the advancement of intelligent driving systems. However, existing atomic activity datasets are collected from an egocentric view, which cannot support the scenarios where traffic activities in an entire intersection are required. Moreover, existing datasets only provide video-level atomic activity annotations, which require exhausting efforts to manually trim the videos for recognition and limit their applications to untrimmed videos. To bridge this gap, we introduce the Aerial Traffic Atomic Activity Recognition and Segmentation (ATARS) dataset, the first aerial dataset designed for multi-label atomic activity analysis. We offer atomic activity labels for each frame, which accurately record the intervals for traffic activities. Moreover, we propose a novel task, Multi-label Temporal Atomic Activity Recognition, enabling the study of accurate temporal localization for atomic activity and easing the burden of manual video trimming for recognition. We conduct extensive experiments to evaluate existing state-of-the-art models on both atomic activity recognition and temporal atomic activity segmentation. The results highlight the unique challenges of our ATARS dataset, such as recognizing extremely small objects' activities. We further provide comprehensive discussion analyzing these challenges and offer valuable insights for future direction to improve recognizing atomic activity in aerial view. Our source code and dataset are available at https://github.com/magecliff96/ATARS/

Instruction-Aligned Visual Attention for Mitigating Hallucinations in Large Vision-Language Models

Bin Li,Dehong Gao,Yeyuan Wang,Linbo Jin,Shanqing Yu,Xiaoyan Cai,Libin Yang

Task: 提出一种指令对齐的视觉注意力方法（IAVA），以减少大型视觉语言模型（LVLMs）在描述图像时产生的幻觉问题。

Motivation: 大型视觉语言模型在描述图像时容易产生幻觉，生成包含不存在对象的答案，原因是模型过度关注不相关的图像标记。

Details

Method: 通过比较两种不同指令下的注意力权重变化识别不相关标记，并采用对比解码动态调整原始图像标记和不相关标记的logits。 Result: IAVA在MME、POPE和TextVQA等基准测试中表现优于现有解码技术，有效减少对象幻觉。 Conclusion: IAVA方法通过减少对不相关信息的过度关注，显著提升了模型生成答案的准确性。 Abstract: Despite the significant success of Large Vision-Language models(LVLMs), these models still suffer hallucinations when describing images, generating answers that include non-existent objects. It is reported that these models tend to over-focus on certain irrelevant image tokens that do not contain critical information for answering the question and distort the output. To address this, we propose an Instruction-Aligned Visual Attention(IAVA) approach, which identifies irrelevant tokens by comparing changes in attention weights under two different instructions. By applying contrastive decoding, we dynamically adjust the logits generated from original image tokens and irrelevant image tokens, reducing the model's over-attention to irrelevant information. The experimental results demonstrate that IAVA consistently outperforms existing decoding techniques on benchmarks such as MME, POPE, and TextVQA in mitigating object hallucinations. Our IAVA approach is available online at https://github.com/Lee-lab558/IAVA.

LeanStereo: A Leaner Backbone based Stereo Network

Rafia Rahim,Samuel Woerz,Andreas Zell

Task: 提出一种快速端到端的立体匹配方法，以解决现有深度学习方法计算和内存带宽需求高的问题。

Motivation: 现有端到端深度立体匹配方法虽然性能优越，但计算和内存需求高，限制了其在实际应用中的适用性。

Details

Method: 通过集成更轻量的主干网络，并结合基于学习注意力权重的成本体积与LogL1损失函数，以恢复性能损失。 Result: 所提方法在减少4倍计算量的同时，比现有方法快9至14倍，且性能相当。 Conclusion: 该方法在保持高性能的同时显著提升了速度，适用于实际应用。 Abstract: Recently, end-to-end deep networks based stereo matching methods, mainly because of their performance, have gained popularity. However, this improvement in performance comes at the cost of increased computational and memory bandwidth requirements, thus necessitating specialized hardware (GPUs); even then, these methods have large inference times compared to classical methods. This limits their applicability in real-world applications. Although we desire high accuracy stereo methods albeit with reasonable inference time. To this end, we propose a fast end-to-end stereo matching method. Majority of this speedup comes from integrating a leaner backbone. To recover the performance lost because of a leaner backbone, we propose to use learned attention weights based cost volume combined with LogL1 loss for stereo matching. Using LogL1 loss not only improves the overall performance of the proposed network but also leads to faster convergence. We do a detailed empirical evaluation of different design choices and show that our method requires 4x less operations and is also about 9 to 14x faster compared to the state of the art methods like ACVNet [1], LEAStereo [2] and CFNet [3] while giving comparable performance.

AMD-Hummingbird: Towards an Efficient Text-to-Video Model

Takashi Isobe,He Cui,Dong Zhou,Mengmeng Ge,Dong Li,Emad Barsoum

Task: 提出一种轻量级的文本到视频（T2V）生成框架Hummingbird，以平衡计算效率和视觉质量。

Motivation: 现有模型在资源有限的设备上难以同时实现高视觉质量和计算效率，且多数研究忽视了实际部署中对小型高效模型的需求。

Details

Method: 通过剪枝现有模型并利用视觉反馈学习提升视觉质量，同时引入基于LLM和VQA的数据处理流程。 Result: 模型参数减少50%，速度提升31倍，在VBench上取得最高分，支持生成26帧视频，且仅需4块GPU完成训练。 Conclusion: Hummingbird为T2V生成提供了一种高效、高性能且灵活的解决方案。 Abstract: Text-to-Video (T2V) generation has attracted significant attention for its ability to synthesize realistic videos from textual descriptions. However, existing models struggle to balance computational efficiency and high visual quality, particularly on resource-limited devices, e.g.,iGPUs and mobile phones. Most prior work prioritizes visual fidelity while overlooking the need for smaller, more efficient models suitable for real-world deployment. To address this challenge, we propose a lightweight T2V framework, termed Hummingbird, which prunes existing models and enhances visual quality through visual feedback learning. Our approach reduces the size of the U-Net from 1.4 billion to 0.7 billion parameters, significantly improving efficiency while preserving high-quality video generation. Additionally, we introduce a novel data processing pipeline that leverages Large Language Models (LLMs) and Video Quality Assessment (VQA) models to enhance the quality of both text prompts and video data. To support user-driven training and style customization, we publicly release the full training code, including data processing and model training. Extensive experiments show that our method achieves a 31X speedup compared to state-of-the-art models such as VideoCrafter2, while also attaining the highest overall score on VBench. Moreover, our method supports the generation of videos with up to 26 frames, addressing the limitations of existing U-Net-based methods in long video generation. Notably, the entire training process requires only four GPUs, yet delivers performance competitive with existing leading methods. Hummingbird presents a practical and efficient solution for T2V generation, combining high performance, scalability, and flexibility for real-world applications.

Advancing Cross-Organ Domain Generalization with Test-Time Style Transfer and Diversity Enhancement

Biwen Meng,Xi Long,Wanrong Yang,Ruochen Liu,Yi Tian,Yalin Zheng,Jingxin Liu

Task: 提出一种测试时风格迁移方法（T3s）和跨域风格多样化模块（CSDM），以解决计算病理学中多域或跨域任务中的域偏移问题。

Motivation: 由于域偏移问题的复杂性，现有模型在多域或跨域任务中性能下降，需要提升模型的泛化能力。

Details

Method: 采用双向映射机制将源域和目标域特征投影到统一特征空间，并引入CSDM确保风格基的正交性，同时使用数据增强和低秩适应技术优化特征对齐和敏感性。 Result: 方法在三个未见数据集上验证了有效性。 Conclusion: 提出的T3s和CSDM能有效提升模型在多域或跨域任务中的适应性和泛化能力。 Abstract: Deep learning has made significant progress in addressing challenges in various fields including computational pathology (CPath). However, due to the complexity of the domain shift problem, the performance of existing models will degrade, especially when it comes to multi-domain or cross-domain tasks. In this paper, we propose a Test-time style transfer (T3s) that uses a bidirectional mapping mechanism to project the features of the source and target domains into a unified feature space, enhancing the generalization ability of the model. To further increase the style expression space, we introduce a Cross-domain style diversification module (CSDM) to ensure the orthogonality between style bases. In addition, data augmentation and low-rank adaptation techniques are used to improve feature alignment and sensitivity, enabling the model to adapt to multi-domain inputs effectively. Our method has demonstrated effectiveness on three unseen datasets.

Adapting Video Diffusion Models for Time-Lapse Microscopy

Alexander Holmberg,Nils Mechtel,Wei Ouyang

Task: 通过域适应视频扩散模型生成高度逼真的HeLa细胞分裂延时显微视频。

Motivation: 尽管自然视频的生成模型已取得显著进展，但在显微领域的应用仍未被充分探索。

Details

Method: 在显微特定序列上微调预训练的视频扩散模型，探索三种条件策略：文本提示、数值嵌入和图像条件生成。 Result: 微调显著提高了视频的真实性，并准确捕捉了细胞分裂和迁移等关键行为，模型还能生成超出训练范围的连贯序列。 Conclusion: 结果表明，特定领域的生成视频模型微调可以产生生物学上合理的合成显微数据，支持虚拟假设测试和数据增强等应用。 Abstract: We present a domain adaptation of video diffusion models to generate highly realistic time-lapse microscopy videos of cell division in HeLa cells. Although state-of-the-art generative video models have advanced significantly for natural videos, they remain underexplored in microscopy domains. To address this gap, we fine-tune a pretrained video diffusion model on microscopy-specific sequences, exploring three conditioning strategies: (1) text prompts derived from numeric phenotypic measurements (e.g., proliferation rates, migration speeds, cell-death frequencies), (2) direct numeric embeddings of phenotype scores, and (3) image-conditioned generation, where an initial microscopy frame is extended into a complete video sequence. Evaluation using biologically meaningful morphological, proliferation, and migration metrics demonstrates that fine-tuning substantially improves realism and accurately captures critical cellular behaviors such as mitosis and migration. Notably, the fine-tuned model also generalizes beyond the training horizon, generating coherent cell dynamics even in extended sequences. However, precisely controlling specific phenotypic characteristics remains challenging, highlighting opportunities for future work to enhance conditioning methods. Our results demonstrate the potential for domain-specific fine-tuning of generative video models to produce biologically plausible synthetic microscopy data, supporting applications such as in-silico hypothesis testing and data augmentation.

Unified Uncertainty-Aware Diffusion for Multi-Agent Trajectory Modeling

Guillem Capellera,Antonio Rubio,Luis Ferraz,Antonio Agudo

Task: 提出一种统一的扩散模型U2Diff，用于轨迹完成并提供状态级不确定性估计和误差概率估计。

Motivation: 现有方法在轨迹建模中多关注未来状态预测，忽略了轨迹完成等任务，且缺乏状态级不确定性估计和多模态场景的误差概率估计。

Details

Method: 通过增强去噪损失和负对数似然预测噪声，将潜在空间不确定性传播到真实状态空间，并引入排序神经网络进行后处理。 Result: 在四个体育数据集（NBA、Basketball-U、Football-U、Soccer-U）上优于现有方法，证明了不确定性和误差概率估计的有效性。 Conclusion: U2Diff在轨迹完成和预测任务中表现优异，同时提供了实用的不确定性及误差概率估计。 Abstract: Multi-agent trajectory modeling has primarily focused on forecasting future states, often overlooking broader tasks like trajectory completion, which are crucial for real-world applications such as correcting tracking data. Existing methods also generally predict agents' states without offering any state-wise measure of uncertainty. Moreover, popular multi-modal sampling methods lack any error probability estimates for each generated scene under the same prior observations, making it difficult to rank the predictions during inference time. We introduce U2Diff, a \textbf{unified} diffusion model designed to handle trajectory completion while providing state-wise \textbf{uncertainty} estimates jointly. This uncertainty estimation is achieved by augmenting the simple denoising loss with the negative log-likelihood of the predicted noise and propagating latent space uncertainty to the real state space. Additionally, we incorporate a Rank Neural Network in post-processing to enable \textbf{error probability} estimation for each generated mode, demonstrating a strong correlation with the error relative to ground truth. Our method outperforms the state-of-the-art solutions in trajectory completion and forecasting across four challenging sports datasets (NBA, Basketball-U, Football-U, Soccer-U), highlighting the effectiveness of uncertainty and error probability estimation. Video at https://youtu.be/ngw4D4eJToE

Training-Free Personalization via Retrieval and Reasoning on Fingerprints

Deepayan Das,Davide Talon,Yiming Wang,Massimiliano Mancini,Elisa Ricci

Task: 探索无需训练的个性化方法，以解决视觉语言模型（VLMs）在理解用户特定概念时的局限性。

Motivation: 现有个性化方法依赖训练过程，成本高或用户体验差，因此需要一种无需训练的解决方案。

Details

Method: 提出R2P方法，通过提取概念指纹、检索与推理、跨模态验证和成对多模态匹配来实现个性化。 Result: R2P在多个基准测试中优于现有方法，尤其在视觉模糊性挑战下表现优异。 Conclusion: R2P是一种高效且无需训练的个性化方法，显著提升了VLMs在用户特定概念理解上的性能。 Abstract: Vision Language Models (VLMs) have lead to major improvements in multimodal reasoning, yet they still struggle to understand user-specific concepts. Existing personalization methods address this limitation but heavily rely on training procedures, that can be either costly or unpleasant to individual users. We depart from existing work, and for the first time explore the training-free setting in the context of personalization. We propose a novel method, Retrieval and Reasoning for Personalization (R2P), leveraging internal knowledge of VLMs. First, we leverage VLMs to extract the concept fingerprint, i.e., key attributes uniquely defining the concept within its semantic class. When a query arrives, the most similar fingerprints are retrieved and scored via chain-of-thought-reasoning. To reduce the risk of hallucinations, the scores are validated through cross-modal verification at the attribute level: in case of a discrepancy between the scores, R2P refines the concept association via pairwise multimodal matching, where the retrieved fingerprints and their images are directly compared with the query. We validate R2P on two publicly available benchmarks and a newly introduced dataset, Personal Concepts with Visual Ambiguity (PerVA), for concept identification highlighting challenges in visual ambiguity. R2P consistently outperforms state-of-the-art approaches on various downstream tasks across all benchmarks. Code will be available upon acceptance.

Generative Dataset Distillation using Min-Max Diffusion Model

Junqiao Fan,Yunjiao Zhou,Min Chang Jordan Ren,Jianfei Yang

Task: 解决生成式数据集蒸馏问题，利用扩散模型合成图像并优化生成效率。

Motivation: 生成模型在数据集蒸馏中效率低，扩散模型生成图像耗时，需平衡样本数量与质量。

Details

Method: 采用扩散模型生成代理数据集，结合最小-最大损失控制多样性和代表性，提出扩散步数减少策略优化性能。 Result: 模型在ECCV2024数据集蒸馏挑战赛生成赛道中获得第二名，表现优异。 Conclusion: 提出的方法在生成效率和图像质量间取得平衡，验证了其有效性。 Abstract: In this paper, we address the problem of generative dataset distillation that utilizes generative models to synthesize images. The generator may produce any number of images under a preserved evaluation time. In this work, we leverage the popular diffusion model as the generator to compute a surrogate dataset, boosted by a min-max loss to control the dataset's diversity and representativeness during training. However, the diffusion model is time-consuming when generating images, as it requires an iterative generation process. We observe a critical trade-off between the number of image samples and the image quality controlled by the diffusion steps and propose Diffusion Step Reduction to achieve optimal performance. This paper details our comprehensive method and its performance. Our model achieved $2^{nd}$ place in the generative track of \href{https://www.dd-challenge.com/#/}{The First Dataset Distillation Challenge of ECCV2024}, demonstrating its superior performance.

Dig2DIG: Dig into Diffusion Information Gains for Image Fusion

Bing Cao,Baoshuo Cai,Changqing Zhang,Qinghua Hu

Task: 提出一种基于扩散模型的动态图像融合框架，通过量化不同模态在不同去噪步骤中的信息贡献，实现更高质量的图像融合。

Motivation: 现有扩散模型在图像融合中通常采用预定义的多模态引导，无法动态捕捉各模态的重要性变化，且缺乏理论保证。

Details

Method: 揭示图像去噪中的时空不平衡性，提出扩散信息增益（DIG）量化方法，并理论推导动态图像融合框架。 Result: 在多个融合场景的实验中，该方法在融合质量和推理效率上均优于现有扩散模型方法。 Conclusion: 通过动态量化模态信息贡献，提出的框架显著提升了图像融合效果，并具有理论保证。 Abstract: Image fusion integrates complementary information from multi-source images to generate more informative results. Recently, the diffusion model, which demonstrates unprecedented generative potential, has been explored in image fusion. However, these approaches typically incorporate predefined multimodal guidance into diffusion, failing to capture the dynamically changing significance of each modality, while lacking theoretical guarantees. To address this issue, we reveal a significant spatio-temporal imbalance in image denoising; specifically, the diffusion model produces dynamic information gains in different image regions with denoising steps. Based on this observation, we Dig into the Diffusion Information Gains (Dig2DIG) and theoretically derive a diffusion-based dynamic image fusion framework that provably reduces the upper bound of the generalization error. Accordingly, we introduce diffusion information gains (DIG) to quantify the information contribution of each modality at different denoising steps, thereby providing dynamic guidance during the fusion process. Extensive experiments on multiple fusion scenarios confirm that our method outperforms existing diffusion-based approaches in terms of both fusion quality and inference efficiency.

Towards Human-Understandable Multi-Dimensional Concept Discovery

Arne Grobrügge,Niklas Kühl,Gerhard Satzger,Philipp Spitzer

Task: 提出一种名为HU-MCD的方法，旨在提高概念可解释性AI（C-XAI）中概念的可理解性。

Motivation: 传统MCD方法生成的概念解释对人类难以理解，限制了其实际应用。

Details

Method: 结合Segment Anything Model进行概念识别，并采用CNN特定的输入掩码技术减少噪声。 Result: 实验表明，HU-MCD比现有C-XAI方法提供更精确和可靠的概念解释。 Conclusion: HU-MCD在保持解释忠实度的同时，显著提升了概念的可理解性。 Abstract: Concept-based eXplainable AI (C-XAI) aims to overcome the limitations of traditional saliency maps by converting pixels into human-understandable concepts that are consistent across an entire dataset. A crucial aspect of C-XAI is completeness, which measures how well a set of concepts explains a model's decisions. Among C-XAI methods, Multi-Dimensional Concept Discovery (MCD) effectively improves completeness by breaking down the CNN latent space into distinct and interpretable concept subspaces. However, MCD's explanations can be difficult for humans to understand, raising concerns about their practical utility. To address this, we propose Human-Understandable Multi-dimensional Concept Discovery (HU-MCD). HU-MCD uses the Segment Anything Model for concept identification and implements a CNN-specific input masking technique to reduce noise introduced by traditional masking methods. These changes to MCD, paired with the completeness relation, enable HU-MCD to enhance concept understandability while maintaining explanation faithfulness. Our experiments, including human subject studies, show that HU-MCD provides more precise and reliable explanations than existing C-XAI methods. The code is available at https://github.com/grobruegge/hu-mcd.

Robust Lane Detection with Wavelet-Enhanced Context Modeling and Adaptive Sampling

Kunyang Li,Ming Hou

Task: 提出一种基于小波增强特征金字塔网络（WE-FPN）的车道检测方法，以应对极端天气、光照变化、遮挡和复杂曲线等挑战。

Motivation: 现有方法（如CLRNet）在恶劣条件下表现不佳，需要提升车道检测的鲁棒性和准确性。

Details

Method: 集成小波非局部块以增强全局上下文建模，设计自适应预处理模块改善光照条件，采用注意力引导采样策略优化空间特征。 Result: 在CULane和TuSimple数据集上显著优于基线方法，尤其在恶劣条件下表现更优。 Conclusion: WE-FPN提升了车道检测在复杂场景下的鲁棒性和准确性，适用于实际驾驶环境。 Abstract: Lane detection is critical for autonomous driving and ad-vanced driver assistance systems (ADAS). While recent methods like CLRNet achieve strong performance, they struggle under adverse con-ditions such as extreme weather, illumination changes, occlusions, and complex curves. We propose a Wavelet-Enhanced Feature Pyramid Net-work (WE-FPN) to address these challenges. A wavelet-based non-local block is integrated before the feature pyramid to improve global context modeling, especially for occluded and curved lanes. Additionally, we de-sign an adaptive preprocessing module to enhance lane visibility under poor lighting. An attention-guided sampling strategy further reffnes spa-tial features, boosting accuracy on distant and curved lanes. Experiments on CULane and TuSimple demonstrate that our approach signiffcantly outperforms baselines in challenging scenarios, achieving better robust-ness and accuracy in real-world driving conditions.

OCCO: LVM-guided Infrared and Visible Image Fusion Framework based on Object-aware and Contextual COntrastive Learning

Hui Li,Congcong Bian,Zeyang Zhang,Xiaoning Song,Xi Li,Xiao-Jun Wu

Task: 提出一种基于大型视觉模型（LVM）引导的图像融合框架OCCO，以平衡融合图像质量与下游任务性能。

Motivation: 现有融合方法难以同时保证融合图像的高质量和下游任务的高性能，需解决这一矛盾。

Details

Method: 利用预训练的LVM提供语义指导，结合对象感知和上下文对比学习（OCCO），并设计特征交互融合网络以减少模态差异引起的信息冲突。 Result: 在四个数据集上与八种先进方法对比，验证了OCCO的有效性，并在下游视觉任务中表现出色。 Conclusion: OCCO框架通过语义指导和对比学习，显著提升了融合图像质量和下游任务性能。 Abstract: Image fusion is a crucial technique in the field of computer vision, and its goal is to generate high-quality fused images and improve the performance of downstream tasks. However, existing fusion methods struggle to balance these two factors. Achieving high quality in fused images may result in lower performance in downstream visual tasks, and vice versa. To address this drawback, a novel LVM (large vision model)-guided fusion framework with Object-aware and Contextual COntrastive learning is proposed, termed as OCCO. The pre-trained LVM is utilized to provide semantic guidance, allowing the network to focus solely on fusion tasks while emphasizing learning salient semantic features in form of contrastive learning. Additionally, a novel feature interaction fusion network is also designed to resolve information conflicts in fusion images caused by modality differences. By learning the distinction between positive samples and negative samples in the latent feature space (contextual space), the integrity of target information in fused image is improved, thereby benefiting downstream performance. Finally, compared with eight state-of-the-art methods on four datasets, the effectiveness of the proposed method is validated, and exceptional performance is also demonstrated on downstream visual task.

Unbiasing through Textual Descriptions: Mitigating Representation Bias in Video Benchmarks

Nina Shvetsova,Arsha Nagrani,Bernt Schiele,Hilde Kuehne,Christian Rupprecht

Task: 提出一种基于现有视频分类和检索数据集的无偏子集的‘Unbiased through Textual Description (UTD)’视频基准，以更稳健地评估视频理解能力。

Motivation: 当前视频基准可能因对象偏差或单帧偏差等问题导致评估不准确，仅通过识别对象或利用单帧即可正确预测。

Details

Method: 利用视觉语言模型（VLMs）和大型语言模型（LLMs）生成视频的逐帧文本描述，并通过过滤和分析这些描述来消除表示偏差，重点关注概念偏差、时间偏差和常识与数据集偏差三个维度。 Result: 对12个流行视频分类和检索数据集进行了系统分析，并为其创建了新的无对象偏差测试子集；同时评估了30个先进视频模型在原始和无偏子集上的表现。 Conclusion: 发布了‘UTD-descriptions’和‘UTD-splits’数据集，以支持未来更稳健的视频理解基准和模型的开发。 Abstract: We propose a new "Unbiased through Textual Description (UTD)" video benchmark based on unbiased subsets of existing video classification and retrieval datasets to enable a more robust assessment of video understanding capabilities. Namely, we tackle the problem that current video benchmarks may suffer from different representation biases, e.g., object bias or single-frame bias, where mere recognition of objects or utilization of only a single frame is sufficient for correct prediction. We leverage VLMs and LLMs to analyze and debias benchmarks from such representation biases. Specifically, we generate frame-wise textual descriptions of videos, filter them for specific information (e.g. only objects) and leverage them to examine representation biases across three dimensions: 1) concept bias - determining if a specific concept (e.g., objects) alone suffice for prediction; 2) temporal bias - assessing if temporal information contributes to prediction; and 3) common sense vs. dataset bias - evaluating whether zero-shot reasoning or dataset correlations contribute to prediction. We conduct a systematic analysis of 12 popular video classification and retrieval datasets and create new object-debiased test splits for these datasets. Moreover, we benchmark 30 state-of-the-art video models on original and debiased splits and analyze biases in the models. To facilitate the future development of more robust video understanding benchmarks and models, we release: "UTD-descriptions", a dataset with our rich structured descriptions for each dataset, and "UTD-splits", a dataset of object-debiased test splits.

LLGS: Unsupervised Gaussian Splatting for Image Enhancement and Reconstruction in Pure Dark Environment

Haoran Wang,Jingwei Huang,Lu Yang,Tianchen Deng,Gaojing Zhang,Mingrui Li

Task: 提出一种基于3D高斯泼溅的无监督多视点立体系统（LLGS），用于低光环境下的图像增强与场景重建。

Motivation: 原始3D高斯泼溅在低光环境下缺乏颜色表征能力，且现有单视点增强方法依赖预训练数据，缺乏场景泛化性，限制了其在机器人领域的应用。

Details

Method: 引入可分解的高斯表征M-Color，分别表征颜色信息以实现针对性增强；提出基于方向增强的无监督优化方法，确保多视点一致性。 Result: 在真实数据集上的实验表明，LLGS在低光增强和3D高斯泼溅任务上优于现有方法。 Conclusion: LLGS系统有效解决了低光环境下的多视点一致性问题，提升了3D高斯泼溅的适用性。 Abstract: 3D Gaussian Splatting has shown remarkable capabilities in novel view rendering tasks and exhibits significant potential for multi-view optimization.However, the original 3D Gaussian Splatting lacks color representation for inputs in low-light environments. Simply using enhanced images as inputs would lead to issues with multi-view consistency, and current single-view enhancement systems rely on pre-trained data, lacking scene generalization. These problems limit the application of 3D Gaussian Splatting in low-light conditions in the field of robotics, including high-fidelity modeling and feature matching. To address these challenges, we propose an unsupervised multi-view stereoscopic system based on Gaussian Splatting, called Low-Light Gaussian Splatting (LLGS). This system aims to enhance images in low-light environments while reconstructing the scene. Our method introduces a decomposable Gaussian representation called M-Color, which separately characterizes color information for targeted enhancement. Furthermore, we propose an unsupervised optimization method with zero-knowledge priors, using direction-based enhancement to ensure multi-view consistency. Experiments conducted on real-world datasets demonstrate that our system outperforms state-of-the-art methods in both low-light enhancement and 3D Gaussian Splatting.

Robust face recognition based on the wing loss and the $\ell_1$ regularization

Yaoyao Yun,Jianwen Xu

Task: 提出一种新的翼约束稀疏编码模型（WCSC）及其加权版本（WWCSC）以解决复杂环境下的面部识别问题。

Motivation: 现有稀疏采样模型在高度遮挡或损坏的面部图像中识别率显著下降，需要更鲁棒的方法。

Details

Method: 采用交替方向乘子法（ADMM）算法求解最小化问题，并在四个知名面部数据库上测试性能。 Result: WWCSC在高度遮挡或损坏的图像中仍具有很高的识别率，优于其他方法。 Conclusion: WWCSC方法在面部识别中表现出强大的鲁棒性，适用于复杂环境。 Abstract: In recent years, sparse sampling techniques based on regression analysis have witnessed extensive applications in face recognition research. Presently, numerous sparse sampling models based on regression analysis have been explored by various researchers. Nevertheless, the recognition rates of the majority of these models would be significantly decreased when confronted with highly occluded and highly damaged face images. In this paper, a new wing-constrained sparse coding model(WCSC) and its weighted version(WWCSC) are introduced, so as to deal with the face recognition problem in complex circumstances, where the alternating direction method of multipliers (ADMM) algorithm is employed to solve the corresponding minimization problems. In addition, performances of the proposed method are examined based on the four well-known facial databases, namely the ORL facial database, the Yale facial database, the AR facial database and the FERET facial database. Also, compared to the other methods in the literatures, the WWCSC has a very high recognition rate even in complex situations where face images have high occlusion or high damage, which illustrates the robustness of the WWCSC method in facial recognition.

Leveraging Land Cover Priors for Isoprene Emission Super-Resolution

Christopher Ummerle,Antonio Giganti,Sara Mandelli,Paolo Bestagini,Stefano Tubaro

Task: 提出一种基于深度学习的超分辨率框架，结合土地覆盖信息提高生物挥发性有机化合物（BVOCs）排放的空间精度。

Motivation: 卫星数据空间分辨率有限，限制了其在大气建模和气候研究中的应用，需要更精确的方法。

Details

Method: 利用土地覆盖先验作为排放驱动因素，通过深度学习超分辨率框架捕捉空间模式。 Result: 实验表明，结合土地覆盖数据显著提高了排放超分辨率的精度，特别是在异质景观中。 Conclusion: 该方法为大气化学和气候建模提供了一种经济高效的数据驱动方法，提升了卫星排放数据的可用性。 Abstract: Remote sensing plays a crucial role in monitoring Earth's ecosystems, yet satellite-derived data often suffer from limited spatial resolution, restricting their applicability in atmospheric modeling and climate research. In this work, we propose a deep learning-based Super-Resolution (SR) framework that leverages land cover information to enhance the spatial accuracy of Biogenic Volatile Organic Compounds (BVOCs) emissions, with a particular focus on isoprene. Our approach integrates land cover priors as emission drivers, capturing spatial patterns more effectively than traditional methods. We evaluate the model's performance across various climate conditions and analyze statistical correlations between isoprene emissions and key environmental information such as cropland and tree cover data. Additionally, we assess the generalization capabilities of our SR model by applying it to unseen climate zones and geographical regions. Experimental results demonstrate that incorporating land cover data significantly improves emission SR accuracy, particularly in heterogeneous landscapes. This study contributes to atmospheric chemistry and climate modeling by providing a cost-effective, data-driven approach to refining BVOC emission maps. The proposed method enhances the usability of satellite-based emissions data, supporting applications in air quality forecasting, climate impact assessments, and environmental studies.

Boosting Virtual Agent Learning and Reasoning: A Step-wise, Multi-dimensional, and Generalist Reward Model with Benchmark

Bingchen Miao,Yang Wu,Minghe Gao,Qifan Yu,Wendong Bu,Wenqiao Zhang,Yunfei Li,Siliang Tang,Tat-Seng Chua,Juncheng Li

Task: 提出一种名为Similar的逐步多维度通用奖励模型，用于为通用虚拟代理（GVAs）提供细粒度的训练信号和推理时扩展的动作选择。

Motivation: 当前通用虚拟代理的训练范式存在依赖结果监督和人工标注的局限性，需要更高效的训练方法。

Details

Method: 系统定义五个评估代理动作的维度，设计MCTS-P算法自动收集和标注数据，采用Triple-M策略训练Similar模型，并引入SRM基准。 Result: 实验表明Similar通过逐步多维评估和协同增益，为GVAs提供了有效的中间信号。 Conclusion: Similar在训练和推理时扩展中表现出色，为GVAs的发展提供了新思路。 Abstract: The development of Generalist Virtual Agents (GVAs) powered by Multimodal Large Language Models (MLLMs) has shown significant promise in autonomous task execution. However, current training paradigms face critical limitations, including reliance on outcome supervision and labor-intensive human annotations. To address these challenges, we propose Similar, a Step-wise Multi-dimensional Generalist Reward Model, which offers fine-grained signals for agent training and can choose better action for inference-time scaling. Specifically, we begin by systematically defining five dimensions for evaluating agent actions. Building on this framework, we design an MCTS-P algorithm to automatically collect and annotate step-wise, five-dimensional agent execution data. Using this data, we train Similar with the Triple-M strategy. Furthermore, we introduce the first benchmark in the virtual agent domain for step-wise, multi-dimensional reward model training and evaluation, named SRM. This benchmark consists of two components: SRMTrain, which serves as the training set for Similar, and SRMEval, a manually selected test set for evaluating the reward model. Experimental results demonstrate that Similar, through its step-wise, multi-dimensional assessment and synergistic gain, provides GVAs with effective intermediate signals during both training and inference-time scaling. The code is available at https://github.com/Galery23/Similar-v1.

Structure-Aware Correspondence Learning for Relative Pose Estimation

Yihan Chen,Wenfei Yang,Huan Ren,Shifeng Zhang,Tianzhu Zhang,Feng Wu

Task: 提出一种结构感知的对应学习方法，用于相对位姿估计。

Motivation: 现有基于3D对应的方法依赖于显式特征匹配，但在可见区域重叠小或不可见区域特征估计不可靠时表现不佳。

Details

Method: 设计了一个结构感知的关键点提取模块和一个结构感知的对应估计模块，通过联合利用这两个模块实现位姿估计。 Result: 在CO3D、Objaverse和LineMOD数据集上显著优于现有方法，如在CO3D数据集上平均角度误差降低了5.7度。 Conclusion: 该方法能够在不依赖显式特征匹配的情况下，为未见过的物体估计3D-3D对应关系，实现精确的相对位姿估计。 Abstract: Relative pose estimation provides a promising way for achieving object-agnostic pose estimation. Despite the success of existing 3D correspondence-based methods, the reliance on explicit feature matching suffers from small overlaps in visible regions and unreliable feature estimation for invisible regions. Inspired by humans' ability to assemble two object parts that have small or no overlapping regions by considering object structure, we propose a novel Structure-Aware Correspondence Learning method for Relative Pose Estimation, which consists of two key modules. First, a structure-aware keypoint extraction module is designed to locate a set of kepoints that can represent the structure of objects with different shapes and appearance, under the guidance of a keypoint based image reconstruction loss. Second, a structure-aware correspondence estimation module is designed to model the intra-image and inter-image relationships between keypoints to extract structure-aware features for correspondence estimation. By jointly leveraging these two modules, the proposed method can naturally estimate 3D-3D correspondences for unseen objects without explicit feature matching for precise relative pose estimation. Experimental results on the CO3D, Objaverse and LineMOD datasets demonstrate that the proposed method significantly outperforms prior methods, i.e., with 5.7{\deg}reduction in mean angular error on the CO3D dataset.

Feature Calibration enhanced Parameter Synthesis for CLIP-based Class-incremental Learning

Juncen Guo,Xiaoguang Zhu,Lianlong Sun,Liangyu Teng,Di Li,Yang Liu,Liang Song

Task: 提出一种特征校准增强的参数合成方法（FCPS）以改进类增量学习（CIL）中的灾难性遗忘和泛化能力问题。

Motivation: 传统CIL方法主要基于视觉特征，难以处理复杂场景；而视觉语言模型（VLMs）虽潜力巨大，但现有方法难以平衡灾难性遗忘和泛化能力。

Details

Method: 通过特征校准机制迭代调整原始视觉特征在最终分类中的比例，并通过参数整合实现新旧知识的平衡。 Result: 在CIFAR100和ImageNet100等基准测试中验证了方法的优越性。 Conclusion: FCPS有效解决了CIL中的灾难性遗忘问题，同时保持了模型的泛化能力。 Abstract: Class-incremental Learning (CIL) enables models to continuously learn new class knowledge while memorizing previous classes, facilitating their adaptation and evolution in dynamic environments. Traditional CIL methods are mainly based on visual features, which limits their ability to handle complex scenarios. In contrast, Vision-Language Models (VLMs) show promising potential to promote CIL by integrating pretrained knowledge with textual features. However, previous methods make it difficult to overcome catastrophic forgetting while preserving the generalization capabilities of VLMs. To tackle these challenges, we propose Feature Calibration enhanced Parameter Synthesis (FCPS) in this paper. Specifically, our FCPS employs a specific parameter adjustment mechanism to iteratively refine the proportion of original visual features participating in the final class determination, ensuring the model's foundational generalization capabilities. Meanwhile, parameter integration across different tasks achieves a balance between learning new class knowledge and retaining old knowledge. Experimental results on popular benchmarks (e.g., CIFAR100 and ImageNet100) validate the superiority of the proposed method.

Any6D: Model-free 6D Pose Estimation of Novel Objects

Taeyeop Lee,Bowen Wen,Minjun Kang,Gyuree Kang,In So Kweon,Kuk-Jin Yoon

Task: 提出Any6D，一种无需模型的6D物体姿态估计框架，仅需单张RGB-D锚点图像即可估计新场景中未知物体的6D姿态和尺寸。

Motivation: 现有方法依赖纹理化3D模型或多视角数据，Any6D旨在通过联合物体对齐过程提升2D-3D对齐和度量尺度估计，以提高姿态估计精度。

Details

Method: 采用渲染-比较策略生成和优化姿态假设，结合联合物体对齐过程，增强对遮挡、非重叠视角、多样光照和大跨环境变化的鲁棒性。 Result: 在REAL275、Toyota-Light、HO3D、YCBINEOAT和LM-O五个数据集上验证，显著优于现有方法。 Conclusion: Any6D在无需模型和多视角数据的情况下，实现了高效且鲁棒的6D姿态估计。 Abstract: We introduce Any6D, a model-free framework for 6D object pose estimation that requires only a single RGB-D anchor image to estimate both the 6D pose and size of unknown objects in novel scenes. Unlike existing methods that rely on textured 3D models or multiple viewpoints, Any6D leverages a joint object alignment process to enhance 2D-3D alignment and metric scale estimation for improved pose accuracy. Our approach integrates a render-and-compare strategy to generate and refine pose hypotheses, enabling robust performance in scenarios with occlusions, non-overlapping views, diverse lighting conditions, and large cross-environment variations. We evaluate our method on five challenging datasets: REAL275, Toyota-Light, HO3D, YCBINEOAT, and LM-O, demonstrating its effectiveness in significantly outperforming state-of-the-art methods for novel object pose estimation. Project page: https://taeyeop.com/any6d

Human Motion Unlearning

Edoardo De Matteis,Matteo Migliarini,Alessio Sampieri,Indro Spinelli,Fabio Galasso

Task: 提出人类动作反学习任务，以防止生成有毒动画，同时保持文本到动作生成的一般性能。

Motivation: 有毒动作可能由显式文本提示或安全动作的隐含有毒组合生成（例如“踢”是“加载和摆动腿”），因此反学习有毒动作具有挑战性。

Details

Method: 通过从HumanML3D和Motion-X数据集中过滤有毒动作，建立首个动作反学习基准；提出基于空间-时间信号处理的基线方法，并提出一种基于潜在代码替换（LCR）的新型动作反学习模型。 Result: LCR是一种无需训练的方法，适用于最先进的文本到动作扩散模型的离散潜在空间，简单且性能优于基线方法。 Conclusion: LCR在质量和数量上均优于基线方法，为动作反学习提供了有效的解决方案。 Abstract: We introduce the task of human motion unlearning to prevent the synthesis of toxic animations while preserving the general text-to-motion generative performance. Unlearning toxic motions is challenging as those can be generated from explicit text prompts and from implicit toxic combinations of safe motions (e.g., ``kicking" is ``loading and swinging a leg"). We propose the first motion unlearning benchmark by filtering toxic motions from the large and recent text-to-motion datasets of HumanML3D and Motion-X. We propose baselines, by adapting state-of-the-art image unlearning techniques to process spatio-temporal signals. Finally, we propose a novel motion unlearning model based on Latent Code Replacement, which we dub LCR. LCR is training-free and suitable to the discrete latent spaces of state-of-the-art text-to-motion diffusion models. LCR is simple and consistently outperforms baselines qualitatively and quantitatively. Project page: \href{https://www.pinlab.org/hmu}{https://www.pinlab.org/hmu}.

NullSwap: Proactive Identity Cloaking Against Deepfake Face Swapping

Tianyi Wang,Harry Cheng,Xiao Zhang,Yinglong Wang

Task: 提出一种名为NullSwap的新型主动防御方法，通过扰动源图像身份特征来阻止Deepfake人脸交换攻击。

Motivation: 现有主动扰动方法存在视觉退化、对换脸攻击效果有限以及对生成模型依赖性强的问题，需要一种更有效的黑盒防御手段。

Details

Method: 设计身份提取模块、扰动块和特征块，结合动态损失加权技术，生成身份引导的扰动以保护源图像身份。 Result: 实验表明，NullSwap能有效欺骗多种身份识别模型，在防止换脸模型生成正确源身份图像方面优于现有方法。 Conclusion: NullSwap是一种无需依赖生成模型的黑盒防御方法，能有效保护源图像身份并阻止Deepfake换脸攻击。 Abstract: Suffering from performance bottlenecks in passively detecting high-quality Deepfake images due to the advancement of generative models, proactive perturbations offer a promising approach to disabling Deepfake manipulations by inserting signals into benign images. However, existing proactive perturbation approaches remain unsatisfactory in several aspects: 1) visual degradation due to direct element-wise addition; 2) limited effectiveness against face swapping manipulation; 3) unavoidable reliance on white- and grey-box settings to involve generative models during training. In this study, we analyze the essence of Deepfake face swapping and argue the necessity of protecting source identities rather than target images, and we propose NullSwap, a novel proactive defense approach that cloaks source image identities and nullifies face swapping under a pure black-box scenario. We design an Identity Extraction module to obtain facial identity features from the source image, while a Perturbation Block is then devised to generate identity-guided perturbations accordingly. Meanwhile, a Feature Block extracts shallow-level image features, which are then fused with the perturbation in the Cloaking Block for image reconstruction. Furthermore, to ensure adaptability across different identity extractors in face swapping algorithms, we propose Dynamic Loss Weighting to adaptively balance identity losses. Experiments demonstrate the outstanding ability of our approach to fool various identity recognition models, outperforming state-of-the-art proactive perturbations in preventing face swapping models from generating images with correct source identities.

Hardware-Rasterized Ray-Based Gaussian Splatting

Samuel Rota Bulò,Nemanja Bartolovic,Lorenzo Porzi,Peter Kontschieder

Task: 提出一种基于硬件光栅化的渲染方法（RayGS），用于实现快速且高质量的3D高斯泼溅新视角合成。

Motivation: 解决现有方法在虚拟和混合现实等质量敏感应用中无法实现高帧率渲染的问题，并解决训练和测试中因多尺度渲染导致的MIP相关问题。

Details

Method: 通过数学严谨且几何直观的推导，高效估计渲染RayGS模型所需的所有相关量，并利用标准硬件光栅化着色器进行结构化设计。 Result: 在不同基准场景中实现了显著的性能提升，同时保持了RayGS的最先进外观质量。 Conclusion: 该方法首次实现了RayGS模型的高帧率渲染，支持质量敏感应用，并解决了多尺度渲染的别名问题。 Abstract: We present a novel, hardware rasterized rendering approach for ray-based 3D Gaussian Splatting (RayGS), obtaining both fast and high-quality results for novel view synthesis. Our work contains a mathematically rigorous and geometrically intuitive derivation about how to efficiently estimate all relevant quantities for rendering RayGS models, structured with respect to standard hardware rasterization shaders. Our solution is the first enabling rendering RayGS models at sufficiently high frame rates to support quality-sensitive applications like Virtual and Mixed Reality. Our second contribution enables alias-free rendering for RayGS, by addressing MIP-related issues arising when rendering diverging scales during training and testing. We demonstrate significant performance gains, across different benchmark scenes, while retaining state-of-the-art appearance quality of RayGS.

OCRT: Boosting Foundation Models in the Open World with Object-Concept-Relation Triad

Luyao Tang,Yuxuan Yuan,Chaoqi Chen,Zeyu Zhang,Yue Huang,Kun Zhang

Task: 研究如何提升基础模型（FMs）在域外数据上的泛化能力。

Motivation: 基础模型在开放世界中面对分布偏移、弱监督或恶意攻击时泛化能力显著下降，而现有方法多为任务相关或模型特定，缺乏通用性和可迁移性。

Details

Method: 提出了一种新颖的框架——对象-概念-关系三元组（OCRT），通过无监督解耦和迭代优化，从原始视觉输入中提取稀疏的高层概念和复杂的关系结构。 Result: 实验表明，OCRT显著提升了SAM和CLIP在多个下游任务中的泛化性和鲁棒性。 Conclusion: OCRT框架为提升基础模型在开放世界中的泛化能力提供了一种通用且可迁移的解决方案。 Abstract: Although foundation models (FMs) claim to be powerful, their generalization ability significantly decreases when faced with distribution shifts, weak supervision, or malicious attacks in the open world. On the other hand, most domain generalization or adversarial fine-tuning methods are task-related or model-specific, ignoring the universality in practical applications and the transferability between FMs. This paper delves into the problem of generalizing FMs to the out-of-domain data. We propose a novel framework, the Object-Concept-Relation Triad (OCRT), that enables FMs to extract sparse, high-level concepts and intricate relational structures from raw visual inputs. The key idea is to bind objects in visual scenes and a set of object-centric representations through unsupervised decoupling and iterative refinement. To be specific, we project the object-centric representations onto a semantic concept space that the model can readily interpret and estimate their importance to filter out irrelevant elements. Then, a concept-based graph, which has a flexible degree, is constructed to incorporate the set of concepts and their corresponding importance, enabling the extraction of high-order factors from informative concepts and facilitating relational reasoning among these concepts. Extensive experiments demonstrate that OCRT can substantially boost the generalizability and robustness of SAM and CLIP across multiple downstream tasks.

Channel Consistency Prior and Self-Reconstruction Strategy Based Unsupervised Image Deraining

Guanglu Dong,Tianheng Zheng,Yuanzhouhan Cao,Linbo Qing,Chao Ren

Task: 提出一种基于通道一致性先验和自重建策略的无监督图像去雨框架CSUD，以解决真实配对数据难以获取和泛化性能差的问题。

Motivation: 由于真实配对数据难以获取且现有模型的泛化性能较差，限制了深度图像去雨模型在现实应用中的表现。

Details

Method: 提出通道一致性损失（CCLoss）和自重建策略（SR），通过生成高质量的伪干净和雨天图像对来增强去雨网络的性能。 Result: 在多个合成和真实数据集上的实验表明，CSUD的去雨性能优于其他最先进的无监督方法，并展现出卓越的泛化能力。 Conclusion: CSUD通过引入通道一致性先验和自重建策略，显著提升了无监督图像去雨的性能和泛化能力。 Abstract: Recently, deep image deraining models based on paired datasets have made a series of remarkable progress. However, they cannot be well applied in real-world applications due to the difficulty of obtaining real paired datasets and the poor generalization performance. In this paper, we propose a novel Channel Consistency Prior and Self-Reconstruction Strategy Based Unsupervised Image Deraining framework, CSUD, to tackle the aforementioned challenges. During training with unpaired data, CSUD is capable of generating high-quality pseudo clean and rainy image pairs which are used to enhance the performance of deraining network. Specifically, to preserve more image background details while transferring rain streaks from rainy images to the unpaired clean images, we propose a novel Channel Consistency Loss (CCLoss) by introducing the Channel Consistency Prior (CCP) of rain streaks into training process, thereby ensuring that the generated pseudo rainy images closely resemble the real ones. Furthermore, we propose a novel Self-Reconstruction (SR) strategy to alleviate the redundant information transfer problem of the generator, further improving the deraining performance and the generalization capability of our method. Extensive experiments on multiple synthetic and real-world datasets demonstrate that the deraining performance of CSUD surpasses other state-of-the-art unsupervised methods and CSUD exhibits superior generalization capability.

Benchmarking Burst Super-Resolution for Polarization Images: Noise Dataset and Analysis

Inseung Hwang,Kiseok Choi,Hyunho Ha,Min H. Kim

Task: 通过开发专用数据集和噪声分析模型，提升偏振成像的噪声抑制和超分辨率重建性能。

Motivation: 偏振成像在低光效和低空间分辨率下噪声增加，且缺乏针对性的数据集和噪声统计信息，限制了其性能提升。

Details

Method: 引入PolarNS和PolarBurstSR两个专用数据集，并提出偏振噪声分析模型，用于噪声传播量化。 Result: 开发的数据集和分析模型为偏振成像提供了全面的评估基准，并展示了针对偏振优化的训练方法的优势。 Conclusion: 该工作为偏振成像的超分辨率重建和噪声抑制提供了重要基准和理论支持。 Abstract: Snapshot polarization imaging calculates polarization states from linearly polarized subimages. To achieve this, a polarization camera employs a double Bayer-patterned sensor to capture both color and polarization. It demonstrates low light efficiency and low spatial resolution, resulting in increased noise and compromised polarization measurements. Although burst super-resolution effectively reduces noise and enhances spatial resolution, applying it to polarization imaging poses challenges due to the lack of tailored datasets and reliable ground truth noise statistics. To address these issues, we introduce PolarNS and PolarBurstSR, two innovative datasets developed specifically for polarization imaging. PolarNS provides characterization of polarization noise statistics, facilitating thorough analysis, while PolarBurstSR functions as a benchmark for burst super-resolution in polarization images. These datasets, collected under various real-world conditions, enable comprehensive evaluation. Additionally, we present a model for analyzing polarization noise to quantify noise propagation, tested on a large dataset captured in a darkroom environment. As part of our application, we compare the latest burst super-resolution models, highlighting the advantages of training tailored to polarization compared to RGB-based methods. This work establishes a benchmark for polarization burst super-resolution and offers critical insights into noise propagation, thereby enhancing polarization image reconstruction.

Revisiting Automatic Data Curation for Vision Foundation Models in Digital Pathology

Boqi Chen,Cédric Vincent-Cuaz,Lydia A. Schoenpflug,Manuel Madeira,Lisa Fournier,Vaishnavi Subramanian,Sonali Andani,Samuel Ruiperez-Campillo,Julia E. Vogt,Raphaëlle Luisier,Dorina Thanou,Viktor H. Koelzer,Pascal Frossard,Gabriele Campanella,Gunnar Rätsch

Task: 研究无监督自动数据筛选在瓦片级别上的潜力，以优化视觉基础模型的预训练数据。

Motivation: 现有数据选择主要依赖专家知识，忽略了瓦片级别的细节信息，可能影响模型性能。

Details

Method: 使用层次聚类树对预提取的瓦片嵌入进行聚类，均匀采样平衡数据集，并提出定制化的批量采样策略。 Result: 通过改进的采样策略，在多种临床相关下游任务中表现更优。 Conclusion: 瓦片级别的无监督数据筛选能有效提升视觉基础模型的性能。 Abstract: Vision foundation models (FMs) are accelerating the development of digital pathology algorithms and transforming biomedical research. These models learn, in a self-supervised manner, to represent histological features in highly heterogeneous tiles extracted from whole-slide images (WSIs) of real-world patient samples. The performance of these FMs is significantly influenced by the size, diversity, and balance of the pre-training data. However, data selection has been primarily guided by expert knowledge at the WSI level, focusing on factors such as disease classification and tissue types, while largely overlooking the granular details available at the tile level. In this paper, we investigate the potential of unsupervised automatic data curation at the tile-level, taking into account 350 million tiles. Specifically, we apply hierarchical clustering trees to pre-extracted tile embeddings, allowing us to sample balanced datasets uniformly across the embedding space of the pretrained FM. We further identify these datasets are subject to a trade-off between size and balance, potentially compromising the quality of representations learned by FMs, and propose tailored batch sampling strategies to mitigate this effect. We demonstrate the effectiveness of our method through improved performance on a diverse range of clinically relevant downstream tasks.

Accenture-NVS1: A Novel View Synthesis Dataset

Thomas Sugg,Kyle O'Brien,Lekh Poudel,Alex Dumouchelle,Michelle Jou,Marc Bosch,Deva Ramanan,Srinivasa Narasimhan,Shubham Tulsiani

Task: 介绍ACC-NVS1数据集，用于空中和地面图像的新视角合成研究。

Motivation: 补充现有数据集，提供更多资源以支持全面研究，而非作为基准。

Details

Method: 在奥斯汀和匹兹堡采集六个多样化真实场景的空中和地面图像，共148,000张。 Result: 解决了高度变化和瞬态物体等挑战。 Conclusion: ACC-NVS1为研究新视角合成提供了丰富的补充资源。 Abstract: This paper introduces ACC-NVS1, a specialized dataset designed for research on Novel View Synthesis specifically for airborne and ground imagery. Data for ACC-NVS1 was collected in Austin, TX and Pittsburgh, PA in 2023 and 2024. The collection encompasses six diverse real-world scenes captured from both airborne and ground cameras, resulting in a total of 148,000 images. ACC-NVS1 addresses challenges such as varying altitudes and transient objects. This dataset is intended to supplement existing datasets, providing additional resources for comprehensive research, rather than serving as a benchmark.

Shaokai Ye,Haozhe Qi,Alexander Mathis,Mackenzie W. Mathis

Task: 评估和改进多模态大语言模型（MLLMs）在动作识别任务中的表现。

Motivation: 人类行为的复杂性需要通过语义丰富的语言结构来映射，而MLLMs为此提供了潜力。

Details

Method: 将EPIC-KITCHENS-100数据集重构为视频多问题回答形式（EPIC-KITCHENS-100-MQA），并提出一系列改进方法。 Result: 改进后的MLLMs在EPIC-KITCHENS-100验证集上达到最优性能，并在EPIC-KITCHENS-100-MQA上准确率超过GPT-4o 21个百分点。 Conclusion: MLLMs在复杂动作任务中具有潜力，并在多个视频基准测试中表现优异。 Abstract: Understanding human behavior requires measuring behavioral actions. Due to its complexity, behavior is best mapped onto a rich, semantic structure such as language. The recent development of multi-modal large language models (MLLMs) is a promising candidate for a wide range of action understanding tasks. In this work, we focus on evaluating and then improving MLLMs to perform action recognition. We reformulate EPIC-KITCHENS-100, one of the largest and most challenging egocentric action datasets, to the form of video multiple question answering (EPIC-KITCHENS-100-MQA). We show that when we sample difficult incorrect answers as distractors, leading MLLMs struggle to recognize the correct actions. We propose a series of methods that greatly improve the MLLMs' ability to perform action recognition, achieving state-of-the-art on both the EPIC-KITCHENS-100 validation set, as well as outperforming GPT-4o by 21 points in accuracy on EPIC-KITCHENS-100-MQA. Lastly, we show improvements on other action-related video benchmarks such as EgoSchema, PerceptionTest, LongVideoBench, VideoMME and MVBench, suggesting that MLLMs are a promising path forward for complex action tasks. Code and models are available at: https://github.com/AdaptiveMotorControlLab/LLaVAction.

GS-Marker: Generalizable and Robust Watermarking for 3D Gaussian Splatting

Lijiang Li,Jinglu Wang,Xiang Ming,Yan Lu

Task: 提出一种单次水印嵌入方法，用于3D高斯泼溅（3DGS）模型，解决现有方法在泛化性和鲁棒性上的不足。

Motivation: 在生成式AI时代，保护3D模型的需求日益迫切，但现有方法因渲染器干扰梯度流和训练复杂性，导致泛化性差且效率低下。

Details

Method: 提出GS-Marker框架，包含3D编码器、失真增强层和2D解码器，并引入自适应标记控制机制以优化训练。 Result: 实验表明，GS-Marker在解码精度和模型保真度上优于现有方法，同时显著减少计算时间。 Conclusion: GS-Marker为3D模型水印提供了一种高效、泛化性强且鲁棒的解决方案。 Abstract: In the Generative AI era, safeguarding 3D models has become increasingly urgent. While invisible watermarking is well-established for 2D images with encoder-decoder frameworks, generalizable and robust solutions for 3D remain elusive. The main difficulty arises from the renderer between the 3D encoder and 2D decoder, which disrupts direct gradient flow and complicates training. Existing 3D methods typically rely on per-scene iterative optimization, resulting in time inefficiency and limited generalization. In this work, we propose a single-pass watermarking approach for 3D Gaussian Splatting (3DGS), a well-known yet underexplored representation for watermarking. We identify two major challenges: (1) ensuring effective training generalized across diverse 3D models, and (2) reliably extracting watermarks from free-view renderings, even under distortions. Our framework, named GS-Marker, incorporates a 3D encoder to embed messages, distortion layers to enhance resilience against various distortions, and a 2D decoder to extract watermarks from renderings. A key innovation is the Adaptive Marker Control mechanism that adaptively perturbs the initially optimized 3DGS, escaping local minima and improving both training stability and convergence. Extensive experiments show that GS-Marker outperforms per-scene training approaches in terms of decoding accuracy and model fidelity, while also significantly reducing computation time.

Boosting Resolution Generalization of Diffusion Transformers with Randomized Positional Encodings

Cong Liu,Liang Hou,Mingwu Zheng,Xin Tao,Pengfei Wan,Di Zhang,Kun Gai

Task: 提出一种新颖的二维随机位置编码（RPE-2D）框架，以解决扩散变换器中分辨率泛化问题。

Motivation: 现有方法在测试与训练时位置编码不匹配的问题未完全解决，影响高分辨率图像生成效果。

Details

Method: RPE-2D通过学习图像块的位置顺序而非具体距离，结合随机数据增强和微调节技术。 Result: 在ImageNet数据集上，RPE-2D在多种分辨率下实现最优泛化性能，并支持低分辨率图像生成和多阶段训练加速。 Conclusion: RPE-2D框架有效提升了分辨率泛化能力，并在多任务中表现优异。 Abstract: Resolution generalization in image generation tasks enables the production of higher-resolution images with lower training resolution overhead. However, a significant challenge in resolution generalization, particularly in the widely used Diffusion Transformers, lies in the mismatch between the positional encodings encountered during testing and those used during training. While existing methods have employed techniques such as interpolation, extrapolation, or their combinations, none have fully resolved this issue. In this paper, we propose a novel two-dimensional randomized positional encodings (RPE-2D) framework that focuses on learning positional order of image patches instead of the specific distances between them, enabling seamless high- and low-resolution image generation without requiring high- and low-resolution image training. Specifically, RPE-2D independently selects positions over a broader range along both the horizontal and vertical axes, ensuring that all position encodings are trained during the inference phase, thus improving resolution generalization. Additionally, we propose a random data augmentation technique to enhance the modeling of position order. To address the issue of image cropping caused by the augmentation, we introduce corresponding micro-conditioning to enable the model to perceive the specific cropping patterns. On the ImageNet dataset, our proposed RPE-2D achieves state-of-the-art resolution generalization performance, outperforming existing competitive methods when trained at a resolution of $256 \times 256$ and inferred at $384 \times 384$ and $512 \times 512$, as well as when scaling from $512 \times 512$ to $768 \times 768$ and $1024 \times 1024$. And it also exhibits outstanding capabilities in low-resolution image generation, multi-stage training acceleration and multi-resolution inheritance.

FG$^2$: Fine-Grained Cross-View Localization by Fine-Grained Feature Matching

Zimin Xia,Alexandre Alahi

Task: 提出一种新颖的细粒度跨视角定位方法，通过匹配地面图像和航拍图像中的细粒度特征，估计地面图像在航拍图像中的3自由度位姿。

Motivation: 解决地面图像与航拍图像之间的位姿估计问题，通过细粒度特征匹配提高定位精度。

Details

Method: 将地面图像特征映射到3D点云，选择高度维度的特征池化为鸟瞰图（BEV）平面，并通过稀疏匹配和Procrustes对齐计算相对位姿。 Result: 在VIGOR跨区域测试集上，平均定位误差降低了28%。 Conclusion: 该方法通过弱监督学习实现了地面与航拍视图之间语义一致的匹配，显著提升了定位精度。 Abstract: We propose a novel fine-grained cross-view localization method that estimates the 3 Degrees of Freedom pose of a ground-level image in an aerial image of the surroundings by matching fine-grained features between the two images. The pose is estimated by aligning a point plane generated from the ground image with a point plane sampled from the aerial image. To generate the ground points, we first map ground image features to a 3D point cloud. Our method then learns to select features along the height dimension to pool the 3D points to a Bird's-Eye-View (BEV) plane. This selection enables us to trace which feature in the ground image contributes to the BEV representation. Next, we sample a set of sparse matches from computed point correspondences between the two point planes and compute their relative pose using Procrustes alignment. Compared to the previous state-of-the-art, our method reduces the mean localization error by 28% on the VIGOR cross-area test set. Qualitative results show that our method learns semantically consistent matches across ground and aerial views through weakly supervised learning from the camera pose.

SFDLA: Source-Free Document Layout Analysis

Sebastian Tewes,Yufan Chen,Omar Moured,Jiaming Zhang,Rainer Stiefelhagen

Task: 提出一种无需源数据的文档布局分析方法（SFDLA），旨在将预训练的源模型适应到未标记的目标域。

Motivation: 现有文档布局分析（DLA）方法需要大规模源数据和目标标签，限制了其在隐私敏感和资源受限领域的应用。

Details

Method: 提出Document Layout Analysis Adapter（DLAdapter）框架，用于跨文档域的源自由适应。 Result: 在PubLayNet到DocLayNet的实验中，方法比源基线提升4.21%，比现有源自由方法提升2.26%。 Conclusion: 该工作为DLA社区提供了源自由文档理解的基准和工具，未来研究将受益于此。 Abstract: Document Layout Analysis (DLA) is a fundamental task in document understanding. However, existing DLA and adaptation methods often require access to large-scale source data and target labels. This requirements severely limiting their real-world applicability, particularly in privacy-sensitive and resource-constrained domains, such as financial statements, medical records, and proprietary business documents. According to our observation, directly transferring source-domain fine-tuned models on target domains often results in a significant performance drop (Avg. -32.64%). In this work, we introduce Source-Free Document Layout Analysis (SFDLA), aiming for adapting a pre-trained source DLA models to an unlabeled target domain, without access to any source data. To address this challenge, we establish the first SFDLA benchmark, covering three major DLA datasets for geometric- and content-aware adaptation. Furthermore, we propose Document Layout Analysis Adapter (DLAdapter), a novel framework that is designed to improve source-free adaptation across document domains. Our method achieves a +4.21% improvement over the source-only baseline and a +2.26% gain over existing source-free methods from PubLayNet to DocLayNet. We believe this work will inspire the DLA community to further investigate source-free document understanding. To support future research of the community, the benchmark, models, and code will be publicly available at https://github.com/s3setewe/sfdla-DLAdapter.

Linguistics-aware Masked Image Modeling for Self-supervised Scene Text Recognition

Yifei Zhang,Chang Liu,Jin Wei,Xiaomeng Yang,Yu Zhou,Can Ma,Xiangyang Ji

Task: 提出一种语言学感知的掩码图像建模（LMIM）方法，用于场景文本识别（STR）中同时捕捉视觉和语言学信息。

Motivation: 当前STR方法通常需要大规模标注数据集来捕捉语言学特征，而自监督学习在缺乏标注的情况下难以解耦与全局上下文相关的语言学特征。

Details

Method: 设计了语言学对齐模块，通过独立分支将语言学信息引入MIM的解码过程，提取与视觉无关的特征作为语言学指导。 Result: 在多个基准测试中取得了最先进的性能，并通过注意力可视化展示了同时捕捉视觉和语言学信息的能力。 Conclusion: LMIM方法通过整合视觉和语言学信息，显著提升了场景文本识别的鲁棒性。 Abstract: Text images are unique in their dual nature, encompassing both visual and linguistic information. The visual component encompasses structural and appearance-based features, while the linguistic dimension incorporates contextual and semantic elements. In scenarios with degraded visual quality, linguistic patterns serve as crucial supplements for comprehension, highlighting the necessity of integrating both aspects for robust scene text recognition (STR). Contemporary STR approaches often use language models or semantic reasoning modules to capture linguistic features, typically requiring large-scale annotated datasets. Self-supervised learning, which lacks annotations, presents challenges in disentangling linguistic features related to the global context. Typically, sequence contrastive learning emphasizes the alignment of local features, while masked image modeling (MIM) tends to exploit local structures to reconstruct visual patterns, resulting in limited linguistic knowledge. In this paper, we propose a Linguistics-aware Masked Image Modeling (LMIM) approach, which channels the linguistic information into the decoding process of MIM through a separate branch. Specifically, we design a linguistics alignment module to extract vision-independent features as linguistic guidance using inputs with different visual appearances. As features extend beyond mere visual structures, LMIM must consider the global context to achieve reconstruction. Extensive experiments on various benchmarks quantitatively demonstrate our state-of-the-art performance, and attention visualizations qualitatively show the simultaneous capture of both visual and linguistic information.

Self-Supervised Learning based on Transformed Image Reconstruction for Equivariance-Coherent Feature Representation

Qin Wang,Benjamin Bruns,Hanno Scharr,Kai Krajsek

Task: 提出一种自监督学习方法，通过重建经过未知变换的图像来独立学习变换，从而增强特征的等变性。

Motivation: 现有自监督学习方法在设计上限制了特征的等变性，而等变性在许多计算机视觉任务中至关重要。

Details

Method: 通过生成图像对并分割特征集，结合不变性损失和辅助任务损失（重建中间变换图像），线性加权组合两种损失。 Result: 在合成任务和自然图像上显著优于其他方法，并能与基于增强的方法（如iBOT或DINOv2）结合，学习到不变性和等变性的平衡特征。 Conclusion: 该方法在多种实际计算机视觉下游任务中表现优异，几乎在所有基准上均有提升。 Abstract: The equivariant behaviour of features is essential in many computer vision tasks, yet popular self-supervised learning (SSL) methods tend to constrain equivariance by design. We propose a self-supervised learning approach where the system learns transformations independently by reconstructing images that have undergone previously unseen transformations. Specifically, the model is tasked to reconstruct intermediate transformed images, e.g. translated or rotated images, without prior knowledge of these transformations. This auxiliary task encourages the model to develop equivariance-coherent features without relying on predefined transformation rules. To this end, we apply transformations to the input image, generating an image pair, and then split the extracted features into two sets per image. One set is used with a usual SSL loss encouraging invariance, the other with our loss based on the auxiliary task to reconstruct the intermediate transformed images. Our loss and the SSL loss are linearly combined with weighted terms. Evaluating on synthetic tasks with natural images, our proposed method strongly outperforms all competitors, regardless of whether they are designed to learn equivariance. Furthermore, when trained alongside augmentation-based methods as the invariance tasks, such as iBOT or DINOv2, we successfully learn a balanced combination of invariant and equivariant features. Our approach performs strong on a rich set of realistic computer vision downstream tasks, almost always improving over all baselines.

EgoSurgery-HTS: A Dataset for Egocentric Hand-Tool Segmentation in Open Surgery Videos

Nathan Darjana,Ryo Fujii,Hideo Saito,Hiroki Kajita

Task: 提出并评估EgoSurgery-HTS数据集，用于分割手术工具、手部及手部与工具的交互。

Motivation: 通过像素级理解手部和手术工具，更准确地建模手术过程和手术室中的人类行为。

Details

Method: 提供带有像素级标注的数据集EgoSurgery-HTS，并评估现有分割方法。 Result: 相比现有数据集，显著提高了手部和手部-工具分割的准确性。 Conclusion: EgoSurgery-HTS为手术视频分析提供了更精细的数据支持，推动了相关研究的发展。 Abstract: Egocentric open-surgery videos capture rich, fine-grained details essential for accurately modeling surgical procedures and human behavior in the operating room. A detailed, pixel-level understanding of hands and surgical tools is crucial for interpreting a surgeon's actions and intentions. We introduce EgoSurgery-HTS, a new dataset with pixel-wise annotations and a benchmark suite for segmenting surgical tools, hands, and interacting tools in egocentric open-surgery videos. Specifically, we provide a labeled dataset for (1) tool instance segmentation of 14 distinct surgical tools, (2) hand instance segmentation, and (3) hand-tool segmentation to label hands and the tools they manipulate. Using EgoSurgery-HTS, we conduct extensive evaluations of state-of-the-art segmentation methods and demonstrate significant improvements in the accuracy of hand and hand-tool segmentation in egocentric open-surgery videos compared to existing datasets. The dataset will be released at https://github.com/Fujiry0/EgoSurgery.

Good Keypoints for the Two-View Geometry Estimation Problem

Konstantin Pakulev,Alexander Vakhitov,Gonzalo Ferrer

Task: 提出一种新的理论模型来评分特征点（关键点），并设计了一种新的关键点检测器BoNeSS-ST。

Motivation: 研究局部特征的性质以改进特征检测器和描述器的设计，提升下游性能。

Details

Method: 提出一个理论模型，确定好的关键点应具备可重复性和小测量误差，并基于此设计BoNeSS-ST关键点检测器。 Result: BoNeSS-ST在平面单应性和对极几何估计问题上优于现有的自监督局部特征检测器。 Conclusion: 通过理论模型和BoNeSS-ST的设计，证明了关键点的可重复性和小测量误差对提升几何估计精度的重要性。 Abstract: Local features are essential to many modern downstream applications. Therefore, it is of interest to determine the properties of local features that contribute to the downstream performance for a better design of feature detectors and descriptors. In our work, we propose a new theoretical model for scoring feature points (keypoints) in the context of the two-view geometry estimation problem. The model determines two properties that a good keypoint for solving the homography estimation problem should have: be repeatable and have a small expected measurement error. This result provides key insights into why maximizing the number of correspondences doesn't always lead to better homography estimation accuracy. We use the developed model to design a method that detects keypoints that benefit the homography estimation introducing the Bounded NeSS-ST (BoNeSS-ST) keypoint detector. The novelty of BoNeSS-ST comes from strong theoretical foundations, a more accurate keypoint scoring due to subpixel refinement and a cost designed for superior robustness to low saliency keypoints. As a result, BoNeSS-ST outperforms prior self-supervised local feature detectors in both planar homography and epipolar geometry estimation problems.

Frequency Dynamic Convolution for Dense Image Prediction

Linwei Chen,Lin Gu,Liang Li,Chenggang Yan,Ying Fu

Task: 提出一种名为FDConv的新方法，通过在傅里叶域学习固定参数预算，解决动态卷积中权重频率响应相似性高的问题。

Motivation: 动态卷积通过并行权重和注意力机制实现自适应权重选择，但其权重频率响应相似性高，导致参数成本高但适应性有限。

Details

Method: FDConv将固定参数预算分为基于频率的组，并通过Kernel Spatial Modulation (KSM)和Frequency Band Modulation (FBM)增强适应性。 Result: 在目标检测、分割和分类任务中验证了FDConv的有效性，应用于ResNet-50时性能优于先前方法，且参数增加较少。 Conclusion: FDConv是一种灵活高效的解决方案，可无缝集成到多种架构中，适用于现代视觉任务。 Abstract: While Dynamic Convolution (DY-Conv) has shown promising performance by enabling adaptive weight selection through multiple parallel weights combined with an attention mechanism, the frequency response of these weights tends to exhibit high similarity, resulting in high parameter costs but limited adaptability. In this work, we introduce Frequency Dynamic Convolution (FDConv), a novel approach that mitigates these limitations by learning a fixed parameter budget in the Fourier domain. FDConv divides this budget into frequency-based groups with disjoint Fourier indices, enabling the construction of frequency-diverse weights without increasing the parameter cost. To further enhance adaptability, we propose Kernel Spatial Modulation (KSM) and Frequency Band Modulation (FBM). KSM dynamically adjusts the frequency response of each filter at the spatial level, while FBM decomposes weights into distinct frequency bands in the frequency domain and modulates them dynamically based on local content. Extensive experiments on object detection, segmentation, and classification validate the effectiveness of FDConv. We demonstrate that when applied to ResNet-50, FDConv achieves superior performance with a modest increase of +3.6M parameters, outperforming previous methods that require substantial increases in parameter budgets (e.g., CondConv +90M, KW +76.5M). Moreover, FDConv seamlessly integrates into a variety of architectures, including ConvNeXt, Swin-Transformer, offering a flexible and efficient solution for modern vision tasks. The code is made publicly available at https://github.com/Linwei-Chen/FDConv.

Leveraging Perturbation Robustness to Enhance Out-of-Distribution Detection

Wenxi Chen,Raymond A. Yeh,Shaoshuai Mou,Yan Gu

Task: 提出一种后处理方法（PRO），用于识别与训练数据分布不同的输入（OOD检测）。

Motivation: 在开放世界环境中安全部署深度计算机视觉模型需要有效的OOD检测能力。

Details

Method: 基于OOD输入在扰动下预测置信度更容易降低的观察，提出一种对抗性评分函数，通过梯度下降搜索原始输入附近的局部最小分数。 Result: 在OpenOOD基准测试中，PRO显著提升了OOD检测性能，特别是在小规模模型上表现领先，FPR@95降低了10%以上。 Conclusion: PRO是一种无需复杂模型修改的后处理方法，显著提升了OOD检测性能。 Abstract: Out-of-distribution (OOD) detection is the task of identifying inputs that deviate from the training data distribution. This capability is essential for safely deploying deep computer vision models in open-world environments. In this work, we propose a post-hoc method, Perturbation-Rectified OOD detection (PRO), based on the insight that prediction confidence for OOD inputs is more susceptible to reduction under perturbation than in-distribution (IND) inputs. Based on the observation, we propose an adversarial score function that searches for the local minimum scores near the original inputs by applying gradient descent. This procedure enhances the separability between IND and OOD samples. Importantly, the approach improves OOD detection performance without complex modifications to the underlying model architectures. We conduct extensive experiments using the OpenOOD benchmark~\cite{yang2022openood}. Our approach further pushes the limit of softmax-based OOD detection and is the leading post-hoc method for small-scale models. On a CIFAR-10 model with adversarial training, PRO effectively detects near-OOD inputs, achieving a reduction of more than 10\% on FPR@95 compared to state-of-the-art methods.

LGI-DETR: Local-Global Interaction for UAV Object Detection

Zifa Chen

Task: 设计一种用于无人机图像的局部-全局信息交互DETR（LGI-DETR）以解决现有端到端目标检测器在无人机图像上表现不佳的问题。

Motivation: 现有无人机目标检测器多为非端到端设计，且现有端到端检测器主要针对自然场景，直接应用于无人机图像效果不理想。

Details

Method: 提出LGI-DETR，包含局部空间增强模块（LSE）和全局信息注入模块（GII），通过跨层双向特征信息增强实现局部与全局信息的交互。 Result: 在VisDrone2019和UAVDT数据集上，LGI-DETR优于现有SOTA模型，AP和AP50分别提升1.9%和2.4%。 Conclusion: LGI-DETR通过局部-全局信息交互机制有效提升了无人机图像中小目标检测的性能。 Abstract: UAV has been widely used in various fields. However, most of the existing object detectors used in drones are not end-to-end and require the design of various complex components and careful fine-tuning. Most of the existing end-to-end object detectors are designed for natural scenes. It is not ideal to apply them directly to UAV images. In order to solve the above challenges, we design an local-global information interaction DETR for UAVs, namely LGI-DETR. Cross-layer bidirectional low-level and high-level feature information enhancement, this fusion method is effective especially in the field of small objection detection. At the initial stage of encoder, we propose a local spatial enhancement module (LSE), which enhances the low-level rich local spatial information into the high-level feature, and reduces the loss of local information in the transmission process of high-level information. At the final stage of the encoder, we propose a novel global information injection module (GII) designed to integrate rich high-level global semantic representations with low-level feature maps. This hierarchical fusion mechanism effectively addresses the inherent limitations of local receptive fields by propagating contextual information across the feature hierarchy. Experimental results on two challenging UAV image object detection benchmarks, VisDrone2019 and UAVDT, show that our proposed model outperforms the SOTA model. Compared to the baseline model, AP and AP50 improved by 1.9% and 2.4%, respectively.

NexusGS: Sparse View Synthesis with Epipolar Depth Priors in 3D Gaussian Splatting

Yulong Zheng,Zicheng Jiang,Shengfeng He,Yandu Sun,Junyu Dong,Huaidong Zhang,Yong Du

Task: 提出一种基于3D高斯泼溅（3DGS）的方法NexusGS，用于增强稀疏视角图像下的新视角合成。

Motivation: 现有方法（如NeRF和3DGS）在少样本场景下因监督不足表现不佳，NexusGS通过直接嵌入深度信息解决这一问题。

Details

Method: NexusGS包含三个关键步骤：Epipolar Depth Nexus、Flow-Resilient Depth Blending和Flow-Filtered Depth Pruning，利用光流和相机位姿计算深度图。 Result: 实验表明NexusGS显著提升了深度准确性和渲染质量，优于现有方法。 Conclusion: NexusGS通过引入深度先验和点云优化策略，有效提升了稀疏视角下的新视角合成性能。 Abstract: Neural Radiance Field (NeRF) and 3D Gaussian Splatting (3DGS) have noticeably advanced photo-realistic novel view synthesis using images from densely spaced camera viewpoints. However, these methods struggle in few-shot scenarios due to limited supervision. In this paper, we present NexusGS, a 3DGS-based approach that enhances novel view synthesis from sparse-view images by directly embedding depth information into point clouds, without relying on complex manual regularizations. Exploiting the inherent epipolar geometry of 3DGS, our method introduces a novel point cloud densification strategy that initializes 3DGS with a dense point cloud, reducing randomness in point placement while preventing over-smoothing and overfitting. Specifically, NexusGS comprises three key steps: Epipolar Depth Nexus, Flow-Resilient Depth Blending, and Flow-Filtered Depth Pruning. These steps leverage optical flow and camera poses to compute accurate depth maps, while mitigating the inaccuracies often associated with optical flow. By incorporating epipolar depth priors, NexusGS ensures reliable dense point cloud coverage and supports stable 3DGS training under sparse-view conditions. Experiments demonstrate that NexusGS significantly enhances depth accuracy and rendering quality, surpassing state-of-the-art methods by a considerable margin. Furthermore, we validate the superiority of our generated point clouds by substantially boosting the performance of competing methods. Project page: https://usmizuki.github.io/NexusGS/.

Change3D: Revisiting Change Detection and Captioning from A Video Modeling Perspective

Duowang Zhu,Xiaohu Huang,Haiyan Huang,Hao Zhou,Zhenfeng Shao

Task: 通过视频建模重新定义变化检测和描述任务。

Motivation: 现有方法将双时相图像视为独立帧，但图像特征编码无法有效关注变化区域，且不同任务需设计不同变化提取器，缺乏统一框架。

Details

Method: 将双时相图像视为视频帧，通过可学习感知帧直接交互并感知差异，避免复杂变化提取器。 Result: 在多个任务和基准测试中表现优异，参数和计算量仅为现有方法的6%-13%和8%-34%。 Conclusion: Change3D可作为基于2D模型的替代方案，推动未来研究。 Abstract: In this paper, we present Change3D, a framework that reconceptualizes the change detection and captioning tasks through video modeling. Recent methods have achieved remarkable success by regarding each pair of bi-temporal images as separate frames. They employ a shared-weight image encoder to extract spatial features and then use a change extractor to capture differences between the two images. However, image feature encoding, being a task-agnostic process, cannot attend to changed regions effectively. Furthermore, different change extractors designed for various change detection and captioning tasks make it difficult to have a unified framework. To tackle these challenges, Change3D regards the bi-temporal images as comprising two frames akin to a tiny video. By integrating learnable perception frames between the bi-temporal images, a video encoder enables the perception frames to interact with the images directly and perceive their differences. Therefore, we can get rid of the intricate change extractors, providing a unified framework for different change detection and captioning tasks. We verify Change3D on multiple tasks, encompassing change detection (including binary change detection, semantic change detection, and building damage assessment) and change captioning, across eight standard benchmarks. Without bells and whistles, this simple yet effective framework can achieve superior performance with an ultra-light video model comprising only ~6%-13% of the parameters and ~8%-34% of the FLOPs compared to state-of-the-art methods. We hope that Change3D could be an alternative to 2D-based models and facilitate future research.

CRCL: Causal Representation Consistency Learning for Anomaly Detection in Surveillance Videos

Yang Liu,Hongjin Wang,Zepu Wang,Xiaoguang Zhu,Jing Liu,Peng Sun,Rui Tang,Jianwei Du,Victor C. M. Leung,Liang Song

Task: 通过因果表示一致性学习（CRCL）在无监督视频异常检测中挖掘潜在的场景鲁棒性因果变量。

Motivation: 现有无监督视频异常检测方法对现实场景中的标签无关数据偏移（如场景变化）和轻微异常反应不足，因深度神经网络的过度泛化而失效。

Details

Method: 提出CRCL方法，结合场景去偏学习和因果启发的正常性学习，剥离深度表示中的场景偏差并学习因果视频正常性。 Result: 在基准测试中验证了CRCL优于传统深度表示学习方法，且能处理多场景中的标签无关偏差，并在有限训练数据下保持稳定性能。 Conclusion: CRCL通过因果学习有效提升了视频异常检测的鲁棒性和泛化能力。 Abstract: Video Anomaly Detection (VAD) remains a fundamental yet formidable task in the video understanding community, with promising applications in areas such as information forensics and public safety protection. Due to the rarity and diversity of anomalies, existing methods only use easily collected regular events to model the inherent normality of normal spatial-temporal patterns in an unsupervised manner. Previous studies have shown that existing unsupervised VAD models are incapable of label-independent data offsets (e.g., scene changes) in real-world scenarios and may fail to respond to light anomalies due to the overgeneralization of deep neural networks. Inspired by causality learning, we argue that there exist causal factors that can adequately generalize the prototypical patterns of regular events and present significant deviations when anomalous instances occur. In this regard, we propose Causal Representation Consistency Learning (CRCL) to implicitly mine potential scene-robust causal variable in unsupervised video normality learning. Specifically, building on the structural causal models, we propose scene-debiasing learning and causality-inspired normality learning to strip away entangled scene bias in deep representations and learn causal video normality, respectively. Extensive experiments on benchmarks validate the superiority of our method over conventional deep representation learning. Moreover, ablation studies and extension validation show that the CRCL can cope with label-independent biases in multi-scene settings and maintain stable performance with only limited training data available.

SKDU at De-Factify 4.0: Vision Transformer with Data Augmentation for AI-Generated Image Detection

Shrikant Malviya,Neelanjan Bhowmik,Stamos Katsigiannis

Task: 探索预训练的视觉语言模型（如ViT）结合高级数据增强策略在检测AI生成图像中的潜力。

Motivation: 利用预训练模型和数据增强技术提高AI生成图像检测的鲁棒性和泛化能力。

Details

Method: 使用Defactify-4.0数据集微调ViT模型，并采用翻转、旋转、高斯噪声注入和JPEG压缩等扰动技术。 Result: 实验表明，基于ViT的管道在验证和测试数据集上显著优于其他方法，达到最先进性能。 Conclusion: 预训练ViT模型结合数据增强策略在AI生成图像检测中表现优异。 Abstract: The aim of this work is to explore the potential of pre-trained vision-language models, e.g. Vision Transformers (ViT), enhanced with advanced data augmentation strategies for the detection of AI-generated images. Our approach leverages a fine-tuned ViT model trained on the Defactify-4.0 dataset, which includes images generated by state-of-the-art models such as Stable Diffusion 2.1, Stable Diffusion XL, Stable Diffusion 3, DALL-E 3, and MidJourney. We employ perturbation techniques like flipping, rotation, Gaussian noise injection, and JPEG compression during training to improve model robustness and generalisation. The experimental results demonstrate that our ViT-based pipeline achieves state-of-the-art performance, significantly outperforming competing methods on both validation and test datasets.

Jeonghyeon Kim,Sangheum Hwang

Task: 研究多模态微调（MMFT）在分布外检测（OoDD）中的性能提升。

Motivation: 现有方法通常冻结或部分调整预训练权重，未能充分利用多模态表示，限制了OoDD性能。

Details

Method: 提出一种训练目标，通过正则化图像和文本嵌入的距离，增强跨模态对齐。 Result: 在ImageNet-1k OoD基准数据集上，结合后处理方法（如NegLabel），显著优于现有方法，达到最优性能。 Conclusion: 多模态微调和跨模态对齐正则化能有效提升OoDD性能，并保持高ID准确性。 Abstract: Prior research on out-of-distribution detection (OoDD) has primarily focused on single-modality models. Recently, with the advent of large-scale pretrained vision-language models such as CLIP, OoDD methods utilizing such multi-modal representations through zero-shot and prompt learning strategies have emerged. However, these methods typically involve either freezing the pretrained weights or only partially tuning them, which can be suboptimal for downstream datasets. In this paper, we highlight that multi-modal fine-tuning (MMFT) can achieve notable OoDD performance. Despite some recent works demonstrating the impact of fine-tuning methods for OoDD, there remains significant potential for performance improvement. We investigate the limitation of na\"ive fine-tuning methods, examining why they fail to fully leverage the pretrained knowledge. Our empirical analysis suggests that this issue could stem from the modality gap within in-distribution (ID) embeddings. To address this, we propose a training objective that enhances cross-modal alignment by regularizing the distances between image and text embeddings of ID data. This adjustment helps in better utilizing pretrained textual information by aligning similar semantics from different modalities (i.e., text and image) more closely in the hyperspherical representation space. We theoretically demonstrate that the proposed regularization corresponds to the maximum likelihood estimation of an energy-based model on a hypersphere. Utilizing ImageNet-1k OoD benchmark datasets, we show that our method, combined with post-hoc OoDD approaches leveraging pretrained knowledge (e.g., NegLabel), significantly outperforms existing methods, achieving state-of-the-art OoDD performance and leading ID accuracy.

DAGait: Generalized Skeleton-Guided Data Alignment for Gait Recognition

Zhengxian Wu,Chuanrui Zhang,Hangrui Xu,Peng Jiao,Haoqian Wang

Task: 提出一种骨架引导的轮廓对齐策略，以提高在复杂环境中步态识别的准确性。

Motivation: 现有步态识别方法在实验室数据集表现良好，但在真实场景中性能显著下降，主要原因是时空分布不一致。

Details

Method: 利用骨架的先验知识对轮廓进行仿射变换，实现数据对齐。 Result: 在Gait3D数据集上平均性能提升7.9%，跨域数据集上最高提升24.0%。 Conclusion: 所提出的对齐策略显著提升了步态识别在复杂环境中的性能，为数据对齐在步态识别中的重要性提供了实证支持。 Abstract: Gait recognition is emerging as a promising and innovative area within the field of computer vision, widely applied to remote person identification. Although existing gait recognition methods have achieved substantial success in controlled laboratory datasets, their performance often declines significantly when transitioning to wild datasets.We argue that the performance gap can be primarily attributed to the spatio-temporal distribution inconsistencies present in wild datasets, where subjects appear at varying angles, positions, and distances across the frames. To achieve accurate gait recognition in the wild, we propose a skeleton-guided silhouette alignment strategy, which uses prior knowledge of the skeletons to perform affine transformations on the corresponding silhouettes.To the best of our knowledge, this is the first study to explore the impact of data alignment on gait recognition. We conducted extensive experiments across multiple datasets and network architectures, and the results demonstrate the significant advantages of our proposed alignment strategy.Specifically, on the challenging Gait3D dataset, our method achieved an average performance improvement of 7.9% across all evaluated networks. Furthermore, our method achieves substantial improvements on cross-domain datasets, with accuracy improvements of up to 24.0%.

3DSwapping: Texture Swapping For 3D Object From Single Reference Image

Xiao Cao,Beibei Lin,Bo Wang,Zhiyong Huang,Robby T. Tan

Task: 提出一种名为3DSwapping的3D纹理交换方法，用于高效且多功能的3D对象纹理定制。

Motivation: 现有的2D编辑方法需要逐帧操作，导致视图间不一致；而基于文本的3D编辑方法难以保留参考图像的纹理特征。

Details

Method: 3DSwapping结合了渐进生成、视图一致性梯度引导和基于提示调优的梯度引导。 Result: 通过定性和定量评估，证实该方法能够实现高保真度的纹理转移，并在多视图中保持结构一致性。 Conclusion: 3DSwapping通过三种新颖策略，实现了更一致和有效的3D纹理交换。 Abstract: 3D texture swapping allows for the customization of 3D object textures, enabling efficient and versatile visual transformations in 3D editing. While no dedicated method exists, adapted 2D editing and text-driven 3D editing approaches can serve this purpose. However, 2D editing requires frame-by-frame manipulation, causing inconsistencies across views, while text-driven 3D editing struggles to preserve texture characteristics from reference images. To tackle these challenges, we introduce 3DSwapping, a 3D texture swapping method that integrates: 1) progressive generation, 2) view-consistency gradient guidance, and 3) prompt-tuned gradient guidance. To ensure view consistency, our progressive generation process starts by editing a single reference image and gradually propagates the edits to adjacent views. Our view-consistency gradient guidance further reinforces consistency by conditioning the generation model on feature differences between consistent and inconsistent outputs. To preserve texture characteristics, we introduce prompt-tuning-based gradient guidance, which learns a token that precisely captures the difference between the reference image and the 3D object. This token then guides the editing process, ensuring more consistent texture preservation across views. Overall, 3DSwapping integrates these novel strategies to achieve higher-fidelity texture transfer while preserving structural coherence across multiple viewpoints. Extensive qualitative and quantitative evaluations confirm that our three novel components enable convincing and effective 2D texture swapping for 3D objects. Code will be available upon acceptance.

MC-LLaVA: Multi-Concept Personalized Vision-Language Model

Ruichuan An,Sihan Yang,Ming Lu,Renrui Zhang,Kai Zeng,Yulin Luo,Jiajun Cao,Hao Liang,Ying Chen,Qi She,Shanghang Zhang,Wentao Zhang

Task: 提出一种多概念个性化范式MC-LLaVA，以增强视觉语言模型在多概念场景中的能力。

Motivation: 现有视觉语言模型主要关注单概念个性化，忽略了多概念的存在及其相互作用，限制了实际应用。

Details

Method: 采用多概念指令调优策略，结合个性化文本提示和视觉提示，并贡献了一个高质量的多概念指令调优数据集。 Result: MC-LLaVA能够实现出色的多概念个性化响应，提升了模型的识别和定位能力。 Conclusion: MC-LLaVA为视觉语言模型成为更好的用户特定助手铺平了道路。 Abstract: Current vision-language models (VLMs) show exceptional abilities across diverse tasks, such as visual question answering. To enhance user experience, recent studies investigate VLM personalization to understand user-provided concepts. However, they mainly focus on single-concept personalization, neglecting the existence and interplay of multiple concepts, which limits real-world applicability. This paper proposes the first multi-concept personalization paradigm, MC-LLaVA. Specifically, MC-LLaVA employs a multi-concept instruction tuning strategy, effectively integrating multiple concepts in a single training step. To reduce the costs related to joint training, we propose a personalized textual prompt that uses visual token information to initialize concept tokens. Additionally, we introduce a personalized visual prompt during inference, aggregating location confidence maps for enhanced recognition and grounding capabilities. To advance multi-concept personalization research, we further contribute a high-quality instruction tuning dataset. We carefully collect images with multiple characters and objects from movies and manually generate question-answer samples for multi-concept scenarios, featuring superior diversity. Comprehensive qualitative and quantitative experiments demonstrate that MC-LLaVA can achieve impressive multi-concept personalized responses, paving the way for VLMs to become better user-specific assistants. The code and dataset will be publicly available at $\href{https://github.com/arctanxarc/MC-LLaVA}{https://github.com/arctanxarc/MC-LLaVA}$.

HunyuanPortrait: Implicit Condition Control for Enhanced Portrait Animation

Zunnan Xu,Zhentao Yu,Zixiang Zhou,Jun Zhou,Xiaoyu Jin,Fa-Ting Hong,Xiaozhong Ji,Junwei Zhu,Chengfei Cai,Shiyu Tang,Qin Lin,Xiu Li,Qinglin Lu

Task: 提出一种基于扩散模型的隐式表示方法HunyuanPortrait，用于实现高度可控且逼真的肖像动画。

Motivation: 通过单张肖像图像和视频模板，实现肖像动画的面部表情和头部姿态控制，解决现有方法在细节丰富性和时间一致性上的不足。

Details

Method: 利用预训练编码器解耦肖像运动信息和身份信息，采用隐式表示编码运动信息作为控制信号，并通过适配层将控制信号注入去噪UNet。 Result: HunyuanPortrait在时间一致性和可控性上优于现有方法，且能有效解耦不同图像风格下的外观和运动。 Conclusion: HunyuanPortrait是一种高效且通用的肖像动画方法，具有广泛的应用潜力。 Abstract: We introduce HunyuanPortrait, a diffusion-based condition control method that employs implicit representations for highly controllable and lifelike portrait animation. Given a single portrait image as an appearance reference and video clips as driving templates, HunyuanPortrait can animate the character in the reference image by the facial expression and head pose of the driving videos. In our framework, we utilize pre-trained encoders to achieve the decoupling of portrait motion information and identity in videos. To do so, implicit representation is adopted to encode motion information and is employed as control signals in the animation phase. By leveraging the power of stable video diffusion as the main building block, we carefully design adapter layers to inject control signals into the denoising unet through attention mechanisms. These bring spatial richness of details and temporal consistency. HunyuanPortrait also exhibits strong generalization performance, which can effectively disentangle appearance and motion under different image styles. Our framework outperforms existing methods, demonstrating superior temporal consistency and controllability. Our project is available at https://kkakkkka.github.io/HunyuanPortrait.

Exploring the Integration of Key-Value Attention Into Pure and Hybrid Transformers for Semantic Segmentation

DeShin Hwa,Tobias Holmes,Klaus Drechsler

Task: 评估KV Transformer在医学图像语义分割任务中的性能。

Motivation: Transformer架构在图像处理中表现出色，但对大数据集和高计算资源的依赖限制了其应用，KV Transformer在降低复杂性和内存使用方面表现出潜力。

Details

Method: 通过直接比较传统Transformer和KV Transformer变体在相同基础架构上的性能。 Result: KV变体模型在参数数量和计算操作上显著减少，同时性能与QKV实现相当。 Conclusion: KV Transformer在医学图像语义分割任务中具有实际优势，尤其是在需要本地推理的场景中。 Abstract: While CNNs were long considered state of the art for image processing, the introduction of Transformer architectures has challenged this position. While achieving excellent results in image classification and segmentation, Transformers remain inherently reliant on large training datasets and remain computationally expensive. A newly introduced Transformer derivative named KV Transformer shows promising results in synthetic, NLP, and image classification tasks, while reducing complexity and memory usage. This is especially conducive to use cases where local inference is required, such as medical screening applications. We endeavoured to further evaluate the merit of KV Transformers on semantic segmentation tasks, specifically in the domain of medical imaging. By directly comparing traditional and KV variants of the same base architectures, we provide further insight into the practical tradeoffs of reduced model complexity. We observe a notable reduction in parameter count and multiply accumulate operations, while achieving similar performance from most of the KV variant models when directly compared to their QKV implementation.

Curriculum Coarse-to-Fine Selection for High-IPC Dataset Distillation

Yanda Chen,Gongwei Chen,Miao Zhang,Weili Guan,Liqiang Nie

Task: 提出一种新的课程式由粗到细选择（CCFS）方法，用于高效的高IPC数据集蒸馏。

Motivation: 当前数据集蒸馏在高IPC设置下效果下降，且蒸馏数据与真实数据的组合存在不兼容问题。

Details

Method: 采用课程式选择框架和由粗到细策略，动态选择适合当前合成数据集的真实数据。 Result: 在CIFAR-10、CIFAR-100和Tiny-ImageNet上分别提升6.6%、5.8%和3.4%，并在Tiny-ImageNet上接近全数据集训练效果。 Conclusion: CCFS方法有效解决了高IPC设置下的数据集蒸馏问题，显著提升了性能。 Abstract: Dataset distillation (DD) excels in synthesizing a small number of images per class (IPC) but struggles to maintain its effectiveness in high-IPC settings. Recent works on dataset distillation demonstrate that combining distilled and real data can mitigate the effectiveness decay. However, our analysis of the combination paradigm reveals that the current one-shot and independent selection mechanism induces an incompatibility issue between distilled and real images. To address this issue, we introduce a novel curriculum coarse-to-fine selection (CCFS) method for efficient high-IPC dataset distillation. CCFS employs a curriculum selection framework for real data selection, where we leverage a coarse-to-fine strategy to select appropriate real data based on the current synthetic dataset in each curriculum. Extensive experiments validate CCFS, surpassing the state-of-the-art by +6.6\% on CIFAR-10, +5.8\% on CIFAR-100, and +3.4\% on Tiny-ImageNet under high-IPC settings. Notably, CCFS achieves 60.2\% test accuracy on ResNet-18 with a 20\% compression ratio of Tiny-ImageNet, closely matching full-dataset training with only 0.3\% degradation. Code: https://github.com/CYDaaa30/CCFS.

Efficient Self-Supervised Adaptation for Medical Image Analysis

Moein Sorkhei,Emir Konuk,Jingyu Guo,Christos Matsoukas,Kevin Smith

Task: 研究如何通过参数高效微调技术（如LoRA）改进自监督适应（SSA）在医学领域的应用。

Motivation: 自监督适应在医学领域迁移基础模型时计算成本过高，而参数高效微调方法在监督适应中已得到验证，但其在自监督适应中的效果尚不明确。

Details

Method: 提出高效自监督适应（ESSA）框架，应用参数高效微调技术（如APLA）以降低计算成本并提升适应性能。 Result: APLA方法在多种医学任务中表现优于全参数SSA和监督微调，同时减少GPU内存40.1%，提升训练吞吐量25.2%，且保持推理效率。 Conclusion: ESSA框架通过参数高效微调技术显著提升了自监督适应的效率和性能，为医学领域模型迁移提供了实用解决方案。 Abstract: Self-supervised adaptation (SSA) improves foundation model transfer to medical domains but is computationally prohibitive. Although parameter efficient fine-tuning methods such as LoRA have been explored for supervised adaptation, their effectiveness for SSA remains unknown. In this work, we introduce efficient self-supervised adaptation (ESSA), a framework that applies parameter-efficient fine-tuning techniques to SSA with the aim of reducing computational cost and improving adaptation performance. Among the methods tested, Attention Projection Layer Adaptation (APLA) sets a new state-of-the-art, consistently surpassing full-parameter SSA and supervised fine-tuning across diverse medical tasks, while reducing GPU memory by up to 40.1% and increasing training throughput by 25.2%, all while maintaining inference efficiency.

Seeing Speech and Sound: Distinguishing and Locating Audios in Visual Scenes

Hyeonggon Ryu,Seongyu Kim,Joon Son Chung,Arda Senocak

Task: 提出一个统一模型，能够同时定位视觉场景中的语音和非语音声音。

Motivation: 现有方法通常只能独立处理语音或非语音声音，或最多按顺序处理而不混合，无法捕捉现实世界中混合音频的复杂性。

Details

Method: 引入“混合与分离”框架，通过音频-视觉对齐目标联合学习对应性和解缠，使用混合音频训练。 Result: 模型在混合音频源的同时定位任务中表现优于现有方法，并在标准分割和跨模态检索任务中达到可比或更好的性能。 Conclusion: “混合与分离”方法有效解决了混合音频源的定位问题，并展示了其广泛适用性。 Abstract: We present a unified model capable of simultaneously grounding both spoken language and non-speech sounds within a visual scene, addressing key limitations in current audio-visual grounding models. Existing approaches are typically limited to handling either speech or non-speech sounds independently, or at best, together but sequentially without mixing. This limitation prevents them from capturing the complexity of real-world audio sources that are often mixed. Our approach introduces a 'mix-and-separate' framework with audio-visual alignment objectives that jointly learn correspondence and disentanglement using mixed audio. Through these objectives, our model learns to produce distinct embeddings for each audio type, enabling effective disentanglement and grounding across mixed audio sources. Additionally, we created a new dataset to evaluate simultaneous grounding of mixed audio sources, demonstrating that our model outperforms prior methods. Our approach also achieves comparable or better performance in standard segmentation and cross-modal retrieval tasks, highlighting the benefits of our mix-and-separate approach.

Efficient and Accurate Scene Text Recognition with Cascaded-Transformers

Savas Ozkan,Andrea Maracani,Hyowon Kim,Sijun Cho,Eunchung Noh,Jeongwon Min,Jung Min Cho,Mete Ozay

Task: 提出一种高效的场景文本识别（STR）系统，通过级联变换器结构减少计算和内存需求。

Motivation: 尽管视觉变换器在场景文本识别中表现出色，但其高计算和内存需求限制了在资源受限应用中的部署。

Details

Method: 采用级联变换器结构，逐步减少视觉标记大小，消除冗余标记以降低计算成本。 Result: 系统在保持与现有最优方法相当性能（92.77到92.68）的同时，计算复杂度几乎减半。 Conclusion: 提出的方法显著提高了效率，适用于资源受限环境。 Abstract: In recent years, vision transformers with text decoder have demonstrated remarkable performance on Scene Text Recognition (STR) due to their ability to capture long-range dependencies and contextual relationships with high learning capacity. However, the computational and memory demands of these models are significant, limiting their deployment in resource-constrained applications. To address this challenge, we propose an efficient and accurate STR system. Specifically, we focus on improving the efficiency of encoder models by introducing a cascaded-transformers structure. This structure progressively reduces the vision token size during the encoding step, effectively eliminating redundant tokens and reducing computational cost. Our experimental results confirm that our STR system achieves comparable performance to state-of-the-art baselines while substantially decreasing computational requirements. In particular, for large-models, the accuracy remains same, 92.77 to 92.68, while computational complexity is almost halved with our structure.

CFG-Zero*: Improved Classifier-Free Guidance for Flow Matching Models

Weichen Fan,Amber Yijia Zheng,Raymond A. Yeh,Ziwei Liu

Task: 研究Classifier-Free Guidance (CFG)在流匹配模型中的效果，并提出改进方法CFG-Zero*。

Motivation: 发现CFG在训练初期会引导样本走向错误轨迹，需要优化以提高模型性能。

Details

Method: 提出CFG-Zero*，包括优化尺度和零初始化ODE求解器的前几步。 Result: 在文本到图像和文本到视频生成任务中，CFG-Zero*表现优于CFG。 Conclusion: CFG-Zero*能有效指导流匹配模型，提升生成质量。 Abstract: Classifier-Free Guidance (CFG) is a widely adopted technique in diffusion/flow models to improve image fidelity and controllability. In this work, we first analytically study the effect of CFG on flow matching models trained on Gaussian mixtures where the ground-truth flow can be derived. We observe that in the early stages of training, when the flow estimation is inaccurate, CFG directs samples toward incorrect trajectories. Building on this observation, we propose CFG-Zero*, an improved CFG with two contributions: (a) optimized scale, where a scalar is optimized to correct for the inaccuracies in the estimated velocity, hence the * in the name; and (b) zero-init, which involves zeroing out the first few steps of the ODE solver. Experiments on both text-to-image (Lumina-Next, Stable Diffusion 3, and Flux) and text-to-video (Wan-2.1) generation demonstrate that CFG-Zero* consistently outperforms CFG, highlighting its effectiveness in guiding Flow Matching models. (Code is available at github.com/WeichenFan/CFG-Zero-star)

Online 3D Scene Reconstruction Using Neural Object Priors

Thomas Chabal,Shizhe Chen,Jean Ponce,Cordelia Schmid

Task: 在线重建RGB-D视频序列中的场景对象级别表示。

Motivation: 当前基于对象的神经隐式表示在在线重建效率和形状补全方面存在局限性。

Details

Method: 提出特征网格插值机制和对象库构建方法，利用形状先验初始化新视频中的对象模型。 Result: 在合成环境、真实世界序列和实验室视频中，方法在重建精度和完整性上优于现有技术。 Conclusion: 提出的方法显著提升了在线对象级别场景重建的性能。 Abstract: This paper addresses the problem of reconstructing a scene online at the level of objects given an RGB-D video sequence. While current object-aware neural implicit representations hold promise, they are limited in online reconstruction efficiency and shape completion. Our main contributions to alleviate the above limitations are twofold. First, we propose a feature grid interpolation mechanism to continuously update grid-based object-centric neural implicit representations as new object parts are revealed. Second, we construct an object library with previously mapped objects in advance and leverage the corresponding shape priors to initialize geometric object models in new videos, subsequently completing them with novel views as well as synthesized past views to avoid losing original object details. Extensive experiments on synthetic environments from the Replica dataset, real-world ScanNet sequences and videos captured in our laboratory demonstrate that our approach outperforms state-of-the-art neural implicit models for this task in terms of reconstruction accuracy and completeness.

Building Blocks for Robust and Effective Semi-Supervised Real-World Object Detection

Moussa Kassem Sbeyti,Nadja Klein,Azarm Nowzad,Fikret Sivrikaya,Sahin Albayrak

Task: 研究半监督目标检测（SSOD）在真实世界条件下的性能优化问题。

Motivation: 解决SSOD在真实应用中面临的类别不平衡、标签噪声和标注错误等关键挑战。

Details

Method: 提出四种构建模块（RCC、RCF、GLC、PLS），分别针对数据增强、批次采样、标签校正和伪标签选择进行优化。 Result: 在自动驾驶数据集上的实验表明，性能提升高达6%。 Conclusion: 通过数据中心的构建模块，实现了在复杂真实场景中鲁棒且高效的SSOD。 Abstract: Semi-supervised object detection (SSOD) based on pseudo-labeling significantly reduces dependence on large labeled datasets by effectively leveraging both labeled and unlabeled data. However, real-world applications of SSOD often face critical challenges, including class imbalance, label noise, and labeling errors. We present an in-depth analysis of SSOD under real-world conditions, uncovering causes of suboptimal pseudo-labeling and key trade-offs between label quality and quantity. Based on our findings, we propose four building blocks that can be seamlessly integrated into an SSOD framework. Rare Class Collage (RCC): a data augmentation method that enhances the representation of rare classes by creating collages of rare objects. Rare Class Focus (RCF): a stratified batch sampling strategy that ensures a more balanced representation of all classes during training. Ground Truth Label Correction (GLC): a label refinement method that identifies and corrects false, missing, and noisy ground truth labels by leveraging the consistency of teacher model predictions. Pseudo-Label Selection (PLS): a selection method for removing low-quality pseudo-labeled images, guided by a novel metric estimating the missing detection rate while accounting for class rarity. We validate our methods through comprehensive experiments on autonomous driving datasets, resulting in up to 6% increase in SSOD performance. Overall, our investigation and novel, data-centric, and broadly applicable building blocks enable robust and effective SSOD in complex, real-world scenarios. Code is available at https://mos-ks.github.io/publications.

Video SimpleQA: Towards Factuality Evaluation in Large Video Language Models

Meng Cao,Pengfei Hu,Yingyao Wang,Jihao Gu,Haoran Tang,Haoze Zhao,Jiahua Dong,Wangbo Yu,Ge Zhang,Ian Reid,Xiaodan Liang

Task: 提出Video SimpleQA，第一个专门用于评估大型视频语言模型（LVLMs）事实性的综合基准。

Motivation: 当前LVLMs在多模态理解方面表现出潜力，但其在视频上下文中的事实性评估仍是一个未解决的挑战。

Details

Method: 通过五个关键特征设计Video SimpleQA基准：1）需要外部知识；2）事实性问题；3）明确简短答案；4）外部来源验证；5）时间推理需求。 Result: 评估了41个最先进的LVLMs，发现其在事实性方面存在显著不足，最佳模型F-score仅为54.4%。 Conclusion: Video SimpleQA为LVLMs的事实性评估提供了有效工具，揭示了当前模型的局限性及改进方向。 Abstract: Recent advancements in Large Video Language Models (LVLMs) have highlighted their potential for multi-modal understanding, yet evaluating their factual grounding in video contexts remains a critical unsolved challenge. To address this gap, we introduce Video SimpleQA, the first comprehensive benchmark tailored for factuality evaluation of LVLMs. Our work distinguishes from existing video benchmarks through the following key features: 1) Knowledge required: demanding integration of external knowledge beyond the explicit narrative; 2) Fact-seeking question: targeting objective, undisputed events or relationships, avoiding subjective interpretation; 3) Definitive & short-form answer: Answers are crafted as unambiguous and definitively correct in a short format, enabling automated evaluation through LLM-as-a-judge frameworks with minimal scoring variance; 4) External-source verified: All annotations undergo rigorous validation against authoritative external references to ensure the reliability; 5) Temporal reasoning required: The annotated question types encompass both static single-frame understanding and dynamic temporal reasoning, explicitly evaluating LVLMs factuality under the long-context dependencies. We extensively evaluate 41 state-of-the-art LVLMs and summarize key findings as follows: 1) Current LVLMs exhibit notable deficiencies in factual adherence, particularly for open-source models. The best-performing model Gemini-1.5-Pro achieves merely an F-score of 54.4%; 2) Test-time compute paradigms show insignificant performance gains, revealing fundamental constraints for enhancing factuality through post-hoc computation; 3) Retrieval-Augmented Generation demonstrates consistent improvements at the cost of additional inference time overhead, presenting a critical efficiency-performance trade-off.

CoMP: Continual Multimodal Pre-training for Vision Foundation Models

Yitong Chen,Lingchen Meng,Wujian Peng,Zuxuan Wu,Yu-Gang Jiang

Task: 通过多模态持续预训练方法改进视觉基础模型（VFMs），使其能够处理不同尺寸的视觉输入并生成与语言表示更对齐的视觉表示。

Motivation: 现有的视觉基础模型在原始预训练过程中可能无法充分适应多模态任务，尤其是视觉与语言的对齐问题。

Details

Method: 提出CoMP多模态预训练流程，包括持续旋转位置嵌入和支持多分辨率预训练的对齐损失。 Result: CoMP-SigLIP在ChartQA和DocVQA上分别达到66.7和75.9分，同时在ImageNet-1K和ADE20K上保持高精度。 Conclusion: CoMP方法显著提升了视觉基础模型在多模态理解和其他下游任务中的表现。 Abstract: Pre-trained Vision Foundation Models (VFMs) provide strong visual representations for a wide range of applications. In this paper, we continually pre-train prevailing VFMs in a multimodal manner such that they can effortlessly process visual inputs of varying sizes and produce visual representations that are more aligned with language representations, regardless of their original pre-training process. To this end, we introduce CoMP, a carefully designed multimodal pre-training pipeline. CoMP uses a Continual Rotary Position Embedding to support native resolution continual pre-training, and an Alignment Loss between visual and textual features through language prototypes to align multimodal representations. By three-stage training, our VFMs achieve remarkable improvements not only in multimodal understanding but also in other downstream tasks such as classification and segmentation. Remarkably, CoMP-SigLIP achieves scores of 66.7 on ChartQA and 75.9 on DocVQA with a 0.5B LLM, while maintaining an 87.4% accuracy on ImageNet-1K and a 49.5 mIoU on ADE20K under frozen chunk evaluation.

Enrico Pallotta,Sina Mokhtarzadeh Azar,Shuai Li,Olga Zatsarynna,Juergen Gall

Task: 提出一种多模态同步视频预测框架（SyncVP），以增强未来视频帧预测的丰富性和准确性。

Motivation: RGB帧单独使用时难以完全捕捉现实世界的复杂性，因此需要结合互补的数据模态。

Details

Method: 基于预训练的模态特定扩散模型，引入高效的时空交叉注意力模块以实现模态间的信息共享。 Result: 在Cityscapes、BAIR等标准数据集上表现优异，并展示了在SYNTHIA和ERA5-Land等其他模态上的泛化能力，达到了最先进的性能。 Conclusion: SyncVP不仅在多模态场景下表现优异，在单模态情况下也展现出鲁棒性，具有广泛的应用潜力。 Abstract: Predicting future video frames is essential for decision-making systems, yet RGB frames alone often lack the information needed to fully capture the underlying complexities of the real world. To address this limitation, we propose a multi-modal framework for Synchronous Video Prediction (SyncVP) that incorporates complementary data modalities, enhancing the richness and accuracy of future predictions. SyncVP builds on pre-trained modality-specific diffusion models and introduces an efficient spatio-temporal cross-attention module to enable effective information sharing across modalities. We evaluate SyncVP on standard benchmark datasets, such as Cityscapes and BAIR, using depth as an additional modality. We furthermore demonstrate its generalization to other modalities on SYNTHIA with semantic information and ERA5-Land with climate data. Notably, SyncVP achieves state-of-the-art performance, even in scenarios where only one modality is present, demonstrating its robustness and potential for a wide range of applications.

Training-free Diffusion Acceleration with Bottleneck Sampling

Ye Tian,Xin Xia,Yuxi Ren,Shanchuan Lin,Xing Wang,Xuefeng Xiao,Yunhai Tong,Ling Yang,Bin Cui

Task: 提出一种名为Bottleneck Sampling的训练无关框架，以利用低分辨率先验减少扩散模型推理时的计算开销。

Motivation: 扩散模型在视觉内容生成中表现出色，但推理时的高计算成本限制了其部署。现有加速方法常牺牲输出质量或需昂贵重训练。

Details

Method: 通过高-低-高去噪流程，在初始和最终阶段进行高分辨率去噪，中间阶段在低分辨率下操作，并优化分辨率转换点和自适应调整去噪时间步。 Result: 在图像和视频生成任务中，推理速度分别提升3倍和2.5倍，同时保持与标准全分辨率采样相当的输出质量。 Conclusion: Bottleneck Sampling是一种高效且无需训练的加速方法，适用于扩散模型的推理优化。 Abstract: Diffusion models have demonstrated remarkable capabilities in visual content generation but remain challenging to deploy due to their high computational cost during inference. This computational burden primarily arises from the quadratic complexity of self-attention with respect to image or video resolution. While existing acceleration methods often compromise output quality or necessitate costly retraining, we observe that most diffusion models are pre-trained at lower resolutions, presenting an opportunity to exploit these low-resolution priors for more efficient inference without degrading performance. In this work, we introduce Bottleneck Sampling, a training-free framework that leverages low-resolution priors to reduce computational overhead while preserving output fidelity. Bottleneck Sampling follows a high-low-high denoising workflow: it performs high-resolution denoising in the initial and final stages while operating at lower resolutions in intermediate steps. To mitigate aliasing and blurring artifacts, we further refine the resolution transition points and adaptively shift the denoising timesteps at each stage. We evaluate Bottleneck Sampling on both image and video generation tasks, where extensive experiments demonstrate that it accelerates inference by up to 3$\times$ for image generation and 2.5$\times$ for video generation, all while maintaining output quality comparable to the standard full-resolution sampling process across multiple evaluation metrics. Code is available at: https://github.com/tyfeld/Bottleneck-Sampling

Video-T1: Test-Time Scaling for Video Generation

Fangfu Liu,Hanyang Wang,Yimo Cai,Kaiyan Zhang,Xiaohang Zhan,Yueqi Duan

Task: 探索在视频生成中通过增加推理时间计算（Test-Time Scaling, TTS）提升生成质量的方法。

Motivation: 随着训练数据、模型规模和计算成本的增加，视频生成在数字创作中取得了显著成果，但通过昂贵的训练成本扩展视频基础模型并不高效。因此，研究如何在推理阶段利用更多计算资源提升生成质量。

Details

Method: 将测试时间扩展重新解释为从高斯噪声空间到目标视频分布的搜索问题，构建搜索空间并使用测试时间验证器和启发式算法指导搜索。提出线性搜索策略和更高效的Tree-of-Frames (ToF)方法。 Result: 在文本条件视频生成基准测试中，增加测试时间计算显著提升了视频质量。 Conclusion: 通过测试时间扩展（TTS）可以在不增加训练成本的情况下显著提升视频生成质量，为视频生成领域提供了新的优化方向。 Abstract: With the scale capability of increasing training data, model size, and computational cost, video generation has achieved impressive results in digital creation, enabling users to express creativity across various domains. Recently, researchers in Large Language Models (LLMs) have expanded the scaling to test-time, which can significantly improve LLM performance by using more inference-time computation. Instead of scaling up video foundation models through expensive training costs, we explore the power of Test-Time Scaling (TTS) in video generation, aiming to answer the question: if a video generation model is allowed to use non-trivial amount of inference-time compute, how much can it improve generation quality given a challenging text prompt. In this work, we reinterpret the test-time scaling of video generation as a searching problem to sample better trajectories from Gaussian noise space to the target video distribution. Specifically, we build the search space with test-time verifiers to provide feedback and heuristic algorithms to guide searching process. Given a text prompt, we first explore an intuitive linear search strategy by increasing noise candidates at inference time. As full-step denoising all frames simultaneously requires heavy test-time computation costs, we further design a more efficient TTS method for video generation called Tree-of-Frames (ToF) that adaptively expands and prunes video branches in an autoregressive manner. Extensive experiments on text-conditioned video generation benchmarks demonstrate that increasing test-time compute consistently leads to significant improvements in the quality of videos. Project page: https://liuff19.github.io/Video-T1

SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding

Mingze Xu,Mingfei Gao,Shiyu Li,Jiasen Lu,Zhe Gan,Zhengfeng Lai,Meng Cao,Kai Kang,Yinfei Yang,Afshin Dehghan

Task: 介绍SlowFast-LLaVA-1.5（SF-LLaVA-1.5），一种用于长视频理解的轻量级视频大语言模型家族。

Motivation: 满足对轻量级、移动友好的视频大语言模型的需求，提供高效的长时间上下文建模能力。

Details

Method: 采用双流SlowFast机制，优化训练流程和高质量数据混合，提供1B至7B参数的模型。 Result: 在多种视频和图像基准测试中表现优异，尤其在长视频理解任务（如LongVideoBench和MLVU）中达到最先进水平，小规模模型（1B和3B）表现突出。 Conclusion: SF-LLaVA-1.5是一种高效且性能优越的视频大语言模型家族，特别适合长视频理解和轻量级应用。 Abstract: We introduce SlowFast-LLaVA-1.5 (abbreviated as SF-LLaVA-1.5), a family of video large language models (LLMs) offering a token-efficient solution for long-form video understanding. This model family employs the two-stream SlowFast mechanism, enabling efficient modeling of long-range temporal context to meet the demand for lightweight, mobile-friendly Video LLMs. We provide models ranging from 1B to 7B parameters, optimized through a streamlined training pipeline and a high-quality data mixture composed of publicly available datasets. Experimental results demonstrate that SF-LLaVA-1.5 achieves competitive performance on a wide range of video and image benchmarks, with robust results across all model sizes. Notably, SF-LLaVA-1.5 achieves state-of-the-art results in long-form video understanding (e.g., LongVideoBench and MLVU) and excels at small scales (1B and 3B) across various video benchmarks.

DINO in the Room: Leveraging 2D Foundation Models for 3D Segmentation

Karim Abou Zeid,Kadir Yilmaz,Daan de Geus,Alexander Hermans,David Adrian,Timm Linder,Bastian Leibe

Task: 探索如何将视觉基础模型（VFMs）的特征有效整合到3D点云分割模型中。

Motivation: 尽管2D图像与3D点云数据常同时存在，但现有3D方法主要依赖3D数据，VFMs在3D视觉中的潜力尚未充分挖掘。

Details

Method: 提出DITR方法，提取2D基础模型特征，将其投影到3D并注入3D点云分割模型；还提出通过蒸馏2D VFMs知识预训练3D骨干网络。 Result: DITR在室内外3D语义分割基准上达到最优性能；预训练方法提升了多种数据集的下游任务表现。 Conclusion: DITR展示了VFMs在3D视觉中的潜力，为2D-3D融合提供了简单有效的解决方案。 Abstract: Vision foundation models (VFMs) trained on large-scale image datasets provide high-quality features that have significantly advanced 2D visual recognition. However, their potential in 3D vision remains largely untapped, despite the common availability of 2D images alongside 3D point cloud datasets. While significant research has been dedicated to 2D-3D fusion, recent state-of-the-art 3D methods predominantly focus on 3D data, leaving the integration of VFMs into 3D models underexplored. In this work, we challenge this trend by introducing DITR, a simple yet effective approach that extracts 2D foundation model features, projects them to 3D, and finally injects them into a 3D point cloud segmentation model. DITR achieves state-of-the-art results on both indoor and outdoor 3D semantic segmentation benchmarks. To enable the use of VFMs even when images are unavailable during inference, we further propose to distill 2D foundation models into a 3D backbone as a pretraining task. By initializing the 3D backbone with knowledge distilled from 2D VFMs, we create a strong basis for downstream 3D segmentation tasks, ultimately boosting performance across various datasets.

Aether: Geometric-Aware Unified World Modeling

Aether Team,Haoyi Zhu,Yifan Wang,Jianjun Zhou,Wenzheng Chang,Yang Zhou,Zizun Li,Junyi Chen,Chunhua Shen,Jiangmiao Pang,Tong He

Task: 提出Aether框架，整合几何重建与生成建模，实现几何感知的世界模型推理。

Motivation: 解决几何重建与生成建模结合的挑战，推动AI系统实现类人空间推理能力。

Details

Method: 联合优化4D动态重建、动作条件视频预测和目标条件视觉规划，通过任务交错特征学习实现知识共享。 Result: Aether在未见真实数据的情况下展示了卓越的合成到真实泛化能力，并在零样本任务中表现优异。 Conclusion: Aether为物理合理的世界建模及其应用开辟了新方向。 Abstract: The integration of geometric reconstruction and generative modeling remains a critical challenge in developing AI systems capable of human-like spatial reasoning. This paper proposes Aether, a unified framework that enables geometry-aware reasoning in world models by jointly optimizing three core capabilities: (1) 4D dynamic reconstruction, (2) action-conditioned video prediction, and (3) goal-conditioned visual planning. Through task-interleaved feature learning, Aether achieves synergistic knowledge sharing across reconstruction, prediction, and planning objectives. Building upon video generation models, our framework demonstrates unprecedented synthetic-to-real generalization despite never observing real-world data during training. Furthermore, our approach achieves zero-shot generalization in both action following and reconstruction tasks, thanks to its intrinsic geometric modeling. Remarkably, even without real-world data, its reconstruction performance far exceeds that of domain-specific models. Additionally, Aether leverages a geometry-informed action space to seamlessly translate predictions into actions, enabling effective autonomous trajectory planning. We hope our work inspires the community to explore new frontiers in physically-reasonable world modeling and its applications.

Tuning-Free Amodal Segmentation via the Occlusion-Free Bias of Inpainting Models

Jae Joong Lee,Bedrich Benes,Raymond A. Yeh

Task: 利用预训练的扩散修复模型进行无调优的模态分割。

Motivation: 现有方法依赖手动标注或合成数据，缺乏多样性和规模，限制了性能。

Details

Method: 通过修复模型的“无遮挡偏好”重建遮挡区域，再进行分割，无需额外训练。 Result: 在五个数据集上验证了方法的通用性和鲁棒性，平均比现有最优方法准确率提升5.3%。 Conclusion: 提出了一种无需调优的高效模态分割方法，性能优于现有技术。 Abstract: Amodal segmentation aims to predict segmentation masks for both the visible and occluded regions of an object. Most existing works formulate this as a supervised learning problem, requiring manually annotated amodal masks or synthetic training data. Consequently, their performance depends on the quality of the datasets, which often lack diversity and scale. This work introduces a tuning-free approach that repurposes pretrained diffusion-based inpainting models for amodal segmentation. Our approach is motivated by the "occlusion-free bias" of inpainting models, i.e., the inpainted objects tend to be complete objects without occlusions. Specifically, we reconstruct the occluded regions of an object via inpainting and then apply segmentation, all without additional training or fine-tuning. Experiments on five datasets demonstrate the generalizability and robustness of our approach. On average, our approach achieves 5.3% more accurate masks over the state-of-the-art.

Equivariant Image Modeling

Ruixiao Dong,Mengde Xu,Zigang Geng,Li Li,Han Hu,Shuyang Gu

Task: 提出一种新的等变图像建模框架，通过利用自然视觉信号的平移不变性来解决生成模型中子任务联合优化的冲突问题。

Motivation: 现有生成模型在联合优化子任务时存在冲突，且现有解决方案无法在不牺牲效率或可扩展性的情况下解决这些冲突。

Details

Method: 引入列式标记化和窗口化因果注意力机制，增强平移对称性和上下文关系一致性。 Result: 在256x256分辨率的类条件ImageNet生成任务中，性能与最先进的AR模型相当，同时使用更少的计算资源。 Conclusion: 该框架首次实现了生成模型中任务对齐的分解，为高效参数共享和无冲突优化提供了新思路。 Abstract: Current generative models, such as autoregressive and diffusion approaches, decompose high-dimensional data distribution learning into a series of simpler subtasks. However, inherent conflicts arise during the joint optimization of these subtasks, and existing solutions fail to resolve such conflicts without sacrificing efficiency or scalability. We propose a novel equivariant image modeling framework that inherently aligns optimization targets across subtasks by leveraging the translation invariance of natural visual signals. Our method introduces (1) column-wise tokenization which enhances translational symmetry along the horizontal axis, and (2) windowed causal attention which enforces consistent contextual relationships across positions. Evaluated on class-conditioned ImageNet generation at 256x256 resolution, our approach achieves performance comparable to state-of-the-art AR models while using fewer computational resources. Systematic analysis demonstrates that enhanced equivariance reduces inter-task conflicts, significantly improving zero-shot generalization and enabling ultra-long image synthesis. This work establishes the first framework for task-aligned decomposition in generative modeling, offering insights into efficient parameter sharing and conflict-free optimization. The code and models are publicly available at https://github.com/drx-code/EquivariantModeling.

Target-Aware Video Diffusion Models

Taeksoo Kim,Hanbyul Joo

Task: 提出一种目标感知的视频扩散模型，通过输入图像生成视频，其中演员与指定目标进行交互并执行所需动作。

Motivation: 现有可控图像到视频扩散模型通常依赖密集结构或运动线索引导演员动作，而本方法仅需简单掩码指示目标，利用预训练模型的泛化能力生成合理动作，特别适用于人-物交互（HOI）场景。

Details

Method: 扩展基线模型以纳入目标掩码作为额外输入，引入特殊令牌编码目标空间信息，并使用新颖的交叉注意力损失微调模型。 Result: 实验表明，目标感知模型在生成演员与目标准确交互的视频方面优于现有解决方案。 Conclusion: 该方法在视频内容创作和零样本3D HOI运动合成等下游应用中表现出高效性。 Abstract: We present a target-aware video diffusion model that generates videos from an input image in which an actor interacts with a specified target while performing a desired action. The target is defined by a segmentation mask and the desired action is described via a text prompt. Unlike existing controllable image-to-video diffusion models that often rely on dense structural or motion cues to guide the actor's movements toward the target, our target-aware model requires only a simple mask to indicate the target, leveraging the generalization capabilities of pretrained models to produce plausible actions. This makes our method particularly effective for human-object interaction (HOI) scenarios, where providing precise action guidance is challenging, and further enables the use of video diffusion models for high-level action planning in applications such as robotics. We build our target-aware model by extending a baseline model to incorporate the target mask as an additional input. To enforce target awareness, we introduce a special token that encodes the target's spatial information within the text prompt. We then fine-tune the model with our curated dataset using a novel cross-attention loss that aligns the cross-attention maps associated with this token with the input target mask. To further improve performance, we selectively apply this loss to the most semantically relevant transformer blocks and attention regions. Experimental results show that our target-aware model outperforms existing solutions in generating videos where actors interact accurately with the specified targets. We further demonstrate its efficacy in two downstream applications: video content creation and zero-shot 3D HOI motion synthesis.

Temporal Flexibility in Spiking Neural Networks: Towards Generalization Across Time Steps and Deployment Friendliness

Kangrui Du,Yuhang Wu,Shikuang Deng,Shi Gu

Task: 探索并训练能够在不同时间步长下通用的脉冲神经网络（SNNs）。

Motivation: 解决当前直接训练方法导致的SNNs时间步长固定问题，以提升其在无时间步长的事件驱动芯片上的部署能力及动态推理时间步长的能量性能平衡。

Details

Method: 提出混合时间步长训练（MTT）方法，通过在不同SNN阶段随机分配时间步长，并通过通信模块传递脉冲信号。 Result: 实验结果表明，MTT训练的模型具有显著的时间步长灵活性，适用于事件驱动和时钟驱动部署，并在多个数据集上表现优异。 Conclusion: MTT是首个在大规模事件驱动场景下部署SNNs并取得显著成果的工作。 Abstract: Spiking Neural Networks (SNNs), models inspired by neural mechanisms in the brain, allow for energy-efficient implementation on neuromorphic hardware. However, SNNs trained with current direct training approaches are constrained to a specific time step. This "temporal inflexibility" 1) hinders SNNs' deployment on time-step-free fully event-driven chips and 2) prevents energy-performance balance based on dynamic inference time steps. In this study, we first explore the feasibility of training SNNs that generalize across different time steps. We then introduce Mixed Time-step Training (MTT), a novel method that improves the temporal flexibility of SNNs, making SNNs adaptive to diverse temporal structures. During each iteration of MTT, random time steps are assigned to different SNN stages, with spikes transmitted between stages via communication modules. After training, the weights are deployed and evaluated on both time-stepped and fully event-driven platforms. Experimental results show that models trained by MTT gain remarkable temporal flexibility, friendliness for both event-driven and clock-driven deployment (nearly lossless on N-MNIST and 10.1% higher than standard methods on CIFAR10-DVS), enhanced network generalization, and near SOTA performance. To the best of our knowledge, this is the first work to report the results of large-scale SNN deployment on fully event-driven scenarios.

On-Device Federated Continual Learning on RISC-V-based Ultra-Low-Power SoC for Intelligent Nano-Drone Swarms

Lars Kröger,Cristian Cioflan,Victor Kartsch,Luca Benini

Task: 提出一种基于正则化的设备端联邦持续学习算法，用于多架纳米无人机执行人脸识别任务。

Motivation: 解决设备端学习在电池供电嵌入式平台上的计算资源受限、设备寿命有限以及灾难性遗忘等问题。

Details

Method: 采用基于RISC-V的10核超低功耗SoC，优化设备端学习的计算需求。 Result: 分类准确率比简单微调提高24%，每个本地周期耗时178毫秒，每个全局周期耗时10.5秒。 Conclusion: 证明了该架构在设备端学习任务中的有效性。 Abstract: RISC-V-based architectures are paving the way for efficient On-Device Learning (ODL) in smart edge devices. When applied across multiple nodes, ODL enables the creation of intelligent sensor networks that preserve data privacy. However, developing ODL-capable, battery-operated embedded platforms presents significant challenges due to constrained computational resources and limited device lifetime, besides intrinsic learning issues such as catastrophic forgetting. We face these challenges by proposing a regularization-based On-Device Federated Continual Learning algorithm tailored for multiple nano-drones performing face recognition tasks. We demonstrate our approach on a RISC-V-based 10-core ultra-low-power SoC, optimizing the ODL computational requirements. We improve the classification accuracy by 24% over naive fine-tuning, requiring 178 ms per local epoch and 10.5 s per global epoch, demonstrating the effectiveness of the architecture for this task.

Bayesian generative models can flag performance loss, bias, and out-of-distribution image content

Miguel López-Pérez,Marco Miani,Valery Naranjo,Søren Hauberg,Aasa Feragen

Task: 提出一种新的不确定性量化方法（SLUG）用于变分自编码器（VAEs），以解决生成模型在医学影像任务中对分布偏移的敏感性问题。

Motivation: 生成模型在医学影像任务中广泛应用，但对分布偏移敏感，可能导致偏差（如代表性不足），而现有的不确定性量化方法有限。

Details

Method: 结合拉普拉斯近似和随机迹估计器，提出SLUG方法，能够高效处理高维图像数据。 Result: SLUG的不确定性评分与重建误差和种族代表性不足偏差强相关，并能检测异常图像内容（如墨水、尺子等）。 Conclusion: SLUG为生成模型提供了一种有效的UQ方法，有助于减少偏差和检测异常数据。 Abstract: Generative models are popular for medical imaging tasks such as anomaly detection, feature extraction, data visualization, or image generation. Since they are parameterized by deep learning models, they are often sensitive to distribution shifts and unreliable when applied to out-of-distribution data, creating a risk of, e.g. underrepresentation bias. This behavior can be flagged using uncertainty quantification methods for generative models, but their availability remains limited. We propose SLUG: A new UQ method for VAEs that combines recent advances in Laplace approximations with stochastic trace estimators to scale gracefully with image dimensionality. We show that our UQ score -- unlike the VAE's encoder variances -- correlates strongly with reconstruction error and racial underrepresentation bias for dermatological images. We also show how pixel-wise uncertainty can detect out-of-distribution image content such as ink, rulers, and patches, which is known to induce learning shortcuts in predictive models.

What's Producible May Not Be Reachable: Measuring the Steerability of Generative Models

Keyon Vafa,Sarah Bentley,Jon Kleinberg,Sendhil Mullainathan

Task: 提出一种评估生成模型可操控性（steerability）的数学框架和基准任务。

Motivation: 现有指标主要关注生成模型的可生产性（producibility），但实际使用价值取决于用户是否能通过模型生成满足特定目标的输出。

Details

Method: 引入数学框架评估可操控性，设计基准任务（用户复现模型输出），并通过大规模用户研究测试文本到图像模型和大语言模型。 Result: 尽管模型能生成高质量输出，但在可操控性上表现不佳；通过强化学习技术改进的图像模型可操控性提升超过2倍。 Conclusion: 需关注生成模型可操控性的改进，且通过技术手段可实现显著提升。 Abstract: How should we evaluate the quality of generative models? Many existing metrics focus on a model's producibility, i.e. the quality and breadth of outputs it can generate. However, the actual value from using a generative model stems not just from what it can produce but whether a user with a specific goal can produce an output that satisfies that goal. We refer to this property as steerability. In this paper, we first introduce a mathematical framework for evaluating steerability independently from producibility. Steerability is more challenging to evaluate than producibility because it requires knowing a user's goals. We address this issue by creating a benchmark task that relies on one key idea: sample an output from a generative model and ask users to reproduce it. We implement this benchmark in a large-scale user study of text-to-image models and large language models. Despite the ability of these models to produce high-quality outputs, they all perform poorly on steerabilty. This suggests that we need to focus on improving the steerability of generative models. We show such improvements are indeed possible: through reinforcement learning techniques, we create an alternative steering mechanism for image models that achieves more than 2x improvement on this benchmark.

Judge Anything: MLLM as a Judge Across Any Modality

Shu Pu,Yaochen Wang,Dongping Chen,Yuhang Chen,Guohao Wang,Qi Qin,Zhongyi Zhang,Zhiyuan Zhang,Zetong Zhou,Shuang Gong,Yi Gui,Yao Wan,Philip S. Yu

Task: 评估生成式基础模型在开放多模态理解（MMU）和生成（MMG）任务中的表现。

Motivation: 由于跨模态交互的复杂性，评估多模态任务面临挑战，需要一种统一的方法。

Details

Method: 引入两个基准（TaskAnything和JudgeAnything）和自动化平台OmniArena，评估多模态大语言模型（MLLMs）的表现和评判能力。 Result: MLLMs在MMU任务中表现较好（平均66.55%和42.79%），但在MMG任务中表现较差（平均53.37%和30.05%），存在跨模态偏见和幻觉问题。 Conclusion: 需要更公平的评估协议和更强的人类偏好对齐，OmniArena为自动化评估提供了解决方案。 Abstract: Evaluating generative foundation models on open-ended multimodal understanding (MMU) and generation (MMG) tasks across diverse modalities (e.g., images, audio, video) poses significant challenges due to the complexity of cross-modal interactions. To this end, the idea of utilizing Multimodal LLMs (MLLMs) as automated judges has emerged, with encouraging results in assessing vision-language understanding tasks. Moving further, this paper extends MLLM-as-a-Judge across modalities to a unified manner by introducing two benchmarks, TaskAnything and JudgeAnything, to respectively evaluate the overall performance and judging capabilities of MLLMs across any-to-any modality tasks. Specifically, TaskAnything evaluates the MMU and MMG capabilities across 15 any-to-any modality categories, employing 1,500 queries curated from well-established benchmarks. Furthermore, JudgeAnything evaluates the judging capabilities of 5 advanced (e.g., GPT-4o and Gemini-2.0-Flash) from the perspectives of Pair Comparison and Score Evaluation, providing a standardized testbed that incorporates human judgments and detailed rubrics. Our extensive experiments reveal that while these MLLMs show promise in assessing MMU (i.e., achieving an average of 66.55% in Pair Comparison setting and 42.79% in Score Evaluation setting), they encounter significant challenges with MMG tasks (i.e., averaging only 53.37% in Pair Comparison setting and 30.05% in Score Evaluation setting), exposing cross-modality biases and hallucination issues. To address this, we present OmniArena, an automated platform for evaluating omni-models and multimodal reward models. Our work highlights the need for fairer evaluation protocols and stronger alignment with human preferences. The source code and dataset are publicly available at: https://urrealhero.github.io/judgeanythingweb/.

Splat-LOAM: Gaussian Splatting LiDAR Odometry and Mapping

Emanuele Giacomini,Luca Di Giammarino,Lorenzo De Rebotti,Giorgio Grisetti,Martin R. Oswald

Task: 开发一种基于高斯点云的LiDAR里程计和建图方法。

Motivation: 解决现有方法在准确性与内存、处理时间之间的权衡问题。

Details

Method: 利用高斯点云方法和球面投影，仅依赖LiDAR测量优化场景表示。 Result: 在匹配当前配准性能的同时，建图任务达到SOTA效果，且GPU需求极低。 Conclusion: 该方法高效，适合实时机器人估计任务，具有进一步探索和应用的潜力。 Abstract: LiDARs provide accurate geometric measurements, making them valuable for ego-motion estimation and reconstruction tasks. Although its success, managing an accurate and lightweight representation of the environment still poses challenges. Both classic and NeRF-based solutions have to trade off accuracy over memory and processing times. In this work, we build on recent advancements in Gaussian Splatting methods to develop a novel LiDAR odometry and mapping pipeline that exclusively relies on Gaussian primitives for its scene representation. Leveraging spherical projection, we drive the refinement of the primitives uniquely from LiDAR measurements. Experiments show that our approach matches the current registration performance, while achieving SOTA results for mapping tasks with minimal GPU requirements. This efficiency makes it a strong candidate for further exploration and potential adoption in real-time robotics estimation tasks.

MM-UNet: Meta Mamba UNet for Medical Image Segmentation

Bin Xie,Yan Yan,Gady Agam

Task: 提出一种名为MM-UNet的统一U形编码器-解码器架构，以解决状态空间模型（SSMs）在医学图像分割中的局限性。

Motivation: SSMs在长序列建模中表现优异，但在医学图像分割中面临3D空间结构和高方差数据的挑战。

Details

Method: 设计MM-UNet，结合SSMs和残差连接的混合模块，并引入双向扫描顺序策略。 Result: 在AMOS2022和Synapse数据集上，MM-UNet的Dice分数分别为91.0%和87.1%，优于现有方法。 Conclusion: 通过架构设计优化，SSMs在医学图像分割中具有显著效果。 Abstract: State Space Models (SSMs) have recently demonstrated outstanding performance in long-sequence modeling, particularly in natural language processing. However, their direct application to medical image segmentation poses several challenges. SSMs, originally designed for 1D sequences, struggle with 3D spatial structures in medical images due to discontinuities introduced by flattening. Additionally, SSMs have difficulty fitting high-variance data, which is common in medical imaging. In this paper, we analyze the intrinsic limitations of SSMs in medical image segmentation and propose a unified U-shaped encoder-decoder architecture, Meta Mamba UNet (MM-UNet), designed to leverage the advantages of SSMs while mitigating their drawbacks. MM-UNet incorporates hybrid modules that integrate SSMs within residual connections, reducing variance and improving performance. Furthermore, we introduce a novel bi-directional scan order strategy to alleviate discontinuities when processing medical images. Extensive experiments on the AMOS2022 and Synapse datasets demonstrate the superiority of MM-UNet over state-of-the-art methods. MM-UNet achieves a Dice score of 91.0% on AMOS2022, surpassing nnUNet by 3.2%, and a Dice score of 87.1% on Synapse. These results confirm the effectiveness of integrating SSMs in medical image segmentation through architectural design optimizations.

Echo-E$^3$Net: Efficient Endo-Epi Spatio-Temporal Network for Ejection Fraction Estimation

Moein Heidari,Afshin Bozorgpour,AmirHossein Zarif-Fakharnia,Dorit Merhof,Ilker Hacihaliloglu

Task: 提出一种高效的Endo-Epi时空网络（Echo-E$^3$Net）用于左心室射血分数（LVEF）的估计。

Motivation: 传统LVEF估计方法耗时且依赖操作者，现有深度学习模型计算量大，难以实时应用，且常忽略时空特征的交互。

Details

Method: 提出Echo-E$^3$Net，包含Endo-Epi Cardial Border Detector（E$^2$CBD）模块和Endo-Epi Feature Aggregator（E$^2$FA）模块，结合多组件损失函数优化时空特征学习。 Result: 在EchoNet-Dynamic数据集上，RMSE为5.15，R$^2$为0.82，参数仅6.8百万，计算量为8.49G Flops。 Conclusion: Echo-E$^3$Net高效且无需预训练或数据增强，适合实时临床超声应用。 Abstract: Left ventricular ejection fraction (LVEF) is a critical metric for assessing cardiac function, widely used in diagnosing heart failure and guiding clinical decisions. Despite its importance, conventional LVEF estimation remains time-consuming and operator-dependent. Recent deep learning advancements have enhanced automation, yet many existing models are computationally demanding, hindering their feasibility for real-time clinical applications. Additionally, the interplay between spatial and temporal features is crucial for accurate estimation but is often overlooked. In this work, we propose Echo-E$^3$Net, an efficient Endo-Epi spatio-temporal network tailored for LVEF estimation. Our method introduces the Endo-Epi Cardial Border Detector (E$^2$CBD) module, which enhances feature extraction by leveraging spatial and temporal landmark cues. Complementing this, the Endo-Epi Feature Aggregator (E$^2$FA) distills statistical descriptors from backbone feature maps, refining the final EF prediction. These modules, along with a multi-component loss function tailored to align with the clinical definition of EF, collectively enhance spatial-temporal representation learning, ensuring robust and efficient EF estimation. We evaluate Echo-E$^3$Net on the EchoNet-Dynamic dataset, achieving a RMSE of 5.15 and an R$^2$ score of 0.82, setting a new benchmark in efficiency with 6.8 million parameters and only 8.49G Flops. Our model operates without pre-training, data augmentation, or ensemble methods, making it well-suited for real-time point-of-care ultrasound (PoCUS) applications. Our Code is publicly available on~\href{https://github.com/moeinheidari7829/Echo-E3Net}{\textcolor{magenta}{GitHub}}.

Audio-Enhanced Vision-Language Modeling with Latent Space Broadening for High Quality Data Expansion

Yu Sun,Yin Li,Ruixiao Sun,Chunhui Liu,Fangming Zhou,Ze Jin,Linjie Wang,Xiang Shen,Zhuolin Hao,Hongyu Xiong

Task: 提出kNN-based Latent Space Broadening (LSB)和Vision-Language Modeling with Audio Enhancement (VLMAE)方法，以提升多模态模型在推荐、搜索和广告系统中的性能。

Motivation: 传统基于统计的主动学习方法在检测过自信误分类和区分语义相似项方面存在局限，且预训练多模态架构主要关注文本和图像，忽略了音频信息的重要性。

Details

Method: 提出LSB提升主动学习效率，VLMAE通过中融合方法将音频整合到视觉语言模型中。 Result: 系统在生产环境中部署，带来显著业务收益。 Conclusion: 提出的方法有效解决了多模态模型中的关键挑战，提升了模型性能和业务指标。 Abstract: Transformer-based multimodal models are widely used in industrial-scale recommendation, search, and advertising systems for content understanding and relevance ranking. Enhancing labeled training data quality and cross-modal fusion significantly improves model performance, influencing key metrics such as quality view rates and ad revenue. High-quality annotations are crucial for advancing content modeling, yet traditional statistical-based active learning (AL) methods face limitations: they struggle to detect overconfident misclassifications and are less effective in distinguishing semantically similar items in deep neural networks. Additionally, audio information plays an increasing role, especially in short-video platforms, yet most pre-trained multimodal architectures primarily focus on text and images. While training from scratch across all three modalities is possible, it sacrifices the benefits of leveraging existing pre-trained visual-language (VL) and audio models. To address these challenges, we propose kNN-based Latent Space Broadening (LSB) to enhance AL efficiency and Vision-Language Modeling with Audio Enhancement (VLMAE), a mid-fusion approach integrating audio into VL models. This system deployed in production systems, leading to significant business gains.

Vishwesh Ramanathan,Tony Xu,Pushpak Pati,Faruk Ahmed,Maged Goubran,Anne L. Martel

Task: 提出ModalTune框架，解决数字病理学中多模态、多任务和泛癌建模的微调问题。

Motivation: 数字病理学中的预测任务面临全切片图像（WSI）巨大尺寸和训练信号弱的挑战，现有方法在微调时存在灾难性遗忘和模态间信息共享不足的问题。

Details

Method: 引入Modal Adapter以在不修改SLFM权重的情况下整合新模态，并利用大语言模型（LLMs）编码标签以捕获语义关系。 Result: ModalTune在四种癌症类型中取得SOTA结果，同时在泛癌场景中保持竞争力，并在两个OOD数据集上表现出高度泛化性。 Conclusion: ModalTune是首个针对数字病理学中多模态、多任务和泛癌建模的统一微调框架。 Abstract: Prediction tasks in digital pathology are challenging due to the massive size of whole-slide images (WSIs) and the weak nature of training signals. Advances in computing, data availability, and self-supervised learning (SSL) have paved the way for slide-level foundation models (SLFMs) that can improve prediction tasks in low-data regimes. However, working with these models is challenging, with issues such as catastrophic forgetting during fine-tuning and under-utilization of shared information between tasks and modalities. To overcome these two challenges, we propose ModalTune, a novel fine-tuning framework which introduces the Modal Adapter to integrate new modalities without modifying SLFM weights. Additionally, we use large-language models (LLMs) to encode labels as text, capturing semantic relationships and enhancing generalization across multiple tasks and cancer types in a single training recipe. ModalTune achieves state-of-the-art (SOTA) results against both uni-modal and multi-modal models across four cancer types, jointly improving survival and cancer subtype prediction while remaining competitive in pan-cancer settings. Additionally, we show ModalTune is highly generalizable to two out-of-distribution (OOD) datasets. To our knowledge, this is the first unified fine-tuning framework for multi-modal, multi-task, and pan-cancer modeling in digital pathology.

Leveraging Audio Representations for Vibration-Based Crowd Monitoring in Stadiums

Yen Cheng Chang,Jesse Codling,Yiwen Dong,Jiale Zhang,Jiasi Chen,Hae Young Noh,Pei Zhang

Task: 利用地板振动数据预测体育场馆内人群行为。

Motivation: 现有基于摄像头和麦克风的人群监测方法干扰性强且涉及隐私问题，振动传感提供了一种更隐蔽的解决方案，但缺乏训练数据。

Details

Method: 提出ViLA方法，通过无监督预训练音频数据（YouTube8M）学习波形行为，再微调少量振动数据，减少对领域特定数据的依赖。 Result: 实验表明，音频预训练使振动模型的误差降低了5.8倍。 Conclusion: ViLA通过跨模态预训练有效解决了振动数据不足的问题，提升了人群行为预测的准确性。 Abstract: Crowd monitoring in sports stadiums is important to enhance public safety and improve the audience experience. Existing approaches mainly rely on cameras and microphones, which can cause significant disturbances and often raise privacy concerns. In this paper, we sense floor vibration, which provides a less disruptive and more non-intrusive way of crowd sensing, to predict crowd behavior. However, since the vibration-based crowd monitoring approach is newly developed, one main challenge is the lack of training data due to sports stadiums being large public spaces with complex physical activities. In this paper, we present ViLA (Vibration Leverage Audio), a vibration-based method that reduces the dependency on labeled data by pre-training with unlabeled cross-modality data. ViLA is first pre-trained on audio data in an unsupervised manner and then fine-tuned with a minimal amount of in-domain vibration data. By leveraging publicly available audio datasets, ViLA learns the wave behaviors from audio and then adapts the representation to vibration, reducing the reliance on domain-specific vibration data. Our real-world experiments demonstrate that pre-training the vibration model using publicly available audio data (YouTube8M) achieved up to a 5.8x error reduction compared to the model without audio pre-training.

GS-LTS: 3D Gaussian Splatting-Based Adaptive Modeling for Long-Term Service Robots

Bin Fu,Jialin Li,Bin Zhang,Ruiping Wang,Xilin Chen

Task: 提出一种基于3D高斯泼溅（3DGS）的系统GS-LTS，用于室内机器人在动态环境中长期执行任务。

Motivation: 现有3DGS方法主要关注静态场景，无法满足长期服务机器人对动态场景变化的需求。

Details

Method: 通过单图像变化检测、基于规则的策略收集多视角观测，并通过高斯编辑高效更新场景表示。 Result: 实验表明GS-LTS在重建、导航和场景更新方面优于基线方法，速度更快且质量更高。 Conclusion: GS-LTS为3DGS在长期机器人操作中的应用提供了有效解决方案。 Abstract: 3D Gaussian Splatting (3DGS) has garnered significant attention in robotics for its explicit, high fidelity dense scene representation, demonstrating strong potential for robotic applications. However, 3DGS-based methods in robotics primarily focus on static scenes, with limited attention to the dynamic scene changes essential for long-term service robots. These robots demand sustained task execution and efficient scene updates-challenges current approaches fail to meet. To address these limitations, we propose GS-LTS (Gaussian Splatting for Long-Term Service), a 3DGS-based system enabling indoor robots to manage diverse tasks in dynamic environments over time. GS-LTS detects scene changes (e.g., object addition or removal) via single-image change detection, employs a rule-based policy to autonomously collect multi-view observations, and efficiently updates the scene representation through Gaussian editing. Additionally, we propose a simulation-based benchmark that automatically generates scene change data as compact configuration scripts, providing a standardized, user-friendly evaluation benchmark. Experimental results demonstrate GS-LTS's advantages in reconstruction, navigation, and superior scene updates-faster and higher quality than the image training baseline-advancing 3DGS for long-term robotic operations. Code and benchmark are available at: https://vipl-vsu.github.io/3DGS-LTS.

RDTF: Resource-efficient Dual-mask Training Framework for Multi-frame Animated Sticker Generation

Zhiqiang Yuan,Ting Zhang,Ying Deng,Jiapei Zhang,Yeshuang Zhu,Zexi Jia,Jie Zhou,Jinchao Zhang

Task: 研究在资源受限条件下，通过从头训练小型视频生成模型而非参数高效调优方法，提升下游应用性能。

Motivation: 参数高效调优方法（如Adapter或Lora）在资源受限条件下可能导致模型拟合能力不足或推理偏离目标域，因此探索更有效的方法。

Details

Method: 构建离散帧生成网络，提出基于双掩码的数据利用策略和难度自适应的课程学习方法。 Result: 实验证明该方法在定量和定性上优于参数高效调优方法（如I2V-Adapter和SimDA）。 Conclusion: 在资源受限条件下，从头训练小型模型结合数据利用和课程学习策略是可行的且优于参数调优方法。 Abstract: Recently, great progress has been made in video generation technology, attracting the widespread attention of scholars. To apply this technology to downstream applications under resource-constrained conditions, researchers usually fine-tune the pre-trained models based on parameter-efficient tuning methods such as Adapter or Lora. Although these methods can transfer the knowledge from the source domain to the target domain, fewer training parameters lead to poor fitting ability, and the knowledge from the source domain may lead to the inference process deviating from the target domain. In this paper, we argue that under constrained resources, training a smaller video generation model from scratch using only million-level samples can outperform parameter-efficient tuning on larger models in downstream applications: the core lies in the effective utilization of data and curriculum strategy. Take animated sticker generation (ASG) as a case study, we first construct a discrete frame generation network for stickers with low frame rates, ensuring that its parameters meet the requirements of model training under constrained resources. In order to provide data support for models trained from scratch, we come up with a dual-mask based data utilization strategy, which manages to improve the availability and expand the diversity of limited data. To facilitate convergence under dual-mask situation, we propose a difficulty-adaptive curriculum learning method, which decomposes the sample entropy into static and adaptive components so as to obtain samples from easy to difficult. The experiment demonstrates that our resource-efficient dual-mask training framework is quantitatively and qualitatively superior to efficient-parameter tuning methods such as I2V-Adapter and SimDA, verifying the feasibility of our method on downstream tasks under constrained resources. Code will be available.

Hierarchy-Aware and Channel-Adaptive Semantic Communication for Bandwidth-Limited Data Fusion

Lei Guo,Wei Chen,Yuxuan Sun,Bo Ai,Nikolaos Pappas,Tony Quek

Task: 提出一种层次感知和通道自适应的语义通信方法，用于带宽受限的高分辨率高光谱图像（HR-HSI）重建任务。

Motivation: 传统融合技术在重建高分辨率高光谱图像时带宽消耗大，直接传输原始数据效率低，因此需要一种更高效的融合方法。

Details

Method: 采用层次相关性模块保留图像的整体结构和细节信息，并结合基于Transformer的通道自适应注意力机制动态整合和传输深浅特征。 Result: 在CAVE和Washington DC Mall数据集上，该方法在峰值信噪比（PSNR）上提升了2 dB，同时减少了三分之二的带宽消耗。 Conclusion: 该方法在带宽受限环境下高效实现了高质量的高分辨率高光谱图像重建。 Abstract: Obtaining high-resolution hyperspectral images (HR-HSI) is costly and data-intensive, making it necessary to fuse low-resolution hyperspectral images (LR-HSI) with high-resolution RGB images (HR-RGB) for practical applications. However, traditional fusion techniques, which integrate detailed information into the reconstruction, significantly increase bandwidth consumption compared to directly transmitting raw data. To overcome these challenges, we propose a hierarchy-aware and channel-adaptive semantic communication approach for bandwidth-limited data fusion. A hierarchical correlation module is proposed to preserve both the overall structural information and the details of the image required for super-resolution. This module efficiently combines deep semantic and shallow features from LR-HSI and HR-RGB. To further reduce bandwidth usage while preserving reconstruction quality, a channel-adaptive attention mechanism based on Transformer is proposed to dynamically integrate and transmit the deep and shallow features, enabling efficient data transmission and high-quality HR-HSI reconstruction. Experimental results on the CAVE and Washington DC Mall datasets demonstrate that our method outperforms single-source transmission, achieving up to a 2 dB improvement in peak signal-to-noise ratio (PSNR). Additionally, it reduces bandwidth consumption by two-thirds, confirming its effectiveness in bandwidth-constrained environments for HR-HSI reconstruction tasks.

Assessing workflow impact and clinical utility of AI-assisted brain aneurysm detection: a multi-reader study

Tommaso Di Noto,Sofyan Jankowski,Francesco Puccinelli,Guillaume Marie,Sebastien Tourbier,Yasser Aleman-Gomez,Oscar Esteban,Ricardo Corredor-Jerez,Guillaume Saliou,Patric Hagmann,Meritxell Bach Cuadra,Jonas Richiardi

Task: 评估AI辅助模型在脑动脉瘤检测中的临床应用效果。

Motivation: 尽管AI算法在放射学异常检测中广泛应用，但其在临床环境中的实际效果和影响鲜少被评估。

Details

Method: 使用开放访问的TOF-MRA数据集（N=460），训练并验证AI模型，并比较两名经验不同的放射科医生在AI辅助和无辅助情况下的表现。 Result: AI模型在测试集上表现优异（灵敏度74%，假阳性率1.6%），但未显著提升医生的检测灵敏度，且AI辅助显著增加了阅读时间。 Conclusion: 研究强调了AI算法在临床环境中验证的重要性，提醒社区关注算法的实际效果和工作流程影响。 Abstract: Despite the plethora of AI-based algorithms developed for anomaly detection in radiology, subsequent integration into clinical setting is rarely evaluated. In this work, we assess the applicability and utility of an AI-based model for brain aneurysm detection comparing the performance of two readers with different levels of experience (2 and 13 years). We aim to answer the following questions: 1) Do the readers improve their performance when assisted by the AI algorithm? 2) How much does the AI algorithm impact routine clinical workflow? We reuse and enlarge our open-access, Time-Of-Flight Magnetic Resonance Angiography dataset (N=460). We use 360 subjects for training/validating our algorithm and 100 as unseen test set for the reading session. Even though our model reaches state-of-the-art results on the test set (sensitivity=74%, false positive rate=1.6), we show that neither the junior nor the senior reader significantly increase their sensitivity (p=0.59, p=1, respectively). In addition, we find that reading time for both readers is significantly higher in the "AI-assisted" setting than in the "Unassisted" (+15 seconds, on average; p=3x10^(-4) junior, p=3x10^(-5) senior). The confidence reported by the readers is unchanged across the two settings, indicating that the AI assistance does not influence the certainty of the diagnosis. Our findings highlight the importance of clinical validation of AI algorithms in a clinical setting involving radiologists. This study should serve as a reminder to the community to always examine the real-word effectiveness and workflow impact of proposed algorithms.

DVG-Diffusion: Dual-View Guided Diffusion Model for CT Reconstruction from X-Rays

Xing Xie,Jiawei Liu,Huijie Fan,Zhi Han,Yandong Tang,Liangqiong Qu

Task: 通过端到端深度学习网络从少量二维X射线图像直接重建三维CT体积。

Motivation: 由于X射线图像仅是三维CT体积的投影视图，直接从少量X射线图像重建三维CT体积具有挑战性。

Details

Method: 提出了一种双视图引导扩散模型（DVG-Diffusion），通过结合真实输入X射线视图和合成的新X射线视图来联合指导CT重建。 Result: 实验结果表明，该方法在CT重建的高保真度和感知质量之间实现了有效平衡，并优于现有最先进方法。 Conclusion: 通过视图参数引导编码和双视图引导CT重建，DVG-Diffusion能够有效提升CT重建的质量。 Abstract: Directly reconstructing 3D CT volume from few-view 2D X-rays using an end-to-end deep learning network is a challenging task, as X-ray images are merely projection views of the 3D CT volume. In this work, we facilitate complex 2D X-ray image to 3D CT mapping by incorporating new view synthesis, and reduce the learning difficulty through view-guided feature alignment. Specifically, we propose a dual-view guided diffusion model (DVG-Diffusion), which couples a real input X-ray view and a synthesized new X-ray view to jointly guide CT reconstruction. First, a novel view parameter-guided encoder captures features from X-rays that are spatially aligned with CT. Next, we concatenate the extracted dual-view features as conditions for the latent diffusion model to learn and refine the CT latent representation. Finally, the CT latent representation is decoded into a CT volume in pixel space. By incorporating view parameter guided encoding and dual-view guided CT reconstruction, our DVG-Diffusion can achieve an effective balance between high fidelity and perceptual quality for CT reconstruction. Experimental results demonstrate our method outperforms state-of-the-art methods. Based on experiments, the comprehensive analysis and discussions for views and reconstruction are also presented.

FundusGAN: A Hierarchical Feature-Aware Generative Framework for High-Fidelity Fundus Image Generation

Qingshan Hou,Meng Wang,Peng Cao,Zou Ke,Xiaoli Liu,Huazhu Fu,Osmar R. Zaiane

Task: 提出FundusGAN，一种用于高保真眼底图像合成的分层特征感知生成框架，以解决眼科基础模型预训练所需大规模数据集的挑战。

Motivation: 现有的眼科基础模型（如RetFound）需要大量数据进行预训练，开发和部署门槛高，FundusGAN旨在解决这一数据稀缺问题。

Details

Method: 采用Feature Pyramid Network提取多尺度信息，结合改进的StyleGAN生成器（使用扩张卷积和策略性上采样调整）以保留视网膜结构并增强病理细节。 Result: 在DDR、DRIVE和IDRiD数据集上表现优异（SSIM: 0.8863, FID: 54.2, KID: 0.0436），生成的图像显著提升疾病分类准确率（最高提升6.49%）。 Conclusion: FundusGAN是解决眼科AI研究中数据稀缺问题的有效工具，能减少对大规模临床数据集的依赖，提升诊断系统的鲁棒性和泛化能力。 Abstract: Recent advancements in ophthalmology foundation models such as RetFound have demonstrated remarkable diagnostic capabilities but require massive datasets for effective pre-training, creating significant barriers for development and deployment. To address this critical challenge, we propose FundusGAN, a novel hierarchical feature-aware generative framework specifically designed for high-fidelity fundus image synthesis. Our approach leverages a Feature Pyramid Network within its encoder to comprehensively extract multi-scale information, capturing both large anatomical structures and subtle pathological features. The framework incorporates a modified StyleGAN-based generator with dilated convolutions and strategic upsampling adjustments to preserve critical retinal structures while enhancing pathological detail representation. Comprehensive evaluations on the DDR, DRIVE, and IDRiD datasets demonstrate that FundusGAN consistently outperforms state-of-the-art methods across multiple metrics (SSIM: 0.8863, FID: 54.2, KID: 0.0436 on DDR). Furthermore, disease classification experiments reveal that augmenting training data with FundusGAN-generated images significantly improves diagnostic accuracy across multiple CNN architectures (up to 6.49\% improvement with ResNet50). These results establish FundusGAN as a valuable foundation model component that effectively addresses data scarcity challenges in ophthalmological AI research, enabling more robust and generalizable diagnostic systems while reducing dependency on large-scale clinical data collection.

Multi-Disease-Aware Training Strategy for Cardiac MR Image Segmentation

Hong Zheng,Yucheng Chen,Nan Mu,Xiaoning Li

Task: 提出一种多疾病感知训练策略（MTS）和数据处理技术，以提高心脏磁共振图像（CMRIs）中不规则形状器官（如右心室）的分割性能。

Motivation: 现有深度学习方法在分割不规则形状器官（如右心室）时表现不佳，主要原因是模型对切片、心脏相位和疾病条件下的分布变化泛化能力不足。

Details

Method: 提出多疾病感知训练策略（MTS），重构CMRI数据集为多疾病数据集，并设计专门的数据处理技术。 Result: 实验表明，使用MTS训练的模型在右心室分割中表现优异，且对未知疾病数据具有鲁棒性。 Conclusion: MTS和数据处理技术有效提升了不规则形状器官的分割性能，并增强了模型的泛化能力。 Abstract: Accurate segmentation of the ventricles from cardiac magnetic resonance images (CMRIs) is crucial for enhancing the diagnosis and analysis of heart conditions. Deep learning-based segmentation methods have recently garnered significant attention due to their impressive performance. However, these segmentation methods are typically good at partitioning regularly shaped organs, such as the left ventricle (LV) and the myocardium (MYO), whereas they perform poorly on irregularly shaped organs, such as the right ventricle (RV). In this study, we argue that this limitation of segmentation models stems from their insufficient generalization ability to address the distribution shift of segmentation targets across slices, cardiac phases, and disease conditions. To overcome this issue, we present a Multi-Disease-Aware Training Strategy (MTS) and restructure the introduced CMRI datasets into multi-disease datasets. Additionally, we propose a specialized data processing technique for preprocessing input images to support the MTS. To validate the effectiveness of our method, we performed control group experiments and cross-validation tests. The experimental results show that (1) network models trained using our proposed strategy achieved superior segmentation performance, particularly in RV segmentation, and (2) these networks exhibited robust performance even when applied to data from unknown diseases.

Real-time Global Illumination for Dynamic 3D Gaussian Scenes

Chenxiao Hu,Meng Gai,Guoping Wang,Sheng Li

Task: 提出一种实时全局光照方法及动态3D高斯模型与网格的渲染管线。

Motivation: 解决动态场景中高质量光照效果与实时性能的挑战。

Details

Method: 基于表面光传输模型的快速复合随机光线追踪算法和优化的3D高斯光栅化器。 Result: 实现动态场景的实时渲染（40 fps以上），支持交互式材质编辑和多光源动态光照。 Conclusion: 展示了3D高斯在动态光照实时应用中的潜力，并提供了性能与优化的见解。 Abstract: We present a real-time global illumination approach along with a pipeline for dynamic 3D Gaussian models and meshes. Building on a formulated surface light transport model for 3D Gaussians, we address key performance challenges with a fast compound stochastic ray-tracing algorithm and an optimized 3D Gaussian rasterizer. Our pipeline integrates multiple real-time techniques to accelerate performance and achieve high-quality lighting effects. Our approach enables real-time rendering of dynamic scenes with interactively editable materials and dynamic lighting of diverse multi-lights settings, capturing mutual multi-bounce light transport (indirect illumination) between 3D Gaussians and mesh. Additionally, we present a real-time renderer with an interactive user interface, validating our approach and demonstrating its practicality and high efficiency with over 40 fps in scenes including both 3D Gaussians and mesh. Furthermore, our work highlights the potential of 3D Gaussians in real-time applications with dynamic lighting, offering insights into performance and optimization.

Semi-supervised Semantic Segmentation with Multi-Constraint Consistency Learning

Jianjian Yin,Tao Chen,Gensheng Pei,Yazhou Yao,Liqiang Nie,Xiansheng Hua

Task: 提出一种多约束一致性学习（MCCL）方法，用于半监督语义分割中编码器和解码器的分阶段增强。

Motivation: 现有方法主要关注基于图像增强的预测一致性，未能充分利用潜在的监督信息。

Details

Method: 设计了特征知识对齐（FKA）策略和自适应干预（SAI）模块，分别增强编码器和解码器的一致性学习。 Result: 在Pascal VOC2012和Cityscapes数据集上取得了新的最优性能。 Conclusion: MCCL方法通过分阶段增强编码器和解码器，显著提升了半监督语义分割的性能。 Abstract: Consistency regularization has prevailed in semi-supervised semantic segmentation and achieved promising performance. However, existing methods typically concentrate on enhancing the Image-augmentation based Prediction consistency and optimizing the segmentation network as a whole, resulting in insufficient utilization of potential supervisory information. In this paper, we propose a Multi-Constraint Consistency Learning (MCCL) approach to facilitate the staged enhancement of the encoder and decoder. Specifically, we first design a feature knowledge alignment (FKA) strategy to promote the feature consistency learning of the encoder from image-augmentation. Our FKA encourages the encoder to derive consistent features for strongly and weakly augmented views from the perspectives of point-to-point alignment and prototype-based intra-class compactness. Moreover, we propose a self-adaptive intervention (SAI) module to increase the discrepancy of aligned intermediate feature representations, promoting Feature-perturbation based Prediction consistency learning. Self-adaptive feature masking and noise injection are designed in an instance-specific manner to perturb the features for robust learning of the decoder. Experimental results on Pascal VOC2012 and Cityscapes datasets demonstrate that our proposed MCCL achieves new state-of-the-art performance. The source code and models are made available at https://github.com/NUST-Machine-Intelligence-Laboratory/MCCL.

Cat-AIR: Content and Task-Aware All-in-One Image Restoration

Jiachen Jiang,Tianyu Ding,Ke Zhang,Jinxin Zhou,Tianyi Chen,Ilya Zharkov,Zhihui Zhu,Luming Liang

Task: 开发一种无需先验知识的全功能图像恢复模型Cat-AIR，以处理多种退化类型。

Motivation: 现有方法难以高效且有效地处理多种退化类型，因此需要一种更灵活且高效的解决方案。

Details

Method: Cat-AIR采用交替的空间-通道注意力机制，结合跨层通道注意力和跨特征空间注意力，根据内容和任务复杂度分配计算资源。 Result: Cat-AIR在多种恢复任务中达到最先进性能，且计算量更少。 Conclusion: Cat-AIR为高效的全功能图像恢复设定了新基准。 Abstract: All-in-one image restoration seeks to recover high-quality images from various types of degradation using a single model, without prior knowledge of the corruption source. However, existing methods often struggle to effectively and efficiently handle multiple degradation types. We present Cat-AIR, a novel \textbf{C}ontent \textbf{A}nd \textbf{T}ask-aware framework for \textbf{A}ll-in-one \textbf{I}mage \textbf{R}estoration. Cat-AIR incorporates an alternating spatial-channel attention mechanism that adaptively balances the local and global information for different tasks. Specifically, we introduce cross-layer channel attentions and cross-feature spatial attentions that allocate computations based on content and task complexity. Furthermore, we propose a smooth learning strategy that allows for seamless adaptation to new restoration tasks while maintaining performance on existing ones. Extensive experiments demonstrate that Cat-AIR achieves state-of-the-art results across a wide range of restoration tasks, requiring fewer FLOPs than previous methods, establishing new benchmarks for efficient all-in-one image restoration.

PathoHR: Breast Cancer Survival Prediction on High-Resolution Pathological Images

Yang Luo,Shiru Wang,Jun Liu,Jiaxuan Xiao,Rundong Xue,Zeyu Zhang,Hao Zhang,Yu Lu,Yang Zhao,Yutong Xie

Task: 提出一种名为PathoHR的新方法，用于提高乳腺癌生存预测的准确性。

Motivation: 由于肿瘤异质性，从病理图像中提取代表性特征以反映肿瘤的侵袭性和生存结果具有挑战性。

Details

Method: 结合高分辨率Vision Transformer（ViT）增强WSI表示，系统评估相似性度量优化特征学习，并验证小图像块增强的效果。 Result: PathoHR通过增强图像分辨率和优化特征学习，实现了更准确和高效的乳腺癌生存预测。 Conclusion: PathoHR为计算病理学提供了一种有前景的方法，能够更准确地预测乳腺癌生存率。 Abstract: Breast cancer survival prediction in computational pathology presents a remarkable challenge due to tumor heterogeneity. For instance, different regions of the same tumor in the pathology image can show distinct morphological and molecular characteristics. This makes it difficult to extract representative features from whole slide images (WSIs) that truly reflect the tumor's aggressive potential and likely survival outcomes. In this paper, we present PathoHR, a novel pipeline for accurate breast cancer survival prediction that enhances any size of pathological images to enable more effective feature learning. Our approach entails (1) the incorporation of a plug-and-play high-resolution Vision Transformer (ViT) to enhance patch-wise WSI representation, enabling more detailed and comprehensive feature extraction, (2) the systematic evaluation of multiple advanced similarity metrics for comparing WSI-extracted features, optimizing the representation learning process to better capture tumor characteristics, (3) the demonstration that smaller image patches enhanced follow the proposed pipeline can achieve equivalent or superior prediction accuracy compared to raw larger patches, while significantly reducing computational overhead. Experimental findings valid that PathoHR provides the potential way of integrating enhanced image resolution with optimized feature learning to advance computational pathology, offering a promising direction for more accurate and efficient breast cancer survival prediction. Code will be available at https://github.com/AIGeeksGroup/PathoHR.

Metaphor-based Jailbreaking Attacks on Text-to-Image Models

Chenyu Zhang,Yiwen Ma,Lanjun Wang,Wenhui Li,Yi Tu,An-An Liu

Task: 提出一种基于隐喻的越狱攻击方法（MJA），以平衡攻击效果和查询效率。

Motivation: 现有基于LLM的攻击方法缺乏明确指导，依赖大量查询，限制了实际应用。

Details

Method: MJA包含多智能体生成模块（MLAG）和对抗提示优化模块（APO），通过隐喻生成多样对抗提示并优化攻击效率。 Result: 实验表明MJA在较少查询下实现更高攻击效果，且对抗提示具有强迁移性。 Conclusion: MJA有效解决了现有方法的局限性，为T2I模型的安全性提供了新视角。 Abstract: To mitigate misuse, text-to-image~(T2I) models commonly incorporate safety filters to prevent the generation of sensitive images. Unfortunately, recent jailbreaking attack methods use LLMs to generate adversarial prompts that effectively bypass safety filters while generating sensitive images, revealing the safety vulnerabilities within the T2I model. However, existing LLM-based attack methods lack explicit guidance, relying on substantial queries to achieve a successful attack, which limits their practicality in real-world scenarios. In this work, we introduce \textbf{MJA}, a \textbf{m}etaphor-based \textbf{j}ailbreaking \textbf{a}ttack method inspired by the Taboo game, aiming to balance the attack effectiveness and query efficiency by generating metaphor-based adversarial prompts. Specifically, MJA consists of two modules: an LLM-based multi-agent generation module~(MLAG) and an adversarial prompt optimization module~(APO). MLAG decomposes the generation of metaphor-based adversarial prompts into three subtasks: metaphor retrieval, context matching, and adversarial prompt generation. Subsequently, MLAG coordinates three LLM-based agents to generate diverse adversarial prompts by exploring various metaphors and contexts. To enhance the attack efficiency, APO first trains a surrogate model to predict the attack results of adversarial prompts and then designs an acquisition strategy to adaptively identify optimal adversarial prompts. Experiments demonstrate that MJA achieves better attack effectiveness while requiring fewer queries compared to baseline methods. Moreover, our adversarial prompts exhibit strong transferability across various open-source and commercial T2I models. \textcolor{red}{This paper includes model-generated content that may contain offensive or distressing material.}

Anomaly Detection and Localization for Speech Deepfakes via Feature Pyramid Matching

Emma Coletta,Davide Salvi,Viola Negroni,Daniele Ugo Leonzio,Paolo Bestagini

Task: 提出一种可解释的单类检测框架，将语音深度伪造检测重新定义为异常检测任务。

Motivation: 现有语音深度伪造检测方法依赖监督学习，存在泛化能力不足和缺乏可解释性的问题。

Details

Method: 采用基于真实语音训练的单类检测框架，结合学生-教师特征金字塔匹配系统和差异缩放技术，生成时间和频率域的可解释异常图。 Result: 实验表明该方法在性能上优于基线模型，验证了将语音深度伪造检测作为异常检测问题的有效性。 Conclusion: 提出的框架不仅提高了检测性能，还提供了可解释性，为语音深度伪造检测提供了新思路。 Abstract: The rise of AI-driven generative models has enabled the creation of highly realistic speech deepfakes - synthetic audio signals that can imitate target speakers' voices - raising critical security concerns. Existing methods for detecting speech deepfakes primarily rely on supervised learning, which suffers from two critical limitations: limited generalization to unseen synthesis techniques and a lack of explainability. In this paper, we address these issues by introducing a novel interpretable one-class detection framework, which reframes speech deepfake detection as an anomaly detection task. Our model is trained exclusively on real speech to characterize its distribution, enabling the classification of out-of-distribution samples as synthetically generated. Additionally, our framework produces interpretable anomaly maps during inference, highlighting anomalous regions across both time and frequency domains. This is done through a Student-Teacher Feature Pyramid Matching system, enhanced with Discrepancy Scaling to improve generalization capabilities across unseen data distributions. Extensive evaluations demonstrate the superior performance of our approach compared to the considered baselines, validating the effectiveness of framing speech deepfake detection as an anomaly detection problem.

Multiple-Particle Autofocusing Algorithm Using Axial Resolution and Morphological Analyses Based on Digital Holography

Wei-Na Li,Yi Zhou,Jiatai Chen,Hongjie Ou,XiangSheng Xie

Task: 提出一种自动对焦算法，通过全息图相对准确地获取密集透明粒子溶液中每个粒子的3D位置，特别是轴向位置和粒子数量。

Motivation: 解决密集粒子自动对焦问题，提供准确的轴向位置信息。

Details

Method: 首先对原始重建图像进行形态学分析和约束强度处理，获取候选聚焦粒子的信息；其次利用轴向分辨率确定真实聚焦粒子；最后基于每个候选聚焦粒子的平均强度和等效直径，最终确定所有聚焦粒子。 Result: 该方法能够快速提供相对准确的轴向位置信息，解决密集粒子的自动对焦问题。 Conclusion: 提出的方法有效解决了密集粒子自动对焦问题，提供了准确的轴向位置信息。 Abstract: We propose an autofocusing algorithm to obtain, relatively accurately, the 3D position of each particle, particularly its axial location, and particle number of a dense transparent particle solution via its hologram. First, morphological analyses and constrained intensity are used on raw reconstructed images to obtain information on candidate focused particles. Second, axial resolution is used to obtain the real focused particles. Based on the mean intensity and equivalent diameter of each candidate focused particle, all focused particles are eventually secured. Our proposed method can rapidly provide relatively accurate ground-truth axial positions to solve the autofocusing problem that occurs with dense particles.

Dynamic Allocation Hypernetwork with Adaptive Model Recalibration for FCL

Xiaoming Qi,Jingyang Zhang,Huazhu Fu,Guanyu Yang,Shuo Li,Yueming Jin

Task: 提出一种新的服务器端联邦持续学习模式（FedDAH），以解决医学领域中动态异步任务流带来的灾难性遗忘和优化偏差问题。

Motivation: 现有的服务器端FCL方法在处理动态异步任务流时面临灾难性遗忘和优化偏差的挑战，尤其在医学场景中。

Details

Method: 提出动态分配超网络（DAHyper）管理任务与模型参数的映射，并引入自适应模型重新校准（AMR）解决优化偏差。 Result: 在AMOS数据集上的实验表明，FedDAH优于其他FCL方法。 Conclusion: FedDAH有效解决了医学领域中动态任务流带来的挑战，提升了联邦持续学习的性能。 Abstract: Federated continual learning (FCL) offers an emerging pattern to facilitate the applicability of federated learning (FL) in real-world scenarios, where tasks evolve dynamically and asynchronously across clients, especially in medical scenario. Existing server-side FCL methods in nature domain construct a continually learnable server model by client aggregation on all-involved tasks. However, they are challenged by: (1) Catastrophic forgetting for previously learned tasks, leading to error accumulation in server model, making it difficult to sustain comprehensive knowledge across all tasks. (2) Biased optimization due to asynchronous tasks handled across different clients, leading to the collision of optimization targets of different clients at the same time steps. In this work, we take the first step to propose a novel server-side FCL pattern in medical domain, Dynamic Allocation Hypernetwork with adaptive model recalibration (\textbf{FedDAH}). It is to facilitate collaborative learning under the distinct and dynamic task streams across clients. To alleviate the catastrophic forgetting, we propose a dynamic allocation hypernetwork (DAHyper) where a continually updated hypernetwork is designed to manage the mapping between task identities and their associated model parameters, enabling the dynamic allocation of the model across clients. For the biased optimization, we introduce a novel adaptive model recalibration (AMR) to incorporate the candidate changes of historical models into current server updates, and assign weights to identical tasks across different time steps based on the similarity for continual optimization. Extensive experiments on the AMOS dataset demonstrate the superiority of our FedDAH to other FCL methods on sites with different task streams. The code is available:https://github.com/jinlab-imvr/FedDAH.

WISE: A Framework for Gigapixel Whole-Slide-Image Lossless Compression

Yu Mao,Jun Wang,Nan Guan,Chun Jason Xue

Task: 研究无损压缩全切片图像（WSI）的方法。

Motivation: WSI虽然避免了物理存储切片的需求，但其数据量巨大，存储和维护成本高昂且不可持续。

Details

Method: 开发了一种名为WISE的无损压缩器，采用分层编码策略提取有效位，降低图像熵，并结合字典方法处理不规则频率模式。 Result: WISE能将千兆像素WSI图像平均压缩36倍，最高达136倍。 Conclusion: WISE是一种简单有效的无损压缩方法，解决了现有压缩器在WSI图像上的不足。 Abstract: Whole-Slide Images (WSIs) have revolutionized medical analysis by presenting high-resolution images of the whole tissue slide. Despite avoiding the physical storage of the slides, WSIs require considerable data volume, which makes the storage and maintenance of WSI records costly and unsustainable. To this end, this work presents the first investigation of lossless compression of WSI images. Interestingly, we find that most existing compression methods fail to compress the WSI images effectively. Furthermore, our analysis reveals that the failure of existing compressors is mainly due to information irregularity in WSI images. To resolve this issue, we developed a simple yet effective lossless compressor called WISE, specifically designed for WSI images. WISE employs a hierarchical encoding strategy to extract effective bits, reducing the entropy of the image and then adopting a dictionary-based method to handle the irregular frequency patterns. Through extensive experiments, we show that WISE can effectively compress the gigapixel WSI images to 36 times on average and up to 136 times.

Unraveling the Effects of Synthetic Data on End-to-End Autonomous Driving

Junhao Ge,Zuhong Liu,Longteng Fan,Yifan Jiang,Jiaqi Su,Yiming Li,Zhejun Zhang,Siheng Chen

Task: 开发一个基于3D高斯散射的自动驾驶模拟器SceneCrafter，以高效生成多样化交通场景的逼真驾驶日志。

Motivation: 现有自动驾驶模拟器在生成逼真传感器数据、效率和交互性方面存在不足，无法满足端到端模型对多样化数据的需求。

Details

Method: 基于3D高斯散射（3DGS）技术，开发了SceneCrafter模拟器，支持高效生成逼真驾驶日志和闭环评估。 Result: 实验表明，SceneCrafter能高效生成多样化数据，显著提升端到端模型的泛化能力。 Conclusion: SceneCrafter是一个高效、逼真且交互性强的自动驾驶模拟器，适用于数据生成和模型评估。 Abstract: End-to-end (E2E) autonomous driving (AD) models require diverse, high-quality data to perform well across various driving scenarios. However, collecting large-scale real-world data is expensive and time-consuming, making high-fidelity synthetic data essential for enhancing data diversity and model robustness. Existing driving simulators for synthetic data generation have significant limitations: game-engine-based simulators struggle to produce realistic sensor data, while NeRF-based and diffusion-based methods face efficiency challenges. Additionally, recent simulators designed for closed-loop evaluation provide limited interaction with other vehicles, failing to simulate complex real-world traffic dynamics. To address these issues, we introduce SceneCrafter, a realistic, interactive, and efficient AD simulator based on 3D Gaussian Splatting (3DGS). SceneCrafter not only efficiently generates realistic driving logs across diverse traffic scenarios but also enables robust closed-loop evaluation of end-to-end models. Experimental results demonstrate that SceneCrafter serves as both a reliable evaluation platform and a efficient data generator that significantly improves end-to-end model generalization.

Efficient Deep Learning Approaches for Processing Ultra-Widefield Retinal Imaging

Siwon Kim,Wooyung Yun,Jeongbin Oh,Soomok Lee

Task: 应用深度学习方法对超广角（UWF）视网膜图像数据集进行分类。

Motivation: UWF图像能准确诊断多种视网膜疾病，但手动处理耗时耗力，且自动化过程中存在计算资源需求和CFP方法准确性两大挑战。

Details

Method: 采用策略性数据增强和模型集成方法，以在低性能计算单元上高效解决问题。 Result: 证明了这些方法可以在平衡性能和计算资源的同时，有效利用UWF图像。 Conclusion: 该方法为资源有限的医疗环境提供了一种可行的解决方案。 Abstract: Deep learning has emerged as the predominant solution for classifying medical images. We intend to apply these developments to the ultra-widefield (UWF) retinal imaging dataset. Since UWF images can accurately diagnose various retina diseases, it is very important to clas sify them accurately and prevent them with early treatment. However, processing images manually is time-consuming and labor-intensive, and there are two challenges to automating this process. First, high perfor mance usually requires high computational resources. Artificial intelli gence medical technology is better suited for places with limited medical resources, but using high-performance processing units in such environ ments is challenging. Second, the problem of the accuracy of colour fun dus photography (CFP) methods. In general, the UWF method provides more information for retinal diagnosis than the CFP method, but most of the research has been conducted based on the CFP method. Thus, we demonstrate that these problems can be efficiently addressed in low performance units using methods such as strategic data augmentation and model ensembles, which balance performance and computational re sources while utilizing UWF images.

SNRAware: Improved Deep Learning MRI Denoising with SNR Unit Training and G-factor Map Augmentation

Hui Xue,Sarah M. Hooper,Iain Pierce,Rhodri H. Davies,John Stairs,Joseph Naegele,Adrienne E. Campbell-Washburn,Charlotte Manisty,James C. Moon,Thomas A. Treibel,Peter Kellman,Michael S. Hansen

Task: 开发并评估一种新的深度学习MR去噪方法，利用重建过程中的定量噪声分布信息以提高去噪性能和泛化能力。

Motivation: 通过模拟大规模、高质量且多样化的合成数据集，并结合噪声分布的定量信息，提升去噪模型的性能和泛化能力。

Details

Method: 使用14种不同的Transformer和卷积模型，基于两种主干架构，在大规模数据集（2,885,236张图像）上训练，并提出SNRAware训练方案。 Result: 在分布内测试中，最佳模型表现出色；在分布外测试中，模型在多种成像序列、对比度变化、不同解剖结构和场强下均表现出良好的泛化能力，显著提升了CNR。 Conclusion: 提出的SNRAware训练方案显著提升了深度学习MR去噪模型的性能和泛化能力，适用于多种临床场景。 Abstract: To develop and evaluate a new deep learning MR denoising method that leverages quantitative noise distribution information from the reconstruction process to improve denoising performance and generalization. This retrospective study trained 14 different transformer and convolutional models with two backbone architectures on a large dataset of 2,885,236 images from 96,605 cardiac retro-gated cine complex series acquired at 3T. The proposed training scheme, termed SNRAware, leverages knowledge of the MRI reconstruction process to improve denoising performance by simulating large, high quality, and diverse synthetic datasets, and providing quantitative information about the noise distribution to the model. In-distribution testing was performed on a hold-out dataset of 3000 samples with performance measured using PSNR and SSIM, with ablation comparison without the noise augmentation. Out-of-distribution tests were conducted on cardiac real-time cine, first-pass cardiac perfusion, and neuro and spine MRI, all acquired at 1.5T, to test model generalization across imaging sequences, dynamically changing contrast, different anatomies, and field strengths. The best model found in the in-distribution test generalized well to out-of-distribution samples, delivering 6.5x and 2.9x CNR improvement for real-time cine and perfusion imaging, respectively. Further, a model trained with 100% cardiac cine data generalized well to a T1 MPRAGE neuro 3D scan and T2 TSE spine MRI.

Decoupling Angles and Strength in Low-rank Adaptation

Massimo Bini,Leander Girrbach,Zeynep Akata

Task: 提出一种新的参数高效微调方法DeLoRA，以解决现有方法在鲁棒性和适应性表达力上的不足。

Motivation: 现有参数高效微调方法（如LoRA）在超参数选择或长时间训练时鲁棒性不足，而其他方法（如ETHER）适应性表达力有限。

Details

Method: 通过归一化和缩放可学习的低秩矩阵，DeLoRA将角度学习与适应强度解耦，从而提升鲁棒性。 Result: 在图像生成、自然语言理解和指令调优任务中，DeLoRA表现优于或与现有方法相当，同时具有更强的鲁棒性。 Conclusion: DeLoRA是一种高效且鲁棒的参数微调方法，适用于多种下游任务。 Abstract: Parameter-Efficient FineTuning (PEFT) methods have recently gained significant popularity thanks to the widespread availability of large-scale pretrained models. These methods allow for quick adaptation to downstream tasks with minimal computational cost. However, popular finetuning methods such as LoRA exhibit limited robustness when it comes to hyperparameter choices or extended training regimes, preventing optimal out-of-the-box performance. In contrast, bounded approaches, such as ETHER, provide greater robustness but are limited to extremely low-rank adaptations and fixed-strength transformations, reducing their adaptation expressive power. In this work, we propose Decoupled Low-rank Adaptation (DeLoRA), a novel finetuning method that normalizes and scales learnable low-rank matrices. By bounding the distance of the transformation, DeLoRA effectively decouples the angular learning from the adaptation strength, enhancing robustness without compromising performance. Through evaluations on subject-driven image generation, natural language understanding, and instruction tuning, we show that DeLoRA matches or surpasses performance of competing PEFT methods, while exhibiting stronger robustness. Code is available at https://github.com/ExplainableML/DeLoRA.

ZECO: ZeroFusion Guided 3D MRI Conditional Generation

Feiran Wang,Bin Duan,Jiachen Tao,Nikhil Sharma,Dawen Cai,Yan Yan

Task: 提出一种名为ZECO的ZeroFusion引导的3D MRI条件生成框架，用于生成高保真MRI图像及其对应的3D分割掩码，以缓解数据稀缺问题。

Motivation: 医学图像分割对MRI诊断和治疗规划至关重要，但获取精确的分割掩码需要专业知识和大量时间，导致临床实践中数据集规模较小。

Details

Method: 引入空间变换模块将MRI图像编码到紧凑的潜在空间，并提出ZeroFusion方法逐步将3D掩码映射到MRI图像，避免过拟合并提升模型性能。 Result: ZECO在多种模态的脑MRI数据集上定量和定性评估中优于现有模型，展示了其在生成高质量MRI图像方面的卓越能力。 Conclusion: ZECO通过条件生成框架有效缓解了数据稀缺问题，为医学图像分割提供了新思路。 Abstract: Medical image segmentation is crucial for enhancing diagnostic accuracy and treatment planning in Magnetic Resonance Imaging (MRI). However, acquiring precise lesion masks for segmentation model training demands specialized expertise and significant time investment, leading to a small dataset scale in clinical practice. In this paper, we present ZECO, a ZeroFusion guided 3D MRI conditional generation framework that extracts, compresses, and generates high-fidelity MRI images with corresponding 3D segmentation masks to mitigate data scarcity. To effectively capture inter-slice relationships within volumes, we introduce a Spatial Transformation Module that encodes MRI images into a compact latent space for the diffusion process. Moving beyond unconditional generation, our novel ZeroFusion method progressively maps 3D masks to MRI images in latent space, enabling robust training on limited datasets while avoiding overfitting. ZECO outperforms state-of-the-art models in both quantitative and qualitative evaluations on Brain MRI datasets across various modalities, showcasing its exceptional capability in synthesizing high-quality MRI images conditioned on segmentation masks.

GI-SLAM: Gaussian-Inertial SLAM

Xulang Liu,Ning Tan

Task: 提出一种结合惯性测量单元（IMU）数据的3D高斯溅射SLAM系统（GI-SLAM），以提升相机跟踪的精度、鲁棒性和效率。

Motivation: 现有3D高斯溅射SLAM方法忽视了IMU数据的重要性，而IMU数据是密集SLAM中关键的信息来源。

Details

Method: 通过引入IMU损失函数，将IMU数据无缝集成到3D高斯溅射SLAM的深度学习框架中，支持多种传感器配置。 Result: 在EuRoC和TUM-RGBD数据集上，GI-SLAM表现出与现有实时方法竞争的性能。 Conclusion: GI-SLAM通过结合IMU数据，显著提升了3D高斯溅射SLAM系统的性能，适用于多种传感器配置。 Abstract: 3D Gaussian Splatting (3DGS) has recently emerged as a powerful representation of geometry and appearance for dense Simultaneous Localization and Mapping (SLAM). Through rapid, differentiable rasterization of 3D Gaussians, many 3DGS SLAM methods achieve near real-time rendering and accelerated training. However, these methods largely overlook inertial data, witch is a critical piece of information collected from the inertial measurement unit (IMU). In this paper, we present GI-SLAM, a novel gaussian-inertial SLAM system which consists of an IMU-enhanced camera tracking module and a realistic 3D Gaussian-based scene representation for mapping. Our method introduces an IMU loss that seamlessly integrates into the deep learning framework underpinning 3D Gaussian Splatting SLAM, effectively enhancing the accuracy, robustness and efficiency of camera tracking. Moreover, our SLAM system supports a wide range of sensor configurations, including monocular, stereo, and RGBD cameras, both with and without IMU integration. Our method achieves competitive performance compared with existing state-of-the-art real-time methods on the EuRoC and TUM-RGBD datasets.

PALATE: Peculiar Application of the Law of Total Expectation to Enhance the Evaluation of Deep Generative Models

Tadeusz Dziarmaga,Marcin Kądziołka,Artur Kasymov,Marcin Mazur

Task: 提出PALATE方法，用于改进深度生成模型（DGMs）的评估，解决现有指标在保真度、多样性和新颖性之间的权衡问题。

Motivation: 当前深度生成模型的评估方法在平衡保真度、多样性和新颖性方面存在挑战，且现有解决方案如FLD存在计算效率问题。

Details

Method: 基于全期望定律的独特应用，结合MMD基线指标和DINOv2特征提取器，提出PALATE方法。 Result: PALATE在实验中表现优异，提供了计算高效且全面的评估框架，优于现有方法。 Conclusion: PALATE为深度生成模型的评估提供了高效且全面的解决方案，尤其在检测样本记忆和评估泛化能力方面有显著贡献。 Abstract: Deep generative models (DGMs) have caused a paradigm shift in the field of machine learning, yielding noteworthy advancements in domains such as image synthesis, natural language processing, and other related areas. However, a comprehensive evaluation of these models that accounts for the trichotomy between fidelity, diversity, and novelty in generated samples remains a formidable challenge. A recently introduced solution that has emerged as a promising approach in this regard is the Feature Likelihood Divergence (FLD), a method that offers a theoretically motivated practical tool, yet also exhibits some computational challenges. In this paper, we propose PALATE, a novel enhancement to the evaluation of DGMs that addresses limitations of existing metrics. Our approach is based on a peculiar application of the law of total expectation to random variables representing accessible real data. When combined with the MMD baseline metric and DINOv2 feature extractor, PALATE offers a holistic evaluation framework that matches or surpasses state-of-the-art solutions while providing superior computational efficiency and scalability to large-scale datasets. Through a series of experiments, we demonstrate the effectiveness of the PALATE enhancement, contributing a computationally efficient, holistic evaluation approach that advances the field of DGMs assessment, especially in detecting sample memorization and evaluating generalization capabilities.

k-NN as a Simple and Effective Estimator of Transferability

Moein Sorkhei,Christos Matsoukas,Johan Fredin Haslum,Kevin Smith

Task: 评估现有迁移学习指标在预测新场景下迁移性能时的准确性。

Motivation: 研究迁移学习在新场景（领域偏移、任务不同、架构变化）下的表现，并验证现有迁移性指标的预测能力。

Details

Method: 通过42,000多次实验，比较23种迁移性指标在16个数据集上的表现。 Result: 发现现有指标均表现不佳，但简单的k近邻评估方法优于现有指标，且计算效率更高、实现更简单。 Conclusion: k近邻评估方法在迁移学习性能预测中表现更优，建议作为替代方案。 Abstract: How well can one expect transfer learning to work in a new setting where the domain is shifted, the task is different, and the architecture changes? Many transfer learning metrics have been proposed to answer this question. But how accurate are their predictions in a realistic new setting? We conducted an extensive evaluation involving over 42,000 experiments comparing 23 transferability metrics across 16 different datasets to assess their ability to predict transfer performance. Our findings reveal that none of the existing metrics perform well across the board. However, we find that a simple k-nearest neighbor evaluation -- as is commonly used to evaluate feature quality for self-supervision -- not only surpasses existing metrics, but also offers better computational efficiency and ease of implementation.

Galaxy Walker: Geometry-aware VLMs For Galaxy-scale Understanding

Tianyu Chen,Xingcheng Fu,Yisen Gao,Haodong Qian,Yuecen Wei,Kun Yan,Haoyi Zhou,Jianxin Li

Task: 开发一种几何感知的视觉语言模型（Galaxy-Walker），用于宇宙级视觉理解任务。

Motivation: 当前视觉语言模型局限于欧几里得空间，缺乏对天体物理几何（如球面空间和双曲空间）的支持，难以处理宇宙级任务。

Details

Method: 提出几何提示（geometry prompt）和多尺度物理图上的随机游走生成几何标记，以及几何适配器（geometry adapter）以混合专家方式压缩和重塑空间各向异性。 Result: Galaxy-Walker在星系属性估计（R²分数高达0.91）和形态分类任务（F1提升高达0.17）中表现优异，显著优于领域专用模型和通用视觉语言模型。 Conclusion: Galaxy-Walker通过几何感知设计有效解决了宇宙级视觉理解任务中的几何挑战，实现了最先进的性能。 Abstract: Modern vision-language models (VLMs) develop patch embedding and convolution backbone within vector space, especially Euclidean ones, at the very founding. When expanding VLMs to a galaxy scale for understanding astronomical phenomena, the integration of spherical space for planetary orbits and hyperbolic spaces for black holes raises two formidable challenges. a) The current pre-training model is confined to Euclidean space rather than a comprehensive geometric embedding. b) The predominant architecture lacks suitable backbones for anisotropic physical geometries. In this paper, we introduced Galaxy-Walker, a geometry-aware VLM, for the universe-level vision understanding tasks. We proposed the geometry prompt that generates geometry tokens by random walks across diverse spaces on a multi-scale physical graph, along with a geometry adapter that compresses and reshapes the space anisotropy in a mixture-of-experts manner. Extensive experiments demonstrate the effectiveness of our approach, with Galaxy-Walker achieving state-of-the-art performance in both galaxy property estimation ($R^2$ scores up to $0.91$) and morphology classification tasks (up to $+0.17$ F1 improvement in challenging features), significantly outperforming both domain-specific models and general-purpose VLMs.

Rethinking Glaucoma Calibration: Voting-Based Binocular and Metadata Integration

Taejin Jeong,Joohyeok Kim,Jaehoon Joo,Yeonwoo Jung,Hyeonmin Kim,Seong Jae Hwang

Task: 提出一种名为V-ViT的新框架，通过结合疾病特异性特征来增强青光眼诊断的校准性能。

Motivation: 青光眼诊断高度主观且依赖多种因素，现有AI模型在准确性提升的同时校准性能下降，需要一种方法解决这一问题。

Details

Method: 提出V-ViT框架，整合双眼数据和元数据，并引入基于MC dropout的投票系统以应对高主观性。 Result: V-ViT在所有指标上达到最先进性能，包括准确性，验证了其在校准问题上的有效性。 Conclusion: V-ViT通过结合疾病特性和处理主观性，有效解决了青光眼诊断中的校准问题。 Abstract: Glaucoma is an incurable ophthalmic disease that damages the optic nerve, leads to vision loss, and ranks among the leading causes of blindness worldwide. Diagnosing glaucoma typically involves fundus photography, optical coherence tomography (OCT), and visual field testing. However, the high cost of OCT often leads to reliance on fundus photography and visual field testing, both of which exhibit inherent inter-observer variability. This stems from glaucoma being a multifaceted disease that influenced by various factors. As a result, glaucoma diagnosis is highly subjective, emphasizing the necessity of calibration, which aligns predicted probabilities with actual disease likelihood. Proper calibration is essential to prevent overdiagnosis or misdiagnosis, which are critical concerns for high-risk diseases. Although AI has significantly improved diagnostic accuracy, overconfidence in models have worsen calibration performance. Recent study has begun focusing on calibration for glaucoma. Nevertheless, previous study has not fully considered glaucoma's systemic nature and the high subjectivity in its diagnostic process. To overcome these limitations, we propose V-ViT (Voting-based ViT), a novel framework that enhances calibration by incorporating disease-specific characteristics. V-ViT integrates binocular data and metadata, reflecting the multi-faceted nature of glaucoma diagnosis. Additionally, we introduce a MC dropout-based Voting System to address high subjectivity. Our approach achieves state-of-the-art performance across all metrics, including accuracy, demonstrating that our proposed methods are effective in addressing calibration issues. We validate our method using a custom dataset including binocular data.

Robust Tube-based Control Strategy for Vision-guided Autonomous Vehicles

Der-Hau Lee

Task: 提出一种基于插值管的约束迭代线性二次调节器（itube-CILQR）算法，用于自动驾驶车辆的视觉车道保持。

Motivation: 提高高速转弯时的鲁棒性，减少系统保守性并提升计算速度。

Details

Method: 采用itube-CILQR算法，结合数值和视觉实验验证其可行性。 Result: itube-CILQR在控制信号生成上平均耗时3.16 ms，优于变分CILQR和传统MPC方法。 Conclusion: itube-CILQR在车道保持任务中表现更优，计算效率高且保守性影响可控。 Abstract: A robust control strategy for autonomous vehicles can improve system stability, enhance riding comfort, and prevent driving accidents. This paper presents a novel interpolation tube-based constrained iterative linear quadratic regulator (itube-CILQR) algorithm for autonomous computer-vision-based vehicle lane-keeping. The goal of the algorithm is to enhance robustness during high-speed cornering on tight turns. The advantages of itube-CILQR over the standard tube-approach include reduced system conservatism and increased computational speed. Numerical and vision-based experiments were conducted to examine the feasibility of the proposed algorithm. The proposed itube-CILQR algorithm is better suited to vehicle lane-keeping than variational CILQR-based methods and model predictive control (MPC) approaches using a classical interior-point solver. Specifically, in evaluation experiments, itube-CILQR achieved an average runtime of 3.16 ms to generate a control signal to guide a self-driving vehicle; itube-MPC typically required a 4.67-times longer computation time to complete the same task. Moreover, the influence of conservatism on system behavior was investigated by exploring the interpolation variable trajectories derived from the proposed itube-CILQR algorithm during lane-keeping maneuvers.

Dual-domain Multi-path Self-supervised Diffusion Model for Accelerated MRI Reconstruction

Yuxuan Zhang,Jinkui Hao,Bo Zhou

Task: 提出一种名为DMSM的新型框架，用于加速MRI重建，提高准确性、效率和可解释性。

Motivation: 现有扩散模型依赖全采样数据训练，计算成本高且缺乏不确定性估计，限制了临床适用性。

Details

Method: 结合自监督双域扩散模型训练方案、轻量级混合注意力网络和多路径推理策略。 Result: 在两种人类MRI数据集上表现优于基线方法，尤其在保留精细解剖结构和抑制高加速因子下的伪影方面。 Conclusion: DMSM消除了对全采样数据的依赖，提供不确定性估计，增强了临床实用性。 Abstract: Magnetic resonance imaging (MRI) is a vital diagnostic tool, but its inherently long acquisition times reduce clinical efficiency and patient comfort. Recent advancements in deep learning, particularly diffusion models, have improved accelerated MRI reconstruction. However, existing diffusion models' training often relies on fully sampled data, models incur high computational costs, and often lack uncertainty estimation, limiting their clinical applicability. To overcome these challenges, we propose a novel framework, called Dual-domain Multi-path Self-supervised Diffusion Model (DMSM), that integrates a self-supervised dual-domain diffusion model training scheme, a lightweight hybrid attention network for the reconstruction diffusion model, and a multi-path inference strategy, to enhance reconstruction accuracy, efficiency, and explainability. Unlike traditional diffusion-based models, DMSM eliminates the dependency on training from fully sampled data, making it more practical for real-world clinical settings. We evaluated DMSM on two human MRI datasets, demonstrating that it achieves favorable performance over several supervised and self-supervised baselines, particularly in preserving fine anatomical structures and suppressing artifacts under high acceleration factors. Additionally, our model generates uncertainty maps that correlate reasonably well with reconstruction errors, offering valuable clinically interpretable guidance and potentially enhancing diagnostic confidence.

Learning to segment anatomy and lesions from disparately labeled sources in brain MRI

Meva Himmetoglu,Ilja Ciernik,Ender Konukoglu

Task: 提出一种能够在脑MRI图像中同时分割健康组织和病变的方法。

Motivation: 由于病变导致的解剖结构破坏和缺乏联合标记的训练数据集，现有算法难以同时分割健康组织和病变。

Details

Method: 通过解耦健康组织和病变分割路径，利用多序列采集和注意力机制合并信息，并在推理时通过图像特定适应减少病变区域对健康组织预测的负面影响。训练时结合元学习和协同训练。 Result: 在公开的脑胶质母细胞瘤数据集上，模型在多个解剖结构和病变上的分割性能优于现有方法。 Conclusion: 该方法能够有效处理病变引起的解剖结构破坏，且无需联合标记的训练数据即可实现高性能分割。 Abstract: Segmenting healthy tissue structures alongside lesions in brain Magnetic Resonance Images (MRI) remains a challenge for today's algorithms due to lesion-caused disruption of the anatomy and lack of jointly labeled training datasets, where both healthy tissues and lesions are labeled on the same images. In this paper, we propose a method that is robust to lesion-caused disruptions and can be trained from disparately labeled training sets, i.e., without requiring jointly labeled samples, to automatically segment both. In contrast to prior work, we decouple healthy tissue and lesion segmentation in two paths to leverage multi-sequence acquisitions and merge information with an attention mechanism. During inference, an image-specific adaptation reduces adverse influences of lesion regions on healthy tissue predictions. During training, the adaptation is taken into account through meta-learning and co-training is used to learn from disparately labeled training images. Our model shows an improved performance on several anatomical structures and lesions on a publicly available brain glioblastoma dataset compared to the state-of-the-art segmentation methods.

A semantic communication-based workload-adjustable transceiver for wireless AI-generated content (AIGC) delivery

Runze Cheng,Yao Sun,Lan Zhang,Lei Feng,Lei Zhang,Muhammad Ali Imran

Task: 提出一种基于语义通信的资源感知工作负载可调收发器（ROUTE），用于动态无线网络中的AI生成内容（AIGC）传输。

Motivation: 解决无线网络中AIGC服务传输面临的信道不稳定、带宽资源有限和计算资源分布不均等挑战。

Details

Method: 利用语义通信优先处理生成内容的语义信息，并应用改进的扩散模型调整计算工作负载和语义密度。 Result: 仿真验证了ROUTE在延迟和内容质量方面优于传统AIGC方法。 Conclusion: ROUTE通过优化语义通信和计算资源分配，有效提升了AIGC在无线网络中的传输性能。 Abstract: With the significant advances in generative AI (GAI) and the proliferation of mobile devices, providing high-quality AI-generated content (AIGC) services via wireless networks is becoming the future direction. However, the primary challenges of AIGC service delivery in wireless networks lie in unstable channels, limited bandwidth resources, and unevenly distributed computational resources. In this paper, we employ semantic communication (SemCom) in diffusion-based GAI models to propose a Resource-aware wOrkload-adjUstable TransceivEr (ROUTE) for AIGC delivery in dynamic wireless networks. Specifically, to relieve the communication resource bottleneck, SemCom is utilized to prioritize semantic information of the generated content. Then, to improve computational resource utilization in both edge and local and reduce AIGC semantic distortion in transmission, modified diffusion-based models are applied to adjust the computing workload and semantic density in cooperative content generation. Simulations verify the superiority of our proposed ROUTE in terms of latency and content quality compared to conventional AIGC approaches.

AdaWorld: Learning Adaptable World Models with Latent Actions

Shenyuan Gao,Siyuan Zhou,Yilun Du,Jun Zhang,Chuang Gan

Task: 提出一种名为AdaWorld的创新世界模型学习方法，以实现高效适应。

Motivation: 现有世界模型依赖大量动作标记数据和昂贵训练，难以通过有限交互适应新环境，限制了其广泛应用。

Details

Method: 通过自监督方式从视频中提取潜在动作，捕捉帧间关键过渡，并开发基于这些潜在动作的自回归世界模型。 Result: AdaWorld在模拟质量和视觉规划方面表现出色，能够高效适应新动作。 Conclusion: AdaWorld通过自监督学习潜在动作，显著提升了世界模型的适应性和性能。 Abstract: World models aim to learn action-controlled prediction models and have proven essential for the development of intelligent agents. However, most existing world models rely heavily on substantial action-labeled data and costly training, making it challenging to adapt to novel environments with heterogeneous actions through limited interactions. This limitation can hinder their applicability across broader domains. To overcome this challenge, we propose AdaWorld, an innovative world model learning approach that enables efficient adaptation. The key idea is to incorporate action information during the pretraining of world models. This is achieved by extracting latent actions from videos in a self-supervised manner, capturing the most critical transitions between frames. We then develop an autoregressive world model that conditions on these latent actions. This learning paradigm enables highly adaptable world models, facilitating efficient transfer and learning of new actions even with limited interactions and finetuning. Our comprehensive experiments across multiple environments demonstrate that AdaWorld achieves superior performance in both simulation quality and visual planning.