2025 03 26

SRMIR: Shadow Reward Models Based on Introspective Reasoning for LLM Alignment

Ruoxi Cheng,Shuirong Cao

Task: 提出一种基于自省推理的阴影奖励模型（SRMIR）方法，以解决大语言模型（LLM）与人类偏好对齐中的局限性。

Motivation: 当前对齐方法依赖昂贵的人工标注、存在对齐税问题，且浅层对齐易受越狱攻击，同时数据集分布不均。

Details

Method: 构建平衡的安全Chain of Draft（CoD）数据集，训练专用奖励模型，并通过Group Relative Policy Optimization（GRPO）进行策略优化。 Result: 实验表明SRMIR显著优于现有方法，分类整合策略在更高计算成本下实现更优对齐。 Conclusion: SRMIR通过自省推理和阴影奖励模型有效解决了对齐问题，提升了模型的安全性和性能。 Abstract: Aligning large language models (LLMs) with human preferences and values is vital for application. However, current alignment methods face three main limitations: (1) reliance on costly human annotation; (2) alignment tax; (3) shallow alignment vulnerable to jailbreak attacks. Additionally, current alignment datasets often suffer from uneven distributions, leading to overrepresentation of some topics and neglect of others. To address these issues, we propose SRMIR (Shadow Reward Models Based on Introspective Reasoning), inspired by shadow models in membership inference attacks. We first construct a balanced safety Chain of Draft (CoD) dataset across $7$ harmful types with structured prompt leveraging the introspective reasoning capabilities of LLMs, then train a set of specialized reward models to guide policy optimization through Group Relative Policy Optimization (GRPO). We apply two strategies, linear combination and categorized approach, to integrate shadow reward models for policy optimization. By comparison, we find that the latter achieves superior alignment despite higher computational costs. Experiments across several LLMs demonstrate SRMIR significantly outperforms existing methods.

LookAhead Tuning: Safer Language Models via Partial Answer Previews

Kangwei Liu,Mengru Wang,Yujie Luo,Lin Yuan,Mengshu Sun,Ningyu Zhang,Lei Liang,Zhiqiang Zhang,Jun Zhou,Huajun Chen

Task: 通过LookAhead Tuning方法在微调大型语言模型时保持其安全性。

Motivation: 微调大型语言模型（LLMs）会削弱其原有的安全对齐机制，需要一种方法在适应特定领域的同时保持安全性。

Details

Method: 提出LookAhead Tuning，包含两种简单、低资源且有效的数据驱动方法，通过预览部分答案前缀修改训练数据。 Result: 实验表明，LookAhead Tuning能有效保持模型安全性，同时不影响下游任务的性能。 Conclusion: LookAhead Tuning是一种可靠且高效的解决方案，适用于LLMs的安全有效适应。 Abstract: Fine-tuning enables large language models (LLMs) to adapt to specific domains, but often undermines their previously established safety alignment. To mitigate the degradation of model safety during fine-tuning, we introduce LookAhead Tuning, which comprises two simple, low-resource, and effective data-driven methods that modify training data by previewing partial answer prefixes. Both methods aim to preserve the model's inherent safety mechanisms by minimizing perturbations to initial token distributions. Comprehensive experiments demonstrate that LookAhead Tuning effectively maintains model safety without sacrificing robust performance on downstream tasks. Our findings position LookAhead Tuning as a reliable and efficient solution for the safe and effective adaptation of LLMs. Code is released at https://github.com/zjunlp/LookAheadTuning.

LLM-Based Insight Extraction for Contact Center Analytics and Cost-Efficient Deployment

Varsha Embar,Ritvik Shrivastava,Vinay Damodaran,Travis Mehlinger,Yu-Chung Hsiao,Karthik Raghunathan

Task: 自动化生成呼叫驱动因素，为联络中心提供可操作的洞察。

Motivation: 提升联络中心的自助服务工具、简化管理流程并增强代理效率。

Details

Method: 提出一种成本高效的LLM系统设计，包括评估专有、开源和微调模型，以及成本优化策略和部署分析。 Result: 系统能够支持主题建模、来电分类、趋势检测和FAQ生成等任务。 Conclusion: 该系统为联络中心代理和管理员提供了高效且成本优化的解决方案。 Abstract: Large Language Models have transformed the Contact Center industry, manifesting in enhanced self-service tools, streamlined administrative processes, and augmented agent productivity. This paper delineates our system that automates call driver generation, which serves as the foundation for tasks such as topic modeling, incoming call classification, trend detection, and FAQ generation, delivering actionable insights for contact center agents and administrators to consume. We present a cost-efficient LLM system design, with 1) a comprehensive evaluation of proprietary, open-weight, and fine-tuned models and 2) cost-efficient strategies, and 3) the corresponding cost analysis when deployed in production environments.

Masks and Mimicry: Strategic Obfuscation and Impersonation Attacks on Authorship Verification

Kenneth Alperin,Rohan Leekha,Adaku Uchendu,Trang Nguyen,Srilakshmi Medarametla,Carlos Levya Capote,Seth Aycock,Charlie Dagli

Task: 评估作者验证模型对抗基于大型语言模型（LLM）攻击的鲁棒性。

Motivation: 随着AI技术的广泛应用，LLM在提升作者识别技术的同时，也为恶意攻击者提供了新的攻击手段，需要研究其对抗性鲁棒性。

Details

Method: 通过干扰准确的作者验证模型，测试其在无目标攻击（作者混淆）和目标攻击（作者模仿）中的表现。 Result: 在混淆和模仿攻击中，攻击成功率分别达到92%和78%。 Conclusion: 作者验证模型在面对LLM攻击时存在显著漏洞，需要进一步改进以提高安全性。 Abstract: The increasing use of Artificial Intelligence (AI) technologies, such as Large Language Models (LLMs) has led to nontrivial improvements in various tasks, including accurate authorship identification of documents. However, while LLMs improve such defense techniques, they also simultaneously provide a vehicle for malicious actors to launch new attack vectors. To combat this security risk, we evaluate the adversarial robustness of authorship models (specifically an authorship verification model) to potent LLM-based attacks. These attacks include untargeted methods - \textit{authorship obfuscation} and targeted methods - \textit{authorship impersonation}. For both attacks, the objective is to mask or mimic the writing style of an author while preserving the original texts' semantics, respectively. Thus, we perturb an accurate authorship verification model, and achieve maximum attack success rates of 92\% and 78\% for both obfuscation and impersonation attacks, respectively.

Understanding and Improving Information Preservation in Prompt Compression for LLMs

Weronika Łajewska,Momchil Hardalov,Laura Aina,Neha Anna John,Hang Su,Lluís Màrquez

Task: 提出一个全面的评估框架，用于深入分析提示压缩方法。

Motivation: 在信息密集型任务中，提示长度快速增加会导致计算需求增加、性能下降以及无关或冗余信息引入的偏差。

Details

Method: 通过三个关键方面（下游任务性能、输入上下文的接地性和信息保留）评估提示压缩方法，并改进软提示方法以控制压缩信息的粒度。 Result: 改进后的软提示方法在下游任务性能上提升23%，接地性提升8 BERTScore点，压缩中保留的实体数量增加2.7倍。 Conclusion: 通过控制压缩信息的粒度，可以显著提升提示压缩方法的有效性。 Abstract: Recent advancements in large language models (LLMs) have enabled their successful application to a broad range of tasks. However, in information-intensive tasks, the prompt length can grow fast, leading to increased computational requirements, performance degradation, and induced biases from irrelevant or redundant information. Recently, various prompt compression techniques have been introduced to optimize the trade-off between reducing input length and retaining performance. We propose a holistic evaluation framework that allows for in-depth analysis of prompt compression methods. We focus on three key aspects, besides compression ratio: (i) downstream task performance, (ii) grounding in the input context, and (iii) information preservation. Through this framework, we investigate state-of-the-art soft and hard compression methods, showing that they struggle to preserve key details from the original prompt, limiting their performance on complex tasks. We demonstrate that modifying soft prompting methods to control better the granularity of the compressed information can significantly improve their effectiveness -- up to +23\% in downstream task performance, more than +8 BERTScore points in grounding, and 2.7x more entities preserved in compression.

Where is this coming from? Making groundedness count in the evaluation of Document VQA models

Armineh Nourbakhsh,Siddharth Parekh,Pranav Shetty,Zhao Jin,Sameena Shah,Carolyn Rose

Task: 提出一种新的评估方法，用于衡量文档视觉问答（VQA）模型输出的语义和多模态基础性。

Motivation: 现有评估指标未能充分反映模型输出的语义和多模态基础性，导致幻觉和语义错误与良好基础性输出被同等对待，无法体现模型的推理能力。

Details

Method: 提出一种参数化的评估方法，考虑预测的语义特性和多模态位置，用户可根据偏好配置评分。 Result: 通过人类判断验证了新评分方法的有效性，并展示了其对现有排行榜的潜在影响。 Conclusion: 新方法能更好地反映模型的鲁棒性，并对校准良好的答案给予更高奖励。 Abstract: Document Visual Question Answering (VQA) models have evolved at an impressive rate over the past few years, coming close to or matching human performance on some benchmarks. We argue that common evaluation metrics used by popular benchmarks do not account for the semantic and multimodal groundedness of a model's outputs. As a result, hallucinations and major semantic errors are treated the same way as well-grounded outputs, and the evaluation scores do not reflect the reasoning capabilities of the model. In response, we propose a new evaluation methodology that accounts for the groundedness of predictions with regard to the semantic characteristics of the output as well as the multimodal placement of the output within the input document. Our proposed methodology is parameterized in such a way that users can configure the score according to their preferences. We validate our scoring methodology using human judgment and show its potential impact on existing popular leaderboards. Through extensive analyses, we demonstrate that our proposed method produces scores that are a better indicator of a model's robustness and tends to give higher rewards to better-calibrated answers.

Overcoming Vocabulary Mismatch: Vocabulary-agnostic Teacher Guided Language Modeling

Haebin Shin,Lei Ji,Xiao Liu,Yeyun Gong

Task: 提出一种名为VocAgnoLM的新方法，解决教师模型与学生模型之间词汇不匹配的问题。

Motivation: 词汇不匹配导致教师模型与学生模型在语言建模中产生分歧，影响训练效果。

Details

Method: 采用两种关键方法：Token-level Lexical Alignment（词汇对齐）和Teacher Guided Loss（教师引导损失）。 Result: 在1B学生模型与7B教师模型的实验中，VocAgnoLM显著提升了性能（46%改进）。 Conclusion: VocAgnoLM为语言建模中的词汇不匹配问题提供了稳健的解决方案，并能从更强的教师模型中获益。 Abstract: Using large teacher models to guide the training of smaller student models has become the prevailing paradigm for efficient and effective learning. However, vocabulary mismatches between teacher and student language models pose significant challenges in language modeling, resulting in divergent token sequences and output distributions. To overcome these limitations, we propose Vocabulary-agnostic Teacher Guided Language Modeling (VocAgnoLM), a novel approach that bridges the gap caused by vocabulary mismatch through two key methods: (1) Token-level Lexical Alignment, which aligns token sequences across mismatched vocabularies, and (2) Teacher Guided Loss, which leverages the loss of teacher model to guide effective student training. We demonstrate its effectiveness in language modeling with 1B student model using various 7B teacher models with different vocabularies. Notably, with Qwen2.5-Math-Instruct, a teacher model sharing only about 6% of its vocabulary with TinyLlama, VocAgnoLM achieves a 46% performance improvement compared to naive continual pretraining. Furthermore, we demonstrate that VocAgnoLM consistently benefits from stronger teacher models, providing a robust solution to vocabulary mismatches in language modeling.

MIRAGE: Multimodal Immersive Reasoning and Guided Exploration for Red-Team Jailbreak Attacks

Wenhao You,Bryan Hooi,Yiwei Wang,Youke Wang,Zong Ke,Ming-Hsuan Yang,Zi Huang,Yujun Cai

Task: 提出一种名为MIRAGE的多模态越狱框架，利用叙事驱动和角色沉浸绕过多模态大语言模型（MLLMs）的安全机制。

Motivation: 尽管安全机制在过滤有害文本输入方面取得进展，但MLLMs仍易受多模态越狱攻击，利用其跨模态推理能力。

Details

Method: 通过将有害查询分解为环境、角色和动作三元组，MIRAGE利用Stable Diffusion构建多轮视觉叙事序列，逐步降低模型防御并引导其推理。 Result: 在实验中，MIRAGE在六种主流MLLMs上实现了最先进的性能，攻击成功率比基线方法提高了17.5%。 Conclusion: 研究揭示了当前多模态安全机制的关键弱点，强调需要更强大的防御措施应对跨模态威胁。 Abstract: While safety mechanisms have significantly progressed in filtering harmful text inputs, MLLMs remain vulnerable to multimodal jailbreaks that exploit their cross-modal reasoning capabilities. We present MIRAGE, a novel multimodal jailbreak framework that exploits narrative-driven context and role immersion to circumvent safety mechanisms in Multimodal Large Language Models (MLLMs). By systematically decomposing the toxic query into environment, role, and action triplets, MIRAGE constructs a multi-turn visual storytelling sequence of images and text using Stable Diffusion, guiding the target model through an engaging detective narrative. This process progressively lowers the model's defences and subtly guides its reasoning through structured contextual cues, ultimately eliciting harmful responses. In extensive experiments on the selected datasets with six mainstream MLLMs, MIRAGE achieves state-of-the-art performance, improving attack success rates by up to 17.5% over the best baselines. Moreover, we demonstrate that role immersion and structured semantic reconstruction can activate inherent model biases, facilitating the model's spontaneous violation of ethical safeguards. These results highlight critical weaknesses in current multimodal safety mechanisms and underscore the urgent need for more robust defences against cross-modal threats.

Language Model Uncertainty Quantification with Attention Chain

Yinghao Li,Rushi Qiang,Lama Moukheiber,Chao Zhang

Task: 提出一种高效方法（UQAC）来量化大型语言模型（LLM）在复杂推理任务中的预测不确定性。

Motivation: 现有研究主要关注简短、可直接回答的问题，而复杂推理任务中的不确定性量化（UQ）因中间推理步骤的依赖性和概率估计的膨胀而变得困难。

Details

Method: UQAC通过回溯过程构建“注意力链”，识别对最终答案语义关键的令牌，并通过相似性过滤和概率阈值进一步优化链，以近似答案令牌的边际概率。 Result: 在多个推理基准测试中验证了UQAC的高效性和可靠性。 Conclusion: UQAC能够高效且可靠地量化复杂推理任务中的预测不确定性。 Abstract: Accurately quantifying a large language model's (LLM) predictive uncertainty is crucial for judging the reliability of its answers. While most existing research focuses on short, directly answerable questions with closed-form outputs (e.g., multiple-choice), involving intermediate reasoning steps in LLM responses is increasingly important. This added complexity complicates uncertainty quantification (UQ) because the probabilities assigned to answer tokens are conditioned on a vast space of preceding reasoning tokens. Direct marginalization is infeasible, and the dependency inflates probability estimates, causing overconfidence in UQ. To address this, we propose UQAC, an efficient method that narrows the reasoning space to a tractable size for marginalization. UQAC iteratively constructs an "attention chain" of tokens deemed "semantically crucial" to the final answer via a backtracking procedure. Starting from the answer tokens, it uses attention weights to identify the most influential predecessors, then iterates this process until reaching the input tokens. Similarity filtering and probability thresholding further refine the resulting chain, allowing us to approximate the marginal probabilities of the answer tokens, which serve as the LLM's confidence. We validate UQAC on multiple reasoning benchmarks with advanced open-source LLMs, demonstrating that it consistently delivers reliable UQ estimates with high computational efficiency.

Evaluating Bias in LLMs for Job-Resume Matching: Gender, Race, and Education

Hayate Iso,Pouya Pezeshkpour,Nikita Bhutani,Estevam Hruschka

Task: 研究大型语言模型（LLMs）在职位描述与简历匹配任务中的表现和公平性。

Motivation: LLMs在招聘自动化中有潜力，但其固有偏见可能导致不公平的招聘行为，影响职场多样性。

Details

Method: 在英语和美国背景下，评估性别、种族和教育背景等因素对LLMs决策的影响。 Result: 近期模型减少了性别和种族等显性偏见，但教育背景相关的隐性偏见仍然显著。 Conclusion: 需要持续评估和开发先进的偏见缓解策略，以确保LLMs在招聘中的公平使用。 Abstract: Large Language Models (LLMs) offer the potential to automate hiring by matching job descriptions with candidate resumes, streamlining recruitment processes, and reducing operational costs. However, biases inherent in these models may lead to unfair hiring practices, reinforcing societal prejudices and undermining workplace diversity. This study examines the performance and fairness of LLMs in job-resume matching tasks within the English language and U.S. context. It evaluates how factors such as gender, race, and educational background influence model decisions, providing critical insights into the fairness and reliability of LLMs in HR applications. Our findings indicate that while recent models have reduced biases related to explicit attributes like gender and race, implicit biases concerning educational background remain significant. These results highlight the need for ongoing evaluation and the development of advanced bias mitigation strategies to ensure equitable hiring practices when using LLMs in industry settings.

Protein Structure-Function Relationship: A Kernel-PCA Approach for Reaction Coordinate Identification

Parisa Mollaei,Amir Barati Farimani

Task: 提出一种Kernel-PCA模型，用于捕捉蛋白质的结构-功能关系并排序反应坐标。

Motivation: 通过机器学习技术揭示高维蛋白质数据中的有意义模式，以支持蛋白质结构-功能分析。

Details

Method: 结合Kernel和主成分分析（PCA）技术，利用分子动力学（MD）模拟数据，并采用网络方法分析残基动态行为。 Result: 模型在G蛋白偶联受体中准确识别反应坐标，并揭示了与特定蛋白质性质相关的残基动态行为相关性。 Conclusion: 该模型是蛋白质结构-功能分析和可视化的有力工具。 Abstract: In this study, we propose a Kernel-PCA model designed to capture structure-function relationships in a protein. This model also enables ranking of reaction coordinates according to their impact on protein properties. By leveraging machine learning techniques, including Kernel and principal component analysis (PCA), our model uncovers meaningful patterns in high-dimensional protein data obtained from molecular dynamics (MD) simulations. The effectiveness of our model in accurately identifying reaction coordinates has been demonstrated through its application to a G protein-coupled receptor. Furthermore, this model utilizes a network-based approach to uncover correlations in the dynamic behavior of residues associated with a specific protein property. These findings underscore the potential of our model as a powerful tool for protein structure-function analysis and visualization.

Overtrained Language Models Are Harder to Fine-Tune

Jacob Mitchell Springer,Sachin Goyal,Kaiyue Wen,Tanishq Kumar,Xiang Yue,Sadhika Malladi,Graham Neubig,Aditi Raghunathan

Task: 研究大规模语言模型预训练对下游任务性能的影响。

Motivation: 挑战预训练性能提升必然带来下游任务改进的假设，揭示过度预训练可能导致性能下降的现象。

Details

Method: 通过控制实验和理论分析，研究预训练参数对修改的系统性敏感性增加。 Result: 发现过度预训练会导致模型对修改（如微调）的敏感性增加，最终性能下降（例如OLMo-1B模型在3T tokens预训练后性能下降2%）。 Conclusion: 呼吁重新评估预训练设计，需考虑模型的下游适应能力。 Abstract: Large language models are pre-trained on ever-growing token budgets under the assumption that better pre-training performance translates to improved downstream models. In this work, we challenge this assumption and show that extended pre-training can make models harder to fine-tune, leading to degraded final performance. We term this phenomenon catastrophic overtraining. For example, the instruction-tuned OLMo-1B model pre-trained on 3T tokens leads to over 2% worse performance on multiple standard LLM benchmarks than its 2.3T token counterpart. Through controlled experiments and theoretical analysis, we show that catastrophic overtraining arises from a systematic increase in the broad sensitivity of pre-trained parameters to modifications, including but not limited to fine-tuning. Our findings call for a critical reassessment of pre-training design that considers the downstream adaptability of the model.

Towards Terminology Management Automation for Arabic

Mahdi Nasser,Laura Sayyah,Fadi A. Zaraket

Task: 自动化管理阿拉伯语术语，提取外语与阿拉伯语术语的平行匹配列表。

Motivation: 提高阿拉伯语学术书籍翻译的术语一致性和准确性，支持跨语言文本处理。

Details

Method: 利用自然出现的术语翻译，计算多种相似性度量（词汇、语音、形态、语义），并尝试启发式、机器学习和后处理方法。 Result: 最佳方法达到94.9%的精确率和92.4%的召回率。 Conclusion: 自动化术语管理能显著减少处理时间并确保术语的一致性和正确性。 Abstract: This paper presents a method and supporting tools for automation of terminology management for Arabic. The tools extract lists of parallel terminology matching terms in foreign languages to their Arabic counterparts from field specific texts. This has significant implications as it can be used to improve consistent translation and use of terms in specialized Arabic academic books, and provides automated aid for enhancing cross lingual text processing. This automation of terminology management aims to reduce processing time, and ensure use of consistent and correct terminology. The extraction takes advantage of naturally occurring term translations. It considers several candidate phrases of varying lengths that co-occur next to the foreign terms. Then it computes several similarity metrics, including lexicographic, phonetic, morphological, and semantic ones to decide the problem. We experiment with heuristic, machine learning, and ML with post processing approaches. This paper reports on a novel curated dataset for the task, an existing expert reviewed industry parallel corpora, and on the performance of the three approaches. The best approach achieved 94.9% precision and 92.4% recall.

A Survey of Large Language Model Agents for Question Answering

Murong Yue

Task: 综述基于大语言模型（LLM）的智能问答（QA）代理的发展。

Motivation: 传统代理存在数据需求大、难以泛化到新环境等显著局限性，而LLM代理通过利用LLM作为核心推理引擎解决了这些问题。

Details

Method: 系统性地回顾了LLM代理在QA任务中的设计，围绕规划、问题理解、信息检索和答案生成等关键阶段展开讨论。 Result: LLM代理通过与外部环境交互，在QA任务中取得了优于传统QA流程和简单LLM QA系统的结果。 Conclusion: 本文指出了LLM代理QA系统的持续挑战，并探讨了未来研究方向以提升其性能。 Abstract: This paper surveys the development of large language model (LLM)-based agents for question answering (QA). Traditional agents face significant limitations, including substantial data requirements and difficulty in generalizing to new environments. LLM-based agents address these challenges by leveraging LLMs as their core reasoning engine. These agents achieve superior QA results compared to traditional QA pipelines and naive LLM QA systems by enabling interaction with external environments. We systematically review the design of LLM agents in the context of QA tasks, organizing our discussion across key stages: planning, question understanding, information retrieval, and answer generation. Additionally, this paper identifies ongoing challenges and explores future research directions to enhance the performance of LLM agent QA systems.

SCI-IDEA: Context-Aware Scientific Ideation Using Token and Sentence Embeddings

Farhana Keya,Gollam Rabby,Prasenjit Mitra,Sahar Vahdati,Sören Auer,Yaser Jaradeh

Task: 提出一个基于大型语言模型（LLM）和Aha Moment检测的框架SCI-IDEA，用于迭代式科学创意生成。

Motivation: 尽管基于科学语料库训练的大型语言模型（LLM）在AI支持的创意生成方面取得了进展，但生成上下文感知、高质量且创新的科学创意仍具挑战性。

Details

Method: SCI-IDEA结合LLM提示策略和Aha Moment检测，从研究文献中提取关键要素，并对生成的创意在新颖性、兴奋度、可行性和有效性方面进行评估。 Result: 实验表明，SCI-IDEA在新颖性、兴奋度、可行性和有效性上的平均得分分别为6.84、6.86、6.89和6.84（1-10分制），验证了其有效性。 Conclusion: SCI-IDEA能够支持结构化且灵活的科学创意探索，同时兼顾伦理标准，为科学创新提供潜力。 Abstract: Every scientific discovery starts with an idea inspired by prior work, interdisciplinary concepts, and emerging challenges. Recent advancements in large language models (LLMs) trained on scientific corpora have driven interest in AI-supported idea generation. However, generating context-aware, high-quality, and innovative ideas remains challenging. We introduce SCI-IDEA, a framework that uses LLM prompting strategies and Aha Moment detection for iterative idea refinement. SCI-IDEA extracts essential facets from research publications, assessing generated ideas on novelty, excitement, feasibility, and effectiveness. Comprehensive experiments validate SCI-IDEA's effectiveness, achieving average scores of 6.84, 6.86, 6.89, and 6.84 (on a 1-10 scale) across novelty, excitement, feasibility, and effectiveness, respectively. Evaluations employed GPT-4o, GPT-4.5, DeepSeek-32B (each under 2-shot prompting), and DeepSeek-70B (3-shot prompting), with token-level embeddings used for Aha Moment detection. Similarly, it achieves scores of 6.87, 6.86, 6.83, and 6.87 using GPT-4o under 5-shot prompting, GPT-4.5 under 3-shot prompting, DeepSeek-32B under zero-shot chain-of-thought prompting, and DeepSeek-70B under 5-shot prompting with sentence-level embeddings. We also address ethical considerations such as intellectual credit, potential misuse, and balancing human creativity with AI-driven ideation. Our results highlight SCI-IDEA's potential to facilitate the structured and flexible exploration of context-aware scientific ideas, supporting innovation while maintaining ethical standards.

Jiali Cheng,Hadi Amiri

Task: 研究大型语言模型（LLMs）在细粒度语言标注任务中的表现。

Motivation: 尽管LLMs在生成连贯文本方面表现出色，但其在需要精确语法和语义理解的细粒度语言标注任务中的能力仍存疑，这影响了其在详细语言分析中的可靠性。

Details

Method: 通过一系列实验，评估近期LLMs在语言标注任务中的表现，重点关注其在复杂语言结构上的表现。 Result: 实验表明，近期LLMs（如Llama3-70b）在检测语言结构时存在显著错误，例如误判嵌套从句、未能识别动词短语以及混淆复杂名词与从句。 Conclusion: 研究结果为未来LLM设计和开发的改进提供了重要见解。 Abstract: Large language models (LLMs) are the foundation of many AI applications today. However, despite their remarkable proficiency in generating coherent text, questions linger regarding their ability to perform fine-grained linguistic annotation tasks, such as detecting nouns or verbs, or identifying more complex syntactic structures like clauses in input texts. These tasks require precise syntactic and semantic understanding of input text, and when LLMs underperform on specific linguistic structures, it raises concerns about their reliability for detailed linguistic analysis and whether their (even correct) outputs truly reflect an understanding of the inputs. In this paper, we empirically study the performance of recent LLMs on fine-grained linguistic annotation tasks. Through a series of experiments, we find that recent LLMs show limited efficacy in addressing linguistic queries and often struggle with linguistically complex inputs. We show that the most capable LLM (Llama3-70b) makes notable errors in detecting linguistic structures, such as misidentifying embedded clauses, failing to recognize verb phrases, and confusing complex nominals with clauses. Our results provide insights to inform future advancements in LLM design and development.

PHEONA: An Evaluation Framework for Large Language Model-based Approaches to Computational Phenotyping

Sarah Pungitore,Shashank Yadav,Vignesh Subbian

Task: 开发并应用一个评估框架（PHEONA），用于评估大型语言模型（LLMs）在计算表型分析任务中的表现。

Motivation: 传统计算表型分析方法耗时耗力，而LLMs在文本任务中的优势尚未充分探索，需要进一步研究。

Details

Method: 开发了PHEONA框架，并在急性呼吸衰竭（ARF）呼吸支持疗法的概念分类任务中进行了应用。 Result: 在测试样本中实现了高分类准确率，表明LLM方法有潜力改进计算表型分析流程。 Conclusion: LLM方法在计算表型分析中具有潜力，PHEONA框架为相关研究提供了指导。 Abstract: Computational phenotyping is essential for biomedical research but often requires significant time and resources, especially since traditional methods typically involve extensive manual data review. While machine learning and natural language processing advancements have helped, further improvements are needed. Few studies have explored using Large Language Models (LLMs) for these tasks despite known advantages of LLMs for text-based tasks. To facilitate further research in this area, we developed an evaluation framework, Evaluation of PHEnotyping for Observational Health Data (PHEONA), that outlines context-specific considerations. We applied and demonstrated PHEONA on concept classification, a specific task within a broader phenotyping process for Acute Respiratory Failure (ARF) respiratory support therapies. From the sample concepts tested, we achieved high classification accuracy, suggesting the potential for LLM-based methods to improve computational phenotyping processes.

MARS: Memory-Enhanced Agents with Reflective Self-improvement

Xuechen Liang,Meiling Tao,Yinghui Xia,Jianhui Wang,Kun Li,Yijin Wang,Jingsong Yang,Tianyu Shi,Yuantao Wang,Miao Zhang,Xueqian Wang

Task: 提出一个名为MARS的创新框架，以解决大语言模型在动态环境中的持续决策、长期记忆缺失和有限上下文窗口等问题。

Motivation: 大语言模型在自然语言处理领域取得了显著进展，但仍面临持续决策、长期记忆缺失和有限上下文窗口等挑战。

Details

Method: MARS框架包含三个代理（用户、助手和检查者），通过整合迭代反馈、反思机制和基于艾宾浩斯遗忘曲线的记忆优化机制。 Result: 显著提升了代理在处理多任务和长跨度信息方面的能力。 Conclusion: MARS框架有效解决了大语言模型在动态环境中的关键挑战。 Abstract: Large language models (LLMs) have made significant advances in the field of natural language processing, but they still face challenges such as continuous decision-making, lack of long-term memory, and limited context windows in dynamic environments. To address these issues, this paper proposes an innovative framework Memory-Enhanced Agents with Reflective Self-improvement. The MARS framework comprises three agents: the User, the Assistant, and the Checker. By integrating iterative feedback, reflective mechanisms, and a memory optimization mechanism based on the Ebbinghaus forgetting curve, it significantly enhances the agents capabilities in handling multi-tasking and long-span information.

CoMAC: Conversational Agent for Multi-Source Auxiliary Context with Sparse and Symmetric Latent Interactions

Junfeng Liu,Christopher T. Symons,Ranga Raju Vatsavai

Task: 提出一种名为CoMAC的新方法，用于从多源辅助数据中高效提取相关信息以生成对话响应。

Motivation: 现有方法在从多源数据中提取相关信息时效率不足，限制了对话AI工具的广泛应用。

Details

Method: CoMAC采用专用编码流和后融合接地网络处理多源数据，并利用新型文本相似度度量实现双向信息共享。 Result: 实验表明，CoMAC在相关人物和知识预测准确性及响应生成质量上显著优于两种先进方法。 Conclusion: CoMAC通过高效提取和整合多源信息，显著提升了对话生成的质量和适应性。 Abstract: Recent advancements in AI-driven conversational agents have exhibited immense potential of AI applications. Effective response generation is crucial to the success of these agents. While extensive research has focused on leveraging multiple auxiliary data sources (e.g., knowledge bases and personas) to enhance response generation, existing methods often struggle to efficiently extract relevant information from these sources. There are still clear limitations in the ability to combine versatile conversational capabilities with adherence to known facts and adaptation to large variations in user preferences and belief systems, which continues to hinder the wide adoption of conversational AI tools. This paper introduces a novel method, Conversational Agent for Multi-Source Auxiliary Context with Sparse and Symmetric Latent Interactions (CoMAC), for conversation generation, which employs specialized encoding streams and post-fusion grounding networks for multiple data sources to identify relevant persona and knowledge information for the conversation. CoMAC also leverages a novel text similarity metric that allows bi-directional information sharing among multiple sources and focuses on a selective subset of meaningful words. Our experiments show that CoMAC improves the relevant persona and knowledge prediction accuracies and response generation quality significantly over two state-of-the-art methods.

Machine-assisted writing evaluation: Exploring pre-trained language models in analyzing argumentative moves

Wenjuan Qin,Weiran Wang,Yuming Yang,Tao Gui

Task: 评估预训练语言模型（PLMs）在分析纵向学习者语料库中论证性移动的可靠性，并利用其生成的注释展示发展模式和预测写作质量。

Motivation: 现有关于论证性移动的研究多依赖定性分析和人工编码，效率低且难以推广。

Details

Method: 收集并标注了235名中国英语学习者的1643篇议论文，分为训练、验证和应用集，使用BERT等PLMs进行分析。 Result: PLMs在分析论证性移动中表现出高可靠性（F1得分0.743），并能有效捕捉发展模式和预测写作质量。 Conclusion: PLMs的应用展示了人工智能在语言教育中的潜力，可提升评估效率和准确性，推动数据驱动的个性化学习环境。 Abstract: The study investigates the efficacy of pre-trained language models (PLMs) in analyzing argumentative moves in a longitudinal learner corpus. Prior studies on argumentative moves often rely on qualitative analysis and manual coding, limiting their efficiency and generalizability. The study aims to: 1) to assess the reliability of PLMs in analyzing argumentative moves; 2) to utilize PLM-generated annotations to illustrate developmental patterns and predict writing quality. A longitudinal corpus of 1643 argumentative texts from 235 English learners in China is collected and annotated into six move types: claim, data, counter-claim, counter-data, rebuttal, and non-argument. The corpus is divided into training, validation, and application sets annotated by human experts and PLMs. We use BERT as one of the implementations of PLMs. The results indicate a robust reliability of PLMs in analyzing argumentative moves, with an overall F1 score of 0.743, surpassing existing models in the field. Additionally, PLM-labeled argumentative moves effectively capture developmental patterns and predict writing quality. Over time, students exhibit an increase in the use of data and counter-claims and a decrease in non-argument moves. While low-quality texts are characterized by a predominant use of claims and data supporting only oneside position, mid- and high-quality texts demonstrate an integrative perspective with a higher ratio of counter-claims, counter-data, and rebuttals. This study underscores the transformative potential of integrating artificial intelligence into language education, enhancing the efficiency and accuracy of evaluating students' writing. The successful application of PLMs can catalyze the development of educational technology, promoting a more data-driven and personalized learning environment that supports diverse educational needs.

Iterative Hypothesis Generation for Scientific Discovery with Monte Carlo Nash Equilibrium Self-Refining Trees

Gollam Rabby,Diyana Muhammed,Prasenjit Mitra,Sören Auer

Task: 提出一种名为MC-NEST的新框架，用于生成科学假设，结合蒙特卡洛树搜索和纳什均衡策略。

Motivation: 传统方法依赖人类直觉和领域知识，而纯大语言模型方法难以生成既创新又可靠的假设。

Details

Method: MC-NEST通过蒙特卡洛树搜索和纳什均衡策略动态平衡探索与利用，迭代优化假设。 Result: 在多个领域的实验中，MC-NEST在新颖性、清晰度、显著性和可验证性指标上优于现有方法。 Conclusion: MC-NEST为自动化假设生成设定了新标准，并支持结构化的人机协作，强调透明度和人类监督。 Abstract: Scientific hypothesis generation is a fundamentally challenging task in research, requiring the synthesis of novel and empirically grounded insights. Traditional approaches rely on human intuition and domain expertise, while purely large language model (LLM) based methods often struggle to produce hypotheses that are both innovative and reliable. To address these limitations, we propose the Monte Carlo Nash Equilibrium Self-Refine Tree (MC-NEST), a novel framework that integrates Monte Carlo Tree Search with Nash Equilibrium strategies to iteratively refine and validate hypotheses. MC-NEST dynamically balances exploration and exploitation through adaptive sampling strategies, which prioritize high-potential hypotheses while maintaining diversity in the search space. We demonstrate the effectiveness of MC-NEST through comprehensive experiments across multiple domains, including biomedicine, social science, and computer science. MC-NEST achieves average scores of 2.65, 2.74, and 2.80 (on a 1-3 scale) for novelty, clarity, significance, and verifiability metrics on the social science, computer science, and biomedicine datasets, respectively, outperforming state-of-the-art prompt-based methods, which achieve 2.36, 2.51, and 2.52 on the same datasets. These results underscore MC-NEST's ability to generate high-quality, empirically grounded hypotheses across diverse domains. Furthermore, MC-NEST facilitates structured human-AI collaboration, ensuring that LLMs augment human creativity rather than replace it. By addressing key challenges such as iterative refinement and the exploration-exploitation balance, MC-NEST sets a new benchmark in automated hypothesis generation. Additionally, MC-NEST's ethical design enables responsible AI use, emphasizing transparency and human supervision in hypothesis generation.

Substance over Style: Evaluating Proactive Conversational Coaching Agents

Vidya Srinivas,Xuhai Xu,Xin Liu,Kumar Ayush,Isaac Galatzer-Levy,Shwetak Patel,Daniel McDuff,Tim Althoff

Task: 研究多轮对话教练代理的设计与评估。

Motivation: 传统NLP研究多关注单轮对话任务，而教练对话具有目标未定义、多轮交互、主观评价等特点，需要新的方法。

Details

Method: 设计并实现五种不同对话风格的多轮教练代理，通过用户研究收集155次对话的反馈。 Result: 用户重视核心功能，纯风格组件被负面评价；用户反馈与专家及语言模型评价存在显著不一致。 Conclusion: 研究为教练代理的设计与评估提供见解，推动以人为中心的NLP应用发展。 Abstract: While NLP research has made strides in conversational tasks, many approaches focus on single-turn responses with well-defined objectives or evaluation criteria. In contrast, coaching presents unique challenges with initially undefined goals that evolve through multi-turn interactions, subjective evaluation criteria, mixed-initiative dialogue. In this work, we describe and implement five multi-turn coaching agents that exhibit distinct conversational styles, and evaluate them through a user study, collecting first-person feedback on 155 conversations. We find that users highly value core functionality, and that stylistic components in absence of core components are viewed negatively. By comparing user feedback with third-person evaluations from health experts and an LM, we reveal significant misalignment across evaluation approaches. Our findings provide insights into design and evaluation of conversational coaching agents and contribute toward improving human-centered NLP applications.

DeCAP: Context-Adaptive Prompt Generation for Debiasing Zero-shot Question Answering in Large Language Models

Suyoung Bae,YunSeok Choi,Jee-Hyong Lee

Task: 提出一种名为DeCAP的方法，用于在零样本问答任务中对大型语言模型（LLMs）进行去偏。

Motivation: 现有零样本方法效率高但未考虑上下文和防止答案中的偏见传播，导致LLMs在面对社会敏感问题时性能下降。

Details

Method: DeCAP通过上下文自适应提示生成（Context-Adaptive Prompt Generation）实现去偏，包括问题模糊性检测和中性答案引导生成。 Result: 在八种LLMs上的实验表明，DeCAP在零样本去偏问答任务中达到了最先进的性能。 Conclusion: DeCAP有效提升了LLMs在不同问答场景中的公平性和准确性。 Abstract: While Large Language Models (LLMs) excel in zero-shot Question Answering (QA), they tend to expose biases in their internal knowledge when faced with socially sensitive questions, leading to a degradation in performance. Existing zero-shot methods are efficient but fail to consider context and prevent bias propagation in the answers. To address this, we propose DeCAP, a method for debiasing LLMs using Context-Adaptive Prompt Generation. DeCAP leverages a Question Ambiguity Detection to take appropriate debiasing actions based on the context and a Neutral Answer Guidance Generation to suppress the LLMs make objective judgments about the context, minimizing the propagation of bias from their internal knowledge. Our various experiments across eight LLMs show that DeCAP achieves state-of-the-art zero-shot debiased QA performance. This demonstrates DeCAP's efficacy in enhancing the fairness and accuracy of LLMs in diverse QA settings.

A Real-Time Human Action Recognition Model for Assisted Living

Yixuan Wang,Paul Stynes,Pramod Pathak,Cristina Muntean

Task: 通过实时人类动作识别模型预测辅助生活环境中老年人的健康风险。

Motivation: 确保老年人和弱势群体在辅助生活环境中的安全与健康，计算机视觉技术提供了一种创新的解决方案。

Details

Method: 结合深度学习模型和实时视频预测与警报系统，利用四种先进的HAR模型（UniFormerV2、TimeSformer、I3D、SlowFast）进行训练和比较。 Result: TimeSformer在宏观F1分数（95.33%）、召回率（95.49%）和精确率（95.19%）上表现最佳，且推理效率更高。 Conclusion: 该研究为提升辅助生活环境中老年人和慢性病患者的健康安全提供了有效方法，推动了智能社区和行业创新。 Abstract: Ensuring the safety and well-being of elderly and vulnerable populations in assisted living environments is a critical concern. Computer vision presents an innovative and powerful approach to predicting health risks through video monitoring, employing human action recognition (HAR) technology. However, real-time prediction of human actions with high performance and efficiency is a challenge. This research proposes a real-time human action recognition model that combines a deep learning model and a live video prediction and alert system, in order to predict falls, staggering and chest pain for residents in assisted living. Six thousand RGB video samples from the NTU RGB+D 60 dataset were selected to create a dataset with four classes: Falling, Staggering, Chest Pain, and Normal, with the Normal class comprising 40 daily activities. Transfer learning technique was applied to train four state-of-the-art HAR models on a GPU server, namely, UniFormerV2, TimeSformer, I3D, and SlowFast. Results of the four models are presented in this paper based on class-wise and macro performance metrics, inference efficiency, model complexity and computational costs. TimeSformer is proposed for developing the real-time human action recognition model, leveraging its leading macro F1 score (95.33%), recall (95.49%), and precision (95.19%) along with significantly higher inference throughput compared to the others. This research provides insights to enhance safety and health of the elderly and people with chronic illnesses in assisted living environments, fostering sustainable care, smarter communities and industry innovation.

Enhancing Small Language Models for Cross-Lingual Generalized Zero-Shot Classification with Soft Prompt Tuning

Fred Philippy,Siwen Guo,Cedric Lothritz,Jacques Klein,Tegawendé F. Bissyandé

Task: 提出一种轻量级且数据高效的方法（RoSPrompt），用于训练软提示以增强跨语言零样本分类（ZSC）能力。

Motivation: 预训练语言模型（PLMs）在零样本分类中表现良好，但依赖大规模训练数据或外部知识，限制了其在多语言和低资源场景中的应用。现有基于提示的方法难以有效利用相关分类任务中的标注数据，尤其是在不同语言或分布的情况下。

Details

Method: 引入RoSPrompt，一种专为小型多语言PLMs设计的软提示训练方法，利用高资源语言提升低资源语言的性能，无需大量微调或高计算成本。 Result: 在覆盖106种语言的数据集上评估，展示了强大的跨语言迁移性能和未见类别的鲁棒泛化能力。 Conclusion: RoSPrompt是一种高效且适应性强的解决方案，显著提升了跨语言零样本分类的性能和泛化能力。 Abstract: In NLP, Zero-Shot Classification (ZSC) has become essential for enabling models to classify text into categories unseen during training, particularly in low-resource languages and domains where labeled data is scarce. While pretrained language models (PLMs) have shown promise in ZSC, they often rely on large training datasets or external knowledge, limiting their applicability in multilingual and low-resource scenarios. Recent approaches leveraging natural language prompts reduce the dependence on large training datasets but struggle to effectively incorporate available labeled data from related classification tasks, especially when these datasets originate from different languages or distributions. Moreover, existing prompt-based methods typically rely on manually crafted prompts in a specific language, limiting their adaptability and effectiveness in cross-lingual settings. To address these challenges, we introduce RoSPrompt, a lightweight and data-efficient approach for training soft prompts that enhance cross-lingual ZSC while ensuring robust generalization across data distribution shifts. RoSPrompt is designed for small multilingual PLMs, enabling them to leverage high-resource languages to improve performance in low-resource settings without requiring extensive fine-tuning or high computational costs. We evaluate our approach on multiple multilingual PLMs across datasets covering 106 languages, demonstrating strong cross-lingual transfer performance and robust generalization capabilities over unseen classes.

SG-Tailor: Inter-Object Commonsense Relationship Reasoning for Scene Graph Manipulation

Haoliang Shang,Hanyu Wu,Guangyao Zhai,Boyang Sun,Fangjinhua Wang,Federico Tombari,Marc Pollefeys

Task: 研究如何通过SG-Tailor模型实现场景图的冲突自由修改和节点添加。

Motivation: 场景图的复杂关系修改和节点添加是计算上难以处理的任务，且现有方法未解决冲突问题。

Details

Method: 提出SG-Tailor，一种自回归模型，预测节点间的无冲突关系，并采用Cut-And-Stitch策略解决冲突。 Result: SG-Tailor在实验中显著优于其他方法，并能无缝集成到场景生成和机器人操作任务中。 Conclusion: SG-Tailor为解决场景图修改中的冲突问题提供了有效方法，具有广泛的应用潜力。 Abstract: Scene graphs capture complex relationships among objects, serving as strong priors for content generation and manipulation. Yet, reasonably manipulating scene graphs -- whether by adding nodes or modifying edges -- remains a challenging and untouched task. Tasks such as adding a node to the graph or reasoning about a node's relationships with all others are computationally intractable, as even a single edge modification can trigger conflicts due to the intricate interdependencies within the graph. To address these challenges, we introduce SG-Tailor, an autoregressive model that predicts the conflict-free relationship between any two nodes. SG-Tailor not only infers inter-object relationships, including generating commonsense edges for newly added nodes but also resolves conflicts arising from edge modifications to produce coherent, manipulated graphs for downstream tasks. For node addition, the model queries the target node and other nodes from the graph to predict the appropriate relationships. For edge modification, SG-Tailor employs a Cut-And-Stitch strategy to solve the conflicts and globally adjust the graph. Extensive experiments demonstrate that SG-Tailor outperforms competing methods by a large margin and can be seamlessly integrated as a plug-in module for scene generation and robotic manipulation tasks.

KSHSeek: Data-Driven Approaches to Mitigating and Detecting Knowledge-Shortcut Hallucinations in Generative Models

Zhiwei Wang,Zhongxin Liu,Ying Li,Hongyu Sun,Meng Xu,Yuqing Zhang

Task: 研究如何减少大型语言模型在自然语言生成任务中的知识捷径幻觉问题。

Motivation: 模型幻觉是自然语言生成任务中的主要挑战，尤其是知识捷径幻觉在生成模型中普遍存在，影响了模型的鲁棒性和可靠性。

Details

Method: 提出一种高相似度剪枝算法用于数据预处理以减少数据中的虚假相关性，并设计了一种特定的检测方法来评估缓解策略的有效性。 Result: 实验结果表明，该方法有效减少了知识捷径幻觉，特别是在微调任务中，且未对问答任务的模型性能产生负面影响。 Conclusion: 该研究为缓解生成模型中的特定幻觉问题提供了新范式，增强了模型在实际应用中的鲁棒性和可靠性。 Abstract: The emergence of large language models (LLMs) has significantly advanced the development of natural language processing (NLP), especially in text generation tasks like question answering. However, model hallucinations remain a major challenge in natural language generation (NLG) tasks due to their complex causes. We systematically expand on the causes of factual hallucinations from the perspective of knowledge shortcuts, analyzing hallucinations arising from correct and defect-free data and demonstrating that knowledge-shortcut hallucinations are prevalent in generative models. To mitigate this issue, we propose a high similarity pruning algorithm at the data preprocessing level to reduce spurious correlations in the data. Additionally, we design a specific detection method for knowledge-shortcut hallucinations to evaluate the effectiveness of our mitigation strategy. Experimental results show that our approach effectively reduces knowledge-shortcut hallucinations, particularly in fine-tuning tasks, without negatively impacting model performance in question answering. This work introduces a new paradigm for mitigating specific hallucination issues in generative models, enhancing their robustness and reliability in real-world applications.

Improving Food Image Recognition with Noisy Vision Transformer

Tonmoy Ghosh,Edward Sazonov

Task: 研究Noisy Vision Transformers (NoisyViT)在食物图像分类中的性能提升。

Motivation: 食物图像识别因高变异性与复杂性而具有挑战性，NoisyViT通过引入噪声降低任务复杂度并调整系统熵，可能提升分类性能。

Details

Method: 在三个基准数据集（Food2K、Food-101、CNFOOD-241）上微调NoisyViT，并与现有先进模型对比。 Result: NoisyViT在Food2K、Food-101和CNFOOD-241上的Top-1准确率分别为95%、99.5%和96.6%，显著优于现有方法。 Conclusion: NoisyViT在饮食评估、营养监测及医疗应用中具有潜力，为视觉食物计算领域的未来发展奠定了基础。 Abstract: Food image recognition is a challenging task in computer vision due to the high variability and complexity of food images. In this study, we investigate the potential of Noisy Vision Transformers (NoisyViT) for improving food classification performance. By introducing noise into the learning process, NoisyViT reduces task complexity and adjusts the entropy of the system, leading to enhanced model accuracy. We fine-tune NoisyViT on three benchmark datasets: Food2K (2,000 categories, ~1M images), Food-101 (101 categories, ~100K images), and CNFOOD-241 (241 categories, ~190K images). The performance of NoisyViT is evaluated against state-of-the-art food recognition models. Our results demonstrate that NoisyViT achieves Top-1 accuracies of 95%, 99.5%, and 96.6% on Food2K, Food-101, and CNFOOD-241, respectively, significantly outperforming existing approaches. This study underscores the potential of NoisyViT for dietary assessment, nutritional monitoring, and healthcare applications, paving the way for future advancements in vision-based food computing. Code for reproducing NoisyViT for food recognition is available at NoisyViT_Food.

DomainCQA: Crafting Expert-Level QA from Domain-Specific Charts

Ling Zhong,Yujing Lu,Jing Yang,Weiming Li,Peng Wei,Yongheng Wang,Manni Duan,Qing Zhang

Task: 构建一个领域特定的图表问答（CQA）基准测试，以评估多模态大语言模型（MLLMs）在领域特定任务中的表现。

Motivation: 当前CQA基准测试主要关注通用任务，未能充分捕捉领域特定的挑战，限制了MLLMs在专业领域的应用评估。

Details

Method: 提出DomainCQA方法，通过构建AstroChart（天文学领域的CQA基准测试）验证其有效性。 Result: 评估表明，现有MLLMs的主要挑战在于图表推理和结合领域知识的深度分析，而非领域知识本身。 Conclusion: DomainCQA为领域特定应用提供了可扩展且严谨的评估框架，有助于提升MLLMs在专业领域的性能。 Abstract: Chart Question Answering (CQA) benchmarks are essential for evaluating the capability of Multimodal Large Language Models (MLLMs) to interpret visual data. However, current benchmarks focus primarily on the evaluation of general-purpose CQA but fail to adequately capture domain-specific challenges. We introduce DomainCQA, a systematic methodology for constructing domain-specific CQA benchmarks, and demonstrate its effectiveness by developing AstroChart, a CQA benchmark in the field of astronomy. Our evaluation shows that chart reasoning and combining chart information with domain knowledge for deeper analysis and summarization, rather than domain-specific knowledge, pose the primary challenge for existing MLLMs, highlighting a critical gap in current benchmarks. By providing a scalable and rigorous framework, DomainCQA enables more precise assessment and improvement of MLLMs for domain-specific applications.

DisentTalk: Cross-lingual Talking Face Generation via Semantic Disentangled Diffusion Model

Kangwei Liu,Junwu Liu,Yun Cao,Jinlin Guo,Xiaowei Yi

Task: 提出一种名为DisentTalk的框架，通过数据驱动的语义解耦方法分解3DMM表情参数，实现细粒度面部控制。

Motivation: 现有方法在时间一致性和空间控制上存在局限性，且两者的整合因控制机制不兼容和面部表示语义纠缠而受阻。

Details

Method: 引入语义解耦框架分解3DMM参数，并开发基于层次潜在扩散的架构，结合区域感知注意力机制。 Result: 实验表明，该方法在唇同步、表情质量和时间一致性等多项指标上优于现有方法。 Conclusion: DisentTalk通过解耦表示和层次扩散架构，实现了空间精确性和时间一致性的平衡。 Abstract: Recent advances in talking face generation have significantly improved facial animation synthesis. However, existing approaches face fundamental limitations: 3DMM-based methods maintain temporal consistency but lack fine-grained regional control, while Stable Diffusion-based methods enable spatial manipulation but suffer from temporal inconsistencies. The integration of these approaches is hindered by incompatible control mechanisms and semantic entanglement of facial representations. This paper presents DisentTalk, introducing a data-driven semantic disentanglement framework that decomposes 3DMM expression parameters into meaningful subspaces for fine-grained facial control. Building upon this disentangled representation, we develop a hierarchical latent diffusion architecture that operates in 3DMM parameter space, integrating region-aware attention mechanisms to ensure both spatial precision and temporal coherence. To address the scarcity of high-quality Chinese training data, we introduce CHDTF, a Chinese high-definition talking face dataset. Extensive experiments show superior performance over existing methods across multiple metrics, including lip synchronization, expression quality, and temporal consistency. Project Page: https://kangweiiliu.github.io/DisentTalk.

FLEX: A Benchmark for Evaluating Robustness of Fairness in Large Language Models

Dahyun Jung,Seungyoon Lee,Hyeonseok Moon,Chanjun Park,Heuiseok Lim

Task: Introduce a new benchmark (FLEX) to evaluate the fairness of Large Language Models (LLMs) under extreme adversarial scenarios.

Motivation: Existing benchmarks may overlook intrinsic weaknesses of LLMs, which can generate biased responses even with simple adversarial instructions, leading to harmful societal impacts.

Details

Method: Develop FLEX, a benchmark integrating prompts designed to amplify potential biases, and compare its results with existing benchmarks. Result: FLEX reveals that traditional evaluations underestimate the inherent risks in LLMs, demonstrating the need for more stringent benchmarks. Conclusion: More rigorous evaluation benchmarks like FLEX are necessary to ensure the safety and fairness of LLMs. Abstract: Recent advancements in Large Language Models (LLMs) have significantly enhanced interactions between users and models. These advancements concurrently underscore the need for rigorous safety evaluations due to the manifestation of social biases, which can lead to harmful societal impacts. Despite these concerns, existing benchmarks may overlook the intrinsic weaknesses of LLMs, which can generate biased responses even with simple adversarial instructions. To address this critical gap, we introduce a new benchmark, Fairness Benchmark in LLM under Extreme Scenarios (FLEX), designed to test whether LLMs can sustain fairness even when exposed to prompts constructed to induce bias. To thoroughly evaluate the robustness of LLMs, we integrate prompts that amplify potential biases into the fairness assessment. Comparative experiments between FLEX and existing benchmarks demonstrate that traditional evaluations may underestimate the inherent risks in models. This highlights the need for more stringent LLM evaluation benchmarks to guarantee safety and fairness.

Video-ColBERT: Contextualized Late Interaction for Text-to-Video Retrieval

Arun Reddy,Alexander Martin,Eugene Yang,Andrew Yates,Kate Sanders,Kenton Murray,Reno Kriz,Celso M. de Melo,Benjamin Van Durme,Rama Chellappa

Task: 解决文本到视频检索（T2VR）的问题。

Motivation: 受文本-文档、文本-图像和文本-视频检索中后期交互技术的成功启发。

Details

Method: 提出Video-ColBERT，包含细粒度空间和时间标记交互、查询和视觉扩展以及双Sigmoid损失训练。 Result: 在常见文本到视频检索基准测试中性能优于其他双编码器方法。 Conclusion: Video-ColBERT通过细粒度交互和训练范式生成强兼容的视频内容表示。 Abstract: In this work, we tackle the problem of text-to-video retrieval (T2VR). Inspired by the success of late interaction techniques in text-document, text-image, and text-video retrieval, our approach, Video-ColBERT, introduces a simple and efficient mechanism for fine-grained similarity assessment between queries and videos. Video-ColBERT is built upon 3 main components: a fine-grained spatial and temporal token-wise interaction, query and visual expansions, and a dual sigmoid loss during training. We find that this interaction and training paradigm leads to strong individual, yet compatible, representations for encoding video content. These representations lead to increases in performance on common text-to-video retrieval benchmarks compared to other bi-encoder methods.

Scaling Laws of Synthetic Data for Language Models

Zeyu Qin,Qingxiu Dong,Xingxing Zhang,Li Dong,Xiaolong Huang,Ziyi Yang,Mahmoud Khademi,Dongdong Zhang,Hany Hassan Awadalla,Yi R. Fung,Weizhu Chen,Minhao Cheng,Furu Wei

Task: 研究合成数据的扩展规律及其在大型语言模型预训练中的应用。

Motivation: 高质量网络数据作为预训练数据源正在迅速枯竭，合成数据成为潜在替代方案，但其扩展规律尚不明确。

Details

Method: 提出SynthLLM框架，通过图算法从预训练语料中提取并重组高级概念，生成多样化、高质量的合成数据集。 Result: SynthLLM生成的合成数据遵循修正的扩展规律；性能在300B标记附近达到平台期；更大模型需要更少标记达到最优性能。 Conclusion: 合成数据是预训练语料的可扩展且可靠替代方案，为模型性能持续提升提供可行路径。 Abstract: Large language models (LLMs) achieve strong performance across diverse tasks, largely driven by high-quality web data used in pre-training. However, recent studies indicate this data source is rapidly depleting. Synthetic data emerges as a promising alternative, but it remains unclear whether synthetic datasets exhibit predictable scalability comparable to raw pre-training data. In this work, we systematically investigate the scaling laws of synthetic data by introducing SynthLLM, a scalable framework that transforms pre-training corpora into diverse, high-quality synthetic datasets. Our approach achieves this by automatically extracting and recombining high-level concepts across multiple documents using a graph algorithm. Key findings from our extensive mathematical experiments on SynthLLM include: (1) SynthLLM generates synthetic data that reliably adheres to the \emph{rectified scaling law} across various model sizes; (2) Performance improvements plateau near 300B tokens; and (3) Larger models approach optimal performance with fewer training tokens. For instance, an 8B model peaks at 1T tokens, while a 3B model requires 4T. Moreover, comparisons with existing synthetic data generation and augmentation methods demonstrate that SynthLLM achieves superior performance and scalability. Our findings highlight synthetic data as a scalable and reliable alternative to organic pre-training corpora, offering a viable path toward continued improvement in model performance.

RomanTex: Decoupling 3D-aware Rotary Positional Embedded Multi-Attention Network for Texture Synthesis

Yifei Feng,Mingxin Yang,Shuhui Yang,Sheng Zhang,Jiaao Yu,Zibo Zhao,Yuhong Liu,Jie Jiang,Chunchao Guo

Task: 提出一种多视角纹理生成框架RomanTex，解决现有方法在多视角图像一致性和纹理质量上的不足。

Motivation: 现有纹理生成方法在2D图像生成和3D纹理合成中存在不一致性和质量不足的问题，RomanTex旨在整合2D扩散模型先验和3D表示，提升纹理生成质量。

Details

Method: 结合多注意力网络和3D表示，引入3D感知旋转位置嵌入和几何相关的无分类器引导机制，增强纹理生成的一致性和语义正确性。 Result: 通过定量、定性评估和用户研究，RomanTex在纹理质量和一致性上达到最先进水平。 Conclusion: RomanTex通过整合2D和3D方法，显著提升了纹理生成的质量和一致性，为3D资产生成提供了高效解决方案。 Abstract: Painting textures for existing geometries is a critical yet labor-intensive process in 3D asset generation. Recent advancements in text-to-image (T2I) models have led to significant progress in texture generation. Most existing research approaches this task by first generating images in 2D spaces using image diffusion models, followed by a texture baking process to achieve UV texture. However, these methods often struggle to produce high-quality textures due to inconsistencies among the generated multi-view images, resulting in seams and ghosting artifacts. In contrast, 3D-based texture synthesis methods aim to address these inconsistencies, but they often neglect 2D diffusion model priors, making them challenging to apply to real-world objects To overcome these limitations, we propose RomanTex, a multiview-based texture generation framework that integrates a multi-attention network with an underlying 3D representation, facilitated by our novel 3D-aware Rotary Positional Embedding. Additionally, we incorporate a decoupling characteristic in the multi-attention block to enhance the model's robustness in image-to-texture task, enabling semantically-correct back-view synthesis. Furthermore, we introduce a geometry-related Classifier-Free Guidance (CFG) mechanism to further improve the alignment with both geometries and images. Quantitative and qualitative evaluations, along with comprehensive user studies, demonstrate that our method achieves state-of-the-art results in texture quality and consistency.

Context-Efficient Retrieval with Factual Decomposition

Yanhong Li,David Yunis,David McAllester,Jiawei Zhou

Task: 研究如何通过将外部语料库预处理为半结构化的“原子事实”来提高信息检索的效率。

Motivation: 动态扩展的外部语料库检索能让大型语言模型（LLMs）融入实时事件，类似于情景记忆，但检索效率需要优化。

Details

Method: 将外部语料库预处理为半结构化的“原子事实”，并在检索文本量受限的情况下测试其效果。 Result: 提出的“原子事实”形式在问答任务中表现更优，尤其是在检索文本量受限时，能减少上下文规模并提升推理效率。 Conclusion: 预处理为“原子事实”能有效提升检索效率，尤其是在资源受限的场景下。 Abstract: There has recently been considerable interest in incorporating information retrieval into large language models (LLMs). Retrieval from a dynamically expanding external corpus of text allows a model to incorporate current events and can be viewed as a form of episodic memory. Here we demonstrate that pre-processing the external corpus into semi-structured ''atomic facts'' makes retrieval more efficient. More specifically, we demonstrate that our particular form of atomic facts improves performance on various question answering tasks when the amount of retrieved text is limited. Limiting the amount of retrieval reduces the size of the context and improves inference efficiency.

DiffV2IR: Visible-to-Infrared Diffusion Model via Vision-Language Understanding

Lingyan Ran,Lidong Wang,Guangcong Wang,Peng Wang,Yanning Zhang

Task: 解决可见光到红外图像（V2IR）翻译中的语义感知、波长谱多样性和数据集稀缺问题。

Motivation: 现有方法将V2IR视为常规图像合成任务，忽略了其特有挑战，如语义感知和波长多样性。

Details

Method: 提出DiffV2IR框架，包含渐进学习模块（PLM）和视觉语言理解模块（VLUM），并收集了大规模红外数据集IR-500K。 Result: DiffV2IR显著提升了V2IR翻译性能，实验验证了其高质量翻译效果。 Conclusion: DiffV2IR通过结合PLM、VLUM和IR-500K数据集，有效解决了V2IR翻译的挑战，具有广泛适用性。 Abstract: The task of translating visible-to-infrared images (V2IR) is inherently challenging due to three main obstacles: 1) achieving semantic-aware translation, 2) managing the diverse wavelength spectrum in infrared imagery, and 3) the scarcity of comprehensive infrared datasets. Current leading methods tend to treat V2IR as a conventional image-to-image synthesis challenge, often overlooking these specific issues. To address this, we introduce DiffV2IR, a novel framework for image translation comprising two key elements: a Progressive Learning Module (PLM) and a Vision-Language Understanding Module (VLUM). PLM features an adaptive diffusion model architecture that leverages multi-stage knowledge learning to infrared transition from full-range to target wavelength. To improve V2IR translation, VLUM incorporates unified Vision-Language Understanding. We also collected a large infrared dataset, IR-500K, which includes 500,000 infrared images compiled by various scenes and objects under various environmental conditions. Through the combination of PLM, VLUM, and the extensive IR-500K dataset, DiffV2IR markedly improves the performance of V2IR. Experiments validate DiffV2IR's excellence in producing high-quality translations, establishing its efficacy and broad applicability. The code, dataset, and DiffV2IR model will be available at https://github.com/LidongWang-26/DiffV2IR.

Hanlin Wu,Xufeng Duan,Zhenguang Cai

Task: 比较大型音频语言模型（LALMs）与人类在语音理解中如何整合说话者特征。

Motivation: 探讨LALMs是否以类似人类认知机制的方式处理说话者上下文语言。

Details

Method: 通过分析两个LALMs（Qwen2-Audio和Ultravox 0.5）的处理模式与人类脑电图（EEG）反应，使用模型的意外度和熵度量。 Result: Qwen2-Audio对说话者不一致内容表现出更高的意外度，其意外度值显著预测人类N400反应，而Ultravox 0.5对说话者特征敏感性有限。 Conclusion: 当前LALMs在处理说话者上下文语言方面具有潜力但也存在局限，揭示了人类与LALMs在社会语言处理机制上的差异。 Abstract: Voice-based AI development faces unique challenges in processing both linguistic and paralinguistic information. This study compares how large audio-language models (LALMs) and humans integrate speaker characteristics during speech comprehension, asking whether LALMs process speaker-contextualized language in ways that parallel human cognitive mechanisms. We compared two LALMs' (Qwen2-Audio and Ultravox 0.5) processing patterns with human EEG responses. Using surprisal and entropy metrics from the models, we analyzed their sensitivity to speaker-content incongruency across social stereotype violations (e.g., a man claiming to regularly get manicures) and biological knowledge violations (e.g., a man claiming to be pregnant). Results revealed that Qwen2-Audio exhibited increased surprisal for speaker-incongruent content and its surprisal values significantly predicted human N400 responses, while Ultravox 0.5 showed limited sensitivity to speaker characteristics. Importantly, neither model replicated the human-like processing distinction between social violations (eliciting N400 effects) and biological violations (eliciting P600 effects). These findings reveal both the potential and limitations of current LALMs in processing speaker-contextualized language, and suggest differences in social-linguistic processing mechanisms between humans and LALMs.

Color Conditional Generation with Sliced Wasserstein Guidance

Alexander Lobashev,Maria Larchenko,Dmitry Guskov

Task: 提出一种无需训练的、基于参考图像颜色分布的图像生成方法。

Motivation: 解决现有方法在生成图像时颜色语义不连贯的问题。

Details

Method: 通过修改扩散模型的采样过程，引入可微分的Sliced 1-Wasserstein距离来匹配颜色分布。 Result: 在颜色相似性和语义连贯性上优于现有技术。 Conclusion: SW-Guidance方法在保持颜色匹配的同时，确保了生成图像的语义一致性。 Abstract: We propose SW-Guidance, a training-free approach for image generation conditioned on the color distribution of a reference image. While it is possible to generate an image with fixed colors by first creating an image from a text prompt and then applying a color style transfer method, this approach often results in semantically meaningless colors in the generated image. Our method solves this problem by modifying the sampling process of a diffusion model to incorporate the differentiable Sliced 1-Wasserstein distance between the color distribution of the generated image and the reference palette. Our method outperforms state-of-the-art techniques for color-conditional generation in terms of color similarity to the reference, producing images that not only match the reference colors but also maintain semantic coherence with the original text prompt. Our source code is available at https://github.com/alobashev/sw-guidance/.

The Greatest Good Benchmark: Measuring LLMs' Alignment with Utilitarian Moral Dilemmas

Giovanni Franco Gabriel Marraffini,Andrés Cotton,Noe Fabian Hsueh,Axel Fridman,Juan Wisznia,Luciano Del Corro

Task: 评估大型语言模型在功利主义困境中的道德判断。

Motivation: 设计有益于人类且无害的语言模型需要最大化人类福祉的决策方法。

Details

Method: 引入Greatest Good Benchmark，分析15种不同大型语言模型的道德偏好。 Result: 发现LLMs普遍偏好公正仁慈并拒绝工具性伤害，其道德偏好与现有道德理论和大众标准不同。 Conclusion: 揭示了LLMs的‘人工道德指南针’，为其道德对齐提供了见解。 Abstract: The question of how to make decisions that maximise the well-being of all persons is very relevant to design language models that are beneficial to humanity and free from harm. We introduce the Greatest Good Benchmark to evaluate the moral judgments of LLMs using utilitarian dilemmas. Our analysis across 15 diverse LLMs reveals consistently encoded moral preferences that diverge from established moral theories and lay population moral standards. Most LLMs have a marked preference for impartial beneficence and rejection of instrumental harm. These findings showcase the 'artificial moral compass' of LLMs, offering insights into their moral alignment.

Color Transfer with Modulated Flows

Maria Larchenko,Alexander Lobashev,Dmitry Guskov,Vladimir Vladimirovich Palyulin

Task: 提出一种基于修正流的颜色迁移方法ModFlows，用于调整目标图像的颜色分布以匹配参考图像。

Motivation: 通过最优传输和可逆变换实现高效且高质量的颜色迁移。

Details

Method: 利用流的双射性质构建中间颜色分布数据集，训练编码器预测修正流模型的权重。 Result: 方法能够处理4K图像，在内容和风格相似性上达到最优性能。 Conclusion: ModFlows是一种高效且无需微调的颜色迁移方法，代码已开源。 Abstract: In this work, we introduce Modulated Flows (ModFlows), a novel approach for color transfer between images based on rectified flows. The primary goal of the color transfer is to adjust the colors of a target image to match the color distribution of a reference image. Our technique is based on optimal transport and executes color transfer as an invertible transformation within the RGB color space. The ModFlows utilizes the bijective property of flows, enabling us to introduce a common intermediate color distribution and build a dataset of rectified flows. We train an encoder on this dataset to predict the weights of a rectified model for new images. After training on a set of optimal transport plans, our approach can generate plans for new pairs of distributions without additional fine-tuning. We additionally show that the trained encoder provides an image embedding, associated only with its color style. The presented method is capable of processing 4K images and achieves the state-of-the-art performance in terms of content and style similarity. Our source code is available at https://github.com/maria-larchenko/modflows

1.4 Million Open-Source Distilled Reasoning Dataset to Empower Large Language Model Training

Han Zhao,Haotian Wang,Yiping Peng,Sitong Zhao,Xiaoyu Tian,Shuaiting Chen,Yunjie Ji,Xiangang Li

Task: 构建一个大规模、高质量的推理任务数据集AM-DeepSeek-R1-Distilled，并训练出性能优越的模型。

Motivation: 为促进面向推理的大型语言模型（LLMs）的发展，提供高质量且具有挑战性的推理问题数据集。

Details

Method: 从多个开源数据集中收集问题，进行语义去重和清理，并通过蒸馏模型生成响应，再经过严格的验证（数学问题核对答案、代码问题测试用例验证、其他任务使用奖励模型评估）。 Result: AM-Distill-Qwen-32B和AM-Distill-Qwen-72B模型在多个基准测试中表现优于对比模型。 Conclusion: 发布1.4百万个问题及其响应，旨在推动推理导向LLMs的研究与发展。 Abstract: The AM-DeepSeek-R1-Distilled is a large-scale dataset with thinking traces for general reasoning tasks, composed of high-quality and challenging reasoning problems. These problems are collected from a multitude of open-source datasets, subjected to semantic deduplication and meticulous cleaning to eliminate test set contamination. All responses within the dataset are distilled from reasoning models (predominantly DeepSeek-R1) and have undergone rigorous verification procedures. Mathematical problems are validated by checking against reference answers, code problems are verified using test cases, and other tasks are evaluated with the aid of a reward model. The AM-Distill-Qwen-32B model, which was trained through only simple Supervised Fine-Tuning (SFT) using this batch of data, outperformed the DeepSeek-R1-Distill-Qwen-32B model on four benchmarks: AIME2024, MATH-500, GPQA-Diamond, and LiveCodeBench. Additionally, the AM-Distill-Qwen-72B model surpassed the DeepSeek-R1-Distill-Llama-70B model on all benchmarks as well. We are releasing these 1.4 million problems and their corresponding responses to the research community with the objective of fostering the development of powerful reasoning-oriented Large Language Models (LLMs). The dataset was published in \href{https://huggingface.co/datasets/a-m-team/AM-DeepSeek-R1-Distilled-1.4M}{https://huggingface.co/datasets/a-m-team/AM-DeepSeek-R1-Distilled-1.4M}.

Zhongyu Yang,Jun Chen,Dannong Xu,Junjie Fei,Xiaoqian Shen,Liangbing Zhao,Chun-Mei Feng,Mohamed Elhoseiny

Task: 自动化生成多模态维基百科风格文章。

Motivation: 传统知识发现和收集需要大量人力，现有方法主要关注纯文本生成，忽视了多模态内容的重要性。

Details

Method: 提出WikiAutoGen系统，集成文本和图像，并引入多视角自反思机制以提高准确性和全面性。 Result: 在WikiSeek基准上，WikiAutoGen比之前方法表现提升8%-29%，生成更准确、连贯且视觉丰富的文章。 Conclusion: WikiAutoGen通过多模态集成和自反思机制，显著提升了自动化文章生成的质量。 Abstract: Knowledge discovery and collection are intelligence-intensive tasks that traditionally require significant human effort to ensure high-quality outputs. Recent research has explored multi-agent frameworks for automating Wikipedia-style article generation by retrieving and synthesizing information from the internet. However, these methods primarily focus on text-only generation, overlooking the importance of multimodal content in enhancing informativeness and engagement. In this work, we introduce WikiAutoGen, a novel system for automated multimodal Wikipedia-style article generation. Unlike prior approaches, WikiAutoGen retrieves and integrates relevant images alongside text, enriching both the depth and visual appeal of generated content. To further improve factual accuracy and comprehensiveness, we propose a multi-perspective self-reflection mechanism, which critically assesses retrieved content from diverse viewpoints to enhance reliability, breadth, and coherence, etc. Additionally, we introduce WikiSeek, a benchmark comprising Wikipedia articles with topics paired with both textual and image-based representations, designed to evaluate multimodal knowledge generation on more challenging topics. Experimental results show that WikiAutoGen outperforms previous methods by 8%-29% on our WikiSeek benchmark, producing more accurate, coherent, and visually enriched Wikipedia-style articles. We show some of our generated examples in https://wikiautogen.github.io/ .

Exploring Cultural Nuances in Emotion Perception Across 15 African Languages

Ibrahim Said Ahmad,Shiran Dudy,Tadesse Destaw Belay,Idris Abdulmumin,Seid Muhie Yimam,Shamsuddeen Hassan Muhammad,Kenneth Church

Task: 分析15种非洲语言中情感表达的跨语言特征。

Motivation: 非洲语言中的情感表达研究不足，限制了情感检测工具的开发，需构建文化包容的NLP系统。

Details

Method: 研究情感表达的四个维度：文本长度、情感极性、情感共现和强度变化。 Result: 发现语言特定的情感表达模式，如索马里语文本较长，尼日利亚语言负面情感较多，情感共现揭示普遍心理联系。 Conclusion: 需针对特定语言设计情感检测方法，同时可在相关语言间进行迁移学习。 Abstract: Understanding how emotions are expressed across languages is vital for building culturally-aware and inclusive NLP systems. However, emotion expression in African languages is understudied, limiting the development of effective emotion detection tools in these languages. In this work, we present a cross-linguistic analysis of emotion expression in 15 African languages. We examine four key dimensions of emotion representation: text length, sentiment polarity, emotion co-occurrence, and intensity variations. Our findings reveal diverse language-specific patterns in emotional expression -- with Somali texts typically longer, while others like IsiZulu and Algerian Arabic show more concise emotional expression. We observe a higher prevalence of negative sentiment in several Nigerian languages compared to lower negativity in languages like IsiXhosa. Further, emotion co-occurrence analysis demonstrates strong cross-linguistic associations between specific emotion pairs (anger-disgust, sadness-fear), suggesting universal psychological connections. Intensity distributions show multimodal patterns with significant variations between language families; Bantu languages display similar yet distinct profiles, while Afroasiatic languages and Nigerian Pidgin demonstrate wider intensity ranges. These findings highlight the need for language-specific approaches to emotion detection while identifying opportunities for transfer learning across related languages.

Clustering data by reordering them

Axel Descamps,Sélène Forget,Aliénor Lahlou,Claire Lavergne,Camille Berthelot,Guillaume Stirnemann,Rodolphe Vuilleumier,Nicolas Chéron

Task: 提出一种基于元素相似性的新算法，用于将元素分组到家族中进行分析。

Motivation: 解决数据驱动世界中多样性问题，通过明确考虑噪声来处理不同领域的分析需求。

Details

Method: 基于元素间距离重新排序数据，并使用易于理解的参数自动进行分析。 Result: 算法成功应用于生物分子构象、基因序列、细胞、图像和实验条件的分类。 Conclusion: 新算法通过相似性分组和噪声处理，为多领域数据分析提供了有效工具。 Abstract: Grouping elements into families to analyse them separately is a standard analysis procedure in many areas of sciences. We propose herein a new algorithm based on the simple idea that members from a family look like each other, and don't resemble elements foreign to the family. After reordering the data according to the distance between elements, the analysis is automatically performed with easily-understandable parameters. Noise is explicitly taken into account to deal with the variety of problems of a data-driven world. We applied the algorithm to sort biomolecules conformations, gene sequences, cells, images, and experimental conditions.

HausaNLP at SemEval-2025 Task 3: Towards a Fine-Grained Model-Aware Hallucination Detection

Maryam Bala,Amina Imam Abubakar,Abdulhamid Abubakar,Abdulkadir Shehu Bichi,Hafsa Kabir Ahmad,Sani Abdullahi Sani,Idris Abdulmumin,Shamsuddeen Hassan Muhamad,Ibrahim Said Ahmad

Task: 识别大型语言模型（LLMs）输出中的幻觉及相关过度生成错误。

Motivation: 提供对幻觉发生及其严重性的细致、模型感知的理解。

Details

Method: 使用自然语言推理并基于400个样本的合成数据集微调ModernBERT模型。 Result: 模型在IoU得分为0.032，相关性得分为0.422，显示模型置信度与实际幻觉存在中等正相关。 Conclusion: 幻觉检测具有复杂性，模型表现符合预期，因其边界难以精确界定。 Abstract: This paper presents our findings of the Multilingual Shared Task on Hallucinations and Related Observable Overgeneration Mistakes, MU-SHROOM, which focuses on identifying hallucinations and related overgeneration errors in large language models (LLMs). The shared task involves detecting specific text spans that constitute hallucinations in the outputs generated by LLMs in 14 languages. To address this task, we aim to provide a nuanced, model-aware understanding of hallucination occurrences and severity in English. We used natural language inference and fine-tuned a ModernBERT model using a synthetic dataset of 400 samples, achieving an Intersection over Union (IoU) score of 0.032 and a correlation score of 0.422. These results indicate a moderately positive correlation between the model's confidence scores and the actual presence of hallucinations. The IoU score indicates that our model has a relatively low overlap between the predicted hallucination span and the truth annotation. The performance is unsurprising, given the intricate nature of hallucination detection. Hallucinations often manifest subtly, relying on context, making pinpointing their exact boundaries formidable.

Uncertainty-Aware Decomposed Hybrid Networks

Sina Ditzel,Achref Jaziri,Iuliia Pliushch,Visvanathan Ramesh

Task: 提出一种结合神经网络适应性与领域特定准不变算子的混合方法，以提高图像识别算法的鲁棒性。

Motivation: 当前模型依赖大量标注数据，鲁棒性不足，需要更透明和稳健的方法。

Details

Method: 将识别任务分解为多个任务特定算子，结合新型置信度测量以优先可靠特征并处理噪声。 Result: 在交通标志检测实验中，尤其在半监督和无监督场景下表现优异。 Conclusion: 该方法提升了透明度和鲁棒性，适用于数据受限的应用场景。 Abstract: The robustness of image recognition algorithms remains a critical challenge, as current models often depend on large quantities of labeled data. In this paper, we propose a hybrid approach that combines the adaptability of neural networks with the interpretability, transparency, and robustness of domain-specific quasi-invariant operators. Our method decomposes the recognition into multiple task-specific operators that focus on different characteristics, supported by a novel confidence measurement tailored to these operators. This measurement enables the network to prioritize reliable features and accounts for noise. We argue that our design enhances transparency and robustness, leading to improved performance, particularly in low-data regimes. Experimental results in traffic sign detection highlight the effectiveness of the proposed method, especially in semi-supervised and unsupervised scenarios, underscoring its potential for data-constrained applications.

A multitask transformer to sign language translation using motion gesture primitives

Fredy Alejandro Mendoza López,Jefferson Rodriguez,Fabio Martínez

Task: 自动翻译手语的时空表示与自然文本语言之间的转换。

Motivation: 解决聋人社区因手语缺乏书面形式而导致的沟通障碍，并提升现有翻译方法的性能。

Details

Method: 提出一种多任务Transformer架构，结合手势学习表示和密集运动表示，以增强手势和运动信息。 Result: 在CoL-SLTD数据集上表现优于现有方法，BLEU-4得分分别为72.64%（split 1）和14.64%（split 2）；在RWTH-PHOENIX-Weather 2014 T数据集上达到11.58%的BLEU-4得分。 Conclusion: 该方法通过引入中间文本表示和运动信息，有效提升了手语翻译的准确性和鲁棒性。 Abstract: The absence of effective communication the deaf population represents the main social gap in this community. Furthermore, the sign language, main deaf communication tool, is unlettered, i.e., there is no formal written representation. In consequence, main challenge today is the automatic translation among spatiotemporal sign representation and natural text language. Recent approaches are based on encoder-decoder architectures, where the most relevant strategies integrate attention modules to enhance non-linear correspondences, besides, many of these approximations require complex training and architectural schemes to achieve reasonable predictions, because of the absence of intermediate text projections. However, they are still limited by the redundant background information of the video sequences. This work introduces a multitask transformer architecture that includes a gloss learning representation to achieve a more suitable translation. The proposed approach also includes a dense motion representation that enhances gestures and includes kinematic information, a key component in sign language. From this representation it is possible to avoid background information and exploit the geometry of the signs, in addition, it includes spatiotemporal representations that facilitate the alignment between gestures and glosses as an intermediate textual representation. The proposed approach outperforms the state-of-the-art evaluated on the CoL-SLTD dataset, achieving a BLEU-4 of 72,64% in split 1, and a BLEU-4 of 14,64% in split 2. Additionally, the strategy was validated on the RWTH-PHOENIX-Weather 2014 T dataset, achieving a competitive BLEU-4 of 11,58%.

Anomaly Detection Using Computer Vision: A Comparative Analysis of Class Distinction and Performance Metrics

Md. Barkat Ullah Tusher,Shartaz Khan Akash,Amirul Islam Showmik

Task: 研究基于计算机视觉的异常检测，重点区分三类目标（授权人员、入侵者和非人类实体）并评估性能。

Motivation: 优化深度学习在实时人脸识别和分类中的应用，提高高安全环境下的异常检测准确性和效率。

Details

Method: 结合OpenCV和深度学习技术，使用TensorFlow构建卷积神经网络，采用MobileNetV2模型优化实时性能，并进行数据集预处理（图像增强和归一化）。 Result: 分类准确率分别为90.20%（授权人员）、98.60%（入侵者）和75.80%（非人类实体），平均处理速率为30帧/秒。 Conclusion: 高级特征选择和数据增强显著提升检测性能，为高安全环境下的实时异常检测系统提供了优化方向。 Abstract: This paper showcases an experimental study on anomaly detection using computer vision. The study focuses on class distinction and performance evaluation, combining OpenCV with deep learning techniques while employing a TensorFlow-based convolutional neural network for real-time face recognition and classification. The system effectively distinguishes among three classes: authorized personnel (admin), intruders, and non-human entities. A MobileNetV2-based deep learning model is utilized to optimize real-time performance, ensuring high computational efficiency without compromising accuracy. Extensive dataset preprocessing, including image augmentation and normalization, enhances the models generalization capabilities. Our analysis demonstrates classification accuracies of 90.20% for admin, 98.60% for intruders, and 75.80% for non-human detection, while maintaining an average processing rate of 30 frames per second. The study leverages transfer learning, batch normalization, and Adam optimization to achieve stable and robust learning, and a comparative analysis of class differentiation strategies highlights the impact of feature extraction techniques and training methodologies. The results indicate that advanced feature selection and data augmentation significantly enhance detection performance, particularly in distinguishing human from non-human scenes. As an experimental study, this research provides critical insights into optimizing deep learning-based surveillance systems for high-security environments and improving the accuracy and efficiency of real-time anomaly detection.

AdaptiVocab: Enhancing LLM Efficiency in Focused Domains through Lightweight Vocabulary Adaptation

Itay Nakash,Nitay Calderon,Eyal Ben David,Elad Hoffer,Roi Reichart

Task: 通过词汇适应（AdaptiVocab）方法降低大型语言模型（LLMs）在特定领域中的计算开销和延迟。

Motivation: 大型语言模型在通用领域的广泛应用带来了高昂的计算成本，而在特定领域中，通用能力并非必需，可以通过优化词汇来提高效率。

Details

Method: 提出AdaptiVocab方法，通过替换词汇为领域特定的n-gram标记，减少输入和输出生成所需的标记数量，并使用轻量级微调阶段。 Result: 在三个特定领域的测试中，AdaptiVocab减少了超过25%的标记使用，同时保持了性能。 Conclusion: AdaptiVocab是一种有效的词汇适应方法，能够在特定领域中显著提高LLMs的效率而不牺牲性能。 Abstract: Large Language Models (LLMs) have shown impressive versatility as general purpose models. However, their broad applicability comes at a high-cost computational overhead, particularly in auto-regressive decoding where each step requires a forward pass. In domain-specific settings, general-purpose capabilities are unnecessary and can be exchanged for efficiency. In this work, we take a novel perspective on domain adaptation, reducing latency and computational costs by adapting the vocabulary to focused domains of interest. We introduce AdaptiVocab, an end-to-end approach for vocabulary adaptation, designed to enhance LLM efficiency in low-resource domains. AdaptiVocab can be applied to any tokenizer and architecture, modifying the vocabulary by replacing tokens with domain-specific n-gram-based tokens, thereby reducing the number of tokens required for both input processing and output generation. AdaptiVocab initializes new n-token embeddings using an exponentially weighted combination of existing embeddings and employs a lightweight fine-tuning phase that can be efficiently performed on a single GPU. We evaluate two 7B LLMs across three niche domains, assessing efficiency, generation quality, and end-task performance. Our results show that AdaptiVocab reduces token usage by over 25% without compromising performance

Your ViT is Secretly an Image Segmentation Model

Tommie Kerssies,Niccolò Cavagnero,Alexander Hermans,Narges Norouzi,Giuseppe Averta,Bastian Leibe,Gijs Dubbelman,Daan de Geus

Task: 探索如何利用纯Vision Transformer（ViT）架构进行图像分割，而不需要任务特定的组件。

Motivation: 现有方法通过卷积适配器和Transformer解码器等任务特定组件引入归纳偏差，但研究表明这些偏差可以通过大规模预训练的ViT本身学习。

Details

Method: 提出Encoder-only Mask Transformer（EoMT），利用纯ViT架构进行图像分割，并通过大规模模型和预训练实现高性能。 Result: EoMT在分割精度上与现有方法相当，同时速度显著提升（如ViT-L快4倍）。 Conclusion: 计算资源应优先用于扩展ViT本身，而非增加架构复杂性，EoMT在精度和速度间实现了最佳平衡。 Abstract: Vision Transformers (ViTs) have shown remarkable performance and scalability across various computer vision tasks. To apply single-scale ViTs to image segmentation, existing methods adopt a convolutional adapter to generate multi-scale features, a pixel decoder to fuse these features, and a Transformer decoder that uses the fused features to make predictions. In this paper, we show that the inductive biases introduced by these task-specific components can instead be learned by the ViT itself, given sufficiently large models and extensive pre-training. Based on these findings, we introduce the Encoder-only Mask Transformer (EoMT), which repurposes the plain ViT architecture to conduct image segmentation. With large-scale models and pre-training, EoMT obtains a segmentation accuracy similar to state-of-the-art models that use task-specific components. At the same time, EoMT is significantly faster than these methods due to its architectural simplicity, e.g., up to 4x faster with ViT-L. Across a range of model sizes, EoMT demonstrates an optimal balance between segmentation accuracy and prediction speed, suggesting that compute resources are better spent on scaling the ViT itself rather than adding architectural complexity. Code: https://www.tue-mps.org/eomt/.

HausaNLP at SemEval-2025 Task 2: Entity-Aware Fine-tuning vs. Prompt Engineering in Entity-Aware Machine Translation

Abdulhamid Abubakar,Hamidatu Abdulkadir,Ibrahim Rabiu Abdullahi,Abubakar Auwal Khalid,Ahmad Mustapha Wali,Amina Aminu Umar,Maryam Bala,Sani Abdullahi Sani,Ibrahim Said Ahmad,Shamsuddeen Hassan Muhammad,Idris Abdulmumin,Vukosi Marivate

Task: 开发能够准确翻译英语句子到目标语言的实体感知机器翻译模型，特别关注命名实体的处理。

Motivation: 命名实体通常对机器翻译系统构成挑战，因此需要专门的方法来解决这一问题。

Details

Method: 采用了多种系统，并详细描述了实验结果。 Result: 论文展示了实验结果，并讨论了从实验中获得的见解。 Conclusion: 总结了实体感知机器翻译的研究成果，并提出了未来改进的方向。 Abstract: This paper presents our findings for SemEval 2025 Task 2, a shared task on entity-aware machine translation (EA-MT). The goal of this task is to develop translation models that can accurately translate English sentences into target languages, with a particular focus on handling named entities, which often pose challenges for MT systems. The task covers 10 target languages with English as the source. In this paper, we describe the different systems we employed, detail our results, and discuss insights gained from our experiments.

Compositional Caching for Training-free Open-vocabulary Attribute Detection

Marco Garosi,Alessandro Conti,Gaowen Liu,Elisa Ricci,Massimiliano Mancini

Task: 提出一种无需训练的方法（ComCa）用于开放词汇属性检测。

Motivation: 解决现有方法依赖人工标注、预定义属性集导致的局限性和可扩展性问题。

Details

Method: 利用网络规模数据库和大语言模型构建辅助图像缓存，通过软属性标签和图像相似性聚合优化视觉语言模型的预测。 Result: 在公共数据集上显著优于零样本和基于缓存的基线方法，与基于训练的方法竞争。 Conclusion: 精心设计的无需训练方法可有效解决开放词汇属性检测问题。 Abstract: Attribute detection is crucial for many computer vision tasks, as it enables systems to describe properties such as color, texture, and material. Current approaches often rely on labor-intensive annotation processes which are inherently limited: objects can be described at an arbitrary level of detail (e.g., color vs. color shades), leading to ambiguities when the annotators are not instructed carefully. Furthermore, they operate within a predefined set of attributes, reducing scalability and adaptability to unforeseen downstream applications. We present Compositional Caching (ComCa), a training-free method for open-vocabulary attribute detection that overcomes these constraints. ComCa requires only the list of target attributes and objects as input, using them to populate an auxiliary cache of images by leveraging web-scale databases and Large Language Models to determine attribute-object compatibility. To account for the compositional nature of attributes, cache images receive soft attribute labels. Those are aggregated at inference time based on the similarity between the input and cache images, refining the predictions of underlying Vision-Language Models (VLMs). Importantly, our approach is model-agnostic, compatible with various VLMs. Experiments on public datasets demonstrate that ComCa significantly outperforms zero-shot and cache-based baselines, competing with recent training-based methods, proving that a carefully designed training-free approach can successfully address open-vocabulary attribute detection.

Writing as a testbed for open ended agents

Sian Gooding,Lucia Lopez-Rivilla,Edward Grefenstette

Task: 研究大型语言模型（LLMs）作为协作合著者的潜力，特别是在开放写作任务中的表现。

Motivation: 开放任务对LLMs具有挑战性，因其解决方案空间广阔且缺乏明确的成功标准，写作任务为此提供了理想的研究场景。

Details

Method: 分析Gemini 1.5 Pro、Claude 3.5 Sonnet和GPT-4o三种LLM在行动多样性、人类对齐和迭代改进能力上的表现。 Result: 建立了一个评估自主写作代理的框架，并揭示了在开放领域构建优秀系统的基本挑战与潜在解决方案。 Conclusion: 该研究为LLMs在开放任务中的应用提供了基准，并指出了未来发展的方向。 Abstract: Open-ended tasks are particularly challenging for LLMs due to the vast solution space, demanding both expansive exploration and adaptable strategies, especially when success lacks a clear, objective definition. Writing, with its vast solution space and subjective evaluation criteria, provides a compelling testbed for studying such problems. In this paper, we investigate the potential of LLMs to act as collaborative co-writers, capable of suggesting and implementing text improvements autonomously. We analyse three prominent LLMs - Gemini 1.5 Pro, Claude 3.5 Sonnet, and GPT-4o - focusing on how their action diversity, human alignment, and iterative improvement capabilities impact overall performance. This work establishes a framework for benchmarking autonomous writing agents and, more broadly, highlights fundamental challenges and potential solutions for building systems capable of excelling in diverse open-ended domains.

HOIGPT: Learning Long Sequence Hand-Object Interaction with Language Models

Mingzhen Huang,Fu-Jen Chu,Bugra Tekin,Kevin J Liang,Haoyu Ma,Weiyao Wang,Xingyu Chen,Pierre Gleize,Hongfei Xue,Siwei Lyu,Kris Kitani,Matt Feiszli,Hao Tang

Task: 提出HOIGPT，一种基于令牌的生成方法，统一3D手-物交互（HOI）的感知与生成。

Motivation: 为从多种条件信号（如文本、物体、部分序列）生成高质量3D HOI序列及其描述提供首个全面解决方案。

Details

Method: 利用大型语言模型预测HOI序列与自然语言描述之间的双向转换，并引入物理基础的HOI令牌化器和运动感知语言模型。 Result: 在文本生成（+2.01% R Precision）和HOI生成（-2.56 FID）任务中达到最新最优性能。 Conclusion: HOIGPT为3D HOI的感知与生成提供了高效统一的解决方案。 Abstract: We introduce HOIGPT, a token-based generative method that unifies 3D hand-object interactions (HOI) perception and generation, offering the first comprehensive solution for captioning and generating high-quality 3D HOI sequences from a diverse range of conditional signals (\eg text, objects, partial sequences). At its core, HOIGPT utilizes a large language model to predict the bidrectional transformation between HOI sequences and natural language descriptions. Given text inputs, HOIGPT generates a sequence of hand and object meshes; given (partial) HOI sequences, HOIGPT generates text descriptions and completes the sequences. To facilitate HOI understanding with a large language model, this paper introduces two key innovations: (1) a novel physically grounded HOI tokenizer, the hand-object decomposed VQ-VAE, for discretizing HOI sequences, and (2) a motion-aware language model trained to process and generate both text and HOI tokens. Extensive experiments demonstrate that HOIGPT sets new state-of-the-art performance on both text generation (+2.01% R Precision) and HOI generation (-2.56 FID) across multiple tasks and benchmarks.

Gemma 3 Technical Report

Gemma Team,Aishwarya Kamath,Johan Ferret,Shreya Pathak,Nino Vieillard,Ramona Merhej,Sarah Perrin,Tatiana Matejovicova,Alexandre Ramé,Morgane Rivière,Louis Rouillard,Thomas Mesnard,Geoffrey Cideron,Jean-bastien Grill,Sabela Ramos,Edouard Yvinec,Michelle Casbon,Etienne Pot,Ivo Penchev,Gaël Liu,Francesco Visin,Kathleen Kenealy,Lucas Beyer,Xiaohai Zhai,Anton Tsitsulin,Robert Busa-Fekete,Alex Feng,Noveen Sachdeva,Benjamin Coleman,Yi Gao,Basil Mustafa,Iain Barr,Emilio Parisotto,David Tian,Matan Eyal,Colin Cherry,Jan-Thorsten Peter,Danila Sinopalnikov,Surya Bhupatiraju,Rishabh Agarwal,Mehran Kazemi,Dan Malkin,Ravin Kumar,David Vilar,Idan Brusilovsky,Jiaming Luo,Andreas Steiner,Abe Friesen,Abhanshu Sharma,Abheesht Sharma,Adi Mayrav Gilady,Adrian Goedeckemeyer,Alaa Saade,Alex Feng,Alexander Kolesnikov,Alexei Bendebury,Alvin Abdagic,Amit Vadi,András György,André Susano Pinto,Anil Das,Ankur Bapna,Antoine Miech,Antoine Yang,Antonia Paterson,Ashish Shenoy,Ayan Chakrabarti,Bilal Piot,Bo Wu,Bobak Shahriari,Bryce Petrini,Charlie Chen,Charline Le Lan,Christopher A. Choquette-Choo,CJ Carey,Cormac Brick,Daniel Deutsch,Danielle Eisenbud,Dee Cattle,Derek Cheng,Dimitris Paparas,Divyashree Shivakumar Sreepathihalli,Doug Reid,Dustin Tran,Dustin Zelle,Eric Noland,Erwin Huizenga,Eugene Kharitonov,Frederick Liu,Gagik Amirkhanyan,Glenn Cameron,Hadi Hashemi,Hanna Klimczak-Plucińska,Harman Singh,Harsh Mehta,Harshal Tushar Lehri,Hussein Hazimeh,Ian Ballantyne,Idan Szpektor,Ivan Nardini,Jean Pouget-Abadie,Jetha Chan,Joe Stanton,John Wieting,Jonathan Lai,Jordi Orbay,Joseph Fernandez,Josh Newlan,Ju-yeong Ji,Jyotinder Singh,Kat Black,Kathy Yu,Kevin Hui,Kiran Vodrahalli,Klaus Greff,Linhai Qiu,Marcella Valentine,Marina Coelho,Marvin Ritter,Matt Hoffman,Matthew Watson,Mayank Chaturvedi,Michael Moynihan,Min Ma,Nabila Babar,Natasha Noy,Nathan Byrd,Nick Roy,Nikola Momchev,Nilay Chauhan,Noveen Sachdeva,Oskar Bunyan,Pankil Botarda,Paul Caron,Paul Kishan Rubenstein,Phil Culliton,Philipp Schmid,Pier Giuseppe Sessa,Pingmei Xu,Piotr Stanczyk,Pouya Tafti,Rakesh Shivanna,Renjie Wu,Renke Pan,Reza Rokni,Rob Willoughby,Rohith Vallu,Ryan Mullins,Sammy Jerome,Sara Smoot,Sertan Girgin,Shariq Iqbal,Shashir Reddy,Shruti Sheth,Siim Põder,Sijal Bhatnagar,Sindhu Raghuram Panyam,Sivan Eiger,Susan Zhang,Tianqi Liu,Trevor Yacovone,Tyler Liechty,Uday Kalra,Utku Evci,Vedant Misra,Vincent Roseberry,Vlad Feinberg,Vlad Kolesnikov,Woohyun Han,Woosuk Kwon,Xi Chen,Yinlam Chow,Yuvein Zhu,Zichuan Wei,Zoltan Egyed,Victor Cotruta,Minh Giang,Phoebe Kirk,Anand Rao,Kat Black,Nabila Babar,Jessica Lo,Erica Moreira,Luiz Gustavo Martins,Omar Sanseviero,Lucas Gonzalez,Zach Gleicher,Tris Warkentin,Vahab Mirrokni,Evan Senter,Eli Collins,Joelle Barral,Zoubin Ghahramani,Raia Hadsell,Yossi Matias,D. Sculley,Slav Petrov,Noah Fiedel,Noam Shazeer,Oriol Vinyals,Jeff Dean,Demis Hassabis,Koray Kavukcuoglu,Clement Farabet,Elena Buchatskaya,Jean-Baptiste Alayrac,Rohan Anil,Dmitry,Lepikhin,Sebastian Borgeaud,Olivier Bachem,Armand Joulin,Alek Andreev,Cassidy Hardin,Robert Dadashi,Léonard Hussenot

Task: 介绍Gemma 3，一个多模态轻量级开源模型家族的新成员，具备视觉理解能力、多语言支持和长上下文处理能力。

Motivation: 扩展Gemma家族的功能，提升模型在视觉、语言和长上下文任务中的性能，同时优化内存使用。

Details

Method: 通过调整模型架构，减少KV缓存内存，增加局部与全局注意力层的比例，并采用蒸馏训练方法。 Result: Gemma 3在预训练和指令微调版本中表现优于Gemma 2，部分模型性能接近Gemini-1.5-Pro。 Conclusion: Gemma 3在多模态任务中表现出色，模型已向社区开放。 Abstract: We introduce Gemma 3, a multimodal addition to the Gemma family of lightweight open models, ranging in scale from 1 to 27 billion parameters. This version introduces vision understanding abilities, a wider coverage of languages and longer context - at least 128K tokens. We also change the architecture of the model to reduce the KV-cache memory that tends to explode with long context. This is achieved by increasing the ratio of local to global attention layers, and keeping the span on local attention short. The Gemma 3 models are trained with distillation and achieve superior performance to Gemma 2 for both pre-trained and instruction finetuned versions. In particular, our novel post-training recipe significantly improves the math, chat, instruction-following and multilingual abilities, making Gemma3-4B-IT competitive with Gemma2-27B-IT and Gemma3-27B-IT comparable to Gemini-1.5-Pro across benchmarks. We release all our models to the community.

FDS: Frequency-Aware Denoising Score for Text-Guided Latent Diffusion Image Editing

Yufan Ren,Zicong Jiang,Tong Zhang,Søren Forchhammer,Sabine Süsstrunk

Task: 提出一种基于频率选择性优化的文本引导图像编辑方法。

Motivation: 现有文本到图像（T2I）模型在图像编辑中常导致细节丢失和颜色变化等不理想修改，原因是优化过程未区分频率带。

Details

Method: 利用小波变换将图像分解为不同频率带，实现局部空间区域的频率选择性优化，并扩展到3D纹理编辑。 Result: 定量评估和用户研究表明，该方法能生成高质量且精确的编辑结果。 Conclusion: 通过选择性优化特定频率带，有效解决了现有方法的不足，提升了编辑的精确性。 Abstract: Text-guided image editing using Text-to-Image (T2I) models often fails to yield satisfactory results, frequently introducing unintended modifications, such as the loss of local detail and color changes. In this paper, we analyze these failure cases and attribute them to the indiscriminate optimization across all frequency bands, even though only specific frequencies may require adjustment. To address this, we introduce a simple yet effective approach that enables the selective optimization of specific frequency bands within localized spatial regions for precise edits. Our method leverages wavelets to decompose images into different spatial resolutions across multiple frequency bands, enabling precise modifications at various levels of detail. To extend the applicability of our approach, we provide a comparative analysis of different frequency-domain techniques. Additionally, we extend our method to 3D texture editing by performing frequency decomposition on the triplane representation, enabling frequency-aware adjustments for 3D textures. Quantitative evaluations and user studies demonstrate the effectiveness of our method in producing high-quality and precise edits.

SemEval-2025 Task 9: The Food Hazard Detection Challenge

Korbinian Randl,John Pavlopoulos,Aron Henriksson,Tony Lindgren,Juli Bakagianni

Task: 探索基于文本的食品危害预测，针对长尾分布类别，分为两个子任务：预测文本是否涉及十种食品危害类别及关联食品类别，以及更细粒度的危害和产品标签分类。

Motivation: 解决食品危害文本分类中长尾分布数据的挑战，并验证合成数据在过采样中的有效性。

Details

Method: 使用大语言模型生成合成数据用于过采样，并比较微调的编码器-解码器、仅编码器和仅解码器系统的性能。 Result: 合成数据对长尾分布过采样非常有效，三种系统在子任务中表现相当。 Conclusion: 通过合成数据和多种模型验证，成功解决了食品危害文本分类的长尾分布问题，并公开了新的标注数据集。 Abstract: In this challenge, we explored text-based food hazard prediction with long tail distributed classes. The task was divided into two subtasks: (1) predicting whether a web text implies one of ten food-hazard categories and identifying the associated food category, and (2) providing a more fine-grained classification by assigning a specific label to both the hazard and the product. Our findings highlight that large language model-generated synthetic data can be highly effective for oversampling long-tail distributions. Furthermore, we find that fine-tuned encoder-only, encoder-decoder, and decoder-only systems achieve comparable maximum performance across both subtasks. During this challenge, we gradually released (under CC BY-NC-SA 4.0) a novel set of 6,644 manually labeled food-incident reports.

Open-Vocabulary Functional 3D Scene Graphs for Real-World Indoor Spaces

Chenyangguang Zhang,Alexandros Delitzas,Fangjinhua Wang,Ruida Zhang,Xiangyang Ji,Marc Pollefeys,Francis Engelmann

Task: 预测真实世界室内环境的功能性3D场景图。

Motivation: 传统3D场景图仅关注物体的空间关系，而功能性3D场景图能捕捉物体、交互元素及其功能关系。

Details

Method: 利用视觉语言模型（VLM）和大型语言模型（LLM）等基础模型编码功能知识。 Result: 在扩展的SceneFun3D数据集和新收集的FunGraph3D数据集上显著优于基线方法（如Open3DSG和ConceptGraph）。 Conclusion: 功能性3D场景图在建模复杂场景功能方面有效，并可用于3D问答和机器人操作等下游任务。 Abstract: We introduce the task of predicting functional 3D scene graphs for real-world indoor environments from posed RGB-D images. Unlike traditional 3D scene graphs that focus on spatial relationships of objects, functional 3D scene graphs capture objects, interactive elements, and their functional relationships. Due to the lack of training data, we leverage foundation models, including visual language models (VLMs) and large language models (LLMs), to encode functional knowledge. We evaluate our approach on an extended SceneFun3D dataset and a newly collected dataset, FunGraph3D, both annotated with functional 3D scene graphs. Our method significantly outperforms adapted baselines, including Open3DSG and ConceptGraph, demonstrating its effectiveness in modeling complex scene functionalities. We also demonstrate downstream applications such as 3D question answering and robotic manipulation using functional 3D scene graphs. See our project page at https://openfungraph.github.io

Contextual Metric Meta-Evaluation by Measuring Local Metric Accuracy

Athiya Deviyani,Fernando Diaz

Task: 提出一种针对自动评估指标的上下文元评估方法，比较评估指标的局部准确性。

Motivation: 现有元评估方法关注评估指标在任意系统输出上的绝对和相对质量，而实际应用中评估指标通常用于高度具体的上下文，因此需要一种更贴合实际需求的评估方法。

Details

Method: 通过比较评估指标的局部准确性，在不同任务（如翻译、语音识别和排序）中验证其有效性。 Result: 实验表明，局部准确性在不同评估上下文中存在显著变化，绝对值和相对效果均有所不同。 Conclusion: 强调采用上下文特定的评估方法比全局评估更为重要。 Abstract: Meta-evaluation of automatic evaluation metrics -- assessing evaluation metrics themselves -- is crucial for accurately benchmarking natural language processing systems and has implications for scientific inquiry, production model development, and policy enforcement. While existing approaches to metric meta-evaluation focus on general statements about the absolute and relative quality of metrics across arbitrary system outputs, in practice, metrics are applied in highly contextual settings, often measuring the performance for a highly constrained set of system outputs. For example, we may only be interested in evaluating a specific model or class of models. We introduce a method for contextual metric meta-evaluation by comparing the local metric accuracy of evaluation metrics. Across translation, speech recognition, and ranking tasks, we demonstrate that the local metric accuracies vary both in absolute value and relative effectiveness as we shift across evaluation contexts. This observed variation highlights the importance of adopting context-specific metric evaluations over global ones.

Benchmarking Object Detectors under Real-World Distribution Shifts in Satellite Imagery

Sara Al-Emadi,Yin Yang,Ferda Ofli

Task: 研究目标检测器在真实世界分布偏移（如气候区域和灾害类型）下的泛化能力和鲁棒性。

Motivation: 现实部署中目标分布常与训练数据不同，导致性能下降，需提升模型对未见条件的适应性。

Details

Method: 引入Real-World Distribution Shifts (RWDS)数据集，包含气候区域和灾害类型等分布偏移场景。 Result: 提出了首个针对真实世界高影响场景的目标检测DG基准数据集。 Conclusion: RWDS数据集为评估未来目标检测模型的鲁棒性和泛化性提供了宝贵资源。 Abstract: Object detectors have achieved remarkable performance in many applications; however, these deep learning models are typically designed under the i.i.d. assumption, meaning they are trained and evaluated on data sampled from the same (source) distribution. In real-world deployment, however, target distributions often differ from source data, leading to substantial performance degradation. Domain Generalisation (DG) seeks to bridge this gap by enabling models to generalise to Out-Of-Distribution (OOD) data without access to target distributions during training, enhancing robustness to unseen conditions. In this work, we examine the generalisability and robustness of state-of-the-art object detectors under real-world distribution shifts, focusing particularly on spatial domain shifts. Despite the need, a standardised benchmark dataset specifically designed for assessing object detection under realistic DG scenarios is currently lacking. To address this, we introduce Real-World Distribution Shifts (RWDS), a suite of three novel DG benchmarking datasets that focus on humanitarian and climate change applications. These datasets enable the investigation of domain shifts across (i) climate zones and (ii) various disasters and geographic regions. To our knowledge, these are the first DG benchmarking datasets tailored for object detection in real-world, high-impact contexts. We aim for these datasets to serve as valuable resources for evaluating the robustness and generalisation of future object detection models. Our datasets and code are available at https://github.com/RWGAI/RWDS.

A Comparative Analysis of Word Segmentation, Part-of-Speech Tagging, and Named Entity Recognition for Historical Chinese Sources, 1900-1950

Zhao Fang,Liang-Chun Wu,Xuening Kong,Spencer Dean Stewart

Task: 比较大语言模型（LLMs）与传统自然语言处理（NLP）工具在1900至1950年中文文本上的分词、词性标注和命名实体识别任务中的表现。

Motivation: 历史中文文档因其表意文字特性、缺乏自然词边界以及显著的语言变化，对文本分析提出了挑战。

Details

Method: 使用上海图书馆民国期刊语料库的样本数据集，比较传统工具（如Jieba和spaCy）与大语言模型（如GPT-4o、Claude 3.5和GLM系列）。 Result: 结果表明，大语言模型在所有指标上均优于传统方法，但计算成本显著更高，体现了准确性与效率之间的权衡。此外，大语言模型能更好地处理特定类型的挑战（如诗歌和时间变化）。 Conclusion: 大语言模型的上下文学习能力可以减少对领域特定训练数据的需求，从而推动历史文本的NLP方法发展。 Abstract: This paper compares large language models (LLMs) and traditional natural language processing (NLP) tools for performing word segmentation, part-of-speech (POS) tagging, and named entity recognition (NER) on Chinese texts from 1900 to 1950. Historical Chinese documents pose challenges for text analysis due to their logographic script, the absence of natural word boundaries, and significant linguistic changes. Using a sample dataset from the Shanghai Library Republican Journal corpus, traditional tools such as Jieba and spaCy are compared to LLMs, including GPT-4o, Claude 3.5, and the GLM series. The results show that LLMs outperform traditional methods in all metrics, albeit at considerably higher computational costs, highlighting a trade-off between accuracy and efficiency. Additionally, LLMs better handle genre-specific challenges such as poetry and temporal variations (i.e., pre-1920 versus post-1920 texts), demonstrating that their contextual learning capabilities can advance NLP approaches to historical texts by reducing the need for domain-specific training data.

FRESA:Feedforward Reconstruction of Personalized Skinned Avatars from Few Images

Rong Wang,Fabian Prada,Ziyan Wang,Zhongshi Jiang,Chengxiang Yin,Junxuan Li,Shunsuke Saito,Igor Santesteban,Javier Romero,Rohan Joshi,Hongdong Li,Jason Saragih,Yaser Sheikh

Task: 提出一种从少量图像重建个性化3D人体化身并实现逼真动画的新方法。

Motivation: 现有方法因人体形状、姿势和衣物类型的巨大差异，通常需要长时间的优化，限制了实际应用。

Details

Method: 通过学习来自上千名穿衣人体的通用先验，实现即时前馈生成和零样本泛化；联合推断个性化化身形状、蒙皮权重和姿势相关变形；设计3D规范化过程和多帧特征聚合。 Result: 实验表明，该方法比现有技术生成更真实的3D重建和动画，并能直接泛化到手机拍摄的输入。 Conclusion: 该方法通过联合优化和规范化设计，显著提高了几何保真度并减少了变形伪影，具有实际应用潜力。 Abstract: We present a novel method for reconstructing personalized 3D human avatars with realistic animation from only a few images. Due to the large variations in body shapes, poses, and cloth types, existing methods mostly require hours of per-subject optimization during inference, which limits their practical applications. In contrast, we learn a universal prior from over a thousand clothed humans to achieve instant feedforward generation and zero-shot generalization. Specifically, instead of rigging the avatar with shared skinning weights, we jointly infer personalized avatar shape, skinning weights, and pose-dependent deformations, which effectively improves overall geometric fidelity and reduces deformation artifacts. Moreover, to normalize pose variations and resolve coupled ambiguity between canonical shapes and skinning weights, we design a 3D canonicalization process to produce pixel-aligned initial conditions, which helps to reconstruct fine-grained geometric details. We then propose a multi-frame feature aggregation to robustly reduce artifacts introduced in canonicalization and fuse a plausible avatar preserving person-specific identities. Finally, we train the model in an end-to-end framework on a large-scale capture dataset, which contains diverse human subjects paired with high-quality 3D scans. Extensive experiments show that our method generates more authentic reconstruction and animation than state-of-the-arts, and can be directly generalized to inputs from casually taken phone photos. Project page and code is available at https://github.com/rongakowang/FRESA.

Think Twice: Enhancing LLM Reasoning by Scaling Multi-round Test-time Thinking

Xiaoyu Tian,Sitong Zhao,Haotian Wang,Shuaiting Chen,Yunjie Ji,Yiping Peng,Han Zhao,Xiangang Li

Task: 提出一种名为Multi-round Thinking的测试时扩展方法，通过迭代优化模型推理过程以提高性能。

Motivation: 当前大型语言模型在处理长文本和强化学习训练效率方面存在局限性，需要一种简单有效的方法来提升性能。

Details

Method: 利用前一轮答案作为下一轮提示，迭代优化模型推理过程。 Result: 在多个模型和基准测试中（如AIME 2024、MATH-500等），性能显著提升，例如QwQ-32B的准确率从80.3%提高到82.1%。 Conclusion: Multi-round Thinking是一种广泛适用且简单的方法，能够稳定提升模型性能，具有未来发展的潜力。 Abstract: Recent advances in large language models (LLMs), such as OpenAI-o1 and DeepSeek-R1, have demonstrated the effectiveness of test-time scaling, where extended reasoning processes substantially enhance model performance. Despite this, current models are constrained by limitations in handling long texts and reinforcement learning (RL) training efficiency. To address these issues, we propose a simple yet effective test-time scaling approach Multi-round Thinking. This method iteratively refines model reasoning by leveraging previous answers as prompts for subsequent rounds. Extensive experiments across multiple models, including QwQ-32B and DeepSeek-R1, consistently show performance improvements on various benchmarks such as AIME 2024, MATH-500, GPQA-diamond, and LiveCodeBench. For instance, the accuracy of QwQ-32B improved from 80.3% (Round 1) to 82.1% (Round 2) on the AIME 2024 dataset, while DeepSeek-R1 showed a similar increase from 79.7% to 82.0%. These results confirm that Multi-round Thinking is a broadly applicable, straightforward approach to achieving stable enhancements in model performance, underscoring its potential for future developments in test-time scaling techniques. The key prompt: {Original question prompt} The assistant's previous answer is: {last round answer} , and please re-answer.

On Symmetries in Convolutional Weights

Bilal Alsallakh,Timothy Wroge,Vivek Miglani,Narine Kokhlikyan

Task: 探索卷积神经网络中各层平均k x k权重核的对称性。

Motivation: 研究内部层平均核倾向于中心对称而非特定方向的原因及其影响。

Details

Method: 分析不同数据集和模型中对称性的出现及其与架构选择的关系。 Result: 对称性与平移和翻转一致性等理想特性相关，可能是卷积神经网络的固有归纳偏置。 Conclusion: 对称性在卷积神经网络中具有重要作用，可能影响其性能和设计。 Abstract: We explore the symmetry of the mean k x k weight kernel in each layer of various convolutional neural networks. Unlike individual neurons, the mean kernels in internal layers tend to be symmetric about their centers instead of favoring specific directions. We investigate why this symmetry emerges in various datasets and models, and how it is impacted by certain architectural choices. We show how symmetry correlates with desirable properties such as shift and flip consistency, and might constitute an inherent inductive bias in convolutional neural networks.

Scaling Evaluation-time Compute with Reasoning Models as Process Evaluators

Seungone Kim,Ian Wu,Jinu Lee,Xiang Yue,Seongyun Lee,Mingyeong Moon,Kiril Gashteovski,Carolin Lawrence,Julia Hockenmaier,Graham Neubig,Sean Welleck

Task: 研究如何通过增加测试时间计算提升语言模型的评估能力。

Motivation: 随着语言模型输出越来越自然，评估其质量变得愈发困难，而增加测试时间计算已被证明能有效提升模型在数学和代码等领域的表现，因此探索评估能力是否也能通过类似方法提升。

Details

Method: 使用推理模型（生成长链思维推理的语言模型）作为评估器，并通过两种方法增加测试时间计算：(1) 使用推理模型，(2) 提示模型不仅评估整体响应（结果评估），还评估响应中的每个步骤（过程评估）。 Result: 实验表明，评估器性能随着生成更多推理标记而单调提升，且更准确的评估器可用于重新排序多个生成结果，证明在评估阶段增加计算与在生成阶段增加计算同样有效。 Conclusion: 增加测试时间计算可以显著提升语言模型的评估能力，且评估阶段的计算优化与生成阶段的计算优化效果相当。 Abstract: As language model (LM) outputs get more and more natural, it is becoming more difficult than ever to evaluate their quality. Simultaneously, increasing LMs' "thinking" time through scaling test-time compute has proven an effective technique to solve challenging problems in domains such as math and code. This raises a natural question: can an LM's evaluation capability also be improved by spending more test-time compute? To answer this, we investigate employing reasoning models-LMs that natively generate long chain-of-thought reasoning-as evaluators. Specifically, we examine methods to leverage more test-time compute by (1) using reasoning models, and (2) prompting these models to evaluate not only the response as a whole (i.e., outcome evaluation) but also assess each step in the response separately (i.e., process evaluation). In experiments, we observe that the evaluator's performance improves monotonically when generating more reasoning tokens, similar to the trends observed in LM-based generation. Furthermore, we use these more accurate evaluators to rerank multiple generations, and demonstrate that spending more compute at evaluation time can be as effective as using more compute at generation time in improving an LM's problem-solving capability.

Face Spoofing Detection using Deep Learning

Najeebullah,Maaz Salman,Zar Nawab Khan Swati

Task: 评估MobileNetV2、ResNET50和Vision Transformer（ViT）在图像欺骗检测中的性能。

Motivation: 数字图像欺骗对依赖面部识别的生物认证系统构成重大安全威胁，需提升检测能力以增强系统安全性。

Details

Method: 使用包含150,986张图像的数据集，分为训练、测试和验证集，比较三种模型的准确率、精确率、召回率和F1分数。 Result: MobileNetV2在测试集上表现最佳（准确率91.59%），验证集上MobileNetV2和ViT表现优异，但MobileNetV2稍优（准确率97.17% vs 96.36%）。 Conclusion: MobileNetV2在性能和鲁棒性上表现平衡，适合实际部署，模型选择对安全敏感场景至关重要。 Abstract: Digital image spoofing has emerged as a significant security threat in biometric authentication systems, particularly those relying on facial recognition. This study evaluates the performance of three vision based models, MobileNetV2, ResNET50, and Vision Transformer, ViT, for spoof detection in image classification, utilizing a dataset of 150,986 images divided into training , 140,002, testing, 10,984, and validation ,39,574, sets. Spoof detection is critical for enhancing the security of image recognition systems, and this research compares the models effectiveness through accuracy, precision, recall, and F1 score metrics. Results reveal that MobileNetV2 outperforms other architectures on the test dataset, achieving an accuracy of 91.59%, precision of 91.72%, recall of 91.59%, and F1 score of 91.58%, compared to ViT 86.54%, 88.28%, 86.54%, and 86.39%, respectively. On the validation dataset, MobileNetV2, and ViT excel, with MobileNetV2 slightly ahead at 97.17% accuracy versus ViT 96.36%. MobileNetV2 demonstrates faster convergence during training and superior generalization to unseen data, despite both models showing signs of overfitting. These findings highlight MobileNetV2 balanced performance and robustness, making it the preferred choice for spoof detection applications where reliability on new data is essential. The study underscores the importance of model selection in security sensitive contexts and suggests MobileNetV2 as a practical solution for real world deployment.

CausalRAG: Integrating Causal Graphs into Retrieval-Augmented Generation

Nengbo Wang,Xiaotian Han,Jagdip Singh,Jing Ma,Vipin Chaudhary

Task: 提出一种名为CausalRAG的新框架，通过将因果图整合到检索过程中，解决传统RAG系统的局限性。

Motivation: 传统RAG系统存在上下文完整性破坏和过度依赖语义相似性的问题，限制了其性能。

Details

Method: CausalRAG通过构建和追踪因果关系，保持上下文连续性并提高检索精度。 Result: CausalRAG在多项指标上优于传统RAG和图基RAG方法。 Conclusion: 基于因果推理的检索为知识密集型任务提供了一种有前景的方法。 Abstract: Large language models (LLMs) have revolutionized natural language processing (NLP), particularly through Retrieval-Augmented Generation (RAG), which enhances LLM capabilities by integrating external knowledge. However, traditional RAG systems face critical limitations, including disrupted contextual integrity due to text chunking, and over-reliance on semantic similarity for retrieval. To address these issues, we propose CausalRAG, a novel framework that incorporates causal graphs into the retrieval process. By constructing and tracing causal relationships, CausalRAG preserves contextual continuity and improves retrieval precision, leading to more accurate and interpretable responses. We evaluate CausalRAG against regular RAG and graph-based RAG approaches, demonstrating its superiority across several metrics. Our findings suggest that grounding retrieval in causal reasoning provides a promising approach to knowledge-intensive tasks.

Beyond Object Categories: Multi-Attribute Reference Understanding for Visual Grounding

Hao Guo,Jianfei Zhu,Wei Fan,Chunzhi Yi,Feng Jiang

Task: 提出Multi-ref EC任务框架，整合状态描述、衍生意图和具身手势以实现目标对象定位。

Motivation: 现有REC方法受限于对象类别描述和单属性意图描述，难以适应真实场景中的人类-机器人交互需求。

Details

Method: 引入SIGAR数据集，结合状态、意图表达和具身参考，通过多属性参考提升定位性能。 Result: 实验表明，多属性参考能显著提升定位性能，单属性参考在自然交互场景中不足。 Conclusion: 多属性参考表达对推进视觉-语言理解至关重要。 Abstract: Referring expression comprehension (REC) aims at achieving object localization based on natural language descriptions. However, existing REC approaches are constrained by object category descriptions and single-attribute intention descriptions, hindering their application in real-world scenarios. In natural human-robot interactions, users often express their desires through individual states and intentions, accompanied by guiding gestures, rather than detailed object descriptions. To address this challenge, we propose Multi-ref EC, a novel task framework that integrates state descriptions, derived intentions, and embodied gestures to locate target objects. We introduce the State-Intention-Gesture Attributes Reference (SIGAR) dataset, which combines state and intention expressions with embodied references. Through extensive experiments with various baseline models on SIGAR, we demonstrate that properly ordered multi-attribute references contribute to improved localization performance, revealing that single-attribute reference is insufficient for natural human-robot interaction scenarios. Our findings underscore the importance of multi-attribute reference expressions in advancing visual-language understanding.

Rankers, Judges, and Assistants: Towards Understanding the Interplay of LLMs in Information Retrieval Evaluation

Krisztian Balog,Donald Metzler,Zhen Qin

Task: 探讨大型语言模型（LLM）在信息检索（IR）中作为排名器、评估器和AI辅助内容生成器时可能产生的偏见。

Motivation: 随着LLM在IR中的广泛应用，需要深入研究其组件间相互作用可能导致的偏见问题。

Details

Method: 综合现有研究并设计新实验，分析LLM排名器和辅助工具对LLM评估器的影响。 Result: 首次实证发现LLM评估器对LLM排名器存在显著偏见，且其辨别系统性能差异的能力有限；未发现对AI生成内容的偏见。 Conclusion: 需全面审视LLM驱动的信息生态系统，并提出初步指南和研究议程以确保LLM在IR评估中的可靠使用。 Abstract: Large language models (LLMs) are increasingly integral to information retrieval (IR), powering ranking, evaluation, and AI-assisted content creation. This widespread adoption necessitates a critical examination of potential biases arising from the interplay between these LLM-based components. This paper synthesizes existing research and presents novel experiment designs that explore how LLM-based rankers and assistants influence LLM-based judges. We provide the first empirical evidence of LLM judges exhibiting significant bias towards LLM-based rankers. Furthermore, we observe limitations in LLM judges' ability to discern subtle system performance differences. Contrary to some previous findings, our preliminary study does not find evidence of bias against AI-generated content. These results highlight the need for a more holistic view of the LLM-driven information ecosystem. To this end, we offer initial guidelines and a research agenda to ensure the reliable use of LLMs in IR evaluation.

Adaptive Multi-Order Graph Regularized NMF with Dual Sparsity for Hyperspectral Unmixing

Hui Chen,Liangyu Liu,Xianchao Xiu,Wanquan Liu

Task: 提出一种自适应多阶图正则化的非负矩阵分解方法（MOGNMF）用于高光谱解混。

Motivation: 现有基于图学习的NMF方法主要关注一阶或二阶近邻关系且需手动调参，无法充分表征数据内在结构。

Details

Method: 引入多阶图正则化以综合利用全局和局部信息，自适应学习多阶图参数，并嵌入双稀疏性（ℓ1/2范数和ℓ2,1范数）。 Result: 在模拟和真实高光谱数据上的实验表明，该方法能提供更好的解混结果。 Conclusion: MOGNMF方法通过自适应多阶图正则化和双稀疏性，显著提升了高光谱解混的性能。 Abstract: Hyperspectral unmixing (HU) is a critical yet challenging task in remote sensing. However, existing nonnegative matrix factorization (NMF) methods with graph learning mostly focus on first-order or second-order nearest neighbor relationships and usually require manual parameter tuning, which fails to characterize intrinsic data structures. To address the above issues, we propose a novel adaptive multi-order graph regularized NMF method (MOGNMF) with three key features. First, multi-order graph regularization is introduced into the NMF framework to exploit global and local information comprehensively. Second, these parameters associated with the multi-order graph are learned adaptively through a data-driven approach. Third, dual sparsity is embedded to obtain better robustness, i.e., $\ell_{1/2}$-norm on the abundance matrix and $\ell_{2,1}$-norm on the noise matrix. To solve the proposed model, we develop an alternating minimization algorithm whose subproblems have explicit solutions, thus ensuring effectiveness. Experiments on simulated and real hyperspectral data indicate that the proposed method delivers better unmixing results.

Browsing Lost Unformed Recollections: A Benchmark for Tip-of-the-Tongue Search and Reasoning

Sky CH-Wang,Darshan Deshpande,Smaranda Muresan,Anand Kannappan,Rebecca Qian

Task: 提出一个名为BLUR的基准测试，用于评估通用AI助手在多模态、多语言输入下的搜索和推理能力。

Motivation: 解决通用AI助手在复杂搜索和推理任务中的性能差距，推动技术进步。

Details

Method: 构建包含573个真实问题的数据集，涵盖多模态和多语言输入，并公开部分问题用于公开测试。 Result: 人类平均得分98%，而最佳AI系统仅得56%，显示显著差距。 Conclusion: BLUR基准为通用AI助手提供了一个具有挑战性的测试平台，促进技术发展。 Abstract: We introduce Browsing Lost Unformed Recollections, a tip-of-the-tongue known-item search and reasoning benchmark for general AI assistants. BLUR introduces a set of 573 real-world validated questions that demand searching and reasoning across multi-modal and multilingual inputs, as well as proficient tool use, in order to excel on. Humans easily ace these questions (scoring on average 98%), while the best-performing system scores around 56%. To facilitate progress toward addressing this challenging and aspirational use case for general AI assistants, we release 350 questions through a public leaderboard, retain the answers to 250 of them, and have the rest as a private test set.

Learning Hazing to Dehazing: Towards Realistic Haze Generation for Real-World Image Dehazing

Ruiyi Wang,Yushuo Zheng,Zicheng Zhang,Chunyi Li,Shuaicheng Liu,Guangtao Zhai,Xiaohong Liu

Task: 提出一种新的去雾-生成雾图像管道，包括HazeGen和DiffDehaze框架，以解决现有去雾方法依赖预训练模型和生成扩散模型效率低的问题。

Motivation: 现有去雾方法过度依赖预训练模型和训练数据，且生成扩散模型在去雾中潜力未充分开发，采样过程耗时。

Details

Method: 结合HazeGen（利用预训练文本到图像扩散模型生成高质量雾图像）和DiffDehaze（采用AccSamp加速采样过程，通过AlignOp减少步骤并保持保真度）。 Result: 实验表明该方法在去雾性能和视觉质量上优于现有方法。 Conclusion: 提出的管道有效解决了现有方法的局限性，提升了去雾效率和效果。 Abstract: Existing real-world image dehazing methods primarily attempt to fine-tune pre-trained models or adapt their inference procedures, thus heavily relying on the pre-trained models and associated training data. Moreover, restoring heavily distorted information under dense haze requires generative diffusion models, whose potential in dehazing remains underutilized partly due to their lengthy sampling processes. To address these limitations, we introduce a novel hazing-dehazing pipeline consisting of a Realistic Hazy Image Generation framework (HazeGen) and a Diffusion-based Dehazing framework (DiffDehaze). Specifically, HazeGen harnesses robust generative diffusion priors of real-world hazy images embedded in a pre-trained text-to-image diffusion model. By employing specialized hybrid training and blended sampling strategies, HazeGen produces realistic and diverse hazy images as high-quality training data for DiffDehaze. To alleviate the inefficiency and fidelity concerns associated with diffusion-based methods, DiffDehaze adopts an Accelerated Fidelity-Preserving Sampling process (AccSamp). The core of AccSamp is the Tiled Statistical Alignment Operation (AlignOp), which can provide a clean and faithful dehazing estimate within a small fraction of sampling steps to reduce complexity and enable effective fidelity guidance. Extensive experiments demonstrate the superior dehazing performance and visual quality of our approach over existing methods. The code is available at https://github.com/ruiyi-w/Learning-Hazing-to-Dehazing.

QUAD: Quantization and Parameter-Efficient Tuning of LLM with Activation Decomposition

Yuxuan Hu,Xiaodong Chen,Cuiping Li,Hong Chen,Jing Zhang

Task: 提出QUAD框架，通过激活分解和奇异值分解（SVD）实现高效4位量化，以解决大语言模型（LLMs）量化中的激活异常值问题。

Motivation: 现有量化方法在中等规模LLMs（如Llama-3-8B）中因激活异常值导致精度下降，需要一种更高效的量化方法。

Details

Method: QUAD利用SVD分解激活异常值，通过离线估计奇异向量构造正交变换矩阵P，将异常值保留为全精度，其余部分量化为4位。 Result: QUAD在W4A4量化下达到94%~96%的精度，结合W4A4/A8和参数高效微调后可达98%的精度。 Conclusion: QUAD是一种高效且精度损失小的量化方法，适用于中等规模LLMs。 Abstract: Large Language Models (LLMs) excel in diverse applications but suffer inefficiency due to massive scale. While quantization reduces computational costs, existing methods degrade accuracy in medium-sized LLMs (e.g., Llama-3-8B) due to activation outliers. To address this, we propose QUAD (Quantization with Activation Decomposition), a framework leveraging Singular Value Decomposition (SVD) to suppress activation outliers for effective 4-bit quantization. QUAD estimates activation singular vectors offline using calibration data to construct an orthogonal transformation matrix P, shifting outliers to additional dimensions in full precision while quantizing rest components to 4-bit. Additionally, QUAD enables parameter-efficient fine-tuning via adaptable full-precision outlier weights, narrowing the accuracy gap between quantized and full-precision models. Experiments demonstrate that QUAD achieves 94% ~ 96% accuracy under W4A4 quantization and 98% accuracy with W4A4/A8 and parameter-efficient fine-tuning for Llama-3 and Qwen-2.5 models. Our code is available at \href{https://github.com/hyx1999/Quad}{repository}.

DWIM: Towards Tool-aware Visual Reasoning via Discrepancy-aware Workflow Generation & Instruct-Masking Tuning

Fucai Ke,Vijay Kumar B G,Xingjian Leng,Zhixi Cai,Zaid Khan,Weiqing Wang,Pari Delir Haghighi,Hamid Rezatofighi,Manmohan Chandraker

Task: 提出一种名为DWIM的方法，通过改进工具使用和训练流程，提升视觉推理（VR）任务的性能。

Motivation: 现有的组合式视觉推理方法因冻结的大型语言模型（LLM）缺乏工具意识，导致性能瓶颈，且直接应用LLM到VR领域存在数据不足、工具不完善等问题。

Details

Method: DWIM包括两部分：i) 差异感知训练流程生成，评估工具使用并提取更可行的训练流程；ii) 指令掩码微调，指导模型仅克隆有效动作，生成更实用的解决方案。 Result: 实验表明，DWIM在多种VR任务中达到最先进性能，并在多个广泛使用的数据集上表现出强泛化能力。 Conclusion: DWIM通过改进训练流程和微调策略，有效解决了现有组合式视觉推理方法的局限性，提升了VR任务的性能。 Abstract: Visual reasoning (VR), which is crucial in many fields for enabling human-like visual understanding, remains highly challenging. Recently, compositional visual reasoning approaches, which leverage the reasoning abilities of large language models (LLMs) with integrated tools to solve problems, have shown promise as more effective strategies than end-to-end VR methods. However, these approaches face limitations, as frozen LLMs lack tool awareness in VR, leading to performance bottlenecks. While leveraging LLMs for reasoning is widely used in other domains, they are not directly applicable to VR due to limited training data, imperfect tools that introduce errors and reduce data collection efficiency in VR, and challenging in fine-tuning on noisy workflows. To address these challenges, we propose DWIM: i) Discrepancy-aware training Workflow generation, which assesses tool usage and extracts more viable workflows for training; and ii) Instruct-Masking fine-tuning, which guides the model to only clone effective actions, enabling the generation of more practical solutions. Our experiments demonstrate that DWIM achieves state-of-the-art performance across various VR tasks, exhibiting strong generalization on multiple widely-used datasets.

ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning

Mingyang Chen,Tianpeng Li,Haoze Sun,Yijie Zhou,Chenzheng Zhu,Fan Yang,Zenan Zhou,Weipeng Chen,Haofen Wang,Jeff Z. Pan,Wen Zhang,Huajun Chen

Task: 提出ReSearch框架，通过强化学习训练LLMs在无需监督数据的情况下整合推理与外部搜索过程。

Motivation: 解决LLMs在复杂多跳问题中整合推理与外部搜索的挑战。

Details

Method: 将搜索操作作为推理链的组成部分，通过文本引导搜索时机和方式，并利用强化学习训练模型。 Result: 在多个基准测试中表现出强泛化能力，并自然引发高级推理能力如反思和自我修正。 Conclusion: ReSearch框架有效提升了LLMs在复杂推理任务中的表现，展示了无监督学习的潜力。 Abstract: Large Language Models (LLMs) have shown remarkable capabilities in reasoning, exemplified by the success of OpenAI-o1 and DeepSeek-R1. However, integrating reasoning with external search processes remains challenging, especially for complex multi-hop questions requiring multiple retrieval steps. We propose ReSearch, a novel framework that trains LLMs to Reason with Search via reinforcement learning without using any supervised data on reasoning steps. Our approach treats search operations as integral components of the reasoning chain, where when and how to perform searches is guided by text-based thinking, and search results subsequently influence further reasoning. We train ReSearch on Qwen2.5-7B(-Instruct) and Qwen2.5-32B(-Instruct) models and conduct extensive experiments. Despite being trained on only one dataset, our models demonstrate strong generalizability across various benchmarks. Analysis reveals that ReSearch naturally elicits advanced reasoning capabilities such as reflection and self-correction during the reinforcement learning process.

Context-Aware Semantic Segmentation: Enhancing Pixel-Level Understanding with Large Language Models for Advanced Vision Applications

Ben Rahman

Task: 提出一种新的上下文感知语义分割框架，结合大型语言模型（LLMs）和视觉主干网络，以解决当前模型在捕捉对象间上下文和语义关系方面的局限性。

Motivation: 当前语义分割模型（如CNN和Transformer）在像素级特征识别上表现优异，但难以区分语义相似对象或理解复杂上下文场景。

Details

Method: 提出混合模型，结合Swin Transformer进行视觉特征提取，GPT-4通过文本嵌入增强语义理解，并引入交叉注意力机制对齐视觉和语言特征，同时使用图神经网络（GNNs）建模对象关系。 Result: 在COCO和Cityscapes等基准数据集上，模型在像素级准确度（mIoU）和上下文理解（mAP）上均优于现有方法。 Conclusion: 该工作弥合了视觉与语言之间的鸿沟，为自动驾驶、医学影像和机器人等领域的智能上下文感知视觉系统奠定了基础。 Abstract: Semantic segmentation has made significant strides in pixel-level image understanding, yet it remains limited in capturing contextual and semantic relationships between objects. Current models, such as CNN and Transformer-based architectures, excel at identifying pixel-level features but fail to distinguish semantically similar objects (e.g., "doctor" vs. "nurse" in a hospital scene) or understand complex contextual scenarios (e.g., differentiating a running child from a regular pedestrian in autonomous driving). To address these limitations, we proposed a novel Context-Aware Semantic Segmentation framework that integrates Large Language Models (LLMs) with state-of-the-art vision backbones. Our hybrid model leverages the Swin Transformer for robust visual feature extraction and GPT-4 for enriching semantic understanding through text embeddings. A Cross-Attention Mechanism is introduced to align vision and language features, enabling the model to reason about context more effectively. Additionally, Graph Neural Networks (GNNs) are employed to model object relationships within the scene, capturing dependencies that are overlooked by traditional models. Experimental results on benchmark datasets (e.g., COCO, Cityscapes) demonstrate that our approach outperforms the existing methods in both pixel-level accuracy (mIoU) and contextual understanding (mAP). This work bridges the gap between vision and language, paving the path for more intelligent and context-aware vision systems in applications including autonomous driving, medical imaging, and robotics.

Multi-agent Application System in Office Collaboration Scenarios

Songtao Sun,Jingyi Li,Yuanfei Dong,Haoguang Liu,Chenxin Xu,Fuyang Li,Qiang Liu

Task: 设计并验证一个多智能体应用系统，以提升办公室协作效率和工作质量。

Motivation: 通过整合人工智能、机器学习和自然语言处理技术，解决办公室协作中的任务分配、进度监控和信息共享等问题。

Details

Method: 提出一种智能体架构，分离计划和求解模块，并采用多轮查询改写和业务工具检索技术，增强智能体的多意图和多轮对话能力。 Result: 系统在查询理解、任务规划和工具调用方面表现出色，并在实际业务应用中验证了其有效性。 Conclusion: 该系统有望在动态环境和大规模多智能体系统中解决复杂交互问题，发挥更大作用。 Abstract: This paper introduces a multi-agent application system designed to enhance office collaboration efficiency and work quality. The system integrates artificial intelligence, machine learning, and natural language processing technologies, achieving functionalities such as task allocation, progress monitoring, and information sharing. The agents within the system are capable of providing personalized collaboration support based on team members' needs and incorporate data analysis tools to improve decision-making quality. The paper also proposes an intelligent agent architecture that separates Plan and Solver, and through techniques such as multi-turn query rewriting and business tool retrieval, it enhances the agent's multi-intent and multi-turn dialogue capabilities. Furthermore, the paper details the design of tools and multi-turn dialogue in the context of office collaboration scenarios, and validates the system's effectiveness through experiments and evaluations. Ultimately, the system has demonstrated outstanding performance in real business applications, particularly in query understanding, task planning, and tool calling. Looking forward, the system is expected to play a more significant role in addressing complex interaction issues within dynamic environments and large-scale multi-agent systems.

Multiscale Feature Importance-based Bit Allocation for End-to-End Feature Coding for Machines

Junle Liu,Yun Zhang,Zixi Guo

Task: 提出一种基于多尺度特征重要性的比特分配方法（MFIBA），用于端到端的特征编码（FCM）。

Motivation: 发现特征的重要性因尺度、物体大小和图像实例而异，因此需要更合理的比特分配方法。

Details

Method: 提出多尺度特征重要性预测模块（MFIP）和任务损失率模型，开发MFIBA方法。 Result: MFIBA结合ELIC在目标检测中节省38.202%比特率，在实例分割和关键点检测中分别节省17.212%和36.492%比特率。 Conclusion: MFIBA具有良好的通用性和适应性，适用于不同机器视觉任务和FCM基础编码器。 Abstract: Feature Coding for Machines (FCM) aims to compress intermediate features effectively for remote intelligent analytics, which is crucial for future intelligent visual applications. In this paper, we propose a Multiscale Feature Importance-based Bit Allocation (MFIBA) for end-to-end FCM. First, we find that the importance of features for machine vision tasks varies with the scales, object size, and image instances. Based on this finding, we propose a Multiscale Feature Importance Prediction (MFIP) module to predict the importance weight for each scale of features. Secondly, we propose a task loss-rate model to establish the relationship between the task accuracy losses of using compressed features and the bitrate of encoding these features. Finally, we develop a MFIBA for end-to-end FCM, which is able to assign coding bits of multiscale features more reasonably based on their importance. Experimental results demonstrate that when combined with a retained Efficient Learned Image Compression (ELIC), the proposed MFIBA achieves an average of 38.202% bitrate savings in object detection compared to the anchor ELIC. Moreover, the proposed MFIBA achieves an average of 17.212% and 36.492% feature bitrate savings for instance segmentation and keypoint detection, respectively. When the proposed MFIBA is applied to the LIC-TCM, it achieves an average of 18.103%, 19.866% and 19.597% bit rate savings on three machine vision tasks, respectively, which validates the proposed MFIBA has good generalizability and adaptability to different machine vision tasks and FCM base codecs.

Lean Formalization of Generalization Error Bound by Rademacher Complexity

Sho Sonoda,Kazumi Kasaura,Yuma Mizuno,Kei Tsukamoto,Naoto Onda

Task: 在Lean 4定理证明器中形式化基于Rademacher复杂度的泛化误差界。

Motivation: 泛化误差衡量学习机器在训练数据和未见测试数据上的性能差距，Rademacher复杂度作为基于假设类复杂度的误差估计方法，适用于包括深度学习和核方法在内的多种机器学习场景。

Details

Method: 形式化关键概念和定理，包括经验与总体Rademacher复杂度，并通过McDiarmid不等式、Hoeffding引理和对称化论证的形式化证明建立泛化误差界。 Result: 成功形式化了Rademacher复杂度及相关定理，并建立了泛化误差界的严格证明。 Conclusion: 通过形式化方法验证了Rademacher复杂度在机器学习中的普适性，为理论分析提供了可靠工具。 Abstract: We formalize the generalization error bound using Rademacher complexity in the Lean 4 theorem prover. Generalization error quantifies the gap between a learning machine's performance on given training data versus unseen test data, and Rademacher complexity serves as an estimate of this error based on the complexity of learning machines, or hypothesis class. Unlike traditional methods such as PAC learning and VC dimension, Rademacher complexity is applicable across diverse machine learning scenarios including deep learning and kernel methods. We formalize key concepts and theorems, including the empirical and population Rademacher complexities, and establish generalization error bounds through formal proofs of McDiarmid's inequality, Hoeffding's lemma, and symmetrization arguments.

ISPDiffuser: Learning RAW-to-sRGB Mappings with Texture-Aware Diffusion Models and Histogram-Guided Color Consistency

Yang Ren,Hai Jiang,Menglong Yang,Wei Li,Shuaicheng Liu

Task: 提出一种基于扩散模型的解耦框架ISPDiffuser，用于将RAW数据映射为高质量的sRGB图像。

Motivation: 现有基于学习的方法在细节重建和颜色一致性方面仍存在不足，需要更有效的解决方案。

Details

Method: 将RAW-to-sRGB映射分解为灰度空间细节重建和灰度到sRGB的颜色一致性映射，采用纹理感知扩散模型和直方图引导的颜色一致性模块。 Result: ISPDiffuser在定量和视觉上均优于现有方法。 Conclusion: ISPDiffuser通过解耦细节和颜色映射，显著提升了RAW-to-sRGB映射的质量。 Abstract: RAW-to-sRGB mapping, or the simulation of the traditional camera image signal processor (ISP), aims to generate DSLR-quality sRGB images from raw data captured by smartphone sensors. Despite achieving comparable results to sophisticated handcrafted camera ISP solutions, existing learning-based methods still struggle with detail disparity and color distortion. In this paper, we present ISPDiffuser, a diffusion-based decoupled framework that separates the RAW-to-sRGB mapping into detail reconstruction in grayscale space and color consistency mapping from grayscale to sRGB. Specifically, we propose a texture-aware diffusion model that leverages the generative ability of diffusion models to focus on local detail recovery, in which a texture enrichment loss is further proposed to prompt the diffusion model to generate more intricate texture details. Subsequently, we introduce a histogram-guided color consistency module that utilizes color histogram as guidance to learn precise color information for grayscale to sRGB color consistency mapping, with a color consistency loss designed to constrain the learned color information. Extensive experimental results show that the proposed ISPDiffuser outperforms state-of-the-art competitors both quantitatively and visually. The code is available at https://github.com/RenYangSCU/ISPDiffuser.

Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models

Ilias Stogiannidis,Steven McDonagh,Sotirios A. Tsaftaris

Task: 评估和提升视觉语言模型（VLMs）在空间推理任务中的表现。

Motivation: 现有VLMs基准测试未能有效分离空间推理与其他任务（如目标检测或语义理解），导致对空间推理能力的评估不足。

Details

Method: 通过分析空间推理的核心要素（如空间关系、方向导航、心理旋转和空间可视化），在合成和真实图像中评估13种先进VLMs的表现。 Result: 当前VLMs在空间推理任务中的平均准确率接近随机概率，表现显著不足。 Conclusion: 揭示了VLMs在空间推理方面的严重缺陷，为未来研究提供了平台和方向。 Abstract: Vision-Language Models (VLMs) have recently emerged as powerful tools, excelling in tasks that integrate visual and textual comprehension, such as image captioning, visual question answering, and image-text retrieval. However, existing benchmarks for VLMs include spatial components, which often fail to isolate spatial reasoning from related tasks such as object detection or semantic comprehension. In this paper, we address these deficiencies with a multi-faceted approach towards understanding spatial reasoning. Informed by the diverse and multi-dimensional nature of human spatial reasoning abilities, we present a detailed analysis that first delineates the core elements of spatial reasoning: spatial relations, orientation and navigation, mental rotation, and spatial visualization, and then assesses the performance of these models in both synthetic and real-world images, bridging controlled and naturalistic contexts. We analyze 13 state-of-the-art Vision-Language Models, uncovering pivotal insights into their spatial reasoning performance. Our results reveal profound shortcomings in current VLMs, with average accuracy across the 13 models approximating random chance, highlighting spatial reasoning as a persistent obstacle. This work not only exposes the pressing need to advance spatial reasoning within VLMs but also establishes a solid platform for future exploration. Code available on GitHub (https://github.com/stogiannidis/srbench) and dataset available on HuggingFace (https://huggingface.co/datasets/stogiannidis/srbench).

Exploring Semantic Feature Discrimination for Perceptual Image Super-Resolution and Opinion-Unaware No-Reference Image Quality Assessment

Guanglu Dong,Xiangyu Liao,Mingyang Li,Guihuan Guo,Chao Ren

Task: 提出一种基于语义特征判别的方法（SFD）用于感知超分辨率（SR），以生成更真实且语义相关的纹理。

Motivation: 现有基于GAN的超分辨率方法通常直接对图像进行粗粒度判别，忽略了图像的语义信息，导致难以学习细粒度和语义相关的纹理细节。

Details

Method: 设计了特征判别器（Feat-D）和文本引导判别方法（TG-D），通过对抗学习对齐超分辨率图像与高质量图像的语义特征分布。 Result: 实验表明，SFD方法在经典SISR、真实世界SISR和OU NR-IQA任务中均表现出色。 Conclusion: SFD方法能有效提升超分辨率图像的感知质量，并提出了无需额外训练的OU NR-IQA方法SFD-IQA。 Abstract: Generative Adversarial Networks (GANs) have been widely applied to image super-resolution (SR) to enhance the perceptual quality. However, most existing GAN-based SR methods typically perform coarse-grained discrimination directly on images and ignore the semantic information of images, making it challenging for the super resolution networks (SRN) to learn fine-grained and semantic-related texture details. To alleviate this issue, we propose a semantic feature discrimination method, SFD, for perceptual SR. Specifically, we first design a feature discriminator (Feat-D), to discriminate the pixel-wise middle semantic features from CLIP, aligning the feature distributions of SR images with that of high-quality images. Additionally, we propose a text-guided discrimination method (TG-D) by introducing learnable prompt pairs (LPP) in an adversarial manner to perform discrimination on the more abstract output feature of CLIP, further enhancing the discriminative ability of our method. With both Feat-D and TG-D, our SFD can effectively distinguish between the semantic feature distributions of low-quality and high-quality images, encouraging SRN to generate more realistic and semantic-relevant textures. Furthermore, based on the trained Feat-D and LPP, we propose a novel opinion-unaware no-reference image quality assessment (OU NR-IQA) method, SFD-IQA, greatly improving OU NR-IQA performance without any additional targeted training. Extensive experiments on classical SISR, real-world SISR, and OU NR-IQA tasks demonstrate the effectiveness of our proposed methods.

CAFe: Unifying Representation and Generation with Contrastive-Autoregressive Finetuning

Hao Yu,Zhuokai Zhao,Shen Yan,Lukasz Korycki,Jianyu Wang,Baosheng He,Jiayi Liu,Lizhu Zhang,Xiangjun Fan,Hanchao Yu

Task: 提出一种名为CAFe的对比自回归微调框架，以增强大型视觉语言模型（LVLMs）在表示学习和生成任务中的性能。

Motivation: 现有LVLMs在生成任务中表现优异，但在高保真表示学习任务（如图像或文本嵌入检索）中存在局限性，且微调后模型可能丧失生成能力。

Details

Method: 通过结合对比目标和自回归语言建模，CAFe框架统一了表示学习和生成任务。 Result: CAFe在多模态检索和多模态生成基准测试中取得了最先进的结果，包括缓解对象幻觉（OH）。 Conclusion: CAFe为未来多模态模型提供了一个既能提升检索精度又能生成连贯输出的统一框架。 Abstract: The rapid advancement of large vision-language models (LVLMs) has driven significant progress in multimodal tasks, enabling models to interpret, reason, and generate outputs across both visual and textual domains. While excelling in generative tasks, existing LVLMs often face limitations in tasks requiring high-fidelity representation learning, such as generating image or text embeddings for retrieval. Recent work has proposed finetuning LVLMs for representational learning, but the fine-tuned model often loses its generative capabilities due to the representational learning training paradigm. To address this trade-off, we introduce CAFe, a contrastive-autoregressive fine-tuning framework that enhances LVLMs for both representation and generative tasks. By integrating a contrastive objective with autoregressive language modeling, our approach unifies these traditionally separate tasks, achieving state-of-the-art results in both multimodal retrieval and multimodal generative benchmarks, including object hallucination (OH) mitigation. CAFe establishes a novel framework that synergizes embedding and generative functionalities in a single model, setting a foundation for future multimodal models that excel in both retrieval precision and coherent output generation.

Fine-grained Textual Inversion Network for Zero-Shot Composed Image Retrieval

Haoqiang Lin,Haokun Wen,Xuemeng Song,Meng Liu,Yupeng Hu,Liqiang Nie

Task: 提出一种名为FTI4CIR的细粒度文本反转网络，用于零样本组合图像检索（ZS-CIR）。

Motivation: 由于训练数据标注成本高昂，现有零样本CIR方法通过粗粒度文本反转将图像映射为单个伪词标记，可能无法准确捕捉图像的全部内容。

Details

Method: FTI4CIR包含细粒度伪词标记映射和三重标题语义正则化，前者将图像映射为主体和属性伪词标记，后者基于BLIP生成的标题模板对齐伪词标记和真实词嵌入空间。 Result: 在三个基准数据集上的实验证明了该方法的优越性。 Conclusion: FTI4CIR通过细粒度文本反转和语义正则化，显著提升了零样本组合图像检索的性能。 Abstract: Composed Image Retrieval (CIR) allows users to search target images with a multimodal query, comprising a reference image and a modification text that describes the user's modification demand over the reference image. Nevertheless, due to the expensive labor cost of training data annotation, recent researchers have shifted to the challenging task of zero-shot CIR (ZS-CIR), which targets fulfilling CIR without annotated triplets. The pioneer ZS-CIR studies focus on converting the CIR task into a standard text-to-image retrieval task by pre-training a textual inversion network that can map a given image into a single pseudo-word token. Despite their significant progress, their coarse-grained textual inversion may be insufficient to capture the full content of the image accurately. To overcome this issue, in this work, we propose a novel Fine-grained Textual Inversion Network for ZS-CIR, named FTI4CIR. In particular, FTI4CIR comprises two main components: fine-grained pseudo-word token mapping and tri-wise caption-based semantic regularization. The former maps the image into a subject-oriented pseudo-word token and several attribute-oriented pseudo-word tokens to comprehensively express the image in the textual form, while the latter works on jointly aligning the fine-grained pseudo-word tokens to the real-word token embedding space based on a BLIP-generated image caption template. Extensive experiments conducted on three benchmark datasets demonstrate the superiority of our proposed method.

BIMII-Net: Brain-Inspired Multi-Iterative Interactive Network for RGB-T Road Scene Semantic Segmentation

Hanshuo Qiu,Jie Jiang,Ruoli Yang,Lixin Zhan,Jizhao Liu

Task: 提出一种名为BIMII-Net的新型RGB-T道路场景语义分割网络，以解决现有方法在多模态信息融合和层次差异处理上的不足。

Motivation: 现有RGB-T语义分割模型通常依赖简单的加法或拼接策略，或忽略不同层次信息的差异，导致在复杂环境（如光照不足或遮挡）下的性能受限。

Details

Method: 提出深度连续耦合神经网络（DCCNN）架构，设计交叉显式注意力增强融合模块（CEAEF-Module），并构建互补交互的多层解码器结构（包括SFI-Module、DFI-Module和MFE-Module）。 Result: BIMII-Net在脑启发计算领域达到SOTA性能，优于大多数现有RGB-T语义分割方法，并在多个RGB-T数据集上表现出强泛化能力。 Conclusion: 脑启发计算模型在多模态图像分割任务中具有有效性，BIMII-Net通过多模块协同优化显著提升了分割结果。 Abstract: RGB-T road scene semantic segmentation enhances visual scene understanding in complex environments characterized by inadequate illumination or occlusion by fusing information from RGB and thermal images. Nevertheless, existing RGB-T semantic segmentation models typically depend on simple addition or concatenation strategies or ignore the differences between information at different levels. To address these issues, we proposed a novel RGB-T road scene semantic segmentation network called Brain-Inspired Multi-Iteration Interaction Network (BIMII-Net). First, to meet the requirements of accurate texture and local information extraction in road scenarios like autonomous driving, we proposed a deep continuous-coupled neural network (DCCNN) architecture based on a brain-inspired model. Second, to enhance the interaction and expression capabilities among multi-modal information, we designed a cross explicit attention-enhanced fusion module (CEAEF-Module) in the feature fusion stage of BIMII-Net to effectively integrate features at different levels. Finally, we constructed a complementary interactive multi-layer decoder structure, incorporating the shallow-level feature iteration module (SFI-Module), the deep-level feature iteration module (DFI-Module), and the multi-feature enhancement module (MFE-Module) to collaboratively extract texture details and global skeleton information, with multi-module joint supervision further optimizing the segmentation results. Experimental results demonstrate that BIMII-Net achieves state-of-the-art (SOTA) performance in the brain-inspired computing domain and outperforms most existing RGB-T semantic segmentation methods. It also exhibits strong generalization capabilities on multiple RGB-T datasets, proving the effectiveness of brain-inspired computer models in multi-modal image segmentation tasks.

Analyzing the Synthetic-to-Real Domain Gap in 3D Hand Pose Estimation

Zhuoran Zhao,Linlin Yang,Pengzhan Sun,Pan Hui,Angela Yao

Task: 系统研究3D手部姿态估计中的合成数据与真实数据之间的差距。

Motivation: 尽管合成数据在面部和身体姿态估计中表现优异，但在手部姿态估计中仍存在显著的合成与真实数据差距。

Details

Method: 提出一个数据合成流程，生成高质量合成数据，并分析关键组件（如前臂、图像频率统计、手部姿态和物体遮挡）。 Result: 通过整合关键组件，合成手部数据可以达到与真实数据相同的精度水平。 Conclusion: 为仅使用合成数据进行手部姿态估计提供了可行路径。 Abstract: Recent synthetic 3D human datasets for the face, body, and hands have pushed the limits on photorealism. Face recognition and body pose estimation have achieved state-of-the-art performance using synthetic training data alone, but for the hand, there is still a large synthetic-to-real gap. This paper presents the first systematic study of the synthetic-to-real gap of 3D hand pose estimation. We analyze the gap and identify key components such as the forearm, image frequency statistics, hand pose, and object occlusions. To facilitate our analysis, we propose a data synthesis pipeline to synthesize high-quality data. We demonstrate that synthetic hand data can achieve the same level of accuracy as real data when integrating our identified components, paving the path to use synthetic data alone for hand pose estimation. Code and data are available at: https://github.com/delaprada/HandSynthesis.git.

A Comprehensive Analysis of Mamba for 3D Volumetric Medical Image Segmentation

Chaohan Wang,Yutong Xie,Qi Chen,Yuyin Zhou,Qi Wu

Task: 研究Mamba在3D医学图像分割中的表现，探讨其是否能替代Transformers、提升多尺度表示学习以及是否需要复杂扫描策略。

Motivation: 解决Mamba在长程依赖建模中的计算效率问题，并验证其在3D医学图像分割中的有效性。

Details

Method: 提出UlikeMamba网络，结合3D深度卷积和多尺度Mamba块，评估其在AMOS、TotalSegmentator和BraTS基准上的表现。 Result: UlikeMamba在准确性和计算效率上优于Transformer网络，多尺度Mamba块在复杂任务中表现更优，Tri-scan策略在挑战性场景中效果显著。 Conclusion: Mamba在3D医学图像分割中展现出超越现有模型的潜力，为高效准确的医学影像分析提供了新方向。 Abstract: Mamba, with its selective State Space Models (SSMs), offers a more computationally efficient solution than Transformers for long-range dependency modeling. However, there is still a debate about its effectiveness in high-resolution 3D medical image segmentation. In this study, we present a comprehensive investigation into Mamba's capabilities in 3D medical image segmentation by tackling three pivotal questions: Can Mamba replace Transformers? Can it elevate multi-scale representation learning? Is complex scanning necessary to unlock its full potential? We evaluate Mamba's performance across three large public benchmarks-AMOS, TotalSegmentator, and BraTS. Our findings reveal that UlikeMamba, a U-shape Mamba-based network, consistently surpasses UlikeTrans, a U-shape Transformer-based network, particularly when enhanced with custom-designed 3D depthwise convolutions, boosting accuracy and computational efficiency. Further, our proposed multi-scale Mamba block demonstrates superior performance in capturing both fine-grained details and global context, especially in complex segmentation tasks, surpassing Transformer-based counterparts. We also critically assess complex scanning strategies, finding that simpler methods often suffice, while our Tri-scan approach delivers notable advantages in the most challenging scenarios. By integrating these advancements, we introduce a new network for 3D medical image segmentation, positioning Mamba as a transformative force that outperforms leading models such as nnUNet, CoTr, and U-Mamba, offering competitive accuracy with superior computational efficiency. This study provides key insights into Mamba's unique advantages, paving the way for more efficient and accurate approaches to 3D medical imaging.

LRSCLIP: A Vision-Language Foundation Model for Aligning Remote Sensing Image with Longer Text

Weizhi Chen,Jingbo Chen,Yupeng Deng,Jiansheng Chen,Yuman Feng,Zhihao Xi,Diyou Liu,Kai Li,Yu Meng

Task: 解决遥感视觉-语言基础模型（VLFM）中长文本处理的技术瓶颈和短文本信息不足导致的‘幻觉’问题。

Motivation: 现有数据集在语义粒度上存在局限性，且缺乏长文本支持，影响了模型的性能和应用。

Details

Method: 提出LRSCLIP模型和LRS2M数据集，通过多源遥感数据整合和大语言模型标注策略，实现细粒度跨模态特征对齐。 Result: 在零样本长文本跨模态检索任务中，LRSCLIP比Long-CLIP基线提高了10%-20%的检索准确率；在短文本任务中，性能优于当前最佳模型GeoRSCLIP。 Conclusion: LRSCLIP在细粒度语义理解和全局特征匹配方面具有双重优势，为遥感多模态学习提供了新的基准模型和数据支持。 Abstract: This study addresses the technical bottlenecks in handling long text and the "hallucination" issue caused by insufficient short text information in remote sensing vision-language foundation models (VLFM). We propose a novel vision-language foundation model, LRSCLIP, and a multimodal dataset, LRS2M. The main contributions are as follows: (1) By integrating multi-source remote sensing data and adopting a large language model labeling strategy, we construct the LRS2M dataset, which contains 2 million image-text pairs, providing both short and long texts for the first time, thus solving the problem of semantic granularity limitations in existing datasets; (2) The design of the LRSCLIP architecture based on Long-CLIP's KPS module, which extends CLIP's text processing capacity and achieves fine-grained cross-modal feature alignment through a dual-text loss weighting mechanism. Experimental results show that LRSCLIP improves retrieval accuracy by 10\%-20\% over the Long-CLIP baseline in the zero-shot long-text cross-modal retrieval task. For the zero-shot short-text cross-modal retrieval task, LRSCLIP achieves improvements over the current best model, GeoRSCLIP, with increases of 0.17\%, 0.67\%, and 0.92\% in Text to Image R@1, Image to Text R@1, and mR on RSITMD, respectively, and 0.04\%, 2.93\%, and 1.28\% on RSICD. In the zero-shot image classification task (average accuracy=75.75\%) and semantic localization task (Rmi=0.7653), LRSCLIP achieves state-of-the-art performance. These results validate the dual advantages of fine-grained semantic understanding and global feature matching in LRSCLIP. This work provides a new benchmark model and data support for remote sensing multimodal learning. The related code has been open source and is available at https://github.com/MitsuiChen14/LRSCLIP.

ImageGen-CoT: Enhancing Text-to-Image In-context Learning with Chain-of-Thought Reasoning

Jiaqi Liao,Zhengyuan Yang,Linjie Li,Dianqi Li,Kevin Lin,Yu Cheng,Lijuan Wang

Task: 研究文本到图像上下文学习（T2I-ICL）问题，并提出一种新框架以增强多模态大语言模型（MLLMs）的上下文推理能力。

Motivation: 现有的统一多模态大语言模型在T2I-ICL场景中难以进行有效的上下文推理。

Details

Method: 提出ImageGen-CoT框架，包括自动构建高质量数据集、微调MLLMs，以及测试时采用混合扩展策略。 Result: 实验表明，使用ImageGen-CoT数据集微调后，SEED-X在T2I-ICL任务上性能提升80%。 Conclusion: ImageGen-CoT框架显著提升了MLLMs在T2I-ICL任务中的表现，并计划开源代码和模型权重。 Abstract: In this work, we study the problem of Text-to-Image In-Context Learning (T2I-ICL). While Unified Multimodal LLMs (MLLMs) have advanced rapidly in recent years, they struggle with contextual reasoning in T2I-ICL scenarios. To address this limitation, we propose a novel framework that incorporates a thought process called ImageGen-CoT prior to image generation. To avoid generating unstructured ineffective reasoning steps, we develop an automatic pipeline to curate a high-quality ImageGen-CoT dataset. We then fine-tune MLLMs using this dataset to enhance their contextual reasoning capabilities. To further enhance performance, we explore test-time scale-up strategies and propose a novel hybrid scaling approach. This approach first generates multiple ImageGen-CoT chains and then produces multiple images for each chain via sampling. Extensive experiments demonstrate the effectiveness of our proposed method. Notably, fine-tuning with the ImageGen-CoT dataset leads to a substantial 80\% performance gain for SEED-X on T2I-ICL tasks. See our project page at https://ImageGen-CoT.github.io/. Code and model weights will be open-sourced.

Long-Context Autoregressive Video Modeling with Next-Frame Prediction

Yuchao Gu,Weijia Mao,Mike Zheng Shou

Task: 研究长上下文视频建模，并提出Frame AutoRegressive (FAR) 作为视频自回归建模的基线方法。

Motivation: 视频生成领域尚未充分利用长时态上下文，而长上下文自回归建模在语言生成中已取得显著进展。

Details

Method: 提出FAR模型，模拟连续帧之间的时序因果关系；引入FlexRoPE技术增强长上下文建模，并提出长短时上下文建模策略。 Result: FAR在短视频和长视频生成中均达到最先进性能。 Conclusion: FAR为视频自回归建模提供了一个简单而有效的基线方法。 Abstract: Long-context autoregressive modeling has significantly advanced language generation, but video generation still struggles to fully utilize extended temporal contexts. To investigate long-context video modeling, we introduce Frame AutoRegressive (FAR), a strong baseline for video autoregressive modeling. Just as language models learn causal dependencies between tokens (i.e., Token AR), FAR models temporal causal dependencies between continuous frames, achieving better convergence than Token AR and video diffusion transformers. Building on FAR, we observe that long-context vision modeling faces challenges due to visual redundancy. Existing RoPE lacks effective temporal decay for remote context and fails to extrapolate well to long video sequences. Additionally, training on long videos is computationally expensive, as vision tokens grow much faster than language tokens. To tackle these issues, we propose balancing locality and long-range dependency. We introduce FlexRoPE, an test-time technique that adds flexible temporal decay to RoPE, enabling extrapolation to 16x longer vision contexts. Furthermore, we propose long short-term context modeling, where a high-resolution short-term context window ensures fine-grained temporal consistency, while an unlimited long-term context window encodes long-range information using fewer tokens. With this approach, we can train on long video sequences with a manageable token context length. We demonstrate that FAR achieves state-of-the-art performance in both short- and long-video generation, providing a simple yet effective baseline for video autoregressive modeling.

ChA-MAEViT: Unifying Channel-Aware Masked Autoencoders and Multi-Channel Vision Transformers for Improved Cross-Channel Learning

Chau Pham,Juan C. Caicedo,Bryan A. Plummer

Task: 提出一种基于MAE的方法（ChA-MAEViT）以增强多通道成像（MCI）中的跨通道特征学习。

Motivation: 传统MAE方法假设图像通道间存在冗余，适用于随机补丁掩码，但在MCI中通道信息互补且重叠少，导致跨通道交互学习不足。

Details

Method: 采用动态通道-补丁掩码、记忆令牌、混合令牌融合模块和通道感知解码器四种策略。 Result: 在多个数据集上显著优于现有MCI-ViT方法，性能提升3.0-21.5%。 Conclusion: 跨通道交互对MCI至关重要，ChA-MAEViT通过创新设计有效提升了特征学习能力。 Abstract: Prior work using Masked Autoencoders (MAEs) typically relies on random patch masking based on the assumption that images have significant redundancies across different channels, allowing for the reconstruction of masked content using cross-channel correlations. However, this assumption does not hold in Multi-Channel Imaging (MCI), where channels may provide complementary information with minimal feature overlap. Thus, these MAEs primarily learn local structures within individual channels from patch reconstruction, failing to fully leverage cross-channel interactions and limiting their MCI effectiveness. In this paper, we present ChA-MAEViT, an MAE-based method that enhances feature learning across MCI channels via four key strategies: (1) dynamic channel-patch masking, which compels the model to reconstruct missing channels in addition to masked patches, thereby enhancing cross-channel dependencies and improving robustness to varying channel configurations; (2) memory tokens, which serve as long-term memory aids to promote information sharing across channels, addressing the challenges of reconstructing structurally diverse channels; (3) hybrid token fusion module, which merges fine-grained patch tokens with a global class token to capture richer representations; and (4) Channel-Aware Decoder, a lightweight decoder utilizes channel tokens to effectively reconstruct image patches. Experiments on satellite and microscopy datasets, CHAMMI, JUMP-CP, and So2Sat, show that ChA-MAEViT significantly outperforms state-of-the-art MCI-ViTs by 3.0-21.5%, highlighting the importance of cross-channel interactions in MCI.

Divide-and-Conquer: Dual-Hierarchical Optimization for Semantic 4D Gaussian Spatting

Zhiying Yan,Yiyuan Liang,Shilv Cai,Tao Zhang,Sheng Zhong,Luxin Yan,Xu Zou

Task: 提出一种名为Dual-Hierarchical Optimization (DHO)的方法，用于动态场景的重建和理解。

Motivation: 静态方法直接应用于动态场景会忽略时间特征，且现有基于高斯泼溅的方法在处理动态部分时易产生噪声和伪影。

Details

Method: 采用分层高斯流和分层高斯引导的分治策略，分别处理静态与动态部分的渲染和特征。 Result: 在合成和真实数据集上均优于基线方法，并支持多种下游任务。 Conclusion: DHO方法有效解决了动态场景理解中的噪声和伪影问题，性能优越。 Abstract: Semantic 4D Gaussians can be used for reconstructing and understanding dynamic scenes, with temporal variations than static scenes. Directly applying static methods to understand dynamic scenes will fail to capture the temporal features. Few works focus on dynamic scene understanding based on Gaussian Splatting, since once the same update strategy is employed for both dynamic and static parts, regardless of the distinction and interaction between Gaussians, significant artifacts and noise appear. We propose Dual-Hierarchical Optimization (DHO), which consists of Hierarchical Gaussian Flow and Hierarchical Gaussian Guidance in a divide-and-conquer manner. The former implements effective division of static and dynamic rendering and features. The latter helps to mitigate the issue of dynamic foreground rendering distortion in textured complex scenes. Extensive experiments show that our method consistently outperforms the baselines on both synthetic and real-world datasets, and supports various downstream tasks. Project Page: https://sweety-yan.github.io/DHO.

BADGR: Bundle Adjustment Diffusion Conditioned by GRadients for Wide-Baseline Floor Plan Reconstruction

Yuguang Li,Ivaylo Boyadzhiev,Zixuan Liu,Linda Shapiro,Alex Colburn

Task: 从宽基线RGB全景图中重建精确的相机位姿和平面布局。

Motivation: 解决从宽基线RGB全景图中重建相机位姿和平面布局的困难问题。

Details

Method: 提出BADGR，一种新颖的扩散模型，联合执行重建和束调整（BA），利用1D地板边界预测从不同输入密度的图像中优化位姿和布局。 Result: BADGR在实验中显著优于现有方法，支持多种输入密度。 Conclusion: BADGR通过扩散模型和束调整的结合，有效解决了相机位姿和平面布局的重建问题。 Abstract: Reconstructing precise camera poses and floor plan layouts from wide-baseline RGB panoramas is a difficult and unsolved problem. We introduce BADGR, a novel diffusion model that jointly performs reconstruction and bundle adjustment (BA) to refine poses and layouts from a coarse state, using 1D floor boundary predictions from dozens of images of varying input densities. Unlike a guided diffusion model, BADGR is conditioned on dense per-entity outputs from a single-step Levenberg Marquardt (LM) optimizer and is trained to predict camera and wall positions while minimizing reprojection errors for view-consistency. The objective of layout generation from denoising diffusion process complements BA optimization by providing additional learned layout-structural constraints on top of the co-visible features across images. These constraints help BADGR to make plausible guesses on spatial relations which help constrain pose graph, such as wall adjacency, collinearity, and learn to mitigate errors from dense boundary observations with global contexts. BADGR trains exclusively on 2D floor plans, simplifying data acquisition, enabling robust augmentation, and supporting variety of input densities. Our experiments and analysis validate our method, which significantly outperforms the state-of-the-art pose and floor plan layout reconstruction with different input densities.

Stop Walking in Circles! Bailing Out Early in Projected Gradient Descent

Philip Doldo,Derek Everett,Amol Khanna,Andre T Nguyen,Edward Raff

Task: 提出一种基于循环检测的早期终止PGD方法，以加速对抗性鲁棒性评估。

Motivation: PGD在对抗性鲁棒性评估中计算成本高，尤其是在需要数千次迭代的情况下，导致效率低下。

Details

Method: 利用PGD实际实现中的几何特性，通过循环检测实现早期终止。 Result: 该方法在不牺牲攻击强度的情况下显著加速PGD，提供与标准PGD完全相同的鲁棒性估计。 Conclusion: 该方法使之前计算上不可行的鲁棒性评估成为可能。 Abstract: Projected Gradient Descent (PGD) under the $L_\infty$ ball has become one of the defacto methods used in adversarial robustness evaluation for computer vision (CV) due to its reliability and efficacy, making a strong and easy-to-implement iterative baseline. However, PGD is computationally demanding to apply, especially when using thousands of iterations is the current best-practice recommendation to generate an adversarial example for a single image. In this work, we introduce a simple novel method for early termination of PGD based on cycle detection by exploiting the geometry of how PGD is implemented in practice and show that it can produce large speedup factors while providing the \emph{exact} same estimate of model robustness as standard PGD. This method substantially speeds up PGD without sacrificing any attack strength, enabling evaluations of robustness that were previously computationally intractable.

Multi-Object Sketch Animation by Scene Decomposition and Motion Planning

Jingyu Liu,Zijie Xin,Yuhan Fu,Ruixiang Zhao,Bangxiang Lan,Xirong Li

Task: 提出一种基于迭代优化的多对象素描动画方法MoSketch，解决单对象到多对象动画的挑战。

Motivation: 当前素描动画方法在多对象场景中表现不佳，主要面临对象感知运动建模和复杂运动优化两大挑战。

Details

Method: 基于Score Distillation Sampling（SDS）的迭代优化方法，包含四个模块：LLM场景分解、LLM运动规划、运动细化网络和组合SDS。 Result: 实验表明，MoSketch在多对象素描动画中优于现有方法。 Conclusion: MoSketch为多对象素描动画开辟了新方向，代码将开源。 Abstract: Sketch animation, which brings static sketches to life by generating dynamic video sequences, has found widespread applications in GIF design, cartoon production, and daily entertainment. While current sketch animation methods perform well in single-object sketch animation, they struggle in multi-object scenarios. By analyzing their failures, we summarize two challenges of transitioning from single-object to multi-object sketch animation: object-aware motion modeling and complex motion optimization. For multi-object sketch animation, we propose MoSketch based on iterative optimization through Score Distillation Sampling (SDS), without any other data for training. We propose four modules: LLM-based scene decomposition, LLM-based motion planning, motion refinement network and compositional SDS, to tackle the two challenges in a divide-and-conquer strategy. Extensive qualitative and quantitative experiments demonstrate the superiority of our method over existing sketch animation approaches. MoSketch takes a pioneering step towards multi-object sketch animation, opening new avenues for future research and applications. The code will be released.

ST-VLM: Kinematic Instruction Tuning for Spatio-Temporal Reasoning in Vision-Language Models

Dohwan Ko,Sihyeon Kim,Yumin Suh,Vijay Kumar B. G,Minseo Yoon,Manmohan Chandraker,Hyunwoo J. Kim

Task: 构建一个时空推理数据集和基准测试（STKit和STKit-Bench），并通过运动学指令调优提升视觉语言模型（VLM）的时空推理能力。

Motivation: 现有视觉语言模型在空间推理方面有所提升，但在分析运动学元素（如移动物体的距离和速度）方面仍存在不足。

Details

Method: 提出一个自动生成伪标签的管道，利用4D重建技术扩展数据规模，并构建ST-VLM模型进行运动学指令调优。 Result: ST-VLM在STKit-Bench上表现优异，且在其他时空基准测试（如ActivityNet、TVQA+）中优于基线模型。 Conclusion: ST-VLM通过结合时空推理能力，实现了复杂的多步推理，展现了跨领域和任务的强泛化能力。 Abstract: Spatio-temporal reasoning is essential in understanding real-world environments in various fields, eg, autonomous driving and sports analytics. Recent advances have improved the spatial reasoning ability of Vision-Language Models (VLMs) by introducing large-scale data, but these models still struggle to analyze kinematic elements like traveled distance and speed of moving objects. To bridge this gap, we construct a spatio-temporal reasoning dataset and benchmark involving kinematic instruction tuning, referred to as STKit and STKit-Bench. They consist of real-world videos with 3D annotations, detailing object motion dynamics: traveled distance, speed, movement direction, inter-object distance comparisons, and relative movement direction. To further scale such data construction to videos without 3D labels, we propose an automatic pipeline to generate pseudo-labels using 4D reconstruction in real-world scale. With our kinematic instruction tuning data for spatio-temporal reasoning, we present ST-VLM, a VLM enhanced for spatio-temporal reasoning, which exhibits outstanding performance on STKit-Bench. Furthermore, we show that ST-VLM generalizes robustly across diverse domains and tasks, outperforming baselines on other spatio-temporal benchmarks (eg, ActivityNet, TVQA+). Finally, by integrating learned spatio-temporal reasoning with existing abilities, ST-VLM enables complex multi-step reasoning. Project page: https://ikodoh.github.io/ST-VLM.

Can Vision-Language Models Answer Face to Face Questions in the Real-World?

Reza Pourreza,Rishit Dagli,Apratim Bhattacharyya,Sunny Panchal,Guillaume Berger,Roland Memisevic

Task: 评估现有AI模型在实时视频和音频输入下与用户对话的能力。

Motivation: 探索AI模型是否能够在实时场景中与用户互动，这是实现真实世界AI助手和人形机器人的关键。

Details

Method: 引入Qualcomm交互式视频数据集（IVD），通过实时问答任务评估模型性能，并分析微调对性能的影响。 Result: 现有模型表现远低于人类水平，但微调可以显著缩小部分感知技能上的差距。 Conclusion: 虽然现有模型在实时交互任务上仍有不足，但通过微调可以提升其性能，为未来AI助手的发展提供了方向。 Abstract: AI models have made significant strides in recent years in their ability to describe and answer questions about real-world images. They have also made progress in the ability to converse with users in real-time using audio input. This raises the question: have we reached the point where AI models, connected to a camera and microphone, can converse with users in real-time about scenes and events that are unfolding live in front of the camera? This has been a long-standing goal in AI and is a prerequisite for real-world AI assistants and humanoid robots to interact with humans in everyday situations. In this work, we introduce a new dataset and benchmark, the Qualcomm Interactive Video Dataset (IVD), which allows us to assess the extent to which existing models can support these abilities, and to what degree these capabilities can be instilled through fine-tuning. The dataset is based on a simple question-answering setup, where users ask questions that the system has to answer, in real-time, based on the camera and audio input. We show that existing models fall far behind human performance on this task, and we identify the main sources for the performance gap. However, we also show that for many of the required perceptual skills, fine-tuning on this form of data can significantly reduce this gap.

Correcting Deviations from Normality: A Reformulated Diffusion Model for Multi-Class Unsupervised Anomaly Detection

Farzad Beizaee,Gregory A. Lodygensky,Christian Desrosiers,Jose Dolz

Task: 提出一种基于扩散模型的改进方法（DeCo-Diff），用于选择性修复异常区域以实现无监督异常检测。

Motivation: 现有扩散模型在无监督异常检测中难以保持结构完整性和选择性修复异常区域，尤其是在多类别场景下。

Details

Method: 通过将异常建模为潜在空间中的噪声，DeCo-Diff模型选择性修复异常区域并保留正常区域。 Result: 在复杂图像中准确识别和定位异常区域，像素级AUPRC提升11-14%。 Conclusion: DeCo-Diff方法显著提升了无监督异常检测的性能，尤其在多类别场景下表现优异。 Abstract: Recent advances in diffusion models have spurred research into their application for Reconstruction-based unsupervised anomaly detection. However, these methods may struggle with maintaining structural integrity and recovering the anomaly-free content of abnormal regions, especially in multi-class scenarios. Furthermore, diffusion models are inherently designed to generate images from pure noise and struggle to selectively alter anomalous regions of an image while preserving normal ones. This leads to potential degradation of normal regions during reconstruction, hampering the effectiveness of anomaly detection. This paper introduces a reformulation of the standard diffusion model geared toward selective region alteration, allowing the accurate identification of anomalies. By modeling anomalies as noise in the latent space, our proposed Deviation correction diffusion (DeCo-Diff) model preserves the normal regions and encourages transformations exclusively on anomalous areas. This selective approach enhances the reconstruction quality, facilitating effective unsupervised detection and localization of anomaly regions. Comprehensive evaluations demonstrate the superiority of our method in accurately identifying and localizing anomalies in complex images, with pixel-level AUPRC improvements of 11-14% over state-of-the-art models on well known anomaly detection datasets. The code is available at https://github.com/farzad-bz/DeCo-Diff

From Sparse to Dense: Camera Relocalization with Scene-Specific Detector from Feature Gaussian Splatting

Zhiwei Huang,Hailin Yu,Yichun Shentu,Jin Yuan,Guofeng Zhang

Task: 提出一种新的相机重定位方法STDLoc，利用特征高斯作为场景表示。

Motivation: 现有方法通常依赖先验位姿或需要从粗到细的定位流程，而STDLoc旨在实现无需先验位姿的高效准确重定位。

Details

Method: 采用稀疏到稠密的定位范式，提出匹配导向的高斯采样策略和场景特定检测器，并通过稠密特征匹配实现精确位姿估计。 Result: 在室内外数据集上，STDLoc在定位精度和召回率上优于当前最先进方法。 Conclusion: STDLoc通过创新的场景表示和定位范式，显著提升了相机重定位的性能。 Abstract: This paper presents a novel camera relocalization method, STDLoc, which leverages Feature Gaussian as scene representation. STDLoc is a full relocalization pipeline that can achieve accurate relocalization without relying on any pose prior. Unlike previous coarse-to-fine localization methods that require image retrieval first and then feature matching, we propose a novel sparse-to-dense localization paradigm. Based on this scene representation, we introduce a novel matching-oriented Gaussian sampling strategy and a scene-specific detector to achieve efficient and robust initial pose estimation. Furthermore, based on the initial localization results, we align the query feature map to the Gaussian feature field by dense feature matching to enable accurate localization. The experiments on indoor and outdoor datasets show that STDLoc outperforms current state-of-the-art localization methods in terms of localization accuracy and recall.

Show and Segment: Universal Medical Image Segmentation via In-Context Learning

Yunhe Gao,Di Liu,Zhuowei Li,Yunsheng Li,Dongdong Chen,Mu Zhou,Dimitris N. Metaxas

Task: 提出一种无需微调即可通过参考示例灵活适应新任务的医学图像分割框架Iris。

Motivation: 解决当前深度学习方法在医学图像分割中因任务多样性而难以泛化的问题。

Details

Method: 采用轻量级上下文任务编码模块，从参考图像-标签对中提取任务信息，并利用上下文嵌入信息指导目标对象分割。 Result: 在12个数据集上表现优异，优于任务特定模型；在7个保留数据集上展示出对分布外数据和未见类别的优越泛化能力。 Conclusion: Iris框架不仅实现了高效的任务适应，还能自动发现跨数据集和模态的解剖关系，为医学对象提供无监督的解剖洞察。 Abstract: Medical image segmentation remains challenging due to the vast diversity of anatomical structures, imaging modalities, and segmentation tasks. While deep learning has made significant advances, current approaches struggle to generalize as they require task-specific training or fine-tuning on unseen classes. We present Iris, a novel In-context Reference Image guided Segmentation framework that enables flexible adaptation to novel tasks through the use of reference examples without fine-tuning. At its core, Iris features a lightweight context task encoding module that distills task-specific information from reference context image-label pairs. This rich context embedding information is used to guide the segmentation of target objects. By decoupling task encoding from inference, Iris supports diverse strategies from one-shot inference and context example ensemble to object-level context example retrieval and in-context tuning. Through comprehensive evaluation across twelve datasets, we demonstrate that Iris performs strongly compared to task-specific models on in-distribution tasks. On seven held-out datasets, Iris shows superior generalization to out-of-distribution data and unseen classes. Further, Iris's task encoding module can automatically discover anatomical relationships across datasets and modalities, offering insights into medical objects without explicit anatomical supervision.

ImageSet2Text: Describing Sets of Images through Text

Piera Riccio,Francesco Galati,Kajetan Schweighofer,Noa Garcia,Nuria Oliver

Task: 提出了一种名为ImageSet2Text的新方法，利用视觉语言基础模型自动生成图像集的自然语言描述。

Motivation: 受概念瓶颈模型（CBMs）和视觉问答（VQA）链的启发，旨在通过迭代提取关键概念并编码为结构化图，提升图像集描述的准确性和可解释性。

Details

Method: 基于VQA链，迭代提取图像子集的关键概念，编码为结构化图，并利用外部知识图和CLIP验证进行优化。 Result: 实验表明，ImageSet2Text在准确性、完整性、可读性和整体质量上优于现有视觉语言模型，并引入了新的大规模图像集描述数据集。 Conclusion: ImageSet2Text通过迭代优化和结构化表示，实现了高质量、可解释的图像集自然语言描述。 Abstract: We introduce ImageSet2Text, a novel approach that leverages vision-language foundation models to automatically create natural language descriptions of image sets. Inspired by concept bottleneck models (CBMs) and based on visual-question answering (VQA) chains, ImageSet2Text iteratively extracts key concepts from image subsets, encodes them into a structured graph, and refines insights using an external knowledge graph and CLIP-based validation. This iterative process enhances interpretability and enables accurate and detailed set-level summarization. Through extensive experiments, we evaluate ImageSet2Text's descriptions on accuracy, completeness, readability and overall quality, benchmarking it against existing vision-language models and introducing new datasets for large-scale group image captioning.

VGAT: A Cancer Survival Analysis Framework Transitioning from Generative Visual Question Answering to Genomic Reconstruction

Zizhi Chen,Minghao Han,Xukun Zhang,Shuwei Ma,Tao Liu,Xing Wei,Lihua Zhang

Task: 通过结合病理图像和基因组序列的多模态学习提升癌症生存分析，并在资源有限地区仅使用全切片图像（WSI）实现生存预测。

Motivation: 解决临床实施中因基因组测序资源有限导致的多模态学习障碍。

Details

Method: 提出VGAT框架，整合视觉问答（VQA）技术重建基因组模态，并通过聚类视觉提示模块增强WSI的判别性区域。 Result: 在五个TCGA数据集中，VGAT优于现有仅使用WSI的方法，证明无需测序即可实现基因组信息推断。 Conclusion: VGAT为资源受限环境下的多模态研究提供了临床可行的解决方案。 Abstract: Multimodal learning combining pathology images and genomic sequences enhances cancer survival analysis but faces clinical implementation barriers due to limited access to genomic sequencing in under-resourced regions. To enable survival prediction using only whole-slide images (WSI), we propose the Visual-Genomic Answering-Guided Transformer (VGAT), a framework integrating Visual Question Answering (VQA) techniques for genomic modality reconstruction. By adapting VQA's text feature extraction approach, we derive stable genomic representations that circumvent dimensionality challenges in raw genomic data. Simultaneously, a cluster-based visual prompt module selectively enhances discriminative WSI patches, addressing noise from unfiltered image regions. Evaluated across five TCGA datasets, VGAT outperforms existing WSI-only methods, demonstrating the viability of genomic-informed inference without sequencing. This approach bridges multimodal research and clinical feasibility in resource-constrained settings. The code link is https://github.com/CZZZZZZZZZZZZZZZZZ/VGAT.

EfficientMT: Efficient Temporal Adaptation for Motion Transfer in Text-to-Video Diffusion Models

Yufei Cai,Hu Han,Yuxiang Wei,Shiguang Shan,Xilin Chen

Task: 提出一种高效的端到端框架EfficientMT，用于视频运动迁移。

Motivation: 现有运动迁移方法依赖样本特定优化策略，计算负担高，且运动可控性有限。

Details

Method: 利用少量合成配对运动迁移样本，通过重新利用T2V模型的主干提取时间信息，并引入缩放模块和时序集成机制。 Result: EfficientMT在效率上优于现有方法，同时保持灵活的运动可控性。 Conclusion: EfficientMT是一种高效且通用的视频运动迁移框架，无需测试时优化。 Abstract: The progress on generative models has led to significant advances on text-to-video (T2V) generation, yet the motion controllability of generated videos remains limited. Existing motion transfer methods explored the motion representations of reference videos to guide generation. Nevertheless, these methods typically rely on sample-specific optimization strategy, resulting in high computational burdens. In this paper, we propose \textbf{EfficientMT}, a novel and efficient end-to-end framework for video motion transfer. By leveraging a small set of synthetic paired motion transfer samples, EfficientMT effectively adapts a pretrained T2V model into a general motion transfer framework that can accurately capture and reproduce diverse motion patterns. Specifically, we repurpose the backbone of the T2V model to extract temporal information from reference videos, and further propose a scaler module to distill motion-related information. Subsequently, we introduce a temporal integration mechanism that seamlessly incorporates reference motion features into the video generation process. After training on our self-collected synthetic paired samples, EfficientMT enables general video motion transfer without requiring test-time optimization. Extensive experiments demonstrate that our EfficientMT outperforms existing methods in efficiency while maintaining flexible motion controllability. Our code will be available https://github.com/PrototypeNx/EfficientMT.

DeClotH: Decomposable 3D Cloth and Human Body Reconstruction from a Single Image

Hyeongjin Nam,Donghwan Kim,Jeongtaek Oh,Kyoung Mu Lee

Task: 从单张图像中分别重建3D衣物和人体。

Motivation: 现有方法通常将衣物和人体视为单一对象，忽略了二者之间的区分，而遮挡问题使得准确重建几何和纹理具有挑战性。

Details

Method: 利用3D模板模型作为正则化，缓解遮挡问题；引入专门设计的衣物扩散模型，增强衣物外观的重建。 Result: 定性和定量实验表明，该方法在重建3D衣物和人体方面非常有效。 Conclusion: DeClotH通过分离重建衣物和人体，解决了遮挡问题，并利用专门设计的模型提升了重建效果。 Abstract: Most existing methods of 3D clothed human reconstruction from a single image treat the clothed human as a single object without distinguishing between cloth and human body. In this regard, we present DeClotH, which separately reconstructs 3D cloth and human body from a single image. This task remains largely unexplored due to the extreme occlusion between cloth and the human body, making it challenging to infer accurate geometries and textures. Moreover, while recent 3D human reconstruction methods have achieved impressive results using text-to-image diffusion models, directly applying such an approach to this problem often leads to incorrect guidance, particularly in reconstructing 3D cloth. To address these challenges, we propose two core designs in our framework. First, to alleviate the occlusion issue, we leverage 3D template models of cloth and human body as regularizations, which provide strong geometric priors to prevent erroneous reconstruction by the occlusion. Second, we introduce a cloth diffusion model specifically designed to provide contextual information about cloth appearance, thereby enhancing the reconstruction of 3D cloth. Qualitative and quantitative experiments demonstrate that our proposed approach is highly effective in reconstructing both 3D cloth and the human body. More qualitative results are provided at https://hygenie1228.github.io/DeClotH/.

Interpretable Generative Models through Post-hoc Concept Bottlenecks

Akshay Kulkarni,Ge Yan,Chung-En Sun,Tuomas Oikarinen,Tsui-Wei Weng

Task: 提出两种低成本方法（CB-AE和CC）来构建可解释的生成模型。

Motivation: 现有基于概念瓶颈模型（CBM）的可解释生成模型方法效率低、扩展性差，需要昂贵的生成模型训练和大量人工标注的概念监督。

Details

Method: 通过后处理技术（CB-AE和CC）实现高效、可扩展的训练，无需真实数据且概念监督需求极低。 Result: 在CelebA、CelebA-HQ和CUB等数据集上，方法在可解释性和可控性上显著优于现有工作（平均提升约25%），且训练速度快4-15倍。 Conclusion: 通过大规模用户研究验证了方法的可解释性和可控性，证明了其高效性和实用性。 Abstract: Concept bottleneck models (CBM) aim to produce inherently interpretable models that rely on human-understandable concepts for their predictions. However, existing approaches to design interpretable generative models based on CBMs are not yet efficient and scalable, as they require expensive generative model training from scratch as well as real images with labor-intensive concept supervision. To address these challenges, we present two novel and low-cost methods to build interpretable generative models through post-hoc techniques and we name our approaches: concept-bottleneck autoencoder (CB-AE) and concept controller (CC). Our proposed approaches enable efficient and scalable training without the need of real data and require only minimal to no concept supervision. Additionally, our methods generalize across modern generative model families including generative adversarial networks and diffusion models. We demonstrate the superior interpretability and steerability of our methods on numerous standard datasets like CelebA, CelebA-HQ, and CUB with large improvements (average ~25%) over the prior work, while being 4-15x faster to train. Finally, a large-scale user study is performed to validate the interpretability and steerability of our methods.

MVPortrait: Text-Guided Motion and Emotion Control for Multi-view Vivid Portrait Animation

Yukang Lin,Hokit Fung,Jianjin Xu,Zeping Ren,Adela S. M. Lau,Guosheng Yin,Xiu Li

Task: 提出一种新颖的两阶段文本引导框架MVPortrait，用于生成具有多视角和丰富表情的肖像动画。

Motivation: 现有肖像动画方法在唇部同步方面表现良好，但缺乏对头部运动和面部表情的明确控制，且无法生成多视角视频，导致动画可控性和表现力不足。此外，文本引导的肖像动画尚未充分探索。

Details

Method: 采用两阶段框架：第一阶段基于文本输入训练FLAME运动和情感扩散模型；第二阶段基于参考肖像图像和多视角FLAME渲染序列训练多视角视频生成模型。 Result: 实验结果表明，MVPortrait在运动和情感控制以及视角一致性方面优于现有方法。 Conclusion: MVPortrait是首个兼容文本、语音和视频驱动信号的可控肖像动画框架，通过FLAME作为中间表示实现了多视角和丰富表情的生成。 Abstract: Recent portrait animation methods have made significant strides in generating realistic lip synchronization. However, they often lack explicit control over head movements and facial expressions, and cannot produce videos from multiple viewpoints, resulting in less controllable and expressive animations. Moreover, text-guided portrait animation remains underexplored, despite its user-friendly nature. We present a novel two-stage text-guided framework, MVPortrait (Multi-view Vivid Portrait), to generate expressive multi-view portrait animations that faithfully capture the described motion and emotion. MVPortrait is the first to introduce FLAME as an intermediate representation, effectively embedding facial movements, expressions, and view transformations within its parameter space. In the first stage, we separately train the FLAME motion and emotion diffusion models based on text input. In the second stage, we train a multi-view video generation model conditioned on a reference portrait image and multi-view FLAME rendering sequences from the first stage. Experimental results exhibit that MVPortrait outperforms existing methods in terms of motion and emotion control, as well as view consistency. Furthermore, by leveraging FLAME as a bridge, MVPortrait becomes the first controllable portrait animation framework that is compatible with text, speech, and video as driving signals.

Inference-Time Scaling for Flow Models via Stochastic Generation and Rollover Budget Forcing

Jaihoon Kim,Taehoon Yoon,Jisung Hwang,Minhyuk Sung

Task: 提出一种用于预训练流模型的推理时缩放方法。

Motivation: 流模型作为扩散模型的替代方案，虽然生成速度快且输出质量高，但其确定性生成过程导致扩散模型的高效推理时缩放方法无法直接应用。

Details

Method: 提出三种关键方法：1) 基于SDE的生成，2) 插值转换，3) Rollover Budget Forcing (RBF)。 Result: 实验表明，基于VP-SDE的生成方法显著提升了流模型的推理时缩放性能，RBF结合VP-SDE表现最佳。 Conclusion: 该方法成功实现了流模型的高效推理时缩放，性能优于现有方法。 Abstract: We propose an inference-time scaling approach for pretrained flow models. Recently, inference-time scaling has gained significant attention in LLMs and diffusion models, improving sample quality or better aligning outputs with user preferences by leveraging additional computation. For diffusion models, particle sampling has allowed more efficient scaling due to the stochasticity at intermediate denoising steps. On the contrary, while flow models have gained popularity as an alternative to diffusion models--offering faster generation and high-quality outputs in state-of-the-art image and video generative models--efficient inference-time scaling methods used for diffusion models cannot be directly applied due to their deterministic generative process. To enable efficient inference-time scaling for flow models, we propose three key ideas: 1) SDE-based generation, enabling particle sampling in flow models, 2) Interpolant conversion, broadening the search space and enhancing sample diversity, and 3) Rollover Budget Forcing (RBF), an adaptive allocation of computational resources across timesteps to maximize budget utilization. Our experiments show that SDE-based generation, particularly variance-preserving (VP) interpolant-based generation, improves the performance of particle sampling methods for inference-time scaling in flow models. Additionally, we demonstrate that RBF with VP-SDE achieves the best performance, outperforming all previous inference-time scaling approaches.

Exploring Textual Semantics Diversity for Image Transmission in Semantic Communication Systems using Visual Language Model

Peishan Huang,Dong Li

Task: 提出一种多文本传输语义通信系统（Multi-SC），用于提高图像语义信号传输的重建精度。

Motivation: 传统语义通信系统提取的语义特征数量不足，导致重建精度低，限制了实际应用。

Details

Method: 使用视觉语言模型（VLM）辅助传输图像语义信号，将图像分块并通过改进的大型语言和视觉助手（LLaVA）提取多文本信息，结合语义分割标签和语义文本进行图像恢复。 Result: 仿真结果表明，所提出的文本语义多样性方案显著提高了重建精度。 Conclusion: Multi-SC系统通过多文本信息提取和结合语义标签，有效解决了图像语义传输中的重建精度问题。 Abstract: In recent years, the rapid development of machine learning has brought reforms and challenges to traditional communication systems. Semantic communication has appeared as an effective strategy to effectively extract relevant semantic signals semantic segmentation labels and image features for image transmission. However, the insufficient number of extracted semantic features of images will potentially result in a low reconstruction accuracy, which hinders the practical applications and still remains challenging for solving. In order to fill this gap, this letter proposes a multi-text transmission semantic communication (Multi-SC) system, which uses the visual language model (VLM) to assist in the transmission of image semantic signals. Unlike previous image transmission semantic communication systems, the proposed system divides the image into multiple blocks and extracts multiple text information from the image using a modified large language and visual assistant (LLaVA), and combines semantic segmentation tags with semantic text for image recovery. Simulation results show that the proposed text semantics diversity scheme can significantly improve the reconstruction accuracy compared with related works.

TraF-Align: Trajectory-aware Feature Alignment for Asynchronous Multi-agent Perception

Zhiying Song,Lei Yang,Fuxi Wen,Jun Li

Task: 解决多智能体协同感知中的时空和语义特征对齐问题。

Motivation: 协同感知能增强车辆感知能力，但跨智能体延迟导致时空和语义特征错位，影响实时数据融合。

Details

Method: 提出TraF-Align框架，通过学习特征流路径预测对象特征轨迹，生成时序采样点以对齐特征。 Result: 在V2V4Real和DAIR-V2X-Seq数据集上，TraF-Align实现了异步协同感知的新基准。 Conclusion: TraF-Align有效解决了时空和语义对齐问题，提升了协同感知性能。 Abstract: Cooperative perception presents significant potential for enhancing the sensing capabilities of individual vehicles, however, inter-agent latency remains a critical challenge. Latencies cause misalignments in both spatial and semantic features, complicating the fusion of real-time observations from the ego vehicle with delayed data from others. To address these issues, we propose TraF-Align, a novel framework that learns the flow path of features by predicting the feature-level trajectory of objects from past observations up to the ego vehicle's current time. By generating temporally ordered sampling points along these paths, TraF-Align directs attention from the current-time query to relevant historical features along each trajectory, supporting the reconstruction of current-time features and promoting semantic interaction across multiple frames. This approach corrects spatial misalignment and ensures semantic consistency across agents, effectively compensating for motion and achieving coherent feature fusion. Experiments on two real-world datasets, V2V4Real and DAIR-V2X-Seq, show that TraF-Align sets a new benchmark for asynchronous cooperative perception.

LangBridge: Interpreting Image as a Combination of Language Embeddings

Jiaqi Liao,Yuwei Niu,Fanqing Meng,Hao Li,Changyao Tian,Yinuo Du,Yuwen Xiong,Dianqi Li,Xizhou Zhu,Li Yuan,Jifeng Dai,Yu Cheng

Task: 研究大型视觉语言模型（LVLMs）中视觉语言对齐的机制，并提出一种新型适配器LangBridge以实现跨模型的高效迁移。

Motivation: 现有LVLMs通过浅层MLP实现视觉语言对齐，但其机制不明确且适配器需针对不同LLM重新训练。

Details

Method: 提出LangBridge适配器，将视觉标记显式映射到LLM词汇嵌入的线性组合中，实现无需预训练的跨模型迁移。 Result: 实验表明LangBridge适配器在不同规模的LLM上均能保持性能，且无需重新训练。 Conclusion: LangBridge通过可解释的视觉语言对齐机制和即插即用设计，实现了高效跨模型迁移。 Abstract: Recent years have witnessed remarkable advances in Large Vision-Language Models (LVLMs), which have achieved human-level performance across various complex vision-language tasks. Following LLaVA's paradigm, mainstream LVLMs typically employ a shallow MLP for visual-language alignment through a two-stage training process: pretraining for cross-modal alignment followed by instruction tuning. While this approach has proven effective, the underlying mechanisms of how MLPs bridge the modality gap remain poorly understood. Although some research has explored how LLMs process transformed visual tokens, few studies have investigated the fundamental alignment mechanism. Furthermore, the MLP adapter requires retraining whenever switching LLM backbones. To address these limitations, we first investigate the working principles of MLP adapters and discover that they learn to project visual embeddings into subspaces spanned by corresponding text embeddings progressively. Based on this insight, we propose LangBridge, a novel adapter that explicitly maps visual tokens to linear combinations of LLM vocabulary embeddings. This innovative design enables pretraining-free adapter transfer across different LLMs while maintaining performance. Our experimental results demonstrate that a LangBridge adapter pre-trained on Qwen2-0.5B can be directly applied to larger models such as LLaMA3-8B or Qwen2.5-14B while maintaining competitive performance. Overall, LangBridge enables interpretable vision-language alignment by grounding visual representations in LLM vocab embedding, while its plug-and-play design ensures efficient reuse across multiple LLMs with nearly no performance degradation. See our project page at https://LangBridge.github.io/

Mingxiao Tu,Hoijoon Jung,Alireza Moghadam,Jineel Raythatha,Lachlan Allan,Jeremy Hsu,Andre Kyme,Jinman Kim

Task: 开发一种多模态网络（mPSE-CT），用于精确估计术中患者的3D姿态和形状。

Motivation: 传统方法因遮挡和复杂体位导致估计不准确，影响临床效果。

Details

Method: 结合CT扫描的几何特征与深度图，通过形状估计模块、姿态估计模块和参数混合模块实现。 Result: mPSE-CT在姿态和形状估计上分别比现有方法提升23%和49.16%。 Conclusion: mPSE-CT在复杂术中环境中具有改善临床效果的潜力。 Abstract: In perioperative care, precise in-bed 3D patient pose and shape estimation (PSE) can be vital in optimizing patient positioning in preoperative planning, enabling accurate overlay of medical images for augmented reality-based surgical navigation, and mitigating risks of prolonged immobility during recovery. Conventional PSE methods relying on modalities such as RGB-D, infrared, or pressure maps often struggle with occlusions caused by bedding and complex patient positioning, leading to inaccurate estimation that can affect clinical outcomes. To address these challenges, we present the first multi-modal in-bed patient 3D PSE network that fuses detailed geometric features extracted from routinely acquired computed tomography (CT) scans with depth maps (mPSE-CT). mPSE-CT incorporates a shape estimation module that utilizes probabilistic correspondence alignment, a pose estimation module with a refined neural network, and a final parameters mixing module. This multi-modal network robustly reconstructs occluded body regions and enhances the accuracy of the estimated 3D human mesh model. We validated mPSE-CT using proprietary whole-body rigid phantom and volunteer datasets in clinical scenarios. mPSE-CT outperformed the best-performing prior method by 23% and 49.16% in pose and shape estimation respectively, demonstrating its potential for improving clinical outcomes in challenging perioperative environments.

M$^2$CD: A Unified MultiModal Framework for Optical-SAR Change Detection with Mixture of Experts and Self-Distillation

Ziyuan Liu,Jiawei Zhang,Wenyu Wang,Yuantao Gu

Task: 提出一个统一的多模态变化检测框架M²CD，用于处理光学与合成孔径雷达（SAR）图像之间的跨模态数据分布问题。

Motivation: 现有方法主要针对光学图像，但在极端场景（如灾害响应）中，SAR图像更适用，而现有方法难以有效处理光学与SAR图像的跨模态差异。

Details

Method: 提出M²CD框架，集成专家混合（MoE）模块处理多模态数据，并提出光学到SAR引导路径（O2SP）及自蒸馏训练以减少模态间特征空间差异。 Result: 实验表明，M²CD的MiT-b1版本在光学-SAR变化检测任务中优于所有现有方法。 Conclusion: M²CD框架有效解决了光学与SAR图像跨模态变化检测的挑战，性能显著优于现有方法。 Abstract: Most existing change detection (CD) methods focus on optical images captured at different times, and deep learning (DL) has achieved remarkable success in this domain. However, in extreme scenarios such as disaster response, synthetic aperture radar (SAR), with its active imaging capability, is more suitable for providing post-event data. This introduces new challenges for CD methods, as existing weight-sharing Siamese networks struggle to effectively learn the cross-modal data distribution between optical and SAR images. To address this challenge, we propose a unified MultiModal CD framework, M$^2$CD. We integrate Mixture of Experts (MoE) modules into the backbone to explicitly handle diverse modalities, thereby enhancing the model's ability to learn multimodal data distributions. Additionally, we innovatively propose an Optical-to-SAR guided path (O2SP) and implement self-distillation during training to reduce the feature space discrepancy between different modalities, further alleviating the model's learning burden. We design multiple variants of M$^2$CD based on both CNN and Transformer backbones. Extensive experiments validate the effectiveness of the proposed framework, with the MiT-b1 version of M$^2$CD outperforming all state-of-the-art (SOTA) methods in optical-SAR CD tasks.

A Prototype-Guided Coarse Annotations Refining Approach for Whole Slide Images

Bingjian Yao,Weiping Lin,Yan He,Zheng Wang,Liangsheng Wang

Task: 提出一种原型引导的方法，用于从粗标注中生成精细标注的病理图像。

Motivation: 现有方法依赖大量训练样本或干净数据集，且难以捕捉病理图像中的局部和全局语义模式，限制了精度。

Details

Method: 采用局部到全局的方法构建非冗余原型，结合原型引导的伪标注模块和动态数据采样与微调策略。 Result: 在三种公开的WSI数据集上显著优于现有方法。 Conclusion: 原型引导的方法能有效提升病理图像标注的精度。 Abstract: The fine-grained annotations in whole slide images (WSIs) show the boundaries of various pathological regions. However, generating such detailed annotation is often costly, whereas the coarse annotations are relatively simpler to produce. Existing methods for refining coarse annotations often rely on extensive training samples or clean datasets, and fail to capture both intra-slide and inter-slide latent sematic patterns, limiting their precision. In this paper, we propose a prototype-guided approach. Specifically, we introduce a local-to-global approach to construct non-redundant representative prototypes by jointly modeling intra-slide local semantics and inter-slide contextual relationships. Then a prototype-guided pseudo-labeling module is proposed for refining coarse annotations. Finally, we employ dynamic data sampling and re-finetuning strategy to train a patch classifier. Extensive experiments on three publicly available WSI datasets, covering lymph, liver, and colorectal cancers, demonstrate that our method significantly outperforms existing state-of-the-art (SOTA) methods. The code will be available.

EmoHead: Emotional Talking Head via Manipulating Semantic Expression Parameters

Xuli Shen,Hua Cai,Dingding Yu,Weilin Shen,Qing Xu,Xiangyang Xue

Task: 从音频输入生成情感特定的说话头部视频。

Motivation: 情感是高度抽象且边界模糊的概念，需要解耦的表达参数来生成情感丰富的说话头部视频。

Details

Method: 提出EmoHead方法，通过语义表达参数合成视频，使用音频-表达模块预测参数，并利用预训练超平面细化面部动作。 Result: 实验结果表明，语义表达参数能提升重建质量和可控性。 Conclusion: EmoHead方法通过解耦表达参数和细化面部动作，实现了情感一致的说话头部视频生成。 Abstract: Generating emotion-specific talking head videos from audio input is an important and complex challenge for human-machine interaction. However, emotion is highly abstract concept with ambiguous boundaries, and it necessitates disentangled expression parameters to generate emotionally expressive talking head videos. In this work, we present EmoHead to synthesize talking head videos via semantic expression parameters. To predict expression parameter for arbitrary audio input, we apply an audio-expression module that can be specified by an emotion tag. This module aims to enhance correlation from audio input across various emotions. Furthermore, we leverage pre-trained hyperplane to refine facial movements by probing along the vertical direction. Finally, the refined expression parameters regularize neural radiance fields and facilitate the emotion-consistent generation of talking head videos. Experimental results demonstrate that semantic expression parameters lead to better reconstruction quality and controllability.

COB-GS: Clear Object Boundaries in 3DGS Segmentation Based on Boundary-Adaptive Gaussian Splitting

Jiaxin Zhang,Junjun Jiang,Youyu Chen,Kui Jiang,Xianming Liu

Task: 通过COB-GS方法改进基于3D高斯泼溅（3DGS）的3D分割，以清晰划分物体边界。

Motivation: 3DGS在分割时因高斯基元的体积特性和缺乏语义指导，难以准确划分物体边界。

Details

Method: 引入边界自适应高斯分裂技术和视觉优化，联合优化语义与视觉信息。 Result: COB-GS显著提升了分割准确性和鲁棒性，同时保持了高视觉质量。 Conclusion: COB-GS通过语义与视觉协同优化，有效解决了3DGS分割中的边界模糊问题。 Abstract: Accurate object segmentation is crucial for high-quality scene understanding in the 3D vision domain. However, 3D segmentation based on 3D Gaussian Splatting (3DGS) struggles with accurately delineating object boundaries, as Gaussian primitives often span across object edges due to their inherent volume and the lack of semantic guidance during training. In order to tackle these challenges, we introduce Clear Object Boundaries for 3DGS Segmentation (COB-GS), which aims to improve segmentation accuracy by clearly delineating blurry boundaries of interwoven Gaussian primitives within the scene. Unlike existing approaches that remove ambiguous Gaussians and sacrifice visual quality, COB-GS, as a 3DGS refinement method, jointly optimizes semantic and visual information, allowing the two different levels to cooperate with each other effectively. Specifically, for the semantic guidance, we introduce a boundary-adaptive Gaussian splitting technique that leverages semantic gradient statistics to identify and split ambiguous Gaussians, aligning them closely with object boundaries. For the visual optimization, we rectify the degraded suboptimal texture of the 3DGS scene, particularly along the refined boundary structures. Experimental results show that COB-GS substantially improves segmentation accuracy and robustness against inaccurate masks from pre-trained model, yielding clear boundaries while preserving high visual quality. Code is available at https://github.com/ZestfulJX/COB-GS.

Towards Robust Time-of-Flight Depth Denoising with Confidence-Aware Diffusion Model

Changyong He,Jin Zeng,Jiawei Zhang,Jiajie Guo

Task: 提出一种名为DepthCAD的新方法，用于解决ToF传感器的深度去噪问题。

Motivation: 现有的基于深度神经网络的方法在严重噪声干扰下表现不佳，因为缺乏对ToF数据分布的充分先验知识。

Details

Method: 利用Stable Diffusion的丰富先验知识确保全局结构平滑，并通过置信度引导扩散过程保持局部度量准确性。 Result: 实验结果表明，该方法在ToF深度去噪中达到了最先进的性能，并在真实数据上验证了其鲁棒性。 Conclusion: DepthCAD是一种有效的ToF去噪方法，结合了扩散模型的先验知识和置信度引导，显著提升了去噪效果。 Abstract: Time-of-Flight (ToF) sensors efficiently capture scene depth, but the nonlinear depth construction procedure often results in extremely large noise variance or even invalid areas. Recent methods based on deep neural networks (DNNs) achieve enhanced ToF denoising accuracy but tend to struggle when presented with severe noise corruption due to limited prior knowledge of ToF data distribution. In this paper, we propose DepthCAD, a novel ToF denoising approach that ensures global structural smoothness by leveraging the rich prior knowledge in Stable Diffusion and maintains local metric accuracy by steering the diffusion process with confidence guidance. To adopt the pretrained image diffusion model to ToF depth denoising, we apply the diffusion on raw ToF correlation measurements with dynamic range normalization before converting to depth maps. Experimental results validate the state-of-the-art performance of the proposed scheme, and the evaluation on real data further verifies its robustness against real-world ToF noise.

SparseGS-W: Sparse-View 3D Gaussian Splatting in the Wild with Generative Priors

Yiqing Li,Xuan Wang,Jiawei Wu,Yikun Ma,Zhi Jin

Task: 从稀疏输入图像（最少五张）中合成大规模场景的新视角视图。

Motivation: 现有方法在稀疏输入条件下表现不佳，导致明显伪影，需要改进。

Details

Method: 提出基于3D高斯泼溅的SparseGS-W框架，结合几何先验和约束扩散先验，并引入约束新视角增强模块和遮挡处理模块。 Result: 在PhotoTourism和Tanks and Temples数据集上，SparseGS-W在多项指标上达到最优性能。 Conclusion: SparseGS-W能够高效处理稀疏输入，实现复杂场景的高质量新视角合成。 Abstract: Synthesizing novel views of large-scale scenes from unconstrained in-the-wild images is an important but challenging task in computer vision. Existing methods, which optimize per-image appearance and transient occlusion through implicit neural networks from dense training views (approximately 1000 images), struggle to perform effectively under sparse input conditions, resulting in noticeable artifacts. To this end, we propose SparseGS-W, a novel framework based on 3D Gaussian Splatting that enables the reconstruction of complex outdoor scenes and handles occlusions and appearance changes with as few as five training images. We leverage geometric priors and constrained diffusion priors to compensate for the lack of multi-view information from extremely sparse input. Specifically, we propose a plug-and-play Constrained Novel-View Enhancement module to iteratively improve the quality of rendered novel views during the Gaussian optimization process. Furthermore, we propose an Occlusion Handling module, which flexibly removes occlusions utilizing the inherent high-quality inpainting capability of constrained diffusion priors. Both modules are capable of extracting appearance features from any user-provided reference image, enabling flexible modeling of illumination-consistent scenes. Extensive experiments on the PhotoTourism and Tanks and Temples datasets demonstrate that SparseGS-W achieves state-of-the-art performance not only in full-reference metrics, but also in commonly used non-reference metrics such as FID, ClipIQA, and MUSIQ.

G-DexGrasp: Generalizable Dexterous Grasping Synthesis Via Part-Aware Prior Retrieval and Prior-Assisted Generation

Juntao Jian,Xiuping Liu,Zixuan Chen,Manyi Li,Jian Liu,Ruizhen Hu

Task: 提出一种检索增强生成方法G-DexGrasp，用于为未见过的物体类别和语言任务指令生成高质量灵巧手配置。

Motivation: 现有方法在灵巧抓取合成中取得了进展，但难以泛化到未见过的物体类别和多样化的任务指令。

Details

Method: 通过检索可泛化的抓取先验（细粒度接触部分和功能相关分布），结合生成模型和优化细化，生成合理的抓取配置。 Result: 实验验证了方法的泛化能力，并展示了优于现有方法的性能。 Conclusion: G-DexGrasp通过检索增强生成，显著提升了灵巧抓取在未见物体和任务指令上的表现。 Abstract: Recent advances in dexterous grasping synthesis have demonstrated significant progress in producing reasonable and plausible grasps for many task purposes. But it remains challenging to generalize to unseen object categories and diverse task instructions. In this paper, we propose G-DexGrasp, a retrieval-augmented generation approach that can produce high-quality dexterous hand configurations for unseen object categories and language-based task instructions. The key is to retrieve generalizable grasping priors, including the fine-grained contact part and the affordance-related distribution of relevant grasping instances, for the following synthesis pipeline. Specifically, the fine-grained contact part and affordance act as generalizable guidance to infer reasonable grasping configurations for unseen objects with a generative model, while the relevant grasping distribution plays as regularization to guarantee the plausibility of synthesized grasps during the subsequent refinement optimization. Our comparison experiments validate the effectiveness of our key designs for generalization and demonstrate the remarkable performance against the existing approaches. Project page: https://g-dexgrasp.github.io/

GaussianUDF: Inferring Unsigned Distance Functions through 3D Gaussian Splatting

Shujuan Li,Yu-Shen Liu,Zhizhong Han

Task: 提出一种新方法，通过2D高斯平面和自监督学习来弥合3D高斯与无符号距离函数（UDFs）之间的差距，以实现开放表面的高精度重建。

Motivation: 现有方法难以通过3D高斯拼接（3DGS）学习连续且隐式的UDF表示，限制了复杂物体表面重建的效果。

Details

Method: 通过拟合薄而平的2D高斯平面，利用自监督和梯度推断监督近远距离的无符号距离，并引入新约束优化学习过程。 Result: 在多个基准测试和真实数据上，该方法在准确性、效率、完整性和边界清晰度方面优于现有技术。 Conclusion: 该方法有效解决了3D高斯与UDF之间的表示差距，显著提升了开放表面重建的质量。 Abstract: Reconstructing open surfaces from multi-view images is vital in digitalizing complex objects in daily life. A widely used strategy is to learn unsigned distance functions (UDFs) by checking if their appearance conforms to the image observations through neural rendering. However, it is still hard to learn continuous and implicit UDF representations through 3D Gaussians splatting (3DGS) due to the discrete and explicit scene representation, i.e., 3D Gaussians. To resolve this issue, we propose a novel approach to bridge the gap between 3D Gaussians and UDFs. Our key idea is to overfit thin and flat 2D Gaussian planes on surfaces, and then, leverage the self-supervision and gradient-based inference to supervise unsigned distances in both near and far area to surfaces. To this end, we introduce novel constraints and strategies to constrain the learning of 2D Gaussians to pursue more stable optimization and more reliable self-supervision, addressing the challenges brought by complicated gradient field on or near the zero level set of UDFs. We report numerical and visual comparisons with the state-of-the-art on widely used benchmarks and real data to show our advantages in terms of accuracy, efficiency, completeness, and sharpness of reconstructed open surfaces with boundaries. Project page: https://lisj575.github.io/GaussianUDF/

AccVideo: Accelerating Video Diffusion Model with Synthetic Dataset

Haiyu Zhang,Xinyuan Chen,Yaohui Wang,Xihui Liu,Yunhong Wang,Yu Qiao

Task: 提出一种名为AccVideo的高效方法，以减少视频扩散模型的推理步骤并加速生成。

Motivation: 现有扩散蒸馏方法存在推理步骤多、计算成本高的问题，需要一种更高效的解决方案。

Details

Method: 利用预训练视频扩散模型生成多个有效的去噪轨迹作为合成数据集，设计基于轨迹的少步指导，并引入对抗训练策略对齐输出分布。 Result: 实验表明，AccVideo在生成速度上比教师模型快8.5倍，同时保持可比性能，并能生成更高质量和分辨率的视频。 Conclusion: AccVideo是一种高效且高质量的加速视频扩散模型的方法。 Abstract: Diffusion models have achieved remarkable progress in the field of video generation. However, their iterative denoising nature requires a large number of inference steps to generate a video, which is slow and computationally expensive. In this paper, we begin with a detailed analysis of the challenges present in existing diffusion distillation methods and propose a novel efficient method, namely AccVideo, to reduce the inference steps for accelerating video diffusion models with synthetic dataset. We leverage the pretrained video diffusion model to generate multiple valid denoising trajectories as our synthetic dataset, which eliminates the use of useless data points during distillation. Based on the synthetic dataset, we design a trajectory-based few-step guidance that utilizes key data points from the denoising trajectories to learn the noise-to-video mapping, enabling video generation in fewer steps. Furthermore, since the synthetic dataset captures the data distribution at each diffusion timestep, we introduce an adversarial training strategy to align the output distribution of the student model with that of our synthetic dataset, thereby enhancing the video quality. Extensive experiments demonstrate that our model achieves 8.5x improvements in generation speed compared to the teacher model while maintaining comparable performance. Compared to previous accelerating methods, our approach is capable of generating videos with higher quality and resolution, i.e., 5-seconds, 720x1280, 24fps.

Noisier2Inverse: Self-Supervised Learning for Image Reconstruction with Correlated Noise

Nadja Gruber,Johannes Schwab,Markus Haltmeier,Ander Biguri,Clemens Dlaska,Gyeongha Hwang

Task: 提出一种无需校正的自监督深度学习方法Noisier2Inverse，用于解决一般逆问题。

Motivation: 在测量噪声统计相关的情况下（如CT、显微镜和地震成像），传统方法需要真实样本，而Noisier2Inverse无需依赖真实样本即可学习重建函数。

Details

Method: 通过生成更嘈杂的数据，网络在测量空间中学习恢复外推图像，避免了推断时的外推步骤。 Result: 数值实验表明，该方法在考虑相关噪声的自监督方法中表现显著优于先前方法。 Conclusion: Noisier2Inverse是一种有效的自监督方法，适用于噪声相关的逆问题，且无需真实样本。 Abstract: We propose Noisier2Inverse, a correction-free self-supervised deep learning approach for general inverse prob- lems. The proposed method learns a reconstruction function without the need for ground truth samples and is ap- plicable in cases where measurement noise is statistically correlated. This includes computed tomography, where detector imperfections or photon scattering create correlated noise patterns, as well as microscopy and seismic imaging, where physical interactions during measurement introduce dependencies in the noise structure. Similar to Noisier2Noise, a key step in our approach is the generation of noisier data from which the reconstruction net- work learns. However, unlike Noisier2Noise, the proposed loss function operates in measurement space and is trained to recover an extrapolated image instead of the original noisy one. This eliminates the need for an extrap- olation step during inference, which would otherwise suffer from ill-posedness. We numerically demonstrate that our method clearly outperforms previous self-supervised approaches that account for correlated noise.

A-MESS: Anchor based Multimodal Embedding with Semantic Synchronization for Multimodal Intent Recognition

Yaomin Shen,Xiaojian Lin,Wei Fan

Task: 通过整合多种模态（如语言文本、身体姿势和语调）来识别人类意图。

Motivation: 现有方法难以充分捕捉模态间的内在联系，并忽略了意图的语义表示。

Details

Method: 提出了基于锚点的多模态嵌入与语义同步（A-MESS）框架，包括锚点多模态嵌入模块和语义同步策略。 Result: A-MESS在实验中达到了最先进的性能，并对多模态表示和下游任务提供了重要见解。 Conclusion: A-MESS框架有效解决了多模态意图识别中的模态连接和语义表示问题。 Abstract: In the domain of multimodal intent recognition (MIR), the objective is to recognize human intent by integrating a variety of modalities, such as language text, body gestures, and tones. However, existing approaches face difficulties adequately capturing the intrinsic connections between the modalities and overlooking the corresponding semantic representations of intent. To address these limitations, we present the Anchor-based Mul- timodal Embedding with Semantic Synchronization (A-MESS) framework. We first design an Anchor-based Multimodal Embed- ding (A-ME) module that employs an anchor-based embedding fusion mechanism to integrate multimodal inputs. Furthermore, we develop a Semantic Synchronization (SS) strategy with the Triplet Contrastive Learning pipeline, which optimizes the pro- cess by synchronizing multimodal representation with label de- scriptions produced by the large language model. Comprehensive experiments indicate that our A-MESS achieves state-of-the-art and provides substantial insight into multimodal representation and downstream tasks.

TeLL Me what you cant see

Saverio Cavasin,Pietro Biasetton,Mattia Tamiazzo,Mauro Conti,Simone Milani

Task: 提出一种新颖的法医照片增强框架，以提高犯罪调查中人物识别的准确性。

Motivation: 执法机构常因高质量图像稀缺或过时而影响人物搜索的准确性和成功率。

Details

Method: 通过可定制的数据增强技术生成额外的高质量图像，同时保持原始数据的生物特征完整性和一致性。 Result: 实验结果表明，该方法显著提高了不同法医场景下的识别准确性和鲁棒性。 Conclusion: 该框架是一种可靠的工具，适用于执法应用。 Abstract: During criminal investigations, images of persons of interest directly influence the success of identification procedures. However, law enforcement agencies often face challenges related to the scarcity of high-quality images or their obsolescence, which can affect the accuracy and success of people searching processes. This paper introduces a novel forensic mugshot augmentation framework aimed at addressing these limitations. Our approach enhances the identification probability of individuals by generating additional, high-quality images through customizable data augmentation techniques, while maintaining the biometric integrity and consistency of the original data. Several experimental results show that our method significantly improves identification accuracy and robustness across various forensic scenarios, demonstrating its effectiveness as a trustworthy tool law enforcement applications. Index Terms: Digital Forensics, Person re-identification, Feature extraction, Data augmentation, Visual-Language models.

GenHancer: Imperfect Generative Models are Secretly Strong Vision-Centric Enhancers

Shijie Ma,Yuying Ge,Teng Wang,Yuxin Guo,Yixiao Ge,Ying Shan

Task: 探索生成模型与判别模型（如CLIP）的协同作用，以提升细粒度视觉细节的表示能力。

Motivation: 判别模型CLIP在高层次语义上表现优异，但在细粒度视觉细节感知上存在不足，而生成模型的条件重建机制尚未充分研究。

Details

Method: 通过实验研究条件机制、去噪配置和生成范式，提出两阶段训练策略和轻量级去噪器，最终开发出GenHancer方法。 Result: GenHancer在MMVP-VLM基准上显著优于现有方法（如OpenAICLIP提升6.0%），并可用于增强多模态大语言模型的视觉性能。 Conclusion: 通过系统探索，GenHancer有效提取生成模型的细粒度知识，为视觉表示增强提供了高效解决方案。 Abstract: The synergy between generative and discriminative models receives growing attention. While discriminative Contrastive Language-Image Pre-Training (CLIP) excels in high-level semantics, it struggles with perceiving fine-grained visual details. Generally, to enhance representations, generative models take CLIP's visual features as conditions for reconstruction. However, the underlying principle remains underexplored. In this work, we empirically found that visually perfect generations are not always optimal for representation enhancement. The essence lies in effectively extracting fine-grained knowledge from generative models while mitigating irrelevant information. To explore critical factors, we delve into three aspects: (1) Conditioning mechanisms: We found that even a small number of local tokens can drastically reduce the difficulty of reconstruction, leading to collapsed training. We thus conclude that utilizing only global visual tokens as conditions is the most effective strategy. (2) Denoising configurations: We observed that end-to-end training introduces extraneous information. To address this, we propose a two-stage training strategy to prioritize learning useful visual knowledge. Additionally, we demonstrate that lightweight denoisers can yield remarkable improvements. (3) Generation paradigms: We explore both continuous and discrete denoisers with desirable outcomes, validating the versatility of our method. Through our in-depth explorations, we have finally arrived at an effective method, namely GenHancer, which consistently outperforms prior arts on the MMVP-VLM benchmark, e.g., 6.0% on OpenAICLIP. The enhanced CLIP can be further plugged into multimodal large language models for better vision-centric performance. All the models and codes are made publicly available.

Exploring Disentangled and Controllable Human Image Synthesis: From End-to-End to Stage-by-Stage

Zhengwentai Sun,Heyuan Li,Xihe Yang,Keru Zheng,Shuliang Ning,Yihao Zhi,Hongjie Liao,Chenghong Li,Shuguang Cui,Xiaoguang Han

Task: 提出一种新的解耦且可控的人体图像合成任务，明确分离并操控视角、姿态、服装和身份四个因素。

Motivation: 现有方法主要关注面部合成或近正面身体生成，难以同时以解耦方式控制多个关键因素。

Details

Method: 开发了一个端到端生成模型，并提出了分阶段框架，将人体图像生成分解为三个顺序步骤：着装A姿态生成、背面合成以及姿态和视角控制。 Result: 分阶段方法在视觉保真度和解耦质量上优于端到端模型，显著提升了可控性和泛化能力。 Conclusion: 分阶段框架为实际任务提供了可扩展的解决方案，尤其在野外场景中表现优异。 Abstract: Achieving fine-grained controllability in human image synthesis is a long-standing challenge in computer vision. Existing methods primarily focus on either facial synthesis or near-frontal body generation, with limited ability to simultaneously control key factors such as viewpoint, pose, clothing, and identity in a disentangled manner. In this paper, we introduce a new disentangled and controllable human synthesis task, which explicitly separates and manipulates these four factors within a unified framework. We first develop an end-to-end generative model trained on MVHumanNet for factor disentanglement. However, the domain gap between MVHumanNet and in-the-wild data produce unsatisfacotry results, motivating the exploration of virtual try-on (VTON) dataset as a potential solution. Through experiments, we observe that simply incorporating the VTON dataset as additional data to train the end-to-end model degrades performance, primarily due to the inconsistency in data forms between the two datasets, which disrupts the disentanglement process. To better leverage both datasets, we propose a stage-by-stage framework that decomposes human image generation into three sequential steps: clothed A-pose generation, back-view synthesis, and pose and view control. This structured pipeline enables better dataset utilization at different stages, significantly improving controllability and generalization, especially for in-the-wild scenarios. Extensive experiments demonstrate that our stage-by-stage approach outperforms end-to-end models in both visual fidelity and disentanglement quality, offering a scalable solution for real-world tasks. Additional demos are available on the project page: https://taited.github.io/discohuman-project/.

Pose-Based Fall Detection System: Efficient Monitoring on Standard CPUs

Vinayak Mali,Saurabh Jaiswal

Task: 提出一种无需额外传感器或高性能硬件的老年人跌倒检测系统。

Motivation: 解决现有跌倒检测系统依赖专用硬件或高计算资源的问题。

Details

Method: 结合姿态估计技术、阈值分析和投票机制，利用MediaPipe框架实现实时处理。 Result: 系统在标准CPU上实现高效运行，减少误报并保持高准确性。 Conclusion: 提供了一种低成本、高效的跌倒检测方案，适用于养老院等场景。 Abstract: Falls among elderly residents in assisted living homes pose significant health risks, often leading to injuries and a decreased quality of life. Current fall detection solutions typically rely on sensor-based systems that require dedicated hardware, or on video-based models that demand high computational resources and GPUs for real-time processing. In contrast, this paper presents a robust fall detection system that does not require any additional sensors or high-powered hardware. The system uses pose estimation techniques, combined with threshold-based analysis and a voting mechanism, to effectively distinguish between fall and non-fall activities. For pose detection, we leverage MediaPipe, a lightweight and efficient framework that enables real-time processing on standard CPUs with minimal computational overhead. By analyzing motion, body position, and key pose points, the system processes pose features with a 20-frame buffer, minimizing false positives and maintaining high accuracy even in real-world settings. This unobtrusive, resource-efficient approach provides a practical solution for enhancing resident safety in old age homes, without the need for expensive sensors or high-end computational resources.

Adaptive Weighted Parameter Fusion with CLIP for Class-Incremental Learning

Juncen Guo,Xiaoguang Zhu,Liangyu Teng,Hao Yang,Jing Liu,Yang Liu,Liang Song

Task: 设计一种自适应加权参数融合方法，结合CLIP，以解决类增量学习中的灾难性遗忘问题。

Motivation: 在类增量学习中，模型优化新类时会遗忘旧类知识，导致灾难性遗忘，需要在保留旧知识和适应新信息之间权衡。

Details

Method: 提出自适应加权参数融合方法，结合CLIP，考虑不同任务数据分布的变异性，并引入平衡因子以平衡数据分布对齐和任务区分度。 Result: 在多个传统基准测试中验证了所提方法的优越性。 Conclusion: 该方法有效解决了类增量学习中的灾难性遗忘问题，同时保留了参数矩阵的有效信息。 Abstract: Class-incremental Learning (CIL) enables the model to incrementally absorb knowledge from new classes and build a generic classifier across all previously encountered classes. When the model optimizes with new classes, the knowledge of previous classes is inevitably erased, leading to catastrophic forgetting. Addressing this challenge requires making a trade-off between retaining old knowledge and accommodating new information. However, this balancing process often requires sacrificing some information, which can lead to a partial loss in the model's ability to discriminate between classes. To tackle this issue, we design the adaptive weighted parameter fusion with Contrastive Language-Image Pre-training (CLIP), which not only takes into account the variability of the data distribution of different tasks, but also retains all the effective information of the parameter matrix to the greatest extent. In addition, we introduce a balance factor that can balance the data distribution alignment and distinguishability of adjacent tasks. Experimental results on several traditional benchmarks validate the superiority of the proposed method.

Improved Alignment of Modalities in Large Vision Language Models

Kartik Jangra,Aman Kumar Singh,Yashwani Mann,Geetanjali Rathee

Task: 提出一种自回归视觉语言模型的训练策略，以统一图像描述和视觉问答等视觉语言任务。

Motivation: 现有方法需要非常大的语言模型或数据集，效率低下，缺乏统一的多任务对齐方法。

Details

Method: 设计四个训练阶段对齐视觉与语言模型，改进注意力掩码以提升视觉特征质量。 Result: 模型在COCO和Flickr30k数据集上超越VILA-130亿模型，接近GIT-2表现，且训练时间仅12小时。 Conclusion: 提出的训练策略高效且通用，适用于下游任务，如医疗领域的视觉问答。 Abstract: Recent advancements in vision-language models have achieved remarkable results in making language models understand vision inputs. However, a unified approach to align these models across diverse tasks such as image captioning and visual question answering remains a challenge. Existing methods either require very big language models or very big datasets which is not efficient in utilizing existing models. This paper addresses this gap and devises a training strategy of auto-regressive vision-language models, to unify vision-language tasks like image-captioning and visual question answering. We propose four training stages for aligning the vision model with the language model, in other words, the language model is given an ability to process visual inputs. We also devise different attention masks for training transformer-based language models that improve the quality of visual features. Further, we introduce some findings, 1) the attention mask should not be applied on visual inputs, 2) the Language model converges faster on AI- generated data, 3) More work should be done in the alignment stage during the pre-training of the model, 4) the model can easily adapt to any downstream tasks like visual question answering on healthcare datasets like PathVQA. After training the model for one epoch for all the stages, it outperforms large models like VILA-13 billion models on common benchmarks like CIDEr scores on COCO and Flickr30k datasets and achieves very close scores to GIT-2 on the same dataset despite being a much smaller model trained on a much smaller dataset. All of the training is done using best practices available like multi- GPU parallel training, lower-precision training with 16-bit float numbers, faster attention (SDPA), and gradient accumulation, and completed the training within 12 hours.

Scene-agnostic Pose Regression for Visual Localization

Junwei Zheng,Ruiping Liu,Yufan Chen,Zhenfang Chen,Kailun Yang,Jiaming Zhang,Rainer Stiefelhagen

Task: 提出一种新的任务——场景无关的姿态回归（SPR），以解决绝对姿态回归（APR）和相对姿态回归（RPR）的局限性。

Motivation: APR缺乏对未知环境的适应性，RPR需要大量图像检索数据库，而视觉里程计（VO）在开放轨迹中误差累积严重。

Details

Method: 提出了SPR-Mamba模型，采用双分支方式解决SPR任务，并创建了大规模数据集360SPR。 Result: 在360SPR和360Loc数据集的未知场景中，SPR方法优于APR、RPR和VO。 Conclusion: SPR范式、数据集和模型在姿态回归任务中表现出色，且无需重新训练或依赖数据库。 Abstract: Absolute Pose Regression (APR) predicts 6D camera poses but lacks the adaptability to unknown environments without retraining, while Relative Pose Regression (RPR) generalizes better yet requires a large image retrieval database. Visual Odometry (VO) generalizes well in unseen environments but suffers from accumulated error in open trajectories. To address this dilemma, we introduce a new task, Scene-agnostic Pose Regression (SPR), which can achieve accurate pose regression in a flexible way while eliminating the need for retraining or databases. To benchmark SPR, we created a large-scale dataset, 360SPR, with over 200K photorealistic panoramas, 3.6M pinhole images and camera poses in 270 scenes at three different sensor heights. Furthermore, a SPR-Mamba model is initially proposed to address SPR in a dual-branch manner. Extensive experiments and studies demonstrate the effectiveness of our SPR paradigm, dataset, and model. In the unknown scenes of both 360SPR and 360Loc datasets, our method consistently outperforms APR, RPR and VO. The dataset and code are available at https://junweizheng93.github.io/publications/SPR/SPR.html.

Tiling artifacts and trade-offs of feature normalization in the segmentation of large biological images

Elena Buglakova,Anwai Archit,Edoardo D'Imprima,Julia Mahamid,Constantin Pape,Anna Kreshuk

Task: 研究大型图像分割中因归一化层导致的拼接伪影问题，并提出解决方案。

Motivation: 大型图像分割（如显微镜、医学影像或遥感）中，滑动窗口推理常导致拼接伪影，影响预测质量。

Details

Method: 分析归一化层对伪影的影响，提出检测指标，并比较不同归一化策略（如BatchRenorm）的效果。 Result: BatchRenorm能有效消除伪影并提升模型迁移性能。 Conclusion: BatchRenorm是解决拼接伪影问题的最佳归一化策略，提高了模型的复用性。 Abstract: Segmentation of very large images is a common problem in microscopy, medical imaging or remote sensing. The problem is usually addressed by sliding window inference, which can theoretically lead to seamlessly stitched predictions. However, in practice many of the popular pipelines still suffer from tiling artifacts. We investigate the root cause of these issues and show that they stem from the normalization layers within the neural networks. We propose indicators to detect normalization issues and further explore the trade-offs between artifact-free and high-quality predictions, using three diverse microscopy datasets as examples. Finally, we propose to use BatchRenorm as the most suitable normalization strategy, which effectively removes tiling artifacts and enhances transfer performance, thereby improving the reusability of trained networks for new datasets.

Practical Fine-Tuning of Autoregressive Models on Limited Handwritten Texts

Jan Kohút,Michal Hradiš

Task: 研究如何通过渐进式微调优化OCR模型，以减少人工标注工作量并提高识别准确率。

Motivation: OCR应用中，用户通常需要逐步修正自动识别的结果，这一过程为模型提供了渐进式适应的机会，但需要确保稳定性和可靠性。

Details

Method: 采用基于Transformer的先进模型进行渐进式微调，并研究模型组件（编码器和解码器）的作用，提出可靠的停止标准和基于置信度的信息行选择方法。 Result: 实验表明，仅需16行数据即可实现10%的相对CER改进，256行数据可提升至40%，同时通过置信度选择方法可将标注成本减半。 Conclusion: 渐进式微调结合Transformer模型能有效优化OCR性能，显著减少人工标注需求，同时保持稳定性和可靠性。 Abstract: A common use case for OCR applications involves users uploading documents and progressively correcting automatic recognition to obtain the final transcript. This correction phase presents an opportunity for progressive adaptation of the OCR model, making it crucial to adapt early, while ensuring stability and reliability. We demonstrate that state-of-the-art transformer-based models can effectively support this adaptation, gradually reducing the annotator's workload. Our results show that fine-tuning can reliably start with just 16 lines, yielding a 10% relative improvement in CER, and scale up to 40% with 256 lines. We further investigate the impact of model components, clarifying the roles of the encoder and decoder in the fine-tuning process. To guide adaptation, we propose reliable stopping criteria, considering both direct approaches and global trend analysis. Additionally, we show that OCR models can be leveraged to cut annotation costs by half through confidence-based selection of informative lines, achieving the same performance with fewer annotations.

Dance Like a Chicken: Low-Rank Stylization for Human Motion Diffusion

Haim Sawdayee,Chuan Guo,Guy Tevet,Bing Zhou,Jian Wang,Amit H. Bermano

Task: 提出一种轻量级框架LoRA-MDM，用于在保持可编辑性的同时，将运动风格化推广到复杂动作。

Motivation: 现有方法在稀缺的风格特定数据下，生成低质量的分布外结果，需要改进。

Details

Method: 通过低秩适应调整生成先验以包含参考风格，仅需少量样本即可学习风格。 Result: LoRA-MDM在文本保真度和风格一致性之间取得良好平衡，支持风格混合和运动编辑。 Conclusion: LoRA-MDM通过调整生成先验，有效实现了复杂动作的风格化，并保持了分布结构的可编辑性。 Abstract: Text-to-motion generative models span a wide range of 3D human actions but struggle with nuanced stylistic attributes such as a "Chicken" style. Due to the scarcity of style-specific data, existing approaches pull the generative prior towards a reference style, which often results in out-of-distribution low quality generations. In this work, we introduce LoRA-MDM, a lightweight framework for motion stylization that generalizes to complex actions while maintaining editability. Our key insight is that adapting the generative prior to include the style, while preserving its overall distribution, is more effective than modifying each individual motion during generation. Building on this idea, LoRA-MDM learns to adapt the prior to include the reference style using only a few samples. The style can then be used in the context of different textual prompts for generation. The low-rank adaptation shifts the motion manifold in a semantically meaningful way, enabling realistic style infusion even for actions not present in the reference samples. Moreover, preserving the distribution structure enables advanced operations such as style blending and motion editing. We compare LoRA-MDM to state-of-the-art stylized motion generation methods and demonstrate a favorable balance between text fidelity and style consistency.

Improved tissue sodium concentration quantification in breast cancer by reducing partial volume effects: a preliminary study

Olgica Zaric,Carmen Leser,Vladimir Juras,Alex Farr,Malina Gologan,Stanislas Rapacchi,Laura Villazan Garcia,Christian Singer,Siegfried Trattnig,Christian Licht,Ramona Woitek

Task: 研究压缩感知（CS）方法在改善乳腺癌患者23Na-MRI图像质量和组织钠浓度（TSC）量化准确性中的可行性。

Motivation: 部分体积效应（PVE）是23Na-MRI中TSC量化的主要误差来源，压缩感知算法可能减少PVE。

Details

Method: 使用加权总变分（wTV）、方向总变分（dTV）、解剖引导总变分（AG-TV）和自适应组合（ADC）重建23Na-MRI图像，并进行图像质量评估和TSC量化。 Result: 所有方法均能生成高质量图像，dTV在肿瘤体积勾画和TSC量化中表现显著优于其他方法。 Conclusion: 图像重建方法的类型和参数会影响肿瘤外观和TSC估计，差异主要源于其减少PVE的鲁棒性不同。 Abstract: Introduction: In sodium (23Na) MRI, partial volume effects (PVE) are one of the most common causes of errors in the quantification of tissue sodium concentration (TSC) in vivo. Advanced image reconstruction algorithms, such as compressed sensing (CS), have been shown to potentially reduce PVE. Therefore, we investigated the feasibility of CS-based methods for image quality and TSC quantification accuracy improvement in patients with breast cancer (BC). Subjects and Methods: Three healthy participants and 12 female participants with BC were examined on a 7T MRI scanner in this study. We reconstructed 23Na-MRI images using the weighted total variation (wTV) and directional total variation (dTV), anatomically guided total variation (AG-TV), and adaptive combine (ADC) reconstruction and performed image quality assessment. We evaluated agreement in tumor volumes delineated on sodium data using the Dice score and performed TSC quantification for different image reconstruction approaches. Results: All methods provided sodium images of the breast with good quality. The mean Dice scores for wTV, dTV, and AG-TV were 65%, 72%, and 75%, respectively. In the breast tumors, average TSC values were 83.0, 72.0, 80.0, and 84.0 mmol/L, respectively. There was a significant difference between dTV and wTV (p<0.001), as well as between dTV and AG-TV (p<0.001) and dTV and ADC algorithm (p<0.001). Conclusion: The results of this study showed that there are differences in tumor appearance and TSC estimations that might be depending on the type of image reconstruction and parameters used, most likely due to differences in their robustness in reducing PVE.

Video Anomaly Detection with Contours - A Study

Mia Siemon,Ivan Nikolov,Thomas B. Moeslund,Kamal Nasrollahi

Task: 研究基于姿态的视频异常检测中，利用2D轮廓学习正常人类行为的循环运动模式。

Motivation: 现有方法假设异常事件源于不常见的人类行为，但仅依赖骨骼表示限制了对象类别覆盖范围。

Details

Method: 将问题建模为回归和分类任务，探索两种轮廓数据表示技术，并使用浅层神经网络降低计算复杂度。 Result: 在六个基准数据集上的实验表明，该方法为基于姿态的视频异常检测提供了新研究方向。 Conclusion: 利用2D轮廓替代骨骼表示，为未来研究覆盖更多对象类别提供了潜力。 Abstract: In Pose-based Video Anomaly Detection prior art is rooted on the assumption that abnormal events can be mostly regarded as a result of uncommon human behavior. Opposed to utilizing skeleton representations of humans, however, we investigate the potential of learning recurrent motion patterns of normal human behavior using 2D contours. Keeping all advantages of pose-based methods, such as increased object anonymization, the shift from human skeletons to contours is hypothesized to leave the opportunity to cover more object categories open for future research. We propose formulating the problem as a regression and a classification task, and additionally explore two distinct data representation techniques for contours. To further reduce the computational complexity of Pose-based Video Anomaly Detection solutions, all methods in this study are based on shallow Neural Networks from the field of Deep Learning, and evaluated on the three most prominent benchmark datasets within Video Anomaly Detection and their human-related counterparts, totaling six datasets. Our results indicate that this novel perspective on Pose-based Video Anomaly Detection marks a promising direction for future research.

SACB-Net: Spatial-awareness Convolutions for Medical Image Registration

Xinxing Cheng,Tianyang Zhang,Wenqi Lu,Qingjie Meng,Alejandro F. Frangi,Jinming Duan

Task: 提出一种3D空间感知卷积块（SACB）以增强特征表示中的空间信息，并基于此构建金字塔流估计器（SACB-Net）来处理多尺度流组合，特别是大变形问题。

Motivation: 现有深度学习方法因依赖空间共享卷积核而难以捕捉特征图中非局部区域的空间变化信息，导致变形场估计不理想。

Details

Method: 通过特征相似性估计特征图中的空间簇，并参数化不同区域的自适应卷积核，生成适应空间变化的卷积核（权重和偏置）。 Result: 在脑部IXI和LPBA数据集以及腹部CT数据集上的实验表明，SACB和SACB-Net优于现有的基于学习的配准方法。 Conclusion: SACB和SACB-Net能有效捕捉空间变化信息，显著提升配准性能。 Abstract: Deep learning-based image registration methods have shown state-of-the-art performance and rapid inference speeds. Despite these advances, many existing approaches fall short in capturing spatially varying information in non-local regions of feature maps due to the reliance on spatially-shared convolution kernels. This limitation leads to suboptimal estimation of deformation fields. In this paper, we propose a 3D Spatial-Awareness Convolution Block (SACB) to enhance the spatial information within feature representations. Our SACB estimates the spatial clusters within feature maps by leveraging feature similarity and subsequently parameterizes the adaptive convolution kernels across diverse regions. This adaptive mechanism generates the convolution kernels (weights and biases) tailored to spatial variations, thereby enabling the network to effectively capture spatially varying information. Building on SACB, we introduce a pyramid flow estimator (named SACB-Net) that integrates SACBs to facilitate multi-scale flow composition, particularly addressing large deformations. Experimental results on the brain IXI and LPBA datasets as well as Abdomen CT datasets demonstrate the effectiveness of SACB and the superiority of SACB-Net over the state-of-the-art learning-based registration methods. The code is available at https://github.com/x-xc/SACB_Net .

Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and Mitigation

Hongcheng Gao,Jiashu Qu,Jingyi Tang,Baolong Bi,Yue Liu,Hongyu Chen,Li Liang,Li Su,Qingming Huang

Task: 研究大型多模态模型（LMMs）在视频模态中的幻觉问题，并提出解决方案。

Motivation: 视频模态的动态性和复杂性使得LMMs的幻觉问题（即看似正确但实际错误的回答）更加严重，限制了其可靠性和应用范围。

Details

Method: 提出了一个名为HAVEN的基准测试，从三个维度（幻觉原因、幻觉方面和问题格式）评估LMMs的幻觉问题，并通过实验研究了7个影响因素；此外，提出了一种视频思维模型，结合监督推理微调（SRFT）和直接偏好优化（TDPO）来减少幻觉。 Result: 实验表明，该方法在幻觉评估中的准确率提高了7.65%，偏差分数降低了4.5%。 Conclusion: 提出的视频思维模型和HAVEN基准测试有效缓解了LMMs在视频模态中的幻觉问题，提升了模型的可靠性。 Abstract: The hallucination of large multimodal models (LMMs), providing responses that appear correct but are actually incorrect, limits their reliability and applicability. This paper aims to study the hallucination problem of LMMs in video modality, which is dynamic and more challenging compared to static modalities like images and text. From this motivation, we first present a comprehensive benchmark termed HAVEN for evaluating hallucinations of LMMs in video understanding tasks. It is built upon three dimensions, i.e., hallucination causes, hallucination aspects, and question formats, resulting in 6K questions. Then, we quantitatively study 7 influential factors on hallucinations, e.g., duration time of videos, model sizes, and model reasoning, via experiments of 16 LMMs on the presented benchmark. In addition, inspired by recent thinking models like OpenAI o1, we propose a video-thinking model to mitigate the hallucinations of LMMs via supervised reasoning fine-tuning (SRFT) and direct preference optimization (TDPO)-- where SRFT enhances reasoning capabilities while TDPO reduces hallucinations in the thinking process. Extensive experiments and analyses demonstrate the effectiveness. Remarkably, it improves the baseline by 7.65% in accuracy on hallucination evaluation and reduces the bias score by 4.5%. The code and data are public at https://github.com/Hongcheng-Gao/HAVEN.

DynOPETs: A Versatile Benchmark for Dynamic Object Pose Estimation and Tracking in Moving Camera Scenarios

Xiangting Meng,Jiaqi Yang,Mingshu Chen,Chenxin Yan,Yujiao Shi,Wenchao Ding,Laurent Kneip

Task: 提出并验证一个名为DynOPETs的新数据集，用于动态物体和移动相机场景下的物体姿态估计。

Motivation: 解决动态场景中物体姿态标注数据稀缺的问题，推动鲁棒姿态估计模型的发展。

Details

Method: 开发了一种结合姿态估计和姿态跟踪技术的数据标注流程，并通过姿态图优化生成精确的伪标签。 Result: 生成了包含动态物体和移动相机场景的精确姿态标注数据集，并通过18种先进方法验证了其有效性。 Conclusion: DynOPETs数据集填补了领域空白，有望加速相关研究，并将公开以促进进一步探索。 Abstract: In the realm of object pose estimation, scenarios involving both dynamic objects and moving cameras are prevalent. However, the scarcity of corresponding real-world datasets significantly hinders the development and evaluation of robust pose estimation models. This is largely attributed to the inherent challenges in accurately annotating object poses in dynamic scenes captured by moving cameras. To bridge this gap, this paper presents a novel dataset DynOPETs and a dedicated data acquisition and annotation pipeline tailored for object pose estimation and tracking in such unconstrained environments. Our efficient annotation method innovatively integrates pose estimation and pose tracking techniques to generate pseudo-labels, which are subsequently refined through pose graph optimization. The resulting dataset offers accurate pose annotations for dynamic objects observed from moving cameras. To validate the effectiveness and value of our dataset, we perform comprehensive evaluations using 18 state-of-the-art methods, demonstrating its potential to accelerate research in this challenging domain. The dataset will be made publicly available to facilitate further exploration and advancement in the field.

Burst Image Super-Resolution with Mamba

Ozan Unal,Steven Marty,Dengxin Dai

Task: 通过多帧低分辨率图像提升关键帧的分辨率。

Motivation: 现有基于Transformer的方法虽然有效，但自注意力机制存在二次复杂度问题，Mamba因其线性复杂度成为潜在解决方案。

Details

Method: 提出BurstMamba架构，分为空间模块（关键帧超分辨率）和时间模块（亚像素先验提取），结合光流序列化和小波重参数化策略。 Result: 在SyntheticSR、RealBSR-RGB和RealBSR-RAW基准测试中达到SOTA性能。 Conclusion: BurstMamba在计算效率和信息整合间取得平衡，为BISR领域提供了新方向。 Abstract: Burst image super-resolution (BISR) aims to enhance the resolution of a keyframe by leveraging information from multiple low-resolution images captured in quick succession. In the deep learning era, BISR methods have evolved from fully convolutional networks to transformer-based architectures, which, despite their effectiveness, suffer from the quadratic complexity of self-attention. We see Mamba as the next natural step in the evolution of this field, offering a comparable global receptive field and selective information routing with only linear time complexity. In this work, we introduce BurstMamba, a Mamba-based architecture for BISR. Our approach decouples the task into two specialized branches: a spatial module for keyframe super-resolution and a temporal module for subpixel prior extraction, striking a balance between computational efficiency and burst information integration. To further enhance burst processing with Mamba, we propose two novel strategies: (i) optical flow-based serialization, which aligns burst sequences only during state updates to preserve subpixel details, and (ii) a wavelet-based reparameterization of the state-space update rules, prioritizing high-frequency features for improved burst-to-keyframe information passing. Our framework achieves SOTA performance on public benchmarks of SyntheticSR, RealBSR-RGB, and RealBSR-RAW.

Show or Tell? Effectively prompting Vision-Language Models for semantic segmentation

Niccolo Avogaro,Thomas Frick,Mattia Rigotti,Andrea Bartezzaghi,Filip Janicki,Cristiano Malossi,Konrad Schindler,Roy Assaf

Task: 研究如何通过提示（prompting）有效地利用大型视觉语言模型（VLMs）进行语义分割。

Motivation: 探讨VLMs在语义分割任务中的表现，并分析文本提示和视觉提示的互补性。

Details

Method: 系统地评估了几种最新模型在MESS数据集上的表现，并提出了一种可扩展的提示方案——少样本提示语义分割。 Result: VLMs在特定分割任务上表现落后专业模型约30%，但文本和视觉提示具有互补性；PromptMatcher结合两者，性能提升显著。 Conclusion: 通过结合文本和视觉提示，PromptMatcher在少样本提示语义分割任务中取得了最佳性能。 Abstract: Large Vision-Language Models (VLMs) are increasingly being regarded as foundation models that can be instructed to solve diverse tasks by prompting, without task-specific training. We examine the seemingly obvious question: how to effectively prompt VLMs for semantic segmentation. To that end, we systematically evaluate the segmentation performance of several recent models guided by either text or visual prompts on the out-of-distribution MESS dataset collection. We introduce a scalable prompting scheme, few-shot prompted semantic segmentation, inspired by open-vocabulary segmentation and few-shot learning. It turns out that VLMs lag far behind specialist models trained for a specific segmentation task, by about 30% on average on the Intersection-over-Union metric. Moreover, we find that text prompts and visual prompts are complementary: each one of the two modes fails on many examples that the other one can solve. Our analysis suggests that being able to anticipate the most effective prompt modality can lead to a 11% improvement in performance. Motivated by our findings, we propose PromptMatcher, a remarkably simple training-free baseline that combines both text and visual prompts, achieving state-of-the-art results outperforming the best text-prompted VLM by 2.5%, and the top visual-prompted VLM by 3.5% on few-shot prompted semantic segmentation.

OpenSDI: Spotting Diffusion-Generated Images in the Open World

Yabin Wang,Zhiwu Huang,Xiaopeng Hong

Task: 提出OpenSDI挑战，并构建OpenSDID数据集以检测和定位开放世界中扩散模型生成的图像。

Motivation: 现有数据集在开放世界场景下对扩散生成图像的检测和定位能力不足，需构建更全面的基准。

Details

Method: 提出Synergizing Pretrained Models (SPM)方案，结合多种预训练基础模型，并引入MaskCLIP模型。 Result: MaskCLIP在OpenSDID数据集上显著优于现有方法，定位任务IoU提升14.23%，检测任务准确率提升2.05%。 Conclusion: OpenSDID数据集和SPM方案为开放世界中扩散生成图像的检测与定位提供了有效解决方案。 Abstract: This paper identifies OpenSDI, a challenge for spotting diffusion-generated images in open-world settings. In response to this challenge, we define a new benchmark, the OpenSDI dataset (OpenSDID), which stands out from existing datasets due to its diverse use of large vision-language models that simulate open-world diffusion-based manipulations. Another outstanding feature of OpenSDID is its inclusion of both detection and localization tasks for images manipulated globally and locally by diffusion models. To address the OpenSDI challenge, we propose a Synergizing Pretrained Models (SPM) scheme to build up a mixture of foundation models. This approach exploits a collaboration mechanism with multiple pretrained foundation models to enhance generalization in the OpenSDI context, moving beyond traditional training by synergizing multiple pretrained models through prompting and attending strategies. Building on this scheme, we introduce MaskCLIP, an SPM-based model that aligns Contrastive Language-Image Pre-Training (CLIP) with Masked Autoencoder (MAE). Extensive evaluations on OpenSDID show that MaskCLIP significantly outperforms current state-of-the-art methods for the OpenSDI challenge, achieving remarkable relative improvements of 14.23% in IoU (14.11% in F1) and 2.05% in accuracy (2.38% in F1) compared to the second-best model in localization and detection tasks, respectively. Our dataset and code are available at https://github.com/iamwangyabin/OpenSDI.

RGB-Th-Bench: A Dense benchmark for Visual-Thermal Understanding of Vision Language Models

Mehdi Moshtaghi,Siavash H. Khajavi,Joni Pajarinen

Task: 评估视觉语言模型（VLMs）在理解RGB-热成像图像对方面的能力。

Motivation: 现有评估主要局限于RGB基准，缺乏对红外视觉任务的评估，且现有数据集任务单一或标注质量不足。

Details

Method: 提出RGB-Th-Bench，包含14个技能维度的1600多个专家标注的Yes/No问题，采用问题和技能级别两种准确率指标。 Result: 对19个先进VLM的评估显示其在热成像理解上表现不佳，性能受限于RGB能力，且缺乏大规模预训练数据集。 Conclusion: RGB-Th-Bench揭示了多模态学习在可见光和热成像理解上的差距，呼吁进一步研究。 Abstract: We introduce RGB-Th-Bench, the first benchmark designed to evaluate the ability of Vision-Language Models (VLMs) to comprehend RGB-Thermal image pairs. While VLMs have demonstrated remarkable progress in visual reasoning and multimodal understanding, their evaluation has been predominantly limited to RGB-based benchmarks, leaving a critical gap in assessing their capabilities in infrared vision tasks. Existing visible-infrared datasets are either task-specific or lack high-quality annotations necessary for rigorous model evaluation. To address these limitations, RGB-Th-Bench provides a comprehensive evaluation framework covering 14 distinct skill dimensions, with a total of 1,600+ expert-annotated Yes/No questions. The benchmark employs two accuracy metrics: a standard question-level accuracy and a stricter skill-level accuracy, which evaluates model robustness across multiple questions within each skill dimension. This design ensures a thorough assessment of model performance, including resilience to adversarial and hallucinated responses. We conduct extensive evaluations on 19 state-of-the-art VLMs, revealing significant performance gaps in RGB-Thermal understanding. Our results show that even the strongest models struggle with thermal image comprehension, with performance heavily constrained by their RGB-based capabilities. Additionally, the lack of large-scale application-specific and expert-annotated thermal-caption-pair datasets in pre-training is an important reason of the observed performance gap. RGB-Th-Bench highlights the urgent need for further advancements in multimodal learning to bridge the gap between visible and thermal image understanding. The dataset is available through this link, and the evaluation code will also be made publicly available.

BiblioPage: A Dataset of Scanned Title Pages for Bibliographic Metadata Extraction

Jan Kohút,Martin Dočekal,Michal Hradiš,Marek Vaško

Task: 提出并评估BiblioPage数据集，用于从扫描的标题页中提取结构化书目元数据。

Motivation: 手动数字化书目元数据耗时耗力，且缺乏专用数据集阻碍自动化。

Details

Method: 收集并标注约2000个标题页，评估目标检测模型（如YOLO、DETR）和视觉大语言模型（如LlamA 3.2-Vision、GPT-4o）。 Result: 最佳模型F1分数为67，mAP为52。 Conclusion: BiblioPage为书目元数据提取提供了真实基准，促进文档理解与信息提取。 Abstract: Manual digitization of bibliographic metadata is time consuming and labor intensive, especially for historical and real-world archives with highly variable formatting across documents. Despite advances in machine learning, the absence of dedicated datasets for metadata extraction hinders automation. To address this gap, we introduce BiblioPage, a dataset of scanned title pages annotated with structured bibliographic metadata. The dataset consists of approximately 2,000 monograph title pages collected from 14 Czech libraries, spanning a wide range of publication periods, typographic styles, and layout structures. Each title page is annotated with 16 bibliographic attributes, including title, contributors, and publication metadata, along with precise positional information in the form of bounding boxes. To extract structured information from this dataset, we valuated object detection models such as YOLO and DETR combined with transformer-based OCR, achieving a maximum mAP of 52 and an F1 score of 59. Additionally, we assess the performance of various visual large language models, including LlamA 3.2-Vision and GPT-4o, with the best model reaching an F1 score of 67. BiblioPage serves as a real-world benchmark for bibliographic metadata extraction, contributing to document understanding, document question answering, and document information extraction. Dataset and evaluation scripts are availible at: https://github.com/DCGM/biblio-dataset

CoSimGen: Controllable Diffusion Model for Simultaneous Image and Mask Generation

Rupak Bose,Chinedu Innocent Nwoye,Aditya Bhat,Nicolas Padoy

Task: 提出一种基于扩散模型的框架CoSimGen，用于可控的同时生成图像和分割掩码。

Motivation: 解决现有生成模型无法同时生成高质量图像和掩码的问题，并增强生成输出的可控性。

Details

Method: 通过文本提示、空间嵌入和频谱嵌入实现条件控制，结合对比三元组损失、扩散损失和对抗损失提升训练效率。 Result: 在多个数据集上达到最先进性能，KID为0.11，LPIPS为0.53。 Conclusion: CoSimGen框架在图像和掩码的同步生成中表现出色，具有高可控性和高质量输出。 Abstract: The acquisition of annotated datasets with paired images and segmentation masks is a critical challenge in domains such as medical imaging, remote sensing, and computer vision. Manual annotation demands significant resources, faces ethical constraints, and depends heavily on domain expertise. Existing generative models often target single-modality outputs, either images or segmentation masks, failing to address the need for high-quality, simultaneous image-mask generation. Additionally, these models frequently lack adaptable conditioning mechanisms, restricting control over the generated outputs and limiting their applicability for dataset augmentation and rare scenario simulation. We propose CoSimGen, a diffusion-based framework for controllable simultaneous image and mask generation. Conditioning is intuitively achieved through (1) text prompts grounded in class semantics, (2) spatial embedding of context prompts to provide spatial coherence, and (3) spectral embedding of timestep information to model noise levels during diffusion. To enhance controllability and training efficiency, the framework incorporates contrastive triplet loss between text and class embeddings, alongside diffusion and adversarial losses. Initial low-resolution outputs 128 x 128 are super-resolved to 512 x 512, producing high-fidelity images and masks with strict adherence to conditions. We evaluate CoSimGen on metrics such as FID, KID, LPIPS, Class FID, Positive predicted value for image fidelity and semantic alignment of generated samples over 4 diverse datasets. CoSimGen achieves state-of-the-art performance across all datasets, achieving the lowest KID of 0.11 and LPIPS of 0.53 across datasets.

fine-CLIP: Enhancing Zero-Shot Fine-Grained Surgical Action Recognition with Vision-Language Models

Saurav Sharma,Didier Mutter,Nicolas Padoy

Task: 提出fine-CLIP方法，以改进零样本手术三重态识别任务。

Motivation: 现有视觉语言模型（如CLIP）在细粒度手术活动（如动作三重态）识别上表现不佳，因其依赖全局图像特征而忽略细粒度语义和上下文细节，且未利用三重态的层次结构。

Details

Method: fine-CLIP通过对象中心特征学习和层次结构利用，结合分层提示建模、LoRA视觉骨干适应和图基凝聚策略。 Result: 在CholecT50数据集上，fine-CLIP在F1和mAP指标上显著提升，尤其在未见过的三重态识别任务中表现优异。 Conclusion: fine-CLIP通过细粒度特征和层次结构建模，有效提升了零样本手术三重态识别的性能。 Abstract: While vision-language models like CLIP have advanced zero-shot surgical phase recognition, they struggle with fine-grained surgical activities, especially action triplets. This limitation arises because current CLIP formulations rely on global image features, which overlook the fine-grained semantics and contextual details crucial for complex tasks like zero-shot triplet recognition. Furthermore, these models do not explore the hierarchical structure inherent in triplets, reducing their ability to generalize to novel triplets. To address these challenges, we propose fine-CLIP, which learns object-centric features and lever- ages the hierarchy in triplet formulation. Our approach integrates three components: hierarchical prompt modeling to capture shared semantics, LoRA-based vision backbone adaptation for enhanced feature extraction, and a graph-based condensation strategy that groups similar patch features into meaningful object clusters. Since triplet classification is a challenging task, we introduce an alternative yet meaningful base-to-novel generalization benchmark with two settings on the CholecT50 dataset: Unseen-Target, assessing adaptability to triplets with novel anatomical structures, and Unseen-Instrument-Verb, where models need to generalize to novel instrument-verb interactions. fine-CLIP shows significant improvements in F1 and mAP, enhancing zero-shot recognition of novel surgical triplets.

Unlocking the Hidden Potential of CLIP in Generalizable Deepfake Detection

Andrii Yermakov,Jan Cech,Jiri Matas

Task: 检测部分被操纵的面部深度伪造（deepfakes），这些伪造涉及对特定面部特征的细微修改，同时保留整体上下文。

Motivation: 部分被操纵的面部深度伪造比完全合成的面部更难检测，需要一种通用且鲁棒的检测方法。

Details

Method: 利用CLIP模型的ViT-L/14视觉编码器，结合参数高效微调（PEFT）技术（如LN-tuning），并采用预处理管道和正则化策略（如L2归一化和度量学习）。 Result: 在FaceForensics++数据集上训练，并在多个数据集（如Celeb-DF-v2、DFDC、FFIW等）上验证，检测精度达到或超过更复杂的现有技术。 Conclusion: 证明了CLIP视觉编码器在面部深度伪造检测中的有效性，为未来研究提供了一个简单而强大的基线。 Abstract: This paper tackles the challenge of detecting partially manipulated facial deepfakes, which involve subtle alterations to specific facial features while retaining the overall context, posing a greater detection difficulty than fully synthetic faces. We leverage the Contrastive Language-Image Pre-training (CLIP) model, specifically its ViT-L/14 visual encoder, to develop a generalizable detection method that performs robustly across diverse datasets and unknown forgery techniques with minimal modifications to the original model. The proposed approach utilizes parameter-efficient fine-tuning (PEFT) techniques, such as LN-tuning, to adjust a small subset of the model's parameters, preserving CLIP's pre-trained knowledge and reducing overfitting. A tailored preprocessing pipeline optimizes the method for facial images, while regularization strategies, including L2 normalization and metric learning on a hyperspherical manifold, enhance generalization. Trained on the FaceForensics++ dataset and evaluated in a cross-dataset fashion on Celeb-DF-v2, DFDC, FFIW, and others, the proposed method achieves competitive detection accuracy comparable to or outperforming much more complex state-of-the-art techniques. This work highlights the efficacy of CLIP's visual encoder in facial deepfake detection and establishes a simple, powerful baseline for future research, advancing the field of generalizable deepfake detection. The code is available at: https://github.com/yermandy/deepfake-detection

Optimization of MedSAM model based on bounding box adaptive perturbation algorithm

Boyi Li,Ye Yuan,Wenjun Tan

Task: 提出一种边界框自适应扰动算法以优化MedSAM模型的训练过程，解决其在医学图像分割中的局限性。

Motivation: MedSAM模型在医学图像分割中存在对小目标和复杂结构的分割误差，尤其是在边界框提示缩小的情况下表现不佳。

Details

Method: 提出边界框自适应扰动算法，优化训练过程。 Result: 减少小目标的分割误差，提升模型在缩小边界框提示下的准确性。 Conclusion: 该方法提高了MedSAM模型在复杂医学图像任务中的鲁棒性和可靠性。 Abstract: The MedSAM model, built upon the SAM framework, enhances medical image segmentation through generalizable training but still exhibits notable limitations. First, constraints in the perturbation window settings during training can cause MedSAM to incorrectly segment small tissues or organs together with adjacent structures, leading to segmentation errors. Second, when dealing with medical image targets characterized by irregular shapes and complex structures, segmentation often relies on narrowing the bounding box to refine segmentation intent. However, MedSAM's performance under reduced bounding box prompts remains suboptimal. To address these challenges, this study proposes a bounding box adaptive perturbation algorithm to optimize the training process. The proposed approach aims to reduce segmentation errors for small targets and enhance the model's accuracy when processing reduced bounding box prompts, ultimately improving the robustness and reliability of the MedSAM model for complex medical imaging tasks.

High-Quality Spatial Reconstruction and Orthoimage Generation Using Efficient 2D Gaussian Splatting

Qian Wang,Zhihao Zhan,Jialei He,Zhituo Tu,Xiang Zhu,Jie Yuan

Task: 提出一种基于2D高斯泼溅（2DGS）的True Digital Orthophoto Maps（TDOM）生成方法，无需显式数字表面模型（DSM）和遮挡检测。

Motivation: 传统TDOM生成方法依赖复杂且计算昂贵的DSM和遮挡检测过程，容易出错。

Details

Method: 利用深度图生成技术获取像素空间信息，结合分治策略实现高效的高斯泼溅训练与渲染。 Result: 实验证明该方法能高效重建大规模场景并实现高精度地形建模。 Conclusion: 该方法为地图应用提供了精确的空间数据，支持更好的规划与决策。 Abstract: Highly accurate geometric precision and dense image features characterize True Digital Orthophoto Maps (TDOMs), which are in great demand for applications such as urban planning, infrastructure management, and environmental monitoring. Traditional TDOM generation methods need sophisticated processes, such as Digital Surface Models (DSM) and occlusion detection, which are computationally expensive and prone to errors. This work presents an alternative technique rooted in 2D Gaussian Splatting (2DGS), free of explicit DSM and occlusion detection. With depth map generation, spatial information for every pixel within the TDOM is retrieved and can reconstruct the scene with high precision. Divide-and-conquer strategy achieves excellent GS training and rendering with high-resolution TDOMs at a lower resource cost, which preserves higher quality of rendering on complex terrain and thin structure without a decrease in efficiency. Experimental results demonstrate the efficiency of large-scale scene reconstruction and high-precision terrain modeling. This approach provides accurate spatial data, which assists users in better planning and decision-making based on maps.

Bootstrap Your Own Views: Masked Ego-Exo Modeling for Fine-grained View-invariant Video Representations

Jungin Park,Jiyoung Lee,Kwanghoon Sohn

Task: 从无配对的第一人称和第三人称视频中学习细粒度的视角不变视频表示。

Motivation: 第一人称和第三人称视频在视角、运动模式和上下文上存在显著差异，导致视角不变表示学习领域研究不足。

Details

Method: 提出了一种新颖的掩码第一人称-第三人称建模方法（BYOV），通过自视角掩码和跨视角掩码预测，同时学习视角不变和强大的表示。 Result: BYOV在四个下游任务中显著超越现有方法，各项指标均有显著提升。 Conclusion: BYOV通过捕捉人类动作的组合性质，实现了跨视角的鲁棒理解，为视角不变视频表示学习提供了有效方法。 Abstract: View-invariant representation learning from egocentric (first-person, ego) and exocentric (third-person, exo) videos is a promising approach toward generalizing video understanding systems across multiple viewpoints. However, this area has been underexplored due to the substantial differences in perspective, motion patterns, and context between ego and exo views. In this paper, we propose a novel masked ego-exo modeling that promotes both causal temporal dynamics and cross-view alignment, called Bootstrap Your Own Views (BYOV), for fine-grained view-invariant video representation learning from unpaired ego-exo videos. We highlight the importance of capturing the compositional nature of human actions as a basis for robust cross-view understanding. Specifically, self-view masking and cross-view masking predictions are designed to learn view-invariant and powerful representations concurrently. Experimental results demonstrate that our BYOV significantly surpasses existing approaches with notable gains across all metrics in four downstream ego-exo video tasks. The code is available at https://github.com/park-jungin/byov.

Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models

Ilias Stogiannidis,Steven McDonagh,Sotirios A. Tsaftaris

Task: 评估视觉语言模型（VLMs）在空间推理任务中的表现。

Motivation: 现有VLM基准测试未能有效分离空间推理与其他任务（如物体检测或语义理解），因此需要更全面的方法来理解空间推理能力。

Details

Method: 通过分析空间推理的核心要素（如空间关系、方向与导航、心理旋转和空间可视化），在合成和真实图像中评估13种最先进的VLMs。 Result: 当前VLMs在空间推理任务中表现不佳，平均准确率接近随机概率，揭示了其在这一领域的显著缺陷。 Conclusion: 本研究不仅揭示了VLMs在空间推理上的不足，还为未来研究提供了基础平台。 Abstract: Vision-Language Models (VLMs) have recently emerged as powerful tools, excelling in tasks that integrate visual and textual comprehension, such as image captioning, visual question answering, and image-text retrieval. However, existing benchmarks for VLMs include spatial components, which often fail to isolate spatial reasoning from related tasks such as object detection or semantic comprehension. In this paper, we address these deficiencies with a multi-faceted approach towards understanding spatial reasoning. Informed by the diverse and multi-dimensional nature of human spatial reasoning abilities, we present a detailed analysis that first delineates the core elements of spatial reasoning: spatial relations, orientation and navigation, mental rotation, and spatial visualization, and then assesses the performance of these models in both synthetic and real-world images, bridging controlled and naturalistic contexts. We analyze 13 state-of-the-art Vision-Language Models, uncovering pivotal insights into their spatial reasoning performance. Our results reveal profound shortcomings in current VLMs, with average accuracy across the 13 models approximating random chance, highlighting spatial reasoning as a persistent obstacle. This work not only exposes the pressing need to advance spatial reasoning within VLMs but also establishes a solid platform for future exploration. Code available on GitHub (https://github.com/stogiannidis/srbench) and dataset available on HuggingFace (https://huggingface.co/datasets/stogiannidis/srbench).

EventMamba: Enhancing Spatio-Temporal Locality with State Space Models for Event-Based Video Reconstruction

Chengjie Ge,Xueyang Fu,Peng He,Kunyu Wang,Chengzhi Cao,Zheng-Jun Zha

Task: 提出一种专门用于事件驱动视频重建（EBVR）任务的模型EventMamba。

Motivation: 现有基于Mamba的视觉模型忽视了事件驱动任务的细节，尤其是视频重建中的空间平移不变性和局部事件关系的重要性。

Details

Method: EventMamba通过引入随机窗口偏移（RWO）和一致的遍历序列化方法，改进了空间和时间域的局部连接性。 Result: 在多个数据集上的测试表明，EventMamba显著提升了视频重建的质量和计算速度，优于基于Transformer的方法。 Conclusion: EventMamba在保留Mamba强大建模能力的同时，显著提升了事件数据的时空局部性。 Abstract: Leveraging its robust linear global modeling capability, Mamba has notably excelled in computer vision. Despite its success, existing Mamba-based vision models have overlooked the nuances of event-driven tasks, especially in video reconstruction. Event-based video reconstruction (EBVR) demands spatial translation invariance and close attention to local event relationships in the spatio-temporal domain. Unfortunately, conventional Mamba algorithms apply static window partitions and standard reshape scanning methods, leading to significant losses in local connectivity. To overcome these limitations, we introduce EventMamba--a specialized model designed for EBVR tasks. EventMamba innovates by incorporating random window offset (RWO) in the spatial domain, moving away from the restrictive fixed partitioning. Additionally, it features a new consistent traversal serialization approach in the spatio-temporal domain, which maintains the proximity of adjacent events both spatially and temporally. These enhancements enable EventMamba to retain Mamba's robust modeling capabilities while significantly preserving the spatio-temporal locality of event data. Comprehensive testing on multiple datasets shows that EventMamba markedly enhances video reconstruction, drastically improving computation speed while delivering superior visual quality compared to Transformer-based methods.

CamSAM2: Segment Anything Accurately in Camouflaged Videos

Yuli Zhou,Guolei Sun,Yawei Li,Yuqian Fu,Luca Benini,Ender Konukoglu

Task: 提出Camouflaged SAM2（CamSAM2）以增强SAM2在视频伪装物体分割（VCOS）任务中的表现。

Motivation: SAM2在分割伪装视频时表现不佳，尤其是在简单提示（如点和框）下。

Details

Method: 引入去伪装标记（decamouflaged token）、隐式和显式对象感知融合模块（IOF和EOF）以及对象原型生成（OPG）。 Result: CamSAM2在三个VCOS数据集上显著优于SAM2，特别是在MoCA-Mask和SUN-SEG-Hard数据集上分别取得12.2和19.6 mDice的提升。 Conclusion: CamSAM2通过少量可学习参数显著提升了SAM2在VCOS任务中的性能。 Abstract: Video camouflaged object segmentation (VCOS), aiming at segmenting camouflaged objects that seamlessly blend into their environment, is a fundamental vision task with various real-world applications. With the release of SAM2, video segmentation has witnessed significant progress. However, SAM2's capability of segmenting camouflaged videos is suboptimal, especially when given simple prompts such as point and box. To address the problem, we propose Camouflaged SAM2 (CamSAM2), which enhances SAM2's ability to handle camouflaged scenes without modifying SAM2's parameters. Specifically, we introduce a decamouflaged token to provide the flexibility of feature adjustment for VCOS. To make full use of fine-grained and high-resolution features from the current frame and previous frames, we propose implicit object-aware fusion (IOF) and explicit object-aware fusion (EOF) modules, respectively. Object prototype generation (OPG) is introduced to abstract and memorize object prototypes with informative details using high-quality features from previous frames. Extensive experiments are conducted to validate the effectiveness of our approach. While CamSAM2 only adds negligible learnable parameters to SAM2, it substantially outperforms SAM2 on three VCOS datasets, especially achieving 12.2 mDice gains with click prompt on MoCA-Mask and 19.6 mDice gains with mask prompt on SUN-SEG-Hard, with Hiera-T as the backbone. The code will be available at \href{https://github.com/zhoustan/CamSAM2}{github.com/zhoustan/CamSAM2}.

PCM : Picard Consistency Model for Fast Parallel Sampling of Diffusion Models

Junhyuk So,Jiwoong Shin,Chaeyeon Jang,Eunhyeok Park

Task: 提出一种新的并行化方案Picard一致性模型（PCM），以减少Picard迭代中的生成步骤。

Motivation: 扩散模型因顺序去噪过程导致生成速度慢，现有并行采样方法（如Picard迭代）虽减少步骤但无法保证更快收敛。

Details

Method: 基于一致性模型，直接训练PCM预测收敛轨迹中任意阶段的固定点解或最终输出，并引入模型切换概念以确保精确收敛。 Result: PCM在图像生成和机器人控制等任务中，比顺序采样快2.71倍，比Picard迭代快1.77倍。 Conclusion: PCM显著提升了扩散模型的生成速度，同时确保了精确收敛。 Abstract: Recently, diffusion models have achieved significant advances in vision, text, and robotics. However, they still face slow generation speeds due to sequential denoising processes. To address this, a parallel sampling method based on Picard iteration was introduced, effectively reducing sequential steps while ensuring exact convergence to the original output. Nonetheless, Picard iteration does not guarantee faster convergence, which can still result in slow generation in practice. In this work, we propose a new parallelization scheme, the Picard Consistency Model (PCM), which significantly reduces the number of generation steps in Picard iteration. Inspired by the consistency model, PCM is directly trained to predict the fixed-point solution, or the final output, at any stage of the convergence trajectory. Additionally, we introduce a new concept called model switching, which addresses PCM's limitations and ensures exact convergence. Extensive experiments demonstrate that PCM achieves up to a 2.71x speedup over sequential sampling and a 1.77x speedup over Picard iteration across various tasks, including image generation and robotic control.

FUSE: Label-Free Image-Event Joint Monocular Depth Estimation via Frequency-Decoupled Alignment and Degradation-Robust Fusion

Pihai Sun,Junjun Jiang,Yuanqi Yao,Youyu Chen,Wenbo Zhao,Kui Jiang,Xianming Liu

Task: 提出一种名为FUSE的频率解耦自监督编码器，用于解决图像-事件联合深度估计中的通用性问题。

Motivation: 现有方法因标注数据不足和图像与事件流频率不匹配导致特征融合效果不佳。

Details

Method: 结合参数高效自监督迁移（PST）和频率解耦融合模块（FreDFuse），实现跨模态知识迁移和频率解耦特征融合。 Result: 在MVSEC和DENSE数据集上分别提升14%和24.9%的Abs.Rel性能，并展示出色的零样本适应能力。 Conclusion: FUSE通过自监督和频率解耦策略显著提升了图像-事件深度估计的通用性和实际部署能力。 Abstract: Image-event joint depth estimation methods leverage complementary modalities for robust perception, yet face challenges in generalizability stemming from two factors: 1) limited annotated image-event-depth datasets causing insufficient cross-modal supervision, and 2) inherent frequency mismatches between static images and dynamic event streams with distinct spatiotemporal patterns, leading to ineffective feature fusion. To address this dual challenge, we propose Frequency-decoupled Unified Self-supervised Encoder (FUSE) with two synergistic components: The Parameter-efficient Self-supervised Transfer (PST) establishes cross-modal knowledge transfer through latent space alignment with image foundation models, effectively mitigating data scarcity by enabling joint encoding without depth ground truth.Complementing this, we propose the Frequency-Decoupled Fusion module (FreDFuse) to explicitly decouple high-frequency edge features from low-frequency structural components, resolving modality-specific frequency mismatches through physics-aware fusion. This combined approach enables FUSE to construct a universal image-event encoder that only requires lightweight decoder adaptation for target datasets. Extensive experiments demonstrate state-of-the-art performance with 14% and 24.9% improvements in Abs.Rel on MVSEC and DENSE datasets. The framework exhibits remarkable zero-shot adaptability to challenging scenarios including extreme lighting and motion blur, significantly advancing real-world deployment capabilities. The source code for our method is publicly available at: https://github.com/sunpihai-up/FUSE

Surg-3M: A Dataset and Foundation Model for Perception in Surgical Settings

Chengan Che,Chao Wang,Tom Vercauteren,Sophia Tsoka,Luis C. Garcia-Peraza-Herrera

Task: 构建一个名为Surg-3M的大规模手术视频数据集，并开发一个基于该数据集的自监督基础模型SurgFM，用于下游任务如手术阶段识别、动作识别和工具存在检测。

Motivation: 传统手术数据集规模小，限制了计算机辅助手术的发展，因此需要更大规模和多样性的数据集来支持研究。

Details

Method: 通过新颖的聚合管道收集高分辨率手术视频，构建Surg-3M数据集；结合ConvNeXt、DINO和增强蒸馏方法开发SurgFM模型。 Result: SurgFM在多个下游任务中表现优异，显著超越现有模型，如手术阶段识别（Jaccard提升8.9pp、4.7pp、3.9pp）、动作识别（mAP提升3.1pp）和工具存在检测（mAP提升4.6pp）。 Conclusion: Surg-3M和SurgFM为开发自主机器人手术系统提供了重要资源，具有加速该领域进展的潜力。 Abstract: Advancements in computer-assisted surgical procedures heavily rely on accurate visual data interpretation from camera systems used during surgeries. Traditional open-access datasets focusing on surgical procedures are often limited by their small size, typically consisting of fewer than 100 videos with less than 100K images. To address these constraints, a new dataset called Surg-3M has been compiled using a novel aggregation pipeline that collects high-resolution videos from online sources. Featuring an extensive collection of over 4K surgical videos and more than 3 million high-quality images from multiple procedure types, Surg-3M offers a comprehensive resource surpassing existing alternatives in size and scope, including two novel tasks. To demonstrate the effectiveness of this dataset, we present SurgFM, a self-supervised foundation model pretrained on Surg-3M that achieves impressive results in downstream tasks such as surgical phase recognition, action recognition, and tool presence detection. Combining key components from ConvNeXt, DINO, and an innovative augmented distillation method, SurgFM exhibits exceptional performance compared to specialist architectures across various benchmarks. Our experimental results show that SurgFM outperforms state-of-the-art models in multiple downstream tasks, including significant gains in surgical phase recognition (+8.9pp, +4.7pp, and +3.9pp of Jaccard in AutoLaparo, M2CAI16, and Cholec80), action recognition (+3.1pp of mAP in CholecT50) and tool presence detection (+4.6pp of mAP in Cholec80). Moreover, even when using only half of the data, SurgFM outperforms state-of-the-art models in AutoLaparo and achieves state-of-the-art performance in Cholec80. Both Surg-3M and SurgFM have significant potential to accelerate progress towards developing autonomous robotic surgery systems.

ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation

Haoyu Fu,Diankun Zhang,Zongchuang Zhao,Jianfeng Cui,Dingkang Liang,Chong Zhang,Dingyuan Zhang,Hongwei Xie,Bing Wang,Xiang Bai

Task: 提出一种名为ORION的端到端自动驾驶框架，通过视觉语言指导的动作生成来解决因果推理能力不足的问题。

Motivation: 当前端到端自动驾驶方法在交互式闭环评估中因因果推理能力有限而表现不佳，且现有视觉语言模型在语义推理空间与动作空间之间的差距导致闭环评估效果不理想。

Details

Method: 结合QT-Former聚合长期历史上下文、大型语言模型进行驾驶场景推理，以及生成式规划器进行精确轨迹预测，同时对齐推理空间与动作空间。 Result: 在Bench2Drive数据集上，ORION实现了77.74的驾驶评分和54.62%的成功率，显著优于现有方法。 Conclusion: ORION通过视觉语言指导的动作生成和对齐推理与动作空间，显著提升了端到端自动驾驶的闭环性能。 Abstract: End-to-end (E2E) autonomous driving methods still struggle to make correct decisions in interactive closed-loop evaluation due to limited causal reasoning capability. Current methods attempt to leverage the powerful understanding and reasoning abilities of Vision-Language Models (VLMs) to resolve this dilemma. However, the problem is still open that few VLMs for E2E methods perform well in the closed-loop evaluation due to the gap between the semantic reasoning space and the purely numerical trajectory output in the action space. To tackle this issue, we propose ORION, a holistic E2E autonomous driving framework by vision-language instructed action generation. ORION uniquely combines a QT-Former to aggregate long-term history context, a Large Language Model (LLM) for driving scenario reasoning, and a generative planner for precision trajectory prediction. ORION further aligns the reasoning space and the action space to implement a unified E2E optimization for both visual question-answering (VQA) and planning tasks. Our method achieves an impressive closed-loop performance of 77.74 Driving Score (DS) and 54.62% Success Rate (SR) on the challenge Bench2Drive datasets, which outperforms state-of-the-art (SOTA) methods by a large margin of 14.28 DS and 19.61% SR.

OpenLex3D: A New Evaluation Benchmark for Open-Vocabulary 3D Scene Representations

Christina Kassab,Sacha Morin,Martin Büchner,Matías Mattamala,Kumaraditya Gupta,Abhinav Valada,Liam Paull,Maurice Fallon

Task: 提出OpenLex3D基准，用于评估3D开放词汇场景表示。

Motivation: 现有评估方法局限于封闭语义，无法体现语言的丰富性。

Details

Method: 为23个场景提供新标签注释，引入同义词类别和细微描述，设计开放集3D语义分割和对象检索任务。 Result: 评估现有方法，展示失败案例和改进方向。 Conclusion: OpenLex3D公开可用，为3D开放词汇表示提供新评估标准。 Abstract: 3D scene understanding has been transformed by open-vocabulary language models that enable interaction via natural language. However, the evaluation of these representations is limited to closed-set semantics that do not capture the richness of language. This work presents OpenLex3D, a dedicated benchmark to evaluate 3D open-vocabulary scene representations. OpenLex3D provides entirely new label annotations for 23 scenes from Replica, ScanNet++, and HM3D, which capture real-world linguistic variability by introducing synonymical object categories and additional nuanced descriptions. By introducing an open-set 3D semantic segmentation task and an object retrieval task, we provide insights on feature precision, segmentation, and downstream capabilities. We evaluate various existing 3D open-vocabulary methods on OpenLex3D, showcasing failure cases, and avenues for improvement. The benchmark is publicly available at: https://openlex3d.github.io/.

BiPrompt-SAM: Enhancing Image Segmentation via Explicit Selection between Point and Text Prompts

Suzhe Xu,Jialin Peng,Chengyuan Zhang

Task: 提出一种双模态提示分割框架BiPrompt-SAM，有效结合点提示和文本提示的优势。

Motivation: 现有方法很少探索如何有效结合点提示和文本提示这两种互补模态以实现最优分割性能。

Details

Method: 通过显式选择机制融合点提示和文本提示，利用SAM生成多个掩码候选并结合文本提示的语义引导掩码，基于相似性度量选择最佳候选。 Result: 在Endovis17数据集上达到89.55% mDice和81.46% mIoU，在RefCOCO系列数据集上显著优于现有方法。 Conclusion: BiPrompt-SAM有效结合了点提示的空间精度和文本提示的语义丰富性，为多模态提示融合提供了新视角。 Abstract: Segmentation is a fundamental task in computer vision, with prompt-driven methods gaining prominence due to their flexibility. The recent Segment Anything Model (SAM) has demonstrated powerful point-prompt segmentation capabilities, while text-based segmentation models offer rich semantic understanding. However, existing approaches rarely explore how to effectively combine these complementary modalities for optimal segmentation performance. This paper presents BiPrompt-SAM, a novel dual-modal prompt segmentation framework that fuses the advantages of point and text prompts through an explicit selection mechanism. Specifically, we leverage SAM's inherent ability to generate multiple mask candidates, combined with a semantic guidance mask from text prompts, and explicitly select the most suitable candidate based on similarity metrics. This approach can be viewed as a simplified Mixture of Experts (MoE) system, where the point and text modules act as distinct "experts," and the similarity scoring serves as a rudimentary "gating network." We conducted extensive evaluations on both the Endovis17 medical dataset and RefCOCO series natural image datasets. On Endovis17, BiPrompt-SAM achieved 89.55\% mDice and 81.46\% mIoU, comparable to state-of-the-art specialized medical segmentation models. On the RefCOCO series datasets, our method attained 87.1\%, 86.5\%, and 85.8\% IoU, significantly outperforming existing approaches. Experiments demonstrate that our explicit dual-selection method effectively combines the spatial precision of point prompts with the semantic richness of text prompts, particularly excelling in scenarios involving semantically complex objects, multiple similar objects, and partial occlusions. BiPrompt-SAM not only provides a simple yet effective implementation but also offers a new perspective on multi-modal prompt fusion.

Konyul Park,Yecheol Kim,Daehun Kim,Jun Won Choi

Task: 提出一种高效且鲁棒的LiDAR-相机3D物体检测器MoME，通过混合专家方法实现鲁棒性能。

Motivation: 现有传感器融合架构在严重传感器故障下性能显著下降，主要由于模态间依赖性。

Details

Method: 采用三个并行专家解码器完全解耦模态依赖性，并提出自适应查询路由器（AQR）选择最适合的解码器。 Result: 在nuScenes-R基准测试中，MoME在极端天气和传感器故障条件下达到最先进性能。 Conclusion: MoME通过解耦模态依赖性和自适应选择解码器，显著提升了传感器故障场景下的鲁棒性。 Abstract: Modern autonomous driving perception systems utilize complementary multi-modal sensors, such as LiDAR and cameras. Although sensor fusion architectures enhance performance in challenging environments, they still suffer significant performance drops under severe sensor failures, such as LiDAR beam reduction, LiDAR drop, limited field of view, camera drop, and occlusion. This limitation stems from inter-modality dependencies in current sensor fusion frameworks. In this study, we introduce an efficient and robust LiDAR-camera 3D object detector, referred to as MoME, which can achieve robust performance through a mixture of experts approach. Our MoME fully decouples modality dependencies using three parallel expert decoders, which use camera features, LiDAR features, or a combination of both to decode object queries, respectively. We propose Multi-Expert Decoding (MED) framework, where each query is decoded selectively using one of three expert decoders. MoME utilizes an Adaptive Query Router (AQR) to select the most appropriate expert decoder for each query based on the quality of camera and LiDAR features. This ensures that each query is processed by the best-suited expert, resulting in robust performance across diverse sensor failure scenarios. We evaluated the performance of MoME on the nuScenes-R benchmark. Our MoME achieved state-of-the-art performance in extreme weather and sensor failure conditions, significantly outperforming the existing models across various sensor failure scenarios.

LPOSS: Label Propagation Over Patches and Pixels for Open-vocabulary Semantic Segmentation

Vladan Stojnić,Yannis Kalantidis,Jiří Matas,Giorgos Tolias

Task: 提出一种无需训练的开集词汇语义分割方法LPOSS+，利用视觉与语言模型（VLMs）和标签传播技术。

Motivation: 解决VLMs在跨模态对齐中缺乏对模态内相似性的优化问题，并通过像素级标签传播提升分割精度。

Details

Method: 结合VLMs的初始预测与视觉模型（VM）的模态内相似性，通过标签传播联合优化预测，并在像素级进行细化。 Result: LPOSS+在多种数据集上实现了无需训练方法中的最先进性能。 Conclusion: LPOSS+通过全局推理和像素级优化，显著提升了开集词汇语义分割的准确性。 Abstract: We propose a training-free method for open-vocabulary semantic segmentation using Vision-and-Language Models (VLMs). Our approach enhances the initial per-patch predictions of VLMs through label propagation, which jointly optimizes predictions by incorporating patch-to-patch relationships. Since VLMs are primarily optimized for cross-modal alignment and not for intra-modal similarity, we use a Vision Model (VM) that is observed to better capture these relationships. We address resolution limitations inherent to patch-based encoders by applying label propagation at the pixel level as a refinement step, significantly improving segmentation accuracy near class boundaries. Our method, called LPOSS+, performs inference over the entire image, avoiding window-based processing and thereby capturing contextual interactions across the full image. LPOSS+ achieves state-of-the-art performance among training-free methods, across a diverse set of datasets. Code: https://github.com/vladan-stojnic/LPOSS

Fine-Grained Erasure in Text-to-Image Diffusion-based Foundation Models

Kartik Thakral,Tamar Glaser,Tal Hassner,Mayank Vatsa,Richa Singh

Task: 提出一种名为FADE的方法，用于在扩散模型中实现邻接感知的遗忘，以解决现有文本到图像生成模型中特定目标概念移除时语义相关概念知识保留不足的问题。

Motivation: 现有遗忘算法在移除特定目标概念时难以保留语义相关概念的知识（邻接问题），因此需要一种更精细的方法来解决这一问题。

Details

Method: FADE包括两个部分：(1) 概念邻域，用于识别相关概念的邻接集合；(2) 网状模块，结合了擦除、邻接和指导损失组件，实现目标概念的精确擦除并保留相关和无关概念的保真度。 Result: 在Stanford Dogs、Oxford Flowers、CUB、I2P、Imagenette和ImageNet1k等数据集上的评估表明，FADE能有效移除目标概念且对相关概念影响最小，保留性能比现有方法至少提高12%。 Conclusion: FADE通过邻接感知的遗忘方法，显著提升了文本到图像生成模型中目标概念移除的精确性和相关概念的保留能力。 Abstract: Existing unlearning algorithms in text-to-image generative models often fail to preserve the knowledge of semantically related concepts when removing specific target concepts: a challenge known as adjacency. To address this, we propose FADE (Fine grained Attenuation for Diffusion Erasure), introducing adjacency aware unlearning in diffusion models. FADE comprises two components: (1) the Concept Neighborhood, which identifies an adjacency set of related concepts, and (2) Mesh Modules, employing a structured combination of Expungement, Adjacency, and Guidance loss components. These enable precise erasure of target concepts while preserving fidelity across related and unrelated concepts. Evaluated on datasets like Stanford Dogs, Oxford Flowers, CUB, I2P, Imagenette, and ImageNet1k, FADE effectively removes target concepts with minimal impact on correlated concepts, achieving atleast a 12% improvement in retention performance over state-of-the-art methods.

SITA: Structurally Imperceptible and Transferable Adversarial Attacks for Stylized Image Generation

Jingdan Kang,Haoxin Yang,Yan Cai,Huaidong Zhang,Xuemiao Xu,Yong Du,Shengfeng He

Task: 提出一种名为SITA的结构性不可感知且可迁移的对抗攻击方法，以保护艺术作品免受未经授权的风格化生成。

Motivation: 当前对抗攻击方法存在迁移性差、计算成本高和引入明显噪声等问题，影响艺术作品的美观性。

Details

Method: 利用基于CLIP的去风格化损失，破坏图像的鲁棒风格表示，无需代理扩散模型，嵌入噪声于图像结构细节中。 Result: SITA在迁移性、计算效率和噪声不可感知性方面显著优于现有方法。 Conclusion: SITA为艺术作品提供了卓越的保护，同时保持了视觉质量。 Abstract: Image generation technology has brought significant advancements across various fields but has also raised concerns about data misuse and potential rights infringements, particularly with respect to creating visual artworks. Current methods aimed at safeguarding artworks often employ adversarial attacks. However, these methods face challenges such as poor transferability, high computational costs, and the introduction of noticeable noise, which compromises the aesthetic quality of the original artwork. To address these limitations, we propose a Structurally Imperceptible and Transferable Adversarial (SITA) attacks. SITA leverages a CLIP-based destylization loss, which decouples and disrupts the robust style representation of the image. This disruption hinders style extraction during stylized image generation, thereby impairing the overall stylization process. Importantly, SITA eliminates the need for a surrogate diffusion model, leading to significantly reduced computational overhead. The method's robust style feature disruption ensures high transferability across diverse models. Moreover, SITA introduces perturbations by embedding noise within the imperceptible structural details of the image. This approach effectively protects against style extraction without compromising the visual quality of the artwork. Extensive experiments demonstrate that SITA offers superior protection for artworks against unauthorized use in stylized generation. It significantly outperforms existing methods in terms of transferability, computational efficiency, and noise imperceptibility. Code is available at https://github.com/A-raniy-day/SITA.

In the Blink of an Eye: Instant Game Map Editing using a Generative-AI Smart Brush

Vitaly Gnatyuk,Valeriia Koriukina Ilya Levoshevich,Pavel Nurminskiy,Guenter Wallner

Task: 利用现代AI技术为复杂3D游戏环境中的高分辨率纹理操作提供解决方案。

Motivation: 3D游戏地图艺术创作的自动化因其独特复杂性和领域特定挑战而未被充分探索，现有研究多集中于简单数据分布。

Details

Method: 提出一种新颖的智能画笔工具，结合生成对抗网络和扩散模型，实现高效且上下文感知的地图编辑。 Result: GAN-based画笔生成最清晰、细节最丰富的输出，同时保持图像上下文，优于其他现有模型。 Conclusion: 混合工作流提升了艺术灵活性和生产效率，填补了游戏开发中自动化与创意控制之间的空白。 Abstract: With video games steadily increasing in complexity, automated generation of game content has found widespread interest. However, the task of 3D gaming map art creation remains underexplored to date due to its unique complexity and domain-specific challenges. While recent works have addressed related topics such as retro-style level generation and procedural terrain creation, these works primarily focus on simpler data distributions. To the best of our knowledge, we are the first to demonstrate the application of modern AI techniques for high-resolution texture manipulation in complex, highly detailed AAA 3D game environments. We introduce a novel Smart Brush for map editing, designed to assist artists in seamlessly modifying selected areas of a game map with minimal effort. By leveraging generative adversarial networks and diffusion models we propose two variants of the brush that enable efficient and context-aware generation. Our hybrid workflow aims to enhance both artistic flexibility and production efficiency, enabling the refinement of environments without manually reworking every detail, thus helping to bridge the gap between automation and creative control in game development. A comparative evaluation of our two methods with adapted versions of several state-of-the art models shows that our GAN-based brush produces the sharpest and most detailed outputs while preserving image context while the evaluated state-of-the-art models tend towards blurrier results and exhibit difficulties in maintaining contextual consistency.

PAVE: Patching and Adapting Video Large Language Models

Zhuoming Liu,Yiquan Li,Khoi Duc Nguyen,Yiwu Zhong,Yin Li

Task: 提出PAVE框架，用于将预训练的视频大语言模型（Video LLMs）适配到涉及额外模态或数据类型的下游任务。

Motivation: 预训练的视频大语言模型在多模态任务中表现优异，但难以适应涉及新模态或数据类型的任务。

Details

Method: 引入轻量级适配器（patches），在不修改基础模型架构或预训练权重的情况下，增加少量参数和操作。 Result: PAVE显著提升了基础模型在下游任务中的性能，超越现有任务专用模型，且仅增加约0.1%的计算和参数开销。 Conclusion: PAVE是一种灵活且高效的框架，支持多任务学习，并能泛化到不同的视频大语言模型。 Abstract: Pre-trained video large language models (Video LLMs) exhibit remarkable reasoning capabilities, yet adapting these models to new tasks involving additional modalities or data types (e.g., audio or 3D information) remains challenging. In this paper, we present PAVE, a flexible framework for adapting pre-trained Video LLMs to downstream tasks with side-channel signals, such as audio, 3D cues, or multi-view videos. PAVE introduces lightweight adapters, referred to as "patches," which add a small number of parameters and operations to a base model without modifying its architecture or pre-trained weights. In doing so, PAVE can effectively adapt the pre-trained base model to support diverse downstream tasks, including audio-visual question answering, 3D reasoning, multi-view video recognition, and high frame rate video understanding. Across these tasks, PAVE significantly enhances the performance of the base model, surpassing state-of-the-art task-specific models while incurring a minor cost of ~0.1% additional FLOPs and parameters. Further, PAVE supports multi-task learning and generalizes well across different Video LLMs. Our code is available at https://github.com/dragonlzm/PAVE.

Unpaired Object-Level SAR-to-Optical Image Translation for Aircraft with Keypoints-Guided Diffusion Models

Ruixi You,Hecheng Jia,Feng Xu

Task: 提出一种基于关键点引导的扩散模型（KeypointDiff），用于无配对飞机目标的SAR到光学图像转换。

Motivation: SAR图像解释依赖专家知识，且现有研究多集中于场景级转换，缺乏对象级转换方法，主要由于配对数据稀缺及轮廓纹理细节保留困难。

Details

Method: 设计关键点引导的扩散模型，结合目标类别和方位角监督，采用无配对数据训练策略，并引入类别-角度引导模块（CAGM）及对抗损失和一致性损失。 Result: 实验表明，该方法在多个指标上优于现有方法，且能零样本泛化到未训练的飞机类型。 Conclusion: KeypointDiff为对象级SAR到光学转换及下游任务提供了高效解决方案。 Abstract: Synthetic Aperture Radar (SAR) imagery provides all-weather, all-day, and high-resolution imaging capabilities but its unique imaging mechanism makes interpretation heavily reliant on expert knowledge, limiting interpretability, especially in complex target tasks. Translating SAR images into optical images is a promising solution to enhance interpretation and support downstream tasks. Most existing research focuses on scene-level translation, with limited work on object-level translation due to the scarcity of paired data and the challenge of accurately preserving contour and texture details. To address these issues, this study proposes a keypoint-guided diffusion model (KeypointDiff) for SAR-to-optical image translation of unpaired aircraft targets. This framework introduces supervision on target class and azimuth angle via keypoints, along with a training strategy for unpaired data. Based on the classifier-free guidance diffusion architecture, a class-angle guidance module (CAGM) is designed to integrate class and angle information into the diffusion generation process. Furthermore, adversarial loss and consistency loss are employed to improve image fidelity and detail quality, tailored for aircraft targets. During sampling, aided by a pre-trained keypoint detector, the model eliminates the requirement for manually labeled class and azimuth information, enabling automated SAR-to-optical translation. Experimental results demonstrate that the proposed method outperforms existing approaches across multiple metrics, providing an efficient and effective solution for object-level SAR-to-optical translation and downstream tasks. Moreover, the method exhibits strong zero-shot generalization to untrained aircraft types with the assistance of the keypoint detector.

Zhiyang Liu,Dong Yang,Minghao Zhang,Hanyu Sun,Hong Wu,Huiying Wang,Wen Shen,Chao Chai,Shuang Xia

Task: 开发一种基于对比学习的多模态头部MRI基础模型，以减少对大量标注数据的需求。

Motivation: 深度学习在医学图像分析中潜力巨大，但缺乏足够标注数据限制了其实际应用。

Details

Method: 提出了一种对比学习框架，结合混合语法和语义相似性匹配度量，减少对大数据集的依赖。 Result: 提出的SeLIP模型在下游任务（如图文检索、分类和图像分割）中表现优异。 Conclusion: 在开发医学图像基础模型时，考虑文本间的相似性至关重要。 Abstract: Despite that deep learning (DL) methods have presented tremendous potential in many medical image analysis tasks, the practical applications of medical DL models are limited due to the lack of enough data samples with manual annotations. By noting that the clinical radiology examinations are associated with radiology reports that describe the images, we propose to develop a foundation model for multi-model head MRI by using contrastive learning on the images and the corresponding radiology findings. In particular, a contrastive learning framework is proposed, where a mixed syntax and semantic similarity matching metric is integrated to reduce the thirst of extreme large dataset in conventional contrastive learning framework. Our proposed similarity enhanced contrastive language image pretraining (SeLIP) is able to effectively extract more useful features. Experiments revealed that our proposed SeLIP performs well in many downstream tasks including image-text retrieval task, classification task, and image segmentation, which highlights the importance of considering the similarities among texts describing different images in developing medical image foundation models.

LENVIZ: A High-Resolution Low-Exposure Night Vision Benchmark Dataset

Manjushree Aithal,Rosaura G. VidalMata,Manikandtan Kartha,Gong Chen,Eashan Adhikarla,Lucas N. Kirsten,Zhicheng Fu,Nikhil A. Madhusudhana,Joe Nasti

Task: Introduce the LENVIZ dataset for low-light image enhancement and analyze state-of-the-art techniques.

Motivation: Low-light image enhancement is vital for applications like night vision and autonomous driving, but current methods face challenges due to limited illumination.

Details

Method: Create the LENVIZ dataset, a multi-exposure benchmark with 230K frames, diverse scenes, and expert-curated ground truth. Result: LENVIZ is the largest publicly available 4K resolution benchmark, offering varied lighting and noise conditions. Conclusion: The dataset and analysis highlight areas for improvement in low-light image enhancement techniques. Abstract: Low-light image enhancement is crucial for a myriad of applications, from night vision and surveillance, to autonomous driving. However, due to the inherent limitations that come in hand with capturing images in low-illumination environments, the task of enhancing such scenes still presents a formidable challenge. To advance research in this field, we introduce our Low Exposure Night Vision (LENVIZ) Dataset, a comprehensive multi-exposure benchmark dataset for low-light image enhancement comprising of over 230K frames showcasing 24K real-world indoor and outdoor, with-and without human, scenes. Captured using 3 different camera sensors, LENVIZ offers a wide range of lighting conditions, noise levels, and scene complexities, making it the largest publicly available up-to 4K resolution benchmark in the field. LENVIZ includes high quality human-generated ground truth, for which each multi-exposure low-light scene has been meticulously curated and edited by expert photographers to ensure optimal image quality. Furthermore, we also conduct a comprehensive analysis of current state-of-the-art low-light image enhancement techniques on our dataset and highlight potential areas of improvement.

FireEdit: Fine-grained Instruction-based Image Editing via Region-aware Vision Language Model

Jun Zhou,Jiahao Li,Zunnan Xu,Hanhui Li,Yiji Cheng,Fa-Ting Hong,Qin Lin,Qinglin Lu,Xiaodan Liang

Task: 提出一种基于细粒度指令的图像编辑框架FireEdit，以解决复杂场景、语义一致性和细粒度编辑的挑战。

Motivation: 现有基于指令的图像编辑方法在复杂场景、语义一致性和细粒度编辑方面仍存在不足。

Details

Method: 通过引入区域感知的视觉语言模型（VLM）增强细粒度视觉感知能力，并结合时间感知目标注入模块和混合视觉交叉注意力模块优化扩散模型。 Result: 实验表明，FireEdit在理解编辑指令和保持语义一致性方面优于现有方法。 Conclusion: FireEdit在基于指令的图像编辑任务中表现出显著优势，并超越了现有技术。 Abstract: Currently, instruction-based image editing methods have made significant progress by leveraging the powerful cross-modal understanding capabilities of vision language models (VLMs). However, they still face challenges in three key areas: 1) complex scenarios; 2) semantic consistency; and 3) fine-grained editing. To address these issues, we propose FireEdit, an innovative Fine-grained Instruction-based image editing framework that exploits a REgion-aware VLM. FireEdit is designed to accurately comprehend user instructions and ensure effective control over the editing process. Specifically, we enhance the fine-grained visual perception capabilities of the VLM by introducing additional region tokens. Relying solely on the output of the LLM to guide the diffusion model may lead to suboptimal editing results. Therefore, we propose a Time-Aware Target Injection module and a Hybrid Visual Cross Attention module. The former dynamically adjusts the guidance strength at various denoising stages by integrating timestep embeddings with the text embeddings. The latter enhances visual details for image editing, thereby preserving semantic consistency between the edited result and the source image. By combining the VLM enhanced with fine-grained region tokens and the time-dependent diffusion model, FireEdit demonstrates significant advantages in comprehending editing instructions and maintaining high semantic consistency. Extensive experiments indicate that our approach surpasses the state-of-the-art instruction-based image editing methods. Our project is available at https://zjgans.github.io/fireedit.github.io.

Attention IoU: Examining Biases in CelebA using Attention Maps

Aaron Serianni,Tyler Zhu,Vikram V. Ramaswamy,Olga Russakovsky

Task: 提出了一种名为Attention-IoU的指标，用于揭示计算机视觉模型内部表示中的偏见。

Motivation: 现有方法主要关注数据集分布和模型在子组上的性能，忽略了模型内部工作机制，因此需要一种新方法来量化模型内部的偏见。

Details

Method: 使用注意力图（attention maps）开发了Attention-IoU指标及相关评分，通过分析注意力图揭示模型内部偏见及其潜在原因。 Result: 在Waterbirds数据集上验证了Attention-IoU的有效性，并在CelebA数据集中发现了超出准确性差异的偏见相关性。 Conclusion: Attention-IoU能够揭示模型内部偏见，并识别数据标签中未体现的潜在混杂变量。 Abstract: Computer vision models have been shown to exhibit and amplify biases across a wide array of datasets and tasks. Existing methods for quantifying bias in classification models primarily focus on dataset distribution and model performance on subgroups, overlooking the internal workings of a model. We introduce the Attention-IoU (Attention Intersection over Union) metric and related scores, which use attention maps to reveal biases within a model's internal representations and identify image features potentially causing the biases. First, we validate Attention-IoU on the synthetic Waterbirds dataset, showing that the metric accurately measures model bias. We then analyze the CelebA dataset, finding that Attention-IoU uncovers correlations beyond accuracy disparities. Through an investigation of individual attributes through the protected attribute of Male, we examine the distinct ways biases are represented in CelebA. Lastly, by subsampling the training set to change attribute correlations, we demonstrate that Attention-IoU reveals potential confounding variables not present in dataset labels.

Carlos Plou,Cesar Borja,Ruben Martinez-Cantin,Ana C. Murillo

Task: 提出一种名为FALCONEye的新型视频代理，用于在长达一小时的视频中检索信息并定位答案所在的帧。

Motivation: 解决现有视觉语言模型（VLMs）在长视频信息检索中的局限性，尤其是上下文窗口限制和难以精确定位答案帧的问题。

Details

Method: 结合视觉语言模型（VLM）和大型语言模型（LLM），提出一种元架构和高效探索算法，利用短片段、字幕和答案置信度定位信息。 Result: FALCONEye在FALCON-Bench上表现优于现有技术，并在相关基准测试中表现相似或更好。 Conclusion: FALCONEye为长视频信息检索提供了一种高效且可扩展的解决方案，并通过FALCON-Bench推动了该领域的评估标准。 Abstract: Information retrieval in hour-long videos presents a significant challenge, even for state-of-the-art Vision-Language Models (VLMs), particularly when the desired information is localized within a small subset of frames. Long video data presents challenges for VLMs due to context window limitations and the difficulty of pinpointing frames containing the answer. Our novel video agent, FALCONEye, combines a VLM and a Large Language Model (LLM) to search relevant information along the video, and locate the frames with the answer. FALCONEye novelty relies on 1) the proposed meta-architecture, which is better suited to tackle hour-long videos compared to short video approaches in the state-of-the-art; 2) a new efficient exploration algorithm to locate the information using short clips, captions and answer confidence; and 3) our state-of-the-art VLMs calibration analysis for the answer confidence. Our agent is built over a small-size VLM and a medium-size LLM being accessible to run on standard computational resources. We also release FALCON-Bench, a benchmark to evaluate long (average > 1 hour) Video Answer Search challenges, highlighting the need for open-ended question evaluation. Our experiments show FALCONEye's superior performance than the state-of-the-art in FALCON-Bench, and similar or better performance in related benchmarks.

Xinpeng Li,Shijian Deng,Bolin Lai,Weiguo Pian,James M. Rehg,Yapeng Tian

Task: 提出一种在线多模态社交交互理解（Online-MMSI）框架，解决现有模型依赖未来上下文的问题。

Motivation: 现实场景中AI代理需实时反馈，但现有模型依赖未来上下文，无法直接应用。

Details

Method: 提出Online-MMSI-VLM框架，结合多轮对话预测和社交感知视觉提示策略。 Result: 在三个任务和两个数据集上实现最优性能，显著优于基线模型。 Conclusion: Online-MMSI-VLM框架有效解决了在线多模态社交交互理解的挑战。 Abstract: Multimodal social interaction understanding (MMSI) is critical in human-robot interaction systems. In real-world scenarios, AI agents are required to provide real-time feedback. However, existing models often depend on both past and future contexts, which hinders them from applying to real-world problems. To bridge this gap, we propose an online MMSI setting, where the model must resolve MMSI tasks using only historical information, such as recorded dialogues and video streams. To address the challenges of missing the useful future context, we develop a novel framework, named Online-MMSI-VLM, that leverages two complementary strategies: multi-party conversation forecasting and social-aware visual prompting with multi-modal large language models. First, to enrich linguistic context, the multi-party conversation forecasting simulates potential future utterances in a coarse-to-fine manner, anticipating upcoming speaker turns and then generating fine-grained conversational details. Second, to effectively incorporate visual social cues like gaze and gesture, social-aware visual prompting highlights the social dynamics in video with bounding boxes and body keypoints for each person and frame. Extensive experiments on three tasks and two datasets demonstrate that our method achieves state-of-the-art performance and significantly outperforms baseline models, indicating its effectiveness on Online-MMSI. The code and pre-trained models will be publicly released at: https://github.com/Sampson-Lee/OnlineMMSI.

Mask$^2$DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation

Tianhao Qi,Jianlong Yuan,Wanquan Feng,Shancheng Fang,Jiawei Liu,SiYu Zhou,Qian He,Hongtao Xie,Yongdong Zhang

Task: 提出Mask$^2$DiT方法，解决多场景视频生成任务中视频片段与文本注释的细粒度对齐问题。

Motivation: 多场景视频生成具有广泛应用前景，但现有研究较少，尤其是视频片段与文本注释的精确对齐问题尚未充分探索。

Details

Method: 在DiT架构的每个注意力层引入对称二进制掩码，确保文本注释仅作用于对应视频片段，并保持时间一致性；同时加入片段级条件掩码以实现自回归场景扩展。 Result: 实验证明Mask$^2$DiT在保持视觉一致性和语义对齐方面表现优异。 Conclusion: Mask$^2$DiT为多场景视频生成提供了一种有效解决方案，实现了片段级精确对齐和自回归扩展。 Abstract: Sora has unveiled the immense potential of the Diffusion Transformer (DiT) architecture in single-scene video generation. However, the more challenging task of multi-scene video generation, which offers broader applications, remains relatively underexplored. To bridge this gap, we propose Mask$^2$DiT, a novel approach that establishes fine-grained, one-to-one alignment between video segments and their corresponding text annotations. Specifically, we introduce a symmetric binary mask at each attention layer within the DiT architecture, ensuring that each text annotation applies exclusively to its respective video segment while preserving temporal coherence across visual tokens. This attention mechanism enables precise segment-level textual-to-visual alignment, allowing the DiT architecture to effectively handle video generation tasks with a fixed number of scenes. To further equip the DiT architecture with the ability to generate additional scenes based on existing ones, we incorporate a segment-level conditional mask, which conditions each newly generated segment on the preceding video segments, thereby enabling auto-regressive scene extension. Both qualitative and quantitative experiments confirm that Mask$^2$DiT excels in maintaining visual consistency across segments while ensuring semantic alignment between each segment and its corresponding text description. Our project page is https://tianhao-qi.github.io/Mask2DiTProject.

Scaling Down Text Encoders of Text-to-Image Diffusion Models

Lifu Wang,Daqing Liu,Xinchen Liu,Xiaodong He

Task: 研究是否需要在扩散模型中使用大型文本编码器，并通过视觉知识蒸馏训练更小的T5编码器模型。

Motivation: 尽管T5系列编码器在复杂提示理解和文本生成能力上有所提升，但其参数过多且对非视觉提示无响应，存在冗余。

Details

Method: 采用视觉知识蒸馏方法，基于图像质量、语义理解和文本渲染构建数据集，训练更小的T5编码器模型。 Result: 蒸馏后的T5-base模型生成图像质量与T5-XXL相当，但模型大小缩小50倍，显著降低GPU需求。 Conclusion: 通过视觉知识蒸馏可以有效缩小模型规模，同时保持高质量文本到图像生成能力，使先进模型更易用。 Abstract: Text encoders in diffusion models have rapidly evolved, transitioning from CLIP to T5-XXL. Although this evolution has significantly enhanced the models' ability to understand complex prompts and generate text, it also leads to a substantial increase in the number of parameters. Despite T5 series encoders being trained on the C4 natural language corpus, which includes a significant amount of non-visual data, diffusion models with T5 encoder do not respond to those non-visual prompts, indicating redundancy in representational power. Therefore, it raises an important question: "Do we really need such a large text encoder?" In pursuit of an answer, we employ vision-based knowledge distillation to train a series of T5 encoder models. To fully inherit its capabilities, we constructed our dataset based on three criteria: image quality, semantic understanding, and text-rendering. Our results demonstrate the scaling down pattern that the distilled T5-base model can generate images of comparable quality to those produced by T5-XXL, while being 50 times smaller in size. This reduction in model size significantly lowers the GPU requirements for running state-of-the-art models such as FLUX and SD3, making high-quality text-to-image generation more accessible.

CAFe: Unifying Representation and Generation with Contrastive-Autoregressive Finetuning

Hao Yu,Zhuokai Zhao,Shen Yan,Lukasz Korycki,Jianyu Wang,Baosheng He,Jiayi Liu,Lizhu Zhang,Xiangjun Fan,Hanchao Yu

Task: 提出一种名为CAFe的对比自回归微调框架，以增强大型视觉语言模型（LVLMs）在表示学习和生成任务中的性能。

Motivation: 现有LVLMs在生成任务中表现优异，但在高保真表示学习任务（如图像或文本嵌入检索）中存在局限性，且微调后模型常失去生成能力。

Details

Method: 通过结合对比目标和自回归语言建模，CAFe框架统一了表示学习和生成任务。 Result: CAFe在多模态检索和多模态生成基准测试中取得了最先进的结果，包括缓解对象幻觉（OH）。 Conclusion: CAFe为未来多模态模型提供了一个同时优化检索精度和生成连贯性的新框架。 Abstract: The rapid advancement of large vision-language models (LVLMs) has driven significant progress in multimodal tasks, enabling models to interpret, reason, and generate outputs across both visual and textual domains. While excelling in generative tasks, existing LVLMs often face limitations in tasks requiring high-fidelity representation learning, such as generating image or text embeddings for retrieval. Recent work has proposed finetuning LVLMs for representational learning, but the fine-tuned model often loses its generative capabilities due to the representational learning training paradigm. To address this trade-off, we introduce CAFe, a contrastive-autoregressive fine-tuning framework that enhances LVLMs for both representation and generative tasks. By integrating a contrastive objective with autoregressive language modeling, our approach unifies these traditionally separate tasks, achieving state-of-the-art results in both multimodal retrieval and multimodal generative benchmarks, including object hallucination (OH) mitigation. CAFe establishes a novel framework that synergizes embedding and generative functionalities in a single model, setting a foundation for future multimodal models that excel in both retrieval precision and coherent output generation.

TokenHSI: Unified Synthesis of Physical Human-Scene Interactions through Task Tokenization

Liang Pan,Zeshi Yang,Zhiyang Dou,Wenjia Wang,Buzhen Huang,Bo Dai,Taku Komura,Jingbo Wang

Task: 提出一种统一的基于Transformer的策略（TokenHSI），用于合成多样且物理合理的人-场景交互（HSI）。

Motivation: 当前方法主要专注于开发独立的控制器，每种控制器仅适用于特定交互任务，难以应对需要多技能整合的复杂HSI任务。

Details

Method: 通过将人体本体感觉建模为共享令牌，并结合任务令牌的掩码机制，实现多技能统一和灵活适应。 Result: 实验表明，该方法在多种HSI任务中显著提高了多功能性、适应性和可扩展性。 Conclusion: TokenHSI通过统一策略和多技能整合，有效解决了复杂HSI任务的挑战。 Abstract: Synthesizing diverse and physically plausible Human-Scene Interactions (HSI) is pivotal for both computer animation and embodied AI. Despite encouraging progress, current methods mainly focus on developing separate controllers, each specialized for a specific interaction task. This significantly hinders the ability to tackle a wide variety of challenging HSI tasks that require the integration of multiple skills, e.g., sitting down while carrying an object. To address this issue, we present TokenHSI, a single, unified transformer-based policy capable of multi-skill unification and flexible adaptation. The key insight is to model the humanoid proprioception as a separate shared token and combine it with distinct task tokens via a masking mechanism. Such a unified policy enables effective knowledge sharing across skills, thereby facilitating the multi-task training. Moreover, our policy architecture supports variable length inputs, enabling flexible adaptation of learned skills to new scenarios. By training additional task tokenizers, we can not only modify the geometries of interaction targets but also coordinate multiple skills to address complex tasks. The experiments demonstrate that our approach can significantly improve versatility, adaptability, and extensibility in various HSI tasks. Website: https://liangpan99.github.io/TokenHSI/

ICE: Intrinsic Concept Extraction from a Single Image via Diffusion Models

Fernando Julio Cendra,Kai Han

Task: 提出一种名为ICE的新框架，用于从单张图像中自动系统地提取内在概念。

Motivation: 解决现有方法在从单张图像中可靠提取可解释内在概念方面的不足。

Details

Method: ICE框架分为两个阶段：自动概念定位模块和概念分解模块，分别用于定位文本概念及其掩码，并将对象级概念分解为内在概念和通用概念。 Result: ICE在无监督情况下从单张图像中提取内在概念表现出优越性能。 Conclusion: ICE提供了一种系统且自动的方法，能够更精细和可解释地分解视觉元素。 Abstract: The inherent ambiguity in defining visual concepts poses significant challenges for modern generative models, such as the diffusion-based Text-to-Image (T2I) models, in accurately learning concepts from a single image. Existing methods lack a systematic way to reliably extract the interpretable underlying intrinsic concepts. To address this challenge, we present ICE, short for Intrinsic Concept Extraction, a novel framework that exclusively utilizes a T2I model to automatically and systematically extract intrinsic concepts from a single image. ICE consists of two pivotal stages. In the first stage, ICE devises an automatic concept localization module to pinpoint relevant text-based concepts and their corresponding masks within the image. This critical stage streamlines concept initialization and provides precise guidance for subsequent analysis. The second stage delves deeper into each identified mask, decomposing the object-level concepts into intrinsic concepts and general concepts. This decomposition allows for a more granular and interpretable breakdown of visual elements. Our framework demonstrates superior performance on intrinsic concept extraction from a single image in an unsupervised manner. Project page: https://visual-ai.github.io/ice

Scaling Vision Pre-Training to 4K Resolution

Baifeng Shi,Boyi Li,Han Cai,Yao Lu,Sifei Liu,Marco Pavone,Jan Kautz,Song Han,Trevor Darrell,Pavlo Molchanov,Hongxu Yin

Task: 将CLIP风格的视觉预训练扩展到4K分辨率，同时保持接近恒定的计算成本。

Motivation: 高分辨率视觉细节感知对日常任务至关重要，但现有预训练方法因计算成本高而局限于低分辨率。

Details

Method: 提出PS3方法，通过选择性处理局部区域并与局部详细描述对比，实现高分辨率表示学习。 Result: PS3显著提升了高分辨率视觉感知能力，并在多模态大语言模型（MLLM）VILA-HD中表现出色，优于现有方法。 Conclusion: PS3和VILA-HD在高分辨率视觉任务中具有显著优势，并提出了新的4K分辨率基准测试4KPro。 Abstract: High-resolution perception of visual details is crucial for daily tasks. Current vision pre-training, however, is still limited to low resolutions (e.g., 378 x 378 pixels) due to the quadratic cost of processing larger images. We introduce PS3 that scales CLIP-style vision pre-training to 4K resolution with a near-constant cost. Instead of contrastive learning on global image representation, PS3 is pre-trained by selectively processing local regions and contrasting them with local detailed captions, enabling high-resolution representation learning with greatly reduced computational overhead. The pre-trained PS3 is able to both encode the global image at low resolution and selectively process local high-resolution regions based on their saliency or relevance to a text prompt. When applying PS3 to multi-modal LLM (MLLM), the resulting model, named VILA-HD, significantly improves high-resolution visual perception compared to baselines without high-resolution vision pre-training such as AnyRes and S^2 while using up to 4.3x fewer tokens. PS3 also unlocks appealing scaling properties of VILA-HD, including scaling up resolution for free and scaling up test-time compute for better performance. Compared to state of the arts, VILA-HD outperforms previous MLLMs such as NVILA and Qwen2-VL across multiple benchmarks and achieves better efficiency than latest token pruning approaches. Finally, we find current benchmarks do not require 4K-resolution perception, which motivates us to propose 4KPro, a new benchmark of image QA at 4K resolution, on which VILA-HD outperforms all previous MLLMs, including a 14.5% improvement over GPT-4o, and a 3.2% improvement and 2.96x speedup over Qwen2-VL.

Tracktention: Leveraging Point Tracking to Attend Videos Faster and Better

Zihang Lai,Andrea Vedaldi

Task: 提出一种名为Tracktention Layer的新架构组件，用于增强视频预测中的时间一致性。

Motivation: 传统方法（如时间注意力和3D卷积）在处理显著物体运动和长程时间依赖时表现不佳，需要一种更有效的方法。

Details

Method: 通过点轨迹（跨帧的对应点序列）显式整合运动信息，设计Tracktention Layer以提升时间对齐能力。 Result: 在视频深度预测和视频着色任务中，Tracktention Layer显著提升了时间一致性，甚至优于原生视频预测模型。 Conclusion: Tracktention Layer是一种高效且易于集成的方法，能够显著提升视频预测任务的时间一致性。 Abstract: Temporal consistency is critical in video prediction to ensure that outputs are coherent and free of artifacts. Traditional methods, such as temporal attention and 3D convolution, may struggle with significant object motion and may not capture long-range temporal dependencies in dynamic scenes. To address this gap, we propose the Tracktention Layer, a novel architectural component that explicitly integrates motion information using point tracks, i.e., sequences of corresponding points across frames. By incorporating these motion cues, the Tracktention Layer enhances temporal alignment and effectively handles complex object motions, maintaining consistent feature representations over time. Our approach is computationally efficient and can be seamlessly integrated into existing models, such as Vision Transformers, with minimal modification. It can be used to upgrade image-only models to state-of-the-art video ones, sometimes outperforming models natively designed for video prediction. We demonstrate this on video depth prediction and video colorization, where models augmented with the Tracktention Layer exhibit significantly improved temporal consistency compared to baselines.

AvatarArtist: Open-Domain 4D Avatarization

Hongyu Liu,Xuan Wang,Ziyu Wan,Yue Ma,Jingye Chen,Yanbo Fan,Yujun Shen,Yibing Song,Qifeng Chen

Task: 从任意风格的肖像图像创建4D虚拟化身。

Motivation: 解决4D GAN在处理多样化数据分布时的挑战，利用2D扩散模型的先验知识提升生成质量。

Details

Method: 结合生成对抗网络（GANs）和扩散模型，使用参数化三平面作为中间4D表示。 Result: 提出的AvatarArtist模型能够生成高质量的4D虚拟化身，并对多种源图像域具有强鲁棒性。 Conclusion: 通过GAN和扩散模型的协同作用，成功开发了一个通用的4D虚拟化身生成器。 Abstract: This work focuses on open-domain 4D avatarization, with the purpose of creating a 4D avatar from a portrait image in an arbitrary style. We select parametric triplanes as the intermediate 4D representation and propose a practical training paradigm that takes advantage of both generative adversarial networks (GANs) and diffusion models. Our design stems from the observation that 4D GANs excel at bridging images and triplanes without supervision yet usually face challenges in handling diverse data distributions. A robust 2D diffusion prior emerges as the solution, assisting the GAN in transferring its expertise across various domains. The synergy between these experts permits the construction of a multi-domain image-triplane dataset, which drives the development of a general 4D avatar creator. Extensive experiments suggest that our model, AvatarArtist, is capable of producing high-quality 4D avatars with strong robustness to various source image domains. The code, the data, and the models will be made publicly available to facilitate future studies..

FullDiT: Multi-Task Video Generative Foundation Model with Full Attention

Xuan Ju,Weicai Ye,Quande Liu,Qiulin Wang,Xintao Wang,Pengfei Wan,Di Zhang,Kun Gai,Qiang Xu

Task: 开发一个统一的视频生成基础模型FullDiT，支持多条件无缝集成。

Motivation: 现有视频生成模型在细粒度控制上受限，且多条件集成时存在分支冲突、参数冗余和性能不足的问题。

Details

Method: 通过统一的全注意力机制融合多任务条件，利用长上下文学习能力捕捉条件动态。 Result: FullDiT减少了参数开销，避免了条件冲突，并在实验中取得了最先进的结果。 Conclusion: FullDiT展示了全注意力机制在复杂多任务视频生成中的高效性和可扩展性。 Abstract: Current video generative foundation models primarily focus on text-to-video tasks, providing limited control for fine-grained video content creation. Although adapter-based approaches (e.g., ControlNet) enable additional controls with minimal fine-tuning, they encounter challenges when integrating multiple conditions, including: branch conflicts between independently trained adapters, parameter redundancy leading to increased computational cost, and suboptimal performance compared to full fine-tuning. To address these challenges, we introduce FullDiT, a unified foundation model for video generation that seamlessly integrates multiple conditions via unified full-attention mechanisms. By fusing multi-task conditions into a unified sequence representation and leveraging the long-context learning ability of full self-attention to capture condition dynamics, FullDiT reduces parameter overhead, avoids conditions conflict, and shows scalability and emergent ability. We further introduce FullBench for multi-task video generation evaluation. Experiments demonstrate that FullDiT achieves state-of-the-art results, highlighting the efficacy of full-attention in complex multi-task video generation.

CoLLM: A Large Language Model for Composed Image Retrieval

Chuong Huynh,Jinyu Yang,Ashish Tawari,Mubarak Shah,Son Tran,Raffay Hamid,Trishul Chilimbi,Abhinav Shrivastava

Task: 解决组合图像检索（CIR）任务中数据稀缺和复杂文本修改的挑战。

Motivation: 现有方法因数据稀缺和复杂文本修改的限制而表现不佳，需要一种更高效且无需人工标注的解决方案。

Details

Method: 提出CoLLM框架，利用LLM生成联合嵌入，并通过图像-标题对动态生成三元组；同时引入MTCIR数据集并优化现有基准。 Result: CoLLM在多个CIR基准测试中达到最优性能，MTCIR数据集带来15%的性能提升。 Conclusion: CoLLM有效解决了CIR任务的数据和文本复杂性挑战，推动了该领域的发展。 Abstract: Composed Image Retrieval (CIR) is a complex task that aims to retrieve images based on a multimodal query. Typical training data consists of triplets containing a reference image, a textual description of desired modifications, and the target image, which are expensive and time-consuming to acquire. The scarcity of CIR datasets has led to zero-shot approaches utilizing synthetic triplets or leveraging vision-language models (VLMs) with ubiquitous web-crawled image-caption pairs. However, these methods have significant limitations: synthetic triplets suffer from limited scale, lack of diversity, and unnatural modification text, while image-caption pairs hinder joint embedding learning of the multimodal query due to the absence of triplet data. Moreover, existing approaches struggle with complex and nuanced modification texts that demand sophisticated fusion and understanding of vision and language modalities. We present CoLLM, a one-stop framework that effectively addresses these limitations. Our approach generates triplets on-the-fly from image-caption pairs, enabling supervised training without manual annotation. We leverage Large Language Models (LLMs) to generate joint embeddings of reference images and modification texts, facilitating deeper multimodal fusion. Additionally, we introduce Multi-Text CIR (MTCIR), a large-scale dataset comprising 3.4M samples, and refine existing CIR benchmarks (CIRR and Fashion-IQ) to enhance evaluation reliability. Experimental results demonstrate that CoLLM achieves state-of-the-art performance across multiple CIR benchmarks and settings. MTCIR yields competitive results, with up to 15% performance improvement. Our refined benchmarks provide more reliable evaluation metrics for CIR models, contributing to the advancement of this important field.

Xiang Xu,Lingdong Kong,Hui Shuai,Wenwei Zhang,Liang Pan,Kai Chen,Ziwei Liu,Qingshan Liu

Task: 提出一种名为SuperFlow++的新框架，通过整合时空线索来改进LiDAR表示学习。

Motivation: 现有方法主要关注LiDAR和相机传感器之间的空间对齐，而忽略了时间动态性，这在驾驶场景中对捕捉运动和场景连续性至关重要。

Details

Method: SuperFlow++包含四个关键组件：视图一致性对齐模块、稠密到稀疏一致性正则化机制、基于流的对比学习方法以及时间投票策略。 Result: 在11个异构LiDAR数据集上的广泛评估表明，SuperFlow++在多样任务和驾驶条件下优于现有方法。 Conclusion: SuperFlow++为自动驾驶中基于LiDAR的高效数据感知设立了新基准，并展示了可扩展的3D基础模型的潜力。 Abstract: LiDAR representation learning has emerged as a promising approach to reducing reliance on costly and labor-intensive human annotations. While existing methods primarily focus on spatial alignment between LiDAR and camera sensors, they often overlook the temporal dynamics critical for capturing motion and scene continuity in driving scenarios. To address this limitation, we propose SuperFlow++, a novel framework that integrates spatiotemporal cues in both pretraining and downstream tasks using consecutive LiDAR-camera pairs. SuperFlow++ introduces four key components: (1) a view consistency alignment module to unify semantic information across camera views, (2) a dense-to-sparse consistency regularization mechanism to enhance feature robustness across varying point cloud densities, (3) a flow-based contrastive learning approach that models temporal relationships for improved scene understanding, and (4) a temporal voting strategy that propagates semantic information across LiDAR scans to improve prediction consistency. Extensive evaluations on 11 heterogeneous LiDAR datasets demonstrate that SuperFlow++ outperforms state-of-the-art methods across diverse tasks and driving conditions. Furthermore, by scaling both 2D and 3D backbones during pretraining, we uncover emergent properties that provide deeper insights into developing scalable 3D foundation models. With strong generalizability and computational efficiency, SuperFlow++ establishes a new benchmark for data-efficient LiDAR-based perception in autonomous driving. The code is publicly available at https://github.com/Xiangxu-0103/SuperFlow

PartRM: Modeling Part-Level Dynamics with Large Cross-State Reconstruction Model

Mingju Gao,Yike Pan,Huan-ang Gao,Zongzheng Zhang,Wenyi Li,Hao Dong,Hao Tang,Li Yi,Hao Zhao

Task: 提出一种名为PartRM的4D重建框架，用于从静态物体的多视角图像中同时建模外观、几何和部件级运动。

Motivation: 现有方法（如Puppet-Master）依赖大规模预训练视频扩散模型，但由于2D视频表示的局限性和处理速度慢，难以实际应用。

Details

Method: PartRM基于大型3D高斯重建模型，引入PartDrag-4D数据集和多尺度拖拽嵌入模块，采用两阶段训练过程以避免灾难性遗忘。 Result: 实验结果表明，PartRM在部件级运动学习上达到新SOTA，并可应用于机器人操作任务。 Conclusion: PartRM通过结合4D重建和多尺度动态建模，克服了现有方法的局限性，为未来研究提供了公开的代码、数据和模型。 Abstract: As interest grows in world models that predict future states from current observations and actions, accurately modeling part-level dynamics has become increasingly relevant for various applications. Existing approaches, such as Puppet-Master, rely on fine-tuning large-scale pre-trained video diffusion models, which are impractical for real-world use due to the limitations of 2D video representation and slow processing times. To overcome these challenges, we present PartRM, a novel 4D reconstruction framework that simultaneously models appearance, geometry, and part-level motion from multi-view images of a static object. PartRM builds upon large 3D Gaussian reconstruction models, leveraging their extensive knowledge of appearance and geometry in static objects. To address data scarcity in 4D, we introduce the PartDrag-4D dataset, providing multi-view observations of part-level dynamics across over 20,000 states. We enhance the model's understanding of interaction conditions with a multi-scale drag embedding module that captures dynamics at varying granularities. To prevent catastrophic forgetting during fine-tuning, we implement a two-stage training process that focuses sequentially on motion and appearance learning. Experimental results show that PartRM establishes a new state-of-the-art in part-level motion learning and can be applied in manipulation tasks in robotics. Our code, data, and models are publicly available to facilitate future research.

Learning 3D Object Spatial Relationships from Pre-trained 2D Diffusion Models

Sangwon Beak,Hyeonwoo Kim,Hanbyul Joo

Task: 学习3D物体间的空间关系（OOR），通过利用预训练的2D扩散模型生成的合成3D样本。

Motivation: 2D扩散模型合成的图像隐含了合理的OOR线索，可用于高效构建3D数据集以学习无限制物体类别的OOR。

Details

Method: 合成多样化的图像以捕捉OOR线索，将其提升为3D样本，并训练基于分数的OOR扩散模型学习空间关系分布。 Result: 实验表明该方法在多种OOR场景中表现稳健，并能应用于真实世界的3D场景布置任务。 Conclusion: 通过2D扩散模型生成的合成数据，可以有效学习3D物体间的空间关系，并扩展至多物体场景。 Abstract: We present a method for learning 3D spatial relationships between object pairs, referred to as object-object spatial relationships (OOR), by leveraging synthetically generated 3D samples from pre-trained 2D diffusion models. We hypothesize that images synthesized by 2D diffusion models inherently capture plausible and realistic OOR cues, enabling efficient ways to collect a 3D dataset to learn OOR for various unbounded object categories. Our approach begins by synthesizing diverse images that capture plausible OOR cues, which we then uplift into 3D samples. Leveraging our diverse collection of plausible 3D samples for the object pairs, we train a score-based OOR diffusion model to learn the distribution of their relative spatial relationships. Additionally, we extend our pairwise OOR to multi-object OOR by enforcing consistency across pairwise relations and preventing object collisions. Extensive experiments demonstrate the robustness of our method across various object-object spatial relationships, along with its applicability to real-world 3D scene arrangement tasks using the OOR diffusion model.

EventFly: Event Camera Perception from Ground to the Sky

Lingdong Kong,Dongyue Lu,Xiang Xu,Lai Xing Ng,Wei Tsang Ooi,Benoit R. Cottereau

Task: 提出EventFly框架，用于事件相机感知中的跨平台适应性。

Motivation: 解决事件相机在不同平台（如车辆、无人机、四足机器人）部署时因运动动态、视角和类别分布差异带来的挑战。

Details

Method: 包括Event Activation Prior (EAP)、EventBlend数据混合策略和EventMatch双判别器技术。 Result: 在EXPo基准测试中表现优异，显著优于现有适应方法。 Conclusion: EventFly为复杂环境中的事件感知提供了高适应性和高性能的解决方案。 Abstract: Cross-platform adaptation in event-based dense perception is crucial for deploying event cameras across diverse settings, such as vehicles, drones, and quadrupeds, each with unique motion dynamics, viewpoints, and class distributions. In this work, we introduce EventFly, a framework for robust cross-platform adaptation in event camera perception. Our approach comprises three key components: i) Event Activation Prior (EAP), which identifies high-activation regions in the target domain to minimize prediction entropy, fostering confident, domain-adaptive predictions; ii) EventBlend, a data-mixing strategy that integrates source and target event voxel grids based on EAP-driven similarity and density maps, enhancing feature alignment; and iii) EventMatch, a dual-discriminator technique that aligns features from source, target, and blended domains for better domain-invariant learning. To holistically assess cross-platform adaptation abilities, we introduce EXPo, a large-scale benchmark with diverse samples across vehicle, drone, and quadruped platforms. Extensive experiments validate our effectiveness, demonstrating substantial gains over popular adaptation methods. We hope this work can pave the way for more adaptive, high-performing event perception across diverse and complex environments.

Is there a future for AI without representation?

Vincent C. Müller

Task: 探讨无表征AI的可行性，特别是Rodney Brooks的提议。

Motivation: 传统AI认为表征是智能的必要条件，但近期认知科学研究表明，智能代理无需中央表征处理器。

Details

Method: 分析Brooks的非集中控制智能代理方法，并与传统AI的表征需求对比。 Result: Brooks的非集中化无表征认知方法对完全智能代理有潜力，但不适用于有意识的类人AI。 Conclusion: 无表征AI在非集中控制范式下具有前景，但需区分智能代理与类人AI的适用性。 Abstract: This paper investigates the prospects of AI without representation in general, and the proposals of Rodney Brooks in particular. What turns out to be characteristic of Brooks' proposal is the rejection of central control in intelligent agents; his systems has as much or as little representation as traditional AI. The traditional view that representation is necessary for intelligence presupposes that intelligence requires central control. However, much of recent cognitive science suggests that we should dispose of the image of intelligent agents as central representation processors. If this paradigm shift is achieved, Brooks' proposal for non-centralized cognition without representation appears promising for full-blown intelligent agents - though not for conscious agents and thus not for human-like AI.

Automated diagnosis of lung diseases using vision transformer: a comparative study on chest x-ray classification

Muhammad Ahmad,Sardar Usman,Ildar Batyrshin,Muhammad Muzammil,K. Sajid,M. Hasnain,Muhammad Jalal,Grigori Sidorov

Task: 利用深度学习模型对胸部X光片进行分类，以诊断肺部疾病。

Motivation: 肺部疾病是全球健康问题，早期准确诊断至关重要，而X光片是重要工具。

Details

Method: 使用五种预训练深度学习模型和两种迁移学习算法对3,475张X光片进行分类。 Result: Vision Transformer (ViT)在二分类和多分类中分别达到99%和95.25%的准确率。 Conclusion: 该方法能有效减少人为干预，提升肺部疾病诊断的自动化水平。 Abstract: Background: Lung disease is a significant health issue, particularly in children and elderly individuals. It often results from lung infections and is one of the leading causes of mortality in children. Globally, lung-related diseases claim many lives each year, making early and accurate diagnoses crucial. Radiographs are valuable tools for the diagnosis of such conditions. The most prevalent lung diseases, including pneumonia, asthma, allergies, chronic obstructive pulmonary disease (COPD), bronchitis, emphysema, and lung cancer, represent significant public health challenges. Early prediction of these conditions is critical, as it allows for the identification of risk factors and implementation of preventive measures to reduce the likelihood of disease onset Methods: In this study, we utilized a dataset comprising 3,475 chest X-ray images sourced from from Mendeley Data provided by Talukder, M. A. (2023) [14], categorized into three classes: normal, lung opacity, and pneumonia. We applied five pre-trained deep learning models, including CNN, ResNet50, DenseNet, CheXNet, and U-Net, as well as two transfer learning algorithms such as Vision Transformer (ViT) and Shifted Window (Swin) to classify these images. This approach aims to address diagnostic issues in lung abnormalities by reducing reliance on human intervention through automated classification systems. Our analysis was conducted in both binary and multiclass settings. Results: In the binary classification, we focused on distinguishing between normal and viral pneumonia cases, whereas in the multi-class classification, all three classes (normal, lung opacity, and viral pneumonia) were included. Our proposed methodology (ViT) achieved remarkable performance, with accuracy rates of 99% for binary classification and 95.25% for multiclass classification.

LoRA Subtraction for Drift-Resistant Space in Exemplar-Free Continual Learning

Xuan Liu,Xiaobin Chang

Task: 提出一种名为Drift-Resistant Space (DRS)的方法，解决无示例持续学习中的特征漂移问题。

Motivation: 在无示例持续学习中，特征漂移导致灾难性遗忘，现有方法依赖静态特征或过时统计信息，无法捕捉特征空间的动态变化。

Details

Method: 提出Low-Rank Adaptation Subtraction (LoRA-)，通过从预训练权重中减去旧任务的LoRA权重来构建DRS，并结合三元组损失提升可塑性。 Result: 方法在多个数据集上取得最先进的结果，尤其在长任务序列中表现优异。 Conclusion: DRS通过LoRA-有效处理特征漂移，提升模型稳定性和效率，同时简化实现。 Abstract: In continual learning (CL), catastrophic forgetting often arises due to feature drift. This challenge is particularly prominent in the exemplar-free continual learning (EFCL) setting, where samples from previous tasks cannot be retained, making it difficult to preserve prior knowledge. To address this issue, some EFCL methods aim to identify feature spaces that minimize the impact on previous tasks while accommodating new ones. However, they rely on static features or outdated statistics stored from old tasks, which prevents them from capturing the dynamic evolution of the feature space in CL, leading to performance degradation over time. In this paper, we introduce the Drift-Resistant Space (DRS), which effectively handles feature drifts without requiring explicit feature modeling or the storage of previous tasks. A novel parameter-efficient fine-tuning approach called Low-Rank Adaptation Subtraction (LoRA-) is proposed to develop the DRS. This method subtracts the LoRA weights of old tasks from the initial pre-trained weight before processing new task data to establish the DRS for model training. Therefore, LoRA- enhances stability, improves efficiency, and simplifies implementation. Furthermore, stabilizing feature drifts allows for better plasticity by learning with a triplet loss. Our method consistently achieves state-of-the-art results, especially for long task sequences, across multiple datasets.

FACE: Few-shot Adapter with Cross-view Fusion for Cross-subject EEG Emotion Recognition

Haiqi Liu,C. L. Philip Chen,Tong Zhang

Task: 提出一种名为FACE的少样本适配器方法，用于跨被试EEG情绪识别。

Motivation: 现有方法在跨被试EEG情绪识别中面临被试间差异大、被试内差异复杂的问题，且通常需要大量目标被试数据或泛化性能有限。

Details

Method: FACE结合动态多视图融合和少样本适配器模块，通过全局脑连接与局部模式的动态融合及元学习增强适配器结构。 Result: 在三个公开EEG情绪识别基准测试中，FACE表现出优于现有方法的泛化性能。 Conclusion: FACE为有限标记数据的跨被试场景提供了实用解决方案。 Abstract: Cross-subject EEG emotion recognition is challenged by significant inter-subject variability and intricately entangled intra-subject variability. Existing works have primarily addressed these challenges through domain adaptation or generalization strategies. However, they typically require extensive target subject data or demonstrate limited generalization performance to unseen subjects. Recent few-shot learning paradigms attempt to address these limitations but often encounter catastrophic overfitting during subject-specific adaptation with limited samples. This article introduces the few-shot adapter with a cross-view fusion method called FACE for cross-subject EEG emotion recognition, which leverages dynamic multi-view fusion and effective subject-specific adaptation. Specifically, FACE incorporates a cross-view fusion module that dynamically integrates global brain connectivity with localized patterns via subject-specific fusion weights to provide complementary emotional information. Moreover, the few-shot adapter module is proposed to enable rapid adaptation for unseen subjects while reducing overfitting by enhancing adapter structures with meta-learning. Experimental results on three public EEG emotion recognition benchmarks demonstrate FACE's superior generalization performance over state-of-the-art methods. FACE provides a practical solution for cross-subject scenarios with limited labeled data.

Abdul Qayyum,Moona Mazher,Devran Ugurlu,Jose Alonso Solis Lemus,Cristobal Rodero,Steven A Niederer

Task: 提出一种基于自监督学习框架的基础模型，用于全心脏分割，以解决模态特异性偏差和标注数据需求大的问题。

Motivation: 现有方法在CT和MRI扫描的全心脏分割中存在模态特异性偏差和需要大量标注数据的问题。

Details

Method: 采用基于学生-教师架构的自监督学习框架，利用xLSTM主干网络捕捉3D医学图像中的长程空间依赖和复杂解剖结构，并通过多模态预训练提升泛化能力。 Result: 模型在少量标注数据的CT和MRI数据集上表现出色，验证了其鲁棒性和适应性。 Conclusion: 该模型在医学影像中自动化全心脏分割方面具有潜在应用价值。 Abstract: Whole-heart segmentation from CT and MRI scans is crucial for cardiovascular disease analysis, yet existing methods struggle with modality-specific biases and the need for extensive labeled datasets. To address these challenges, we propose a foundation model for whole-heart segmentation using a self-supervised learning (SSL) framework based on a student-teacher architecture. Our model is pretrained on a large, unlabeled dataset of CT and MRI scans, leveraging the xLSTM backbone to capture long-range spatial dependencies and complex anatomical structures in 3D medical images. By incorporating multi-modal pretraining, our approach ensures strong generalization across both CT and MRI modalities, mitigating modality-specific variations and improving segmentation accuracy in diverse clinical settings. The use of large-scale unlabeled data significantly reduces the dependency on manual annotations, enabling robust performance even with limited labeled data. We further introduce an xLSTM-UNet-based architecture for downstream whole-heart segmentation tasks, demonstrating its effectiveness on few-label CT and MRI datasets. Our results validate the robustness and adaptability of the proposed model, highlighting its potential for advancing automated whole-heart segmentation in medical imaging.

LookAhead Tuning: Safer Language Models via Partial Answer Previews

Kangwei Liu,Mengru Wang,Yujie Luo,Lin Yuan,Mengshu Sun,Ningyu Zhang,Lei Liang,Zhiqiang Zhang,Jun Zhou,Huajun Chen

Task: 提出LookAhead Tuning方法，以在微调大型语言模型时保持其安全性。

Motivation: 微调会削弱模型的安全性对齐，需要一种方法来避免这种退化。

Details

Method: 通过预览部分答案前缀修改训练数据，包括两种简单、低资源且有效的数据驱动方法。 Result: LookAhead Tuning在不牺牲下游任务性能的情况下有效保持模型安全性。 Conclusion: LookAhead Tuning是安全高效适应大型语言模型的可靠解决方案。 Abstract: Fine-tuning enables large language models (LLMs) to adapt to specific domains, but often undermines their previously established safety alignment. To mitigate the degradation of model safety during fine-tuning, we introduce LookAhead Tuning, which comprises two simple, low-resource, and effective data-driven methods that modify training data by previewing partial answer prefixes. Both methods aim to preserve the model's inherent safety mechanisms by minimizing perturbations to initial token distributions. Comprehensive experiments demonstrate that LookAhead Tuning effectively maintains model safety without sacrificing robust performance on downstream tasks. Our findings position LookAhead Tuning as a reliable and efficient solution for the safe and effective adaptation of LLMs. Code is released at https://github.com/zjunlp/LookAheadTuning.

3D Structural Phenotype of the Optic Nerve Head at the Intersection of Glaucoma and Myopia - A Key to Improving Glaucoma Diagnosis in Myopic Populations

Swati Sharma,Fabian A. Braeu,Thanadet Chuangsuwanich,Tin A. Tun,Quan V Hoang,Rachel Chong,Shamira Perera,Ching-Lin Ho,Rahat Husain,Martin L. Buist,Tin Aung,Michaël J. A. Girard

Task: 表征青光眼、高度近视及并发高度近视和青光眼的患者视神经乳头（ONH）的3D结构表型，并评估这些条件下的变异。

Motivation: 研究不同眼部疾病状态下ONH的结构差异，为临床诊断提供依据。

Details

Method: 使用光学相干断层扫描（OCT）数据，通过分割视网膜和结缔组织，将其边界转换为3D点云，并开发一个集成网络进行分类。 Result: 分类网络在独立测试集上实现了高准确率（AUC 0.92±0.03），解码器有效重建点云（Chamfer损失0.013±0.002），并揭示了四种ONH结构的显著差异。 Conclusion: ONH在不同疾病状态下具有独特的结构特征，其形态学信息足以用于分类，且主成分分析捕捉了各组的结构模式。 Abstract: Purpose: To characterize the 3D structural phenotypes of the optic nerve head (ONH) in patients with glaucoma, high myopia, and concurrent high myopia and glaucoma, and to evaluate their variations across these conditions. Participants: A total of 685 optical coherence tomography (OCT) scans from 754 subjects of Singapore-Chinese ethnicity, including 256 healthy (H), 94 highly myopic (HM), 227 glaucomatous (G), and 108 highly myopic with glaucoma (HMG) cases. Methods: We segmented the retinal and connective tissues from OCT volumes and their boundary edges were converted into 3D point clouds. To classify the 3D point clouds into four ONH conditions, i.e., H, HM, G, and HMG, a specialized ensemble network was developed, consisting of an encoder to transform high-dimensional input data into a compressed latent vector, a decoder to reconstruct point clouds from the latent vector, and a classifier to categorize the point clouds into the four ONH conditions. Results: The classification network achieved high accuracy, distinguishing H, HM, G, and HMG classes with a micro-average AUC of 0.92 $\pm$ 0.03 on an independent test set. The decoder effectively reconstructed point clouds, achieving a Chamfer loss of 0.013 $\pm$ 0.002. Dimensionality reduction clustered ONHs into four distinct groups, revealing structural variations such as changes in retinal and connective tissue thickness, tilting and stretching of the disc and scleral canal opening, and alterations in optic cup morphology, including shallow or deep excavation, across the four conditions. Conclusions: This study demonstrated that ONHs exhibit distinct structural signatures across H, HM, G, and HMG conditions. The findings further indicate that ONH morphology provides sufficient information for classification into distinct clusters, with principal components capturing unique structural patterns within each group.

TrackRAD2025 challenge dataset: Real-time tumor tracking for MRI-guided radiotherapy

Yiling Wang,Elia Lombardo,Adrian Thummerer,Tom Blöcker,Yu Fan,Yue Zhao,Christianna Iris Papadopoulou,Coen Hurkmans,Rob H. N. Tijssen,Pia A. W. Görts,Shyama U. Tetar,Davide Cusumano,Martijn P. W. Intven,Pim Borman,Marco Riboldi,Denis Dudáš,Hilary Byrne,Lorenzo Placidi,Marco Fusella,Michael Jameson,Miguel Palacios,Paul Cobussen,Tobias Finazzi,Cornelis J. A. Haasbeek,Paul Keall,Christopher Kurz,Guillaume Landry,Matteo Maspero

Task: 提出一个多机构的实时MRI时间序列数据集，用于支持MRI引导放疗中实时肿瘤定位算法的开发和评估。

Motivation: MRI引导放疗中实时运动管理对癌症治疗至关重要，但缺乏公开数据集支持算法开发。

Details

Method: 收集了585名患者的矢状2D cine MRI数据，来自6个中心，涵盖不同MRI-linac设备，并对部分数据进行了手动分割。 Result: 数据集包含527例公开训练集（477未标记，50标记）和58例私有测试集（全部标记），已公开可用。 Conclusion: 该数据集有望推动MRI引导放疗中实时肿瘤定位算法的发展，提升运动管理和自适应治疗策略的准确性。 Abstract: Purpose: Magnetic resonance imaging (MRI) to visualize anatomical motion is becoming increasingly important when treating cancer patients with radiotherapy. Hybrid MRI-linear accelerator (MRI-linac) systems allow real-time motion management during irradiation. This paper presents a multi-institutional real-time MRI time series dataset from different MRI-linac vendors. The dataset is designed to support developing and evaluating real-time tumor localization (tracking) algorithms for MRI-guided radiotherapy within the TrackRAD2025 challenge (https://trackrad2025.grand-challenge.org/). Acquisition and validation methods: The dataset consists of sagittal 2D cine MRIs in 585 patients from six centers (3 Dutch, 1 German, 1 Australian, and 1 Chinese). Tumors in the thorax, abdomen, and pelvis acquired on two commercially available MRI-linacs (0.35 T and 1.5 T) were included. For 108 cases, irradiation targets or tracking surrogates were manually segmented on each temporal frame. The dataset was randomly split into a public training set of 527 cases (477 unlabeled and 50 labeled) and a private testing set of 58 cases (all labeled). Data Format and Usage Notes: The data is publicly available under the TrackRAD2025 collection: https://doi.org/10.57967/hf/4539. Both the images and segmentations for each patient are available in metadata format. Potential Applications: This novel clinical dataset will enable the development and evaluation of real-time tumor localization algorithms for MRI-guided radiotherapy. By enabling more accurate motion management and adaptive treatment strategies, this dataset has the potential to advance the field of radiotherapy significantly.

Stochastic Poisson Surface Reconstruction with One Solve using Geometric Gaussian Processes

Sidhanth Holalkere,David S. Bindel,Silvia Sellán,Alexander Terenin

Task: 将高斯过程模型与表面重建结合，实现单阶段的概率表面重建。

Motivation: 传统方法需要两阶段计算（高斯过程插值和全局求解偏微分方程），计算成本高且不够灵活。

Details

Method: 利用几何高斯过程技术，将插值与表面重建合并为单阶段，仅需一次线性求解。 Result: 实现了局部空间查询，支持概率碰撞检测、光线投射和分片视图规划，提高了重建质量。 Conclusion: 该方法提供了更简洁、更原则化且更灵活的概率表面重建流程。 Abstract: Poisson Surface Reconstruction is a widely-used algorithm for reconstructing a surface from an oriented point cloud. To facilitate applications where only partial surface information is available, or scanning is performed sequentially, a recent line of work proposes to incorporate uncertainty into the reconstructed surface via Gaussian process models. The resulting algorithms first perform Gaussian process interpolation, then solve a set of volumetric partial differential equations globally in space, resulting in a computationally expensive two-stage procedure. In this work, we apply recently-developed techniques from geometric Gaussian processes to combine interpolation and surface reconstruction into a single stage, requiring only one linear solve per sample. The resulting reconstructed surface samples can be queried locally in space, without the use of problem-dependent volumetric meshes or grids. These capabilities enable one to (a) perform probabilistic collision detection locally around the region of interest, (b) perform ray casting without evaluating points not on the ray's trajectory, and (c) perform next-view planning on a per-slice basis. They also improve reconstruction quality, by not requiring one to approximate kernel matrix inverses with diagonal matrices as part of intermediate computations. Results show that our approach provides a cleaner, more-principled, and more-flexible stochastic surface reconstruction pipeline.

Risk-Based Thresholding for Reliable Anomaly Detection in Concentrated Solar Power Plants

Yorick Estievenart,Sukanya Patra,Souhaib Ben Taieb

Task: 提出一种框架，用于为集中太阳能发电（CSP）植物的红外图像异常检测生成更可靠的决策阈值。

Motivation: CSP植物高温太阳能接收器面临冻结、变形和腐蚀等严重操作风险，导致高成本停机维护。

Details

Method: 提出一种框架，结合有限样本覆盖保证和弃权机制，并采用密度预测方法估计异常分数。 Result: 在多个CSP植物的实际部署中验证了框架的有效性，并为行业合作伙伴提供了优化维护操作的见解。 Conclusion: 该框架通过生成可靠的决策阈值和模拟数据集，为CSP植物的异常检测提供了实用解决方案。 Abstract: Efficient and reliable operation of Concentrated Solar Power (CSP) plants is essential for meeting the growing demand for sustainable energy. However, high-temperature solar receivers face severe operational risks, such as freezing, deformation, and corrosion, resulting in costly downtime and maintenance. To monitor CSP plants, cameras mounted on solar receivers record infrared images at irregular intervals ranging from one to five minutes throughout the day. Anomalous images can be detected by thresholding an anomaly score, where the threshold is chosen to optimize metrics such as the F1-score on a validation set. This work proposes a framework for generating more reliable decision thresholds with finite-sample coverage guarantees on any chosen risk function. Our framework also incorporates an abstention mechanism, allowing high-risk predictions to be deferred to domain experts. Second, we propose a density forecasting method to estimate the likelihood of an observed image given a sequence of previously observed images, using this likelihood as its anomaly score. Third, we analyze the deployment results of our framework across multiple training scenarios over several months for two CSP plants. This analysis provides valuable insights to our industry partner for optimizing maintenance operations. Finally, given the confidential nature of our dataset, we provide an extended simulated dataset, leveraging recent advancements in generative modeling to create diverse thermal images that simulate multiple CSP plants. Our code is publicly available.

Out-of-distribution evaluations of channel agnostic masked autoencoders in fluorescence microscopy

Christian John Hurry,Jinjie Zhang,Olubukola Ishola,Emma Slade,Cuong Q. Nguyen

Task: 提出一种评估方案和通道无关的掩码自编码器（Campfire），用于解决高内涵筛选中计算机视觉模型的分布偏移问题。

Motivation: 高内涵筛选中实验条件、扰动剂和荧光标记的变化导致分布偏移，传统迁移学习评估方法无法区分不同来源的偏移，限制了模型设计和训练对泛化能力影响的理解。

Details

Method: 使用JUMP-CP数据集分离分布偏移来源，并提出通道无关的掩码自编码器Campfire，通过共享解码器适应多荧光标记数据集。 Result: Campfire在实验批次、扰动剂和荧光标记的分布外数据上表现良好，并成功实现细胞类型间的迁移学习。 Conclusion: 提出的评估方案和Campfire模型有效解决了高内涵筛选中分布偏移问题，提升了模型的泛化能力。 Abstract: Developing computer vision for high-content screening is challenging due to various sources of distribution-shift caused by changes in experimental conditions, perturbagens, and fluorescent markers. The impact of different sources of distribution-shift are confounded in typical evaluations of models based on transfer learning, which limits interpretations of how changes to model design and training affect generalisation. We propose an evaluation scheme that isolates sources of distribution-shift using the JUMP-CP dataset, allowing researchers to evaluate generalisation with respect to specific sources of distribution-shift. We then present a channel-agnostic masked autoencoder $\mathbf{Campfire}$ which, via a shared decoder for all channels, scales effectively to datasets containing many different fluorescent markers, and show that it generalises to out-of-distribution experimental batches, perturbagens, and fluorescent markers, and also demonstrates successful transfer learning from one cell type to another.

PSO-UNet: Particle Swarm-Optimized U-Net Framework for Precise Multimodal Brain Tumor Segmentation

Shoffan Saifullah,Rafał Dreżewski

Task: 开发一种结合粒子群优化（PSO）和U-Net架构的动态超参数优化方法（PSO-UNet），用于脑肿瘤医学图像分割。

Motivation: 由于多模态MRI数据集的复杂性和肿瘤形态的多样性，脑肿瘤分割需要精确且计算高效的模型。

Details

Method: 将粒子群优化（PSO）与U-Net架构结合，动态优化超参数（如滤波器数量、核大小和学习率）。 Result: 在BraTS 2021和Figshare数据集上，PSO-UNet的Dice相似系数（DSC）分别为0.9578和0.9523，IoU分数分别为0.9194和0.9097，且计算复杂度显著降低（仅780万参数，约906秒运行时间）。 Conclusion: PSO-UNet在多模态MRI和肿瘤分类中表现出强大的泛化能力，具有临床潜力，未来将探索混合优化策略以进一步提升其鲁棒性和可扩展性。 Abstract: Medical image segmentation, particularly for brain tumor analysis, demands precise and computationally efficient models due to the complexity of multimodal MRI datasets and diverse tumor morphologies. This study introduces PSO-UNet, which integrates Particle Swarm Optimization (PSO) with the U-Net architecture for dynamic hyperparameter optimization. Unlike traditional manual tuning or alternative optimization approaches, PSO effectively navigates complex hyperparameter search spaces, explicitly optimizing the number of filters, kernel size, and learning rate. PSO-UNet substantially enhances segmentation performance, achieving Dice Similarity Coefficients (DSC) of 0.9578 and 0.9523 and Intersection over Union (IoU) scores of 0.9194 and 0.9097 on the BraTS 2021 and Figshare datasets, respectively. Moreover, the method reduces computational complexity significantly, utilizing only 7.8 million parameters and executing in approximately 906 seconds, markedly faster than comparable U-Net-based frameworks. These outcomes underscore PSO-UNet's robust generalization capabilities across diverse MRI modalities and tumor classifications, emphasizing its clinical potential and clear advantages over conventional hyperparameter tuning methods. Future research will explore hybrid optimization strategies and validate the framework against other bio-inspired algorithms to enhance its robustness and scalability.

HoGS: Unified Near and Far Object Reconstruction via Homogeneous Gaussian Splatting

Xinpeng Liu,Zeyi Huang,Fumio Okura,Yasuyuki Matsushita

Task: 提出一种基于齐次坐标的高斯泼溅方法（HoGS），以提升无界户外环境中远距离物体的渲染精度。

Motivation: 现有3D高斯泼溅（3DGS）依赖笛卡尔坐标，限制了远距离物体的渲染性能，而齐次坐标在投影几何中的应用可以显著改善这一问题。

Details

Method: 将齐次坐标引入3DGS框架，提出Homogeneous Gaussian Splatting（HoGS），通过投影几何原理统一表示远近物体。 Result: HoGS显著提升了远距离物体的重建精度，同时保持近距离物体的高质量渲染，且训练速度快、支持实时渲染。 Conclusion: HoGS通过齐次坐标的引入，为无界户外环境中的3D重建提供了一种高效且统一的解决方案。 Abstract: Novel view synthesis has demonstrated impressive progress recently, with 3D Gaussian splatting (3DGS) offering efficient training time and photorealistic real-time rendering. However, reliance on Cartesian coordinates limits 3DGS's performance on distant objects, which is important for reconstructing unbounded outdoor environments. We found that, despite its ultimate simplicity, using homogeneous coordinates, a concept on the projective geometry, for the 3DGS pipeline remarkably improves the rendering accuracies of distant objects. We therefore propose Homogeneous Gaussian Splatting (HoGS) incorporating homogeneous coordinates into the 3DGS framework, providing a unified representation for enhancing near and distant objects. HoGS effectively manages both expansive spatial positions and scales particularly in outdoor unbounded environments by adopting projective geometry principles. Experiments show that HoGS significantly enhances accuracy in reconstructing distant objects while maintaining high-quality rendering of nearby objects, along with fast training speed and real-time rendering capability. Our implementations are available on our project page https://kh129.github.io/hogs/.

Limited-angle x-ray nano-tomography with machine-learning enabled iterative reconstruction engine

Chonghang Zhao,Mingyuan Ge,Xiaogang Yang,Yong S. Chu,Hanfei Yan

Task: 解决断层扫描中的'缺失楔形'问题，通过结合卷积神经网络和感知知识作为正则化器。

Motivation: 由于几何限制导致的投影图像采集不完整，导致重建图像中存在显著伪影和分辨率低下。

Details

Method: 提出了一种名为Perception Fused Iterative Tomography Reconstruction Engine的方法，将CNN与感知知识结合到迭代求解引擎中，并使用交替方向乘子法优化解。 Result: 在多种X射线显微镜技术的数据集上验证了方法的有效性，显著改善了重建质量，即使在缺失楔形超过100度的情况下。 Conclusion: 该方法在解决3D X射线成像中的常见挑战时表现出鲁棒性和通用性。 Abstract: A long-standing challenge in tomography is the 'missing wedge' problem, which arises when the acquisition of projection images within a certain angular range is restricted due to geometrical constraints. This incomplete dataset results in significant artifacts and poor resolution in the reconstructed image. To tackle this challenge, we propose an approach dubbed Perception Fused Iterative Tomography Reconstruction Engine, which integrates a convolutional neural network (CNN) with perceptional knowledge as a smart regularizer into an iterative solving engine. We employ the Alternating Direction Method of Multipliers to optimize the solution in both physics and image domains, thereby achieving a physically coherent and visually enhanced result. We demonstrate the effectiveness of the proposed approach using various experimental datasets obtained with different x-ray microscopy techniques. All show significantly improved reconstruction even with a missing wedge of over 100 degrees - a scenario where conventional methods fail. Notably, it also improves the reconstruction in case of sparse projections, despite the network not being specifically trained for that. This demonstrates the robustness and generality of our method of addressing commonly occurring challenges in 3D x-ray imaging applications for real-world problems.

$L^2$FMamba: Lightweight Light Field Image Super-Resolution with State Space Model

Zeqiang Wei,Kai Jin,Zeyi Hou,Kuan Song,Xiuzhuang Zhou

Task: 改进光场图像超分辨率任务的性能。

Motivation: Transformer在光场图像超分辨率任务中表现优异，但其核心自注意力机制的高计算复杂度阻碍了进一步的发展。

Details

Method: 提出LF-VSSM模块以高效捕获光场图像的长程空间-角度依赖关系，并基于此设计轻量级网络$L^2$FMamba。 Result: 在多个光场数据集上验证，该方法减少了参数和计算复杂度，同时实现了更优的超分辨率性能和更快的推理速度。 Conclusion: LF-VSSM和$L^2$FMamba有效解决了Transformer的计算复杂度问题，提升了光场图像超分辨率的性能。 Abstract: Transformers bring significantly improved performance to the light field image super-resolution task due to their long-range dependency modeling capability. However, the inherently high computational complexity of their core self-attention mechanism has increasingly hindered their advancement in this task. To address this issue, we first introduce the LF-VSSM block, a novel module inspired by progressive feature extraction, to efficiently capture critical long-range spatial-angular dependencies in light field images. LF-VSSM successively extracts spatial features within sub-aperture images, spatial-angular features between sub-aperture images, and spatial-angular features between light field image pixels. On this basis, we propose a lightweight network, $L^2$FMamba (Lightweight Light Field Mamba), which integrates the LF-VSSM block to leverage light field features for super-resolution tasks while overcoming the computational challenges of Transformer-based approaches. Extensive experiments on multiple light field datasets demonstrate that our method reduces the number of parameters and complexity while achieving superior super-resolution performance with faster inference speed.

MARS: Memory-Enhanced Agents with Reflective Self-improvement

Xuechen Liang,Meiling Tao,Yinghui Xia,Jianhui Wang,Kun Li,Yijin Wang,Jingsong Yang,Tianyu Shi,Yuantao Wang,Miao Zhang,Xueqian Wang

Task: 提出一种名为MARS的创新框架，以解决大语言模型在动态环境中的连续决策、长期记忆缺失和有限上下文窗口等问题。

Motivation: 大语言模型在自然语言处理领域取得显著进展，但仍面临动态环境中的连续决策、长期记忆缺失和有限上下文窗口等挑战。

Details

Method: MARS框架由三个代理（用户、助手和检查者）组成，结合迭代反馈、反思机制和基于艾宾浩斯遗忘曲线的记忆优化机制。 Result: 显著提升了代理在多任务处理和长跨度信息处理中的能力。 Conclusion: MARS框架通过创新的设计和机制，有效增强了大语言模型在动态环境中的表现。 Abstract: Large language models (LLMs) have made significant advances in the field of natural language processing, but they still face challenges such as continuous decision-making, lack of long-term memory, and limited context windows in dynamic environments. To address these issues, this paper proposes an innovative framework Memory-Enhanced Agents with Reflective Self-improvement. The MARS framework comprises three agents: the User, the Assistant, and the Checker. By integrating iterative feedback, reflective mechanisms, and a memory optimization mechanism based on the Ebbinghaus forgetting curve, it significantly enhances the agents capabilities in handling multi-tasking and long-span information.

Adaptive Wavelet Filters as Practical Texture Feature Amplifiers for Parkinson's Disease Screening in OCT

Xiaoqing Zhang,Hanfeng Shi,Xiangyu Li,Haili Ye,Tao Xu,Na Li,Yan Hu,Fan Lv,Jiangfan Chen,Jiang Liu

Task: 提出一种基于自适应小波滤波器（AWF）和平衡置信度损失（BC Loss）的深度学习网络AWFNet，用于自动化帕金森病（PD）筛查。

Motivation: 视网膜纹理特征在PD筛查中具有潜力，但现有方法未充分利用这些特征，且频率域学习技术可增强深度神经网络的纹理特征表示。

Details

Method: 设计自适应小波滤波器（AWF）增强纹理特征多样性，并结合DNN构建AWFNet；提出平衡置信度损失（BC Loss）优化模型性能。 Result: AWFNet和BC Loss在PD筛查性能和可信度上优于现有方法。 Conclusion: AWFNet和BC Loss为PD筛查提供了高效且可信的解决方案。 Abstract: Parkinson's disease (PD) is a prevalent neurodegenerative disorder globally. The eye's retina is an extension of the brain and has great potential in PD screening. Recent studies have suggested that texture features extracted from retinal layers can be adopted as biomarkers for PD diagnosis under optical coherence tomography (OCT) images. Frequency domain learning techniques can enhance the feature representations of deep neural networks (DNNs) by decomposing frequency components involving rich texture features. Additionally, previous works have not exploited texture features for automated PD screening in OCT. Motivated by the above analysis, we propose a novel Adaptive Wavelet Filter (AWF) that serves as the Practical Texture Feature Amplifier to fully leverage the merits of texture features to boost the PD screening performance of DNNs with the aid of frequency domain learning. Specifically, AWF first enhances texture feature representation diversities via channel mixer, then emphasizes informative texture feature representations with the well-designed adaptive wavelet filtering token mixer. By combining the AWFs with the DNN stem, AWFNet is constructed for automated PD screening. Additionally, we introduce a novel Balanced Confidence (BC) Loss by mining the potential of sample-wise predicted probabilities of all classes and class frequency prior, to further boost the PD screening performance and trustworthiness of AWFNet. The extensive experiments manifest the superiority of our AWFNet and BC over state-of-the-art methods in terms of PD screening performance and trustworthiness.

Wavelet-based Global-Local Interaction Network with Cross-Attention for Multi-View Diabetic Retinopathy Detection

Yongting Hu,Yuxin Lin,Chengliang Liu,Xiaoling Luo,Xiaoyan Dou,Qihao Xu,Yong Xu

Task: 提出一种新颖的多视角糖尿病视网膜病变（DR）检测方法，解决现有方法在病灶信息学习和多视角融合上的不足。

Motivation: 多视角DR检测面临病灶大小不一、分布分散的挑战，且现有方法未充分考虑多视角间的相关性和冗余性。

Details

Method: 采用双分支网络获取局部病灶特征及其全局依赖关系，利用小波变换的高频分量增强病灶边缘信息，并通过跨视角融合模块优化多视角融合。 Result: 在大型公共数据集上的实验证明了该方法的有效性。 Conclusion: 提出的方法在病灶信息学习和多视角融合方面表现优异，代码已开源。 Abstract: Multi-view diabetic retinopathy (DR) detection has recently emerged as a promising method to address the issue of incomplete lesions faced by single-view DR. However, it is still challenging due to the variable sizes and scattered locations of lesions. Furthermore, existing multi-view DR methods typically merge multiple views without considering the correlations and redundancies of lesion information across them. Therefore, we propose a novel method to overcome the challenges of difficult lesion information learning and inadequate multi-view fusion. Specifically, we introduce a two-branch network to obtain both local lesion features and their global dependencies. The high-frequency component of the wavelet transform is used to exploit lesion edge information, which is then enhanced by global semantic to facilitate difficult lesion learning. Additionally, we present a cross-view fusion module to improve multi-view fusion and reduce redundancy. Experimental results on large public datasets demonstrate the effectiveness of our method. The code is open sourced on https://github.com/HuYongting/WGLIN.

MATT-GS: Masked Attention-based 3DGS for Robot Perception and Object Detection

Jee Won Lee,Hansol Lim,SooYeun Yang,Jongseong Brad Choi

Task: 提出一种基于掩码注意力的3D高斯泼溅（3DGS）方法，以增强工业及智能工厂环境中的机器人感知和物体检测能力。

Motivation: 在复杂工业环境中，机器人需要高精度的物体识别和操作能力，而现有方法在细节捕捉和背景干扰处理上存在不足。

Details

Method: 结合U2-Net进行背景去除以隔离目标物体，并集成基于Sobel滤波器的注意力机制到3DGS框架中，以增强细节捕捉。 Result: 实验结果表明，该方法在视觉保真度和细节保留方面显著优于原始3DGS基线，验证了其有效性。 Conclusion: 该方法能够有效提升复杂工业环境中机器人视觉的物体识别和操作能力。 Abstract: This paper presents a novel masked attention-based 3D Gaussian Splatting (3DGS) approach to enhance robotic perception and object detection in industrial and smart factory environments. U2-Net is employed for background removal to isolate target objects from raw images, thereby minimizing clutter and ensuring that the model processes only relevant data. Additionally, a Sobel filter-based attention mechanism is integrated into the 3DGS framework to enhance fine details - capturing critical features such as screws, wires, and intricate textures essential for high-precision tasks. We validate our approach using quantitative metrics, including L1 loss, SSIM, PSNR, comparing the performance of the background-removed and attention-incorporated 3DGS model against the ground truth images and the original 3DGS training baseline. The results demonstrate significant improves in visual fidelity and detail preservation, highlighting the effectiveness of our method in enhancing robotic vision for object recognition and manipulation in complex industrial settings.

ASP-VMUNet: Atrous Shifted Parallel Vision Mamba U-Net for Skin Lesion Segmentation

Muyi Bao,Shuchang Lyu,Zhaoyang Xu,Qi Zhao,Changyu Zeng,Wenpei Bai,Guangliang Cheng

Task: 提出一种新型皮肤病变分割框架ASP-VMUNet，结合Mamba架构解决传统CNN和Transformer的局限性。

Motivation: 传统CNN感受野有限，Transformer计算负担大，需高效且可扩展的解决方案。

Details

Method: 采用Atrous扫描技术、并行视觉Mamba层和移位轮转操作，结合CNN分支优化分割。 Result: 在ISIC16/17/18和PH2数据集上表现优异，验证了混合架构的优势。 Conclusion: ASP-VMUNet不仅推进了医学图像分割，还展示了混合架构在医学成像中的潜力。 Abstract: Skin lesion segmentation is a critical challenge in computer vision, and it is essential to separate pathological features from healthy skin for diagnostics accurately. Traditional Convolutional Neural Networks (CNNs) are limited by narrow receptive fields, and Transformers face significant computational burdens. This paper presents a novel skin lesion segmentation framework, the Atrous Shifted Parallel Vision Mamba UNet (ASP-VMUNet), which integrates the efficient and scalable Mamba architecture to overcome limitations in traditional CNNs and computationally demanding Transformers. The framework introduces an atrous scan technique that minimizes background interference and expands the receptive field, enhancing Mamba's scanning capabilities. Additionally, the inclusion of a Parallel Vision Mamba (PVM) layer and a shift round operation optimizes feature segmentation and fosters rich inter-segment information exchange. A supplementary CNN branch with a Selective-Kernel (SK) Block further refines the segmentation by blending local and global contextual information. Tested on four benchmark datasets (ISIC16/17/18 and PH2), ASP-VMUNet demonstrates superior performance in skin lesion segmentation, validated by comprehensive ablation studies. This approach not only advances medical image segmentation but also highlights the benefits of hybrid architectures in medical imaging technology. Our code is available at https://github.com/BaoBao0926/ASP-VMUNet/tree/main.

Quantifying the Ease of Reproducing Training Data in Unconditional Diffusion Models

Masaya Hasegawa,Koji Yasuda

Task: 量化无条件扩散模型中训练数据复现的容易程度。

Motivation: 扩散模型可能生成与训练数据高度相似的样本，导致版权问题，因此需要量化这种复现的容易程度。

Details

Method: 通过分析反向扩散过程中Langevin方程的平均样本群体，建立图像与潜在空间噪声的一对一映射关系，并利用ODE的可逆性量化复现概率。 Result: 成功量化了训练数据复现的容易程度，并通过测量体积增长率验证了方法的有效性。 Conclusion: 该方法计算复杂度低，可用于检测和修改易被记忆的训练样本，从而提升训练数据质量。 Abstract: Diffusion models, which have been advancing rapidly in recent years, may generate samples that closely resemble the training data. This phenomenon, known as memorization, may lead to copyright issues. In this study, we propose a method to quantify the ease of reproducing training data in unconditional diffusion models. The average of a sample population following the Langevin equation in the reverse diffusion process moves according to a first-order ordinary differential equation (ODE). This ODE establishes a 1-to-1 correspondence between images and their noisy counterparts in the latent space. Since the ODE is reversible and the initial noisy images are sampled randomly, the volume of an image's projected area represents the probability of generating those images. We examined the ODE, which projects images to latent space, and succeeded in quantifying the ease of reproducing training data by measuring the volume growth rate in this process. Given the relatively low computational complexity of this method, it allows us to enhance the quality of training data by detecting and modifying the easily memorized training samples.

TFIC: End-to-End Text-Focused Image Compression for Coding for Machines

Stefano Della Fiore,Alessandro Gnutti,Marco Dalai,Pierangelo Migliorati,Riccardo Leonardi

Task: 设计一种针对光学字符识别（OCR）任务的图像压缩系统。

Motivation: 传统图像压缩方法专注于人类感知，而Coding for Machines则关注保留特定机器任务所需的信息。

Details

Method: 提出一种图像压缩系统，专注于保留文本特征，压缩时间仅为OCR模块的一半。 Result: 在低比特率下显著提高了文本提取的准确性，甚至优于未压缩图像的OCR结果。 Conclusion: 该方法可作为本地预处理步骤，适用于计算能力有限的设备。 Abstract: Traditional image compression methods aim to faithfully reconstruct images for human perception. In contrast, Coding for Machines focuses on compressing images to preserve information relevant to a specific machine task. In this paper, we present an image compression system designed to retain text-specific features for subsequent Optical Character Recognition (OCR). Our encoding process requires half the time needed by the OCR module, making it especially suitable for devices with limited computational capacity. In scenarios where on-device OCR is computationally prohibitive, images are compressed and later processed to recover the text content. Experimental results demonstrate that our method achieves significant improvements in text extraction accuracy at low bitrates, even improving over the accuracy of OCR performed on uncompressed images, thus acting as a local pre-processing step.

Single-Step Latent Consistency Model for Remote Sensing Image Super-Resolution

Xiaohui Sun,Jiangwei Mo,Hanlin Wu,Jie Ma

Task: 提出一种名为LCMSR的单步扩散模型，用于提升遥感图像超分辨率任务的效率和视觉质量。

Motivation: 传统扩散模型的迭代采样过程导致推理速度慢，难以应用于实时任务。

Details

Method: 分为两阶段：预训练残差自编码器以编码高低分辨率图像的差异信息，并在潜在空间进行一致性扩散学习。 Result: LCMSR将传统扩散模型的迭代步骤从50-1000步减少到单步，显著提升效率，同时保持高质量输出。 Conclusion: LCMSR在效率和性能之间取得了良好平衡，适用于实时遥感图像超分辨率任务。 Abstract: Recent advancements in diffusion models (DMs) have greatly advanced remote sensing image super-resolution (RSISR). However, their iterative sampling processes often result in slow inference speeds, limiting their application in real-time tasks. To address this challenge, we propose the latent consistency model for super-resolution (LCMSR), a novel single-step diffusion approach designed to enhance both efficiency and visual quality in RSISR tasks. Our proposal is structured into two distinct stages. In the first stage, we pretrain a residual autoencoder to encode the differential information between high-resolution (HR) and low-resolution (LR) images, transitioning the diffusion process into a latent space to reduce computational costs. The second stage focuses on consistency diffusion learning, which aims to learn the distribution of residual encodings in the latent space, conditioned on LR images. The consistency constraint enforces that predictions at any two timesteps along the reverse diffusion trajectory remain consistent, enabling direct mapping from noise to data. As a result, the proposed LCMSR reduces the iterative steps of traditional diffusion models from 50-1000 or more to just a single step, significantly improving efficiency. Experimental results demonstrate that LCMSR effectively balances efficiency and performance, achieving inference times comparable to non-diffusion models while maintaining high-quality output.

RoboFlamingo-Plus: Fusion of Depth and RGB Perception with Vision-Language Models for Enhanced Robotic Manipulation

Sheng Wang

Task: 通过引入RoboFlamingo-Plus框架，改进现有Vision-Language Models (VLMs) 在3D环境中融合RGB和深度信息的能力，以提升机器人操作的性能。

Motivation: 当前方法在融合深度和RGB信息以及执行语言指导任务方面仍存在挑战，需要更高效的解决方案。

Details

Method: 结合预训练的Vision Transformer (ViT) 和重采样技术，深度数据通过预训练的重采样器提取特征，并利用交叉注意力机制实现最优特征融合。 Result: RoboFlamingo-Plus在机器人操作任务中比现有方法提升了10-20%的性能。 Conclusion: RoboFlamingo-Plus显著提升了3D环境理解和语言指导任务的能力，为机器人技术领域带来了重要进展。 Abstract: As robotic technologies advancing towards more complex multimodal interactions and manipulation tasks, the integration of advanced Vision-Language Models (VLMs) has become a key driver in the field. Despite progress with current methods, challenges persist in fusing depth and RGB information within 3D environments and executing tasks guided by linguistic instructions. In response to these challenges, we have enhanced the existing RoboFlamingo framework by introducing RoboFlamingo-Plus, which incorporates depth data into VLMs to significantly improve robotic manipulation performance. Our research achieves a nuanced fusion of RGB and depth information by integrating a pre-trained Vision Transformer (ViT) with a resampling technique, closely aligning this combined data with linguistic cues for superior multimodal understanding. The novelty of RoboFlamingo-Plus lies in its adaptation of inputs for depth data processing, leveraging a pre-trained resampler for depth feature extraction, and employing cross-attention mechanisms for optimal feature integration. These improvements allow RoboFlamingo-Plus to not only deeply understand 3D environments but also easily perform complex, language-guided tasks in challenging settings. Experimental results show that RoboFlamingo-Plus boosts robotic manipulation by 10-20% over current methods, marking a significant advancement. Codes and model weights are public at RoboFlamingo-Plus.

One Framework to Rule Them All: Unifying RL-Based and RL-Free Methods in RLHF

Xin Cai

Task: 研究基于强化学习（RL）和无强化学习（RL-free）的方法，以解决人类反馈强化学习（RLHF）和大推理模型（LRMs）问题。

Motivation: 通过神经结构化赌博预测的视角重新解释RL-based和RL-free算法，揭示它们之间的深层联系，并改进现有RLHF研究中的不足。

Details

Method: 重新解释算法，推导标准RLHF目标，提出广义强化优化（GRO）框架。 Result: 提出GRO框架，整合RL-based和RL-free方法，为RLHF提供新视角。 Conclusion: GRO框架为RLHF研究提供新方向，期待社区验证和反馈。 Abstract: In this article, we primarily examine a variety of RL-based and RL-free methods designed to address Reinforcement Learning from Human Feedback (RLHF) and Large Reasoning Models (LRMs). We begin with a concise overview of the typical steps involved in RLHF and LRMs. Next, we reinterpret several RL-based and RL-free algorithms through the perspective of neural structured bandit prediction, providing a clear conceptual framework that uncovers a deeper connection between these seemingly distinct approaches. Following this, we briefly review some core principles of reinforcement learning, drawing attention to an often-overlooked aspect in existing RLHF studies. This leads to a detailed derivation of the standard RLHF objective within a full RL context, demonstrating its equivalence to neural structured bandit prediction. Finally, by reinvestigating the principles behind Proximal Policy Optimization (PPO), we pinpoint areas needing adjustment, which culminates in the introduction of the Generalized Reinforce Optimization (GRO) framework, seamlessly integrating RL-based and RL-free methods in RLHF. We look forward to the community's efforts to empirically validate GRO and invite constructive feedback.

SINR: Sparsity Driven Compressed Implicit Neural Representations

Dhananjaya Jayasundara,Sudarshan Rajagopalan,Yasiru Ranasinghe,Trac D. Tran,Vishal M. Patel

Task: 提出一种基于隐式神经表示（INR）的创新压缩算法SINR，利用INR权重形成的向量空间模式进行高效压缩。

Motivation: 现有INR压缩方法性能依赖于量化和熵编码方案，限制了压缩效率和通用性。

Details

Method: 通过在高维稀疏字典中编码INR权重的向量空间模式，无需学习或传输字典原子即可恢复权重。 Result: SINR显著降低了INR的存储需求，优于传统压缩方法，并在多种数据模态中保持高质量解码。 Conclusion: SINR是一种高效且通用的INR压缩方法，适用于多种应用场景。 Abstract: Implicit Neural Representations (INRs) are increasingly recognized as a versatile data modality for representing discretized signals, offering benefits such as infinite query resolution and reduced storage requirements. Existing signal compression approaches for INRs typically employ one of two strategies: 1. direct quantization with entropy coding of the trained INR; 2. deriving a latent code on top of the INR through a learnable transformation. Thus, their performance is heavily dependent on the quantization and entropy coding schemes employed. In this paper, we introduce SINR, an innovative compression algorithm that leverages the patterns in the vector spaces formed by weights of INRs. We compress these vector spaces using a high-dimensional sparse code within a dictionary. Further analysis reveals that the atoms of the dictionary used to generate the sparse code do not need to be learned or transmitted to successfully recover the INR weights. We demonstrate that the proposed approach can be integrated with any existing INR-based signal compression technique. Our results indicate that SINR achieves substantial reductions in storage requirements for INRs across various configurations, outperforming conventional INR-based compression baselines. Furthermore, SINR maintains high-quality decoding across diverse data modalities, including images, occupancy fields, and Neural Radiance Fields.

Prompt-Guided Dual-Path UNet with Mamba for Medical Image Segmentation

Shaolei Zhang,Jinyan Liu,Tianyi Qian,Xuesong Li

Task: 提出一种名为PGM-UNet的双路径UNet架构，用于医学图像分割任务。

Motivation: 现有的Mamba-based方法在感知原始输入数据和捕捉局部细节方面存在不足。

Details

Method: 结合提示引导的CNN-Mamba双路径设计，包括动态视觉提示提取模块、局部-全局信息融合网络和多尺度信息提取模块。 Result: 在多个医学图像分割任务中显著优于现有方法。 Conclusion: PGM-UNet通过结合局部和全局信息，提升了医学图像分割的性能。 Abstract: Convolutional neural networks (CNNs) and transformers are widely employed in constructing UNet architectures for medical image segmentation tasks. However, CNNs struggle to model long-range dependencies, while transformers suffer from quadratic computational complexity. Recently, Mamba, a type of State Space Models, has gained attention for its exceptional ability to model long-range interactions while maintaining linear computational complexity. Despite the emergence of several Mamba-based methods, they still present the following limitations: first, their network designs generally lack perceptual capabilities for the original input data; second, they primarily focus on capturing global information, while often neglecting local details. To address these challenges, we propose a prompt-guided CNN-Mamba dual-path UNet, termed PGM-UNet, for medical image segmentation. Specifically, we introduce a prompt-guided residual Mamba module that adaptively extracts dynamic visual prompts from the original input data, effectively guiding Mamba in capturing global information. Additionally, we design a local-global information fusion network, comprising a local information extraction module, a prompt-guided residual Mamba module, and a multi-focus attention fusion module, which effectively integrates local and global information. Furthermore, inspired by Kolmogorov-Arnold Networks (KANs), we develop a multi-scale information extraction module to capture richer contextual information without altering the resolution. We conduct extensive experiments on the ISIC-2017, ISIC-2018, DIAS, and DRIVE. The results demonstrate that the proposed method significantly outperforms state-of-the-art approaches in multiple medical image segmentation tasks.

GIViC: Generative Implicit Video Compression

Ge Gao,Siyue Teng,Tianhao Peng,Fan Zhang,David Bull

Task: 提出一种基于生成式隐式神经表示（INRs）的视频压缩框架GIViC，以提升INR类方法的性能极限。

Motivation: 现有基于INR的视频编解码器在相同编码配置下无法达到与传统或基于自动编码器方法相媲美的性能。

Details

Method: 通过新设计的隐式扩散过程，在粗到细的时空分解中进行扩散采样，并结合分层门控线性注意力变换器（HGLA）建模全局依赖。 Result: 在随机访问（RA）配置下，GIViC相比VVC VTM、DCVC-FM和NVRC分别实现了15.94%、22.46%和8.52%的BD-rate节省。 Conclusion: GIViC是首个在RA配置下性能优于VTM的基于INR的视频编解码器。 Abstract: While video compression based on implicit neural representations (INRs) has recently demonstrated great potential, existing INR-based video codecs still cannot achieve state-of-the-art (SOTA) performance compared to their conventional or autoencoder-based counterparts given the same coding configuration. In this context, we propose a Generative Implicit Video Compression framework, GIViC, aiming at advancing the performance limits of this type of coding methods. GIViC is inspired by the characteristics that INRs share with large language and diffusion models in exploiting long-term dependencies. Through the newly designed implicit diffusion process, GIViC performs diffusive sampling across coarse-to-fine spatiotemporal decompositions, gradually progressing from coarser-grained full-sequence diffusion to finer-grained per-token diffusion. A novel Hierarchical Gated Linear Attention-based transformer (HGLA), is also integrated into the framework, which dual-factorizes global dependency modeling along scale and sequential axes. The proposed GIViC model has been benchmarked against SOTA conventional and neural codecs using a Random Access (RA) configuration (YUV 4:2:0, GOPSize=32), and yields BD-rate savings of 15.94%, 22.46% and 8.52% over VVC VTM, DCVC-FM and NVRC, respectively. As far as we are aware, GIViC is the first INR-based video codec that outperforms VTM based on the RA coding configuration. The source code will be made available.

Single Shot AI-assisted quantification of KI-67 proliferation index in breast cancer

Deepti Madurai Muthu,Priyanka S,Lalitha Rani N,P. G. Kubendran Amos

Task: 利用YOLOv8框架开发一种AI辅助方法，用于自动量化乳腺癌中的Ki-67增殖标记物。

Motivation: 传统的Ki-67量化方法（如视觉估计和手动计数）存在观察者间变异性和可重复性差的问题，需要一种更高效、客观的替代方案。

Details

Method: 使用YOLOv8目标检测框架，对高分辨率数字图像中的Ki-67阳性和阴性肿瘤细胞进行自动识别和评分。 Result: YOLOv8 Medium模型表现最佳，对Ki-67阳性细胞的平均精确度（mAP50）超过85%。 Conclusion: 该方法为Ki-67评估提供了一种高效、可扩展且客观的替代方案，未来将开发用户友好的临床界面并扩展多机构数据集以提升通用性。 Abstract: Reliable quantification of Ki-67, a key proliferation marker in breast cancer, is essential for molecular subtyping and informed treatment planning. Conventional approaches, including visual estimation and manual counting, suffer from interobserver variability and limited reproducibility. This study introduces an AI-assisted method using the YOLOv8 object detection framework for automated Ki-67 scoring. High-resolution digital images (40x magnification) of immunohistochemically stained tumor sections were captured from Ki-67 hotspot regions and manually annotated by a domain expert to distinguish Ki-67-positive and negative tumor cells. The dataset was augmented and divided into training (80%), validation (10%), and testing (10%) subsets. Among the YOLOv8 variants tested, the Medium model achieved the highest performance, with a mean Average Precision at 50% Intersection over Union (mAP50) exceeding 85% for Ki-67-positive cells. The proposed approach offers an efficient, scalable, and objective alternative to conventional scoring methods, supporting greater consistency in Ki-67 evaluation. Future directions include developing user-friendly clinical interfaces and expanding to multi-institutional datasets to enhance generalizability and facilitate broader adoption in diagnostic practice.

MultimodalStudio: A Heterogeneous Sensor Dataset and Framework for Neural Rendering across Multiple Imaging Modalities

Federico Lincetto,Gianluca Agresti,Mattia Rossi,Pietro Zanuttigh

Task: 探索隐式神经模型在多模态成像数据中的学习和信息迁移能力。

Motivation: 尽管RGB图像广泛用于训练体积渲染模型，但对其他辐射模态的兴趣也在增长，但缺乏相关数据集和研究。

Details

Method: 提出了MultimodalStudio（MMS），包括多模态多视图数据集MMS-DATA和模块化多模态NeRF框架MMS-FW。 Result: 实验表明，MMS-FW在MMS-DATA上训练后，能够在不同成像模态之间迁移信息，并生成比单一模态更高质量的渲染结果。 Conclusion: MMS为多模态体积渲染研究提供了数据集和框架，推动了该领域的发展。 Abstract: Neural Radiance Fields (NeRF) have shown impressive performances in the rendering of 3D scenes from arbitrary viewpoints. While RGB images are widely preferred for training volume rendering models, the interest in other radiance modalities is also growing. However, the capability of the underlying implicit neural models to learn and transfer information across heterogeneous imaging modalities has seldom been explored, mostly due to the limited training data availability. For this purpose, we present MultimodalStudio (MMS): it encompasses MMS-DATA and MMS-FW. MMS-DATA is a multimodal multi-view dataset containing 32 scenes acquired with 5 different imaging modalities: RGB, monochrome, near-infrared, polarization and multispectral. MMS-FW is a novel modular multimodal NeRF framework designed to handle multimodal raw data and able to support an arbitrary number of multi-channel devices. Through extensive experiments, we demonstrate that MMS-FW trained on MMS-DATA can transfer information between different imaging modalities and produce higher quality renderings than using single modalities alone. We publicly release the dataset and the framework, to promote the research on multimodal volume rendering and beyond.

Semi-SD: Semi-Supervised Metric Depth Estimation via Surrounding Cameras for Autonomous Driving

Yusen Xie,Zhengmin Huang,Shaojie Shen,Jun Ma

Task: 提出了一种名为Semi-SD的新型度量深度估计框架，适用于自动驾驶中的环绕摄像头设备。

Motivation: 解决多摄像头设置中的尺度模糊问题，并提升深度估计的质量。

Details

Method: 采用统一的时空语义融合模块构建视觉融合特征，利用交叉注意力组件优化度量尺度信息和时间特征匹配，并结合语义世界模型和单目深度估计世界模型进行监督。 Result: 在DDAD和nuScenes数据集上验证，结果表明该方法在环绕摄像头深度估计质量上达到最先进水平。 Conclusion: Semi-SD框架有效解决了多摄像头深度估计问题，并在性能上表现优异。 Abstract: In this paper, we introduce Semi-SD, a novel metric depth estimation framework tailored for surrounding cameras equipment in autonomous driving. In this work, the input data consists of adjacent surrounding frames and camera parameters. We propose a unified spatial-temporal-semantic fusion module to construct the visual fused features. Cross-attention components for surrounding cameras and adjacent frames are utilized to focus on metric scale information refinement and temporal feature matching. Building on this, we propose a pose estimation framework using surrounding cameras, their corresponding estimated depths, and extrinsic parameters, which effectively address the scale ambiguity in multi-camera setups. Moreover, semantic world model and monocular depth estimation world model are integrated to supervised the depth estimation, which improve the quality of depth estimation. We evaluate our algorithm on DDAD and nuScenes datasets, and the results demonstrate that our method achieves state-of-the-art performance in terms of surrounding camera based depth estimation quality. The source code will be available on https://github.com/xieyuser/Semi-SD.

On What Depends the Robustness of Multi-source Models to Missing Data in Earth Observation?

Francisco Mena,Diego Arenas,Miro Miranda,Andreas Dengel

Task: 评估六种先进的多源模型在数据缺失或单一数据源情况下的预测性能。

Motivation: 理解多源模型在不同数据缺失情况下的有效性及其影响因素。

Details

Method: 分析六种多源模型在单源缺失或单源可用场景下的预测表现。 Result: 模型效果与任务性质、数据源互补性和模型设计密切相关；某些情况下移除数据源反而提升性能。 Conclusion: 研究对模型复杂性和数据源必要性提出反思，为EO应用提供更高效的思路。 Abstract: In recent years, the development of robust multi-source models has emerged in the Earth Observation (EO) field. These are models that leverage data from diverse sources to improve predictive accuracy when there is missing data. Despite these advancements, the factors influencing the varying effectiveness of such models remain poorly understood. In this study, we evaluate the predictive performance of six state-of-the-art multi-source models in predicting scenarios where either a single data source is missing or only a single source is available. Our analysis reveals that the efficacy of these models is intricately tied to the nature of the task, the complementarity among data sources, and the model design. Surprisingly, we observe instances where the removal of certain data sources leads to improved predictive performance, challenging the assumption that incorporating all available data is always beneficial. These findings prompt critical reflections on model complexity and the necessity of all collected data sources, potentially shaping the way for more streamlined approaches in EO applications.

InterSliceBoost: Identifying Tissue Layers in Three-dimensional Ultrasound Images for Chronic Lower Back Pain (cLBP) Assessment

Zixue Zeng,Matthew Cartier,Xiaoyan Zhao,Pengyu Chen,Xin Meng,Zhiyu Sheng,Maryam Satarpour,John M Cormack,Allison C. Bean,Ryan P. Nussbaum,Maya Maurer,Emily Landis-Walkenhorst,Kang Kim,Ajay D. Wasan,Jiantao Pu

Task: 开发并验证一种名为InterSliceBoost的新方法，用于在部分标注的数据集上训练分割模型而不影响分割性能。

Motivation: 现有研究通常只关注慢性腰痛（cLBP）的特定组织，缺乏全面的分层分析，且手动标注三维图像耗时且易出错。

Details

Method: InterSliceBoost包含两个组件：一个切片间生成器和一个分割模型。生成器利用残差块编码器提取相邻图像-掩码对的特征，计算差异特征并生成切片间图像-掩码对。分割模型在部分标注的数据集和生成的切片间图像-掩码对上训练。 Result: InterSliceBoost在仅使用33%图像切片的情况下，独立测试集上的平均Dice系数为80.84%，显著优于传统模型（p<0.05）。 Conclusion: InterSliceBoost能在部分标注的情况下有效分割3D B型超声图像中的六种组织层。 Abstract: Available studies on chronic lower back pain (cLBP) typically focus on one or a few specific tissues rather than conducting a comprehensive layer-by-layer analysis. Since three-dimensional (3-D) images often contain hundreds of slices, manual annotation of these anatomical structures is both time-consuming and error-prone. We aim to develop and validate a novel approach called InterSliceBoost to enable the training of a segmentation model on a partially annotated dataset without compromising segmentation performance. The architecture of InterSliceBoost includes two components: an inter-slice generator and a segmentation model. The generator utilizes residual block-based encoders to extract features from adjacent image-mask pairs (IMPs). Differential features are calculated and input into a decoder to generate inter-slice IMPs. The segmentation model is trained on partially annotated datasets (e.g., skipping 1, 2, 3, or 7 images) and the generated inter-slice IMPs. To validate the performance of InterSliceBoost, we utilized a dataset of 76 B-mode ultrasound scans acquired on 29 subjects enrolled in an ongoing cLBP study. InterSliceBoost, trained on only 33% of the image slices, achieved a mean Dice coefficient of 80.84% across all six layers on the independent test set, with Dice coefficients of 73.48%, 61.11%, 81.87%, 95.74%, 83.52% and 88.74% for segmenting dermis, superficial fat, superficial fascial membrane, deep fat, deep fascial membrane, and muscle. This performance is significantly higher than the conventional model trained on fully annotated images (p<0.05). InterSliceBoost can effectively segment the six tissue layers depicted on 3-D B-model ultrasound images in settings with partial annotations.

GRN+: A Simplified Generative Reinforcement Network for Tissue Layer Analysis in 3D Ultrasound Images for Chronic Low-back Pain

Zixue Zeng,Xiaoyan Zhao,Matthew Cartier,Xin Meng,Jiantao Pu

Task: 开发并验证GRN+，一种自动化层分割的多模型框架，用于3D超声图像分析。

Motivation: 手动区分软组织进行定量分析耗时且劳动密集，需要一种自动化方法以减少对大量标注数据的依赖。

Details

Method: GRN+结合了基于ResNet的生成器和U-Net分割模型，通过分割引导增强（SGE）方法生成新图像和匹配掩码，采用两阶段反向传播策略稳定训练。 Result: 在仅使用5%标注数据的情况下，GRN+在Dice系数上优于其他半监督方法；在完全标注数据集上，Dice系数提高2.16%，且计算成本更低。 Conclusion: GRN+通过减少计算成本和标注依赖，为3D超声分析提供了高效工具，特别适用于慢性腰痛患者。 Abstract: 3D ultrasound delivers high-resolution, real-time images of soft tissues, which is essential for pain research. However, manually distinguishing various tissues for quantitative analysis is labor-intensive. To streamline this process, we developed and validated GRN+, a novel multi-model framework that automates layer segmentation with minimal annotated data. GRN+ combines a ResNet-based generator and a U-Net segmentation model. Through a method called Segmentation-guided Enhancement (SGE), the generator produces new images and matching masks under the guidance of the segmentation model, with its weights adjusted according to the segmentation loss gradient. To prevent gradient explosion and secure stable training, a two-stage backpropagation strategy was implemented: the first stage propagates the segmentation loss through both the generator and segmentation model, while the second stage concentrates on optimizing the segmentation model alone, thereby refining mask prediction using the generated images. Tested on 69 fully annotated 3D ultrasound scans from 29 subjects with six manually labeled tissue layers, GRN+ outperformed all other semi-supervised methods in terms of the Dice coefficient using only 5% labeled data, despite not using unlabeled data for unsupervised training. Additionally, when applied to fully annotated datasets, GRN+ with SGE achieved a 2.16% higher Dice coefficient while incurring lower computational costs compared to other models. Overall, GRN+ provides accurate tissue segmentation while reducing both computational expenses and the dependency on extensive annotations, making it an effective tool for 3D ultrasound analysis in cLBP patients.

A Survey on Event-driven 3D Reconstruction: Development under Different Categories

Chuanzhi Xu,Haoxian Zhou,Haodong Chen,Vera Chung,Qiang Qu

Task: 综述事件驱动3D重建方法，包括立体、单目和多模态系统。

Motivation: 事件相机因其高时间分辨率、低延迟和高动态范围在3D重建中受到关注，但其相关方法缺乏系统梳理。

Details

Method: 按几何、基于学习和混合方法分类，并涵盖新兴趋势如神经辐射场和3D高斯泼溅。 Result: 提供了事件驱动3D重建的全面综述，并指出研究空白和未来方向。 Conclusion: 事件相机在3D重建中潜力巨大，未来需在数据集、实验和事件表示等方面进一步研究。 Abstract: Event cameras have gained increasing attention for 3D reconstruction due to their high temporal resolution, low latency, and high dynamic range. They capture per-pixel brightness changes asynchronously, allowing accurate reconstruction under fast motion and challenging lighting conditions. In this survey, we provide a comprehensive review of event-driven 3D reconstruction methods, including stereo, monocular, and multimodal systems. We further categorize recent developments based on geometric, learning-based, and hybrid approaches. Emerging trends, such as neural radiance fields and 3D Gaussian splatting with event data, are also covered. The related works are structured chronologically to illustrate the innovations and progression within the field. To support future research, we also highlight key research gaps and future research directions in dataset, experiment, evaluation, event representation, etc.

Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy

Zhi Hou,Tianyi Zhang,Yuwen Xiong,Haonan Duan,Hengjun Pu,Ronglei Tong,Chengyang Zhao,Xizhou Zhu,Yu Qiao,Jifeng Dai,Yuntao Chen

Task: 提出Dita框架，通过统一的多模态扩散过程直接去噪连续动作序列，以解决异构动作空间的适应性问题。

Motivation: 现有视觉-语言-动作模型依赖紧凑的动作头预测离散或连续动作，限制了其在异构动作空间中的适应性。

Details

Method: 采用Transformer架构和扩散过程，通过上下文条件化实现去噪动作与历史观察的细粒度对齐。 Result: 在仿真和真实世界中表现出色，尤其在长时程任务和环境变化中具有鲁棒性。 Conclusion: Dita为通用机器人策略学习提供了一个轻量级、开源且多功能的基线框架。 Abstract: While recent vision-language-action models trained on diverse robot datasets exhibit promising generalization capabilities with limited in-domain data, their reliance on compact action heads to predict discretized or continuous actions constrains adaptability to heterogeneous action spaces. We present Dita, a scalable framework that leverages Transformer architectures to directly denoise continuous action sequences through a unified multimodal diffusion process. Departing from prior methods that condition denoising on fused embeddings via shallow networks, Dita employs in-context conditioning -- enabling fine-grained alignment between denoised actions and raw visual tokens from historical observations. This design explicitly models action deltas and environmental nuances. By scaling the diffusion action denoiser alongside the Transformer's scalability, Dita effectively integrates cross-embodiment datasets across diverse camera perspectives, observation scenes, tasks, and action spaces. Such synergy enhances robustness against various variances and facilitates the successful execution of long-horizon tasks. Evaluations across extensive benchmarks demonstrate state-of-the-art or comparative performance in simulation. Notably, Dita achieves robust real-world adaptation to environmental variances and complex long-horizon tasks through 10-shot finetuning, using only third-person camera inputs. The architecture establishes a versatile, lightweight and open-source baseline for generalist robot policy learning. Project Page: https://robodita.github.io.

Domain-incremental White Blood Cell Classification with Privacy-aware Continual Learning

Pratibha Kumari,Afshin Bozorgpour,Daniel Reisenbüchler,Edgar Jost,Martina Crysandt,Christian Matek,Dorit Merhof

Task: 提出一种基于生成重放的持续学习策略，用于防止基础模型在白细胞分类中的灾难性遗忘。

Motivation: 白细胞分类在血液学中至关重要，但传统深度学习模型和基础模型在动态环境中面临灾难性遗忘或性能下降的问题。

Details

Method: 采用轻量级生成器模拟历史数据，通过合成潜在表示实现隐私保护的重放。 Result: 实验表明，传统微调方法在先前任务上性能下降，而提出的持续学习策略有效缓解了灾难性遗忘。 Conclusion: 该策略为临床环境中数据分布频繁变化的白细胞分类提供了实用解决方案。 Abstract: White blood cell (WBC) classification plays a vital role in hematology for diagnosing various medical conditions. However, it faces significant challenges due to domain shifts caused by variations in sample sources (e.g., blood or bone marrow) and differing imaging conditions across hospitals. Traditional deep learning models often suffer from catastrophic forgetting in such dynamic environments, while foundation models, though generally robust, experience performance degradation when the distribution of inference data differs from that of the training data. To address these challenges, we propose a generative replay-based Continual Learning (CL) strategy designed to prevent forgetting in foundation models for WBC classification. Our method employs lightweight generators to mimic past data with a synthetic latent representation to enable privacy-preserving replay. To showcase the effectiveness, we carry out extensive experiments with a total of four datasets with different task ordering and four backbone models including ResNet50, RetCCL, CTransPath, and UNI. Experimental results demonstrate that conventional fine-tuning methods degrade performance on previously learned tasks and struggle with domain shifts. In contrast, our continual learning strategy effectively mitigates catastrophic forgetting, preserving model performance across varying domains. This work presents a practical solution for maintaining reliable WBC classification in real-world clinical settings, where data distributions frequently evolve.

GyralNet Subnetwork Partitioning via Differentiable Spectral Modularity Optimization

Yan Zhuang,Minheng Chen,Chao Cao,Tong Chen,Jing Zhang,Xiaowei Yu,Yanjun Lyu,Lu Zhang,Tianming Liu,Dajiang Zhu

Task: 提出一种基于谱模块化最大化优化策略的全微分子网络划分框架，用于分析GyralNet中3HGs的模块化组织。

Motivation: 现有3HGs分析方法面临亚体素尺度、计算复杂性和节点独立性简化等挑战。

Details

Method: 结合拓扑结构相似性和DTI衍生的连接模式作为属性特征，采用谱模块化最大化优化策略。 Result: 在HCP数据集上验证了方法在个体水平和跨被试社区一致性上的有效性。 Conclusion: 该方法为理解脑连接提供了稳健的基础。 Abstract: Understanding the structural and functional organization of the human brain requires a detailed examination of cortical folding patterns, among which the three-hinge gyrus (3HG) has been identified as a key structural landmark. GyralNet, a network representation of cortical folding, models 3HGs as nodes and gyral crests as edges, highlighting their role as critical hubs in cortico-cortical connectivity. However, existing methods for analyzing 3HGs face significant challenges, including the sub-voxel scale of 3HGs at typical neuroimaging resolutions, the computational complexity of establishing cross-subject correspondences, and the oversimplification of treating 3HGs as independent nodes without considering their community-level relationships. To address these limitations, we propose a fully differentiable subnetwork partitioning framework that employs a spectral modularity maximization optimization strategy to modularize the organization of 3HGs within GyralNet. By incorporating topological structural similarity and DTI-derived connectivity patterns as attribute features, our approach provides a biologically meaningful representation of cortical organization. Extensive experiments on the Human Connectome Project (HCP) dataset demonstrate that our method effectively partitions GyralNet at the individual level while preserving the community-level consistency of 3HGs across subjects, offering a robust foundation for understanding brain connectivity.

AudCast: Audio-Driven Human Video Generation by Cascaded Diffusion Transformers

Jiazhi Guan,Kaisiyuan Wang,Zhiliang Xu,Quanwei Yang,Yasheng Sun,Shengyi He,Borong Liang,Yukang Cao,Yingying Li,Haocheng Feng,Errui Ding,Jingdong Wang,Youjian Zhao,Hang Zhou,Ziwei Liu

Task: 提出一种名为AudCast的音频驱动人类视频生成框架，旨在生成具有准确唇同步和精细伴随手势的整体人类视频。

Motivation: 现有方法主要关注面部运动，导致头部和身体动态不连贯，因此需要一种能够生成整体人类视频的方法。

Details

Method: 采用级联Diffusion-Transformers（DiTs）范式，包括音频条件化的整体人类DiT架构和区域细化DiT。 Result: 实验表明，该框架能够生成具有时间连贯性和精细面部及手部细节的高保真音频驱动人类视频。 Conclusion: AudCast框架成功解决了现有方法的局限性，生成了更连贯和细致的整体人类视频。 Abstract: Despite the recent progress of audio-driven video generation, existing methods mostly focus on driving facial movements, leading to non-coherent head and body dynamics. Moving forward, it is desirable yet challenging to generate holistic human videos with both accurate lip-sync and delicate co-speech gestures w.r.t. given audio. In this work, we propose AudCast, a generalized audio-driven human video generation framework adopting a cascade Diffusion-Transformers (DiTs) paradigm, which synthesizes holistic human videos based on a reference image and a given audio. 1) Firstly, an audio-conditioned Holistic Human DiT architecture is proposed to directly drive the movements of any human body with vivid gesture dynamics. 2) Then to enhance hand and face details that are well-knownly difficult to handle, a Regional Refinement DiT leverages regional 3D fitting as the bridge to reform the signals, producing the final results. Extensive experiments demonstrate that our framework generates high-fidelity audio-driven holistic human videos with temporal coherence and fine facial and hand details. Resources can be found at https://guanjz20.github.io/projects/AudCast.

Unpaired Translation of Chest X-ray Images for Lung Opacity Diagnosis via Adaptive Activation Masks and Cross-Domain Alignment

Junzhi Ning,Dominic Marshall,Yijian Gao,Xiaodan Xing Yang Nan,Yingying Fang,Sheng Zhang,Matthieu Komorowski,Guang Yang

Task: 提出一种无配对CXR翻译框架，将带有肺部不透明区域的CXR转换为无肺部不透明区域的图像，同时保留语义特征。

Motivation: 肺部不透明区域在CXR中常掩盖解剖结构，影响肺边界清晰识别和病灶定位，从而降低分割和诊断准确性。

Details

Method: 使用自适应激活掩码选择性修改肺部不透明区域，并通过跨域对齐确保翻译后的CXR与预训练病灶分类器的特征图和预测标签一致。 Result: 在RSNA、MIMIC-CXR-JPG和JSRT数据集上验证，FID和KID得分优于现有方法，分割和病灶分类性能显著提升。 Conclusion: 该方法通过图像翻译技术提升了CXR分析能力，尤其在分割和病灶识别方面具有临床潜力。 Abstract: Chest X-ray radiographs (CXRs) play a pivotal role in diagnosing and monitoring cardiopulmonary diseases. However, lung opac- ities in CXRs frequently obscure anatomical structures, impeding clear identification of lung borders and complicating the localization of pathology. This challenge significantly hampers segmentation accuracy and precise lesion identification, which are crucial for diagnosis. To tackle these issues, our study proposes an unpaired CXR translation framework that converts CXRs with lung opacities into counterparts without lung opacities while preserving semantic features. Central to our approach is the use of adaptive activation masks to selectively modify opacity regions in lung CXRs. Cross-domain alignment ensures translated CXRs without opacity issues align with feature maps and prediction labels from a pre-trained CXR lesion classifier, facilitating the interpretability of the translation process. We validate our method using RSNA, MIMIC-CXR-JPG and JSRT datasets, demonstrating superior translation quality through lower Frechet Inception Distance (FID) and Kernel Inception Distance (KID) scores compared to existing meth- ods (FID: 67.18 vs. 210.4, KID: 0.01604 vs. 0.225). Evaluation on RSNA opacity, MIMIC acute respiratory distress syndrome (ARDS) patient CXRs and JSRT CXRs show our method enhances segmentation accuracy of lung borders and improves lesion classification, further underscoring its potential in clinical settings (RSNA: mIoU: 76.58% vs. 62.58%, Sensitivity: 85.58% vs. 77.03%; MIMIC ARDS: mIoU: 86.20% vs. 72.07%, Sensitivity: 92.68% vs. 86.85%; JSRT: mIoU: 91.08% vs. 85.6%, Sensitivity: 97.62% vs. 95.04%). Our approach advances CXR imaging analysis, especially in investigating segmentation impacts through image translation techniques.

GENIUS: A Generative Framework for Universal Multimodal Search

Sungyeon Kim,Xinliang Zhu,Xiaofan Lin,Muhammet Bastan,Douglas Gray,Suha Kwak

Task: 提出一种通用的生成式检索框架GENIUS，支持多模态和多领域的多样化任务。

Motivation: 现有生成式检索模型任务特定且性能不及基于嵌入的检索方法，需要一种更通用且高效的解决方案。

Details

Method: 引入模态解耦的语义量化，将多模态数据转化为离散ID，并提出查询增强技术以提高泛化能力。 Result: 在M-BEIR基准测试中表现优于现有生成式方法，检索速度稳定且性能在多基准测试中具有竞争力。 Conclusion: GENIUS在保持高效的同时，性能接近基于嵌入的检索方法，展示了生成式检索的潜力。 Abstract: Generative retrieval is an emerging approach in information retrieval that generates identifiers (IDs) of target data based on a query, providing an efficient alternative to traditional embedding-based retrieval methods. However, existing models are task-specific and fall short of embedding-based retrieval in performance. This paper proposes GENIUS, a universal generative retrieval framework supporting diverse tasks across multiple modalities and domains. At its core, GENIUS introduces modality-decoupled semantic quantization, transforming multimodal data into discrete IDs encoding both modality and semantics. Moreover, to enhance generalization, we propose a query augmentation that interpolates between a query and its target, allowing GENIUS to adapt to varied query forms. Evaluated on the M-BEIR benchmark, it surpasses prior generative methods by a clear margin. Unlike embedding-based retrieval, GENIUS consistently maintains high retrieval speed across database size, with competitive performance across multiple benchmarks. With additional re-ranking, GENIUS often achieves results close to those of embedding-based methods while preserving efficiency.

Visuo-Tactile Object Pose Estimation for a Multi-Finger Robot Hand with Low-Resolution In-Hand Tactile Sensing

Lukas Mack,Felix Grüninger,Benjamin A. Richardson,Regine Lendway,Katherine J. Kuchenbecker,Joerg Stueckler

Task: 提出一种结合视觉、本体感知和低分辨率触觉测量的方法，以提高机器人手中被遮挡物体的3D姿态估计精度。

Motivation: 机器人手部遮挡导致物体3D姿态估计困难，需要多模态感知来解决。

Details

Method: 在因子图中概率化建模视觉-触觉姿态估计问题，使用鲁棒成本函数优化物体姿态以减少异常值影响。 Result: 仿真和初步实物测试表明，低分辨率触觉测量显著提高了高遮挡和高视觉噪声下的姿态估计精度。 Conclusion: 结合视觉和触觉的多模态方法有效解决了机器人手中物体的姿态估计问题。 Abstract: Accurate 3D pose estimation of grasped objects is an important prerequisite for robots to perform assembly or in-hand manipulation tasks, but object occlusion by the robot's own hand greatly increases the difficulty of this perceptual task. Here, we propose that combining visual information and proprioception with binary, low-resolution tactile contact measurements from across the interior surface of an articulated robotic hand can mitigate this issue. The visuo-tactile object-pose-estimation problem is formulated probabilistically in a factor graph. The pose of the object is optimized to align with the three kinds of measurements using a robust cost function to reduce the influence of visual or tactile outlier readings. The advantages of the proposed approach are first demonstrated in simulation: a custom 15-DoF robot hand with one binary tactile sensor per link grasps 17 YCB objects while observed by an RGB-D camera. This low-resolution in-hand tactile sensing significantly improves object-pose estimates under high occlusion and also high visual noise. We also show these benefits through grasping tests with a preliminary real version of our tactile hand, obtaining reasonable visuo-tactile estimates of object pose at approximately 13.3 Hz on average.