2025 04 05

From Text to Graph: Leveraging Graph Neural Networks for Enhanced Explainability in NLP

Fabio Yáñez-Romero,Andrés Montoyo,Armando Suárez,Yoan Gutiérrez,Ruslan Mitkov

Task: 提出一种新方法，通过将句子自动转换为图结构来实现自然语言处理任务的可解释性。

Motivation: 当前基于Transformer的模型虽然表现优异，但因其规模庞大导致计算成本高，且其基于token的输入方式缺乏语义连贯性，难以解释模型行为。

Details

Method: 通过将句子转换为图结构，保留语义信息，利用节点和关系表达基本语言概念，并支持后续任务的知识利用。 Result: 实验表明，该方法在确定文本结构中关键组件对分类任务的影响方面取得了有希望的结果。 Conclusion: 该方法为自然语言处理任务提供了一种可解释性强且计算高效的新途径。 Abstract: Researchers have relegated natural language processing tasks to Transformer-type models, particularly generative models, because these models exhibit high versatility when performing generation and classification tasks. As the size of these models increases, they achieve outstanding results. Given their widespread use, many explainability techniques are developed based on these models. However, this process becomes computationally expensive due to the large size of the models. Additionally, transformers interpret input information through tokens that fragment input words into sequences lacking inherent semantic meaning, complicating the explanation of the model from the very beginning. This study proposes a novel methodology to achieve explainability in natural language processing tasks by automatically converting sentences into graphs and maintaining semantics through nodes and relations that express fundamental linguistic concepts. It also allows the subsequent exploitation of this knowledge in subsequent tasks, making it possible to obtain trends and understand how the model associates the different elements inside the text with the explained task. The experiments delivered promising results in determining the most critical components within the text structure for a given classification.

Increasing happiness through conversations with artificial intelligence

Joseph Heffner,Chongyu Qin,Martin Chadwick,Chris Knutsen,Christopher Summerfield,Zeb Kurth-Nelson,Robb B. Rutledge

Task: 研究AI聊天机器人对话如何影响主观幸福感。

Motivation: 尽管AI聊天机器人广泛使用，但其对主观幸福感的影响尚未充分研究。

Details

Method: 通过实验比较参与者与AI聊天机器人对话（N=334）和写日记（N=193）后的幸福感，并利用大型语言模型进行情感分析。 Result: AI对话后的幸福感高于写日记，尤其是在讨论负面话题时；AI的情感反馈和参与者的情感预期误差解释了这一效应。 Conclusion: AI互动对人类幸福感有显著影响，情感预期在对话中起关键作用。 Abstract: Chatbots powered by artificial intelligence (AI) have rapidly become a significant part of everyday life, with over a quarter of American adults using them multiple times per week. While these tools offer potential benefits and risks, a fundamental question remains largely unexplored: How do conversations with AI influence subjective well-being? To investigate this, we conducted a study where participants either engaged in conversations with an AI chatbot (N = 334) or wrote journal entires (N = 193) on the same randomly assigned topics and reported their momentary happiness afterward. We found that happiness after AI chatbot conversations was higher than after journaling, particularly when discussing negative topics such as depression or guilt. Leveraging large language models for sentiment analysis, we found that the AI chatbot mirrored participants' sentiment while maintaining a consistent positivity bias. When discussing negative topics, participants gradually aligned their sentiment with the AI's positivity, leading to an overall increase in happiness. We hypothesized that the history of participants' sentiment prediction errors, the difference between expected and actual emotional tone when responding to the AI chatbot, might explain this happiness effect. Using computational modeling, we find the history of these sentiment prediction errors over the course of a conversation predicts greater post-conversation happiness, demonstrating a central role of emotional expectations during dialogue. Our findings underscore the effect that AI interactions can have on human well-being.

ContrastScore: Towards Higher Quality, Less Biased, More Efficient Evaluation Metrics with Contrastive Evaluation

Xiao Wang,Daniil Larionov,Siwei Wu,Yiqi Liu,Steffen Eger,Nafise Sadat Moosavi,Chenghua Lin

Task: 提出一种名为ContrastScore的对比评估指标，用于自动评估生成文本的质量。

Motivation: 传统的基于参考的指标与人类评估相关性较弱，而现有的基于大语言模型的指标（尤其是较小模型）仍难以与人类判断对齐。

Details

Method: 设计ContrastScore，一种对比性评估指标，用于机器翻译和摘要生成任务。 Result: ContrastScore在人类判断相关性上优于单模型和集成基线，且基于较小模型的ContrastScore表现优于较大模型。 Conclusion: ContrastScore能高效、无偏且稳健地评估生成文本质量。 Abstract: Evaluating the quality of generated text automatically remains a significant challenge. Conventional reference-based metrics have been shown to exhibit relatively weak correlation with human evaluations. Recent research advocates the use of large language models (LLMs) as source-based metrics for natural language generation (NLG) assessment. While promising, LLM-based metrics, particularly those using smaller models, still fall short in aligning with human judgments. In this work, we introduce ContrastScore, a contrastive evaluation metric designed to enable higher-quality, less biased, and more efficient assessment of generated text. We evaluate ContrastScore on two NLG tasks: machine translation and summarization. Experimental results show that ContrastScore consistently achieves stronger correlation with human judgments than both single-model and ensemble-based baselines. Notably, ContrastScore based on Qwen 3B and 0.5B even outperforms Qwen 7B, despite having only half as many parameters, demonstrating its efficiency. Furthermore, it effectively mitigates common evaluation biases such as length and likelihood preferences, resulting in more robust automatic evaluation.

Language Models at the Syntax-Semantics Interface: A Case Study of the Long-Distance Binding of Chinese Reflexive ziji

Xiulin Yang

Task: 探索语言模型是否能有效解决汉语反身代词“自己”的复杂约束模式。

Motivation: 研究语言模型在处理受句法和语义因素共同约束的汉语反身代词“自己”时的表现。

Details

Method: 构建包含240个合成句子和320个自然句子的数据集，评估21个语言模型的表现，并与母语者的判断对比。 Result: 现有语言模型无法一致地复现人类判断，依赖顺序线索，忽视细微的语义和句法约束，对名词相关语义更敏感。 Conclusion: 语言模型在处理复杂语言现象时仍存在局限性，需进一步改进。 Abstract: This paper explores whether language models can effectively resolve the complex binding patterns of the Mandarin Chinese reflexive ziji, which are constrained by both syntactic and semantic factors. We construct a dataset of 240 synthetic sentences using templates and examples from syntactic literature, along with 320 natural sentences from the BCC corpus. Evaluating 21 language models against this dataset and comparing their performance to judgments from native Mandarin speakers, we find that none of the models consistently replicates human-like judgments. The results indicate that existing language models tend to rely heavily on sequential cues, though not always favoring the closest strings, and often overlooking subtle semantic and syntactic constraints. They tend to be more sensitive to noun-related than verb-related semantics.

Overcoming Vocabulary Constraints with Pixel-level Fallback

Jonas F. Lotz,Hendra Setiawan,Stephan Peitz,Yova Kementchedjhieva

Task: 提出一种基于像素的词汇无关编码器，用于增强预训练语言模型的多语言能力。

Motivation: 解决子词分词在计算效率和词汇覆盖之间的平衡问题，尤其是在未被优先训练的语言和脚本上表现不佳的情况。

Details

Method: 通过将文本渲染为像素生成输入嵌入，替代传统的子词分词方法。 Result: 实验表明，该方法显著提升了机器翻译性能，实现了有效的跨语言迁移，并优于基于分词器的方法。 Conclusion: 基于像素的表征方法优于字节级方法和标准词汇扩展，无需大量重新训练即可增强单语语言模型的多语言能力，同时通过输入压缩减少解码延迟。 Abstract: Subword tokenization requires balancing computational efficiency and vocabulary coverage, which often leads to suboptimal performance on languages and scripts not prioritized during training. We propose to augment pretrained language models with a vocabulary-free encoder that generates input embeddings from text rendered as pixels. Through experiments on English-centric language models, we demonstrate that our approach substantially improves machine translation performance and facilitates effective cross-lingual transfer, outperforming tokenizer-based methods. Furthermore, we find that pixel-based representations outperform byte-level approaches and standard vocabulary expansion. Our approach enhances the multilingual capabilities of monolingual language models without extensive retraining and reduces decoding latency via input compression.

One Pic is All it Takes: Poisoning Visual Document Retrieval Augmented Generation with a Single Image

Ezzeldin Shereen,Dan Ristea,Burak Hasircioglu,Shae McFadden,Vasilios Mavroudis,Chris Hicks

Task: 提出一种针对多模态检索增强生成（M-RAG）系统的投毒攻击方法，以揭示其脆弱性。

Motivation: M-RAG系统通过知识库抑制幻觉，但可能因恶意注入条目而受到攻击，本研究旨在探索这种攻击的可行性和影响。

Details

Method: 针对视觉文档检索应用，设计一种单图像投毒攻击，使其对多种查询均被检索并影响生成模型输出。 Result: 攻击对多种先进检索器和生成器有效，但对鲁棒嵌入模型无效，揭示了M-RAG管道的脆弱性。 Conclusion: M-RAG系统易受投毒攻击，且在良性设置下也可能存在性能瓶颈。 Abstract: Multimodal retrieval augmented generation (M-RAG) has recently emerged as a method to inhibit hallucinations of large multimodal models (LMMs) through a factual knowledge base (KB). However, M-RAG also introduces new attack vectors for adversaries that aim to disrupt the system by injecting malicious entries into the KB. In this work, we present a poisoning attack against M-RAG targeting visual document retrieval applications, where the KB contains images of document pages. Our objective is to craft a single image that is retrieved for a variety of different user queries, and consistently influences the output produced by the generative model, thus creating a universal denial-of-service (DoS) attack against the M-RAG system. We demonstrate that while our attack is effective against a diverse range of widely-used, state-of-the-art retrievers (embedding models) and generators (LMMs), it can also be ineffective against robust embedding models. Our attack not only highlights the vulnerability of M-RAG pipelines to poisoning attacks, but also sheds light on a fundamental weakness that potentially hinders their performance even in benign settings.

LL4G: Self-Supervised Dynamic Optimization for Graph-Based Personality Detection

Lingzhi Shen,Yunfei Long,Xiaohao Cai,Guanming Chen,Yuhan Wang,Imran Razzak,Shoaib Jameel

Task: 提出一种基于大语言模型的自监督框架LL4G，用于优化图神经网络在人格检测任务中的表现。

Motivation: 当前基于图的人格检测方法在处理稀疏或噪声数据时表现不佳，且依赖静态图结构，难以捕捉动态变化。

Details

Method: 利用大语言模型提取语义特征生成节点表示和推断关系，动态调整图结构，并通过图神经网络进行联合训练。 Result: 在Kaggle和Pandora数据集上，LL4G优于现有最先进模型。 Conclusion: LL4G通过结合语义和结构信息，生成了更鲁棒的人格画像。 Abstract: Graph-based personality detection constructs graph structures from textual data, particularly social media posts. Current methods often struggle with sparse or noisy data and rely on static graphs, limiting their ability to capture dynamic changes between nodes and relationships. This paper introduces LL4G, a self-supervised framework leveraging large language models (LLMs) to optimize graph neural networks (GNNs). LLMs extract rich semantic features to generate node representations and to infer explicit and implicit relationships. The graph structure adaptively adds nodes and edges based on input data, continuously optimizing itself. The GNN then uses these optimized representations for joint training on node reconstruction, edge prediction, and contrastive learning tasks. This integration of semantic and structural information generates robust personality profiles. Experimental results on Kaggle and Pandora datasets show LL4G outperforms state-of-the-art models.

Subasa -- Adapting Language Models for Low-resourced Offensive Language Detection in Sinhala

Shanilka Haturusinghe,Tharindu Cyril Weerasooriya,Marcos Zampieri,Christopher M. Homan,S. R. Liyanage

Task: 检测僧伽罗语中的冒犯性语言。

Motivation: 低资源语言和高资源语言在冒犯性语言检测任务上性能差异显著，需探索新的微调策略以提升僧伽罗语的表现。

Details

Method: 采用未在僧伽罗语中探索过的微调策略，提出四种模型：Subasa-XLM-R（结合中间预微调步骤）、Subasa-Llama和Subasa-Mistral（任务特定策略微调）。 Result: 所有模型在SOLD基准数据集上均优于现有基线，Subasa-XLM-R在零样本设置下取得最高Macro F1分数（0.84），超越GPT-4o等大型语言模型。 Conclusion: 提出的模型在僧伽罗语冒犯性语言检测中表现优异，代码和模型已公开。 Abstract: Accurate detection of offensive language is essential for a number of applications related to social media safety. There is a sharp contrast in performance in this task between low and high-resource languages. In this paper, we adapt fine-tuning strategies that have not been previously explored for Sinhala in the downstream task of offensive language detection. Using this approach, we introduce four models: "Subasa-XLM-R", which incorporates an intermediate Pre-Finetuning step using Masked Rationale Prediction. Two variants of "Subasa-Llama" and "Subasa-Mistral", are fine-tuned versions of Llama (3.2) and Mistral (v0.3), respectively, with a task-specific strategy. We evaluate our models on the SOLD benchmark dataset for Sinhala offensive language detection. All our models outperform existing baselines. Subasa-XLM-R achieves the highest Macro F1 score (0.84) surpassing state-of-the-art large language models like GPT-4o when evaluated on the same SOLD benchmark dataset under zero-shot settings. The models and code are publicly available.

LLMs as Deceptive Agents: How Role-Based Prompting Induces Semantic Ambiguity in Puzzle Tasks

Seunghyun Yoo

Task: 研究大型语言模型（LLM）如何利用语义模糊性生成具有欺骗性的谜题。

Motivation: 探索LLM在对抗性环境中表现出的代理行为及其对语义模糊性的利用，以揭示其潜在的伦理问题。

Details

Method: 通过零样本提示、角色注入对抗提示和人工制作的谜题进行系统比较，并利用HateBERT计算语义模糊性及人类主观评估。 Result: 对抗性代理行为显著增加了语义模糊性，提高了认知负荷并降低了谜题解决的公平性。 Conclusion: 研究揭示了LLM的代理行为特性，强调了在教育技术和娱乐领域安全部署自主语言系统时的伦理考量。 Abstract: Recent advancements in Large Language Models (LLMs) have not only showcased impressive creative capabilities but also revealed emerging agentic behaviors that exploit linguistic ambiguity in adversarial settings. In this study, we investigate how an LLM, acting as an autonomous agent, leverages semantic ambiguity to generate deceptive puzzles that mislead and challenge human users. Inspired by the popular puzzle game "Connections", we systematically compare puzzles produced through zero-shot prompting, role-injected adversarial prompts, and human-crafted examples, with an emphasis on understanding the underlying agent decision-making processes. Employing computational analyses with HateBERT to quantify semantic ambiguity, alongside subjective human evaluations, we demonstrate that explicit adversarial agent behaviors significantly heighten semantic ambiguity -- thereby increasing cognitive load and reducing fairness in puzzle solving. These findings provide critical insights into the emergent agentic qualities of LLMs and underscore important ethical considerations for evaluating and safely deploying autonomous language systems in both educational technologies and entertainment.

State-of-the-Art Translation of Text-to-Gloss using mBART : A case study of Bangla

Sharif Md. Abdullah,Abhijit Paul,Shebuti Rayana,Ahmedul Kabir,Zarif Masud

Task: 研究孟加拉手语（BdSL）的文本到手势翻译任务。

Motivation: 尽管孟加拉有170万聋哑人口，但BdSL的研究仍然不足，尤其是文本到手势翻译任务尚未有相关工作。

Details

Method: 结合德国和美国手语的语法规则生成手势，利用LLM生成合成数据，并通过回译和文本生成进行数据增强。随后微调预训练的mBART-50和mBERT模型，并训练GRU、RNN及一种新颖的带多头注意力的序列到序列模型。 Result: 微调的mBART-50模型表现最佳（ScareBLEU=79.53），并在PHOENIX-14T基准测试中取得SOTA性能（ScareBLEU=63.89等）。 Conclusion: 研究提出了一种基于mBART模型的文本到手势翻译新范式，并证明基于规则的合成数据集对BdSL任务有显著帮助。 Abstract: Despite a large deaf and dumb population of 1.7 million, Bangla Sign Language (BdSL) remains a understudied domain. Specifically, there are no works on Bangla text-to-gloss translation task. To address this gap, we begin by addressing the dataset problem. We take inspiration from grammatical rule based gloss generation used in Germany and American sign langauage (ASL) and adapt it for BdSL. We also leverage LLM to generate synthetic data and use back-translation, text generation for data augmentation. With dataset prepared, we started experimentation. We fine-tuned pretrained mBART-50 and mBERT-multiclass-uncased model on our dataset. We also trained GRU, RNN and a novel seq-to-seq model with multi-head attention. We observe significant high performance (ScareBLEU=79.53) with fine-tuning pretrained mBART-50 multilingual model from Facebook. We then explored why we observe such high performance with mBART. We soon notice an interesting property of mBART -- it was trained on shuffled and masked text data. And as we know, gloss form has shuffling property. So we hypothesize that mBART is inherently good at text-to-gloss tasks. To find support against this hypothesis, we trained mBART-50 on PHOENIX-14T benchmark and evaluated it with existing literature. Our mBART-50 finetune demonstrated State-of-the-Art performance on PHOENIX-14T benchmark, far outperforming existing models in all 6 metrics (ScareBLEU = 63.89, BLEU-1 = 55.14, BLEU-2 = 38.07, BLEU-3 = 27.13, BLEU-4 = 20.68, COMET = 0.624). Based on the results, this study proposes a new paradigm for text-to-gloss task using mBART models. Additionally, our results show that BdSL text-to-gloss task can greatly benefit from rule-based synthetic dataset.

Measurement of LLM's Philosophies of Human Nature

Minheng Ni,Ennan Wu,Zidong Gong,Zhengyuan Yang,Linjie Li,Chung-Ching Lin,Kevin Lin,Lijuan Wang,Wangmeng Zuo

Task: 设计并验证一个针对大型语言模型（LLM）的标准化心理量表（M-PHNS），评估其对人类的态度，并提出一种心理循环学习框架以优化其价值体系。

Motivation: 随着人工智能（AI）的广泛应用及其涉及的冲突或违规行为，社会对与AI系统互动的担忧增加，需要评估和改善AI对人类的态度。

Details

Method: 基于Wrightsman的人类自然哲学量表（PHNS），设计M-PHNS量表，评估LLM对人类的态度；提出心理循环学习框架，通过构建道德场景优化LLM的价值体系。 Result: 当前LLM对人类普遍缺乏信任，且模型智能水平与对人类信任呈显著负相关；心理循环学习显著提升了LLM对人类的信任。 Conclusion: M-PHNS量表可用于诊断LLM的认知偏差，心理循环学习为AI的伦理学习提供了潜在解决方案。 Abstract: The widespread application of artificial intelligence (AI) in various tasks, along with frequent reports of conflicts or violations involving AI, has sparked societal concerns about interactions with AI systems. Based on Wrightsman's Philosophies of Human Nature Scale (PHNS), a scale empirically validated over decades to effectively assess individuals' attitudes toward human nature, we design the standardized psychological scale specifically targeting large language models (LLM), named the Machine-based Philosophies of Human Nature Scale (M-PHNS). By evaluating LLMs' attitudes toward human nature across six dimensions, we reveal that current LLMs exhibit a systemic lack of trust in humans, and there is a significant negative correlation between the model's intelligence level and its trust in humans. Furthermore, we propose a mental loop learning framework, which enables LLM to continuously optimize its value system during virtual interactions by constructing moral scenarios, thereby improving its attitude toward human nature. Experiments demonstrate that mental loop learning significantly enhances their trust in humans compared to persona or instruction prompts. This finding highlights the potential of human-based psychological assessments for LLM, which can not only diagnose cognitive biases but also provide a potential solution for ethical learning in artificial intelligence. We release the M-PHNS evaluation code and data at https://github.com/kodenii/M-PHNS.

Improving Harmful Text Detection with Joint Retrieval and External Knowledge

Zidong Yu,Shuo Wang,Nan Jiang,Weiqiang Huang,Xu Han,Junliang Du

Task: 提出一种联合检索框架，结合预训练语言模型和知识图谱，以提高有害文本检测的准确性和鲁棒性。

Motivation: 随着AI生成内容在数字平台上的扩展，有害文本检测成为开发和部署大型语言模型的关键任务。

Details

Method: 采用联合检索方法，整合预训练语言模型和知识图谱，利用外部上下文信息捕捉细微的有害内容。 Result: 实验结果表明，联合检索方法在低资源训练场景和多语言环境中显著优于单模型基线。 Conclusion: 该方法为AI安全领域做出贡献，未来研究应优化计算效率、增强模型可解释性并扩展多模态检测能力。 Abstract: Harmful text detection has become a crucial task in the development and deployment of large language models, especially as AI-generated content continues to expand across digital platforms. This study proposes a joint retrieval framework that integrates pre-trained language models with knowledge graphs to improve the accuracy and robustness of harmful text detection. Experimental results demonstrate that the joint retrieval approach significantly outperforms single-model baselines, particularly in low-resource training scenarios and multilingual environments. The proposed method effectively captures nuanced harmful content by leveraging external contextual information, addressing the limitations of traditional detection models. Future research should focus on optimizing computational efficiency, enhancing model interpretability, and expanding multimodal detection capabilities to better tackle evolving harmful content patterns. This work contributes to the advancement of AI safety, ensuring more trustworthy and reliable content moderation systems.

CoTAL: Human-in-the-Loop Prompt Engineering, Chain-of-Thought Reasoning, and Active Learning for Generalizable Formative Assessment Scoring

Clayton Cohn,Nicole Hutchins,Ashwin T S,Gautam Biswas

Task: 研究如何利用链式思维提示（CoT）和主动学习（CoTAL）改进大型语言模型（LLM）在形成性评估中的评分性能。

Motivation: 探索链式思维提示方法在多学科（如科学、计算和工程）形成性评估中的泛化能力，并提高自动化评分的准确性和反馈质量。

Details

Method: 结合证据中心设计（ECD）原则开发课程对齐的形成性评估和评分标准，采用人在环路的提示工程自动化评分，并整合师生反馈迭代优化评估问题、评分标准和提示。 Result: CoTAL显著提升了GPT-4的评分性能，比未优化的基线提高了24.5%，师生均认为其评分和解释效果良好。 Conclusion: CoTAL是一种有效的LLM评分方法，通过迭代优化和师生反馈，显著提升了自动化评分的准确性和解释质量。 Abstract: Large language models (LLMs) have created new opportunities to assist teachers and support student learning. Methods such as chain-of-thought (CoT) prompting enable LLMs to grade formative assessments in science, providing scores and relevant feedback to students. However, the extent to which these methods generalize across curricula in multiple domains (such as science, computing, and engineering) remains largely untested. In this paper, we introduce Chain-of-Thought Prompting + Active Learning (CoTAL), an LLM-based approach to formative assessment scoring that (1) leverages Evidence-Centered Design (ECD) principles to develop curriculum-aligned formative assessments and rubrics, (2) applies human-in-the-loop prompt engineering to automate response scoring, and (3) incorporates teacher and student feedback to iteratively refine assessment questions, grading rubrics, and LLM prompts for automated grading. Our findings demonstrate that CoTAL improves GPT-4's scoring performance, achieving gains of up to 24.5% over a non-prompt-engineered baseline. Both teachers and students view CoTAL as effective in scoring and explaining student responses, each providing valuable refinements to enhance grading accuracy and explanation quality.

LearNAT: Learning NL2SQL with AST-guided Task Decomposition for Large Language Models

Weibin Liao,Xin Gao,Tianyu Jia,Rihong Qiu,Yifan Zhu,Yang Lin,Xu Chu,Junfeng Zhao,Yasha Wang

Task: 提出LearNAT框架，通过任务分解和强化学习提升开源LLM在复杂NL2SQL任务中的性能。

Motivation: 现有NL2SQL方法依赖闭源LLM或需微调的开源模型，难以处理复杂任务，存在语义鸿沟和间接表达问题。

Details

Method: LearNAT框架包含三个关键组件：基于AST的任务分解合成、边缘感知强化学习和自适应示例推理。 Result: 在Spider和BIRD数据集上，LearNAT使7B参数的开源LLM性能接近GPT-4，且效率更高。 Conclusion: LearNAT通过任务分解和强化学习显著提升了开源LLM在复杂NL2SQL任务中的表现，具有高效和可访问性优势。 Abstract: Natural Language to SQL (NL2SQL) has emerged as a critical task for enabling seamless interaction with databases. Recent advancements in Large Language Models (LLMs) have demonstrated remarkable performance in this domain. However, existing NL2SQL methods predominantly rely on closed-source LLMs leveraging prompt engineering, while open-source models typically require fine-tuning to acquire domain-specific knowledge. Despite these efforts, open-source LLMs struggle with complex NL2SQL tasks due to the indirect expression of user query objectives and the semantic gap between user queries and database schemas. Inspired by the application of reinforcement learning in mathematical problem-solving to encourage step-by-step reasoning in LLMs, we propose LearNAT (Learning NL2SQL with AST-guided Task Decomposition), a novel framework that improves the performance of open-source LLMs on complex NL2SQL tasks through task decomposition and reinforcement learning. LearNAT introduces three key components: (1) a Decomposition Synthesis Procedure that leverages Abstract Syntax Trees (ASTs) to guide efficient search and pruning strategies for task decomposition, (2) Margin-aware Reinforcement Learning, which employs fine-grained step-level optimization via DPO with AST margins, and (3) Adaptive Demonstration Reasoning, a mechanism for dynamically selecting relevant examples to enhance decomposition capabilities. Extensive experiments on two benchmark datasets, Spider and BIRD, demonstrate that LearNAT enables a 7B-parameter open-source LLM to achieve performance comparable to GPT-4, while offering improved efficiency and accessibility.

The quasi-semantic competence of LLMs: a case study on the part-whole relation

Mattia Proietti,Alessandro Lenci

Task: 研究大型语言模型（LLMs）对部分-整体关系（meronymy）的理解能力。

Motivation: 部分-整体关系在词汇组织中起关键作用，但目前研究不足，需要评估LLMs在这方面的语义能力。

Details

Method: 使用ConceptNet关系数据和人类生成的语义特征规范，通过行为测试、句子概率评分和概念表示分析三个层次进行研究。 Result: LLMs对部分-整体关系的理解是部分的，仅具备“准语义”能力，未能完全捕捉深层推理特性。 Conclusion: LLMs在部分-整体关系上的语义能力有限，仍需进一步研究以提升其理解深度。 Abstract: Understanding the extent and depth of the semantic competence of \emph{Large Language Models} (LLMs) is at the center of the current scientific agenda in Artificial Intelligence (AI) and Computational Linguistics (CL). We contribute to this endeavor by investigating their knowledge of the \emph{part-whole} relation, a.k.a. \emph{meronymy}, which plays a crucial role in lexical organization, but it is significantly understudied. We used data from ConceptNet relations \citep{speer2016conceptnet} and human-generated semantic feature norms \citep{McRae:2005} to explore the abilities of LLMs to deal with \textit{part-whole} relations. We employed several methods based on three levels of analysis: i.) \textbf{behavioral} testing via prompting, where we directly queried the models on their knowledge of meronymy, ii.) sentence \textbf{probability} scoring, where we tested models' abilities to discriminate correct (real) and incorrect (asymmetric counterfactual) \textit{part-whole} relations, and iii.) \textbf{concept representation} analysis in vector space, where we proved the linear organization of the \textit{part-whole} concept in the embedding and unembedding spaces. These analyses present a complex picture that reveals that the LLMs' knowledge of this relation is only partial. They have just a ``\emph{quasi}-semantic'' competence and still fall short of capturing deep inferential properties.

Scaling Analysis of Interleaved Speech-Text Language Models

Gallil Maimon,Michael Hassid,Amit Roth,Yossi Adi

Task: 研究混合语音语言模型（SLM）与纯文本语音语言模型的扩展效率差异。

Motivation: 现有研究表明SLM需要比文本更多的计算资源和数据，但现代SLM通常通过语音-文本交替初始化，以利用知识转移，因此需要验证这种混合方式是否更高效。

Details

Method: 通过训练数十个混合SLM并分析扩展趋势，研究计算资源分配对模型性能的影响。 Result: 混合SLM在计算资源利用上更高效，且扩展动态与纯文本SLM显著不同，建议将更多计算预算用于增加模型规模而非训练数据量。 Conclusion: 混合SLM在语音语义指标上表现优异，且计算和数据需求更低，证明了其高效性。 Abstract: Existing Speech Language Model (SLM) scaling analysis paints a bleak picture. They predict that SLMs require much more compute and data compared to text, leading some to question the feasibility of training high-quality SLMs. However, modern SLMs are often initialised from pre-trained TextLMs using speech-text interleaving to allow knowledge transfer. This raises the question - Do interleaved SLMs scale more efficiently than textless-SLMs? In this paper we answer a resounding, yes! We conduct scaling analysis of interleaved SLMs by training several dozen and analysing the scaling trends. We see that under this setup SLMs scale more efficiently with compute. Additionally, our results indicate that the scaling-dynamics are significantly different than textless-SLMs, suggesting one should allocate notably more of the compute budget for increasing model size over training tokens. We also study the role of synthetic data and TextLM model families in unlocking this potential. Results suggest, that our scaled up model achieves comparable performance with leading models on speech semantic metrics while using less compute and data than other approaches. We open source models, samples, and data - https://pages.cs.huji.ac.il/adiyoss-lab/sims.

DaKultur: Evaluating the Cultural Awareness of Language Models for Danish with Native Speakers

Max Müller-Eberstein,Mike Zhang,Elisa Bassignana,Peter Brunsgaard Trolle,Rob van der Goot

Task: 评估大型语言模型（LLMs）在丹麦语中的文化适应能力。

Motivation: LLMs在多语言交互中表现出文化意识不足，尤其是对非英语社区，需要研究其文化适应能力。

Details

Method: 通过丹麦母语者与不同模型的交互，收集1,038条数据，分析文化适应挑战。 Result: 自动翻译数据不足以训练或衡量文化适应能力，而基于母语数据的训练可将响应接受率提高一倍以上。 Conclusion: 发布首个丹麦语文化意识数据集DaKultur，强调需要更多母语数据以提升LLMs的文化适应能力。 Abstract: Large Language Models (LLMs) have seen widespread societal adoption. However, while they are able to interact with users in languages beyond English, they have been shown to lack cultural awareness, providing anglocentric or inappropriate responses for underrepresented language communities. To investigate this gap and disentangle linguistic versus cultural proficiency, we conduct the first cultural evaluation study for the mid-resource language of Danish, in which native speakers prompt different models to solve tasks requiring cultural awareness. Our analysis of the resulting 1,038 interactions from 63 demographically diverse participants highlights open challenges to cultural adaptation: Particularly, how currently employed automatically translated data are insufficient to train or measure cultural adaptation, and how training on native-speaker data can more than double response acceptance rates. We release our study data as DaKultur - the first native Danish cultural awareness dataset.

LSC-ADL: An Activity of Daily Living (ADL)-Annotated Lifelog Dataset Generated via Semi-Automatic Clustering

Minh-Quan Ho-Le,Duy-Khang Ho,Van-Tu Ninh,Cathal Gurrin,Minh-Triet Tran

Task: 构建一个包含日常生活活动（ADL）注释的LSC-ADL数据集，以增强生命日志检索的语义理解和可解释性。

Motivation: 现有生命日志检索方法忽视了活动级别的注释，而这些注释能捕捉时间关系并丰富语义理解。

Details

Method: 采用半自动方法，结合HDBSCAN算法进行类内聚类和人工验证，生成准确的ADL注释。 Result: LSC-ADL数据集填补了现有研究的空白，提供了更具上下文感知的日常生活表示。 Conclusion: 该数据集将推动生命日志检索、活动识别和第一人称视觉的研究，提高检索内容的准确性和可解释性。 Abstract: Lifelogging involves continuously capturing personal data through wearable cameras, providing an egocentric view of daily activities. Lifelog retrieval aims to search and retrieve relevant moments from this data, yet existing methods largely overlook activity-level annotations, which capture temporal relationships and enrich semantic understanding. In this work, we introduce LSC-ADL, an ADL-annotated lifelog dataset derived from the LSC dataset, incorporating Activities of Daily Living (ADLs) as a structured semantic layer. Using a semi-automatic approach featuring the HDBSCAN algorithm for intra-class clustering and human-in-the-loop verification, we generate accurate ADL annotations to enhance retrieval explainability. By integrating action recognition into lifelog retrieval, LSC-ADL bridges a critical gap in existing research, offering a more context-aware representation of daily life. We believe this dataset will advance research in lifelog retrieval, activity recognition, and egocentric vision, ultimately improving the accuracy and interpretability of retrieved content. The ADL annotations can be downloaded at https://bit.ly/lsc-adl-annotations.

AnesBench: Multi-Dimensional Evaluation of LLM Reasoning in Anesthesiology

Xiang Feng,Wentao Jiang,Zengmao Wang,Yong Luo,Pingbo Xu,Baosheng Yu,Hua Jin,Bo Du,Jing Zhang

Task: 系统评估大型语言模型（LLMs）在麻醉学领域的推理能力，并分析影响其性能的关键因素。

Motivation: 尽管LLMs在医学领域的应用受到广泛关注，但其在麻醉学等专业领域的推理能力尚未充分探索。

Details

Method: 引入AnesBench基准，评估麻醉学相关推理的三个层次（事实检索、混合推理和复杂决策），并通过实验分析模型特性、训练策略和推理技术的影响。 Result: 通过实验探索了模型规模、思维链长度、语言迁移性等因素对性能的影响，并评估了不同训练策略和推理技术的有效性。 Conclusion: AnesBench及其相关数据集和代码将公开发布，以促进麻醉学领域LLMs的研究。 Abstract: The application of large language models (LLMs) in the medical field has gained significant attention, yet their reasoning capabilities in more specialized domains like anesthesiology remain underexplored. In this paper, we systematically evaluate the reasoning capabilities of LLMs in anesthesiology and analyze key factors influencing their performance. To this end, we introduce AnesBench, a cross-lingual benchmark designed to assess anesthesiology-related reasoning across three levels: factual retrieval (System 1), hybrid reasoning (System 1.x), and complex decision-making (System 2). Through extensive experiments, we first explore how model characteristics, including model scale, Chain of Thought (CoT) length, and language transferability, affect reasoning performance. Then, we further evaluate the effectiveness of different training strategies, leveraging our curated anesthesiology-related dataset, including continuous pre-training (CPT) and supervised fine-tuning (SFT). Additionally, we also investigate how the test-time reasoning techniques, such as Best-of-N sampling and beam search, influence reasoning performance, and assess the impact of reasoning-enhanced model distillation, specifically DeepSeek-R1. We will publicly release AnesBench, along with our CPT and SFT training datasets and evaluation code at https://github.com/MiliLab/AnesBench.

Aligned Better, Listen Better for Audio-Visual Large Language Models

Yuxin Guo,Shuailei Ma,Shijie Ma,Xiaoyi Bao,Chen-Wei Xie,Kecheng Zheng,Tingyu Weng,Siyang Sun,Yun Zheng,Wei Zou

Task: 提出一种细粒度的音频-视觉大语言模型（Dolphin）和音频-视觉数据集（AVU），以解决现有模型在利用音频信息时的不足。

Motivation: 音频对多模态视频理解至关重要，但现有的视频大语言模型和音频-视觉大语言模型在利用音频信息时存在缺陷，导致理解能力弱和幻觉问题。

Details

Method: 从架构角度提出Dolphin模型，通过音频-视觉多尺度适配器和音频-视觉交错合并实现时空对齐；从数据集角度构建AVU数据集，包含520万多样化的数据元组。 Result: 实验表明，模型在音频-视觉理解方面表现优异，并能有效减少幻觉问题。 Conclusion: 提出的Dolphin模型和AVU数据集显著提升了音频-视觉理解能力，解决了现有模型的不足。 Abstract: Audio is essential for multimodal video understanding. On the one hand, video inherently contains audio, which supplies complementary information to vision. Besides, video large language models (Video-LLMs) can encounter many audio-centric settings. However, existing Video-LLMs and Audio-Visual Large Language Models (AV-LLMs) exhibit deficiencies in exploiting audio information, leading to weak understanding and hallucinations. To solve the issues, we delve into the model architecture and dataset. (1) From the architectural perspective, we propose a fine-grained AV-LLM, namely Dolphin. The concurrent alignment of audio and visual modalities in both temporal and spatial dimensions ensures a comprehensive and accurate understanding of videos. Specifically, we devise an audio-visual multi-scale adapter for multi-scale information aggregation, which achieves spatial alignment. For temporal alignment, we propose audio-visual interleaved merging. (2) From the dataset perspective, we curate an audio-visual caption and instruction-tuning dataset, called AVU. It comprises 5.2 million diverse, open-ended data tuples (video, audio, question, answer) and introduces a novel data partitioning strategy. Extensive experiments show our model not only achieves remarkable performance in audio-visual understanding, but also mitigates potential hallucinations.

Adapting Large Language Models for Multi-Domain Retrieval-Augmented-Generation

Alexandre Misrahi,Nadezhda Chirkova,Maxime Louis,Vassilina Nikoulina

Task: 引入一个多样化的基准测试，并系统测试RAG调优策略的跨域泛化能力。

Motivation: 解决多领域应用中缺乏多样化基准测试和跨域泛化能力不足的问题。

Details

Method: 使用包含8个来源和13个领域的多样化基准测试，并通过序列级蒸馏和教师生成标签改进跨域性能。 Result: 标准微调泛化效果不佳，但序列级蒸馏显著提升了跨域性能。 Conclusion: 研究为提升多领域RAG的鲁棒性提供了关键策略。 Abstract: Retrieval-Augmented Generation (RAG) enhances LLM factuality, but multi-domain applications face challenges like lack of diverse benchmarks and poor out-of-domain generalization. The first contribution of this work is to introduce a diverse benchmark comprising a variety of question-answering tasks from 8 sources and covering 13 domains. Our second contribution consists in systematically testing out-of-domain generalization for typical RAG tuning strategies. While our findings reveal that standard fine-tuning fails to generalize effectively, we show that sequence-level distillation with teacher-generated labels improves out-of-domain performance by providing more coherent supervision. Our findings highlight key strategies for improving multi-domain RAG robustness.

FreSca: Unveiling the Scaling Space in Diffusion Models

Chao Huang,Susan Liang,Yunlong Tang,Li Ma,Yapeng Tian,Chenliang Xu

Task: 探索扩散模型中噪声预测的频域特性，并提出一种独立调整不同频段指导缩放的方法FreSca。

Motivation: 扩散模型通过噪声预测和分类器无关的指导缩放提供了图像任务的可控性，但其频域特性及细粒度语义操控潜力尚未充分研究。

Details

Method: 通过傅里叶分析噪声预测，发现其高低频分量在扩散过程中演化不同，据此提出FreSca方法，独立调整不同频段的指导缩放。 Result: FreSca显著提升了现有图像编辑方法的效果，并在深度估计等图像理解任务中取得定量增益。 Conclusion: FreSca通过频域分析揭示了噪声预测的新特性，为扩散模型的细粒度操控提供了有效工具。 Abstract: Diffusion models offer impressive controllability for image tasks, primarily through noise predictions that encode task-specific information and classifier-free guidance enabling adjustable scaling. This scaling mechanism implicitly defines a ``scaling space'' whose potential for fine-grained semantic manipulation remains underexplored. We investigate this space, starting with inversion-based editing where the difference between conditional/unconditional noise predictions carries key semantic information. Our core contribution stems from a Fourier analysis of noise predictions, revealing that its low- and high-frequency components evolve differently throughout diffusion. Based on this insight, we introduce FreSca, a straightforward method that applies guidance scaling independently to different frequency bands in the Fourier domain. FreSca demonstrably enhances existing image editing methods without retraining. Excitingly, its effectiveness extends to image understanding tasks such as depth estimation, yielding quantitative gains across multiple datasets.

Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation

Chuanqi Cheng,Jian Guan,Wei Wu,Rui Yan

Task: 提出一种名为ViLaMP的分层视频语言模型，用于高效处理长视频内容。

Motivation: 现有方法在处理长视频时因计算成本高而牺牲时间依赖性或语义信息，需要一种新方法在保留关键信息的同时减少冗余。

Details

Method: 采用差分蒸馏原则，通过差分关键帧选择和差分特征合并机制，实现混合精度处理。 Result: ViLaMP在四个视频理解基准测试中表现优异，尤其擅长处理长视频，且能在单块NVIDIA A100 GPU上处理超长视频（达10K帧）。 Conclusion: ViLaMP在计算效率和性能上均达到领先水平，为长视频处理提供了有效解决方案。 Abstract: Long-form video processing fundamentally challenges vision-language models (VLMs) due to the high computational costs of handling extended temporal sequences. Existing token pruning and feature merging methods often sacrifice critical temporal dependencies or dilute semantic information. We introduce differential distillation, a principled approach that systematically preserves task-relevant information while suppressing redundancy. Based on this principle, we develop ViLaMP, a hierarchical video-language model that processes hour-long videos at ``mixed precision'' through two key mechanisms: (1) differential keyframe selection that maximizes query relevance while maintaining temporal distinctiveness at the frame level and (2) differential feature merging that preserves query-salient features in non-keyframes at the patch level. Hence, ViLaMP retains full information in keyframes while reducing non-keyframes to their most salient features, resembling mixed-precision training. Extensive experiments demonstrate ViLaMP's superior performance across four video understanding benchmarks, particularly on long-form content. Notably, ViLaMP can process ultra-long videos (up to 10K frames) on a single NVIDIA A100 GPU, achieving substantial computational efficiency while maintaining state-of-the-art performance.

UAVTwin: Neural Digital Twins for UAVs using Gaussian Splatting

Jaehoon Choi,Dongki Jung,Yonghan Lee,Sungmin Eum,Dinesh Manocha,Heesung Kwon

Task: 提出UAVTwin方法，用于从真实环境中创建数字孪生，并增强无人机（UAV）下游模型的数据。

Motivation: 解决复杂场景中动态对象和外观变化对3D高斯泼溅（3DGS）建模的挑战，提升下游模型的性能。

Details

Method: 结合3DGS重建背景，并引入可控的合成人体模型，提出新的外观建模策略和掩码细化模块。 Result: 神经渲染质量提升（PSNR提高1.23 dB），数据增强效果显著（mAP提升2.5%至13.7%）。 Conclusion: UAVTwin首次基于3DGS生成高保真数字孪生，显著提升了无人机感知任务的性能。 Abstract: We present UAVTwin, a method for creating digital twins from real-world environments and facilitating data augmentation for training downstream models embedded in unmanned aerial vehicles (UAVs). Specifically, our approach focuses on synthesizing foreground components, such as various human instances in motion within complex scene backgrounds, from UAV perspectives. This is achieved by integrating 3D Gaussian Splatting (3DGS) for reconstructing backgrounds along with controllable synthetic human models that display diverse appearances and actions in multiple poses. To the best of our knowledge, UAVTwin is the first approach for UAV-based perception that is capable of generating high-fidelity digital twins based on 3DGS. The proposed work significantly enhances downstream models through data augmentation for real-world environments with multiple dynamic objects and significant appearance variations-both of which typically introduce artifacts in 3DGS-based modeling. To tackle these challenges, we propose a novel appearance modeling strategy and a mask refinement module to enhance the training of 3D Gaussian Splatting. We demonstrate the high quality of neural rendering by achieving a 1.23 dB improvement in PSNR compared to recent methods. Furthermore, we validate the effectiveness of data augmentation by showing a 2.5% to 13.7% improvement in mAP for the human detection task.

Cognitive Memory in Large Language Models

Lianlei Shan,Shixian Luo,Zezhou Zhu,Yu Yuan,Yong Wu

Task: 分析大型语言模型（LLM）中的记忆机制及其重要性。

Motivation: 探讨记忆机制对上下文丰富响应、减少幻觉和提高效率的作用。

Details

Method: 将记忆分为感官、短期和长期记忆，并详细讨论文本、KV缓存、参数和隐藏状态等记忆方法的实现与管理。 Result: 提出了多种记忆机制的实现策略，包括文本记忆的获取与管理、KV缓存的压缩与选择、参数化记忆方法以及隐藏状态记忆的改进。 Conclusion: 全面分析了LLM记忆机制的重要性，并指出了未来研究方向。 Abstract: This paper examines memory mechanisms in Large Language Models (LLMs), emphasizing their importance for context-rich responses, reduced hallucinations, and improved efficiency. It categorizes memory into sensory, short-term, and long-term, with sensory memory corresponding to input prompts, short-term memory processing immediate context, and long-term memory implemented via external databases or structures. The text-based memory section covers acquisition (selection and summarization), management (updating, accessing, storing, and resolving conflicts), and utilization (full-text search, SQL queries, semantic search). The KV cache-based memory section discusses selection methods (regularity-based summarization, score-based approaches, special token embeddings) and compression techniques (low-rank compression, KV merging, multimodal compression), along with management strategies like offloading and shared attention mechanisms. Parameter-based memory methods (LoRA, TTT, MoE) transform memories into model parameters to enhance efficiency, while hidden-state-based memory approaches (chunk mechanisms, recurrent transformers, Mamba model) improve long-text processing by combining RNN hidden states with current methods. Overall, the paper offers a comprehensive analysis of LLM memory mechanisms, highlighting their significance and future research directions.

Less-to-More Generalization: Unlocking More Controllability by In-Context Generation

Shaojin Wu,Mengqi Huang,Wenxu Wu,Yufeng Cheng,Fei Ding,Qian He

Task: 解决主题驱动生成在数据可扩展性和主题扩展性方面的挑战。

Motivation: 主题驱动生成在图像生成中应用广泛，但面临数据可扩展性和主题扩展性的挑战。

Details

Method: 提出了一种高一致性的数据合成流程，并引入了UNO模型，包含渐进式跨模态对齐和通用旋转位置嵌入。 Result: 实验表明，该方法在单主题和多主题驱动生成中均能实现高一致性和可控性。 Conclusion: 提出的方法有效解决了主题驱动生成中的数据可扩展性和主题扩展性问题。 Abstract: Although subject-driven generation has been extensively explored in image generation due to its wide applications, it still has challenges in data scalability and subject expansibility. For the first challenge, moving from curating single-subject datasets to multiple-subject ones and scaling them is particularly difficult. For the second, most recent methods center on single-subject generation, making it hard to apply when dealing with multi-subject scenarios. In this study, we propose a highly-consistent data synthesis pipeline to tackle this challenge. This pipeline harnesses the intrinsic in-context generation capabilities of diffusion transformers and generates high-consistency multi-subject paired data. Additionally, we introduce UNO, which consists of progressive cross-modal alignment and universal rotary position embedding. It is a multi-image conditioned subject-to-image model iteratively trained from a text-to-image model. Extensive experiments show that our method can achieve high consistency while ensuring controllability in both single-subject and multi-subject driven generation.

Inference-Time Scaling for Generalist Reward Modeling

Zijun Liu,Peiyi Wang,Runxin Xu,Shirong Ma,Chong Ruan,Peng Li,Yang Liu,Yu Wu

Task: 研究如何通过改进奖励建模（RM）和学习方法来提升大型语言模型（LLMs）在推理时的可扩展性和性能。

Motivation: 强化学习（RL）在LLMs的后训练中广泛应用，但如何为不同领域提供准确的奖励信号仍是一个关键挑战。

Details

Method: 采用点式生成奖励建模（GRM）和提出自原则批判调优（SPCT）方法，结合在线RL和并行采样技术。 Result: SPCT显著提升了GRM的质量和可扩展性，在多个RM基准测试中优于现有方法，并在推理时扩展性能上表现更优。 Conclusion: DeepSeek-GRM在部分任务中仍面临挑战，但为通用奖励系统的未来发展提供了方向，模型将开源。 Abstract: Reinforcement learning (RL) has been widely adopted in post-training for large language models (LLMs) at scale. Recently, the incentivization of reasoning capabilities in LLMs from RL indicates that $\textit{proper learning methods could enable effective inference-time scalability}$. A key challenge of RL is to obtain accurate reward signals for LLMs in various domains beyond verifiable questions or artificial rules. In this work, we investigate how to improve reward modeling (RM) with more inference compute for general queries, i.e. the $\textbf{inference-time scalability of generalist RM}$, and further, how to improve the effectiveness of performance-compute scaling with proper learning methods. For the RM approach, we adopt pointwise generative reward modeling (GRM) to enable flexibility for different input types and potential for inference-time scaling. For the learning method, we propose Self-Principled Critique Tuning (SPCT) to foster scalable reward generation behaviors in GRMs through online RL, to generate principles adaptively and critiques accurately, resulting in $\textbf{DeepSeek-GRM}$ models. Furthermore, for effective inference-time scaling, we use parallel sampling to expand compute usage, and introduce a meta RM to guide voting process for better scaling performance. Empirically, we show that SPCT significantly improves the quality and scalability of GRMs, outperforming existing methods and models in various RM benchmarks without severe biases, and could achieve better performance compared to training-time scaling. DeepSeek-GRM still meets challenges in some tasks, which we believe can be addressed by future efforts in generalist reward systems. The models will be released and open-sourced.

MDP: Multidimensional Vision Model Pruning with Latency Constraint

Xinglong Sun,Barath Lakshmanan,Maying Shen,Shiyi Lan,Jingde Chen,Jose M. Alvarez

Task: 提出一种名为多维度剪枝（MDP）的新范式，联合优化多种剪枝粒度以解决现有结构剪枝方法的局限性。

Motivation: 现有结构剪枝方法在细粒度剪枝（如通道）上受限，且依赖简化的线性模型难以准确预测延迟，尤其在Transformer中。

Details

Method: MDP通过混合整数非线性规划（MINLP）联合优化多种剪枝粒度，并采用高级延迟建模技术。 Result: 在ImageNet上，MDP对ResNet50剪枝实现了28%的速度提升和+1.4 Top-1准确率提升；对Transformer剪枝实现了37%加速和+0.7 Top-1准确率提升。 Conclusion: MDP是一种高效且通用的剪枝框架，显著优于现有方法，尤其在高压剪枝比下表现突出。 Abstract: Current structural pruning methods face two significant limitations: (i) they often limit pruning to finer-grained levels like channels, making aggressive parameter reduction challenging, and (ii) they focus heavily on parameter and FLOP reduction, with existing latency-aware methods frequently relying on simplistic, suboptimal linear models that fail to generalize well to transformers, where multiple interacting dimensions impact latency. In this paper, we address both limitations by introducing Multi-Dimensional Pruning (MDP), a novel paradigm that jointly optimizes across a variety of pruning granularities-including channels, query, key, heads, embeddings, and blocks. MDP employs an advanced latency modeling technique to accurately capture latency variations across all prunable dimensions, achieving an optimal balance between latency and accuracy. By reformulating pruning as a Mixed-Integer Nonlinear Program (MINLP), MDP efficiently identifies the optimal pruned structure across all prunable dimensions while respecting latency constraints. This versatile framework supports both CNNs and transformers. Extensive experiments demonstrate that MDP significantly outperforms previous methods, especially at high pruning ratios. On ImageNet, MDP achieves a 28% speed increase with a +1.4 Top-1 accuracy improvement over prior work like HALP for ResNet50 pruning. Against the latest transformer pruning method, Isomorphic, MDP delivers an additional 37% acceleration with a +0.7 Top-1 accuracy improvement.

UNDO: Understanding Distillation as Optimization

Kushal Jain,Piyushi Goyal,Kumar Shridhar

Task: 提出一种名为UNDO的迭代知识蒸馏框架，以优化学生模型的学习效果。

Motivation: 标准的一次性知识蒸馏方法因教师生成的解释与学生特定学习需求不匹配而效果不佳。

Details

Method: 通过迭代识别学生的错误并促使教师调整解释，直接针对学生的学习缺陷。 Result: 在数学和常识推理任务中，UNDO比标准方法性能提升高达20%，且教师生成的数据对其他学生模型也有效。 Conclusion: UNDO将知识蒸馏重新定义为师生动态交互过程，显著提升了知识蒸馏效果。 Abstract: Knowledge distillation has emerged as an effective strategy for compressing large language models' (LLMs) knowledge into smaller, more efficient student models. However, standard one-shot distillation methods often produce suboptimal results due to a mismatch between teacher-generated rationales and the student's specific learning requirements. In this paper, we introduce the UNDO: UNderstanding Distillation as Optimization framework, designed to bridge this gap by iteratively identifying the student's errors and prompting the teacher to refine its explanations accordingly. Each iteration directly targets the student's learning deficiencies, motivating the teacher to provide tailored and enhanced rationales that specifically address these weaknesses. Empirical evaluations on various challenging mathematical and commonsense reasoning tasks demonstrate that our iterative distillation method, UNDO, significantly outperforms standard one-step distillation methods, achieving performance gains of up to 20%. Additionally, we show that teacher-generated data refined through our iterative process remains effective even when applied to different student models, underscoring the broad applicability of our approach. Our work fundamentally reframes knowledge distillation as an iterative teacher-student interaction, effectively leveraging dynamic refinement by the teacher for better knowledge distillation.

Foreground Focus: Enhancing Coherence and Fidelity in Camouflaged Image Generation

Pei-Chi Chen,Yi Yao,Chan-Feng Hsu,HongXia Xie,Hung-Jen Chen,Hong-Han Shuai,Wen-Huang Cheng

Task: 提出一种前景感知的伪装图像生成模型（FACIG），以解决现有方法中前景与背景融合不足及前景保真度低的问题。

Motivation: 现有伪装图像生成方法存在前景与背景融合不协调及前景保真度低的问题，限制了生成图像的质量。

Details

Method: 引入前景感知特征融合模块（FAFIM）增强前景与背景的融合，并设计前景感知去噪损失以提升前景重建监督。 Result: 在多个数据集上的实验表明，该方法在整体伪装图像质量和前景保真度上优于现有方法。 Conclusion: FACIG模型有效解决了现有方法的不足，提升了伪装图像生成的性能。 Abstract: Camouflaged image generation is emerging as a solution to data scarcity in camouflaged vision perception, offering a cost-effective alternative to data collection and labeling. Recently, the state-of-the-art approach successfully generates camouflaged images using only foreground objects. However, it faces two critical weaknesses: 1) the background knowledge does not integrate effectively with foreground features, resulting in a lack of foreground-background coherence (e.g., color discrepancy); 2) the generation process does not prioritize the fidelity of foreground objects, which leads to distortion, particularly for small objects. To address these issues, we propose a Foreground-Aware Camouflaged Image Generation (FACIG) model. Specifically, we introduce a Foreground-Aware Feature Integration Module (FAFIM) to strengthen the integration between foreground features and background knowledge. In addition, a Foreground-Aware Denoising Loss is designed to enhance foreground reconstruction supervision. Experiments on various datasets show our method outperforms previous methods in overall camouflaged image quality and foreground fidelity.

Leveraging LLM For Synchronizing Information Across Multilingual Tables

Siddharth Khincha,Tushar Kataria,Ankita Anand,Dan Roth,Vivek Gupta

Task: 探索利用大型语言模型（LLMs）进行多语言信息同步，特别是更新过时的维基百科表格。

Motivation: 解决非英语用户在获取高资源语言（如英语和法语）信息时面临的挑战，以及维基百科低资源语言内容过时或不完整的问题。

Details

Method: 提出一种基于零样本提示（zero-shot prompting）的可扩展解决方案，并引入任务分解策略以提高一致性和准确性。 Result: 提出的方法在信息更新（1.79%）和信息添加（20.58%）方面优于现有基线。 Conclusion: 大型语言模型在动态更新和跨架构丰富数据方面表现出强大潜力。 Abstract: The vast amount of online information today poses challenges for non-English speakers, as much of it is concentrated in high-resource languages such as English and French. Wikipedia reflects this imbalance, with content in low-resource languages frequently outdated or incomplete. Recent research has sought to improve cross-language synchronization of Wikipedia tables using rule-based methods. These approaches can be effective, but they struggle with complexity and generalization. This paper explores large language models (LLMs) for multilingual information synchronization, using zero-shot prompting as a scalable solution. We introduce the Information Updation dataset, simulating the real-world process of updating outdated Wikipedia tables, and evaluate LLM performance. Our findings reveal that single-prompt approaches often produce suboptimal results, prompting us to introduce a task decomposition strategy that enhances coherence and accuracy. Our proposed method outperforms existing baselines, particularly in Information Updation (1.79%) and Information Addition (20.58%), highlighting the model strength in dynamically updating and enriching data across architectures

ESC: Erasing Space Concept for Knowledge Deletion

Tae-Young Lee,Sundong Park,Minwoo Jeon,Hyoseok Hwang,Gyeong-Moon Park

Task: 提出一种名为知识删除（KD）的新概念，并开发了无需训练的擦除方法ESC和带训练的ESC-T，以解决深度学习中的隐私问题。

Motivation: 现有方法未能满足用户对完全知识擦除的需求，且存在通过嵌入特征泄露个人知识的风险。

Details

Method: 提出ESC方法通过消除特征中的相关激活来限制遗忘知识的重要子空间，并进一步提出ESC-T方法使用可学习掩码平衡遗忘与保留知识的权衡。 Result: 在多种数据集和模型上的实验表明，所提方法实现了最快且最先进的性能，适用于多种遗忘场景。 Conclusion: ESC和ESC-T方法在知识删除任务中表现出高效性和通用性，代码已开源。 Abstract: As concerns regarding privacy in deep learning continue to grow, individuals are increasingly apprehensive about the potential exploitation of their personal knowledge in trained models. Despite several research efforts to address this, they often fail to consider the real-world demand from users for complete knowledge erasure. Furthermore, our investigation reveals that existing methods have a risk of leaking personal knowledge through embedding features. To address these issues, we introduce a novel concept of Knowledge Deletion (KD), an advanced task that considers both concerns, and provides an appropriate metric, named Knowledge Retention score (KR), for assessing knowledge retention in feature space. To achieve this, we propose a novel training-free erasing approach named Erasing Space Concept (ESC), which restricts the important subspace for the forgetting knowledge by eliminating the relevant activations in the feature. In addition, we suggest ESC with Training (ESC-T), which uses a learnable mask to better balance the trade-off between forgetting and preserving knowledge in KD. Our extensive experiments on various datasets and models demonstrate that our proposed methods achieve the fastest and state-of-the-art performance. Notably, our methods are applicable to diverse forgetting scenarios, such as facial domain setting, demonstrating the generalizability of our methods. The code is available at http://github.com/KU-VGI/ESC .

Language Models reach higher Agreement than Humans in Historical Interpretation

Fabio Celli,Georgios Spathulas

Task: 比较人类和大型语言模型在历史注释中的表现。

Motivation: 探讨人类和大型语言模型在历史注释中的文化偏见和共识差异，以推动数字人文领域的大规模注释和定量分析。

Details

Method: 通过对比人类和大型语言模型对历史事实的注释和解释。 Result: 大型语言模型在短文本历史事实解释上达成更高共识，但存在信息遗漏或幻觉导致的差异；人类则因个人偏见而意见不一。 Conclusion: 研究为数字人文提供了新工具，支持大规模历史数据注释和定量分析，同时促进对偏见的批判性思考。 Abstract: This paper compares historical annotations by humans and Large Language Models. The findings reveal that both exhibit some cultural bias, but Large Language Models achieve a higher consensus on the interpretation of historical facts from short texts. While humans tend to disagree on the basis of their personal biases, Large Models disagree when they skip information or produce hallucinations. These findings have significant implications for digital humanities, enabling large-scale annotation and quantitative analysis of historical data. This offers new educational and research opportunities to explore historical interpretations from different Language Models, fostering critical thinking about bias.

Geospatial Artificial Intelligence for Satellite-based Flood Extent Mapping: Concepts, Advances, and Future Perspectives

Hyunho Lee,Wenwen Li

Task: 利用地理空间人工智能（GeoAI）技术结合卫星数据进行洪水范围制图，以识别洪水事件并评估其影响。

Motivation: 为灾害管理和空间决策提供支持。

Details

Method: 系统整合人工智能技术与卫星数据。 Result: 生成洪水范围地图，包括受影响区域的划分，以及不确定性估计和变化检测等附加分析输出。 Conclusion: GeoAI在洪水监测和灾害管理中具有重要应用价值。 Abstract: Geospatial Artificial Intelligence (GeoAI) for satellite-based flood extent mapping systematically integrates artificial intelligence techniques with satellite data to identify flood events and assess their impacts, for disaster management and spatial decision-making. The primary output often includes flood extent maps, which delineate the affected areas, along with additional analytical outputs such as uncertainty estimation and change detection.

LexPam: Legal Procedure Awareness-Guided Mathematical Reasoning

Kepu Zhang,Guofu Xie,Weijie Yu,Mingyue Xu,Xu Tang,Yaxin Li,Jun Xu

Task: 提出首个中文法律数学推理数据集LexNum，并基于此测试现有法律LLM和推理LLM的性能，同时引入LexPam算法增强LLM在法律场景中的数学推理能力。

Motivation: 法律数学推理能力直接影响LLM的可信度，但现有法律LLM缺乏相关训练，且缺乏验证和增强该能力的数据集。

Details

Method: 构建LexNum数据集，测试现有模型性能，并提出LexPam算法（基于法律程序意识的强化学习算法）训练LLM。 Result: 现有法律LLM和推理模型在法律数学推理任务中表现不佳，LexPam能显著提升LLM在该任务中的能力。 Conclusion: LexNum和LexPam为提升LLM在法律数学推理中的能力提供了有效解决方案。 Abstract: The legal mathematical reasoning ability of LLMs is crucial when applying them to real-world scenarios, as it directly affects the credibility of the LLM. While existing legal LLMs can perform general judicial question answering, their legal mathematical reasoning capabilities have not been trained. Open-domain reasoning models, though able to generate detailed calculation steps, do not follow the reasoning logic required for legal scenarios. Additionally, there is currently a lack of legal mathematical reasoning datasets to help validate and enhance LLMs' reasoning abilities in legal contexts. To address these issues, we propose the first Chinese legal Mathematical Reasoning Dataset, LexNum, which includes three common legal mathematical reasoning scenarios: economic compensation, work injury compensation, and traffic accident compensation. Based on LexNum, we tested the performance of existing legal LLMs and reasoning LLMs, and introduced LexPam, a reinforcement learning algorithm guided by legal procedural awareness to train LLMs, enhancing their mathematical reasoning abilities in legal scenarios. Experiments on tasks in the three legal scenarios show that the performance of existing legal LLMs and reasoning models in legal mathematical reasoning tasks is unsatisfactory. LexPam can enhance the LLM's ability in these tasks.

AC-LoRA: Auto Component LoRA for Personalized Artistic Style Image Generation

Zhipu Cui,Andong Tian,Zhi Ying,Jialiang Lu

Task: 提出一种自动分离LoRA矩阵信号与噪声成分的方法（AC-LoRA），用于高效个性化艺术风格图像生成。

Motivation: LoRA方法在个性化图像生成中需手动调整秩参数，效率低且效果不稳定。

Details

Method: 基于奇异值分解（SVD）和动态启发式方法，自动更新训练超参数。 Result: 在FID、CLIP、DINO和ImageReward指标上平均提升9%，优于现有方法。 Conclusion: AC-LoRA能有效解决模型欠拟合或过拟合问题，提升个性化图像生成效率。 Abstract: Personalized image generation allows users to preserve styles or subjects of a provided small set of images for further image generation. With the advancement in large text-to-image models, many techniques have been developed to efficiently fine-tune those models for personalization, such as Low Rank Adaptation (LoRA). However, LoRA-based methods often face the challenge of adjusting the rank parameter to achieve satisfactory results. To address this challenge, AutoComponent-LoRA (AC-LoRA) is proposed, which is able to automatically separate the signal component and noise component of the LoRA matrices for fast and efficient personalized artistic style image generation. This method is based on Singular Value Decomposition (SVD) and dynamic heuristics to update the hyperparameters during training. Superior performance over existing methods in overcoming model underfitting or overfitting problems is demonstrated. The results were validated using FID, CLIP, DINO, and ImageReward, achieving an average of 9% improvement.

LinTO Audio and Textual Datasets to Train and Evaluate Automatic Speech Recognition in Tunisian Arabic Dialect

Hedi Naouara,Jean-Pierre Lorré,Jérôme Louradour

Task: 开发适用于突尼斯阿拉伯方言的自动语音识别（ASR）系统。

Motivation: 突尼斯阿拉伯方言的语言复杂性及标注语音数据集的稀缺性使得ASR系统的开发具有挑战性。

Details

Method: 提出LinTO音频和文本数据集，涵盖突尼斯阿拉伯方言的音韵和词汇特征，包括多样化文本和真实音频样本。 Result: LinTO数据集为构建和评估突尼斯阿拉伯方言的ASR系统提供了高质量材料。 Conclusion: LinTO数据集解决了突尼斯阿拉伯方言ASR系统开发的资源稀缺问题。 Abstract: Developing Automatic Speech Recognition (ASR) systems for Tunisian Arabic Dialect is challenging due to the dialect's linguistic complexity and the scarcity of annotated speech datasets. To address these challenges, we propose the LinTO audio and textual datasets -- comprehensive resources that capture phonological and lexical features of Tunisian Arabic Dialect. These datasets include a variety of texts from numerous sources and real-world audio samples featuring diverse speakers and code-switching between Tunisian Arabic Dialect and English or French. By providing high-quality audio paired with precise transcriptions, the LinTO audio and textual datasets aim to provide qualitative material to build and benchmark ASR systems for the Tunisian Arabic Dialect. Keywords -- Tunisian Arabic Dialect, Speech-to-Text, Low-Resource Languages, Audio Data Augmentation

SocialGesture: Delving into Multi-person Gesture Understanding

Xu Cao,Pranav Virupaksha,Wenqi Jia,Bolin Lai,Fiona Ryan,Sangmin Lee,James M. Rehg

Task: Introduce SocialGesture, the first large-scale dataset for multi-person gesture analysis, and propose a novel VQA task to benchmark VLMs' performance on social gesture understanding.

Motivation: Existing datasets overlook multi-person interactions, which are crucial for understanding the social context of gestures, posing challenges in aligning gestures with other modalities like language and speech.

Details

Method: Develop SocialGesture, a diverse dataset for multi-person gesture analysis, supporting tasks like video-based recognition and temporal localization, and introduce a VQA task for benchmarking VLMs. Result: The dataset and VQA task highlight limitations of current gesture recognition models, providing insights for future improvements. Conclusion: SocialGesture is a valuable resource for advancing the study of gestures in complex social interactions and is publicly available. Abstract: Previous research in human gesture recognition has largely overlooked multi-person interactions, which are crucial for understanding the social context of naturally occurring gestures. This limitation in existing datasets presents a significant challenge in aligning human gestures with other modalities like language and speech. To address this issue, we introduce SocialGesture, the first large-scale dataset specifically designed for multi-person gesture analysis. SocialGesture features a diverse range of natural scenarios and supports multiple gesture analysis tasks, including video-based recognition and temporal localization, providing a valuable resource for advancing the study of gesture during complex social interactions. Furthermore, we propose a novel visual question answering (VQA) task to benchmark vision language models'(VLMs) performance on social gesture understanding. Our findings highlight several limitations of current gesture recognition models, offering insights into future directions for improvement in this field. SocialGesture is available at huggingface.co/datasets/IrohXu/SocialGesture.

LLM for Complex Reasoning Task: An Exploratory Study in Fermi Problems

Zishuo Liu,Carlos Rabat Villarreal,Mostafa Rahgouy,Amit Das,Zheng Zhang,Chang Ren,Dongji Feng

Task: 探索大型语言模型（LLMs）在解决费米问题（FPs）中的能力和局限性。

Motivation: 费米问题因其涉及现实世界的不切实际性或模糊概念，对人类和AI都具有挑战性，而LLMs在此类任务中的表现尚未充分研究。

Details

Method: 使用公开可用的FP数据集评估三种先进LLMs的表现，并基于TELeR分类设计提示，包括零样本场景。 Result: 所有LLMs的fp_score均低于0.5，且在标准问题上的表现优于特定问题。 Conclusion: 费米问题对LLMs具有固有难度，标准问题的清晰性和简洁性有助于提升模型表现。 Abstract: Fermi Problems (FPs) are mathematical reasoning tasks that require human-like logic and numerical reasoning. Unlike other reasoning questions, FPs often involve real-world impracticalities or ambiguous concepts, making them challenging even for humans to solve. Despite advancements in AI, particularly with large language models (LLMs) in various reasoning tasks, FPs remain relatively under-explored. This work conducted an exploratory study to examine the capabilities and limitations of LLMs in solving FPs. We first evaluated the overall performance of three advanced LLMs using a publicly available FP dataset. We designed prompts according to the recently proposed TELeR taxonomy, including a zero-shot scenario. Results indicated that all three LLMs achieved a fp_score (range between 0 - 1) below 0.5, underscoring the inherent difficulty of these reasoning tasks. To further investigate, we categorized FPs into standard and specific questions, hypothesizing that LLMs would perform better on standard questions, which are characterized by clarity and conciseness, than on specific ones. Comparative experiments confirmed this hypothesis, demonstrating that LLMs performed better on standard FPs in terms of both accuracy and efficiency.

Re-thinking Temporal Search for Long-Form Video Understanding

Jinhui Ye,Zihan Wang,Haosen Sun,Keshigeyan Chandrasegaran,Zane Durante,Cristobal Eyzaguirre,Yonatan Bisk,Juan Carlos Niebles,Ehsan Adeli,Li Fei-Fei,Jiajun Wu,Manling Li

Task: 研究长视频理解中的时间搜索问题，并提出一种轻量级关键帧搜索框架T*。

Motivation: 长视频理解在计算机视觉中仍具挑战性，现有方法在时间搜索能力上存在显著不足。

Details

Method: 将时间搜索问题重新定义为空间搜索问题，并提出T*框架，结合视觉定位能力和自适应缩放机制。 Result: T*显著提升了现有方法的性能，例如将GPT-4o的性能从50.5%提升至53.1%。 Conclusion: T*框架为长视频理解提供了一种高效的时间搜索解决方案，填补了现有研究的空白。 Abstract: Efficient understanding of long-form videos remains a significant challenge in computer vision. In this work, we revisit temporal search paradigms for long-form video understanding, studying a fundamental issue pertaining to all state-of-the-art (SOTA) long-context vision-language models (VLMs). In particular, our contributions are two-fold: First, we formulate temporal search as a Long Video Haystack problem, i.e., finding a minimal set of relevant frames (typically one to five) among tens of thousands of frames from real-world long videos given specific queries. To validate our formulation, we create LV-Haystack, the first benchmark containing 3,874 human-annotated instances with fine-grained evaluation metrics for assessing keyframe search quality and computational efficiency. Experimental results on LV-Haystack highlight a significant research gap in temporal search capabilities, with SOTA keyframe selection methods achieving only 2.1% temporal F1 score on the LVBench subset. Next, inspired by visual search in images, we re-think temporal searching and propose a lightweight keyframe searching framework, T*, which casts the expensive temporal search as a spatial search problem. T* leverages superior visual localization capabilities typically used in images and introduces an adaptive zooming-in mechanism that operates across both temporal and spatial dimensions. Our extensive experiments show that when integrated with existing methods, T* significantly improves SOTA long-form video understanding performance. Specifically, under an inference budget of 32 frames, T* improves GPT-4o's performance from 50.5% to 53.1% and LLaVA-OneVision-72B's performance from 56.5% to 62.4% on LongVideoBench XL subset. Our PyTorch code, benchmark dataset and models are included in the Supplementary material.

Limitations of Religious Data and the Importance of the Target Domain: Towards Machine Translation for Guinea-Bissau Creole

Jacqueline Rowe,Edward Gow-Smith,Mark Hepple

Task: Introduce a new dataset for machine translation of Guinea-Bissau Creole (Kiriol) and investigate domain transfer from religious to general domains.

Motivation: Address the lack of resources for low-resource languages like Kiriol and explore methods to improve translation performance with limited data.

Details

Method: Train transformer-based models using a dataset of 40K parallel sentences (primarily religious texts) and analyze the impact of adding small amounts of target domain data. Result: Adding 300 target domain sentences significantly improves translation; Portuguese-to-Kiriol models outperform others, possibly due to morphological and lexical similarities. Conclusion: The study highlights the importance of small-scale data collection for low-resource languages and encourages further research on Kiriol and creole languages in machine translation. Abstract: We introduce a new dataset for machine translation of Guinea-Bissau Creole (Kiriol), comprising around 40 thousand parallel sentences to English and Portuguese. This dataset is made up of predominantly religious data (from the Bible and texts from the Jehovah's Witnesses), but also a small amount of general domain data (from a dictionary). This mirrors the typical resource availability of many low resource languages. We train a number of transformer-based models to investigate how to improve domain transfer from religious data to a more general domain. We find that adding even 300 sentences from the target domain when training substantially improves the translation performance, highlighting the importance and need for data collection for low-resource languages, even on a small-scale. We additionally find that Portuguese-to-Kiriol translation models perform better on average than other source and target language pairs, and investigate how this relates to the morphological complexity of the languages involved and the degree of lexical overlap between creoles and lexifiers. Overall, we hope our work will stimulate research into Kiriol and into how machine translation might better support creole languages in general.

WonderTurbo: Generating Interactive 3D World in 0.72 Seconds

Chaojun Ni,Xiaofeng Wang,Zheng Zhu,Weijie Wang,Haoyun Li,Guosheng Zhao,Jie Li,Wenkang Qin,Guan Huang,Wenjun Mei

Task: 提出首个实时交互式3D场景生成框架WonderTurbo，能够在0.72秒内生成3D场景的新视角。

Motivation: 当前3D生成技术难以实现实时交互性，限制了沉浸式虚拟体验的发展。

Details

Method: 通过StepSplat动态更新高效3D几何表示（0.26秒/次），QuickDepth提供一致深度输入，FastPaint实现快速空间一致性修复。 Result: 实验显示WonderTurbo比基线方法快15倍，同时保持高质量输出和空间一致性。 Conclusion: WonderTurbo成功解决了实时交互3D生成的挑战，为沉浸式体验提供了高效工具。 Abstract: Interactive 3D generation is gaining momentum and capturing extensive attention for its potential to create immersive virtual experiences. However, a critical challenge in current 3D generation technologies lies in achieving real-time interactivity. To address this issue, we introduce WonderTurbo, the first real-time interactive 3D scene generation framework capable of generating novel perspectives of 3D scenes within 0.72 seconds. Specifically, WonderTurbo accelerates both geometric and appearance modeling in 3D scene generation. In terms of geometry, we propose StepSplat, an innovative method that constructs efficient 3D geometric representations through dynamic updates, each taking only 0.26 seconds. Additionally, we design QuickDepth, a lightweight depth completion module that provides consistent depth input for StepSplat, further enhancing geometric accuracy. For appearance modeling, we develop FastPaint, a 2-steps diffusion model tailored for instant inpainting, which focuses on maintaining spatial appearance consistency. Experimental results demonstrate that WonderTurbo achieves a remarkable 15X speedup compared to baseline methods, while preserving excellent spatial consistency and delivering high-quality output.

The Hidden Space of Safety: Understanding Preference-Tuned LLMs in Multilingual context

Nikhil Verma,Manasa Bharadwaj

Task: 分析大型语言模型在多语言环境中的对齐效果及其分布变化。

Motivation: 当前对齐方法主要关注英语，缺乏对多语言环境下对齐机制泛化能力的理解，可能导致模型在不同语言中的表现不均衡。

Details

Method: 通过系统分析对齐前后LLM嵌入空间的分布变化，利用对齐诱导的安全空间分离作为量化工具，评估七种LLM在平衡毒性数据集和并行文本去毒基准上的表现。 Result: 研究发现高资源语言和低资源语言在潜在表示空间存在显著差异，凸显了语言特定微调的必要性。 Conclusion: 研究为开发真正安全的多语言LLM奠定了基础，强调解决低资源语言对齐差距的紧迫性。 Abstract: Alignment tuning has enabled large language models to excel in reasoning, instruction-following, and minimizing harmful generations. However, despite their widespread deployment, these models exhibit a monolingual bias, raising concerns about the effectiveness of alignment across languages. Current alignment methods predominantly focus on English, leaving it unclear how alignment mechanism generalize to multilingual settings. To address this, we conduct a systematic analysis of distributional shifts in the embedding space of LLMs before and after alignment, uncovering its impact on model behavior across diverse languages. We leverage the alignment-induced separation in safety space as a quantitative tool to measure how alignment enforces safety constraints. Our study evaluates seven LLMs using balanced toxicity datasets and parallel text-detoxification benchmarks, revealing substantial disparities in the latent representation space between high-resource and low-resource languages. These findings underscore the need for language-specific fine-tuning to ensure fair, reliable and robust multilingual alignment. Our insights provide a foundation for developing truly safe multilingual LLMs, emphasizing the urgency of addressing alignment gaps in underrepresented languages.

MMTL-UniAD: A Unified Framework for Multimodal and Multi-Task Learning in Assistive Driving Perception

Wenzhuo Liu,Wenshuo Wang,Yicheng Qiao,Qiannan Guo,Jiayin Zhu,Pengfei Li,Zilong Chen,Huiming Yang,Zhiwei Li,Lening Wang,Tiao Tan,Huaping Liu

Task: 提出一个统一的多模态多任务学习框架MMTL-UniAD，用于同时识别驾驶员行为、情绪、车辆行为和交通环境。

Motivation: 现有研究忽视了联合学习驾驶员状态和交通环境的潜在优势，且多任务学习中存在负迁移问题。

Details

Method: 引入多轴区域注意力网络提取全局上下文敏感特征，以及双分支多模态嵌入学习任务共享和任务特定的特征。 Result: 在AIDE数据集上验证，MMTL-UniAD在所有四项任务中优于现有方法。 Conclusion: MMTL-UniAD通过有效减少负迁移和增强跨任务知识转移，实现了多任务联合学习的优越性能。 Abstract: Advanced driver assistance systems require a comprehensive understanding of the driver's mental/physical state and traffic context but existing works often neglect the potential benefits of joint learning between these tasks. This paper proposes MMTL-UniAD, a unified multi-modal multi-task learning framework that simultaneously recognizes driver behavior (e.g., looking around, talking), driver emotion (e.g., anxiety, happiness), vehicle behavior (e.g., parking, turning), and traffic context (e.g., traffic jam, traffic smooth). A key challenge is avoiding negative transfer between tasks, which can impair learning performance. To address this, we introduce two key components into the framework: one is the multi-axis region attention network to extract global context-sensitive features, and the other is the dual-branch multimodal embedding to learn multimodal embeddings from both task-shared and task-specific features. The former uses a multi-attention mechanism to extract task-relevant features, mitigating negative transfer caused by task-unrelated features. The latter employs a dual-branch structure to adaptively adjust task-shared and task-specific parameters, enhancing cross-task knowledge transfer while reducing task conflicts. We assess MMTL-UniAD on the AIDE dataset, using a series of ablation studies, and show that it outperforms state-of-the-art methods across all four tasks. The code is available on https://github.com/Wenzhuo-Liu/MMTL-UniAD.

ERPO: Advancing Safety Alignment via Ex-Ante Reasoning Preference Optimization

Kehua Feng,Keyan Ding,Jing Yu,Menghan Li,Yuhao Wang,Tong Xu,Xinda Wang,Qiang Zhang,Huajun Chen

Task: 提出一种名为Ex-Ante Reasoning Preference Optimization (ERPO)的新型安全对齐框架，以增强大型语言模型的安全性能。

Motivation: 现有对齐方法在覆盖多样化安全场景和抵御对抗攻击方面存在不足，需要更有效的解决方案。

Details

Method: 通过三个阶段实现：1) 使用监督微调（SFT）为模型配备Ex-Ante推理能力；2) 通过直接偏好优化（DPO）提升安全性、实用性和效率；3) 采用长度控制的迭代偏好优化策略减少推理延迟。 Result: 在多个开源大型语言模型上的实验表明，ERPO显著提升了安全性能，同时保持了响应效率。 Conclusion: ERPO是一种有效的安全对齐框架，能够在提升安全性的同时保持模型效率。 Abstract: Recent advancements in large language models (LLMs) have accelerated progress toward artificial general intelligence, yet their potential to generate harmful content poses critical safety challenges. Existing alignment methods often struggle to cover diverse safety scenarios and remain vulnerable to adversarial attacks. In this work, we propose Ex-Ante Reasoning Preference Optimization (ERPO), a novel safety alignment framework that equips LLMs with explicit preemptive reasoning through Chain-of-Thought and provides clear evidence for safety judgments by embedding predefined safety rules. Specifically, our approach consists of three stages: first, equipping the model with Ex-Ante reasoning through supervised fine-tuning (SFT) using a constructed reasoning module; second, enhancing safety, usefulness, and efficiency via Direct Preference Optimization (DPO); and third, mitigating inference latency with a length-controlled iterative preference optimization strategy. Experiments on multiple open-source LLMs demonstrate that ERPO significantly enhances safety performance while maintaining response efficiency.

MinkOcc: Towards real-time label-efficient semantic occupancy prediction

Samuel Sze,Daniele De Martini,Lars Kunze

Task: 开发一种多模态3D语义占用预测框架MinkOcc，以减少对密集3D标注的依赖。

Motivation: 密集3D标注既耗时又耗资源，需要更高效的标注或无标注方法。

Details

Method: 提出两步半监督训练流程：先用少量3D标注启动训练，再通过易标注的LiDAR扫描和图像（通过视觉基础模型标注）继续监督。结合LiDAR和相机数据，利用稀疏卷积网络实现实时预测。 Result: MinkOcc减少90%的手动标注需求，同时保持竞争力精度。 Conclusion: MinkOcc在监督和计算效率上的优势，有望推动3D语义占用预测在自动驾驶中的广泛应用。 Abstract: Developing 3D semantic occupancy prediction models often relies on dense 3D annotations for supervised learning, a process that is both labor and resource-intensive, underscoring the need for label-efficient or even label-free approaches. To address this, we introduce MinkOcc, a multi-modal 3D semantic occupancy prediction framework for cameras and LiDARs that proposes a two-step semi-supervised training procedure. Here, a small dataset of explicitly 3D annotations warm-starts the training process; then, the supervision is continued by simpler-to-annotate accumulated LiDAR sweeps and images -- semantically labelled through vision foundational models. MinkOcc effectively utilizes these sensor-rich supervisory cues and reduces reliance on manual labeling by 90\% while maintaining competitive accuracy. In addition, the proposed model incorporates information from LiDAR and camera data through early fusion and leverages sparse convolution networks for real-time prediction. With its efficiency in both supervision and computation, we aim to extend MinkOcc beyond curated datasets, enabling broader real-world deployment of 3D semantic occupancy prediction in autonomous driving.

Why do LLMs attend to the first token?

Federico Barbero,Álvaro Arroyo,Xiangming Gu,Christos Perivolaropoulos,Michael Bronstein,Petar Veličkovi ć,Razvan Pascanu

Task: 研究大型语言模型（LLMs）中注意力集中在序列第一个标记（注意力汇）的原因及其作用。

Motivation: 现有研究虽详细描述了注意力汇现象及其影响，但对其形成原因和实际用途的理解仍较浅显。

Details

Method: 通过理论和实证分析，探讨注意力汇如何帮助LLMs避免过度混合信息，并实验验证上下文长度、模型深度和数据打包等因素对其行为的影响。 Result: 研究揭示了注意力汇在LLMs中的实用性，为理解训练过程中形成的注意力模式提供了新视角。 Conclusion: 注意力汇是LLMs避免信息过度混合的一种机制，研究为深入理解其作用提供了理论和实证支持。 Abstract: Large Language Models (LLMs) tend to attend heavily to the first token in the sequence -- creating a so-called attention sink. Many works have studied this phenomenon in detail, proposing various ways to either leverage or alleviate it. Attention sinks have been connected to quantisation difficulties, security issues, and streaming attention. Yet, while many works have provided conditions in which they occur or not, a critical question remains shallowly answered: Why do LLMs learn such patterns and how are they being used? In this work, we argue theoretically and empirically that this mechanism provides a method for LLMs to avoid over-mixing, connecting this to existing lines of work that study mathematically how information propagates in Transformers. We conduct experiments to validate our theoretical intuitions and show how choices such as context length, depth, and data packing influence the sink behaviour. We hope that this study provides a new practical perspective on why attention sinks are useful in LLMs, leading to a better understanding of the attention patterns that form during training.

Generative Classifier for Domain Generalization

Shaocong Long,Qianyu Zhou,Xiangtai Li,Chenhao Ying,Yunhai Tong,Lizhuang Ma,Yuan Luo,Dacheng Tao

Task: 探索生成式分类器驱动的领域泛化方法（GCDG）以提升计算机视觉模型对分布变化的泛化能力。

Motivation: 主流领域泛化方法专注于学习领域不变性，但忽视了领域特定信息的潜力，导致在面对多模态的领域内变化时表现不佳。

Details

Method: 提出基于高斯混合模型（GMMs）的生成式分类器GCDG，包含异质性学习分类器（HLC）、虚假相关性阻断（SCB）和多样性组件平衡（DCB）三个模块。 Result: GCDG在五个领域泛化基准数据集和一个人脸防伪数据集上表现出色，并能无缝集成到现有方法中提升性能。 Conclusion: GCDG通过捕获领域特定信息的细微差异，降低了目标风险并鼓励平坦最小值，从而提升了模型的泛化能力。 Abstract: Domain generalization (DG) aims to improve the generalizability of computer vision models toward distribution shifts. The mainstream DG methods focus on learning domain invariance, however, such methods overlook the potential inherent in domain-specific information. While the prevailing practice of discriminative linear classifier has been tailored to domain-invariant features, it struggles when confronted with diverse domain-specific information, e.g., intra-class shifts, that exhibits multi-modality. To address these issues, we explore the theoretical implications of relying on domain invariance, revealing the crucial role of domain-specific information in mitigating the target risk for DG. Drawing from these insights, we propose Generative Classifier-driven Domain Generalization (GCDG), introducing a generative paradigm for the DG classifier based on Gaussian Mixture Models (GMMs) for each class across domains. GCDG consists of three key modules: Heterogeneity Learning Classifier~(HLC), Spurious Correlation Blocking~(SCB), and Diverse Component Balancing~(DCB). Concretely, HLC attempts to model the feature distributions and thereby capture valuable domain-specific information via GMMs. SCB identifies the neural units containing spurious correlations and perturbs them, mitigating the risk of HLC learning spurious patterns. Meanwhile, DCB ensures a balanced contribution of components in HLC, preventing the underestimation or neglect of critical components. In this way, GCDG excels in capturing the nuances of domain-specific information characterized by diverse distributions. GCDG demonstrates the potential to reduce the target risk and encourage flat minima, improving the generalizability. Extensive experiments show GCDG's comparable performance on five DG benchmarks and one face anti-spoofing dataset, seamlessly integrating into existing DG methods with consistent improvements.

Enhancing LLM Robustness to Perturbed Instructions: An Empirical Study

Aryan Agrawal,Lisa Alazraki,Shahin Honarvar,Marek Rei

Task: 研究如何提升大型语言模型（LLMs）对任务级指令扰动的鲁棒性。

Motivation: 现有方法主要关注扰动数据样本，而任务级指令扰动的鲁棒性提升研究较少。

Details

Method: 采用字符和词级别的指令编辑，实验了自去噪和表示对齐等技术，测试了不同模型、数据集和指令。 Result: 自去噪方法（无论是冻结LLM还是微调模型）比其他策略（如集成和监督方法）表现更优。 Conclusion: 自去噪是提升LLMs对指令扰动鲁棒性的有效方法。 Abstract: Large Language Models (LLMs) are highly vulnerable to input perturbations, as even a small prompt change may result in a substantially different output. Existing methods to enhance LLM robustness are primarily focused on perturbed data samples, whereas improving resiliency to perturbations of task-level instructions has remained relatively underexplored. In this work, we focus on character- and word-level edits of task-specific instructions, which substantially degrade downstream performance. We experiment with a variety of techniques to enhance the robustness of LLMs, including self-denoising and representation alignment, testing different models (Llama 3 and Flan-T5), datasets (CoLA, QNLI, SST-2) and instructions (both task-oriented and role-oriented). We find that, on average, self-denoising -- whether performed by a frozen LLM or a fine-tuned model -- achieves substantially higher performance gains than alternative strategies, including more complex baselines such as ensembling and supervised methods.

Beyond Conventional Transformers: The Medical X-ray Attention (MXA) Block for Improved Multi-Label Diagnosis Using Knowledge Distillation

Amit Rand,Hadi Ibrahim

Task: 提出一种针对X射线异常检测的专用注意力机制（MXA块），并将其嵌入EfficientViT架构中，用于多标签分类。

Motivation: 医学影像（如X射线）通常需要同时检测多种异常，多标签分类在临床应用中至关重要。

Details

Method: 提出MXA块，改进传统多头自注意力机制（MHSA），结合局部细节和全局上下文信息，嵌入EfficientViT架构并采用知识蒸馏。 Result: 在CheXpert数据集上，AUC达到0.85，比基线模型（AUC=0.66）显著提升0.19。 Conclusion: MXA块显著提升了X射线多标签分类性能，为临床诊断提供了更有效的工具。 Abstract: Medical imaging, particularly X-ray analysis, often involves detecting multiple conditions simultaneously within a single scan, making multi-label classification crucial for real-world clinical applications. We present the Medical X-ray Attention (MXA) block, a novel attention mechanism tailored specifically to address the unique challenges of X-ray abnormality detection. The MXA block enhances traditional Multi-Head Self Attention (MHSA) by integrating a specialized module that efficiently captures both detailed local information and broader global context. To the best of our knowledge, this is the first work to propose a task-specific attention mechanism for diagnosing chest X-rays, as well as to attempt multi-label classification using an Efficient Vision Transformer (EfficientViT). By embedding the MXA block within the EfficientViT architecture and employing knowledge distillation, our proposed model significantly improves performance on the CheXpert dataset, a widely used benchmark for multi-label chest X-ray abnormality detection. Our approach achieves an area under the curve (AUC) of 0.85, an absolute improvement of 0.19 compared to our baseline model's AUC of 0.66, corresponding to a substantial approximate 233% relative improvement over random guessing (AUC = 0.5).

MultiBLiMP 1.0: A Massively Multilingual Benchmark of Linguistic Minimal Pairs

Jaap Jumelet,Leonie Weissweiler,Arianna Bisazza

Task: Introduce MultiBLiMP 1.0, a multilingual benchmark for evaluating linguistic minimal pairs across 101 languages.

Motivation: To assess the abilities of large language models (LLMs) at a multilingual scale and identify shortcomings in modeling low-resource languages.

Details

Method: Utilize an automated pipeline with Universal Dependencies and UniMorph resources to create over 125,000 minimal pairs. Result: MultiBLiMP 1.0 covers 6 linguistic phenomena and highlights current LLM limitations in low-resource language modeling. Conclusion: The benchmark provides a comprehensive tool for evaluating LLMs and underscores the need for improved modeling of low-resource languages. Abstract: We introduce MultiBLiMP 1.0, a massively multilingual benchmark of linguistic minimal pairs, covering 101 languages, 6 linguistic phenomena and containing more than 125,000 minimal pairs. Our minimal pairs are created using a fully automated pipeline, leveraging the large-scale linguistic resources of Universal Dependencies and UniMorph. MultiBLiMP 1.0 evaluates abilities of LLMs at an unprecedented multilingual scale, and highlights the shortcomings of the current state-of-the-art in modelling low-resource languages.

Trung Thanh Nguyen,Yasutomo Kawanishi,Vijay John,Takahiro Komamizu,Ichiro Ide

Task: 提出一种基于Transformer的多模态多视角传感器融合方法（MultiTSF）用于动作识别。

Motivation: 解决现有方法在多样化环境条件、严格传感器同步和细粒度标注需求等现实挑战中的不足。

Details

Method: 利用Transformer动态建模视角间关系并捕捉多视角时间依赖性，引入人体检测模块生成伪标签以优化空间特征学习。 Result: 在MultiSensor-Home和MM-Office数据集上的实验表明，MultiTSF在视频序列级和帧级动作识别中优于现有方法。 Conclusion: MultiTSF通过动态建模和伪标签优化，显著提升了多模态多视角动作识别的性能。 Abstract: Action recognition from multi-modal and multi-view observations holds significant potential for applications in surveillance, robotics, and smart environments. However, existing methods often fall short of addressing real-world challenges such as diverse environmental conditions, strict sensor synchronization, and the need for fine-grained annotations. In this study, we propose the Multi-modal Multi-view Transformer-based Sensor Fusion (MultiTSF). The proposed method leverages a Transformer-based to dynamically model inter-view relationships and capture temporal dependencies across multiple views. Additionally, we introduce a Human Detection Module to generate pseudo-ground-truth labels, enabling the model to prioritize frames containing human activity and enhance spatial feature learning. Comprehensive experiments conducted on our in-house MultiSensor-Home dataset and the existing MM-Office dataset demonstrate that MultiTSF outperforms state-of-the-art methods in both video sequence-level and frame-level action recognition settings.

A Framework for Robust Cognitive Evaluation of LLMs

Karin de Langis,Jong Inn Park,Bin Hu,Khanh Chi Le,Andreas Schramm,Michael C. Mensink,Andrew Elfenbein,Dongyeop Kang

Task: 开发CognitivEval框架，用于系统评估大型语言模型（LLMs）的人工认知能力。

Motivation: 尽管大型语言模型的认知能力已被广泛观察，但其本质和机制尚不明确，且缺乏标准化的评估方法。

Details

Method: CognitivEval框架包括自动提示置换和同时收集生成结果与模型概率估计的测试方法。 Result: 实验表明该框架能提高实验结果的稳健性，并成功复现了五项经典认知科学实验。 Conclusion: CognitivEval框架的公开发布将促进认知科学领域的广泛合作。 Abstract: Emergent cognitive abilities in large language models (LLMs) have been widely observed, but their nature and underlying mechanisms remain poorly understood. A growing body of research draws on cognitive science to investigate LLM cognition, but standard methodologies and experimen-tal pipelines have not yet been established. To address this gap we develop CognitivEval, a framework for systematically evaluating the artificial cognitive capabilities of LLMs, with a particular emphasis on robustness in response collection. The key features of CognitivEval include: (i) automatic prompt permutations, and (ii) testing that gathers both generations and model probability estimates. Our experiments demonstrate that these features lead to more robust experimental outcomes. Using CognitivEval, we replicate five classic experiments in cognitive science, illustrating the framework's generalizability across various experimental tasks and obtaining a cognitive profile of several state of the art LLMs. CognitivEval will be released publicly to foster broader collaboration within the cognitive science community.

Moment Quantization for Video Temporal Grounding

Xiaolong Sun,Le Wang,Sanping Zhou,Liushuai Shi,Kun Xia,Mengnan Liu,Yabing Wang,Gang Hua

Task: 提出一种基于时刻量化的视频时间定位方法（MQVTG），以增强相关与无关时刻的区分能力。

Motivation: 现有方法在区分前景与背景特征时表现较弱，MQVTG通过量化视频为离散向量来解决这一问题。

Details

Method: MQVTG使用可学习的时刻码本，将视频时刻与码字匹配，并通过聚类过程避免硬量化导致的信息损失，同时采用先验初始化和联合投影策略优化码本。 Result: 在六个流行基准测试中，MQVTG显著优于现有方法，有效区分相关与无关特征。 Conclusion: MQVTG是一种简单且通用的方法，可作为即插即用组件集成到现有模型中，显著提升视频时间定位性能。 Abstract: Video temporal grounding is a critical video understanding task, which aims to localize moments relevant to a language description. The challenge of this task lies in distinguishing relevant and irrelevant moments. Previous methods focused on learning continuous features exhibit weak differentiation between foreground and background features. In this paper, we propose a novel Moment-Quantization based Video Temporal Grounding method (MQVTG), which quantizes the input video into various discrete vectors to enhance the discrimination between relevant and irrelevant moments. Specifically, MQVTG maintains a learnable moment codebook, where each video moment matches a codeword. Considering the visual diversity, i.e., various visual expressions for the same moment, MQVTG treats moment-codeword matching as a clustering process without using discrete vectors, avoiding the loss of useful information from direct hard quantization. Additionally, we employ effective prior-initialization and joint-projection strategies to enhance the maintained moment codebook. With its simple implementation, the proposed method can be integrated into existing temporal grounding models as a plug-and-play component. Extensive experiments on six popular benchmarks demonstrate the effectiveness and generalizability of MQVTG, significantly outperforming state-of-the-art methods. Further qualitative analysis shows that our method effectively groups relevant features and separates irrelevant ones, aligning with our goal of enhancing discrimination.

Zhuohan Ge,Nicole Hu,Darian Li,Yubo Wang,Shihao Qi,Yuming Xu,Han Shi,Jason Zhang

Task: 探索大语言模型（LLMs）在社交媒体数据分析中用于心理健康问题检测的潜力。

Motivation: 社交媒体数据是心理健康研究的重要资源，但如何利用LLMs进行心理健康问题检测存在挑战。

Details

Method: 从文本数据分析和心理健康障碍检测等多个维度总结LLM的应用方法，并分析当前研究的主要挑战和不足。 Result: 总结了流行数据集和评估指标，揭示了LLMs在心理健康检测中的巨大潜力。 Conclusion: 为心理健康领域的研究者提供了全面的参考框架，并展示了LLMs在未来心理健康干预中的进一步应用前景。 Abstract: The detection and intervention of mental health issues represent a critical global research focus, and social media data has been recognized as an important resource for mental health research. However, how to utilize Large Language Models (LLMs) for mental health problem detection on social media poses significant challenges. Hence, this paper aims to explore the potential of LLM applications in social media data analysis, focusing not only on the most common psychological disorders such as depression and anxiety but also incorporating psychotic disorders and externalizing disorders, summarizing the application methods of LLM from different dimensions, such as text data analysis and detection of mental disorders, and revealing the major challenges and shortcomings of current research. In addition, the paper provides an overview of popular datasets, and evaluation metrics. The survey in this paper provides a comprehensive frame of reference for researchers in the field of mental health, while demonstrating the great potential of LLMs in mental health detection to facilitate the further application of LLMs in future mental health interventions.

Trung Thanh Nguyen,Yasutomo Kawanishi,Vijay John,Takahiro Komamizu,Ichiro Ide

Task: 提出一种基于Transformer的多模态多视角传感器融合方法（MultiTSF）并引入MultiSensor-Home数据集，用于家庭环境中的动作识别。

Motivation: 当前数据集和方法在应对真实世界挑战（如宽区域环境条件、异步数据流、缺乏帧级标注）以及建模视角间关系和增强空间特征学习方面存在不足。

Details

Method: 采用Transformer融合机制动态建模视角间关系，并集成外部人体检测模块以增强空间特征学习。 Result: 在MultiSensor-Home和MM-Office数据集上的实验表明，MultiTSF优于现有方法。 Conclusion: MultiTSF方法在多模态多视角动作识别中表现出色，推动了该领域在真实世界应用中的进展。 Abstract: Multi-modal multi-view action recognition is a rapidly growing field in computer vision, offering significant potential for applications in surveillance. However, current datasets often fail to address real-world challenges such as wide-area environmental conditions, asynchronous data streams, and the lack of frame-level annotations. Furthermore, existing methods face difficulties in effectively modeling inter-view relationships and enhancing spatial feature learning. In this study, we propose the Multi-modal Multi-view Transformer-based Sensor Fusion (MultiTSF) method and introduce the MultiSensor-Home dataset, a novel benchmark designed for comprehensive action recognition in home environments. The MultiSensor-Home dataset features untrimmed videos captured by distributed sensors, providing high-resolution RGB and audio data along with detailed multi-view frame-level action labels. The proposed MultiTSF method leverages a Transformer-based fusion mechanism to dynamically model inter-view relationships. Furthermore, the method also integrates a external human detection module to enhance spatial feature learning. Experiments on MultiSensor-Home and MM-Office datasets demonstrate the superiority of MultiTSF over the state-of-the-art methods. The quantitative and qualitative results highlight the effectiveness of the proposed method in advancing real-world multi-modal multi-view action recognition.

MegaMath: Pushing the Limits of Open Math Corpora

Fan Zhou,Zengzhi Wang,Nikhil Ranjan,Zhoujun Cheng,Liping Tang,Guowei He,Zhengzhong Liu,Eric P. Xing

Task: 构建一个开放、大规模、高质量的数学相关语料库MegaMath，用于支持数学中心的大型语言模型预训练。

Motivation: 当前缺乏专门针对数学中心LLM预训练的开放、大规模、高质量语料库，限制了数学推理能力的研究进展。

Details

Method: 通过三种策略构建语料库：(1) 重新提取网络数学文档并进行优化；(2) 从代码训练语料库中筛选高质量数学相关代码；(3) 合成问答式文本、数学相关代码及混合文本-代码块。 Result: MegaMath提供了371B tokens的数据，是目前开放数学预训练数据集中规模最大且质量最高的。 Conclusion: MegaMath填补了数学中心LLM预训练语料库的空白，为数学推理研究提供了重要资源。 Abstract: Mathematical reasoning is a cornerstone of human intelligence and a key benchmark for advanced capabilities in large language models (LLMs). However, the research community still lacks an open, large-scale, high-quality corpus tailored to the demands of math-centric LLM pre-training. We present MegaMath, an open dataset curated from diverse, math-focused sources through following practices: (1) Revisiting web data: We re-extracted mathematical documents from Common Crawl with math-oriented HTML optimizations, fasttext-based filtering and deduplication, all for acquiring higher-quality data on the Internet. (2) Recalling Math-related code data: We identified high quality math-related code from large code training corpus, Stack-V2, further enhancing data diversity. (3) Exploring Synthetic data: We synthesized QA-style text, math-related code, and interleaved text-code blocks from web data or code data. By integrating these strategies and validating their effectiveness through extensive ablations, MegaMath delivers 371B tokens with the largest quantity and top quality among existing open math pre-training datasets.

OmniCam: Unified Multimodal Video Generation via Camera Control

Xiaoda Yang,Jiayang Xu,Kaixuan Luan,Xinyu Zhan,Hongshun Qiu,Shijun Shi,Hao Li,Shuai Yang,Li Zhang,Checheng Yu,Cewu Lu,Lixin Yang

Task: 提出OmniCam，一个统一的多模态相机控制框架，用于生成时空一致的视频。

Motivation: 现有相机控制方法存在交互复杂和控制能力有限的问题。

Details

Method: 利用大型语言模型和视频扩散模型，支持多种输入模态组合（文本、视频、图像等）进行相机运动控制。 Result: 实验结果表明，OmniCam在高质量相机控制视频生成方面达到最先进性能。 Conclusion: OmniCam通过多模态输入和数据集支持，解决了现有方法的局限性。 Abstract: Camera control, which achieves diverse visual effects by changing camera position and pose, has attracted widespread attention. However, existing methods face challenges such as complex interaction and limited control capabilities. To address these issues, we present OmniCam, a unified multimodal camera control framework. Leveraging large language models and video diffusion models, OmniCam generates spatio-temporally consistent videos. It supports various combinations of input modalities: the user can provide text or video with expected trajectory as camera path guidance, and image or video as content reference, enabling precise control over camera motion. To facilitate the training of OmniCam, we introduce the OmniTr dataset, which contains a large collection of high-quality long-sequence trajectories, videos, and corresponding descriptions. Experimental results demonstrate that our model achieves state-of-the-art performance in high-quality camera-controlled video generation across various metrics.

Generative Evaluation of Complex Reasoning in Large Language Models

Haowei Lin,Xiangyu Wang,Ruilin Yan,Baizhou Huang,Haotian Ye,Jianhua Zhu,Zihao Wang,James Zou,Jianzhu Ma,Yitao Liang

Task: 评估大型语言模型（LLMs）是否真正具备推理能力，而非仅依赖训练数据中的记忆。

Motivation: 公开的基准测试一旦被纳入LLMs的训练集，其可靠性会受到影响，因此需要一种新的评估框架来动态生成多样化的推理任务。

Details

Method: 提出KUMO框架，结合LLMs与符号引擎动态生成部分可观测、难度可调的多轮推理任务，并通过自动化流程持续生成新任务。 Result: 评估了23个先进LLMs在5,000个任务上的表现，发现许多LLMs在简单推理任务上超越大学生水平，而推理优化的LLMs在复杂任务上达到大学生水平。 Conclusion: KUMO是一种稳健且持久的评估工具，能有效衡量LLMs的真实推理能力，其表现与新发布的现实推理基准高度相关。 Abstract: With powerful large language models (LLMs) demonstrating superhuman reasoning capabilities, a critical question arises: Do LLMs genuinely reason, or do they merely recall answers from their extensive, web-scraped training datasets? Publicly released benchmarks inevitably become contaminated once incorporated into subsequent LLM training sets, undermining their reliability as faithful assessments. To address this, we introduce KUMO, a generative evaluation framework designed specifically for assessing reasoning in LLMs. KUMO synergistically combines LLMs with symbolic engines to dynamically produce diverse, multi-turn reasoning tasks that are partially observable and adjustable in difficulty. Through an automated pipeline, KUMO continuously generates novel tasks across open-ended domains, compelling models to demonstrate genuine generalization rather than memorization. We evaluated 23 state-of-the-art LLMs on 5,000 tasks across 100 domains created by KUMO, benchmarking their reasoning abilities against university students. Our findings reveal that many LLMs have outperformed university-level performance on easy reasoning tasks, and reasoning-scaled LLMs reach university-level performance on complex reasoning challenges. Moreover, LLM performance on KUMO tasks correlates strongly with results on newly released real-world reasoning benchmarks, underscoring KUMO's value as a robust, enduring assessment tool for genuine LLM reasoning capabilities.

ConsDreamer: Advancing Multi-View Consistency for Zero-Shot Text-to-3D Generation

Yuan Zhou,Shilong Jin,Litao Hua,Wanjun Lv,Haoran Duan,Jungong Han

Task: 提出ConsDreamer框架以解决零样本文本到3D生成中的视角偏差问题。

Motivation: 现有方法在利用预训练文本到图像模型进行3D高斯泼溅时存在视角偏差，导致多视角生成不一致（如多面Janus问题）。

Details

Method: ConsDreamer通过改进分数蒸馏过程中的条件项和无条件项来减少视角偏差，包括视角解耦模块（VDM）和基于相似性的部分顺序损失。 Result: 实验表明，ConsDreamer在多视角一致性和视觉质量上优于现有方法。 Conclusion: ConsDreamer有效解决了文本到3D生成中的多面Janus问题，提升了生成质量。 Abstract: Recent advances in zero-shot text-to-3D generation have revolutionized 3D content creation by enabling direct synthesis from textual descriptions. While state-of-the-art methods leverage 3D Gaussian Splatting with score distillation to enhance multi-view rendering through pre-trained text-to-image (T2I) models, they suffer from inherent view biases in T2I priors. These biases lead to inconsistent 3D generation, particularly manifesting as the multi-face Janus problem, where objects exhibit conflicting features across views. To address this fundamental challenge, we propose ConsDreamer, a novel framework that mitigates view bias by refining both the conditional and unconditional terms in the score distillation process: (1) a View Disentanglement Module (VDM) that eliminates viewpoint biases in conditional prompts by decoupling irrelevant view components and injecting precise camera parameters; and (2) a similarity-based partial order loss that enforces geometric consistency in the unconditional term by aligning cosine similarities with azimuth relationships. Extensive experiments demonstrate that ConsDreamer effectively mitigates the multi-face Janus problem in text-to-3D generation, outperforming existing methods in both visual quality and consistency.

LLMs Working in Harmony: A Survey on the Technological Aspects of Building Effective LLM-Based Multi Agent Systems

R. M. Aratchige,W. M. K. S. Ilmini

Task: 调查支持大型语言模型（LLM）多智能体系统开发的基础技术。

Motivation: 优化多智能体系统在协作、动态环境中的表现。

Details

Method: 分析四个关键领域：架构、记忆、规划和相关技术/框架，并评估其局限性和创新。 Result: 总结了当前技术的优势和挑战，提出了提升系统可扩展性、协作和适应性的建议。 Conclusion: 为未来研究提供了路线图，支持开发更高效、鲁棒的多智能体系统。 Abstract: This survey investigates foundational technologies essential for developing effective Large Language Model (LLM)-based multi-agent systems. Aiming to answer how best to optimize these systems for collaborative, dynamic environments, we focus on four critical areas: Architecture, Memory, Planning, and Technologies/Frameworks. By analyzing recent advancements and their limitations - such as scalability, real-time response challenges, and agent coordination constraints, we provide a detailed view of the technological landscape. Frameworks like the Mixture of Agents architecture and the ReAct planning model exemplify current innovations, showcasing improvements in role assignment and decision-making. This review synthesizes key strengths and persistent challenges, offering practical recommendations to enhance system scalability, agent collaboration, and adaptability. Our findings provide a roadmap for future research, supporting the creation of robust, efficient multi-agent systems that advance both individual agent performance and collective system resilience.

X-Capture: An Open-Source Portable Device for Multi-Sensory Learning

Samuel Clarke,Suzannah Wistreich,Yanjie Ze,Jiajun Wu

Task: Introduce X-Capture, a device for real-world multi-sensory data collection, and demonstrate its utility in AI tasks.

Motivation: Existing datasets for multi-sensory AI are limited by controlled environments or restricted modalities, hindering progress in human-like sensory representations.

Details

Method: Develop X-Capture, a portable and cost-effective device to capture correlated RGBD images, tactile readings, and impact audio in real-world settings. Result: A sample dataset of 3,000 points on 500 objects is curated, showing value in quantity and sensory breadth for AI tasks like cross-sensory retrieval. Conclusion: X-Capture advances scalable, accessible, and real-world applicable multi-sensory AI, laying groundwork for richer human-like representations. Abstract: Understanding objects through multiple sensory modalities is fundamental to human perception, enabling cross-sensory integration and richer comprehension. For AI and robotic systems to replicate this ability, access to diverse, high-quality multi-sensory data is critical. Existing datasets are often limited by their focus on controlled environments, simulated objects, or restricted modality pairings. We introduce X-Capture, an open-source, portable, and cost-effective device for real-world multi-sensory data collection, capable of capturing correlated RGBD images, tactile readings, and impact audio. With a build cost under $1,000, X-Capture democratizes the creation of multi-sensory datasets, requiring only consumer-grade tools for assembly. Using X-Capture, we curate a sample dataset of 3,000 total points on 500 everyday objects from diverse, real-world environments, offering both richness and variety. Our experiments demonstrate the value of both the quantity and the sensory breadth of our data for both pretraining and fine-tuning multi-modal representations for object-centric tasks such as cross-sensory retrieval and reconstruction. X-Capture lays the groundwork for advancing human-like sensory representations in AI, emphasizing scalability, accessibility, and real-world applicability.

Urban Computing in the Era of Large Language Models

Zhonghang Li,Lianghao Xia,Xubin Ren,Jiabin Tang,Tianyi Chen,Yong Xu,Chao Huang

Task: 探索大型语言模型（LLMs）与城市计算的交叉点，强调LLMs在处理和分析城市数据、增强决策制定和促进公民参与方面的作用。

Motivation: 传统方法在城市计算中存在泛化性、可扩展性和上下文理解的挑战，而LLMs的出现为解决这些问题提供了变革性潜力。

Details

Method: 通过综述LLMs的演变和核心技术，并调查其在交通、公共安全和环境监测等关键城市领域的应用。 Result: 总结了LLMs在城市计算中的功能角色和实施模式，并提出了潜在的LLM解决方案以应对未解决的挑战。 Conclusion: 讨论了当前方法的局限性，并展望了LLMs在城市计算中的未来发展方向。 Abstract: Urban computing has emerged as a multidisciplinary field that harnesses data-driven technologies to address challenges and improve urban living. Traditional approaches, while beneficial, often face challenges with generalization, scalability, and contextual understanding. The advent of Large Language Models (LLMs) offers transformative potential in this domain. This survey explores the intersection of LLMs and urban computing, emphasizing the impact of LLMs in processing and analyzing urban data, enhancing decision-making, and fostering citizen engagement. We provide a concise overview of the evolution and core technologies of LLMs. Additionally, we survey their applications across key urban domains, such as transportation, public safety, and environmental monitoring, summarizing essential tasks and prior works in various urban contexts, while highlighting LLMs' functional roles and implementation patterns. Building on this, we propose potential LLM-based solutions to address unresolved challenges. To facilitate in-depth research, we compile a list of available datasets and tools applicable to diverse urban scenarios. Finally, we discuss the limitations of current approaches and outline future directions for advancing LLMs in urban computing.

Refining CLIP's Spatial Awareness: A Visual-Centric Perspective

Congpei Qiu,Yanhao Wu,Wei Ke,Xiuxiu Bai,Tong Zhang

Task: 提出一种名为空间相关性蒸馏（SCD）的框架，以增强CLIP在密集预测任务中的空间感知能力。

Motivation: CLIP在全局对齐语言和图像方面表现优异，但在空间信息敏感的任务中表现不佳，尤其是密集预测任务。

Details

Method: 通过空间相关性蒸馏（SCD）框架保留CLIP的空间结构，并引入轻量级Refiner提取高质量密集特征。 Result: 该方法在多种开放词汇密集预测基准测试中取得了最先进的结果。 Conclusion: SCD框架有效提升了CLIP在密集任务中的性能，同时保留了其空间感知能力。 Abstract: Contrastive Language-Image Pre-training (CLIP) excels in global alignment with language but exhibits limited sensitivity to spatial information, leading to strong performance in zero-shot classification tasks but underperformance in tasks requiring precise spatial understanding. Recent approaches have introduced Region-Language Alignment (RLA) to enhance CLIP's performance in dense multimodal tasks by aligning regional visual representations with corresponding text inputs. However, we find that CLIP ViTs fine-tuned with RLA suffer from notable loss in spatial awareness, which is crucial for dense prediction tasks. To address this, we propose the Spatial Correlation Distillation (SCD) framework, which preserves CLIP's inherent spatial structure and mitigates the above degradation. To further enhance spatial correlations, we introduce a lightweight Refiner that extracts refined correlations directly from CLIP before feeding them into SCD, based on an intriguing finding that CLIP naturally captures high-quality dense features. Together, these components form a robust distillation framework that enables CLIP ViTs to integrate both visual-language and visual-centric improvements, achieving state-of-the-art results across various open-vocabulary dense prediction benchmarks.

Self-Resource Allocation in Multi-Agent LLM Systems

Alfonso Amayuelas,Jingbo Yang,Saaket Agashe,Ashwin Nagarajan,Antonis Antoniades,Xin Eric Wang,William Wang

Task: 探索如何利用LLMs在多智能体系统中有效分配计算任务。

Motivation: 随着LLMs作为智能体的发展，多智能体系统在任务分配和协调中的作用日益重要，需要研究其有效性。

Details

Method: 比较LLMs作为协调者和规划者在任务分配中的表现，并分析其效率和性能。 Result: 实验表明，LLMs在资源分配任务中具有高有效性和准确性，规划者方法在并发操作中表现更优。 Conclusion: 提供明确的智能体能力信息可以优化规划者的分配策略，特别是在处理次优智能体时。 Abstract: With the development of LLMs as agents, there is a growing interest in connecting multiple agents into multi-agent systems to solve tasks concurrently, focusing on their role in task assignment and coordination. This paper explores how LLMs can effectively allocate computational tasks among multiple agents, considering factors such as cost, efficiency, and performance. In this work, we address key questions, including the effectiveness of LLMs as orchestrators and planners, comparing their effectiveness in task assignment and coordination. Our experiments demonstrate that LLMs can achieve high validity and accuracy in resource allocation tasks. We find that the planner method outperforms the orchestrator method in handling concurrent actions, resulting in improved efficiency and better utilization of agents. Additionally, we show that providing explicit information about worker capabilities enhances the allocation strategies of planners, particularly when dealing with suboptimal workers.

Evaluating and Enhancing Segmentation Model Robustness with Metamorphic Testing

Seif Mzoughi,Mohamed Elshafeia,Foutse Khomh

Task: 提出SegRMT，一种利用遗传算法优化空间和光谱变换序列的变形测试方法，以增强图像分割模型的鲁棒性。

Motivation: 图像分割模型在面对细微图像失真时缺乏鲁棒性，影响其在医疗影像、增强现实等关键应用中的可靠性。

Details

Method: 通过遗传算法优化变换序列，结合PSNR阈值保持图像保真度，生成对抗样本挑战DeepLabV3模型。 Result: SegRMT将DeepLabV3的mIoU降至6.4%，优于其他基线方法（8.5%-21.7%），并在对抗训练中显著提升模型性能（mIoU达73%）。 Conclusion: SegRMT不仅能模拟真实图像失真，还能增强分割模型的鲁棒性，适用于安全关键应用。 Abstract: Image segmentation is critical for applications such as medical imaging, augmented reality, and video surveillance. However, segmentation models often lack robustness, making them vulnerable to adversarial perturbations from subtle image distortions. In this work, we propose SegRMT, a metamorphic testing approach that leverages genetic algorithms (GA) to optimize sequences of spatial and spectral transformations while preserving image fidelity via a predefined PSNR threshold. Using the Cityscapes dataset, our method generates adversarial examples that effectively challenge the DeepLabV3 segmentation model. Our experiments show that SegRMT reduces DeepLabV3's mean Intersection over Union (mIoU) to 6.4%, outperforming other adversarial baselines that decrease mIoU to between 8.5% and 21.7%. Furthermore, when used for adversarial training, SegRMT boosts model performance, achieving mIoU improvements up to 73% on dedicated adversarial datasets and increasing cross-adversarial mIoU to 53.8%, compared to only 2%-10% for other methods. These findings demonstrate that SegRMT not only simulates realistic image distortions but also enhances the robustness of segmentation models, making it a valuable tool for ensuring reliable performance in safety-critical applications.

TiC-LM: A Web-Scale Benchmark for Time-Continual LLM Pretraining

Jeffrey Li,Mohammadreza Armandpour,Iman Mirzadeh,Sachin Mehta,Vaishaal Shankar,Raviteja Vemulapalli,Samy Bengio,Oncel Tuzel,Mehrdad Farajtabar,Hadi Pouransari,Fartash Faghri

Task: 研究大型语言模型（LLMs）在数据更新时的评估策略和更新方法。

Motivation: 由于LLMs基于历史网络数据训练，容易过时，需要探索如何有效更新模型以适应新数据。

Details

Method: 引入基于114个Common Crawl（CC）转储的网络规模数据集，设计时间分层评估方法，比较不同持续学习方法的效果。 Result: 在通用CC数据上，结合固定比例回放旧数据的自回归元调度方法，能以2.6倍的计算量达到与从头训练相当的损失。 Conclusion: 回放旧数据对通用网络数据至关重要，但对特定领域数据影响较小，需根据数据特性调整更新策略。 Abstract: Large Language Models (LLMs) trained on historical web data inevitably become outdated. We investigate evaluation strategies and update methods for LLMs as new data becomes available. We introduce a web-scale dataset for time-continual pretraining of LLMs derived from 114 dumps of Common Crawl (CC) - orders of magnitude larger than previous continual language modeling benchmarks. We also design time-stratified evaluations across both general CC data and specific domains (Wikipedia, StackExchange, and code documentation) to assess how well various continual learning methods adapt to new data while retaining past knowledge. Our findings demonstrate that, on general CC data, autoregressive meta-schedules combined with a fixed-ratio replay of older data can achieve comparable held-out loss to re-training from scratch, while requiring significantly less computation (2.6x). However, the optimal balance between incorporating new data and replaying old data differs as replay is crucial to avoid forgetting on generic web data but less so on specific domains.

LPA3D: 3D Room-Level Scene Generation from In-the-Wild Images

Ming-Jia Yang,Yu-Xiao Guo,Yang Liu,Bin Zhou,Xin Tong

Task: 从单张RGB图像生成逼真且语义合理的室内场景。

Motivation: 现有基于NeRF的场景生成方法需要额外信息（如多视角、深度图像或语义引导），而无法仅依赖RGB图像，这限制了其应用范围。

Details

Method: 提出LPA-GAN，一种基于NeRF的生成方法，通过局部姿态对齐（LPA）框架重新定义全局姿态，并联合优化姿态预测与场景生成。 Result: 实验表明，LPA-GAN在视角一致性和语义合理性上优于其他方法。 Conclusion: LPA-GAN为仅依赖RGB图像的室内场景生成提供了一种有效解决方案。 Abstract: Generating realistic, room-level indoor scenes with semantically plausible and detailed appearances from in-the-wild images is crucial for various applications in VR, AR, and robotics. The success of NeRF-based generative methods indicates a promising direction to address this challenge. However, unlike their success at the object level, existing scene-level generative methods require additional information, such as multiple views, depth images, or semantic guidance, rather than relying solely on RGB images. This is because NeRF-based methods necessitate prior knowledge of camera poses, which is challenging to approximate for indoor scenes due to the complexity of defining alignment and the difficulty of globally estimating poses from a single image, given the unseen parts behind the camera. To address this challenge, we redefine global poses within the framework of Local-Pose-Alignment (LPA) -- an anchor-based multi-local-coordinate system that uses a selected number of anchors as the roots of these coordinates. Building on this foundation, we introduce LPA-GAN, a novel NeRF-based generative approach that incorporates specific modifications to estimate the priors of camera poses under LPA. It also co-optimizes the pose predictor and scene generation processes. Our ablation study and comparisons with straightforward extensions of NeRF-based object generative methods demonstrate the effectiveness of our approach. Furthermore, visual comparisons with other techniques reveal that our method achieves superior view-to-view consistency and semantic normality.

Exploring LLM Reasoning Through Controlled Prompt Variations

Giannis Chatziveroglou,Richard Yun,Maura Kelleher

Task: 研究大型语言模型（LLMs）在数学问题解决任务中面对系统性输入扰动时的推理鲁棒性。

Motivation: 评估现有先进模型在四种提示扰动（无关上下文、病态指令、事实相关但非必要上下文及其组合）下保持逻辑一致性和正确性的能力。

Details

Method: 使用GSM8K数据集作为受控测试平台，对13个开源和闭源LLMs进行实验。 Result: 引入无关上下文显著降低性能，推理任务复杂性与性能下降不严格相关，某些扰动意外触发类似思维链的推理行为。 Conclusion: 当前LLMs存在关键脆弱性，需提升对噪声、误导和上下文密集输入的鲁棒性，以实现更可靠的实际应用推理。 Abstract: This study investigates the reasoning robustness of large language models (LLMs) on mathematical problem-solving tasks under systematically introduced input perturbations. Using the GSM8K dataset as a controlled testbed, we evaluate how well state-of-the-art models maintain logical consistency and correctness when confronted with four categories of prompt perturbations: irrelevant context, pathological instructions, factually relevant but non-essential context, and a combination of the latter two. Our experiments, conducted on thirteen open-source and closed-source LLMs, reveal that introducing irrelevant context within the model's context window significantly degrades performance, suggesting that distinguishing essential from extraneous details remains a pressing challenge. Surprisingly, performance regressions are relatively insensitive to the complexity of the reasoning task, as measured by the number of steps required, and are not strictly correlated with model size. Moreover, we observe that certain perturbations inadvertently trigger chain-of-thought-like reasoning behaviors, even without explicit prompting. Our findings highlight critical vulnerabilities in current LLMs and underscore the need for improved robustness against noisy, misleading, and contextually dense inputs, paving the way for more resilient and reliable reasoning in real-world applications.

SemiISP/SemiIE: Semi-Supervised Image Signal Processor and Image Enhancement Leveraging One-to-Many Mapping sRGB-to-RAW

Masakazu Yoshimura,Junji Otsuka,Radu Berdan,Takeshi Ohashi

Task: 实现基于半监督学习的图像信号处理器（ISP）和图像增强（IE）任务。

Motivation: 由于ISP和IE任务的训练数据创建成本高且个性化需求多样，半监督学习成为潜在解决方案。

Details

Method: 提出一种改进的sRGB-to-RAW方法，并结合半监督学习框架。 Result: 提出的方法成功提升了多种模型在不同数据集上的图像质量。 Conclusion: 半监督学习结合改进的sRGB-to-RAW方法为ISP和IE任务提供了高效解决方案。 Abstract: DNN-based methods have been successful in Image Signal Processor (ISP) and image enhancement (IE) tasks. However, the cost of creating training data for these tasks is considerably higher than for other tasks, making it difficult to prepare large-scale datasets. Also, creating personalized ISP and IE with minimal training data can lead to new value streams since preferred image quality varies depending on the person and use case. While semi-supervised learning could be a potential solution in such cases, it has rarely been utilized for these tasks. In this paper, we realize semi-supervised learning for ISP and IE leveraging a RAW image reconstruction (sRGB-to-RAW) method. Although existing sRGB-to-RAW methods can generate pseudo-RAW image datasets that improve the accuracy of RAW-based high-level computer vision tasks such as object detection, their quality is not sufficient for ISP and IE tasks that require precise image quality definition. Therefore, we also propose a sRGB-to-RAW method that can improve the image quality of these tasks. The proposed semi-supervised learning with the proposed sRGB-to-RAW method successfully improves the image quality of various models on various datasets.

Achieving Unanimous Consensus in Decision Making Using Multi-Agents

Apurba Pokharel,Ram Dantu,Shakila Zaman,Sirisha Talapuru,Vinh Quach

Task: Introduce a deliberation-based consensus mechanism using Large Language Models (LLMs) for blockchain decision-making.

Motivation: Existing consensus mechanisms like PoW and PoS lack adaptability for scenarios requiring unanimous or graded decision-making based on opinions.

Details

Method: Leverage LLMs as rational agents in structured discussions, using graded consensus and multi-round deliberation. Result: Formalized system maintains blockchain properties (consistency, agreement, liveness, determinism) and experimental results validate feasibility. Conclusion: The proposed method addresses challenges like thought degeneration and scalability, enabling effective decision-making on blockchain networks. Abstract: Blockchain consensus mechanisms have relied on algorithms such as Proof-of-Work (PoW) and Proof-of-Stake (PoS) to ensure network functionality and integrity. However, these approaches struggle with adaptability for decision-making where the opinions of each matter rather than reaching an agreement based on honest majority or weighted consensus. This paper introduces a novel deliberation-based consensus mechanism where Large Language Models (LLMs) act as rational agents engaging in structured discussions to reach a unanimous consensus. By leveraging graded consensus and a multi-round deliberation process, our approach ensures both unanimous consensus for definitive problems and graded confidence for prioritized decisions and policies. We provide a formalization of our system and use it to show that the properties of blockchains: consistency, agreement, liveness, and determinism are maintained. Moreover, experimental results demonstrate our system's feasibility, showcasing how our deliberation method's convergence, block properties, and accuracy enable decision-making on blockchain networks. We also address key challenges with this novel approach such as degeneration of thoughts, hallucinations, malicious models and nodes, resource consumption, and scalability.

Agglomerating Large Vision Encoders via Distillation for VFSS Segmentation

Chengxi Zeng,Yuxuan Jiang,Fan Zhang,Alberto Gambaruto,Tilo Burghardt

Task: 通过知识蒸馏从多个大型医学基础模型中提升低复杂度模型的性能。

Motivation: 医学基础模型在医学影像中表现出色，但其训练和推理复杂度高，轻量级变体性能受限。

Details

Method: 提出新框架，通过从多个大型医学基础模型（如MedSAM、RAD-DINO、MedCLIP）进行知识蒸馏，提升低复杂度模型性能。 Result: 聚合模型在12个分割任务中表现出色，平均Dice系数提升2%。 Conclusion: 该方法在复杂度和性能之间取得了更好的平衡。 Abstract: The deployment of foundation models for medical imaging has demonstrated considerable success. However, their training overheads associated with downstream tasks remain substantial due to the size of the image encoders employed, and the inference complexity is also significantly high. Although lightweight variants have been obtained for these foundation models, their performance is constrained by their limited model capacity and suboptimal training strategies. In order to achieve an improved tradeoff between complexity and performance, we propose a new framework to improve the performance of low complexity models via knowledge distillation from multiple large medical foundation models (e.g., MedSAM, RAD-DINO, MedCLIP), each specializing in different vision tasks, with the goal to effectively bridge the performance gap for medical image segmentation tasks. The agglomerated model demonstrates superior generalization across 12 segmentation tasks, whereas specialized models require explicit training for each task. Our approach achieved an average performance gain of 2\% in Dice coefficient compared to simple distillation.

Towards Interpretable Soft Prompts

Oam Patel,Jason Wang,Nikhil Shivakumar Nayak,Suraj Srinivas,Himabindu Lakkaraju

Task: 提出一种新的理论框架来评估可训练提示的可解释性。

Motivation: 软提示等方法虽然提升了任务性能，但缺乏可解释性，因此需要一种理论框架来评估和改进。

Details

Method: 基于忠实性和可审查性两个标准，提出新的可解释性导向目标函数，并在PEZ和RLPrompt两种提示调优方法上测试。 Result: 实验表明，可解释性与任务性能之间存在权衡关系，揭示了优化可解释性代理时的异常行为。 Conclusion: 研究揭示了软提示可解释性问题的复杂性，并为未来可解释性导向的提示方法提供了方向。 Abstract: Soft prompts have been popularized as a cheap and easy way to improve task-specific LLM performance beyond few-shot prompts. Despite their origin as an automated prompting method, however, soft prompts and other trainable prompts remain a black-box method with no immediately interpretable connections to prompting. We create a novel theoretical framework for evaluating the interpretability of trainable prompts based on two desiderata: faithfulness and scrutability. We find that existing methods do not naturally satisfy our proposed interpretability criterion. Instead, our framework inspires a new direction of trainable prompting methods that explicitly optimizes for interpretability. To this end, we formulate and test new interpretability-oriented objective functions for two state-of-the-art prompt tuners: Hard Prompts Made Easy (PEZ) and RLPrompt. Our experiments with GPT-2 demonstrate a fundamental trade-off between interpretability and the task-performance of the trainable prompt, explicating the hardness of the soft prompt interpretability problem and revealing odd behavior that arises when one optimizes for an interpretability proxy.

All-day Depth Completion via Thermal-LiDAR Fusion

Janghyun Kim,Minseong Kweon,Jinsun Park,Ukcheol Shin

Task: 研究热成像与LiDAR结合的深度补全方法，以提升在恶劣环境（如低光照、雨天）下的性能。

Motivation: 现有基于RGB和LiDAR的深度补全方法在恶劣环境下表现不佳，且地面真实深度数据在恶劣天气下存在缺失问题；热成像相机在这些条件下表现稳定，但相关研究较少。

Details

Method: 提出COPS框架，结合对比学习和伪监督，利用深度基础模型提升深度边界清晰度和补全精度。 Result: 在MS$^2$和ViViD数据集上验证了方法的可行性和鲁棒性，并分析了热成像-LiDAR深度补全的关键挑战。 Conclusion: COPS框架有效解决了热成像-LiDAR深度补全中的边界模糊和监督不足问题，为未来研究提供了方向。 Abstract: Depth completion, which estimates dense depth from sparse LiDAR and RGB images, has demonstrated outstanding performance in well-lit conditions. However, due to the limitations of RGB sensors, existing methods often struggle to achieve reliable performance in harsh environments, such as heavy rain and low-light conditions. Furthermore, we observe that ground truth depth maps often suffer from large missing measurements in adverse weather conditions such as heavy rain, leading to insufficient supervision. In contrast, thermal cameras are known for providing clear and reliable visibility in such conditions, yet research on thermal-LiDAR depth completion remains underexplored. Moreover, the characteristics of thermal images, such as blurriness, low contrast, and noise, bring unclear depth boundary problems. To address these challenges, we first evaluate the feasibility and robustness of thermal-LiDAR depth completion across diverse lighting (eg., well-lit, low-light), weather (eg., clear-sky, rainy), and environment (eg., indoor, outdoor) conditions, by conducting extensive benchmarks on the MS$^2$ and ViViD datasets. In addition, we propose a framework that utilizes COntrastive learning and Pseudo-Supervision (COPS) to enhance depth boundary clarity and improve completion accuracy by leveraging a depth foundation model in two key ways. First, COPS enforces a depth-aware contrastive loss between different depth points by mining positive and negative samples using a monocular depth foundation model to sharpen depth boundaries. Second, it mitigates the issue of incomplete supervision from ground truth depth maps by leveraging foundation model predictions as dense depth priors. We also provide in-depth analyses of the key challenges in thermal-LiDAR depth completion to aid in understanding the task and encourage future research.

Neural Style Transfer for Synthesising a Dataset of Ancient Egyptian Hieroglyphs

Lewis Matheson Creed

Task: 提出一种利用神经风格迁移（NST）生成古埃及象形文字数据集的新方法。

Motivation: 低资源语言（如古埃及语）的训练数据有限，限制了机器学习技术的应用。

Details

Method: 通过将NST应用于数字字体，生成古埃及象形文字数据集。 Result: 实验结果表明，基于NST生成的数据和真实照片训练的模型在分类性能和泛化能力上表现相当。 Conclusion: NST是一种有效的数据增强方法，可用于解决低资源语言的数据稀缺问题。 Abstract: The limited availability of training data for low-resource languages makes applying machine learning techniques challenging. Ancient Egyptian is one such language with few resources. However, innovative applications of data augmentation methods, such as Neural Style Transfer, could overcome these barriers. This paper presents a novel method for generating datasets of ancient Egyptian hieroglyphs by applying NST to a digital typeface. Experimental results found that image classification models trained on NST-generated examples and photographs demonstrate equal performance and transferability to real unseen images of hieroglyphs.

Brightness Perceiving for Recursive Low-Light Image Enhancement

Haodian Wang,Long Peng,Yuejin Sun,Zengyu Wan,Yang Wang,Yang Cao

Task: 提出一种基于亮度感知的递归增强框架，用于高动态范围低光图像增强。

Motivation: 由于真实低光场景的动态范围广泛，现有端到端方法难以将低光图像增强至正常曝光。

Details

Method: 采用递归增强框架，包含两个并行子网络（ACT-Net和BP-Net），并通过无监督训练策略协调两者。 Result: 在六个参考和无参考指标上达到新的SOTA性能，PSNR比现有SOTA方法提高0.9 dB。 Conclusion: 所提方法能有效解决低光图像增强问题，并在性能上显著优于现有方法。 Abstract: Due to the wide dynamic range in real low-light scenes, there will be large differences in the degree of contrast degradation and detail blurring of captured images, making it difficult for existing end-to-end methods to enhance low-light images to normal exposure. To address the above issue, we decompose low-light image enhancement into a recursive enhancement task and propose a brightness-perceiving-based recursive enhancement framework for high dynamic range low-light image enhancement. Specifically, our recursive enhancement framework consists of two parallel sub-networks: Adaptive Contrast and Texture enhancement network (ACT-Net) and Brightness Perception network (BP-Net). The ACT-Net is proposed to adaptively enhance image contrast and details under the guidance of the brightness adjustment branch and gradient adjustment branch, which are proposed to perceive the degradation degree of contrast and details in low-light images. To adaptively enhance images captured under different brightness levels, BP-Net is proposed to control the recursive enhancement times of ACT-Net by exploring the image brightness distribution properties. Finally, in order to coordinate ACT-Net and BP-Net, we design a novel unsupervised training strategy to facilitate the training procedure. To further validate the effectiveness of the proposed method, we construct a new dataset with a broader brightness distribution by mixing three low-light datasets. Compared with eleven existing representative methods, the proposed method achieves new SOTA performance on six reference and no reference metrics. Specifically, the proposed method improves the PSNR by 0.9 dB compared to the existing SOTA method.

Jacy Reese Anthis,Ryan Liu,Sean M. Richardson,Austin C. Kozlowski,Bernard Koch,James Evans,Erik Brynjolfsson,Michael Bernstein

Task: 探讨如何通过解决五个可操作性挑战来实现大型语言模型（LLM）对人类研究对象的准确模拟。

Motivation: LLM模拟人类研究对象为理解人类行为和训练新AI系统提供了潜在数据源，但目前成果有限且社会科学家采用较少。

Details

Method: 基于文献综述，包括LLM与人类研究对象的实证比较、相关评论及研究，提出提示、微调和互补方法等方向。 Result: LLM社会模拟已可用于心理学、经济学、社会学和营销学的探索性研究，如试点实验。 Conclusion: 随着LLM能力的快速提升，更广泛应用可能即将实现，研究者应优先开发可迭代部署和优化的概念模型与评估方法。 Abstract: Accurate and verifiable large language model (LLM) simulations of human research subjects promise an accessible data source for understanding human behavior and training new AI systems. However, results to date have been limited, and few social scientists have adopted these methods. In this position paper, we argue that the promise of LLM social simulations can be achieved by addressing five tractable challenges. We ground our argument in a literature survey of empirical comparisons between LLMs and human research subjects, commentaries on the topic, and related work. We identify promising directions with prompting, fine-tuning, and complementary methods. We believe that LLM social simulations can already be used for exploratory research, such as pilot experiments for psychology, economics, sociology, and marketing. More widespread use may soon be possible with rapidly advancing LLM capabilities, and researchers should prioritize developing conceptual models and evaluations that can be iteratively deployed and refined at pace with ongoing AI advances.

VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models

Kim Sung-Bin,Jeongsoo Choi,Puyuan Peng,Joon Son Chung,Tae-Hyun Oh,David Harwath

Task: 提出一种名为VoiceCraft-Dub的自动化视频配音方法，通过文本和面部线索合成高质量语音。

Motivation: 该任务在电影制作、多媒体创作和辅助语音障碍者方面有广泛应用。

Details

Method: 基于神经编解码语言模型（NCLMs），通过引入视频特征和设计适配器，确保语音与面部动作同步且表达自然。 Result: 模型在人类感知和客观评估中表现优异，实现了高质量、清晰且自然的语音合成。 Conclusion: VoiceCraft-Dub不仅适用于视频配音，还能扩展到视频到语音任务，展示了其多功能性。 Abstract: We present VoiceCraft-Dub, a novel approach for automated video dubbing that synthesizes high-quality speech from text and facial cues. This task has broad applications in filmmaking, multimedia creation, and assisting voice-impaired individuals. Building on the success of Neural Codec Language Models (NCLMs) for speech synthesis, our method extends their capabilities by incorporating video features, ensuring that synthesized speech is time-synchronized and expressively aligned with facial movements while preserving natural prosody. To inject visual cues, we design adapters to align facial features with the NCLM token space and introduce audio-visual fusion layers to merge audio-visual information within the NCLM framework. Additionally, we curate CelebV-Dub, a new dataset of expressive, real-world videos specifically designed for automated video dubbing. Extensive experiments show that our model achieves high-quality, intelligible, and natural speech synthesis with accurate lip synchronization, outperforming existing methods in human perception and performing favorably in objective evaluations. We also adapt VoiceCraft-Dub for the video-to-speech task, demonstrating its versatility for various applications.

Advancing Semantic Caching for LLMs with Domain-Specific Embeddings and Synthetic Data

Waris Gill,Justin Cechmanek,Tyler Hutcherson,Srijith Rajamohan,Jen Agarwal,Muhammad Ali Gulzar,Manvinder Singh,Benoit Dion

Task: 研究如何通过使用专门优化的嵌入模型来提升语义缓存的效果。

Motivation: 语义缓存依赖于嵌入相似性而非精确的键匹配，这带来了在精度、查询延迟和计算效率之间平衡的独特挑战。

Details

Method: 提出使用更小、领域特定的嵌入模型，并通过真实世界和合成生成的数据集进行优化。 Result: 实验表明，仅在一个专门数据集上优化一个周期的紧凑嵌入模型在精度和召回率上显著优于现有开源和专有方案。 Conclusion: 通过平衡计算开销和准确性，提出了一种可行且高效的语义缓存实现策略。 Abstract: This report investigates enhancing semantic caching effectiveness by employing specialized, fine-tuned embedding models. Semantic caching relies on embedding similarity rather than exact key matching, presenting unique challenges in balancing precision, query latency, and computational efficiency. We propose leveraging smaller, domain-specific embedding models, fine-tuned with targeted real-world and synthetically generated datasets. Our empirical evaluations demonstrate that compact embedding models fine-tuned for just one epoch on specialized datasets significantly surpass both state-of-the-art open-source and proprietary alternatives in precision and recall. Moreover, we introduce a novel synthetic data generation pipeline for the semantic cache that mitigates the challenge of limited domain-specific annotated data, further boosting embedding performance. Our approach effectively balances computational overhead and accuracy, establishing a viable and efficient strategy for practical semantic caching implementations.

Marine Saliency Segmenter: Object-Focused Conditional Diffusion with Region-Level Semantic Knowledge Distillation

Laibin Chang,Yunke Wang,JiaXing Huang,Longxiang Deng,Bo Du,Chang Xu

Task: 提出一种基于扩散模型的海洋显著性分割方法（DiffMSS），以解决现有技术中目标定位不准和边界模糊的问题。

Motivation: 现有海洋分割技术因复杂水下环境导致目标定位不准和边界模糊，扩散模型在视觉分割中表现优异但仍有潜力通过上下文语义提升区域级显著性目标的特征学习。

Details

Method: 利用语义知识蒸馏指导分割，设计区域-词相似性匹配机制提取文本描述中的显著性词汇，并通过条件特征学习网络生成具有语义知识的扩散条件，同时开发共识确定性采样以优化细粒度结构分割。 Result: DiffMSS在定量和定性评估中均优于现有最先进方法。 Conclusion: DiffMSS通过语义知识蒸馏和共识确定性采样，显著提升了海洋显著性分割的准确性和边界精度。 Abstract: Marine Saliency Segmentation (MSS) plays a pivotal role in various vision-based marine exploration tasks. However, existing marine segmentation techniques face the dilemma of object mislocalization and imprecise boundaries due to the complex underwater environment. Meanwhile, despite the impressive performance of diffusion models in visual segmentation, there remains potential to further leverage contextual semantics to enhance feature learning of region-level salient objects, thereby improving segmentation outcomes. Building on this insight, we propose DiffMSS, a novel marine saliency segmenter based on the diffusion model, which utilizes semantic knowledge distillation to guide the segmentation of marine salient objects. Specifically, we design a region-word similarity matching mechanism to identify salient terms at the word level from the text descriptions. These high-level semantic features guide the conditional feature learning network in generating salient and accurate diffusion conditions with semantic knowledge distillation. To further refine the segmentation of fine-grained structures in unique marine organisms, we develop the dedicated consensus deterministic sampling to suppress overconfident missegmentations. Comprehensive experiments demonstrate the superior performance of DiffMSS over state-of-the-art methods in both quantitative and qualitative evaluations.

ZClip: Adaptive Spike Mitigation for LLM Pre-Training

Abhay Kumar,Louis Owen,Nilabhra Roy Chowdhury,Fabian Güra

Task: 提出一种自适应梯度裁剪算法ZClip，用于解决大语言模型训练中的梯度不稳定和损失峰值问题。

Motivation: 传统梯度裁剪方法依赖固定阈值或启发式方法，无法有效应对梯度不稳定和损失峰值，导致学习效率低下和频繁手动干预。

Details

Method: ZClip通过基于梯度范数统计特性的动态阈值调整，利用z分数异常检测识别和缓解大梯度峰值。 Result: ZClip能够预防恶性损失峰值，同时不影响模型收敛。 Conclusion: ZClip是一种无需先验假设的自适应梯度裁剪方法，有效提升大语言模型训练的稳定性和效率。 Abstract: Training large language models (LLMs) presents numerous challenges, including gradient instability and loss spikes. These phenomena can lead to catastrophic divergence, requiring costly checkpoint restoration and data batch skipping. Traditional gradient clipping techniques, such as constant or norm-based methods, fail to address these issues effectively due to their reliance on fixed thresholds or heuristics, leading to inefficient learning and requiring frequent manual intervention. In this work, we propose ZClip, an adaptive gradient clipping algorithm that dynamically adjusts the clipping threshold based on statistical properties of gradient norms over time. Unlike prior reactive strategies, ZClip proactively adapts to training dynamics without making any prior assumptions on the scale and the temporal evolution of gradient norms. At its core, it leverages z-score-based anomaly detection to identify and mitigate large gradient spikes, preventing malignant loss spikes while not interfering with convergence otherwise. Our code is available at: https://github.com/bluorion-com/ZClip.

Learning Audio-guided Video Representation with Gated Attention for Video-Text Retrieval

Boseung Jeong,Jicheol Park,Sungyeon Kim,Suha Kwak

Task: 视频-文本检索任务，即基于文本查询检索视频或反之。

Motivation: 现有方法主要依赖视觉和文本特征，常忽略音频信息，而音频能增强视频内容理解；传统模型盲目使用音频输入，导致视频表示不理想。

Details

Method: 提出AVIGATE框架，通过门控注意力机制选择性利用音频线索，并提出自适应边界对比损失以优化视频-文本对齐。 Result: AVIGATE在公开基准测试中达到最先进性能。 Conclusion: AVIGATE通过有效利用音频和优化对齐关系，显著提升了视频-文本检索性能。 Abstract: Video-text retrieval, the task of retrieving videos based on a textual query or vice versa, is of paramount importance for video understanding and multimodal information retrieval. Recent methods in this area rely primarily on visual and textual features and often ignore audio, although it helps enhance overall comprehension of video content. Moreover, traditional models that incorporate audio blindly utilize the audio input regardless of whether it is useful or not, resulting in suboptimal video representation. To address these limitations, we propose a novel video-text retrieval framework, Audio-guided VIdeo representation learning with GATEd attention (AVIGATE), that effectively leverages audio cues through a gated attention mechanism that selectively filters out uninformative audio signals. In addition, we propose an adaptive margin-based contrastive loss to deal with the inherently unclear positive-negative relationship between video and text, which facilitates learning better video-text alignment. Our extensive experiments demonstrate that AVIGATE achieves state-of-the-art performance on all the public benchmarks.

Reasoning Inconsistencies and How to Mitigate Them in Deep Learning

Erik Arakelyan

Task: 提出新方法以检测、量化和缓解深度学习模型在知识图谱、自然语言和图像处理中的推理不一致性。

Motivation: 深度学习模型的内部推理过程不透明，存在系统性不一致或逻辑错误，可能导致偏见、可被利用或逻辑不可靠的模型部署。

Details

Method: 开发了两种技术检测自然语言和图像处理模型中的预测不一致性；提出数据高效采样方法和合成数据集生成方法以缓解训练数据偏见；优化模型用于复杂推理任务。 Result: 提供了提高深度学习模型鲁棒性、公平性和可解释性的综合框架。 Conclusion: 该论文通过多种技术提升了深度学习模型在不同任务和模态中的性能、公平性和可解释性。 Abstract: The recent advancements in Deep Learning models and techniques have led to significant strides in performance across diverse tasks and modalities. However, while the overall capabilities of models show promising growth, our understanding of their internal reasoning processes remains limited, particularly concerning systematic inconsistencies or errors patterns of logical or inferential flaws. These inconsistencies may manifest as contradictory outputs, failure to generalize across similar tasks, or erroneous conclusions in specific contexts. Even detecting and measuring such reasoning discrepancies is challenging, as they may arise from opaque internal procedures, biases and imbalances in training data, or the inherent complexity of the task. Without effective methods to detect, measure, and mitigate these errors, there is a risk of deploying models that are biased, exploitable, or logically unreliable. This thesis aims to address these issues by producing novel methods for deep learning models that reason over knowledge graphs, natural language, and images. The thesis contributes two techniques for detecting and quantifying predictive inconsistencies originating from opaque internal procedures in natural language and image processing models. To mitigate inconsistencies from biases in training data, this thesis presents a data efficient sampling method to improve fairness and performance and a synthetic dataset generation approach in low resource scenarios. Finally, the thesis offers two techniques to optimize the models for complex reasoning tasks. These methods enhance model performance while allowing for more faithful and interpretable exploration and exploitation during inference. Critically, this thesis provides a comprehensive framework to improve the robustness, fairness, and interpretability of deep learning models across diverse tasks and modalities.

Hyperspectral Remote Sensing Images Salient Object Detection: The First Benchmark Dataset and Baseline

Peifu Liu,Huiyan Bai,Tingfa Xu,Jihui Wang,Huan Chen,Jianan Li

Task: 高光谱遥感图像显著目标检测（HRSI-SOD）旨在识别与背景具有显著光谱对比的目标或区域。

Motivation: 该领域在实际应用中具有重要潜力，但缺乏专用数据集和方法限制了其进展。

Details

Method: 提出了首个HRSI-SOD数据集HRSSD，并设计了一种名为Deep Spectral Saliency Network (DSSN)的基线模型，其核心是跨层级显著评估块和高分辨率融合模块。 Result: 在HRSSD数据集上的实验验证了DSSN的优越性，并在其他数据集上展示了其泛化能力。 Conclusion: 强调了专用数据集和方法在该领域的重要性，并公开了数据集和源代码以促进研究。 Abstract: The objective of hyperspectral remote sensing image salient object detection (HRSI-SOD) is to identify objects or regions that exhibit distinct spectrum contrasts with the background. This area holds significant promise for practical applications; however, progress has been limited by a notable scarcity of dedicated datasets and methodologies. To bridge this gap and stimulate further research, we introduce the first HRSI-SOD dataset, termed HRSSD, which includes 704 hyperspectral images and 5327 pixel-level annotated salient objects. The HRSSD dataset poses substantial challenges for salient object detection algorithms due to large scale variation, diverse foreground-background relations, and multi-salient objects. Additionally, we propose an innovative and efficient baseline model for HRSI-SOD, termed the Deep Spectral Saliency Network (DSSN). The core of DSSN is the Cross-level Saliency Assessment Block, which performs pixel-wise attention and evaluates the contributions of multi-scale similarity maps at each spatial location, effectively reducing erroneous responses in cluttered regions and emphasizes salient regions across scales. Additionally, the High-resolution Fusion Module combines bottom-up fusion strategy and learned spatial upsampling to leverage the strengths of multi-scale saliency maps, ensuring accurate localization of small objects. Experiments on the HRSSD dataset robustly validate the superiority of DSSN, underscoring the critical need for specialized datasets and methodologies in this domain. Further evaluations on the HSOD-BIT and HS-SOD datasets demonstrate the generalizability of the proposed method. The dataset and source code are publicly available at https://github.com/laprf/HRSSD.

Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme

Yan Ma,Steffi Chern,Xuyang Shen,Yiran Zhong,Pengfei Liu

Task: 提出一个透明、从头开始的强化学习框架，用于视觉语言模型（VLMs），并引入标准化的评估方案。

Motivation: 现有VLMs中的强化学习应用依赖高度工程化的框架，缺乏可重复性和标准化评估，限制了结果的可比性和训练动态的解释。

Details

Method: 设计了一个最小但功能完整的四步流程，并在多个模型和数据集上验证，同时提出标准化评估方案。 Result: 实验发现响应长度对随机种子敏感，反思行为与输出长度相关，强化学习在泛化能力上优于监督微调。 Conclusion: 提出的框架和发现旨在建立可重复的基线，促进基于强化学习的VLM研究。 Abstract: Reinforcement learning (RL) has recently shown strong potential in improving the reasoning capabilities of large language models and is now being actively extended to vision-language models (VLMs). However, existing RL applications in VLMs often rely on heavily engineered frameworks that hinder reproducibility and accessibility, while lacking standardized evaluation protocols, making it difficult to compare results or interpret training dynamics. This work introduces a transparent, from-scratch framework for RL in VLMs, offering a minimal yet functional four-step pipeline validated across multiple models and datasets. In addition, a standardized evaluation scheme is proposed to assess training dynamics and reflective behaviors. Extensive experiments on visual reasoning tasks uncover key empirical findings: response length is sensitive to random seeds, reflection correlates with output length, and RL consistently outperforms supervised fine-tuning (SFT) in generalization, even with high-quality data. These findings, together with the proposed framework, aim to establish a reproducible baseline and support broader engagement in RL-based VLM research.

Leveraging Static Relationships for Intra-Type and Inter-Type Message Passing in Video Question Answering

Lili Liang,Guanglu Sun

Task: 提出一种基于静态关系的类型内和类型间消息传递推理方法，用于视频问答任务。

Motivation: 现有方法在静态关系识别和表示上存在不足，未能充分利用视频中的静态关系信息进行深入推理和分析。

Details

Method: 构建双图进行类型内消息传递推理，构建基于静态关系的异构图进行类型间消息传递推理。 Result: 在ANetQA和Next-QA数据集上验证了方法的有效性。 Conclusion: 该方法通过结合类型内和类型间线索，提升了视频问答的推理能力。 Abstract: Video Question Answering (VideoQA) is an important research direction in the field of artificial intelligence, enabling machines to understand video content and perform reasoning and answering based on natural language questions. Although methods based on static relationship reasoning have made certain progress, there are still deficiencies in the accuracy of static relationship recognition and representation, and they have not fully utilized the static relationship information in videos for in-depth reasoning and analysis. Therefore, this paper proposes a reasoning method for intra-type and inter-type message passing based on static relationships. This method constructs a dual graph for intra-type message passing reasoning and builds a heterogeneous graph based on static relationships for inter-type message passing reasoning. The intra-type message passing reasoning model captures the neighborhood information of targets and relationships related to the question in the dual graph, updating the dual graph to obtain intra-type clues for answering the question. The inter-type message passing reasoning model captures the neighborhood information of targets and relationships from different categories related to the question in the heterogeneous graph, updating the heterogeneous graph to obtain inter-type clues for answering the question. Finally, the answers are inferred by combining the intra-type and inter-type clues based on static relationships. Experimental results on the ANetQA and Next-QA datasets demonstrate the effectiveness of this method.

Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving

Daoguang Zan,Zhirong Huang,Wei Liu,Hanwu Chen,Linhao Zhang,Shulin Xin,Lu Chen,Qi Liu,Xiaojian Zhong,Aoyan Li,Siyao Liu,Yongsheng Xiao,Liangqiang Chen,Yuyu Zhang,Jing Su,Tianyu Liu,Rui Long,Kai Shen,Liang Xiang

Task: 构建一个多语言问题解决基准（Multi-SWE-bench），用于评估大型语言模型（LLMs）在多种编程语言中的表现。

Motivation: 现有基准（如SWE-bench）主要关注Python，无法全面评估LLMs在多样化软件生态系统中的能力。

Details

Method: 通过专家标注从2,456个候选实例中筛选出1,632个高质量实例，覆盖Java、TypeScript、JavaScript、Go、Rust、C和C++，并评估三种代表性方法（Agentless、SWE-agent和OpenHands）。 Result: 提出了Multi-SWE-bench基准，并开源了4,723个结构化实例及数据生产流程，支持强化学习研究。 Conclusion: Multi-SWE-bench和Multi-SWE-RL社区将推动强化学习发展，为通用人工智能（AGI）奠定基础。 Abstract: The task of issue resolving is to modify a codebase to generate a patch that addresses a given issue. However, existing benchmarks, such as SWE-bench, focus almost exclusively on Python, making them insufficient for evaluating Large Language Models (LLMs) across diverse software ecosystems. To address this, we introduce a multilingual issue-resolving benchmark, called Multi-SWE-bench, covering Java, TypeScript, JavaScript, Go, Rust, C, and C++. It includes a total of 1,632 high-quality instances, which were carefully annotated from 2,456 candidates by 68 expert annotators, ensuring that the benchmark can provide an accurate and reliable evaluation. Based on Multi-SWE-bench, we evaluate a series of state-of-the-art models using three representative methods (Agentless, SWE-agent, and OpenHands) and present a comprehensive analysis with key empirical insights. In addition, we launch a Multi-SWE-RL open-source community, aimed at building large-scale reinforcement learning (RL) training datasets for issue-resolving tasks. As an initial contribution, we release a set of 4,723 well-structured instances spanning seven programming languages, laying a solid foundation for RL research in this domain. More importantly, we open-source our entire data production pipeline, along with detailed tutorials, encouraging the open-source community to continuously contribute and expand the dataset. We envision our Multi-SWE-bench and the ever-growing Multi-SWE-RL community as catalysts for advancing RL toward its full potential, bringing us one step closer to the dawn of AGI.

OmniTalker: Real-Time Text-Driven Talking Head Generation with In-Context Audio-Visual Style Replication

Zhongjian Wang,Peng Zhang,Jinwei Qi,Guangyuan Wang Sheng Xu,Bang Zhang,Liefeng Bo

Task: 开发一个端到端的统一框架OmniTalker，用于从文本和参考视频实时生成同步的语音和说话头部视频。

Motivation: 解决现有级联方法在系统复杂性、延迟、异步视听输出以及语音与视觉表达风格不一致方面的局限性。

Details

Method: 采用双分支扩散变换器架构，音频分支从文本合成梅尔频谱，视觉分支预测头部姿态和面部动态，并通过音频-视觉融合模块实现跨模态同步。 Result: OmniTalker在生成质量上超越现有方法，尤其在风格保持和音视频同步方面表现优异，实时推理速度达25 FPS。 Conclusion: OmniTalker是首个在零样本设置下联合建模语音和面部风格的统一框架，为实时高质量说话头部生成提供了有效解决方案。 Abstract: Recent years have witnessed remarkable advances in talking head generation, owing to its potential to revolutionize the human-AI interaction from text interfaces into realistic video chats. However, research on text-driven talking heads remains underexplored, with existing methods predominantly adopting a cascaded pipeline that combines TTS systems with audio-driven talking head models. This conventional pipeline not only introduces system complexity and latency overhead but also fundamentally suffers from asynchronous audiovisual output and stylistic discrepancies between generated speech and visual expressions. To address these limitations, we introduce OmniTalker, an end-to-end unified framework that simultaneously generates synchronized speech and talking head videos from text and reference video in real-time zero-shot scenarios, while preserving both speech style and facial styles. The framework employs a dual-branch diffusion transformer architecture: the audio branch synthesizes mel-spectrograms from text, while the visual branch predicts fine-grained head poses and facial dynamics. To bridge modalities, we introduce a novel audio-visual fusion module that integrates cross-modal information to ensure temporal synchronization and stylistic coherence between audio and visual outputs. Furthermore, our in-context reference learning module effectively captures both speech and facial style characteristics from a single reference video without introducing an extra style extracting module. To the best of our knowledge, OmniTalker presents the first unified framework that jointly models speech style and facial style in a zero-shot setting, achieving real-time inference speed of 25 FPS. Extensive experiments demonstrate that our method surpasses existing approaches in generation quality, particularly excelling in style preservation and audio-video synchronization.

Efficient Model Editing with Task-Localized Sparse Fine-tuning

Leonardo Iurada,Marco Ciccone,Tatiana Tommasi

Task: 提出一种名为TaLoS的方法，用于构建稀疏任务向量以提升模型编辑的效率和效果。

Motivation: 现有方法依赖网络线性化，导致计算瓶颈且无法确保权重解耦，影响任务向量的无冲突组合。

Details

Method: 通过识别预训练模型中梯度敏感性低的参数子集，稀疏更新这些参数以促进权重解耦。 Result: TaLoS在训练和推理效率上优于现有方法，并在任务添加和否定任务中表现更优。 Conclusion: TaLoS通过模块化参数编辑，为实际应用中可适应基础模型的部署提供了实用解决方案。 Abstract: Task arithmetic has emerged as a promising approach for editing models by representing task-specific knowledge as composable task vectors. However, existing methods rely on network linearization to derive task vectors, leading to computational bottlenecks during training and inference. Moreover, linearization alone does not ensure weight disentanglement, the key property that enables conflict-free composition of task vectors. To address this, we propose TaLoS which allows to build sparse task vectors with minimal interference without requiring explicit linearization and sharing information across tasks. We find that pre-trained models contain a subset of parameters with consistently low gradient sensitivity across tasks, and that sparsely updating only these parameters allows for promoting weight disentanglement during fine-tuning. Our experiments prove that TaLoS improves training and inference efficiency while outperforming current methods in task addition and negation. By enabling modular parameter editing, our approach fosters practical deployment of adaptable foundation models in real-world applications.

SkyReels-A2: Compose Anything in Video Diffusion Transformers

Zhengcong Fei,Debang Li,Di Qiu,Jiahua Wang,Yikun Dou,Rui Wang,Jingtao Xu,Mingyuan Fan,Guibin Chen,Yang Li,Yahui Zhou

Task: 提出SkyReels-A2框架，实现基于文本提示和参考图像的元素到视频（E2V）生成。

Motivation: 解决元素保真度、场景连贯性和输出自然性的挑战，推动可控视频生成的发展。

Details

Method: 设计数据管道构建训练三元组，提出图像-文本联合嵌入模型，优化推理流程。 Result: 生成多样、高质量视频，性能优于闭源商业模型。 Conclusion: SkyReels-A2是首个开源商业级E2V模型，有望推动创意应用发展。 Abstract: This paper presents SkyReels-A2, a controllable video generation framework capable of assembling arbitrary visual elements (e.g., characters, objects, backgrounds) into synthesized videos based on textual prompts while maintaining strict consistency with reference images for each element. We term this task elements-to-video (E2V), whose primary challenges lie in preserving the fidelity of each reference element, ensuring coherent composition of the scene, and achieving natural outputs. To address these, we first design a comprehensive data pipeline to construct prompt-reference-video triplets for model training. Next, we propose a novel image-text joint embedding model to inject multi-element representations into the generative process, balancing element-specific consistency with global coherence and text alignment. We also optimize the inference pipeline for both speed and output stability. Moreover, we introduce a carefully curated benchmark for systematic evaluation, i.e, A2 Bench. Experiments demonstrate that our framework can generate diverse, high-quality videos with precise element control. SkyReels-A2 is the first open-source commercial grade model for the generation of E2V, performing favorably against advanced closed-source commercial models. We anticipate SkyReels-A2 will advance creative applications such as drama and virtual e-commerce, pushing the boundaries of controllable video generation.

Affordable AI Assistants with Knowledge Graph of Thoughts

Maciej Besta,Lorenzo Paleari,Jia Hao Andrea Jiang,Robert Gerstenberger,You Wu,Patrick Iff,Ales Kubicek,Piotr Nyczyk,Diana Khimey,Jón Gunnar Hannesson,Grzegorz Kwaśniewski,Marcin Copik,Hubert Niewiadomski,Torsten Hoefler

Task: 提出一种名为Knowledge Graph of Thoughts (KGoT)的新型AI助手架构，以解决当前LLM驱动代理的高成本和低成功率问题。

Motivation: 当前最先进的LLM驱动代理在复杂任务（如GAIA基准测试）中面临高成本和低成功率的挑战。

Details

Method: KGoT通过将LLM推理与动态构建的知识图谱（KGs）结合，提取并结构化任务相关知识，并通过外部工具（如数学求解器、网络爬虫和Python脚本）迭代增强。 Result: KGoT在GAIA基准测试中任务成功率提高了29%，成本降低了36倍以上，并在其他推理模型（如Qwen2.5-32B和Deepseek-R1-70B）中表现出类似的改进。 Conclusion: KGoT为AI助手提供了一种可扩展、经济高效且高性能的解决方案。 Abstract: Large Language Models (LLMs) are revolutionizing the development of AI assistants capable of performing diverse tasks across domains. However, current state-of-the-art LLM-driven agents face significant challenges, including high operational costs and limited success rates on complex benchmarks like GAIA. To address these issues, we propose the Knowledge Graph of Thoughts (KGoT), an innovative AI assistant architecture that integrates LLM reasoning with dynamically constructed knowledge graphs (KGs). KGoT extracts and structures task-relevant knowledge into a dynamic KG representation, iteratively enhanced through external tools such as math solvers, web crawlers, and Python scripts. Such structured representation of task-relevant knowledge enables low-cost models to solve complex tasks effectively. For example, KGoT achieves a 29% improvement in task success rates on the GAIA benchmark compared to Hugging Face Agents with GPT-4o mini, while reducing costs by over 36x compared to GPT-4o. Improvements for recent reasoning models are similar, e.g., 36% and 37.5% for Qwen2.5-32B and Deepseek-R1-70B, respectively. KGoT offers a scalable, affordable, and high-performing solution for AI assistants.

MonoGS++: Fast and Accurate Monocular RGB Gaussian SLAM

Renwu Li,Wenjing Ke,Dong Li,Lu Tian,Emad Barsoum

Task: 提出一种仅依赖RGB输入的新型快速准确SLAM方法MonoGS++。

Motivation: 减少对深度传感器的硬件依赖，仅需RGB输入，并通过在线视觉里程计实时生成稀疏点云。

Details

Method: 引入动态3D高斯插入、清晰度增强的高斯密集化模块和平面正则化，以减少冗余并提升3D场景重建质量。 Result: 在合成Replica和真实TUM-RGBD数据集上实现了与最先进方法相当的精确相机跟踪结果，并在帧率上比前最先进方法MonoGS提升了5.57倍。 Conclusion: MonoGS++是一种高效且准确的SLAM方法，显著提升了性能和速度。 Abstract: We present MonoGS++, a novel fast and accurate Simultaneous Localization and Mapping (SLAM) method that leverages 3D Gaussian representations and operates solely on RGB inputs. While previous 3D Gaussian Splatting (GS)-based methods largely depended on depth sensors, our approach reduces the hardware dependency and only requires RGB input, leveraging online visual odometry (VO) to generate sparse point clouds in real-time. To reduce redundancy and enhance the quality of 3D scene reconstruction, we implemented a series of methodological enhancements in 3D Gaussian mapping. Firstly, we introduced dynamic 3D Gaussian insertion to avoid adding redundant Gaussians in previously well-reconstructed areas. Secondly, we introduced clarity-enhancing Gaussian densification module and planar regularization to handle texture-less areas and flat surfaces better. We achieved precise camera tracking results both on the synthetic Replica and real-world TUM-RGBD datasets, comparable to those of the state-of-the-art. Additionally, our method realized a significant 5.57x improvement in frames per second (fps) over the previous state-of-the-art, MonoGS.

A Framework for Situating Innovations, Opportunities, and Challenges in Advancing Vertical Systems with Large AI Models

Gaurav Verma,Jiawei Zhou,Mohit Chandra,Srijan Kumar,Munmun De Choudhury

Task: 提出一个框架，通过分层抽象创新来弥合大型AI模型与实际应用需求之间的差距。

Motivation: 大型AI模型在标准化基准测试中表现出色，但在实际高风险领域（如医疗、教育、法律）中存在局限性，如对输入数据的微小变化敏感、缺乏上下文感知能力等。

Details

Method: 引入一个分层抽象框架，通过案例研究展示如何将大型模型转化为实用的垂直系统。 Result: 框架帮助研究者和从业者优化创新定位、发现被忽视的机会，并促进跨学科交流。 Conclusion: 该框架为大型AI模型的实际应用提供了系统化的解决方案，并推动了跨学科合作。 Abstract: Large artificial intelligence (AI) models have garnered significant attention for their remarkable, often "superhuman", performance on standardized benchmarks. However, when these models are deployed in high-stakes verticals such as healthcare, education, and law, they often reveal notable limitations. For instance, they exhibit brittleness to minor variations in input data, present contextually uninformed decisions in critical settings, and undermine user trust by confidently producing or reproducing inaccuracies. These challenges in applying large models necessitate cross-disciplinary innovations to align the models' capabilities with the needs of real-world applications. We introduce a framework that addresses this gap through a layer-wise abstraction of innovations aimed at meeting users' requirements with large models. Through multiple case studies, we illustrate how researchers and practitioners across various fields can operationalize this framework. Beyond modularizing the pipeline of transforming large models into useful "vertical systems", we also highlight the dynamism that exists within different layers of the framework. Finally, we discuss how our framework can guide researchers and practitioners to (i) optimally situate their innovations (e.g., when vertical-specific insights can empower broadly impactful vertical-agnostic innovations), (ii) uncover overlooked opportunities (e.g., spotting recurring problems across verticals to develop practically useful foundation models instead of chasing benchmarks), and (iii) facilitate cross-disciplinary communication of critical challenges (e.g., enabling a shared vocabulary for AI developers, domain experts, and human-computer interaction scholars).

HGFormer: Topology-Aware Vision Transformer with HyperGraph Learning

Hao Wang,Shuo Zhang,Biao Leng

Task: 提出一种名为HyperGraph Transformer (HGFormer) 的拓扑感知视觉Transformer，用于解决传统视觉Transformer在区域上下文和空间拓扑建模上的不足。

Motivation: 传统视觉Transformer在建模排列不变性和全连接交互时破坏了区域上下文和空间拓扑，偏离了感知组织的原则，因此需要引入超图概念以改进视觉建模。

Details

Method: 提出CS-KNN算法用于超图构建的语义引导，并设计拓扑感知的HyperGraph Attention (HGA)机制，结合超图拓扑引导全局和无偏信息的聚合。 Result: HGFormer在多个视觉基准测试中表现出色，与当前最先进方法相比具有竞争力。 Conclusion: HGFormer通过超图建模和拓扑感知机制，有效提升了视觉表示的质量，为场景描绘提供了更细致和统一的表示。 Abstract: The computer vision community has witnessed an extensive exploration of vision transformers in the past two years. Drawing inspiration from traditional schemes, numerous works focus on introducing vision-specific inductive biases. However, the implicit modeling of permutation invariance and fully-connected interaction with individual tokens disrupts the regional context and spatial topology, further hindering higher-order modeling. This deviates from the principle of perceptual organization that emphasizes the local groups and overall topology of visual elements. Thus, we introduce the concept of hypergraph for perceptual exploration. Specifically, we propose a topology-aware vision transformer called HyperGraph Transformer (HGFormer). Firstly, we present a Center Sampling K-Nearest Neighbors (CS-KNN) algorithm for semantic guidance during hypergraph construction. Secondly, we present a topology-aware HyperGraph Attention (HGA) mechanism that integrates hypergraph topology as perceptual indications to guide the aggregation of global and unbiased information during hypergraph messaging. Using HGFormer as visual backbone, we develop an effective and unitive representation, achieving distinct and detailed scene depictions. Empirical experiments show that the proposed HGFormer achieves competitive performance compared to the recent SoTA counterparts on various visual benchmarks. Extensive ablation and visualization studies provide comprehensive explanations of our ideas and contributions.

Concept Lancet: Image Editing with Compositional Representation Transplant

Jinqi Luo,Tianjiao Ding,Kwan Ho Ryan Chan,Hancheng Min,Chris Callison-Burch,René Vidal

Task: 提出一种名为Concept Lancet (CoLan)的零样本即插即用框架，用于扩散模型中的图像编辑任务。

Motivation: 现有编辑方法在文本嵌入或分数空间中设计编辑方向时，常面临编辑强度难以平衡的问题：过强会破坏视觉一致性，过弱则无法完成任务。

Details

Method: 通过将源输入在潜在空间中分解为视觉概念的稀疏线性组合，准确估计概念的存在情况，并根据编辑任务（替换/添加/移除）执行定制化的概念移植过程。 Result: 实验表明，配备CoLan的方法在编辑效果和一致性保持方面达到最先进性能。 Conclusion: CoLan框架为扩散模型中的图像编辑提供了一种高效且无需调整的解决方案。 Abstract: Diffusion models are widely used for image editing tasks. Existing editing methods often design a representation manipulation procedure by curating an edit direction in the text embedding or score space. However, such a procedure faces a key challenge: overestimating the edit strength harms visual consistency while underestimating it fails the editing task. Notably, each source image may require a different editing strength, and it is costly to search for an appropriate strength via trial-and-error. To address this challenge, we propose Concept Lancet (CoLan), a zero-shot plug-and-play framework for principled representation manipulation in diffusion-based image editing. At inference time, we decompose the source input in the latent (text embedding or diffusion score) space as a sparse linear combination of the representations of the collected visual concepts. This allows us to accurately estimate the presence of concepts in each image, which informs the edit. Based on the editing task (replace/add/remove), we perform a customized concept transplant process to impose the corresponding editing direction. To sufficiently model the concept space, we curate a conceptual representation dataset, CoLan-150K, which contains diverse descriptions and scenarios of visual terms and phrases for the latent dictionary. Experiments on multiple diffusion-based image editing baselines show that methods equipped with CoLan achieve state-of-the-art performance in editing effectiveness and consistency preservation.

ConMo: Controllable Motion Disentanglement and Recomposition for Zero-Shot Motion Transfer

Jiayi Gao,Zijin Yin,Changcheng Hua,Yuxin Peng,Kongming Liang,Zhanyu Ma,Jun Guo,Yang Liu

Task: 提出一种零样本框架ConMo，用于解决多主体视频中运动转移的准确性和多样性问题。

Motivation: 现有方法在多主体视频中难以准确转移特定主体运动，且无法保持运动多样性以适应不同形状的主体。

Details

Method: 通过分离和重组主体与背景的运动线索，结合软引导控制原始运动的保留，实现更精确的运动控制和形状适应。 Result: ConMo在运动保真度和语义一致性上显著优于现有方法，并支持多种应用如主体编辑和相机运动模拟。 Conclusion: ConMo为多主体视频运动转移提供了高效且灵活的解决方案，扩展了应用范围。 Abstract: The development of Text-to-Video (T2V) generation has made motion transfer possible, enabling the control of video motion based on existing footage. However, current methods have two limitations: 1) struggle to handle multi-subjects videos, failing to transfer specific subject motion; 2) struggle to preserve the diversity and accuracy of motion as transferring to subjects with varying shapes. To overcome these, we introduce \textbf{ConMo}, a zero-shot framework that disentangle and recompose the motions of subjects and camera movements. ConMo isolates individual subject and background motion cues from complex trajectories in source videos using only subject masks, and reassembles them for target video generation. This approach enables more accurate motion control across diverse subjects and improves performance in multi-subject scenarios. Additionally, we propose soft guidance in the recomposition stage which controls the retention of original motion to adjust shape constraints, aiding subject shape adaptation and semantic transformation. Unlike previous methods, ConMo unlocks a wide range of applications, including subject size and position editing, subject removal, semantic modifications, and camera motion simulation. Extensive experiments demonstrate that ConMo significantly outperforms state-of-the-art methods in motion fidelity and semantic consistency. The code is available at https://github.com/Andyplus1/ConMo.

Taylor Series-Inspired Local Structure Fitting Network for Few-shot Point Cloud Semantic Segmentation

Changshuo Wang,Shuting He,Xiang Fang,Meiqing Wu,Siew-Kei Lam,Prayag Tiwari

Task: 提出一种无需预训练的局部结构拟合网络（TaylorSeg）用于少样本点云语义分割。

Motivation: 预训练方法不仅引入过多时间开销，还忽略了不规则点云的局部结构表示。

Details

Method: 基于泰勒级数思想，将局部结构表示视为多项式拟合问题，提出TaylorConv卷积，并构建非参数化TaylorSeg-NN和参数化TaylorSeg-PN两种变体。 Result: 在2-way 1-shot设置下，TaylorSeg-PN在S3DIS和ScanNet数据集上分别比现有最优方法提升+2.28%和+4.37% mIoU。 Conclusion: TaylorSeg通过局部结构拟合有效解决了少样本点云分割问题，无需预训练且性能优越。 Abstract: Few-shot point cloud semantic segmentation aims to accurately segment "unseen" new categories in point cloud scenes using limited labeled data. However, pretraining-based methods not only introduce excessive time overhead but also overlook the local structure representation among irregular point clouds. To address these issues, we propose a pretraining-free local structure fitting network for few-shot point cloud semantic segmentation, named TaylorSeg. Specifically, inspired by Taylor series, we treat the local structure representation of irregular point clouds as a polynomial fitting problem and propose a novel local structure fitting convolution, called TaylorConv. This convolution learns the low-order basic information and high-order refined information of point clouds from explicit encoding of local geometric structures. Then, using TaylorConv as the basic component, we construct two variants of TaylorSeg: a non-parametric TaylorSeg-NN and a parametric TaylorSeg-PN. The former can achieve performance comparable to existing parametric models without pretraining. For the latter, we equip it with an Adaptive Push-Pull (APP) module to mitigate the feature distribution differences between the query set and the support set. Extensive experiments validate the effectiveness of the proposed method. Notably, under the 2-way 1-shot setting, TaylorSeg-PN achieves improvements of +2.28% and +4.37% mIoU on the S3DIS and ScanNet datasets respectively, compared to the previous state-of-the-art methods. Our code is available at https://github.com/changshuowang/TaylorSeg.

CornerPoint3D: Look at the Nearest Corner Instead of the Center

Ruixiao Zhang,Runwei Guan,Xiangyu Chen,Adam Prugel-Bennett,Xiaohao Cai

Task: 预测LiDAR点云中物体的中心、尺寸和旋转。

Motivation: 解决LiDAR仅捕获物体近侧导致跨域任务中定位准确性差的问题，并提出更实用的评估指标。

Details

Method: 提出EdgeHead细化头和CornerPoint3D检测器，专注于物体更近表面的学习和检测。 Result: 在跨域任务中优于传统中心检测器CenterPoint，平衡了整体检测质量与近表面定位准确性。 Conclusion: 提供了一种更实用且鲁棒的跨域3D物体检测解决方案。 Abstract: 3D object detection aims to predict object centers, dimensions, and rotations from LiDAR point clouds. Despite its simplicity, LiDAR captures only the near side of objects, making center-based detectors prone to poor localization accuracy in cross-domain tasks with varying point distributions. Meanwhile, existing evaluation metrics designed for single-domain assessment also suffer from overfitting due to dataset-specific size variations. A key question arises: Do we really need models to maintain excellent performance in the entire 3D bounding boxes after being applied across domains? Actually, one of our main focuses is on preventing collisions between vehicles and other obstacles, especially in cross-domain scenarios where correctly predicting the sizes is much more difficult. To address these issues, we rethink cross-domain 3D object detection from a practical perspective. We propose two new metrics that evaluate a model's ability to detect objects' closer-surfaces to the LiDAR sensor. Additionally, we introduce EdgeHead, a refinement head that guides models to focus more on learnable closer surfaces, significantly improving cross-domain performance under both our new and traditional BEV/3D metrics. Furthermore, we argue that predicting the nearest corner rather than the object center enhances robustness. We propose a novel 3D object detector, coined as CornerPoint3D, which is built upon CenterPoint and uses heatmaps to supervise the learning and detection of the nearest corner of each object. Our proposed methods realize a balanced trade-off between the detection quality of entire bounding boxes and the locating accuracy of closer surfaces to the LiDAR sensor, outperforming the traditional center-based detector CenterPoint in multiple cross-domain tasks and providing a more practically reasonable and robust cross-domain 3D object detection solution.

Semantic segmentation of forest stands using deep learning

Håkon Næss Sandum,Hans Ole Ørka,Oliver Tomic,Erik Næsset,Terje Gobakken

Task: 提出一种基于U-Net深度学习框架的多类分割方法，用于自动划分森林林分边界。

Motivation: 传统手动解释立体航拍图像划分林分边界耗时且主观，限制了操作效率并引入不一致性，亟需自动化方法。

Details

Method: 将林分划分视为多类分割问题，采用U-Net深度学习框架，结合多光谱图像、ALS数据和专家绘制的林分图进行训练和评估。 Result: 模型在独立数据上的总体准确率为0.73，显示出深度学习在自动林分划分中的潜力。 Conclusion: 深度学习在自动林分划分中具有潜力，但在复杂森林环境中仍存在挑战。 Abstract: Forest stands are the fundamental units in forest management inventories, silviculture, and financial analysis within operational forestry. Over the past two decades, a common method for mapping stand borders has involved delineation through manual interpretation of stereographic aerial images. This is a time-consuming and subjective process, limiting operational efficiency and introducing inconsistencies. Substantial effort has been devoted to automating the process, using various algorithms together with aerial images and canopy height models constructed from airborne laser scanning (ALS) data, but manual interpretation remains the preferred method. Deep learning (DL) methods have demonstrated great potential in computer vision, yet their application to forest stand delineation remains unexplored in published research. This study presents a novel approach, framing stand delineation as a multiclass segmentation problem and applying a U-Net based DL framework. The model was trained and evaluated using multispectral images, ALS data, and an existing stand map created by an expert interpreter. Performance was assessed on independent data using overall accuracy, a standard metric for classification tasks that measures the proportions of correctly classified pixels. The model achieved an overall accuracy of 0.73. These results demonstrate strong potential for DL in automated stand delineation. However, a few key challenges were noted, especially for complex forest environments.

MG-MotionLLM: A Unified Framework for Motion Comprehension and Generation across Multiple Granularities

Bizhu Wu,Jinheng Xie,Keming Shen,Zhe Kong,Jianfeng Ren,Ruibin Bai,Rong Qu,Linlin Shen

Task: 开发MG-MotionLLM，一个统一的多粒度运动理解和生成的运动-语言模型。

Motivation: 现有方法主要关注粗粒度的运动-文本建模，无法处理细粒度的运动相关任务，如理解和控制特定身体部位的运动。

Details

Method: 引入多粒度训练方案，包括定位运动片段的时间边界和运动详细描述等辅助任务，以促进不同粒度级别的运动-文本建模。 Result: MG-MotionLLM在经典文本到运动和运动到文本任务中表现出色，并在细粒度运动理解和编辑任务中展现出潜力。 Conclusion: MG-MotionLLM通过多粒度建模克服了现有方法的局限性，为运动理解和生成提供了更全面的解决方案。 Abstract: Recent motion-aware large language models have demonstrated promising potential in unifying motion comprehension and generation. However, existing approaches primarily focus on coarse-grained motion-text modeling, where text describes the overall semantics of an entire motion sequence in just a few words. This limits their ability to handle fine-grained motion-relevant tasks, such as understanding and controlling the movements of specific body parts. To overcome this limitation, we pioneer MG-MotionLLM, a unified motion-language model for multi-granular motion comprehension and generation. We further introduce a comprehensive multi-granularity training scheme by incorporating a set of novel auxiliary tasks, such as localizing temporal boundaries of motion segments via detailed text as well as motion detailed captioning, to facilitate mutual reinforcement for motion-text modeling across various levels of granularity. Extensive experiments show that our MG-MotionLLM achieves superior performance on classical text-to-motion and motion-to-text tasks, and exhibits potential in novel fine-grained motion comprehension and editing tasks. Project page: CVI-SZU/MG-MotionLLM

Graph Attention-Driven Bayesian Deep Unrolling for Dual-Peak Single-Photon Lidar Imaging

Kyungmin Choi,JaKeoung Koo,Stephen McLaughlin,Abderrahim Halimi

Task: 提出一种用于双峰单光子激光雷达成像的深度展开算法。

Motivation: 解决单光子激光雷达在噪声环境和多目标场景中的挑战，同时结合统计方法和深度学习的优势。

Details

Method: 采用分层贝叶斯模型处理多目标，并通过神经网络展开底层统计方法，利用几何深度学习提取点云特征。 Result: 在合成和真实数据上展示了与现有方法竞争的性能，并提供了不确定性信息。 Conclusion: 该方法在准确性和不确定性量化方面结合了统计方法和学习方法的优势。 Abstract: Single-photon Lidar imaging offers a significant advantage in 3D imaging due to its high resolution and long-range capabilities, however it is challenging to apply in noisy environments with multiple targets per pixel. To tackle these challenges, several methods have been proposed. Statistical methods demonstrate interpretability on the inferred parameters, but they are often limited in their ability to handle complex scenes. Deep learning-based methods have shown superior performance in terms of accuracy and robustness, but they lack interpretability or they are limited to a single-peak per pixel. In this paper, we propose a deep unrolling algorithm for dual-peak single-photon Lidar imaging. We introduce a hierarchical Bayesian model for multiple targets and propose a neural network that unrolls the underlying statistical method. To support multiple targets, we adopt a dual depth maps representation and exploit geometric deep learning to extract features from the point cloud. The proposed method takes advantages of statistical methods and learning-based methods in terms of accuracy and quantifying uncertainty. The experimental results on synthetic and real data demonstrate the competitive performance when compared to existing methods, while also providing uncertainty information.

Semiconductor Wafer Map Defect Classification with Tiny Vision Transformers

Faisal Mohammad,Duksan Ryu

Task: 半导体晶圆缺陷分类

Motivation: 传统CNN模型在晶圆缺陷分类中存在类别不平衡和多缺陷重叠识别问题，需要更高效的解决方案。

Details

Method: 提出ViT-Tiny，一种轻量级Vision Transformer框架，优化了晶圆缺陷分类，并在WM-38k数据集上训练。 Result: ViT-Tiny在多项分类任务中表现优于ViT-Base和SOTA模型，F1-score达98.4%，并在有限标注数据下表现稳健。 Conclusion: ViT-Tiny是一种计算高效且可靠的半导体缺陷检测解决方案。 Abstract: Semiconductor wafer defect classification is critical for ensuring high precision and yield in manufacturing. Traditional CNN-based models often struggle with class imbalances and recognition of the multiple overlapping defect types in wafer maps. To address these challenges, we propose ViT-Tiny, a lightweight Vision Transformer (ViT) framework optimized for wafer defect classification. Trained on the WM-38k dataset. ViT-Tiny outperforms its ViT-Base counterpart and state-of-the-art (SOTA) models, such as MSF-Trans and CNN-based architectures. Through extensive ablation studies, we determine that a patch size of 16 provides optimal performance. ViT-Tiny achieves an F1-score of 98.4%, surpassing MSF-Trans by 2.94% in four-defect classification, improving recall by 2.86% in two-defect classification, and increasing precision by 3.13% in three-defect classification. Additionally, it demonstrates enhanced robustness under limited labeled data conditions, making it a computationally efficient and reliable solution for real-world semiconductor defect detection.

Group-based Distinctive Image Captioning with Memory Difference Encoding and Attention

Jiuniu Wang,Wenjia Xu,Qingzhong Wang,Antoni B. Chan

Task: 提出一种基于分组的差分显著图像描述方法，以增强图像描述的独特性。

Motivation: 现有图像描述模型在传统指标上表现良好，但生成的描述难以区分目标图像与其他相似图像，缺乏独特性。

Details

Method: 引入基于分组的差分记忆注意力模块（GDMA），通过视觉比较相似图像组中的对象特征，优先突出独特特征以生成更具区分度的描述。 Result: 该方法显著提升了基线模型的描述独特性，并在保持准确性的同时实现了最先进的性能。 Conclusion: 提出的GDMA模块和新评估指标DisWordRate有效增强了图像描述的独特性，为未来研究提供了新方向。 Abstract: Recent advances in image captioning have focused on enhancing accuracy by substantially increasing the dataset and model size. While conventional captioning models exhibit high performance on established metrics such as BLEU, CIDEr, and SPICE, the capability of captions to distinguish the target image from other similar images is under-explored. To generate distinctive captions, a few pioneers employed contrastive learning or re-weighted the ground-truth captions. However, these approaches often overlook the relationships among objects in a similar image group (e.g., items or properties within the same album or fine-grained events). In this paper, we introduce a novel approach to enhance the distinctiveness of image captions, namely Group-based Differential Distinctive Captioning Method, which visually compares each image with other images in one similar group and highlights the uniqueness of each image. In particular, we introduce a Group-based Differential Memory Attention (GDMA) module, designed to identify and emphasize object features in an image that are uniquely distinguishable within its image group, i.e., those exhibiting low similarity with objects in other images. This mechanism ensures that such unique object features are prioritized during caption generation for the image, thereby enhancing the distinctiveness of the resulting captions. To further refine this process, we select distinctive words from the ground-truth captions to guide both the language decoder and the GDMA module. Additionally, we propose a new evaluation metric, the Distinctive Word Rate (DisWordRate), to quantitatively assess caption distinctiveness. Quantitative results indicate that the proposed method significantly improves the distinctiveness of several baseline models, and achieves state-of-the-art performance on distinctiveness while not excessively sacrificing accuracy...

APHQ-ViT: Post-Training Quantization with Average Perturbation Hessian Based Reconstruction for Vision Transformers

Zhuguanyu Wu,Jiayi Zhang,Jiaxin Chen,Jinyang Guo,Di Huang,Yunhong Wang

Task: 提出一种基于平均扰动Hessian（APH）重要性估计的后训练量化方法APHQ-ViT，以解决ViTs在超低位量化时的性能下降问题。

Motivation: 尽管ViTs在视觉任务中表现优异，但在后训练量化（PTQ）时，尤其是在超低位下，性能显著下降。现有的基于重建的PTQ方法在ViTs上失效，主要由于输出重要性估计不准确以及GELU激活后量化导致的精度损失。

Details

Method: 提出APHQ-ViT方法，包括改进的平均扰动Hessian损失函数，以及针对GELU激活后的量化问题设计的MLP重建（MR）方法。 Result: 实验表明，APHQ-ViT在3位和4位量化下显著优于现有PTQ方法，适用于多种视觉任务。 Conclusion: APHQ-ViT通过改进的Hessian损失和MLP重建方法，有效解决了ViTs在超低位量化时的性能问题，为实际部署提供了高效解决方案。 Abstract: Vision Transformers (ViTs) have become one of the most commonly used backbones for vision tasks. Despite their remarkable performance, they often suffer significant accuracy drops when quantized for practical deployment, particularly by post-training quantization (PTQ) under ultra-low bits. Recently, reconstruction-based PTQ methods have shown promising performance in quantizing Convolutional Neural Networks (CNNs). However, they fail when applied to ViTs, primarily due to the inaccurate estimation of output importance and the substantial accuracy degradation in quantizing post-GELU activations. To address these issues, we propose \textbf{APHQ-ViT}, a novel PTQ approach based on importance estimation with Average Perturbation Hessian (APH). Specifically, we first thoroughly analyze the current approximation approaches with Hessian loss, and propose an improved average perturbation Hessian loss. To deal with the quantization of the post-GELU activations, we design an MLP Reconstruction (MR) method by replacing the GELU function in MLP with ReLU and reconstructing it by the APH loss on a small unlabeled calibration set. Extensive experiments demonstrate that APHQ-ViT using linear quantizers outperforms existing PTQ methods by substantial margins in 3-bit and 4-bit across different vision tasks. The source code is available at https://github.com/GoatWu/APHQ-ViT.

Towards Generalizing Temporal Action Segmentation to Unseen Views

Emad Bahrami,Olga Zatsarynna,Gianpiero Francesca,Juergen Gall

Task: 提出一种针对未见视角的动作分割方法，并定义了一种未见视角动作分割的协议。

Motivation: 现有方法在未见视角的动作分割上表现不佳，需要解决视角变化带来的挑战。

Details

Method: 通过共享序列和片段级别的表示，引入序列损失和动作损失以减少视角差异的影响。 Result: 在Assembly101、IkeaASM和EgoExoLearn数据集上，F1@50指标显著提升，未见外中心视角提升12.8%，未见自我中心视角提升54%。 Conclusion: 该方法有效解决了未见视角动作分割的挑战，显著提升了模型在视角变化下的性能。 Abstract: While there has been substantial progress in temporal action segmentation, the challenge to generalize to unseen views remains unaddressed. Hence, we define a protocol for unseen view action segmentation where camera views for evaluating the model are unavailable during training. This includes changing from top-frontal views to a side view or even more challenging from exocentric to egocentric views. Furthermore, we present an approach for temporal action segmentation that tackles this challenge. Our approach leverages a shared representation at both the sequence and segment levels to reduce the impact of view differences during training. We achieve this by introducing a sequence loss and an action loss, which together facilitate consistent video and action representations across different views. The evaluation on the Assembly101, IkeaASM, and EgoExoLearn datasets demonstrate significant improvements, with a 12.8% increase in F1@50 for unseen exocentric views and a substantial 54% improvement for unseen egocentric views.

Exploration-Driven Generative Interactive Environments

Nedko Savov,Naser Kazemi,Mohammad Mahdi,Danda Pani Paudel,Xi Wang,Luc Van Gool

Task: 提出一种仅使用随机代理在虚拟环境中训练多环境世界模型的框架，并引入AutoExplore Agent以解决随机探索的局限性。

Motivation: 现代世界模型需要昂贵且耗时的视频数据集收集，而Genie模型虽然能模拟多环境行为，但训练依赖昂贵的演示数据。

Details

Method: 使用随机代理和AutoExplore Agent在虚拟环境中收集数据，并通过分组相似行为的环境构建大规模数据集RetroAct。 Result: 预训练的多环境模型能快速适应新环境，提升视频保真度和可控性。 Conclusion: 提出的框架和AutoExplore Agent有效降低了训练成本，提高了模型的适应性和性能。 Abstract: Modern world models require costly and time-consuming collection of large video datasets with action demonstrations by people or by environment-specific agents. To simplify training, we focus on using many virtual environments for inexpensive, automatically collected interaction data. Genie, a recent multi-environment world model, demonstrates simulation abilities of many environments with shared behavior. Unfortunately, training their model requires expensive demonstrations. Therefore, we propose a training framework merely using a random agent in virtual environments. While the model trained in this manner exhibits good controls, it is limited by the random exploration possibilities. To address this limitation, we propose AutoExplore Agent - an exploration agent that entirely relies on the uncertainty of the world model, delivering diverse data from which it can learn the best. Our agent is fully independent of environment-specific rewards and thus adapts easily to new environments. With this approach, the pretrained multi-environment model can quickly adapt to new environments achieving video fidelity and controllability improvement. In order to obtain automatically large-scale interaction datasets for pretraining, we group environments with similar behavior and controls. To this end, we annotate the behavior and controls of 974 virtual environments - a dataset that we name RetroAct. For building our model, we first create an open implementation of Genie - GenieRedux and apply enhancements and adaptations in our version GenieRedux-G. Our code and data are available at https://github.com/insait-institute/GenieRedux.

MultiNeRF: Multiple Watermark Embedding for Neural Radiance Fields

Yash Kulthe,Andrew Gilbert,John Collomosse

Task: 提出一种名为MultiNeRF的3D水印方法，能够在单个NeRF模型渲染的图像中嵌入多个唯一键控的水印，同时保持高视觉质量。

Motivation: 扩展TensoRF NeRF模型，通过增加专用水印网格，提高水印容量且不干扰场景内容。

Details

Method: 采用基于FiLM的条件调制机制，动态激活水印，支持多水印嵌入与提取，无需重新训练模型。 Result: 在NeRF-Synthetic和LLFF数据集上验证，显著提高了鲁棒容量且不影响渲染质量。 Conclusion: MultiNeRF将单水印方法推广为灵活的多水印框架，为3D内容归属提供了可扩展的解决方案。 Abstract: We present MultiNeRF, a 3D watermarking method that embeds multiple uniquely keyed watermarks within images rendered by a single Neural Radiance Field (NeRF) model, whilst maintaining high visual quality. Our approach extends the TensoRF NeRF model by incorporating a dedicated watermark grid alongside the existing geometry and appearance grids. This extension ensures higher watermark capacity without entangling watermark signals with scene content. We propose a FiLM-based conditional modulation mechanism that dynamically activates watermarks based on input identifiers, allowing multiple independent watermarks to be embedded and extracted without requiring model retraining. MultiNeRF is validated on the NeRF-Synthetic and LLFF datasets, with statistically significant improvements in robust capacity without compromising rendering quality. By generalizing single-watermark NeRF methods into a flexible multi-watermarking framework, MultiNeRF provides a scalable solution for 3D content. attribution.

Data-Driven Object Tracking: Integrating Modular Neural Networks into a Kalman Framework

Christian Alexander Holz,Christian Bader,Markus Enzweiler,Matthias Drüppel

Task: 提出三种新型神经网络模型（SPENT、SANT、MANTa）用于多目标跟踪（MOT），以满足高级驾驶辅助系统（ADAS）对复杂性和精确性的需求。

Motivation: 解决多目标跟踪中轨迹预测、传感器对象与轨迹映射以及多对象关联的关键挑战。

Details

Method: 将三种神经网络模型（SPENT、SANT、MANTa）集成到传统卡尔曼滤波框架中，保持系统模块化。 Result: 在KITTI数据集上评估，SPENT将RMSE降低50%，SANT和MANTa的关联准确率达95%。 Conclusion: 任务特定的神经网络可显著提升传统跟踪系统的性能与鲁棒性，同时保持模块化和可维护性。 Abstract: This paper presents novel Machine Learning (ML) methodologies for Multi-Object Tracking (MOT), specifically designed to meet the increasing complexity and precision demands of Advanced Driver Assistance Systems (ADAS). We introduce three Neural Network (NN) models that address key challenges in MOT: (i) the Single-Prediction Network (SPENT) for trajectory prediction, (ii) the Single-Association Network (SANT) for mapping individual Sensor Object (SO) to existing tracks, and (iii) the Multi-Association Network (MANTa) for associating multiple SOs to multiple tracks. These models are seamlessly integrated into a traditional Kalman Filter (KF) framework, maintaining the system's modularity by replacing relevant components without disrupting the overall architecture. Importantly, all three networks are designed to be run in a realtime, embedded environment. Each network contains less than 50k trainable parameters. Our evaluation, conducted on the public KITTI tracking dataset, demonstrates significant improvements in tracking performance. SPENT reduces the Root Mean Square Error (RMSE) by 50% compared to a standard KF, while SANT and MANTa achieve up to 95% accuracy in sensor object-to-track assignments. These results underscore the effectiveness of incorporating task-specific NNs into traditional tracking systems, boosting performance and robustness while preserving modularity, maintainability, and interpretability.

Charm: The Missing Piece in ViT fine-tuning for Image Aesthetic Assessment

Fatemeh Behrad,Tinne Tuytelaars,Johan Wagemans

Task: 提出一种名为Charm的新型标记化方法，以解决Vision Transformers（ViTs）在处理可变尺寸输入时的计算复杂性和信息丢失问题。

Motivation: ViTs通常在小尺寸固定图像上训练，导致信息丢失，影响图像美学评估等任务。

Details

Method: Charm通过保留图像的高分辨率、长宽比和多尺度信息，同时优先处理特定区域的高分辨率细节，生成固定长度的输入序列。 Result: 实验表明，Charm在多个图像美学和质量评估数据集上显著提升了性能（最高提升8.1%）。 Conclusion: Charm是一种兼容预训练ViTs的高效方法，能够在不裁剪或改变长宽比的情况下提升模型性能。 Abstract: The capacity of Vision transformers (ViTs) to handle variable-sized inputs is often constrained by computational complexity and batch processing limitations. Consequently, ViTs are typically trained on small, fixed-size images obtained through downscaling or cropping. While reducing computational burden, these methods result in significant information loss, negatively affecting tasks like image aesthetic assessment. We introduce Charm, a novel tokenization approach that preserves Composition, High-resolution, Aspect Ratio, and Multi-scale information simultaneously. Charm prioritizes high-resolution details in specific regions while downscaling others, enabling shorter fixed-size input sequences for ViTs while incorporating essential information. Charm is designed to be compatible with pre-trained ViTs and their learned positional embeddings. By providing multiscale input and introducing variety to input tokens, Charm improves ViT performance and generalizability for image aesthetic assessment. We avoid cropping or changing the aspect ratio to further preserve information. Extensive experiments demonstrate significant performance improvements on various image aesthetic and quality assessment datasets (up to 8.1 %) using a lightweight ViT backbone. Code and pre-trained models are available at https://github.com/FBehrad/Charm.

SelfMedHPM: Self Pre-training With Hard Patches Mining Masked Autoencoders For Medical Image Segmentation

Yunhao Lv,Lingyu Chen,Jian Wang,Yangxi Li,Fang Chen

Task: 提出一种基于掩码图像建模（MIM）的自训练框架，用于CT多器官分割任务。

Motivation: 现有基于MIM的CT多器官分割方法未能有效识别最难重建的区域。

Details

Method: 提出SelfMedHPM方法，结合ViT自预训练和辅助损失预测器，动态确定掩码位置。 Result: 在腹部和全身CT多器官分割任务中表现优于现有方法。 Conclusion: SelfMedHPM框架在CT多器官分割任务中具有优越性能。 Abstract: In recent years, deep learning methods such as convolutional neural network (CNN) and transformers have made significant progress in CT multi-organ segmentation. However, CT multi-organ segmentation methods based on masked image modeling (MIM) are very limited. There are already methods using MAE for CT multi-organ segmentation task, we believe that the existing methods do not identify the most difficult areas to reconstruct. To this end, we propose a MIM self-training framework with hard patches mining masked autoencoders for CT multi-organ segmentation tasks (selfMedHPM). The method performs ViT self-pretraining on the training set of the target data and introduces an auxiliary loss predictor, which first predicts the patch loss and determines the location of the next mask. SelfMedHPM implementation is better than various competitive methods in abdominal CT multi-organ segmentation and body CT multi-organ segmentation. We have validated the performance of our method on the Multi Atlas Labeling Beyond The Cranial Vault (BTCV) dataset for abdomen mult-organ segmentation and the SinoMed Whole Body (SMWB) dataset for body multi-organ segmentation tasks.

Delineate Anything: Resolution-Agnostic Field Boundary Delineation on Satellite Imagery

Mykola Lavreniuk,Nataliia Kussul,Andrii Shelestov,Bohdan Yailymov,Yevhenii Salii,Volodymyr Kuzin,Zoltan Szantoi

Task: 通过实例分割方法准确划分卫星图像中的农田边界。

Motivation: 现有方法因数据集规模小、分辨率差异和环境多样性而面临挑战。

Details

Method: 提出FBIS-22M数据集和Delineate Anything模型，将任务重新定义为实例分割。 Result: 模型在mAP@0.5和mAP@0.5:0.95上分别提升88.5%和103%，推理速度更快且零样本泛化能力强。 Conclusion: FBIS-22M数据集和Delineate Anything模型显著提升了农田边界划分的准确性和效率。 Abstract: The accurate delineation of agricultural field boundaries from satellite imagery is vital for land management and crop monitoring. However, current methods face challenges due to limited dataset sizes, resolution discrepancies, and diverse environmental conditions. We address this by reformulating the task as instance segmentation and introducing the Field Boundary Instance Segmentation - 22M dataset (FBIS-22M), a large-scale, multi-resolution dataset comprising 672,909 high-resolution satellite image patches (ranging from 0.25 m to 10 m) and 22,926,427 instance masks of individual fields, significantly narrowing the gap between agricultural datasets and those in other computer vision domains. We further propose Delineate Anything, an instance segmentation model trained on our new FBIS-22M dataset. Our proposed model sets a new state-of-the-art, achieving a substantial improvement of 88.5% in mAP@0.5 and 103% in mAP@0.5:0.95 over existing methods, while also demonstrating significantly faster inference and strong zero-shot generalization across diverse image resolutions and unseen geographic regions. Code, pre-trained models, and the FBIS-22M dataset are available at https://lavreniuk.github.io/Delineate-Anything.

A Sensorimotor Vision Transformer

Konrad Gadzicki,Kerstin Schill,Christoph Zetzsche

Task: 提出一种受人类眼动启发的视觉模型（SMT），通过优先处理高显著性区域来提升计算效率和减少内存消耗。

Motivation: 传统模型均匀处理所有图像块，效率低下；SMT基于生物视觉系统选择性聚焦的原理，优化信息处理。

Details

Method: 利用二维特征（如角点和遮挡）选择高信息量的图像块，仅处理这些块以减少计算和内存需求。 Result: 在Imagenet-1k上，SMT在保持竞争力的top-1准确率的同时，显著降低了内存消耗和计算复杂度。 Conclusion: SMT为资源受限的应用提供了一种高效的视觉模型，并展示了生物启发架构的潜力。 Abstract: This paper presents the Sensorimotor Transformer (SMT), a vision model inspired by human saccadic eye movements that prioritize high-saliency regions in visual input to enhance computational efficiency and reduce memory consumption. Unlike traditional models that process all image patches uniformly, SMT identifies and selects the most salient patches based on intrinsic two-dimensional (i2D) features, such as corners and occlusions, which are known to convey high-information content and align with human fixation patterns. The SMT architecture uses this biological principle to leverage vision transformers to process only the most informative patches, allowing for a substantial reduction in memory usage that scales with the sequence length of selected patches. This approach aligns with visual neuroscience findings, suggesting that the human visual system optimizes information gathering through selective, spatially dynamic focus. Experimental evaluations on Imagenet-1k demonstrate that SMT achieves competitive top-1 accuracy while significantly reducing memory consumption and computational complexity, particularly when a limited number of patches is used. This work introduces a saccade-like selection mechanism into transformer-based vision models, offering an efficient alternative for image analysis and providing new insights into biologically motivated architectures for resource-constrained applications.

Audio-visual Controlled Video Diffusion with Masked Selective State Spaces Modeling for Natural Talking Head Generation

Fa-Ting Hong,Zunnan Xu,Zixiang Zhou,Jun Zhou,Xiu Li,Qin Lin,Qinglin Lu,Dan Xu

Task: 提出一个支持多信号和单信号控制的端到端视频扩散框架（ACTalker）用于说话头部视频生成。

Motivation: 现有方法通常仅支持单一模态控制，限制了实际应用。

Details

Method: 设计了并行mamba结构，每个分支利用单独驱动信号控制特定面部区域，并采用门机制和mask-drop策略以避免控制冲突。 Result: 实验结果表明，该方法能生成自然的面部视频，且mamba层能无缝整合多种驱动模态。 Conclusion: ACTalker在多信号控制下表现出色，解决了现有方法的局限性。 Abstract: Talking head synthesis is vital for virtual avatars and human-computer interaction. However, most existing methods are typically limited to accepting control from a single primary modality, restricting their practical utility. To this end, we introduce \textbf{ACTalker}, an end-to-end video diffusion framework that supports both multi-signals control and single-signal control for talking head video generation. For multiple control, we design a parallel mamba structure with multiple branches, each utilizing a separate driving signal to control specific facial regions. A gate mechanism is applied across all branches, providing flexible control over video generation. To ensure natural coordination of the controlled video both temporally and spatially, we employ the mamba structure, which enables driving signals to manipulate feature tokens across both dimensions in each branch. Additionally, we introduce a mask-drop strategy that allows each driving signal to independently control its corresponding facial region within the mamba structure, preventing control conflicts. Experimental results demonstrate that our method produces natural-looking facial videos driven by diverse signals and that the mamba layer seamlessly integrates multiple driving modalities without conflict.

MAD: Makeup All-in-One with Cross-Domain Diffusion Model

Bo-Kai Ruan,Hong-Han Shuai

Task: 使用单一模型完成多种化妆任务，包括美容滤镜、化妆迁移和化妆移除。

Motivation: 现有化妆技术需要为不同任务设计多个模型，且缺乏文本引导的化妆试妆功能，增加了复杂性和使用门槛。

Details

Method: 将不同化妆任务视为跨域转换，利用跨域扩散模型和域嵌入实现任务切换。 Result: 提出的方法通过单一模型和域嵌入实现了多种化妆任务，并扩展了MT数据集以支持文本引导的化妆应用。 Conclusion: 该方法简化了化妆技术的实现，提升了用户友好性和实用性。 Abstract: Existing makeup techniques often require designing multiple models to handle different inputs and align features across domains for different makeup tasks, e.g., beauty filter, makeup transfer, and makeup removal, leading to increased complexity. Another limitation is the absence of text-guided makeup try-on, which is more user-friendly without needing reference images. In this study, we make the first attempt to use a single model for various makeup tasks. Specifically, we formulate different makeup tasks as cross-domain translations and leverage a cross-domain diffusion model to accomplish all tasks. Unlike existing methods that rely on separate encoder-decoder configurations or cycle-based mechanisms, we propose using different domain embeddings to facilitate domain control. This allows for seamless domain switching by merely changing embeddings with a single model, thereby reducing the reliance on additional modules for different tasks. Moreover, to support precise text-to-makeup applications, we introduce the MT-Text dataset by extending the MT dataset with textual annotations, advancing the practicality of makeup technologies.

Noise Calibration and Spatial-Frequency Interactive Network for STEM Image Enhancement

Hesong Li,Ziqi Wu,Ruiwen Shao,Tao Zhang,Ying Fu

Task: 开发噪声校准、数据合成和增强方法以提高STEM图像的质量。

Motivation: 现有STEM图像增强方法忽视频域特征，且数据集缺乏真实性和通用性。

Details

Method: 提出噪声校准方法合成更真实的STEM图像，开发通用数据集，并设计空间-频率交互网络进行图像增强。 Result: 实验表明，合成的数据更接近真实STEM图像，且网络增强效果更优。 Conclusion: 所提方法在STEM图像增强中表现出更高的真实性和性能。 Abstract: Scanning Transmission Electron Microscopy (STEM) enables the observation of atomic arrangements at sub-angstrom resolution, allowing for atomically resolved analysis of the physical and chemical properties of materials. However, due to the effects of noise, electron beam damage, sample thickness, etc, obtaining satisfactory atomic-level images is often challenging. Enhancing STEM images can reveal clearer structural details of materials. Nonetheless, existing STEM image enhancement methods usually overlook unique features in the frequency domain, and existing datasets lack realism and generality. To resolve these issues, in this paper, we develop noise calibration, data synthesis, and enhancement methods for STEM images. We first present a STEM noise calibration method, which is used to synthesize more realistic STEM images. The parameters of background noise, scan noise, and pointwise noise are obtained by statistical analysis and fitting of real STEM images containing atoms. Then we use these parameters to develop a more general dataset that considers both regular and random atomic arrangements and includes both HAADF and BF mode images. Finally, we design a spatial-frequency interactive network for STEM image enhancement, which can explore the information in the frequency domain formed by the periodicity of atomic arrangement. Experimental results show that our data is closer to real STEM images and achieves better enhancement performances together with our network. Code will be available at https://github.com/HeasonLee/SFIN}{https://github.com/HeasonLee/SFIN.

Rip Current Segmentation: A Novel Benchmark and YOLOv8 Baseline Results

Andrei Dumitriu,Florin Tatui,Florin Miron,Radu Tudor Ionescu,Radu Timofte

Task: 自动检测裂流（rip current）的实例分割任务。

Motivation: 裂流是全球许多海滩上致命事故和伤害的主要原因，自动检测这些危险的水流具有重要意义。

Details

Method: 引入包含2,466张图像的综合数据集，并训练YOLOv8的不同版本进行实例分割，评估其在测试数据集（视频）上的性能。 Result: YOLOv8-nano模型在验证数据集上达到88.94%的mAP50，测试数据集上达到81.21%的宏平均。 Conclusion: 研究为裂流分割提供了基准，贡献了详细的标注数据集和深度学习模型，代码和数据集已公开。 Abstract: Rip currents are the leading cause of fatal accidents and injuries on many beaches worldwide, emphasizing the importance of automatically detecting these hazardous surface water currents. In this paper, we address a novel task: rip current instance segmentation. We introduce a comprehensive dataset containing $2,466$ images with newly created polygonal annotations for instance segmentation, used for training and validation. Additionally, we present a novel dataset comprising $17$ drone videos (comprising about $24K$ frames) captured at $30 FPS$, annotated with both polygons for instance segmentation and bounding boxes for object detection, employed for testing purposes. We train various versions of YOLOv8 for instance segmentation on static images and assess their performance on the test dataset (videos). The best results were achieved by the YOLOv8-nano model (runnable on a portable device), with an mAP50 of $88.94%$ on the validation dataset and $81.21%$ macro average on the test dataset. The results provide a baseline for future research in rip current segmentation. Our work contributes to the existing literature by introducing a detailed, annotated dataset, and training a deep learning model for instance segmentation of rip currents. The code, training details and the annotated dataset are made publicly available at https://github.com/Irikos/rip_currents.

L-LBVC: Long-Term Motion Estimation and Prediction for Learned Bi-Directional Video Compression

Yongqi Zhai,Luyang Tang,Wei Jiang,Jiayu Yang,Ronggang Wang

Task: 提出一种新型的LBVC框架（L-LBVC），以解决学习型双向视频压缩（LBVC）在长时运动估计和预测中的性能不足问题。

Motivation: LBVC在长时运动估计和预测（尤其是大运动场景）中表现不佳，导致性能落后于传统双向编码。

Details

Method: 提出自适应运动估计模块和自适应运动预测模块，分别处理短时和长时运动，并通过递归累积局部流估计长时流。 Result: L-LBVC在随机访问配置下显著优于现有LVC方法，甚至在某些测试数据集上超越VVC（VTM）。 Conclusion: L-LBVC通过改进长时运动估计和预测，显著提升了LBVC的性能。 Abstract: Recently, learned video compression (LVC) has shown superior performance under low-delay configuration. However, the performance of learned bi-directional video compression (LBVC) still lags behind traditional bi-directional coding. The performance gap mainly arises from inaccurate long-term motion estimation and prediction of distant frames, especially in large motion scenes. To solve these two critical problems, this paper proposes a novel LBVC framework, namely L-LBVC. Firstly, we propose an adaptive motion estimation module that can handle both short-term and long-term motions. Specifically, we directly estimate the optical flows for adjacent frames and non-adjacent frames with small motions. For non-adjacent frames with large motions, we recursively accumulate local flows between adjacent frames to estimate long-term flows. Secondly, we propose an adaptive motion prediction module that can largely reduce the bit cost for motion coding. To improve the accuracy of long-term motion prediction, we adaptively downsample reference frames during testing to match the motion ranges observed during training. Experiments show that our L-LBVC significantly outperforms previous state-of-the-art LVC methods and even surpasses VVC (VTM) on some test datasets under random access configuration.

Leveraging Sparse Annotations for Leukemia Diagnosis on the Large Leukemia Dataset

Abdul Rehman,Talha Meraj,Aiman Mahmood Minhas,Ayisha Imran,Mohsen Ali,Waqas Sultani,Mubarak Shah

Task: 提出一个大规模的白血病数据集（LLD）和新方法，用于检测白细胞及其形态属性。

Motivation: 现有白血病分析缺乏大规模、多样化的多任务数据集，限制了实际应用。

Details

Method: 通过外周血涂片收集数据，提出多任务模型和稀疏注释方法。 Result: 提供了包含7种形态属性注释的数据集，并开发了可解释的临床解决方案。 Conclusion: 该数据集和方法可用于解决显微镜图像分析中的多个挑战性问题。 Abstract: Leukemia is 10th most frequently diagnosed cancer and one of the leading causes of cancer related deaths worldwide. Realistic analysis of Leukemia requires White Blook Cells (WBC) localization, classification, and morphological assessment. Despite deep learning advances in medical imaging, leukemia analysis lacks a large, diverse multi-task dataset, while existing small datasets lack domain diversity, limiting real world applicability. To overcome dataset challenges, we present a large scale WBC dataset named Large Leukemia Dataset (LLD) and novel methods for detecting WBC with their attributes. Our contribution here is threefold. First, we present a large-scale Leukemia dataset collected through Peripheral Blood Films (PBF) from several patients, through multiple microscopes, multi cameras, and multi magnification. To enhance diagnosis explainability and medical expert acceptance, each leukemia cell is annotated at 100x with 7 morphological attributes, ranging from Cell Size to Nuclear Shape. Secondly, we propose a multi task model that not only detects WBCs but also predicts their attributes, providing an interpretable and clinically meaningful solution. Third, we propose a method for WBC detection with attribute analysis using sparse annotations. This approach reduces the annotation burden on hematologists, requiring them to mark only a small area within the field of view. Our method enables the model to leverage the entire field of view rather than just the annotated regions, enhancing learning efficiency and diagnostic accuracy. From diagnosis explainability to overcoming domain shift challenges, presented datasets could be used for many challenging aspects of microscopic image analysis. The datasets, code, and demo are available at: https://im.itu.edu.pk/sparse-leukemiaattri/

Fine-Tuning Visual Autoregressive Models for Subject-Driven Generation

Jiwoo Chung,Sangeek Hyun,Hyunjun Kim,Eunseo Koh,MinKyu Lee,Jae-Pil Heo

Task: 提出一种基于视觉自回归（VAR）模型的主题驱动生成方法，解决扩散模型计算开销大、语言漂移和多样性降低的问题。

Motivation: 扩散模型虽然生成质量高，但计算开销大，限制了实际应用；VAR模型推理速度快，但直接微调会导致计算开销、语言漂移和多样性降低。

Details

Method: 引入选择性层调优和先验蒸馏以减少复杂性和语言漂移，并提出尺度加权调优以优先关注主题相关信息。 Result: 实验表明，该方法在多项指标上显著优于基于扩散的基线，并展示了实际应用价值。 Conclusion: 提出的VAR方法在主题驱动生成中高效且实用，解决了扩散模型的计算开销问题。 Abstract: Recent advances in text-to-image generative models have enabled numerous practical applications, including subject-driven generation, which fine-tunes pretrained models to capture subject semantics from only a few examples. While diffusion-based models produce high-quality images, their extensive denoising steps result in significant computational overhead, limiting real-world applicability. Visual autoregressive~(VAR) models, which predict next-scale tokens rather than spatially adjacent ones, offer significantly faster inference suitable for practical deployment. In this paper, we propose the first VAR-based approach for subject-driven generation. However, na\"{\i}ve fine-tuning VAR leads to computational overhead, language drift, and reduced diversity. To address these challenges, we introduce selective layer tuning to reduce complexity and prior distillation to mitigate language drift. Additionally, we found that the early stages have a greater influence on the generation of subject than the latter stages, which merely synthesize local details. Based on this finding, we propose scale-wise weighted tuning, which prioritizes coarser resolutions for promoting the model to focus on the subject-relevant information instead of local details. Extensive experiments validate that our method significantly outperforms diffusion-based baselines across various metrics and demonstrates its practical usage.

PicoPose: Progressive Pixel-to-Pixel Correspondence Learning for Novel Object Pose Estimation

Lihua Liu,Jiehong Lin,Zhenxin Liu,Kui Jia

Task: 从RGB图像中估计未见过的物体的6D位姿。

Motivation: 解决零样本泛化中未见物体位姿估计的挑战。

Details

Method: 提出PicoPose框架，通过三阶段像素到像素的对应学习过程逐步优化对应关系。 Result: 在BOP基准测试的七个核心数据集上达到最先进性能，表现出对CAD模型或参考图像表示的新物体的优异泛化能力。 Conclusion: PicoPose通过逐步细化对应关系显著提高了位姿估计的准确性，适用于未见物体的位姿估计任务。 Abstract: Novel object pose estimation from RGB images presents a significant challenge for zero-shot generalization, as it involves estimating the relative 6D transformation between an RGB observation and a CAD model of an object that was not seen during training. In this paper, we introduce PicoPose, a novel framework designed to tackle this task using a three-stage pixel-to-pixel correspondence learning process. Firstly, PicoPose matches features from the RGB observation with those from rendered object templates, identifying the best-matched template and establishing coarse correspondences. Secondly, PicoPose smooths the correspondences by globally regressing a 2D affine transformation, including in-plane rotation, scale, and 2D translation, from the coarse correspondence map. Thirdly, PicoPose applies the affine transformation to the feature map of the best-matched template and learns correspondence offsets within local regions to achieve fine-grained correspondences. By progressively refining the correspondences, PicoPose significantly improves the accuracy of object poses computed via PnP/RANSAC. PicoPose achieves state-of-the-art performance on the seven core datasets of the BOP benchmark, demonstrating exceptional generalization to novel objects represented by CAD models or object reference images. Code and models are available at https://github.com/foollh/PicoPose.

Learning Phase Distortion with Selective State Space Models for Video Turbulence Mitigation

Xingguang Zhang,Nicholas Chimitt,Xijun Wang,Yu Yuan,Stanley H. Chan

Task: 提出一种基于选择性状态空间模型（MambaTM）和学习潜在相位畸变（LPD）的湍流抑制方法。

Motivation: 现有深度学习方法在湍流抑制中存在速度慢、内存占用高、泛化能力差的问题，且空间域和时间域方法各有局限性。

Details

Method: 结合选择性状态空间模型（MambaTM）和学习潜在相位畸变（LPD），提供全局感受野并保持线性计算复杂度。 Result: 在合成和真实湍流抑制基准测试中超越现有最优方法，且推理速度显著提升。 Conclusion: MambaTM和LPD的结合有效解决了湍流抑制中的计算复杂性和性能问题。 Abstract: Atmospheric turbulence is a major source of image degradation in long-range imaging systems. Although numerous deep learning-based turbulence mitigation (TM) methods have been proposed, many are slow, memory-hungry, and do not generalize well. In the spatial domain, methods based on convolutional operators have a limited receptive field, so they cannot handle a large spatial dependency required by turbulence. In the temporal domain, methods relying on self-attention can, in theory, leverage the lucky effects of turbulence, but their quadratic complexity makes it difficult to scale to many frames. Traditional recurrent aggregation methods face parallelization challenges. In this paper, we present a new TM method based on two concepts: (1) A turbulence mitigation network based on the Selective State Space Model (MambaTM). MambaTM provides a global receptive field in each layer across spatial and temporal dimensions while maintaining linear computational complexity. (2) Learned Latent Phase Distortion (LPD). LPD guides the state space model. Unlike classical Zernike-based representations of phase distortion, the new LPD map uniquely captures the actual effects of turbulence, significantly improving the model's capability to estimate degradation by reducing the ill-posedness. Our proposed method exceeds current state-of-the-art networks on various synthetic and real-world TM benchmarks with significantly faster inference speed. The code is available at http://github.com/xg416/MambaTM.

HQViT: Hybrid Quantum Vision Transformer for Image Classification

Hui Zhang,Qinglin Zhao,Mengchu Zhou,Li Feng

Task: 提出一种混合量子视觉变换器（HQViT），结合量子计算加速模型训练并提升性能。

Motivation: 传统Transformer的自注意力机制在计算复杂度上存在挑战，尤其是在高维输入数据（如图像）时，训练成本高昂。

Details

Method: HQViT采用全图像处理和振幅编码，减少量子资源需求，同时将计算密集型任务卸载到量子框架。 Result: HQViT在多个计算机视觉数据集上表现优异，最高提升10.9%（MNIST任务）。 Conclusion: 结合量子与经典计算在处理复杂图像分类任务中具有巨大潜力。 Abstract: Transformer-based architectures have revolutionized the landscape of deep learning. In computer vision domain, Vision Transformer demonstrates remarkable performance on par with or even surpassing that of convolutional neural networks. However, the quadratic computational complexity of its self-attention mechanism poses challenges for classical computing, making model training with high-dimensional input data, e.g., images, particularly expensive. To address such limitations, we propose a Hybrid Quantum Vision Transformer (HQViT), that leverages the principles of quantum computing to accelerate model training while enhancing model performance. HQViT introduces whole-image processing with amplitude encoding to better preserve global image information without additional positional encoding. By leveraging quantum computation on the most critical steps and selectively handling other components in a classical way, we lower the cost of quantum resources for HQViT. The qubit requirement is minimized to $O(log_2N)$ and the number of parameterized quantum gates is only $O(log_2d)$, making it well-suited for Noisy Intermediate-Scale Quantum devices. By offloading the computationally intensive attention coefficient matrix calculation to the quantum framework, HQViT reduces the classical computational load by $O(T^2d)$. Extensive experiments across various computer vision datasets demonstrate that HQViT outperforms existing models, achieving a maximum improvement of up to $10.9\%$ (on the MNIST 10-classification task) over the state of the art. This work highlights the great potential to combine quantum and classical computing to cope with complex image classification tasks.

MD-ProjTex: Texturing 3D Shapes with Multi-Diffusion Projection

Ahmet Burak Yildirim,Mustafa Utku Aydogdu,Duygu Ceylan,Aysegul Dundar

Task: 提出一种基于预训练文本到图像扩散模型的快速且一致的文本引导3D形状纹理生成方法MD-ProjTex。

Motivation: 解决现有方法依赖优化或顺序视图合成导致的计算效率低和3D一致性不足的问题。

Details

Method: 通过UV空间中的多视图一致性机制，融合多视图噪声预测并联合更新每视图去噪方向以保持3D一致性。 Result: MD-ProjTex在计算效率上优于现有方法，并在定量和定性结果上表现更好。 Conclusion: MD-ProjTex是一种高效且高质量的文本引导3D纹理生成方法。 Abstract: We introduce MD-ProjTex, a method for fast and consistent text-guided texture generation for 3D shapes using pretrained text-to-image diffusion models. At the core of our approach is a multi-view consistency mechanism in UV space, which ensures coherent textures across different viewpoints. Specifically, MD-ProjTex fuses noise predictions from multiple views at each diffusion step and jointly updates the per-view denoising directions to maintain 3D consistency. In contrast to existing state-of-the-art methods that rely on optimization or sequential view synthesis, MD-ProjTex is computationally more efficient and achieves better quantitative and qualitative results.

CanonNet: Canonical Ordering and Curvature Learning for Point Cloud Analysis

Benjy Friedmann,Michael Werman

Task: 提出CanonNet，一种轻量级神经网络，用于解决点云处理中的点排序和几何特征学习问题。

Motivation: 当前架构依赖复杂操作，限制了表达能力且难以捕捉细节几何特征。

Details

Method: CanonNet由两部分组成：(1) 预处理流水线，创建规范的点排序和方向；(2) 几何学习框架，从具有精确曲率值的合成表面学习。 Result: 在曲率估计任务中达到最先进性能，在几何描述任务中表现竞争性，且参数数量显著减少（100倍）。 Conclusion: CanonNet的高效性使其适用于计算资源有限的场景，证明数学预处理能有效补充神经架构用于点云分析。 Abstract: Point cloud processing poses two fundamental challenges: establishing consistent point ordering and effectively learning fine-grained geometric features. Current architectures rely on complex operations that limit expressivity while struggling to capture detailed surface geometry. We present CanonNet, a lightweight neural network composed of two complementary components: (1) a preprocessing pipeline that creates a canonical point ordering and orientation, and (2) a geometric learning framework where networks learn from synthetic surfaces with precise curvature values. This modular approach eliminates the need for complex transformation-invariant architectures while effectively capturing local geometric properties. Our experiments demonstrate state-of-the-art performance in curvature estimation and competitive results in geometric descriptor tasks with significantly fewer parameters (\textbf{100X}) than comparable methods. CanonNet's efficiency makes it particularly suitable for real-world applications where computational resources are limited, demonstrating that mathematical preprocessing can effectively complement neural architectures for point cloud analysis. The code for the project is publicly available \hyperlink{https://benjyfri.github.io/CanonNet/}{https://benjyfri.github.io/CanonNet/}.

Scene Splatter: Momentum 3D Scene Generation from Single Image with Video Diffusion Model

Shengjun Zhang,Jinzhao Li,Xin Fei,Hao Liu,Yueqi Duan

Task: 提出Scene Splatter，一种基于动量的视频扩散范式，用于从单张图像生成通用场景。

Motivation: 现有方法在生成新视角时存在视频长度有限和场景不一致的问题，导致重建时出现伪影和失真。

Details

Method: 通过构建原始特征的噪声样本作为动量来增强视频细节和保持场景一致性，并引入像素级动量以恢复未知区域。 Result: 实验表明，该方法在高保真和一致性场景生成方面具有优越性能。 Conclusion: Scene Splatter通过级联动量解决了视频扩散模型的限制，实现了高质量和一致性的新视角生成。 Abstract: In this paper, we propose Scene Splatter, a momentum-based paradigm for video diffusion to generate generic scenes from single image. Existing methods, which employ video generation models to synthesize novel views, suffer from limited video length and scene inconsistency, leading to artifacts and distortions during further reconstruction. To address this issue, we construct noisy samples from original features as momentum to enhance video details and maintain scene consistency. However, for latent features with the perception field that spans both known and unknown regions, such latent-level momentum restricts the generative ability of video diffusion in unknown regions. Therefore, we further introduce the aforementioned consistent video as a pixel-level momentum to a directly generated video without momentum for better recovery of unseen regions. Our cascaded momentum enables video diffusion models to generate both high-fidelity and consistent novel views. We further finetune the global Gaussian representations with enhanced frames and render new frames for momentum update in the next step. In this manner, we can iteratively recover a 3D scene, avoiding the limitation of video length. Extensive experiments demonstrate the generalization capability and superior performance of our method in high-fidelity and consistent scene generation.

TailedCore: Few-Shot Sampling for Unsupervised Long-Tail Noisy Anomaly Detection

Yoon Gyo Jung,Jaewoo Park,Jaeho Yoon,Kuan-Chuan Peng,Wonchul Kim,Andrew Beng Jin Teoh,Octavia Camps

Task: 解决在正常数据集被缺陷区域污染且类别分布未知的长尾环境下的无监督异常检测问题。

Motivation: 现有模型在像素噪声和尾部类别样本之间存在性能权衡，无法同时兼顾。

Details

Method: 提出TailSampler类大小预测器，基于嵌入相似度的类对称分布假设估计样本类别基数，并构建内存模型TailedCore。 Result: TailedCore在长尾噪声异常检测设置中表现优于现有方法。 Conclusion: TailedCore能有效处理尾部类别和噪声样本，提升异常检测性能。 Abstract: We aim to solve unsupervised anomaly detection in a practical challenging environment where the normal dataset is both contaminated with defective regions and its product class distribution is tailed but unknown. We observe that existing models suffer from tail-versus-noise trade-off where if a model is robust against pixel noise, then its performance deteriorates on tail class samples, and vice versa. To mitigate the issue, we handle the tail class and noise samples independently. To this end, we propose TailSampler, a novel class size predictor that estimates the class cardinality of samples based on a symmetric assumption on the class-wise distribution of embedding similarities. TailSampler can be utilized to sample the tail class samples exclusively, allowing to handle them separately. Based on these facets, we build a memory-based anomaly detection model TailedCore, whose memory both well captures tail class information and is noise-robust. We extensively validate the effectiveness of TailedCore on the unsupervised long-tail noisy anomaly detection setting, and show that TailedCore outperforms the state-of-the-art in most settings.

Multi-Head Adaptive Graph Convolution Network for Sparse Point Cloud-Based Human Activity Recognition

Vincent Gbouna Zakka,Luis J. Manso,Zhuangzhuang Dai

Task: 提出一种基于多头部自适应核（MAK）的图卷积方法，用于处理毫米波雷达点云数据以实现人类活动识别。

Motivation: 解决现有方法在处理稀疏和噪声点云数据时依赖固定核的局限性，以及隐私和低光条件下图像方法的不足。

Details

Method: 在图卷积框架中引入多头部自适应核模块（MAK），动态生成多个核以捕捉局部特征空间的不同方面。 Result: 在基准数据集上实现了最先进的性能，验证了方法的有效性。 Conclusion: MAK-GCN方法通过自适应核动态调整局部特征，显著提升了人类活动识别的性能。 Abstract: Human activity recognition is increasingly vital for supporting independent living, particularly for the elderly and those in need of assistance. Domestic service robots with monitoring capabilities can enhance safety and provide essential support. Although image-based methods have advanced considerably in the past decade, their adoption remains limited by concerns over privacy and sensitivity to low-light or dark conditions. As an alternative, millimetre-wave (mmWave) radar can produce point cloud data which is privacy-preserving. However, processing the sparse and noisy point clouds remains a long-standing challenge. While graph-based methods and attention mechanisms show promise, they predominantly rely on "fixed" kernels; kernels that are applied uniformly across all neighbourhoods, highlighting the need for adaptive approaches that can dynamically adjust their kernels to the specific geometry of each local neighbourhood in point cloud data. To overcome this limitation, we introduce an adaptive approach within the graph convolutional framework. Instead of a single shared weight function, our Multi-Head Adaptive Kernel (MAK) module generates multiple dynamic kernels, each capturing different aspects of the local feature space. By progressively refining local features while maintaining global spatial context, our method enables convolution kernels to adapt to varying local features. Experimental results on benchmark datasets confirm the effectiveness of our approach, achieving state-of-the-art performance in human activity recognition. Our source code is made publicly available at: https://github.com/Gbouna/MAK-GCN

GPT-ImgEval: A Comprehensive Benchmark for Diagnosing GPT4o in Image Generation

Zhiyuan Yan,Junyan Ye,Weijia Li,Zilong Huang,Shenghai Yuan,Xiangyang He,Kaiqing Lin,Jun He,Conghui He,Li Yuan

Task: 评估GPT-4o在图像生成和编辑中的性能，并提出其架构的推测。

Motivation: OpenAI的GPT-4o在图像生成和编辑方面表现出色，但缺乏系统的评估和对其架构的深入理解。

Details

Method: 提出了GPT-ImgEval基准，定量和定性评估GPT-4o在生成质量、编辑能力和语义合成三方面的表现，并基于生成数据推测其架构。 Result: GPT-4o在图像生成和编辑任务中表现优异，推测其架构为自回归与扩散模型的结合。 Conclusion: 该研究为GPT-4o的性能和架构提供了可靠基准，并指出了其局限性，推动了图像生成领域的进一步研究。 Abstract: The recent breakthroughs in OpenAI's GPT4o model have demonstrated surprisingly good capabilities in image generation and editing, resulting in significant excitement in the community. This technical report presents the first-look evaluation benchmark (named GPT-ImgEval), quantitatively and qualitatively diagnosing GPT-4o's performance across three critical dimensions: (1) generation quality, (2) editing proficiency, and (3) world knowledge-informed semantic synthesis. Across all three tasks, GPT-4o demonstrates strong performance, significantly surpassing existing methods in both image generation control and output quality, while also showcasing exceptional knowledge reasoning capabilities. Furthermore, based on the GPT-4o's generated data, we propose a classification-model-based approach to investigate the underlying architecture of GPT-4o, where our empirical results suggest the model consists of an auto-regressive (AR) combined with a diffusion-based head for image decoding, rather than the VAR-like architectures. We also provide a complete speculation on GPT-4o's overall architecture. In addition, we conduct a series of analyses to identify and visualize GPT-4o's specific limitations and the synthetic artifacts commonly observed in its image generation. We also present a comparative study of multi-round image editing between GPT-4o and Gemini 2.0 Flash, and discuss the safety implications of GPT-4o's outputs, particularly their detectability by existing image forensic models. We hope that our work can offer valuable insight and provide a reliable benchmark to guide future research, foster reproducibility, and accelerate innovation in the field of image generation and beyond. The codes and datasets used for evaluating GPT-4o can be found at https://github.com/PicoTrex/GPT-ImgEval.

Systematic Evaluation of Large Vision-Language Models for Surgical Artificial Intelligence

Anita Rau,Mark Endo,Josiah Aklilu,Jaewoo Heo,Khaled Saab,Alberto Paderno,Jeffrey Jopling,F. Christopher Holsinger,Serena Yeung-Levy

Task: 分析11种最先进的视觉语言模型在17个外科AI关键视觉理解任务中的表现。

Motivation: 探索视觉语言模型在医学领域（尤其是外科手术）中的实际应用潜力，特别是在专家标注数据稀缺的情况下。

Details

Method: 使用13个数据集，涵盖腹腔镜、机器人和开放手术，对11种视觉语言模型进行综合评估。 Result: 视觉语言模型展现出良好的泛化能力，有时在非训练场景下表现优于监督模型；上下文学习可将性能提升三倍。 Conclusion: 视觉语言模型在复杂动态场景（如外科手术）中具有潜力，但在空间或时间推理任务上仍有挑战。 Abstract: Large Vision-Language Models offer a new paradigm for AI-driven image understanding, enabling models to perform tasks without task-specific training. This flexibility holds particular promise across medicine, where expert-annotated data is scarce. Yet, VLMs' practical utility in intervention-focused domains--especially surgery, where decision-making is subjective and clinical scenarios are variable--remains uncertain. Here, we present a comprehensive analysis of 11 state-of-the-art VLMs across 17 key visual understanding tasks in surgical AI--from anatomy recognition to skill assessment--using 13 datasets spanning laparoscopic, robotic, and open procedures. In our experiments, VLMs demonstrate promising generalizability, at times outperforming supervised models when deployed outside their training setting. In-context learning, incorporating examples during testing, boosted performance up to three-fold, suggesting adaptability as a key strength. Still, tasks requiring spatial or temporal reasoning remained difficult. Beyond surgery, our findings offer insights into VLMs' potential for tackling complex and dynamic scenarios in clinical and broader real-world applications.

F-ViTA: Foundation Model Guided Visible to Thermal Translation

Jay N. Paranjape,Celso de Melo,Vishal M. Patel

Task: 提出一种名为F-ViTA的新方法，利用基础模型中的通用知识指导扩散过程，改进可见光到热图像的翻译。

Motivation: 由于红外图像采集设备成本高且数据收集困难，研究者探索了可见光到热图像的翻译方法，但现有方法（如GANs或DMs）需从有限数据中学习模态分布和物理原理，效果有限。

Details

Method: 通过结合InstructPix2Pix扩散模型与基础模型（如SAM和Grounded DINO）的零样本掩码和标签，学习场景对象与其热特征之间的相关性。 Result: 在五个公开数据集上的实验表明，F-ViTA优于现有方法，并能生成LWIR、MWIR和NIR翻译，且对分布外场景泛化能力强。 Conclusion: F-ViTA通过利用基础模型的通用知识，显著提升了可见光到热图像的翻译效果，并展示了良好的泛化能力。 Abstract: Thermal imaging is crucial for scene understanding, particularly in low-light and nighttime conditions. However, collecting large thermal datasets is costly and labor-intensive due to the specialized equipment required for infrared image capture. To address this challenge, researchers have explored visible-to-thermal image translation. Most existing methods rely on Generative Adversarial Networks (GANs) or Diffusion Models (DMs), treating the task as a style transfer problem. As a result, these approaches attempt to learn both the modality distribution shift and underlying physical principles from limited training data. In this paper, we propose F-ViTA, a novel approach that leverages the general world knowledge embedded in foundation models to guide the diffusion process for improved translation. Specifically, we condition an InstructPix2Pix Diffusion Model with zero-shot masks and labels from foundation models such as SAM and Grounded DINO. This allows the model to learn meaningful correlations between scene objects and their thermal signatures in infrared imagery. Extensive experiments on five public datasets demonstrate that F-ViTA outperforms state-of-the-art (SOTA) methods. Furthermore, our model generalizes well to out-of-distribution (OOD) scenarios and can generate Long-Wave Infrared (LWIR), Mid-Wave Infrared (MWIR), and Near-Infrared (NIR) translations from the same visible image. Code: https://github.com/JayParanjape/F-ViTA/tree/master.

BOP Challenge 2024 on Model-Based and Model-Free 6D Object Pose Estimation

Van Nguyen Nguyen,Stephen Tyree,Andrew Guo,Mederic Fourmy,Anas Gouda,Taeyeop Lee,Sungphill Moon,Hyeontae Son,Lukas Ranftl,Jonathan Tremblay,Eric Brachmann,Bertram Drost,Vincent Lepetit,Carsten Rother,Stan Birchfield,Jiri Matas,Yann Labbe,Martin Sundermeyer,Tomas Hodan

Task: 评估BOP Challenge 2024的方法论、数据集和结果，推动6D物体姿态估计及相关任务的发展。

Motivation: 将BOP从实验室环境过渡到真实场景，提升任务的实用性和挑战性。

Details

Method: 引入新任务（模型无关任务、更实用的6D物体检测任务）和新数据集（BOP-H3），支持模型无关和模型相关任务。 Result: 2024年最佳方法在未见物体6D定位任务上比2023年方法提高了22%的准确率，但速度较慢；更实用的方法Co-op速度更快且准确率更高。 Conclusion: BOP Challenge 2024在真实场景中取得了显著进展，但未见物体的2D检测准确率仍有提升空间。 Abstract: We present the evaluation methodology, datasets and results of the BOP Challenge 2024, the sixth in a series of public competitions organized to capture the state of the art in 6D object pose estimation and related tasks. In 2024, our goal was to transition BOP from lab-like setups to real-world scenarios. First, we introduced new model-free tasks, where no 3D object models are available and methods need to onboard objects just from provided reference videos. Second, we defined a new, more practical 6D object detection task where identities of objects visible in a test image are not provided as input. Third, we introduced new BOP-H3 datasets recorded with high-resolution sensors and AR/VR headsets, closely resembling real-world scenarios. BOP-H3 include 3D models and onboarding videos to support both model-based and model-free tasks. Participants competed on seven challenge tracks, each defined by a task, object onboarding setup, and dataset group. Notably, the best 2024 method for model-based 6D localization of unseen objects (FreeZeV2.1) achieves 22% higher accuracy on BOP-Classic-Core than the best 2023 method (GenFlow), and is only 4% behind the best 2023 method for seen objects (GPose2023) although being significantly slower (24.9 vs 2.7s per image). A more practical 2024 method for this task is Co-op which takes only 0.8s per image and is 25X faster and 13% more accurate than GenFlow. Methods have a similar ranking on 6D detection as on 6D localization but higher run time. On model-based 2D detection of unseen objects, the best 2024 method (MUSE) achieves 21% relative improvement compared to the best 2023 method (CNOS). However, the 2D detection accuracy for unseen objects is still noticealy (-53%) behind the accuracy for seen objects (GDet2023). The online evaluation system stays open and is available at http://bop.felk.cvut.cz/

Efficient Autoregressive Shape Generation via Octree-Based Adaptive Tokenization

Kangle Deng,Hsueh-Ti Derek Liu,Yiheng Zhu,Xiaoxia Sun,Chong Shang,Kiran Bhat,Deva Ramanan,Jun-Yan Zhu,Maneesh Agrawala,Tinghui Zhou

Task: 提出一种基于八叉树的自适应标记化框架，用于3D形状生成。

Motivation: 现有方法将所有形状编码为固定大小的标记，忽略了3D数据在尺度和复杂性上的固有变化，导致潜在表示效率低下。

Details

Method: 通过基于二次误差的细分准则构建自适应八叉树结构，并使用基于查询的变换器为每个八叉树单元分配形状潜在向量。 Result: 实验表明，该方法在保持视觉质量的同时，将标记数量减少了50%；在相似标记长度下，生成更高质量的3D形状。 Conclusion: 该方法能够生成更详细和多样化的3D内容，优于现有方法。 Abstract: Many 3D generative models rely on variational autoencoders (VAEs) to learn compact shape representations. However, existing methods encode all shapes into a fixed-size token, disregarding the inherent variations in scale and complexity across 3D data. This leads to inefficient latent representations that can compromise downstream generation. We address this challenge by introducing Octree-based Adaptive Tokenization, a novel framework that adjusts the dimension of latent representations according to shape complexity. Our approach constructs an adaptive octree structure guided by a quadric-error-based subdivision criterion and allocates a shape latent vector to each octree cell using a query-based transformer. Building upon this tokenization, we develop an octree-based autoregressive generative model that effectively leverages these variable-sized representations in shape generation. Extensive experiments demonstrate that our approach reduces token counts by 50% compared to fixed-size methods while maintaining comparable visual quality. When using a similar token length, our method produces significantly higher-quality shapes. When incorporated with our downstream generative model, our method creates more detailed and diverse 3D content than existing approaches.

GMR-Conv: An Efficient Rotation and Reflection Equivariant Convolution Kernel Using Gaussian Mixture Rings

Yuexi Du,Jiazhen Zhang,Nicha C. Dvornek,John A. Onofrey

Task: 设计一种高效的卷积核（GMR-Conv），以解决传统CNN在旋转和反射等变性上的挑战。

Motivation: 传统CNN仅支持平移等变性，而扩展到旋转和反射等变性时往往需要权衡效率和信息损失。

Details

Method: 提出Gaussian Mixture Ring Convolution（GMR-Conv），通过高斯加权环平滑径向对称性，减少离散化误差，保持旋转和反射等变性。 Result: 在八个分类和一个分割数据集上，GMR-Conv不仅匹配传统CNN性能，且在无方向数据中表现更优，同时比现有等变学习方法更鲁棒高效。 Conclusion: GMR-Conv证明径向对称性可缓解信息损失问题，为等变网络架构提供了有前景的进展。 Abstract: Symmetry, where certain features remain invariant under geometric transformations, can often serve as a powerful prior in designing convolutional neural networks (CNNs). While conventional CNNs inherently support translational equivariance, extending this property to rotation and reflection has proven challenging, often forcing a compromise between equivariance, efficiency, and information loss. In this work, we introduce Gaussian Mixture Ring Convolution (GMR-Conv), an efficient convolution kernel that smooths radial symmetry using a mixture of Gaussian-weighted rings. This design mitigates discretization errors of circular kernels, thereby preserving robust rotation and reflection equivariance without incurring computational overhead. We further optimize both the space and speed efficiency of GMR-Conv via a novel parameterization and computation strategy, allowing larger kernels at an acceptable cost. Extensive experiments on eight classification and one segmentation datasets demonstrate that GMR-Conv not only matches conventional CNNs' performance but can also surpass it in applications with orientation-less data. GMR-Conv is also proven to be more robust and efficient than the state-of-the-art equivariant learning methods. Our work provides inspiring empirical evidence that carefully applied radial symmetry can alleviate the challenges of information loss, marking a promising advance in equivariant network architectures. The code is available at https://github.com/XYPB/GMR-Conv.

Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models

Mateusz Pach,Shyamgopal Karthik,Quentin Bouniot,Serge Belongie,Zeynep Akata

Task: 扩展稀疏自编码器（SAEs）在视觉语言模型（VLMs）中的应用，并评估视觉表示中的单义性。

Motivation: 增强视觉语言模型的解释性和可控性，同时探索SAEs在VLMs中的潜力。

Details

Method: 在VLMs（如CLIP）上训练SAEs，并设计框架评估视觉表示的单义性。 Result: SAEs显著提升神经元的单义性，并形成与专家定义结构（如iNaturalist分类）一致的层次表示；SAEs可直接干预CLIP视觉编码器，无需修改底层模型即可引导多模态LLMs的输出。 Conclusion: SAEs是一种无监督方法，能有效提升VLMs的解释性和可控性。 Abstract: Sparse Autoencoders (SAEs) have recently been shown to enhance interpretability and steerability in Large Language Models (LLMs). In this work, we extend the application of SAEs to Vision-Language Models (VLMs), such as CLIP, and introduce a comprehensive framework for evaluating monosemanticity in vision representations. Our experimental results reveal that SAEs trained on VLMs significantly enhance the monosemanticity of individual neurons while also exhibiting hierarchical representations that align well with expert-defined structures (e.g., iNaturalist taxonomy). Most notably, we demonstrate that applying SAEs to intervene on a CLIP vision encoder, directly steer output from multimodal LLMs (e.g., LLaVA) without any modifications to the underlying model. These findings emphasize the practicality and efficacy of SAEs as an unsupervised approach for enhancing both the interpretability and control of VLMs.

STING-BEE: Towards Vision-Language Model for Real-World X-ray Baggage Security Inspection

Divya Velayudhan,Abdelfatah Ahmed,Mohamad Alansari,Neha Gour,Abderaouf Behouch,Taimur Hassan,Syed Talal Wasim,Nabil Maalej,Muzammal Naseer,Juergen Gall,Mohammed Bennamoun,Ernesto Damiani,Naoufel Werghi

Task: 开发一个多模态X射线行李安全数据集STCray，并训练一个领域感知的视觉AI助手STING-BEE，以支持多种视觉语言任务。

Motivation: 当前数据集在表示真实世界中复杂威胁和隐藏策略方面有限，且现有方法受限于预定义标签的封闭集范式。

Details

Method: 引入STCray数据集，包含46,642对图像-标题扫描，覆盖21种威胁类别，并开发领域感知的视觉AI助手STING-BEE。 Result: STING-BEE在跨域设置中表现出最先进的泛化能力，并建立了X射线行李安全多模态学习的新基准。 Conclusion: STCray和STING-BEE为X射线行李安全领域提供了创新的多模态解决方案，推动了该领域的发展。 Abstract: Advancements in Computer-Aided Screening (CAS) systems are essential for improving the detection of security threats in X-ray baggage scans. However, current datasets are limited in representing real-world, sophisticated threats and concealment tactics, and existing approaches are constrained by a closed-set paradigm with predefined labels. To address these challenges, we introduce STCray, the first multimodal X-ray baggage security dataset, comprising 46,642 image-caption paired scans across 21 threat categories, generated using an X-ray scanner for airport security. STCray is meticulously developed with our specialized protocol that ensures domain-aware, coherent captions, that lead to the multi-modal instruction following data in X-ray baggage security. This allows us to train a domain-aware visual AI assistant named STING-BEE that supports a range of vision-language tasks, including scene comprehension, referring threat localization, visual grounding, and visual question answering (VQA), establishing novel baselines for multi-modal learning in X-ray baggage security. Further, STING-BEE shows state-of-the-art generalization in cross-domain settings. Code, data, and models are available at https://divs1159.github.io/STING-BEE/.

Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing

Xiangyu Zhao,Peiyuan Zhang,Kexian Tang,Hao Li,Zicheng Zhang,Guangtao Zhai,Junchi Yan,Hua Yang,Xue Yang,Haodong Duan

Task: 评估和提升大型多模态模型（LMMs）在推理感知视觉编辑（RISE）中的能力。

Motivation: 当前LMMs在视觉理解和生成方面取得进展，但在复杂指令遵循、外观一致性和灵活输入格式支持方面仍存在挑战。

Details

Method: 引入RISEBench基准，涵盖四种推理类型（时间、因果、空间和逻辑），并提出评估框架，结合人工和LMM-as-a-judge方法。 Result: GPT-4o-Native表现最佳，但在逻辑推理任务上仍有困难。 Conclusion: RISEBench为推理感知视觉编辑提供基础见解，并推动未来研究。 Abstract: Large Multi-modality Models (LMMs) have made significant progress in visual understanding and generation, but they still face challenges in General Visual Editing, particularly in following complex instructions, preserving appearance consistency, and supporting flexible input formats. To address this gap, we introduce RISEBench, the first benchmark for evaluating Reasoning-Informed viSual Editing (RISE). RISEBench focuses on four key reasoning types: Temporal, Causal, Spatial, and Logical Reasoning. We curate high-quality test cases for each category and propose an evaluation framework that assesses Instruction Reasoning, Appearance Consistency, and Visual Plausibility with both human judges and an LMM-as-a-judge approach. Our experiments reveal that while GPT-4o-Native significantly outperforms other open-source and proprietary models, even this state-of-the-art system struggles with logical reasoning tasks, highlighting an area that remains underexplored. As an initial effort, RISEBench aims to provide foundational insights into reasoning-aware visual editing and to catalyze future research. Though still in its early stages, we are committed to continuously expanding and refining the benchmark to support more comprehensive, reliable, and scalable evaluations of next-generation multimodal systems. Our code and data will be released at https://github.com/PhoenixZ810/RISEBench.

Concept Lancet: Image Editing with Compositional Representation Transplant

Jinqi Luo,Tianjiao Ding,Kwan Ho Ryan Chan,Hancheng Min,Chris Callison-Burch,René Vidal

Task: 提出一种名为Concept Lancet (CoLan)的零样本即插即用框架，用于基于扩散模型的图像编辑中的表示操作。

Motivation: 现有编辑方法在文本嵌入或分数空间中设计编辑方向时，常面临编辑强度难以平衡的问题，导致视觉一致性受损或编辑任务失败。

Details

Method: 通过将输入图像在潜在空间（文本嵌入或扩散分数）中分解为收集到的视觉概念的稀疏线性组合，准确估计概念存在性，并根据编辑任务（替换/添加/移除）执行定制化的概念移植过程。 Result: 实验表明，配备CoLan的方法在多个基于扩散的图像编辑基准中实现了最先进的编辑效果和一致性保持性能。 Conclusion: CoLan框架通过稀疏线性组合和概念移植，有效解决了编辑强度估计问题，提升了图像编辑的效果和一致性。 Abstract: Diffusion models are widely used for image editing tasks. Existing editing methods often design a representation manipulation procedure by curating an edit direction in the text embedding or score space. However, such a procedure faces a key challenge: overestimating the edit strength harms visual consistency while underestimating it fails the editing task. Notably, each source image may require a different editing strength, and it is costly to search for an appropriate strength via trial-and-error. To address this challenge, we propose Concept Lancet (CoLan), a zero-shot plug-and-play framework for principled representation manipulation in diffusion-based image editing. At inference time, we decompose the source input in the latent (text embedding or diffusion score) space as a sparse linear combination of the representations of the collected visual concepts. This allows us to accurately estimate the presence of concepts in each image, which informs the edit. Based on the editing task (replace/add/remove), we perform a customized concept transplant process to impose the corresponding editing direction. To sufficiently model the concept space, we curate a conceptual representation dataset, CoLan-150K, which contains diverse descriptions and scenarios of visual terms and phrases for the latent dictionary. Experiments on multiple diffusion-based image editing baselines show that methods equipped with CoLan achieve state-of-the-art performance in editing effectiveness and consistency preservation.

CaLiV: LiDAR-to-Vehicle Calibration of Arbitrary Sensor Setups via Object Reconstruction

Ilir Tahiraj,Markus Edinger,Dominik Kulmer,Markus Lienkamp

Task: 提出一种名为CaLiV的目标标定技术，用于多激光雷达系统的外参标定（Sensor-to-Sensor和Sensor-to-Vehicle）。

Motivation: 现有激光雷达标定方法大多需要重叠视场、外部传感设备或特征丰富的环境，且多数不支持Sensor-to-Vehicle标定。

Details

Method: 通过运动产生视场重叠，利用无迹卡尔曼滤波获取车辆位姿，再通过GMMCalib框架对齐点云，最后将标定问题转化为最小化问题求解。 Result: 该方法能准确解决Sensor-to-Sensor的平移和旋转误差，并高精度标定Sensor-to-Vehicle的旋转角度。 Conclusion: CaLiV是一种无需外部设备、适用于非重叠视场和任意标定目标的有效标定方法，实验验证了其性能，代码已开源。 Abstract: In autonomous systems, sensor calibration is essential for a safe and efficient navigation in dynamic environments. Accurate calibration is a prerequisite for reliable perception and planning tasks such as object detection and obstacle avoidance. Many existing LiDAR calibration methods require overlapping fields of view, while others use external sensing devices or postulate a feature-rich environment. In addition, Sensor-to-Vehicle calibration is not supported by the vast majority of calibration algorithms. In this work, we propose a novel target-based technique for extrinsic Sensor-to-Sensor and Sensor-to-Vehicle calibration of multi-LiDAR systems called CaLiV. This algorithm works for non-overlapping FoVs, as well as arbitrary calibration targets, and does not require any external sensing devices. First, we apply motion to produce FoV overlaps and utilize a simple unscented Kalman filter to obtain vehicle poses. Then, we use the Gaussian mixture model-based registration framework GMMCalib to align the point clouds in a common calibration frame. Finally, we reduce the task of recovering the sensor extrinsics to a minimization problem. We show that both translational and rotational Sensor-to-Sensor errors can be solved accurately by our method. In addition, all Sensor-to-Vehicle rotation angles can also be calibrated with high accuracy. We validate the simulation results in real-world experiments. The code is open source and available on https://github.com/TUMFTM/CaLiV.

Distance Estimation to Support Assistive Drones for the Visually Impaired using Robust Calibration

Suman Raj,Bhavani A Madhabhavi,Madhav Kumar,Prabhav Gupta,Yogesh Simmhan

Task: 利用深度图和动态更新方法为视觉障碍者（VIPs）提供无人机自主导航辅助。

Motivation: 通过无人机和深度学习技术帮助视觉障碍者在户外环境中导航并避开障碍物。

Details

Method: 提出NOVA技术，利用深度图和动态更新方法估计障碍物的绝对距离。 Result: NOVA在VIP距离估计中误差小于30cm，对其他障碍物（如汽车、自行车）最大误差为60cm，优于基线方法和现有技术。 Conclusion: NOVA是一种鲁棒且通用的方法，显著优于现有技术，适用于动态和多样化环境。 Abstract: Autonomous navigation by drones using onboard sensors, combined with deep learning and computer vision algorithms, is impacting a number of domains. We examine the use of drones to autonomously assist Visually Impaired People (VIPs) in navigating outdoor environments while avoiding obstacles. Here, we present NOVA, a robust calibration technique using depth maps to estimate absolute distances to obstacles in a campus environment. NOVA uses a dynamic-update method that can adapt to adversarial scenarios. We compare NOVA with SOTA depth map approaches, and with geometric and regression-based baseline models, for distance estimation to VIPs and other obstacles in diverse and dynamic conditions. We also provide exhaustive evaluations to validate the robustness and generalizability of our methods. NOVA predicts distances to VIP with an error <30cm and to different obstacles like cars and bicycles with a maximum of 60cm error, which are better than the baselines. NOVA also clearly out-performs SOTA depth map methods, by upto 5.3-14.6x.

A Concise Survey on Lane Topology Reasoning for HD Mapping

Yi Yao,Miao Fan,Shengtong Xu,Haoyi Xiong,Xiangzeng Liu,Wenbo Hu,Wenbing Huang

Task: 系统综述车道拓扑推理方法的演变和现状。

Motivation: 车道拓扑推理在高清地图和自动驾驶中至关重要，但缺乏对这些方法的全面综述。

Details

Method: 将方法分为三类：基于程序建模的方法、基于航空影像的方法和基于车载传感器的方法，并分析从早期规则方法到现代深度学习方法的进展。 Result: 分析了标准化评估指标（如APLS、TLTS、DET和TOP分数）以及在OpenLane-V2等基准数据集上的性能比较。 Conclusion: 总结了关键挑战（如数据集可用性和模型效率）和未来研究方向，为研究者和从业者提供了理论框架和实践趋势的洞察。 Abstract: Lane topology reasoning techniques play a crucial role in high-definition (HD) mapping and autonomous driving applications. While recent years have witnessed significant advances in this field, there has been limited effort to consolidate these works into a comprehensive overview. This survey systematically reviews the evolution and current state of lane topology reasoning methods, categorizing them into three major paradigms: procedural modeling-based methods, aerial imagery-based methods, and onboard sensors-based methods. We analyze the progression from early rule-based approaches to modern learning-based solutions utilizing transformers, graph neural networks (GNNs), and other deep learning architectures. The paper examines standardized evaluation metrics, including road-level measures (APLS and TLTS score), and lane-level metrics (DET and TOP score), along with performance comparisons on benchmark datasets such as OpenLane-V2. We identify key technical challenges, including dataset availability and model efficiency, and outline promising directions for future research. This comprehensive review provides researchers and practitioners with insights into the theoretical frameworks, practical implementations, and emerging trends in lane topology reasoning for HD mapping applications.

Khizar Anjum,Parul Pandey,Vidyasagar Sadhu,Roberto Tron,Dario Pompili

Task: 提出一种新颖的马尔可夫决策过程（MDP）框架，以减少计算机视觉（CV）算法在自主导航中的计算负担。

Motivation: 传统基于几何3D点云的自主导航方法计算成本高，而基于语义信息（如交通标志）的方法虽然简单，但计算机视觉算法（如目标检测）对资源有限的设备（如无人机）仍具有挑战性。

Details

Method: 引入MDP框架，应用于基于特征和神经网络的目标检测任务，并通过开环、闭环仿真和硬件在环仿真进行测试。 Result: 在能耗和速度方面显著提升，仅带来有限的精度损失。 Conclusion: 提出的MDP框架在资源受限的自主导航中具有实际应用价值。 Abstract: Most applications in autonomous navigation using mounted cameras rely on the construction and processing of geometric 3D point clouds, which is an expensive process. However, there is another simpler way to make a space navigable quickly: to use semantic information (e.g., traffic signs) to guide the agent. However, detecting and acting on semantic information involves Computer Vision~(CV) algorithms such as object detection, which themselves are demanding for agents such as aerial drones with limited onboard resources. To solve this problem, we introduce a novel Markov Decision Process~(MDP) framework to reduce the workload of these CV approaches. We apply our proposed framework to both feature-based and neural-network-based object-detection tasks, using open-loop and closed-loop simulations as well as hardware-in-the-loop emulations. These holistic tests show significant benefits in energy consumption and speed with only a limited loss in accuracy compared to models based on static features and neural networks.

WorldPrompter: Traversable Text-to-Scene Generation

Zhaoyang Zhang,Yannick Hold-Geoffroy,Miloš Hašan,Chen Ziwen,Fujun Luan,Julie Dorsey,Yiwei Hu

Task: 生成可遍历的3D场景

Motivation: 现有方法生成的场景多为局部且导航自由度有限，需要一种能够合成完整可遍历3D场景的方法。

Details

Method: 利用全景视频作为中间表示，结合条件性360度全景视频生成器和快速前馈3D重建器，生成高斯点云表示的3D场景。 Result: 实验表明，全景视频生成模型在帧间视图一致性上表现优异，支持高质量全景高斯点云重建，并实现场景区域的遍历。 Conclusion: WorldPrompter在360度视频生成和3D场景生成方面优于现有方法。 Abstract: Scene-level 3D generation is a challenging research topic, with most existing methods generating only partial scenes and offering limited navigational freedom. We introduce WorldPrompter, a novel generative pipeline for synthesizing traversable 3D scenes from text prompts. We leverage panoramic videos as an intermediate representation to model the 360{\deg} details of a scene. WorldPrompter incorporates a conditional 360{\deg} panoramic video generator, capable of producing a 128-frame video that simulates a person walking through and capturing a virtual environment. The resulting video is then reconstructed as Gaussian splats by a fast feedforward 3D reconstructor, enabling a true walkable experience within the 3D scene. Experiments demonstrate that our panoramic video generation model achieves convincing view consistency across frames, enabling high-quality panoramic Gaussian splat reconstruction and facilitating traversal over an area of the scene. Qualitative and quantitative results also show it outperforms the state-of-the-art 360{\deg} video generators and 3D scene generation models.

Evaluation of Flight Parameters in UAV-based 3D Reconstruction for Rooftop Infrastructure Assessment

Nick Chodura,Melissa Greeff,Joshua Woods

Task: 优化无人机摄影测量中地面采样距离（GSD）和图像重叠率，以实现复杂屋顶基础设施的高精度3D重建。

Motivation: 现有方法通常需要高图像重叠率和长飞行时间以确保模型精度，本研究旨在通过系统评估关键飞行参数来优化这一过程。

Details

Method: 通过控制无人机飞行参数（GSD和重叠率），使用DJI Phantom 4 Pro V2进行飞行实验，数据通过Reality Capture软件处理，并与LiDAR和TLS生成的地面真实模型对比评估。 Result: 实验结果表明，GSD范围为0.75-1.26 cm且图像重叠率为85%时，可在最小化图像数量和飞行时间的同时实现高模型精度。 Conclusion: 研究结果为规划高效屋顶评估的自主无人机飞行路径提供了指导。 Abstract: Rooftop 3D reconstruction using UAV-based photogrammetry offers a promising solution for infrastructure assessment, but existing methods often require high percentages of image overlap and extended flight times to ensure model accuracy when using autonomous flight paths. This study systematically evaluates key flight parameters-ground sampling distance (GSD) and image overlap-to optimize the 3D reconstruction of complex rooftop infrastructure. Controlled UAV flights were conducted over a multi-segment rooftop at Queen's University using a DJI Phantom 4 Pro V2, with varied GSD and overlap settings. The collected data were processed using Reality Capture software and evaluated against ground truth models generated from UAV-based LiDAR and terrestrial laser scanning (TLS). Experimental results indicate that a GSD range of 0.75-1.26 cm combined with 85% image overlap achieves a high degree of model accuracy, while minimizing images collected and flight time. These findings provide guidance for planning autonomous UAV flight paths for efficient rooftop assessments.

One Pic is All it Takes: Poisoning Visual Document Retrieval Augmented Generation with a Single Image

Ezzeldin Shereen,Dan Ristea,Burak Hasircioglu,Shae McFadden,Vasilios Mavroudis,Chris Hicks

Task: 研究针对多模态检索增强生成（M-RAG）的中毒攻击，特别是针对视觉文档检索应用。

Motivation: M-RAG通过知识库抑制大模型幻觉，但也引入了新的攻击向量，尤其是恶意注入知识库的攻击。

Details

Method: 提出一种针对M-RAG的中毒攻击方法，通过制作一个通用图像，影响多种查询的输出。 Result: 攻击对多种主流检索器和生成器有效，但对鲁棒的嵌入模型无效。 Conclusion: 揭示了M-RAG管道在中毒攻击中的脆弱性，并指出了其潜在的根本弱点。 Abstract: Multimodal retrieval augmented generation (M-RAG) has recently emerged as a method to inhibit hallucinations of large multimodal models (LMMs) through a factual knowledge base (KB). However, M-RAG also introduces new attack vectors for adversaries that aim to disrupt the system by injecting malicious entries into the KB. In this work, we present a poisoning attack against M-RAG targeting visual document retrieval applications, where the KB contains images of document pages. Our objective is to craft a single image that is retrieved for a variety of different user queries, and consistently influences the output produced by the generative model, thus creating a universal denial-of-service (DoS) attack against the M-RAG system. We demonstrate that while our attack is effective against a diverse range of widely-used, state-of-the-art retrievers (embedding models) and generators (LMMs), it can also be ineffective against robust embedding models. Our attack not only highlights the vulnerability of M-RAG pipelines to poisoning attacks, but also sheds light on a fundamental weakness that potentially hinders their performance even in benign settings.

Multivariate Temporal Regression at Scale: A Three-Pillar Framework Combining ML, XAI, and NLP

Jiztom Kavalakkatt Francis,Matthew J Darr

Task: 探索高维数据分析中的挑战并提出简化模型的方法。

Motivation: 传统数据分析方法可能忽略复杂关系，且计算需求高，需要更高效、可理解的方法。

Details

Method: 采用变量移除、统计分析、合成数据等技术，提出全局特征提取和降维方法。 Result: 通过降维简化模型，揭示新输入与结果的关系，并验证数据集的可靠性。 Conclusion: 提出的方法能有效简化高维数据分析，提升模型可解释性并发现潜在关系。 Abstract: The rapid use of artificial intelligence (AI) in processes such as coding, image processing, and data prediction means it is crucial to understand and validate the data we are working with fully. This paper dives into the hurdles of analyzing high-dimensional data, especially when it gets too complex. Traditional methods in data analysis often look at direct connections between input variables, which can miss out on the more complicated relationships within the data. To address these issues, we explore several tested techniques, such as removing specific variables to see their impact and using statistical analysis to find connections between multiple variables. We also consider the role of synthetic data and how information can sometimes be redundant across different sensors. These analyses are typically very computationally demanding and often require much human effort to make sense of the results. A common approach is to treat the entire dataset as one unit and apply advanced models to handle it. However, this can become problematic with larger, noisier datasets and more complex models. So, we suggest methods to identify overall patterns that can help with tasks like classification or regression based on the idea that more straightforward approaches might be more understandable. Our research looks at two datasets: a real-world dataset and a synthetic one. The goal is to create a methodology that highlights key features on a global scale that lead to predictions, making it easier to validate or quantify the data set. By reducing the dimensionality with this method, we can simplify the models used and thus clarify the insights we gain. Furthermore, our method can reveal unexplored relationships between specific inputs and outcomes, providing a way to validate these new connections further.

Preference-Driven Active 3D Scene Representation for Robotic Inspection in Nuclear Decommissioning

Zhen Meng,Kan Chen,Xiangmin Xu,Erwin Jose Lopez Pulgarin,Emma Li,Philip G. Zhao,David Flynn

Task: 提出一种将专家操作者偏好融入主动3D场景表示框架的新方法。

Motivation: 传统方法主要关注几何保真度或渲染精度，但忽略了操作者特定的目标（如安全关键覆盖或任务驱动的视角），导致在受限环境（如核退役）中的视角选择不理想。

Details

Method: 采用基于人类反馈的强化学习（RLHF）来指导机器人路径规划，并通过交互式选择实验捕捉操作者特定优先级。 Result: 在核退役场景中，该方法优于基线方法，提升了场景表示并优化了轨迹效率，RLHF策略在任务关键细节上表现更优。 Conclusion: 通过结合显式3D几何建模与隐式人机协同优化，为自适应、安全关键的机器人感知系统奠定了基础，推动了高风险环境中自动化的发展。 Abstract: Active 3D scene representation is pivotal in modern robotics applications, including remote inspection, manipulation, and telepresence. Traditional methods primarily optimize geometric fidelity or rendering accuracy, but often overlook operator-specific objectives, such as safety-critical coverage or task-driven viewpoints. This limitation leads to suboptimal viewpoint selection, particularly in constrained environments such as nuclear decommissioning. To bridge this gap, we introduce a novel framework that integrates expert operator preferences into the active 3D scene representation pipeline. Specifically, we employ Reinforcement Learning from Human Feedback (RLHF) to guide robotic path planning, reshaping the reward function based on expert input. To capture operator-specific priorities, we conduct interactive choice experiments that evaluate user preferences in 3D scene representation. We validate our framework using a UR3e robotic arm for reactor tile inspection in a nuclear decommissioning scenario. Compared to baseline methods, our approach enhances scene representation while optimizing trajectory efficiency. The RLHF-based policy consistently outperforms random selection, prioritizing task-critical details. By unifying explicit 3D geometric modeling with implicit human-in-the-loop optimization, this work establishes a foundation for adaptive, safety-critical robotic perception systems, paving the way for enhanced automation in nuclear decommissioning, remote maintenance, and other high-risk environments.

Neural Style Transfer for Synthesising a Dataset of Ancient Egyptian Hieroglyphs

Lewis Matheson Creed

Task: 提出一种利用神经风格迁移（NST）生成古埃及象形文字数据集的新方法。

Motivation: 低资源语言（如古埃及语）的训练数据有限，导致机器学习技术应用困难。

Details

Method: 通过将NST应用于数字字体，生成古埃及象形文字数据集。 Result: 实验表明，基于NST生成的数据和真实照片训练的模型在分类任务中表现相当，且能泛化到未见过的真实象形文字图像。 Conclusion: NST是一种有效的数据增强方法，可用于解决低资源语言的训练数据不足问题。 Abstract: The limited availability of training data for low-resource languages makes applying machine learning techniques challenging. Ancient Egyptian is one such language with few resources. However, innovative applications of data augmentation methods, such as Neural Style Transfer, could overcome these barriers. This paper presents a novel method for generating datasets of ancient Egyptian hieroglyphs by applying NST to a digital typeface. Experimental results found that image classification models trained on NST-generated examples and photographs demonstrate equal performance and transferability to real unseen images of hieroglyphs.

Image Coding for Machines via Feature-Preserving Rate-Distortion Optimization

Samuel Fernández-Menduiña,Eduardo Pavez,Antonio Ortega

Task: 优化图像和视频压缩方法，以同时兼顾视觉质量和下游计算机视觉任务的性能。

Motivation: 许多图像和视频主要由计算机视觉算法处理，仅偶尔需要人工检查，因此在压缩时需要同时优化视觉质量和任务性能。

Details

Method: 提出一种基于特征距离的率失真优化（RDO）方法，通过泰勒展开和Jacobian矩阵近似特征距离，并引入输入依赖的平方误差（IDSE）和Jacobian草图降低计算复杂度。 Result: 在AVC编码中，相比基于SSE的RDO，该方法在保持相同计算机视觉准确性的情况下节省了10%的比特率，且解码器复杂度不变，编码器复杂度仅增加7%。 Conclusion: 该方法有效平衡了压缩效率和计算机视觉任务性能，具有实际应用价值。 Abstract: Many images and videos are primarily processed by computer vision algorithms, involving only occasional human inspection. When this content requires compression before processing, e.g., in distributed applications, coding methods must optimize for both visual quality and downstream task performance. We first show that, given the features obtained from the original and the decoded images, an approach to reduce the effect of compression on a task loss is to perform rate-distortion optimization (RDO) using the distance between features as a distortion metric. However, optimizing directly such a rate-distortion trade-off requires an iterative workflow of encoding, decoding, and feature evaluation for each coding parameter, which is computationally impractical. We address this problem by simplifying the RDO formulation to make the distortion term computable using block-based encoders. We first apply Taylor's expansion to the feature extractor, recasting the feature distance as a quadratic metric with the Jacobian matrix of the neural network. Then, we replace the linearized metric with a block-wise approximation, which we call input-dependent squared error (IDSE). To reduce computational complexity, we approximate IDSE using Jacobian sketches. The resulting loss can be evaluated block-wise in the transform domain and combined with the sum of squared errors (SSE) to address both visual quality and computer vision performance. Simulations with AVC across multiple feature extractors and downstream neural networks show up to 10% bit-rate savings for the same computer vision accuracy compared to RDO based on SSE, with no decoder complexity overhead and just a 7% encoder complexity increase.

APSeg: Auto-Prompt Model with Acquired and Injected Knowledge for Nuclear Instance Segmentation and Classification

Liying Xu,Hongliang He,Wei Han,Hanbin Huang,Siwei Feng,Guohong Fu

Task: 提出一种名为APSeg的自动提示模型，用于核实例分割和分类，以解决SAM模型对精确提示的依赖和类别无关设计的问题。

Motivation: 核实例分割和分类在数字病理诊断中具有重要作用，但现有SAM模型依赖精确提示且无法分类，因此需要改进。

Details

Method: APSeg包含两个知识感知模块：DG-POM（通过密度图学习分布知识）和CK-SIM（注入类别描述的形态学知识）。 Result: 在PanNuke和CoNSeP数据集上的实验证明了APSeg的有效性。 Conclusion: APSeg通过自动生成更准确的提示，显著提升了核实例分割和分类的性能。 Abstract: Nuclear instance segmentation and classification provide critical quantitative foundations for digital pathology diagnosis. With the advent of the foundational Segment Anything Model (SAM), the accuracy and efficiency of nuclear segmentation have improved significantly. However, SAM imposes a strong reliance on precise prompts, and its class-agnostic design renders its classification results entirely dependent on the provided prompts. Therefore, we focus on generating prompts with more accurate localization and classification and propose \textbf{APSeg}, \textbf{A}uto-\textbf{P}rompt model with acquired and injected knowledge for nuclear instance \textbf{Seg}mentation and classification. APSeg incorporates two knowledge-aware modules: (1) Distribution-Guided Proposal Offset Module (\textbf{DG-POM}), which learns distribution knowledge through density map guided, and (2) Category Knowledge Semantic Injection Module (\textbf{CK-SIM}), which injects morphological knowledge derived from category descriptions. We conducted extensive experiments on the PanNuke and CoNSeP datasets, demonstrating the effectiveness of our approach. The code will be released upon acceptance.

LLM-Guided Evolution: An Autonomous Model Optimization for Object Detection

YiMing Yu,Jason Zutty

Task: 通过改进LLM-GE框架，优化YOLO模型的架构以提升在KITTI数据集上的目标检测性能。

Motivation: 传统神经架构搜索（NAS）需要大量试错和领域知识，而进化算法依赖固定规则和预定义模块，LLM-GE框架通过结合大语言模型（LLM）直接修改模型源代码，智能引导变异和交叉，提供更灵活的自动化机器学习方法。

Details

Method: 采用LLM-GE框架，结合“思维进化”（EoT）技术，通过反馈循环迭代优化LLM的决策，调整YOLO模型的设计和参数以优化检测精度和速度。 Result: LLM-GE生成的YOLO变体在KITTI数据集上表现显著提升，如平均精度（mAP）从92.5%提高到94.5%。 Conclusion: LLM-GE框架结合LLM驱动的推理与进化策略，为自动化机器学习提供了灵活高效的新范式，适用于实际挑战。 Abstract: In machine learning, Neural Architecture Search (NAS) requires domain knowledge of model design and a large amount of trial-and-error to achieve promising performance. Meanwhile, evolutionary algorithms have traditionally relied on fixed rules and pre-defined building blocks. The Large Language Model (LLM)-Guided Evolution (GE) framework transformed this approach by incorporating LLMs to directly modify model source code for image classification algorithms on CIFAR data and intelligently guide mutations and crossovers. A key element of LLM-GE is the "Evolution of Thought" (EoT) technique, which establishes feedback loops, allowing LLMs to refine their decisions iteratively based on how previous operations performed. In this study, we perform NAS for object detection by improving LLM-GE to modify the architecture of You Only Look Once (YOLO) models to enhance performance on the KITTI dataset. Our approach intelligently adjusts the design and settings of YOLO to find the optimal algorithms against objective such as detection accuracy and speed. We show that LLM-GE produced variants with significant performance improvements, such as an increase in Mean Average Precision from 92.5% to 94.5%. This result highlights the flexibility and effectiveness of LLM-GE on real-world challenges, offering a novel paradigm for automated machine learning that combines LLM-driven reasoning with evolutionary strategies.

Towards Assessing Deep Learning Test Input Generators

Seif Mzoughi,Ahmed Hajyahmed,Mohamed Elshafei,Foutse Khomh anb Diego Elias Costa

Task: 对四种先进的测试输入生成器（TIGs）在多个关键维度上进行全面评估。

Motivation: 深度学习系统在安全关键应用中部署增多，但其鲁棒性问题可能导致严重故障，现有TIGs的评估缺乏全面性。

Details

Method: 通过三种预训练模型（LeNet-5、VGG16和EfficientNetB3）和不同复杂度的数据集（MNIST、CIFAR-10和ImageNet-1K）评估四种TIGs的性能。 Result: 揭示了TIGs在鲁棒性揭示能力、测试用例生成多样性和计算效率上的权衡，且性能随数据集复杂度变化显著。 Conclusion: 为根据具体目标和数据集特性选择TIGs提供了实用指导，但仍需进一步改进TIGs以适用于现实世界安全关键系统。 Abstract: Deep Learning (DL) systems are increasingly deployed in safety-critical applications, yet they remain vulnerable to robustness issues that can lead to significant failures. While numerous Test Input Generators (TIGs) have been developed to evaluate DL robustness, a comprehensive assessment of their effectiveness across different dimensions is still lacking. This paper presents a comprehensive assessment of four state-of-the-art TIGs--DeepHunter, DeepFault, AdvGAN, and SinVAD--across multiple critical aspects: fault-revealing capability, naturalness, diversity, and efficiency. Our empirical study leverages three pre-trained models (LeNet-5, VGG16, and EfficientNetB3) on datasets of varying complexity (MNIST, CIFAR-10, and ImageNet-1K) to evaluate TIG performance. Our findings reveal important trade-offs in robustness revealing capability, variation in test case generation, and computational efficiency across TIGs. The results also show that TIG performance varies significantly with dataset complexity, as tools that perform well on simpler datasets may struggle with more complex ones. In contrast, others maintain steadier performance or better scalability. This paper offers practical guidance for selecting appropriate TIGs aligned with specific objectives and dataset characteristics. Nonetheless, more work is needed to address TIG limitations and advance TIGs for real-world, safety-critical systems.

Determining Sphere Radius through Pairwise Distances

Boris Sukhovilov

Task: 提出一种基于球面上点间距离测量确定球面半径的新方法。

Motivation: 解决在距离测量存在误差且球面形状存在随机偏差时，如何准确计算球面半径的问题。

Details

Method: 使用最少四个点和任意N个点，通过距离矩阵提供闭式解，并确定半径估计的标准偏差。 Result: 提出了球面半径的闭式解，并找到最优的点配置以最小化半径估计的标准偏差。 Conclusion: 该方法通过数学推导和开源代码实现，为球面半径的精确测量提供了有效工具。 Abstract: We propose a novel method for determining the radius of a spherical surface based on the distances measured between points on this surface. We consider the most general case of determining the radius when the distances are measured with errors and the sphere has random deviations from its ideal shape. For the solution, we used the minimally necessary four points and an arbitrary N number of points. We provide a new closed form solution for the radius of the sphere through the matrix of pairwise distances. We also determine the standard deviation of the radius estimate caused by measurement errors and deviations of the sphere from its ideal shape. We found optimal configurations of points on the sphere that provide the minimum standard deviation of the radius estimate. This paper describes our solution and provides all the mathematical derivations. We share the implementation of our method as open source code at https://github.com/boris-sukhovilov/Sphere_Radius.

MG-Gen: Single Image to Motion Graphics Generation with Layer Decomposition

Takahiro Shirakawa,Tomoyuki Suzuki,Daichi Haraguchi

Task: 提出一种名为MG-Gen的框架，用于从单张栅格图像生成动态图形。

Motivation: 解决现有图像到视频生成方法在动态图形生成中缺乏主动文本运动和对象失真的问题，以及基于代码的动画生成方法需要矢量数据的限制。

Details

Method: 通过分解输入图像为分层元素，将其重建为HTML格式数据，并生成可执行的JavaScript代码。 Result: 实验证明MG-Gen能够生成动态图形，同时保持文本可读性和输入一致性。 Conclusion: 结合分层分解和动画代码生成是动态图形生成的有效策略。 Abstract: General image-to-video generation methods often produce suboptimal animations that do not meet the requirements of animated graphics, as they lack active text motion and exhibit object distortion. Also, code-based animation generation methods typically require layer-structured vector data which are often not readily available for motion graphic generation. To address these challenges, we propose a novel framework named MG-Gen that reconstructs data in vector format from a single raster image to extend the capabilities of code-based methods to enable motion graphics generation from a raster image in the framework of general image-to-video generation. MG-Gen first decomposes the input image into layer-wise elements, reconstructs them as HTML format data and then generates executable JavaScript code for the reconstructed HTML data. We experimentally confirm that \ours{} generates motion graphics while preserving text readability and input consistency. These successful results indicate that combining layer decomposition and animation code generation is an effective strategy for motion graphics generation.

HPGN: Hybrid Priors-Guided Network for Compressed Low-Light Image Enhancement

Hantang Li,Jinhua Hao,Lei Xiong,Shuyuan Zhu

Task: 提出一种混合先验引导网络（HPGN）用于增强压缩低光图像。

Motivation: 现有方法在增强过程中忽视压缩伪影的去除或未能为不同压缩质量的图像建立统一的联合任务增强框架。

Details

Method: 通过整合压缩和光照先验，利用JPEG质量因子（QF）和DCT量化矩阵（QM）设计高效联合任务即插即用模块，并采用随机QF生成策略指导模型训练。 Result: 实验结果表明所提方法的优越性。 Conclusion: HPGN能够有效增强不同压缩水平的低光图像。 Abstract: In practical applications, conventional methods generate large volumes of low-light images that require compression for efficient storage and transmission. However, most existing methods either disregard the removal of potential compression artifacts during the enhancement process or fail to establish a unified framework for joint task enhancement of images with varying compression qualities. To solve this problem, we propose the hybrid priors-guided network (HPGN), which enhances compressed low-light images by integrating both compression and illumination priors. Our approach fully utilizes the JPEG quality factor (QF) and DCT quantization matrix (QM) to guide the design of efficient joint task plug-and-play modules. Additionally, we employ a random QF generation strategy to guide model training, enabling a single model to enhance images across different compression levels. Experimental results confirm the superiority of our proposed method.

Benchmark of Segmentation Techniques for Pelvic Fracture in CT and X-ray: Summary of the PENGWIN 2024 Challenge

Yudi Sang,Yanzhen Liu,Sutuke Yibulayimu,Yunning Wang,Benjamin D. Killeen,Mingxu Liu,Ping-Cheng Ku,Ole Johannsen,Karol Gotkowski,Maximilian Zenk,Klaus Maier-Hein,Fabian Isensee,Peiyan Yue,Yi Wang,Haidong Yu,Zhaohong Pan,Yutong He,Xiaokun Liang,Daiqi Liu,Fuxin Fan,Artur Jurgas,Andrzej Skalski,Yuxi Ma,Jing Yang,Szymon Płotka,Rafał Litka,Gang Zhu,Yingchun Song,Mathias Unberath,Mehran Armand,Dan Ruan,S. Kevin Zhou,Qiyong Cao,Chunpeng Zhao,Xinbao Wu,Yu Wang

Task: 通过PENGWIN挑战赛评估和比较自动化骨盆骨折碎片分割算法在CT和X射线图像中的性能。

Motivation: 骨盆骨折碎片的准确分割对创伤诊断、手术规划和术中引导至关重要，但由于复杂的解剖结构和成像限制，这一任务具有挑战性。

Details

Method: 收集了150份CT扫描数据，并使用DeepDRR方法生成模拟X射线图像，评估了16个团队提交的算法。 Result: CT分割的最佳算法平均IoU为0.930，而X射线分割的最佳算法IoU为0.774，表明X射线任务更具挑战性。 Conclusion: 尽管算法表现出多样性，但碎片定义的不确定性仍需解决，交互式分割方法可能提升模型的可靠性和临床适用性。 Abstract: The segmentation of pelvic fracture fragments in CT and X-ray images is crucial for trauma diagnosis, surgical planning, and intraoperative guidance. However, accurately and efficiently delineating the bone fragments remains a significant challenge due to complex anatomy and imaging limitations. The PENGWIN challenge, organized as a MICCAI 2024 satellite event, aimed to advance automated fracture segmentation by benchmarking state-of-the-art algorithms on these complex tasks. A diverse dataset of 150 CT scans was collected from multiple clinical centers, and a large set of simulated X-ray images was generated using the DeepDRR method. Final submissions from 16 teams worldwide were evaluated under a rigorous multi-metric testing scheme. The top-performing CT algorithm achieved an average fragment-wise intersection over union (IoU) of 0.930, demonstrating satisfactory accuracy. However, in the X-ray task, the best algorithm attained an IoU of 0.774, highlighting the greater challenges posed by overlapping anatomical structures. Beyond the quantitative evaluation, the challenge revealed methodological diversity in algorithm design. Variations in instance representation, such as primary-secondary classification versus boundary-core separation, led to differing segmentation strategies. Despite promising results, the challenge also exposed inherent uncertainties in fragment definition, particularly in cases of incomplete fractures. These findings suggest that interactive segmentation approaches, integrating human decision-making with task-relevant information, may be essential for improving model reliability and clinical applicability.

Translation of Fetal Brain Ultrasound Images into Pseudo-MRI Images using Artificial Intelligence

Naomi Silverstein,Efrat Leibowitz,Ron Beloosesky,Haim Azhari

Task: 利用人工智能技术将超声图像转换为类似MRI的图像，以提升胎儿脑部组织的视觉辨识度。

Motivation: 超声图像在胎儿脑部评估中存在局限性，尤其是在妊娠晚期，而MRI虽然图像质量高但成本昂贵且不易获取。

Details

Method: 提出了一种名为“Dual Diffusion Imposed Correlation (DDIC)”的方法，基于扩散模型，假设超声和MRI图像共享潜在空间。 Result: 生成的伪MRI图像在脑组织视觉辨识度上有显著提升，尤其在侧脑室和外侧裂区域，且多项评价指标优于其他方法。 Conclusion: 该方法有望通过改善图像表示来简化诊断流程并提升临床效果。 Abstract: Ultrasound is a widely accessible and cost-effective medical imaging tool commonly used for prenatal evaluation of the fetal brain. However, it has limitations, particularly in the third trimester, where the complexity of the fetal brain requires high image quality for extracting quantitative data. In contrast, magnetic resonance imaging (MRI) offers superior image quality and tissue differentiation but is less available, expensive, and requires time-consuming acquisition. Thus, transforming ultrasonic images into an MRI-mimicking display may be advantageous and allow better tissue anatomy presentation. To address this goal, we have examined the use of artificial intelligence, implementing a diffusion model renowned for generating high-quality images. The proposed method, termed "Dual Diffusion Imposed Correlation" (DDIC), leverages a diffusion-based translation methodology, assuming a shared latent space between ultrasound and MRI domains. Model training was obtained utilizing the "HC18" dataset for ultrasound and the "CRL fetal brain atlas" along with the "FeTA " datasets for MRI. The generated pseudo-MRI images provide notable improvements in visual discrimination of brain tissue, especially in the lateral ventricles and the Sylvian fissure, characterized by enhanced contrast clarity. Improvement was demonstrated in Mutual information, Peak signal-to-noise ratio, Fr\'echet Inception Distance, and Contrast-to-noise ratio. Findings from these evaluations indicate statistically significant superior performance of the DDIC compared to other translation methodologies. In addition, a Medical Opinion Test was obtained from 5 gynecologists. The results demonstrated display improvement in 81% of the tested images. In conclusion, the presented pseudo-MRI images hold the potential for streamlining diagnosis and enhancing clinical outcomes through improved representation.

Estimating Scene Flow in Robot Surroundings with Distributed Miniaturized Time-of-Flight Sensors

Jack Sander,Giammarco Caroleo,Alessandro Albini,Perla Maiolino

Task: 提出一种从低密度和噪声点云中估计场景流的方法，以改进机器人的安全和反应能力。

Motivation: 跟踪机器人周围人类或物体的运动对提升机器人运动安全性至关重要。

Details

Method: 通过聚类连续帧的点云并应用迭代最近点（ICP）算法估计密集运动流，结合基于适应度的分类和离群点去除策略以减少噪声和低密度数据的影响。 Result: 实验结果表明，该方法能准确估计运动方向和大小，误差与传感器噪声一致。 Conclusion: 所提方法在低密度和噪声点云下有效估计场景流，适用于机器人运动安全应用。 Abstract: Tracking motions of humans or objects in the surroundings of the robot is essential to improve safe robot motions and reactions. In this work, we present an approach for scene flow estimation from low-density and noisy point clouds acquired from miniaturized Time of Flight (ToF) sensors distributed on the robot body. The proposed method clusters points from consecutive frames and applies Iterative Closest Point (ICP) to estimate a dense motion flow, with additional steps introduced to mitigate the impact of sensor noise and low-density data points. Specifically, we employ a fitness-based classification to distinguish between stationary and moving points and an inlier removal strategy to refine geometric correspondences. The proposed approach is validated in an experimental setup where 24 ToF are used to estimate the velocity of an object moving at different controlled speeds. Experimental results show that the method consistently approximates the direction of the motion and its magnitude with an error which is in line with sensor noise.

RASP: Revisiting 3D Anamorphic Art for Shadow-Guided Packing of Irregular Objects

Soumyaratna Debnath,Ashish Tiwari,Kaustubh Sadekar,Shanmuganathan Raman

Task: 通过阴影引导优化在有限体积内排列任意形状的3D物体，以实现最小化物体间距和接近最大化的占用率。

Motivation: 利用3D变形艺术的原理，探索如何通过计算模型实现3D物体的高效排列和组装。

Details

Method: 提出RASP框架，基于可微分渲染和阴影引导优化，结合SDF（符号距离函数）处理物体间交叉和容器外溢问题。 Result: RASP能够扩展到部件组装，并在多视角变形艺术中实现从多个视角观察的有意义表达。 Conclusion: RASP为3D物体排列和组装提供了一种高效且灵活的方法，适用于艺术创作和实际应用。 Abstract: Recent advancements in learning-based methods have opened new avenues for exploring and interpreting art forms, such as shadow art, origami, and sketch art, through computational models. One notable visual art form is 3D Anamorphic Art in which an ensemble of arbitrarily shaped 3D objects creates a realistic and meaningful expression when observed from a particular viewpoint and loses its coherence over the other viewpoints. In this work, we build on insights from 3D Anamorphic Art to perform 3D object arrangement. We introduce RASP, a differentiable-rendering-based framework to arrange arbitrarily shaped 3D objects within a bounded volume via shadow (or silhouette)-guided optimization with an aim of minimal inter-object spacing and near-maximal occupancy. Furthermore, we propose a novel SDF-based formulation to handle inter-object intersection and container extrusion. We demonstrate that RASP can be extended to part assembly alongside object packing considering 3D objects to be "parts" of another 3D object. Finally, we present artistic illustrations of multi-view anamorphic art, achieving meaningful expressions from multiple viewpoints within a single ensemble.

Adaptive path planning for efficient object search by UAVs in agricultural fields

Rick van Essen,Eldert van Henten,Lammert Kooistra,Gert Kootstra

Task: 开发一种用于农业领域中无人机搜索物体的自适应路径规划器。

Motivation: 提高无人机在农业领域中搜索物体的效率，尤其是在物体分布不均匀的情况下。

Details

Method: 结合高海拔覆盖飞行路径和低海拔检查路径，利用YOLOv8检测网络的不确定性触发额外检查。 Result: 自适应路径规划器在物体分布不均匀时表现更优，路径更短且检测精度与覆盖路径相当。 Conclusion: 自适应路径规划器在农业领域搜索物体时具有高效性和鲁棒性，适用于不同分布情况。 Abstract: This paper presents an adaptive path planner for object search in agricultural fields using UAVs. The path planner uses a high-altitude coverage flight path and plans additional low-altitude inspections when the detection network is uncertain. The path planner was evaluated in an offline simulation environment containing real-world images. We trained a YOLOv8 detection network to detect artificial plants placed in grass fields to showcase the potential of our path planner. We evaluated the effect of different detection certainty measures, optimized the path planning parameters, investigated the effects of localization errors and different numbers of objects in the field. The YOLOv8 detection confidence worked best to differentiate between true and false positive detections and was therefore used in the adaptive planner. The optimal parameters of the path planner depended on the distribution of objects in the field, when the objects were uniformly distributed, more low-altitude inspections were needed compared to a non-uniform distribution of objects, resulting in a longer path length. The adaptive planner proved to be robust against localization uncertainty. When increasing the number of objects, the flight path length increased, especially when the objects were uniformly distributed. When the objects were non-uniformly distributed, the adaptive path planner yielded a shorter path than a low-altitude coverage path, even with high number of objects. Overall, the presented adaptive path planner allowed to find non-uniformly distributed objects in a field faster than a coverage path planner and resulted in a compatible detection accuracy. The path planner is made available at https://github.com/wur-abe/uav_adaptive_planner.

Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision

Xiaofeng Han,Shunpeng Chen,Zenghuang Fu,Zhe Feng,Lue Fan,Dong An,Changwei Wang,Li Guo,Weiliang Meng,Xiaopeng Zhang,Rongtao Xu,Shibiao Xu

Task: 系统综述多模态融合和视觉语言模型在机器人视觉任务中的应用。

Motivation: 机器人视觉领域受益于多模态融合技术和视觉语言模型的发展，但缺乏对这些技术应用的全面分析和比较。

Details

Method: 通过系统综述和比较分析，评估多模态融合方法及视觉语言模型在语义场景理解、SLAM、3D目标检测等任务中的表现。 Result: 总结了多模态融合的优势、局限性和协同效应，并分析了常用数据集的适用性和挑战。 Conclusion: 提出了未来研究方向，如自监督学习、基于Transformer的融合架构和可扩展的多模态框架，为机器人视觉的多模态感知和交互提供了参考。 Abstract: Robot vision has greatly benefited from advancements in multimodal fusion techniques and vision-language models (VLMs). We systematically review the applications of multimodal fusion in key robotic vision tasks, including semantic scene understanding, simultaneous localization and mapping (SLAM), 3D object detection, navigation and localization, and robot manipulation. We compare VLMs based on large language models (LLMs) with traditional multimodal fusion methods, analyzing their advantages, limitations, and synergies. Additionally, we conduct an in-depth analysis of commonly used datasets, evaluating their applicability and challenges in real-world robotic scenarios. Furthermore, we identify critical research challenges such as cross-modal alignment, efficient fusion strategies, real-time deployment, and domain adaptation, and propose future research directions, including self-supervised learning for robust multimodal representations, transformer-based fusion architectures, and scalable multimodal frameworks. Through a comprehensive review, comparative analysis, and forward-looking discussion, we provide a valuable reference for advancing multimodal perception and interaction in robotic vision. A comprehensive list of studies in this survey is available at https://github.com/Xiaofeng-Han-Res/MF-RV.

Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme

Yan Ma,Steffi Chern,Xuyang Shen,Yiran Zhong,Pengfei Liu

Task: 提出一个透明、从头开始的强化学习框架，用于视觉语言模型（VLMs），并验证其有效性。

Motivation: 现有强化学习在视觉语言模型中的应用依赖复杂框架，缺乏可重复性和标准化评估协议，难以比较结果或解释训练动态。

Details

Method: 设计了一个最小但功能完整的四步流程框架，并在多个模型和数据集上验证；同时提出标准化评估方案。 Result: 实验发现：响应长度对随机种子敏感，反思与输出长度相关，强化学习在泛化能力上优于监督微调（SFT）。 Conclusion: 提出的框架和发现旨在建立可重复的基线，支持更广泛的基于强化学习的视觉语言模型研究。 Abstract: Reinforcement learning (RL) has recently shown strong potential in improving the reasoning capabilities of large language models and is now being actively extended to vision-language models (VLMs). However, existing RL applications in VLMs often rely on heavily engineered frameworks that hinder reproducibility and accessibility, while lacking standardized evaluation protocols, making it difficult to compare results or interpret training dynamics. This work introduces a transparent, from-scratch framework for RL in VLMs, offering a minimal yet functional four-step pipeline validated across multiple models and datasets. In addition, a standardized evaluation scheme is proposed to assess training dynamics and reflective behaviors. Extensive experiments on visual reasoning tasks uncover key empirical findings: response length is sensitive to random seeds, reflection correlates with output length, and RL consistently outperforms supervised fine-tuning (SFT) in generalization, even with high-quality data. These findings, together with the proposed framework, aim to establish a reproducible baseline and support broader engagement in RL-based VLM research.

Efficient Model Editing with Task-Localized Sparse Fine-tuning

Leonardo Iurada,Marco Ciccone,Tatiana Tommasi

Task: 提出一种名为TaLoS的方法，用于构建稀疏任务向量，以提升模型编辑的效率和解耦性。

Motivation: 现有方法依赖网络线性化，导致计算瓶颈且无法确保权重解耦，限制了任务向量的无冲突组合。

Details

Method: 通过识别预训练模型中梯度敏感性低的参数子集，稀疏更新这些参数以促进权重解耦。 Result: TaLoS在训练和推理效率上优于现有方法，并在任务添加和否定任务中表现更优。 Conclusion: TaLoS通过模块化参数编辑，为实际应用中适应性基础模型的部署提供了可行方案。 Abstract: Task arithmetic has emerged as a promising approach for editing models by representing task-specific knowledge as composable task vectors. However, existing methods rely on network linearization to derive task vectors, leading to computational bottlenecks during training and inference. Moreover, linearization alone does not ensure weight disentanglement, the key property that enables conflict-free composition of task vectors. To address this, we propose TaLoS which allows to build sparse task vectors with minimal interference without requiring explicit linearization and sharing information across tasks. We find that pre-trained models contain a subset of parameters with consistently low gradient sensitivity across tasks, and that sparsely updating only these parameters allows for promoting weight disentanglement during fine-tuning. Our experiments prove that TaLoS improves training and inference efficiency while outperforming current methods in task addition and negation. By enabling modular parameter editing, our approach fosters practical deployment of adaptable foundation models in real-world applications.

Towards Computation- and Communication-efficient Computational Pathology

Chu Han,Bingchao Zhao,Jiatai Lin,Shanshan Lyu,Longfei Wang,Tianpeng Deng,Cheng Lu,Changhong Liang,Hannah Y. Wen,Xiaojing Guo,Zhenwei Shi,Zaiyi Liu

Task: 提出一种名为MAGA-GLTrans的计算和通信高效框架，用于解决当前计算病理学模型在诊断效率上的挑战。

Motivation: 当前计算病理学模型依赖高倍率全切片图像分析，导致诊断效率低下，限制了其临床实用性，尤其是在时间敏感的诊断场景中。

Details

Method: 通过提出一种称为放大对齐（MAGA）的自监督学习机制，将低倍率和高倍率图像的特征表示对齐，从而减少计算时间和存储需求。 Result: MAGA-GLTrans在多个基础CPath任务中表现出色，计算时间减少10.7倍，文件传输和存储需求减少20倍以上。 Conclusion: MAGA-GLTrans是一种高效且通用的解决方案，特别适用于时间敏感的应用场景，如术中冰冻切片诊断。 Abstract: Despite the impressive performance across a wide range of applications, current computational pathology models face significant diagnostic efficiency challenges due to their reliance on high-magnification whole-slide image analysis. This limitation severely compromises their clinical utility, especially in time-sensitive diagnostic scenarios and situations requiring efficient data transfer. To address these issues, we present a novel computation- and communication-efficient framework called Magnification-Aligned Global-Local Transformer (MAGA-GLTrans). Our approach significantly reduces computational time, file transfer requirements, and storage overhead by enabling effective analysis using low-magnification inputs rather than high-magnification ones. The key innovation lies in our proposed magnification alignment (MAGA) mechanism, which employs self-supervised learning to bridge the information gap between low and high magnification levels by effectively aligning their feature representations. Through extensive evaluation across various fundamental CPath tasks, MAGA-GLTrans demonstrates state-of-the-art classification performance while achieving remarkable efficiency gains: up to 10.7 times reduction in computational time and over 20 times reduction in file transfer and storage requirements. Furthermore, we highlight the versatility of our MAGA framework through two significant extensions: (1) its applicability as a feature extractor to enhance the efficiency of any CPath architecture, and (2) its compatibility with existing foundation models and histopathology-specific encoders, enabling them to process low-magnification inputs with minimal information loss. These advancements position MAGA-GLTrans as a particularly promising solution for time-sensitive applications, especially in the context of intraoperative frozen section diagnosis where both accuracy and efficiency are paramount.

Adaptive Frequency Enhancement Network for Remote Sensing Image Semantic Segmentation

Feng Gao,Miao Fu,Jingchao Cao,Junyu Dong,Qian Du

Task: 提出一种自适应频率增强网络（AFENet）用于高分辨率遥感图像的语义分割。

Motivation: 现有方法在适应不同土地覆盖分布和增强空间与频域特征交互方面存在挑战。

Details

Method: AFENet包含自适应频率与空间特征交互模块（AFSIM）和选择性特征融合模块（SFM），动态分离和调制高低频特征，并选择性融合全局与局部特征。 Result: 在三个公开数据集上，AFENet优于现有方法，验证了AFSIM和SFM的有效性。 Conclusion: AFENet通过增强频域与空间特征的交互，提升了语义分割性能。 Abstract: Semantic segmentation of high-resolution remote sensing images plays a crucial role in land-use monitoring and urban planning. Recent remarkable progress in deep learning-based methods makes it possible to generate satisfactory segmentation results. However, existing methods still face challenges in adapting network parameters to various land cover distributions and enhancing the interaction between spatial and frequency domain features. To address these challenges, we propose the Adaptive Frequency Enhancement Network (AFENet), which integrates two key components: the Adaptive Frequency and Spatial feature Interaction Module (AFSIM) and the Selective feature Fusion Module (SFM). AFSIM dynamically separates and modulates high- and low-frequency features according to the content of the input image. It adaptively generates two masks to separate high- and low-frequency components, therefore providing optimal details and contextual supplementary information for ground object feature representation. SFM selectively fuses global context and local detailed features to enhance the network's representation capability. Hence, the interactions between frequency and spatial features are further enhanced. Extensive experiments on three publicly available datasets demonstrate that the proposed AFENet outperforms state-of-the-art methods. In addition, we also validate the effectiveness of AFSIM and SFM in managing diverse land cover types and complex scenarios. Our codes are available at https://github.com/oucailab/AFENet.

BECAME: BayEsian Continual Learning with Adaptive Model MErging

Mei Li,Yuxiang Lu,Qinyan Dai,Suizhi Huang,Yue Ding,Hongtao Lu

Task: 探索模型合并技术如何通过贝叶斯持续学习原则优化稳定性与可塑性之间的权衡。

Motivation: 持续学习中的稳定性与可塑性平衡是一个关键挑战，现有梯度投影方法虽确保稳定性但限制了可塑性，而模型合并技术虽具潜力但依赖经验假设和超参数选择。

Details

Method: 提出基于贝叶斯持续学习原则的模型合并机制，推导出适应任务多样性的最优合并系数，并设计两阶段框架BECAME结合梯度投影与自适应合并。 Result: 实验表明BECAME在持续学习任务中优于现有方法和合并策略。 Conclusion: 模型合并通过贝叶斯原则优化了稳定性与可塑性的权衡，BECAME框架验证了其有效性。 Abstract: Continual Learning (CL) strives to learn incrementally across tasks while mitigating catastrophic forgetting. A key challenge in CL is balancing stability (retaining prior knowledge) and plasticity (learning new tasks). While representative gradient projection methods ensure stability, they often limit plasticity. Model merging techniques offer promising solutions, but prior methods typically rely on empirical assumptions and carefully selected hyperparameters. In this paper, we explore the potential of model merging to enhance the stability-plasticity trade-off, providing theoretical insights that underscore its benefits. Specifically, we reformulate the merging mechanism using Bayesian continual learning principles and derive a closed-form solution for the optimal merging coefficient that adapts to the diverse characteristics of tasks. To validate our approach, we introduce a two-stage framework named BECAME, which synergizes the expertise of gradient projection and adaptive merging. Extensive experiments show that our approach outperforms state-of-the-art CL methods and existing merging strategies.

Spline-based Transformers

Prashanth Chandran,Agon Serifi,Markus Gross,Moritz Bächer

Task: 提出一种无需位置编码的新型Transformer模型——基于样条的Transformer。

Motivation: 解决传统位置编码在序列长度外推等问题上的局限性，并提供用户直接操作潜在空间的新方式。

Details

Method: 受计算机动画中样条曲线启发，将输入序列嵌入为潜在空间中的平滑轨迹。 Result: 在多种数据集上（合成2D、图像、3D形状和动画）表现优于传统位置编码方法。 Conclusion: 基于样条的Transformer在性能和用户交互性上具有显著优势。 Abstract: We introduce Spline-based Transformers, a novel class of Transformer models that eliminate the need for positional encoding. Inspired by workflows using splines in computer animation, our Spline-based Transformers embed an input sequence of elements as a smooth trajectory in latent space. Overcoming drawbacks of positional encoding such as sequence length extrapolation, Spline-based Transformers also provide a novel way for users to interact with transformer latent spaces by directly manipulating the latent control points to create new latent trajectories and sequences. We demonstrate the superior performance of our approach in comparison to conventional positional encoding on a variety of datasets, ranging from synthetic 2D to large-scale real-world datasets of images, 3D shapes, and animations.