2025 04 06

From Text to Graph: Leveraging Graph Neural Networks for Enhanced Explainability in NLP

Fabio Yáñez-Romero,Andrés Montoyo,Armando Suárez,Yoan Gutiérrez,Ruslan Mitkov

Task: 提出一种新颖的方法，通过将句子自动转换为图结构来实现自然语言处理任务的可解释性。

Motivation: Transformer类模型虽然表现出色，但由于其基于分词的处理方式和模型规模庞大，导致解释性差且计算成本高。

Details

Method: 通过将句子转换为图结构，利用节点和关系表达基本语言概念，保持语义完整性。 Result: 实验结果表明，该方法能够有效识别文本结构中对分类任务最关键的部分。 Conclusion: 该方法为自然语言处理任务提供了一种高效且可解释的解决方案。 Abstract: Researchers have relegated natural language processing tasks to Transformer-type models, particularly generative models, because these models exhibit high versatility when performing generation and classification tasks. As the size of these models increases, they achieve outstanding results. Given their widespread use, many explainability techniques are developed based on these models. However, this process becomes computationally expensive due to the large size of the models. Additionally, transformers interpret input information through tokens that fragment input words into sequences lacking inherent semantic meaning, complicating the explanation of the model from the very beginning. This study proposes a novel methodology to achieve explainability in natural language processing tasks by automatically converting sentences into graphs and maintaining semantics through nodes and relations that express fundamental linguistic concepts. It also allows the subsequent exploitation of this knowledge in subsequent tasks, making it possible to obtain trends and understand how the model associates the different elements inside the text with the explained task. The experiments delivered promising results in determining the most critical components within the text structure for a given classification.

Increasing happiness through conversations with artificial intelligence

Joseph Heffner,Chongyu Qin,Martin Chadwick,Chris Knutsen,Christopher Summerfield,Zeb Kurth-Nelson,Robb B. Rutledge

Task: 研究AI聊天机器人对话如何影响主观幸福感。

Motivation: 尽管AI聊天机器人广泛使用，但其对主观幸福感的影响尚未充分研究。

Details

Method: 通过实验比较参与者与AI聊天机器人对话（N=334）和写日记（N=193）后的幸福感，并利用大型语言模型进行情感分析。 Result: 与AI聊天机器人对话后的幸福感高于写日记，尤其是在讨论负面话题时；AI的情感反馈和参与者的情感预期误差对幸福感有显著影响。 Conclusion: AI互动对人类幸福感有显著影响，情感预期在对话中起关键作用。 Abstract: Chatbots powered by artificial intelligence (AI) have rapidly become a significant part of everyday life, with over a quarter of American adults using them multiple times per week. While these tools offer potential benefits and risks, a fundamental question remains largely unexplored: How do conversations with AI influence subjective well-being? To investigate this, we conducted a study where participants either engaged in conversations with an AI chatbot (N = 334) or wrote journal entires (N = 193) on the same randomly assigned topics and reported their momentary happiness afterward. We found that happiness after AI chatbot conversations was higher than after journaling, particularly when discussing negative topics such as depression or guilt. Leveraging large language models for sentiment analysis, we found that the AI chatbot mirrored participants' sentiment while maintaining a consistent positivity bias. When discussing negative topics, participants gradually aligned their sentiment with the AI's positivity, leading to an overall increase in happiness. We hypothesized that the history of participants' sentiment prediction errors, the difference between expected and actual emotional tone when responding to the AI chatbot, might explain this happiness effect. Using computational modeling, we find the history of these sentiment prediction errors over the course of a conversation predicts greater post-conversation happiness, demonstrating a central role of emotional expectations during dialogue. Our findings underscore the effect that AI interactions can have on human well-being.

ContrastScore: Towards Higher Quality, Less Biased, More Efficient Evaluation Metrics with Contrastive Evaluation

Xiao Wang,Daniil Larionov,Siwei Wu,Yiqi Liu,Steffen Eger,Nafise Sadat Moosavi,Chenghua Lin

Task: 提出一种名为ContrastScore的对比性评估指标，用于自动评估生成文本的质量。

Motivation: 传统的基于参考的指标与人类评估相关性较弱，且基于大语言模型（LLM）的指标（尤其是小模型）仍难以与人类判断一致。

Details

Method: 设计ContrastScore，一种对比性评估指标，用于机器翻译和摘要生成任务。 Result: ContrastScore在人类判断相关性上优于单模型和集成基线，且基于较小模型的ContrastScore甚至优于较大模型。 Conclusion: ContrastScore是一种高效、无偏且鲁棒的自动评估方法。 Abstract: Evaluating the quality of generated text automatically remains a significant challenge. Conventional reference-based metrics have been shown to exhibit relatively weak correlation with human evaluations. Recent research advocates the use of large language models (LLMs) as source-based metrics for natural language generation (NLG) assessment. While promising, LLM-based metrics, particularly those using smaller models, still fall short in aligning with human judgments. In this work, we introduce ContrastScore, a contrastive evaluation metric designed to enable higher-quality, less biased, and more efficient assessment of generated text. We evaluate ContrastScore on two NLG tasks: machine translation and summarization. Experimental results show that ContrastScore consistently achieves stronger correlation with human judgments than both single-model and ensemble-based baselines. Notably, ContrastScore based on Qwen 3B and 0.5B even outperforms Qwen 7B, despite having only half as many parameters, demonstrating its efficiency. Furthermore, it effectively mitigates common evaluation biases such as length and likelihood preferences, resulting in more robust automatic evaluation.

Language Models at the Syntax-Semantics Interface: A Case Study of the Long-Distance Binding of Chinese Reflexive ziji

Xiulin Yang

Task: 探索语言模型是否能有效解决汉语反身代词“自己”的复杂绑定模式。

Motivation: 研究语言模型在处理受句法和语义因素约束的汉语反身代词“自己”时的表现。

Details

Method: 构建包含240个合成句子和320个自然句子的数据集，评估21个语言模型的表现并与母语者判断对比。 Result: 现有语言模型无法一致复现人类判断，倾向于依赖序列线索，忽略细微的句法和语义约束。 Conclusion: 语言模型在处理复杂句法和语义约束时仍有局限，需进一步改进。 Abstract: This paper explores whether language models can effectively resolve the complex binding patterns of the Mandarin Chinese reflexive ziji, which are constrained by both syntactic and semantic factors. We construct a dataset of 240 synthetic sentences using templates and examples from syntactic literature, along with 320 natural sentences from the BCC corpus. Evaluating 21 language models against this dataset and comparing their performance to judgments from native Mandarin speakers, we find that none of the models consistently replicates human-like judgments. The results indicate that existing language models tend to rely heavily on sequential cues, though not always favoring the closest strings, and often overlooking subtle semantic and syntactic constraints. They tend to be more sensitive to noun-related than verb-related semantics.

LSC-ADL: An Activity of Daily Living (ADL)-Annotated Lifelog Dataset Generated via Semi-Automatic Clustering

Minh-Quan Ho-Le,Duy-Khang Ho,Van-Tu Ninh,Cathal Gurrin,Minh-Triet Tran

Task: 提出并构建了一个基于ADL标注的LSC-ADL数据集，以增强生命日志检索的语义理解和解释性。

Motivation: 现有生命日志检索方法忽视了活动级别的标注，而这些标注能捕捉时间关系并丰富语义理解。

Details

Method: 采用半自动方法，结合HDBSCAN算法进行类内聚类和人工验证，生成准确的ADL标注。 Result: LSC-ADL数据集填补了现有研究的空白，提供了更具上下文感知的日常生活表示。 Conclusion: 该数据集有望推动生命日志检索、活动识别和第一人称视觉研究，提升检索内容的准确性和可解释性。 Abstract: Lifelogging involves continuously capturing personal data through wearable cameras, providing an egocentric view of daily activities. Lifelog retrieval aims to search and retrieve relevant moments from this data, yet existing methods largely overlook activity-level annotations, which capture temporal relationships and enrich semantic understanding. In this work, we introduce LSC-ADL, an ADL-annotated lifelog dataset derived from the LSC dataset, incorporating Activities of Daily Living (ADLs) as a structured semantic layer. Using a semi-automatic approach featuring the HDBSCAN algorithm for intra-class clustering and human-in-the-loop verification, we generate accurate ADL annotations to enhance retrieval explainability. By integrating action recognition into lifelog retrieval, LSC-ADL bridges a critical gap in existing research, offering a more context-aware representation of daily life. We believe this dataset will advance research in lifelog retrieval, activity recognition, and egocentric vision, ultimately improving the accuracy and interpretability of retrieved content. The ADL annotations can be downloaded at https://bit.ly/lsc-adl-annotations.

Overcoming Vocabulary Constraints with Pixel-level Fallback

Jonas F. Lotz,Hendra Setiawan,Stephan Peitz,Yova Kementchedjhieva

Task: 提出一种基于像素的词汇无关编码器，用于增强预训练语言模型的多语言能力。

Motivation: 解决子词分词在计算效率和词汇覆盖之间的平衡问题，特别是在未被优先训练的语言和脚本上表现不佳的问题。

Details

Method: 通过将文本渲染为像素生成输入嵌入，替代传统的子词分词方法。 Result: 实验表明，该方法显著提升了机器翻译性能，优于基于分词器的方法，并在跨语言迁移中表现更优。 Conclusion: 基于像素的表征方法能够增强单语语言模型的多语言能力，且无需大量重新训练，同时通过输入压缩减少解码延迟。 Abstract: Subword tokenization requires balancing computational efficiency and vocabulary coverage, which often leads to suboptimal performance on languages and scripts not prioritized during training. We propose to augment pretrained language models with a vocabulary-free encoder that generates input embeddings from text rendered as pixels. Through experiments on English-centric language models, we demonstrate that our approach substantially improves machine translation performance and facilitates effective cross-lingual transfer, outperforming tokenizer-based methods. Furthermore, we find that pixel-based representations outperform byte-level approaches and standard vocabulary expansion. Our approach enhances the multilingual capabilities of monolingual language models without extensive retraining and reduces decoding latency via input compression.

Aligned Better, Listen Better for Audio-Visual Large Language Models

Yuxin Guo,Shuailei Ma,Shijie Ma,Xiaoyi Bao,Chen-Wei Xie,Kecheng Zheng,Tingyu Weng,Siyang Sun,Yun Zheng,Wei Zou

Task: 提出一种细粒度的音频-视觉大语言模型（Dolphin）和数据集（AVU），以解决现有模型在音频信息利用不足和幻觉问题上的缺陷。

Motivation: 音频在多模态视频理解中至关重要，但现有视频大语言模型和音频-视觉大语言模型在音频信息利用上存在不足，导致理解能力弱和幻觉问题。

Details

Method: 从模型架构角度提出Dolphin模型，通过音频-视觉多尺度适配器和交错合并实现时空对齐；从数据集角度构建AVU数据集，包含520万多样化的数据元组和新的数据分区策略。 Result: 实验表明，模型在音频-视觉理解上表现优异，并有效缓解了幻觉问题。 Conclusion: Dolphin模型和AVU数据集显著提升了音频-视觉理解能力，解决了现有模型的局限性。 Abstract: Audio is essential for multimodal video understanding. On the one hand, video inherently contains audio, which supplies complementary information to vision. Besides, video large language models (Video-LLMs) can encounter many audio-centric settings. However, existing Video-LLMs and Audio-Visual Large Language Models (AV-LLMs) exhibit deficiencies in exploiting audio information, leading to weak understanding and hallucinations. To solve the issues, we delve into the model architecture and dataset. (1) From the architectural perspective, we propose a fine-grained AV-LLM, namely Dolphin. The concurrent alignment of audio and visual modalities in both temporal and spatial dimensions ensures a comprehensive and accurate understanding of videos. Specifically, we devise an audio-visual multi-scale adapter for multi-scale information aggregation, which achieves spatial alignment. For temporal alignment, we propose audio-visual interleaved merging. (2) From the dataset perspective, we curate an audio-visual caption and instruction-tuning dataset, called AVU. It comprises 5.2 million diverse, open-ended data tuples (video, audio, question, answer) and introduces a novel data partitioning strategy. Extensive experiments show our model not only achieves remarkable performance in audio-visual understanding, but also mitigates potential hallucinations.

One Pic is All it Takes: Poisoning Visual Document Retrieval Augmented Generation with a Single Image

Ezzeldin Shereen,Dan Ristea,Burak Hasircioglu,Shae McFadden,Vasilios Mavroudis,Chris Hicks

Task: 提出一种针对多模态检索增强生成（M-RAG）系统的投毒攻击方法，以揭示其潜在漏洞。

Motivation: M-RAG系统通过知识库减少大模型幻觉，但可能因恶意注入条目而受到攻击，影响系统可靠性。

Details

Method: 设计一种针对视觉文档检索应用的投毒攻击，通过单张图像影响多种查询结果，实现通用拒绝服务攻击。 Result: 攻击对多种先进检索和生成模型有效，但对鲁棒嵌入模型效果有限。 Conclusion: 揭示了M-RAG系统在投毒攻击下的脆弱性，并指出其在良性场景下也可能存在的性能问题。 Abstract: Multimodal retrieval augmented generation (M-RAG) has recently emerged as a method to inhibit hallucinations of large multimodal models (LMMs) through a factual knowledge base (KB). However, M-RAG also introduces new attack vectors for adversaries that aim to disrupt the system by injecting malicious entries into the KB. In this work, we present a poisoning attack against M-RAG targeting visual document retrieval applications, where the KB contains images of document pages. Our objective is to craft a single image that is retrieved for a variety of different user queries, and consistently influences the output produced by the generative model, thus creating a universal denial-of-service (DoS) attack against the M-RAG system. We demonstrate that while our attack is effective against a diverse range of widely-used, state-of-the-art retrievers (embedding models) and generators (LMMs), it can also be ineffective against robust embedding models. Our attack not only highlights the vulnerability of M-RAG pipelines to poisoning attacks, but also sheds light on a fundamental weakness that potentially hinders their performance even in benign settings.

FreSca: Unveiling the Scaling Space in Diffusion Models

Chao Huang,Susan Liang,Yunlong Tang,Li Ma,Yapeng Tian,Chenliang Xu

Task: 探索扩散模型中基于频率的指导缩放空间，以增强图像编辑和理解任务。

Motivation: 扩散模型中的指导缩放机制隐含定义了一个‘缩放空间’，但其在细粒度语义操作中的潜力尚未充分挖掘。

Details

Method: 通过傅里叶分析噪声预测，提出FreSca方法，独立地对不同频率带应用指导缩放。 Result: FreSca显著提升了现有图像编辑方法的性能，并在深度估计等图像理解任务中取得了定量增益。 Conclusion: FreSca展示了频率域指导缩放的有效性，为扩散模型的应用提供了新思路。 Abstract: Diffusion models offer impressive controllability for image tasks, primarily through noise predictions that encode task-specific information and classifier-free guidance enabling adjustable scaling. This scaling mechanism implicitly defines a ``scaling space'' whose potential for fine-grained semantic manipulation remains underexplored. We investigate this space, starting with inversion-based editing where the difference between conditional/unconditional noise predictions carries key semantic information. Our core contribution stems from a Fourier analysis of noise predictions, revealing that its low- and high-frequency components evolve differently throughout diffusion. Based on this insight, we introduce FreSca, a straightforward method that applies guidance scaling independently to different frequency bands in the Fourier domain. FreSca demonstrably enhances existing image editing methods without retraining. Excitingly, its effectiveness extends to image understanding tasks such as depth estimation, yielding quantitative gains across multiple datasets.

LL4G: Self-Supervised Dynamic Optimization for Graph-Based Personality Detection

Lingzhi Shen,Yunfei Long,Xiaohao Cai,Guanming Chen,Yuhan Wang,Imran Razzak,Shoaib Jameel

Task: 提出一种基于大型语言模型的自监督框架LL4G，用于优化图神经网络以进行人格检测。

Motivation: 解决现有方法在稀疏或噪声数据上的不足，以及静态图无法捕捉动态变化的局限性。

Details

Method: 利用大型语言模型提取语义特征生成节点表示和推断关系，动态调整图结构，并通过节点重建、边预测和对比学习任务联合训练GNN。 Result: 在Kaggle和Pandora数据集上实验表明，LL4G优于现有最先进模型。 Conclusion: LL4G通过结合语义和结构信息，生成了更鲁棒的人格画像。 Abstract: Graph-based personality detection constructs graph structures from textual data, particularly social media posts. Current methods often struggle with sparse or noisy data and rely on static graphs, limiting their ability to capture dynamic changes between nodes and relationships. This paper introduces LL4G, a self-supervised framework leveraging large language models (LLMs) to optimize graph neural networks (GNNs). LLMs extract rich semantic features to generate node representations and to infer explicit and implicit relationships. The graph structure adaptively adds nodes and edges based on input data, continuously optimizing itself. The GNN then uses these optimized representations for joint training on node reconstruction, edge prediction, and contrastive learning tasks. This integration of semantic and structural information generates robust personality profiles. Experimental results on Kaggle and Pandora datasets show LL4G outperforms state-of-the-art models.

UAVTwin: Neural Digital Twins for UAVs using Gaussian Splatting

Jaehoon Choi,Dongki Jung,Yonghan Lee,Sungmin Eum,Dinesh Manocha,Heesung Kwon

Task: 提出UAVTwin方法，用于从真实环境中创建数字孪生，并通过数据增强训练无人机（UAV）中的下游模型。

Motivation: 解决复杂场景中动态对象和外观变化对3D高斯泼溅（3DGS）建模的挑战，提升下游模型的性能。

Details

Method: 结合3DGS重建背景和可控合成人体模型，提出新的外观建模策略和掩码细化模块。 Result: 在PSNR上提升1.23 dB，并在人体检测任务中mAP提升2.5%至13.7%。 Conclusion: UAVTwin是首个基于3DGS的高保真数字孪生方法，显著提升了无人机感知任务的数据增强效果。 Abstract: We present UAVTwin, a method for creating digital twins from real-world environments and facilitating data augmentation for training downstream models embedded in unmanned aerial vehicles (UAVs). Specifically, our approach focuses on synthesizing foreground components, such as various human instances in motion within complex scene backgrounds, from UAV perspectives. This is achieved by integrating 3D Gaussian Splatting (3DGS) for reconstructing backgrounds along with controllable synthetic human models that display diverse appearances and actions in multiple poses. To the best of our knowledge, UAVTwin is the first approach for UAV-based perception that is capable of generating high-fidelity digital twins based on 3DGS. The proposed work significantly enhances downstream models through data augmentation for real-world environments with multiple dynamic objects and significant appearance variations-both of which typically introduce artifacts in 3DGS-based modeling. To tackle these challenges, we propose a novel appearance modeling strategy and a mask refinement module to enhance the training of 3D Gaussian Splatting. We demonstrate the high quality of neural rendering by achieving a 1.23 dB improvement in PSNR compared to recent methods. Furthermore, we validate the effectiveness of data augmentation by showing a 2.5% to 13.7% improvement in mAP for the human detection task.

Subasa -- Adapting Language Models for Low-resourced Offensive Language Detection in Sinhala

Shanilka Haturusinghe,Tharindu Cyril Weerasooriya,Marcos Zampieri,Christopher M. Homan,S. R. Liyanage

Task: 研究如何通过微调策略提升低资源语言（如僧伽罗语）中冒犯性语言的检测性能。

Motivation: 冒犯性语言的准确检测对社交媒体安全至关重要，但低资源语言和高资源语言在此任务上性能差异显著。

Details

Method: 提出四种模型：Subasa-XLM-R（结合中间预微调步骤）、Subasa-Llama和Subasa-Mistral（分别基于Llama和Mistral微调），并在SOLD数据集上评估。 Result: 所有模型均超越现有基线，Subasa-XLM-R在零样本设置下取得最高Macro F1分数（0.84），优于GPT-4o等大型语言模型。 Conclusion: 提出的模型在僧伽罗语冒犯性语言检测中表现优异，代码和模型已公开。 Abstract: Accurate detection of offensive language is essential for a number of applications related to social media safety. There is a sharp contrast in performance in this task between low and high-resource languages. In this paper, we adapt fine-tuning strategies that have not been previously explored for Sinhala in the downstream task of offensive language detection. Using this approach, we introduce four models: "Subasa-XLM-R", which incorporates an intermediate Pre-Finetuning step using Masked Rationale Prediction. Two variants of "Subasa-Llama" and "Subasa-Mistral", are fine-tuned versions of Llama (3.2) and Mistral (v0.3), respectively, with a task-specific strategy. We evaluate our models on the SOLD benchmark dataset for Sinhala offensive language detection. All our models outperform existing baselines. Subasa-XLM-R achieves the highest Macro F1 score (0.84) surpassing state-of-the-art large language models like GPT-4o when evaluated on the same SOLD benchmark dataset under zero-shot settings. The models and code are publicly available.

Less-to-More Generalization: Unlocking More Controllability by In-Context Generation

Shaojin Wu,Mengqi Huang,Wenxu Wu,Yufeng Cheng,Fei Ding,Qian He

Task: 解决主题驱动生成中的数据可扩展性和主题扩展性问题。

Motivation: 主题驱动生成在图像生成中应用广泛，但面临数据可扩展性和多主题扩展的挑战。

Details

Method: 提出高一致性的数据合成流程和UNO模型，结合渐进跨模态对齐和通用旋转位置嵌入。 Result: 实验表明，该方法在单主题和多主题生成中均能实现高一致性和可控性。 Conclusion: 该方法有效解决了主题驱动生成中的关键挑战，具有广泛的应用潜力。 Abstract: Although subject-driven generation has been extensively explored in image generation due to its wide applications, it still has challenges in data scalability and subject expansibility. For the first challenge, moving from curating single-subject datasets to multiple-subject ones and scaling them is particularly difficult. For the second, most recent methods center on single-subject generation, making it hard to apply when dealing with multi-subject scenarios. In this study, we propose a highly-consistent data synthesis pipeline to tackle this challenge. This pipeline harnesses the intrinsic in-context generation capabilities of diffusion transformers and generates high-consistency multi-subject paired data. Additionally, we introduce UNO, which consists of progressive cross-modal alignment and universal rotary position embedding. It is a multi-image conditioned subject-to-image model iteratively trained from a text-to-image model. Extensive experiments show that our method can achieve high consistency while ensuring controllability in both single-subject and multi-subject driven generation.

LLMs as Deceptive Agents: How Role-Based Prompting Induces Semantic Ambiguity in Puzzle Tasks

Seunghyun Yoo

Task: 研究大型语言模型（LLM）作为自主代理如何利用语义模糊性生成具有欺骗性的谜题。

Motivation: 探索LLM在对抗性环境中表现出的代理行为及其对语义模糊性的利用，以揭示其潜在的伦理问题。

Details

Method: 通过零样本提示、角色注入对抗性提示和人工制作的谜题进行系统比较，结合HateBERT计算分析和人类主观评估。 Result: 对抗性代理行为显著增加了语义模糊性，提高了认知负荷并降低了谜题解决的公平性。 Conclusion: 研究揭示了LLM的代理行为特性，并强调了在教育技术和娱乐中安全部署自主语言系统的伦理考量。 Abstract: Recent advancements in Large Language Models (LLMs) have not only showcased impressive creative capabilities but also revealed emerging agentic behaviors that exploit linguistic ambiguity in adversarial settings. In this study, we investigate how an LLM, acting as an autonomous agent, leverages semantic ambiguity to generate deceptive puzzles that mislead and challenge human users. Inspired by the popular puzzle game "Connections", we systematically compare puzzles produced through zero-shot prompting, role-injected adversarial prompts, and human-crafted examples, with an emphasis on understanding the underlying agent decision-making processes. Employing computational analyses with HateBERT to quantify semantic ambiguity, alongside subjective human evaluations, we demonstrate that explicit adversarial agent behaviors significantly heighten semantic ambiguity -- thereby increasing cognitive load and reducing fairness in puzzle solving. These findings provide critical insights into the emergent agentic qualities of LLMs and underscore important ethical considerations for evaluating and safely deploying autonomous language systems in both educational technologies and entertainment.

MDP: Multidimensional Vision Model Pruning with Latency Constraint

Xinglong Sun,Barath Lakshmanan,Maying Shen,Shiyi Lan,Jingde Chen,Jose M. Alvarez

Task: 提出一种名为多维度剪枝（MDP）的新范式，解决现有结构剪枝方法在剪枝粒度和延迟建模上的局限性。

Motivation: 现有方法在剪枝粒度和延迟建模上存在不足，难以实现激进的参数削减，且对Transformer等结构的延迟预测不准确。

Details

Method: MDP通过联合优化多种剪枝粒度（如通道、查询、键、头、嵌入和块），并采用高级延迟建模技术，将剪枝问题转化为混合整数非线性规划（MINLP）进行求解。 Result: MDP在ImageNet上对ResNet50剪枝时，比HALP方法提速28%且Top-1准确率提升1.4%；在Transformer剪枝上比Isomorphic方法提速37%且准确率提升0.7%。 Conclusion: MDP是一种通用框架，显著优于现有方法，尤其在高效剪枝和高精度保持方面表现突出。 Abstract: Current structural pruning methods face two significant limitations: (i) they often limit pruning to finer-grained levels like channels, making aggressive parameter reduction challenging, and (ii) they focus heavily on parameter and FLOP reduction, with existing latency-aware methods frequently relying on simplistic, suboptimal linear models that fail to generalize well to transformers, where multiple interacting dimensions impact latency. In this paper, we address both limitations by introducing Multi-Dimensional Pruning (MDP), a novel paradigm that jointly optimizes across a variety of pruning granularities-including channels, query, key, heads, embeddings, and blocks. MDP employs an advanced latency modeling technique to accurately capture latency variations across all prunable dimensions, achieving an optimal balance between latency and accuracy. By reformulating pruning as a Mixed-Integer Nonlinear Program (MINLP), MDP efficiently identifies the optimal pruned structure across all prunable dimensions while respecting latency constraints. This versatile framework supports both CNNs and transformers. Extensive experiments demonstrate that MDP significantly outperforms previous methods, especially at high pruning ratios. On ImageNet, MDP achieves a 28% speed increase with a +1.4 Top-1 accuracy improvement over prior work like HALP for ResNet50 pruning. Against the latest transformer pruning method, Isomorphic, MDP delivers an additional 37% acceleration with a +0.7 Top-1 accuracy improvement.

State-of-the-Art Translation of Text-to-Gloss using mBART : A case study of Bangla

Sharif Md. Abdullah,Abhijit Paul,Shebuti Rayana,Ahmedul Kabir,Zarif Masud

Task: 解决孟加拉手语（BdSL）文本到手势翻译任务的数据集和方法问题。

Motivation: 尽管孟加拉有170万聋哑人口，但BdSL研究不足，尤其是文本到手势翻译任务尚未被探索。

Details

Method: 结合基于语法规则的手势生成方法、LLM生成合成数据及数据增强技术，并微调预训练模型（如mBART-50和mBERT）。 Result: 微调后的mBART-50在BdSL和PHOENIX-14T基准测试中表现优异，达到SOTA性能。 Conclusion: 提出使用mBART模型进行文本到手势任务的新范式，并证明基于规则的合成数据集对BdSL任务有益。 Abstract: Despite a large deaf and dumb population of 1.7 million, Bangla Sign Language (BdSL) remains a understudied domain. Specifically, there are no works on Bangla text-to-gloss translation task. To address this gap, we begin by addressing the dataset problem. We take inspiration from grammatical rule based gloss generation used in Germany and American sign langauage (ASL) and adapt it for BdSL. We also leverage LLM to generate synthetic data and use back-translation, text generation for data augmentation. With dataset prepared, we started experimentation. We fine-tuned pretrained mBART-50 and mBERT-multiclass-uncased model on our dataset. We also trained GRU, RNN and a novel seq-to-seq model with multi-head attention. We observe significant high performance (ScareBLEU=79.53) with fine-tuning pretrained mBART-50 multilingual model from Facebook. We then explored why we observe such high performance with mBART. We soon notice an interesting property of mBART -- it was trained on shuffled and masked text data. And as we know, gloss form has shuffling property. So we hypothesize that mBART is inherently good at text-to-gloss tasks. To find support against this hypothesis, we trained mBART-50 on PHOENIX-14T benchmark and evaluated it with existing literature. Our mBART-50 finetune demonstrated State-of-the-Art performance on PHOENIX-14T benchmark, far outperforming existing models in all 6 metrics (ScareBLEU = 63.89, BLEU-1 = 55.14, BLEU-2 = 38.07, BLEU-3 = 27.13, BLEU-4 = 20.68, COMET = 0.624). Based on the results, this study proposes a new paradigm for text-to-gloss task using mBART models. Additionally, our results show that BdSL text-to-gloss task can greatly benefit from rule-based synthetic dataset.

Foreground Focus: Enhancing Coherence and Fidelity in Camouflaged Image Generation

Pei-Chi Chen,Yi Yao,Chan-Feng Hsu,HongXia Xie,Hung-Jen Chen,Hong-Han Shuai,Wen-Huang Cheng

Task: 提出一种前景感知的伪装图像生成模型（FACIG），解决现有方法在背景与前景特征融合及前景保真度方面的不足。

Motivation: 现有伪装图像生成方法在背景与前景特征融合不足导致颜色不一致，且前景保真度低，尤其是小物体易失真。

Details

Method: 引入前景感知特征融合模块（FAFIM）加强前景与背景特征融合，并设计前景感知去噪损失增强前景重建监督。 Result: 在多个数据集上实验表明，FACIG在伪装图像质量和前景保真度上优于现有方法。 Conclusion: FACIG有效解决了背景与前景融合及前景保真度问题，提升了伪装图像生成质量。 Abstract: Camouflaged image generation is emerging as a solution to data scarcity in camouflaged vision perception, offering a cost-effective alternative to data collection and labeling. Recently, the state-of-the-art approach successfully generates camouflaged images using only foreground objects. However, it faces two critical weaknesses: 1) the background knowledge does not integrate effectively with foreground features, resulting in a lack of foreground-background coherence (e.g., color discrepancy); 2) the generation process does not prioritize the fidelity of foreground objects, which leads to distortion, particularly for small objects. To address these issues, we propose a Foreground-Aware Camouflaged Image Generation (FACIG) model. Specifically, we introduce a Foreground-Aware Feature Integration Module (FAFIM) to strengthen the integration between foreground features and background knowledge. In addition, a Foreground-Aware Denoising Loss is designed to enhance foreground reconstruction supervision. Experiments on various datasets show our method outperforms previous methods in overall camouflaged image quality and foreground fidelity.

Measurement of LLM's Philosophies of Human Nature

Minheng Ni,Ennan Wu,Zidong Gong,Zhengyuan Yang,Linjie Li,Chung-Ching Lin,Kevin Lin,Lijuan Wang,Wangmeng Zuo

Task: 设计并验证一个针对大型语言模型（LLM）的标准化心理量表（M-PHNS），以评估其对人类的态度，并提出一种心理循环学习框架来优化其价值系统。

Motivation: 随着人工智能（AI）的广泛应用及其涉及的冲突或违规事件，社会对与AI系统互动的担忧增加，需要评估和改善AI对人类的态度。

Details

Method: 基于Wrightsman的人类自然哲学量表（PHNS），设计M-PHNS量表，并通过六个维度评估LLM对人类的态度；提出心理循环学习框架，通过构建道德场景优化LLM的价值系统。 Result: 当前LLM对人类普遍缺乏信任，且模型智能水平与对人类信任呈显著负相关；心理循环学习显著提升了LLM对人类的信任。 Conclusion: M-PHNS量表为LLM的认知偏差诊断和伦理学习提供了潜在解决方案，心理循环学习框架有效改善了LLM对人类的态度。 Abstract: The widespread application of artificial intelligence (AI) in various tasks, along with frequent reports of conflicts or violations involving AI, has sparked societal concerns about interactions with AI systems. Based on Wrightsman's Philosophies of Human Nature Scale (PHNS), a scale empirically validated over decades to effectively assess individuals' attitudes toward human nature, we design the standardized psychological scale specifically targeting large language models (LLM), named the Machine-based Philosophies of Human Nature Scale (M-PHNS). By evaluating LLMs' attitudes toward human nature across six dimensions, we reveal that current LLMs exhibit a systemic lack of trust in humans, and there is a significant negative correlation between the model's intelligence level and its trust in humans. Furthermore, we propose a mental loop learning framework, which enables LLM to continuously optimize its value system during virtual interactions by constructing moral scenarios, thereby improving its attitude toward human nature. Experiments demonstrate that mental loop learning significantly enhances their trust in humans compared to persona or instruction prompts. This finding highlights the potential of human-based psychological assessments for LLM, which can not only diagnose cognitive biases but also provide a potential solution for ethical learning in artificial intelligence. We release the M-PHNS evaluation code and data at https://github.com/kodenii/M-PHNS.

ESC: Erasing Space Concept for Knowledge Deletion

Tae-Young Lee,Sundong Park,Minwoo Jeon,Hyoseok Hwang,Gyeong-Moon Park

Task: 提出一种名为知识删除（KD）的新概念，解决深度学习模型中对个人知识完全擦除的需求，并引入知识保留分数（KR）作为评估指标。

Motivation: 现有方法未能满足用户对完全知识擦除的需求，且存在通过嵌入特征泄露个人知识的风险。

Details

Method: 提出无需训练的擦除方法Erasing Space Concept（ESC）和带训练的方法ESC-T，通过限制特征空间中的重要子空间来实现知识擦除。 Result: 实验表明，所提方法在多种数据集和模型上实现了最快和最先进的性能，且适用于多种擦除场景。 Conclusion: ESC和ESC-T方法在知识删除任务中表现出高效性和通用性，解决了现有方法的不足。 Abstract: As concerns regarding privacy in deep learning continue to grow, individuals are increasingly apprehensive about the potential exploitation of their personal knowledge in trained models. Despite several research efforts to address this, they often fail to consider the real-world demand from users for complete knowledge erasure. Furthermore, our investigation reveals that existing methods have a risk of leaking personal knowledge through embedding features. To address these issues, we introduce a novel concept of Knowledge Deletion (KD), an advanced task that considers both concerns, and provides an appropriate metric, named Knowledge Retention score (KR), for assessing knowledge retention in feature space. To achieve this, we propose a novel training-free erasing approach named Erasing Space Concept (ESC), which restricts the important subspace for the forgetting knowledge by eliminating the relevant activations in the feature. In addition, we suggest ESC with Training (ESC-T), which uses a learnable mask to better balance the trade-off between forgetting and preserving knowledge in KD. Our extensive experiments on various datasets and models demonstrate that our proposed methods achieve the fastest and state-of-the-art performance. Notably, our methods are applicable to diverse forgetting scenarios, such as facial domain setting, demonstrating the generalizability of our methods. The code is available at http://github.com/KU-VGI/ESC .

Improving Harmful Text Detection with Joint Retrieval and External Knowledge

Zidong Yu,Shuo Wang,Nan Jiang,Weiqiang Huang,Xu Han,Junliang Du

Task: 提出一种联合检索框架，结合预训练语言模型和知识图谱，以提高有害文本检测的准确性和鲁棒性。

Motivation: 随着AI生成内容在数字平台上的扩展，有害文本检测成为开发和部署大型语言模型的关键任务。

Details

Method: 采用联合检索方法，结合预训练语言模型和知识图谱，利用外部上下文信息捕捉细微的有害内容。 Result: 实验结果表明，联合检索方法显著优于单一模型基线，尤其在低资源训练场景和多语言环境中表现突出。 Conclusion: 该方法为AI安全领域做出贡献，未来研究应关注计算效率优化、模型可解释性增强和多模态检测能力的扩展。 Abstract: Harmful text detection has become a crucial task in the development and deployment of large language models, especially as AI-generated content continues to expand across digital platforms. This study proposes a joint retrieval framework that integrates pre-trained language models with knowledge graphs to improve the accuracy and robustness of harmful text detection. Experimental results demonstrate that the joint retrieval approach significantly outperforms single-model baselines, particularly in low-resource training scenarios and multilingual environments. The proposed method effectively captures nuanced harmful content by leveraging external contextual information, addressing the limitations of traditional detection models. Future research should focus on optimizing computational efficiency, enhancing model interpretability, and expanding multimodal detection capabilities to better tackle evolving harmful content patterns. This work contributes to the advancement of AI safety, ensuring more trustworthy and reliable content moderation systems.

Geospatial Artificial Intelligence for Satellite-based Flood Extent Mapping: Concepts, Advances, and Future Perspectives

Hyunho Lee,Wenwen Li

Task: 利用地理空间人工智能（GeoAI）技术结合卫星数据进行洪水范围制图，以识别洪水事件并评估其影响。

Motivation: 为灾害管理和空间决策提供支持。

Details

Method: 系统整合人工智能技术与卫星数据，生成洪水范围地图及相关分析输出（如不确定性估计和变化检测）。 Result: 生成洪水范围地图及其他分析输出，用于灾害管理和决策。 Conclusion: GeoAI技术结合卫星数据在洪水范围制图和灾害管理中具有重要应用价值。 Abstract: Geospatial Artificial Intelligence (GeoAI) for satellite-based flood extent mapping systematically integrates artificial intelligence techniques with satellite data to identify flood events and assess their impacts, for disaster management and spatial decision-making. The primary output often includes flood extent maps, which delineate the affected areas, along with additional analytical outputs such as uncertainty estimation and change detection.

CoTAL: Human-in-the-Loop Prompt Engineering, Chain-of-Thought Reasoning, and Active Learning for Generalizable Formative Assessment Scoring

Clayton Cohn,Nicole Hutchins,Ashwin T S,Gautam Biswas

Task: 探索并验证基于大语言模型（LLM）的Chain-of-Thought Prompting + Active Learning（CoTAL）方法在多领域形成性评估中的通用性和有效性。

Motivation: 尽管LLM在科学领域形成性评估中表现出潜力，但其在多领域（如科学、计算和工程）的通用性尚未充分验证。

Details

Method: 结合Evidence-Centered Design（ECD）原则设计课程对齐的评估和评分标准，采用人机协同的提示工程自动化评分，并通过师生反馈迭代优化评估问题、评分标准和提示。 Result: CoTAL显著提升了GPT-4的评分性能，比未优化的基线提高了24.5%，师生均认为其在评分和解释学生回答方面有效。 Conclusion: CoTAL是一种有效的多领域形成性评估方法，通过迭代优化可进一步提升评分准确性和解释质量。 Abstract: Large language models (LLMs) have created new opportunities to assist teachers and support student learning. Methods such as chain-of-thought (CoT) prompting enable LLMs to grade formative assessments in science, providing scores and relevant feedback to students. However, the extent to which these methods generalize across curricula in multiple domains (such as science, computing, and engineering) remains largely untested. In this paper, we introduce Chain-of-Thought Prompting + Active Learning (CoTAL), an LLM-based approach to formative assessment scoring that (1) leverages Evidence-Centered Design (ECD) principles to develop curriculum-aligned formative assessments and rubrics, (2) applies human-in-the-loop prompt engineering to automate response scoring, and (3) incorporates teacher and student feedback to iteratively refine assessment questions, grading rubrics, and LLM prompts for automated grading. Our findings demonstrate that CoTAL improves GPT-4's scoring performance, achieving gains of up to 24.5% over a non-prompt-engineered baseline. Both teachers and students view CoTAL as effective in scoring and explaining student responses, each providing valuable refinements to enhance grading accuracy and explanation quality.

AC-LoRA: Auto Component LoRA for Personalized Artistic Style Image Generation

Zhipu Cui,Andong Tian,Zhi Ying,Jialiang Lu

Task: 提出一种名为AutoComponent-LoRA（AC-LoRA）的方法，用于自动分离LoRA矩阵的信号和噪声成分，以实现高效个性化的艺术风格图像生成。

Motivation: 现有的LoRA方法需要手动调整秩参数，难以达到理想效果，因此需要一种自动化的解决方案。

Details

Method: 基于奇异值分解（SVD）和动态启发式方法，在训练过程中自动更新超参数。 Result: 在克服模型欠拟合或过拟合问题上表现优于现有方法，平均提升9%。 Conclusion: AC-LoRA是一种高效且自动化的个性化图像生成方法，显著提升了生成质量。 Abstract: Personalized image generation allows users to preserve styles or subjects of a provided small set of images for further image generation. With the advancement in large text-to-image models, many techniques have been developed to efficiently fine-tune those models for personalization, such as Low Rank Adaptation (LoRA). However, LoRA-based methods often face the challenge of adjusting the rank parameter to achieve satisfactory results. To address this challenge, AutoComponent-LoRA (AC-LoRA) is proposed, which is able to automatically separate the signal component and noise component of the LoRA matrices for fast and efficient personalized artistic style image generation. This method is based on Singular Value Decomposition (SVD) and dynamic heuristics to update the hyperparameters during training. Superior performance over existing methods in overcoming model underfitting or overfitting problems is demonstrated. The results were validated using FID, CLIP, DINO, and ImageReward, achieving an average of 9% improvement.

LearNAT: Learning NL2SQL with AST-guided Task Decomposition for Large Language Models

Weibin Liao,Xin Gao,Tianyu Jia,Rihong Qiu,Yifan Zhu,Yang Lin,Xu Chu,Junfeng Zhao,Yasha Wang

Task: 提出一种名为LearNAT的框架，通过任务分解和强化学习提升开源大语言模型在复杂NL2SQL任务中的性能。

Motivation: 现有NL2SQL方法依赖闭源大语言模型或需要微调开源模型，但开源模型在复杂任务中表现不佳，因用户查询目标表达间接且与数据库模式存在语义鸿沟。

Details

Method: LearNAT框架包含三个关键组件：基于AST的任务分解合成过程、边缘感知强化学习（DPO优化）和自适应示例推理机制。 Result: 在Spider和BIRD基准测试中，LearNAT使7B参数的开源模型性能接近GPT-4，同时提升效率和可访问性。 Conclusion: LearNAT通过任务分解和强化学习显著提升了开源模型在复杂NL2SQL任务中的表现，为领域提供了高效且可访问的解决方案。 Abstract: Natural Language to SQL (NL2SQL) has emerged as a critical task for enabling seamless interaction with databases. Recent advancements in Large Language Models (LLMs) have demonstrated remarkable performance in this domain. However, existing NL2SQL methods predominantly rely on closed-source LLMs leveraging prompt engineering, while open-source models typically require fine-tuning to acquire domain-specific knowledge. Despite these efforts, open-source LLMs struggle with complex NL2SQL tasks due to the indirect expression of user query objectives and the semantic gap between user queries and database schemas. Inspired by the application of reinforcement learning in mathematical problem-solving to encourage step-by-step reasoning in LLMs, we propose LearNAT (Learning NL2SQL with AST-guided Task Decomposition), a novel framework that improves the performance of open-source LLMs on complex NL2SQL tasks through task decomposition and reinforcement learning. LearNAT introduces three key components: (1) a Decomposition Synthesis Procedure that leverages Abstract Syntax Trees (ASTs) to guide efficient search and pruning strategies for task decomposition, (2) Margin-aware Reinforcement Learning, which employs fine-grained step-level optimization via DPO with AST margins, and (3) Adaptive Demonstration Reasoning, a mechanism for dynamically selecting relevant examples to enhance decomposition capabilities. Extensive experiments on two benchmark datasets, Spider and BIRD, demonstrate that LearNAT enables a 7B-parameter open-source LLM to achieve performance comparable to GPT-4, while offering improved efficiency and accessibility.

SocialGesture: Delving into Multi-person Gesture Understanding

Xu Cao,Pranav Virupaksha,Wenqi Jia,Bolin Lai,Fiona Ryan,Sangmin Lee,James M. Rehg

Task: 构建并发布首个大规模多人物手势分析数据集SocialGesture，并提出视觉问答任务以评估视觉语言模型在社交手势理解上的表现。

Motivation: 现有手势识别研究忽视多人物互动，导致难以与语言和语音等其他模态对齐，限制了社交场景中手势理解的深入研究。

Details

Method: 引入SocialGesture数据集，包含多样自然场景，支持视频识别和时间定位任务，并提出视觉问答任务。 Result: SocialGesture为复杂社交互动中的手势研究提供资源，并揭示了当前手势识别模型的局限性。 Conclusion: SocialGesture填补了多人物手势分析领域的空白，为未来改进提供了方向。 Abstract: Previous research in human gesture recognition has largely overlooked multi-person interactions, which are crucial for understanding the social context of naturally occurring gestures. This limitation in existing datasets presents a significant challenge in aligning human gestures with other modalities like language and speech. To address this issue, we introduce SocialGesture, the first large-scale dataset specifically designed for multi-person gesture analysis. SocialGesture features a diverse range of natural scenarios and supports multiple gesture analysis tasks, including video-based recognition and temporal localization, providing a valuable resource for advancing the study of gesture during complex social interactions. Furthermore, we propose a novel visual question answering (VQA) task to benchmark vision language models'(VLMs) performance on social gesture understanding. Our findings highlight several limitations of current gesture recognition models, offering insights into future directions for improvement in this field. SocialGesture is available at huggingface.co/datasets/IrohXu/SocialGesture.

The quasi-semantic competence of LLMs: a case study on the part-whole relation

Mattia Proietti,Alessandro Lenci

Task: 研究大型语言模型（LLMs）对部分-整体关系（meronymy）的语义理解能力。

Motivation: 部分-整体关系在词汇组织中起关键作用，但当前研究较少，需要探索LLMs在此方面的能力。

Details

Method: 使用ConceptNet和人类生成的语义特征数据，通过行为测试、句子概率评分和概念表示分析三种方法进行研究。 Result: LLMs对部分-整体关系的理解是部分的，仅具备“准语义”能力，未能完全捕捉深层推理特性。 Conclusion: LLMs在部分-整体关系上的语义能力有限，仍需进一步改进。 Abstract: Understanding the extent and depth of the semantic competence of \emph{Large Language Models} (LLMs) is at the center of the current scientific agenda in Artificial Intelligence (AI) and Computational Linguistics (CL). We contribute to this endeavor by investigating their knowledge of the \emph{part-whole} relation, a.k.a. \emph{meronymy}, which plays a crucial role in lexical organization, but it is significantly understudied. We used data from ConceptNet relations \citep{speer2016conceptnet} and human-generated semantic feature norms \citep{McRae:2005} to explore the abilities of LLMs to deal with \textit{part-whole} relations. We employed several methods based on three levels of analysis: i.) \textbf{behavioral} testing via prompting, where we directly queried the models on their knowledge of meronymy, ii.) sentence \textbf{probability} scoring, where we tested models' abilities to discriminate correct (real) and incorrect (asymmetric counterfactual) \textit{part-whole} relations, and iii.) \textbf{concept representation} analysis in vector space, where we proved the linear organization of the \textit{part-whole} concept in the embedding and unembedding spaces. These analyses present a complex picture that reveals that the LLMs' knowledge of this relation is only partial. They have just a ``\emph{quasi}-semantic'' competence and still fall short of capturing deep inferential properties.

Re-thinking Temporal Search for Long-Form Video Understanding

Jinhui Ye,Zihan Wang,Haosen Sun,Keshigeyan Chandrasegaran,Zane Durante,Cristobal Eyzaguirre,Yonatan Bisk,Juan Carlos Niebles,Ehsan Adeli,Li Fei-Fei,Jiajun Wu,Manling Li

Task: 研究长视频理解中的时间搜索问题，并提出一种轻量级的关键帧搜索框架T*。

Motivation: 解决现有长上下文视觉语言模型在长视频理解中时间搜索能力不足的问题。

Details

Method: 提出LV-Haystack基准测试，并设计T*框架，将时间搜索问题转化为空间搜索问题。 Result: T*框架显著提升了现有方法的性能，如GPT-4o和LLaVA-OneVision-72B在LongVideoBench XL子集上的表现。 Conclusion: T*框架为长视频理解提供了一种高效的时间搜索解决方案，并填补了现有研究的空白。 Abstract: Efficient understanding of long-form videos remains a significant challenge in computer vision. In this work, we revisit temporal search paradigms for long-form video understanding, studying a fundamental issue pertaining to all state-of-the-art (SOTA) long-context vision-language models (VLMs). In particular, our contributions are two-fold: First, we formulate temporal search as a Long Video Haystack problem, i.e., finding a minimal set of relevant frames (typically one to five) among tens of thousands of frames from real-world long videos given specific queries. To validate our formulation, we create LV-Haystack, the first benchmark containing 3,874 human-annotated instances with fine-grained evaluation metrics for assessing keyframe search quality and computational efficiency. Experimental results on LV-Haystack highlight a significant research gap in temporal search capabilities, with SOTA keyframe selection methods achieving only 2.1% temporal F1 score on the LVBench subset. Next, inspired by visual search in images, we re-think temporal searching and propose a lightweight keyframe searching framework, T*, which casts the expensive temporal search as a spatial search problem. T* leverages superior visual localization capabilities typically used in images and introduces an adaptive zooming-in mechanism that operates across both temporal and spatial dimensions. Our extensive experiments show that when integrated with existing methods, T* significantly improves SOTA long-form video understanding performance. Specifically, under an inference budget of 32 frames, T* improves GPT-4o's performance from 50.5% to 53.1% and LLaVA-OneVision-72B's performance from 56.5% to 62.4% on LongVideoBench XL subset. Our PyTorch code, benchmark dataset and models are included in the Supplementary material.

Scaling Analysis of Interleaved Speech-Text Language Models

Gallil Maimon,Michael Hassid,Amit Roth,Yossi Adi

Task: 分析交错式语音语言模型（SLMs）的扩展效率及其与无文本SLMs的区别。

Motivation: 现有研究表明SLMs需要比文本模型更多的计算资源和数据，而现代SLMs通常通过语音-文本交错初始化以进行知识迁移，因此需要验证交错式SLMs是否更高效。

Details

Method: 通过训练数十个交错式SLMs并分析其扩展趋势，研究计算资源分配、合成数据及文本模型家族的作用。 Result: 交错式SLMs在计算效率上优于无文本SLMs，且扩展动态显著不同，建议将更多计算预算用于增加模型规模而非训练数据量。 Conclusion: 交错式SLMs在计算和数据需求上更高效，其扩展模型在语音语义指标上表现优异，且开源了相关资源。 Abstract: Existing Speech Language Model (SLM) scaling analysis paints a bleak picture. They predict that SLMs require much more compute and data compared to text, leading some to question the feasibility of training high-quality SLMs. However, modern SLMs are often initialised from pre-trained TextLMs using speech-text interleaving to allow knowledge transfer. This raises the question - Do interleaved SLMs scale more efficiently than textless-SLMs? In this paper we answer a resounding, yes! We conduct scaling analysis of interleaved SLMs by training several dozen and analysing the scaling trends. We see that under this setup SLMs scale more efficiently with compute. Additionally, our results indicate that the scaling-dynamics are significantly different than textless-SLMs, suggesting one should allocate notably more of the compute budget for increasing model size over training tokens. We also study the role of synthetic data and TextLM model families in unlocking this potential. Results suggest, that our scaled up model achieves comparable performance with leading models on speech semantic metrics while using less compute and data than other approaches. We open source models, samples, and data - https://pages.cs.huji.ac.il/adiyoss-lab/sims.

WonderTurbo: Generating Interactive 3D World in 0.72 Seconds

Chaojun Ni,Xiaofeng Wang,Zheng Zhu,Weijie Wang,Haoyun Li,Guosheng Zhao,Jie Li,Wenkang Qin,Guan Huang,Wenjun Mei

Task: 提出WonderTurbo，一种实时交互式3D场景生成框架，能够在0.72秒内生成3D场景的新视角。

Motivation: 当前3D生成技术难以实现实时交互性，限制了沉浸式虚拟体验的潜力。

Details

Method: 通过StepSplat（动态更新3D几何表示）、QuickDepth（轻量级深度补全模块）和FastPaint（2步扩散模型）加速几何和外观建模。 Result: WonderTurbo比基线方法快15倍，同时保持优秀的空间一致性和高质量输出。 Conclusion: WonderTurbo成功解决了实时交互3D生成的挑战，为沉浸式体验提供了高效解决方案。 Abstract: Interactive 3D generation is gaining momentum and capturing extensive attention for its potential to create immersive virtual experiences. However, a critical challenge in current 3D generation technologies lies in achieving real-time interactivity. To address this issue, we introduce WonderTurbo, the first real-time interactive 3D scene generation framework capable of generating novel perspectives of 3D scenes within 0.72 seconds. Specifically, WonderTurbo accelerates both geometric and appearance modeling in 3D scene generation. In terms of geometry, we propose StepSplat, an innovative method that constructs efficient 3D geometric representations through dynamic updates, each taking only 0.26 seconds. Additionally, we design QuickDepth, a lightweight depth completion module that provides consistent depth input for StepSplat, further enhancing geometric accuracy. For appearance modeling, we develop FastPaint, a 2-steps diffusion model tailored for instant inpainting, which focuses on maintaining spatial appearance consistency. Experimental results demonstrate that WonderTurbo achieves a remarkable 15X speedup compared to baseline methods, while preserving excellent spatial consistency and delivering high-quality output.

DaKultur: Evaluating the Cultural Awareness of Language Models for Danish with Native Speakers

Max Müller-Eberstein,Mike Zhang,Elisa Bassignana,Peter Brunsgaard Trolle,Rob van der Goot

Task: 评估大型语言模型（LLMs）在丹麦语中的文化适应能力。

Motivation: LLMs在多语言交互中表现出文化意识不足，尤其是对非英语社区的响应可能不恰当或偏向英语文化。

Details

Method: 通过丹麦母语者与不同模型的互动，收集1,038次交互数据，分析文化适应问题。 Result: 发现自动翻译数据不足以训练或衡量文化适应能力，而使用母语数据训练可将响应接受率提高一倍以上。 Conclusion: 提出了首个丹麦文化意识数据集DaKultur，并强调需要更多母语数据以提升LLMs的文化适应能力。 Abstract: Large Language Models (LLMs) have seen widespread societal adoption. However, while they are able to interact with users in languages beyond English, they have been shown to lack cultural awareness, providing anglocentric or inappropriate responses for underrepresented language communities. To investigate this gap and disentangle linguistic versus cultural proficiency, we conduct the first cultural evaluation study for the mid-resource language of Danish, in which native speakers prompt different models to solve tasks requiring cultural awareness. Our analysis of the resulting 1,038 interactions from 63 demographically diverse participants highlights open challenges to cultural adaptation: Particularly, how currently employed automatically translated data are insufficient to train or measure cultural adaptation, and how training on native-speaker data can more than double response acceptance rates. We release our study data as DaKultur - the first native Danish cultural awareness dataset.

MMTL-UniAD: A Unified Framework for Multimodal and Multi-Task Learning in Assistive Driving Perception

Wenzhuo Liu,Wenshuo Wang,Yicheng Qiao,Qiannan Guo,Jiayin Zhu,Pengfei Li,Zilong Chen,Huiming Yang,Zhiwei Li,Lening Wang,Tiao Tan,Huaping Liu

Task: 提出一个统一的多模态多任务学习框架MMTL-UniAD，用于同时识别驾驶员行为、情绪、车辆行为和交通环境。

Motivation: 现有研究忽视了联合学习驾驶员状态和交通环境的潜在优势，且存在任务间负迁移问题。

Details

Method: 引入多轴区域注意力网络提取全局上下文敏感特征，以及双分支多模态嵌入学习任务共享和任务特定的特征。 Result: 在AIDE数据集上，MMTL-UniAD在四个任务上均优于现有方法。 Conclusion: MMTL-UniAD通过优化特征提取和参数调整，有效解决了任务间负迁移问题，提升了多任务学习性能。 Abstract: Advanced driver assistance systems require a comprehensive understanding of the driver's mental/physical state and traffic context but existing works often neglect the potential benefits of joint learning between these tasks. This paper proposes MMTL-UniAD, a unified multi-modal multi-task learning framework that simultaneously recognizes driver behavior (e.g., looking around, talking), driver emotion (e.g., anxiety, happiness), vehicle behavior (e.g., parking, turning), and traffic context (e.g., traffic jam, traffic smooth). A key challenge is avoiding negative transfer between tasks, which can impair learning performance. To address this, we introduce two key components into the framework: one is the multi-axis region attention network to extract global context-sensitive features, and the other is the dual-branch multimodal embedding to learn multimodal embeddings from both task-shared and task-specific features. The former uses a multi-attention mechanism to extract task-relevant features, mitigating negative transfer caused by task-unrelated features. The latter employs a dual-branch structure to adaptively adjust task-shared and task-specific parameters, enhancing cross-task knowledge transfer while reducing task conflicts. We assess MMTL-UniAD on the AIDE dataset, using a series of ablation studies, and show that it outperforms state-of-the-art methods across all four tasks. The code is available on https://github.com/Wenzhuo-Liu/MMTL-UniAD.

AnesBench: Multi-Dimensional Evaluation of LLM Reasoning in Anesthesiology

Xiang Feng,Wentao Jiang,Zengmao Wang,Yong Luo,Pingbo Xu,Baosheng Yu,Hua Jin,Bo Du,Jing Zhang

Task: 系统评估大型语言模型（LLMs）在麻醉学领域的推理能力，并分析影响其性能的关键因素。

Motivation: 尽管LLMs在医学领域的应用受到广泛关注，但其在麻醉学等专业领域的推理能力尚未充分探索。

Details

Method: 引入AnesBench跨语言基准，评估麻醉学相关推理的三个层次，并通过实验探讨模型特性、训练策略和推理技术的影响。 Result: 实验分析了模型规模、思维链长度、语言迁移性等因素对性能的影响，并评估了不同训练策略和推理技术的有效性。 Conclusion: AnesBench及其相关数据集和代码将公开发布，以促进麻醉学领域LLMs的研究。 Abstract: The application of large language models (LLMs) in the medical field has gained significant attention, yet their reasoning capabilities in more specialized domains like anesthesiology remain underexplored. In this paper, we systematically evaluate the reasoning capabilities of LLMs in anesthesiology and analyze key factors influencing their performance. To this end, we introduce AnesBench, a cross-lingual benchmark designed to assess anesthesiology-related reasoning across three levels: factual retrieval (System 1), hybrid reasoning (System 1.x), and complex decision-making (System 2). Through extensive experiments, we first explore how model characteristics, including model scale, Chain of Thought (CoT) length, and language transferability, affect reasoning performance. Then, we further evaluate the effectiveness of different training strategies, leveraging our curated anesthesiology-related dataset, including continuous pre-training (CPT) and supervised fine-tuning (SFT). Additionally, we also investigate how the test-time reasoning techniques, such as Best-of-N sampling and beam search, influence reasoning performance, and assess the impact of reasoning-enhanced model distillation, specifically DeepSeek-R1. We will publicly release AnesBench, along with our CPT and SFT training datasets and evaluation code at https://github.com/MiliLab/AnesBench.

MinkOcc: Towards real-time label-efficient semantic occupancy prediction

Samuel Sze,Daniele De Martini,Lars Kunze

Task: 开发一种多模态3D语义占用预测框架MinkOcc，以减少对密集3D标注的依赖。

Motivation: 密集3D标注既耗时又资源密集，需要更高效的标注或无标注方法。

Details

Method: 提出两步半监督训练流程：先用少量3D标注启动训练，再用易于标注的LiDAR扫描和图像（通过视觉基础模型标注）继续监督。结合LiDAR和相机数据，利用稀疏卷积网络实现实时预测。 Result: MinkOcc减少90%的手动标注需求，同时保持竞争力精度。 Conclusion: MinkOcc在监督和计算上高效，有望在自动驾驶中推广3D语义占用预测的实际应用。 Abstract: Developing 3D semantic occupancy prediction models often relies on dense 3D annotations for supervised learning, a process that is both labor and resource-intensive, underscoring the need for label-efficient or even label-free approaches. To address this, we introduce MinkOcc, a multi-modal 3D semantic occupancy prediction framework for cameras and LiDARs that proposes a two-step semi-supervised training procedure. Here, a small dataset of explicitly 3D annotations warm-starts the training process; then, the supervision is continued by simpler-to-annotate accumulated LiDAR sweeps and images -- semantically labelled through vision foundational models. MinkOcc effectively utilizes these sensor-rich supervisory cues and reduces reliance on manual labeling by 90\% while maintaining competitive accuracy. In addition, the proposed model incorporates information from LiDAR and camera data through early fusion and leverages sparse convolution networks for real-time prediction. With its efficiency in both supervision and computation, we aim to extend MinkOcc beyond curated datasets, enabling broader real-world deployment of 3D semantic occupancy prediction in autonomous driving.

Adapting Large Language Models for Multi-Domain Retrieval-Augmented-Generation

Alexandre Misrahi,Nadezhda Chirkova,Maxime Louis,Vassilina Nikoulina

Task: 引入一个多样化的基准测试，并系统性地测试检索增强生成（RAG）在不同领域的泛化能力。

Motivation: 解决多领域应用中缺乏多样化基准测试和跨领域泛化能力不足的问题。

Details

Method: 使用包含8个来源和13个领域的多样化基准测试，并通过序列级蒸馏和教师生成标签改进泛化能力。 Result: 标准微调在跨领域泛化中表现不佳，但序列级蒸馏显著提升了性能。 Conclusion: 序列级蒸馏是提升多领域RAG鲁棒性的关键策略。 Abstract: Retrieval-Augmented Generation (RAG) enhances LLM factuality, but multi-domain applications face challenges like lack of diverse benchmarks and poor out-of-domain generalization. The first contribution of this work is to introduce a diverse benchmark comprising a variety of question-answering tasks from 8 sources and covering 13 domains. Our second contribution consists in systematically testing out-of-domain generalization for typical RAG tuning strategies. While our findings reveal that standard fine-tuning fails to generalize effectively, we show that sequence-level distillation with teacher-generated labels improves out-of-domain performance by providing more coherent supervision. Our findings highlight key strategies for improving multi-domain RAG robustness.

Generative Classifier for Domain Generalization

Shaocong Long,Qianyu Zhou,Xiangtai Li,Chenhao Ying,Yunhai Tong,Lizhuang Ma,Yuan Luo,Dacheng Tao

Task: 提出一种基于生成式分类器的领域泛化方法（GCDG），以解决传统领域不变性方法在捕捉领域特定信息时的不足。

Motivation: 传统领域泛化方法过于依赖领域不变性，忽视了领域特定信息的潜力，导致在面对多模态数据时表现不佳。

Details

Method: 通过高斯混合模型（GMMs）构建生成式分类器，结合异质性学习分类器（HLC）、虚假相关性阻断（SCB）和多样性组件平衡（DCB）三个模块。 Result: GCDG在五个领域泛化基准测试和一个面部反欺骗数据集上表现出色，并能与现有方法无缝集成。 Conclusion: GCDG通过有效捕捉领域特定信息，降低了目标风险并提升了模型的泛化能力。 Abstract: Domain generalization (DG) aims to improve the generalizability of computer vision models toward distribution shifts. The mainstream DG methods focus on learning domain invariance, however, such methods overlook the potential inherent in domain-specific information. While the prevailing practice of discriminative linear classifier has been tailored to domain-invariant features, it struggles when confronted with diverse domain-specific information, e.g., intra-class shifts, that exhibits multi-modality. To address these issues, we explore the theoretical implications of relying on domain invariance, revealing the crucial role of domain-specific information in mitigating the target risk for DG. Drawing from these insights, we propose Generative Classifier-driven Domain Generalization (GCDG), introducing a generative paradigm for the DG classifier based on Gaussian Mixture Models (GMMs) for each class across domains. GCDG consists of three key modules: Heterogeneity Learning Classifier~(HLC), Spurious Correlation Blocking~(SCB), and Diverse Component Balancing~(DCB). Concretely, HLC attempts to model the feature distributions and thereby capture valuable domain-specific information via GMMs. SCB identifies the neural units containing spurious correlations and perturbs them, mitigating the risk of HLC learning spurious patterns. Meanwhile, DCB ensures a balanced contribution of components in HLC, preventing the underestimation or neglect of critical components. In this way, GCDG excels in capturing the nuances of domain-specific information characterized by diverse distributions. GCDG demonstrates the potential to reduce the target risk and encourage flat minima, improving the generalizability. Extensive experiments show GCDG's comparable performance on five DG benchmarks and one face anti-spoofing dataset, seamlessly integrating into existing DG methods with consistent improvements.

Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation

Chuanqi Cheng,Jian Guan,Wei Wu,Rui Yan

Task: 提出一种名为ViLaMP的分层视频语言模型，用于高效处理长视频并保留关键信息。

Motivation: 现有方法在处理长视频时计算成本高，且容易牺牲时间依赖性或语义信息。

Details

Method: 采用差分蒸馏原则，结合差分关键帧选择和差分特征合并机制。 Result: ViLaMP在四个视频理解基准测试中表现优异，尤其擅长处理长视频内容，且计算效率高。 Conclusion: ViLaMP通过混合精度处理长视频，在保持高性能的同时显著提升了计算效率。 Abstract: Long-form video processing fundamentally challenges vision-language models (VLMs) due to the high computational costs of handling extended temporal sequences. Existing token pruning and feature merging methods often sacrifice critical temporal dependencies or dilute semantic information. We introduce differential distillation, a principled approach that systematically preserves task-relevant information while suppressing redundancy. Based on this principle, we develop ViLaMP, a hierarchical video-language model that processes hour-long videos at ``mixed precision'' through two key mechanisms: (1) differential keyframe selection that maximizes query relevance while maintaining temporal distinctiveness at the frame level and (2) differential feature merging that preserves query-salient features in non-keyframes at the patch level. Hence, ViLaMP retains full information in keyframes while reducing non-keyframes to their most salient features, resembling mixed-precision training. Extensive experiments demonstrate ViLaMP's superior performance across four video understanding benchmarks, particularly on long-form content. Notably, ViLaMP can process ultra-long videos (up to 10K frames) on a single NVIDIA A100 GPU, achieving substantial computational efficiency while maintaining state-of-the-art performance.

Beyond Conventional Transformers: The Medical X-ray Attention (MXA) Block for Improved Multi-Label Diagnosis Using Knowledge Distillation

Amit Rand,Hadi Ibrahim

Task: 提出一种针对X光多标签分类的专用注意力机制（MXA块），并嵌入到EfficientViT架构中。

Motivation: 医学影像（尤其是X光）中需要同时检测多种异常，多标签分类对临床应用至关重要。

Details

Method: 设计MXA块，改进传统MHSA，结合局部细节和全局上下文信息，嵌入EfficientViT架构并采用知识蒸馏。 Result: 在CheXpert数据集上AUC达到0.85，比基线模型（AUC=0.66）提升0.19，相对随机猜测提升约233%。 Conclusion: MXA块显著提升了X光多标签分类性能，为临床诊断提供了更高效的工具。 Abstract: Medical imaging, particularly X-ray analysis, often involves detecting multiple conditions simultaneously within a single scan, making multi-label classification crucial for real-world clinical applications. We present the Medical X-ray Attention (MXA) block, a novel attention mechanism tailored specifically to address the unique challenges of X-ray abnormality detection. The MXA block enhances traditional Multi-Head Self Attention (MHSA) by integrating a specialized module that efficiently captures both detailed local information and broader global context. To the best of our knowledge, this is the first work to propose a task-specific attention mechanism for diagnosing chest X-rays, as well as to attempt multi-label classification using an Efficient Vision Transformer (EfficientViT). By embedding the MXA block within the EfficientViT architecture and employing knowledge distillation, our proposed model significantly improves performance on the CheXpert dataset, a widely used benchmark for multi-label chest X-ray abnormality detection. Our approach achieves an area under the curve (AUC) of 0.85, an absolute improvement of 0.19 compared to our baseline model's AUC of 0.66, corresponding to a substantial approximate 233% relative improvement over random guessing (AUC = 0.5).

Cognitive Memory in Large Language Models

Lianlei Shan,Shixian Luo,Zezhou Zhu,Yu Yuan,Yong Wu

Task: 分析大型语言模型（LLM）中的记忆机制，包括其分类、实现方法及管理策略。

Motivation: 探讨记忆机制对提升LLM的上下文响应能力、减少幻觉和提高效率的重要性。

Details

Method: 将记忆分为感官记忆、短期记忆和长期记忆，并详细讨论了基于文本、KV缓存、参数和隐藏状态的记忆实现方法。 Result: 提出了多种记忆管理策略和技术，如压缩、共享注意力机制等，并展示了其在LLM中的应用效果。 Conclusion: 全面总结了LLM记忆机制的研究现状，强调了其重要性，并指出了未来的研究方向。 Abstract: This paper examines memory mechanisms in Large Language Models (LLMs), emphasizing their importance for context-rich responses, reduced hallucinations, and improved efficiency. It categorizes memory into sensory, short-term, and long-term, with sensory memory corresponding to input prompts, short-term memory processing immediate context, and long-term memory implemented via external databases or structures. The text-based memory section covers acquisition (selection and summarization), management (updating, accessing, storing, and resolving conflicts), and utilization (full-text search, SQL queries, semantic search). The KV cache-based memory section discusses selection methods (regularity-based summarization, score-based approaches, special token embeddings) and compression techniques (low-rank compression, KV merging, multimodal compression), along with management strategies like offloading and shared attention mechanisms. Parameter-based memory methods (LoRA, TTT, MoE) transform memories into model parameters to enhance efficiency, while hidden-state-based memory approaches (chunk mechanisms, recurrent transformers, Mamba model) improve long-text processing by combining RNN hidden states with current methods. Overall, the paper offers a comprehensive analysis of LLM memory mechanisms, highlighting their significance and future research directions.

Trung Thanh Nguyen,Yasutomo Kawanishi,Vijay John,Takahiro Komamizu,Ichiro Ide

Task: 提出一种基于Transformer的多模态多视角传感器融合方法（MultiTSF），用于动作识别。

Motivation: 解决现有方法在多样化环境条件、严格传感器同步和细粒度标注需求等现实挑战中的不足。

Details

Method: 利用Transformer动态建模视角间关系并捕捉多视角的时间依赖性，同时引入人体检测模块生成伪标签以优化特征学习。 Result: 在MultiSensor-Home和MM-Office数据集上的实验表明，MultiTSF在视频序列级和帧级动作识别中优于现有方法。 Conclusion: MultiTSF通过动态建模和伪标签优化，显著提升了多模态多视角动作识别的性能。 Abstract: Action recognition from multi-modal and multi-view observations holds significant potential for applications in surveillance, robotics, and smart environments. However, existing methods often fall short of addressing real-world challenges such as diverse environmental conditions, strict sensor synchronization, and the need for fine-grained annotations. In this study, we propose the Multi-modal Multi-view Transformer-based Sensor Fusion (MultiTSF). The proposed method leverages a Transformer-based to dynamically model inter-view relationships and capture temporal dependencies across multiple views. Additionally, we introduce a Human Detection Module to generate pseudo-ground-truth labels, enabling the model to prioritize frames containing human activity and enhance spatial feature learning. Comprehensive experiments conducted on our in-house MultiSensor-Home dataset and the existing MM-Office dataset demonstrate that MultiTSF outperforms state-of-the-art methods in both video sequence-level and frame-level action recognition settings.

Inference-Time Scaling for Generalist Reward Modeling

Zijun Liu,Peiyi Wang,Runxin Xu,Shirong Ma,Chong Ruan,Peng Li,Yang Liu,Yu Wu

Task: 研究如何通过改进奖励建模（RM）和学习方法，提升大型语言模型（LLMs）在通用查询中的推理时间可扩展性。

Motivation: 强化学习（RL）在LLMs的后训练中广泛应用，但如何为不同领域提供准确的奖励信号仍是一个关键挑战。

Details

Method: 采用点式生成奖励建模（GRM）和自原则批判调优（SPCT）方法，结合在线RL和并行采样技术。 Result: 提出的DeepSeek-GRM模型在多个RM基准测试中表现优异，优于现有方法，且未出现严重偏差。 Conclusion: SPCT显著提升了GRM的质量和可扩展性，未来仍需进一步改进通用奖励系统。 Abstract: Reinforcement learning (RL) has been widely adopted in post-training for large language models (LLMs) at scale. Recently, the incentivization of reasoning capabilities in LLMs from RL indicates that $\textit{proper learning methods could enable effective inference-time scalability}$. A key challenge of RL is to obtain accurate reward signals for LLMs in various domains beyond verifiable questions or artificial rules. In this work, we investigate how to improve reward modeling (RM) with more inference compute for general queries, i.e. the $\textbf{inference-time scalability of generalist RM}$, and further, how to improve the effectiveness of performance-compute scaling with proper learning methods. For the RM approach, we adopt pointwise generative reward modeling (GRM) to enable flexibility for different input types and potential for inference-time scaling. For the learning method, we propose Self-Principled Critique Tuning (SPCT) to foster scalable reward generation behaviors in GRMs through online RL, to generate principles adaptively and critiques accurately, resulting in $\textbf{DeepSeek-GRM}$ models. Furthermore, for effective inference-time scaling, we use parallel sampling to expand compute usage, and introduce a meta RM to guide voting process for better scaling performance. Empirically, we show that SPCT significantly improves the quality and scalability of GRMs, outperforming existing methods and models in various RM benchmarks without severe biases, and could achieve better performance compared to training-time scaling. DeepSeek-GRM still meets challenges in some tasks, which we believe can be addressed by future efforts in generalist reward systems. The models will be released and open-sourced.

Moment Quantization for Video Temporal Grounding

Xiaolong Sun,Le Wang,Sanping Zhou,Liushuai Shi,Kun Xia,Mengnan Liu,Yabing Wang,Gang Hua

Task: 提出一种基于时刻量化的视频时序定位方法（MQVTG），以增强相关与无关时刻的区分能力。

Motivation: 现有方法在学习连续特征时区分前景和背景特征的能力较弱，难以有效区分相关与无关时刻。

Details

Method: 通过量化输入视频为离散向量，维护一个可学习的时刻码本，将视频时刻与码字匹配视为聚类过程，避免直接硬量化导致的信息损失，并采用先验初始化和联合投影策略增强码本。 Result: 在六个流行基准测试上的实验表明，MQVTG显著优于现有方法，有效区分相关与无关特征。 Conclusion: MQVTG通过时刻量化增强了时序定位中相关与无关时刻的区分能力，且易于集成到现有模型中。 Abstract: Video temporal grounding is a critical video understanding task, which aims to localize moments relevant to a language description. The challenge of this task lies in distinguishing relevant and irrelevant moments. Previous methods focused on learning continuous features exhibit weak differentiation between foreground and background features. In this paper, we propose a novel Moment-Quantization based Video Temporal Grounding method (MQVTG), which quantizes the input video into various discrete vectors to enhance the discrimination between relevant and irrelevant moments. Specifically, MQVTG maintains a learnable moment codebook, where each video moment matches a codeword. Considering the visual diversity, i.e., various visual expressions for the same moment, MQVTG treats moment-codeword matching as a clustering process without using discrete vectors, avoiding the loss of useful information from direct hard quantization. Additionally, we employ effective prior-initialization and joint-projection strategies to enhance the maintained moment codebook. With its simple implementation, the proposed method can be integrated into existing temporal grounding models as a plug-and-play component. Extensive experiments on six popular benchmarks demonstrate the effectiveness and generalizability of MQVTG, significantly outperforming state-of-the-art methods. Further qualitative analysis shows that our method effectively groups relevant features and separates irrelevant ones, aligning with our goal of enhancing discrimination.

UNDO: Understanding Distillation as Optimization

Kushal Jain,Piyushi Goyal,Kumar Shridhar

Task: 提出一种名为UNDO的迭代蒸馏框架，通过动态优化教师模型的解释来提升学生模型的学习效果。

Motivation: 标准的一次性蒸馏方法因教师生成的解释与学生具体学习需求不匹配而导致效果不佳。

Details

Method: 采用迭代方法，识别学生的错误并促使教师模型优化其解释，直接针对学生的学习缺陷。 Result: 在数学和常识推理任务上，UNDO比标准方法性能提升高达20%，且优化后的数据对其他学生模型也有效。 Conclusion: UNDO将知识蒸馏重新定义为迭代的师生互动，通过动态优化显著提升了蒸馏效果。 Abstract: Knowledge distillation has emerged as an effective strategy for compressing large language models' (LLMs) knowledge into smaller, more efficient student models. However, standard one-shot distillation methods often produce suboptimal results due to a mismatch between teacher-generated rationales and the student's specific learning requirements. In this paper, we introduce the UNDO: UNderstanding Distillation as Optimization framework, designed to bridge this gap by iteratively identifying the student's errors and prompting the teacher to refine its explanations accordingly. Each iteration directly targets the student's learning deficiencies, motivating the teacher to provide tailored and enhanced rationales that specifically address these weaknesses. Empirical evaluations on various challenging mathematical and commonsense reasoning tasks demonstrate that our iterative distillation method, UNDO, significantly outperforms standard one-step distillation methods, achieving performance gains of up to 20%. Additionally, we show that teacher-generated data refined through our iterative process remains effective even when applied to different student models, underscoring the broad applicability of our approach. Our work fundamentally reframes knowledge distillation as an iterative teacher-student interaction, effectively leveraging dynamic refinement by the teacher for better knowledge distillation.

Trung Thanh Nguyen,Yasutomo Kawanishi,Vijay John,Takahiro Komamizu,Ichiro Ide

Task: 提出一种基于Transformer的多模态多视角动作识别方法（MultiTSF）并引入MultiSensor-Home数据集。

Motivation: 当前数据集和方法在真实场景中面临环境条件复杂、数据流异步和缺乏帧级标注等挑战，且难以有效建模视角间关系和增强空间特征学习。

Details

Method: 采用Transformer融合机制动态建模视角间关系，并集成外部人体检测模块增强空间特征学习。 Result: 在MultiSensor-Home和MM-Office数据集上实验表明，MultiTSF优于现有方法。 Conclusion: MultiTSF方法在多模态多视角动作识别中具有显著优势，推动了该领域的发展。 Abstract: Multi-modal multi-view action recognition is a rapidly growing field in computer vision, offering significant potential for applications in surveillance. However, current datasets often fail to address real-world challenges such as wide-area environmental conditions, asynchronous data streams, and the lack of frame-level annotations. Furthermore, existing methods face difficulties in effectively modeling inter-view relationships and enhancing spatial feature learning. In this study, we propose the Multi-modal Multi-view Transformer-based Sensor Fusion (MultiTSF) method and introduce the MultiSensor-Home dataset, a novel benchmark designed for comprehensive action recognition in home environments. The MultiSensor-Home dataset features untrimmed videos captured by distributed sensors, providing high-resolution RGB and audio data along with detailed multi-view frame-level action labels. The proposed MultiTSF method leverages a Transformer-based fusion mechanism to dynamically model inter-view relationships. Furthermore, the method also integrates a external human detection module to enhance spatial feature learning. Experiments on MultiSensor-Home and MM-Office datasets demonstrate the superiority of MultiTSF over the state-of-the-art methods. The quantitative and qualitative results highlight the effectiveness of the proposed method in advancing real-world multi-modal multi-view action recognition.

Leveraging LLM For Synchronizing Information Across Multilingual Tables

Siddharth Khincha,Tushar Kataria,Ankita Anand,Dan Roth,Vivek Gupta

Task: 探索利用大型语言模型（LLMs）进行多语言信息同步，特别是通过零样本提示作为可扩展解决方案。

Motivation: 解决非英语使用者在获取高资源语言（如英语和法语）信息时面临的不平衡问题，尤其是维基百科中低资源语言内容过时或不完整的情况。

Details

Method: 引入信息更新数据集，模拟更新过时维基百科表格的真实过程，并评估LLM性能；提出任务分解策略以提高一致性和准确性。 Result: 提出的方法在信息更新（1.79%）和信息添加（20.58%）方面优于现有基线，展示了模型在动态更新和丰富数据方面的优势。 Conclusion: 大型语言模型在多语言信息同步中具有潜力，任务分解策略能显著提升性能。 Abstract: The vast amount of online information today poses challenges for non-English speakers, as much of it is concentrated in high-resource languages such as English and French. Wikipedia reflects this imbalance, with content in low-resource languages frequently outdated or incomplete. Recent research has sought to improve cross-language synchronization of Wikipedia tables using rule-based methods. These approaches can be effective, but they struggle with complexity and generalization. This paper explores large language models (LLMs) for multilingual information synchronization, using zero-shot prompting as a scalable solution. We introduce the Information Updation dataset, simulating the real-world process of updating outdated Wikipedia tables, and evaluate LLM performance. Our findings reveal that single-prompt approaches often produce suboptimal results, prompting us to introduce a task decomposition strategy that enhances coherence and accuracy. Our proposed method outperforms existing baselines, particularly in Information Updation (1.79%) and Information Addition (20.58%), highlighting the model strength in dynamically updating and enriching data across architectures

OmniCam: Unified Multimodal Video Generation via Camera Control

Xiaoda Yang,Jiayang Xu,Kaixuan Luan,Xinyu Zhan,Hongshun Qiu,Shijun Shi,Hao Li,Shuai Yang,Li Zhang,Checheng Yu,Cewu Lu,Lixin Yang

Task: 提出OmniCam，一个统一的多模态相机控制框架，用于生成时空一致的视频。

Motivation: 解决现有相机控制方法交互复杂且控制能力有限的问题。

Details

Method: 利用大语言模型和视频扩散模型，支持多种输入模态组合（文本、视频、图像等）。 Result: 实验结果表明，OmniCam在高质量相机控制视频生成方面达到最先进性能。 Conclusion: OmniCam通过多模态输入实现了对相机运动的精确控制，并展示了卓越的性能。 Abstract: Camera control, which achieves diverse visual effects by changing camera position and pose, has attracted widespread attention. However, existing methods face challenges such as complex interaction and limited control capabilities. To address these issues, we present OmniCam, a unified multimodal camera control framework. Leveraging large language models and video diffusion models, OmniCam generates spatio-temporally consistent videos. It supports various combinations of input modalities: the user can provide text or video with expected trajectory as camera path guidance, and image or video as content reference, enabling precise control over camera motion. To facilitate the training of OmniCam, we introduce the OmniTr dataset, which contains a large collection of high-quality long-sequence trajectories, videos, and corresponding descriptions. Experimental results demonstrate that our model achieves state-of-the-art performance in high-quality camera-controlled video generation across various metrics.

Language Models reach higher Agreement than Humans in Historical Interpretation

Fabio Celli,Georgios Spathulas

Task: 比较人类和大型语言模型在历史注释中的表现。

Motivation: 探讨人类和大型语言模型在历史注释中的文化偏见和共识差异，以推动数字人文领域的大规模注释和定量分析。

Details

Method: 通过对比人类和大型语言模型对历史事实的注释和解释。 Result: 大型语言模型在短文本的历史事实解释上达成更高共识，但存在信息遗漏或幻觉；人类则因个人偏见而分歧。 Conclusion: 研究为数字人文提供了新工具，支持大规模历史数据注释和定量分析，同时促进对偏见的批判性思考。 Abstract: This paper compares historical annotations by humans and Large Language Models. The findings reveal that both exhibit some cultural bias, but Large Language Models achieve a higher consensus on the interpretation of historical facts from short texts. While humans tend to disagree on the basis of their personal biases, Large Models disagree when they skip information or produce hallucinations. These findings have significant implications for digital humanities, enabling large-scale annotation and quantitative analysis of historical data. This offers new educational and research opportunities to explore historical interpretations from different Language Models, fostering critical thinking about bias.

ConsDreamer: Advancing Multi-View Consistency for Zero-Shot Text-to-3D Generation

Yuan Zhou,Shilong Jin,Litao Hua,Wanjun Lv,Haoran Duan,Jungong Han

Task: 提出ConsDreamer框架以解决零样本文本到3D生成中的视角偏差问题。

Motivation: 现有方法利用3D高斯泼溅和分数蒸馏从预训练文本到图像（T2I）模型中生成多视角渲染，但存在视角偏差导致3D生成不一致，尤其是多面Janus问题。

Details

Method: ConsDreamer通过改进分数蒸馏过程中的条件项和无条件项来消除视角偏差，包括视图解耦模块（VDM）和基于相似性的偏序损失。 Result: 实验表明ConsDreamer有效缓解了多面Janus问题，在视觉质量和一致性上优于现有方法。 Conclusion: ConsDreamer为解决文本到3D生成中的视角偏差提供了有效解决方案。 Abstract: Recent advances in zero-shot text-to-3D generation have revolutionized 3D content creation by enabling direct synthesis from textual descriptions. While state-of-the-art methods leverage 3D Gaussian Splatting with score distillation to enhance multi-view rendering through pre-trained text-to-image (T2I) models, they suffer from inherent view biases in T2I priors. These biases lead to inconsistent 3D generation, particularly manifesting as the multi-face Janus problem, where objects exhibit conflicting features across views. To address this fundamental challenge, we propose ConsDreamer, a novel framework that mitigates view bias by refining both the conditional and unconditional terms in the score distillation process: (1) a View Disentanglement Module (VDM) that eliminates viewpoint biases in conditional prompts by decoupling irrelevant view components and injecting precise camera parameters; and (2) a similarity-based partial order loss that enforces geometric consistency in the unconditional term by aligning cosine similarities with azimuth relationships. Extensive experiments demonstrate that ConsDreamer effectively mitigates the multi-face Janus problem in text-to-3D generation, outperforming existing methods in both visual quality and consistency.

LexPam: Legal Procedure Awareness-Guided Mathematical Reasoning

Kepu Zhang,Guofu Xie,Weijie Yu,Mingyue Xu,Xu Tang,Yaxin Li,Jun Xu

Task: 提出首个中文法律数学推理数据集LexNum，并基于此开发LexPam算法以提升LLM在法律场景中的数学推理能力。

Motivation: 现有法律LLM在数学推理能力上未受训练，且缺乏相关数据集验证和提升其能力。

Details

Method: 构建LexNum数据集，测试现有模型表现，并引入LexPam算法进行训练。 Result: 现有模型在法律数学推理任务中表现不佳，LexPam能显著提升LLM能力。 Conclusion: LexNum和LexPam填补了法律数学推理领域的空白，提升了LLM的实用性和可信度。 Abstract: The legal mathematical reasoning ability of LLMs is crucial when applying them to real-world scenarios, as it directly affects the credibility of the LLM. While existing legal LLMs can perform general judicial question answering, their legal mathematical reasoning capabilities have not been trained. Open-domain reasoning models, though able to generate detailed calculation steps, do not follow the reasoning logic required for legal scenarios. Additionally, there is currently a lack of legal mathematical reasoning datasets to help validate and enhance LLMs' reasoning abilities in legal contexts. To address these issues, we propose the first Chinese legal Mathematical Reasoning Dataset, LexNum, which includes three common legal mathematical reasoning scenarios: economic compensation, work injury compensation, and traffic accident compensation. Based on LexNum, we tested the performance of existing legal LLMs and reasoning LLMs, and introduced LexPam, a reinforcement learning algorithm guided by legal procedural awareness to train LLMs, enhancing their mathematical reasoning abilities in legal scenarios. Experiments on tasks in the three legal scenarios show that the performance of existing legal LLMs and reasoning models in legal mathematical reasoning tasks is unsatisfactory. LexPam can enhance the LLM's ability in these tasks.

X-Capture: An Open-Source Portable Device for Multi-Sensory Learning

Samuel Clarke,Suzannah Wistreich,Yanjie Ze,Jiajun Wu

Task: 开发一种低成本、便携式设备（X-Capture），用于在真实环境中收集多感官数据（RGBD图像、触觉读数、撞击音频）。

Motivation: 现有数据集通常局限于受控环境、模拟对象或有限的模态配对，难以满足AI和机器人系统对多样化、高质量多感官数据的需求。

Details

Method: 设计并构建X-Capture设备，利用消费级工具组装，成本低于1000美元，并在真实环境中采集500个日常对象的多感官数据。 Result: 创建了一个包含3000个数据点的样本数据集，实验证明数据的数量和感官广度对多模态表示预训练和微调具有重要价值。 Conclusion: X-Capture为AI中类似人类感官表示的研究奠定了基础，强调可扩展性、可访问性和真实世界适用性。 Abstract: Understanding objects through multiple sensory modalities is fundamental to human perception, enabling cross-sensory integration and richer comprehension. For AI and robotic systems to replicate this ability, access to diverse, high-quality multi-sensory data is critical. Existing datasets are often limited by their focus on controlled environments, simulated objects, or restricted modality pairings. We introduce X-Capture, an open-source, portable, and cost-effective device for real-world multi-sensory data collection, capable of capturing correlated RGBD images, tactile readings, and impact audio. With a build cost under $1,000, X-Capture democratizes the creation of multi-sensory datasets, requiring only consumer-grade tools for assembly. Using X-Capture, we curate a sample dataset of 3,000 total points on 500 everyday objects from diverse, real-world environments, offering both richness and variety. Our experiments demonstrate the value of both the quantity and the sensory breadth of our data for both pretraining and fine-tuning multi-modal representations for object-centric tasks such as cross-sensory retrieval and reconstruction. X-Capture lays the groundwork for advancing human-like sensory representations in AI, emphasizing scalability, accessibility, and real-world applicability.

LinTO Audio and Textual Datasets to Train and Evaluate Automatic Speech Recognition in Tunisian Arabic Dialect

Hedi Naouara,Jean-Pierre Lorré,Jérôme Louradour

Task: 开发针对突尼斯阿拉伯方言的自动语音识别（ASR）系统。

Motivation: 突尼斯阿拉伯方言的语言复杂性及缺乏标注语音数据集是主要挑战。

Details

Method: 提出LinTO音频和文本数据集，涵盖方言的语音和词汇特征，包括多源文本和真实音频样本。 Result: 提供高质量音频与精确转录，支持构建和评估ASR系统。 Conclusion: LinTO数据集为突尼斯阿拉伯方言的ASR研究提供了重要资源。 Abstract: Developing Automatic Speech Recognition (ASR) systems for Tunisian Arabic Dialect is challenging due to the dialect's linguistic complexity and the scarcity of annotated speech datasets. To address these challenges, we propose the LinTO audio and textual datasets -- comprehensive resources that capture phonological and lexical features of Tunisian Arabic Dialect. These datasets include a variety of texts from numerous sources and real-world audio samples featuring diverse speakers and code-switching between Tunisian Arabic Dialect and English or French. By providing high-quality audio paired with precise transcriptions, the LinTO audio and textual datasets aim to provide qualitative material to build and benchmark ASR systems for the Tunisian Arabic Dialect. Keywords -- Tunisian Arabic Dialect, Speech-to-Text, Low-Resource Languages, Audio Data Augmentation

Refining CLIP's Spatial Awareness: A Visual-Centric Perspective

Congpei Qiu,Yanhao Wu,Wei Ke,Xiuxiu Bai,Tong Zhang

Task: 提出了一种名为空间相关性蒸馏（SCD）的框架，以增强CLIP ViTs在密集预测任务中的空间感知能力。

Motivation: CLIP在全局语言对齐方面表现优异，但在空间信息敏感性方面表现不足，导致在需要精确空间理解的密集预测任务中表现不佳。

Details

Method: 通过空间相关性蒸馏（SCD）框架保留CLIP固有的空间结构，并引入轻量级Refiner提取高质量密集特征。 Result: 该框架在多种开放词汇密集预测基准测试中取得了最先进的结果。 Conclusion: SCD框架成功整合了视觉-语言和视觉中心的改进，显著提升了CLIP ViTs在密集预测任务中的性能。 Abstract: Contrastive Language-Image Pre-training (CLIP) excels in global alignment with language but exhibits limited sensitivity to spatial information, leading to strong performance in zero-shot classification tasks but underperformance in tasks requiring precise spatial understanding. Recent approaches have introduced Region-Language Alignment (RLA) to enhance CLIP's performance in dense multimodal tasks by aligning regional visual representations with corresponding text inputs. However, we find that CLIP ViTs fine-tuned with RLA suffer from notable loss in spatial awareness, which is crucial for dense prediction tasks. To address this, we propose the Spatial Correlation Distillation (SCD) framework, which preserves CLIP's inherent spatial structure and mitigates the above degradation. To further enhance spatial correlations, we introduce a lightweight Refiner that extracts refined correlations directly from CLIP before feeding them into SCD, based on an intriguing finding that CLIP naturally captures high-quality dense features. Together, these components form a robust distillation framework that enables CLIP ViTs to integrate both visual-language and visual-centric improvements, achieving state-of-the-art results across various open-vocabulary dense prediction benchmarks.

LLM for Complex Reasoning Task: An Exploratory Study in Fermi Problems

Zishuo Liu,Carlos Rabat Villarreal,Mostafa Rahgouy,Amit Das,Zheng Zhang,Chang Ren,Dongji Feng

Task: 探索大型语言模型（LLMs）在解决费米问题（FPs）中的能力和局限性。

Motivation: 费米问题因其涉及现实世界的不切实际性或模糊概念，对人类和AI都具有挑战性，而LLMs在此类任务中的表现尚未充分研究。

Details

Method: 使用公开的FP数据集评估三种先进LLMs的表现，并根据TELeR分类设计提示，包括零样本场景。 Result: 所有LLMs的fp_score均低于0.5，且在标准问题上的表现优于特定问题。 Conclusion: 费米问题对LLMs具有固有难度，但标准问题的清晰性和简洁性有助于提升模型表现。 Abstract: Fermi Problems (FPs) are mathematical reasoning tasks that require human-like logic and numerical reasoning. Unlike other reasoning questions, FPs often involve real-world impracticalities or ambiguous concepts, making them challenging even for humans to solve. Despite advancements in AI, particularly with large language models (LLMs) in various reasoning tasks, FPs remain relatively under-explored. This work conducted an exploratory study to examine the capabilities and limitations of LLMs in solving FPs. We first evaluated the overall performance of three advanced LLMs using a publicly available FP dataset. We designed prompts according to the recently proposed TELeR taxonomy, including a zero-shot scenario. Results indicated that all three LLMs achieved a fp_score (range between 0 - 1) below 0.5, underscoring the inherent difficulty of these reasoning tasks. To further investigate, we categorized FPs into standard and specific questions, hypothesizing that LLMs would perform better on standard questions, which are characterized by clarity and conciseness, than on specific ones. Comparative experiments confirmed this hypothesis, demonstrating that LLMs performed better on standard FPs in terms of both accuracy and efficiency.

Evaluating and Enhancing Segmentation Model Robustness with Metamorphic Testing

Seif Mzoughi,Mohamed Elshafeia,Foutse Khomh

Task: 提出SegRMT方法，通过遗传算法优化空间和光谱变换序列，以生成对抗性示例挑战图像分割模型。

Motivation: 图像分割模型在对抗性扰动下缺乏鲁棒性，影响其在医疗影像、增强现实等安全关键应用中的可靠性。

Details

Method: 利用遗传算法（GA）优化变换序列，同时通过预设PSNR阈值保持图像保真度，生成对抗性示例。 Result: SegRMT将DeepLabV3的mIoU降至6.4%，优于其他基线方法（8.5%-21.7%），并在对抗训练中将mIoU提升至73%。 Conclusion: SegRMT不仅能模拟真实图像失真，还能增强分割模型的鲁棒性，适用于安全关键应用。 Abstract: Image segmentation is critical for applications such as medical imaging, augmented reality, and video surveillance. However, segmentation models often lack robustness, making them vulnerable to adversarial perturbations from subtle image distortions. In this work, we propose SegRMT, a metamorphic testing approach that leverages genetic algorithms (GA) to optimize sequences of spatial and spectral transformations while preserving image fidelity via a predefined PSNR threshold. Using the Cityscapes dataset, our method generates adversarial examples that effectively challenge the DeepLabV3 segmentation model. Our experiments show that SegRMT reduces DeepLabV3's mean Intersection over Union (mIoU) to 6.4%, outperforming other adversarial baselines that decrease mIoU to between 8.5% and 21.7%. Furthermore, when used for adversarial training, SegRMT boosts model performance, achieving mIoU improvements up to 73% on dedicated adversarial datasets and increasing cross-adversarial mIoU to 53.8%, compared to only 2%-10% for other methods. These findings demonstrate that SegRMT not only simulates realistic image distortions but also enhances the robustness of segmentation models, making it a valuable tool for ensuring reliable performance in safety-critical applications.

Limitations of Religious Data and the Importance of the Target Domain: Towards Machine Translation for Guinea-Bissau Creole

Jacqueline Rowe,Edward Gow-Smith,Mark Hepple

Task: 构建并评估一个用于几内亚比绍克里奥尔语（Kiriol）机器翻译的新数据集。

Motivation: 解决低资源语言（如Kiriol）在机器翻译中数据稀缺的问题，并探索如何通过宗教领域数据提升通用领域的翻译性能。

Details

Method: 使用约4万句平行语料（主要来自宗教文本和少量通用领域数据），训练多个基于Transformer的模型，研究领域迁移效果。 Result: 添加少量目标领域数据（如300句）可显著提升翻译性能；葡萄牙语到Kiriol的翻译表现最佳。 Conclusion: 强调了为低资源语言收集数据的重要性，并希望推动对克里奥尔语机器翻译的进一步研究。 Abstract: We introduce a new dataset for machine translation of Guinea-Bissau Creole (Kiriol), comprising around 40 thousand parallel sentences to English and Portuguese. This dataset is made up of predominantly religious data (from the Bible and texts from the Jehovah's Witnesses), but also a small amount of general domain data (from a dictionary). This mirrors the typical resource availability of many low resource languages. We train a number of transformer-based models to investigate how to improve domain transfer from religious data to a more general domain. We find that adding even 300 sentences from the target domain when training substantially improves the translation performance, highlighting the importance and need for data collection for low-resource languages, even on a small-scale. We additionally find that Portuguese-to-Kiriol translation models perform better on average than other source and target language pairs, and investigate how this relates to the morphological complexity of the languages involved and the degree of lexical overlap between creoles and lexifiers. Overall, we hope our work will stimulate research into Kiriol and into how machine translation might better support creole languages in general.

LPA3D: 3D Room-Level Scene Generation from In-the-Wild Images

Ming-Jia Yang,Yu-Xiao Guo,Yang Liu,Bin Zhou,Xin Tong

Task: 从单张RGB图像生成语义合理且细节丰富的室内场景。

Motivation: 现有基于NeRF的场景生成方法需要额外信息（如多视角、深度图像或语义引导），而无法仅依赖RGB图像，主要因为相机姿态的先验知识难以估计。

Details

Method: 提出LPA-GAN，一种基于NeRF的生成方法，通过局部姿态对齐（LPA）框架重新定义全局姿态，并联合优化姿态预测与场景生成。 Result: 实验表明，LPA-GAN在视图一致性和语义合理性上优于其他方法。 Conclusion: LPA-GAN为解决单张RGB图像生成室内场景的挑战提供了有效方案。 Abstract: Generating realistic, room-level indoor scenes with semantically plausible and detailed appearances from in-the-wild images is crucial for various applications in VR, AR, and robotics. The success of NeRF-based generative methods indicates a promising direction to address this challenge. However, unlike their success at the object level, existing scene-level generative methods require additional information, such as multiple views, depth images, or semantic guidance, rather than relying solely on RGB images. This is because NeRF-based methods necessitate prior knowledge of camera poses, which is challenging to approximate for indoor scenes due to the complexity of defining alignment and the difficulty of globally estimating poses from a single image, given the unseen parts behind the camera. To address this challenge, we redefine global poses within the framework of Local-Pose-Alignment (LPA) -- an anchor-based multi-local-coordinate system that uses a selected number of anchors as the roots of these coordinates. Building on this foundation, we introduce LPA-GAN, a novel NeRF-based generative approach that incorporates specific modifications to estimate the priors of camera poses under LPA. It also co-optimizes the pose predictor and scene generation processes. Our ablation study and comparisons with straightforward extensions of NeRF-based object generative methods demonstrate the effectiveness of our approach. Furthermore, visual comparisons with other techniques reveal that our method achieves superior view-to-view consistency and semantic normality.

The Hidden Space of Safety: Understanding Preference-Tuned LLMs in Multilingual context

Nikhil Verma,Manasa Bharadwaj

Task: 分析大型语言模型在多语言环境中对齐机制的表现及其影响。

Motivation: 当前的对齐方法主要关注英语，导致多语言环境下对齐效果不明确，存在单语偏见问题。

Details

Method: 通过系统分析对齐前后LLM嵌入空间的分布变化，利用对齐诱导的安全空间分离作为量化工具，评估七种LLM在平衡毒性数据集和并行文本去毒基准上的表现。 Result: 发现高资源和低资源语言在潜在表示空间存在显著差异，凸显语言特定微调的必要性。 Conclusion: 需针对不同语言进行特定微调，以确保多语言对齐的公平性、可靠性和鲁棒性，为开发真正安全的多语言LLM奠定基础。 Abstract: Alignment tuning has enabled large language models to excel in reasoning, instruction-following, and minimizing harmful generations. However, despite their widespread deployment, these models exhibit a monolingual bias, raising concerns about the effectiveness of alignment across languages. Current alignment methods predominantly focus on English, leaving it unclear how alignment mechanism generalize to multilingual settings. To address this, we conduct a systematic analysis of distributional shifts in the embedding space of LLMs before and after alignment, uncovering its impact on model behavior across diverse languages. We leverage the alignment-induced separation in safety space as a quantitative tool to measure how alignment enforces safety constraints. Our study evaluates seven LLMs using balanced toxicity datasets and parallel text-detoxification benchmarks, revealing substantial disparities in the latent representation space between high-resource and low-resource languages. These findings underscore the need for language-specific fine-tuning to ensure fair, reliable and robust multilingual alignment. Our insights provide a foundation for developing truly safe multilingual LLMs, emphasizing the urgency of addressing alignment gaps in underrepresented languages.

SemiISP/SemiIE: Semi-Supervised Image Signal Processor and Image Enhancement Leveraging One-to-Many Mapping sRGB-to-RAW

Masakazu Yoshimura,Junji Otsuka,Radu Berdan,Takeshi Ohashi

Task: 实现基于半监督学习的图像信号处理器（ISP）和图像增强（IE）任务。

Motivation: 由于创建训练数据成本高且个性化需求多样，半监督学习成为潜在解决方案，但此前未充分应用于ISP和IE任务。

Details

Method: 提出一种sRGB-to-RAW方法，生成高质量伪RAW图像数据集，并用于半监督学习。 Result: 提出的方法成功提升了多种模型在不同数据集上的图像质量。 Conclusion: 半监督学习结合改进的sRGB-to-RAW方法，为ISP和IE任务提供了高效且高质量的解决方案。 Abstract: DNN-based methods have been successful in Image Signal Processor (ISP) and image enhancement (IE) tasks. However, the cost of creating training data for these tasks is considerably higher than for other tasks, making it difficult to prepare large-scale datasets. Also, creating personalized ISP and IE with minimal training data can lead to new value streams since preferred image quality varies depending on the person and use case. While semi-supervised learning could be a potential solution in such cases, it has rarely been utilized for these tasks. In this paper, we realize semi-supervised learning for ISP and IE leveraging a RAW image reconstruction (sRGB-to-RAW) method. Although existing sRGB-to-RAW methods can generate pseudo-RAW image datasets that improve the accuracy of RAW-based high-level computer vision tasks such as object detection, their quality is not sufficient for ISP and IE tasks that require precise image quality definition. Therefore, we also propose a sRGB-to-RAW method that can improve the image quality of these tasks. The proposed semi-supervised learning with the proposed sRGB-to-RAW method successfully improves the image quality of various models on various datasets.

ERPO: Advancing Safety Alignment via Ex-Ante Reasoning Preference Optimization

Kehua Feng,Keyan Ding,Jing Yu,Menghan Li,Yuhao Wang,Tong Xu,Xinda Wang,Qiang Zhang,Huajun Chen

Task: 提出一种名为Ex-Ante Reasoning Preference Optimization (ERPO)的新安全对齐框架，以解决大语言模型生成有害内容的问题。

Motivation: 现有对齐方法难以覆盖多样化的安全场景且易受对抗攻击，需要更有效的安全对齐方法。

Details

Method: 通过三个阶段实现：1) 监督微调（SFT）引入显式预推理；2) 直接偏好优化（DPO）提升安全性、实用性和效率；3) 长度控制的迭代偏好优化策略减少推理延迟。 Result: 实验表明，ERPO显著提升了安全性，同时保持了响应效率。 Conclusion: ERPO是一种有效的大语言模型安全对齐框架，能够兼顾安全性和效率。 Abstract: Recent advancements in large language models (LLMs) have accelerated progress toward artificial general intelligence, yet their potential to generate harmful content poses critical safety challenges. Existing alignment methods often struggle to cover diverse safety scenarios and remain vulnerable to adversarial attacks. In this work, we propose Ex-Ante Reasoning Preference Optimization (ERPO), a novel safety alignment framework that equips LLMs with explicit preemptive reasoning through Chain-of-Thought and provides clear evidence for safety judgments by embedding predefined safety rules. Specifically, our approach consists of three stages: first, equipping the model with Ex-Ante reasoning through supervised fine-tuning (SFT) using a constructed reasoning module; second, enhancing safety, usefulness, and efficiency via Direct Preference Optimization (DPO); and third, mitigating inference latency with a length-controlled iterative preference optimization strategy. Experiments on multiple open-source LLMs demonstrate that ERPO significantly enhances safety performance while maintaining response efficiency.

Agglomerating Large Vision Encoders via Distillation for VFSS Segmentation

Chengxi Zeng,Yuxuan Jiang,Fan Zhang,Alberto Gambaruto,Tilo Burghardt

Task: 提出一种新框架，通过知识蒸馏从多个大型医学基础模型中提升低复杂度模型的性能。

Motivation: 医学基础模型在医学影像中表现出色，但其训练和推理复杂度高，轻量级变体性能受限。

Details

Method: 通过从多个大型医学基础模型（如MedSAM、RAD-DINO、MedCLIP）中进行知识蒸馏，提升低复杂度模型的性能。 Result: 提出的方法在12个分割任务中表现出更好的泛化能力，平均Dice系数提升2%。 Conclusion: 该方法在复杂度和性能之间取得了更好的平衡，适用于医学图像分割任务。 Abstract: The deployment of foundation models for medical imaging has demonstrated considerable success. However, their training overheads associated with downstream tasks remain substantial due to the size of the image encoders employed, and the inference complexity is also significantly high. Although lightweight variants have been obtained for these foundation models, their performance is constrained by their limited model capacity and suboptimal training strategies. In order to achieve an improved tradeoff between complexity and performance, we propose a new framework to improve the performance of low complexity models via knowledge distillation from multiple large medical foundation models (e.g., MedSAM, RAD-DINO, MedCLIP), each specializing in different vision tasks, with the goal to effectively bridge the performance gap for medical image segmentation tasks. The agglomerated model demonstrates superior generalization across 12 segmentation tasks, whereas specialized models require explicit training for each task. Our approach achieved an average performance gain of 2\% in Dice coefficient compared to simple distillation.

Why do LLMs attend to the first token?

Federico Barbero,Álvaro Arroyo,Xiangming Gu,Christos Perivolaropoulos,Michael Bronstein,Petar Veličkovi ć,Razvan Pascanu

Task: 研究大型语言模型（LLMs）中注意力机制（attention sink）的形成原因及其作用。

Motivation: 尽管已有许多研究探讨了注意力机制的现象及其影响，但对于LLMs为何学习这种模式及其具体用途的理解仍不够深入。

Details

Method: 通过理论和实证分析，探讨注意力机制如何帮助LLMs避免过度混合信息，并结合实验验证上下文长度、模型深度和数据打包等因素对注意力机制的影响。 Result: 实验验证了理论直觉，并展示了不同选择对注意力机制行为的影响。 Conclusion: 本研究为理解LLMs中注意力机制的形成提供了新的视角，有助于更深入地理解训练过程中形成的注意力模式。 Abstract: Large Language Models (LLMs) tend to attend heavily to the first token in the sequence -- creating a so-called attention sink. Many works have studied this phenomenon in detail, proposing various ways to either leverage or alleviate it. Attention sinks have been connected to quantisation difficulties, security issues, and streaming attention. Yet, while many works have provided conditions in which they occur or not, a critical question remains shallowly answered: Why do LLMs learn such patterns and how are they being used? In this work, we argue theoretically and empirically that this mechanism provides a method for LLMs to avoid over-mixing, connecting this to existing lines of work that study mathematically how information propagates in Transformers. We conduct experiments to validate our theoretical intuitions and show how choices such as context length, depth, and data packing influence the sink behaviour. We hope that this study provides a new practical perspective on why attention sinks are useful in LLMs, leading to a better understanding of the attention patterns that form during training.

All-day Depth Completion via Thermal-LiDAR Fusion

Janghyun Kim,Minseong Kweon,Jinsun Park,Ukcheol Shin

Task: 研究如何利用热成像和LiDAR数据进行深度补全，以解决在恶劣环境（如低光照和雨天）中RGB传感器性能不足的问题。

Motivation: 现有方法在恶劣环境下表现不佳，且地面真实深度图在恶劣天气中常缺失数据，而热成像传感器在此类条件下表现稳定但相关研究较少。

Details

Method: 提出了一种结合对比学习和伪监督（COPS）的框架，利用深度基础模型增强深度边界清晰度并提高补全精度。 Result: 在MS^2和ViViD数据集上进行了广泛测试，验证了热成像-LiDAR深度补全的可行性和鲁棒性。 Conclusion: 该研究为热成像-LiDAR深度补全提供了新方法，并分析了关键挑战，推动了未来研究。 Abstract: Depth completion, which estimates dense depth from sparse LiDAR and RGB images, has demonstrated outstanding performance in well-lit conditions. However, due to the limitations of RGB sensors, existing methods often struggle to achieve reliable performance in harsh environments, such as heavy rain and low-light conditions. Furthermore, we observe that ground truth depth maps often suffer from large missing measurements in adverse weather conditions such as heavy rain, leading to insufficient supervision. In contrast, thermal cameras are known for providing clear and reliable visibility in such conditions, yet research on thermal-LiDAR depth completion remains underexplored. Moreover, the characteristics of thermal images, such as blurriness, low contrast, and noise, bring unclear depth boundary problems. To address these challenges, we first evaluate the feasibility and robustness of thermal-LiDAR depth completion across diverse lighting (eg., well-lit, low-light), weather (eg., clear-sky, rainy), and environment (eg., indoor, outdoor) conditions, by conducting extensive benchmarks on the MS$^2$ and ViViD datasets. In addition, we propose a framework that utilizes COntrastive learning and Pseudo-Supervision (COPS) to enhance depth boundary clarity and improve completion accuracy by leveraging a depth foundation model in two key ways. First, COPS enforces a depth-aware contrastive loss between different depth points by mining positive and negative samples using a monocular depth foundation model to sharpen depth boundaries. Second, it mitigates the issue of incomplete supervision from ground truth depth maps by leveraging foundation model predictions as dense depth priors. We also provide in-depth analyses of the key challenges in thermal-LiDAR depth completion to aid in understanding the task and encourage future research.

Enhancing LLM Robustness to Perturbed Instructions: An Empirical Study

Aryan Agrawal,Lisa Alazraki,Shahin Honarvar,Marek Rei

Task: 研究如何提高大型语言模型（LLMs）对任务指令扰动的鲁棒性。

Motivation: 现有方法主要关注扰动数据样本，而对任务指令扰动的鲁棒性提升研究较少。

Details

Method: 采用字符和单词级别的任务指令编辑，实验了自去噪和表示对齐等技术，测试了不同模型、数据集和指令。 Result: 自去噪方法（无论是冻结LLM还是微调模型）比其他策略（如集成和监督方法）表现更好。 Conclusion: 自去噪是提升LLMs对指令扰动鲁棒性的有效方法。 Abstract: Large Language Models (LLMs) are highly vulnerable to input perturbations, as even a small prompt change may result in a substantially different output. Existing methods to enhance LLM robustness are primarily focused on perturbed data samples, whereas improving resiliency to perturbations of task-level instructions has remained relatively underexplored. In this work, we focus on character- and word-level edits of task-specific instructions, which substantially degrade downstream performance. We experiment with a variety of techniques to enhance the robustness of LLMs, including self-denoising and representation alignment, testing different models (Llama 3 and Flan-T5), datasets (CoLA, QNLI, SST-2) and instructions (both task-oriented and role-oriented). We find that, on average, self-denoising -- whether performed by a frozen LLM or a fine-tuned model -- achieves substantially higher performance gains than alternative strategies, including more complex baselines such as ensembling and supervised methods.

Brightness Perceiving for Recursive Low-Light Image Enhancement

Haodian Wang,Long Peng,Yuejin Sun,Zengyu Wan,Yang Wang,Yang Cao

Task: 提出一种基于亮度感知的递归增强框架，用于高动态范围低光图像增强。

Motivation: 由于真实低光场景的动态范围广泛，现有端到端方法难以将低光图像增强至正常曝光水平。

Details

Method: 提出递归增强框架，包含两个并行子网络（ACT-Net和BP-Net），并设计无监督训练策略。 Result: 在六个参考和非参考指标上达到新的SOTA性能，PSNR提升0.9 dB。 Conclusion: 该方法通过递归增强和亮度感知，有效解决了低光图像增强中的对比度和细节退化问题。 Abstract: Due to the wide dynamic range in real low-light scenes, there will be large differences in the degree of contrast degradation and detail blurring of captured images, making it difficult for existing end-to-end methods to enhance low-light images to normal exposure. To address the above issue, we decompose low-light image enhancement into a recursive enhancement task and propose a brightness-perceiving-based recursive enhancement framework for high dynamic range low-light image enhancement. Specifically, our recursive enhancement framework consists of two parallel sub-networks: Adaptive Contrast and Texture enhancement network (ACT-Net) and Brightness Perception network (BP-Net). The ACT-Net is proposed to adaptively enhance image contrast and details under the guidance of the brightness adjustment branch and gradient adjustment branch, which are proposed to perceive the degradation degree of contrast and details in low-light images. To adaptively enhance images captured under different brightness levels, BP-Net is proposed to control the recursive enhancement times of ACT-Net by exploring the image brightness distribution properties. Finally, in order to coordinate ACT-Net and BP-Net, we design a novel unsupervised training strategy to facilitate the training procedure. To further validate the effectiveness of the proposed method, we construct a new dataset with a broader brightness distribution by mixing three low-light datasets. Compared with eleven existing representative methods, the proposed method achieves new SOTA performance on six reference and no reference metrics. Specifically, the proposed method improves the PSNR by 0.9 dB compared to the existing SOTA method.

MultiBLiMP 1.0: A Massively Multilingual Benchmark of Linguistic Minimal Pairs

Jaap Jumelet,Leonie Weissweiler,Arianna Bisazza

Task: 介绍并评估MultiBLiMP 1.0，一个覆盖101种语言、6种语言现象的大规模多语言基准测试。

Motivation: 通过大规模多语言基准测试，评估大型语言模型（LLMs）在多语言环境下的表现，并揭示当前先进模型在低资源语言建模中的不足。

Details

Method: 利用Universal Dependencies和UniMorph的大规模语言资源，通过全自动流程生成超过125,000个最小对。 Result: MultiBLiMP 1.0在评估LLMs的多语言能力方面具有前所未有的规模，并突出了当前先进模型在低资源语言建模中的短板。 Conclusion: MultiBLiMP 1.0为多语言语言模型的评估提供了重要工具，并揭示了未来研究的方向。 Abstract: We introduce MultiBLiMP 1.0, a massively multilingual benchmark of linguistic minimal pairs, covering 101 languages, 6 linguistic phenomena and containing more than 125,000 minimal pairs. Our minimal pairs are created using a fully automated pipeline, leveraging the large-scale linguistic resources of Universal Dependencies and UniMorph. MultiBLiMP 1.0 evaluates abilities of LLMs at an unprecedented multilingual scale, and highlights the shortcomings of the current state-of-the-art in modelling low-resource languages.

VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models

Kim Sung-Bin,Jeongsoo Choi,Puyuan Peng,Joon Son Chung,Tae-Hyun Oh,David Harwath

Task: 提出了一种名为VoiceCraft-Dub的自动化视频配音方法，通过文本和面部线索合成高质量语音。

Motivation: 该任务在电影制作、多媒体创作和辅助语音障碍人士方面具有广泛应用。

Details

Method: 基于神经编解码语言模型（NCLMs）的成功，通过融入视频特征确保语音合成与面部动作时间同步且表达一致，同时保留自然韵律。设计了适配器以对齐面部特征与NCLM标记空间，并引入音频-视觉融合层在NCLM框架内合并信息。 Result: 实验表明，模型实现了高质量、清晰且自然的语音合成，具有准确的唇同步，在人类感知和客观评估中优于现有方法。 Conclusion: VoiceCraft-Dub还适用于视频到语音任务，展示了其在多种应用中的灵活性。 Abstract: We present VoiceCraft-Dub, a novel approach for automated video dubbing that synthesizes high-quality speech from text and facial cues. This task has broad applications in filmmaking, multimedia creation, and assisting voice-impaired individuals. Building on the success of Neural Codec Language Models (NCLMs) for speech synthesis, our method extends their capabilities by incorporating video features, ensuring that synthesized speech is time-synchronized and expressively aligned with facial movements while preserving natural prosody. To inject visual cues, we design adapters to align facial features with the NCLM token space and introduce audio-visual fusion layers to merge audio-visual information within the NCLM framework. Additionally, we curate CelebV-Dub, a new dataset of expressive, real-world videos specifically designed for automated video dubbing. Extensive experiments show that our model achieves high-quality, intelligible, and natural speech synthesis with accurate lip synchronization, outperforming existing methods in human perception and performing favorably in objective evaluations. We also adapt VoiceCraft-Dub for the video-to-speech task, demonstrating its versatility for various applications.

A Framework for Robust Cognitive Evaluation of LLMs

Karin de Langis,Jong Inn Park,Bin Hu,Khanh Chi Le,Andreas Schramm,Michael C. Mensink,Andrew Elfenbein,Dongyeop Kang

Task: 开发CognitivEval框架，用于系统评估大型语言模型（LLMs）的人工认知能力。

Motivation: 大型语言模型的认知能力虽被广泛观察，但其本质和机制尚不明确，且缺乏标准化的评估方法。

Details

Method: CognitivEval框架包括自动提示排列和同时收集生成结果与模型概率估计的测试方法。 Result: 实验表明该框架能提高实验结果的稳健性，并成功复现了五项经典认知科学实验。 Conclusion: CognitivEval框架具有通用性，可促进认知科学社区的合作，并将公开发布。 Abstract: Emergent cognitive abilities in large language models (LLMs) have been widely observed, but their nature and underlying mechanisms remain poorly understood. A growing body of research draws on cognitive science to investigate LLM cognition, but standard methodologies and experimen-tal pipelines have not yet been established. To address this gap we develop CognitivEval, a framework for systematically evaluating the artificial cognitive capabilities of LLMs, with a particular emphasis on robustness in response collection. The key features of CognitivEval include: (i) automatic prompt permutations, and (ii) testing that gathers both generations and model probability estimates. Our experiments demonstrate that these features lead to more robust experimental outcomes. Using CognitivEval, we replicate five classic experiments in cognitive science, illustrating the framework's generalizability across various experimental tasks and obtaining a cognitive profile of several state of the art LLMs. CognitivEval will be released publicly to foster broader collaboration within the cognitive science community.

Marine Saliency Segmenter: Object-Focused Conditional Diffusion with Region-Level Semantic Knowledge Distillation

Laibin Chang,Yunke Wang,JiaXing Huang,Longxiang Deng,Bo Du,Chang Xu

Task: 提出一种基于扩散模型的海洋显著性分割方法DiffMSS，以解决现有技术中物体定位不准和边界模糊的问题。

Motivation: 复杂水下环境导致现有海洋分割技术存在物体定位不准和边界模糊的困境，且扩散模型在视觉分割中表现优异但仍有提升空间。

Details

Method: 利用语义知识蒸馏指导分割，设计区域-词语相似性匹配机制提取高级语义特征，并开发共识确定性采样优化细粒度结构分割。 Result: DiffMSS在定量和定性评估中均优于现有最先进方法。 Conclusion: DiffMSS通过语义知识蒸馏和共识确定性采样，显著提升了海洋显著性分割的准确性和边界精度。 Abstract: Marine Saliency Segmentation (MSS) plays a pivotal role in various vision-based marine exploration tasks. However, existing marine segmentation techniques face the dilemma of object mislocalization and imprecise boundaries due to the complex underwater environment. Meanwhile, despite the impressive performance of diffusion models in visual segmentation, there remains potential to further leverage contextual semantics to enhance feature learning of region-level salient objects, thereby improving segmentation outcomes. Building on this insight, we propose DiffMSS, a novel marine saliency segmenter based on the diffusion model, which utilizes semantic knowledge distillation to guide the segmentation of marine salient objects. Specifically, we design a region-word similarity matching mechanism to identify salient terms at the word level from the text descriptions. These high-level semantic features guide the conditional feature learning network in generating salient and accurate diffusion conditions with semantic knowledge distillation. To further refine the segmentation of fine-grained structures in unique marine organisms, we develop the dedicated consensus deterministic sampling to suppress overconfident missegmentations. Comprehensive experiments demonstrate the superior performance of DiffMSS over state-of-the-art methods in both quantitative and qualitative evaluations.

Zhuohan Ge,Nicole Hu,Darian Li,Yubo Wang,Shihao Qi,Yuming Xu,Han Shi,Jason Zhang

Task: 探索大型语言模型（LLMs）在社交媒体数据分析中用于心理健康问题检测的潜力。

Motivation: 社交媒体数据是心理健康研究的重要资源，但如何利用LLMs进行心理健康问题检测仍面临挑战。

Details

Method: 从文本数据分析和心理健康障碍检测等多个维度总结LLM的应用方法，并分析当前研究的主要挑战和不足。 Result: 总结了流行的数据集和评估指标，展示了LLMs在心理健康检测中的巨大潜力。 Conclusion: 为心理健康领域的研究者提供了全面的参考框架，并推动LLMs在未来心理健康干预中的进一步应用。 Abstract: The detection and intervention of mental health issues represent a critical global research focus, and social media data has been recognized as an important resource for mental health research. However, how to utilize Large Language Models (LLMs) for mental health problem detection on social media poses significant challenges. Hence, this paper aims to explore the potential of LLM applications in social media data analysis, focusing not only on the most common psychological disorders such as depression and anxiety but also incorporating psychotic disorders and externalizing disorders, summarizing the application methods of LLM from different dimensions, such as text data analysis and detection of mental disorders, and revealing the major challenges and shortcomings of current research. In addition, the paper provides an overview of popular datasets, and evaluation metrics. The survey in this paper provides a comprehensive frame of reference for researchers in the field of mental health, while demonstrating the great potential of LLMs in mental health detection to facilitate the further application of LLMs in future mental health interventions.

Learning Audio-guided Video Representation with Gated Attention for Video-Text Retrieval

Boseung Jeong,Jicheol Park,Sungyeon Kim,Suha Kwak

Task: 提出一种基于音频引导的视频表示学习框架（AVIGATE），用于视频-文本检索任务。

Motivation: 现有方法主要依赖视觉和文本特征，忽略了音频信息的重要性，而传统模型盲目使用音频输入，导致视频表示不理想。

Details

Method: 采用门控注意力机制选择性过滤无信息音频信号，并提出自适应边际对比损失以优化视频-文本对齐。 Result: 在公开基准测试中达到最先进性能。 Conclusion: AVIGATE框架通过有效利用音频信息和优化对齐关系，显著提升了视频-文本检索的性能。 Abstract: Video-text retrieval, the task of retrieving videos based on a textual query or vice versa, is of paramount importance for video understanding and multimodal information retrieval. Recent methods in this area rely primarily on visual and textual features and often ignore audio, although it helps enhance overall comprehension of video content. Moreover, traditional models that incorporate audio blindly utilize the audio input regardless of whether it is useful or not, resulting in suboptimal video representation. To address these limitations, we propose a novel video-text retrieval framework, Audio-guided VIdeo representation learning with GATEd attention (AVIGATE), that effectively leverages audio cues through a gated attention mechanism that selectively filters out uninformative audio signals. In addition, we propose an adaptive margin-based contrastive loss to deal with the inherently unclear positive-negative relationship between video and text, which facilitates learning better video-text alignment. Our extensive experiments demonstrate that AVIGATE achieves state-of-the-art performance on all the public benchmarks.

MegaMath: Pushing the Limits of Open Math Corpora

Fan Zhou,Zengzhi Wang,Nikhil Ranjan,Zhoujun Cheng,Liping Tang,Guowei He,Zhengzhong Liu,Eric P. Xing

Task: 构建一个开放、大规模、高质量的数学相关语料库MegaMath，用于支持数学中心的大型语言模型预训练。

Motivation: 数学推理是人类智能的核心，也是大型语言模型高级能力的关键基准，但目前缺乏适合数学中心预训练需求的开放语料库。

Details

Method: 通过三种策略构建语料库：(1) 重新提取网络数据，优化HTML并过滤去重；(2) 从代码数据中筛选高质量数学相关代码；(3) 合成问答文本、数学相关代码及混合文本代码块。 Result: MegaMath提供了371B tokens，是目前开放数学预训练数据集中规模最大且质量最高的。 Conclusion: MegaMath通过多样化数据来源和验证策略，成功填补了数学中心预训练语料库的空白。 Abstract: Mathematical reasoning is a cornerstone of human intelligence and a key benchmark for advanced capabilities in large language models (LLMs). However, the research community still lacks an open, large-scale, high-quality corpus tailored to the demands of math-centric LLM pre-training. We present MegaMath, an open dataset curated from diverse, math-focused sources through following practices: (1) Revisiting web data: We re-extracted mathematical documents from Common Crawl with math-oriented HTML optimizations, fasttext-based filtering and deduplication, all for acquiring higher-quality data on the Internet. (2) Recalling Math-related code data: We identified high quality math-related code from large code training corpus, Stack-V2, further enhancing data diversity. (3) Exploring Synthetic data: We synthesized QA-style text, math-related code, and interleaved text-code blocks from web data or code data. By integrating these strategies and validating their effectiveness through extensive ablations, MegaMath delivers 371B tokens with the largest quantity and top quality among existing open math pre-training datasets.

Hyperspectral Remote Sensing Images Salient Object Detection: The First Benchmark Dataset and Baseline

Peifu Liu,Huiyan Bai,Tingfa Xu,Jihui Wang,Huan Chen,Jianan Li

Task: 高光谱遥感图像显著目标检测（HRSI-SOD）旨在识别与背景具有显著光谱对比的目标或区域。

Motivation: 该领域在实际应用中具有重要潜力，但缺乏专用数据集和方法限制了其发展。

Details

Method: 提出了首个HRSI-SOD数据集HRSSD，并设计了一种名为Deep Spectral Saliency Network（DSSN）的基线模型，其核心是跨层次显著评估块和高分辨率融合模块。 Result: 在HRSSD数据集上的实验验证了DSSN的优越性，并在其他数据集上展示了其泛化能力。 Conclusion: 该研究强调了专用数据集和方法在该领域的重要性，并提供了公开可用的数据集和源代码。 Abstract: The objective of hyperspectral remote sensing image salient object detection (HRSI-SOD) is to identify objects or regions that exhibit distinct spectrum contrasts with the background. This area holds significant promise for practical applications; however, progress has been limited by a notable scarcity of dedicated datasets and methodologies. To bridge this gap and stimulate further research, we introduce the first HRSI-SOD dataset, termed HRSSD, which includes 704 hyperspectral images and 5327 pixel-level annotated salient objects. The HRSSD dataset poses substantial challenges for salient object detection algorithms due to large scale variation, diverse foreground-background relations, and multi-salient objects. Additionally, we propose an innovative and efficient baseline model for HRSI-SOD, termed the Deep Spectral Saliency Network (DSSN). The core of DSSN is the Cross-level Saliency Assessment Block, which performs pixel-wise attention and evaluates the contributions of multi-scale similarity maps at each spatial location, effectively reducing erroneous responses in cluttered regions and emphasizes salient regions across scales. Additionally, the High-resolution Fusion Module combines bottom-up fusion strategy and learned spatial upsampling to leverage the strengths of multi-scale saliency maps, ensuring accurate localization of small objects. Experiments on the HRSSD dataset robustly validate the superiority of DSSN, underscoring the critical need for specialized datasets and methodologies in this domain. Further evaluations on the HSOD-BIT and HS-SOD datasets demonstrate the generalizability of the proposed method. The dataset and source code are publicly available at https://github.com/laprf/HRSSD.

Generative Evaluation of Complex Reasoning in Large Language Models

Haowei Lin,Xiangyu Wang,Ruilin Yan,Baizhou Huang,Haotian Ye,Jianhua Zhu,Zihao Wang,James Zou,Jianzhu Ma,Yitao Liang

Task: 评估大型语言模型（LLMs）是否真正具备推理能力，而非仅依赖记忆训练数据中的答案。

Motivation: 公开的基准测试一旦被纳入LLMs的训练集，其可靠性会因数据污染而降低，需要一种新的评估方法来测试LLMs的真实推理能力。

Details

Method: 提出KUMO框架，结合LLMs和符号引擎动态生成多样化、多轮次、部分可观察且难度可调的推理任务。 Result: 23个前沿LLMs在5,000个任务上表现优异，部分模型在简单任务上超越大学生水平，复杂任务上达到大学生水平；KUMO任务表现与新发布的真实世界推理基准强相关。 Conclusion: KUMO是一种可靠且持久的评估工具，能够有效测试LLMs的真实推理能力。 Abstract: With powerful large language models (LLMs) demonstrating superhuman reasoning capabilities, a critical question arises: Do LLMs genuinely reason, or do they merely recall answers from their extensive, web-scraped training datasets? Publicly released benchmarks inevitably become contaminated once incorporated into subsequent LLM training sets, undermining their reliability as faithful assessments. To address this, we introduce KUMO, a generative evaluation framework designed specifically for assessing reasoning in LLMs. KUMO synergistically combines LLMs with symbolic engines to dynamically produce diverse, multi-turn reasoning tasks that are partially observable and adjustable in difficulty. Through an automated pipeline, KUMO continuously generates novel tasks across open-ended domains, compelling models to demonstrate genuine generalization rather than memorization. We evaluated 23 state-of-the-art LLMs on 5,000 tasks across 100 domains created by KUMO, benchmarking their reasoning abilities against university students. Our findings reveal that many LLMs have outperformed university-level performance on easy reasoning tasks, and reasoning-scaled LLMs reach university-level performance on complex reasoning challenges. Moreover, LLM performance on KUMO tasks correlates strongly with results on newly released real-world reasoning benchmarks, underscoring KUMO's value as a robust, enduring assessment tool for genuine LLM reasoning capabilities.

Leveraging Static Relationships for Intra-Type and Inter-Type Message Passing in Video Question Answering

Lili Liang,Guanglu Sun

Task: 提出一种基于静态关系的类型内和类型间消息传递推理方法，以提高视频问答的准确性。

Motivation: 现有基于静态关系推理的方法在静态关系识别和表示上存在不足，且未充分利用视频中的静态关系信息进行深入推理。

Details

Method: 构建双图进行类型内消息传递推理，构建异构图进行类型间消息传递推理，结合两种线索推断答案。 Result: 在ANetQA和Next-QA数据集上验证了该方法的有效性。 Conclusion: 该方法通过充分利用静态关系信息，显著提升了视频问答的推理准确性。 Abstract: Video Question Answering (VideoQA) is an important research direction in the field of artificial intelligence, enabling machines to understand video content and perform reasoning and answering based on natural language questions. Although methods based on static relationship reasoning have made certain progress, there are still deficiencies in the accuracy of static relationship recognition and representation, and they have not fully utilized the static relationship information in videos for in-depth reasoning and analysis. Therefore, this paper proposes a reasoning method for intra-type and inter-type message passing based on static relationships. This method constructs a dual graph for intra-type message passing reasoning and builds a heterogeneous graph based on static relationships for inter-type message passing reasoning. The intra-type message passing reasoning model captures the neighborhood information of targets and relationships related to the question in the dual graph, updating the dual graph to obtain intra-type clues for answering the question. The inter-type message passing reasoning model captures the neighborhood information of targets and relationships from different categories related to the question in the heterogeneous graph, updating the heterogeneous graph to obtain inter-type clues for answering the question. Finally, the answers are inferred by combining the intra-type and inter-type clues based on static relationships. Experimental results on the ANetQA and Next-QA datasets demonstrate the effectiveness of this method.

LLMs Working in Harmony: A Survey on the Technological Aspects of Building Effective LLM-Based Multi Agent Systems

R. M. Aratchige,W. M. K. S. Ilmini

Task: 调查大型语言模型（LLM）多智能体系统的关键技术。

Motivation: 优化多智能体系统在协作和动态环境中的表现。

Details

Method: 分析架构、记忆、规划和技术框架四个关键领域的进展与局限。 Result: 总结了当前技术的优势与挑战，并提出了提升系统可扩展性、协作性和适应性的建议。 Conclusion: 为未来研究提供了路线图，支持构建更高效、鲁棒的多智能体系统。 Abstract: This survey investigates foundational technologies essential for developing effective Large Language Model (LLM)-based multi-agent systems. Aiming to answer how best to optimize these systems for collaborative, dynamic environments, we focus on four critical areas: Architecture, Memory, Planning, and Technologies/Frameworks. By analyzing recent advancements and their limitations - such as scalability, real-time response challenges, and agent coordination constraints, we provide a detailed view of the technological landscape. Frameworks like the Mixture of Agents architecture and the ReAct planning model exemplify current innovations, showcasing improvements in role assignment and decision-making. This review synthesizes key strengths and persistent challenges, offering practical recommendations to enhance system scalability, agent collaboration, and adaptability. Our findings provide a roadmap for future research, supporting the creation of robust, efficient multi-agent systems that advance both individual agent performance and collective system resilience.

OmniTalker: Real-Time Text-Driven Talking Head Generation with In-Context Audio-Visual Style Replication

Zhongjian Wang,Peng Zhang,Jinwei Qi,Guangyuan Wang Sheng Xu,Bang Zhang,Liefeng Bo

Task: 开发一个端到端的统一框架OmniTalker，用于从文本和参考视频实时生成同步的语音和说话头部视频。

Motivation: 解决现有级联方法在系统复杂性、延迟、视听异步和风格差异方面的局限性。

Details

Method: 采用双分支扩散变换器架构，结合音频分支和视觉分支，并通过音频-视觉融合模块实现跨模态信息整合。 Result: OmniTalker在生成质量、风格保持和音视频同步方面优于现有方法，并实现25 FPS的实时推理速度。 Conclusion: OmniTalker是首个在零样本设置下联合建模语音风格和面部风格的统一框架，具有显著优势。 Abstract: Recent years have witnessed remarkable advances in talking head generation, owing to its potential to revolutionize the human-AI interaction from text interfaces into realistic video chats. However, research on text-driven talking heads remains underexplored, with existing methods predominantly adopting a cascaded pipeline that combines TTS systems with audio-driven talking head models. This conventional pipeline not only introduces system complexity and latency overhead but also fundamentally suffers from asynchronous audiovisual output and stylistic discrepancies between generated speech and visual expressions. To address these limitations, we introduce OmniTalker, an end-to-end unified framework that simultaneously generates synchronized speech and talking head videos from text and reference video in real-time zero-shot scenarios, while preserving both speech style and facial styles. The framework employs a dual-branch diffusion transformer architecture: the audio branch synthesizes mel-spectrograms from text, while the visual branch predicts fine-grained head poses and facial dynamics. To bridge modalities, we introduce a novel audio-visual fusion module that integrates cross-modal information to ensure temporal synchronization and stylistic coherence between audio and visual outputs. Furthermore, our in-context reference learning module effectively captures both speech and facial style characteristics from a single reference video without introducing an extra style extracting module. To the best of our knowledge, OmniTalker presents the first unified framework that jointly models speech style and facial style in a zero-shot setting, achieving real-time inference speed of 25 FPS. Extensive experiments demonstrate that our method surpasses existing approaches in generation quality, particularly excelling in style preservation and audio-video synchronization.

Urban Computing in the Era of Large Language Models

Zhonghang Li,Lianghao Xia,Xubin Ren,Jiabin Tang,Tianyi Chen,Yong Xu,Chao Huang

Task: 探讨大型语言模型（LLMs）与城市计算的交叉点及其应用。

Motivation: 传统方法在泛化性、可扩展性和上下文理解方面存在局限，而LLMs为解决这些问题提供了潜力。

Details

Method: 综述LLMs的核心技术及其在城市计算中的应用，包括数据处理、决策支持和市民参与。 Result: 总结了LLMs在交通、公共安全和环境监测等领域的应用，并提出潜在解决方案。 Conclusion: 讨论了当前方法的局限性，并展望了LLMs在城市计算中的未来发展方向。 Abstract: Urban computing has emerged as a multidisciplinary field that harnesses data-driven technologies to address challenges and improve urban living. Traditional approaches, while beneficial, often face challenges with generalization, scalability, and contextual understanding. The advent of Large Language Models (LLMs) offers transformative potential in this domain. This survey explores the intersection of LLMs and urban computing, emphasizing the impact of LLMs in processing and analyzing urban data, enhancing decision-making, and fostering citizen engagement. We provide a concise overview of the evolution and core technologies of LLMs. Additionally, we survey their applications across key urban domains, such as transportation, public safety, and environmental monitoring, summarizing essential tasks and prior works in various urban contexts, while highlighting LLMs' functional roles and implementation patterns. Building on this, we propose potential LLM-based solutions to address unresolved challenges. To facilitate in-depth research, we compile a list of available datasets and tools applicable to diverse urban scenarios. Finally, we discuss the limitations of current approaches and outline future directions for advancing LLMs in urban computing.

SkyReels-A2: Compose Anything in Video Diffusion Transformers

Zhengcong Fei,Debang Li,Di Qiu,Jiahua Wang,Yikun Dou,Rui Wang,Jingtao Xu,Mingyuan Fan,Guibin Chen,Yang Li,Yahui Zhou

Task: 提出SkyReels-A2框架，实现基于文本提示和参考图像的元素到视频（E2V）生成。

Motivation: 解决在视频生成中保持参考元素保真度、场景连贯性和输出自然性的挑战。

Details

Method: 设计数据管道构建训练数据，提出图像-文本联合嵌入模型，优化推理流程。 Result: 生成多样且高质量的视频，性能优于闭源商业模型。 Conclusion: SkyReels-A2是首个开源商业级E2V模型，有望推动创意应用发展。 Abstract: This paper presents SkyReels-A2, a controllable video generation framework capable of assembling arbitrary visual elements (e.g., characters, objects, backgrounds) into synthesized videos based on textual prompts while maintaining strict consistency with reference images for each element. We term this task elements-to-video (E2V), whose primary challenges lie in preserving the fidelity of each reference element, ensuring coherent composition of the scene, and achieving natural outputs. To address these, we first design a comprehensive data pipeline to construct prompt-reference-video triplets for model training. Next, we propose a novel image-text joint embedding model to inject multi-element representations into the generative process, balancing element-specific consistency with global coherence and text alignment. We also optimize the inference pipeline for both speed and output stability. Moreover, we introduce a carefully curated benchmark for systematic evaluation, i.e, A2 Bench. Experiments demonstrate that our framework can generate diverse, high-quality videos with precise element control. SkyReels-A2 is the first open-source commercial grade model for the generation of E2V, performing favorably against advanced closed-source commercial models. We anticipate SkyReels-A2 will advance creative applications such as drama and virtual e-commerce, pushing the boundaries of controllable video generation.

Self-Resource Allocation in Multi-Agent LLM Systems

Alfonso Amayuelas,Jingbo Yang,Saaket Agashe,Ashwin Nagarajan,Antonis Antoniades,Xin Eric Wang,William Wang

Task: 探索LLMs在多智能体系统中如何有效分配计算任务。

Motivation: 随着LLMs作为智能体的发展，多智能体系统在任务分配和协调中的作用日益重要。

Details

Method: 比较LLMs作为协调者（orchestrator）和规划者（planner）在任务分配中的有效性，并实验验证其性能。 Result: 实验表明，LLMs在资源分配任务中具有高有效性和准确性，规划者方法在处理并发动作时优于协调者方法。 Conclusion: 提供明确的工人能力信息可以优化规划者的分配策略，尤其是在处理次优工人时。 Abstract: With the development of LLMs as agents, there is a growing interest in connecting multiple agents into multi-agent systems to solve tasks concurrently, focusing on their role in task assignment and coordination. This paper explores how LLMs can effectively allocate computational tasks among multiple agents, considering factors such as cost, efficiency, and performance. In this work, we address key questions, including the effectiveness of LLMs as orchestrators and planners, comparing their effectiveness in task assignment and coordination. Our experiments demonstrate that LLMs can achieve high validity and accuracy in resource allocation tasks. We find that the planner method outperforms the orchestrator method in handling concurrent actions, resulting in improved efficiency and better utilization of agents. Additionally, we show that providing explicit information about worker capabilities enhances the allocation strategies of planners, particularly when dealing with suboptimal workers.

MonoGS++: Fast and Accurate Monocular RGB Gaussian SLAM

Renwu Li,Wenjing Ke,Dong Li,Lu Tian,Emad Barsoum

Task: 提出一种仅依赖RGB输入的新型快速准确的SLAM方法MonoGS++。

Motivation: 减少对深度传感器的硬件依赖，仅需RGB输入，同时提高3D场景重建的质量和效率。

Details

Method: 利用在线视觉里程计生成稀疏点云，引入动态3D高斯插入、清晰度增强的高斯密集化模块和平面正则化。 Result: 在合成和真实数据集上实现了与最先进方法相当的精确相机跟踪结果，帧率提升了5.57倍。 Conclusion: MonoGS++在减少硬件依赖的同时，显著提升了SLAM的性能和效率。 Abstract: We present MonoGS++, a novel fast and accurate Simultaneous Localization and Mapping (SLAM) method that leverages 3D Gaussian representations and operates solely on RGB inputs. While previous 3D Gaussian Splatting (GS)-based methods largely depended on depth sensors, our approach reduces the hardware dependency and only requires RGB input, leveraging online visual odometry (VO) to generate sparse point clouds in real-time. To reduce redundancy and enhance the quality of 3D scene reconstruction, we implemented a series of methodological enhancements in 3D Gaussian mapping. Firstly, we introduced dynamic 3D Gaussian insertion to avoid adding redundant Gaussians in previously well-reconstructed areas. Secondly, we introduced clarity-enhancing Gaussian densification module and planar regularization to handle texture-less areas and flat surfaces better. We achieved precise camera tracking results both on the synthetic Replica and real-world TUM-RGBD datasets, comparable to those of the state-of-the-art. Additionally, our method realized a significant 5.57x improvement in frames per second (fps) over the previous state-of-the-art, MonoGS.

TiC-LM: A Web-Scale Benchmark for Time-Continual LLM Pretraining

Jeffrey Li,Mohammadreza Armandpour,Iman Mirzadeh,Sachin Mehta,Vaishaal Shankar,Raviteja Vemulapalli,Samy Bengio,Oncel Tuzel,Mehrdad Farajtabar,Hadi Pouransari,Fartash Faghri

Task: 研究大型语言模型（LLMs）在数据更新时的评估策略和更新方法。

Motivation: 由于LLMs基于历史网络数据训练，容易过时，需要探索如何高效更新模型以适应新数据。

Details

Method: 引入一个基于114个Common Crawl（CC）转储的大规模数据集，设计时间分层评估方法，比较不同持续学习方法的效果。 Result: 在通用CC数据上，结合自回归元调度和固定比例重放旧数据的方法，可以达到与从头训练相当的损失，同时计算量减少2.6倍。 Conclusion: 重放旧数据对通用网络数据至关重要，但对特定领域数据影响较小，需根据数据类型调整更新策略。 Abstract: Large Language Models (LLMs) trained on historical web data inevitably become outdated. We investigate evaluation strategies and update methods for LLMs as new data becomes available. We introduce a web-scale dataset for time-continual pretraining of LLMs derived from 114 dumps of Common Crawl (CC) - orders of magnitude larger than previous continual language modeling benchmarks. We also design time-stratified evaluations across both general CC data and specific domains (Wikipedia, StackExchange, and code documentation) to assess how well various continual learning methods adapt to new data while retaining past knowledge. Our findings demonstrate that, on general CC data, autoregressive meta-schedules combined with a fixed-ratio replay of older data can achieve comparable held-out loss to re-training from scratch, while requiring significantly less computation (2.6x). However, the optimal balance between incorporating new data and replaying old data differs as replay is crucial to avoid forgetting on generic web data but less so on specific domains.

HGFormer: Topology-Aware Vision Transformer with HyperGraph Learning

Hao Wang,Shuo Zhang,Biao Leng

Task: 提出一种名为HGFormer的拓扑感知视觉变换器，用于解决传统视觉变换器中区域上下文和空间拓扑建模不足的问题。

Motivation: 传统视觉变换器在建模排列不变性和全连接交互时破坏了区域上下文和空间拓扑，偏离了感知组织的原则，因此需要引入超图概念进行改进。

Details

Method: 提出CS-KNN算法用于超图构建的语义引导，以及拓扑感知的HGA机制，结合超图拓扑指导全局和无偏信息的聚合。 Result: HGFormer在多个视觉基准测试中表现优异，能够生成清晰且细致的场景描述。 Conclusion: HGFormer通过引入超图概念和拓扑感知机制，有效提升了视觉变换器的性能，为视觉表示提供了新的思路。 Abstract: The computer vision community has witnessed an extensive exploration of vision transformers in the past two years. Drawing inspiration from traditional schemes, numerous works focus on introducing vision-specific inductive biases. However, the implicit modeling of permutation invariance and fully-connected interaction with individual tokens disrupts the regional context and spatial topology, further hindering higher-order modeling. This deviates from the principle of perceptual organization that emphasizes the local groups and overall topology of visual elements. Thus, we introduce the concept of hypergraph for perceptual exploration. Specifically, we propose a topology-aware vision transformer called HyperGraph Transformer (HGFormer). Firstly, we present a Center Sampling K-Nearest Neighbors (CS-KNN) algorithm for semantic guidance during hypergraph construction. Secondly, we present a topology-aware HyperGraph Attention (HGA) mechanism that integrates hypergraph topology as perceptual indications to guide the aggregation of global and unbiased information during hypergraph messaging. Using HGFormer as visual backbone, we develop an effective and unitive representation, achieving distinct and detailed scene depictions. Empirical experiments show that the proposed HGFormer achieves competitive performance compared to the recent SoTA counterparts on various visual benchmarks. Extensive ablation and visualization studies provide comprehensive explanations of our ideas and contributions.

Exploring LLM Reasoning Through Controlled Prompt Variations

Giannis Chatziveroglou,Richard Yun,Maura Kelleher

Task: 研究大型语言模型（LLMs）在数学问题解决任务中面对系统性输入扰动时的推理鲁棒性。

Motivation: 评估当前最先进模型在四种提示扰动下保持逻辑一致性和正确性的能力，以揭示其在实际应用中的潜在脆弱性。

Details

Method: 使用GSM8K数据集作为受控测试平台，对13种开源和闭源LLMs进行实验，分析其在无关上下文、病态指令、事实相关但非必要上下文及其组合扰动下的表现。 Result: 引入无关上下文显著降低模型性能，且性能下降与任务复杂性或模型大小无严格相关性；某些扰动意外触发链式推理行为。 Conclusion: 当前LLMs在噪声、误导性和上下文密集输入下的鲁棒性不足，需改进以实现更可靠的实际应用。 Abstract: This study investigates the reasoning robustness of large language models (LLMs) on mathematical problem-solving tasks under systematically introduced input perturbations. Using the GSM8K dataset as a controlled testbed, we evaluate how well state-of-the-art models maintain logical consistency and correctness when confronted with four categories of prompt perturbations: irrelevant context, pathological instructions, factually relevant but non-essential context, and a combination of the latter two. Our experiments, conducted on thirteen open-source and closed-source LLMs, reveal that introducing irrelevant context within the model's context window significantly degrades performance, suggesting that distinguishing essential from extraneous details remains a pressing challenge. Surprisingly, performance regressions are relatively insensitive to the complexity of the reasoning task, as measured by the number of steps required, and are not strictly correlated with model size. Moreover, we observe that certain perturbations inadvertently trigger chain-of-thought-like reasoning behaviors, even without explicit prompting. Our findings highlight critical vulnerabilities in current LLMs and underscore the need for improved robustness against noisy, misleading, and contextually dense inputs, paving the way for more resilient and reliable reasoning in real-world applications.

ConMo: Controllable Motion Disentanglement and Recomposition for Zero-Shot Motion Transfer

Jiayi Gao,Zijin Yin,Changcheng Hua,Yuxin Peng,Kongming Liang,Zhanyu Ma,Jun Guo,Yang Liu

Task: 提出一个零样本框架ConMo，用于解决多主体视频中运动转移的准确性和多样性问题。

Motivation: 现有方法在多主体视频中难以转移特定主体运动，且无法保持运动多样性及准确性。

Details

Method: 通过分离和重组主体与相机运动，结合软引导控制原始运动的保留。 Result: ConMo在运动保真度和语义一致性上显著优于现有方法。 Conclusion: ConMo为多主体视频运动转移提供了更准确的解决方案，并支持多种应用。 Abstract: The development of Text-to-Video (T2V) generation has made motion transfer possible, enabling the control of video motion based on existing footage. However, current methods have two limitations: 1) struggle to handle multi-subjects videos, failing to transfer specific subject motion; 2) struggle to preserve the diversity and accuracy of motion as transferring to subjects with varying shapes. To overcome these, we introduce \textbf{ConMo}, a zero-shot framework that disentangle and recompose the motions of subjects and camera movements. ConMo isolates individual subject and background motion cues from complex trajectories in source videos using only subject masks, and reassembles them for target video generation. This approach enables more accurate motion control across diverse subjects and improves performance in multi-subject scenarios. Additionally, we propose soft guidance in the recomposition stage which controls the retention of original motion to adjust shape constraints, aiding subject shape adaptation and semantic transformation. Unlike previous methods, ConMo unlocks a wide range of applications, including subject size and position editing, subject removal, semantic modifications, and camera motion simulation. Extensive experiments demonstrate that ConMo significantly outperforms state-of-the-art methods in motion fidelity and semantic consistency. The code is available at https://github.com/Andyplus1/ConMo.

Achieving Unanimous Consensus in Decision Making Using Multi-Agents

Apurba Pokharel,Ram Dantu,Shakila Zaman,Sirisha Talapuru,Vinh Quach

Task: 提出一种基于大型语言模型（LLMs）的新型审议共识机制，以解决传统区块链共识机制在决策适应性上的不足。

Motivation: 传统共识机制（如PoW和PoS）在需要每个参与者意见的决策场景中适应性不足，无法满足基于理性讨论达成一致的需求。

Details

Method: 利用LLMs作为理性代理，通过分级共识和多轮审议过程，实现一致性和优先级决策。 Result: 实验证明该方法在保持区块链一致性、协议性、活跃性和确定性的同时，支持高效决策。 Conclusion: 审议共识机制为区块链决策提供了新思路，但需解决思维退化、幻觉、恶意节点等挑战。 Abstract: Blockchain consensus mechanisms have relied on algorithms such as Proof-of-Work (PoW) and Proof-of-Stake (PoS) to ensure network functionality and integrity. However, these approaches struggle with adaptability for decision-making where the opinions of each matter rather than reaching an agreement based on honest majority or weighted consensus. This paper introduces a novel deliberation-based consensus mechanism where Large Language Models (LLMs) act as rational agents engaging in structured discussions to reach a unanimous consensus. By leveraging graded consensus and a multi-round deliberation process, our approach ensures both unanimous consensus for definitive problems and graded confidence for prioritized decisions and policies. We provide a formalization of our system and use it to show that the properties of blockchains: consistency, agreement, liveness, and determinism are maintained. Moreover, experimental results demonstrate our system's feasibility, showcasing how our deliberation method's convergence, block properties, and accuracy enable decision-making on blockchain networks. We also address key challenges with this novel approach such as degeneration of thoughts, hallucinations, malicious models and nodes, resource consumption, and scalability.

Taylor Series-Inspired Local Structure Fitting Network for Few-shot Point Cloud Semantic Segmentation

Changshuo Wang,Shuting He,Xiang Fang,Meiqing Wu,Siew-Kei Lam,Prayag Tiwari

Task: 提出一种名为TaylorSeg的预训练自由局部结构拟合网络，用于少样本点云语义分割。

Motivation: 解决基于预训练的方法在少样本点云语义分割中引入过多时间开销且忽略不规则点云局部结构表示的问题。

Details

Method: 受泰勒级数启发，将不规则点云的局部结构表示视为多项式拟合问题，提出一种名为TaylorConv的新型局部结构拟合卷积，并构建了两种TaylorSeg变体：非参数化的TaylorSeg-NN和参数化的TaylorSeg-PN。 Result: 在2-way 1-shot设置下，TaylorSeg-PN在S3DIS和ScanNet数据集上分别比之前的最优方法提高了+2.28%和+4.37%的mIoU。 Conclusion: TaylorSeg通过局部结构拟合卷积和自适应推拉模块，有效提升了少样本点云语义分割的性能，且无需预训练。 Abstract: Few-shot point cloud semantic segmentation aims to accurately segment "unseen" new categories in point cloud scenes using limited labeled data. However, pretraining-based methods not only introduce excessive time overhead but also overlook the local structure representation among irregular point clouds. To address these issues, we propose a pretraining-free local structure fitting network for few-shot point cloud semantic segmentation, named TaylorSeg. Specifically, inspired by Taylor series, we treat the local structure representation of irregular point clouds as a polynomial fitting problem and propose a novel local structure fitting convolution, called TaylorConv. This convolution learns the low-order basic information and high-order refined information of point clouds from explicit encoding of local geometric structures. Then, using TaylorConv as the basic component, we construct two variants of TaylorSeg: a non-parametric TaylorSeg-NN and a parametric TaylorSeg-PN. The former can achieve performance comparable to existing parametric models without pretraining. For the latter, we equip it with an Adaptive Push-Pull (APP) module to mitigate the feature distribution differences between the query set and the support set. Extensive experiments validate the effectiveness of the proposed method. Notably, under the 2-way 1-shot setting, TaylorSeg-PN achieves improvements of +2.28% and +4.37% mIoU on the S3DIS and ScanNet datasets respectively, compared to the previous state-of-the-art methods. Our code is available at https://github.com/changshuowang/TaylorSeg.

Towards Interpretable Soft Prompts

Oam Patel,Jason Wang,Nikhil Shivakumar Nayak,Suraj Srinivas,Himabindu Lakkaraju

Task: 提出一种新的理论框架，用于评估可训练提示的可解释性。

Motivation: 软提示等方法虽然提升了任务性能，但其黑盒特性缺乏可解释性。

Details

Method: 基于忠实性和可审查性两个标准，提出新的可解释性框架，并设计优化可解释性的目标函数。 Result: 实验表明，可解释性与任务性能之间存在权衡关系。 Conclusion: 揭示了软提示可解释性问题的复杂性，并提出了优化可解释性的新方向。 Abstract: Soft prompts have been popularized as a cheap and easy way to improve task-specific LLM performance beyond few-shot prompts. Despite their origin as an automated prompting method, however, soft prompts and other trainable prompts remain a black-box method with no immediately interpretable connections to prompting. We create a novel theoretical framework for evaluating the interpretability of trainable prompts based on two desiderata: faithfulness and scrutability. We find that existing methods do not naturally satisfy our proposed interpretability criterion. Instead, our framework inspires a new direction of trainable prompting methods that explicitly optimizes for interpretability. To this end, we formulate and test new interpretability-oriented objective functions for two state-of-the-art prompt tuners: Hard Prompts Made Easy (PEZ) and RLPrompt. Our experiments with GPT-2 demonstrate a fundamental trade-off between interpretability and the task-performance of the trainable prompt, explicating the hardness of the soft prompt interpretability problem and revealing odd behavior that arises when one optimizes for an interpretability proxy.

CornerPoint3D: Look at the Nearest Corner Instead of the Center

Ruixiao Zhang,Runwei Guan,Xiangyu Chen,Adam Prugel-Bennett,Xiaohao Cai

Task: 改进跨域3D物体检测的定位准确性，尤其是靠近LiDAR传感器的一面。

Motivation: LiDAR仅捕捉物体的近侧，导致中心检测器在跨域任务中定位不准确；现有评估指标因数据集特定尺寸变化而过拟合。

Details

Method: 提出两个新指标评估模型检测靠近LiDAR面的能力；引入EdgeHead改进模型；提出CornerPoint3D检测器，基于最近角而非中心。 Result: 在跨域任务中优于传统中心检测器CenterPoint，平衡整体检测质量与靠近面的定位准确性。 Conclusion: 提供更实用且鲁棒的跨域3D物体检测解决方案。 Abstract: 3D object detection aims to predict object centers, dimensions, and rotations from LiDAR point clouds. Despite its simplicity, LiDAR captures only the near side of objects, making center-based detectors prone to poor localization accuracy in cross-domain tasks with varying point distributions. Meanwhile, existing evaluation metrics designed for single-domain assessment also suffer from overfitting due to dataset-specific size variations. A key question arises: Do we really need models to maintain excellent performance in the entire 3D bounding boxes after being applied across domains? Actually, one of our main focuses is on preventing collisions between vehicles and other obstacles, especially in cross-domain scenarios where correctly predicting the sizes is much more difficult. To address these issues, we rethink cross-domain 3D object detection from a practical perspective. We propose two new metrics that evaluate a model's ability to detect objects' closer-surfaces to the LiDAR sensor. Additionally, we introduce EdgeHead, a refinement head that guides models to focus more on learnable closer surfaces, significantly improving cross-domain performance under both our new and traditional BEV/3D metrics. Furthermore, we argue that predicting the nearest corner rather than the object center enhances robustness. We propose a novel 3D object detector, coined as CornerPoint3D, which is built upon CenterPoint and uses heatmaps to supervise the learning and detection of the nearest corner of each object. Our proposed methods realize a balanced trade-off between the detection quality of entire bounding boxes and the locating accuracy of closer surfaces to the LiDAR sensor, outperforming the traditional center-based detector CenterPoint in multiple cross-domain tasks and providing a more practically reasonable and robust cross-domain 3D object detection solution.

Neural Style Transfer for Synthesising a Dataset of Ancient Egyptian Hieroglyphs

Lewis Matheson Creed

Task: 提出一种利用神经风格迁移（NST）生成古埃及象形文字数据集的新方法。

Motivation: 低资源语言（如古埃及语）的训练数据稀缺，限制了机器学习技术的应用。

Details

Method: 通过将NST应用于数字字体，生成古埃及象形文字数据集。 Result: 实验表明，基于NST生成的数据集训练的模型在分类和迁移到真实图像上表现良好。 Conclusion: NST是一种有效的数据增强方法，可用于低资源语言的机器学习任务。 Abstract: The limited availability of training data for low-resource languages makes applying machine learning techniques challenging. Ancient Egyptian is one such language with few resources. However, innovative applications of data augmentation methods, such as Neural Style Transfer, could overcome these barriers. This paper presents a novel method for generating datasets of ancient Egyptian hieroglyphs by applying NST to a digital typeface. Experimental results found that image classification models trained on NST-generated examples and photographs demonstrate equal performance and transferability to real unseen images of hieroglyphs.

Semantic segmentation of forest stands using deep learning

Håkon Næss Sandum,Hans Ole Ørka,Oliver Tomic,Erik Næsset,Terje Gobakken

Task: 提出一种基于U-Net深度学习框架的多类分割方法，用于自动化森林林分边界划分。

Motivation: 传统的手动解释立体航空图像方法耗时且主观，限制了操作效率并引入不一致性，因此需要自动化解决方案。

Details

Method: 采用U-Net深度学习框架，结合多光谱图像、ALS数据和专家制作的林分地图进行训练和评估。 Result: 模型在独立数据上的总体准确率为0.73，显示出深度学习在自动化林分划分中的潜力。 Conclusion: 尽管在复杂森林环境中存在挑战，但深度学习在自动化林分划分中表现出强大潜力。 Abstract: Forest stands are the fundamental units in forest management inventories, silviculture, and financial analysis within operational forestry. Over the past two decades, a common method for mapping stand borders has involved delineation through manual interpretation of stereographic aerial images. This is a time-consuming and subjective process, limiting operational efficiency and introducing inconsistencies. Substantial effort has been devoted to automating the process, using various algorithms together with aerial images and canopy height models constructed from airborne laser scanning (ALS) data, but manual interpretation remains the preferred method. Deep learning (DL) methods have demonstrated great potential in computer vision, yet their application to forest stand delineation remains unexplored in published research. This study presents a novel approach, framing stand delineation as a multiclass segmentation problem and applying a U-Net based DL framework. The model was trained and evaluated using multispectral images, ALS data, and an existing stand map created by an expert interpreter. Performance was assessed on independent data using overall accuracy, a standard metric for classification tasks that measures the proportions of correctly classified pixels. The model achieved an overall accuracy of 0.73. These results demonstrate strong potential for DL in automated stand delineation. However, a few key challenges were noted, especially for complex forest environments.

Jacy Reese Anthis,Ryan Liu,Sean M. Richardson,Austin C. Kozlowski,Bernard Koch,James Evans,Erik Brynjolfsson,Michael Bernstein

Task: 探讨如何通过解决五个可处理的挑战，实现大型语言模型（LLM）在人类行为模拟中的潜力。

Motivation: LLM模拟人类研究对象的潜力尚未充分实现，且社会科学领域对此方法的应用有限。

Details

Method: 通过文献综述、实证比较和评论分析，提出提示、微调和互补方法等方向。 Result: LLM社会模拟已可用于探索性研究，如心理学、经济学、社会学和营销的预实验。 Conclusion: 随着LLM能力的快速进步，广泛应用的时机可能很快到来，研究者应优先开发可迭代部署和优化的概念模型与评估方法。 Abstract: Accurate and verifiable large language model (LLM) simulations of human research subjects promise an accessible data source for understanding human behavior and training new AI systems. However, results to date have been limited, and few social scientists have adopted these methods. In this position paper, we argue that the promise of LLM social simulations can be achieved by addressing five tractable challenges. We ground our argument in a literature survey of empirical comparisons between LLMs and human research subjects, commentaries on the topic, and related work. We identify promising directions with prompting, fine-tuning, and complementary methods. We believe that LLM social simulations can already be used for exploratory research, such as pilot experiments for psychology, economics, sociology, and marketing. More widespread use may soon be possible with rapidly advancing LLM capabilities, and researchers should prioritize developing conceptual models and evaluations that can be iteratively deployed and refined at pace with ongoing AI advances.

MG-MotionLLM: A Unified Framework for Motion Comprehension and Generation across Multiple Granularities

Bizhu Wu,Jinheng Xie,Keming Shen,Zhe Kong,Jianfeng Ren,Ruibin Bai,Rong Qu,Linlin Shen

Task: 开发MG-MotionLLM，一个统一的多粒度运动理解和生成的运动-语言模型。

Motivation: 现有方法主要关注粗粒度的运动-文本建模，无法处理细粒度的运动相关任务，如理解和控制特定身体部位的运动。

Details

Method: 引入多粒度训练方案，包括定位运动片段的时间边界和运动详细描述等辅助任务，以促进不同粒度级别的运动-文本建模。 Result: MG-MotionLLM在经典文本到运动和运动到文本任务中表现优异，并在细粒度运动理解和编辑任务中展现出潜力。 Conclusion: MG-MotionLLM通过多粒度训练方案，成功提升了运动理解和生成的性能，尤其在细粒度任务中表现突出。 Abstract: Recent motion-aware large language models have demonstrated promising potential in unifying motion comprehension and generation. However, existing approaches primarily focus on coarse-grained motion-text modeling, where text describes the overall semantics of an entire motion sequence in just a few words. This limits their ability to handle fine-grained motion-relevant tasks, such as understanding and controlling the movements of specific body parts. To overcome this limitation, we pioneer MG-MotionLLM, a unified motion-language model for multi-granular motion comprehension and generation. We further introduce a comprehensive multi-granularity training scheme by incorporating a set of novel auxiliary tasks, such as localizing temporal boundaries of motion segments via detailed text as well as motion detailed captioning, to facilitate mutual reinforcement for motion-text modeling across various levels of granularity. Extensive experiments show that our MG-MotionLLM achieves superior performance on classical text-to-motion and motion-to-text tasks, and exhibits potential in novel fine-grained motion comprehension and editing tasks. Project page: CVI-SZU/MG-MotionLLM

Advancing Semantic Caching for LLMs with Domain-Specific Embeddings and Synthetic Data

Waris Gill,Justin Cechmanek,Tyler Hutcherson,Srijith Rajamohan,Jen Agarwal,Muhammad Ali Gulzar,Manvinder Singh,Benoit Dion

Task: 研究如何通过使用专门优化的嵌入模型来提升语义缓存的效率。

Motivation: 语义缓存依赖嵌入相似性而非精确键匹配，需平衡精度、查询延迟和计算效率。

Details

Method: 提出使用小型、领域特定的嵌入模型，并通过真实和合成数据集进行微调。 Result: 实验表明，针对特定领域数据集微调的小型嵌入模型在精度和召回率上优于现有开源和专有方案。 Conclusion: 该方法有效平衡计算开销和准确性，为语义缓存的实践应用提供了高效策略。 Abstract: This report investigates enhancing semantic caching effectiveness by employing specialized, fine-tuned embedding models. Semantic caching relies on embedding similarity rather than exact key matching, presenting unique challenges in balancing precision, query latency, and computational efficiency. We propose leveraging smaller, domain-specific embedding models, fine-tuned with targeted real-world and synthetically generated datasets. Our empirical evaluations demonstrate that compact embedding models fine-tuned for just one epoch on specialized datasets significantly surpass both state-of-the-art open-source and proprietary alternatives in precision and recall. Moreover, we introduce a novel synthetic data generation pipeline for the semantic cache that mitigates the challenge of limited domain-specific annotated data, further boosting embedding performance. Our approach effectively balances computational overhead and accuracy, establishing a viable and efficient strategy for practical semantic caching implementations.

Graph Attention-Driven Bayesian Deep Unrolling for Dual-Peak Single-Photon Lidar Imaging

Kyungmin Choi,JaKeoung Koo,Stephen McLaughlin,Abderrahim Halimi

Task: 提出一种深度展开算法用于双峰单光子激光雷达成像。

Motivation: 解决单光子激光雷达在噪声环境和多目标场景中的挑战，结合统计方法和深度学习的优势。

Details

Method: 采用分层贝叶斯模型和深度展开算法，结合双深度图表示和几何深度学习。 Result: 在合成和真实数据上表现出与现有方法竞争的性能，并提供不确定性信息。 Conclusion: 该方法结合了统计方法和深度学习的优势，实现了高精度和不确定性量化。 Abstract: Single-photon Lidar imaging offers a significant advantage in 3D imaging due to its high resolution and long-range capabilities, however it is challenging to apply in noisy environments with multiple targets per pixel. To tackle these challenges, several methods have been proposed. Statistical methods demonstrate interpretability on the inferred parameters, but they are often limited in their ability to handle complex scenes. Deep learning-based methods have shown superior performance in terms of accuracy and robustness, but they lack interpretability or they are limited to a single-peak per pixel. In this paper, we propose a deep unrolling algorithm for dual-peak single-photon Lidar imaging. We introduce a hierarchical Bayesian model for multiple targets and propose a neural network that unrolls the underlying statistical method. To support multiple targets, we adopt a dual depth maps representation and exploit geometric deep learning to extract features from the point cloud. The proposed method takes advantages of statistical methods and learning-based methods in terms of accuracy and quantifying uncertainty. The experimental results on synthetic and real data demonstrate the competitive performance when compared to existing methods, while also providing uncertainty information.

ZClip: Adaptive Spike Mitigation for LLM Pre-Training

Abhay Kumar,Louis Owen,Nilabhra Roy Chowdhury,Fabian Güra

Task: 提出一种自适应梯度裁剪算法ZClip，以解决大语言模型训练中的梯度不稳定和损失峰值问题。

Motivation: 传统梯度裁剪方法依赖固定阈值或启发式方法，无法有效应对梯度不稳定和损失峰值，导致学习效率低下和频繁手动干预。

Details

Method: ZClip通过基于梯度范数的统计特性动态调整裁剪阈值，利用z-score异常检测识别和缓解大梯度峰值。 Result: ZClip能够预防恶性损失峰值，同时不影响模型的收敛性。 Conclusion: ZClip是一种无需先验假设的自适应梯度裁剪算法，能有效提升大语言模型训练的稳定性。 Abstract: Training large language models (LLMs) presents numerous challenges, including gradient instability and loss spikes. These phenomena can lead to catastrophic divergence, requiring costly checkpoint restoration and data batch skipping. Traditional gradient clipping techniques, such as constant or norm-based methods, fail to address these issues effectively due to their reliance on fixed thresholds or heuristics, leading to inefficient learning and requiring frequent manual intervention. In this work, we propose ZClip, an adaptive gradient clipping algorithm that dynamically adjusts the clipping threshold based on statistical properties of gradient norms over time. Unlike prior reactive strategies, ZClip proactively adapts to training dynamics without making any prior assumptions on the scale and the temporal evolution of gradient norms. At its core, it leverages z-score-based anomaly detection to identify and mitigate large gradient spikes, preventing malignant loss spikes while not interfering with convergence otherwise. Our code is available at: https://github.com/bluorion-com/ZClip.

Semiconductor Wafer Map Defect Classification with Tiny Vision Transformers

Faisal Mohammad,Duksan Ryu

Task: 提出一种轻量级Vision Transformer（ViT-Tiny）框架，用于半导体晶圆缺陷分类。

Motivation: 传统CNN模型在晶圆缺陷分类中存在类别不平衡和多缺陷类型重叠识别的问题。

Details

Method: 使用ViT-Tiny框架，并在WM-38k数据集上进行训练，通过消融实验确定最佳patch size为16。 Result: ViT-Tiny在四类、二类和三类缺陷分类中分别以98.4%的F1分数、2.86%的召回率提升和3.13%的精确度提升超越SOTA模型。 Conclusion: ViT-Tiny是一种计算高效且可靠的解决方案，适用于实际半导体缺陷检测。 Abstract: Semiconductor wafer defect classification is critical for ensuring high precision and yield in manufacturing. Traditional CNN-based models often struggle with class imbalances and recognition of the multiple overlapping defect types in wafer maps. To address these challenges, we propose ViT-Tiny, a lightweight Vision Transformer (ViT) framework optimized for wafer defect classification. Trained on the WM-38k dataset. ViT-Tiny outperforms its ViT-Base counterpart and state-of-the-art (SOTA) models, such as MSF-Trans and CNN-based architectures. Through extensive ablation studies, we determine that a patch size of 16 provides optimal performance. ViT-Tiny achieves an F1-score of 98.4%, surpassing MSF-Trans by 2.94% in four-defect classification, improving recall by 2.86% in two-defect classification, and increasing precision by 3.13% in three-defect classification. Additionally, it demonstrates enhanced robustness under limited labeled data conditions, making it a computationally efficient and reliable solution for real-world semiconductor defect detection.

Reasoning Inconsistencies and How to Mitigate Them in Deep Learning

Erik Arakelyan

Task: 开发新方法以检测和缓解深度学习模型中的推理不一致性，并优化其在复杂任务中的表现。

Motivation: 尽管深度学习模型性能显著提升，但其内部推理过程的不透明性导致系统性错误或逻辑缺陷难以检测和解决，可能引发偏见或不可靠性。

Details

Method: 提出基于知识图谱、自然语言和图像的新方法，包括检测预测不一致性、改进数据采样公平性、合成数据集生成及优化复杂推理任务的技术。 Result: 贡献了多种技术，包括检测不一致性的方法、提高公平性和性能的数据采样方法，以及优化复杂推理任务的模型技术。 Conclusion: 该论文提供了一个全面框架，提升了深度学习模型在多样任务和模态中的鲁棒性、公平性和可解释性。 Abstract: The recent advancements in Deep Learning models and techniques have led to significant strides in performance across diverse tasks and modalities. However, while the overall capabilities of models show promising growth, our understanding of their internal reasoning processes remains limited, particularly concerning systematic inconsistencies or errors patterns of logical or inferential flaws. These inconsistencies may manifest as contradictory outputs, failure to generalize across similar tasks, or erroneous conclusions in specific contexts. Even detecting and measuring such reasoning discrepancies is challenging, as they may arise from opaque internal procedures, biases and imbalances in training data, or the inherent complexity of the task. Without effective methods to detect, measure, and mitigate these errors, there is a risk of deploying models that are biased, exploitable, or logically unreliable. This thesis aims to address these issues by producing novel methods for deep learning models that reason over knowledge graphs, natural language, and images. The thesis contributes two techniques for detecting and quantifying predictive inconsistencies originating from opaque internal procedures in natural language and image processing models. To mitigate inconsistencies from biases in training data, this thesis presents a data efficient sampling method to improve fairness and performance and a synthetic dataset generation approach in low resource scenarios. Finally, the thesis offers two techniques to optimize the models for complex reasoning tasks. These methods enhance model performance while allowing for more faithful and interpretable exploration and exploitation during inference. Critically, this thesis provides a comprehensive framework to improve the robustness, fairness, and interpretability of deep learning models across diverse tasks and modalities.

Group-based Distinctive Image Captioning with Memory Difference Encoding and Attention

Jiuniu Wang,Wenjia Xu,Qingzhong Wang,Antoni B. Chan

Task: 提出一种基于群组的差异性显著图像描述方法，以增强图像描述的独特性。

Motivation: 现有图像描述模型在标准指标上表现良好，但生成的描述在区分目标图像与类似图像方面的能力不足。

Details

Method: 引入群组差异性记忆注意力模块（GDMA），通过视觉比较图像群组中的对象特征，优先突出独特对象特征，并结合真实描述中的显著词指导生成。 Result: 该方法显著提升了基线模型的描述独特性，并在独特性指标上达到最优性能，同时不过度牺牲准确性。 Conclusion: 提出的方法有效增强了图像描述的独特性，为图像描述领域提供了新的评估指标和改进方向。 Abstract: Recent advances in image captioning have focused on enhancing accuracy by substantially increasing the dataset and model size. While conventional captioning models exhibit high performance on established metrics such as BLEU, CIDEr, and SPICE, the capability of captions to distinguish the target image from other similar images is under-explored. To generate distinctive captions, a few pioneers employed contrastive learning or re-weighted the ground-truth captions. However, these approaches often overlook the relationships among objects in a similar image group (e.g., items or properties within the same album or fine-grained events). In this paper, we introduce a novel approach to enhance the distinctiveness of image captions, namely Group-based Differential Distinctive Captioning Method, which visually compares each image with other images in one similar group and highlights the uniqueness of each image. In particular, we introduce a Group-based Differential Memory Attention (GDMA) module, designed to identify and emphasize object features in an image that are uniquely distinguishable within its image group, i.e., those exhibiting low similarity with objects in other images. This mechanism ensures that such unique object features are prioritized during caption generation for the image, thereby enhancing the distinctiveness of the resulting captions. To further refine this process, we select distinctive words from the ground-truth captions to guide both the language decoder and the GDMA module. Additionally, we propose a new evaluation metric, the Distinctive Word Rate (DisWordRate), to quantitatively assess caption distinctiveness. Quantitative results indicate that the proposed method significantly improves the distinctiveness of several baseline models, and achieves state-of-the-art performance on distinctiveness while not excessively sacrificing accuracy...

Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme

Yan Ma,Steffi Chern,Xuyang Shen,Yiran Zhong,Pengfei Liu

Task: 提出一个透明、从头开始的强化学习框架，用于视觉语言模型（VLMs），并验证其有效性。

Motivation: 现有强化学习在视觉语言模型中的应用依赖复杂框架，缺乏可重现性和标准化评估协议，难以比较结果或解释训练动态。

Details

Method: 设计了一个最小但功能完整的四步流程框架，并在多个模型和数据集上验证；同时提出标准化评估方案。 Result: 实验发现响应长度对随机种子敏感，反思行为与输出长度相关，强化学习在泛化能力上优于监督微调。 Conclusion: 该框架和发现旨在建立可重现的基线，支持更广泛的基于强化学习的视觉语言模型研究。 Abstract: Reinforcement learning (RL) has recently shown strong potential in improving the reasoning capabilities of large language models and is now being actively extended to vision-language models (VLMs). However, existing RL applications in VLMs often rely on heavily engineered frameworks that hinder reproducibility and accessibility, while lacking standardized evaluation protocols, making it difficult to compare results or interpret training dynamics. This work introduces a transparent, from-scratch framework for RL in VLMs, offering a minimal yet functional four-step pipeline validated across multiple models and datasets. In addition, a standardized evaluation scheme is proposed to assess training dynamics and reflective behaviors. Extensive experiments on visual reasoning tasks uncover key empirical findings: response length is sensitive to random seeds, reflection correlates with output length, and RL consistently outperforms supervised fine-tuning (SFT) in generalization, even with high-quality data. These findings, together with the proposed framework, aim to establish a reproducible baseline and support broader engagement in RL-based VLM research.

APHQ-ViT: Post-Training Quantization with Average Perturbation Hessian Based Reconstruction for Vision Transformers

Zhuguanyu Wu,Jiayi Zhang,Jiaxin Chen,Jinyang Guo,Di Huang,Yunhong Wang

Task: 提出一种基于平均扰动Hessian（APH）的新型后训练量化方法APHQ-ViT，用于解决Vision Transformers（ViTs）在超低位量化中的性能下降问题。

Motivation: ViTs在量化部署时，尤其是超低位后训练量化（PTQ）下，性能显著下降，现有方法无法有效解决这一问题。

Details

Method: 通过改进的APH损失估计输出重要性，并设计MLP重建（MR）方法处理GELU激活后的量化问题。 Result: APHQ-ViT在3位和4位量化下显著优于现有PTQ方法，适用于多种视觉任务。 Conclusion: APHQ-ViT是一种高效的PTQ方法，显著提升了ViTs在超低位量化下的性能。 Abstract: Vision Transformers (ViTs) have become one of the most commonly used backbones for vision tasks. Despite their remarkable performance, they often suffer significant accuracy drops when quantized for practical deployment, particularly by post-training quantization (PTQ) under ultra-low bits. Recently, reconstruction-based PTQ methods have shown promising performance in quantizing Convolutional Neural Networks (CNNs). However, they fail when applied to ViTs, primarily due to the inaccurate estimation of output importance and the substantial accuracy degradation in quantizing post-GELU activations. To address these issues, we propose \textbf{APHQ-ViT}, a novel PTQ approach based on importance estimation with Average Perturbation Hessian (APH). Specifically, we first thoroughly analyze the current approximation approaches with Hessian loss, and propose an improved average perturbation Hessian loss. To deal with the quantization of the post-GELU activations, we design an MLP Reconstruction (MR) method by replacing the GELU function in MLP with ReLU and reconstructing it by the APH loss on a small unlabeled calibration set. Extensive experiments demonstrate that APHQ-ViT using linear quantizers outperforms existing PTQ methods by substantial margins in 3-bit and 4-bit across different vision tasks. The source code is available at https://github.com/GoatWu/APHQ-ViT.

Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving

Daoguang Zan,Zhirong Huang,Wei Liu,Hanwu Chen,Linhao Zhang,Shulin Xin,Lu Chen,Qi Liu,Xiaojian Zhong,Aoyan Li,Siyao Liu,Yongsheng Xiao,Liangqiang Chen,Yuyu Zhang,Jing Su,Tianyu Liu,Rui Long,Kai Shen,Liang Xiang

Task: 构建一个多语言问题解决基准（Multi-SWE-bench），用于评估大型语言模型（LLMs）在多种编程语言中的表现。

Motivation: 现有基准（如SWE-bench）主要针对Python，无法全面评估LLMs在多样化软件生态系统中的能力。

Details

Method: 通过专家标注从2,456个候选实例中筛选出1,632个高质量实例，覆盖Java、TypeScript等七种语言，并使用三种代表性方法（Agentless、SWE-agent和OpenHands）评估模型。 Result: 提出了Multi-SWE-bench基准，并基于此对一系列先进模型进行了全面分析，同时开源了4,723个结构化实例和数据集生产流程。 Conclusion: Multi-SWE-bench和Multi-SWE-RL社区将推动强化学习（RL）的发展，为通用人工智能（AGI）的实现奠定基础。 Abstract: The task of issue resolving is to modify a codebase to generate a patch that addresses a given issue. However, existing benchmarks, such as SWE-bench, focus almost exclusively on Python, making them insufficient for evaluating Large Language Models (LLMs) across diverse software ecosystems. To address this, we introduce a multilingual issue-resolving benchmark, called Multi-SWE-bench, covering Java, TypeScript, JavaScript, Go, Rust, C, and C++. It includes a total of 1,632 high-quality instances, which were carefully annotated from 2,456 candidates by 68 expert annotators, ensuring that the benchmark can provide an accurate and reliable evaluation. Based on Multi-SWE-bench, we evaluate a series of state-of-the-art models using three representative methods (Agentless, SWE-agent, and OpenHands) and present a comprehensive analysis with key empirical insights. In addition, we launch a Multi-SWE-RL open-source community, aimed at building large-scale reinforcement learning (RL) training datasets for issue-resolving tasks. As an initial contribution, we release a set of 4,723 well-structured instances spanning seven programming languages, laying a solid foundation for RL research in this domain. More importantly, we open-source our entire data production pipeline, along with detailed tutorials, encouraging the open-source community to continuously contribute and expand the dataset. We envision our Multi-SWE-bench and the ever-growing Multi-SWE-RL community as catalysts for advancing RL toward its full potential, bringing us one step closer to the dawn of AGI.

Towards Generalizing Temporal Action Segmentation to Unseen Views

Emad Bahrami,Olga Zatsarynna,Gianpiero Francesca,Juergen Gall

Task: 提出一种用于未见视角动作分割的协议和方法。

Motivation: 解决动作分割模型在未见视角下泛化能力不足的问题。

Details

Method: 利用序列和片段级别的共享表示，通过序列损失和动作损失减少视角差异的影响。 Result: 在Assembly101、IkeaASM和EgoExoLearn数据集上显著提升性能，F1@50分别提高12.8%和54%。 Conclusion: 提出的方法有效提升了动作分割模型在未见视角下的泛化能力。 Abstract: While there has been substantial progress in temporal action segmentation, the challenge to generalize to unseen views remains unaddressed. Hence, we define a protocol for unseen view action segmentation where camera views for evaluating the model are unavailable during training. This includes changing from top-frontal views to a side view or even more challenging from exocentric to egocentric views. Furthermore, we present an approach for temporal action segmentation that tackles this challenge. Our approach leverages a shared representation at both the sequence and segment levels to reduce the impact of view differences during training. We achieve this by introducing a sequence loss and an action loss, which together facilitate consistent video and action representations across different views. The evaluation on the Assembly101, IkeaASM, and EgoExoLearn datasets demonstrate significant improvements, with a 12.8% increase in F1@50 for unseen exocentric views and a substantial 54% improvement for unseen egocentric views.

Efficient Model Editing with Task-Localized Sparse Fine-tuning

Leonardo Iurada,Marco Ciccone,Tatiana Tommasi

Task: 提出一种名为TaLoS的方法，用于构建稀疏任务向量，以解决现有任务向量方法中的计算瓶颈和权重解耦问题。

Motivation: 现有方法依赖网络线性化来推导任务向量，导致训练和推理时的计算瓶颈，且线性化无法确保权重解耦，影响任务向量的无冲突组合。

Details

Method: 通过识别预训练模型中梯度敏感性低的参数子集，稀疏更新这些参数以促进权重解耦，避免显式线性化和任务间信息共享。 Result: TaLoS在训练和推理效率上优于现有方法，并在任务添加和否定任务中表现更优。 Conclusion: TaLoS通过模块化参数编辑，为实际应用中适应性基础模型的部署提供了可行方案。 Abstract: Task arithmetic has emerged as a promising approach for editing models by representing task-specific knowledge as composable task vectors. However, existing methods rely on network linearization to derive task vectors, leading to computational bottlenecks during training and inference. Moreover, linearization alone does not ensure weight disentanglement, the key property that enables conflict-free composition of task vectors. To address this, we propose TaLoS which allows to build sparse task vectors with minimal interference without requiring explicit linearization and sharing information across tasks. We find that pre-trained models contain a subset of parameters with consistently low gradient sensitivity across tasks, and that sparsely updating only these parameters allows for promoting weight disentanglement during fine-tuning. Our experiments prove that TaLoS improves training and inference efficiency while outperforming current methods in task addition and negation. By enabling modular parameter editing, our approach fosters practical deployment of adaptable foundation models in real-world applications.

Exploration-Driven Generative Interactive Environments

Nedko Savov,Naser Kazemi,Mohammad Mahdi,Danda Pani Paudel,Xi Wang,Luc Van Gool

Task: 提出一种仅使用随机代理在虚拟环境中训练多环境世界模型的框架，并通过AutoExplore Agent改进探索能力。

Motivation: 现代世界模型需要昂贵且耗时的视频数据集收集，而Genie模型虽能模拟多环境但依赖昂贵的演示数据。

Details

Method: 使用随机代理和AutoExplore Agent在虚拟环境中收集数据，并通过环境分组和标注构建大规模数据集RetroAct。 Result: 预训练的多环境模型能快速适应新环境，提升视频保真度和可控性。 Conclusion: 提出的框架和AutoExplore Agent有效降低了训练成本并提高了模型适应性。 Abstract: Modern world models require costly and time-consuming collection of large video datasets with action demonstrations by people or by environment-specific agents. To simplify training, we focus on using many virtual environments for inexpensive, automatically collected interaction data. Genie, a recent multi-environment world model, demonstrates simulation abilities of many environments with shared behavior. Unfortunately, training their model requires expensive demonstrations. Therefore, we propose a training framework merely using a random agent in virtual environments. While the model trained in this manner exhibits good controls, it is limited by the random exploration possibilities. To address this limitation, we propose AutoExplore Agent - an exploration agent that entirely relies on the uncertainty of the world model, delivering diverse data from which it can learn the best. Our agent is fully independent of environment-specific rewards and thus adapts easily to new environments. With this approach, the pretrained multi-environment model can quickly adapt to new environments achieving video fidelity and controllability improvement. In order to obtain automatically large-scale interaction datasets for pretraining, we group environments with similar behavior and controls. To this end, we annotate the behavior and controls of 974 virtual environments - a dataset that we name RetroAct. For building our model, we first create an open implementation of Genie - GenieRedux and apply enhancements and adaptations in our version GenieRedux-G. Our code and data are available at https://github.com/insait-institute/GenieRedux.

Affordable AI Assistants with Knowledge Graph of Thoughts

Maciej Besta,Lorenzo Paleari,Jia Hao Andrea Jiang,Robert Gerstenberger,You Wu,Patrick Iff,Ales Kubicek,Piotr Nyczyk,Diana Khimey,Jón Gunnar Hannesson,Grzegorz Kwaśniewski,Marcin Copik,Hubert Niewiadomski,Torsten Hoefler

Task: 提出一种名为知识思维图（KGoT）的新型AI助手架构，以解决当前LLM驱动代理的高成本和低成功率问题。

Motivation: 当前最先进的LLM驱动代理在复杂任务（如GAIA基准测试）中面临高成本和低成功率的挑战。

Details

Method: KGoT通过将LLM推理与动态构建的知识图谱（KGs）结合，提取并结构化任务相关知识，并利用外部工具（如数学求解器、网络爬虫和Python脚本）迭代增强。 Result: KGoT在GAIA基准测试中任务成功率提高了29%，成本降低了36倍以上，同时在Qwen2.5-32B和Deepseek-R1-70B等模型上也有类似提升。 Conclusion: KGoT为AI助手提供了一种可扩展、经济高效且高性能的解决方案。 Abstract: Large Language Models (LLMs) are revolutionizing the development of AI assistants capable of performing diverse tasks across domains. However, current state-of-the-art LLM-driven agents face significant challenges, including high operational costs and limited success rates on complex benchmarks like GAIA. To address these issues, we propose the Knowledge Graph of Thoughts (KGoT), an innovative AI assistant architecture that integrates LLM reasoning with dynamically constructed knowledge graphs (KGs). KGoT extracts and structures task-relevant knowledge into a dynamic KG representation, iteratively enhanced through external tools such as math solvers, web crawlers, and Python scripts. Such structured representation of task-relevant knowledge enables low-cost models to solve complex tasks effectively. For example, KGoT achieves a 29% improvement in task success rates on the GAIA benchmark compared to Hugging Face Agents with GPT-4o mini, while reducing costs by over 36x compared to GPT-4o. Improvements for recent reasoning models are similar, e.g., 36% and 37.5% for Qwen2.5-32B and Deepseek-R1-70B, respectively. KGoT offers a scalable, affordable, and high-performing solution for AI assistants.

MultiNeRF: Multiple Watermark Embedding for Neural Radiance Fields

Yash Kulthe,Andrew Gilbert,John Collomosse

Task: 提出一种名为MultiNeRF的3D水印方法，能够在单个NeRF模型渲染的图像中嵌入多个唯一键控水印，同时保持高视觉质量。

Motivation: 扩展TensoRF NeRF模型，以在不影响场景内容的情况下实现更高的水印容量，并提供灵活的多人水印嵌入和提取方案。

Details

Method: 通过引入专用的水印网格和基于FiLM的条件调制机制，动态激活水印，支持多个独立水印的嵌入和提取。 Result: 在NeRF-Synthetic和LLFF数据集上验证，显著提高了鲁棒容量，同时保持渲染质量。 Conclusion: MultiNeRF将单水印NeRF方法推广为灵活的多水印框架，为3D内容归属提供了可扩展的解决方案。 Abstract: We present MultiNeRF, a 3D watermarking method that embeds multiple uniquely keyed watermarks within images rendered by a single Neural Radiance Field (NeRF) model, whilst maintaining high visual quality. Our approach extends the TensoRF NeRF model by incorporating a dedicated watermark grid alongside the existing geometry and appearance grids. This extension ensures higher watermark capacity without entangling watermark signals with scene content. We propose a FiLM-based conditional modulation mechanism that dynamically activates watermarks based on input identifiers, allowing multiple independent watermarks to be embedded and extracted without requiring model retraining. MultiNeRF is validated on the NeRF-Synthetic and LLFF datasets, with statistically significant improvements in robust capacity without compromising rendering quality. By generalizing single-watermark NeRF methods into a flexible multi-watermarking framework, MultiNeRF provides a scalable solution for 3D content. attribution.

A Framework for Situating Innovations, Opportunities, and Challenges in Advancing Vertical Systems with Large AI Models

Gaurav Verma,Jiawei Zhou,Mohit Chandra,Srijan Kumar,Munmun De Choudhury

Task: 提出一个框架，通过分层抽象创新来解决大型AI模型在实际应用中的局限性。

Motivation: 大型AI模型在标准化基准测试中表现优异，但在高风险领域（如医疗、教育、法律）中暴露了脆弱性、缺乏上下文理解等问题，亟需跨学科创新以适配实际需求。

Details

Method: 引入一个分层抽象创新的框架，并通过多个案例研究展示其可操作性。 Result: 框架不仅模块化了将大型模型转化为实用“垂直系统”的流程，还揭示了各层之间的动态性，并指导研究者优化创新定位、发现机会和促进跨学科交流。 Conclusion: 该框架为研究者和实践者提供了一种系统化的方法，以解决大型AI模型在实际应用中的挑战，并促进跨学科合作。 Abstract: Large artificial intelligence (AI) models have garnered significant attention for their remarkable, often "superhuman", performance on standardized benchmarks. However, when these models are deployed in high-stakes verticals such as healthcare, education, and law, they often reveal notable limitations. For instance, they exhibit brittleness to minor variations in input data, present contextually uninformed decisions in critical settings, and undermine user trust by confidently producing or reproducing inaccuracies. These challenges in applying large models necessitate cross-disciplinary innovations to align the models' capabilities with the needs of real-world applications. We introduce a framework that addresses this gap through a layer-wise abstraction of innovations aimed at meeting users' requirements with large models. Through multiple case studies, we illustrate how researchers and practitioners across various fields can operationalize this framework. Beyond modularizing the pipeline of transforming large models into useful "vertical systems", we also highlight the dynamism that exists within different layers of the framework. Finally, we discuss how our framework can guide researchers and practitioners to (i) optimally situate their innovations (e.g., when vertical-specific insights can empower broadly impactful vertical-agnostic innovations), (ii) uncover overlooked opportunities (e.g., spotting recurring problems across verticals to develop practically useful foundation models instead of chasing benchmarks), and (iii) facilitate cross-disciplinary communication of critical challenges (e.g., enabling a shared vocabulary for AI developers, domain experts, and human-computer interaction scholars).

Data-Driven Object Tracking: Integrating Modular Neural Networks into a Kalman Framework

Christian Alexander Holz,Christian Bader,Markus Enzweiler,Matthias Drüppel

Task: 提出三种神经网络模型（SPENT、SANT、MANTa）用于多目标跟踪（MOT），以满足高级驾驶辅助系统（ADAS）对复杂性和精度的需求。

Motivation: 应对多目标跟踪中日益增长的复杂性和精度需求，特别是在ADAS领域。

Details

Method: 将三种神经网络模型（SPENT、SANT、MANTa）集成到传统的卡尔曼滤波器框架中，保持系统模块化。 Result: 在KITTI数据集上，SPENT将RMSE降低50%，SANT和MANTa在目标分配中达到95%的准确率。 Conclusion: 任务特定的神经网络能显著提升传统跟踪系统的性能、鲁棒性，同时保持模块化和可维护性。 Abstract: This paper presents novel Machine Learning (ML) methodologies for Multi-Object Tracking (MOT), specifically designed to meet the increasing complexity and precision demands of Advanced Driver Assistance Systems (ADAS). We introduce three Neural Network (NN) models that address key challenges in MOT: (i) the Single-Prediction Network (SPENT) for trajectory prediction, (ii) the Single-Association Network (SANT) for mapping individual Sensor Object (SO) to existing tracks, and (iii) the Multi-Association Network (MANTa) for associating multiple SOs to multiple tracks. These models are seamlessly integrated into a traditional Kalman Filter (KF) framework, maintaining the system's modularity by replacing relevant components without disrupting the overall architecture. Importantly, all three networks are designed to be run in a realtime, embedded environment. Each network contains less than 50k trainable parameters. Our evaluation, conducted on the public KITTI tracking dataset, demonstrates significant improvements in tracking performance. SPENT reduces the Root Mean Square Error (RMSE) by 50% compared to a standard KF, while SANT and MANTa achieve up to 95% accuracy in sensor object-to-track assignments. These results underscore the effectiveness of incorporating task-specific NNs into traditional tracking systems, boosting performance and robustness while preserving modularity, maintainability, and interpretability.

Concept Lancet: Image Editing with Compositional Representation Transplant

Jinqi Luo,Tianjiao Ding,Kwan Ho Ryan Chan,Hancheng Min,Chris Callison-Burch,René Vidal

Task: 提出一种名为Concept Lancet（CoLan）的零样本即插即用框架，用于扩散模型中的图像编辑任务。

Motivation: 现有编辑方法在文本嵌入或分数空间中设计编辑方向时，常面临编辑强度难以平衡的问题，导致视觉一致性受损或编辑任务失败。

Details

Method: 通过将源输入在潜在空间中分解为视觉概念的稀疏线性组合，准确估计概念在图像中的存在，并根据编辑任务（替换/添加/移除）执行定制的概念移植过程。 Result: 实验表明，配备CoLan的方法在编辑效果和一致性保持方面达到了最先进的性能。 Conclusion: CoLan框架为扩散模型中的图像编辑提供了一种高效且准确的解决方案。 Abstract: Diffusion models are widely used for image editing tasks. Existing editing methods often design a representation manipulation procedure by curating an edit direction in the text embedding or score space. However, such a procedure faces a key challenge: overestimating the edit strength harms visual consistency while underestimating it fails the editing task. Notably, each source image may require a different editing strength, and it is costly to search for an appropriate strength via trial-and-error. To address this challenge, we propose Concept Lancet (CoLan), a zero-shot plug-and-play framework for principled representation manipulation in diffusion-based image editing. At inference time, we decompose the source input in the latent (text embedding or diffusion score) space as a sparse linear combination of the representations of the collected visual concepts. This allows us to accurately estimate the presence of concepts in each image, which informs the edit. Based on the editing task (replace/add/remove), we perform a customized concept transplant process to impose the corresponding editing direction. To sufficiently model the concept space, we curate a conceptual representation dataset, CoLan-150K, which contains diverse descriptions and scenarios of visual terms and phrases for the latent dictionary. Experiments on multiple diffusion-based image editing baselines show that methods equipped with CoLan achieve state-of-the-art performance in editing effectiveness and consistency preservation.

Charm: The Missing Piece in ViT fine-tuning for Image Aesthetic Assessment

Fatemeh Behrad,Tinne Tuytelaars,Johan Wagemans

Task: 提出一种名为Charm的新型标记化方法，以解决Vision transformers（ViTs）在处理可变尺寸输入时的计算复杂性和信息丢失问题。

Motivation: ViTs通常在小尺寸固定图像上训练，导致信息丢失，影响图像美学评估等任务。

Details

Method: Charm通过保留组合、高分辨率、宽高比和多尺度信息，优先处理特定区域的高分辨率细节，同时缩减其他区域，生成固定大小的输入序列。 Result: 实验表明，Charm在多种图像美学和质量评估数据集上显著提升了性能（最高提升8.1%）。 Conclusion: Charm是一种高效且兼容预训练ViTs的方法，能够提升ViTs在图像美学评估中的性能和泛化能力。 Abstract: The capacity of Vision transformers (ViTs) to handle variable-sized inputs is often constrained by computational complexity and batch processing limitations. Consequently, ViTs are typically trained on small, fixed-size images obtained through downscaling or cropping. While reducing computational burden, these methods result in significant information loss, negatively affecting tasks like image aesthetic assessment. We introduce Charm, a novel tokenization approach that preserves Composition, High-resolution, Aspect Ratio, and Multi-scale information simultaneously. Charm prioritizes high-resolution details in specific regions while downscaling others, enabling shorter fixed-size input sequences for ViTs while incorporating essential information. Charm is designed to be compatible with pre-trained ViTs and their learned positional embeddings. By providing multiscale input and introducing variety to input tokens, Charm improves ViT performance and generalizability for image aesthetic assessment. We avoid cropping or changing the aspect ratio to further preserve information. Extensive experiments demonstrate significant performance improvements on various image aesthetic and quality assessment datasets (up to 8.1 %) using a lightweight ViT backbone. Code and pre-trained models are available at https://github.com/FBehrad/Charm.

SelfMedHPM: Self Pre-training With Hard Patches Mining Masked Autoencoders For Medical Image Segmentation

Yunhao Lv,Lingyu Chen,Jian Wang,Yangxi Li,Fang Chen

Task: 提出一种基于MIM自训练框架的CT多器官分割方法（selfMedHPM），通过挖掘困难区域提升分割性能。

Motivation: 现有基于MIM的CT多器官分割方法未能有效识别最难重建区域，限制了性能提升。

Details

Method: 采用ViT自预训练，引入辅助损失预测器动态确定掩码位置，提出selfMedHPM框架。 Result: 在腹部和全身CT多器官分割任务中，selfMedHPM优于多种竞争方法。 Conclusion: selfMedHPM通过挖掘困难区域显著提升了CT多器官分割的性能。 Abstract: In recent years, deep learning methods such as convolutional neural network (CNN) and transformers have made significant progress in CT multi-organ segmentation. However, CT multi-organ segmentation methods based on masked image modeling (MIM) are very limited. There are already methods using MAE for CT multi-organ segmentation task, we believe that the existing methods do not identify the most difficult areas to reconstruct. To this end, we propose a MIM self-training framework with hard patches mining masked autoencoders for CT multi-organ segmentation tasks (selfMedHPM). The method performs ViT self-pretraining on the training set of the target data and introduces an auxiliary loss predictor, which first predicts the patch loss and determines the location of the next mask. SelfMedHPM implementation is better than various competitive methods in abdominal CT multi-organ segmentation and body CT multi-organ segmentation. We have validated the performance of our method on the Multi Atlas Labeling Beyond The Cranial Vault (BTCV) dataset for abdomen mult-organ segmentation and the SinoMed Whole Body (SMWB) dataset for body multi-organ segmentation tasks.

Delineate Anything: Resolution-Agnostic Field Boundary Delineation on Satellite Imagery

Mykola Lavreniuk,Nataliia Kussul,Andrii Shelestov,Bohdan Yailymov,Yevhenii Salii,Volodymyr Kuzin,Zoltan Szantoi

Task: 通过实例分割方法准确划分卫星图像中的农田边界。

Motivation: 现有方法因数据集规模小、分辨率差异和环境多样性而面临挑战。

Details

Method: 提出FBIS-22M数据集和Delineate Anything模型，将任务重新定义为实例分割。 Result: 模型在mAP@0.5和mAP@0.5:0.95上分别提升88.5%和103%，推理速度更快且具有强零样本泛化能力。 Conclusion: FBIS-22M数据集和Delineate Anything模型显著提升了农田边界划分的准确性和效率。 Abstract: The accurate delineation of agricultural field boundaries from satellite imagery is vital for land management and crop monitoring. However, current methods face challenges due to limited dataset sizes, resolution discrepancies, and diverse environmental conditions. We address this by reformulating the task as instance segmentation and introducing the Field Boundary Instance Segmentation - 22M dataset (FBIS-22M), a large-scale, multi-resolution dataset comprising 672,909 high-resolution satellite image patches (ranging from 0.25 m to 10 m) and 22,926,427 instance masks of individual fields, significantly narrowing the gap between agricultural datasets and those in other computer vision domains. We further propose Delineate Anything, an instance segmentation model trained on our new FBIS-22M dataset. Our proposed model sets a new state-of-the-art, achieving a substantial improvement of 88.5% in mAP@0.5 and 103% in mAP@0.5:0.95 over existing methods, while also demonstrating significantly faster inference and strong zero-shot generalization across diverse image resolutions and unseen geographic regions. Code, pre-trained models, and the FBIS-22M dataset are available at https://lavreniuk.github.io/Delineate-Anything.

A Sensorimotor Vision Transformer

Konrad Gadzicki,Kerstin Schill,Christoph Zetzsche

Task: 提出一种受人类眼动启发的视觉模型（SMT），通过优先处理高显著性区域来提高计算效率和减少内存消耗。

Motivation: 传统模型均匀处理所有图像块，而SMT基于生物视觉原理选择高信息量的块，以优化资源使用。

Details

Method: SMT利用二维特征（如角点和遮挡）选择显著性块，仅处理这些块以减少内存和计算复杂度。 Result: 在Imagenet-1k上，SMT在保持高准确率的同时显著降低了内存消耗和计算复杂度。 Conclusion: SMT为资源受限应用提供了一种高效的视觉模型，并展示了生物启发架构的潜力。 Abstract: This paper presents the Sensorimotor Transformer (SMT), a vision model inspired by human saccadic eye movements that prioritize high-saliency regions in visual input to enhance computational efficiency and reduce memory consumption. Unlike traditional models that process all image patches uniformly, SMT identifies and selects the most salient patches based on intrinsic two-dimensional (i2D) features, such as corners and occlusions, which are known to convey high-information content and align with human fixation patterns. The SMT architecture uses this biological principle to leverage vision transformers to process only the most informative patches, allowing for a substantial reduction in memory usage that scales with the sequence length of selected patches. This approach aligns with visual neuroscience findings, suggesting that the human visual system optimizes information gathering through selective, spatially dynamic focus. Experimental evaluations on Imagenet-1k demonstrate that SMT achieves competitive top-1 accuracy while significantly reducing memory consumption and computational complexity, particularly when a limited number of patches is used. This work introduces a saccade-like selection mechanism into transformer-based vision models, offering an efficient alternative for image analysis and providing new insights into biologically motivated architectures for resource-constrained applications.

Audio-visual Controlled Video Diffusion with Masked Selective State Spaces Modeling for Natural Talking Head Generation

Fa-Ting Hong,Zunnan Xu,Zixiang Zhou,Jun Zhou,Xiu Li,Qin Lin,Qinglin Lu,Dan Xu

Task: 提出一种支持多信号和单信号控制的端到端视频扩散框架（ACTalker）用于说话头部视频生成。

Motivation: 现有方法通常仅支持单一模态控制，限制了实际应用。

Details

Method: 设计并行Mamba结构，每分支使用独立驱动信号控制特定面部区域，并采用门控机制和掩码丢弃策略。 Result: 实验表明，该方法能生成自然的面部视频，且Mamba层能无缝整合多种驱动模态。 Conclusion: ACTalker在多信号控制下表现出色，解决了控制冲突问题。 Abstract: Talking head synthesis is vital for virtual avatars and human-computer interaction. However, most existing methods are typically limited to accepting control from a single primary modality, restricting their practical utility. To this end, we introduce \textbf{ACTalker}, an end-to-end video diffusion framework that supports both multi-signals control and single-signal control for talking head video generation. For multiple control, we design a parallel mamba structure with multiple branches, each utilizing a separate driving signal to control specific facial regions. A gate mechanism is applied across all branches, providing flexible control over video generation. To ensure natural coordination of the controlled video both temporally and spatially, we employ the mamba structure, which enables driving signals to manipulate feature tokens across both dimensions in each branch. Additionally, we introduce a mask-drop strategy that allows each driving signal to independently control its corresponding facial region within the mamba structure, preventing control conflicts. Experimental results demonstrate that our method produces natural-looking facial videos driven by diverse signals and that the mamba layer seamlessly integrates multiple driving modalities without conflict.

MAD: Makeup All-in-One with Cross-Domain Diffusion Model

Bo-Kai Ruan,Hong-Han Shuai

Task: 使用单一模型处理多种化妆任务，包括美颜滤镜、化妆转移和化妆移除。

Motivation: 现有方法需要多个模型处理不同任务且缺乏文本引导的化妆尝试，增加了复杂性和用户不便。

Details

Method: 将不同化妆任务视为跨域转换，利用跨域扩散模型和域嵌入实现单一模型的多任务处理。 Result: 提出的方法通过单一模型实现了多种化妆任务，并引入了MT-Text数据集支持文本到化妆的精确应用。 Conclusion: 该方法减少了模型复杂性，提升了用户友好性，并推动了化妆技术的实用性。 Abstract: Existing makeup techniques often require designing multiple models to handle different inputs and align features across domains for different makeup tasks, e.g., beauty filter, makeup transfer, and makeup removal, leading to increased complexity. Another limitation is the absence of text-guided makeup try-on, which is more user-friendly without needing reference images. In this study, we make the first attempt to use a single model for various makeup tasks. Specifically, we formulate different makeup tasks as cross-domain translations and leverage a cross-domain diffusion model to accomplish all tasks. Unlike existing methods that rely on separate encoder-decoder configurations or cycle-based mechanisms, we propose using different domain embeddings to facilitate domain control. This allows for seamless domain switching by merely changing embeddings with a single model, thereby reducing the reliance on additional modules for different tasks. Moreover, to support precise text-to-makeup applications, we introduce the MT-Text dataset by extending the MT dataset with textual annotations, advancing the practicality of makeup technologies.

Noise Calibration and Spatial-Frequency Interactive Network for STEM Image Enhancement

Hesong Li,Ziqi Wu,Ruiwen Shao,Tao Zhang,Ying Fu

Task: 开发噪声校准、数据合成和增强方法以提升STEM图像质量。

Motivation: 现有STEM图像增强方法忽视频域特征，且数据集缺乏真实性和通用性。

Details

Method: 提出噪声校准方法合成更真实的STEM图像，构建通用数据集，并设计空间-频率交互网络进行图像增强。 Result: 实验表明，合成的数据更接近真实STEM图像，且网络在增强性能上表现更好。 Conclusion: 提出的方法有效解决了STEM图像增强中的频域特征和数据集问题，提升了图像质量。 Abstract: Scanning Transmission Electron Microscopy (STEM) enables the observation of atomic arrangements at sub-angstrom resolution, allowing for atomically resolved analysis of the physical and chemical properties of materials. However, due to the effects of noise, electron beam damage, sample thickness, etc, obtaining satisfactory atomic-level images is often challenging. Enhancing STEM images can reveal clearer structural details of materials. Nonetheless, existing STEM image enhancement methods usually overlook unique features in the frequency domain, and existing datasets lack realism and generality. To resolve these issues, in this paper, we develop noise calibration, data synthesis, and enhancement methods for STEM images. We first present a STEM noise calibration method, which is used to synthesize more realistic STEM images. The parameters of background noise, scan noise, and pointwise noise are obtained by statistical analysis and fitting of real STEM images containing atoms. Then we use these parameters to develop a more general dataset that considers both regular and random atomic arrangements and includes both HAADF and BF mode images. Finally, we design a spatial-frequency interactive network for STEM image enhancement, which can explore the information in the frequency domain formed by the periodicity of atomic arrangement. Experimental results show that our data is closer to real STEM images and achieves better enhancement performances together with our network. Code will be available at https://github.com/HeasonLee/SFIN}{https://github.com/HeasonLee/SFIN.

Rip Current Segmentation: A Novel Benchmark and YOLOv8 Baseline Results

Andrei Dumitriu,Florin Tatui,Florin Miron,Radu Tudor Ionescu,Radu Timofte

Task: 提出并解决了一种新任务：裂流实例分割。

Motivation: 裂流是全球许多海滩上致命事故和伤害的主要原因，自动检测这些危险表面水流具有重要意义。

Details

Method: 引入了一个包含2,466张图像的数据集，并训练了多种YOLOv8版本进行实例分割，评估其在测试数据集（视频）上的性能。 Result: YOLOv8-nano模型表现最佳，验证数据集上的mAP50为88.94%，测试数据集上的宏平均为81.21%。 Conclusion: 研究为裂流分割的未来研究提供了基线，贡献了详细的标注数据集和深度学习模型，代码和数据集已公开。 Abstract: Rip currents are the leading cause of fatal accidents and injuries on many beaches worldwide, emphasizing the importance of automatically detecting these hazardous surface water currents. In this paper, we address a novel task: rip current instance segmentation. We introduce a comprehensive dataset containing $2,466$ images with newly created polygonal annotations for instance segmentation, used for training and validation. Additionally, we present a novel dataset comprising $17$ drone videos (comprising about $24K$ frames) captured at $30 FPS$, annotated with both polygons for instance segmentation and bounding boxes for object detection, employed for testing purposes. We train various versions of YOLOv8 for instance segmentation on static images and assess their performance on the test dataset (videos). The best results were achieved by the YOLOv8-nano model (runnable on a portable device), with an mAP50 of $88.94%$ on the validation dataset and $81.21%$ macro average on the test dataset. The results provide a baseline for future research in rip current segmentation. Our work contributes to the existing literature by introducing a detailed, annotated dataset, and training a deep learning model for instance segmentation of rip currents. The code, training details and the annotated dataset are made publicly available at https://github.com/Irikos/rip_currents.

L-LBVC: Long-Term Motion Estimation and Prediction for Learned Bi-Directional Video Compression

Yongqi Zhai,Luyang Tang,Wei Jiang,Jiayu Yang,Ronggang Wang

Task: 提出一种新型的LBVC框架L-LBVC，以解决长时运动估计和预测不准确的问题。

Motivation: LBVC在长时运动估计和预测上表现不佳，尤其是在大运动场景中。

Details

Method: 提出自适应运动估计模块和自适应运动预测模块，分别处理短时和长时运动，并优化运动编码的比特成本。 Result: L-LBVC在随机访问配置下显著优于现有LVC方法，甚至在某些测试数据集上超越VVC（VTM）。 Conclusion: L-LBVC框架有效解决了LBVC的长时运动问题，性能显著提升。 Abstract: Recently, learned video compression (LVC) has shown superior performance under low-delay configuration. However, the performance of learned bi-directional video compression (LBVC) still lags behind traditional bi-directional coding. The performance gap mainly arises from inaccurate long-term motion estimation and prediction of distant frames, especially in large motion scenes. To solve these two critical problems, this paper proposes a novel LBVC framework, namely L-LBVC. Firstly, we propose an adaptive motion estimation module that can handle both short-term and long-term motions. Specifically, we directly estimate the optical flows for adjacent frames and non-adjacent frames with small motions. For non-adjacent frames with large motions, we recursively accumulate local flows between adjacent frames to estimate long-term flows. Secondly, we propose an adaptive motion prediction module that can largely reduce the bit cost for motion coding. To improve the accuracy of long-term motion prediction, we adaptively downsample reference frames during testing to match the motion ranges observed during training. Experiments show that our L-LBVC significantly outperforms previous state-of-the-art LVC methods and even surpasses VVC (VTM) on some test datasets under random access configuration.

Leveraging Sparse Annotations for Leukemia Diagnosis on the Large Leukemia Dataset

Abdul Rehman,Talha Meraj,Aiman Mahmood Minhas,Ayisha Imran,Mohsen Ali,Waqas Sultani,Mubarak Shah

Task: 提出一个大规模的白血病数据集（LLD）和新方法，用于检测白细胞及其形态属性。

Motivation: 现有白血病分析缺乏大规模、多样化的多任务数据集，限制了深度学习方法在现实世界中的应用。

Details

Method: 通过外周血涂片收集数据，提出多任务模型检测白细胞并预测其属性，同时提出稀疏标注方法减少标注负担。 Result: 提出了一个包含多种形态属性标注的大规模数据集，并开发了高效的多任务模型和稀疏标注方法。 Conclusion: 该数据集和方法可提升白血病诊断的可解释性和准确性，适用于多种显微图像分析挑战。 Abstract: Leukemia is 10th most frequently diagnosed cancer and one of the leading causes of cancer related deaths worldwide. Realistic analysis of Leukemia requires White Blook Cells (WBC) localization, classification, and morphological assessment. Despite deep learning advances in medical imaging, leukemia analysis lacks a large, diverse multi-task dataset, while existing small datasets lack domain diversity, limiting real world applicability. To overcome dataset challenges, we present a large scale WBC dataset named Large Leukemia Dataset (LLD) and novel methods for detecting WBC with their attributes. Our contribution here is threefold. First, we present a large-scale Leukemia dataset collected through Peripheral Blood Films (PBF) from several patients, through multiple microscopes, multi cameras, and multi magnification. To enhance diagnosis explainability and medical expert acceptance, each leukemia cell is annotated at 100x with 7 morphological attributes, ranging from Cell Size to Nuclear Shape. Secondly, we propose a multi task model that not only detects WBCs but also predicts their attributes, providing an interpretable and clinically meaningful solution. Third, we propose a method for WBC detection with attribute analysis using sparse annotations. This approach reduces the annotation burden on hematologists, requiring them to mark only a small area within the field of view. Our method enables the model to leverage the entire field of view rather than just the annotated regions, enhancing learning efficiency and diagnostic accuracy. From diagnosis explainability to overcoming domain shift challenges, presented datasets could be used for many challenging aspects of microscopic image analysis. The datasets, code, and demo are available at: https://im.itu.edu.pk/sparse-leukemiaattri/

Fine-Tuning Visual Autoregressive Models for Subject-Driven Generation

Jiwoo Chung,Sangeek Hyun,Hyunjun Kim,Eunseo Koh,MinKyu Lee,Jae-Pil Heo

Task: 提出一种基于视觉自回归（VAR）模型的主题驱动生成方法，解决现有扩散模型计算开销大、语言漂移和多样性降低的问题。

Motivation: 扩散模型虽然生成质量高，但计算开销大，限制了实际应用；VAR模型推理速度快，但直接微调会导致计算开销、语言漂移和多样性降低。

Details

Method: 引入选择性层调优和先验蒸馏以减少复杂性和语言漂移，并提出尺度加权调优以优先处理粗分辨率信息。 Result: 实验表明，该方法在多个指标上显著优于基于扩散的基线，并展示了实际应用潜力。 Conclusion: 提出的VAR方法在主题驱动生成中高效且实用，解决了扩散模型的局限性。 Abstract: Recent advances in text-to-image generative models have enabled numerous practical applications, including subject-driven generation, which fine-tunes pretrained models to capture subject semantics from only a few examples. While diffusion-based models produce high-quality images, their extensive denoising steps result in significant computational overhead, limiting real-world applicability. Visual autoregressive~(VAR) models, which predict next-scale tokens rather than spatially adjacent ones, offer significantly faster inference suitable for practical deployment. In this paper, we propose the first VAR-based approach for subject-driven generation. However, na\"{\i}ve fine-tuning VAR leads to computational overhead, language drift, and reduced diversity. To address these challenges, we introduce selective layer tuning to reduce complexity and prior distillation to mitigate language drift. Additionally, we found that the early stages have a greater influence on the generation of subject than the latter stages, which merely synthesize local details. Based on this finding, we propose scale-wise weighted tuning, which prioritizes coarser resolutions for promoting the model to focus on the subject-relevant information instead of local details. Extensive experiments validate that our method significantly outperforms diffusion-based baselines across various metrics and demonstrates its practical usage.

PicoPose: Progressive Pixel-to-Pixel Correspondence Learning for Novel Object Pose Estimation

Lihua Liu,Jiehong Lin,Zhenxin Liu,Kui Jia

Task: 通过三阶段像素到像素对应学习过程，从RGB图像中估计未见训练对象的6D姿态。

Motivation: 解决零样本泛化中未见对象的姿态估计问题，提升准确性。

Details

Method: 提出PicoPose框架，分三阶段：特征匹配、全局2D仿射变换回归、局部对应偏移学习。 Result: 在BOP基准测试的七个核心数据集上达到最先进性能，展示了对CAD模型或参考图像表示的新对象的优异泛化能力。 Conclusion: PicoPose通过逐步细化对应关系显著提升了姿态估计精度，适用于未见对象的姿态估计任务。 Abstract: Novel object pose estimation from RGB images presents a significant challenge for zero-shot generalization, as it involves estimating the relative 6D transformation between an RGB observation and a CAD model of an object that was not seen during training. In this paper, we introduce PicoPose, a novel framework designed to tackle this task using a three-stage pixel-to-pixel correspondence learning process. Firstly, PicoPose matches features from the RGB observation with those from rendered object templates, identifying the best-matched template and establishing coarse correspondences. Secondly, PicoPose smooths the correspondences by globally regressing a 2D affine transformation, including in-plane rotation, scale, and 2D translation, from the coarse correspondence map. Thirdly, PicoPose applies the affine transformation to the feature map of the best-matched template and learns correspondence offsets within local regions to achieve fine-grained correspondences. By progressively refining the correspondences, PicoPose significantly improves the accuracy of object poses computed via PnP/RANSAC. PicoPose achieves state-of-the-art performance on the seven core datasets of the BOP benchmark, demonstrating exceptional generalization to novel objects represented by CAD models or object reference images. Code and models are available at https://github.com/foollh/PicoPose.

Learning Phase Distortion with Selective State Space Models for Video Turbulence Mitigation

Xingguang Zhang,Nicholas Chimitt,Xijun Wang,Yu Yuan,Stanley H. Chan

Task: 提出一种基于选择性状态空间模型（MambaTM）和学习潜在相位畸变（LPD）的大气湍流图像复原方法。

Motivation: 现有深度学习方法在湍流复原中存在速度慢、内存占用高、泛化能力差的问题，且空间域和时间域方法各有局限性。

Details

Method: 结合选择性状态空间模型（MambaTM）和学习潜在相位畸变（LPD），实现全局感受野和线性计算复杂度。 Result: 在合成和真实湍流基准测试中超越当前最优方法，且推理速度显著提升。 Conclusion: MambaTM和LPD的结合有效解决了湍流复原中的计算复杂性和性能问题。 Abstract: Atmospheric turbulence is a major source of image degradation in long-range imaging systems. Although numerous deep learning-based turbulence mitigation (TM) methods have been proposed, many are slow, memory-hungry, and do not generalize well. In the spatial domain, methods based on convolutional operators have a limited receptive field, so they cannot handle a large spatial dependency required by turbulence. In the temporal domain, methods relying on self-attention can, in theory, leverage the lucky effects of turbulence, but their quadratic complexity makes it difficult to scale to many frames. Traditional recurrent aggregation methods face parallelization challenges. In this paper, we present a new TM method based on two concepts: (1) A turbulence mitigation network based on the Selective State Space Model (MambaTM). MambaTM provides a global receptive field in each layer across spatial and temporal dimensions while maintaining linear computational complexity. (2) Learned Latent Phase Distortion (LPD). LPD guides the state space model. Unlike classical Zernike-based representations of phase distortion, the new LPD map uniquely captures the actual effects of turbulence, significantly improving the model's capability to estimate degradation by reducing the ill-posedness. Our proposed method exceeds current state-of-the-art networks on various synthetic and real-world TM benchmarks with significantly faster inference speed. The code is available at http://github.com/xg416/MambaTM.

HQViT: Hybrid Quantum Vision Transformer for Image Classification

Hui Zhang,Qinglin Zhao,Mengchu Zhou,Li Feng

Task: 提出一种混合量子视觉变换器（HQViT），结合量子计算加速模型训练并提升性能。

Motivation: 传统Transformer的自注意力机制在计算复杂度上存在挑战，尤其是处理高维输入数据时成本高昂。

Details

Method: HQViT采用全图像处理和幅度编码，结合量子计算处理关键步骤，降低量子资源需求。 Result: HQViT在多个计算机视觉数据集上表现优异，最高提升10.9%（MNIST任务）。 Conclusion: 结合量子与经典计算在处理复杂图像分类任务中具有巨大潜力。 Abstract: Transformer-based architectures have revolutionized the landscape of deep learning. In computer vision domain, Vision Transformer demonstrates remarkable performance on par with or even surpassing that of convolutional neural networks. However, the quadratic computational complexity of its self-attention mechanism poses challenges for classical computing, making model training with high-dimensional input data, e.g., images, particularly expensive. To address such limitations, we propose a Hybrid Quantum Vision Transformer (HQViT), that leverages the principles of quantum computing to accelerate model training while enhancing model performance. HQViT introduces whole-image processing with amplitude encoding to better preserve global image information without additional positional encoding. By leveraging quantum computation on the most critical steps and selectively handling other components in a classical way, we lower the cost of quantum resources for HQViT. The qubit requirement is minimized to $O(log_2N)$ and the number of parameterized quantum gates is only $O(log_2d)$, making it well-suited for Noisy Intermediate-Scale Quantum devices. By offloading the computationally intensive attention coefficient matrix calculation to the quantum framework, HQViT reduces the classical computational load by $O(T^2d)$. Extensive experiments across various computer vision datasets demonstrate that HQViT outperforms existing models, achieving a maximum improvement of up to $10.9\%$ (on the MNIST 10-classification task) over the state of the art. This work highlights the great potential to combine quantum and classical computing to cope with complex image classification tasks.

MD-ProjTex: Texturing 3D Shapes with Multi-Diffusion Projection

Ahmet Burak Yildirim,Mustafa Utku Aydogdu,Duygu Ceylan,Aysegul Dundar

Task: 提出一种名为MD-ProjTex的方法，用于基于预训练文本到图像扩散模型快速且一致地生成3D形状的纹理。

Motivation: 解决现有方法依赖优化或顺序视图合成导致的计算效率低和一致性差的问题。

Details

Method: 采用多视角一致性机制在UV空间中融合噪声预测，联合更新每视角的去噪方向以保持3D一致性。 Result: MD-ProjTex在计算效率上优于现有方法，并在定量和定性结果上表现更好。 Conclusion: MD-ProjTex是一种高效且效果优越的文本引导3D纹理生成方法。 Abstract: We introduce MD-ProjTex, a method for fast and consistent text-guided texture generation for 3D shapes using pretrained text-to-image diffusion models. At the core of our approach is a multi-view consistency mechanism in UV space, which ensures coherent textures across different viewpoints. Specifically, MD-ProjTex fuses noise predictions from multiple views at each diffusion step and jointly updates the per-view denoising directions to maintain 3D consistency. In contrast to existing state-of-the-art methods that rely on optimization or sequential view synthesis, MD-ProjTex is computationally more efficient and achieves better quantitative and qualitative results.

CanonNet: Canonical Ordering and Curvature Learning for Point Cloud Analysis

Benjy Friedmann,Michael Werman

Task: 提出CanonNet，一种轻量级神经网络，用于解决点云处理中的点排序和几何特征学习问题。

Motivation: 当前架构依赖复杂操作，限制了表达能力且难以捕捉精细几何特征。

Details

Method: CanonNet由两部分组成：预处理管道（创建规范点排序和方向）和几何学习框架（从精确曲率值的合成表面学习）。 Result: 在曲率估计任务中达到最先进性能，在几何描述符任务中表现竞争性，且参数数量显著减少（100倍）。 Conclusion: 数学预处理能有效补充神经网络架构，CanonNet在计算资源有限的实际应用中表现出色。 Abstract: Point cloud processing poses two fundamental challenges: establishing consistent point ordering and effectively learning fine-grained geometric features. Current architectures rely on complex operations that limit expressivity while struggling to capture detailed surface geometry. We present CanonNet, a lightweight neural network composed of two complementary components: (1) a preprocessing pipeline that creates a canonical point ordering and orientation, and (2) a geometric learning framework where networks learn from synthetic surfaces with precise curvature values. This modular approach eliminates the need for complex transformation-invariant architectures while effectively capturing local geometric properties. Our experiments demonstrate state-of-the-art performance in curvature estimation and competitive results in geometric descriptor tasks with significantly fewer parameters (\textbf{100X}) than comparable methods. CanonNet's efficiency makes it particularly suitable for real-world applications where computational resources are limited, demonstrating that mathematical preprocessing can effectively complement neural architectures for point cloud analysis. The code for the project is publicly available \hyperlink{https://benjyfri.github.io/CanonNet/}{https://benjyfri.github.io/CanonNet/}.

Scene Splatter: Momentum 3D Scene Generation from Single Image with Video Diffusion Model

Shengjun Zhang,Jinzhao Li,Xin Fei,Hao Liu,Yueqi Duan

Task: 提出一种基于动量的视频扩散方法Scene Splatter，用于从单张图像生成通用场景。

Motivation: 现有方法在生成新视角时存在视频长度有限和场景不一致的问题，导致重建时出现伪影和失真。

Details

Method: 通过构建噪声样本作为动量增强视频细节和场景一致性，并引入像素级动量以恢复未知区域。 Result: 实验表明该方法在高保真和一致场景生成方面具有优越性能。 Conclusion: Scene Splatter通过级联动量迭代恢复3D场景，克服了视频长度限制。 Abstract: In this paper, we propose Scene Splatter, a momentum-based paradigm for video diffusion to generate generic scenes from single image. Existing methods, which employ video generation models to synthesize novel views, suffer from limited video length and scene inconsistency, leading to artifacts and distortions during further reconstruction. To address this issue, we construct noisy samples from original features as momentum to enhance video details and maintain scene consistency. However, for latent features with the perception field that spans both known and unknown regions, such latent-level momentum restricts the generative ability of video diffusion in unknown regions. Therefore, we further introduce the aforementioned consistent video as a pixel-level momentum to a directly generated video without momentum for better recovery of unseen regions. Our cascaded momentum enables video diffusion models to generate both high-fidelity and consistent novel views. We further finetune the global Gaussian representations with enhanced frames and render new frames for momentum update in the next step. In this manner, we can iteratively recover a 3D scene, avoiding the limitation of video length. Extensive experiments demonstrate the generalization capability and superior performance of our method in high-fidelity and consistent scene generation.

TailedCore: Few-Shot Sampling for Unsupervised Long-Tail Noisy Anomaly Detection

Yoon Gyo Jung,Jaewoo Park,Jaeho Yoon,Kuan-Chuan Peng,Wonchul Kim,Andrew Beng Jin Teoh,Octavia Camps

Task: 解决在正常数据集被污染且类别分布未知的长尾环境下的无监督异常检测问题。

Motivation: 现有模型在像素噪声和尾部类别样本之间存在性能权衡，需要独立处理尾部类别和噪声样本。

Details

Method: 提出TailSampler（一种新颖的类别大小预测器）和TailedCore（基于记忆的异常检测模型），分别处理尾部类别和噪声样本。 Result: TailedCore在无监督长尾噪声异常检测设置中表现优于现有方法。 Conclusion: 提出的方法有效解决了尾部类别和噪声样本的独立处理问题，提升了异常检测性能。 Abstract: We aim to solve unsupervised anomaly detection in a practical challenging environment where the normal dataset is both contaminated with defective regions and its product class distribution is tailed but unknown. We observe that existing models suffer from tail-versus-noise trade-off where if a model is robust against pixel noise, then its performance deteriorates on tail class samples, and vice versa. To mitigate the issue, we handle the tail class and noise samples independently. To this end, we propose TailSampler, a novel class size predictor that estimates the class cardinality of samples based on a symmetric assumption on the class-wise distribution of embedding similarities. TailSampler can be utilized to sample the tail class samples exclusively, allowing to handle them separately. Based on these facets, we build a memory-based anomaly detection model TailedCore, whose memory both well captures tail class information and is noise-robust. We extensively validate the effectiveness of TailedCore on the unsupervised long-tail noisy anomaly detection setting, and show that TailedCore outperforms the state-of-the-art in most settings.

Multi-Head Adaptive Graph Convolution Network for Sparse Point Cloud-Based Human Activity Recognition

Vincent Gbouna Zakka,Luis J. Manso,Zhuangzhuang Dai

Task: 提出一种基于多头部自适应核（MAK）的图卷积方法，用于处理毫米波雷达点云数据以实现人类活动识别。

Motivation: 解决图像方法在隐私和低光条件下的局限性，以及现有图卷积方法中固定核无法适应点云数据局部几何特征的问题。

Details

Method: 在图卷积框架中引入多头部自适应核（MAK）模块，生成动态核以捕捉局部特征空间的不同方面，同时保持全局空间上下文。 Result: 在基准数据集上实现了最先进的人类活动识别性能。 Conclusion: 提出的自适应方法有效解决了点云数据处理中的挑战，为隐私保护的人类活动识别提供了新思路。 Abstract: Human activity recognition is increasingly vital for supporting independent living, particularly for the elderly and those in need of assistance. Domestic service robots with monitoring capabilities can enhance safety and provide essential support. Although image-based methods have advanced considerably in the past decade, their adoption remains limited by concerns over privacy and sensitivity to low-light or dark conditions. As an alternative, millimetre-wave (mmWave) radar can produce point cloud data which is privacy-preserving. However, processing the sparse and noisy point clouds remains a long-standing challenge. While graph-based methods and attention mechanisms show promise, they predominantly rely on "fixed" kernels; kernels that are applied uniformly across all neighbourhoods, highlighting the need for adaptive approaches that can dynamically adjust their kernels to the specific geometry of each local neighbourhood in point cloud data. To overcome this limitation, we introduce an adaptive approach within the graph convolutional framework. Instead of a single shared weight function, our Multi-Head Adaptive Kernel (MAK) module generates multiple dynamic kernels, each capturing different aspects of the local feature space. By progressively refining local features while maintaining global spatial context, our method enables convolution kernels to adapt to varying local features. Experimental results on benchmark datasets confirm the effectiveness of our approach, achieving state-of-the-art performance in human activity recognition. Our source code is made publicly available at: https://github.com/Gbouna/MAK-GCN

GPT-ImgEval: A Comprehensive Benchmark for Diagnosing GPT4o in Image Generation

Zhiyuan Yan,Junyan Ye,Weijia Li,Zilong Huang,Shenghai Yuan,Xiangyang He,Kaiqing Lin,Jun He,Conghui He,Li Yuan

Task: 评估GPT-4o在图像生成和编辑中的性能，并提出其架构的推测。

Motivation: OpenAI的GPT-4o在图像生成和编辑方面表现出色，但缺乏系统性评估和对其架构的理解。

Details

Method: 提出GPT-ImgEval基准，定量和定性分析GPT-4o在生成质量、编辑熟练度和语义合成三方面的表现，并通过分类模型推测其架构。 Result: GPT-4o在图像生成控制和输出质量上显著优于现有方法，并展示了强大的知识推理能力；推测其架构为自回归与扩散模型的结合。 Conclusion: 该研究为未来图像生成领域的创新提供了可靠基准和见解，并公开了代码和数据集以促进可重复性。 Abstract: The recent breakthroughs in OpenAI's GPT4o model have demonstrated surprisingly good capabilities in image generation and editing, resulting in significant excitement in the community. This technical report presents the first-look evaluation benchmark (named GPT-ImgEval), quantitatively and qualitatively diagnosing GPT-4o's performance across three critical dimensions: (1) generation quality, (2) editing proficiency, and (3) world knowledge-informed semantic synthesis. Across all three tasks, GPT-4o demonstrates strong performance, significantly surpassing existing methods in both image generation control and output quality, while also showcasing exceptional knowledge reasoning capabilities. Furthermore, based on the GPT-4o's generated data, we propose a classification-model-based approach to investigate the underlying architecture of GPT-4o, where our empirical results suggest the model consists of an auto-regressive (AR) combined with a diffusion-based head for image decoding, rather than the VAR-like architectures. We also provide a complete speculation on GPT-4o's overall architecture. In addition, we conduct a series of analyses to identify and visualize GPT-4o's specific limitations and the synthetic artifacts commonly observed in its image generation. We also present a comparative study of multi-round image editing between GPT-4o and Gemini 2.0 Flash, and discuss the safety implications of GPT-4o's outputs, particularly their detectability by existing image forensic models. We hope that our work can offer valuable insight and provide a reliable benchmark to guide future research, foster reproducibility, and accelerate innovation in the field of image generation and beyond. The codes and datasets used for evaluating GPT-4o can be found at https://github.com/PicoTrex/GPT-ImgEval.

Systematic Evaluation of Large Vision-Language Models for Surgical Artificial Intelligence

Anita Rau,Mark Endo,Josiah Aklilu,Jaewoo Heo,Khaled Saab,Alberto Paderno,Jeffrey Jopling,F. Christopher Holsinger,Serena Yeung-Levy

Task: 分析11种最先进的视觉语言模型在17种外科AI视觉理解任务中的表现。

Motivation: 探索视觉语言模型在医学领域，尤其是外科手术中的实际应用潜力，特别是在专家标注数据稀缺的情况下。

Details

Method: 使用13个数据集，涵盖腹腔镜、机器人和开放手术，对11种视觉语言模型进行综合评估。 Result: 视觉语言模型展现出良好的泛化能力，有时在非训练环境中表现优于监督模型；上下文学习将性能提升至三倍。 Conclusion: 视觉语言模型在复杂动态场景中具有潜力，但空间或时间推理任务仍具挑战性。 Abstract: Large Vision-Language Models offer a new paradigm for AI-driven image understanding, enabling models to perform tasks without task-specific training. This flexibility holds particular promise across medicine, where expert-annotated data is scarce. Yet, VLMs' practical utility in intervention-focused domains--especially surgery, where decision-making is subjective and clinical scenarios are variable--remains uncertain. Here, we present a comprehensive analysis of 11 state-of-the-art VLMs across 17 key visual understanding tasks in surgical AI--from anatomy recognition to skill assessment--using 13 datasets spanning laparoscopic, robotic, and open procedures. In our experiments, VLMs demonstrate promising generalizability, at times outperforming supervised models when deployed outside their training setting. In-context learning, incorporating examples during testing, boosted performance up to three-fold, suggesting adaptability as a key strength. Still, tasks requiring spatial or temporal reasoning remained difficult. Beyond surgery, our findings offer insights into VLMs' potential for tackling complex and dynamic scenarios in clinical and broader real-world applications.

F-ViTA: Foundation Model Guided Visible to Thermal Translation

Jay N. Paranjape,Celso de Melo,Vishal M. Patel

Task: 提出一种名为F-ViTA的新方法，利用基础模型中的通用知识指导扩散过程，实现可见光到热红外图像的翻译。

Motivation: 由于采集大规模热红外数据成本高且耗时，现有方法通常依赖GAN或扩散模型，但难以同时学习模态分布变化和物理原理。

Details

Method: 通过结合InstructPix2Pix扩散模型与基础模型（如SAM和Grounded DINO）的零样本掩码和标签，学习场景对象与其热红外特征之间的关联。 Result: 在五个公开数据集上的实验表明，F-ViTA优于现有方法，并能泛化到分布外场景，支持多种红外波段的翻译。 Conclusion: F-ViTA通过利用基础模型的知识，显著提升了可见光到热红外图像的翻译性能，具有广泛的应用潜力。 Abstract: Thermal imaging is crucial for scene understanding, particularly in low-light and nighttime conditions. However, collecting large thermal datasets is costly and labor-intensive due to the specialized equipment required for infrared image capture. To address this challenge, researchers have explored visible-to-thermal image translation. Most existing methods rely on Generative Adversarial Networks (GANs) or Diffusion Models (DMs), treating the task as a style transfer problem. As a result, these approaches attempt to learn both the modality distribution shift and underlying physical principles from limited training data. In this paper, we propose F-ViTA, a novel approach that leverages the general world knowledge embedded in foundation models to guide the diffusion process for improved translation. Specifically, we condition an InstructPix2Pix Diffusion Model with zero-shot masks and labels from foundation models such as SAM and Grounded DINO. This allows the model to learn meaningful correlations between scene objects and their thermal signatures in infrared imagery. Extensive experiments on five public datasets demonstrate that F-ViTA outperforms state-of-the-art (SOTA) methods. Furthermore, our model generalizes well to out-of-distribution (OOD) scenarios and can generate Long-Wave Infrared (LWIR), Mid-Wave Infrared (MWIR), and Near-Infrared (NIR) translations from the same visible image. Code: https://github.com/JayParanjape/F-ViTA/tree/master.

BOP Challenge 2024 on Model-Based and Model-Free 6D Object Pose Estimation

Van Nguyen Nguyen,Stephen Tyree,Andrew Guo,Mederic Fourmy,Anas Gouda,Taeyeop Lee,Sungphill Moon,Hyeontae Son,Lukas Ranftl,Jonathan Tremblay,Eric Brachmann,Bertram Drost,Vincent Lepetit,Carsten Rother,Stan Birchfield,Jiri Matas,Yann Labbe,Martin Sundermeyer,Tomas Hodan

Task: 评估BOP Challenge 2024的方法论、数据集和结果，推动6D物体姿态估计从实验室环境向真实场景过渡。

Motivation: 将6D物体姿态估计技术从实验室环境扩展到更接近真实世界的场景，并引入新的任务和数据集以支持这一目标。

Details

Method: 引入无模型任务、更实用的6D物体检测任务，以及新的BOP-H3数据集，使用高分辨率传感器和AR/VR头显记录数据。 Result: 2024年的最佳方法在无模型任务和6D定位任务中表现优于2023年方法，但在速度和2D检测任务中仍有改进空间。 Conclusion: BOP Challenge 2024成功推动了6D物体姿态估计技术的发展，但仍需在速度和2D检测任务上进一步优化。 Abstract: We present the evaluation methodology, datasets and results of the BOP Challenge 2024, the sixth in a series of public competitions organized to capture the state of the art in 6D object pose estimation and related tasks. In 2024, our goal was to transition BOP from lab-like setups to real-world scenarios. First, we introduced new model-free tasks, where no 3D object models are available and methods need to onboard objects just from provided reference videos. Second, we defined a new, more practical 6D object detection task where identities of objects visible in a test image are not provided as input. Third, we introduced new BOP-H3 datasets recorded with high-resolution sensors and AR/VR headsets, closely resembling real-world scenarios. BOP-H3 include 3D models and onboarding videos to support both model-based and model-free tasks. Participants competed on seven challenge tracks, each defined by a task, object onboarding setup, and dataset group. Notably, the best 2024 method for model-based 6D localization of unseen objects (FreeZeV2.1) achieves 22% higher accuracy on BOP-Classic-Core than the best 2023 method (GenFlow), and is only 4% behind the best 2023 method for seen objects (GPose2023) although being significantly slower (24.9 vs 2.7s per image). A more practical 2024 method for this task is Co-op which takes only 0.8s per image and is 25X faster and 13% more accurate than GenFlow. Methods have a similar ranking on 6D detection as on 6D localization but higher run time. On model-based 2D detection of unseen objects, the best 2024 method (MUSE) achieves 21% relative improvement compared to the best 2023 method (CNOS). However, the 2D detection accuracy for unseen objects is still noticealy (-53%) behind the accuracy for seen objects (GDet2023). The online evaluation system stays open and is available at http://bop.felk.cvut.cz/

Efficient Autoregressive Shape Generation via Octree-Based Adaptive Tokenization

Kangle Deng,Hsueh-Ti Derek Liu,Yiheng Zhu,Xiaoxia Sun,Chong Shang,Kiran Bhat,Deva Ramanan,Jun-Yan Zhu,Maneesh Agrawala,Tinghui Zhou

Task: 提出一种基于八叉树的自适应标记化框架，以根据形状复杂度调整潜在表示的维度。

Motivation: 现有方法将所有形状编码为固定大小的标记，忽略了3D数据中尺度和复杂度的固有变化，导致潜在表示效率低下，影响下游生成。

Details

Method: 通过基于二次误差细分的准则构建自适应八叉树结构，并使用基于查询的变换器为每个八叉树单元分配形状潜在向量。 Result: 实验表明，该方法在保持视觉质量的同时，将标记数量减少了50%；在相似标记长度下，生成形状的质量显著提高。 Conclusion: 该方法能够生成更详细和多样化的3D内容，优于现有方法。 Abstract: Many 3D generative models rely on variational autoencoders (VAEs) to learn compact shape representations. However, existing methods encode all shapes into a fixed-size token, disregarding the inherent variations in scale and complexity across 3D data. This leads to inefficient latent representations that can compromise downstream generation. We address this challenge by introducing Octree-based Adaptive Tokenization, a novel framework that adjusts the dimension of latent representations according to shape complexity. Our approach constructs an adaptive octree structure guided by a quadric-error-based subdivision criterion and allocates a shape latent vector to each octree cell using a query-based transformer. Building upon this tokenization, we develop an octree-based autoregressive generative model that effectively leverages these variable-sized representations in shape generation. Extensive experiments demonstrate that our approach reduces token counts by 50% compared to fixed-size methods while maintaining comparable visual quality. When using a similar token length, our method produces significantly higher-quality shapes. When incorporated with our downstream generative model, our method creates more detailed and diverse 3D content than existing approaches.

GMR-Conv: An Efficient Rotation and Reflection Equivariant Convolution Kernel Using Gaussian Mixture Rings

Yuexi Du,Jiazhen Zhang,Nicha C. Dvornek,John A. Onofrey

Task: 设计一种高效的卷积核（GMR-Conv），以平滑径向对称性并保持旋转和反射等变性。

Motivation: 传统CNN在平移等变性方面表现良好，但在旋转和反射等变性上存在挑战，往往需要在等变性、效率和信息损失之间妥协。

Details

Method: 提出高斯混合环卷积（GMR-Conv），通过高斯加权环平滑径向对称性，减少圆形核的离散化误差，并优化空间和速度效率。 Result: 在八个分类和一个分割数据集上，GMR-Conv不仅匹配传统CNN性能，还能在无方向数据应用中超越，且比现有等变学习方法更鲁棒高效。 Conclusion: 精心设计的径向对称性可缓解信息损失问题，为等变网络架构提供了有前景的进展。 Abstract: Symmetry, where certain features remain invariant under geometric transformations, can often serve as a powerful prior in designing convolutional neural networks (CNNs). While conventional CNNs inherently support translational equivariance, extending this property to rotation and reflection has proven challenging, often forcing a compromise between equivariance, efficiency, and information loss. In this work, we introduce Gaussian Mixture Ring Convolution (GMR-Conv), an efficient convolution kernel that smooths radial symmetry using a mixture of Gaussian-weighted rings. This design mitigates discretization errors of circular kernels, thereby preserving robust rotation and reflection equivariance without incurring computational overhead. We further optimize both the space and speed efficiency of GMR-Conv via a novel parameterization and computation strategy, allowing larger kernels at an acceptable cost. Extensive experiments on eight classification and one segmentation datasets demonstrate that GMR-Conv not only matches conventional CNNs' performance but can also surpass it in applications with orientation-less data. GMR-Conv is also proven to be more robust and efficient than the state-of-the-art equivariant learning methods. Our work provides inspiring empirical evidence that carefully applied radial symmetry can alleviate the challenges of information loss, marking a promising advance in equivariant network architectures. The code is available at https://github.com/XYPB/GMR-Conv.

Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models

Mateusz Pach,Shyamgopal Karthik,Quentin Bouniot,Serge Belongie,Zeynep Akata

Task: 将稀疏自编码器（SAEs）应用于视觉语言模型（VLMs）以增强其可解释性和可控性。

Motivation: 探索SAEs在VLMs中的应用潜力，提升视觉表示的单义性和模型的可控性。

Details

Method: 在VLMs（如CLIP）上训练SAEs，并引入评估视觉表示单义性的框架。 Result: SAEs显著提升了神经元单义性，并展现出与专家定义结构（如iNaturalist分类）一致的层次表示；SAEs可直接干预CLIP视觉编码器，无需修改底层模型即可控制多模态LLMs的输出。 Conclusion: SAEs是一种有效的无监督方法，可增强VLMs的可解释性和可控性。 Abstract: Sparse Autoencoders (SAEs) have recently been shown to enhance interpretability and steerability in Large Language Models (LLMs). In this work, we extend the application of SAEs to Vision-Language Models (VLMs), such as CLIP, and introduce a comprehensive framework for evaluating monosemanticity in vision representations. Our experimental results reveal that SAEs trained on VLMs significantly enhance the monosemanticity of individual neurons while also exhibiting hierarchical representations that align well with expert-defined structures (e.g., iNaturalist taxonomy). Most notably, we demonstrate that applying SAEs to intervene on a CLIP vision encoder, directly steer output from multimodal LLMs (e.g., LLaVA) without any modifications to the underlying model. These findings emphasize the practicality and efficacy of SAEs as an unsupervised approach for enhancing both the interpretability and control of VLMs.

STING-BEE: Towards Vision-Language Model for Real-World X-ray Baggage Security Inspection

Divya Velayudhan,Abdelfatah Ahmed,Mohamad Alansari,Neha Gour,Abderaouf Behouch,Taimur Hassan,Syed Talal Wasim,Nabil Maalej,Muzammal Naseer,Juergen Gall,Mohammed Bennamoun,Ernesto Damiani,Naoufel Werghi

Task: 开发一个多模态X射线行李安全数据集STCray，并训练一个领域感知的视觉AI助手STING-BEE，用于支持多种视觉语言任务。

Motivation: 当前数据集在表示真实世界复杂威胁和隐藏策略方面有限，且现有方法受限于预定义标签的封闭集范式。

Details

Method: 通过X射线扫描仪生成46,642对图像-标题扫描数据，覆盖21种威胁类别，并开发领域感知的视觉AI助手STING-BEE。 Result: STING-BEE在多模态学习中表现优异，支持场景理解、威胁定位、视觉基础和视觉问答等任务，并在跨域设置中表现出最先进的泛化能力。 Conclusion: STCray和STING-BEE为X射线行李安全领域提供了新的多模态学习基准，并展示了强大的跨域泛化能力。 Abstract: Advancements in Computer-Aided Screening (CAS) systems are essential for improving the detection of security threats in X-ray baggage scans. However, current datasets are limited in representing real-world, sophisticated threats and concealment tactics, and existing approaches are constrained by a closed-set paradigm with predefined labels. To address these challenges, we introduce STCray, the first multimodal X-ray baggage security dataset, comprising 46,642 image-caption paired scans across 21 threat categories, generated using an X-ray scanner for airport security. STCray is meticulously developed with our specialized protocol that ensures domain-aware, coherent captions, that lead to the multi-modal instruction following data in X-ray baggage security. This allows us to train a domain-aware visual AI assistant named STING-BEE that supports a range of vision-language tasks, including scene comprehension, referring threat localization, visual grounding, and visual question answering (VQA), establishing novel baselines for multi-modal learning in X-ray baggage security. Further, STING-BEE shows state-of-the-art generalization in cross-domain settings. Code, data, and models are available at https://divs1159.github.io/STING-BEE/.

Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing

Xiangyu Zhao,Peiyuan Zhang,Kexian Tang,Hao Li,Zicheng Zhang,Guangtao Zhai,Junchi Yan,Hua Yang,Xue Yang,Haodong Duan

Task: 评估推理感知视觉编辑（RISE）的首个基准RISEBench。

Motivation: 大型多模态模型（LMMs）在视觉理解和生成方面取得进展，但在通用视觉编辑中仍面临复杂指令遵循、外观一致性和灵活输入格式支持的挑战。

Details

Method: 提出RISEBench，包含四种关键推理类型（时间、因果、空间和逻辑推理），并设计评估框架，结合人工和LMM-as-a-judge方法评估指令推理、外观一致性和视觉合理性。 Result: 实验显示GPT-4o-Native表现最佳，但在逻辑推理任务中仍有困难。 Conclusion: RISEBench为推理感知视觉编辑提供基础见解，并推动未来研究，未来将持续扩展和优化基准。 Abstract: Large Multi-modality Models (LMMs) have made significant progress in visual understanding and generation, but they still face challenges in General Visual Editing, particularly in following complex instructions, preserving appearance consistency, and supporting flexible input formats. To address this gap, we introduce RISEBench, the first benchmark for evaluating Reasoning-Informed viSual Editing (RISE). RISEBench focuses on four key reasoning types: Temporal, Causal, Spatial, and Logical Reasoning. We curate high-quality test cases for each category and propose an evaluation framework that assesses Instruction Reasoning, Appearance Consistency, and Visual Plausibility with both human judges and an LMM-as-a-judge approach. Our experiments reveal that while GPT-4o-Native significantly outperforms other open-source and proprietary models, even this state-of-the-art system struggles with logical reasoning tasks, highlighting an area that remains underexplored. As an initial effort, RISEBench aims to provide foundational insights into reasoning-aware visual editing and to catalyze future research. Though still in its early stages, we are committed to continuously expanding and refining the benchmark to support more comprehensive, reliable, and scalable evaluations of next-generation multimodal systems. Our code and data will be released at https://github.com/PhoenixZ810/RISEBench.

Concept Lancet: Image Editing with Compositional Representation Transplant

Jinqi Luo,Tianjiao Ding,Kwan Ho Ryan Chan,Hancheng Min,Chris Callison-Burch,René Vidal

Task: 提出一种零样本即插即用框架Concept Lancet (CoLan)，用于扩散模型中的图像编辑任务。

Motivation: 现有编辑方法在文本嵌入或分数空间中设计编辑方向时，难以平衡编辑强度与视觉一致性，且需要试错调整。

Details

Method: 通过将输入分解为视觉概念的稀疏线性组合，定制概念移植过程，并构建概念表示数据集CoLan-150K。 Result: 实验表明，配备CoLan的方法在编辑效果和一致性保持上达到最优性能。 Conclusion: CoLan为扩散模型中的图像编辑提供了一种高效且无需调整的解决方案。 Abstract: Diffusion models are widely used for image editing tasks. Existing editing methods often design a representation manipulation procedure by curating an edit direction in the text embedding or score space. However, such a procedure faces a key challenge: overestimating the edit strength harms visual consistency while underestimating it fails the editing task. Notably, each source image may require a different editing strength, and it is costly to search for an appropriate strength via trial-and-error. To address this challenge, we propose Concept Lancet (CoLan), a zero-shot plug-and-play framework for principled representation manipulation in diffusion-based image editing. At inference time, we decompose the source input in the latent (text embedding or diffusion score) space as a sparse linear combination of the representations of the collected visual concepts. This allows us to accurately estimate the presence of concepts in each image, which informs the edit. Based on the editing task (replace/add/remove), we perform a customized concept transplant process to impose the corresponding editing direction. To sufficiently model the concept space, we curate a conceptual representation dataset, CoLan-150K, which contains diverse descriptions and scenarios of visual terms and phrases for the latent dictionary. Experiments on multiple diffusion-based image editing baselines show that methods equipped with CoLan achieve state-of-the-art performance in editing effectiveness and consistency preservation.

CaLiV: LiDAR-to-Vehicle Calibration of Arbitrary Sensor Setups via Object Reconstruction

Ilir Tahiraj,Markus Edinger,Dominik Kulmer,Markus Lienkamp

Task: 提出一种名为CaLiV的目标标定技术，用于多LiDAR系统的外参标定（Sensor-to-Sensor和Sensor-to-Vehicle）。

Motivation: 现有LiDAR标定方法需要重叠视场或依赖外部设备，且多数不支持Sensor-to-Vehicle标定。

Details

Method: 通过运动生成视场重叠，利用无迹卡尔曼滤波获取车辆位姿，再通过GMMCalib框架对齐点云，最后将标定问题转化为最小化问题求解。 Result: 方法能准确解决Sensor-to-Sensor的平移和旋转误差，并高精度标定Sensor-to-Vehicle的旋转角度。 Conclusion: CaLiV是一种无需外部设备、支持非重叠视场的高效标定方法，已在真实实验中验证。 Abstract: In autonomous systems, sensor calibration is essential for a safe and efficient navigation in dynamic environments. Accurate calibration is a prerequisite for reliable perception and planning tasks such as object detection and obstacle avoidance. Many existing LiDAR calibration methods require overlapping fields of view, while others use external sensing devices or postulate a feature-rich environment. In addition, Sensor-to-Vehicle calibration is not supported by the vast majority of calibration algorithms. In this work, we propose a novel target-based technique for extrinsic Sensor-to-Sensor and Sensor-to-Vehicle calibration of multi-LiDAR systems called CaLiV. This algorithm works for non-overlapping FoVs, as well as arbitrary calibration targets, and does not require any external sensing devices. First, we apply motion to produce FoV overlaps and utilize a simple unscented Kalman filter to obtain vehicle poses. Then, we use the Gaussian mixture model-based registration framework GMMCalib to align the point clouds in a common calibration frame. Finally, we reduce the task of recovering the sensor extrinsics to a minimization problem. We show that both translational and rotational Sensor-to-Sensor errors can be solved accurately by our method. In addition, all Sensor-to-Vehicle rotation angles can also be calibrated with high accuracy. We validate the simulation results in real-world experiments. The code is open source and available on https://github.com/TUMFTM/CaLiV.

Distance Estimation to Support Assistive Drones for the Visually Impaired using Robust Calibration

Suman Raj,Bhavani A Madhabhavi,Madhav Kumar,Prabhav Gupta,Yogesh Simmhan

Task: 利用深度地图和动态更新方法为视障人士（VIPs）提供无人机自主导航辅助。

Motivation: 通过结合深度学习和计算机视觉算法，提升无人机在复杂动态环境中为视障人士导航的准确性和鲁棒性。

Details

Method: 提出NOVA技术，利用深度地图估计绝对距离，并采用动态更新方法适应对抗性场景。 Result: NOVA在VIP距离估计中误差小于30cm，对其他障碍物（如汽车、自行车）误差最大为60cm，优于基线方法和现有深度地图方法。 Conclusion: NOVA在动态和多样化环境中表现出更高的鲁棒性和泛化能力，显著优于现有技术。 Abstract: Autonomous navigation by drones using onboard sensors, combined with deep learning and computer vision algorithms, is impacting a number of domains. We examine the use of drones to autonomously assist Visually Impaired People (VIPs) in navigating outdoor environments while avoiding obstacles. Here, we present NOVA, a robust calibration technique using depth maps to estimate absolute distances to obstacles in a campus environment. NOVA uses a dynamic-update method that can adapt to adversarial scenarios. We compare NOVA with SOTA depth map approaches, and with geometric and regression-based baseline models, for distance estimation to VIPs and other obstacles in diverse and dynamic conditions. We also provide exhaustive evaluations to validate the robustness and generalizability of our methods. NOVA predicts distances to VIP with an error <30cm and to different obstacles like cars and bicycles with a maximum of 60cm error, which are better than the baselines. NOVA also clearly out-performs SOTA depth map methods, by upto 5.3-14.6x.

A Concise Survey on Lane Topology Reasoning for HD Mapping

Yi Yao,Miao Fan,Shengtong Xu,Haoyi Xiong,Xiangzeng Liu,Wenbo Hu,Wenbing Huang

Task: 系统综述车道拓扑推理方法的演变、现状及未来研究方向。

Motivation: 车道拓扑推理在高清地图和自动驾驶中至关重要，但缺乏对这些方法的全面综述。

Details

Method: 将方法分为三类：基于程序建模、基于航空影像和基于车载传感器，并分析从规则到深度学习的演进。 Result: 总结了标准化评估指标和基准数据集上的性能比较，并指出数据集可用性和模型效率等挑战。 Conclusion: 为研究者和从业者提供了车道拓扑推理的理论框架、实践实现和新兴趋势的全面视角。 Abstract: Lane topology reasoning techniques play a crucial role in high-definition (HD) mapping and autonomous driving applications. While recent years have witnessed significant advances in this field, there has been limited effort to consolidate these works into a comprehensive overview. This survey systematically reviews the evolution and current state of lane topology reasoning methods, categorizing them into three major paradigms: procedural modeling-based methods, aerial imagery-based methods, and onboard sensors-based methods. We analyze the progression from early rule-based approaches to modern learning-based solutions utilizing transformers, graph neural networks (GNNs), and other deep learning architectures. The paper examines standardized evaluation metrics, including road-level measures (APLS and TLTS score), and lane-level metrics (DET and TOP score), along with performance comparisons on benchmark datasets such as OpenLane-V2. We identify key technical challenges, including dataset availability and model efficiency, and outline promising directions for future research. This comprehensive review provides researchers and practitioners with insights into the theoretical frameworks, practical implementations, and emerging trends in lane topology reasoning for HD mapping applications.

Khizar Anjum,Parul Pandey,Vidyasagar Sadhu,Roberto Tron,Dario Pompili

Task: 提出一种新颖的马尔可夫决策过程（MDP）框架，以减少计算机视觉（CV）算法在自主导航中的计算负担。

Motivation: 现有的自主导航方法依赖于昂贵的几何3D点云处理，而基于语义信息（如交通标志）的导航虽然简单，但计算机视觉算法（如目标检测）对资源有限的设备（如无人机）负担较重。

Details

Method: 引入一种新颖的MDP框架，应用于基于特征和神经网络的物体检测任务，并通过开环、闭环仿真和硬件在环仿真进行测试。 Result: 测试结果显示，与基于静态特征和神经网络的方法相比，该方法在能耗和速度上有显著优势，且精度损失有限。 Conclusion: 提出的MDP框架有效降低了计算机视觉算法的计算负担，适用于资源有限的自主导航设备。 Abstract: Most applications in autonomous navigation using mounted cameras rely on the construction and processing of geometric 3D point clouds, which is an expensive process. However, there is another simpler way to make a space navigable quickly: to use semantic information (e.g., traffic signs) to guide the agent. However, detecting and acting on semantic information involves Computer Vision~(CV) algorithms such as object detection, which themselves are demanding for agents such as aerial drones with limited onboard resources. To solve this problem, we introduce a novel Markov Decision Process~(MDP) framework to reduce the workload of these CV approaches. We apply our proposed framework to both feature-based and neural-network-based object-detection tasks, using open-loop and closed-loop simulations as well as hardware-in-the-loop emulations. These holistic tests show significant benefits in energy consumption and speed with only a limited loss in accuracy compared to models based on static features and neural networks.

WorldPrompter: Traversable Text-to-Scene Generation

Zhaoyang Zhang,Yannick Hold-Geoffroy,Miloš Hašan,Chen Ziwen,Fujun Luan,Julie Dorsey,Yiwei Hu

Task: 提出一种名为WorldPrompter的新方法，从文本提示生成可遍历的3D场景。

Motivation: 现有方法生成的场景多为局部且导航自由度有限，需要一种能生成完整可遍历3D场景的方法。

Details

Method: 利用全景视频作为中间表示，结合条件性360度全景视频生成器和快速前馈3D重建器，生成可遍历的3D场景。 Result: 实验表明，全景视频生成模型能实现帧间视图一致性，支持高质量全景高斯泼溅重建，并在场景区域上实现遍历。 Conclusion: WorldPrompter在360度视频生成和3D场景生成方面优于现有先进方法。 Abstract: Scene-level 3D generation is a challenging research topic, with most existing methods generating only partial scenes and offering limited navigational freedom. We introduce WorldPrompter, a novel generative pipeline for synthesizing traversable 3D scenes from text prompts. We leverage panoramic videos as an intermediate representation to model the 360{\deg} details of a scene. WorldPrompter incorporates a conditional 360{\deg} panoramic video generator, capable of producing a 128-frame video that simulates a person walking through and capturing a virtual environment. The resulting video is then reconstructed as Gaussian splats by a fast feedforward 3D reconstructor, enabling a true walkable experience within the 3D scene. Experiments demonstrate that our panoramic video generation model achieves convincing view consistency across frames, enabling high-quality panoramic Gaussian splat reconstruction and facilitating traversal over an area of the scene. Qualitative and quantitative results also show it outperforms the state-of-the-art 360{\deg} video generators and 3D scene generation models.

Evaluation of Flight Parameters in UAV-based 3D Reconstruction for Rooftop Infrastructure Assessment

Nick Chodura,Melissa Greeff,Joshua Woods

Task: 系统评估关键飞行参数（地面采样距离和图像重叠率）以优化复杂屋顶基础设施的3D重建。

Motivation: 现有方法在自主飞行路径下需要高图像重叠率和长飞行时间以确保模型精度，本研究旨在优化这些参数以提高效率。

Details

Method: 通过控制无人机飞行，使用不同地面采样距离和图像重叠率设置，采集数据后使用Reality Capture软件处理，并与基于LiDAR和地面激光扫描的基准模型对比。 Result: 实验结果表明，地面采样距离在0.75-1.26厘米范围内且图像重叠率为85%时，模型精度高且图像采集和飞行时间最少。 Conclusion: 研究结果为规划自主无人机飞行路径以高效评估屋顶基础设施提供了指导。 Abstract: Rooftop 3D reconstruction using UAV-based photogrammetry offers a promising solution for infrastructure assessment, but existing methods often require high percentages of image overlap and extended flight times to ensure model accuracy when using autonomous flight paths. This study systematically evaluates key flight parameters-ground sampling distance (GSD) and image overlap-to optimize the 3D reconstruction of complex rooftop infrastructure. Controlled UAV flights were conducted over a multi-segment rooftop at Queen's University using a DJI Phantom 4 Pro V2, with varied GSD and overlap settings. The collected data were processed using Reality Capture software and evaluated against ground truth models generated from UAV-based LiDAR and terrestrial laser scanning (TLS). Experimental results indicate that a GSD range of 0.75-1.26 cm combined with 85% image overlap achieves a high degree of model accuracy, while minimizing images collected and flight time. These findings provide guidance for planning autonomous UAV flight paths for efficient rooftop assessments.

One Pic is All it Takes: Poisoning Visual Document Retrieval Augmented Generation with a Single Image

Ezzeldin Shereen,Dan Ristea,Burak Hasircioglu,Shae McFadden,Vasilios Mavroudis,Chris Hicks

Task: 研究针对多模态检索增强生成（M-RAG）系统的投毒攻击，特别是在视觉文档检索应用中。

Motivation: M-RAG系统通过知识库抑制多模态模型的幻觉，但可能因恶意知识库条目注入而受到攻击，揭示其潜在漏洞。

Details

Method: 设计一种针对视觉文档检索的投毒攻击，通过单张图像影响多种查询的输出，实现通用拒绝服务攻击。 Result: 攻击对多种先进检索器和生成器有效，但对鲁棒嵌入模型效果有限。 Conclusion: M-RAG系统易受投毒攻击，且在良性设置下也可能存在性能瓶颈。 Abstract: Multimodal retrieval augmented generation (M-RAG) has recently emerged as a method to inhibit hallucinations of large multimodal models (LMMs) through a factual knowledge base (KB). However, M-RAG also introduces new attack vectors for adversaries that aim to disrupt the system by injecting malicious entries into the KB. In this work, we present a poisoning attack against M-RAG targeting visual document retrieval applications, where the KB contains images of document pages. Our objective is to craft a single image that is retrieved for a variety of different user queries, and consistently influences the output produced by the generative model, thus creating a universal denial-of-service (DoS) attack against the M-RAG system. We demonstrate that while our attack is effective against a diverse range of widely-used, state-of-the-art retrievers (embedding models) and generators (LMMs), it can also be ineffective against robust embedding models. Our attack not only highlights the vulnerability of M-RAG pipelines to poisoning attacks, but also sheds light on a fundamental weakness that potentially hinders their performance even in benign settings.

Multivariate Temporal Regression at Scale: A Three-Pillar Framework Combining ML, XAI, and NLP

Jiztom Kavalakkatt Francis,Matthew J Darr

Task: 探索高维数据分析中的挑战并提出简化模型的方法。

Motivation: 传统数据分析方法在处理复杂高维数据时可能遗漏复杂关系，且计算成本高。

Details

Method: 采用变量移除、统计分析、合成数据等方法，提出全局特征识别和降维技术。 Result: 开发了一种能简化模型并揭示输入与结果间新关系的方法。 Conclusion: 该方法有助于数据验证和量化，同时提升模型可解释性。 Abstract: The rapid use of artificial intelligence (AI) in processes such as coding, image processing, and data prediction means it is crucial to understand and validate the data we are working with fully. This paper dives into the hurdles of analyzing high-dimensional data, especially when it gets too complex. Traditional methods in data analysis often look at direct connections between input variables, which can miss out on the more complicated relationships within the data. To address these issues, we explore several tested techniques, such as removing specific variables to see their impact and using statistical analysis to find connections between multiple variables. We also consider the role of synthetic data and how information can sometimes be redundant across different sensors. These analyses are typically very computationally demanding and often require much human effort to make sense of the results. A common approach is to treat the entire dataset as one unit and apply advanced models to handle it. However, this can become problematic with larger, noisier datasets and more complex models. So, we suggest methods to identify overall patterns that can help with tasks like classification or regression based on the idea that more straightforward approaches might be more understandable. Our research looks at two datasets: a real-world dataset and a synthetic one. The goal is to create a methodology that highlights key features on a global scale that lead to predictions, making it easier to validate or quantify the data set. By reducing the dimensionality with this method, we can simplify the models used and thus clarify the insights we gain. Furthermore, our method can reveal unexplored relationships between specific inputs and outcomes, providing a way to validate these new connections further.

Preference-Driven Active 3D Scene Representation for Robotic Inspection in Nuclear Decommissioning

Zhen Meng,Kan Chen,Xiangmin Xu,Erwin Jose Lopez Pulgarin,Emma Li,Philip G. Zhao,David Flynn

Task: 提出一种将专家操作者偏好融入主动3D场景表示框架的新方法。

Motivation: 传统方法主要优化几何保真度或渲染精度，但忽略了操作者特定目标（如安全关键覆盖或任务驱动视角），导致在受限环境中的视角选择不理想。

Details

Method: 采用基于人类反馈的强化学习（RLHF）指导机器人路径规划，并通过交互式选择实验捕捉操作者偏好。 Result: 在核退役场景中验证，RLHF策略优于随机选择，优化了轨迹效率并提升了场景表示。 Conclusion: 该工作为自适应、安全关键的机器人感知系统奠定了基础，推动了高风险环境中的自动化发展。 Abstract: Active 3D scene representation is pivotal in modern robotics applications, including remote inspection, manipulation, and telepresence. Traditional methods primarily optimize geometric fidelity or rendering accuracy, but often overlook operator-specific objectives, such as safety-critical coverage or task-driven viewpoints. This limitation leads to suboptimal viewpoint selection, particularly in constrained environments such as nuclear decommissioning. To bridge this gap, we introduce a novel framework that integrates expert operator preferences into the active 3D scene representation pipeline. Specifically, we employ Reinforcement Learning from Human Feedback (RLHF) to guide robotic path planning, reshaping the reward function based on expert input. To capture operator-specific priorities, we conduct interactive choice experiments that evaluate user preferences in 3D scene representation. We validate our framework using a UR3e robotic arm for reactor tile inspection in a nuclear decommissioning scenario. Compared to baseline methods, our approach enhances scene representation while optimizing trajectory efficiency. The RLHF-based policy consistently outperforms random selection, prioritizing task-critical details. By unifying explicit 3D geometric modeling with implicit human-in-the-loop optimization, this work establishes a foundation for adaptive, safety-critical robotic perception systems, paving the way for enhanced automation in nuclear decommissioning, remote maintenance, and other high-risk environments.

Neural Style Transfer for Synthesising a Dataset of Ancient Egyptian Hieroglyphs

Lewis Matheson Creed

Task: 提出一种利用神经风格迁移（NST）生成古埃及象形文字数据集的新方法。

Motivation: 低资源语言（如古埃及语）的训练数据有限，限制了机器学习技术的应用。

Details

Method: 通过将NST应用于数字字体，生成古埃及象形文字数据集。 Result: 实验表明，基于NST生成的数据集训练的模型在分类任务中表现与真实图像相当，且具有迁移能力。 Conclusion: NST方法为低资源语言的数据生成提供了可行解决方案。 Abstract: The limited availability of training data for low-resource languages makes applying machine learning techniques challenging. Ancient Egyptian is one such language with few resources. However, innovative applications of data augmentation methods, such as Neural Style Transfer, could overcome these barriers. This paper presents a novel method for generating datasets of ancient Egyptian hieroglyphs by applying NST to a digital typeface. Experimental results found that image classification models trained on NST-generated examples and photographs demonstrate equal performance and transferability to real unseen images of hieroglyphs.

Image Coding for Machines via Feature-Preserving Rate-Distortion Optimization

Samuel Fernández-Menduiña,Eduardo Pavez,Antonio Ortega

Task: 优化图像和视频压缩方法，以同时兼顾视觉质量和下游计算机视觉任务性能。

Motivation: 许多图像和视频主要由计算机视觉算法处理，人类检查较少，因此压缩方法需同时考虑视觉质量和任务性能。

Details

Method: 通过泰勒展开和块级近似简化率失真优化（RDO），提出输入依赖平方误差（IDSE）并结合SSE。 Result: 在AVC模拟中，相比基于SSE的RDO，实现了10%的比特率节省，且解码器复杂度不变，编码器复杂度仅增加7%。 Conclusion: 提出的方法在保持计算机视觉任务性能的同时，显著降低了比特率需求，且计算复杂度增加有限。 Abstract: Many images and videos are primarily processed by computer vision algorithms, involving only occasional human inspection. When this content requires compression before processing, e.g., in distributed applications, coding methods must optimize for both visual quality and downstream task performance. We first show that, given the features obtained from the original and the decoded images, an approach to reduce the effect of compression on a task loss is to perform rate-distortion optimization (RDO) using the distance between features as a distortion metric. However, optimizing directly such a rate-distortion trade-off requires an iterative workflow of encoding, decoding, and feature evaluation for each coding parameter, which is computationally impractical. We address this problem by simplifying the RDO formulation to make the distortion term computable using block-based encoders. We first apply Taylor's expansion to the feature extractor, recasting the feature distance as a quadratic metric with the Jacobian matrix of the neural network. Then, we replace the linearized metric with a block-wise approximation, which we call input-dependent squared error (IDSE). To reduce computational complexity, we approximate IDSE using Jacobian sketches. The resulting loss can be evaluated block-wise in the transform domain and combined with the sum of squared errors (SSE) to address both visual quality and computer vision performance. Simulations with AVC across multiple feature extractors and downstream neural networks show up to 10% bit-rate savings for the same computer vision accuracy compared to RDO based on SSE, with no decoder complexity overhead and just a 7% encoder complexity increase.

APSeg: Auto-Prompt Model with Acquired and Injected Knowledge for Nuclear Instance Segmentation and Classification

Liying Xu,Hongliang He,Wei Han,Hanbin Huang,Siwei Feng,Guohong Fu

Task: 提出一种名为APSeg的自动提示模型，用于核实例分割和分类，以解决Segment Anything Model（SAM）对精确提示的依赖和分类结果依赖提示的问题。

Motivation: SAM在核分割中虽提高了准确性和效率，但其对精确提示的强依赖和类无关设计限制了其分类能力。

Details

Method: APSeg包含两个知识感知模块：分布引导的提议偏移模块（DG-POM）和类别知识语义注入模块（CK-SIM）。 Result: 在PanNuke和CoNSeP数据集上的实验证明了APSeg的有效性。 Conclusion: APSeg通过引入分布和形态学知识，提升了核实例分割和分类的性能。 Abstract: Nuclear instance segmentation and classification provide critical quantitative foundations for digital pathology diagnosis. With the advent of the foundational Segment Anything Model (SAM), the accuracy and efficiency of nuclear segmentation have improved significantly. However, SAM imposes a strong reliance on precise prompts, and its class-agnostic design renders its classification results entirely dependent on the provided prompts. Therefore, we focus on generating prompts with more accurate localization and classification and propose \textbf{APSeg}, \textbf{A}uto-\textbf{P}rompt model with acquired and injected knowledge for nuclear instance \textbf{Seg}mentation and classification. APSeg incorporates two knowledge-aware modules: (1) Distribution-Guided Proposal Offset Module (\textbf{DG-POM}), which learns distribution knowledge through density map guided, and (2) Category Knowledge Semantic Injection Module (\textbf{CK-SIM}), which injects morphological knowledge derived from category descriptions. We conducted extensive experiments on the PanNuke and CoNSeP datasets, demonstrating the effectiveness of our approach. The code will be released upon acceptance.

LLM-Guided Evolution: An Autonomous Model Optimization for Object Detection

YiMing Yu,Jason Zutty

Task: 通过改进LLM-GE框架，优化YOLO模型架构以提升在KITTI数据集上的目标检测性能。

Motivation: 传统神经架构搜索（NAS）需要大量试错和领域知识，而进化算法依赖固定规则和预定义模块，LLM-GE通过结合大语言模型（LLM）的智能指导，提供了一种更灵活和高效的方法。

Details

Method: 采用LLM-GE框架，结合“思维进化”（EoT）技术，通过反馈循环迭代优化YOLO模型的设计和参数。 Result: LLM-GE生成的YOLO变体在KITTI数据集上显著提升了性能，如平均精度均值（mAP）从92.5%提高到94.5%。 Conclusion: LLM-GE结合LLM驱动推理与进化策略，为自动化机器学习提供了新范式，展示了其在解决实际问题中的灵活性和有效性。 Abstract: In machine learning, Neural Architecture Search (NAS) requires domain knowledge of model design and a large amount of trial-and-error to achieve promising performance. Meanwhile, evolutionary algorithms have traditionally relied on fixed rules and pre-defined building blocks. The Large Language Model (LLM)-Guided Evolution (GE) framework transformed this approach by incorporating LLMs to directly modify model source code for image classification algorithms on CIFAR data and intelligently guide mutations and crossovers. A key element of LLM-GE is the "Evolution of Thought" (EoT) technique, which establishes feedback loops, allowing LLMs to refine their decisions iteratively based on how previous operations performed. In this study, we perform NAS for object detection by improving LLM-GE to modify the architecture of You Only Look Once (YOLO) models to enhance performance on the KITTI dataset. Our approach intelligently adjusts the design and settings of YOLO to find the optimal algorithms against objective such as detection accuracy and speed. We show that LLM-GE produced variants with significant performance improvements, such as an increase in Mean Average Precision from 92.5% to 94.5%. This result highlights the flexibility and effectiveness of LLM-GE on real-world challenges, offering a novel paradigm for automated machine learning that combines LLM-driven reasoning with evolutionary strategies.

Towards Assessing Deep Learning Test Input Generators

Seif Mzoughi,Ahmed Hajyahmed,Mohamed Elshafei,Foutse Khomh anb Diego Elias Costa

Task: 对四种先进的测试输入生成器（TIGs）在多个关键维度上的有效性进行全面评估。

Motivation: 深度学习系统在安全关键应用中部署增多，但其鲁棒性问题可能导致严重故障，现有TIGs的评估缺乏全面性。

Details

Method: 利用三种预训练模型（LeNet-5、VGG16、EfficientNetB3）和不同复杂度的数据集（MNIST、CIFAR-10、ImageNet-1K），评估四种TIGs（DeepHunter、DeepFault、AdvGAN、SinVAD）在故障揭示能力、自然性、多样性和效率方面的表现。 Result: 研究发现TIGs在鲁棒性揭示能力、测试用例生成和计算效率方面存在重要权衡，且性能随数据集复杂度变化显著。 Conclusion: 本文为根据特定目标和数据集特征选择合适TIGs提供了实用指导，但需进一步改进TIGs以应对实际安全关键系统的需求。 Abstract: Deep Learning (DL) systems are increasingly deployed in safety-critical applications, yet they remain vulnerable to robustness issues that can lead to significant failures. While numerous Test Input Generators (TIGs) have been developed to evaluate DL robustness, a comprehensive assessment of their effectiveness across different dimensions is still lacking. This paper presents a comprehensive assessment of four state-of-the-art TIGs--DeepHunter, DeepFault, AdvGAN, and SinVAD--across multiple critical aspects: fault-revealing capability, naturalness, diversity, and efficiency. Our empirical study leverages three pre-trained models (LeNet-5, VGG16, and EfficientNetB3) on datasets of varying complexity (MNIST, CIFAR-10, and ImageNet-1K) to evaluate TIG performance. Our findings reveal important trade-offs in robustness revealing capability, variation in test case generation, and computational efficiency across TIGs. The results also show that TIG performance varies significantly with dataset complexity, as tools that perform well on simpler datasets may struggle with more complex ones. In contrast, others maintain steadier performance or better scalability. This paper offers practical guidance for selecting appropriate TIGs aligned with specific objectives and dataset characteristics. Nonetheless, more work is needed to address TIG limitations and advance TIGs for real-world, safety-critical systems.

Determining Sphere Radius through Pairwise Distances

Boris Sukhovilov

Task: 提出一种基于球面上点间距离测量确定球面半径的新方法。

Motivation: 解决在距离测量存在误差且球面形状存在随机偏差的最一般情况下确定球面半径的问题。

Details

Method: 使用最少四个点和任意N个点，通过距离矩阵提供球面半径的闭合解，并确定由测量误差和球面形状偏差引起的半径估计的标准差。 Result: 提出了球面半径的闭合解，并找到了使半径估计标准差最小的最优点配置。 Conclusion: 该方法通过数学推导和开源代码实现，为球面半径的精确估计提供了有效解决方案。 Abstract: We propose a novel method for determining the radius of a spherical surface based on the distances measured between points on this surface. We consider the most general case of determining the radius when the distances are measured with errors and the sphere has random deviations from its ideal shape. For the solution, we used the minimally necessary four points and an arbitrary N number of points. We provide a new closed form solution for the radius of the sphere through the matrix of pairwise distances. We also determine the standard deviation of the radius estimate caused by measurement errors and deviations of the sphere from its ideal shape. We found optimal configurations of points on the sphere that provide the minimum standard deviation of the radius estimate. This paper describes our solution and provides all the mathematical derivations. We share the implementation of our method as open source code at https://github.com/boris-sukhovilov/Sphere_Radius.

MG-Gen: Single Image to Motion Graphics Generation with Layer Decomposition

Takahiro Shirakawa,Tomoyuki Suzuki,Daichi Haraguchi

Task: 提出一种名为MG-Gen的新框架，从单张栅格图像生成矢量格式数据，以支持基于代码的运动图形生成。

Motivation: 解决现有图像到视频生成方法在运动图形中表现不佳的问题，如缺乏主动文本运动和对象失真，以及基于代码的方法需要矢量数据的限制。

Details

Method: MG-Gen首先将输入图像分解为分层元素，将其重建为HTML格式数据，并生成可执行的JavaScript代码。 Result: 实验证实MG-Gen能够生成运动图形，同时保持文本可读性和输入一致性。 Conclusion: 结合分层分解和动画代码生成是运动图形生成的有效策略。 Abstract: General image-to-video generation methods often produce suboptimal animations that do not meet the requirements of animated graphics, as they lack active text motion and exhibit object distortion. Also, code-based animation generation methods typically require layer-structured vector data which are often not readily available for motion graphic generation. To address these challenges, we propose a novel framework named MG-Gen that reconstructs data in vector format from a single raster image to extend the capabilities of code-based methods to enable motion graphics generation from a raster image in the framework of general image-to-video generation. MG-Gen first decomposes the input image into layer-wise elements, reconstructs them as HTML format data and then generates executable JavaScript code for the reconstructed HTML data. We experimentally confirm that \ours{} generates motion graphics while preserving text readability and input consistency. These successful results indicate that combining layer decomposition and animation code generation is an effective strategy for motion graphics generation.

HPGN: Hybrid Priors-Guided Network for Compressed Low-Light Image Enhancement

Hantang Li,Jinhua Hao,Lei Xiong,Shuyuan Zhu

Task: 提出一种混合先验引导网络（HPGN），用于增强压缩低光图像。

Motivation: 现有方法在增强过程中忽略压缩伪影的去除，或未能为不同压缩质量的图像建立统一的联合任务增强框架。

Details

Method: 结合压缩和光照先验，利用JPEG质量因子（QF）和DCT量化矩阵（QM）设计高效的联合任务即插即用模块，并采用随机QF生成策略指导模型训练。 Result: 实验结果表明所提方法的优越性。 Conclusion: HPGN能有效增强不同压缩水平的低光图像。 Abstract: In practical applications, conventional methods generate large volumes of low-light images that require compression for efficient storage and transmission. However, most existing methods either disregard the removal of potential compression artifacts during the enhancement process or fail to establish a unified framework for joint task enhancement of images with varying compression qualities. To solve this problem, we propose the hybrid priors-guided network (HPGN), which enhances compressed low-light images by integrating both compression and illumination priors. Our approach fully utilizes the JPEG quality factor (QF) and DCT quantization matrix (QM) to guide the design of efficient joint task plug-and-play modules. Additionally, we employ a random QF generation strategy to guide model training, enabling a single model to enhance images across different compression levels. Experimental results confirm the superiority of our proposed method.

Benchmark of Segmentation Techniques for Pelvic Fracture in CT and X-ray: Summary of the PENGWIN 2024 Challenge

Yudi Sang,Yanzhen Liu,Sutuke Yibulayimu,Yunning Wang,Benjamin D. Killeen,Mingxu Liu,Ping-Cheng Ku,Ole Johannsen,Karol Gotkowski,Maximilian Zenk,Klaus Maier-Hein,Fabian Isensee,Peiyan Yue,Yi Wang,Haidong Yu,Zhaohong Pan,Yutong He,Xiaokun Liang,Daiqi Liu,Fuxin Fan,Artur Jurgas,Andrzej Skalski,Yuxi Ma,Jing Yang,Szymon Płotka,Rafał Litka,Gang Zhu,Yingchun Song,Mathias Unberath,Mehran Armand,Dan Ruan,S. Kevin Zhou,Qiyong Cao,Chunpeng Zhao,Xinbao Wu,Yu Wang

Task: 评估和比较自动化算法在骨盆骨折碎片分割中的性能。

Motivation: 骨盆骨折碎片的准确分割对创伤诊断、手术规划和术中引导至关重要，但由于复杂的解剖结构和成像限制，这仍是一个重大挑战。

Details

Method: 通过PENGWIN挑战赛，收集了150份CT扫描数据和模拟X射线图像，评估了16个团队的最先进算法。 Result: CT分割任务中最佳算法的平均IoU为0.930，而X射线任务中为0.774，显示X射线分割更具挑战性。 Conclusion: 尽管结果令人鼓舞，但碎片定义的不确定性表明，结合人类决策的交互式分割方法可能对提高模型可靠性和临床适用性至关重要。 Abstract: The segmentation of pelvic fracture fragments in CT and X-ray images is crucial for trauma diagnosis, surgical planning, and intraoperative guidance. However, accurately and efficiently delineating the bone fragments remains a significant challenge due to complex anatomy and imaging limitations. The PENGWIN challenge, organized as a MICCAI 2024 satellite event, aimed to advance automated fracture segmentation by benchmarking state-of-the-art algorithms on these complex tasks. A diverse dataset of 150 CT scans was collected from multiple clinical centers, and a large set of simulated X-ray images was generated using the DeepDRR method. Final submissions from 16 teams worldwide were evaluated under a rigorous multi-metric testing scheme. The top-performing CT algorithm achieved an average fragment-wise intersection over union (IoU) of 0.930, demonstrating satisfactory accuracy. However, in the X-ray task, the best algorithm attained an IoU of 0.774, highlighting the greater challenges posed by overlapping anatomical structures. Beyond the quantitative evaluation, the challenge revealed methodological diversity in algorithm design. Variations in instance representation, such as primary-secondary classification versus boundary-core separation, led to differing segmentation strategies. Despite promising results, the challenge also exposed inherent uncertainties in fragment definition, particularly in cases of incomplete fractures. These findings suggest that interactive segmentation approaches, integrating human decision-making with task-relevant information, may be essential for improving model reliability and clinical applicability.

Translation of Fetal Brain Ultrasound Images into Pseudo-MRI Images using Artificial Intelligence

Naomi Silverstein,Efrat Leibowitz,Ron Beloosesky,Haim Azhari

Task: 利用人工智能将超声图像转化为类似MRI的显示，以提升胎儿脑部组织的视觉分辨能力。

Motivation: 超声图像在胎儿脑部评估中存在局限性，尤其是在第三孕期，而MRI虽图像质量高但成本昂贵且不易获取。

Details

Method: 提出了一种名为“双扩散强加相关性”（DDIC）的方法，基于扩散模型，假设超声和MRI域共享潜在空间。 Result: 生成的伪MRI图像在脑组织视觉分辨上显著提升，尤其在侧脑室和外侧裂区域，多项评估指标显示DDIC优于其他方法。 Conclusion: 伪MRI图像有望通过改善图像表示简化诊断并提升临床效果。 Abstract: Ultrasound is a widely accessible and cost-effective medical imaging tool commonly used for prenatal evaluation of the fetal brain. However, it has limitations, particularly in the third trimester, where the complexity of the fetal brain requires high image quality for extracting quantitative data. In contrast, magnetic resonance imaging (MRI) offers superior image quality and tissue differentiation but is less available, expensive, and requires time-consuming acquisition. Thus, transforming ultrasonic images into an MRI-mimicking display may be advantageous and allow better tissue anatomy presentation. To address this goal, we have examined the use of artificial intelligence, implementing a diffusion model renowned for generating high-quality images. The proposed method, termed "Dual Diffusion Imposed Correlation" (DDIC), leverages a diffusion-based translation methodology, assuming a shared latent space between ultrasound and MRI domains. Model training was obtained utilizing the "HC18" dataset for ultrasound and the "CRL fetal brain atlas" along with the "FeTA " datasets for MRI. The generated pseudo-MRI images provide notable improvements in visual discrimination of brain tissue, especially in the lateral ventricles and the Sylvian fissure, characterized by enhanced contrast clarity. Improvement was demonstrated in Mutual information, Peak signal-to-noise ratio, Fr\'echet Inception Distance, and Contrast-to-noise ratio. Findings from these evaluations indicate statistically significant superior performance of the DDIC compared to other translation methodologies. In addition, a Medical Opinion Test was obtained from 5 gynecologists. The results demonstrated display improvement in 81% of the tested images. In conclusion, the presented pseudo-MRI images hold the potential for streamlining diagnosis and enhancing clinical outcomes through improved representation.

Estimating Scene Flow in Robot Surroundings with Distributed Miniaturized Time-of-Flight Sensors

Jack Sander,Giammarco Caroleo,Alessandro Albini,Perla Maiolino

Task: 提出一种从低密度和噪声点云中估计场景流的方法，以改进机器人的安全运动和反应。

Motivation: 通过跟踪周围环境中人或物体的运动，提高机器人的安全性和反应能力。

Details

Method: 使用聚类和迭代最近点（ICP）算法估计密集运动流，并引入适应性分类和离群点去除策略以减少噪声和低密度数据的影响。 Result: 实验结果表明，该方法能准确估计运动方向和速度，误差与传感器噪声一致。 Conclusion: 所提方法在低密度和噪声点云下有效估计场景流，适用于机器人环境感知。 Abstract: Tracking motions of humans or objects in the surroundings of the robot is essential to improve safe robot motions and reactions. In this work, we present an approach for scene flow estimation from low-density and noisy point clouds acquired from miniaturized Time of Flight (ToF) sensors distributed on the robot body. The proposed method clusters points from consecutive frames and applies Iterative Closest Point (ICP) to estimate a dense motion flow, with additional steps introduced to mitigate the impact of sensor noise and low-density data points. Specifically, we employ a fitness-based classification to distinguish between stationary and moving points and an inlier removal strategy to refine geometric correspondences. The proposed approach is validated in an experimental setup where 24 ToF are used to estimate the velocity of an object moving at different controlled speeds. Experimental results show that the method consistently approximates the direction of the motion and its magnitude with an error which is in line with sensor noise.

RASP: Revisiting 3D Anamorphic Art for Shadow-Guided Packing of Irregular Objects

Soumyaratna Debnath,Ashish Tiwari,Kaustubh Sadekar,Shanmuganathan Raman

Task: 利用可微分渲染框架RASP进行3D物体排列，以实现最小化物体间距和最大化空间占用。

Motivation: 基于3D变形艺术的启发，探索通过计算模型实现多视角下有意义的艺术表达。

Details

Method: 提出RASP框架，结合阴影引导优化和SDF（符号距离函数）处理物体间交叉和容器外溢问题。 Result: 展示了多视角变形艺术的艺术效果，实现了从多个视角观察时有意义的表达。 Conclusion: RASP框架在3D物体排列和部件组装中表现出色，为多视角艺术创作提供了新方法。 Abstract: Recent advancements in learning-based methods have opened new avenues for exploring and interpreting art forms, such as shadow art, origami, and sketch art, through computational models. One notable visual art form is 3D Anamorphic Art in which an ensemble of arbitrarily shaped 3D objects creates a realistic and meaningful expression when observed from a particular viewpoint and loses its coherence over the other viewpoints. In this work, we build on insights from 3D Anamorphic Art to perform 3D object arrangement. We introduce RASP, a differentiable-rendering-based framework to arrange arbitrarily shaped 3D objects within a bounded volume via shadow (or silhouette)-guided optimization with an aim of minimal inter-object spacing and near-maximal occupancy. Furthermore, we propose a novel SDF-based formulation to handle inter-object intersection and container extrusion. We demonstrate that RASP can be extended to part assembly alongside object packing considering 3D objects to be "parts" of another 3D object. Finally, we present artistic illustrations of multi-view anamorphic art, achieving meaningful expressions from multiple viewpoints within a single ensemble.

Adaptive path planning for efficient object search by UAVs in agricultural fields

Rick van Essen,Eldert van Henten,Lammert Kooistra,Gert Kootstra

Task: 开发一种用于农业领域无人机搜索物体的自适应路径规划器。

Motivation: 提高无人机在农业领域中搜索物体的效率和准确性，尤其是在物体分布不均匀的情况下。

Details

Method: 结合高空覆盖飞行路径和低空检测，利用YOLOv8检测网络评估检测不确定性，优化路径规划参数。 Result: 自适应路径规划器在物体分布不均匀时表现更优，路径长度更短，且对定位误差具有鲁棒性。 Conclusion: 自适应路径规划器比传统覆盖路径规划器更快找到非均匀分布的物体，同时保持检测精度，具有实际应用潜力。 Abstract: This paper presents an adaptive path planner for object search in agricultural fields using UAVs. The path planner uses a high-altitude coverage flight path and plans additional low-altitude inspections when the detection network is uncertain. The path planner was evaluated in an offline simulation environment containing real-world images. We trained a YOLOv8 detection network to detect artificial plants placed in grass fields to showcase the potential of our path planner. We evaluated the effect of different detection certainty measures, optimized the path planning parameters, investigated the effects of localization errors and different numbers of objects in the field. The YOLOv8 detection confidence worked best to differentiate between true and false positive detections and was therefore used in the adaptive planner. The optimal parameters of the path planner depended on the distribution of objects in the field, when the objects were uniformly distributed, more low-altitude inspections were needed compared to a non-uniform distribution of objects, resulting in a longer path length. The adaptive planner proved to be robust against localization uncertainty. When increasing the number of objects, the flight path length increased, especially when the objects were uniformly distributed. When the objects were non-uniformly distributed, the adaptive path planner yielded a shorter path than a low-altitude coverage path, even with high number of objects. Overall, the presented adaptive path planner allowed to find non-uniformly distributed objects in a field faster than a coverage path planner and resulted in a compatible detection accuracy. The path planner is made available at https://github.com/wur-abe/uav_adaptive_planner.

Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision

Xiaofeng Han,Shunpeng Chen,Zenghuang Fu,Zhe Feng,Lue Fan,Dong An,Changwei Wang,Li Guo,Weiliang Meng,Xiaopeng Zhang,Rongtao Xu,Shibiao Xu

Task: 系统回顾多模态融合在机器人视觉关键任务中的应用，并比较基于大语言模型的视觉语言模型与传统多模态融合方法。

Motivation: 机器人视觉受益于多模态融合技术和视觉语言模型的进步，但缺乏系统性的综述和比较分析。

Details

Method: 通过文献综述和比较分析，评估多模态融合在语义场景理解、SLAM、3D物体检测等任务中的应用，并分析数据集和挑战。 Result: 总结了多模态融合的优势、局限性和协同效应，提出了未来研究方向，如自监督学习和基于Transformer的融合架构。 Conclusion: 本文为机器人视觉中的多模态感知和交互提供了有价值的参考，并指出了未来的研究挑战和方向。 Abstract: Robot vision has greatly benefited from advancements in multimodal fusion techniques and vision-language models (VLMs). We systematically review the applications of multimodal fusion in key robotic vision tasks, including semantic scene understanding, simultaneous localization and mapping (SLAM), 3D object detection, navigation and localization, and robot manipulation. We compare VLMs based on large language models (LLMs) with traditional multimodal fusion methods, analyzing their advantages, limitations, and synergies. Additionally, we conduct an in-depth analysis of commonly used datasets, evaluating their applicability and challenges in real-world robotic scenarios. Furthermore, we identify critical research challenges such as cross-modal alignment, efficient fusion strategies, real-time deployment, and domain adaptation, and propose future research directions, including self-supervised learning for robust multimodal representations, transformer-based fusion architectures, and scalable multimodal frameworks. Through a comprehensive review, comparative analysis, and forward-looking discussion, we provide a valuable reference for advancing multimodal perception and interaction in robotic vision. A comprehensive list of studies in this survey is available at https://github.com/Xiaofeng-Han-Res/MF-RV.

Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme

Yan Ma,Steffi Chern,Xuyang Shen,Yiran Zhong,Pengfei Liu

Task: 提出一个透明、从零开始的强化学习框架，用于视觉语言模型，并验证其有效性。

Motivation: 现有视觉语言模型中的强化学习方法依赖复杂框架，缺乏可复现性和标准化评估协议。

Details

Method: 设计一个最小但功能完整的四步流程，并在多个模型和数据集上验证，同时提出标准化评估方案。 Result: 实验发现响应长度对随机种子敏感，反思行为与输出长度相关，且强化学习在泛化能力上优于监督微调。 Conclusion: 该框架和发现为强化学习在视觉语言模型中的应用提供了可复现的基准，促进更广泛的研究参与。 Abstract: Reinforcement learning (RL) has recently shown strong potential in improving the reasoning capabilities of large language models and is now being actively extended to vision-language models (VLMs). However, existing RL applications in VLMs often rely on heavily engineered frameworks that hinder reproducibility and accessibility, while lacking standardized evaluation protocols, making it difficult to compare results or interpret training dynamics. This work introduces a transparent, from-scratch framework for RL in VLMs, offering a minimal yet functional four-step pipeline validated across multiple models and datasets. In addition, a standardized evaluation scheme is proposed to assess training dynamics and reflective behaviors. Extensive experiments on visual reasoning tasks uncover key empirical findings: response length is sensitive to random seeds, reflection correlates with output length, and RL consistently outperforms supervised fine-tuning (SFT) in generalization, even with high-quality data. These findings, together with the proposed framework, aim to establish a reproducible baseline and support broader engagement in RL-based VLM research.

Efficient Model Editing with Task-Localized Sparse Fine-tuning

Leonardo Iurada,Marco Ciccone,Tatiana Tommasi

Task: 提出一种名为TaLoS的方法，用于构建稀疏任务向量，以解决现有任务向量方法中的计算瓶颈和权重解耦问题。

Motivation: 现有方法依赖网络线性化来推导任务向量，导致训练和推理时的计算瓶颈，且线性化无法确保权重解耦，而权重解耦是实现任务向量无冲突组合的关键。

Details

Method: 通过识别预训练模型中梯度敏感性低的参数子集，并仅稀疏更新这些参数，以促进权重解耦。 Result: TaLoS在任务添加和否定任务中优于现有方法，同时提高了训练和推理效率。 Conclusion: TaLoS通过模块化参数编辑，为实际应用中适应性基础模型的部署提供了可行方案。 Abstract: Task arithmetic has emerged as a promising approach for editing models by representing task-specific knowledge as composable task vectors. However, existing methods rely on network linearization to derive task vectors, leading to computational bottlenecks during training and inference. Moreover, linearization alone does not ensure weight disentanglement, the key property that enables conflict-free composition of task vectors. To address this, we propose TaLoS which allows to build sparse task vectors with minimal interference without requiring explicit linearization and sharing information across tasks. We find that pre-trained models contain a subset of parameters with consistently low gradient sensitivity across tasks, and that sparsely updating only these parameters allows for promoting weight disentanglement during fine-tuning. Our experiments prove that TaLoS improves training and inference efficiency while outperforming current methods in task addition and negation. By enabling modular parameter editing, our approach fosters practical deployment of adaptable foundation models in real-world applications.

Towards Computation- and Communication-efficient Computational Pathology

Chu Han,Bingchao Zhao,Jiatai Lin,Shanshan Lyu,Longfei Wang,Tianpeng Deng,Cheng Lu,Changhong Liang,Hannah Y. Wen,Xiaojing Guo,Zhenwei Shi,Zaiyi Liu

Task: 提出一种名为MAGA-GLTrans的计算和通信高效框架，以解决当前计算病理学模型在高倍率全切片图像分析中的效率问题。

Motivation: 当前计算病理学模型依赖高倍率图像分析，导致诊断效率低下，限制了其在时间敏感场景中的临床应用。

Details

Method: 采用放大对齐（MAGA）机制，通过自监督学习对齐低倍率与高倍率图像的特征表示，从而减少计算时间和存储需求。 Result: MAGA-GLTrans在多种任务中表现出色，计算时间减少10.7倍，文件传输和存储需求降低20倍以上。 Conclusion: MAGA-GLTrans为时间敏感应用（如术中冰冻切片诊断）提供了高效且准确的解决方案。 Abstract: Despite the impressive performance across a wide range of applications, current computational pathology models face significant diagnostic efficiency challenges due to their reliance on high-magnification whole-slide image analysis. This limitation severely compromises their clinical utility, especially in time-sensitive diagnostic scenarios and situations requiring efficient data transfer. To address these issues, we present a novel computation- and communication-efficient framework called Magnification-Aligned Global-Local Transformer (MAGA-GLTrans). Our approach significantly reduces computational time, file transfer requirements, and storage overhead by enabling effective analysis using low-magnification inputs rather than high-magnification ones. The key innovation lies in our proposed magnification alignment (MAGA) mechanism, which employs self-supervised learning to bridge the information gap between low and high magnification levels by effectively aligning their feature representations. Through extensive evaluation across various fundamental CPath tasks, MAGA-GLTrans demonstrates state-of-the-art classification performance while achieving remarkable efficiency gains: up to 10.7 times reduction in computational time and over 20 times reduction in file transfer and storage requirements. Furthermore, we highlight the versatility of our MAGA framework through two significant extensions: (1) its applicability as a feature extractor to enhance the efficiency of any CPath architecture, and (2) its compatibility with existing foundation models and histopathology-specific encoders, enabling them to process low-magnification inputs with minimal information loss. These advancements position MAGA-GLTrans as a particularly promising solution for time-sensitive applications, especially in the context of intraoperative frozen section diagnosis where both accuracy and efficiency are paramount.

Adaptive Frequency Enhancement Network for Remote Sensing Image Semantic Segmentation

Feng Gao,Miao Fu,Jingchao Cao,Junyu Dong,Qian Du

Task: 提出一种自适应频率增强网络（AFENet），用于高分辨率遥感图像的语义分割。

Motivation: 现有方法在适应不同土地覆盖分布和增强空间与频率域特征交互方面存在挑战。

Details

Method: AFENet包含自适应频率与空间特征交互模块（AFSIM）和选择性特征融合模块（SFM），分别动态分离调制高低频特征和选择性融合全局与局部特征。 Result: 在三个公开数据集上，AFENet优于现有方法，并验证了AFSIM和SFM的有效性。 Conclusion: AFENet通过增强频率与空间特征的交互，提升了语义分割性能。 Abstract: Semantic segmentation of high-resolution remote sensing images plays a crucial role in land-use monitoring and urban planning. Recent remarkable progress in deep learning-based methods makes it possible to generate satisfactory segmentation results. However, existing methods still face challenges in adapting network parameters to various land cover distributions and enhancing the interaction between spatial and frequency domain features. To address these challenges, we propose the Adaptive Frequency Enhancement Network (AFENet), which integrates two key components: the Adaptive Frequency and Spatial feature Interaction Module (AFSIM) and the Selective feature Fusion Module (SFM). AFSIM dynamically separates and modulates high- and low-frequency features according to the content of the input image. It adaptively generates two masks to separate high- and low-frequency components, therefore providing optimal details and contextual supplementary information for ground object feature representation. SFM selectively fuses global context and local detailed features to enhance the network's representation capability. Hence, the interactions between frequency and spatial features are further enhanced. Extensive experiments on three publicly available datasets demonstrate that the proposed AFENet outperforms state-of-the-art methods. In addition, we also validate the effectiveness of AFSIM and SFM in managing diverse land cover types and complex scenarios. Our codes are available at https://github.com/oucailab/AFENet.

BECAME: BayEsian Continual Learning with Adaptive Model MErging

Mei Li,Yuxiang Lu,Qinyan Dai,Suizhi Huang,Yue Ding,Hongtao Lu

Task: 探索模型合并技术如何优化持续学习中稳定性与可塑性的权衡。

Motivation: 持续学习中平衡稳定性（保留旧知识）和可塑性（学习新任务）是关键挑战，现有梯度投影方法虽确保稳定性但限制可塑性，而模型合并技术依赖经验假设和超参数选择。

Details

Method: 基于贝叶斯持续学习原理重新设计合并机制，推导出适应任务多样性的最优合并系数闭式解，并提出两阶段框架BECAME结合梯度投影与自适应合并。 Result: 实验表明BECAME在持续学习中优于现有方法和合并策略。 Conclusion: 模型合并技术通过理论支持和自适应设计，显著提升了持续学习中稳定性与可塑性的权衡效果。 Abstract: Continual Learning (CL) strives to learn incrementally across tasks while mitigating catastrophic forgetting. A key challenge in CL is balancing stability (retaining prior knowledge) and plasticity (learning new tasks). While representative gradient projection methods ensure stability, they often limit plasticity. Model merging techniques offer promising solutions, but prior methods typically rely on empirical assumptions and carefully selected hyperparameters. In this paper, we explore the potential of model merging to enhance the stability-plasticity trade-off, providing theoretical insights that underscore its benefits. Specifically, we reformulate the merging mechanism using Bayesian continual learning principles and derive a closed-form solution for the optimal merging coefficient that adapts to the diverse characteristics of tasks. To validate our approach, we introduce a two-stage framework named BECAME, which synergizes the expertise of gradient projection and adaptive merging. Extensive experiments show that our approach outperforms state-of-the-art CL methods and existing merging strategies.

Spline-based Transformers

Prashanth Chandran,Agon Serifi,Markus Gross,Moritz Bächer

Task: 提出一种基于样条的Transformer模型，无需位置编码。

Motivation: 解决传统Transformer中位置编码的局限性，如序列长度外推问题，并提供用户与潜在空间交互的新方式。

Details

Method: 利用样条将输入序列嵌入为潜在空间中的平滑轨迹，用户可直接操作潜在控制点生成新序列。 Result: 在合成2D数据和大规模真实数据集（如图像、3D形状和动画）上表现优于传统位置编码方法。 Conclusion: 基于样条的Transformer模型在性能和交互性上具有优势，为Transformer架构提供了新的可能性。 Abstract: We introduce Spline-based Transformers, a novel class of Transformer models that eliminate the need for positional encoding. Inspired by workflows using splines in computer animation, our Spline-based Transformers embed an input sequence of elements as a smooth trajectory in latent space. Overcoming drawbacks of positional encoding such as sequence length extrapolation, Spline-based Transformers also provide a novel way for users to interact with transformer latent spaces by directly manipulating the latent control points to create new latent trajectories and sequences. We demonstrate the superior performance of our approach in comparison to conventional positional encoding on a variety of datasets, ranging from synthetic 2D to large-scale real-world datasets of images, 3D shapes, and animations.